linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] xfs: rework near mode extent allocation
@ 2019-08-15 12:55 Brian Foster
  2019-08-15 12:55 ` [PATCH v3 1/4] xfs: track active state of allocation btree cursors Brian Foster
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Brian Foster @ 2019-08-15 12:55 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's a v3 of the near allocation mode extent rework algorithm. The
most obvious difference from v2 is that I've dropped the by-size and
exact mode algorithm reworks and subsequent refactoring from the series
for now. It became a bit unwieldy to keep everything together with all
of the refactoring and whatnot, so I'll plan to follow up with those
bits once the near mode algorithm and associated factoring is settled.
Refer to the previous version(s) for an idea of how this approach
facilitates reuse by the other allocation modes.

Another notable change from v2 is reintroduction of the "last block
scan" in the near mode algorithm. I found that this logic tends to
dominate on allocations from a clean filesystem for quite some time, so
this change adds some consistency between current behavior/performance
and with this patch and makes the fundamental algorithm change a bit
more conservative. From an algorithmic perspective, this change now
essentially enhances the fallback algorithm from a pure bnobt span to a
combined bnobt/cntbt span for improved worst case performance at a
slight cost in locality.

This version has been more thoroughly tested than the previous versions.
I've run fstests with various configurations without regression and some
performance tests to compare with the existing allocator with regard to
performance and resulting allocation locality. While the latter is a bit
ad hoc, it provides some data as to how much this patch changes locality
allocations over time. IOW, the locality data is not necessarily for
deep analysis or detailed comparison of locality between the algorithms,
but rather to provide a data point that locality is still effective with
the algorithm change.

I'll append some raw data to the end of this mail for some of the more
recent filebench tests. For context, each test runs on the same 16xcpu
128GB RAM box against the same striped lvm volume. The test is a simple
64x concurrent file write+fsync of a random distribution of file sizes
from 4k to 500MB to fill up ~70% of a 500GB (16 AGs) fs. The locality
output is from a hacked up version of the xfs_io 'frag' command to scan
for and measure distances of known near mode allocations from the
presumed hint (i.e., the previous extent in a file or the inode location
for initial extents). The largest histogram bucket is the AG size and
represents AG switches (i.e., an allocation occurred on an external AG
from the locality hint).

The first set of tests are run against a clean filesystem and show
comparable performance and locality between the original and updated
algorithms. The second set of tests prepopulates the filesystem and
creates a sequence of randomly sized free space extents before the test
runs. This test is designed to stress the updated algorithm by creating
a large set of unique and randomly sized free extents. Not surprisingly,
this test shows slightly degraded performance for the updated algorithm
when compared to the baseline. While this is not quite an apples to
apples comparison, this is a significantly milder degradation than what
occurs when the baseline algorithm encounters its worst case free space
layout (refer to the v1 cover letter for details). Note that both
variants of the test were repeated a couple times but I'm only including
one set of output for each since the runs were similar.

An interesting design note related to the performance results is that
some followup experimentation shows that the performance is fairly
sensitive to the search key increment aggressiveness implemented in
xfs_alloc_lookup_iter(). This controls how aggressive we search for
ideal locality in the lookup algorithm once we've already found one
suitable maxlen extent. The currently implemented approach is to visit
each set of equal size extents until we have an allocation candidate and
from that point forward, double the search cur key to optimize the scan
of the rest of the tree for better locality. A hack to this logic to
exit on the first available extent can turn the above slight performance
degradation into an overall improvement at the cost of random locality
(i.e., this basically becomes a by-size allocation). I've played around
with this a bit to try and improve this heuristic and close the gap from
the current algorithm, including even considering some form of sysfs
knob to tune behavior (even if just for debug/testing), but so far I
haven't come up with anything preferable to the simplicity of the
current approach that doesn't negatively affect the effectiveness of the
allocator with respect to locality.

Thoughts, reviews, flames appreciated.

Brian

v3:
- Drop by-size and exact allocation rework bits.
- Add near mode last block scan.
- Add debug mode patch to randomly toggle near mode algos.
- Refactor cursor setup/lookup logic.
- Refactor minlen reverse scan to be common between near mode algos.
- Fix up logic to consistently prioritize extent size over locality.
- Add more useful tracepoints.
- Miscellaneous bug fixes and code/comment cleanups.
v2: https://marc.info/?l=linux-xfs&m=155854834815400&w=2
- Lift small mode refactoring into separate patch (retained review
  tag(s).
- Various logic cleanups and refactors.
- Push active flag down into btree cursor private area; eliminate cursor
  container struct.
- Refactor final allocation code. Fold xfs_alloc_ag_vextent_type() into
  caller and factor out accounting.                                                                                                                     
- Fix up tracepoints.
v1: https://marc.info/?l=linux-xfs&m=155742169729590&w=2
- Continued development (various fixes, refinements) on generic bits and
  near mode implementation.
- Added patches 4-6 to refactor exact, by-size and small allocation
  modes.
rfcv2: https://marc.info/?l=linux-xfs&m=155197946630582&w=2
- Dropped spurious initial refactoring.
- Added minlen functionality.
- Properly tied into near alloc path.
- General refactoring and cleanups.
rfcv1: https://marc.info/?l=linux-xfs&m=154479089914351&w=2

Brian Foster (4):
  xfs: track active state of allocation btree cursors
  xfs: use locality optimized cntbt lookups for near mode allocations
  xfs: randomly fall back to near mode lookup algorithm in debug mode
  xfs: refactor successful AG allocation accounting code

 fs/xfs/libxfs/xfs_alloc.c       | 1156 +++++++++++++++++--------------
 fs/xfs/libxfs/xfs_alloc_btree.c |    1 +
 fs/xfs/libxfs/xfs_btree.h       |    3 +
 fs/xfs/xfs_trace.h              |   43 +-
 4 files changed, 660 insertions(+), 543 deletions(-)

-- 
2.20.1

--- test results

- clean fs, baseline:

 2016: 3537.429: Run took 3536 seconds...
 2016: 3537.431: Per-Operation Breakdown
closefile1           39939ops       11ops/s   0.0mb/s      1.2ms/op      430us/op-cpu [0ms - 1812ms]
fsync1               39944ops       11ops/s   0.0mb/s   5557.3ms/op   688845us/op-cpu [3ms - 429730ms]
writefile1           39999ops       11ops/s 100.4mb/s      6.8ms/op    10280us/op-cpu [0ms - 532ms]
createfile1          40000ops       11ops/s   0.0mb/s      0.5ms/op      814us/op-cpu [0ms - 801ms]
 2016: 3537.431: IO Summary: 159882 ops, 45.211 ops/s, (0/11 r/w), 100.4mb/s,  12807us cpu/op, 5558.1ms latency
 2016: 3537.431: Shutting down processes

actual 40732, ideal 40720, fragmentation factor 0.03%
Note, this number is largely meaningless.
Files on this filesystem average 1.00 extents per file
   from      to extents  blocks    pct
      0       0       6       0   0.01
      1       1      81      81   0.20
      2       3     343     884   0.84
      4       7    1173    6786   2.88
      8      15    3527   35273   8.66
     16      31     488   11400   1.20
     32      63    3153  140510   7.74
     64     127    2567  233234   6.30
    128     255    1175  201550   2.88
    256     511     568  216547   1.39
    512    1023     627  460669   1.54
   1024    2047     831 1357531   2.04
   2048    4095     857 2498448   2.10
   4096    8191     597 3637740   1.47
   8192   16383     516 6196647   1.27
  16384   32767     690 16778626   1.69
  32768   65535    1189 58881806   2.92
  65536  131071    1582 152275877   3.88
 131072  262143    2078 397775535   5.10
 262144  524287    3163 1250504038   7.77
 524288 1048575    2789 2010791315   6.85
1048576 2097151    2956 4678090228   7.26
2097152 4194303    4084 12741930013  10.03
4194304 8191983    2239 11372991216   5.50
8191984 8191984    3453 28286920752   8.48

- clean fs, test:

19362: 3540.426: Run took 3539 seconds...
19362: 3540.428: Per-Operation Breakdown
closefile1           39942ops       11ops/s   0.0mb/s      1.1ms/op      459us/op-cpu [0ms - 1991ms]
fsync1               39945ops       11ops/s   0.0mb/s   5566.4ms/op   692442us/op-cpu [3ms - 464333ms]
writefile1           39999ops       11ops/s 100.3mb/s      6.9ms/op    10198us/op-cpu [0ms - 544ms]
createfile1          40000ops       11ops/s   0.0mb/s      0.4ms/op      675us/op-cpu [0ms - 752ms]
19362: 3540.428: IO Summary: 159886 ops, 45.174 ops/s, (0/11 r/w), 100.3mb/s,  12928us cpu/op, 5567.2ms latency
19362: 3540.428: Shutting down processes

actual 40730, ideal 40720, fragmentation factor 0.02%
Note, this number is largely meaningless.
Files on this filesystem average 1.00 extents per file
   from      to extents  blocks    pct
      0       0       6       0   0.01
      1       1     109     109   0.27
      2       3     338     856   0.83
      4       7    1136    6582   2.79
      8      15    3408   33781   8.37
     16      31     648   14649   1.59
     32      63    3145  142563   7.72
     64     127    2809  258454   6.90
    128     255    1482  255051   3.64
    256     511     502  191520   1.23
    512    1023     654  487802   1.61
   1024    2047    1067 1752771   2.62
   2048    4095    1109 3314767   2.72
   4096    8191     906 5534201   2.22
   8192   16383     649 8100247   1.59
  16384   32767     812 18740413   1.99
  32768   65535     996 46874568   2.45
  65536  131071    1861 181198557   4.57
 131072  262143    2570 480355851   6.31
 262144  524287    1801 663068591   4.42
 524288 1048575    2010 1564391753   4.93
1048576 2097151    3008 4760143526   7.39
2097152 4194303    4485 13773793202  11.01
4194304 8191983    1686 8263583893   4.14
8191984 8191984    3533 28942279472   8.67

- randomized free space, baseline:

 1933: 4053.531: Run took 4052 seconds...
 1933: 4053.533: Per-Operation Breakdown
closefile1           39939ops       10ops/s   0.0mb/s      1.1ms/op      470us/op-cpu [0ms - 1339ms]
fsync1               39939ops       10ops/s   0.0mb/s   6379.4ms/op   677902us/op-cpu [3ms - 376494ms]
writefile1           39997ops       10ops/s  87.6mb/s      6.8ms/op    10159us/op-cpu [0ms - 522ms]
createfile1          40000ops       10ops/s   0.0mb/s      0.7ms/op      887us/op-cpu [0ms - 371ms]
 1933: 4053.533: IO Summary: 159875 ops, 39.452 ops/s, (0/10 r/w),  87.6mb/s,  13219us cpu/op, 6378.8ms latency
 1933: 4053.533: Shutting down processes

actual 82107, ideal 72320, fragmentation factor 11.92%
Note, this number is largely meaningless.
Files on this filesystem average 1.14 extents per file
   from      to extents  blocks    pct
      0       0       2       0   0.00
      1       1      74      74   0.09
      2       3     371     962   0.45
      4       7    1174    6710   1.43
      8      15    3633   36965   4.42
     16      31     859   19650   1.05
     32      63    3244  145941   3.95
     64     127    2945  267753   3.59
    128     255    1940  347233   2.36
    256     511    1537  579226   1.87
    512    1023    2597 1977748   3.16
   1024    2047    4773 7310593   5.81
   2048    4095    8990 27491096  10.95
   4096    8191   16422 100901820  20.00
   8192   16383     816 9837468   0.99
  16384   32767     914 21952573   1.11
  32768   65535    2037 98635533   2.48
  65536  131071    3434 336852501   4.18
 131072  262143    3340 593712009   4.07
 262144  524287    2459 951760115   2.99
 524288 1048575    2808 2116317254   3.42
1048576 2097151    2479 3510197318   3.02
2097152 4194303    1005 2975191867   1.22
4194304 8191983     571 3166999981   0.70
8191984 8191984   13683 112090917072  16.66

- randomized free space, test:

 1837: 4117.481: Run took 4116 seconds...
 1837: 4117.483: Per-Operation Breakdown
closefile1           39937ops       10ops/s   0.0mb/s      1.5ms/op      535us/op-cpu [0ms - 1507ms]
fsync1               39937ops       10ops/s   0.0mb/s   6488.7ms/op   682138us/op-cpu [2ms - 399490ms]
writefile1           40000ops       10ops/s  86.2mb/s      6.8ms/op    10332us/op-cpu [0ms - 524ms]
createfile1          40000ops       10ops/s   0.0mb/s      0.7ms/op      852us/op-cpu [0ms - 1027ms]
 1837: 4117.483: IO Summary: 159874 ops, 38.839 ops/s, (0/10 r/w),  86.2mb/s,  13295us cpu/op, 6487.5ms latency
 1837: 4117.483: Shutting down processes

actual 82311, ideal 72323, fragmentation factor 12.13%
Note, this number is largely meaningless.
Files on this filesystem average 1.14 extents per file
   from      to extents  blocks    pct
      0       0       7       0   0.01
      1       1      95      95   0.12
      2       3     395    1032   0.48
      4       7    1169    6696   1.42
      8      15    3512   35631   4.27
     16      31    1167   25350   1.42
     32      63    3563  161086   4.33
     64     127    3296  299665   4.00
    128     255    2370  426781   2.88
    256     511    2568  977733   3.12
    512    1023    4357 3271492   5.29
   1024    2047    5871 8937475   7.13
   2048    4095    9396 28541302  11.42
   4096    8191   16546 101380257  20.10
   8192   16383     400 4415125   0.49
  16384   32767     206 4922473   0.25
  32768   65535     256 12692379   0.31
  65536  131071     318 30349888   0.39
 131072  262143     467 90335617   0.57
 262144  524287     803 312116185   0.98
 524288 1048575    1238 972993740   1.50
1048576 2097151    2173 3398364897   2.64
2097152 4194303    3594 11074400645   4.37
4194304 8191983    4362 25547937595   5.30
8191984 8191984   14182 116178717088  17.23

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/4] xfs: track active state of allocation btree cursors
  2019-08-15 12:55 [PATCH v3 0/4] xfs: rework near mode extent allocation Brian Foster
@ 2019-08-15 12:55 ` Brian Foster
  2019-08-17  0:51   ` Darrick J. Wong
  2019-08-15 12:55 ` [PATCH v3 2/4] xfs: use locality optimized cntbt lookups for near mode allocations Brian Foster
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2019-08-15 12:55 UTC (permalink / raw)
  To: linux-xfs

The upcoming allocation algorithm update searches multiple
allocation btree cursors concurrently. As such, it requires an
active state to track when a particular cursor should continue
searching. While active state will be modified based on higher level
logic, we can define base functionality based on the result of
allocation btree lookups.

Define an active flag in the private area of the btree cursor.
Update it based on the result of lookups in the existing allocation
btree helpers. Finally, provide a new helper to query the current
state.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c       | 24 +++++++++++++++++++++---
 fs/xfs/libxfs/xfs_alloc_btree.c |  1 +
 fs/xfs/libxfs/xfs_btree.h       |  3 +++
 3 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 372ad55631fc..6340f59ac3f4 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -146,9 +146,13 @@ xfs_alloc_lookup_eq(
 	xfs_extlen_t		len,	/* length of extent */
 	int			*stat)	/* success/failure */
 {
+	int			error;
+
 	cur->bc_rec.a.ar_startblock = bno;
 	cur->bc_rec.a.ar_blockcount = len;
-	return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
+	error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
+	cur->bc_private.a.priv.abt.active = *stat;
+	return error;
 }
 
 /*
@@ -162,9 +166,13 @@ xfs_alloc_lookup_ge(
 	xfs_extlen_t		len,	/* length of extent */
 	int			*stat)	/* success/failure */
 {
+	int			error;
+
 	cur->bc_rec.a.ar_startblock = bno;
 	cur->bc_rec.a.ar_blockcount = len;
-	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
+	error = xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
+	cur->bc_private.a.priv.abt.active = *stat;
+	return error;
 }
 
 /*
@@ -178,9 +186,19 @@ xfs_alloc_lookup_le(
 	xfs_extlen_t		len,	/* length of extent */
 	int			*stat)	/* success/failure */
 {
+	int			error;
 	cur->bc_rec.a.ar_startblock = bno;
 	cur->bc_rec.a.ar_blockcount = len;
-	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
+	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
+	cur->bc_private.a.priv.abt.active = *stat;
+	return error;
+}
+
+static inline bool
+xfs_alloc_cur_active(
+	struct xfs_btree_cur	*cur)
+{
+	return cur && cur->bc_private.a.priv.abt.active;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
index 2a94543857a1..279694d73e4e 100644
--- a/fs/xfs/libxfs/xfs_alloc_btree.c
+++ b/fs/xfs/libxfs/xfs_alloc_btree.c
@@ -507,6 +507,7 @@ xfs_allocbt_init_cursor(
 
 	cur->bc_private.a.agbp = agbp;
 	cur->bc_private.a.agno = agno;
+	cur->bc_private.a.priv.abt.active = false;
 
 	if (xfs_sb_version_hascrc(&mp->m_sb))
 		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
index fa3cd8ab9aba..a66063c356cc 100644
--- a/fs/xfs/libxfs/xfs_btree.h
+++ b/fs/xfs/libxfs/xfs_btree.h
@@ -183,6 +183,9 @@ union xfs_btree_cur_private {
 		unsigned long	nr_ops;		/* # record updates */
 		int		shape_changes;	/* # of extent splits */
 	} refc;
+	struct {
+		bool		active;		/* allocation cursor state */
+	} abt;
 };
 
 /*
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/4] xfs: use locality optimized cntbt lookups for near mode allocations
  2019-08-15 12:55 [PATCH v3 0/4] xfs: rework near mode extent allocation Brian Foster
  2019-08-15 12:55 ` [PATCH v3 1/4] xfs: track active state of allocation btree cursors Brian Foster
@ 2019-08-15 12:55 ` Brian Foster
  2019-08-17  1:34   ` Darrick J. Wong
  2019-08-15 12:55 ` [PATCH v3 3/4] xfs: randomly fall back to near mode lookup algorithm in debug mode Brian Foster
  2019-08-15 12:55 ` [PATCH v3 4/4] xfs: refactor successful AG allocation accounting code Brian Foster
  3 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2019-08-15 12:55 UTC (permalink / raw)
  To: linux-xfs

The extent allocation code in XFS has several allocation modes with
unique implementations. These modes are not all that different from
a high level perspective. The most involved mode is near allocation
mode which attempts to allocate an optimally sized extent with ideal
locality with respect to a provided agbno hint.

In the common case, a near mode allocation consists of a conditional
scan of the last cntbt block followed by a concurrent left and right
spanning search of the bnobt starting from the ideal point of
locality in the bnobt. This works reasonably well as filesystems age
via most common allocation patterns. If free space fragments as the
filesystem ages, however, the near algorithm has very poor breakdown
characteristics. If the extent size lookup happens to land outside
(i.e., before) the last cntbt block, the alloc bypasses the cntbt
entirely. If a suitably sized extent is far enough away from the
starting points of the bnobt search, the left/right scans can take a
significant amount of time to locate the target extent. This leads
to pathological allocation latencies in certain workloads.

While locality is important to near mode allocations, it is not so
important as to incur pathological allocation latency to provide the
asolute best locality for every allocation. The left/right bnobt
scan is inefficient in many large allocation scenarios. As an
alternative, we can use locality optimized lookups of the cntbt to
find the closest extent to the target agbno of a particular size. We
can repeat this lookup to cover a larger span of extents much more
efficiently. Finally, we can combine this cntbt lookup algorithm
with the existing bnobt scan to provide a natural balance between
the two for large and small allocations. For example, the bnobt scan
may be able to satisfy inode chunk or btree block allocations fairly
quickly where the cntbt search may have to search through a large
set of extent sizes. On the other hand, the cntbt search can more
deterministically scan the set of free extents available for much
larger delayed allocation requests where the bnobt scan may have to
search the entire AG.

Rework the near mode allocation algorithm as described above to
provide predictable allocation latency under breakdown conditions
with good enough locality in the common case. In doing so, refactor
the affected code into a more generic interface and mechanisms to
facilitate future reuse by the by-size and exact mode algorithms.
Both allocation modes currently have unique, ground-up
implementations that can be replaced with minimal logic additions to
the generic code in this patch.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 1045 +++++++++++++++++++------------------
 fs/xfs/xfs_trace.h        |   43 +-
 2 files changed, 581 insertions(+), 507 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 6340f59ac3f4..7753b61ba532 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -37,7 +37,6 @@ struct workqueue_struct *xfs_alloc_wq;
 #define	XFSA_FIXUP_CNT_OK	2
 
 STATIC int xfs_alloc_ag_vextent_exact(xfs_alloc_arg_t *);
-STATIC int xfs_alloc_ag_vextent_near(xfs_alloc_arg_t *);
 STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
 
 /*
@@ -710,8 +709,446 @@ xfs_alloc_update_counters(
 }
 
 /*
- * Allocation group level functions.
+ * Block allocation algorithm and data structures.
  */
+struct xfs_alloc_cur {
+	struct xfs_btree_cur		*cnt;	/* btree cursors */
+	struct xfs_btree_cur		*bnolt;
+	struct xfs_btree_cur		*bnogt;
+	xfs_extlen_t			cur_len;/* current search length */
+	xfs_agblock_t			rec_bno;/* extent startblock */
+	xfs_extlen_t			rec_len;/* extent length */
+	xfs_agblock_t			bno;	/* alloc bno */
+	xfs_extlen_t			len;	/* alloc len */
+	xfs_extlen_t			diff;	/* diff from search bno */
+	unsigned			busy_gen;/* busy state */
+	bool				busy;
+};
+
+/*
+ * Set up cursors, etc. in the extent allocation cursor. This function can be
+ * called multiple times to reset an initialized structure without having to
+ * reallocate cursors.
+ */
+static int
+xfs_alloc_cur_setup(
+	struct xfs_alloc_arg	*args,
+	struct xfs_alloc_cur	*acur)
+{
+	int			error;
+	int			i;
+
+	ASSERT(args->alignment == 1 || args->type != XFS_ALLOCTYPE_THIS_BNO);
+
+	acur->cur_len = args->maxlen;
+	acur->rec_bno = 0;
+	acur->rec_len = 0;
+	acur->bno = 0;
+	acur->len = 0;
+	acur->diff = -1;
+	acur->busy = false;
+	acur->busy_gen = 0;
+
+	/*
+	 * Allocate the cntbt cursor and set the starting search length to
+	 * maxlen or minlen.
+	 */
+	if (!acur->cnt)
+		acur->cnt = xfs_allocbt_init_cursor(args->mp, args->tp,
+					args->agbp, args->agno, XFS_BTNUM_CNT);
+	error = xfs_alloc_lookup_ge(acur->cnt, 0, acur->cur_len, &i);
+	if (!error && !i && args->maxlen != args->minlen) {
+		acur->cur_len = args->minlen;
+		error = xfs_alloc_lookup_ge(acur->cnt, 0, acur->cur_len, &i);
+	}
+	if (error)
+		return error;
+	if (!i)
+		return -ENOSPC;
+
+	/*
+	 * Allocate the bnobt left and right search cursors.
+	 */
+	if (!acur->bnolt)
+		acur->bnolt = xfs_allocbt_init_cursor(args->mp, args->tp,
+					args->agbp, args->agno, XFS_BTNUM_BNO);
+	if (!acur->bnogt)
+		acur->bnogt = xfs_allocbt_init_cursor(args->mp, args->tp,
+					args->agbp, args->agno, XFS_BTNUM_BNO);
+	return 0;
+}
+
+static void
+xfs_alloc_cur_close(
+	struct xfs_alloc_cur	*acur,
+	bool			error)
+{
+	int			cur_error = XFS_BTREE_NOERROR;
+
+	if (error)
+		cur_error = XFS_BTREE_ERROR;
+
+	if (acur->cnt)
+		xfs_btree_del_cursor(acur->cnt, cur_error);
+	if (acur->bnolt)
+		xfs_btree_del_cursor(acur->bnolt, cur_error);
+	if (acur->bnogt)
+		xfs_btree_del_cursor(acur->bnogt, cur_error);
+	acur->cnt = acur->bnolt = acur->bnogt = NULL;
+}
+
+/*
+ * Check an extent for allocation and track the best available candidate in the
+ * allocation structure. The cursor is deactivated if it has entered an out of
+ * range state based on allocation arguments. Optionally return the extent
+ * extent geometry and allocation status if requested by the caller.
+ */
+static int
+xfs_alloc_cur_check(
+	struct xfs_alloc_arg		*args,
+	struct xfs_alloc_cur		*acur,
+	struct xfs_btree_cur		*cur,
+	int				*new)
+{
+	int			error, i;
+	xfs_agblock_t		bno, bnoa, bnew;
+	xfs_extlen_t		len, lena, diff = -1;
+	bool			busy;
+	unsigned		busy_gen = 0;
+	bool			deactivate = false;
+	bool			isbnobt = cur->bc_btnum == XFS_BTNUM_BNO;
+
+	*new = 0;
+
+	error = xfs_alloc_get_rec(cur, &bno, &len, &i);
+	if (error)
+		return error;
+	XFS_WANT_CORRUPTED_RETURN(args->mp, i == 1);
+
+	/*
+	 * Check minlen and deactivate a cntbt cursor if out of acceptable size
+	 * range (i.e., walking backwards looking for a minlen extent).
+	 */
+	if (len < args->minlen) {
+		deactivate = !isbnobt;
+		goto out;
+	}
+
+	busy = xfs_alloc_compute_aligned(args, bno, len, &bnoa, &lena,
+					 &busy_gen);
+	acur->busy |= busy;
+	if (busy)
+		acur->busy_gen = busy_gen;
+	/* deactivate a bnobt cursor outside of locality range */
+	if (bnoa < args->min_agbno || bnoa > args->max_agbno) {
+		deactivate = isbnobt;
+		goto out;
+	}
+	if (lena < args->minlen)
+		goto out;
+
+	args->len = XFS_EXTLEN_MIN(lena, args->maxlen);
+	xfs_alloc_fix_len(args);
+	ASSERT(args->len >= args->minlen);
+	if (args->len < acur->len)
+		goto out;
+
+	/*
+	 * We have an aligned record that satisfies minlen and beats or matches
+	 * the candidate extent size. Compare locality for near allocation mode.
+	 */
+	ASSERT(args->type == XFS_ALLOCTYPE_NEAR_BNO);
+	diff = xfs_alloc_compute_diff(args->agbno, args->len,
+				      args->alignment, args->datatype,
+				      bnoa, lena, &bnew);
+	if (bnew == NULLAGBLOCK)
+		goto out;
+
+	/*
+	 * We always prioritize allocation size over locality, so select this
+	 * extent if it is larger or matches size with equal or better locality.
+	 * Don't filter out equivalent extents because the higher level
+	 * algorithm can use selection from particular cursors to terminate the
+	 * search. For example, a bnobt cursor maxlen hit means this is the
+	 * optimal free extent.
+	 */
+	if (args->len == acur->len && diff > acur->diff) {
+		deactivate = isbnobt;
+		goto out;
+	}
+
+	ASSERT(args->len > acur->len ||
+	       (args->len == acur->len && diff <= acur->diff));
+	acur->rec_bno = bno;
+	acur->rec_len = len;
+	acur->bno = bnew;
+	acur->len = args->len;
+	acur->diff = diff;
+	*new = 1;
+out:
+	if (deactivate)
+		cur->bc_private.a.priv.abt.active = false;
+	trace_xfs_alloc_cur_check(args->mp, cur->bc_btnum, bno, len, diff,
+				  *new);
+	return 0;
+}
+
+/*
+ * Complete an allocation of a candidate extent. Remove the extent from both
+ * trees and update the args structure.
+ */
+STATIC int
+xfs_alloc_cur_finish(
+	struct xfs_alloc_arg	*args,
+	struct xfs_alloc_cur	*acur)
+{
+	int			error;
+
+	ASSERT(acur->len);
+	ASSERT(acur->cnt && acur->bnolt);
+
+	error = xfs_alloc_fixup_trees(acur->cnt, acur->bnolt, acur->rec_bno,
+				      acur->rec_len, acur->bno, acur->len, 0);
+	if (error)
+		return error;
+
+	args->agbno = acur->bno;
+	args->len = acur->len;
+	args->wasfromfl = 0;
+
+	trace_xfs_alloc_cur(args);
+	return 0;
+}
+
+/*
+ * Locality allocation lookup algorithm. This expects a cntbt cursor and uses
+ * bno optimized lookup to search for extents with ideal size and locality.
+ */
+STATIC int
+xfs_alloc_lookup_iter(
+	struct xfs_alloc_arg		*args,
+	struct xfs_alloc_cur		*acur,
+	struct xfs_btree_cur		*cur)
+{
+	xfs_agblock_t		bno;
+	xfs_extlen_t		len, cur_len;
+	int			error;
+	int			i;
+
+	if (!xfs_alloc_cur_active(cur))
+		return 0;
+
+	/* locality optimized lookup */
+	cur_len = acur->cur_len;
+	error = xfs_alloc_lookup_ge(cur, args->agbno, cur_len, &i);
+	if (error)
+		return error;
+	if (i == 0)
+		return 0;
+	error = xfs_alloc_get_rec(cur, &bno, &len, &i);
+	if (error)
+		return error;
+
+	/* check the current record and update search length from it */
+	error = xfs_alloc_cur_check(args, acur, cur, &i);
+	if (error)
+		return error;
+	ASSERT(len >= acur->cur_len);
+	acur->cur_len = len;
+
+	/*
+	 * We looked up the first record >= [agbno, len] above. The agbno is a
+	 * secondary key and so the current record may lie just before or after
+	 * agbno. If it is past agbno, check the previous record too so long as
+	 * the length matches as it may be closer. Don't check a smaller record
+	 * because that could deactivate our cursor.
+	 */
+	if (bno > args->agbno) {
+		error = xfs_btree_decrement(cur, 0, &i);
+		if (!error && i) {
+			error = xfs_alloc_get_rec(cur, &bno, &len, &i);
+			if (!error && i && len == acur->cur_len)
+				error = xfs_alloc_cur_check(args, acur, cur,
+							    &i);
+		}
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Increment the search key until we find at least one allocation
+	 * candidate or if the extent we found was larger. Otherwise, double the
+	 * search key to optimize the search. Efficiency is more important here
+	 * than absolute best locality.
+	 */
+	cur_len <<= 1;
+	if (!acur->len || acur->cur_len >= cur_len)
+		acur->cur_len++;
+	else
+		acur->cur_len = cur_len;
+
+	return error;
+}
+
+/*
+ * Incremental lookup algorithm. Walk a btree in either direction looking for
+ * candidate extents. This works for either bnobt (locality allocation) or cntbt
+ * (by-size allocation) cursors.
+ */
+STATIC int
+xfs_alloc_walk_iter(
+	struct xfs_alloc_arg		*args,
+	struct xfs_alloc_cur		*acur,
+	struct xfs_btree_cur		*cur,
+	bool				increment,
+	bool				findone,
+	int				iters,
+	int				*stat)
+{
+	int			error;
+	int			i;
+
+	*stat = 0;
+
+	if (!xfs_alloc_cur_active(cur))
+		return 0;
+
+	for (; iters > 0; iters--) {
+		error = xfs_alloc_cur_check(args, acur, cur, &i);
+		if (error)
+			return error;
+		if (i) {
+			*stat = 1;
+			if (findone)
+				break;
+		}
+		if (!xfs_alloc_cur_active(cur))
+			break;
+
+		if (increment)
+			error = xfs_btree_increment(cur, 0, &i);
+		else
+			error = xfs_btree_decrement(cur, 0, &i);
+		if (error)
+			return error;
+		if (i == 0) {
+			cur->bc_private.a.priv.abt.active = false;
+			break;
+		}
+	}
+
+	return error;
+}
+
+/*
+ * High level locality allocation algorithm. Search the bnobt (left and right)
+ * in parallel with locality-optimized cntbt lookups to find an extent with
+ * ideal locality.
+ */
+STATIC int
+xfs_alloc_ag_vextent_lookup(
+	struct xfs_alloc_arg	*args,
+	struct xfs_alloc_cur	*acur,
+	int			*stat)
+{
+	int				error;
+	int				i;
+	struct xfs_btree_cur		*fbcur = NULL;
+	bool				fbinc = false;
+
+	ASSERT(acur->cnt);
+	ASSERT(args->type == XFS_ALLOCTYPE_NEAR_BNO);
+	*stat = 0;
+
+	/*
+	 * Set up available cursors for a locality allocation search based on
+	 * the ->agbno hint. An exact allocation only needs the bnolt so disable
+	 * the cntbt cursor explicitly.
+	 */
+	error = xfs_alloc_lookup_ge(acur->cnt, args->agbno, acur->cur_len, &i);
+	if (error)
+		return error;
+	error = xfs_alloc_lookup_le(acur->bnolt, args->agbno, 0, &i);
+	if (error)
+		return error;
+	error = xfs_alloc_lookup_ge(acur->bnogt, args->agbno, 0, &i);
+	if (error)
+		return error;
+
+	/* search as long as we have at least one active cursor */
+	while (xfs_alloc_cur_active(acur->cnt) ||
+	       xfs_alloc_cur_active(acur->bnolt) ||
+	       xfs_alloc_cur_active(acur->bnogt)) {
+		trace_xfs_alloc_cur_lookup(args);
+
+		/*
+		 * Search the bnobt left and right. If either of these find a
+		 * maxlen extent, we know we've found ideal locality. Do a
+		 * capped search in the opposite direction and we're done.
+		 */
+		error = xfs_alloc_walk_iter(args, acur, acur->bnolt, false,
+					    true, 1, &i);
+		if (error)
+			return error;
+		if (i && acur->len == args->maxlen) {
+			trace_xfs_alloc_cur_left(args);
+			fbcur = acur->bnogt;
+			fbinc = true;
+			break;
+		}
+
+		error = xfs_alloc_walk_iter(args, acur, acur->bnogt, true,
+					    true, 1, &i);
+		if (error)
+			return error;
+		if (i && acur->len == args->maxlen) {
+			trace_xfs_alloc_cur_right(args);
+			fbcur = acur->bnolt;
+			break;
+		}
+
+		/*
+		 * Search the cntbt for a maximum sized extent with ideal
+		 * locality. The lookup search terminates on reaching the end of
+		 * the cntbt. Break out of the loop if this occurs to throttle
+		 * the bnobt scans.
+		 */
+		error = xfs_alloc_lookup_iter(args, acur, acur->cnt);
+		if (error)
+			return error;
+		if (!xfs_alloc_cur_active(acur->cnt)) {
+			trace_xfs_alloc_cur_lookup_done(args);
+			if (!acur->len) {
+				/*
+				 * Reset the cursor for a minlen search in the
+				 * caller.
+				 */
+				error = xfs_btree_decrement(acur->cnt, 0, &i);
+				if (error)
+					return error;
+				if (i)
+					acur->cnt->bc_private.a.priv.abt.active = true;
+			}
+			break;
+		}
+	}
+
+	/*
+	 * Perform a best-effort search in the opposite direction from a bnobt
+	 * allocation.
+	 */
+	if (fbcur) {
+		error = xfs_alloc_walk_iter(args, acur, fbcur, fbinc, true,
+					    args->mp->m_alloc_mxr[0], &i);
+		if (error)
+			return error;
+	}
+
+	if (acur->len)
+		*stat = 1;
+
+	return error;
+}
 
 /*
  * Deal with the case where only small freespaces remain. Either return the
@@ -814,6 +1251,113 @@ xfs_alloc_ag_vextent_small(
 	return error;
 }
 
+/*
+ * Allocate a variable extent near bno in the allocation group agno.
+ * Extent's length (returned in len) will be between minlen and maxlen,
+ * and of the form k * prod + mod unless there's nothing that large.
+ * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
+ */
+STATIC int
+xfs_alloc_ag_vextent_near(
+	struct xfs_alloc_arg	*args)
+{
+	struct xfs_alloc_cur	acur = {0,};
+	int			error;
+	int			i;
+	xfs_agblock_t		bno;
+	xfs_extlen_t		len;
+
+	/* handle unitialized agbno range so caller doesn't have to */
+	if (!args->min_agbno && !args->max_agbno)
+		args->max_agbno = args->mp->m_sb.sb_agblocks - 1;
+	ASSERT(args->min_agbno <= args->max_agbno);
+
+	/* clamp agbno to the range if it's outside */
+	if (args->agbno < args->min_agbno)
+		args->agbno = args->min_agbno;
+	if (args->agbno > args->max_agbno)
+		args->agbno = args->max_agbno;
+
+restart:
+	error = xfs_alloc_cur_setup(args, &acur);
+	if (error) {
+		error = (error == -ENOSPC) ? 0 : error;
+		goto out;
+	}
+
+	/*
+	 * The cntbt cursor points at the first maxlen or minlen extent. If it
+	 * resides in the last block of the tree, scan the rest of the block.
+	 * Otherwise run the optimized lookup search algorithm from the current
+	 * location to the end of the tree.
+	 */
+	if (xfs_btree_islastblock(acur.cnt, 0)) {
+		int	j;
+
+		trace_xfs_alloc_cur_lastblock(args);
+		error = xfs_alloc_walk_iter(args, &acur, acur.cnt, true, false,
+				    INT_MAX, &i);
+		if (error)
+			goto out;
+
+		/*
+		 * If allocation fails, reset the cursor for a reverse minlen
+		 * scan. cur_len doesn't change so look up the extent just
+		 * before the starting point.
+		 */
+		if (!i && acur.cur_len > args->minlen)
+			error = xfs_alloc_lookup_le(acur.cnt, 0, acur.cur_len, &j);
+	} else {
+		error = xfs_alloc_ag_vextent_lookup(args, &acur, &i);
+	}
+	if (error)
+		goto out;
+
+	/*
+	 * Ignore locality and search backwards for the first suitable minlen
+	 * extent. This fails if we run out of minlen extents.
+	 */
+	if (!i && xfs_alloc_cur_active(acur.cnt)) {
+		trace_xfs_alloc_cur_reverse(args);
+		error = xfs_alloc_walk_iter(args, &acur, acur.cnt, false, true,
+					    INT_MAX, &i);
+		if (error)
+			goto out;
+	}
+
+	if (!i) {
+		/* flush and retry if we found busy extents */
+		if (acur.busy) {
+			trace_xfs_alloc_ag_busy(args);
+			xfs_extent_busy_flush(args->mp, args->pag,
+					      acur.busy_gen);
+			goto restart;
+		}
+
+		/*
+		 * Try the AGFL as a last resort. Don't pass a cursor so this
+		 * returns an AGFL block (i == 0) or nothing.
+		 */
+		error = xfs_alloc_ag_vextent_small(args, NULL, &bno, &len, &i);
+		if (error)
+			goto out;
+		ASSERT(i == 0 || (i && len == 0));
+		trace_xfs_alloc_ag_noentry(args);
+
+		args->agbno = bno;
+		args->len = len;
+		goto out;
+	}
+
+	/* fix up btrees on a successful allocation */
+	error = xfs_alloc_cur_finish(args, &acur);
+out:
+	xfs_alloc_cur_close(&acur, error);
+	if (error)
+		trace_xfs_alloc_ag_error(args);
+	return error;
+}
+
 /*
  * Allocate a variable extent in the allocation group agno.
  * Type and bno are used to determine where in the allocation group the
@@ -1001,503 +1545,6 @@ xfs_alloc_ag_vextent_exact(
 	return error;
 }
 
-/*
- * Search the btree in a given direction via the search cursor and compare
- * the records found against the good extent we've already found.
- */
-STATIC int
-xfs_alloc_find_best_extent(
-	struct xfs_alloc_arg	*args,	/* allocation argument structure */
-	struct xfs_btree_cur	**gcur,	/* good cursor */
-	struct xfs_btree_cur	**scur,	/* searching cursor */
-	xfs_agblock_t		gdiff,	/* difference for search comparison */
-	xfs_agblock_t		*sbno,	/* extent found by search */
-	xfs_extlen_t		*slen,	/* extent length */
-	xfs_agblock_t		*sbnoa,	/* aligned extent found by search */
-	xfs_extlen_t		*slena,	/* aligned extent length */
-	int			dir)	/* 0 = search right, 1 = search left */
-{
-	xfs_agblock_t		new;
-	xfs_agblock_t		sdiff;
-	int			error;
-	int			i;
-	unsigned		busy_gen;
-
-	/* The good extent is perfect, no need to  search. */
-	if (!gdiff)
-		goto out_use_good;
-
-	/*
-	 * Look until we find a better one, run out of space or run off the end.
-	 */
-	do {
-		error = xfs_alloc_get_rec(*scur, sbno, slen, &i);
-		if (error)
-			goto error0;
-		XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
-		xfs_alloc_compute_aligned(args, *sbno, *slen,
-				sbnoa, slena, &busy_gen);
-
-		/*
-		 * The good extent is closer than this one.
-		 */
-		if (!dir) {
-			if (*sbnoa > args->max_agbno)
-				goto out_use_good;
-			if (*sbnoa >= args->agbno + gdiff)
-				goto out_use_good;
-		} else {
-			if (*sbnoa < args->min_agbno)
-				goto out_use_good;
-			if (*sbnoa <= args->agbno - gdiff)
-				goto out_use_good;
-		}
-
-		/*
-		 * Same distance, compare length and pick the best.
-		 */
-		if (*slena >= args->minlen) {
-			args->len = XFS_EXTLEN_MIN(*slena, args->maxlen);
-			xfs_alloc_fix_len(args);
-
-			sdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-						       args->alignment,
-						       args->datatype, *sbnoa,
-						       *slena, &new);
-
-			/*
-			 * Choose closer size and invalidate other cursor.
-			 */
-			if (sdiff < gdiff)
-				goto out_use_search;
-			goto out_use_good;
-		}
-
-		if (!dir)
-			error = xfs_btree_increment(*scur, 0, &i);
-		else
-			error = xfs_btree_decrement(*scur, 0, &i);
-		if (error)
-			goto error0;
-	} while (i);
-
-out_use_good:
-	xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);
-	*scur = NULL;
-	return 0;
-
-out_use_search:
-	xfs_btree_del_cursor(*gcur, XFS_BTREE_NOERROR);
-	*gcur = NULL;
-	return 0;
-
-error0:
-	/* caller invalidates cursors */
-	return error;
-}
-
-/*
- * Allocate a variable extent near bno in the allocation group agno.
- * Extent's length (returned in len) will be between minlen and maxlen,
- * and of the form k * prod + mod unless there's nothing that large.
- * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
- */
-STATIC int				/* error */
-xfs_alloc_ag_vextent_near(
-	xfs_alloc_arg_t	*args)		/* allocation argument structure */
-{
-	xfs_btree_cur_t	*bno_cur_gt;	/* cursor for bno btree, right side */
-	xfs_btree_cur_t	*bno_cur_lt;	/* cursor for bno btree, left side */
-	xfs_btree_cur_t	*cnt_cur;	/* cursor for count btree */
-	xfs_agblock_t	gtbno;		/* start bno of right side entry */
-	xfs_agblock_t	gtbnoa;		/* aligned ... */
-	xfs_extlen_t	gtdiff;		/* difference to right side entry */
-	xfs_extlen_t	gtlen;		/* length of right side entry */
-	xfs_extlen_t	gtlena;		/* aligned ... */
-	xfs_agblock_t	gtnew;		/* useful start bno of right side */
-	int		error;		/* error code */
-	int		i;		/* result code, temporary */
-	int		j;		/* result code, temporary */
-	xfs_agblock_t	ltbno;		/* start bno of left side entry */
-	xfs_agblock_t	ltbnoa;		/* aligned ... */
-	xfs_extlen_t	ltdiff;		/* difference to left side entry */
-	xfs_extlen_t	ltlen;		/* length of left side entry */
-	xfs_extlen_t	ltlena;		/* aligned ... */
-	xfs_agblock_t	ltnew;		/* useful start bno of left side */
-	xfs_extlen_t	rlen;		/* length of returned extent */
-	bool		busy;
-	unsigned	busy_gen;
-#ifdef DEBUG
-	/*
-	 * Randomly don't execute the first algorithm.
-	 */
-	int		dofirst;	/* set to do first algorithm */
-
-	dofirst = prandom_u32() & 1;
-#endif
-
-	/* handle unitialized agbno range so caller doesn't have to */
-	if (!args->min_agbno && !args->max_agbno)
-		args->max_agbno = args->mp->m_sb.sb_agblocks - 1;
-	ASSERT(args->min_agbno <= args->max_agbno);
-
-	/* clamp agbno to the range if it's outside */
-	if (args->agbno < args->min_agbno)
-		args->agbno = args->min_agbno;
-	if (args->agbno > args->max_agbno)
-		args->agbno = args->max_agbno;
-
-restart:
-	bno_cur_lt = NULL;
-	bno_cur_gt = NULL;
-	ltlen = 0;
-	gtlena = 0;
-	ltlena = 0;
-	busy = false;
-
-	/*
-	 * Get a cursor for the by-size btree.
-	 */
-	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
-		args->agno, XFS_BTNUM_CNT);
-
-	/*
-	 * See if there are any free extents as big as maxlen.
-	 */
-	if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i)))
-		goto error0;
-	/*
-	 * If none, then pick up the last entry in the tree unless the
-	 * tree is empty.
-	 */
-	if (!i) {
-		if ((error = xfs_alloc_ag_vextent_small(args, cnt_cur, &ltbno,
-				&ltlen, &i)))
-			goto error0;
-		if (i == 0 || ltlen == 0) {
-			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
-			trace_xfs_alloc_near_noentry(args);
-			return 0;
-		}
-		ASSERT(i == 1);
-	}
-	args->wasfromfl = 0;
-
-	/*
-	 * First algorithm.
-	 * If the requested extent is large wrt the freespaces available
-	 * in this a.g., then the cursor will be pointing to a btree entry
-	 * near the right edge of the tree.  If it's in the last btree leaf
-	 * block, then we just examine all the entries in that block
-	 * that are big enough, and pick the best one.
-	 * This is written as a while loop so we can break out of it,
-	 * but we never loop back to the top.
-	 */
-	while (xfs_btree_islastblock(cnt_cur, 0)) {
-		xfs_extlen_t	bdiff;
-		int		besti=0;
-		xfs_extlen_t	blen=0;
-		xfs_agblock_t	bnew=0;
-
-#ifdef DEBUG
-		if (dofirst)
-			break;
-#endif
-		/*
-		 * Start from the entry that lookup found, sequence through
-		 * all larger free blocks.  If we're actually pointing at a
-		 * record smaller than maxlen, go to the start of this block,
-		 * and skip all those smaller than minlen.
-		 */
-		if (ltlen || args->alignment > 1) {
-			cnt_cur->bc_ptrs[0] = 1;
-			do {
-				if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno,
-						&ltlen, &i)))
-					goto error0;
-				XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
-				if (ltlen >= args->minlen)
-					break;
-				if ((error = xfs_btree_increment(cnt_cur, 0, &i)))
-					goto error0;
-			} while (i);
-			ASSERT(ltlen >= args->minlen);
-			if (!i)
-				break;
-		}
-		i = cnt_cur->bc_ptrs[0];
-		for (j = 1, blen = 0, bdiff = 0;
-		     !error && j && (blen < args->maxlen || bdiff > 0);
-		     error = xfs_btree_increment(cnt_cur, 0, &j)) {
-			/*
-			 * For each entry, decide if it's better than
-			 * the previous best entry.
-			 */
-			if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno, &ltlen, &i)))
-				goto error0;
-			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
-			busy = xfs_alloc_compute_aligned(args, ltbno, ltlen,
-					&ltbnoa, &ltlena, &busy_gen);
-			if (ltlena < args->minlen)
-				continue;
-			if (ltbnoa < args->min_agbno || ltbnoa > args->max_agbno)
-				continue;
-			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
-			xfs_alloc_fix_len(args);
-			ASSERT(args->len >= args->minlen);
-			if (args->len < blen)
-				continue;
-			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, args->datatype, ltbnoa,
-				ltlena, &ltnew);
-			if (ltnew != NULLAGBLOCK &&
-			    (args->len > blen || ltdiff < bdiff)) {
-				bdiff = ltdiff;
-				bnew = ltnew;
-				blen = args->len;
-				besti = cnt_cur->bc_ptrs[0];
-			}
-		}
-		/*
-		 * It didn't work.  We COULD be in a case where
-		 * there's a good record somewhere, so try again.
-		 */
-		if (blen == 0)
-			break;
-		/*
-		 * Point at the best entry, and retrieve it again.
-		 */
-		cnt_cur->bc_ptrs[0] = besti;
-		if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno, &ltlen, &i)))
-			goto error0;
-		XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
-		ASSERT(ltbno + ltlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
-		args->len = blen;
-
-		/*
-		 * We are allocating starting at bnew for blen blocks.
-		 */
-		args->agbno = bnew;
-		ASSERT(bnew >= ltbno);
-		ASSERT(bnew + blen <= ltbno + ltlen);
-		/*
-		 * Set up a cursor for the by-bno tree.
-		 */
-		bno_cur_lt = xfs_allocbt_init_cursor(args->mp, args->tp,
-			args->agbp, args->agno, XFS_BTNUM_BNO);
-		/*
-		 * Fix up the btree entries.
-		 */
-		if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno,
-				ltlen, bnew, blen, XFSA_FIXUP_CNT_OK)))
-			goto error0;
-		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
-		xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_NOERROR);
-
-		trace_xfs_alloc_near_first(args);
-		return 0;
-	}
-	/*
-	 * Second algorithm.
-	 * Search in the by-bno tree to the left and to the right
-	 * simultaneously, until in each case we find a space big enough,
-	 * or run into the edge of the tree.  When we run into the edge,
-	 * we deallocate that cursor.
-	 * If both searches succeed, we compare the two spaces and pick
-	 * the better one.
-	 * With alignment, it's possible for both to fail; the upper
-	 * level algorithm that picks allocation groups for allocations
-	 * is not supposed to do this.
-	 */
-	/*
-	 * Allocate and initialize the cursor for the leftward search.
-	 */
-	bno_cur_lt = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
-		args->agno, XFS_BTNUM_BNO);
-	/*
-	 * Lookup <= bno to find the leftward search's starting point.
-	 */
-	if ((error = xfs_alloc_lookup_le(bno_cur_lt, args->agbno, args->maxlen, &i)))
-		goto error0;
-	if (!i) {
-		/*
-		 * Didn't find anything; use this cursor for the rightward
-		 * search.
-		 */
-		bno_cur_gt = bno_cur_lt;
-		bno_cur_lt = NULL;
-	}
-	/*
-	 * Found something.  Duplicate the cursor for the rightward search.
-	 */
-	else if ((error = xfs_btree_dup_cursor(bno_cur_lt, &bno_cur_gt)))
-		goto error0;
-	/*
-	 * Increment the cursor, so we will point at the entry just right
-	 * of the leftward entry if any, or to the leftmost entry.
-	 */
-	if ((error = xfs_btree_increment(bno_cur_gt, 0, &i)))
-		goto error0;
-	if (!i) {
-		/*
-		 * It failed, there are no rightward entries.
-		 */
-		xfs_btree_del_cursor(bno_cur_gt, XFS_BTREE_NOERROR);
-		bno_cur_gt = NULL;
-	}
-	/*
-	 * Loop going left with the leftward cursor, right with the
-	 * rightward cursor, until either both directions give up or
-	 * we find an entry at least as big as minlen.
-	 */
-	do {
-		if (bno_cur_lt) {
-			if ((error = xfs_alloc_get_rec(bno_cur_lt, &ltbno, &ltlen, &i)))
-				goto error0;
-			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
-			busy |= xfs_alloc_compute_aligned(args, ltbno, ltlen,
-					&ltbnoa, &ltlena, &busy_gen);
-			if (ltlena >= args->minlen && ltbnoa >= args->min_agbno)
-				break;
-			if ((error = xfs_btree_decrement(bno_cur_lt, 0, &i)))
-				goto error0;
-			if (!i || ltbnoa < args->min_agbno) {
-				xfs_btree_del_cursor(bno_cur_lt,
-						     XFS_BTREE_NOERROR);
-				bno_cur_lt = NULL;
-			}
-		}
-		if (bno_cur_gt) {
-			if ((error = xfs_alloc_get_rec(bno_cur_gt, &gtbno, &gtlen, &i)))
-				goto error0;
-			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
-			busy |= xfs_alloc_compute_aligned(args, gtbno, gtlen,
-					&gtbnoa, &gtlena, &busy_gen);
-			if (gtlena >= args->minlen && gtbnoa <= args->max_agbno)
-				break;
-			if ((error = xfs_btree_increment(bno_cur_gt, 0, &i)))
-				goto error0;
-			if (!i || gtbnoa > args->max_agbno) {
-				xfs_btree_del_cursor(bno_cur_gt,
-						     XFS_BTREE_NOERROR);
-				bno_cur_gt = NULL;
-			}
-		}
-	} while (bno_cur_lt || bno_cur_gt);
-
-	/*
-	 * Got both cursors still active, need to find better entry.
-	 */
-	if (bno_cur_lt && bno_cur_gt) {
-		if (ltlena >= args->minlen) {
-			/*
-			 * Left side is good, look for a right side entry.
-			 */
-			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
-			xfs_alloc_fix_len(args);
-			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, args->datatype, ltbnoa,
-				ltlena, &ltnew);
-
-			error = xfs_alloc_find_best_extent(args,
-						&bno_cur_lt, &bno_cur_gt,
-						ltdiff, &gtbno, &gtlen,
-						&gtbnoa, &gtlena,
-						0 /* search right */);
-		} else {
-			ASSERT(gtlena >= args->minlen);
-
-			/*
-			 * Right side is good, look for a left side entry.
-			 */
-			args->len = XFS_EXTLEN_MIN(gtlena, args->maxlen);
-			xfs_alloc_fix_len(args);
-			gtdiff = xfs_alloc_compute_diff(args->agbno, args->len,
-				args->alignment, args->datatype, gtbnoa,
-				gtlena, &gtnew);
-
-			error = xfs_alloc_find_best_extent(args,
-						&bno_cur_gt, &bno_cur_lt,
-						gtdiff, &ltbno, &ltlen,
-						&ltbnoa, &ltlena,
-						1 /* search left */);
-		}
-
-		if (error)
-			goto error0;
-	}
-
-	/*
-	 * If we couldn't get anything, give up.
-	 */
-	if (bno_cur_lt == NULL && bno_cur_gt == NULL) {
-		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
-
-		if (busy) {
-			trace_xfs_alloc_near_busy(args);
-			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
-			goto restart;
-		}
-		trace_xfs_alloc_size_neither(args);
-		args->agbno = NULLAGBLOCK;
-		return 0;
-	}
-
-	/*
-	 * At this point we have selected a freespace entry, either to the
-	 * left or to the right.  If it's on the right, copy all the
-	 * useful variables to the "left" set so we only have one
-	 * copy of this code.
-	 */
-	if (bno_cur_gt) {
-		bno_cur_lt = bno_cur_gt;
-		bno_cur_gt = NULL;
-		ltbno = gtbno;
-		ltbnoa = gtbnoa;
-		ltlen = gtlen;
-		ltlena = gtlena;
-		j = 1;
-	} else
-		j = 0;
-
-	/*
-	 * Fix up the length and compute the useful address.
-	 */
-	args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
-	xfs_alloc_fix_len(args);
-	rlen = args->len;
-	(void)xfs_alloc_compute_diff(args->agbno, rlen, args->alignment,
-				     args->datatype, ltbnoa, ltlena, &ltnew);
-	ASSERT(ltnew >= ltbno);
-	ASSERT(ltnew + rlen <= ltbnoa + ltlena);
-	ASSERT(ltnew + rlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
-	ASSERT(ltnew >= args->min_agbno && ltnew <= args->max_agbno);
-	args->agbno = ltnew;
-
-	if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno, ltlen,
-			ltnew, rlen, XFSA_FIXUP_BNO_OK)))
-		goto error0;
-
-	if (j)
-		trace_xfs_alloc_near_greater(args);
-	else
-		trace_xfs_alloc_near_lesser(args);
-
-	xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
-	xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_NOERROR);
-	return 0;
-
- error0:
-	trace_xfs_alloc_near_error(args);
-	if (cnt_cur != NULL)
-		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_ERROR);
-	if (bno_cur_lt != NULL)
-		xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_ERROR);
-	if (bno_cur_gt != NULL)
-		xfs_btree_del_cursor(bno_cur_gt, XFS_BTREE_ERROR);
-	return error;
-}
-
 /*
  * Allocate a variable extent anywhere in the allocation group agno.
  * Extent's length (returned in len) will be between minlen and maxlen,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8094b1920eef..a441a472da79 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1635,14 +1635,16 @@ DEFINE_EVENT(xfs_alloc_class, name, \
 DEFINE_ALLOC_EVENT(xfs_alloc_exact_done);
 DEFINE_ALLOC_EVENT(xfs_alloc_exact_notfound);
 DEFINE_ALLOC_EVENT(xfs_alloc_exact_error);
-DEFINE_ALLOC_EVENT(xfs_alloc_near_nominleft);
-DEFINE_ALLOC_EVENT(xfs_alloc_near_first);
-DEFINE_ALLOC_EVENT(xfs_alloc_near_greater);
-DEFINE_ALLOC_EVENT(xfs_alloc_near_lesser);
-DEFINE_ALLOC_EVENT(xfs_alloc_near_error);
-DEFINE_ALLOC_EVENT(xfs_alloc_near_noentry);
-DEFINE_ALLOC_EVENT(xfs_alloc_near_busy);
-DEFINE_ALLOC_EVENT(xfs_alloc_size_neither);
+DEFINE_ALLOC_EVENT(xfs_alloc_ag_error);
+DEFINE_ALLOC_EVENT(xfs_alloc_ag_noentry);
+DEFINE_ALLOC_EVENT(xfs_alloc_ag_busy);
+DEFINE_ALLOC_EVENT(xfs_alloc_cur);
+DEFINE_ALLOC_EVENT(xfs_alloc_cur_lastblock);
+DEFINE_ALLOC_EVENT(xfs_alloc_cur_left);
+DEFINE_ALLOC_EVENT(xfs_alloc_cur_right);
+DEFINE_ALLOC_EVENT(xfs_alloc_cur_lookup);
+DEFINE_ALLOC_EVENT(xfs_alloc_cur_lookup_done);
+DEFINE_ALLOC_EVENT(xfs_alloc_cur_reverse);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_noentry);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_nominleft);
 DEFINE_ALLOC_EVENT(xfs_alloc_size_done);
@@ -1658,6 +1660,31 @@ DEFINE_ALLOC_EVENT(xfs_alloc_vextent_noagbp);
 DEFINE_ALLOC_EVENT(xfs_alloc_vextent_loopfailed);
 DEFINE_ALLOC_EVENT(xfs_alloc_vextent_allfailed);
 
+TRACE_EVENT(xfs_alloc_cur_check,
+	TP_PROTO(struct xfs_mount *mp, xfs_btnum_t btnum, xfs_agblock_t bno,
+		 xfs_extlen_t len, xfs_extlen_t diff, bool new),
+	TP_ARGS(mp, btnum, bno, len, diff, new),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_btnum_t, btnum)
+		__field(xfs_agblock_t, bno)
+		__field(xfs_extlen_t, len)
+		__field(xfs_extlen_t, diff)
+		__field(bool, new)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->btnum = btnum;
+		__entry->bno = bno;
+		__entry->len = len;
+		__entry->diff = diff;
+		__entry->new = new;
+	),
+	TP_printk("dev %d:%d btnum %d bno 0x%x len 0x%x diff 0x%x new %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->btnum,
+		  __entry->bno, __entry->len, __entry->diff, __entry->new)
+)
+
 DECLARE_EVENT_CLASS(xfs_da_class,
 	TP_PROTO(struct xfs_da_args *args),
 	TP_ARGS(args),
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/4] xfs: randomly fall back to near mode lookup algorithm in debug mode
  2019-08-15 12:55 [PATCH v3 0/4] xfs: rework near mode extent allocation Brian Foster
  2019-08-15 12:55 ` [PATCH v3 1/4] xfs: track active state of allocation btree cursors Brian Foster
  2019-08-15 12:55 ` [PATCH v3 2/4] xfs: use locality optimized cntbt lookups for near mode allocations Brian Foster
@ 2019-08-15 12:55 ` Brian Foster
  2019-08-17  1:37   ` Darrick J. Wong
  2019-08-15 12:55 ` [PATCH v3 4/4] xfs: refactor successful AG allocation accounting code Brian Foster
  3 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2019-08-15 12:55 UTC (permalink / raw)
  To: linux-xfs

The last block scan is the dominant near mode allocation algorithm
for a newer filesystem with fewer, large free extents. Add debug
mode logic to randomly fall back to lookup mode to improve
regression test coverage.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 7753b61ba532..d550aa5597bf 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1266,6 +1266,7 @@ xfs_alloc_ag_vextent_near(
 	int			i;
 	xfs_agblock_t		bno;
 	xfs_extlen_t		len;
+	bool			lastblock;
 
 	/* handle unitialized agbno range so caller doesn't have to */
 	if (!args->min_agbno && !args->max_agbno)
@@ -1291,7 +1292,12 @@ xfs_alloc_ag_vextent_near(
 	 * Otherwise run the optimized lookup search algorithm from the current
 	 * location to the end of the tree.
 	 */
-	if (xfs_btree_islastblock(acur.cnt, 0)) {
+	lastblock = xfs_btree_islastblock(acur.cnt, 0);
+#ifdef DEBUG
+	if (lastblock)
+		lastblock = prandom_u32() & 1;
+#endif
+	if (lastblock) {
 		int	j;
 
 		trace_xfs_alloc_cur_lastblock(args);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 4/4] xfs: refactor successful AG allocation accounting code
  2019-08-15 12:55 [PATCH v3 0/4] xfs: rework near mode extent allocation Brian Foster
                   ` (2 preceding siblings ...)
  2019-08-15 12:55 ` [PATCH v3 3/4] xfs: randomly fall back to near mode lookup algorithm in debug mode Brian Foster
@ 2019-08-15 12:55 ` Brian Foster
  2019-08-17  0:28   ` Darrick J. Wong
  3 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2019-08-15 12:55 UTC (permalink / raw)
  To: linux-xfs

The higher level allocation code is unnecessarily split across
xfs_alloc_ag_vextent() and xfs_alloc_ag_vextent_type(). In
preparation for condensing this code, factor out the AG accounting
bits and move the caller down after the generic allocation structure
and function definitions to pick them up without the need for
declarations. No functional changes.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 75 +++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 30 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index d550aa5597bf..4ae4cfa0ed7f 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -1364,6 +1364,48 @@ xfs_alloc_ag_vextent_near(
 	return error;
 }
 
+/*
+ * Various AG accounting updates for a successful allocation. This includes
+ * updating the rmapbt, AG free block accounting and AG reservation accounting.
+ */
+STATIC int
+xfs_alloc_ag_vextent_accounting(
+	struct xfs_alloc_arg	*args)
+{
+	int			error = 0;
+
+	ASSERT(args->agbno != NULLAGBLOCK);
+	ASSERT(args->len >= args->minlen);
+	ASSERT(args->len <= args->maxlen);
+	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
+	ASSERT(args->agbno % args->alignment == 0);
+
+	/* if not file data, insert new block into the reverse map btree */
+	if (!xfs_rmap_should_skip_owner_update(&args->oinfo)) {
+		error = xfs_rmap_alloc(args->tp, args->agbp, args->agno,
+				       args->agbno, args->len, &args->oinfo);
+		if (error)
+			return error;
+	}
+
+	if (!args->wasfromfl) {
+		error = xfs_alloc_update_counters(args->tp, args->pag,
+						  args->agbp,
+						  -((long)(args->len)));
+		if (error)
+			return error;
+
+		ASSERT(!xfs_extent_busy_search(args->mp, args->agno,
+					      args->agbno, args->len));
+	}
+
+	xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
+
+	XFS_STATS_INC(args->mp, xs_allocx);
+	XFS_STATS_ADD(args->mp, xs_allocb, args->len);
+	return error;
+}
+
 /*
  * Allocate a variable extent in the allocation group agno.
  * Type and bno are used to determine where in the allocation group the
@@ -1402,38 +1444,11 @@ xfs_alloc_ag_vextent(
 		ASSERT(0);
 		/* NOTREACHED */
 	}
-
-	if (error || args->agbno == NULLAGBLOCK)
+	if (error)
 		return error;
 
-	ASSERT(args->len >= args->minlen);
-	ASSERT(args->len <= args->maxlen);
-	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
-	ASSERT(args->agbno % args->alignment == 0);
-
-	/* if not file data, insert new block into the reverse map btree */
-	if (!xfs_rmap_should_skip_owner_update(&args->oinfo)) {
-		error = xfs_rmap_alloc(args->tp, args->agbp, args->agno,
-				       args->agbno, args->len, &args->oinfo);
-		if (error)
-			return error;
-	}
-
-	if (!args->wasfromfl) {
-		error = xfs_alloc_update_counters(args->tp, args->pag,
-						  args->agbp,
-						  -((long)(args->len)));
-		if (error)
-			return error;
-
-		ASSERT(!xfs_extent_busy_search(args->mp, args->agno,
-					      args->agbno, args->len));
-	}
-
-	xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
-
-	XFS_STATS_INC(args->mp, xs_allocx);
-	XFS_STATS_ADD(args->mp, xs_allocb, args->len);
+	if (args->agbno != NULLAGBLOCK)
+		error = xfs_alloc_ag_vextent_accounting(args);
 	return error;
 }
 
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 4/4] xfs: refactor successful AG allocation accounting code
  2019-08-15 12:55 ` [PATCH v3 4/4] xfs: refactor successful AG allocation accounting code Brian Foster
@ 2019-08-17  0:28   ` Darrick J. Wong
  0 siblings, 0 replies; 12+ messages in thread
From: Darrick J. Wong @ 2019-08-17  0:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Aug 15, 2019 at 08:55:38AM -0400, Brian Foster wrote:
> The higher level allocation code is unnecessarily split across
> xfs_alloc_ag_vextent() and xfs_alloc_ag_vextent_type(). In
> preparation for condensing this code, factor out the AG accounting
> bits and move the caller down after the generic allocation structure
> and function definitions to pick them up without the need for
> declarations. No functional changes.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_alloc.c | 75 +++++++++++++++++++++++----------------
>  1 file changed, 45 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index d550aa5597bf..4ae4cfa0ed7f 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1364,6 +1364,48 @@ xfs_alloc_ag_vextent_near(
>  	return error;
>  }
>  
> +/*
> + * Various AG accounting updates for a successful allocation. This includes
> + * updating the rmapbt, AG free block accounting and AG reservation accounting.
> + */
> +STATIC int
> +xfs_alloc_ag_vextent_accounting(
> +	struct xfs_alloc_arg	*args)
> +{
> +	int			error = 0;
> +
> +	ASSERT(args->agbno != NULLAGBLOCK);
> +	ASSERT(args->len >= args->minlen);
> +	ASSERT(args->len <= args->maxlen);
> +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
> +	ASSERT(args->agbno % args->alignment == 0);
> +
> +	/* if not file data, insert new block into the reverse map btree */
> +	if (!xfs_rmap_should_skip_owner_update(&args->oinfo)) {
> +		error = xfs_rmap_alloc(args->tp, args->agbp, args->agno,
> +				       args->agbno, args->len, &args->oinfo);
> +		if (error)
> +			return error;
> +	}
> +
> +	if (!args->wasfromfl) {
> +		error = xfs_alloc_update_counters(args->tp, args->pag,
> +						  args->agbp,
> +						  -((long)(args->len)));
> +		if (error)
> +			return error;
> +
> +		ASSERT(!xfs_extent_busy_search(args->mp, args->agno,
> +					      args->agbno, args->len));
> +	}
> +
> +	xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
> +
> +	XFS_STATS_INC(args->mp, xs_allocx);
> +	XFS_STATS_ADD(args->mp, xs_allocb, args->len);
> +	return error;
> +}
> +
>  /*
>   * Allocate a variable extent in the allocation group agno.
>   * Type and bno are used to determine where in the allocation group the
> @@ -1402,38 +1444,11 @@ xfs_alloc_ag_vextent(
>  		ASSERT(0);
>  		/* NOTREACHED */
>  	}
> -
> -	if (error || args->agbno == NULLAGBLOCK)
> +	if (error)
>  		return error;
>  
> -	ASSERT(args->len >= args->minlen);
> -	ASSERT(args->len <= args->maxlen);
> -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
> -	ASSERT(args->agbno % args->alignment == 0);
> -
> -	/* if not file data, insert new block into the reverse map btree */
> -	if (!xfs_rmap_should_skip_owner_update(&args->oinfo)) {
> -		error = xfs_rmap_alloc(args->tp, args->agbp, args->agno,
> -				       args->agbno, args->len, &args->oinfo);
> -		if (error)
> -			return error;
> -	}
> -
> -	if (!args->wasfromfl) {
> -		error = xfs_alloc_update_counters(args->tp, args->pag,
> -						  args->agbp,
> -						  -((long)(args->len)));
> -		if (error)
> -			return error;
> -
> -		ASSERT(!xfs_extent_busy_search(args->mp, args->agno,
> -					      args->agbno, args->len));
> -	}
> -
> -	xfs_ag_resv_alloc_extent(args->pag, args->resv, args);
> -
> -	XFS_STATS_INC(args->mp, xs_allocx);
> -	XFS_STATS_ADD(args->mp, xs_allocb, args->len);
> +	if (args->agbno != NULLAGBLOCK)
> +		error = xfs_alloc_ag_vextent_accounting(args);
>  	return error;
>  }
>  
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/4] xfs: track active state of allocation btree cursors
  2019-08-15 12:55 ` [PATCH v3 1/4] xfs: track active state of allocation btree cursors Brian Foster
@ 2019-08-17  0:51   ` Darrick J. Wong
  2019-08-19 18:03     ` Brian Foster
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2019-08-17  0:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Aug 15, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> The upcoming allocation algorithm update searches multiple
> allocation btree cursors concurrently. As such, it requires an
> active state to track when a particular cursor should continue
> searching. While active state will be modified based on higher level
> logic, we can define base functionality based on the result of
> allocation btree lookups.
> 
> Define an active flag in the private area of the btree cursor.
> Update it based on the result of lookups in the existing allocation
> btree helpers. Finally, provide a new helper to query the current
> state.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c       | 24 +++++++++++++++++++++---
>  fs/xfs/libxfs/xfs_alloc_btree.c |  1 +
>  fs/xfs/libxfs/xfs_btree.h       |  3 +++
>  3 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 372ad55631fc..6340f59ac3f4 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -146,9 +146,13 @@ xfs_alloc_lookup_eq(
>  	xfs_extlen_t		len,	/* length of extent */
>  	int			*stat)	/* success/failure */
>  {
> +	int			error;
> +
>  	cur->bc_rec.a.ar_startblock = bno;
>  	cur->bc_rec.a.ar_blockcount = len;
> -	return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
> +	error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
> +	cur->bc_private.a.priv.abt.active = *stat;

<urk> Not really a fan of mixing types (even if they are bool and int),
how hard would it be to convert some of these *stat to bool?

Does abt.active have a use outside of the struct xfs_alloc_cur in the
next patch?

--D

> +	return error;
>  }
>  
>  /*
> @@ -162,9 +166,13 @@ xfs_alloc_lookup_ge(
>  	xfs_extlen_t		len,	/* length of extent */
>  	int			*stat)	/* success/failure */
>  {
> +	int			error;
> +
>  	cur->bc_rec.a.ar_startblock = bno;
>  	cur->bc_rec.a.ar_blockcount = len;
> -	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
> +	error = xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
> +	cur->bc_private.a.priv.abt.active = *stat;
> +	return error;
>  }
>  
>  /*
> @@ -178,9 +186,19 @@ xfs_alloc_lookup_le(
>  	xfs_extlen_t		len,	/* length of extent */
>  	int			*stat)	/* success/failure */
>  {
> +	int			error;
>  	cur->bc_rec.a.ar_startblock = bno;
>  	cur->bc_rec.a.ar_blockcount = len;
> -	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
> +	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
> +	cur->bc_private.a.priv.abt.active = *stat;
> +	return error;
> +}
> +
> +static inline bool
> +xfs_alloc_cur_active(
> +	struct xfs_btree_cur	*cur)
> +{
> +	return cur && cur->bc_private.a.priv.abt.active;
>  }
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> index 2a94543857a1..279694d73e4e 100644
> --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> @@ -507,6 +507,7 @@ xfs_allocbt_init_cursor(
>  
>  	cur->bc_private.a.agbp = agbp;
>  	cur->bc_private.a.agno = agno;
> +	cur->bc_private.a.priv.abt.active = false;
>  
>  	if (xfs_sb_version_hascrc(&mp->m_sb))
>  		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
> diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> index fa3cd8ab9aba..a66063c356cc 100644
> --- a/fs/xfs/libxfs/xfs_btree.h
> +++ b/fs/xfs/libxfs/xfs_btree.h
> @@ -183,6 +183,9 @@ union xfs_btree_cur_private {
>  		unsigned long	nr_ops;		/* # record updates */
>  		int		shape_changes;	/* # of extent splits */
>  	} refc;
> +	struct {
> +		bool		active;		/* allocation cursor state */
> +	} abt;
>  };
>  
>  /*
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/4] xfs: use locality optimized cntbt lookups for near mode allocations
  2019-08-15 12:55 ` [PATCH v3 2/4] xfs: use locality optimized cntbt lookups for near mode allocations Brian Foster
@ 2019-08-17  1:34   ` Darrick J. Wong
  2019-08-19 18:06     ` Brian Foster
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2019-08-17  1:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

Whee, a huge patch! :)

On Thu, Aug 15, 2019 at 08:55:36AM -0400, Brian Foster wrote:
> The extent allocation code in XFS has several allocation modes with
> unique implementations. These modes are not all that different from
> a high level perspective. The most involved mode is near allocation
> mode which attempts to allocate an optimally sized extent with ideal
> locality with respect to a provided agbno hint.
> 
> In the common case, a near mode allocation consists of a conditional
> scan of the last cntbt block followed by a concurrent left and right
> spanning search of the bnobt starting from the ideal point of
> locality in the bnobt. This works reasonably well as filesystems age
> via most common allocation patterns. If free space fragments as the
> filesystem ages, however, the near algorithm has very poor breakdown
> characteristics. If the extent size lookup happens to land outside
> (i.e., before) the last cntbt block, the alloc bypasses the cntbt
> entirely. If a suitably sized extent is far enough away from the
> starting points of the bnobt search, the left/right scans can take a
> significant amount of time to locate the target extent. This leads
> to pathological allocation latencies in certain workloads.
> 
> While locality is important to near mode allocations, it is not so
> important as to incur pathological allocation latency to provide the
> asolute best locality for every allocation. The left/right bnobt
> scan is inefficient in many large allocation scenarios. As an
> alternative, we can use locality optimized lookups of the cntbt to
> find the closest extent to the target agbno of a particular size. We
> can repeat this lookup to cover a larger span of extents much more
> efficiently. Finally, we can combine this cntbt lookup algorithm
> with the existing bnobt scan to provide a natural balance between
> the two for large and small allocations. For example, the bnobt scan
> may be able to satisfy inode chunk or btree block allocations fairly
> quickly where the cntbt search may have to search through a large
> set of extent sizes. On the other hand, the cntbt search can more
> deterministically scan the set of free extents available for much
> larger delayed allocation requests where the bnobt scan may have to
> search the entire AG.

I sure wish this lengthy description was being put into xfs_alloc.c
itself.  It would be nice to have some pictures of how this is supposed
to work too.

> Rework the near mode allocation algorithm as described above to
> provide predictable allocation latency under breakdown conditions
> with good enough locality in the common case. In doing so, refactor
> the affected code into a more generic interface and mechanisms to
> facilitate future reuse by the by-size and exact mode algorithms.
> Both allocation modes currently have unique, ground-up
> implementations that can be replaced with minimal logic additions to
> the generic code in this patch.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 1045 +++++++++++++++++++------------------
>  fs/xfs/xfs_trace.h        |   43 +-
>  2 files changed, 581 insertions(+), 507 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 6340f59ac3f4..7753b61ba532 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -37,7 +37,6 @@ struct workqueue_struct *xfs_alloc_wq;
>  #define	XFSA_FIXUP_CNT_OK	2
>  
>  STATIC int xfs_alloc_ag_vextent_exact(xfs_alloc_arg_t *);
> -STATIC int xfs_alloc_ag_vextent_near(xfs_alloc_arg_t *);
>  STATIC int xfs_alloc_ag_vextent_size(xfs_alloc_arg_t *);
>  
>  /*
> @@ -710,8 +709,446 @@ xfs_alloc_update_counters(
>  }
>  
>  /*
> - * Allocation group level functions.
> + * Block allocation algorithm and data structures.
>   */
> +struct xfs_alloc_cur {
> +	struct xfs_btree_cur		*cnt;	/* btree cursors */
> +	struct xfs_btree_cur		*bnolt;
> +	struct xfs_btree_cur		*bnogt;
> +	xfs_extlen_t			cur_len;/* current search length */
> +	xfs_agblock_t			rec_bno;/* extent startblock */
> +	xfs_extlen_t			rec_len;/* extent length */
> +	xfs_agblock_t			bno;	/* alloc bno */
> +	xfs_extlen_t			len;	/* alloc len */
> +	xfs_extlen_t			diff;	/* diff from search bno */
> +	unsigned			busy_gen;/* busy state */
> +	bool				busy;
> +};
> +
> +/*
> + * Set up cursors, etc. in the extent allocation cursor. This function can be
> + * called multiple times to reset an initialized structure without having to
> + * reallocate cursors.
> + */
> +static int
> +xfs_alloc_cur_setup(
> +	struct xfs_alloc_arg	*args,
> +	struct xfs_alloc_cur	*acur)
> +{
> +	int			error;
> +	int			i;
> +
> +	ASSERT(args->alignment == 1 || args->type != XFS_ALLOCTYPE_THIS_BNO);
> +
> +	acur->cur_len = args->maxlen;
> +	acur->rec_bno = 0;
> +	acur->rec_len = 0;
> +	acur->bno = 0;
> +	acur->len = 0;
> +	acur->diff = -1;
> +	acur->busy = false;
> +	acur->busy_gen = 0;
> +
> +	/*
> +	 * Allocate the cntbt cursor and set the starting search length to
> +	 * maxlen or minlen.
> +	 */
> +	if (!acur->cnt)
> +		acur->cnt = xfs_allocbt_init_cursor(args->mp, args->tp,
> +					args->agbp, args->agno, XFS_BTNUM_CNT);
> +	error = xfs_alloc_lookup_ge(acur->cnt, 0, acur->cur_len, &i);
> +	if (!error && !i && args->maxlen != args->minlen) {
> +		acur->cur_len = args->minlen;
> +		error = xfs_alloc_lookup_ge(acur->cnt, 0, acur->cur_len, &i);
> +	}
> +	if (error)
> +		return error;
> +	if (!i)
> +		return -ENOSPC;
> +
> +	/*
> +	 * Allocate the bnobt left and right search cursors.
> +	 */
> +	if (!acur->bnolt)
> +		acur->bnolt = xfs_allocbt_init_cursor(args->mp, args->tp,
> +					args->agbp, args->agno, XFS_BTNUM_BNO);
> +	if (!acur->bnogt)
> +		acur->bnogt = xfs_allocbt_init_cursor(args->mp, args->tp,
> +					args->agbp, args->agno, XFS_BTNUM_BNO);
> +	return 0;
> +}
> +
> +static void
> +xfs_alloc_cur_close(
> +	struct xfs_alloc_cur	*acur,
> +	bool			error)
> +{
> +	int			cur_error = XFS_BTREE_NOERROR;
> +
> +	if (error)
> +		cur_error = XFS_BTREE_ERROR;
> +
> +	if (acur->cnt)
> +		xfs_btree_del_cursor(acur->cnt, cur_error);
> +	if (acur->bnolt)
> +		xfs_btree_del_cursor(acur->bnolt, cur_error);
> +	if (acur->bnogt)
> +		xfs_btree_del_cursor(acur->bnogt, cur_error);
> +	acur->cnt = acur->bnolt = acur->bnogt = NULL;
> +}
> +
> +/*
> + * Check an extent for allocation and track the best available candidate in the
> + * allocation structure. The cursor is deactivated if it has entered an out of
> + * range state based on allocation arguments. Optionally return the extent
> + * extent geometry and allocation status if requested by the caller.
> + */
> +static int
> +xfs_alloc_cur_check(
> +	struct xfs_alloc_arg		*args,
> +	struct xfs_alloc_cur		*acur,
> +	struct xfs_btree_cur		*cur,
> +	int				*new)
> +{
> +	int			error, i;
> +	xfs_agblock_t		bno, bnoa, bnew;
> +	xfs_extlen_t		len, lena, diff = -1;
> +	bool			busy;
> +	unsigned		busy_gen = 0;
> +	bool			deactivate = false;
> +	bool			isbnobt = cur->bc_btnum == XFS_BTNUM_BNO;
> +
> +	*new = 0;
> +
> +	error = xfs_alloc_get_rec(cur, &bno, &len, &i);
> +	if (error)
> +		return error;
> +	XFS_WANT_CORRUPTED_RETURN(args->mp, i == 1);
> +
> +	/*
> +	 * Check minlen and deactivate a cntbt cursor if out of acceptable size
> +	 * range (i.e., walking backwards looking for a minlen extent).
> +	 */
> +	if (len < args->minlen) {
> +		deactivate = !isbnobt;
> +		goto out;
> +	}
> +
> +	busy = xfs_alloc_compute_aligned(args, bno, len, &bnoa, &lena,
> +					 &busy_gen);
> +	acur->busy |= busy;
> +	if (busy)
> +		acur->busy_gen = busy_gen;
> +	/* deactivate a bnobt cursor outside of locality range */
> +	if (bnoa < args->min_agbno || bnoa > args->max_agbno) {
> +		deactivate = isbnobt;
> +		goto out;
> +	}
> +	if (lena < args->minlen)
> +		goto out;
> +
> +	args->len = XFS_EXTLEN_MIN(lena, args->maxlen);
> +	xfs_alloc_fix_len(args);
> +	ASSERT(args->len >= args->minlen);
> +	if (args->len < acur->len)
> +		goto out;
> +
> +	/*
> +	 * We have an aligned record that satisfies minlen and beats or matches
> +	 * the candidate extent size. Compare locality for near allocation mode.
> +	 */
> +	ASSERT(args->type == XFS_ALLOCTYPE_NEAR_BNO);
> +	diff = xfs_alloc_compute_diff(args->agbno, args->len,
> +				      args->alignment, args->datatype,
> +				      bnoa, lena, &bnew);
> +	if (bnew == NULLAGBLOCK)
> +		goto out;
> +
> +	/*
> +	 * We always prioritize allocation size over locality, so select this
> +	 * extent if it is larger or matches size with equal or better locality.
> +	 * Don't filter out equivalent extents because the higher level
> +	 * algorithm can use selection from particular cursors to terminate the
> +	 * search. For example, a bnobt cursor maxlen hit means this is the
> +	 * optimal free extent.
> +	 */
> +	if (args->len == acur->len && diff > acur->diff) {
> +		deactivate = isbnobt;
> +		goto out;
> +	}
> +
> +	ASSERT(args->len > acur->len ||
> +	       (args->len == acur->len && diff <= acur->diff));
> +	acur->rec_bno = bno;
> +	acur->rec_len = len;
> +	acur->bno = bnew;
> +	acur->len = args->len;
> +	acur->diff = diff;
> +	*new = 1;
> +out:
> +	if (deactivate)
> +		cur->bc_private.a.priv.abt.active = false;
> +	trace_xfs_alloc_cur_check(args->mp, cur->bc_btnum, bno, len, diff,
> +				  *new);
> +	return 0;
> +}
> +
> +/*
> + * Complete an allocation of a candidate extent. Remove the extent from both
> + * trees and update the args structure.
> + */
> +STATIC int
> +xfs_alloc_cur_finish(
> +	struct xfs_alloc_arg	*args,
> +	struct xfs_alloc_cur	*acur)
> +{
> +	int			error;
> +
> +	ASSERT(acur->len);
> +	ASSERT(acur->cnt && acur->bnolt);
> +
> +	error = xfs_alloc_fixup_trees(acur->cnt, acur->bnolt, acur->rec_bno,
> +				      acur->rec_len, acur->bno, acur->len, 0);
> +	if (error)
> +		return error;
> +
> +	args->agbno = acur->bno;
> +	args->len = acur->len;
> +	args->wasfromfl = 0;
> +
> +	trace_xfs_alloc_cur(args);
> +	return 0;
> +}
> +
> +/*
> + * Locality allocation lookup algorithm. This expects a cntbt cursor and uses
> + * bno optimized lookup to search for extents with ideal size and locality.

Is it supposed to be the case that acur->cnt == cur?  Why not omit the
third parameter?

> + */
> +STATIC int
> +xfs_alloc_lookup_iter(
> +	struct xfs_alloc_arg		*args,
> +	struct xfs_alloc_cur		*acur,
> +	struct xfs_btree_cur		*cur)
> +{
> +	xfs_agblock_t		bno;
> +	xfs_extlen_t		len, cur_len;
> +	int			error;
> +	int			i;
> +
> +	if (!xfs_alloc_cur_active(cur))
> +		return 0;
> +
> +	/* locality optimized lookup */
> +	cur_len = acur->cur_len;
> +	error = xfs_alloc_lookup_ge(cur, args->agbno, cur_len, &i);
> +	if (error)
> +		return error;
> +	if (i == 0)
> +		return 0;
> +	error = xfs_alloc_get_rec(cur, &bno, &len, &i);
> +	if (error)
> +		return error;
> +
> +	/* check the current record and update search length from it */
> +	error = xfs_alloc_cur_check(args, acur, cur, &i);
> +	if (error)
> +		return error;
> +	ASSERT(len >= acur->cur_len);
> +	acur->cur_len = len;
> +
> +	/*
> +	 * We looked up the first record >= [agbno, len] above. The agbno is a
> +	 * secondary key and so the current record may lie just before or after
> +	 * agbno. If it is past agbno, check the previous record too so long as
> +	 * the length matches as it may be closer. Don't check a smaller record
> +	 * because that could deactivate our cursor.
> +	 */
> +	if (bno > args->agbno) {
> +		error = xfs_btree_decrement(cur, 0, &i);
> +		if (!error && i) {
> +			error = xfs_alloc_get_rec(cur, &bno, &len, &i);
> +			if (!error && i && len == acur->cur_len)
> +				error = xfs_alloc_cur_check(args, acur, cur,
> +							    &i);
> +		}
> +		if (error)
> +			return error;
> +	}
> +
> +	/*
> +	 * Increment the search key until we find at least one allocation
> +	 * candidate or if the extent we found was larger. Otherwise, double the
> +	 * search key to optimize the search. Efficiency is more important here
> +	 * than absolute best locality.
> +	 */
> +	cur_len <<= 1;
> +	if (!acur->len || acur->cur_len >= cur_len)
> +		acur->cur_len++;
> +	else
> +		acur->cur_len = cur_len;
> +
> +	return error;
> +}
> +
> +/*
> + * Incremental lookup algorithm. Walk a btree in either direction looking for
> + * candidate extents. This works for either bnobt (locality allocation) or cntbt
> + * (by-size allocation) cursors.
> + */
> +STATIC int
> +xfs_alloc_walk_iter(
> +	struct xfs_alloc_arg		*args,
> +	struct xfs_alloc_cur		*acur,
> +	struct xfs_btree_cur		*cur,
> +	bool				increment,
> +	bool				findone,
> +	int				iters,

Err... what do fin-done and iters do?

"If @findone, return the first thing we find" and "Only search @iters
steps away"?  Maybe the two bools should be flags?

> +	int				*stat)
> +{
> +	int			error;
> +	int			i;
> +
> +	*stat = 0;
> +
> +	if (!xfs_alloc_cur_active(cur))
> +		return 0;
> +
> +	for (; iters > 0; iters--) {
> +		error = xfs_alloc_cur_check(args, acur, cur, &i);
> +		if (error)
> +			return error;
> +		if (i) {
> +			*stat = 1;
> +			if (findone)
> +				break;
> +		}
> +		if (!xfs_alloc_cur_active(cur))
> +			break;
> +
> +		if (increment)
> +			error = xfs_btree_increment(cur, 0, &i);
> +		else
> +			error = xfs_btree_decrement(cur, 0, &i);
> +		if (error)
> +			return error;
> +		if (i == 0) {
> +			cur->bc_private.a.priv.abt.active = false;
> +			break;
> +		}
> +	}
> +
> +	return error;
> +}
> +
> +/*
> + * High level locality allocation algorithm. Search the bnobt (left and right)
> + * in parallel with locality-optimized cntbt lookups to find an extent with
> + * ideal locality.
> + */
> +STATIC int
> +xfs_alloc_ag_vextent_lookup(
> +	struct xfs_alloc_arg	*args,
> +	struct xfs_alloc_cur	*acur,
> +	int			*stat)
> +{
> +	int				error;
> +	int				i;
> +	struct xfs_btree_cur		*fbcur = NULL;
> +	bool				fbinc = false;
> +
> +	ASSERT(acur->cnt);
> +	ASSERT(args->type == XFS_ALLOCTYPE_NEAR_BNO);
> +	*stat = 0;
> +
> +	/*
> +	 * Set up available cursors for a locality allocation search based on
> +	 * the ->agbno hint. An exact allocation only needs the bnolt so disable
> +	 * the cntbt cursor explicitly.
> +	 */
> +	error = xfs_alloc_lookup_ge(acur->cnt, args->agbno, acur->cur_len, &i);
> +	if (error)
> +		return error;
> +	error = xfs_alloc_lookup_le(acur->bnolt, args->agbno, 0, &i);
> +	if (error)
> +		return error;
> +	error = xfs_alloc_lookup_ge(acur->bnogt, args->agbno, 0, &i);
> +	if (error)
> +		return error;
> +
> +	/* search as long as we have at least one active cursor */
> +	while (xfs_alloc_cur_active(acur->cnt) ||
> +	       xfs_alloc_cur_active(acur->bnolt) ||
> +	       xfs_alloc_cur_active(acur->bnogt)) {
> +		trace_xfs_alloc_cur_lookup(args);
> +
> +		/*
> +		 * Search the bnobt left and right. If either of these find a
> +		 * maxlen extent, we know we've found ideal locality. Do a
> +		 * capped search in the opposite direction and we're done.

So ... we cap the distance we'll walk the two bno cursors to 1 record,
which I think is key to preventing the "pathological allocation"
mentioned above?

> +		 */
> +		error = xfs_alloc_walk_iter(args, acur, acur->bnolt, false,
> +					    true, 1, &i);
> +		if (error)
> +			return error;
> +		if (i && acur->len == args->maxlen) {
> +			trace_xfs_alloc_cur_left(args);
> +			fbcur = acur->bnogt;
> +			fbinc = true;
> +			break;
> +		}
> +
> +		error = xfs_alloc_walk_iter(args, acur, acur->bnogt, true,
> +					    true, 1, &i);
> +		if (error)
> +			return error;
> +		if (i && acur->len == args->maxlen) {
> +			trace_xfs_alloc_cur_right(args);
> +			fbcur = acur->bnolt;
> +			break;
> +		}
> +
> +		/*
> +		 * Search the cntbt for a maximum sized extent with ideal
> +		 * locality. The lookup search terminates on reaching the end of
> +		 * the cntbt. Break out of the loop if this occurs to throttle
> +		 * the bnobt scans.
> +		 */
> +		error = xfs_alloc_lookup_iter(args, acur, acur->cnt);
> +		if (error)
> +			return error;
> +		if (!xfs_alloc_cur_active(acur->cnt)) {
> +			trace_xfs_alloc_cur_lookup_done(args);

...and I guess this means that now we're racing incremental steps of
three different free space btree cursors to see if any of the three will
find a match with the length we want?

And if all of this fails, then we fall back to scanning backwards with
the cntbt until we find something?

Ok, I think I get how this is a behavior improvement on the old code.

The splitting of the old _near function into smaller pieces makes it a
little easier to read the code, though I sorta wish this was more like
the March(?) version where the splitting happened and /then/ the
behavior change came after, instead of a 33K patch...

--D

> +			if (!acur->len) {
> +				/*
> +				 * Reset the cursor for a minlen search in the
> +				 * caller.
> +				 */
> +				error = xfs_btree_decrement(acur->cnt, 0, &i);
> +				if (error)
> +					return error;
> +				if (i)
> +					acur->cnt->bc_private.a.priv.abt.active = true;
> +			}
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * Perform a best-effort search in the opposite direction from a bnobt
> +	 * allocation.
> +	 */
> +	if (fbcur) {
> +		error = xfs_alloc_walk_iter(args, acur, fbcur, fbinc, true,
> +					    args->mp->m_alloc_mxr[0], &i);
> +		if (error)
> +			return error;
> +	}
> +
> +	if (acur->len)
> +		*stat = 1;
> +
> +	return error;
> +}
>  
>  /*
>   * Deal with the case where only small freespaces remain. Either return the
> @@ -814,6 +1251,113 @@ xfs_alloc_ag_vextent_small(
>  	return error;
>  }
>  
> +/*
> + * Allocate a variable extent near bno in the allocation group agno.
> + * Extent's length (returned in len) will be between minlen and maxlen,
> + * and of the form k * prod + mod unless there's nothing that large.
> + * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
> + */
> +STATIC int
> +xfs_alloc_ag_vextent_near(
> +	struct xfs_alloc_arg	*args)
> +{
> +	struct xfs_alloc_cur	acur = {0,};
> +	int			error;
> +	int			i;
> +	xfs_agblock_t		bno;
> +	xfs_extlen_t		len;
> +
> +	/* handle unitialized agbno range so caller doesn't have to */
> +	if (!args->min_agbno && !args->max_agbno)
> +		args->max_agbno = args->mp->m_sb.sb_agblocks - 1;
> +	ASSERT(args->min_agbno <= args->max_agbno);
> +
> +	/* clamp agbno to the range if it's outside */
> +	if (args->agbno < args->min_agbno)
> +		args->agbno = args->min_agbno;
> +	if (args->agbno > args->max_agbno)
> +		args->agbno = args->max_agbno;
> +
> +restart:
> +	error = xfs_alloc_cur_setup(args, &acur);
> +	if (error) {
> +		error = (error == -ENOSPC) ? 0 : error;
> +		goto out;
> +	}
> +
> +	/*
> +	 * The cntbt cursor points at the first maxlen or minlen extent. If it
> +	 * resides in the last block of the tree, scan the rest of the block.
> +	 * Otherwise run the optimized lookup search algorithm from the current
> +	 * location to the end of the tree.
> +	 */
> +	if (xfs_btree_islastblock(acur.cnt, 0)) {
> +		int	j;
> +
> +		trace_xfs_alloc_cur_lastblock(args);
> +		error = xfs_alloc_walk_iter(args, &acur, acur.cnt, true, false,
> +				    INT_MAX, &i);
> +		if (error)
> +			goto out;
> +
> +		/*
> +		 * If allocation fails, reset the cursor for a reverse minlen
> +		 * scan. cur_len doesn't change so look up the extent just
> +		 * before the starting point.
> +		 */
> +		if (!i && acur.cur_len > args->minlen)
> +			error = xfs_alloc_lookup_le(acur.cnt, 0, acur.cur_len, &j);
> +	} else {
> +		error = xfs_alloc_ag_vextent_lookup(args, &acur, &i);
> +	}
> +	if (error)
> +		goto out;
> +
> +	/*
> +	 * Ignore locality and search backwards for the first suitable minlen
> +	 * extent. This fails if we run out of minlen extents.
> +	 */
> +	if (!i && xfs_alloc_cur_active(acur.cnt)) {
> +		trace_xfs_alloc_cur_reverse(args);
> +		error = xfs_alloc_walk_iter(args, &acur, acur.cnt, false, true,
> +					    INT_MAX, &i);
> +		if (error)
> +			goto out;
> +	}
> +
> +	if (!i) {
> +		/* flush and retry if we found busy extents */
> +		if (acur.busy) {
> +			trace_xfs_alloc_ag_busy(args);
> +			xfs_extent_busy_flush(args->mp, args->pag,
> +					      acur.busy_gen);
> +			goto restart;
> +		}
> +
> +		/*
> +		 * Try the AGFL as a last resort. Don't pass a cursor so this
> +		 * returns an AGFL block (i == 0) or nothing.
> +		 */
> +		error = xfs_alloc_ag_vextent_small(args, NULL, &bno, &len, &i);
> +		if (error)
> +			goto out;
> +		ASSERT(i == 0 || (i && len == 0));
> +		trace_xfs_alloc_ag_noentry(args);
> +
> +		args->agbno = bno;
> +		args->len = len;
> +		goto out;
> +	}
> +
> +	/* fix up btrees on a successful allocation */
> +	error = xfs_alloc_cur_finish(args, &acur);
> +out:
> +	xfs_alloc_cur_close(&acur, error);
> +	if (error)
> +		trace_xfs_alloc_ag_error(args);
> +	return error;
> +}
> +
>  /*
>   * Allocate a variable extent in the allocation group agno.
>   * Type and bno are used to determine where in the allocation group the
> @@ -1001,503 +1545,6 @@ xfs_alloc_ag_vextent_exact(
>  	return error;
>  }
>  
> -/*
> - * Search the btree in a given direction via the search cursor and compare
> - * the records found against the good extent we've already found.
> - */
> -STATIC int
> -xfs_alloc_find_best_extent(
> -	struct xfs_alloc_arg	*args,	/* allocation argument structure */
> -	struct xfs_btree_cur	**gcur,	/* good cursor */
> -	struct xfs_btree_cur	**scur,	/* searching cursor */
> -	xfs_agblock_t		gdiff,	/* difference for search comparison */
> -	xfs_agblock_t		*sbno,	/* extent found by search */
> -	xfs_extlen_t		*slen,	/* extent length */
> -	xfs_agblock_t		*sbnoa,	/* aligned extent found by search */
> -	xfs_extlen_t		*slena,	/* aligned extent length */
> -	int			dir)	/* 0 = search right, 1 = search left */
> -{
> -	xfs_agblock_t		new;
> -	xfs_agblock_t		sdiff;
> -	int			error;
> -	int			i;
> -	unsigned		busy_gen;
> -
> -	/* The good extent is perfect, no need to  search. */
> -	if (!gdiff)
> -		goto out_use_good;
> -
> -	/*
> -	 * Look until we find a better one, run out of space or run off the end.
> -	 */
> -	do {
> -		error = xfs_alloc_get_rec(*scur, sbno, slen, &i);
> -		if (error)
> -			goto error0;
> -		XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> -		xfs_alloc_compute_aligned(args, *sbno, *slen,
> -				sbnoa, slena, &busy_gen);
> -
> -		/*
> -		 * The good extent is closer than this one.
> -		 */
> -		if (!dir) {
> -			if (*sbnoa > args->max_agbno)
> -				goto out_use_good;
> -			if (*sbnoa >= args->agbno + gdiff)
> -				goto out_use_good;
> -		} else {
> -			if (*sbnoa < args->min_agbno)
> -				goto out_use_good;
> -			if (*sbnoa <= args->agbno - gdiff)
> -				goto out_use_good;
> -		}
> -
> -		/*
> -		 * Same distance, compare length and pick the best.
> -		 */
> -		if (*slena >= args->minlen) {
> -			args->len = XFS_EXTLEN_MIN(*slena, args->maxlen);
> -			xfs_alloc_fix_len(args);
> -
> -			sdiff = xfs_alloc_compute_diff(args->agbno, args->len,
> -						       args->alignment,
> -						       args->datatype, *sbnoa,
> -						       *slena, &new);
> -
> -			/*
> -			 * Choose closer size and invalidate other cursor.
> -			 */
> -			if (sdiff < gdiff)
> -				goto out_use_search;
> -			goto out_use_good;
> -		}
> -
> -		if (!dir)
> -			error = xfs_btree_increment(*scur, 0, &i);
> -		else
> -			error = xfs_btree_decrement(*scur, 0, &i);
> -		if (error)
> -			goto error0;
> -	} while (i);
> -
> -out_use_good:
> -	xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);
> -	*scur = NULL;
> -	return 0;
> -
> -out_use_search:
> -	xfs_btree_del_cursor(*gcur, XFS_BTREE_NOERROR);
> -	*gcur = NULL;
> -	return 0;
> -
> -error0:
> -	/* caller invalidates cursors */
> -	return error;
> -}
> -
> -/*
> - * Allocate a variable extent near bno in the allocation group agno.
> - * Extent's length (returned in len) will be between minlen and maxlen,
> - * and of the form k * prod + mod unless there's nothing that large.
> - * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
> - */
> -STATIC int				/* error */
> -xfs_alloc_ag_vextent_near(
> -	xfs_alloc_arg_t	*args)		/* allocation argument structure */
> -{
> -	xfs_btree_cur_t	*bno_cur_gt;	/* cursor for bno btree, right side */
> -	xfs_btree_cur_t	*bno_cur_lt;	/* cursor for bno btree, left side */
> -	xfs_btree_cur_t	*cnt_cur;	/* cursor for count btree */
> -	xfs_agblock_t	gtbno;		/* start bno of right side entry */
> -	xfs_agblock_t	gtbnoa;		/* aligned ... */
> -	xfs_extlen_t	gtdiff;		/* difference to right side entry */
> -	xfs_extlen_t	gtlen;		/* length of right side entry */
> -	xfs_extlen_t	gtlena;		/* aligned ... */
> -	xfs_agblock_t	gtnew;		/* useful start bno of right side */
> -	int		error;		/* error code */
> -	int		i;		/* result code, temporary */
> -	int		j;		/* result code, temporary */
> -	xfs_agblock_t	ltbno;		/* start bno of left side entry */
> -	xfs_agblock_t	ltbnoa;		/* aligned ... */
> -	xfs_extlen_t	ltdiff;		/* difference to left side entry */
> -	xfs_extlen_t	ltlen;		/* length of left side entry */
> -	xfs_extlen_t	ltlena;		/* aligned ... */
> -	xfs_agblock_t	ltnew;		/* useful start bno of left side */
> -	xfs_extlen_t	rlen;		/* length of returned extent */
> -	bool		busy;
> -	unsigned	busy_gen;
> -#ifdef DEBUG
> -	/*
> -	 * Randomly don't execute the first algorithm.
> -	 */
> -	int		dofirst;	/* set to do first algorithm */
> -
> -	dofirst = prandom_u32() & 1;
> -#endif
> -
> -	/* handle unitialized agbno range so caller doesn't have to */
> -	if (!args->min_agbno && !args->max_agbno)
> -		args->max_agbno = args->mp->m_sb.sb_agblocks - 1;
> -	ASSERT(args->min_agbno <= args->max_agbno);
> -
> -	/* clamp agbno to the range if it's outside */
> -	if (args->agbno < args->min_agbno)
> -		args->agbno = args->min_agbno;
> -	if (args->agbno > args->max_agbno)
> -		args->agbno = args->max_agbno;
> -
> -restart:
> -	bno_cur_lt = NULL;
> -	bno_cur_gt = NULL;
> -	ltlen = 0;
> -	gtlena = 0;
> -	ltlena = 0;
> -	busy = false;
> -
> -	/*
> -	 * Get a cursor for the by-size btree.
> -	 */
> -	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
> -		args->agno, XFS_BTNUM_CNT);
> -
> -	/*
> -	 * See if there are any free extents as big as maxlen.
> -	 */
> -	if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i)))
> -		goto error0;
> -	/*
> -	 * If none, then pick up the last entry in the tree unless the
> -	 * tree is empty.
> -	 */
> -	if (!i) {
> -		if ((error = xfs_alloc_ag_vextent_small(args, cnt_cur, &ltbno,
> -				&ltlen, &i)))
> -			goto error0;
> -		if (i == 0 || ltlen == 0) {
> -			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> -			trace_xfs_alloc_near_noentry(args);
> -			return 0;
> -		}
> -		ASSERT(i == 1);
> -	}
> -	args->wasfromfl = 0;
> -
> -	/*
> -	 * First algorithm.
> -	 * If the requested extent is large wrt the freespaces available
> -	 * in this a.g., then the cursor will be pointing to a btree entry
> -	 * near the right edge of the tree.  If it's in the last btree leaf
> -	 * block, then we just examine all the entries in that block
> -	 * that are big enough, and pick the best one.
> -	 * This is written as a while loop so we can break out of it,
> -	 * but we never loop back to the top.
> -	 */
> -	while (xfs_btree_islastblock(cnt_cur, 0)) {
> -		xfs_extlen_t	bdiff;
> -		int		besti=0;
> -		xfs_extlen_t	blen=0;
> -		xfs_agblock_t	bnew=0;
> -
> -#ifdef DEBUG
> -		if (dofirst)
> -			break;
> -#endif
> -		/*
> -		 * Start from the entry that lookup found, sequence through
> -		 * all larger free blocks.  If we're actually pointing at a
> -		 * record smaller than maxlen, go to the start of this block,
> -		 * and skip all those smaller than minlen.
> -		 */
> -		if (ltlen || args->alignment > 1) {
> -			cnt_cur->bc_ptrs[0] = 1;
> -			do {
> -				if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno,
> -						&ltlen, &i)))
> -					goto error0;
> -				XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> -				if (ltlen >= args->minlen)
> -					break;
> -				if ((error = xfs_btree_increment(cnt_cur, 0, &i)))
> -					goto error0;
> -			} while (i);
> -			ASSERT(ltlen >= args->minlen);
> -			if (!i)
> -				break;
> -		}
> -		i = cnt_cur->bc_ptrs[0];
> -		for (j = 1, blen = 0, bdiff = 0;
> -		     !error && j && (blen < args->maxlen || bdiff > 0);
> -		     error = xfs_btree_increment(cnt_cur, 0, &j)) {
> -			/*
> -			 * For each entry, decide if it's better than
> -			 * the previous best entry.
> -			 */
> -			if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno, &ltlen, &i)))
> -				goto error0;
> -			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> -			busy = xfs_alloc_compute_aligned(args, ltbno, ltlen,
> -					&ltbnoa, &ltlena, &busy_gen);
> -			if (ltlena < args->minlen)
> -				continue;
> -			if (ltbnoa < args->min_agbno || ltbnoa > args->max_agbno)
> -				continue;
> -			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
> -			xfs_alloc_fix_len(args);
> -			ASSERT(args->len >= args->minlen);
> -			if (args->len < blen)
> -				continue;
> -			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
> -				args->alignment, args->datatype, ltbnoa,
> -				ltlena, &ltnew);
> -			if (ltnew != NULLAGBLOCK &&
> -			    (args->len > blen || ltdiff < bdiff)) {
> -				bdiff = ltdiff;
> -				bnew = ltnew;
> -				blen = args->len;
> -				besti = cnt_cur->bc_ptrs[0];
> -			}
> -		}
> -		/*
> -		 * It didn't work.  We COULD be in a case where
> -		 * there's a good record somewhere, so try again.
> -		 */
> -		if (blen == 0)
> -			break;
> -		/*
> -		 * Point at the best entry, and retrieve it again.
> -		 */
> -		cnt_cur->bc_ptrs[0] = besti;
> -		if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno, &ltlen, &i)))
> -			goto error0;
> -		XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> -		ASSERT(ltbno + ltlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
> -		args->len = blen;
> -
> -		/*
> -		 * We are allocating starting at bnew for blen blocks.
> -		 */
> -		args->agbno = bnew;
> -		ASSERT(bnew >= ltbno);
> -		ASSERT(bnew + blen <= ltbno + ltlen);
> -		/*
> -		 * Set up a cursor for the by-bno tree.
> -		 */
> -		bno_cur_lt = xfs_allocbt_init_cursor(args->mp, args->tp,
> -			args->agbp, args->agno, XFS_BTNUM_BNO);
> -		/*
> -		 * Fix up the btree entries.
> -		 */
> -		if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno,
> -				ltlen, bnew, blen, XFSA_FIXUP_CNT_OK)))
> -			goto error0;
> -		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> -		xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_NOERROR);
> -
> -		trace_xfs_alloc_near_first(args);
> -		return 0;
> -	}
> -	/*
> -	 * Second algorithm.
> -	 * Search in the by-bno tree to the left and to the right
> -	 * simultaneously, until in each case we find a space big enough,
> -	 * or run into the edge of the tree.  When we run into the edge,
> -	 * we deallocate that cursor.
> -	 * If both searches succeed, we compare the two spaces and pick
> -	 * the better one.
> -	 * With alignment, it's possible for both to fail; the upper
> -	 * level algorithm that picks allocation groups for allocations
> -	 * is not supposed to do this.
> -	 */
> -	/*
> -	 * Allocate and initialize the cursor for the leftward search.
> -	 */
> -	bno_cur_lt = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
> -		args->agno, XFS_BTNUM_BNO);
> -	/*
> -	 * Lookup <= bno to find the leftward search's starting point.
> -	 */
> -	if ((error = xfs_alloc_lookup_le(bno_cur_lt, args->agbno, args->maxlen, &i)))
> -		goto error0;
> -	if (!i) {
> -		/*
> -		 * Didn't find anything; use this cursor for the rightward
> -		 * search.
> -		 */
> -		bno_cur_gt = bno_cur_lt;
> -		bno_cur_lt = NULL;
> -	}
> -	/*
> -	 * Found something.  Duplicate the cursor for the rightward search.
> -	 */
> -	else if ((error = xfs_btree_dup_cursor(bno_cur_lt, &bno_cur_gt)))
> -		goto error0;
> -	/*
> -	 * Increment the cursor, so we will point at the entry just right
> -	 * of the leftward entry if any, or to the leftmost entry.
> -	 */
> -	if ((error = xfs_btree_increment(bno_cur_gt, 0, &i)))
> -		goto error0;
> -	if (!i) {
> -		/*
> -		 * It failed, there are no rightward entries.
> -		 */
> -		xfs_btree_del_cursor(bno_cur_gt, XFS_BTREE_NOERROR);
> -		bno_cur_gt = NULL;
> -	}
> -	/*
> -	 * Loop going left with the leftward cursor, right with the
> -	 * rightward cursor, until either both directions give up or
> -	 * we find an entry at least as big as minlen.
> -	 */
> -	do {
> -		if (bno_cur_lt) {
> -			if ((error = xfs_alloc_get_rec(bno_cur_lt, &ltbno, &ltlen, &i)))
> -				goto error0;
> -			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> -			busy |= xfs_alloc_compute_aligned(args, ltbno, ltlen,
> -					&ltbnoa, &ltlena, &busy_gen);
> -			if (ltlena >= args->minlen && ltbnoa >= args->min_agbno)
> -				break;
> -			if ((error = xfs_btree_decrement(bno_cur_lt, 0, &i)))
> -				goto error0;
> -			if (!i || ltbnoa < args->min_agbno) {
> -				xfs_btree_del_cursor(bno_cur_lt,
> -						     XFS_BTREE_NOERROR);
> -				bno_cur_lt = NULL;
> -			}
> -		}
> -		if (bno_cur_gt) {
> -			if ((error = xfs_alloc_get_rec(bno_cur_gt, &gtbno, &gtlen, &i)))
> -				goto error0;
> -			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> -			busy |= xfs_alloc_compute_aligned(args, gtbno, gtlen,
> -					&gtbnoa, &gtlena, &busy_gen);
> -			if (gtlena >= args->minlen && gtbnoa <= args->max_agbno)
> -				break;
> -			if ((error = xfs_btree_increment(bno_cur_gt, 0, &i)))
> -				goto error0;
> -			if (!i || gtbnoa > args->max_agbno) {
> -				xfs_btree_del_cursor(bno_cur_gt,
> -						     XFS_BTREE_NOERROR);
> -				bno_cur_gt = NULL;
> -			}
> -		}
> -	} while (bno_cur_lt || bno_cur_gt);
> -
> -	/*
> -	 * Got both cursors still active, need to find better entry.
> -	 */
> -	if (bno_cur_lt && bno_cur_gt) {
> -		if (ltlena >= args->minlen) {
> -			/*
> -			 * Left side is good, look for a right side entry.
> -			 */
> -			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
> -			xfs_alloc_fix_len(args);
> -			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
> -				args->alignment, args->datatype, ltbnoa,
> -				ltlena, &ltnew);
> -
> -			error = xfs_alloc_find_best_extent(args,
> -						&bno_cur_lt, &bno_cur_gt,
> -						ltdiff, &gtbno, &gtlen,
> -						&gtbnoa, &gtlena,
> -						0 /* search right */);
> -		} else {
> -			ASSERT(gtlena >= args->minlen);
> -
> -			/*
> -			 * Right side is good, look for a left side entry.
> -			 */
> -			args->len = XFS_EXTLEN_MIN(gtlena, args->maxlen);
> -			xfs_alloc_fix_len(args);
> -			gtdiff = xfs_alloc_compute_diff(args->agbno, args->len,
> -				args->alignment, args->datatype, gtbnoa,
> -				gtlena, &gtnew);
> -
> -			error = xfs_alloc_find_best_extent(args,
> -						&bno_cur_gt, &bno_cur_lt,
> -						gtdiff, &ltbno, &ltlen,
> -						&ltbnoa, &ltlena,
> -						1 /* search left */);
> -		}
> -
> -		if (error)
> -			goto error0;
> -	}
> -
> -	/*
> -	 * If we couldn't get anything, give up.
> -	 */
> -	if (bno_cur_lt == NULL && bno_cur_gt == NULL) {
> -		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> -
> -		if (busy) {
> -			trace_xfs_alloc_near_busy(args);
> -			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> -			goto restart;
> -		}
> -		trace_xfs_alloc_size_neither(args);
> -		args->agbno = NULLAGBLOCK;
> -		return 0;
> -	}
> -
> -	/*
> -	 * At this point we have selected a freespace entry, either to the
> -	 * left or to the right.  If it's on the right, copy all the
> -	 * useful variables to the "left" set so we only have one
> -	 * copy of this code.
> -	 */
> -	if (bno_cur_gt) {
> -		bno_cur_lt = bno_cur_gt;
> -		bno_cur_gt = NULL;
> -		ltbno = gtbno;
> -		ltbnoa = gtbnoa;
> -		ltlen = gtlen;
> -		ltlena = gtlena;
> -		j = 1;
> -	} else
> -		j = 0;
> -
> -	/*
> -	 * Fix up the length and compute the useful address.
> -	 */
> -	args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
> -	xfs_alloc_fix_len(args);
> -	rlen = args->len;
> -	(void)xfs_alloc_compute_diff(args->agbno, rlen, args->alignment,
> -				     args->datatype, ltbnoa, ltlena, &ltnew);
> -	ASSERT(ltnew >= ltbno);
> -	ASSERT(ltnew + rlen <= ltbnoa + ltlena);
> -	ASSERT(ltnew + rlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
> -	ASSERT(ltnew >= args->min_agbno && ltnew <= args->max_agbno);
> -	args->agbno = ltnew;
> -
> -	if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno, ltlen,
> -			ltnew, rlen, XFSA_FIXUP_BNO_OK)))
> -		goto error0;
> -
> -	if (j)
> -		trace_xfs_alloc_near_greater(args);
> -	else
> -		trace_xfs_alloc_near_lesser(args);
> -
> -	xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> -	xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_NOERROR);
> -	return 0;
> -
> - error0:
> -	trace_xfs_alloc_near_error(args);
> -	if (cnt_cur != NULL)
> -		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_ERROR);
> -	if (bno_cur_lt != NULL)
> -		xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_ERROR);
> -	if (bno_cur_gt != NULL)
> -		xfs_btree_del_cursor(bno_cur_gt, XFS_BTREE_ERROR);
> -	return error;
> -}
> -
>  /*
>   * Allocate a variable extent anywhere in the allocation group agno.
>   * Extent's length (returned in len) will be between minlen and maxlen,
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 8094b1920eef..a441a472da79 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -1635,14 +1635,16 @@ DEFINE_EVENT(xfs_alloc_class, name, \
>  DEFINE_ALLOC_EVENT(xfs_alloc_exact_done);
>  DEFINE_ALLOC_EVENT(xfs_alloc_exact_notfound);
>  DEFINE_ALLOC_EVENT(xfs_alloc_exact_error);
> -DEFINE_ALLOC_EVENT(xfs_alloc_near_nominleft);
> -DEFINE_ALLOC_EVENT(xfs_alloc_near_first);
> -DEFINE_ALLOC_EVENT(xfs_alloc_near_greater);
> -DEFINE_ALLOC_EVENT(xfs_alloc_near_lesser);
> -DEFINE_ALLOC_EVENT(xfs_alloc_near_error);
> -DEFINE_ALLOC_EVENT(xfs_alloc_near_noentry);
> -DEFINE_ALLOC_EVENT(xfs_alloc_near_busy);
> -DEFINE_ALLOC_EVENT(xfs_alloc_size_neither);
> +DEFINE_ALLOC_EVENT(xfs_alloc_ag_error);
> +DEFINE_ALLOC_EVENT(xfs_alloc_ag_noentry);
> +DEFINE_ALLOC_EVENT(xfs_alloc_ag_busy);
> +DEFINE_ALLOC_EVENT(xfs_alloc_cur);
> +DEFINE_ALLOC_EVENT(xfs_alloc_cur_lastblock);
> +DEFINE_ALLOC_EVENT(xfs_alloc_cur_left);
> +DEFINE_ALLOC_EVENT(xfs_alloc_cur_right);
> +DEFINE_ALLOC_EVENT(xfs_alloc_cur_lookup);
> +DEFINE_ALLOC_EVENT(xfs_alloc_cur_lookup_done);
> +DEFINE_ALLOC_EVENT(xfs_alloc_cur_reverse);
>  DEFINE_ALLOC_EVENT(xfs_alloc_size_noentry);
>  DEFINE_ALLOC_EVENT(xfs_alloc_size_nominleft);
>  DEFINE_ALLOC_EVENT(xfs_alloc_size_done);
> @@ -1658,6 +1660,31 @@ DEFINE_ALLOC_EVENT(xfs_alloc_vextent_noagbp);
>  DEFINE_ALLOC_EVENT(xfs_alloc_vextent_loopfailed);
>  DEFINE_ALLOC_EVENT(xfs_alloc_vextent_allfailed);
>  
> +TRACE_EVENT(xfs_alloc_cur_check,
> +	TP_PROTO(struct xfs_mount *mp, xfs_btnum_t btnum, xfs_agblock_t bno,
> +		 xfs_extlen_t len, xfs_extlen_t diff, bool new),
> +	TP_ARGS(mp, btnum, bno, len, diff, new),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(xfs_btnum_t, btnum)
> +		__field(xfs_agblock_t, bno)
> +		__field(xfs_extlen_t, len)
> +		__field(xfs_extlen_t, diff)
> +		__field(bool, new)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = mp->m_super->s_dev;
> +		__entry->btnum = btnum;
> +		__entry->bno = bno;
> +		__entry->len = len;
> +		__entry->diff = diff;
> +		__entry->new = new;
> +	),
> +	TP_printk("dev %d:%d btnum %d bno 0x%x len 0x%x diff 0x%x new %d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->btnum,
> +		  __entry->bno, __entry->len, __entry->diff, __entry->new)
> +)
> +
>  DECLARE_EVENT_CLASS(xfs_da_class,
>  	TP_PROTO(struct xfs_da_args *args),
>  	TP_ARGS(args),
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/4] xfs: randomly fall back to near mode lookup algorithm in debug mode
  2019-08-15 12:55 ` [PATCH v3 3/4] xfs: randomly fall back to near mode lookup algorithm in debug mode Brian Foster
@ 2019-08-17  1:37   ` Darrick J. Wong
  2019-08-19 18:19     ` Brian Foster
  0 siblings, 1 reply; 12+ messages in thread
From: Darrick J. Wong @ 2019-08-17  1:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Aug 15, 2019 at 08:55:37AM -0400, Brian Foster wrote:
> The last block scan is the dominant near mode allocation algorithm
> for a newer filesystem with fewer, large free extents. Add debug
> mode logic to randomly fall back to lookup mode to improve
> regression test coverage.

How about just using an errortag since the new sysfs interface lets
testcases / admins control the frequency?

--D

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 7753b61ba532..d550aa5597bf 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -1266,6 +1266,7 @@ xfs_alloc_ag_vextent_near(
>  	int			i;
>  	xfs_agblock_t		bno;
>  	xfs_extlen_t		len;
> +	bool			lastblock;
>  
>  	/* handle unitialized agbno range so caller doesn't have to */
>  	if (!args->min_agbno && !args->max_agbno)
> @@ -1291,7 +1292,12 @@ xfs_alloc_ag_vextent_near(
>  	 * Otherwise run the optimized lookup search algorithm from the current
>  	 * location to the end of the tree.
>  	 */
> -	if (xfs_btree_islastblock(acur.cnt, 0)) {
> +	lastblock = xfs_btree_islastblock(acur.cnt, 0);
> +#ifdef DEBUG
> +	if (lastblock)
> +		lastblock = prandom_u32() & 1;
> +#endif
> +	if (lastblock) {
>  		int	j;
>  
>  		trace_xfs_alloc_cur_lastblock(args);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/4] xfs: track active state of allocation btree cursors
  2019-08-17  0:51   ` Darrick J. Wong
@ 2019-08-19 18:03     ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2019-08-19 18:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Aug 16, 2019 at 05:51:49PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 15, 2019 at 08:55:35AM -0400, Brian Foster wrote:
> > The upcoming allocation algorithm update searches multiple
> > allocation btree cursors concurrently. As such, it requires an
> > active state to track when a particular cursor should continue
> > searching. While active state will be modified based on higher level
> > logic, we can define base functionality based on the result of
> > allocation btree lookups.
> > 
> > Define an active flag in the private area of the btree cursor.
> > Update it based on the result of lookups in the existing allocation
> > btree helpers. Finally, provide a new helper to query the current
> > state.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c       | 24 +++++++++++++++++++++---
> >  fs/xfs/libxfs/xfs_alloc_btree.c |  1 +
> >  fs/xfs/libxfs/xfs_btree.h       |  3 +++
> >  3 files changed, 25 insertions(+), 3 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 372ad55631fc..6340f59ac3f4 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -146,9 +146,13 @@ xfs_alloc_lookup_eq(
> >  	xfs_extlen_t		len,	/* length of extent */
> >  	int			*stat)	/* success/failure */
> >  {
> > +	int			error;
> > +
> >  	cur->bc_rec.a.ar_startblock = bno;
> >  	cur->bc_rec.a.ar_blockcount = len;
> > -	return xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
> > +	error = xfs_btree_lookup(cur, XFS_LOOKUP_EQ, stat);
> > +	cur->bc_private.a.priv.abt.active = *stat;
> 
> <urk> Not really a fan of mixing types (even if they are bool and int),
> how hard would it be to convert some of these *stat to bool?
> 

Hmm... that might be slightly annoying because some of these functions
use an 'i' variable and pass it all around to various other similar
helpers that expect an int. I was trying to break that mold by using a
boolean here, but not necessarily tie in this kind of refactoring as a
dependency for this work, particularly since it will ultimately replace
much of the allocator code. 

I could either code this without the direct assignment between int/bool
or just use an int for 'active' for now (and even rename it to stat if
that's more clear) and switch both over to bool once the other
allocation modes are done. Thoughts?

> Does abt.active have a use outside of the struct xfs_alloc_cur in the
> next patch?
> 

No. It was originally a field in the xfs_alloc_cur and then pushed down
into the cursor private area based on feedback on previous versions of
this set.

Brian

> --D
> 
> > +	return error;
> >  }
> >  
> >  /*
> > @@ -162,9 +166,13 @@ xfs_alloc_lookup_ge(
> >  	xfs_extlen_t		len,	/* length of extent */
> >  	int			*stat)	/* success/failure */
> >  {
> > +	int			error;
> > +
> >  	cur->bc_rec.a.ar_startblock = bno;
> >  	cur->bc_rec.a.ar_blockcount = len;
> > -	return xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
> > +	error = xfs_btree_lookup(cur, XFS_LOOKUP_GE, stat);
> > +	cur->bc_private.a.priv.abt.active = *stat;
> > +	return error;
> >  }
> >  
> >  /*
> > @@ -178,9 +186,19 @@ xfs_alloc_lookup_le(
> >  	xfs_extlen_t		len,	/* length of extent */
> >  	int			*stat)	/* success/failure */
> >  {
> > +	int			error;
> >  	cur->bc_rec.a.ar_startblock = bno;
> >  	cur->bc_rec.a.ar_blockcount = len;
> > -	return xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
> > +	error = xfs_btree_lookup(cur, XFS_LOOKUP_LE, stat);
> > +	cur->bc_private.a.priv.abt.active = *stat;
> > +	return error;
> > +}
> > +
> > +static inline bool
> > +xfs_alloc_cur_active(
> > +	struct xfs_btree_cur	*cur)
> > +{
> > +	return cur && cur->bc_private.a.priv.abt.active;
> >  }
> >  
> >  /*
> > diff --git a/fs/xfs/libxfs/xfs_alloc_btree.c b/fs/xfs/libxfs/xfs_alloc_btree.c
> > index 2a94543857a1..279694d73e4e 100644
> > --- a/fs/xfs/libxfs/xfs_alloc_btree.c
> > +++ b/fs/xfs/libxfs/xfs_alloc_btree.c
> > @@ -507,6 +507,7 @@ xfs_allocbt_init_cursor(
> >  
> >  	cur->bc_private.a.agbp = agbp;
> >  	cur->bc_private.a.agno = agno;
> > +	cur->bc_private.a.priv.abt.active = false;
> >  
> >  	if (xfs_sb_version_hascrc(&mp->m_sb))
> >  		cur->bc_flags |= XFS_BTREE_CRC_BLOCKS;
> > diff --git a/fs/xfs/libxfs/xfs_btree.h b/fs/xfs/libxfs/xfs_btree.h
> > index fa3cd8ab9aba..a66063c356cc 100644
> > --- a/fs/xfs/libxfs/xfs_btree.h
> > +++ b/fs/xfs/libxfs/xfs_btree.h
> > @@ -183,6 +183,9 @@ union xfs_btree_cur_private {
> >  		unsigned long	nr_ops;		/* # record updates */
> >  		int		shape_changes;	/* # of extent splits */
> >  	} refc;
> > +	struct {
> > +		bool		active;		/* allocation cursor state */
> > +	} abt;
> >  };
> >  
> >  /*
> > -- 
> > 2.20.1
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/4] xfs: use locality optimized cntbt lookups for near mode allocations
  2019-08-17  1:34   ` Darrick J. Wong
@ 2019-08-19 18:06     ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2019-08-19 18:06 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Aug 16, 2019 at 06:34:59PM -0700, Darrick J. Wong wrote:
> Whee, a huge patch! :)
> 

Heh.

> On Thu, Aug 15, 2019 at 08:55:36AM -0400, Brian Foster wrote:
> > The extent allocation code in XFS has several allocation modes with
> > unique implementations. These modes are not all that different from
> > a high level perspective. The most involved mode is near allocation
> > mode which attempts to allocate an optimally sized extent with ideal
> > locality with respect to a provided agbno hint.
> > 
> > In the common case, a near mode allocation consists of a conditional
> > scan of the last cntbt block followed by a concurrent left and right
> > spanning search of the bnobt starting from the ideal point of
> > locality in the bnobt. This works reasonably well as filesystems age
> > via most common allocation patterns. If free space fragments as the
> > filesystem ages, however, the near algorithm has very poor breakdown
> > characteristics. If the extent size lookup happens to land outside
> > (i.e., before) the last cntbt block, the alloc bypasses the cntbt
> > entirely. If a suitably sized extent is far enough away from the
> > starting points of the bnobt search, the left/right scans can take a
> > significant amount of time to locate the target extent. This leads
> > to pathological allocation latencies in certain workloads.
> > 
> > While locality is important to near mode allocations, it is not so
> > important as to incur pathological allocation latency to provide the
> > asolute best locality for every allocation. The left/right bnobt
> > scan is inefficient in many large allocation scenarios. As an
> > alternative, we can use locality optimized lookups of the cntbt to
> > find the closest extent to the target agbno of a particular size. We
> > can repeat this lookup to cover a larger span of extents much more
> > efficiently. Finally, we can combine this cntbt lookup algorithm
> > with the existing bnobt scan to provide a natural balance between
> > the two for large and small allocations. For example, the bnobt scan
> > may be able to satisfy inode chunk or btree block allocations fairly
> > quickly where the cntbt search may have to search through a large
> > set of extent sizes. On the other hand, the cntbt search can more
> > deterministically scan the set of free extents available for much
> > larger delayed allocation requests where the bnobt scan may have to
> > search the entire AG.
> 
> I sure wish this lengthy description was being put into xfs_alloc.c
> itself.  It would be nice to have some pictures of how this is supposed
> to work too.
> 

I can try to filter some of this down into the functions that implement
the appropriate bits of the algorithm. I'm not sure what a picture might
look like though... can you elaborate on what you're looking for here?

> > Rework the near mode allocation algorithm as described above to
> > provide predictable allocation latency under breakdown conditions
> > with good enough locality in the common case. In doing so, refactor
> > the affected code into a more generic interface and mechanisms to
> > facilitate future reuse by the by-size and exact mode algorithms.
> > Both allocation modes currently have unique, ground-up
> > implementations that can be replaced with minimal logic additions to
> > the generic code in this patch.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c | 1045 +++++++++++++++++++------------------
> >  fs/xfs/xfs_trace.h        |   43 +-
> >  2 files changed, 581 insertions(+), 507 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 6340f59ac3f4..7753b61ba532 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
...
> > @@ -710,8 +709,446 @@ xfs_alloc_update_counters(
...
> > +
> > +/*
> > + * Locality allocation lookup algorithm. This expects a cntbt cursor and uses
> > + * bno optimized lookup to search for extents with ideal size and locality.
> 
> Is it supposed to be the case that acur->cnt == cur?  Why not omit the
> third parameter?
> 

Yeah, it was originally written to facilitate reuse but didn't end up
that way. I should be able to just drop the cursor parameter. I might
rename it to something like cntbt_lookup_iter() too just so it's clear,
FWIW.

> > + */
> > +STATIC int
> > +xfs_alloc_lookup_iter(
> > +	struct xfs_alloc_arg		*args,
> > +	struct xfs_alloc_cur		*acur,
> > +	struct xfs_btree_cur		*cur)
> > +{
> > +	xfs_agblock_t		bno;
> > +	xfs_extlen_t		len, cur_len;
> > +	int			error;
> > +	int			i;
> > +
> > +	if (!xfs_alloc_cur_active(cur))
> > +		return 0;
> > +
> > +	/* locality optimized lookup */
> > +	cur_len = acur->cur_len;
> > +	error = xfs_alloc_lookup_ge(cur, args->agbno, cur_len, &i);
> > +	if (error)
> > +		return error;
> > +	if (i == 0)
> > +		return 0;
> > +	error = xfs_alloc_get_rec(cur, &bno, &len, &i);
> > +	if (error)
> > +		return error;
> > +
> > +	/* check the current record and update search length from it */
> > +	error = xfs_alloc_cur_check(args, acur, cur, &i);
> > +	if (error)
> > +		return error;
> > +	ASSERT(len >= acur->cur_len);
> > +	acur->cur_len = len;
> > +
> > +	/*
> > +	 * We looked up the first record >= [agbno, len] above. The agbno is a
> > +	 * secondary key and so the current record may lie just before or after
> > +	 * agbno. If it is past agbno, check the previous record too so long as
> > +	 * the length matches as it may be closer. Don't check a smaller record
> > +	 * because that could deactivate our cursor.
> > +	 */
> > +	if (bno > args->agbno) {
> > +		error = xfs_btree_decrement(cur, 0, &i);
> > +		if (!error && i) {
> > +			error = xfs_alloc_get_rec(cur, &bno, &len, &i);
> > +			if (!error && i && len == acur->cur_len)
> > +				error = xfs_alloc_cur_check(args, acur, cur,
> > +							    &i);
> > +		}
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	/*
> > +	 * Increment the search key until we find at least one allocation
> > +	 * candidate or if the extent we found was larger. Otherwise, double the
> > +	 * search key to optimize the search. Efficiency is more important here
> > +	 * than absolute best locality.
> > +	 */
> > +	cur_len <<= 1;
> > +	if (!acur->len || acur->cur_len >= cur_len)
> > +		acur->cur_len++;
> > +	else
> > +		acur->cur_len = cur_len;
> > +
> > +	return error;
> > +}
> > +
> > +/*
> > + * Incremental lookup algorithm. Walk a btree in either direction looking for
> > + * candidate extents. This works for either bnobt (locality allocation) or cntbt
> > + * (by-size allocation) cursors.
> > + */
> > +STATIC int
> > +xfs_alloc_walk_iter(
> > +	struct xfs_alloc_arg		*args,
> > +	struct xfs_alloc_cur		*acur,
> > +	struct xfs_btree_cur		*cur,
> > +	bool				increment,
> > +	bool				findone,
> > +	int				iters,
> 
> Err... what do fin-done and iters do?
> 
> "If @findone, return the first thing we find" and "Only search @iters
> steps away"?  Maybe the two bools should be flags?
> 

Yeah.. basically visit iters records and whether or not to end the
search if we find an allocation candidate. It doesn't matter to me
whether we use bools or flags, but how do you prefer to define flags for
the increment/decrement control? Increment by default unless
XFS_ALLOC_DECREMENT or some such flag is passed..? Vice versa?

> > +	int				*stat)
> > +{
> > +	int			error;
> > +	int			i;
> > +
> > +	*stat = 0;
> > +
> > +	if (!xfs_alloc_cur_active(cur))
> > +		return 0;
> > +
> > +	for (; iters > 0; iters--) {
> > +		error = xfs_alloc_cur_check(args, acur, cur, &i);
> > +		if (error)
> > +			return error;
> > +		if (i) {
> > +			*stat = 1;
> > +			if (findone)
> > +				break;
> > +		}
> > +		if (!xfs_alloc_cur_active(cur))
> > +			break;
> > +
> > +		if (increment)
> > +			error = xfs_btree_increment(cur, 0, &i);
> > +		else
> > +			error = xfs_btree_decrement(cur, 0, &i);
> > +		if (error)
> > +			return error;
> > +		if (i == 0) {
> > +			cur->bc_private.a.priv.abt.active = false;
> > +			break;
> > +		}
> > +	}
> > +
> > +	return error;
> > +}
> > +
> > +/*
> > + * High level locality allocation algorithm. Search the bnobt (left and right)
> > + * in parallel with locality-optimized cntbt lookups to find an extent with
> > + * ideal locality.
> > + */
> > +STATIC int
> > +xfs_alloc_ag_vextent_lookup(
> > +	struct xfs_alloc_arg	*args,
> > +	struct xfs_alloc_cur	*acur,
> > +	int			*stat)
> > +{
> > +	int				error;
> > +	int				i;
> > +	struct xfs_btree_cur		*fbcur = NULL;
> > +	bool				fbinc = false;
> > +
> > +	ASSERT(acur->cnt);
> > +	ASSERT(args->type == XFS_ALLOCTYPE_NEAR_BNO);
> > +	*stat = 0;
> > +
> > +	/*
> > +	 * Set up available cursors for a locality allocation search based on
> > +	 * the ->agbno hint. An exact allocation only needs the bnolt so disable
> > +	 * the cntbt cursor explicitly.
> > +	 */
> > +	error = xfs_alloc_lookup_ge(acur->cnt, args->agbno, acur->cur_len, &i);
> > +	if (error)
> > +		return error;
> > +	error = xfs_alloc_lookup_le(acur->bnolt, args->agbno, 0, &i);
> > +	if (error)
> > +		return error;
> > +	error = xfs_alloc_lookup_ge(acur->bnogt, args->agbno, 0, &i);
> > +	if (error)
> > +		return error;
> > +
> > +	/* search as long as we have at least one active cursor */
> > +	while (xfs_alloc_cur_active(acur->cnt) ||
> > +	       xfs_alloc_cur_active(acur->bnolt) ||
> > +	       xfs_alloc_cur_active(acur->bnogt)) {
> > +		trace_xfs_alloc_cur_lookup(args);
> > +
> > +		/*
> > +		 * Search the bnobt left and right. If either of these find a
> > +		 * maxlen extent, we know we've found ideal locality. Do a
> > +		 * capped search in the opposite direction and we're done.
> 
> So ... we cap the distance we'll walk the two bno cursors to 1 record,
> which I think is key to preventing the "pathological allocation"
> mentioned above?
> 

The key to avoiding the pathological behavior is more to have the cntbt
lookup search in parallel. That prevents us from ultimately having to
scan through bunches of bnobt blocks without any records that meet the
minimum requirements, particularly for larger allocation requests. The
cntbt lookup guarantees we'll find some extents that at least meet the
size requirement of the allocation in a timely fashion.

In that sense, the bnobt scan becomes more of an optimization/balance
for smaller allocations. If the bnobt scan finds a maxlen extent before
the cntbt lookup completes, we know we've found best case locality.
E.g., this is pretty much what happens every time for single block
allocations where otherwise the cntbt search would have to jump from
single block extents all the way to the largest free extent before it
would terminate.

With regard to the per-iteration tuning, I was originally thinking about
having this scan a full block at a time, since we've already had to read
it off disk. That turned out to be more overhead than I initially
thought so I just set it back to 1 to keep it simple. I'm guessing we
could bump it to something slightly larger (more on the order of a
handful), but I just haven't come across a case where that level of fine
tuning matters to this point.

> > +		 */
> > +		error = xfs_alloc_walk_iter(args, acur, acur->bnolt, false,
> > +					    true, 1, &i);
> > +		if (error)
> > +			return error;
> > +		if (i && acur->len == args->maxlen) {
> > +			trace_xfs_alloc_cur_left(args);
> > +			fbcur = acur->bnogt;
> > +			fbinc = true;
> > +			break;
> > +		}
> > +
> > +		error = xfs_alloc_walk_iter(args, acur, acur->bnogt, true,
> > +					    true, 1, &i);
> > +		if (error)
> > +			return error;
> > +		if (i && acur->len == args->maxlen) {
> > +			trace_xfs_alloc_cur_right(args);
> > +			fbcur = acur->bnolt;
> > +			break;
> > +		}
> > +
> > +		/*
> > +		 * Search the cntbt for a maximum sized extent with ideal
> > +		 * locality. The lookup search terminates on reaching the end of
> > +		 * the cntbt. Break out of the loop if this occurs to throttle
> > +		 * the bnobt scans.
> > +		 */
> > +		error = xfs_alloc_lookup_iter(args, acur, acur->cnt);
> > +		if (error)
> > +			return error;
> > +		if (!xfs_alloc_cur_active(acur->cnt)) {
> > +			trace_xfs_alloc_cur_lookup_done(args);
> 
> ...and I guess this means that now we're racing incremental steps of
> three different free space btree cursors to see if any of the three will
> find a match with the length we want?
> 
> And if all of this fails, then we fall back to scanning backwards with
> the cntbt until we find something?
> 

Pretty much. The bnobt search is by-definition a locality search so we
can terminate immediately on a maxlen hit. The cntbt search could
essentially provide random locality because of how the tree is
organized, hence the need for multiple lookups of >=maxlen sized extents
to improve the quality of the locality.

If the search fails, the fallbacks are less performance sensitive and
oriented towards just finding something that satisfies minlen.

> Ok, I think I get how this is a behavior improvement on the old code.
> 
> The splitting of the old _near function into smaller pieces makes it a
> little easier to read the code, though I sorta wish this was more like
> the March(?) version where the splitting happened and /then/ the
> behavior change came after, instead of a 33K patch...
> 

I'm not sure what you're referring to here. I never split up the current
implementation because I didn't have a clear path to do that at the
time. This was partly because the final algorithm wasn't really
established, so I needed to work that out, but also because there are
some weird quirks with the current code where helper functions are tied
into behavior of the higher level algorithm and it seemed like kind of a
mess to untangle just to potentially replace that code.

The previous version of this series converted more allocation modes
(exact and by-size) than this version. Those have been temporarily
dropped for now so I can make progress on the core mechanism/interfaces
without continuously refactoring the rest. Is that what you're referring
to?

I can take another look at moving to the updated algorithm more
incrementally now that the end result is more of a known quantity, but
I'll have to go back and stare at it some more to figure if/how much it
can be broken up...

Brian

> --D
> 
> > +			if (!acur->len) {
> > +				/*
> > +				 * Reset the cursor for a minlen search in the
> > +				 * caller.
> > +				 */
> > +				error = xfs_btree_decrement(acur->cnt, 0, &i);
> > +				if (error)
> > +					return error;
> > +				if (i)
> > +					acur->cnt->bc_private.a.priv.abt.active = true;
> > +			}
> > +			break;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Perform a best-effort search in the opposite direction from a bnobt
> > +	 * allocation.
> > +	 */
> > +	if (fbcur) {
> > +		error = xfs_alloc_walk_iter(args, acur, fbcur, fbinc, true,
> > +					    args->mp->m_alloc_mxr[0], &i);
> > +		if (error)
> > +			return error;
> > +	}
> > +
> > +	if (acur->len)
> > +		*stat = 1;
> > +
> > +	return error;
> > +}
> >  
> >  /*
> >   * Deal with the case where only small freespaces remain. Either return the
> > @@ -814,6 +1251,113 @@ xfs_alloc_ag_vextent_small(
> >  	return error;
> >  }
> >  
> > +/*
> > + * Allocate a variable extent near bno in the allocation group agno.
> > + * Extent's length (returned in len) will be between minlen and maxlen,
> > + * and of the form k * prod + mod unless there's nothing that large.
> > + * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
> > + */
> > +STATIC int
> > +xfs_alloc_ag_vextent_near(
> > +	struct xfs_alloc_arg	*args)
> > +{
> > +	struct xfs_alloc_cur	acur = {0,};
> > +	int			error;
> > +	int			i;
> > +	xfs_agblock_t		bno;
> > +	xfs_extlen_t		len;
> > +
> > +	/* handle unitialized agbno range so caller doesn't have to */
> > +	if (!args->min_agbno && !args->max_agbno)
> > +		args->max_agbno = args->mp->m_sb.sb_agblocks - 1;
> > +	ASSERT(args->min_agbno <= args->max_agbno);
> > +
> > +	/* clamp agbno to the range if it's outside */
> > +	if (args->agbno < args->min_agbno)
> > +		args->agbno = args->min_agbno;
> > +	if (args->agbno > args->max_agbno)
> > +		args->agbno = args->max_agbno;
> > +
> > +restart:
> > +	error = xfs_alloc_cur_setup(args, &acur);
> > +	if (error) {
> > +		error = (error == -ENOSPC) ? 0 : error;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The cntbt cursor points at the first maxlen or minlen extent. If it
> > +	 * resides in the last block of the tree, scan the rest of the block.
> > +	 * Otherwise run the optimized lookup search algorithm from the current
> > +	 * location to the end of the tree.
> > +	 */
> > +	if (xfs_btree_islastblock(acur.cnt, 0)) {
> > +		int	j;
> > +
> > +		trace_xfs_alloc_cur_lastblock(args);
> > +		error = xfs_alloc_walk_iter(args, &acur, acur.cnt, true, false,
> > +				    INT_MAX, &i);
> > +		if (error)
> > +			goto out;
> > +
> > +		/*
> > +		 * If allocation fails, reset the cursor for a reverse minlen
> > +		 * scan. cur_len doesn't change so look up the extent just
> > +		 * before the starting point.
> > +		 */
> > +		if (!i && acur.cur_len > args->minlen)
> > +			error = xfs_alloc_lookup_le(acur.cnt, 0, acur.cur_len, &j);
> > +	} else {
> > +		error = xfs_alloc_ag_vextent_lookup(args, &acur, &i);
> > +	}
> > +	if (error)
> > +		goto out;
> > +
> > +	/*
> > +	 * Ignore locality and search backwards for the first suitable minlen
> > +	 * extent. This fails if we run out of minlen extents.
> > +	 */
> > +	if (!i && xfs_alloc_cur_active(acur.cnt)) {
> > +		trace_xfs_alloc_cur_reverse(args);
> > +		error = xfs_alloc_walk_iter(args, &acur, acur.cnt, false, true,
> > +					    INT_MAX, &i);
> > +		if (error)
> > +			goto out;
> > +	}
> > +
> > +	if (!i) {
> > +		/* flush and retry if we found busy extents */
> > +		if (acur.busy) {
> > +			trace_xfs_alloc_ag_busy(args);
> > +			xfs_extent_busy_flush(args->mp, args->pag,
> > +					      acur.busy_gen);
> > +			goto restart;
> > +		}
> > +
> > +		/*
> > +		 * Try the AGFL as a last resort. Don't pass a cursor so this
> > +		 * returns an AGFL block (i == 0) or nothing.
> > +		 */
> > +		error = xfs_alloc_ag_vextent_small(args, NULL, &bno, &len, &i);
> > +		if (error)
> > +			goto out;
> > +		ASSERT(i == 0 || (i && len == 0));
> > +		trace_xfs_alloc_ag_noentry(args);
> > +
> > +		args->agbno = bno;
> > +		args->len = len;
> > +		goto out;
> > +	}
> > +
> > +	/* fix up btrees on a successful allocation */
> > +	error = xfs_alloc_cur_finish(args, &acur);
> > +out:
> > +	xfs_alloc_cur_close(&acur, error);
> > +	if (error)
> > +		trace_xfs_alloc_ag_error(args);
> > +	return error;
> > +}
> > +
> >  /*
> >   * Allocate a variable extent in the allocation group agno.
> >   * Type and bno are used to determine where in the allocation group the
> > @@ -1001,503 +1545,6 @@ xfs_alloc_ag_vextent_exact(
> >  	return error;
> >  }
> >  
> > -/*
> > - * Search the btree in a given direction via the search cursor and compare
> > - * the records found against the good extent we've already found.
> > - */
> > -STATIC int
> > -xfs_alloc_find_best_extent(
> > -	struct xfs_alloc_arg	*args,	/* allocation argument structure */
> > -	struct xfs_btree_cur	**gcur,	/* good cursor */
> > -	struct xfs_btree_cur	**scur,	/* searching cursor */
> > -	xfs_agblock_t		gdiff,	/* difference for search comparison */
> > -	xfs_agblock_t		*sbno,	/* extent found by search */
> > -	xfs_extlen_t		*slen,	/* extent length */
> > -	xfs_agblock_t		*sbnoa,	/* aligned extent found by search */
> > -	xfs_extlen_t		*slena,	/* aligned extent length */
> > -	int			dir)	/* 0 = search right, 1 = search left */
> > -{
> > -	xfs_agblock_t		new;
> > -	xfs_agblock_t		sdiff;
> > -	int			error;
> > -	int			i;
> > -	unsigned		busy_gen;
> > -
> > -	/* The good extent is perfect, no need to  search. */
> > -	if (!gdiff)
> > -		goto out_use_good;
> > -
> > -	/*
> > -	 * Look until we find a better one, run out of space or run off the end.
> > -	 */
> > -	do {
> > -		error = xfs_alloc_get_rec(*scur, sbno, slen, &i);
> > -		if (error)
> > -			goto error0;
> > -		XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> > -		xfs_alloc_compute_aligned(args, *sbno, *slen,
> > -				sbnoa, slena, &busy_gen);
> > -
> > -		/*
> > -		 * The good extent is closer than this one.
> > -		 */
> > -		if (!dir) {
> > -			if (*sbnoa > args->max_agbno)
> > -				goto out_use_good;
> > -			if (*sbnoa >= args->agbno + gdiff)
> > -				goto out_use_good;
> > -		} else {
> > -			if (*sbnoa < args->min_agbno)
> > -				goto out_use_good;
> > -			if (*sbnoa <= args->agbno - gdiff)
> > -				goto out_use_good;
> > -		}
> > -
> > -		/*
> > -		 * Same distance, compare length and pick the best.
> > -		 */
> > -		if (*slena >= args->minlen) {
> > -			args->len = XFS_EXTLEN_MIN(*slena, args->maxlen);
> > -			xfs_alloc_fix_len(args);
> > -
> > -			sdiff = xfs_alloc_compute_diff(args->agbno, args->len,
> > -						       args->alignment,
> > -						       args->datatype, *sbnoa,
> > -						       *slena, &new);
> > -
> > -			/*
> > -			 * Choose closer size and invalidate other cursor.
> > -			 */
> > -			if (sdiff < gdiff)
> > -				goto out_use_search;
> > -			goto out_use_good;
> > -		}
> > -
> > -		if (!dir)
> > -			error = xfs_btree_increment(*scur, 0, &i);
> > -		else
> > -			error = xfs_btree_decrement(*scur, 0, &i);
> > -		if (error)
> > -			goto error0;
> > -	} while (i);
> > -
> > -out_use_good:
> > -	xfs_btree_del_cursor(*scur, XFS_BTREE_NOERROR);
> > -	*scur = NULL;
> > -	return 0;
> > -
> > -out_use_search:
> > -	xfs_btree_del_cursor(*gcur, XFS_BTREE_NOERROR);
> > -	*gcur = NULL;
> > -	return 0;
> > -
> > -error0:
> > -	/* caller invalidates cursors */
> > -	return error;
> > -}
> > -
> > -/*
> > - * Allocate a variable extent near bno in the allocation group agno.
> > - * Extent's length (returned in len) will be between minlen and maxlen,
> > - * and of the form k * prod + mod unless there's nothing that large.
> > - * Return the starting a.g. block, or NULLAGBLOCK if we can't do it.
> > - */
> > -STATIC int				/* error */
> > -xfs_alloc_ag_vextent_near(
> > -	xfs_alloc_arg_t	*args)		/* allocation argument structure */
> > -{
> > -	xfs_btree_cur_t	*bno_cur_gt;	/* cursor for bno btree, right side */
> > -	xfs_btree_cur_t	*bno_cur_lt;	/* cursor for bno btree, left side */
> > -	xfs_btree_cur_t	*cnt_cur;	/* cursor for count btree */
> > -	xfs_agblock_t	gtbno;		/* start bno of right side entry */
> > -	xfs_agblock_t	gtbnoa;		/* aligned ... */
> > -	xfs_extlen_t	gtdiff;		/* difference to right side entry */
> > -	xfs_extlen_t	gtlen;		/* length of right side entry */
> > -	xfs_extlen_t	gtlena;		/* aligned ... */
> > -	xfs_agblock_t	gtnew;		/* useful start bno of right side */
> > -	int		error;		/* error code */
> > -	int		i;		/* result code, temporary */
> > -	int		j;		/* result code, temporary */
> > -	xfs_agblock_t	ltbno;		/* start bno of left side entry */
> > -	xfs_agblock_t	ltbnoa;		/* aligned ... */
> > -	xfs_extlen_t	ltdiff;		/* difference to left side entry */
> > -	xfs_extlen_t	ltlen;		/* length of left side entry */
> > -	xfs_extlen_t	ltlena;		/* aligned ... */
> > -	xfs_agblock_t	ltnew;		/* useful start bno of left side */
> > -	xfs_extlen_t	rlen;		/* length of returned extent */
> > -	bool		busy;
> > -	unsigned	busy_gen;
> > -#ifdef DEBUG
> > -	/*
> > -	 * Randomly don't execute the first algorithm.
> > -	 */
> > -	int		dofirst;	/* set to do first algorithm */
> > -
> > -	dofirst = prandom_u32() & 1;
> > -#endif
> > -
> > -	/* handle unitialized agbno range so caller doesn't have to */
> > -	if (!args->min_agbno && !args->max_agbno)
> > -		args->max_agbno = args->mp->m_sb.sb_agblocks - 1;
> > -	ASSERT(args->min_agbno <= args->max_agbno);
> > -
> > -	/* clamp agbno to the range if it's outside */
> > -	if (args->agbno < args->min_agbno)
> > -		args->agbno = args->min_agbno;
> > -	if (args->agbno > args->max_agbno)
> > -		args->agbno = args->max_agbno;
> > -
> > -restart:
> > -	bno_cur_lt = NULL;
> > -	bno_cur_gt = NULL;
> > -	ltlen = 0;
> > -	gtlena = 0;
> > -	ltlena = 0;
> > -	busy = false;
> > -
> > -	/*
> > -	 * Get a cursor for the by-size btree.
> > -	 */
> > -	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
> > -		args->agno, XFS_BTNUM_CNT);
> > -
> > -	/*
> > -	 * See if there are any free extents as big as maxlen.
> > -	 */
> > -	if ((error = xfs_alloc_lookup_ge(cnt_cur, 0, args->maxlen, &i)))
> > -		goto error0;
> > -	/*
> > -	 * If none, then pick up the last entry in the tree unless the
> > -	 * tree is empty.
> > -	 */
> > -	if (!i) {
> > -		if ((error = xfs_alloc_ag_vextent_small(args, cnt_cur, &ltbno,
> > -				&ltlen, &i)))
> > -			goto error0;
> > -		if (i == 0 || ltlen == 0) {
> > -			xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> > -			trace_xfs_alloc_near_noentry(args);
> > -			return 0;
> > -		}
> > -		ASSERT(i == 1);
> > -	}
> > -	args->wasfromfl = 0;
> > -
> > -	/*
> > -	 * First algorithm.
> > -	 * If the requested extent is large wrt the freespaces available
> > -	 * in this a.g., then the cursor will be pointing to a btree entry
> > -	 * near the right edge of the tree.  If it's in the last btree leaf
> > -	 * block, then we just examine all the entries in that block
> > -	 * that are big enough, and pick the best one.
> > -	 * This is written as a while loop so we can break out of it,
> > -	 * but we never loop back to the top.
> > -	 */
> > -	while (xfs_btree_islastblock(cnt_cur, 0)) {
> > -		xfs_extlen_t	bdiff;
> > -		int		besti=0;
> > -		xfs_extlen_t	blen=0;
> > -		xfs_agblock_t	bnew=0;
> > -
> > -#ifdef DEBUG
> > -		if (dofirst)
> > -			break;
> > -#endif
> > -		/*
> > -		 * Start from the entry that lookup found, sequence through
> > -		 * all larger free blocks.  If we're actually pointing at a
> > -		 * record smaller than maxlen, go to the start of this block,
> > -		 * and skip all those smaller than minlen.
> > -		 */
> > -		if (ltlen || args->alignment > 1) {
> > -			cnt_cur->bc_ptrs[0] = 1;
> > -			do {
> > -				if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno,
> > -						&ltlen, &i)))
> > -					goto error0;
> > -				XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> > -				if (ltlen >= args->minlen)
> > -					break;
> > -				if ((error = xfs_btree_increment(cnt_cur, 0, &i)))
> > -					goto error0;
> > -			} while (i);
> > -			ASSERT(ltlen >= args->minlen);
> > -			if (!i)
> > -				break;
> > -		}
> > -		i = cnt_cur->bc_ptrs[0];
> > -		for (j = 1, blen = 0, bdiff = 0;
> > -		     !error && j && (blen < args->maxlen || bdiff > 0);
> > -		     error = xfs_btree_increment(cnt_cur, 0, &j)) {
> > -			/*
> > -			 * For each entry, decide if it's better than
> > -			 * the previous best entry.
> > -			 */
> > -			if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno, &ltlen, &i)))
> > -				goto error0;
> > -			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> > -			busy = xfs_alloc_compute_aligned(args, ltbno, ltlen,
> > -					&ltbnoa, &ltlena, &busy_gen);
> > -			if (ltlena < args->minlen)
> > -				continue;
> > -			if (ltbnoa < args->min_agbno || ltbnoa > args->max_agbno)
> > -				continue;
> > -			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
> > -			xfs_alloc_fix_len(args);
> > -			ASSERT(args->len >= args->minlen);
> > -			if (args->len < blen)
> > -				continue;
> > -			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
> > -				args->alignment, args->datatype, ltbnoa,
> > -				ltlena, &ltnew);
> > -			if (ltnew != NULLAGBLOCK &&
> > -			    (args->len > blen || ltdiff < bdiff)) {
> > -				bdiff = ltdiff;
> > -				bnew = ltnew;
> > -				blen = args->len;
> > -				besti = cnt_cur->bc_ptrs[0];
> > -			}
> > -		}
> > -		/*
> > -		 * It didn't work.  We COULD be in a case where
> > -		 * there's a good record somewhere, so try again.
> > -		 */
> > -		if (blen == 0)
> > -			break;
> > -		/*
> > -		 * Point at the best entry, and retrieve it again.
> > -		 */
> > -		cnt_cur->bc_ptrs[0] = besti;
> > -		if ((error = xfs_alloc_get_rec(cnt_cur, &ltbno, &ltlen, &i)))
> > -			goto error0;
> > -		XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> > -		ASSERT(ltbno + ltlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
> > -		args->len = blen;
> > -
> > -		/*
> > -		 * We are allocating starting at bnew for blen blocks.
> > -		 */
> > -		args->agbno = bnew;
> > -		ASSERT(bnew >= ltbno);
> > -		ASSERT(bnew + blen <= ltbno + ltlen);
> > -		/*
> > -		 * Set up a cursor for the by-bno tree.
> > -		 */
> > -		bno_cur_lt = xfs_allocbt_init_cursor(args->mp, args->tp,
> > -			args->agbp, args->agno, XFS_BTNUM_BNO);
> > -		/*
> > -		 * Fix up the btree entries.
> > -		 */
> > -		if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno,
> > -				ltlen, bnew, blen, XFSA_FIXUP_CNT_OK)))
> > -			goto error0;
> > -		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> > -		xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_NOERROR);
> > -
> > -		trace_xfs_alloc_near_first(args);
> > -		return 0;
> > -	}
> > -	/*
> > -	 * Second algorithm.
> > -	 * Search in the by-bno tree to the left and to the right
> > -	 * simultaneously, until in each case we find a space big enough,
> > -	 * or run into the edge of the tree.  When we run into the edge,
> > -	 * we deallocate that cursor.
> > -	 * If both searches succeed, we compare the two spaces and pick
> > -	 * the better one.
> > -	 * With alignment, it's possible for both to fail; the upper
> > -	 * level algorithm that picks allocation groups for allocations
> > -	 * is not supposed to do this.
> > -	 */
> > -	/*
> > -	 * Allocate and initialize the cursor for the leftward search.
> > -	 */
> > -	bno_cur_lt = xfs_allocbt_init_cursor(args->mp, args->tp, args->agbp,
> > -		args->agno, XFS_BTNUM_BNO);
> > -	/*
> > -	 * Lookup <= bno to find the leftward search's starting point.
> > -	 */
> > -	if ((error = xfs_alloc_lookup_le(bno_cur_lt, args->agbno, args->maxlen, &i)))
> > -		goto error0;
> > -	if (!i) {
> > -		/*
> > -		 * Didn't find anything; use this cursor for the rightward
> > -		 * search.
> > -		 */
> > -		bno_cur_gt = bno_cur_lt;
> > -		bno_cur_lt = NULL;
> > -	}
> > -	/*
> > -	 * Found something.  Duplicate the cursor for the rightward search.
> > -	 */
> > -	else if ((error = xfs_btree_dup_cursor(bno_cur_lt, &bno_cur_gt)))
> > -		goto error0;
> > -	/*
> > -	 * Increment the cursor, so we will point at the entry just right
> > -	 * of the leftward entry if any, or to the leftmost entry.
> > -	 */
> > -	if ((error = xfs_btree_increment(bno_cur_gt, 0, &i)))
> > -		goto error0;
> > -	if (!i) {
> > -		/*
> > -		 * It failed, there are no rightward entries.
> > -		 */
> > -		xfs_btree_del_cursor(bno_cur_gt, XFS_BTREE_NOERROR);
> > -		bno_cur_gt = NULL;
> > -	}
> > -	/*
> > -	 * Loop going left with the leftward cursor, right with the
> > -	 * rightward cursor, until either both directions give up or
> > -	 * we find an entry at least as big as minlen.
> > -	 */
> > -	do {
> > -		if (bno_cur_lt) {
> > -			if ((error = xfs_alloc_get_rec(bno_cur_lt, &ltbno, &ltlen, &i)))
> > -				goto error0;
> > -			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> > -			busy |= xfs_alloc_compute_aligned(args, ltbno, ltlen,
> > -					&ltbnoa, &ltlena, &busy_gen);
> > -			if (ltlena >= args->minlen && ltbnoa >= args->min_agbno)
> > -				break;
> > -			if ((error = xfs_btree_decrement(bno_cur_lt, 0, &i)))
> > -				goto error0;
> > -			if (!i || ltbnoa < args->min_agbno) {
> > -				xfs_btree_del_cursor(bno_cur_lt,
> > -						     XFS_BTREE_NOERROR);
> > -				bno_cur_lt = NULL;
> > -			}
> > -		}
> > -		if (bno_cur_gt) {
> > -			if ((error = xfs_alloc_get_rec(bno_cur_gt, &gtbno, &gtlen, &i)))
> > -				goto error0;
> > -			XFS_WANT_CORRUPTED_GOTO(args->mp, i == 1, error0);
> > -			busy |= xfs_alloc_compute_aligned(args, gtbno, gtlen,
> > -					&gtbnoa, &gtlena, &busy_gen);
> > -			if (gtlena >= args->minlen && gtbnoa <= args->max_agbno)
> > -				break;
> > -			if ((error = xfs_btree_increment(bno_cur_gt, 0, &i)))
> > -				goto error0;
> > -			if (!i || gtbnoa > args->max_agbno) {
> > -				xfs_btree_del_cursor(bno_cur_gt,
> > -						     XFS_BTREE_NOERROR);
> > -				bno_cur_gt = NULL;
> > -			}
> > -		}
> > -	} while (bno_cur_lt || bno_cur_gt);
> > -
> > -	/*
> > -	 * Got both cursors still active, need to find better entry.
> > -	 */
> > -	if (bno_cur_lt && bno_cur_gt) {
> > -		if (ltlena >= args->minlen) {
> > -			/*
> > -			 * Left side is good, look for a right side entry.
> > -			 */
> > -			args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
> > -			xfs_alloc_fix_len(args);
> > -			ltdiff = xfs_alloc_compute_diff(args->agbno, args->len,
> > -				args->alignment, args->datatype, ltbnoa,
> > -				ltlena, &ltnew);
> > -
> > -			error = xfs_alloc_find_best_extent(args,
> > -						&bno_cur_lt, &bno_cur_gt,
> > -						ltdiff, &gtbno, &gtlen,
> > -						&gtbnoa, &gtlena,
> > -						0 /* search right */);
> > -		} else {
> > -			ASSERT(gtlena >= args->minlen);
> > -
> > -			/*
> > -			 * Right side is good, look for a left side entry.
> > -			 */
> > -			args->len = XFS_EXTLEN_MIN(gtlena, args->maxlen);
> > -			xfs_alloc_fix_len(args);
> > -			gtdiff = xfs_alloc_compute_diff(args->agbno, args->len,
> > -				args->alignment, args->datatype, gtbnoa,
> > -				gtlena, &gtnew);
> > -
> > -			error = xfs_alloc_find_best_extent(args,
> > -						&bno_cur_gt, &bno_cur_lt,
> > -						gtdiff, &ltbno, &ltlen,
> > -						&ltbnoa, &ltlena,
> > -						1 /* search left */);
> > -		}
> > -
> > -		if (error)
> > -			goto error0;
> > -	}
> > -
> > -	/*
> > -	 * If we couldn't get anything, give up.
> > -	 */
> > -	if (bno_cur_lt == NULL && bno_cur_gt == NULL) {
> > -		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> > -
> > -		if (busy) {
> > -			trace_xfs_alloc_near_busy(args);
> > -			xfs_extent_busy_flush(args->mp, args->pag, busy_gen);
> > -			goto restart;
> > -		}
> > -		trace_xfs_alloc_size_neither(args);
> > -		args->agbno = NULLAGBLOCK;
> > -		return 0;
> > -	}
> > -
> > -	/*
> > -	 * At this point we have selected a freespace entry, either to the
> > -	 * left or to the right.  If it's on the right, copy all the
> > -	 * useful variables to the "left" set so we only have one
> > -	 * copy of this code.
> > -	 */
> > -	if (bno_cur_gt) {
> > -		bno_cur_lt = bno_cur_gt;
> > -		bno_cur_gt = NULL;
> > -		ltbno = gtbno;
> > -		ltbnoa = gtbnoa;
> > -		ltlen = gtlen;
> > -		ltlena = gtlena;
> > -		j = 1;
> > -	} else
> > -		j = 0;
> > -
> > -	/*
> > -	 * Fix up the length and compute the useful address.
> > -	 */
> > -	args->len = XFS_EXTLEN_MIN(ltlena, args->maxlen);
> > -	xfs_alloc_fix_len(args);
> > -	rlen = args->len;
> > -	(void)xfs_alloc_compute_diff(args->agbno, rlen, args->alignment,
> > -				     args->datatype, ltbnoa, ltlena, &ltnew);
> > -	ASSERT(ltnew >= ltbno);
> > -	ASSERT(ltnew + rlen <= ltbnoa + ltlena);
> > -	ASSERT(ltnew + rlen <= be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_length));
> > -	ASSERT(ltnew >= args->min_agbno && ltnew <= args->max_agbno);
> > -	args->agbno = ltnew;
> > -
> > -	if ((error = xfs_alloc_fixup_trees(cnt_cur, bno_cur_lt, ltbno, ltlen,
> > -			ltnew, rlen, XFSA_FIXUP_BNO_OK)))
> > -		goto error0;
> > -
> > -	if (j)
> > -		trace_xfs_alloc_near_greater(args);
> > -	else
> > -		trace_xfs_alloc_near_lesser(args);
> > -
> > -	xfs_btree_del_cursor(cnt_cur, XFS_BTREE_NOERROR);
> > -	xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_NOERROR);
> > -	return 0;
> > -
> > - error0:
> > -	trace_xfs_alloc_near_error(args);
> > -	if (cnt_cur != NULL)
> > -		xfs_btree_del_cursor(cnt_cur, XFS_BTREE_ERROR);
> > -	if (bno_cur_lt != NULL)
> > -		xfs_btree_del_cursor(bno_cur_lt, XFS_BTREE_ERROR);
> > -	if (bno_cur_gt != NULL)
> > -		xfs_btree_del_cursor(bno_cur_gt, XFS_BTREE_ERROR);
> > -	return error;
> > -}
> > -
> >  /*
> >   * Allocate a variable extent anywhere in the allocation group agno.
> >   * Extent's length (returned in len) will be between minlen and maxlen,
> > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> > index 8094b1920eef..a441a472da79 100644
> > --- a/fs/xfs/xfs_trace.h
> > +++ b/fs/xfs/xfs_trace.h
> > @@ -1635,14 +1635,16 @@ DEFINE_EVENT(xfs_alloc_class, name, \
> >  DEFINE_ALLOC_EVENT(xfs_alloc_exact_done);
> >  DEFINE_ALLOC_EVENT(xfs_alloc_exact_notfound);
> >  DEFINE_ALLOC_EVENT(xfs_alloc_exact_error);
> > -DEFINE_ALLOC_EVENT(xfs_alloc_near_nominleft);
> > -DEFINE_ALLOC_EVENT(xfs_alloc_near_first);
> > -DEFINE_ALLOC_EVENT(xfs_alloc_near_greater);
> > -DEFINE_ALLOC_EVENT(xfs_alloc_near_lesser);
> > -DEFINE_ALLOC_EVENT(xfs_alloc_near_error);
> > -DEFINE_ALLOC_EVENT(xfs_alloc_near_noentry);
> > -DEFINE_ALLOC_EVENT(xfs_alloc_near_busy);
> > -DEFINE_ALLOC_EVENT(xfs_alloc_size_neither);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_ag_error);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_ag_noentry);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_ag_busy);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_cur);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_cur_lastblock);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_cur_left);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_cur_right);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_cur_lookup);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_cur_lookup_done);
> > +DEFINE_ALLOC_EVENT(xfs_alloc_cur_reverse);
> >  DEFINE_ALLOC_EVENT(xfs_alloc_size_noentry);
> >  DEFINE_ALLOC_EVENT(xfs_alloc_size_nominleft);
> >  DEFINE_ALLOC_EVENT(xfs_alloc_size_done);
> > @@ -1658,6 +1660,31 @@ DEFINE_ALLOC_EVENT(xfs_alloc_vextent_noagbp);
> >  DEFINE_ALLOC_EVENT(xfs_alloc_vextent_loopfailed);
> >  DEFINE_ALLOC_EVENT(xfs_alloc_vextent_allfailed);
> >  
> > +TRACE_EVENT(xfs_alloc_cur_check,
> > +	TP_PROTO(struct xfs_mount *mp, xfs_btnum_t btnum, xfs_agblock_t bno,
> > +		 xfs_extlen_t len, xfs_extlen_t diff, bool new),
> > +	TP_ARGS(mp, btnum, bno, len, diff, new),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(xfs_btnum_t, btnum)
> > +		__field(xfs_agblock_t, bno)
> > +		__field(xfs_extlen_t, len)
> > +		__field(xfs_extlen_t, diff)
> > +		__field(bool, new)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = mp->m_super->s_dev;
> > +		__entry->btnum = btnum;
> > +		__entry->bno = bno;
> > +		__entry->len = len;
> > +		__entry->diff = diff;
> > +		__entry->new = new;
> > +	),
> > +	TP_printk("dev %d:%d btnum %d bno 0x%x len 0x%x diff 0x%x new %d",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->btnum,
> > +		  __entry->bno, __entry->len, __entry->diff, __entry->new)
> > +)
> > +
> >  DECLARE_EVENT_CLASS(xfs_da_class,
> >  	TP_PROTO(struct xfs_da_args *args),
> >  	TP_ARGS(args),
> > -- 
> > 2.20.1
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/4] xfs: randomly fall back to near mode lookup algorithm in debug mode
  2019-08-17  1:37   ` Darrick J. Wong
@ 2019-08-19 18:19     ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2019-08-19 18:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Aug 16, 2019 at 06:37:03PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 15, 2019 at 08:55:37AM -0400, Brian Foster wrote:
> > The last block scan is the dominant near mode allocation algorithm
> > for a newer filesystem with fewer, large free extents. Add debug
> > mode logic to randomly fall back to lookup mode to improve
> > regression test coverage.
> 
> How about just using an errortag since the new sysfs interface lets
> testcases / admins control the frequency?
> 

We could do that, but my understanding of the equivalent logic in the
current algorithm is that we want broad coverage of both near mode
sub-algorithms across the entire suite of tests. Hence we randomly drop
allocations into either algorithm when DEBUG mode is enabled. IIRC, we
do something similar with sparse inodes (i.e., randomly allocate sparse
inode chunks even when unnecessary) so the functionality isn't only
covered by targeted tests.

Do we have the ability to have always on error tags as such? I thought
we had default frequency values for each tag, but I thought they still
had to be explicitly enabled. If that's the case, I'm sure we could come
up with such an on-by-default mechanism and perhaps switch over these
remaining DEBUG mode hacks, but that's a follow up thing IMO..

Brian

> --D
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 7753b61ba532..d550aa5597bf 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -1266,6 +1266,7 @@ xfs_alloc_ag_vextent_near(
> >  	int			i;
> >  	xfs_agblock_t		bno;
> >  	xfs_extlen_t		len;
> > +	bool			lastblock;
> >  
> >  	/* handle unitialized agbno range so caller doesn't have to */
> >  	if (!args->min_agbno && !args->max_agbno)
> > @@ -1291,7 +1292,12 @@ xfs_alloc_ag_vextent_near(
> >  	 * Otherwise run the optimized lookup search algorithm from the current
> >  	 * location to the end of the tree.
> >  	 */
> > -	if (xfs_btree_islastblock(acur.cnt, 0)) {
> > +	lastblock = xfs_btree_islastblock(acur.cnt, 0);
> > +#ifdef DEBUG
> > +	if (lastblock)
> > +		lastblock = prandom_u32() & 1;
> > +#endif
> > +	if (lastblock) {
> >  		int	j;
> >  
> >  		trace_xfs_alloc_cur_lastblock(args);
> > -- 
> > 2.20.1
> > 

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-08-19 18:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-15 12:55 [PATCH v3 0/4] xfs: rework near mode extent allocation Brian Foster
2019-08-15 12:55 ` [PATCH v3 1/4] xfs: track active state of allocation btree cursors Brian Foster
2019-08-17  0:51   ` Darrick J. Wong
2019-08-19 18:03     ` Brian Foster
2019-08-15 12:55 ` [PATCH v3 2/4] xfs: use locality optimized cntbt lookups for near mode allocations Brian Foster
2019-08-17  1:34   ` Darrick J. Wong
2019-08-19 18:06     ` Brian Foster
2019-08-15 12:55 ` [PATCH v3 3/4] xfs: randomly fall back to near mode lookup algorithm in debug mode Brian Foster
2019-08-17  1:37   ` Darrick J. Wong
2019-08-19 18:19     ` Brian Foster
2019-08-15 12:55 ` [PATCH v3 4/4] xfs: refactor successful AG allocation accounting code Brian Foster
2019-08-17  0:28   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).