All of lore.kernel.org
 help / color / mirror / Atom feed
* [QUESTION] about the freelist allocator in XFS
@ 2016-07-07 11:01 Kaho Ng
  2016-07-07 12:13 ` Brian Foster
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Kaho Ng @ 2016-07-07 11:01 UTC (permalink / raw)
  To: xfs

I am trying to investigate how freelist allocator in xfs interacts
with freespace B+Tree allocator.
First I prepared a patch
<https://gist.github.com/22ffca35929e67c08759b057779b7566> on
linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
(The kernel version used is linux-3.10.0-327.22.2.el7).
Then, I wrote a simple utility
<https://gist.github.com/992364ceca984d3f14099ec94aaacd9d> to make
TONS of
holes in a filesystem by calling fallocate() to punch holes in a file
that is almost as large as the volume size.

I created an XFS filesystem image by the following steps:
1. fallocate -l 80G /mnt/disk2/xfs
2. mkfs.xfs -f -d agcount=1 /mnt/disk2/xfs

Then I created a large file by fallocate:
fallocate -l 85823746048 /mnt/test/abc

which left only 4 blocks available in the volume finally:
/dev/loop0      20961280 20961276         4 100% /mnt/test

The result of xfs_bmap against /mnt/test/abc:
/mnt/test/abc:
 EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET              TOTAL FLAGS
   0: [0..167624503]:  83000..167707503  0 (83000..167707503) 167624504 10000

After that, I used the hole-punching utility above to create holes on
the files, and captured the output of kmsg.

When reading the log output
<https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
realised that there is no B+Tree split
triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
Isn't B+Tree split possible in by-size B+Tree even when truncating a
longer freespace record to shorter one? But what I found in the log is
only a few tree shrinks... And when reading the source code of
freespace allocator I found that a B+Tree growth in this case is
impossible at least...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-07 11:01 [QUESTION] about the freelist allocator in XFS Kaho Ng
@ 2016-07-07 12:13 ` Brian Foster
  2016-07-07 22:28 ` Dave Chinner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Brian Foster @ 2016-07-07 12:13 UTC (permalink / raw)
  To: Kaho Ng; +Cc: xfs

On Thu, Jul 07, 2016 at 07:01:35PM +0800, Kaho Ng wrote:
> I am trying to investigate how freelist allocator in xfs interacts
> with freespace B+Tree allocator.
> First I prepared a patch
> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
> (The kernel version used is linux-3.10.0-327.22.2.el7).
> Then, I wrote a simple utility
> <https://gist.github.com/992364ceca984d3f14099ec94aaacd9d> to make
> TONS of
> holes in a filesystem by calling fallocate() to punch holes in a file
> that is almost as large as the volume size.
> 
> I created an XFS filesystem image by the following steps:
> 1. fallocate -l 80G /mnt/disk2/xfs
> 2. mkfs.xfs -f -d agcount=1 /mnt/disk2/xfs
> 
> Then I created a large file by fallocate:
> fallocate -l 85823746048 /mnt/test/abc
> 
> which left only 4 blocks available in the volume finally:
> /dev/loop0      20961280 20961276         4 100% /mnt/test
> 
> The result of xfs_bmap against /mnt/test/abc:
> /mnt/test/abc:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET              TOTAL FLAGS
>    0: [0..167624503]:  83000..167707503  0 (83000..167707503) 167624504 10000
> 
> After that, I used the hole-punching utility above to create holes on
> the files, and captured the output of kmsg.
> 
> When reading the log output
> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
> realised that there is no B+Tree split
> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
> Isn't B+Tree split possible in by-size B+Tree even when truncating a
> longer freespace record to shorter one? But what I found in the log is
> only a few tree shrinks... And when reading the source code of
> freespace allocator I found that a B+Tree growth in this case is
> impossible at least...
> 

I'd suggest to use a combination of xfs_db and tracepoints/xfsstats to
identify what's happening in your test sequence. E.g., unmount and use
xfs_db to identify the state of the free space btree(s) before and after
various points of your test. See [1] for examples of how to use xfs_db
to explore on-disk data structures. See 'man trace-cmd' to work with
tracepoints and /proc/fs/xfs/stats (and /proc/sys/fs/xfs/stats_clear) to
view runtime statistics (which I believe already includes the number of
btree splits).

Brian

[1] http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/index.html

> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-07 11:01 [QUESTION] about the freelist allocator in XFS Kaho Ng
  2016-07-07 12:13 ` Brian Foster
@ 2016-07-07 22:28 ` Dave Chinner
       [not found]   ` <CAGeO4WNAdmeXgL4+CAQ1Yqo18XFgv3NZxWVbDTS0xDZLyb3e2w@mail.gmail.com>
  2016-07-08  5:48   ` Kaho Ng
  2016-07-08 19:17 ` Kaho Ng
  2016-07-10 16:57 ` Kaho Ng
  3 siblings, 2 replies; 13+ messages in thread
From: Dave Chinner @ 2016-07-07 22:28 UTC (permalink / raw)
  To: Kaho Ng; +Cc: xfs

On Thu, Jul 07, 2016 at 07:01:35PM +0800, Kaho Ng wrote:
> I am trying to investigate how freelist allocator in xfs interacts
> with freespace B+Tree allocator.
> First I prepared a patch
> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
> (The kernel version used is linux-3.10.0-327.22.2.el7).
......
> When reading the log output
> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
> realised that there is no B+Tree split
> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
> Isn't B+Tree split possible in by-size B+Tree even when truncating a
> longer freespace record to shorter one? But what I found in the log is
> only a few tree shrinks... And when reading the source code of
> freespace allocator I found that a B+Tree growth in this case is
> impossible at least...

args->isfl doesn't mean what you think it means.

args->isfl is only set when moving blocks from the freespace btree
to the AGFL, which only occurs when a previous operation allocated a
new freespace btree block and depleted the current freelist. i.e.
"AG Free List" != "AG freespace btree" - they are different
structures on disk...

And when you consider that a freelist refill can only remove records
from the the freespace btree, it's should be clear that a btree
split won't occur during a freelist refill...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Fwd: [QUESTION] about the freelist allocator in XFS
       [not found]   ` <CAGeO4WNAdmeXgL4+CAQ1Yqo18XFgv3NZxWVbDTS0xDZLyb3e2w@mail.gmail.com>
@ 2016-07-08  2:29     ` Kaho Ng
       [not found]     ` <20160708034710.GL12670@dastard>
  1 sibling, 0 replies; 13+ messages in thread
From: Kaho Ng @ 2016-07-08  2:29 UTC (permalink / raw)
  To: xfs

---------- Forwarded message ----------
From: Kaho Ng <ngkaho1234@gmail.com>
Date: Fri, Jul 8, 2016 at 10:26 AM
Subject: Re: [QUESTION] about the freelist allocator in XFS
To: Dave Chinner <david@fromorbit.com>


Hmm, wouldn't xfs_alloc_ag_vextent_size() first remove the free extent
record, and insert a new extent record into the freespace by-size
btree if the found free extent record is longer than args->maxlen?

On Fri, Jul 8, 2016 at 6:28 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jul 07, 2016 at 07:01:35PM +0800, Kaho Ng wrote:
>> I am trying to investigate how freelist allocator in xfs interacts
>> with freespace B+Tree allocator.
>> First I prepared a patch
>> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
>> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
>> (The kernel version used is linux-3.10.0-327.22.2.el7).
> ......
>> When reading the log output
>> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
>> realised that there is no B+Tree split
>> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
>> Isn't B+Tree split possible in by-size B+Tree even when truncating a
>> longer freespace record to shorter one? But what I found in the log is
>> only a few tree shrinks... And when reading the source code of
>> freespace allocator I found that a B+Tree growth in this case is
>> impossible at least...
>
> args->isfl doesn't mean what you think it means.
>
> args->isfl is only set when moving blocks from the freespace btree
> to the AGFL, which only occurs when a previous operation allocated a
> new freespace btree block and depleted the current freelist. i.e.
> "AG Free List" != "AG freespace btree" - they are different
> structures on disk...
>
> And when you consider that a freelist refill can only remove records
> from the the freespace btree, it's should be clear that a btree
> split won't occur during a freelist refill...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
       [not found]     ` <20160708034710.GL12670@dastard>
@ 2016-07-08  4:05       ` Kaho Ng
  0 siblings, 0 replies; 13+ messages in thread
From: Kaho Ng @ 2016-07-08  4:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hmm, wouldn't xfs_alloc_ag_vextent_size() first remove the free extent
record, and insert a new extent record into the freespace by-size
btree if the found free extent record is longer than args->maxlen?

On Fri, Jul 8, 2016 at 11:47 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Fri, Jul 08, 2016 at 10:26:33AM +0800, Kaho Ng wrote:
>> Hmm, wouldn't xfs_alloc_ag_vextent_size() first remove the free extent
>> record, and insert a new extent record into the freespace by-size
>> btree if the found free extent record is longer than args->maxlen?
>
> Please reply to the list, not privately.
>
> -Dave.
>
>>
>> On Fri, Jul 8, 2016 at 6:28 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Jul 07, 2016 at 07:01:35PM +0800, Kaho Ng wrote:
>> >> I am trying to investigate how freelist allocator in xfs interacts
>> >> with freespace B+Tree allocator.
>> >> First I prepared a patch
>> >> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
>> >> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
>> >> (The kernel version used is linux-3.10.0-327.22.2.el7).
>> > ......
>> >> When reading the log output
>> >> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
>> >> realised that there is no B+Tree split
>> >> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
>> >> Isn't B+Tree split possible in by-size B+Tree even when truncating a
>> >> longer freespace record to shorter one? But what I found in the log is
>> >> only a few tree shrinks... And when reading the source code of
>> >> freespace allocator I found that a B+Tree growth in this case is
>> >> impossible at least...
>> >
>> > args->isfl doesn't mean what you think it means.
>> >
>> > args->isfl is only set when moving blocks from the freespace btree
>> > to the AGFL, which only occurs when a previous operation allocated a
>> > new freespace btree block and depleted the current freelist. i.e.
>> > "AG Free List" != "AG freespace btree" - they are different
>> > structures on disk...
>> >
>> > And when you consider that a freelist refill can only remove records
>> > from the the freespace btree, it's should be clear that a btree
>> > split won't occur during a freelist refill...
>> >
>> > Cheers,
>> >
>> > Dave.
>> > --
>> > Dave Chinner
>> > david@fromorbit.com
>>
>
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-07 22:28 ` Dave Chinner
       [not found]   ` <CAGeO4WNAdmeXgL4+CAQ1Yqo18XFgv3NZxWVbDTS0xDZLyb3e2w@mail.gmail.com>
@ 2016-07-08  5:48   ` Kaho Ng
  2016-07-10 23:22     ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Kaho Ng @ 2016-07-08  5:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Hmm, wouldn't xfs_alloc_ag_vextent_size() first remove the free extent
record, and insert a new extent record into the freespace by-size
btree if the found free extent record is longer than args->maxlen?

On Fri, Jul 8, 2016 at 6:28 AM, Dave Chinner <david@fromorbit.com> wrote:
> On Thu, Jul 07, 2016 at 07:01:35PM +0800, Kaho Ng wrote:
>> I am trying to investigate how freelist allocator in xfs interacts
>> with freespace B+Tree allocator.
>> First I prepared a patch
>> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
>> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
>> (The kernel version used is linux-3.10.0-327.22.2.el7).
> ......
>> When reading the log output
>> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
>> realised that there is no B+Tree split
>> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
>> Isn't B+Tree split possible in by-size B+Tree even when truncating a
>> longer freespace record to shorter one? But what I found in the log is
>> only a few tree shrinks... And when reading the source code of
>> freespace allocator I found that a B+Tree growth in this case is
>> impossible at least...
>
> args->isfl doesn't mean what you think it means.
>
> args->isfl is only set when moving blocks from the freespace btree
> to the AGFL, which only occurs when a previous operation allocated a
> new freespace btree block and depleted the current freelist. i.e.
> "AG Free List" != "AG freespace btree" - they are different
> structures on disk...
>
> And when you consider that a freelist refill can only remove records
> from the the freespace btree, it's should be clear that a btree
> split won't occur during a freelist refill...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-07 11:01 [QUESTION] about the freelist allocator in XFS Kaho Ng
  2016-07-07 12:13 ` Brian Foster
  2016-07-07 22:28 ` Dave Chinner
@ 2016-07-08 19:17 ` Kaho Ng
  2016-07-09 12:26   ` Kaho Ng
  2016-07-10 16:57 ` Kaho Ng
  3 siblings, 1 reply; 13+ messages in thread
From: Kaho Ng @ 2016-07-08 19:17 UTC (permalink / raw)
  To: xfs

Maybe i should clarify my inquiries first... It is about whether
freelist refilling will trigger any tree splits or even tree growths
in by-size freespace tree.

When reading the source code of XFS(xfs_alloc.c) to find information
about freelist refilling in xfs_free_extent(), I found that insertion
to by-block B+ Tree is not possible to happen since there is only
record updates in this tree. That sounds clear to me.
But insertion to by-size B+ Tree may happen in xfs_alloc_fixup_trees()
after removing an record from the tree.

Thus I come up with a doubt. Is tree split or tree growth in by-size
B+ Tree possible in the above case?

On Thu, Jul 7, 2016 at 7:01 PM, Kaho Ng <ngkaho1234@gmail.com> wrote:
> I am trying to investigate how freelist allocator in xfs interacts
> with freespace B+Tree allocator.
> First I prepared a patch
> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
> (The kernel version used is linux-3.10.0-327.22.2.el7).
> Then, I wrote a simple utility
> <https://gist.github.com/992364ceca984d3f14099ec94aaacd9d> to make
> TONS of
> holes in a filesystem by calling fallocate() to punch holes in a file
> that is almost as large as the volume size.
>
> I created an XFS filesystem image by the following steps:
> 1. fallocate -l 80G /mnt/disk2/xfs
> 2. mkfs.xfs -f -d agcount=1 /mnt/disk2/xfs
>
> Then I created a large file by fallocate:
> fallocate -l 85823746048 /mnt/test/abc
>
> which left only 4 blocks available in the volume finally:
> /dev/loop0      20961280 20961276         4 100% /mnt/test
>
> The result of xfs_bmap against /mnt/test/abc:
> /mnt/test/abc:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET              TOTAL FLAGS
>    0: [0..167624503]:  83000..167707503  0 (83000..167707503) 167624504 10000
>
> After that, I used the hole-punching utility above to create holes on
> the files, and captured the output of kmsg.
>
> When reading the log output
> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
> realised that there is no B+Tree split
> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
> Isn't B+Tree split possible in by-size B+Tree even when truncating a
> longer freespace record to shorter one? But what I found in the log is
> only a few tree shrinks... And when reading the source code of
> freespace allocator I found that a B+Tree growth in this case is
> impossible at least...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-08 19:17 ` Kaho Ng
@ 2016-07-09 12:26   ` Kaho Ng
  0 siblings, 0 replies; 13+ messages in thread
From: Kaho Ng @ 2016-07-09 12:26 UTC (permalink / raw)
  To: xfs

if ((error = xfs_alloc_lookup_eq(cnt_cur, nfbno1, nflen1, &i)))
        return error;
XFS_WANT_CORRUPTED_RETURN(mp, i == 0);
if ((error = xfs_btree_insert(cnt_cur, &i)))
        return error;
XFS_WANT_CORRUPTED_RETURN(mp, i == 1);

The code is extracted from xfs_alloc_fixup_trees(). When the layout of
the leaf in the by-size tree is like this:
+------------------------------+-------------------------------+----
| blkcnt: 5, startblk: xxxx  | blkcnt: 7, startblk: yyyyy  | ...
+------------------------------+-------------------------------+----
which is full of items with blkcnt == 7 in the remaining space, if the
freelist refilling process requires 6 blocks to be allocated, wouldn't
a tree split is required for the insertion to proceed in case the
siblings are also full?

On Sat, Jul 9, 2016 at 3:17 AM, Kaho Ng <ngkaho1234@gmail.com> wrote:
> Maybe i should clarify my inquiries first... It is about whether
> freelist refilling will trigger any tree splits or even tree growths
> in by-size freespace tree.
>
> When reading the source code of XFS(xfs_alloc.c) to find information
> about freelist refilling in xfs_free_extent(), I found that insertion
> to by-block B+ Tree is not possible to happen since there is only
> record updates in this tree. That sounds clear to me.
> But insertion to by-size B+ Tree may happen in xfs_alloc_fixup_trees()
> after removing an record from the tree.
>
> Thus I come up with a doubt. Is tree split or tree growth in by-size
> B+ Tree possible in the above case?
>
> On Thu, Jul 7, 2016 at 7:01 PM, Kaho Ng <ngkaho1234@gmail.com> wrote:
>> I am trying to investigate how freelist allocator in xfs interacts
>> with freespace B+Tree allocator.
>> First I prepared a patch
>> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
>> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
>> (The kernel version used is linux-3.10.0-327.22.2.el7).
>> Then, I wrote a simple utility
>> <https://gist.github.com/992364ceca984d3f14099ec94aaacd9d> to make
>> TONS of
>> holes in a filesystem by calling fallocate() to punch holes in a file
>> that is almost as large as the volume size.
>>
>> I created an XFS filesystem image by the following steps:
>> 1. fallocate -l 80G /mnt/disk2/xfs
>> 2. mkfs.xfs -f -d agcount=1 /mnt/disk2/xfs
>>
>> Then I created a large file by fallocate:
>> fallocate -l 85823746048 /mnt/test/abc
>>
>> which left only 4 blocks available in the volume finally:
>> /dev/loop0      20961280 20961276         4 100% /mnt/test
>>
>> The result of xfs_bmap against /mnt/test/abc:
>> /mnt/test/abc:
>>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET              TOTAL FLAGS
>>    0: [0..167624503]:  83000..167707503  0 (83000..167707503) 167624504 10000
>>
>> After that, I used the hole-punching utility above to create holes on
>> the files, and captured the output of kmsg.
>>
>> When reading the log output
>> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
>> realised that there is no B+Tree split
>> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
>> Isn't B+Tree split possible in by-size B+Tree even when truncating a
>> longer freespace record to shorter one? But what I found in the log is
>> only a few tree shrinks... And when reading the source code of
>> freespace allocator I found that a B+Tree growth in this case is
>> impossible at least...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-07 11:01 [QUESTION] about the freelist allocator in XFS Kaho Ng
                   ` (2 preceding siblings ...)
  2016-07-08 19:17 ` Kaho Ng
@ 2016-07-10 16:57 ` Kaho Ng
  2016-07-10 23:27   ` Dave Chinner
  3 siblings, 1 reply; 13+ messages in thread
From: Kaho Ng @ 2016-07-10 16:57 UTC (permalink / raw)
  To: xfs

well, a piece of comment about the corner case i mentioned is found in
xfsprogs/repair/phase5.c, but i still have no idea how that is
prevented by the xfs kernel module.

/*
 * We need to leave some free records in the tree for the corner case of
 * setting up the AGFL. This may require allocation of blocks, and as
 * such can require insertion of new records into the tree (e.g. moving
 * a record in the by-count tree when a long extent is shortened). If we
 * pack the records into the leaves with no slack space, this requires a
 * leaf split to occur and a block to be allocated from the free list.
 * If we don't have any blocks on the free list (because we are setting
 * it up!), then we fail, and the filesystem will fail with the same
 * failure at runtime. Hence leave a couple of records slack space in
 * each block to allow immediate modification of the tree without
 * requiring splits to be done.
 *
 * XXX(hch): any reason we don't just look at mp->m_alloc_mxr?
 */

On Thu, Jul 7, 2016 at 7:01 PM, Kaho Ng <ngkaho1234@gmail.com> wrote:
> I am trying to investigate how freelist allocator in xfs interacts
> with freespace B+Tree allocator.
> First I prepared a patch
> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
> (The kernel version used is linux-3.10.0-327.22.2.el7).
> Then, I wrote a simple utility
> <https://gist.github.com/992364ceca984d3f14099ec94aaacd9d> to make
> TONS of
> holes in a filesystem by calling fallocate() to punch holes in a file
> that is almost as large as the volume size.
>
> I created an XFS filesystem image by the following steps:
> 1. fallocate -l 80G /mnt/disk2/xfs
> 2. mkfs.xfs -f -d agcount=1 /mnt/disk2/xfs
>
> Then I created a large file by fallocate:
> fallocate -l 85823746048 /mnt/test/abc
>
> which left only 4 blocks available in the volume finally:
> /dev/loop0      20961280 20961276         4 100% /mnt/test
>
> The result of xfs_bmap against /mnt/test/abc:
> /mnt/test/abc:
>  EXT: FILE-OFFSET      BLOCK-RANGE      AG AG-OFFSET              TOTAL FLAGS
>    0: [0..167624503]:  83000..167707503  0 (83000..167707503) 167624504 10000
>
> After that, I used the hole-punching utility above to create holes on
> the files, and captured the output of kmsg.
>
> When reading the log output
> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
> realised that there is no B+Tree split
> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
> Isn't B+Tree split possible in by-size B+Tree even when truncating a
> longer freespace record to shorter one? But what I found in the log is
> only a few tree shrinks... And when reading the source code of
> freespace allocator I found that a B+Tree growth in this case is
> impossible at least...

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-08  5:48   ` Kaho Ng
@ 2016-07-10 23:22     ` Dave Chinner
  2016-07-11  7:06       ` Kaho Ng
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-07-10 23:22 UTC (permalink / raw)
  To: Kaho Ng; +Cc: xfs

[please don't top post]

On Fri, Jul 08, 2016 at 01:48:43PM +0800, Kaho Ng wrote:
> On Fri, Jul 8, 2016 at 6:28 AM, Dave Chinner <david@fromorbit.com> wrote:
> > On Thu, Jul 07, 2016 at 07:01:35PM +0800, Kaho Ng wrote:
> >> I am trying to investigate how freelist allocator in xfs interacts
> >> with freespace B+Tree allocator.
> >> First I prepared a patch
> >> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
> >> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
> >> (The kernel version used is linux-3.10.0-327.22.2.el7).
> > ......
> >> When reading the log output
> >> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
> >> realised that there is no B+Tree split
> >> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
> >> Isn't B+Tree split possible in by-size B+Tree even when truncating a
> >> longer freespace record to shorter one? But what I found in the log is
> >> only a few tree shrinks... And when reading the source code of
> >> freespace allocator I found that a B+Tree growth in this case is
> >> impossible at least...
> >
> > args->isfl doesn't mean what you think it means.
> >
> > args->isfl is only set when moving blocks from the freespace btree
> > to the AGFL, which only occurs when a previous operation allocated a
> > new freespace btree block and depleted the current freelist. i.e.
> > "AG Free List" != "AG freespace btree" - they are different
> > structures on disk...
> >
> > And when you consider that a freelist refill can only remove records
> > from the the freespace btree, it's should be clear that a btree
> > split won't occur during a freelist refill...
>
> Hmm, wouldn't xfs_alloc_ag_vextent_size() first remove the free extent
> record, and insert a new extent record into the freespace by-size
> btree if the found free extent record is longer than args->maxlen?

Good, you went and looked to verify what I said(*).  Indeed, when
you read xfs_alloc_fixup_trees() where the two trees are modified
after an extent has been selected for allocation:

...
        * Delete the entry from the by-size btree.
...
	* Add new by-size btree entry(s).
...
	* Fix up the by-block btree entry(s)

It's pretty clear what the answer is. IOWs, yes, you're correct - we
only ever modify or delete records from the by-bno freespace tree,
but the by-size tree does do a delete and insert and so can
split.(**)

So, while it is possible for a split to occur, you still didn't see
one in your test. That's because a split is rather unlikely in your
scenario because once we have a large enough freespace btree records
stored for a split to occur, we almost always have exact matches in
the by-count tree for freelist fills and so record deletes are the
usual operationon the by-count tree.

And even if we are doing a by-count insert, the chances the target
block for the insert is full is quite small, as are the chances the
target block for the insert is different to the block the original
record was deleted from (consider ordering, what record index the
initial lookup returns and that freelist fills usually only require
a block or two to be allocated).

Cheers,

Dave.

(*) That was what my answer was aimed at getting you to do as I get
a lot of timewasters asking me via direct email to answer their
homework questions for them. Anyone who is trying to understand the
basics of the freespace btrees should have realised the by-count
tree needs a delete/insert operation pair if we modify the length of
a freespace record.

(**) The lesson here is this: trust what people say, but always
verify it when you can. e.g. I now know you'll look at the
code to try to understand answers you are given, so my time
answering your questions is not going to be wasted.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-10 16:57 ` Kaho Ng
@ 2016-07-10 23:27   ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-07-10 23:27 UTC (permalink / raw)
  To: Kaho Ng; +Cc: xfs

On Mon, Jul 11, 2016 at 12:57:59AM +0800, Kaho Ng wrote:
> well, a piece of comment about the corner case i mentioned is found in
> xfsprogs/repair/phase5.c, but i still have no idea how that is
> prevented by the xfs kernel module.
> 
> /*
>  * We need to leave some free records in the tree for the corner case of
>  * setting up the AGFL. This may require allocation of blocks, and as
>  * such can require insertion of new records into the tree (e.g. moving
>  * a record in the by-count tree when a long extent is shortened). If we
>  * pack the records into the leaves with no slack space, this requires a
>  * leaf split to occur and a block to be allocated from the free list.
>  * If we don't have any blocks on the free list (because we are setting
>  * it up!), then we fail, and the filesystem will fail with the same
>  * failure at runtime. Hence leave a couple of records slack space in
>  * each block to allow immediate modification of the tree without
>  * requiring splits to be done.
>  *
>  * XXX(hch): any reason we don't just look at mp->m_alloc_mxr?
>  */

Once mkfs fills the AGFL with the minimum number of blocks to ensure
allocations always succeed, the kernel guarantees that minimum
number of blocks will always be available on the AGFL and hence
allocation and freeing of blocks will always succeed, regardless of
whether btree splits are needed or not.

i.e. ENOSPC is not reported at "filesystem has zero freespace" but
at "freespace btrees are empty and each AGFL is at minimum free
blocks". See XFS_ALLOC_SET_ASIDE() and XFS_ALLOC_AG_MAX_USABLE().

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-10 23:22     ` Dave Chinner
@ 2016-07-11  7:06       ` Kaho Ng
  2016-07-11 22:53         ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Kaho Ng @ 2016-07-11  7:06 UTC (permalink / raw)
  To: xfs

On Mon, Jul 11, 2016 at 7:22 AM, Dave Chinner <david@fromorbit.com> wrote:
> [please don't top post]
>
> On Fri, Jul 08, 2016 at 01:48:43PM +0800, Kaho Ng wrote:
>> On Fri, Jul 8, 2016 at 6:28 AM, Dave Chinner <david@fromorbit.com> wrote:
>> > On Thu, Jul 07, 2016 at 07:01:35PM +0800, Kaho Ng wrote:
>> >> I am trying to investigate how freelist allocator in xfs interacts
>> >> with freespace B+Tree allocator.
>> >> First I prepared a patch
>> >> <https://gist.github.com/22ffca35929e67c08759b057779b7566> on
>> >> linux-source/fs/xfs/libxfs/xfs_alloc.c to print debugging messages
>> >> (The kernel version used is linux-3.10.0-327.22.2.el7).
>> > ......
>> >> When reading the log output
>> >> <https://gist.github.com/890076405e1c13c0a952a579e25e6afe> , I
>> >> realised that there is no B+Tree split
>> >> triggered by xfs_alloc_fix_freelist() when calling xfs_free_extent().
>> >> Isn't B+Tree split possible in by-size B+Tree even when truncating a
>> >> longer freespace record to shorter one? But what I found in the log is
>> >> only a few tree shrinks... And when reading the source code of
>> >> freespace allocator I found that a B+Tree growth in this case is
>> >> impossible at least...
>> >
>> > args->isfl doesn't mean what you think it means.
>> >
>> > args->isfl is only set when moving blocks from the freespace btree
>> > to the AGFL, which only occurs when a previous operation allocated a
>> > new freespace btree block and depleted the current freelist. i.e.
>> > "AG Free List" != "AG freespace btree" - they are different
>> > structures on disk...
>> >
>> > And when you consider that a freelist refill can only remove records
>> > from the the freespace btree, it's should be clear that a btree
>> > split won't occur during a freelist refill...
>>
>> Hmm, wouldn't xfs_alloc_ag_vextent_size() first remove the free extent
>> record, and insert a new extent record into the freespace by-size
>> btree if the found free extent record is longer than args->maxlen?
>
> Good, you went and looked to verify what I said(*).  Indeed, when
> you read xfs_alloc_fixup_trees() where the two trees are modified
> after an extent has been selected for allocation:
>
> ...
>         * Delete the entry from the by-size btree.
> ...
>         * Add new by-size btree entry(s).
> ...
>         * Fix up the by-block btree entry(s)
>
> It's pretty clear what the answer is. IOWs, yes, you're correct - we
> only ever modify or delete records from the by-bno freespace tree,
> but the by-size tree does do a delete and insert and so can
> split.(**)
>
> So, while it is possible for a split to occur, you still didn't see
> one in your test. That's because a split is rather unlikely in your
> scenario because once we have a large enough freespace btree records
> stored for a split to occur, we almost always have exact matches in
> the by-count tree for freelist fills and so record deletes are the
> usual operationon the by-count tree.
>
> And even if we are doing a by-count insert, the chances the target
> block for the insert is full is quite small, as are the chances the
> target block for the insert is different to the block the original
> record was deleted from (consider ordering, what record index the
> initial lookup returns and that freelist fills usually only require
> a block or two to be allocated).
>
> Cheers,
>
> Dave.
>
> (*) That was what my answer was aimed at getting you to do as I get
> a lot of timewasters asking me via direct email to answer their
> homework questions for them. Anyone who is trying to understand the
> basics of the freespace btrees should have realised the by-count
> tree needs a delete/insert operation pair if we modify the length of
> a freespace record.
>
> (**) The lesson here is this: trust what people say, but always
> verify it when you can. e.g. I now know you'll look at the
> code to try to understand answers you are given, so my time
> answering your questions is not going to be wasted.
>
> --
> Dave Chinner
> david@fromorbit.com

Thanks for the detailed explaination!

Just wonders why we prefer failing the request of refilling freelist
with XFS_WANT_CORRUPTED_RETURN(mp, i == 1) in some rare case, rather
than returning NULLAGBLOCK and allowing the loop in
xfs_alloc_ag_vextent_size() to try xfs_alloc_ag_vextent_small()... In
such corner case there will always be a lot of small extents at the
front of the by-count tree, and any truncation changes to the first
entry in the tree will not result in tree splits and triggering
assertion failure.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] about the freelist allocator in XFS
  2016-07-11  7:06       ` Kaho Ng
@ 2016-07-11 22:53         ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-07-11 22:53 UTC (permalink / raw)
  To: Kaho Ng; +Cc: xfs

On Mon, Jul 11, 2016 at 03:06:03PM +0800, Kaho Ng wrote:
> Just wonders why we prefer failing the request of refilling freelist
> with XFS_WANT_CORRUPTED_RETURN(mp, i == 1) in some rare case, rather
> than returning NULLAGBLOCK and allowing the loop in
> xfs_alloc_ag_vextent_size() to try xfs_alloc_ag_vextent_small()...

Have a look at where xfs_alloc_ag_vextent_small() gets the blocks it
returns to the caller if the btree cursor doesn't point to a btree
record we can use. i.e. you can't refill the free list from
xfs_alloc_ag_vextent_small() because it allocates blocks from ...

> In
> such corner case there will always be a lot of small extents at the
> front of the by-count tree, and any truncation changes to the first
> entry in the tree will not result in tree splits and triggering
> assertion failure.

If there are records we can use, then we'll allocate them from the
btree. Failure to allocate from the btree indicates something is
inconsistent, there's a bug in the code or we've got corruption
occuring. A corruption shutdown is the only safe course of action
when we find something confusing like this - if we guess wrong them
we'll only make the bad state/corruption worse.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-07-11 22:54 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-07 11:01 [QUESTION] about the freelist allocator in XFS Kaho Ng
2016-07-07 12:13 ` Brian Foster
2016-07-07 22:28 ` Dave Chinner
     [not found]   ` <CAGeO4WNAdmeXgL4+CAQ1Yqo18XFgv3NZxWVbDTS0xDZLyb3e2w@mail.gmail.com>
2016-07-08  2:29     ` Fwd: " Kaho Ng
     [not found]     ` <20160708034710.GL12670@dastard>
2016-07-08  4:05       ` Kaho Ng
2016-07-08  5:48   ` Kaho Ng
2016-07-10 23:22     ` Dave Chinner
2016-07-11  7:06       ` Kaho Ng
2016-07-11 22:53         ` Dave Chinner
2016-07-08 19:17 ` Kaho Ng
2016-07-09 12:26   ` Kaho Ng
2016-07-10 16:57 ` Kaho Ng
2016-07-10 23:27   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.