All of lore.kernel.org
 help / color / mirror / Atom feed
* Extreme fragmentation ho!
@ 2020-12-21 21:54 Chris Dunlop
  2020-12-22 13:03 ` Brian Foster
  2020-12-28 22:06 ` Dave Chinner
  0 siblings, 2 replies; 5+ messages in thread
From: Chris Dunlop @ 2020-12-21 21:54 UTC (permalink / raw)
  To: linux-xfs

Hi,

I have a 2T file fragmented into 841891 randomly placed extents. It takes 
4-6 minutes (depending on what else the filesystem is doing) to delete the 
file. This is causing a timeout in the application doing the removal, and 
hilarity ensues.

The fragmentation is the result of reflinking bits and bobs from other 
files into the subject file, so it's probably unavoidable.

The file is sitting on XFS on LV on a raid6 comprising 6 x 5400 RPM HDD:

# xfs_info /home
meta-data=/dev/mapper/vg00-home  isize=512    agcount=32, agsize=244184192 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=1
          =                       reflink=1
data     =                       bsize=4096   blocks=7813893120, imaxpct=5
          =                       sunit=128    swidth=512 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

I'm guessing the time taken to remove is not unreasonable given the speed 
of the underlying storage and the amount of metadata involved. Does my 
guess seem correct?

I'd like to do some experimentation with a facsimile of this file, e.g.  
try the remove on different storage subsystems, and/or with a external fast 
journal etc., to see how they compare.

What is the easiest way to recreate a similarly (or even better, 
identically) fragmented file?

One way would be to use xfs_metadump / xfs_mdrestore to create an entire 
copy of the original filesystem, but I'd really prefer not taking the 
original fs offline for the time required. I also don't have the space to 
restore the whole fs but perhaps using lvmthin can address the restore 
issue, at the cost of a slight(?) performance impact due to the extra 
layer.

Is it possible to using the output of xfs_bmap on the original file to 
drive ...something, maybe xfs_io, to recreate the fragmentation? A naive 
test using xfs_io pwrite didn't produce any fragmentation - unsurprisingly, 
given the effort XFS puts into reducing fragmentation.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Extreme fragmentation ho!
  2020-12-21 21:54 Extreme fragmentation ho! Chris Dunlop
@ 2020-12-22 13:03 ` Brian Foster
  2020-12-28 22:06 ` Dave Chinner
  1 sibling, 0 replies; 5+ messages in thread
From: Brian Foster @ 2020-12-22 13:03 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Tue, Dec 22, 2020 at 08:54:53AM +1100, Chris Dunlop wrote:
> Hi,
> 
> I have a 2T file fragmented into 841891 randomly placed extents. It takes
> 4-6 minutes (depending on what else the filesystem is doing) to delete the
> file. This is causing a timeout in the application doing the removal, and
> hilarity ensues.
> 
> The fragmentation is the result of reflinking bits and bobs from other files
> into the subject file, so it's probably unavoidable.
> 
> The file is sitting on XFS on LV on a raid6 comprising 6 x 5400 RPM HDD:
> 
> # xfs_info /home
> meta-data=/dev/mapper/vg00-home  isize=512    agcount=32, agsize=244184192 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1
> data     =                       bsize=4096   blocks=7813893120, imaxpct=5
>          =                       sunit=128    swidth=512 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> I'm guessing the time taken to remove is not unreasonable given the speed of
> the underlying storage and the amount of metadata involved. Does my guess
> seem correct?
> 
> I'd like to do some experimentation with a facsimile of this file, e.g.  try
> the remove on different storage subsystems, and/or with a external fast
> journal etc., to see how they compare.
> 
> What is the easiest way to recreate a similarly (or even better,
> identically) fragmented file?
> 
> One way would be to use xfs_metadump / xfs_mdrestore to create an entire
> copy of the original filesystem, but I'd really prefer not taking the
> original fs offline for the time required. I also don't have the space to
> restore the whole fs but perhaps using lvmthin can address the restore
> issue, at the cost of a slight(?) performance impact due to the extra layer.
> 

Note that xfs_metadump doesn't include file data, only metadata, so it
might actually be the most time and space efficient way to replicate the
large file. You would need a similarly sized block device to restore to
and would not be able to change filesystem geometry and whatnot. The
former can be easily worked around by restoring the image to a file on a
smaller fs though, which may or may not interfere with whatever
performance testing you're doing.

> Is it possible to using the output of xfs_bmap on the original file to drive
> ...something, maybe xfs_io, to recreate the fragmentation? A naive test
> using xfs_io pwrite didn't produce any fragmentation - unsurprisingly, given
> the effort XFS puts into reducing fragmentation.
> 

fstests has a helper program (xfstests-dev/src/punch-alternating) that
helps create fragmented files. IIRC, you create a fully allocated file
in advance and it will punch out alternating ranges based on the
offset/size parameters. You might have to wait a bit for it to complete,
but it's pretty easy to use (and you can always create a metadump image
from the result for quicker restoration).

Yet another option might be to try a write workload that attempts to
defeat the allocator heuristics. For example, do direct I/O or falloc
requests in reverse order and in small sizes across a file. xfs_io has a
couple flags you can pass to pwrite (i.e., -B, -R) to make that easier,
but that's more manual and you may have to play around with it to get
the behavior you want.

Brian

> Cheers,
> 
> Chris
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Extreme fragmentation ho!
  2020-12-21 21:54 Extreme fragmentation ho! Chris Dunlop
  2020-12-22 13:03 ` Brian Foster
@ 2020-12-28 22:06 ` Dave Chinner
  2020-12-30  6:28   ` Chris Dunlop
  1 sibling, 1 reply; 5+ messages in thread
From: Dave Chinner @ 2020-12-28 22:06 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Tue, Dec 22, 2020 at 08:54:53AM +1100, Chris Dunlop wrote:
> Hi,
> 
> I have a 2T file fragmented into 841891 randomly placed extents. It takes
> 4-6 minutes (depending on what else the filesystem is doing) to delete the
> file. This is causing a timeout in the application doing the removal, and
> hilarity ensues.

~3,000 extents/s being removed, with reflink+rmap mods being made for
every extent. Seems a little slow compared to what I typically see,
but...

> The fragmentation is the result of reflinking bits and bobs from other files
> into the subject file, so it's probably unavoidable.
> 
> The file is sitting on XFS on LV on a raid6 comprising 6 x 5400 RPM HDD:

... probably not that unreasonable for pretty much the slowest
storage configuration you can possibly come up with for small,
metadata write intensive workloads.

> # xfs_info /home
> meta-data=/dev/mapper/vg00-home  isize=512    agcount=32, agsize=244184192 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=1, rmapbt=1
>          =                       reflink=1
> data     =                       bsize=4096   blocks=7813893120, imaxpct=5
>          =                       sunit=128    swidth=512 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
> 
> I'm guessing the time taken to remove is not unreasonable given the speed of
> the underlying storage and the amount of metadata involved. Does my guess
> seem correct?

Yup.

> I'd like to do some experimentation with a facsimile of this file, e.g.  try
> the remove on different storage subsystems, and/or with a external fast
> journal etc., to see how they compare.

I think you'll find a limit at ~20,000 extents/s, regardless of your
storage subsystem. Once you take away IO latency, it's basically
single threaded and CPU bound so performance is largely dependent
on how fast your CPUs are. IOWs, the moment you move to SSDs, it
will be CPU bound and still take a minute or two to remove all the
extents....

> What is the easiest way to recreate a similarly (or even better,
> identically) fragmented file?

Just script xfs_io to reflink random bits and bobs from other files
into a larger file?

> One way would be to use xfs_metadump / xfs_mdrestore to create an entire
> copy of the original filesystem, but I'd really prefer not taking the
> original fs offline for the time required. I also don't have the space to
> restore the whole fs but perhaps using lvmthin can address the restore
> issue, at the cost of a slight(?) performance impact due to the extra layer.

Easiest, most space efficient way is to mdrestore to a file (ends up
sparse, containing only metadata), mount it via loopback.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Extreme fragmentation ho!
  2020-12-28 22:06 ` Dave Chinner
@ 2020-12-30  6:28   ` Chris Dunlop
  2020-12-30 22:03     ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Chris Dunlop @ 2020-12-30  6:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Dec 29, 2020 at 09:06:22AM +1100, Dave Chinner wrote:
> On Tue, Dec 22, 2020 at 08:54:53AM +1100, Chris Dunlop wrote:
>> The file is sitting on XFS on LV on a raid6 comprising 6 x 5400 RPM HDD:
>
> ... probably not that unreasonable for pretty much the slowest
> storage configuration you can possibly come up with for small,
> metadata write intensive workloads.

[ Chris grimaces and glances over at the 8+3 erasure-encoded ceph rbd 
   sitting like a pitch drop experiment in the corner. ]

Speaking of slow storage and metadata write intensive workloads, what's 
the reason reflinks with a realtime device isn't supported? That was one 
approach I wanted to try, to get the metadata ops running on a small fast 
storage with the bulk data sitting on big slow bulk storage. But:

# mkfs.xfs -m reflink=1 -d rtinherit=1 -r rtdev=/dev/fast /dev/slow
reflink not supported with realtime devices

My naive thought was a reflink was probably "just" a block range 
referenced from multiple places, and probably a refcount somewhere. It 
seems like it should be possible to have the range, references and 
refcount sitting on the fast storage pointing to the actual data blocks on 
the slow storage.

>> What is the easiest way to recreate a similarly (or even better,
>> identically) fragmented file?
>
> Just script xfs_io to reflink random bits and bobs from other files
> into a larger file?

Thanks - that did it.

Cheers,

Chris

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Extreme fragmentation ho!
  2020-12-30  6:28   ` Chris Dunlop
@ 2020-12-30 22:03     ` Dave Chinner
  0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2020-12-30 22:03 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Wed, Dec 30, 2020 at 05:28:36PM +1100, Chris Dunlop wrote:
> On Tue, Dec 29, 2020 at 09:06:22AM +1100, Dave Chinner wrote:
> > On Tue, Dec 22, 2020 at 08:54:53AM +1100, Chris Dunlop wrote:
> > > The file is sitting on XFS on LV on a raid6 comprising 6 x 5400 RPM HDD:
> > 
> > ... probably not that unreasonable for pretty much the slowest
> > storage configuration you can possibly come up with for small,
> > metadata write intensive workloads.
> 
> [ Chris grimaces and glances over at the 8+3 erasure-encoded ceph rbd
> sitting like a pitch drop experiment in the corner. ]

I would have thought that should be able to do more than the 20 IOPS
the raid6 above will do on random 4kB writes.... :)

> Speaking of slow storage and metadata write intensive workloads, what's the
> reason reflinks with a realtime device isn't supported? That was one
> approach I wanted to try, to get the metadata ops running on a small fast
> storage with the bulk data sitting on big slow bulk storage. But:
> 
> # mkfs.xfs -m reflink=1 -d rtinherit=1 -r rtdev=/dev/fast /dev/slow
> reflink not supported with realtime devices

Yup, the realtime device is a pure data device, so all it's metadata
is held externally to the device (i.e. it is held in the "data
device", not the RT device). IOWs, it's a completely separate
filesystem implementation within XFS, and so requires independent
functional extensions to support reflink + rmap...

> My naive thought was a reflink was probably "just" a block range referenced
> from multiple places, and probably a refcount somewhere. It seems like it
> should be possible to have the range, references and refcount sitting on the
> fast storage pointing to the actual data blocks on the slow storage.

Yes, it is possible, but the current reflink implementation is based
on allocation group internal structures (rmap is the same), and the
realtime device doesn't have these. Hence there are new metadata
structures that need to be added (refcount btrees rooted in inodes,
not fixed location AG headers) and a bunch of new supporting code to
be written. Largely Darrick has done this already, it's just a
problem of review bandwidth and validation time:

https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=realtime-reflink-extsize

(which also includes realtime rmap support, a whole new internal
metadata inode directory to index all the new inode btrees for the
rt device, etc)

It's a pretty large chunk of shiny new code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2020-12-30 22:04 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-21 21:54 Extreme fragmentation ho! Chris Dunlop
2020-12-22 13:03 ` Brian Foster
2020-12-28 22:06 ` Dave Chinner
2020-12-30  6:28   ` Chris Dunlop
2020-12-30 22:03     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.