All of lore.kernel.org
 help / color / mirror / Atom feed
* fstrim and strace considered harmful?
@ 2022-05-18  6:59 Chris Dunlop
  2022-05-18  7:07 ` Chris Dunlop
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Dunlop @ 2022-05-18  6:59 UTC (permalink / raw)
  To: linux-xfs

Hi,

I have an fstrim that's been running for over 48 hours on a 256T thin 
provisioned XFS fs containing around 55T of actual data on a slow 
subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be an 
an enourmous amount of data to trim, maybe a few T, but I've no idea how 
long how long it might be expected to take. In an attempt to see what the 
what the fstrim was doing, I ran an strace on it. The strace has been 
sitting there without output and unkillable since then, now 5+ hours ago.  
Since the strace, on that same filesystem I now have 123 df processes and 
615 rm processes -- and growing -- that are blocked in xfs_inodegc_flush, 
e.g.:

May 18 15:31:52 d5 kernel: task:df              state:D stack:    0 pid:31741 ppid:     1 flags:0x00004004
May 18 15:31:52 d5 kernel: Call Trace:
May 18 15:31:52 d5 kernel:  <TASK>
May 18 15:31:52 d5 kernel:  __schedule+0x241/0x740
May 18 15:31:52 d5 kernel:  ? lock_is_held_type+0x97/0x100
May 18 15:31:52 d5 kernel:  schedule+0x3a/0xa0
May 18 15:31:52 d5 kernel:  schedule_timeout+0x271/0x310
May 18 15:31:52 d5 kernel:  ? find_held_lock+0x2d/0x90
May 18 15:31:52 d5 kernel:  ? sched_clock_cpu+0x9/0xa0
May 18 15:31:52 d5 kernel:  ? lock_release+0x214/0x350
May 18 15:31:52 d5 kernel:  wait_for_completion+0x7b/0xc0
May 18 15:31:52 d5 kernel:  __flush_work+0x217/0x350
May 18 15:31:52 d5 kernel:  ? flush_workqueue_prep_pwqs+0x120/0x120
May 18 15:31:52 d5 kernel:  ? wait_for_completion+0x1c/0xc0
May 18 15:31:52 d5 kernel:  xfs_inodegc_flush.part.24+0x62/0xc0 [xfs]
May 18 15:31:52 d5 kernel:  xfs_fs_statfs+0x37/0x1a0 [xfs]
May 18 15:31:52 d5 kernel:  statfs_by_dentry+0x3c/0x60
May 18 15:31:52 d5 kernel:  vfs_statfs+0x16/0xd0
May 18 15:31:52 d5 kernel:  user_statfs+0x44/0x80
May 18 15:31:52 d5 kernel:  __do_sys_statfs+0x10/0x30
May 18 15:31:52 d5 kernel:  do_syscall_64+0x34/0x80
May 18 15:31:52 d5 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 18 15:31:52 d5 kernel: RIP: 0033:0x7fe9e9db3c07
May 18 15:31:52 d5 kernel: RSP: 002b:00007ffe08f50178 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
May 18 15:31:52 d5 kernel: RAX: ffffffffffffffda RBX: 0000555963fcae40 RCX: 00007fe9e9db3c07
May 18 15:31:52 d5 kernel: RDX: 00007ffe08f50400 RSI: 00007ffe08f50180 RDI: 0000555963fcae40
May 18 15:31:52 d5 kernel: RBP: 00007ffe08f50180 R08: 0000555963fcae80 R09: 0000000000000000
May 18 15:31:52 d5 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe08f50220
May 18 15:31:52 d5 kernel: R13: 0000000000000000 R14: 0000555963fcae80 R15: 0000555963fcae40
May 18 15:31:52 d5 kernel:  </TASK>

Full 1.5M sysrq output at: https://file.io/bWOL8F7mzKI6

That stack trace is uncomfortably familiar:

Subject: Highly reflinked and fragmented considered harmful?
https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/

FYI:

# xfs_info /vol
meta-data=/dev/vg01/vol          isize=512    agcount=257, agsize=268434432 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=1
          =                       reflink=1    bigtime=1 inobtcount=1
data     =                       bsize=4096   blocks=68719475712, imaxpct=1
          =                       sunit=1024   swidth=8192 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

Is there something I can do to "unstick" things, or is it time to hit the 
reset, and hope the recovery on mount isn't onerous?

Aside from that immediate issue, what has gone wrong here?

Cheers,

Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fstrim and strace considered harmful?
  2022-05-18  6:59 fstrim and strace considered harmful? Chris Dunlop
@ 2022-05-18  7:07 ` Chris Dunlop
  2022-05-18 15:59   ` Darrick J. Wong
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Dunlop @ 2022-05-18  7:07 UTC (permalink / raw)
  To: linux-xfs

Oh, sorry... on linux v5.15.34

On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
> Hi,
>
> I have an fstrim that's been running for over 48 hours on a 256T thin 
> provisioned XFS fs containing around 55T of actual data on a slow 
> subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be 
> an an enourmous amount of data to trim, maybe a few T, but I've no 
> idea how long how long it might be expected to take. In an attempt to 
> see what the what the fstrim was doing, I ran an strace on it. The 
> strace has been sitting there without output and unkillable since 
> then, now 5+ hours ago.  Since the strace, on that same filesystem I 
> now have 123 df processes and 615 rm processes -- and growing -- that 
> are blocked in xfs_inodegc_flush, e.g.:
>
> May 18 15:31:52 d5 kernel: task:df              state:D stack:    0 pid:31741 ppid:     1 flags:0x00004004
> May 18 15:31:52 d5 kernel: Call Trace:
> May 18 15:31:52 d5 kernel:  <TASK>
> May 18 15:31:52 d5 kernel:  __schedule+0x241/0x740
> May 18 15:31:52 d5 kernel:  ? lock_is_held_type+0x97/0x100
> May 18 15:31:52 d5 kernel:  schedule+0x3a/0xa0
> May 18 15:31:52 d5 kernel:  schedule_timeout+0x271/0x310
> May 18 15:31:52 d5 kernel:  ? find_held_lock+0x2d/0x90
> May 18 15:31:52 d5 kernel:  ? sched_clock_cpu+0x9/0xa0
> May 18 15:31:52 d5 kernel:  ? lock_release+0x214/0x350
> May 18 15:31:52 d5 kernel:  wait_for_completion+0x7b/0xc0
> May 18 15:31:52 d5 kernel:  __flush_work+0x217/0x350
> May 18 15:31:52 d5 kernel:  ? flush_workqueue_prep_pwqs+0x120/0x120
> May 18 15:31:52 d5 kernel:  ? wait_for_completion+0x1c/0xc0
> May 18 15:31:52 d5 kernel:  xfs_inodegc_flush.part.24+0x62/0xc0 [xfs]
> May 18 15:31:52 d5 kernel:  xfs_fs_statfs+0x37/0x1a0 [xfs]
> May 18 15:31:52 d5 kernel:  statfs_by_dentry+0x3c/0x60
> May 18 15:31:52 d5 kernel:  vfs_statfs+0x16/0xd0
> May 18 15:31:52 d5 kernel:  user_statfs+0x44/0x80
> May 18 15:31:52 d5 kernel:  __do_sys_statfs+0x10/0x30
> May 18 15:31:52 d5 kernel:  do_syscall_64+0x34/0x80
> May 18 15:31:52 d5 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
> May 18 15:31:52 d5 kernel: RIP: 0033:0x7fe9e9db3c07
> May 18 15:31:52 d5 kernel: RSP: 002b:00007ffe08f50178 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
> May 18 15:31:52 d5 kernel: RAX: ffffffffffffffda RBX: 0000555963fcae40 RCX: 00007fe9e9db3c07
> May 18 15:31:52 d5 kernel: RDX: 00007ffe08f50400 RSI: 00007ffe08f50180 RDI: 0000555963fcae40
> May 18 15:31:52 d5 kernel: RBP: 00007ffe08f50180 R08: 0000555963fcae80 R09: 0000000000000000
> May 18 15:31:52 d5 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe08f50220
> May 18 15:31:52 d5 kernel: R13: 0000000000000000 R14: 0000555963fcae80 R15: 0000555963fcae40
> May 18 15:31:52 d5 kernel:  </TASK>
>
> Full 1.5M sysrq output at: https://file.io/bWOL8F7mzKI6
>
> That stack trace is uncomfortably familiar:
>
> Subject: Highly reflinked and fragmented considered harmful?
> https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/
>
> FYI:
>
> # xfs_info /vol
> meta-data=/dev/vg01/vol          isize=512    agcount=257, agsize=268434432 blks
>         =                       sectsz=4096  attr=2, projid32bit=1
>         =                       crc=1        finobt=1, sparse=1, rmapbt=1
>         =                       reflink=1    bigtime=1 inobtcount=1
> data     =                       bsize=4096   blocks=68719475712, imaxpct=1
>         =                       sunit=1024   swidth=8192 blks
> naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> log      =internal log           bsize=4096   blocks=521728, version=2
>         =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> Is there something I can do to "unstick" things, or is it time to hit 
> the reset, and hope the recovery on mount isn't onerous?
>
> Aside from that immediate issue, what has gone wrong here?
>
> Cheers,
>
> Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fstrim and strace considered harmful?
  2022-05-18  7:07 ` Chris Dunlop
@ 2022-05-18 15:59   ` Darrick J. Wong
  2022-05-18 22:36     ` Chris Dunlop
  0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2022-05-18 15:59 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: linux-xfs

On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote:
> Oh, sorry... on linux v5.15.34
> 
> On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
> > Hi,
> > 
> > I have an fstrim that's been running for over 48 hours on a 256T thin
> > provisioned XFS fs containing around 55T of actual data on a slow
> > subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be
> > an an enourmous amount of data to trim, maybe a few T, but I've no idea
> > how long how long it might be expected to take. In an attempt to see
> > what the what the fstrim was doing, I ran an strace on it. The strace
> > has been sitting there without output and unkillable since then, now 5+
> > hours ago.  Since the strace, on that same filesystem I now have 123 df
> > processes and 615 rm processes -- and growing -- that are blocked in
> > xfs_inodegc_flush, e.g.:
> > 
> > May 18 15:31:52 d5 kernel: task:df              state:D stack:    0 pid:31741 ppid:     1 flags:0x00004004
> > May 18 15:31:52 d5 kernel: Call Trace:
> > May 18 15:31:52 d5 kernel:  <TASK>
> > May 18 15:31:52 d5 kernel:  __schedule+0x241/0x740
> > May 18 15:31:52 d5 kernel:  ? lock_is_held_type+0x97/0x100
> > May 18 15:31:52 d5 kernel:  schedule+0x3a/0xa0
> > May 18 15:31:52 d5 kernel:  schedule_timeout+0x271/0x310
> > May 18 15:31:52 d5 kernel:  ? find_held_lock+0x2d/0x90
> > May 18 15:31:52 d5 kernel:  ? sched_clock_cpu+0x9/0xa0
> > May 18 15:31:52 d5 kernel:  ? lock_release+0x214/0x350
> > May 18 15:31:52 d5 kernel:  wait_for_completion+0x7b/0xc0
> > May 18 15:31:52 d5 kernel:  __flush_work+0x217/0x350
> > May 18 15:31:52 d5 kernel:  ? flush_workqueue_prep_pwqs+0x120/0x120
> > May 18 15:31:52 d5 kernel:  ? wait_for_completion+0x1c/0xc0
> > May 18 15:31:52 d5 kernel:  xfs_inodegc_flush.part.24+0x62/0xc0 [xfs]
> > May 18 15:31:52 d5 kernel:  xfs_fs_statfs+0x37/0x1a0 [xfs]
> > May 18 15:31:52 d5 kernel:  statfs_by_dentry+0x3c/0x60
> > May 18 15:31:52 d5 kernel:  vfs_statfs+0x16/0xd0
> > May 18 15:31:52 d5 kernel:  user_statfs+0x44/0x80
> > May 18 15:31:52 d5 kernel:  __do_sys_statfs+0x10/0x30
> > May 18 15:31:52 d5 kernel:  do_syscall_64+0x34/0x80
> > May 18 15:31:52 d5 kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > May 18 15:31:52 d5 kernel: RIP: 0033:0x7fe9e9db3c07
> > May 18 15:31:52 d5 kernel: RSP: 002b:00007ffe08f50178 EFLAGS: 00000246 ORIG_RAX: 0000000000000089
> > May 18 15:31:52 d5 kernel: RAX: ffffffffffffffda RBX: 0000555963fcae40 RCX: 00007fe9e9db3c07
> > May 18 15:31:52 d5 kernel: RDX: 00007ffe08f50400 RSI: 00007ffe08f50180 RDI: 0000555963fcae40
> > May 18 15:31:52 d5 kernel: RBP: 00007ffe08f50180 R08: 0000555963fcae80 R09: 0000000000000000
> > May 18 15:31:52 d5 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe08f50220
> > May 18 15:31:52 d5 kernel: R13: 0000000000000000 R14: 0000555963fcae80 R15: 0000555963fcae40
> > May 18 15:31:52 d5 kernel:  </TASK>
> > 
> > Full 1.5M sysrq output at: https://file.io/bWOL8F7mzKI6

 task:fstrim          state:D stack:    0 pid: 3552 ppid:  2091 flags:0x00004006
 Call Trace:
  <TASK>
  __schedule+0x241/0x740
  schedule+0x3a/0xa0
  schedule_timeout+0x1c9/0x310
  ? del_timer_sync+0x90/0x90
  io_schedule_timeout+0x19/0x40
  wait_for_completion_io_timeout+0x75/0xd0
  submit_bio_wait+0x63/0x90
  ? wait_for_completion_io_timeout+0x1f/0xd0
  blkdev_issue_discard+0x6a/0xa0
  ? _raw_spin_unlock+0x1f/0x30
  xfs_trim_extents+0x1a7/0x3d0 [xfs]
  xfs_ioc_trim+0x161/0x1e0 [xfs]
  xfs_file_ioctl+0x914/0xbf0 [xfs]
  ? __do_sys_newfstat+0x2d/0x40
  __x64_sys_ioctl+0x71/0xb0
  do_syscall_64+0x34/0x80
  entry_SYSCALL_64_after_hwframe+0x44/0xae
 RIP: 0033:0x7fa84e61ae57
 RSP: 002b:00007ffe90fa1da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
 RAX: ffffffffffffffda RBX: 00007ffe90fa1f10 RCX: 00007fa84e61ae57
 RDX: 00007ffe90fa1db0 RSI: 00000000c0185879 RDI: 0000000000000003
 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffe90fa3d10
 R13: 00007ffe90fa3d10 R14: 0000000000000000 R15: 00007fa84e1fdff8
  </TASK>

It looks like the storage device is stalled on the discard, and most
everything else is stuck waiting for buffer locks?  The statfs threads
are the same symptom as last time.

--D

> > 
> > That stack trace is uncomfortably familiar:
> > 
> > Subject: Highly reflinked and fragmented considered harmful?
> > https://lore.kernel.org/linux-xfs/20220509024659.GA62606@onthe.net.au/
> > 
> > FYI:
> > 
> > # xfs_info /vol
> > meta-data=/dev/vg01/vol          isize=512    agcount=257, agsize=268434432 blks
> >         =                       sectsz=4096  attr=2, projid32bit=1
> >         =                       crc=1        finobt=1, sparse=1, rmapbt=1
> >         =                       reflink=1    bigtime=1 inobtcount=1
> > data     =                       bsize=4096   blocks=68719475712, imaxpct=1
> >         =                       sunit=1024   swidth=8192 blks
> > naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
> > log      =internal log           bsize=4096   blocks=521728, version=2
> >         =                       sectsz=4096  sunit=1 blks, lazy-count=1
> > realtime =none                   extsz=4096   blocks=0, rtextents=0
> > 
> > Is there something I can do to "unstick" things, or is it time to hit
> > the reset, and hope the recovery on mount isn't onerous?
> > 
> > Aside from that immediate issue, what has gone wrong here?
> > 
> > Cheers,
> > 
> > Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fstrim and strace considered harmful?
  2022-05-18 15:59   ` Darrick J. Wong
@ 2022-05-18 22:36     ` Chris Dunlop
  2022-05-19  0:50       ` Dave Chinner
  0 siblings, 1 reply; 8+ messages in thread
From: Chris Dunlop @ 2022-05-18 22:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, May 18, 2022 at 08:59:00AM -0700, Darrick J. Wong wrote:
> On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote:
>> Oh, sorry... on linux v5.15.34
>>
>> On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
>>> I have an fstrim that's been running for over 48 hours on a 256T thin
>>> provisioned XFS fs containing around 55T of actual data on a slow
>>> subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be
>>> an an enourmous amount of data to trim, maybe a few T, but I've no idea
>>> how long how long it might be expected to take. In an attempt to see
>>> what the what the fstrim was doing, I ran an strace on it. The strace
>>> has been sitting there without output and unkillable since then, now 5+
>>> hours ago.  Since the strace, on that same filesystem I now have 123 df
>>> processes and 615 rm processes -- and growing -- that are blocked in
>>> xfs_inodegc_flush, e.g.:
...
> It looks like the storage device is stalled on the discard, and most
> everything else is stuck waiting for buffer locks?  The statfs threads
> are the same symptom as last time.

Note: the box has been rebooted and it's back to normal after an anxious 
30 minutes waiting for the mount recovery. (Not an entirely wasted 30 
minutes - what a thrilling stage of the Giro d'Italia!)

I'm not sure if the fstrim was stalled, unless the strace had stalled it 
somehow: it had been running for ~48 hours without apparent issues before 
the strace was attached, and then it was another hour before the first 
process stuck on xfs_inodegc_flush appeared.

The open question is what caused the stuck processes? It's possible the 
strace was involved: the stuck process with the earliest start time, a 
"df", was started an hour after the strace and it's entirely plausible 
that was the very first df or rm issued after the strace. However it's 
also plausible that was a coincidence and the strace had nothing to do 
with it. Indeed it's even plausible the fstrim had nothing to do with the 
stuck processes and there's something else entirely going on: I don't know 
if there's a ticking time bomb somewhere in the system

It's now no mystery to me why the fstrim was taking so long, nor why the 
strace didn't produce any output: it turns out fstrim, without an explicit 
--offset --length range, issues a single ioctl() to trim from the start of 
the device to the end, and without an explicit --minimum, uses 
/sys/block/xxx/queue/discard_granularity as the minimum block size to 
discard, in this case 64kB. So it would have been issuing a metric 
shit-ton of discard requests to the underlying storage, something close 
to:

   (fs-size - fs-used) / discard-size
   256T - 26T / 64k
   3,858,759,680 requests

It was after figuring out all that that I hit the reset.

Note: it turns out the actual used space per the filesystem is 26T, whilst 
the underlying storage shows 55T used, i.e. there's 29T of real discards 
to process. With this ceph rbd storage I don't know if a "real" discard 
takes any more or less time than a discard to already-unoccupied storage. 

Next time I'll issue the fstrim in much smaller increments, e.g. starting 
with perhaps 128G (at least at first), and use a --minimum that matches 
the underlying object size (4MB). Then play around and monitor it to work 
out what parameters work best for this system.

Cheers,

Chris - older, wiser, a little more sleep deprived

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fstrim and strace considered harmful?
  2022-05-18 22:36     ` Chris Dunlop
@ 2022-05-19  0:50       ` Dave Chinner
  2022-05-19  2:33         ` Chris Dunlop
  2022-05-19 15:25         ` Chris Murphy
  0 siblings, 2 replies; 8+ messages in thread
From: Dave Chinner @ 2022-05-19  0:50 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Darrick J. Wong, linux-xfs

On Thu, May 19, 2022 at 08:36:06AM +1000, Chris Dunlop wrote:
> On Wed, May 18, 2022 at 08:59:00AM -0700, Darrick J. Wong wrote:
> > On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote:
> > > Oh, sorry... on linux v5.15.34
> > > 
> > > On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
> > > > I have an fstrim that's been running for over 48 hours on a 256T thin
> > > > provisioned XFS fs containing around 55T of actual data on a slow
> > > > subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be
> > > > an an enourmous amount of data to trim, maybe a few T, but I've no idea
> > > > how long how long it might be expected to take. In an attempt to see
> > > > what the what the fstrim was doing, I ran an strace on it. The strace
> > > > has been sitting there without output and unkillable since then, now 5+
> > > > hours ago.  Since the strace, on that same filesystem I now have 123 df
> > > > processes and 615 rm processes -- and growing -- that are blocked in
> > > > xfs_inodegc_flush, e.g.:
> ...
> > It looks like the storage device is stalled on the discard, and most
> > everything else is stuck waiting for buffer locks?  The statfs threads
> > are the same symptom as last time.
> 
> Note: the box has been rebooted and it's back to normal after an anxious 30
> minutes waiting for the mount recovery. (Not an entirely wasted 30 minutes -
> what a thrilling stage of the Giro d'Italia!)
> 
> I'm not sure if the fstrim was stalled, unless the strace had stalled it
> somehow: it had been running for ~48 hours without apparent issues before
> the strace was attached, and then it was another hour before the first
> process stuck on xfs_inodegc_flush appeared.

I suspect that it's just that your storage device is really slow at
small trims. If you didn't set a minimum trim size, XFS will issue
discards on every free space in it's trees. If you have fragmented
free space (quite possible if you're using reflink and removing
files that have been reflinked and modified) then you could have
millions of tiny free spaces that XFS is now asking the storage to
free.

Dumping the free space histogram of the filesystem will tell us just
how much work you asked the storage to do. e.g:

# xfs_spaceman -c "freesp" /
   from      to extents  blocks    pct
      1       1   20406   20406   0.03
      2       3   14974   35666   0.06
      4       7   11773   61576   0.10
      8      15   11935  131561   0.22
     16      31   15428  359009   0.60
     32      63   13594  620194   1.04
     64     127   15354 1415541   2.38
    128     255   19269 3560215   5.98
    256     511     975  355811   0.60
    512    1023     831  610381   1.02
   1024    2047     398  580983   0.98
   2048    4095     275  827636   1.39
   4096    8191     156  911802   1.53
   8192   16383      90 1051443   1.77
  16384   32767      54 1257999   2.11
  32768   65535      17  813203   1.37
  65536  131071      13 1331349   2.24
 131072  262143      18 3501547   5.88
 262144  524287       8 2834877   4.76
 524288 1048575       8 5722448   9.61
1048576 2097151       6 9189190  15.43
2097152 4194303       4 14026658  23.55
4194304 8388607       2 10348013  17.37
#

So on this 1TB filesystem, there's ~125,000 free space extents and
the vast majority of them are less than 255 blocks in length (1MB).
Hence I run fstrim on this filesystem without a minium size limit, it
will issue roughly 125,000 discard requests.

If I set a 1MB minimum size, it will issue discards on all free
spaces 256 blocks or larger. i.e.  it will only issue ~2000 discards
and that will cover ~92% of the free space in the filesystem....

> The open question is what caused the stuck processes?

Oh, that's easy the easy bit to explain: discard runs with the AGF
locked because it is iterating the free space tree directly. Hence
operations on that AG are blocked until all the free space in that
AG have been discarded. Could be smarter, never needed to be
smarter.

Now inodegc comes along, and tries to free an inode in that AG, and
blocks getting the AGF lock during the inode free operation (likely
inode chunk freeing of finobt block allocation). Everythign then
backs up on inodegc flushes, which is backed up on discard
operations....

> It's now no mystery to me why the fstrim was taking so long, nor why the
> strace didn't produce any output: it turns out fstrim, without an explicit
> --offset --length range, issues a single ioctl() to trim from the start of
> the device to the end, and without an explicit --minimum, uses
> /sys/block/xxx/queue/discard_granularity as the minimum block size to
> discard, in this case 64kB. So it would have been issuing a metric shit-ton
> of discard requests to the underlying storage, something close to:
> 
>   (fs-size - fs-used) / discard-size
>   256T - 26T / 64k
>   3,858,759,680 requests

Won't be anywhere near that number - free space in a 256TB
filesystem with only 29TB used will have lots of really large
contiguous free spaces. Those will get broken down into max discard
length chunks, not minimum. Of course, if the bdev is setting a
really small max discard size, then that's going to be just a big a
problem for you....

> It was after figuring out all that that I hit the reset.

Yup, see above for how to actually determine what minimum size to
set for a trim....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fstrim and strace considered harmful?
  2022-05-19  0:50       ` Dave Chinner
@ 2022-05-19  2:33         ` Chris Dunlop
  2022-05-19  6:33           ` Dave Chinner
  2022-05-19 15:25         ` Chris Murphy
  1 sibling, 1 reply; 8+ messages in thread
From: Chris Dunlop @ 2022-05-19  2:33 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Thu, May 19, 2022 at 10:50:14AM +1000, Dave Chinner wrote:
> On Thu, May 19, 2022 at 08:36:06AM +1000, Chris Dunlop wrote:
>> On Wed, May 18, 2022 at 08:59:00AM -0700, Darrick J. Wong wrote:
>>> On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote:
>>>> Oh, sorry... on linux v5.15.34
>>>>
>>>> On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
>>>>> I have an fstrim that's been running for over 48 hours on a 256T thin
>>>>> provisioned XFS fs containing around 55T of actual data on a slow
>>>>> subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be
>>>>> an an enourmous amount of data to trim, maybe a few T, but I've no idea
>>>>> how long how long it might be expected to take. In an attempt to see
>>>>> what the what the fstrim was doing, I ran an strace on it. The strace
>>>>> has been sitting there without output and unkillable since then, now 5+
>>>>> hours ago.  Since the strace, on that same filesystem I now have 123 df
>>>>> processes and 615 rm processes -- and growing -- that are blocked in
>>>>> xfs_inodegc_flush, e.g.:
...
> I suspect that it's just that your storage device is really slow at
> small trims. If you didn't set a minimum trim size, XFS will issue
> discards on every free space in it's trees. If you have fragmented
> free space (quite possible if you're using reflink and removing
> files that have been reflinked and modified) then you could have
> millions of tiny free spaces that XFS is now asking the storage to
> free.
>
> Dumping the free space histogram of the filesystem will tell us just
> how much work you asked the storage to do. e.g:

# xfs_spaceman -c freesp /vol
    from      to extents  blocks    pct
       1       1    2368    2368   0.00
       2       3    4493   11211   0.00
       4       7    6827   38214   0.00
       8      15   12656  144036   0.00
      16      31   35988  878969   0.00
      32      63  163747 8091729   0.01
      64     127  248625 22572336   0.04
     128     255  367889 71796010   0.11
     256     511  135012 48176856   0.08
     512    1023   92534 74126716   0.12
    1024    2047   13464 18608536   0.03
    2048    4095    3873 10930189   0.02
    4096    8191    1374 7886168   0.01
    8192   16383     598 6875977   0.01
   16384   32767     340 7729264   0.01
   32768   65535     146 6745043   0.01
   65536  131071      48 4419901   0.01
  131072  262143      12 2380800   0.00
  262144  524287       5 1887092   0.00
  524288 1048575       2 1105184   0.00
1048576 2097151       4 5316211   0.01
2097152 4194303       3 8747030   0.01
4194304 8388607      65 522142416   0.83
8388608 16777215       2 21411858   0.03
67108864 134217727       4 379247828   0.60
134217728 268434432     236 62042143854  98.05

I guess from/to are in units of filesystem blocks, 4kB in this case?

Not that it makes much difference here, but the sake of accuracy... does 
the default fstrim without range/size args issue discard requests for 
*all* the extents, or, if I'm reading this right:

fs/xfs/xfs_discard.c
--
xfs_ioc_trim(
         struct xfs_mount                *mp,
         struct fstrim_range __user      *urange)
{
         struct request_queue    *q = bdev_get_queue(mp->m_ddev_targp->bt_bdev);
         unsigned int            granularity = q->limits.discard_granularity;
...
         if (copy_from_user(&range, urange, sizeof(range)))
                 return -EFAULT;

         range.minlen = max_t(u64, granularity, range.minlen);
...
}

...does it take into account /sys/block/xxx/queue/discard_granularity, in 
this case 64kB, or 16 blocks @ 4kB, so issuing discards only for extents 
>= 16 blocks?

>> The open question is what caused the stuck processes?
>
> Oh, that's easy the easy bit to explain: discard runs with the AGF
> locked because it is iterating the free space tree directly. Hence
> operations on that AG are blocked until all the free space in that
> AG have been discarded. Could be smarter, never needed to be
> smarter.
>
> Now inodegc comes along, and tries to free an inode in that AG, and
> blocks getting the AGF lock during the inode free operation (likely
> inode chunk freeing of finobt block allocation). Everythign then
> backs up on inodegc flushes, which is backed up on discard
> operations....

I'm not sure that explains how the first stuck process only appeared >48 
hours after initiating the fstrim. Unless that's because it may have 
finally got to the AG(s) with a lot of free extents?

#
# AGs w/ at least 400 free extents: only 31 out of 256 AGs
#
d5# xfs_spaceman -c "freesp -gs" /chroot | awk '$2>=400 {print}' 
         AG    extents     blocks
         43      69435   29436263
         47      14984    5623982
         48      42482  166285283
         49      56284  218913822
         50      10354  240969820
{ sequential range...
         54      60416   11292928
         55      72742   15344807
         56      88455   17204239
         57      81594   15218624
         58     126731   27719558
         59      64525   10047593
         60      37324    8591083
         61      57267  113745589
         62      36501   18360912
         63       3998  255040699
         64       2684  258737072
         65       2047  263904736
         66       1503  265467595
         67        920  263457328
         68       1393  264277701
}
         70       1150  266485834
         72        406  267609078
         77        429  267479272
         79        911  267625473
         80       1182  267354313
{ sequential range...
        105      39345  151535709
        106      69512   57950504
        107      46464   10236142
        108      40795    8666966
        109      14431  208027543
        110      15800  258264227
}
total free extents 1090313
total free blocks 63273250933
average free extent size 58032.2

The number of free space extents per AG seems oddly "lumpy", e.g. the 
sequential cluster ranges beginning with AGs 54 and 105 with a large 
number of extents. Is that simply due to the pattern of frees in this 
specific case or is there some underlying design to that?

>>   (fs-size - fs-used) / discard-size
>>   256T - 26T / 64k
>>   3,858,759,680 requests
>
> Won't be anywhere near that number - free space in a 256TB
> filesystem with only 29TB used will have lots of really large
> contiguous free spaces. Those will get broken down into max discard
> length chunks, not minimum. Of course, if the bdev is setting a
> really small max discard size, then that's going to be just a big a
> problem for you....

Is this the bdev's max discard size?

# cat /sys/block/dm-0/queue/discard_max_bytes
4194304

And does that mean, for instance, these 236 extents will be split into 
somewhere between 131072 and 262143 individual discard requests (i.e. 
size of extent in bytes divided by discard_max_bytes) being sent to the 
underlying "device" (ceph rbd)?

# xfs_spaceman -c freesp /vol
      from        to extents      blocks    pct
...
134217728 268434432     236 62042143854  98.05



Cheers,

Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fstrim and strace considered harmful?
  2022-05-19  2:33         ` Chris Dunlop
@ 2022-05-19  6:33           ` Dave Chinner
  0 siblings, 0 replies; 8+ messages in thread
From: Dave Chinner @ 2022-05-19  6:33 UTC (permalink / raw)
  To: Chris Dunlop; +Cc: Darrick J. Wong, linux-xfs

On Thu, May 19, 2022 at 12:33:51PM +1000, Chris Dunlop wrote:
> On Thu, May 19, 2022 at 10:50:14AM +1000, Dave Chinner wrote:
> > On Thu, May 19, 2022 at 08:36:06AM +1000, Chris Dunlop wrote:
> > > On Wed, May 18, 2022 at 08:59:00AM -0700, Darrick J. Wong wrote:
> > > > On Wed, May 18, 2022 at 05:07:13PM +1000, Chris Dunlop wrote:
> > > > > Oh, sorry... on linux v5.15.34
> > > > > 
> > > > > On Wed, May 18, 2022 at 04:59:49PM +1000, Chris Dunlop wrote:
> > > > > > I have an fstrim that's been running for over 48 hours on a 256T thin
> > > > > > provisioned XFS fs containing around 55T of actual data on a slow
> > > > > > subsystem (ceph 8,3 erasure-encoded rbd). I don't think there would be
> > > > > > an an enourmous amount of data to trim, maybe a few T, but I've no idea
> > > > > > how long how long it might be expected to take. In an attempt to see
> > > > > > what the what the fstrim was doing, I ran an strace on it. The strace
> > > > > > has been sitting there without output and unkillable since then, now 5+

And now I sort of understand why strace appeared hung. read below....

> > > > > > hours ago.  Since the strace, on that same filesystem I now have 123 df
> > > > > > processes and 615 rm processes -- and growing -- that are blocked in
> > > > > > xfs_inodegc_flush, e.g.:
> ...
> > I suspect that it's just that your storage device is really slow at
> > small trims. If you didn't set a minimum trim size, XFS will issue
> > discards on every free space in it's trees. If you have fragmented
> > free space (quite possible if you're using reflink and removing
> > files that have been reflinked and modified) then you could have
> > millions of tiny free spaces that XFS is now asking the storage to
> > free.
> > 
> > Dumping the free space histogram of the filesystem will tell us just
> > how much work you asked the storage to do. e.g:
> 
> # xfs_spaceman -c freesp /vol
>    from      to extents  blocks    pct
>       1       1    2368    2368   0.00
>       2       3    4493   11211   0.00
>       4       7    6827   38214   0.00
>       8      15   12656  144036   0.00
>      16      31   35988  878969   0.00
>      32      63  163747 8091729   0.01
>      64     127  248625 22572336   0.04
>     128     255  367889 71796010   0.11
>     256     511  135012 48176856   0.08
>     512    1023   92534 74126716   0.12
>    1024    2047   13464 18608536   0.03
>    2048    4095    3873 10930189   0.02
>    4096    8191    1374 7886168   0.01
>    8192   16383     598 6875977   0.01
>   16384   32767     340 7729264   0.01
>   32768   65535     146 6745043   0.01
>   65536  131071      48 4419901   0.01
>  131072  262143      12 2380800   0.00
>  262144  524287       5 1887092   0.00
>  524288 1048575       2 1105184   0.00
> 1048576 2097151       4 5316211   0.01
> 2097152 4194303       3 8747030   0.01
> 4194304 8388607      65 522142416   0.83
> 8388608 16777215       2 21411858   0.03
> 67108864 134217727       4 379247828   0.60
> 134217728 268434432     236 62042143854  98.05
> 
> I guess from/to are in units of filesystem blocks, 4kB in this case?

Yes.

> 
> Not that it makes much difference here, but the sake of accuracy... does the
> default fstrim without range/size args issue discard requests for *all* the
> extents, or, if I'm reading this right:
> 
> fs/xfs/xfs_discard.c
> --
> xfs_ioc_trim(
>         struct xfs_mount                *mp,
>         struct fstrim_range __user      *urange)
> {
>         struct request_queue    *q = bdev_get_queue(mp->m_ddev_targp->bt_bdev);
>         unsigned int            granularity = q->limits.discard_granularity;
> ...
>         if (copy_from_user(&range, urange, sizeof(range)))
>                 return -EFAULT;
> 
>         range.minlen = max_t(u64, granularity, range.minlen);
> ...
> }
> 
> ...does it take into account /sys/block/xxx/queue/discard_granularity, in

Yes, that's the sysfs file that exports the current value of
q->limits.discard_granularity.

> this case 64kB, or 16 blocks @ 4kB, so issuing discards only for extents
> > = 16 blocks?

Yes. But that's not the same on every device. My local nvme SSDs
give:

$ cat /sys/block/nvme0n1/queue/discard_granularity
512
$

So the disk has a granualrity of 1 sector or 512 bytes. Hence fstrim
will try to trim right down to single filsystem block free spaces by
default.

> 
> > > The open question is what caused the stuck processes?
> > 
> > Oh, that's easy the easy bit to explain: discard runs with the AGF
> > locked because it is iterating the free space tree directly. Hence
> > operations on that AG are blocked until all the free space in that
> > AG have been discarded. Could be smarter, never needed to be
> > smarter.
> > 
> > Now inodegc comes along, and tries to free an inode in that AG, and
> > blocks getting the AGF lock during the inode free operation (likely
> > inode chunk freeing of finobt block allocation). Everythign then
> > backs up on inodegc flushes, which is backed up on discard
> > operations....
> 
> I'm not sure that explains how the first stuck process only appeared >48
> hours after initiating the fstrim. Unless that's because it may have finally
> got to the AG(s) with a lot of free extents?

I know why now - see below. It's the same reason the strace is hung.
It took 2 days for fstrim to progress to an AG that had active inode
freeing going on in it.

> # AGs w/ at least 400 free extents: only 31 out of 256 AGs
> #
> d5# xfs_spaceman -c "freesp -gs" /chroot | awk '$2>=400 {print}'         AG
> extents     blocks
>         43      69435   29436263
>         47      14984    5623982
>         48      42482  166285283
>         49      56284  218913822
>         50      10354  240969820
> { sequential range...
>         54      60416   11292928
>         55      72742   15344807
>         56      88455   17204239
>         57      81594   15218624
>         58     126731   27719558
>         59      64525   10047593
>         60      37324    8591083
>         61      57267  113745589
>         62      36501   18360912
>         63       3998  255040699
>         64       2684  258737072
>         65       2047  263904736
>         66       1503  265467595
>         67        920  263457328
>         68       1393  264277701
> }
>         70       1150  266485834
>         72        406  267609078
>         77        429  267479272
>         79        911  267625473
>         80       1182  267354313
> { sequential range...
>        105      39345  151535709
>        106      69512   57950504
>        107      46464   10236142
>        108      40795    8666966
>        109      14431  208027543
>        110      15800  258264227
> }
> total free extents 1090313
> total free blocks 63273250933

That's 250GB of free space in 1 million extents spread across 31
AG btrees. Not a huge deal for the filesystem, but for trim on
a really slow block device....

> average free extent size 58032.2
> 
> The number of free space extents per AG seems oddly "lumpy", e.g. the
> sequential cluster ranges beginning with AGs 54 and 105 with a large number
> of extents. Is that simply due to the pattern of frees in this specific case
> or is there some underlying design to that?

That's likely showing how the allocator uses locality for related
inodes.  The underlying locality algorithm will give files in the
same directory the same target AG.  So if you have a directory with
100 100GB files in them, they will largely all be allocated from AGs
target to target + 10. When you reflink and overwrite, the overwrite
allocations  will also get targetted to the same locality, and so
you'll get lots of small extents allocated across the same AGs, too.
When you then remove one of the reflinked files, only the overwrites
unique to that copy are freed, and now you get new small free spaces
in amongst largely full AGs. DO taht often enough, and you get a
significant number of small free spaces occurring.

IIRC, you last problem was one directory with 50 highly reflinked
files totalling 29TB of capacity - the above is pretty much the sort
of allocation and free space patterns I'd expect to see for that
pattern of data storage....

> > Won't be anywhere near that number - free space in a 256TB
> > filesystem with only 29TB used will have lots of really large
> > contiguous free spaces. Those will get broken down into max discard
> > length chunks, not minimum. Of course, if the bdev is setting a
> > really small max discard size, then that's going to be just a big a
> > problem for you....
> 
> Is this the bdev's max discard size?

Yup.

> # cat /sys/block/dm-0/queue/discard_max_bytes
> 4194304

And that's the problem right there.

What is the value for the underlying ceph rbds? That will tell us
if this number comes from the ceph rbds or the dm layers you have
above the ceph rbds. What are you using for thin provisioning? Is
that the dm layer (dm-thin) or something you are getting from the
ceph rbds?

FWIW, the 1TB nvme drive I mentioned above?

$ cat /sys/block/nvme0n1/queue/discard_max_bytes 
2199023255040
$ echo $(((2**32 - 1 ) * 512))
2199023255040
$

That can handle one sector short of 2TB per discard, which I think
is the maximum the nvme protocol allows in a single discard
operation. IOWs, I can issue a single discard request to free the
entire device.

> And does that mean, for instance, these 236 extents will be split into
> somewhere between 131072 and 262143 individual discard requests (i.e. size
> of extent in bytes divided by discard_max_bytes) being sent to the
> underlying "device" (ceph rbd)?
> 
> # xfs_spaceman -c freesp /vol
>      from        to extents      blocks    pct
> ...
> 134217728 268434432     236 62042143854  98.05

Yes. these are largely empty AGs - most of them have more than half
their space in one large contiguous free extent. And this will be
why the strace appears hung - discarding a single AG full of free
space (1TB) will be done as 250,000+ individual discard operations.

So that's something like 55 million discards for the ~220TB of free
space you have in that filesystem....

As to the strace that has hung - the discard loop has this in it:

                if (fatal_signal_pending(current)) {
                        error = -ERESTARTSYS;
                        goto out_del_cursor;
                }

So if you ctrl-c the fstrim and we are doing discards on small free
spaces, it should abort almost straight away.  However, if we've
issued a single discard for an entirely empty AG - we've called
blkdev_issue_discard() with a 1TB length.  blkdev_issue_discard()
does the slice and dice into device length and aligned discards,
and it has no checks for signals in it's main loop.

So strace/fstrim are stuck until those 250,000+ discards are issued
and completed - however many hours that will take. I suspect that
we should push a signal check into blkdev_issue_discard() and
friends.

If you've got a recent iostat on your system:

$ iostat -dxyzm 5

<trimmed read/write output for reability>

Device                 d/s     dMB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
nvme0n1          0    0.00      0.00     0.00   0.00    0.00     0.00    5.60    1.21    0.01   2.56
nvme0n1          0  367.60    204.51     0.00   0.00    2.58   569.69    0.80    8.75    0.96  97.04
nvme0n1          0  362.20    579.89     0.00   0.00    2.71  1639.44    3.60    1.89    1.00 100.00
nvme0n1          0  376.20   2070.96     0.00   0.00    2.59  5637.08    0.60    0.67    0.98  99.44
....

You can see what sort of progress the discards are making.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: fstrim and strace considered harmful?
  2022-05-19  0:50       ` Dave Chinner
  2022-05-19  2:33         ` Chris Dunlop
@ 2022-05-19 15:25         ` Chris Murphy
  1 sibling, 0 replies; 8+ messages in thread
From: Chris Murphy @ 2022-05-19 15:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chris Dunlop, Darrick J. Wong, xfs list

On Wed, May 18, 2022 at 8:50 PM Dave Chinner <david@fromorbit.com> wrote:

> I suspect that it's just that your storage device is really slow at
> small trims. If you didn't set a minimum trim size, XFS will issue
> discards on every free space in it's trees. If you have fragmented
> free space (quite possible if you're using reflink and removing
> files that have been reflinked and modified) then you could have
> millions of tiny free spaces that XFS is now asking the storage to
> free.

Yeah, fstrim man page says minimum-size default is 0, so it'll trim
every filesystem free block. I asked about it a while ago in
linux-block@ list and didn't get a reply, maybe it's the wrong list.
https://lore.kernel.org/linux-block/CAJCQCtRM4Gn_EY_A0Da7qz=MFfmw08+oD=syQEQt=9DrE8_gFw@mail.gmail.com/

If the context includes trim down to SSD hardware, and if the workload
involves lots of small files, and many/most of the mixed used+free
portions of the filesystem blocks look like swiss cheese, then I
suppose 1M granularity means quite a lot isn't trimmed, and ends up
getting needlessly moved around by the firmware's wear leveling? But
oh well? Maybe in that case the discard mount option makes more sense.

But if context is only LVM thin provisioning, and not pass through to
an SSD, then a 4M granularity is adequate (match the LVM logical
extent size). I'm offhand not imagining a benefit to trimming thin
provisioning less than LE size.


--
Chris Murphy

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-05-19 15:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-05-18  6:59 fstrim and strace considered harmful? Chris Dunlop
2022-05-18  7:07 ` Chris Dunlop
2022-05-18 15:59   ` Darrick J. Wong
2022-05-18 22:36     ` Chris Dunlop
2022-05-19  0:50       ` Dave Chinner
2022-05-19  2:33         ` Chris Dunlop
2022-05-19  6:33           ` Dave Chinner
2022-05-19 15:25         ` Chris Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.