All of lore.kernel.org
 help / color / mirror / Atom feed
* v5.10.1 xfs deadlock
@ 2020-12-17 17:44 Donald Buczek
  2020-12-17 19:43 ` Brian Foster
  2020-12-18 21:49 ` v5.10.1 xfs deadlock Dave Chinner
  0 siblings, 2 replies; 29+ messages in thread
From: Donald Buczek @ 2020-12-17 17:44 UTC (permalink / raw)
  To: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

Dear xfs developer,

I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.

The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.

After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.

     root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
     root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
     root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
     root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP

Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt

It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this

     [<0>] xfs_log_commit_cil+0x6cc/0x7c0
     [<0>] __xfs_trans_commit+0xab/0x320
     [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
     [<0>] xfs_end_ioend+0xc6/0x110
     [<0>] xfs_end_io+0xad/0xe0
     [<0>] process_one_work+0x1dd/0x3e0
     [<0>] worker_thread+0x2d/0x3b0
     [<0>] kthread+0x118/0x130
     [<0>] ret_from_fork+0x22/0x30

xfs_log_commit_cil+0x6cc is

   xfs_log_commit_cil()
     xlog_cil_push_background(log)
       xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);

Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc

There are also single "flush" process for each md device with this stack signature:

     [<0>] xfs_map_blocks+0xbf/0x400
     [<0>] iomap_do_writepage+0x15e/0x880
     [<0>] write_cache_pages+0x175/0x3f0
     [<0>] iomap_writepages+0x1c/0x40
     [<0>] xfs_vm_writepages+0x59/0x80
     [<0>] do_writepages+0x4b/0xe0
     [<0>] __writeback_single_inode+0x42/0x300
     [<0>] writeback_sb_inodes+0x198/0x3f0
     [<0>] __writeback_inodes_wb+0x5e/0xc0
     [<0>] wb_writeback+0x246/0x2d0
     [<0>] wb_workfn+0x26e/0x490
     [<0>] process_one_work+0x1dd/0x3e0
     [<0>] worker_thread+0x2d/0x3b0
     [<0>] kthread+0x118/0x130
     [<0>] ret_from_fork+0x22/0x30

xfs_map_blocks+0xbf is the

     xfs_ilock(ip, XFS_ILOCK_SHARED);

in xfs_map_blocks().

The system is low on free memory

     MemTotal:       197587764 kB
     MemFree:          2196496 kB
     MemAvailable:   189895408 kB

but responsive.

I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from linux-scsi) , but it is unlikely that this blocking is related to that, because the md block devices itself are responsive (`xxd /dev/md0` )

I can keep the system in the state for a while. Is there an idea what was going from or an idea what data I could collect from the running system to help? I have full debug info and could walk lists or retrieve data structures with gdb.

Best
   Donald

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-17 17:44 v5.10.1 xfs deadlock Donald Buczek
@ 2020-12-17 19:43 ` Brian Foster
  2020-12-17 21:30   ` Donald Buczek
  2020-12-18 21:49 ` v5.10.1 xfs deadlock Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Brian Foster @ 2020-12-17 19:43 UTC (permalink / raw)
  To: Donald Buczek; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
> Dear xfs developer,
> 
> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
> 
> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
> 
> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
> 
>     root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>     root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>     root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>     root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
> 
> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
> 
> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
> 
>     [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>     [<0>] __xfs_trans_commit+0xab/0x320
>     [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>     [<0>] xfs_end_ioend+0xc6/0x110
>     [<0>] xfs_end_io+0xad/0xe0
>     [<0>] process_one_work+0x1dd/0x3e0
>     [<0>] worker_thread+0x2d/0x3b0
>     [<0>] kthread+0x118/0x130
>     [<0>] ret_from_fork+0x22/0x30
> 
> xfs_log_commit_cil+0x6cc is
> 
>   xfs_log_commit_cil()
>     xlog_cil_push_background(log)
>       xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
> 
> Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
> 
> There are also single "flush" process for each md device with this stack signature:
> 
>     [<0>] xfs_map_blocks+0xbf/0x400
>     [<0>] iomap_do_writepage+0x15e/0x880
>     [<0>] write_cache_pages+0x175/0x3f0
>     [<0>] iomap_writepages+0x1c/0x40
>     [<0>] xfs_vm_writepages+0x59/0x80
>     [<0>] do_writepages+0x4b/0xe0
>     [<0>] __writeback_single_inode+0x42/0x300
>     [<0>] writeback_sb_inodes+0x198/0x3f0
>     [<0>] __writeback_inodes_wb+0x5e/0xc0
>     [<0>] wb_writeback+0x246/0x2d0
>     [<0>] wb_workfn+0x26e/0x490
>     [<0>] process_one_work+0x1dd/0x3e0
>     [<0>] worker_thread+0x2d/0x3b0
>     [<0>] kthread+0x118/0x130
>     [<0>] ret_from_fork+0x22/0x30
> 
> xfs_map_blocks+0xbf is the
> 
>     xfs_ilock(ip, XFS_ILOCK_SHARED);
> 
> in xfs_map_blocks().
> 
> The system is low on free memory
> 
>     MemTotal:       197587764 kB
>     MemFree:          2196496 kB
>     MemAvailable:   189895408 kB
> 
> but responsive.
> 
> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from linux-scsi) , but it is unlikely that this blocking is related to that, because the md block devices itself are responsive (`xxd /dev/md0` )
> 
> I can keep the system in the state for a while. Is there an idea what was going from or an idea what data I could collect from the running system to help? I have full debug info and could walk lists or retrieve data structures with gdb.
> 

It might be useful to dump the values under /sys/fs/xfs/<dev>/log/* for
each fs to get an idea of the state of the logs as well...

Brian

> Best
>   Donald
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-17 19:43 ` Brian Foster
@ 2020-12-17 21:30   ` Donald Buczek
  2020-12-18 15:35     ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2020-12-17 21:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On 17.12.20 20:43, Brian Foster wrote:
> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>> Dear xfs developer,
>>
>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>
>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>
>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>
>>      root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>      root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>      root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>      root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>
>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>
>> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
>>
>>      [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>      [<0>] __xfs_trans_commit+0xab/0x320
>>      [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>      [<0>] xfs_end_ioend+0xc6/0x110
>>      [<0>] xfs_end_io+0xad/0xe0
>>      [<0>] process_one_work+0x1dd/0x3e0
>>      [<0>] worker_thread+0x2d/0x3b0
>>      [<0>] kthread+0x118/0x130
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> xfs_log_commit_cil+0x6cc is
>>
>>    xfs_log_commit_cil()
>>      xlog_cil_push_background(log)
>>        xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>
>> Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
>>
>> There are also single "flush" process for each md device with this stack signature:
>>
>>      [<0>] xfs_map_blocks+0xbf/0x400
>>      [<0>] iomap_do_writepage+0x15e/0x880
>>      [<0>] write_cache_pages+0x175/0x3f0
>>      [<0>] iomap_writepages+0x1c/0x40
>>      [<0>] xfs_vm_writepages+0x59/0x80
>>      [<0>] do_writepages+0x4b/0xe0
>>      [<0>] __writeback_single_inode+0x42/0x300
>>      [<0>] writeback_sb_inodes+0x198/0x3f0
>>      [<0>] __writeback_inodes_wb+0x5e/0xc0
>>      [<0>] wb_writeback+0x246/0x2d0
>>      [<0>] wb_workfn+0x26e/0x490
>>      [<0>] process_one_work+0x1dd/0x3e0
>>      [<0>] worker_thread+0x2d/0x3b0
>>      [<0>] kthread+0x118/0x130
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> xfs_map_blocks+0xbf is the
>>
>>      xfs_ilock(ip, XFS_ILOCK_SHARED);
>>
>> in xfs_map_blocks().
>>
>> The system is low on free memory
>>
>>      MemTotal:       197587764 kB
>>      MemFree:          2196496 kB
>>      MemAvailable:   189895408 kB
>>
>> but responsive.
>>
>> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from linux-scsi) , but it is unlikely that this blocking is related to that, because the md block devices itself are responsive (`xxd /dev/md0` )
>>
>> I can keep the system in the state for a while. Is there an idea what was going from or an idea what data I could collect from the running system to help? I have full debug info and could walk lists or retrieve data structures with gdb.
>>
> 
> It might be useful to dump the values under /sys/fs/xfs/<dev>/log/* for
> each fs to get an idea of the state of the logs as well...


root@deadbird:~# for f in /sys/fs/xfs/*/log/*; do echo $f : $(cat $f);done
/sys/fs/xfs/md0/log/log_head_lsn : 5:714808
/sys/fs/xfs/md0/log/log_tail_lsn : 5:581592
/sys/fs/xfs/md0/log/reserve_grant_head : 5:365981696
/sys/fs/xfs/md0/log/write_grant_head : 5:365981696
/sys/fs/xfs/md1/log/log_head_lsn : 3:2963880
/sys/fs/xfs/md1/log/log_tail_lsn : 3:2772656
/sys/fs/xfs/md1/log/reserve_grant_head : 3:1517506560
/sys/fs/xfs/md1/log/write_grant_head : 3:1517506560
/sys/fs/xfs/sda1/log/log_head_lsn : 233:106253
/sys/fs/xfs/sda1/log/log_tail_lsn : 233:106251
/sys/fs/xfs/sda1/log/reserve_grant_head : 233:54403812
/sys/fs/xfs/sda1/log/write_grant_head : 233:54403812
/sys/fs/xfs/sda2/log/log_head_lsn : 84:5653
/sys/fs/xfs/sda2/log/log_tail_lsn : 84:5651
/sys/fs/xfs/sda2/log/reserve_grant_head : 84:2894336
/sys/fs/xfs/sda2/log/write_grant_head : 84:2894336



> 
> Brian
> 
>> Best
>>    Donald
>>
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-17 21:30   ` Donald Buczek
@ 2020-12-18 15:35     ` Brian Foster
  2020-12-18 18:35       ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Brian Foster @ 2020-12-18 15:35 UTC (permalink / raw)
  To: Donald Buczek; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
> On 17.12.20 20:43, Brian Foster wrote:
> > On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
> > > Dear xfs developer,
> > > 
> > > I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
> > > 
> > > The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
> > > 
> > > After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
> > > 
> > >      root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
> > >      root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
> > >      root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
> > >      root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
> > > 

Do you have any indication of whether these workloads actually hung or
just became incredibly slow?

> > > Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
> > > 
> > > It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
> > > 
> > >      [<0>] xfs_log_commit_cil+0x6cc/0x7c0
> > >      [<0>] __xfs_trans_commit+0xab/0x320
> > >      [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
> > >      [<0>] xfs_end_ioend+0xc6/0x110
> > >      [<0>] xfs_end_io+0xad/0xe0
> > >      [<0>] process_one_work+0x1dd/0x3e0
> > >      [<0>] worker_thread+0x2d/0x3b0
> > >      [<0>] kthread+0x118/0x130
> > >      [<0>] ret_from_fork+0x22/0x30
> > > 
> > > xfs_log_commit_cil+0x6cc is
> > > 
> > >    xfs_log_commit_cil()
> > >      xlog_cil_push_background(log)
> > >        xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
> > > 

This looks like the transaction commit throttling code. That was
introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs:
Throttle commits on delayed background CIL push"). The purpose of that
change was to prevent the CIL from growing too large. FWIW, I don't
recall that being a functional problem so it should be possible to
simply remove that blocking point and see if that avoids the problem or
if we simply stall out somewhere else, if you wanted to give that a
test.

> > > Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
> > > 
> > > There are also single "flush" process for each md device with this stack signature:
> > > 
> > >      [<0>] xfs_map_blocks+0xbf/0x400
> > >      [<0>] iomap_do_writepage+0x15e/0x880
> > >      [<0>] write_cache_pages+0x175/0x3f0
> > >      [<0>] iomap_writepages+0x1c/0x40
> > >      [<0>] xfs_vm_writepages+0x59/0x80
> > >      [<0>] do_writepages+0x4b/0xe0
> > >      [<0>] __writeback_single_inode+0x42/0x300
> > >      [<0>] writeback_sb_inodes+0x198/0x3f0
> > >      [<0>] __writeback_inodes_wb+0x5e/0xc0
> > >      [<0>] wb_writeback+0x246/0x2d0
> > >      [<0>] wb_workfn+0x26e/0x490
> > >      [<0>] process_one_work+0x1dd/0x3e0
> > >      [<0>] worker_thread+0x2d/0x3b0
> > >      [<0>] kthread+0x118/0x130
> > >      [<0>] ret_from_fork+0x22/0x30
> > > 

Is writeback still blocked as such or was this just a transient stack?

> > > xfs_map_blocks+0xbf is the
> > > 
> > >      xfs_ilock(ip, XFS_ILOCK_SHARED);
> > > 
> > > in xfs_map_blocks().
> > > 
> > > The system is low on free memory
> > > 
> > >      MemTotal:       197587764 kB
> > >      MemFree:          2196496 kB
> > >      MemAvailable:   189895408 kB
> > > 
> > > but responsive.
> > > 
> > > I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from linux-scsi) , but it is unlikely that this blocking is related to that, because the md block devices itself are responsive (`xxd /dev/md0` )
> > > 
> > > I can keep the system in the state for a while. Is there an idea what was going from or an idea what data I could collect from the running system to help? I have full debug info and could walk lists or retrieve data structures with gdb.
> > > 
> > 
> > It might be useful to dump the values under /sys/fs/xfs/<dev>/log/* for
> > each fs to get an idea of the state of the logs as well...
> 
> 
> root@deadbird:~# for f in /sys/fs/xfs/*/log/*; do echo $f : $(cat $f);done
> /sys/fs/xfs/md0/log/log_head_lsn : 5:714808
> /sys/fs/xfs/md0/log/log_tail_lsn : 5:581592
> /sys/fs/xfs/md0/log/reserve_grant_head : 5:365981696
> /sys/fs/xfs/md0/log/write_grant_head : 5:365981696

Hm, so it looks like the log is populated but not necessarily full. What
looks more interesting is that the grant heads (365981696 bytes) line up
with the physical log head (714808 512b sectors). That suggests there is
no outstanding transaction reservation and thus perhaps all workload
tasks are sitting at that throttling point just after the current
transaction commits and releases unused reservation. That certainly
shouldn't be such a longstanding blocking point as it only waits for the
CIL push to start.

Out of curiosity, have any of the above values changed since the sample
provided here was collected? As above, I'm curious if the filesystem
happens to be moving along slowly or not at all, whether the AIL has
been drained in the background, etc.

Could you post the xfs_info for the affected filesystems?

Also since it seems like you should have plenty of available log
reservation, are you able to perform any writable operations on the fs
(i.e., touch <file>)? If so, I wonder if you were able to start a new
copy workload on of the fs' capable of triggering the blocking threshold
again, if that might eventually unstick the currently blocked tasks when
the next CIL push occurs...

Brian

> /sys/fs/xfs/md1/log/log_head_lsn : 3:2963880
> /sys/fs/xfs/md1/log/log_tail_lsn : 3:2772656
> /sys/fs/xfs/md1/log/reserve_grant_head : 3:1517506560
> /sys/fs/xfs/md1/log/write_grant_head : 3:1517506560
> /sys/fs/xfs/sda1/log/log_head_lsn : 233:106253
> /sys/fs/xfs/sda1/log/log_tail_lsn : 233:106251
> /sys/fs/xfs/sda1/log/reserve_grant_head : 233:54403812
> /sys/fs/xfs/sda1/log/write_grant_head : 233:54403812
> /sys/fs/xfs/sda2/log/log_head_lsn : 84:5653
> /sys/fs/xfs/sda2/log/log_tail_lsn : 84:5651
> /sys/fs/xfs/sda2/log/reserve_grant_head : 84:2894336
> /sys/fs/xfs/sda2/log/write_grant_head : 84:2894336
> 
> 
> 
> > 
> > Brian
> > 
> > > Best
> > >    Donald
> > > 
> > 
> 
> -- 
> Donald Buczek
> buczek@molgen.mpg.de
> Tel: +49 30 8413 1433
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-18 15:35     ` Brian Foster
@ 2020-12-18 18:35       ` Donald Buczek
  2020-12-27 17:34         ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2020-12-18 18:35 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On 18.12.20 16:35, Brian Foster wrote:
> On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
>> On 17.12.20 20:43, Brian Foster wrote:
>>> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>>>> Dear xfs developer,
>>>>
>>>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>>>
>>>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>>>
>>>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>>>
>>>>       root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>>>       root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>>>       root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>>>       root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>>>
> 
> Do you have any indication of whether these workloads actually hung or
> just became incredibly slow?

There is zero progress. iostat doesn't show any I/O on any of the block devices (md or member)

>>>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>>>
>>>> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
>>>>
>>>>       [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>>>       [<0>] __xfs_trans_commit+0xab/0x320
>>>>       [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>>>       [<0>] xfs_end_ioend+0xc6/0x110
>>>>       [<0>] xfs_end_io+0xad/0xe0
>>>>       [<0>] process_one_work+0x1dd/0x3e0
>>>>       [<0>] worker_thread+0x2d/0x3b0
>>>>       [<0>] kthread+0x118/0x130
>>>>       [<0>] ret_from_fork+0x22/0x30
>>>>
>>>> xfs_log_commit_cil+0x6cc is
>>>>
>>>>     xfs_log_commit_cil()
>>>>       xlog_cil_push_background(log)
>>>>         xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>>>
> 
> This looks like the transaction commit throttling code. That was
> introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs:
> Throttle commits on delayed background CIL push"). The purpose of that
> change was to prevent the CIL from growing too large. FWIW, I don't
> recall that being a functional problem so it should be possible to
> simply remove that blocking point and see if that avoids the problem or
> if we simply stall out somewhere else, if you wanted to give that a
> test.

Will do. Before trying with this commit reverted, I will repeat the test without any change to see if the problem is reproducible at all.

>>>> Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
>>>>
>>>> There are also single "flush" process for each md device with this stack signature:
>>>>
>>>>       [<0>] xfs_map_blocks+0xbf/0x400
>>>>       [<0>] iomap_do_writepage+0x15e/0x880
>>>>       [<0>] write_cache_pages+0x175/0x3f0
>>>>       [<0>] iomap_writepages+0x1c/0x40
>>>>       [<0>] xfs_vm_writepages+0x59/0x80
>>>>       [<0>] do_writepages+0x4b/0xe0
>>>>       [<0>] __writeback_single_inode+0x42/0x300
>>>>       [<0>] writeback_sb_inodes+0x198/0x3f0
>>>>       [<0>] __writeback_inodes_wb+0x5e/0xc0
>>>>       [<0>] wb_writeback+0x246/0x2d0
>>>>       [<0>] wb_workfn+0x26e/0x490
>>>>       [<0>] process_one_work+0x1dd/0x3e0
>>>>       [<0>] worker_thread+0x2d/0x3b0
>>>>       [<0>] kthread+0x118/0x130
>>>>       [<0>] ret_from_fork+0x22/0x30
>>>>
> 
> Is writeback still blocked as such or was this just a transient stack?

This is frozen. I did another collection of the stack dump today and all the stack trace (related to these two filesystems) are unchanged.

However, I discovered, that my original report was not correct. The above stack trace is from PID 348 ("kworker/u81:8+flush-9:0") while the other thread for the other device (PID 20, "kworker/u82:0+flush-9:1" doesn't have an identical thread, it has three more frames and is blocking at xfs_log_commit_cil+0x6cc as most other processes.

     [<0>] xfs_log_commit_cil+0x6cc/0x7c0
     [<0>] __xfs_trans_commit+0xab/0x320
     [<0>] xfs_bmapi_convert_delalloc+0x437/0x4b0
     [<0>] xfs_map_blocks+0x1e3/0x400
     [<0>] iomap_do_writepage+0x15e/0x880
     [<0>] write_cache_pages+0x175/0x3f0
     [<0>] iomap_writepages+0x1c/0x40
     [<0>] xfs_vm_writepages+0x59/0x80
     [<0>] do_writepages+0x4b/0xe0
     [<0>] __writeback_single_inode+0x42/0x300
     [<0>] writeback_sb_inodes+0x198/0x3f0
     [<0>] __writeback_inodes_wb+0x5e/0xc0
     [<0>] wb_writeback+0x246/0x2d0
     [<0>] wb_workfn+0x26e/0x490
     [<0>] process_one_work+0x1dd/0x3e0
     [<0>] worker_thread+0x2d/0x3b0
     [<0>] kthread+0x118/0x130
     [<0>] ret_from_fork+0x22/0x30

> 
>>>> xfs_map_blocks+0xbf is the
>>>>
>>>>       xfs_ilock(ip, XFS_ILOCK_SHARED);
>>>>
>>>> in xfs_map_blocks().
>>>>
>>>> The system is low on free memory
>>>>
>>>>       MemTotal:       197587764 kB
>>>>       MemFree:          2196496 kB
>>>>       MemAvailable:   189895408 kB
>>>>
>>>> but responsive.
>>>>
>>>> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005 pulled from linux-scsi) , but it is unlikely that this blocking is related to that, because the md block devices itself are responsive (`xxd /dev/md0` )
>>>>
>>>> I can keep the system in the state for a while. Is there an idea what was going from or an idea what data I could collect from the running system to help? I have full debug info and could walk lists or retrieve data structures with gdb.
>>>>
>>>
>>> It might be useful to dump the values under /sys/fs/xfs/<dev>/log/* for
>>> each fs to get an idea of the state of the logs as well...
>>
>>
>> root@deadbird:~# for f in /sys/fs/xfs/*/log/*; do echo $f : $(cat $f);done
>> /sys/fs/xfs/md0/log/log_head_lsn : 5:714808
>> /sys/fs/xfs/md0/log/log_tail_lsn : 5:581592
>> /sys/fs/xfs/md0/log/reserve_grant_head : 5:365981696
>> /sys/fs/xfs/md0/log/write_grant_head : 5:365981696
> 
> Hm, so it looks like the log is populated but not necessarily full. What
> looks more interesting is that the grant heads (365981696 bytes) line up
> with the physical log head (714808 512b sectors). That suggests there is
> no outstanding transaction reservation and thus perhaps all workload
> tasks are sitting at that throttling point just after the current
> transaction commits and releases unused reservation. That certainly
> shouldn't be such a longstanding blocking point as it only waits for the
> CIL push to start.
> 
> Out of curiosity, have any of the above values changed since the sample
> provided here was collected? As above, I'm curious if the filesystem
> happens to be moving along slowly or not at all, whether the AIL has
> been drained in the background, etc.

Zero I/O on the block devices.
All related thread have exact same stack frame after >2days

> Could you post the xfs_info for the affected filesystems?

buczek@deadbird:~/linux_problems/mdX_raid6_looping/tests_with_deadbird/2020-12-16-01$ xfs_info /dev/md0
meta-data=/dev/md0               isize=512    agcount=102, agsize=268435328 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1
data     =                       bsize=4096   blocks=27348629504, imaxpct=1
          =                       sunit=128    swidth=1792 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0
buczek@deadbird:~/linux_problems/mdX_raid6_looping/tests_with_deadbird/2020-12-16-01$ xfs_info /dev/md1
meta-data=/dev/md1               isize=512    agcount=102, agsize=268435328 blks
          =                       sectsz=4096  attr=2, projid32bit=1
          =                       crc=1        finobt=1, sparse=1, rmapbt=0
          =                       reflink=1
data     =                       bsize=4096   blocks=27348629504, imaxpct=1
          =                       sunit=128    swidth=1792 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
          =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


> Also since it seems like you should have plenty of available log
> reservation, are you able to perform any writable operations on the fs
> (i.e., touch <file>)? If so, I wonder if you were able to start a new
> copy workload on of the fs' capable of triggering the blocking threshold
> again, if that might eventually unstick the currently blocked tasks when
> the next CIL push occurs...

When I do something like

     touch /jbod/M8067/scratch/x.001

This doesn't block and shortly thereafter I see momentary write activity on the md and member devices. But the blocked processes don't resume.

     root@deadbird:~# iostat|grep md0 ; sleep 30 ; iostat|grep md0 ; touch /jbod/M8067/scratch/x.008 ; sleep 30 ; iostat|grep md0 ; sleep 30 ; iostat|grep md0
     md0              25.81        52.45      2385.56         0.00   12806780  582460904          0
     md0              25.81        52.45      2385.27         0.00   12806780  582460904          0
     md0              25.81        52.44      2384.98         0.00   12806780  582460908          0
     md0              25.80        52.43      2384.68         0.00   12806780  582460908          0

The touched files are visible with `ls`.

So then I tried something longer:

     cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/linux.001

which also triggerd a few I/Os in the beginning but then came to a perceived halt (no I/O in several minutes).

Interestingly, this new "cp" is not blocking on xfs_log_commit_cil+0x6cc as the four "cp" commands from the stress test but has this stack:

     [<0>] balance_dirty_pages+0x31c/0xd80
     [<0>] balance_dirty_pages_ratelimited+0x2f9/0x3c0
     [<0>] iomap_write_actor+0x11d/0x190
     [<0>] iomap_apply+0x117/0x2e0
     [<0>] iomap_file_buffered_write+0x62/0x90
     [<0>] xfs_file_buffered_aio_write+0xd3/0x320
     [<0>] new_sync_write+0x11f/0x1b0
     [<0>] vfs_write+0x1ea/0x250
     [<0>] ksys_write+0xa1/0xe0
     [<0>] do_syscall_64+0x33/0x40
     [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Maybe this would change if I waited a few hours more.

So I will reboot the system now and retry test test exactly the same configuration as then again with 0e7ab7efe7745 ("xfs: Throttle commits on delayed background CIL push") reverted.

Donald

> Brian
> 
>> /sys/fs/xfs/md1/log/log_head_lsn : 3:2963880
>> /sys/fs/xfs/md1/log/log_tail_lsn : 3:2772656
>> /sys/fs/xfs/md1/log/reserve_grant_head : 3:1517506560
>> /sys/fs/xfs/md1/log/write_grant_head : 3:1517506560
>> /sys/fs/xfs/sda1/log/log_head_lsn : 233:106253
>> /sys/fs/xfs/sda1/log/log_tail_lsn : 233:106251
>> /sys/fs/xfs/sda1/log/reserve_grant_head : 233:54403812
>> /sys/fs/xfs/sda1/log/write_grant_head : 233:54403812
>> /sys/fs/xfs/sda2/log/log_head_lsn : 84:5653
>> /sys/fs/xfs/sda2/log/log_tail_lsn : 84:5651
>> /sys/fs/xfs/sda2/log/reserve_grant_head : 84:2894336
>> /sys/fs/xfs/sda2/log/write_grant_head : 84:2894336
>>
>>
>>
>>>
>>> Brian
>>>
>>>> Best
>>>>     Donald
>>>>
>>>
>>
>> -- 
>> Donald Buczek
>> buczek@molgen.mpg.de
>> Tel: +49 30 8413 1433
>>
> 

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-17 17:44 v5.10.1 xfs deadlock Donald Buczek
  2020-12-17 19:43 ` Brian Foster
@ 2020-12-18 21:49 ` Dave Chinner
  2020-12-21 12:22   ` Donald Buczek
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2020-12-18 21:49 UTC (permalink / raw)
  To: Donald Buczek; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
> Dear xfs developer,
> 
> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
> 
> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
> 
> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
> 
>     root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>     root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>     root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>     root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
> 
> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
> 
> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
> 
>     [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>     [<0>] __xfs_trans_commit+0xab/0x320
>     [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>     [<0>] xfs_end_ioend+0xc6/0x110
>     [<0>] xfs_end_io+0xad/0xe0
>     [<0>] process_one_work+0x1dd/0x3e0
>     [<0>] worker_thread+0x2d/0x3b0
>     [<0>] kthread+0x118/0x130
>     [<0>] ret_from_fork+0x22/0x30
> 
> xfs_log_commit_cil+0x6cc is
> 
>   xfs_log_commit_cil()
>     xlog_cil_push_background(log)
>       xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
> 
> Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
> 
> There are also single "flush" process for each md device with this stack signature:
> 
>     [<0>] xfs_map_blocks+0xbf/0x400
>     [<0>] iomap_do_writepage+0x15e/0x880
>     [<0>] write_cache_pages+0x175/0x3f0
>     [<0>] iomap_writepages+0x1c/0x40
>     [<0>] xfs_vm_writepages+0x59/0x80
>     [<0>] do_writepages+0x4b/0xe0
>     [<0>] __writeback_single_inode+0x42/0x300
>     [<0>] writeback_sb_inodes+0x198/0x3f0
>     [<0>] __writeback_inodes_wb+0x5e/0xc0
>     [<0>] wb_writeback+0x246/0x2d0
>     [<0>] wb_workfn+0x26e/0x490
>     [<0>] process_one_work+0x1dd/0x3e0
>     [<0>] worker_thread+0x2d/0x3b0
>     [<0>] kthread+0x118/0x130
>     [<0>] ret_from_fork+0x22/0x30
> 
> xfs_map_blocks+0xbf is the
> 
>     xfs_ilock(ip, XFS_ILOCK_SHARED);
> 
> in xfs_map_blocks().

Can you post the entire dmesg output after running
'echo w > /proc/sysrq-trigger' to dump all the block threads to
dmesg?

> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005
> pulled from linux-scsi) , but it is unlikely that this blocking is
> related to that, because the md block devices itself are
> responsive (`xxd /dev/md0` )

My bet is that the OOT driver/hardware had dropped a log IO on the
floor - XFS is waiting for the CIL push to complete, and I'm betting
that is stuck waiting for iclog IO completion while writing the CIL
to the journal. The sysrq output will tell us if this is the case,
so that's the first place to look.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-18 21:49 ` v5.10.1 xfs deadlock Dave Chinner
@ 2020-12-21 12:22   ` Donald Buczek
  2020-12-27 17:22     ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2020-12-21 12:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On 18.12.20 22:49, Dave Chinner wrote:
> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>> Dear xfs developer,
>>
>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>
>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>
>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>
>>      root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>      root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>      root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>      root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>
>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>
>> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
>>
>>      [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>      [<0>] __xfs_trans_commit+0xab/0x320
>>      [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>      [<0>] xfs_end_ioend+0xc6/0x110
>>      [<0>] xfs_end_io+0xad/0xe0
>>      [<0>] process_one_work+0x1dd/0x3e0
>>      [<0>] worker_thread+0x2d/0x3b0
>>      [<0>] kthread+0x118/0x130
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> xfs_log_commit_cil+0x6cc is
>>
>>    xfs_log_commit_cil()
>>      xlog_cil_push_background(log)
>>        xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>
>> Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
>>
>> There are also single "flush" process for each md device with this stack signature:
>>
>>      [<0>] xfs_map_blocks+0xbf/0x400
>>      [<0>] iomap_do_writepage+0x15e/0x880
>>      [<0>] write_cache_pages+0x175/0x3f0
>>      [<0>] iomap_writepages+0x1c/0x40
>>      [<0>] xfs_vm_writepages+0x59/0x80
>>      [<0>] do_writepages+0x4b/0xe0
>>      [<0>] __writeback_single_inode+0x42/0x300
>>      [<0>] writeback_sb_inodes+0x198/0x3f0
>>      [<0>] __writeback_inodes_wb+0x5e/0xc0
>>      [<0>] wb_writeback+0x246/0x2d0
>>      [<0>] wb_workfn+0x26e/0x490
>>      [<0>] process_one_work+0x1dd/0x3e0
>>      [<0>] worker_thread+0x2d/0x3b0
>>      [<0>] kthread+0x118/0x130
>>      [<0>] ret_from_fork+0x22/0x30
>>
>> xfs_map_blocks+0xbf is the
>>
>>      xfs_ilock(ip, XFS_ILOCK_SHARED);
>>
>> in xfs_map_blocks().
> 
> Can you post the entire dmesg output after running
> 'echo w > /proc/sysrq-trigger' to dump all the block threads to
> dmesg?
> 
>> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005
>> pulled from linux-scsi) , but it is unlikely that this blocking is
>> related to that, because the md block devices itself are
>> responsive (`xxd /dev/md0` )
> 
> My bet is that the OOT driver/hardware had dropped a log IO on the
> floor - XFS is waiting for the CIL push to complete, and I'm betting
> that is stuck waiting for iclog IO completion while writing the CIL
> to the journal. The sysrq output will tell us if this is the case,
> so that's the first place to look.

I think you are right here, and I'm sorry for blaming the wrong layer.

I've got the system into another (though little different) zero-progress situation. This time it happened on md0 only while md1 was still working.

I think, this should be prove, that the failure is on the block layer of the member disks:

     root:deadbird:/scratch/local/# for f in /sys/devices/virtual/block/md?/md/rd*/block/inflight;do echo $f: $(cat $f);done
     /sys/devices/virtual/block/md0/md/rd0/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd1/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd10/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd11/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd12/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd13/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd14/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd15/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd2/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd3/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd4/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd5/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd6/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd7/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd8/block/inflight: 1 0
     /sys/devices/virtual/block/md0/md/rd9/block/inflight: 1 0
     /sys/devices/virtual/block/md1/md/rd0/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd1/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd10/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd11/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd12/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd13/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd14/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd15/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd2/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd3/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd4/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd5/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd6/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd7/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd8/block/inflight: 0 0
     /sys/devices/virtual/block/md1/md/rd9/block/inflight: 0 0

Best

   Donald

> 
> Cheers,
> 
> Dave.
> 
-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-21 12:22   ` Donald Buczek
@ 2020-12-27 17:22     ` Donald Buczek
  0 siblings, 0 replies; 29+ messages in thread
From: Donald Buczek @ 2020-12-27 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On 21.12.20 13:22, Donald Buczek wrote:
> On 18.12.20 22:49, Dave Chinner wrote:
>> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>>> Dear xfs developer,
>>>
>>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>>
>>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>>
>>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>>
>>>      root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>>      root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>>      root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>>      root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>>
>>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>>
>>> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
>>>
>>>      [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>>      [<0>] __xfs_trans_commit+0xab/0x320
>>>      [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>>      [<0>] xfs_end_ioend+0xc6/0x110
>>>      [<0>] xfs_end_io+0xad/0xe0
>>>      [<0>] process_one_work+0x1dd/0x3e0
>>>      [<0>] worker_thread+0x2d/0x3b0
>>>      [<0>] kthread+0x118/0x130
>>>      [<0>] ret_from_fork+0x22/0x30
>>>
>>> xfs_log_commit_cil+0x6cc is
>>>
>>>    xfs_log_commit_cil()
>>>      xlog_cil_push_background(log)
>>>        xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>>
>>> Some other threads, including the four "cp" commands are also blocking at xfs_log_commit_cil+0x6cc
>>>
>>> There are also single "flush" process for each md device with this stack signature:
>>>
>>>      [<0>] xfs_map_blocks+0xbf/0x400
>>>      [<0>] iomap_do_writepage+0x15e/0x880
>>>      [<0>] write_cache_pages+0x175/0x3f0
>>>      [<0>] iomap_writepages+0x1c/0x40
>>>      [<0>] xfs_vm_writepages+0x59/0x80
>>>      [<0>] do_writepages+0x4b/0xe0
>>>      [<0>] __writeback_single_inode+0x42/0x300
>>>      [<0>] writeback_sb_inodes+0x198/0x3f0
>>>      [<0>] __writeback_inodes_wb+0x5e/0xc0
>>>      [<0>] wb_writeback+0x246/0x2d0
>>>      [<0>] wb_workfn+0x26e/0x490
>>>      [<0>] process_one_work+0x1dd/0x3e0
>>>      [<0>] worker_thread+0x2d/0x3b0
>>>      [<0>] kthread+0x118/0x130
>>>      [<0>] ret_from_fork+0x22/0x30
>>>
>>> xfs_map_blocks+0xbf is the
>>>
>>>      xfs_ilock(ip, XFS_ILOCK_SHARED);
>>>
>>> in xfs_map_blocks().
>>
>> Can you post the entire dmesg output after running
>> 'echo w > /proc/sysrq-trigger' to dump all the block threads to
>> dmesg?
>>
>>> I have an out of tree driver for the HBA ( smartpqi 2.1.6-005
>>> pulled from linux-scsi) , but it is unlikely that this blocking is
>>> related to that, because the md block devices itself are
>>> responsive (`xxd /dev/md0` )
>>
>> My bet is that the OOT driver/hardware had dropped a log IO on the
>> floor - XFS is waiting for the CIL push to complete, and I'm betting
>> that is stuck waiting for iclog IO completion while writing the CIL
>> to the journal. The sysrq output will tell us if this is the case,
>> so that's the first place to look.
> 
> I think you are right here, and I'm sorry for blaming the wrong layer.

I was to fast to accept that. The display of the non-zero "inflight" counters of the underlying block devices couldn't be reproduced. They usually are "0 0" when the filesystem is deadlocked.

To make sure, I've also added my on "bio requested" and "bio completed" atomic counters to the two submit_bio calls and to xlog_bio_end_io in xfs_log.c and when the system is deadlocked, the requests and completions do match:

     sudo gdb linux/vmlinux /proc/kcore  -ex 'print xxx_bio_requested' -ex 'print xxx_bio_completed'
     [...]
     $1 = {counter = 34723}
     $2 = {counter = 34723}

So at least log writes are not lost.

Best
   Donald

> I've got the system into another (though little different) zero-progress situation. This time it happened on md0 only while md1 was still working.
> 
> I think, this should be prove, that the failure is on the block layer of the member disks:
> 
>      root:deadbird:/scratch/local/# for f in /sys/devices/virtual/block/md?/md/rd*/block/inflight;do echo $f: $(cat $f);done
>      /sys/devices/virtual/block/md0/md/rd0/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd1/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd10/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd11/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd12/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd13/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd14/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd15/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd2/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd3/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd4/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd5/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd6/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd7/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd8/block/inflight: 1 0
>      /sys/devices/virtual/block/md0/md/rd9/block/inflight: 1 0
>      /sys/devices/virtual/block/md1/md/rd0/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd1/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd10/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd11/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd12/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd13/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd14/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd15/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd2/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd3/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd4/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd5/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd6/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd7/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd8/block/inflight: 0 0
>      /sys/devices/virtual/block/md1/md/rd9/block/inflight: 0 0
> 
> Best
> 
>    Donald
> 
>>
>> Cheers,
>>
>> Dave.
>>

-- 
Donald Buczek
buczek@molgen.mpg.de
Tel: +49 30 8413 1433

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-18 18:35       ` Donald Buczek
@ 2020-12-27 17:34         ` Donald Buczek
  2020-12-28 23:13           ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2020-12-27 17:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On 18.12.20 19:35, Donald Buczek wrote:
> On 18.12.20 16:35, Brian Foster wrote:
>> On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
>>> On 17.12.20 20:43, Brian Foster wrote:
>>>> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>>>>> Dear xfs developer,
>>>>>
>>>>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>>>>
>>>>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>>>>
>>>>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>>>>
>>>>>       root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>>>>       root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>>>>       root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>>>>       root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>>>>
>>
>> Do you have any indication of whether these workloads actually hung or
>> just became incredibly slow?
> 
> There is zero progress. iostat doesn't show any I/O on any of the block devices (md or member)
> 
>>>>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>>>>
>>>>> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
>>>>>
>>>>>       [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>>>>       [<0>] __xfs_trans_commit+0xab/0x320
>>>>>       [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>>>>       [<0>] xfs_end_ioend+0xc6/0x110
>>>>>       [<0>] xfs_end_io+0xad/0xe0
>>>>>       [<0>] process_one_work+0x1dd/0x3e0
>>>>>       [<0>] worker_thread+0x2d/0x3b0
>>>>>       [<0>] kthread+0x118/0x130
>>>>>       [<0>] ret_from_fork+0x22/0x30
>>>>>
>>>>> xfs_log_commit_cil+0x6cc is
>>>>>
>>>>>     xfs_log_commit_cil()
>>>>>       xlog_cil_push_background(log)
>>>>>         xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>>>>
>>
>> This looks like the transaction commit throttling code. That was
>> introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs:
>> Throttle commits on delayed background CIL push"). The purpose of that
>> change was to prevent the CIL from growing too large. FWIW, I don't
>> recall that being a functional problem so it should be possible to
>> simply remove that blocking point and see if that avoids the problem or
>> if we simply stall out somewhere else, if you wanted to give that a
>> test.
> 
> Will do. Before trying with this commit reverted, I will repeat the test without any change to see if the problem is reproducible at all.

I'm now able to reliably reproduce the deadlock with a little less complex setup (e.g. with only one filesystem involved). One key to that was to run the test against a freshly created filesystem (mkfs).

And, yes, you are right: When I revert ef565ab8cc2e ("xfs: Throttle commits on delayed background CIL push") and 7ee6dfa2a245 ("xfs: fix use-after-free on CIL context on shutdown") the deadlock seems to be gone.

Best
   Donald

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: v5.10.1 xfs deadlock
  2020-12-27 17:34         ` Donald Buczek
@ 2020-12-28 23:13           ` Donald Buczek
  2020-12-29 23:56             ` [PATCH] xfs: Wake CIL push waiters more reliably Donald Buczek
       [not found]             ` <20201230024642.2171-1-hdanton@sina.com>
  0 siblings, 2 replies; 29+ messages in thread
From: Donald Buczek @ 2020-12-28 23:13 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs



On 27.12.20 18:34, Donald Buczek wrote:
> On 18.12.20 19:35, Donald Buczek wrote:
>> On 18.12.20 16:35, Brian Foster wrote:
>>> On Thu, Dec 17, 2020 at 10:30:37PM +0100, Donald Buczek wrote:
>>>> On 17.12.20 20:43, Brian Foster wrote:
>>>>> On Thu, Dec 17, 2020 at 06:44:51PM +0100, Donald Buczek wrote:
>>>>>> Dear xfs developer,
>>>>>>
>>>>>> I was doing some testing on a Linux 5.10.1 system with two 100 TB xfs filesystems on md raid6 raids.
>>>>>>
>>>>>> The stress test was essentially `cp -a`ing a Linux source repository with two threads in parallel on each filesystem.
>>>>>>
>>>>>> After about on hour, the processes to one filesystem (md1) blocked, 30 minutes later the process to the other filesystem (md0) did.
>>>>>>
>>>>>>       root      7322  2167  0 Dec16 pts/1    00:00:06 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/1/linux.018.TMP
>>>>>>       root      7329  2169  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8068/scratch/linux /jbod/M8068/scratch/2/linux.019.TMP
>>>>>>       root     13856  2170  0 Dec16 pts/1    00:00:08 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/2/linux.028.TMP
>>>>>>       root     13899  2168  0 Dec16 pts/1    00:00:05 cp -a /jbod/M8067/scratch/linux /jbod/M8067/scratch/1/linux.027.TMP
>>>>>>
>>>
>>> Do you have any indication of whether these workloads actually hung or
>>> just became incredibly slow?
>>
>> There is zero progress. iostat doesn't show any I/O on any of the block devices (md or member)
>>
>>>>>> Some info from the system (all stack traces, slabinfo) is available here: https://owww.molgen.mpg.de/~buczek/2020-12-16.info.txt
>>>>>>
>>>>>> It stands out, that there are many (549 for md0, but only 10 for md1)  "xfs-conv" threads all with stacks like this
>>>>>>
>>>>>>       [<0>] xfs_log_commit_cil+0x6cc/0x7c0
>>>>>>       [<0>] __xfs_trans_commit+0xab/0x320
>>>>>>       [<0>] xfs_iomap_write_unwritten+0xcb/0x2e0
>>>>>>       [<0>] xfs_end_ioend+0xc6/0x110
>>>>>>       [<0>] xfs_end_io+0xad/0xe0
>>>>>>       [<0>] process_one_work+0x1dd/0x3e0
>>>>>>       [<0>] worker_thread+0x2d/0x3b0
>>>>>>       [<0>] kthread+0x118/0x130
>>>>>>       [<0>] ret_from_fork+0x22/0x30
>>>>>>
>>>>>> xfs_log_commit_cil+0x6cc is
>>>>>>
>>>>>>     xfs_log_commit_cil()
>>>>>>       xlog_cil_push_background(log)
>>>>>>         xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
>>>>>>
>>>
>>> This looks like the transaction commit throttling code. That was
>>> introduced earlier this year in v5.7 via commit 0e7ab7efe7745 ("xfs:
>>> Throttle commits on delayed background CIL push"). The purpose of that
>>> change was to prevent the CIL from growing too large. FWIW, I don't
>>> recall that being a functional problem so it should be possible to
>>> simply remove that blocking point and see if that avoids the problem or
>>> if we simply stall out somewhere else, if you wanted to give that a
>>> test.
>>
>> Will do. Before trying with this commit reverted, I will repeat the test without any change to see if the problem is reproducible at all.
> 
> I'm now able to reliably reproduce the deadlock with a little less complex setup (e.g. with only one filesystem involved). One key to that was to run the test against a freshly created filesystem (mkfs).
> 
> And, yes, you are right: When I revert ef565ab8cc2e ("xfs: Throttle commits on delayed background CIL push") and 7ee6dfa2a245 ("xfs: fix use-after-free on CIL context on shutdown") the deadlock seems to be gone.

Hash Correction: 0e7ab7efe774 ("xfs: Throttle commits on delayed background CIL push") and c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown")

When I tried to remove only the wait:

     --- a/fs/xfs/xfs_log_cil.c
     +++ b/fs/xfs/xfs_log_cil.c
     @@ -936,7 +936,6 @@ xlog_cil_push_background(
             if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
                     trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
                     ASSERT(cil->xc_ctx->space_used < log->l_logsize);
     -               xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
                     return;
             }

the system got into a critical overload situation with many kworker threads at 100% CPU and became totally unresponsive after a while, requiring sysrq triggered reboot.

I noticed, that 7ee6dfa2a245 ("xfs: fix use-after-free on CIL context on shutdown") put the `wake_up_all(&ctx->push_wait)` into an additional conditional `if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))` - the same as is used to decide whether the background push is triggered at all. But in the deadlock situation, I've seen waiters waiting on cil->xc_push_wait with ctx->space_used being zero with the debugger:

     $L_003: (struct xfs_cil *) 0xffff889886db0700 : {
         xc_cil = {next = 0xffff889886db0708, prev = 0xffff889886db0708}    /* empty */
         xc_ctx = 0xffff888e2cb6dd80 /* --> $L_004 */
         xc_push_seq = 31
         xc_committing = {next = 0xffff889886db0790, prev = 0xffff889886db0790}    /* empty */
         xc_current_sequence = 32
         xc_push_wait = {
             head = {next = 0xffffc9000dbcbd48, prev = 0xffffc90021a1bd20}
         }
     }
     $L_004: (struct xfs_cil_ctx *) 0xffff888e2cb6dd80 : {
         sequence = 32
         start_lsn = 0
         commit_lsn = 0
         space_used = 0
         committing = {next = 0xffff888e2cb6ddd8, prev = 0xffff888e2cb6ddd8}    /* empty */
     }


So, out of curiosity I've tried

     --- a/fs/xfs/xfs_log_cil.c
     +++ b/fs/xfs/xfs_log_cil.c
     @@ -670,8 +670,7 @@ xlog_cil_push_work(
             /*
              * Wake up any background push waiters now this context is being pushed.
              */
     -       if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
     -               wake_up_all(&cil->xc_push_wait);
     +       wake_up_all(&cil->xc_push_wait);
      
             /*
              * Check if we've anything to push. If there is nothing, then we don't

and this seems to fix the problem. But I don't understand, how this is possible.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH] xfs: Wake CIL push waiters more reliably
  2020-12-28 23:13           ` Donald Buczek
@ 2020-12-29 23:56             ` Donald Buczek
  2020-12-30 22:16               ` Dave Chinner
       [not found]             ` <20201230024642.2171-1-hdanton@sina.com>
  1 sibling, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2020-12-29 23:56 UTC (permalink / raw)
  To: linux-xfs, Brian Foster, Dave Chinner
  Cc: Linux Kernel Mailing List, it+linux-xfs, Donald Buczek

Threads, which committed items to the CIL, wait in the xc_push_wait
waitqueue when used_space in the push context goes over a limit. These
threads need to be woken when the CIL is pushed.

The CIL push worker tries to avoid the overhead of calling wake_all()
when there are no waiters waiting. It does so by checking the same
condition which caused the waits to happen. This, however, is
unreliable, because ctx->space_used can actually decrease when items are
recommitted. If the value goes below the limit while some threads are
already waiting but before the push worker gets to it, these threads are
not woken.

Always wake all CIL push waiters. Test with waitqueue_active() as an
optimization. This is possible, because we hold the xc_push_lock
spinlock, which prevents additions to the waitqueue.

Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
---
 fs/xfs/xfs_log_cil.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index b0ef071b3cb5..d620de8e217c 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -670,7 +670,7 @@ xlog_cil_push_work(
 	/*
 	 * Wake up any background push waiters now this context is being pushed.
 	 */
-	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+	if (waitqueue_active(&cil->xc_push_wait))
 		wake_up_all(&cil->xc_push_wait);
 
 	/*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
       [not found]             ` <20201230024642.2171-1-hdanton@sina.com>
@ 2020-12-30 16:54               ` Donald Buczek
  0 siblings, 0 replies; 29+ messages in thread
From: Donald Buczek @ 2020-12-30 16:54 UTC (permalink / raw)
  To: Hillf Danton; +Cc: linux-xfs, Brian Foster, Dave Chinner, LKML, it+linux-xfs

On 30.12.20 03:46, Hillf Danton wrote:
> On Wed, 30 Dec 2020 00:56:27 +0100
>> Threads, which committed items to the CIL, wait in the xc_push_wait
>> waitqueue when used_space in the push context goes over a limit. These
>> threads need to be woken when the CIL is pushed.
>>
>> The CIL push worker tries to avoid the overhead of calling wake_all()
>> when there are no waiters waiting. It does so by checking the same
>> condition which caused the waits to happen. This, however, is
>> unreliable, because ctx->space_used can actually decrease when items are
>> recommitted. If the value goes below the limit while some threads are
>> already waiting but before the push worker gets to it, these threads are
>> not woken.
> 
> Looks like you are fixing a typo in c7f87f3984cf ("xfs: fix
> use-after-free on CIL context on shutdown") in mainline, and
> it may mean
> 
> 	/*
> 	 * Wake up any background push waiters now this context is being pushed
> 	 * if we are no longer over the space limit
> 	 */
> 
> given waiters throttled for comsuming more space than limit in
> xlog_cil_push_background().

I'm not sure, I understand you correctly. Do you suggest to update the comment to "...if we are no longer over the space limit"  and change the code to `if (ctx->space_used < XLOG_CIL_BLOCKING_SPACE_LIMIT(log))` ?

I don't think, that would be correct.

The current push context is most probably still over the limit if it ever was. It is exceptional, that the few bytes, which might be returned to ctx->space_used, bring us back below the limit. The new context, on the other hand, will have space_used=0.

We need to resume any thread which is waiting for the push.

Best
   Donald

>> Always wake all CIL push waiters. Test with waitqueue_active() as an
>> optimization. This is possible, because we hold the xc_push_lock
>> spinlock, which prevents additions to the waitqueue.
>>
>> Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
>> ---
>>   fs/xfs/xfs_log_cil.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
>> index b0ef071b3cb5..d620de8e217c 100644
>> --- a/fs/xfs/xfs_log_cil.c
>> +++ b/fs/xfs/xfs_log_cil.c
>> @@ -670,7 +670,7 @@ xlog_cil_push_work(
>>   	/*
>>   	 * Wake up any background push waiters now this context is being pushed.
>>   	 */
>> -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
>> +	if (waitqueue_active(&cil->xc_push_wait))
>>   		wake_up_all(&cil->xc_push_wait);
>>   
>>   	/*
>> -- 
>> 2.26.2

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2020-12-29 23:56             ` [PATCH] xfs: Wake CIL push waiters more reliably Donald Buczek
@ 2020-12-30 22:16               ` Dave Chinner
  2020-12-31 11:48                 ` Donald Buczek
  2021-01-04 16:23                 ` Brian Foster
  0 siblings, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2020-12-30 22:16 UTC (permalink / raw)
  To: Donald Buczek
  Cc: linux-xfs, Brian Foster, Linux Kernel Mailing List, it+linux-xfs

On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> Threads, which committed items to the CIL, wait in the xc_push_wait
> waitqueue when used_space in the push context goes over a limit. These
> threads need to be woken when the CIL is pushed.
> 
> The CIL push worker tries to avoid the overhead of calling wake_all()
> when there are no waiters waiting. It does so by checking the same
> condition which caused the waits to happen. This, however, is
> unreliable, because ctx->space_used can actually decrease when items are
> recommitted.

When does this happen?

Do you have tracing showing the operation where the relogged item
has actually gotten smaller? By definition, relogging in the CIL
should only grow the size of the object in the CIL because it must
relog all the existing changes on top of the new changed being made
to the object. Hence the CIL reservation should only ever grow.

IOWs, returning negative lengths from the formatting code is
unexpected and probably a bug and requires further investigation,
not papering over the occurrence with broadcast wakeups...

> If the value goes below the limit while some threads are
> already waiting but before the push worker gets to it, these threads are
> not woken.
> 
> Always wake all CIL push waiters. Test with waitqueue_active() as an
> optimization. This is possible, because we hold the xc_push_lock
> spinlock, which prevents additions to the waitqueue.
> 
> Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> ---
>  fs/xfs/xfs_log_cil.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index b0ef071b3cb5..d620de8e217c 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -670,7 +670,7 @@ xlog_cil_push_work(
>  	/*
>  	 * Wake up any background push waiters now this context is being pushed.
>  	 */
> -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> +	if (waitqueue_active(&cil->xc_push_wait))
>  		wake_up_all(&cil->xc_push_wait);

That just smells wrong to me. It *might* be correct, but this
condition should pair with the sleep condition, as space used by a
CIL context should never actually decrease....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2020-12-30 22:16               ` Dave Chinner
@ 2020-12-31 11:48                 ` Donald Buczek
  2020-12-31 21:59                   ` Dave Chinner
  2021-01-04 16:23                 ` Brian Foster
  1 sibling, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2020-12-31 11:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, Brian Foster, Linux Kernel Mailing List, it+linux-xfs

On 30.12.20 23:16, Dave Chinner wrote:
> On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
>> Threads, which committed items to the CIL, wait in the xc_push_wait
>> waitqueue when used_space in the push context goes over a limit. These
>> threads need to be woken when the CIL is pushed.
>>
>> The CIL push worker tries to avoid the overhead of calling wake_all()
>> when there are no waiters waiting. It does so by checking the same
>> condition which caused the waits to happen. This, however, is
>> unreliable, because ctx->space_used can actually decrease when items are
>> recommitted.
> 
> When does this happen?
> 
> Do you have tracing showing the operation where the relogged item
> has actually gotten smaller? By definition, relogging in the CIL
> should only grow the size of the object in the CIL because it must
> relog all the existing changes on top of the new changed being made
> to the object. Hence the CIL reservation should only ever grow.

I have (very ugly printk based) log (see below), but it only shows, that it happened (space_used decreasing), not what caused it.

I only browsed the ( xfs_*_item.c ) code and got the impression that the size of a log item is rather dynamic (e.g. number of extends in an inode, extended attributes in an inode, continuity of chunks in a buffer) and wasn't surprised that a relogged item might need less space from time to time.

> IOWs, returning negative lengths from the formatting code is
> unexpected and probably a bug and requires further investigation,
> not papering over the occurrence with broadcast wakeups...

One could argue that the code is more robust after the change, because it wakes up every thread which is waiting on the next push to happen when the next push is happening without making assumption of why these threads are waiting by duplicating code from that waiters side. The proposed waitqueue_active() is inlined to two instructions and avoids the call overhead if there are no waiters as well.

But, of course, if the size is not expected to decrease, this discrepancy must be clarified anyway. I am sure, I can find out the exact circumstances, this does happen. I will follow up on this.

Happy New Year!

   Donald

Log extract following....

========

+++ b/fs/xfs/xfs_log_cil.c
@@ -17,6 +17,8 @@
  #include "xfs_log_priv.h"
  #include "xfs_trace.h"
  
+#include <linux/printk.h>
+
  struct workqueue_struct *xfs_discard_wq;
  
  /*
@@ -670,8 +672,25 @@ xlog_cil_push_work(
         /*
          * Wake up any background push waiters now this context is being pushed.
          */
-       if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
+       if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
                 wake_up_all(&cil->xc_push_wait);
+               pr_warn("XXX wake    cil %p ctx %p  ctx->space_used=%d >= %d, push_seq=%lld, ctx->sequence=%lld\n",
+                       cil, ctx,
+                       ctx->space_used,
+                       XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+                       cil->xc_push_seq,
+                       ctx->sequence);
+       } else {
+               pr_warn("XXX no wake cil %p ctx %p ctx->space_used=%d < %d, push_seq=%lld, ctx->sequence=%lld\n",
+                       cil, ctx,
+                       ctx->space_used,
+                       XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+                       cil->xc_push_seq,
+                       ctx->sequence);
+               if (waitqueue_active(&cil->xc_push_wait)) {
+                       pr_warn("XXX xc_push_wait ACTIVE!\n");
+               }
+       }
  
         /*
          * Check if we've anything to push. If there is nothing, then we don't
@@ -918,6 +937,11 @@ xlog_cil_push_background(
         spin_lock(&cil->xc_push_lock);
         if (cil->xc_push_seq < cil->xc_current_sequence) {
                 cil->xc_push_seq = cil->xc_current_sequence;
+               pr_warn("XXX trigger cil %p ctx %p  ctx->space_used=%d      , push_seq=%lld, ctx->sequence=%lld\n",
+                       cil, cil->xc_ctx,
+                       cil->xc_ctx->space_used,
+                       cil->xc_push_seq,
+                       cil->xc_ctx->sequence);
                 queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
         }
  
@@ -936,6 +960,12 @@ xlog_cil_push_background(
         if (cil->xc_ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log)) {
                 trace_xfs_log_cil_wait(log, cil->xc_ctx->ticket);
                 ASSERT(cil->xc_ctx->space_used < log->l_logsize);
+               pr_warn("XXX pushw   cil %p ctx %p  ctx->space_used=%d >= %d, push_seq=%lld, ctx->sequence=%lld\n",
+                       cil, cil->xc_ctx,
+                       cil->xc_ctx->space_used,
+                       XLOG_CIL_BLOCKING_SPACE_LIMIT(log),
+                       cil->xc_push_seq,
+                       cil->xc_ctx->sequence);
                 xlog_wait(&cil->xc_push_wait, &cil->xc_push_lock);
                 return;
         }

=====

14606 "XXX" lines logged before deadlock
4847 "XXX" lines logged for the deadlocked filesystem (CIL 00000000e374c6f1) before deadlock

relevant lines:

# seq 29

2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.392095+01:00 deadbird kernel: [ 1071.600503] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109068 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.392120+01:00 deadbird kernel: [ 1071.616245] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109108 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.408102+01:00 deadbird kernel: [ 1071.632024] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109108 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.425086+01:00 deadbird kernel: [ 1071.648913] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109108 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.442109+01:00 deadbird kernel: [ 1071.666410] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109108 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.482769+01:00 deadbird kernel: [ 1071.706666] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.501063+01:00 deadbird kernel: [ 1071.725484] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.535711+01:00 deadbird kernel: [ 1071.742926] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.535728+01:00 deadbird kernel: [ 1071.760346] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.552079+01:00 deadbird kernel: [ 1071.776167] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:33.569070+01:00 deadbird kernel: [ 1071.793576] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
[ 953 more of these ...]
2020-12-29T20:08:50.368192+01:00 deadbird kernel: [ 1088.576346] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:50.385023+01:00 deadbird kernel: [ 1088.593009] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:50.418318+01:00 deadbird kernel: [ 1088.609811] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:08:50.418330+01:00 deadbird kernel: [ 1088.626431] XXX pushw   cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29

# seq 30

2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=33554480      , push_seq=30, ctx->sequence=30
2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108908 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.507085+01:00 deadbird kernel: [ 1180.715657] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108892 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.507094+01:00 deadbird kernel: [ 1180.731757] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108876 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.659070+01:00 deadbird kernel: [ 1180.867812] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.771081+01:00 deadbird kernel: [ 1180.980187] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:22.791116+01:00 deadbird kernel: [ 1180.996535] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:32.512085+01:00 deadbird kernel: [ 1190.725044] XXX no wake cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108856 < 67108864, push_seq=30, ctx->sequence=30
2020-12-29T20:10:32.528119+01:00 deadbird kernel: [ 1190.753321] XXX xc_push_wait ACTIVE!

After this "xc_push_wait ACTIVE!" the I/O to the filesystem ceased.

space_used was reduced from 67108924 .. 67108908 .. 67108892 .. 67108876 .. 67108872 as seen by waiters to 67108856 as seen by the push worker. Presumably by other calls to xfs_log_commit_cil which did not produce log, because they neither triggered the worker nor waitetd for the push.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2020-12-31 11:48                 ` Donald Buczek
@ 2020-12-31 21:59                   ` Dave Chinner
  2021-01-02 19:12                     ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2020-12-31 21:59 UTC (permalink / raw)
  To: Donald Buczek
  Cc: linux-xfs, Brian Foster, Linux Kernel Mailing List, it+linux-xfs

On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> On 30.12.20 23:16, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > Threads, which committed items to the CIL, wait in the
> > > xc_push_wait waitqueue when used_space in the push context
> > > goes over a limit. These threads need to be woken when the CIL
> > > is pushed.
> > > 
> > > The CIL push worker tries to avoid the overhead of calling
> > > wake_all() when there are no waiters waiting. It does so by
> > > checking the same condition which caused the waits to happen.
> > > This, however, is unreliable, because ctx->space_used can
> > > actually decrease when items are recommitted.
> > 
> > When does this happen?
> > 
> > Do you have tracing showing the operation where the relogged
> > item has actually gotten smaller? By definition, relogging in
> > the CIL should only grow the size of the object in the CIL
> > because it must relog all the existing changes on top of the new
> > changed being made to the object. Hence the CIL reservation
> > should only ever grow.
> 
> I have (very ugly printk based) log (see below), but it only
> shows, that it happened (space_used decreasing), not what caused
> it.
> 
> I only browsed the ( xfs_*_item.c ) code and got the impression
> that the size of a log item is rather dynamic (e.g. number of
> extends in an inode, extended attributes in an inode, continuity
> of chunks in a buffer) and wasn't surprised that a relogged item
> might need less space from time to time.
> 
> > IOWs, returning negative lengths from the formatting code is
> > unexpected and probably a bug and requires further
> > investigation, not papering over the occurrence with broadcast
> > wakeups...
> 
> One could argue that the code is more robust after the change,
> because it wakes up every thread which is waiting on the next push
> to happen when the next push is happening without making
> assumption of why these threads are waiting by duplicating code
> from that waiters side. The proposed waitqueue_active() is inlined
> to two instructions and avoids the call overhead if there are no
> waiters as well.

One could argue that, but one should also understand the design
constraints for a particular algorithm are before suggesting that
their solution is "robust". :)

> 
> # seq 29
> 
> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29

So, at 20:08:15 we get a push trigger and the work is queued. But...

.....
> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29

It takes the best part of *50 seconds* before the push work actually
runs?

That's .... well and truly screwed up - the work should run on that
CPU on the very next time it yeilds the CPU. If we're holding the
CPU without yeilding it for that long, hangcheck and RCU warnings
should be going off...

> # seq 30
> 
> 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=33554480      , push_seq=30, ctx->sequence=30

20:09:39 for the next trigger,

> 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108908 >= 67108864, push_seq=30, ctx->sequence=30

So it dropped by 16 bytes (seems to be common) which is unexpected.
I wonder if it filled a hole in a buffer and so needed one less
xlog_op_header()? But then the size would have gone up by at least
128 bytes for the hole that was filled, so it still shouldn't go
down in size.

I think you need to instrument xlog_cil_insert_items() and catch
a negative length here:

	/* account for space used by new iovec headers  */
	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
	len += iovhdr_res;
	ctx->nvecs += diff_iovecs;

(diff_iovecs will be negative if the number of xlog_op_header
structures goes down)

And if this happens, then dump the transaction ticket via
xlog_print_trans(tp) so we can see all the log items types and
vectors that the transaction has formatted...

> 2020-12-29T20:10:22.507085+01:00 deadbird kernel: [ 1180.715657] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108892 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.507094+01:00 deadbird kernel: [ 1180.731757] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108876 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.659070+01:00 deadbird kernel: [ 1180.867812] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.771081+01:00 deadbird kernel: [ 1180.980187] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:22.791116+01:00 deadbird kernel: [ 1180.996535] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108872 >= 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:32.512085+01:00 deadbird kernel: [ 1190.725044] XXX no wake cil 00000000e374c6f1 ctx 00000000c46ab121 ctx->space_used=67108856 < 67108864, push_seq=30, ctx->sequence=30
> 2020-12-29T20:10:32.528119+01:00 deadbird kernel: [ 1190.753321] XXX xc_push_wait ACTIVE!

Also, another 50s hold-off from push work being queued to the work
actually running. That also needs to be understood, because that's
clearly contributing to hitting the hard limit regularly and that
should mostly never happen....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2020-12-31 21:59                   ` Dave Chinner
@ 2021-01-02 19:12                     ` Donald Buczek
  2021-01-02 22:44                       ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2021-01-02 19:12 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, Brian Foster, Linux Kernel Mailing List, it+linux-xfs

On 31.12.20 22:59, Dave Chinner wrote:

Hey, funny, your email could celebrate New Year a second time :-)

> On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
>> On 30.12.20 23:16, Dave Chinner wrote:
>>> On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
>>>> Threads, which committed items to the CIL, wait in the
>>>> xc_push_wait waitqueue when used_space in the push context
>>>> goes over a limit. These threads need to be woken when the CIL
>>>> is pushed.
>>>>
>>>> The CIL push worker tries to avoid the overhead of calling
>>>> wake_all() when there are no waiters waiting. It does so by
>>>> checking the same condition which caused the waits to happen.
>>>> This, however, is unreliable, because ctx->space_used can
>>>> actually decrease when items are recommitted.
>>>
>>> When does this happen?
>>>
>>> Do you have tracing showing the operation where the relogged
>>> item has actually gotten smaller? By definition, relogging in
>>> the CIL should only grow the size of the object in the CIL
>>> because it must relog all the existing changes on top of the new
>>> changed being made to the object. Hence the CIL reservation
>>> should only ever grow.
>>
>> I have (very ugly printk based) log (see below), but it only
>> shows, that it happened (space_used decreasing), not what caused
>> it.
>>
>> I only browsed the ( xfs_*_item.c ) code and got the impression
>> that the size of a log item is rather dynamic (e.g. number of
>> extends in an inode, extended attributes in an inode, continuity
>> of chunks in a buffer) and wasn't surprised that a relogged item
>> might need less space from time to time.
>>
>>> IOWs, returning negative lengths from the formatting code is
>>> unexpected and probably a bug and requires further
>>> investigation, not papering over the occurrence with broadcast
>>> wakeups...
>>
>> One could argue that the code is more robust after the change,
>> because it wakes up every thread which is waiting on the next push
>> to happen when the next push is happening without making
>> assumption of why these threads are waiting by duplicating code
>> from that waiters side. The proposed waitqueue_active() is inlined
>> to two instructions and avoids the call overhead if there are no
>> waiters as well.
> 
> One could argue that, but one should also understand the design
> constraints for a particular algorithm are before suggesting that
> their solution is "robust". :)

Yes, but an understanding to the extend required by the argument should be sufficient :-)

>> # seq 29
>>
>> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
> 
> So, at 20:08:15 we get a push trigger and the work is queued. But...
> 
> .....
>> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
> 
> It takes the best part of *50 seconds* before the push work actually
> runs?
> 
> That's .... well and truly screwed up - the work should run on that
> CPU on the very next time it yeilds the CPU. If we're holding the
> CPU without yeilding it for that long, hangcheck and RCU warnings
> should be going off...

No such warnings.

But the load is probably I/O bound to the log:

- creates `cp -a` copies of a directory tree with small files (linux source repository)
- source tree probably completely cached.
- two copies in parallel
- slow log (on software raid6)

Isn't it to be expected that sooner or later you need to wait for log writes when you write as fast as possible with lots of metadata updates and not so much data?

I'm a bit concerned, though, that there seem to be a rather unlimited (~ 1000) number of kernel worker threads waiting for the cil push and indirectly for log writes.

>> # seq 30
>>
>> 2020-12-29T20:09:39.305108+01:00 deadbird kernel: [ 1137.514718] XXX trigger cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=33554480      , push_seq=30, ctx->sequence=30
> 
> 20:09:39 for the next trigger,
> 
>> 2020-12-29T20:10:20.389104+01:00 deadbird kernel: [ 1178.597976] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:20.389117+01:00 deadbird kernel: [ 1178.613792] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:20.619077+01:00 deadbird kernel: [ 1178.827935] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:21.129074+01:00 deadbird kernel: [ 1179.337996] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:21.190101+01:00 deadbird kernel: [ 1179.398869] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:21.866096+01:00 deadbird kernel: [ 1180.074325] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:22.076095+01:00 deadbird kernel: [ 1180.283748] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:22.193070+01:00 deadbird kernel: [ 1180.401590] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108924 >= 67108864, push_seq=30, ctx->sequence=30
>> 2020-12-29T20:10:22.421082+01:00 deadbird kernel: [ 1180.629682] XXX pushw   cil 00000000e374c6f1 ctx 00000000c46ab121  ctx->space_used=67108908 >= 67108864, push_seq=30, ctx->sequence=30
> 
> So it dropped by 16 bytes (seems to be common) which is unexpected.
> I wonder if it filled a hole in a buffer and so needed one less
> xlog_op_header()? But then the size would have gone up by at least
> 128 bytes for the hole that was filled, so it still shouldn't go
> down in size.
> 
> I think you need to instrument xlog_cil_insert_items() and catch
> a negative length here:
> 
> 	/* account for space used by new iovec headers  */
> 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
> 	len += iovhdr_res;
> 	ctx->nvecs += diff_iovecs;
> 
> (diff_iovecs will be negative if the number of xlog_op_header
> structures goes down)
> 
> And if this happens, then dump the transaction ticket via
> xlog_print_trans(tp) so we can see all the log items types and
> vectors that the transaction has formatted...

I tried that, but the output was difficult to understand, because at that point you can only log the complete transaction with the items already updated. And a shrinking item is not switched to the shadow vector, so the formatted content is already overwritten and not available for analysis.

So I've added some code to dump an item in its old and its new state in xlog_cil_insert_format_items when either the requested buffer len or the number of vectors decreased.

Several examples of different kind with a little bit of manual annotation following

Best
   Donald

- XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 32 to 20 bytes

[   29.606212] XFS (sda1): XXX required buf size 184 -> 172
[   29.612591] XFS (sda1): XXX niovecs           3 -> 3

[   29.618570] XFS (sda1): XXX old log item:
[   29.623469] XFS (sda1): log item:
[   29.627683] XFS (sda1):   type       = 0x123b                                               # XFS_LI_INODE
[   29.632375] XFS (sda1):   flags      = 0x8
[   29.636858] XFS (sda1):   niovecs    = 3
[   29.647442] XFS (sda1):   size       = 312
[   29.651814] XFS (sda1):   bytes      = 184
[   29.656278] XFS (sda1):   buf len    = 184
[   29.660927] XFS (sda1):   iovec[0]
[   29.665071] XFS (sda1):     type     = 0x5                                                  # XLOG_REG_TYPE_IFORMAT
[   29.669592] XFS (sda1):     len      = 56
[   29.673914] XFS (sda1):     first 32 bytes of iovec[0]:
[   29.680079] 00000000: 3b 12 03 00 03 00 00 00 00 00 20 00 00 00 00 00  ;......... .....
[   29.689363] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   29.698633] XFS (sda1):   iovec[1]
[   29.702756] XFS (sda1):     type     = 0x6                                                  # XLOG_REG_TYPE_ICORE
[   29.707263] XFS (sda1):     len      = 96
[   29.711571] XFS (sda1):     first 32 bytes of iovec[1]:
[   29.717720] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   29.726986] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   29.736241] XFS (sda1):   iovec[2]
[   29.740364] XFS (sda1):     type     = 0x9                                                  # XLOG_REG_TYPE_ILOCAL
[   29.744873] XFS (sda1):     len      = 32
[   29.749184] XFS (sda1):     first 32 bytes of iovec[2]:
[   29.755336] 00000000: 02 00 30 e7 02 26 06 00 40 73 65 72 76 65 72 00  ..0..&..@server.
[   29.764612] 00000010: 08 92 ef 04 00 58 6d 78 36 34 00 d3 93 58 00 58  .....Xmx64...X.X

[   29.773900] XFS (sda1): XXX new log item:
[   29.778718] XFS (sda1): log item:
[   29.782856] XFS (sda1):   type       = 0x123b
[   29.787478] XFS (sda1):   flags      = 0x8
[   29.791902] XFS (sda1):   niovecs    = 3
[   29.796321] XFS (sda1):   size       = 312
[   29.800640] XFS (sda1):   bytes      = 172
[   29.805052] XFS (sda1):   buf len    = 176
[   29.809659] XFS (sda1):   iovec[0]
[   29.813781] XFS (sda1):     type     = 0x5
[   29.818289] XFS (sda1):     len      = 56
[   29.822599] XFS (sda1):     first 32 bytes of iovec[0]:
[   29.828754] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
[   29.838024] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   29.847292] XFS (sda1):   iovec[1]
[   29.851420] XFS (sda1):     type     = 0x6
[   29.855933] XFS (sda1):     len      = 96
[   29.860247] XFS (sda1):     first 32 bytes of iovec[1]:
[   29.866406] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   29.875677] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   29.884949] XFS (sda1):   iovec[2]
[   29.889081] XFS (sda1):     type     = 0x9
[   29.893601] XFS (sda1):     len      = 20
[   29.897924] XFS (sda1):     first 20 bytes of iovec[2]:
[   29.904096] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
[   29.913381] 00000010: 58 92 ef 04                                      X...

- (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 20 to 8 bytes

[   29.982907] XFS (sda1): XXX required buf size 172 -> 160
[   29.992716] XFS (sda1): XXX niovecs           3 -> 3

[   29.998728] XFS (sda1): XXX old log item:
[   30.003624] XFS (sda1): log item:
[   30.007835] XFS (sda1):   type       = 0x123b
[   30.012654] XFS (sda1):   flags      = 0x8
[   30.017145] XFS (sda1):   niovecs    = 3
[   30.021638] XFS (sda1):   size       = 312
[   30.026012] XFS (sda1):   bytes      = 172
[   30.030610] XFS (sda1):   buf len    = 176
[   30.035292] XFS (sda1):   iovec[0]
[   30.039480] XFS (sda1):     type     = 0x5
[   30.044054] XFS (sda1):     len      = 56
[   30.048534] XFS (sda1):     first 32 bytes of iovec[0]:
[   30.054749] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
[   30.064091] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   30.073545] XFS (sda1):   iovec[1]
[   30.077744] XFS (sda1):     type     = 0x6
[   30.082455] XFS (sda1):     len      = 96
[   30.086824] XFS (sda1):     first 32 bytes of iovec[1]:
[   30.093025] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   30.102346] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   30.111801] XFS (sda1):   iovec[2]
[   30.115989] XFS (sda1):     type     = 0x9
[   30.120715] XFS (sda1):     len      = 20
[   30.125102] XFS (sda1):     first 20 bytes of iovec[2]:
[   30.131331] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
[   30.140808] 00000010: 58 92 ef 04                                      X...

[   30.149006] XFS (sda1): XXX new log item:
[   30.154039] XFS (sda1): log item:
[   30.154039] XFS (sda1):   type       = 0x123b
[   30.154041] XFS (sda1):   flags      = 0x8
[   30.167436] XFS (sda1):   niovecs    = 3
[   30.167437] XFS (sda1):   size       = 312
[   30.167438] XFS (sda1):   bytes      = 160
[   30.180881] XFS (sda1):   buf len    = 160
[   30.180882] XFS (sda1):   iovec[0]
[   30.180882] XFS (sda1):     type     = 0x5
[   30.180883] XFS (sda1):     len      = 56
[   30.180884] XFS (sda1):     first 32 bytes of iovec[0]:
[   30.180884] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
[   30.180885] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   30.180886] XFS (sda1):   iovec[1]
[   30.180886] XFS (sda1):     type     = 0x6
[   30.180887] XFS (sda1):     len      = 96
[   30.180887] XFS (sda1):     first 32 bytes of iovec[1]:
[   30.180888] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   30.180889] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   30.180889] XFS (sda1):   iovec[2]
[   30.180890] XFS (sda1):     type     = 0x9
[   30.180890] XFS (sda1):     len      = 8
[   30.180890] XFS (sda1):     first 8 bytes of iovec[2]:
[   30.180891] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..

- (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL removed - niovecs from 3 to 2

[   30.197403] XFS (sda1): XXX required buf size 160 -> 152
[   30.296091] XFS (sda1): XXX niovecs           3 -> 2

[   30.296092] XFS (sda1): XXX old log item:
[   30.296093] XFS (sda1): log item:
[   30.297552] ixgbe 0000:01:00.1 net03: renamed from eth3
[   30.317524] XFS (sda1):   type       = 0x123b
[   30.317524] XFS (sda1):   flags      = 0x8
[   30.317525] XFS (sda1):   niovecs    = 3
[   30.317525] XFS (sda1):   size       = 304
[   30.317526] XFS (sda1):   bytes      = 160
[   30.317526] XFS (sda1):   buf len    = 160
[   30.317527] XFS (sda1):   iovec[0]
[   30.317527] XFS (sda1):     type     = 0x5
[   30.317528] XFS (sda1):     len      = 56
[   30.317528] XFS (sda1):     first 32 bytes of iovec[0]:
[   30.317529] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
[   30.317530] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   30.317531] XFS (sda1):   iovec[1]
[   30.317531] XFS (sda1):     type     = 0x6
[   30.317531] XFS (sda1):     len      = 96
[   30.317532] XFS (sda1):     first 32 bytes of iovec[1]:
[   30.317533] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
[   30.317533] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
[   30.317533] XFS (sda1):   iovec[2]
[   30.317534] XFS (sda1):     type     = 0x9
[   30.317534] XFS (sda1):     len      = 8
[   30.317535] XFS (sda1):     first 8 bytes of iovec[2]:
[   30.317535] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..

[   30.317536] XFS (sda1): XXX new log item:
[   30.317537] XFS (sda1): log item:
[   30.317537] XFS (sda1):   type       = 0x123b
[   30.317538] XFS (sda1):   flags      = 0x8
[   30.317539] XFS (sda1):   niovecs    = 2
[   30.317539] XFS (sda1):   size       = 304
[   30.317540] XFS (sda1):   bytes      = 152
[   30.317540] XFS (sda1):   buf len    = 152
[   30.317541] XFS (sda1):   iovec[0]
[   30.317541] XFS (sda1):     type     = 0x5
[   30.317542] XFS (sda1):     len      = 56
[   30.317542] XFS (sda1):     first 32 bytes of iovec[0]:
[   30.317543] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
[   30.317543] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
[   30.317544] XFS (sda1):   iovec[1]
[   30.317544] XFS (sda1):     type     = 0x6
[   30.317545] XFS (sda1):     len      = 96
[   30.317545] XFS (sda1):     first 32 bytes of iovec[1]:
[   30.317546] 00000000: 4e 49 00 00 02 02 00 00 00 00 00 00 00 00 00 00  NI..............
[   30.317546] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................

- XFS_LI_INODE.XLOG_REG_TYPE_IEXT removed - niovecs from 3 to 2

[   37.983756] XFS (sda1): XXX required buf size 168 -> 152
[   37.990253] XFS (sda1): XXX niovecs           3 -> 2

[   37.996202] XFS (sda1): XXX old log item:
[   38.001061] XFS (sda1): log item:
[   38.005239] XFS (sda1):   type       = 0x123b
[   38.009885] XFS (sda1):   flags      = 0x9
[   38.014330] XFS (sda1):   niovecs    = 3
[   38.018764] XFS (sda1):   size       = 440
[   38.023100] XFS (sda1):   bytes      = 168
[   38.027533] XFS (sda1):   buf len    = 168
[   38.032157] XFS (sda1):   iovec[0]
[   38.036286] XFS (sda1):     type     = 0x5
[   38.040796] XFS (sda1):     len      = 56
[   38.045114] XFS (sda1):     first 32 bytes of iovec[0]:
[   38.051277] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
[   38.060562] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   38.069853] XFS (sda1):   iovec[1]
[   38.073989] XFS (sda1):     type     = 0x6
[   38.078525] XFS (sda1):     len      = 96
[   38.082871] XFS (sda1):     first 32 bytes of iovec[1]:
[   38.089052] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
[   38.098331] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.
[   38.107611] XFS (sda1):   iovec[2]
[   38.111754] XFS (sda1):     type     = 0x7                                                # XLOG_REG_TYPE_IEXT
[   38.116285] XFS (sda1):     len      = 16
[   38.120608] XFS (sda1):     first 16 bytes of iovec[2]:
[   38.126770] 00000000: 00 00 00 00 00 00 00 00 00 00 00 11 31 80 00 01  ............1...

[   38.136054] XFS (sda1): XXX new log item:
[   38.140878] XFS (sda1): log item:
[   38.145025] XFS (sda1):   type       = 0x123b
[   38.149645] XFS (sda1):   flags      = 0x9
[   38.154067] XFS (sda1):   niovecs    = 2
[   38.158490] XFS (sda1):   size       = 440
[   38.162799] XFS (sda1):   bytes      = 152
[   38.167202] XFS (sda1):   buf len    = 152
[   38.171801] XFS (sda1):   iovec[0]
[   38.175911] XFS (sda1):     type     = 0x5
[   38.180409] XFS (sda1):     len      = 56
[   38.184708] XFS (sda1):     first 32 bytes of iovec[0]:
[   38.190852] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
[   38.200115] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
[   38.209373] XFS (sda1):   iovec[1]
[   38.213488] XFS (sda1):     type     = 0x6
[   38.217990] XFS (sda1):     len      = 96
[   38.222297] XFS (sda1):     first 32 bytes of iovec[1]:
[   38.228442] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
[   38.237707] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.

- XFS_LI_BUF.XLOG_REG_TYPE_BCHUNK removed - niovecs from 3 to 2

[ 2120.030590] XFS (md0): XXX required buf size 3992 -> 4120
[ 2120.037257] XFS (md0): XXX niovecs           3 -> 2

[ 2120.043295] XFS (md0): XXX old log item:
[ 2120.048235] XFS (md0): log item:
[ 2120.052394] XFS (md0):   type        = 0x123c                                             # XFS_LI_BUF
[ 2120.057112] XFS (md0):   flags       = 0x8
[ 2120.061624] XFS (md0):   niovecs     = 3
[ 2120.066029] XFS (md0):   size        = 4120
[ 2120.070528] XFS (md0):   bytes       = 3992
[ 2120.075120] XFS (md0):   buf len     = 3992
[ 2120.079990] XFS (md0):   iovec[0]
[ 2120.084200] XFS (md0):     type      = 0x1                                                # XLOG_REG_TYPE_BFORMAT
[ 2120.088891] XFS (md0):     len       = 24
[ 2120.093287] XFS (md0):     first 24 bytes of iovec[0]:
[ 2120.099622] 00000000: 3c 12 03 00 00 50 08 00 78 dc ff 7f 04 00 00 00  <....P..x.......
[ 2120.109088] 00000010: 01 00 00 00 ff ff ff fe                          ........
[ 2120.117878] XFS (md0):   iovec[1]
[ 2120.122012] XFS (md0):     type      = 0x2                                                # XLOG_REG_TYPE_BCHUNK
[ 2120.126532] XFS (md0):     len       = 3072
[ 2120.131130] XFS (md0):     first 32 bytes of iovec[1]:
[ 2120.137376] 00000000: 58 44 42 33 00 00 00 00 00 00 00 04 7f ff dc 78  XDB3...........x
[ 2120.146740] 00000010: 00 00 00 00 00 00 00 00 58 03 ee 68 c3 6b 48 0c  ........X..h.kH.
[ 2120.156111] XFS (md0):   iovec[2]
[ 2120.160339] XFS (md0):     type      = 0x2                                                # XLOG_REG_TYPE_BCHUNK
[ 2120.164860] XFS (md0):     len       = 896
[ 2120.169354] XFS (md0):     first 32 bytes of iovec[2]:
[ 2120.175509] 00000000: 00 00 00 00 00 00 0b 80 00 00 00 2e 00 00 00 08  ................
[ 2120.184963] 00000010: 00 00 17 2e 00 00 00 0a 04 7c 79 34 00 00 00 16  .........|y4....

[ 2120.194426] XFS (md0): XXX new log item:
[ 2120.199236] XFS (md0): log item:
[ 2120.203459] XFS (md0):   type        = 0x123c
[ 2120.208064] XFS (md0):   flags       = 0x8
[ 2120.212465] XFS (md0):   niovecs     = 2
[ 2120.216949] XFS (md0):   size        = 4224
[ 2120.221434] XFS (md0):   bytes       = 4120
[ 2120.225921] XFS (md0):   buf len     = 4120
[ 2120.230695] XFS (md0):   iovec[0]
[ 2120.234780] XFS (md0):     type      = 0x1
[ 2120.239256] XFS (md0):     len       = 24
[ 2120.243619] XFS (md0):     first 24 bytes of iovec[0]:
[ 2120.249735] 00000000: 3c 12 02 00 00 50 08 00 78 dc ff 7f 04 00 00 00  <....P..x.......
[ 2120.259148] 00000010: 01 00 00 00 ff ff ff ff                          ........
[ 2120.267776] XFS (md0):   iovec[1]
[ 2120.271845] XFS (md0):     type      = 0x2
[ 2120.276287] XFS (md0):     len       = 4096
[ 2120.280803] XFS (md0):     first 32 bytes of iovec[1]:
[ 2120.286884] 00000000: 58 44 42 33 00 00 00 00 00 00 00 04 7f ff dc 78  XDB3...........x
[ 2120.296254] 00000010: 00 00 00 00 00 00 00 00 58 03 ee 68 c3 6b 48 0c  ........X..h.kH.

- only XFS_LI_BUF.XLOG_REG_TYPE_BCHUNK removed - niovecs from 2 to 1

[ 2279.729095] XFS (sda1): XXX required buf size 152 -> 24
[ 2279.735437] XFS (sda1): XXX niovecs           2 -> 1

[ 2279.741360] XFS (sda1): XXX old log item:
[ 2279.746199] XFS (sda1): log item:
[ 2279.750512] XFS (sda1):   type       = 0x123c
[ 2279.755181] XFS (sda1):   flags      = 0x8
[ 2279.759644] XFS (sda1):   niovecs    = 2
[ 2279.764246] XFS (sda1):   size       = 256
[ 2279.768625] XFS (sda1):   bytes      = 152
[ 2279.773094] XFS (sda1):   buf len    = 152
[ 2279.777741] XFS (sda1):   iovec[0]
[ 2279.782044] XFS (sda1):     type     = 0x1
[ 2279.786607] XFS (sda1):     len      = 24
[ 2279.790980] XFS (sda1):     first 24 bytes of iovec[0]:
[ 2279.797290] 00000000: 3c 12 02 00 00 20 08 00 b0 41 f8 08 00 00 00 00  <.... ...A......
[ 2279.806730] 00000010: 01 00 00 00 01 00 00 00                          ........
[ 2279.815287] XFS (sda1):   iovec[1]
[ 2279.819604] XFS (sda1):     type     = 0x2
[ 2279.824408] XFS (sda1):     len      = 128
[ 2279.828888] XFS (sda1):     first 32 bytes of iovec[1]:
[ 2279.835126] 00000000: 42 4d 41 50 00 00 00 09 ff ff ff ff ff ff ff ff  BMAP............
[ 2279.844617] 00000010: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00  ................

[ 2279.853964] XFS (sda1): XXX new log item:
[ 2279.858958] XFS (sda1): log item:
[ 2279.863149] XFS (sda1):   type       = 0x123c
[ 2279.867834] XFS (sda1):   flags      = 0x8
[ 2279.872410] XFS (sda1):   niovecs    = 1
[ 2279.876874] XFS (sda1):   size       = 256
[ 2279.881242] XFS (sda1):   bytes      = 24
[ 2279.885585] XFS (sda1):   buf len    = 24
[ 2279.890248] XFS (sda1):   iovec[0]
[ 2279.894408] XFS (sda1):     type     = 0x1
[ 2279.898954] XFS (sda1):     len      = 24
[ 2279.903291] XFS (sda1):     first 24 bytes of iovec[0]:
[ 2279.909608] 00000000: 3c 12 01 00 02 00 08 00 b0 41 f8 08 00 00 00 00  <........A......
[ 2279.918937] 00000010: 01 00 00 00 00 00 00 00                          ........

Log code for reference

commit 137ce3815d4d3ec47a2cf043a9c26aff1892eabe
Author: Donald Buczek <buczek@molgen.mpg.de>
Date:   Fri Jan 1 15:36:30 2021 +0100

     log negativ log items

diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 58c3fcbec94a..d2e6c7d78047 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -17,6 +17,7 @@ struct xfs_log_vec {
  	int			lv_bytes;	/* accounted space in buffer */
  	int			lv_buf_len;	/* aligned size of buffer */
  	int			lv_size;	/* size of allocated lv */
+	int			lv_debug_required_bytes;
  };
  
  #define XFS_LOG_VEC_ORDERED	(-1)
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index d620de8e217c..3931ae9c8996 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -17,6 +17,8 @@
  #include "xfs_log_priv.h"
  #include "xfs_trace.h"
  
+#include <linux/ratelimit_types.h>
+
  struct workqueue_struct *xfs_discard_wq;
  
  /*
@@ -133,6 +135,7 @@ xlog_cil_alloc_shadow_bufs(
  		int	nbytes = 0;
  		int	buf_size;
  		bool	ordered = false;
+		int	debug_required_bytes;
  
  		/* Skip items which aren't dirty in this transaction. */
  		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
@@ -141,6 +144,8 @@ xlog_cil_alloc_shadow_bufs(
  		/* get number of vecs and size of data to be stored */
  		lip->li_ops->iop_size(lip, &niovecs, &nbytes);
  
+		debug_required_bytes = nbytes;
+
  		/*
  		 * Ordered items need to be tracked but we do not wish to write
  		 * them. We need a logvec to track the object, but we do not
@@ -208,6 +213,7 @@ xlog_cil_alloc_shadow_bufs(
  
  		/* Ensure the lv is set up according to ->iop_size */
  		lv->lv_niovecs = niovecs;
+		lv->lv_debug_required_bytes = debug_required_bytes;
  
  		/* The allocated data region lies beyond the iovec region */
  		lv->lv_buf = (char *)lv + xlog_cil_iovec_space(niovecs);
@@ -266,6 +272,39 @@ xfs_cil_prepare_item(
  		lv->lv_item->li_seq = log->l_cilp->xc_ctx->sequence;
  }
  
+static void xlog_print_logitem(
+	struct xfs_log_item *lip) {
+
+	struct xfs_mount	*mp = lip->li_mountp;
+	struct xfs_log_vec	*lv = lip->li_lv;
+	struct xfs_log_iovec	*vec;
+	int			i;
+
+	xfs_warn(mp, "log item: ");
+	xfs_warn(mp, "  type	= 0x%x", lip->li_type);
+	xfs_warn(mp, "  flags	= 0x%lx", lip->li_flags);
+	if (!lv)
+		return;
+	xfs_warn(mp, "  niovecs	= %d", lv->lv_niovecs);
+	xfs_warn(mp, "  size	= %d", lv->lv_size);
+	xfs_warn(mp, "  bytes	= %d", lv->lv_bytes);
+	xfs_warn(mp, "  buf len	= %d", lv->lv_buf_len);
+
+	/* dump each iovec for the log item */
+	vec = lv->lv_iovecp;
+	for (i = 0; i < lv->lv_niovecs; i++) {
+		int dumplen = min(vec->i_len, 32);
+
+		xfs_warn(mp, "  iovec[%d]", i);
+		xfs_warn(mp, "    type	= 0x%x", vec->i_type);
+		xfs_warn(mp, "    len	= %d", vec->i_len);
+		xfs_warn(mp, "    first %d bytes of iovec[%d]:", dumplen, i);
+		xfs_hex_dump(vec->i_addr, dumplen);
+
+		vec++;
+	}
+}
+
  /*
   * Format log item into a flat buffers
   *
@@ -315,6 +354,8 @@ xlog_cil_insert_format_items(
  		struct xfs_log_vec *old_lv = NULL;
  		struct xfs_log_vec *shadow;
  		bool	ordered = false;
+		bool	want_dump = false;
+		static DEFINE_RATELIMIT_STATE(_rs, 1*HZ, 5);
  
  		/* Skip items which aren't dirty in this transaction. */
  		if (!test_bit(XFS_LI_DIRTY, &lip->li_flags))
@@ -334,6 +375,21 @@ xlog_cil_insert_format_items(
  
  		/* compare to existing item size */
  		old_lv = lip->li_lv;
+
+		if (lip->li_lv
+			&& (
+				shadow->lv_debug_required_bytes < old_lv->lv_bytes
+				|| shadow->lv_niovecs < old_lv->lv_niovecs )
+			&& __ratelimit(&_rs) ) {
+			xfs_warn(lip->li_mountp, "XXX required buf size %d -> %d",
+				old_lv->lv_bytes, shadow->lv_debug_required_bytes);
+			xfs_warn(lip->li_mountp, "XXX niovecs           %d -> %d",
+				old_lv->lv_niovecs, shadow->lv_niovecs);
+			xfs_warn(lip->li_mountp, "XXX old log item:");
+			xlog_print_logitem(lip);
+			want_dump = true;
+		}
+
  		if (lip->li_lv && shadow->lv_size <= lip->li_lv->lv_size) {
  			/* same or smaller, optimise common overwrite case */
  			lv = lip->li_lv;
@@ -372,6 +428,10 @@ xlog_cil_insert_format_items(
  		lip->li_ops->iop_format(lip, lv);
  insert:
  		xfs_cil_prepare_item(log, lv, old_lv, diff_len, diff_iovecs);
+		if (want_dump) {
+			xfs_warn(lip->li_mountp, "XXX new log item:");
+			xlog_print_logitem(lip);
+		}
  	}
  }
  


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-02 19:12                     ` Donald Buczek
@ 2021-01-02 22:44                       ` Dave Chinner
  2021-01-03 16:03                         ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2021-01-02 22:44 UTC (permalink / raw)
  To: Donald Buczek
  Cc: linux-xfs, Brian Foster, Linux Kernel Mailing List, it+linux-xfs

On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> On 31.12.20 22:59, Dave Chinner wrote:
> > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > On 30.12.20 23:16, Dave Chinner wrote:

> > One could argue that, but one should also understand the design
> > constraints for a particular algorithm are before suggesting
> > that their solution is "robust". :)
>
> Yes, but an understanding to the extend required by the
> argument should be sufficient :-)

And that's the fundamental conceit described by Dunning-Kruger.

I.e. someone thinks they know enough to understand the argument,
when in fact they don't understand enough about the subject matter
to realise they don't understand what the expert at the other end is
saying at all....

> > > # seq 29
> > > 
> > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
> > 
> > So, at 20:08:15 we get a push trigger and the work is queued. But...
> > 
> > .....
> > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
> > 
> > It takes the best part of *50 seconds* before the push work actually
> > runs?
> > 
> > That's .... well and truly screwed up - the work should run on that
> > CPU on the very next time it yeilds the CPU. If we're holding the
> > CPU without yeilding it for that long, hangcheck and RCU warnings
> > should be going off...
> 
> No such warnings.
> 
> But the load is probably I/O bound to the log:
> 
> - creates `cp -a` copies of a directory tree with small files (linux source repository)
> - source tree probably completely cached.
> - two copies in parallel
> - slow log (on software raid6)
> 
> Isn't it to be expected that sooner or later you need to wait for
> log writes when you write as fast as possible with lots of
> metadata updates and not so much data?

No, incoming transactions waits for transaction reservation space,
not log writes. Reservation space is freed up by metadata writeback.
So if you have new transactions blocking in xfs_trans_reserve(),
then we are blocking on metadata writeback.

The CIL worker thread may be waiting for log IO completion before it
issues more log IO, and in that case it is waiting on iclog buffer
space (i.e. only waiting on internal log state, not metadata
writeback).  Can you find that kworker thread stack and paste it? If
bound on log IO, it will be blocked in xlog_get_iclog_space().

Please paste the entire stack output, too, not just the bits you
think are relevant or understand....

Also, cp -a of a linux source tree is just as data intensive as it
is metadata intensive. There's probably more data IO than metadata
IO, so that's more likely to be what is slowing the disks down as
metadata writeback is...

> I'm a bit concerned, though, that there seem to be a rather
> unlimited (~ 1000) number of kernel worker threads waiting for the
> cil push and indirectly for log writes.

That's normal - XFS is highly asynchronous and defers a lot of work
to completion threads.

> > So it dropped by 16 bytes (seems to be common) which is
> > unexpected.  I wonder if it filled a hole in a buffer and so
> > needed one less xlog_op_header()? But then the size would have
> > gone up by at least 128 bytes for the hole that was filled, so
> > it still shouldn't go down in size.
> > 
> > I think you need to instrument xlog_cil_insert_items() and catch
> > a negative length here:
> > 
> > 	/* account for space used by new iovec headers  */
> > 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
> > 	iovhdr_res; ctx->nvecs += diff_iovecs;
> > 
> > (diff_iovecs will be negative if the number of xlog_op_header
> > structures goes down)
> > 
> > And if this happens, then dump the transaction ticket via
> > xlog_print_trans(tp) so we can see all the log items types and
> > vectors that the transaction has formatted...
> 
> I tried that, but the output was difficult to understand, because
> at that point you can only log the complete transaction with the
> items already updated.  And a shrinking item is not switched to the
> shadow vector, so the formatted content is already overwritten and
> not available for analysis.

Yes, that's exactly the information I need to see.

But the fact you think this is something I don't need to know about
is classic Dunning-Kruger in action. You don't understand why I
asked for this information, and found the information "difficult to
understand", so you decided I didn't need it either, despite the
fact I asked explicitly for it.

What I first need to know is what operations are being performed by
the transaciton that shrunk and what all the items in it are, not
which specific items shrunk and by how much. There can be tens to
hundreds of items in a single transaction, and it's the combination
of the transaction state, the reservation, the amount of the
reservation that has been consumed, what items are consuming that
reservation, etc. that I need to first see and analyse.

I don't ask for specific information just for fun - I ask for
specific information because it is *necessary*. If you want the
problem triaged and fixed, then please supply the information you
are asked for, even if you don't understand why or what it means.

> So I've added some code to dump an item in its old and its new
> state in xlog_cil_insert_format_items when either the requested
> buffer len or the number of vectors decreased.

Yes, that would be useful when trying to understand the low level
details of where a specific operation might be doing something
unexpected, but I don't have a high level picture of what operations
are triggering this issue so have no clue about the context in which
these isolated, context free changes are occuring.

That said ....

> Several examples of different kind with a little bit of manual annotation following
> 
> Best
>   Donald
> 
> - XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 32 to 20 bytes
> 
> [   29.606212] XFS (sda1): XXX required buf size 184 -> 172
> [   29.612591] XFS (sda1): XXX niovecs           3 -> 3
> 
> [   29.618570] XFS (sda1): XXX old log item:
> [   29.623469] XFS (sda1): log item:
> [   29.627683] XFS (sda1):   type       = 0x123b                                               # XFS_LI_INODE
> [   29.632375] XFS (sda1):   flags      = 0x8
> [   29.636858] XFS (sda1):   niovecs    = 3
> [   29.647442] XFS (sda1):   size       = 312
> [   29.651814] XFS (sda1):   bytes      = 184
> [   29.656278] XFS (sda1):   buf len    = 184
> [   29.660927] XFS (sda1):   iovec[0]
> [   29.665071] XFS (sda1):     type     = 0x5                                                  # XLOG_REG_TYPE_IFORMAT
> [   29.669592] XFS (sda1):     len      = 56
> [   29.673914] XFS (sda1):     first 32 bytes of iovec[0]:
> [   29.680079] 00000000: 3b 12 03 00 03 00 00 00 00 00 20 00 00 00 00 00  ;......... .....
> [   29.689363] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   29.698633] XFS (sda1):   iovec[1]
> [   29.702756] XFS (sda1):     type     = 0x6                                                  # XLOG_REG_TYPE_ICORE
> [   29.707263] XFS (sda1):     len      = 96
> [   29.711571] XFS (sda1):     first 32 bytes of iovec[1]:
> [   29.717720] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   29.726986] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   29.736241] XFS (sda1):   iovec[2]
> [   29.740364] XFS (sda1):     type     = 0x9                                                  # XLOG_REG_TYPE_ILOCAL
> [   29.744873] XFS (sda1):     len      = 32
> [   29.749184] XFS (sda1):     first 32 bytes of iovec[2]:
> [   29.755336] 00000000: 02 00 30 e7 02 26 06 00 40 73 65 72 76 65 72 00  ..0..&..@server.
> [   29.764612] 00000010: 08 92 ef 04 00 58 6d 78 36 34 00 d3 93 58 00 58  .....Xmx64...X.X
> 
> [   29.773900] XFS (sda1): XXX new log item:
> [   29.778718] XFS (sda1): log item:
> [   29.782856] XFS (sda1):   type       = 0x123b
> [   29.787478] XFS (sda1):   flags      = 0x8
> [   29.791902] XFS (sda1):   niovecs    = 3
> [   29.796321] XFS (sda1):   size       = 312
> [   29.800640] XFS (sda1):   bytes      = 172
> [   29.805052] XFS (sda1):   buf len    = 176
> [   29.809659] XFS (sda1):   iovec[0]
> [   29.813781] XFS (sda1):     type     = 0x5
> [   29.818289] XFS (sda1):     len      = 56
> [   29.822599] XFS (sda1):     first 32 bytes of iovec[0]:
> [   29.828754] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
> [   29.838024] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   29.847292] XFS (sda1):   iovec[1]
> [   29.851420] XFS (sda1):     type     = 0x6
> [   29.855933] XFS (sda1):     len      = 96
> [   29.860247] XFS (sda1):     first 32 bytes of iovec[1]:
> [   29.866406] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   29.875677] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   29.884949] XFS (sda1):   iovec[2]
> [   29.889081] XFS (sda1):     type     = 0x9
> [   29.893601] XFS (sda1):     len      = 20
> [   29.897924] XFS (sda1):     first 20 bytes of iovec[2]:
> [   29.904096] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
> [   29.913381] 00000010: 58 92 ef 04                                      X...

That may be removal of a directory entry from a local form
directory. Hmmm - local form inode forks only log the active byte
range so that could be the issue here.

Hold on - where did this come from? "cp -a" doesn't result in file
and directory removals, right?

> - (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 20 to 8 bytes
> 
> [   29.982907] XFS (sda1): XXX required buf size 172 -> 160
> [   29.992716] XFS (sda1): XXX niovecs           3 -> 3
> 
> [   29.998728] XFS (sda1): XXX old log item:
> [   30.003624] XFS (sda1): log item:
> [   30.007835] XFS (sda1):   type       = 0x123b
> [   30.012654] XFS (sda1):   flags      = 0x8
> [   30.017145] XFS (sda1):   niovecs    = 3
> [   30.021638] XFS (sda1):   size       = 312
> [   30.026012] XFS (sda1):   bytes      = 172
> [   30.030610] XFS (sda1):   buf len    = 176
> [   30.035292] XFS (sda1):   iovec[0]
> [   30.039480] XFS (sda1):     type     = 0x5
> [   30.044054] XFS (sda1):     len      = 56
> [   30.048534] XFS (sda1):     first 32 bytes of iovec[0]:
> [   30.054749] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
> [   30.064091] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   30.073545] XFS (sda1):   iovec[1]
> [   30.077744] XFS (sda1):     type     = 0x6
> [   30.082455] XFS (sda1):     len      = 96
> [   30.086824] XFS (sda1):     first 32 bytes of iovec[1]:
> [   30.093025] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   30.102346] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   30.111801] XFS (sda1):   iovec[2]
> [   30.115989] XFS (sda1):     type     = 0x9
> [   30.120715] XFS (sda1):     len      = 20
> [   30.125102] XFS (sda1):     first 20 bytes of iovec[2]:
> [   30.131331] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
> [   30.140808] 00000010: 58 92 ef 04                                      X...
> 
> [   30.149006] XFS (sda1): XXX new log item:
> [   30.154039] XFS (sda1): log item:
> [   30.154039] XFS (sda1):   type       = 0x123b
> [   30.154041] XFS (sda1):   flags      = 0x8
> [   30.167436] XFS (sda1):   niovecs    = 3
> [   30.167437] XFS (sda1):   size       = 312
> [   30.167438] XFS (sda1):   bytes      = 160
> [   30.180881] XFS (sda1):   buf len    = 160
> [   30.180882] XFS (sda1):   iovec[0]
> [   30.180882] XFS (sda1):     type     = 0x5
> [   30.180883] XFS (sda1):     len      = 56
> [   30.180884] XFS (sda1):     first 32 bytes of iovec[0]:
> [   30.180884] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
> [   30.180885] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   30.180886] XFS (sda1):   iovec[1]
> [   30.180886] XFS (sda1):     type     = 0x6
> [   30.180887] XFS (sda1):     len      = 96
> [   30.180887] XFS (sda1):     first 32 bytes of iovec[1]:
> [   30.180888] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   30.180889] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   30.180889] XFS (sda1):   iovec[2]
> [   30.180890] XFS (sda1):     type     = 0x9
> [   30.180890] XFS (sda1):     len      = 8
> [   30.180890] XFS (sda1):     first 8 bytes of iovec[2]:
> [   30.180891] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..

And that looks to be another unlink in the same directory inode, now
empty.

> 
> - (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL removed - niovecs from 3 to 2
> 
> [   30.197403] XFS (sda1): XXX required buf size 160 -> 152
> [   30.296091] XFS (sda1): XXX niovecs           3 -> 2
> 
> [   30.296092] XFS (sda1): XXX old log item:
> [   30.296093] XFS (sda1): log item:
> [   30.297552] ixgbe 0000:01:00.1 net03: renamed from eth3
> [   30.317524] XFS (sda1):   type       = 0x123b
> [   30.317524] XFS (sda1):   flags      = 0x8
> [   30.317525] XFS (sda1):   niovecs    = 3
> [   30.317525] XFS (sda1):   size       = 304
> [   30.317526] XFS (sda1):   bytes      = 160
> [   30.317526] XFS (sda1):   buf len    = 160
> [   30.317527] XFS (sda1):   iovec[0]
> [   30.317527] XFS (sda1):     type     = 0x5
> [   30.317528] XFS (sda1):     len      = 56
> [   30.317528] XFS (sda1):     first 32 bytes of iovec[0]:
> [   30.317529] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
> [   30.317530] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   30.317531] XFS (sda1):   iovec[1]
> [   30.317531] XFS (sda1):     type     = 0x6
> [   30.317531] XFS (sda1):     len      = 96
> [   30.317532] XFS (sda1):     first 32 bytes of iovec[1]:
> [   30.317533] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
> [   30.317533] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> [   30.317533] XFS (sda1):   iovec[2]
> [   30.317534] XFS (sda1):     type     = 0x9
> [   30.317534] XFS (sda1):     len      = 8
> [   30.317535] XFS (sda1):     first 8 bytes of iovec[2]:
> [   30.317535] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..
> 
> [   30.317536] XFS (sda1): XXX new log item:
> [   30.317537] XFS (sda1): log item:
> [   30.317537] XFS (sda1):   type       = 0x123b
> [   30.317538] XFS (sda1):   flags      = 0x8
> [   30.317539] XFS (sda1):   niovecs    = 2
> [   30.317539] XFS (sda1):   size       = 304
> [   30.317540] XFS (sda1):   bytes      = 152
> [   30.317540] XFS (sda1):   buf len    = 152
> [   30.317541] XFS (sda1):   iovec[0]
> [   30.317541] XFS (sda1):     type     = 0x5
> [   30.317542] XFS (sda1):     len      = 56
> [   30.317542] XFS (sda1):     first 32 bytes of iovec[0]:
> [   30.317543] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
> [   30.317543] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
> [   30.317544] XFS (sda1):   iovec[1]
> [   30.317544] XFS (sda1):     type     = 0x6
> [   30.317545] XFS (sda1):     len      = 96
> [   30.317545] XFS (sda1):     first 32 bytes of iovec[1]:
> [   30.317546] 00000000: 4e 49 00 00 02 02 00 00 00 00 00 00 00 00 00 00  NI..............
> [   30.317546] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................

And maybe that's an rmdir() triggering an unlink on the same - now
empty - directory.

Does your workload also do 'rm -rf' anywhere because I can't explain
how a concurrent 'cp -a' workload would trigger this behaviour...

> - XFS_LI_INODE.XLOG_REG_TYPE_IEXT removed - niovecs from 3 to 2
> 
> [   37.983756] XFS (sda1): XXX required buf size 168 -> 152
> [   37.990253] XFS (sda1): XXX niovecs           3 -> 2
> 
> [   37.996202] XFS (sda1): XXX old log item:
> [   38.001061] XFS (sda1): log item:
> [   38.005239] XFS (sda1):   type       = 0x123b
> [   38.009885] XFS (sda1):   flags      = 0x9
> [   38.014330] XFS (sda1):   niovecs    = 3
> [   38.018764] XFS (sda1):   size       = 440
> [   38.023100] XFS (sda1):   bytes      = 168
> [   38.027533] XFS (sda1):   buf len    = 168
> [   38.032157] XFS (sda1):   iovec[0]
> [   38.036286] XFS (sda1):     type     = 0x5
> [   38.040796] XFS (sda1):     len      = 56
> [   38.045114] XFS (sda1):     first 32 bytes of iovec[0]:
> [   38.051277] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
> [   38.060562] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [   38.069853] XFS (sda1):   iovec[1]
> [   38.073989] XFS (sda1):     type     = 0x6
> [   38.078525] XFS (sda1):     len      = 96
> [   38.082871] XFS (sda1):     first 32 bytes of iovec[1]:
> [   38.089052] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
> [   38.098331] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.
> [   38.107611] XFS (sda1):   iovec[2]
> [   38.111754] XFS (sda1):     type     = 0x7                                                # XLOG_REG_TYPE_IEXT
> [   38.116285] XFS (sda1):     len      = 16
> [   38.120608] XFS (sda1):     first 16 bytes of iovec[2]:
> [   38.126770] 00000000: 00 00 00 00 00 00 00 00 00 00 00 11 31 80 00 01  ............1...
> 
> [   38.136054] XFS (sda1): XXX new log item:
> [   38.140878] XFS (sda1): log item:
> [   38.145025] XFS (sda1):   type       = 0x123b
> [   38.149645] XFS (sda1):   flags      = 0x9
> [   38.154067] XFS (sda1):   niovecs    = 2
> [   38.158490] XFS (sda1):   size       = 440
> [   38.162799] XFS (sda1):   bytes      = 152
> [   38.167202] XFS (sda1):   buf len    = 152
> [   38.171801] XFS (sda1):   iovec[0]
> [   38.175911] XFS (sda1):     type     = 0x5
> [   38.180409] XFS (sda1):     len      = 56
> [   38.184708] XFS (sda1):     first 32 bytes of iovec[0]:
> [   38.190852] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
> [   38.200115] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> [   38.209373] XFS (sda1):   iovec[1]
> [   38.213488] XFS (sda1):     type     = 0x6
> [   38.217990] XFS (sda1):     len      = 96
> [   38.222297] XFS (sda1):     first 32 bytes of iovec[1]:
> [   38.228442] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
> [   38.237707] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.

truncate() or directory entry removal resulting in block to
shortform conversion, maybe? Implies 'rm -rf', too.

> 
> - XFS_LI_BUF.XLOG_REG_TYPE_BCHUNK removed - niovecs from 3 to 2
> 
> [ 2120.030590] XFS (md0): XXX required buf size 3992 -> 4120
> [ 2120.037257] XFS (md0): XXX niovecs           3 -> 2

Logged buffer size went up because a hole was filled. Not a
candidate. Was a directory buffer.

> 
> [ 2279.729095] XFS (sda1): XXX required buf size 152 -> 24
> [ 2279.735437] XFS (sda1): XXX niovecs           2 -> 1
> 
> [ 2279.741360] XFS (sda1): XXX old log item:
> [ 2279.746199] XFS (sda1): log item:
> [ 2279.750512] XFS (sda1):   type       = 0x123c
> [ 2279.755181] XFS (sda1):   flags      = 0x8
> [ 2279.759644] XFS (sda1):   niovecs    = 2
> [ 2279.764246] XFS (sda1):   size       = 256
> [ 2279.768625] XFS (sda1):   bytes      = 152
> [ 2279.773094] XFS (sda1):   buf len    = 152
> [ 2279.777741] XFS (sda1):   iovec[0]
> [ 2279.782044] XFS (sda1):     type     = 0x1
> [ 2279.786607] XFS (sda1):     len      = 24
> [ 2279.790980] XFS (sda1):     first 24 bytes of iovec[0]:
> [ 2279.797290] 00000000: 3c 12 02 00 00 20 08 00 b0 41 f8 08 00 00 00 00  <.... ...A......
> [ 2279.806730] 00000010: 01 00 00 00 01 00 00 00                          ........
> [ 2279.815287] XFS (sda1):   iovec[1]
> [ 2279.819604] XFS (sda1):     type     = 0x2
> [ 2279.824408] XFS (sda1):     len      = 128
> [ 2279.828888] XFS (sda1):     first 32 bytes of iovec[1]:
> [ 2279.835126] 00000000: 42 4d 41 50 00 00 00 09 ff ff ff ff ff ff ff ff  BMAP............
> [ 2279.844617] 00000010: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00  ................

Hold on - BMAP?

But the directory buffer above was a XDB3, and so a matching BMAP
btree buffer should have a BMA3 signature. So this is output from
two different filesystems, one formatted with crc=0, the other with
crc=1....

Oh, wait, now that I look at it, all the inodes in the output are v2
inodes. Yet your testing is supposed to be running on CRC+reflink
enabled filesystems, which use v3 inodes. So none of these traces
come from the filesystem under your 'cp -a' workload, right?

Fmeh, it's right there - "XFS (sda1)...". These are all traces from
your root directory, not md0/md1 which are your RAID6 devices
running the 'cp -a' workload. The only trace from md0 was the hole
filling buffer which grew the size of the log vector.

Ok, can you plese filter the output to limit it to just the RAID6
filesystem under test so we get the transaction dumps just before it
hangs? If you want to, create a separate dump output from the root
device to capture the things it is doing, but the first thing is to
isolate the 'cp -a' hang vector...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-02 22:44                       ` Dave Chinner
@ 2021-01-03 16:03                         ` Donald Buczek
  2021-01-07 22:19                           ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2021-01-03 16:03 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, Brian Foster, Linux Kernel Mailing List, it+linux-xfs

On 02.01.21 23:44, Dave Chinner wrote:
> On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
>> On 31.12.20 22:59, Dave Chinner wrote:
>>> On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
>>>> On 30.12.20 23:16, Dave Chinner wrote:
> 
>>> One could argue that, but one should also understand the design
>>> constraints for a particular algorithm are before suggesting
>>> that their solution is "robust". :)
>>
>> Yes, but an understanding to the extend required by the
>> argument should be sufficient :-)
> 
> And that's the fundamental conceit described by Dunning-Kruger.
> 
> I.e. someone thinks they know enough to understand the argument,
> when in fact they don't understand enough about the subject matter
> to realise they don't understand what the expert at the other end is
> saying at all....
> 
>>>> # seq 29
>>>>
>>>> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
>>>
>>> So, at 20:08:15 we get a push trigger and the work is queued. But...
>>>
>>> .....
>>>> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
>>>
>>> It takes the best part of *50 seconds* before the push work actually
>>> runs?
>>>
>>> That's .... well and truly screwed up - the work should run on that
>>> CPU on the very next time it yeilds the CPU. If we're holding the
>>> CPU without yeilding it for that long, hangcheck and RCU warnings
>>> should be going off...
>>
>> No such warnings.
>>
>> But the load is probably I/O bound to the log:
>>
>> - creates `cp -a` copies of a directory tree with small files (linux source repository)
>> - source tree probably completely cached.
>> - two copies in parallel
>> - slow log (on software raid6)
>>
>> Isn't it to be expected that sooner or later you need to wait for
>> log writes when you write as fast as possible with lots of
>> metadata updates and not so much data?
> 
> No, incoming transactions waits for transaction reservation space,
> not log writes. Reservation space is freed up by metadata writeback.
> So if you have new transactions blocking in xfs_trans_reserve(),
> then we are blocking on metadata writeback.
> 
> The CIL worker thread may be waiting for log IO completion before it
> issues more log IO, and in that case it is waiting on iclog buffer
> space (i.e. only waiting on internal log state, not metadata
> writeback).  Can you find that kworker thread stack and paste it? If
> bound on log IO, it will be blocked in xlog_get_iclog_space().
> 
> Please paste the entire stack output, too, not just the bits you
> think are relevant or understand....

That would be a kworker on the "xfs-cil/%s" workqueue, correct?
And you mean before the lockup, when the I/O is still active, correct?

This is the usual stack output from that thread:

# # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
# cat /proc/2080/task/2080/stack

[<0>] md_flush_request+0x87/0x190
[<0>] raid5_make_request+0x61b/0xb30
[<0>] md_handle_request+0x127/0x1a0
[<0>] md_submit_bio+0xbd/0x100
[<0>] submit_bio_noacct+0x151/0x410
[<0>] submit_bio+0x4b/0x1a0
[<0>] xlog_state_release_iclog+0x87/0xb0
[<0>] xlog_write+0x452/0x6d0
[<0>] xlog_cil_push_work+0x2e0/0x4d0
[<0>] process_one_work+0x1dd/0x3e0
[<0>] worker_thread+0x23f/0x3b0
[<0>] kthread+0x118/0x130
[<0>] ret_from_fork+0x22/0x30

sampled three times with a few seconds in between, stack identical.

> Also, cp -a of a linux source tree is just as data intensive as it
> is metadata intensive. There's probably more data IO than metadata
> IO, so that's more likely to be what is slowing the disks down as
> metadata writeback is...
> 
>> I'm a bit concerned, though, that there seem to be a rather
>> unlimited (~ 1000) number of kernel worker threads waiting for the
>> cil push and indirectly for log writes.
> 
> That's normal - XFS is highly asynchronous and defers a lot of work
> to completion threads.
> 
>>> So it dropped by 16 bytes (seems to be common) which is
>>> unexpected.  I wonder if it filled a hole in a buffer and so
>>> needed one less xlog_op_header()? But then the size would have
>>> gone up by at least 128 bytes for the hole that was filled, so
>>> it still shouldn't go down in size.
>>>
>>> I think you need to instrument xlog_cil_insert_items() and catch
>>> a negative length here:
>>>
>>> 	/* account for space used by new iovec headers  */
>>> 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
>>> 	iovhdr_res; ctx->nvecs += diff_iovecs;
>>>
>>> (diff_iovecs will be negative if the number of xlog_op_header
>>> structures goes down)
>>>
>>> And if this happens, then dump the transaction ticket via
>>> xlog_print_trans(tp) so we can see all the log items types and
>>> vectors that the transaction has formatted...
>>
>> I tried that, but the output was difficult to understand, because
>> at that point you can only log the complete transaction with the
>> items already updated.  And a shrinking item is not switched to the
>> shadow vector, so the formatted content is already overwritten and
>> not available for analysis.
> 
> Yes, that's exactly the information I need to see.
> 
> But the fact you think this is something I don't need to know about
> is classic Dunning-Kruger in action. You don't understand why I
> asked for this information, and found the information "difficult to
> understand", so you decided I didn't need it either, despite the
> fact I asked explicitly for it.
> 
> What I first need to know is what operations are being performed by
> the transaciton that shrunk and what all the items in it are, not
> which specific items shrunk and by how much. There can be tens to
> hundreds of items in a single transaction, and it's the combination
> of the transaction state, the reservation, the amount of the
> reservation that has been consumed, what items are consuming that
> reservation, etc. that I need to first see and analyse.
> 
> I don't ask for specific information just for fun - I ask for
> specific information because it is *necessary*. If you want the
> problem triaged and fixed, then please supply the information you
> are asked for, even if you don't understand why or what it means.

I see. I hope, you can make use of the following.

Attempt 1 ( base: Linux v5.1.10 + smartpqi driver 2.1.6-005)

+++ b/fs/xfs/xfs_log_cil.c
@@ -405,6 +405,11 @@ xlog_cil_insert_items(
  
         spin_lock(&cil->xc_cil_lock);
  
+       if (diff_iovecs < 0 && strcmp(tp->t_mountp->m_super->s_id, "md0") == 0) {
+               xfs_warn (tp->t_mountp, "XXX diff_iovecs %d", diff_iovecs);
+               xlog_print_trans(tp);
+       }
+
         /* account for space used by new iovec headers  */
         iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t);
         len += iovhdr_res;

Result: No log output before filesystem locked up after 15 iterations of the copy loop.

Attempt 2 ( base: previous patch):

--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -405,8 +405,8 @@ xlog_cil_insert_items(
  
         spin_lock(&cil->xc_cil_lock);
  
-       if (diff_iovecs < 0 && strcmp(tp->t_mountp->m_super->s_id, "md0") == 0) {
-               xfs_warn (tp->t_mountp, "XXX diff_iovecs %d", diff_iovecs);
+       if ((diff_iovecs < 0 || len < 0) && strcmp(tp->t_mountp->m_super->s_id, "md0") == 0) {
+               xfs_warn (tp->t_mountp, "XXX diff_iovecs %d diff_len %d", diff_iovecs, len);
                 xlog_print_trans(tp);
         }

Result: 594 rather similar events logged before filesystem locked up after 29 iterations of the copy loop

The third iovec sometimes has len=48 instead of len=16.

Last two events before the lockup:

2021-01-03T14:48:43.098887+01:00 deadbird kernel: [ 3194.831260] XFS (md0): XXX diff_iovecs 0 diff_len -16
2021-01-03T14:48:43.109363+01:00 deadbird kernel: [ 3194.837364] XFS (md0): transaction summary:
2021-01-03T14:48:43.109367+01:00 deadbird kernel: [ 3194.842544] XFS (md0):   log res   = 212728
2021-01-03T14:48:43.118996+01:00 deadbird kernel: [ 3194.847617] XFS (md0):   log count = 8
2021-01-03T14:48:43.119010+01:00 deadbird kernel: [ 3194.852193] XFS (md0):   flags     = 0x25
2021-01-03T14:48:43.129701+01:00 deadbird kernel: [ 3194.857156] XFS (md0): ticket reservation summary:
2021-01-03T14:48:43.129714+01:00 deadbird kernel: [ 3194.862890] XFS (md0):   unit res    = 225140 bytes
2021-01-03T14:48:43.135515+01:00 deadbird kernel: [ 3194.868710] XFS (md0):   current res = 225140 bytes
2021-01-03T14:48:43.148684+01:00 deadbird kernel: [ 3194.874702] XFS (md0):   total reg   = 0 bytes (o/flow = 0 bytes)
2021-01-03T14:48:43.148700+01:00 deadbird kernel: [ 3194.881885] XFS (md0):   ophdrs      = 0 (ophdr space = 0 bytes)
2021-01-03T14:48:43.155781+01:00 deadbird kernel: [ 3194.888975] XFS (md0):   ophdr + reg = 0 bytes
2021-01-03T14:48:43.161210+01:00 deadbird kernel: [ 3194.894404] XFS (md0):   num regions = 0
2021-01-03T14:48:43.165948+01:00 deadbird kernel: [ 3194.899141] XFS (md0): log item:
2021-01-03T14:48:43.169996+01:00 deadbird kernel: [ 3194.903203] XFS (md0):   type      = 0x123b
2021-01-03T14:48:43.174544+01:00 deadbird kernel: [ 3194.907741] XFS (md0):   flags     = 0x8
2021-01-03T14:48:43.178996+01:00 deadbird kernel: [ 3194.912191] XFS (md0):   niovecs   = 3
2021-01-03T14:48:43.183347+01:00 deadbird kernel: [ 3194.916546] XFS (md0):   size      = 696
2021-01-03T14:48:43.187598+01:00 deadbird kernel: [ 3194.920791] XFS (md0):   bytes     = 248
2021-01-03T14:48:43.191932+01:00 deadbird kernel: [ 3194.925131] XFS (md0):   buf len   = 248
2021-01-03T14:48:43.196581+01:00 deadbird kernel: [ 3194.929776] XFS (md0):   iovec[0]
2021-01-03T14:48:43.200633+01:00 deadbird kernel: [ 3194.933832] XFS (md0):     type    = 0x5
2021-01-03T14:48:43.205069+01:00 deadbird kernel: [ 3194.938264] XFS (md0):     len     = 56
2021-01-03T14:48:43.209293+01:00 deadbird kernel: [ 3194.942497] XFS (md0):     first 32 bytes of iovec[0]:
2021-01-03T14:48:43.215494+01:00 deadbird kernel: [ 3194.948690] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
2021-01-03T14:48:43.224801+01:00 deadbird kernel: [ 3194.957997] 00000010: 35 e3 ba 80 2e 00 00 00 00 00 00 00 00 00 00 00  5...............
2021-01-03T14:48:43.234100+01:00 deadbird kernel: [ 3194.967297] XFS (md0):   iovec[1]
2021-01-03T14:48:43.238293+01:00 deadbird kernel: [ 3194.971484] XFS (md0):     type    = 0x6
2021-01-03T14:48:43.242744+01:00 deadbird kernel: [ 3194.975942] XFS (md0):     len     = 176
2021-01-03T14:48:43.247108+01:00 deadbird kernel: [ 3194.980295] XFS (md0):     first 32 bytes of iovec[1]:
2021-01-03T14:48:43.253304+01:00 deadbird kernel: [ 3194.986506] 00000000: 4e 49 b0 81 03 02 00 00 7d 00 00 00 7d 00 00 00  NI......}...}...
2021-01-03T14:48:43.262648+01:00 deadbird kernel: [ 3194.995835] 00000010: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
2021-01-03T14:48:43.272064+01:00 deadbird kernel: [ 3195.005267] XFS (md0):   iovec[2]
2021-01-03T14:48:43.276174+01:00 deadbird kernel: [ 3195.009366] XFS (md0):     type    = 0x7
2021-01-03T14:48:43.280649+01:00 deadbird kernel: [ 3195.013848] XFS (md0):     len     = 16
2021-01-03T14:48:43.285042+01:00 deadbird kernel: [ 3195.018243] XFS (md0):     first 16 bytes of iovec[2]:
2021-01-03T14:48:43.291166+01:00 deadbird kernel: [ 3195.024364] 00000000: 00 00 00 00 00 00 00 00 00 ba 02 ef e0 01 23 ef  ..............#.

2021-01-03T14:48:43.300880+01:00 deadbird kernel: [ 3195.033706] XFS (md0): XXX diff_iovecs 0 diff_len -16
2021-01-03T14:48:43.311716+01:00 deadbird kernel: [ 3195.039849] XFS (md0): transaction summary:
2021-01-03T14:48:43.311732+01:00 deadbird kernel: [ 3195.044919] XFS (md0):   log res   = 212728
2021-01-03T14:48:43.316784+01:00 deadbird kernel: [ 3195.049982] XFS (md0):   log count = 8
2021-01-03T14:48:43.321455+01:00 deadbird kernel: [ 3195.054663] XFS (md0):   flags     = 0x25
2021-01-03T14:48:43.326316+01:00 deadbird kernel: [ 3195.059519] XFS (md0): ticket reservation summary:
2021-01-03T14:48:43.332050+01:00 deadbird kernel: [ 3195.065247] XFS (md0):   unit res    = 225140 bytes
2021-01-03T14:48:43.337873+01:00 deadbird kernel: [ 3195.071065] XFS (md0):   current res = 225140 bytes
2021-01-03T14:48:43.343762+01:00 deadbird kernel: [ 3195.076967] XFS (md0):   total reg   = 0 bytes (o/flow = 0 bytes)
2021-01-03T14:48:43.350951+01:00 deadbird kernel: [ 3195.084148] XFS (md0):   ophdrs      = 0 (ophdr space = 0 bytes)
2021-01-03T14:48:43.358036+01:00 deadbird kernel: [ 3195.091238] XFS (md0):   ophdr + reg = 0 bytes
2021-01-03T14:48:43.363476+01:00 deadbird kernel: [ 3195.096671] XFS (md0):   num regions = 0
2021-01-03T14:48:43.368213+01:00 deadbird kernel: [ 3195.101416] XFS (md0): log item:
2021-01-03T14:48:43.372263+01:00 deadbird kernel: [ 3195.105465] XFS (md0):   type      = 0x123b
2021-01-03T14:48:43.376895+01:00 deadbird kernel: [ 3195.110101] XFS (md0):   flags     = 0x8
2021-01-03T14:48:43.381240+01:00 deadbird kernel: [ 3195.114443] XFS (md0):   niovecs   = 3
2021-01-03T14:48:43.385587+01:00 deadbird kernel: [ 3195.118788] XFS (md0):   size      = 696
2021-01-03T14:48:43.389830+01:00 deadbird kernel: [ 3195.123026] XFS (md0):   bytes     = 248
2021-01-03T14:48:43.394355+01:00 deadbird kernel: [ 3195.127469] XFS (md0):   buf len   = 248
2021-01-03T14:48:43.398909+01:00 deadbird kernel: [ 3195.132100] XFS (md0):   iovec[0]
2021-01-03T14:48:43.402967+01:00 deadbird kernel: [ 3195.136147] XFS (md0):     type    = 0x5
2021-01-03T14:48:43.407367+01:00 deadbird kernel: [ 3195.140574] XFS (md0):     len     = 56
2021-01-03T14:48:43.411704+01:00 deadbird kernel: [ 3195.144903] XFS (md0):     first 32 bytes of iovec[0]:
2021-01-03T14:48:43.417812+01:00 deadbird kernel: [ 3195.151002] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
2021-01-03T14:48:43.427114+01:00 deadbird kernel: [ 3195.160302] 00000010: 57 b1 d6 80 2b 00 00 00 00 00 00 00 00 00 00 00  W...+...........
2021-01-03T14:48:43.440572+01:00 deadbird kernel: [ 3195.169704] XFS (md0):   iovec[1]
2021-01-03T14:48:43.440581+01:00 deadbird kernel: [ 3195.173774] XFS (md0):     type    = 0x6
2021-01-03T14:48:43.449478+01:00 deadbird kernel: [ 3195.178225] XFS (md0):     len     = 176
2021-01-03T14:48:43.449481+01:00 deadbird kernel: [ 3195.182687] XFS (md0):     first 32 bytes of iovec[1]:
2021-01-03T14:48:43.464908+01:00 deadbird kernel: [ 3195.188797] 00000000: 4e 49 20 81 03 02 00 00 7d 00 00 00 7d 00 00 00  NI .....}...}...
2021-01-03T14:48:43.464910+01:00 deadbird kernel: [ 3195.198116] 00000010: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
2021-01-03T14:48:43.478436+01:00 deadbird kernel: [ 3195.207555] XFS (md0):   iovec[2]
2021-01-03T14:48:43.478438+01:00 deadbird kernel: [ 3195.211645] XFS (md0):     type    = 0x7
2021-01-03T14:48:43.487300+01:00 deadbird kernel: [ 3195.216229] XFS (md0):     len     = 16
2021-01-03T14:48:43.487303+01:00 deadbird kernel: [ 3195.220509] XFS (md0):     first 16 bytes of iovec[2]:
2021-01-03T14:48:43.502751+01:00 deadbird kernel: [ 3195.226631] 00000000: 00 00 00 00 00 00 00 00 00 ae 04 29 00 00 da 24  ...........)...$

Full output available at https://owww.molgen.mpg.de/~buczek/2021-01-03_13-09/log.txt (2.2M)

>> So I've added some code to dump an item in its old and its new
>> state in xlog_cil_insert_format_items when either the requested
>> buffer len or the number of vectors decreased.
> 
> Yes, that would be useful when trying to understand the low level
> details of where a specific operation might be doing something
> unexpected, but I don't have a high level picture of what operations
> are triggering this issue so have no clue about the context in which
> these isolated, context free changes are occuring.
> 
> That said ....
> 
>> Several examples of different kind with a little bit of manual annotation following
>>
>> Best
>>    Donald
>>
>> - XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 32 to 20 bytes
>>
>> [   29.606212] XFS (sda1): XXX required buf size 184 -> 172
>> [   29.612591] XFS (sda1): XXX niovecs           3 -> 3
>>
>> [   29.618570] XFS (sda1): XXX old log item:
>> [   29.623469] XFS (sda1): log item:
>> [   29.627683] XFS (sda1):   type       = 0x123b                                               # XFS_LI_INODE
>> [   29.632375] XFS (sda1):   flags      = 0x8
>> [   29.636858] XFS (sda1):   niovecs    = 3
>> [   29.647442] XFS (sda1):   size       = 312
>> [   29.651814] XFS (sda1):   bytes      = 184
>> [   29.656278] XFS (sda1):   buf len    = 184
>> [   29.660927] XFS (sda1):   iovec[0]
>> [   29.665071] XFS (sda1):     type     = 0x5                                                  # XLOG_REG_TYPE_IFORMAT
>> [   29.669592] XFS (sda1):     len      = 56
>> [   29.673914] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   29.680079] 00000000: 3b 12 03 00 03 00 00 00 00 00 20 00 00 00 00 00  ;......... .....
>> [   29.689363] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   29.698633] XFS (sda1):   iovec[1]
>> [   29.702756] XFS (sda1):     type     = 0x6                                                  # XLOG_REG_TYPE_ICORE
>> [   29.707263] XFS (sda1):     len      = 96
>> [   29.711571] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   29.717720] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   29.726986] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   29.736241] XFS (sda1):   iovec[2]
>> [   29.740364] XFS (sda1):     type     = 0x9                                                  # XLOG_REG_TYPE_ILOCAL
>> [   29.744873] XFS (sda1):     len      = 32
>> [   29.749184] XFS (sda1):     first 32 bytes of iovec[2]:
>> [   29.755336] 00000000: 02 00 30 e7 02 26 06 00 40 73 65 72 76 65 72 00  ..0..&..@server.
>> [   29.764612] 00000010: 08 92 ef 04 00 58 6d 78 36 34 00 d3 93 58 00 58  .....Xmx64...X.X
>>
>> [   29.773900] XFS (sda1): XXX new log item:
>> [   29.778718] XFS (sda1): log item:
>> [   29.782856] XFS (sda1):   type       = 0x123b
>> [   29.787478] XFS (sda1):   flags      = 0x8
>> [   29.791902] XFS (sda1):   niovecs    = 3
>> [   29.796321] XFS (sda1):   size       = 312
>> [   29.800640] XFS (sda1):   bytes      = 172
>> [   29.805052] XFS (sda1):   buf len    = 176
>> [   29.809659] XFS (sda1):   iovec[0]
>> [   29.813781] XFS (sda1):     type     = 0x5
>> [   29.818289] XFS (sda1):     len      = 56
>> [   29.822599] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   29.828754] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
>> [   29.838024] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   29.847292] XFS (sda1):   iovec[1]
>> [   29.851420] XFS (sda1):     type     = 0x6
>> [   29.855933] XFS (sda1):     len      = 96
>> [   29.860247] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   29.866406] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   29.875677] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   29.884949] XFS (sda1):   iovec[2]
>> [   29.889081] XFS (sda1):     type     = 0x9
>> [   29.893601] XFS (sda1):     len      = 20
>> [   29.897924] XFS (sda1):     first 20 bytes of iovec[2]:
>> [   29.904096] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
>> [   29.913381] 00000010: 58 92 ef 04                                      X...
> 
> That may be removal of a directory entry from a local form
> directory. Hmmm - local form inode forks only log the active byte
> range so that could be the issue here.
> 
> Hold on - where did this come from? "cp -a" doesn't result in file
> and directory removals, right?

Correct. In the above part of the mail, where I described the load, that was reffering to your question "It takes the best part of *50 seconds* before the push work actually runs?". So this was related to the specific test from which the quoted loglines were.

The second part of the email was just about the question, whether it can be demonstrated that items in the CIL shrink. This was the followup to "Do you have tracing showing the operation where the relogged item has actually gotten smaller?" and "But, of course, if the size is not expected to decrease, this discrepancy must be clarified anyway. I am sure, I can find out the exact circumstances, this does happen. I will follow up on this. "

As soon as I added that logging, I got events from the system disk ("sda1") during boot activity. This already seemed to answer the question whether log items shrink (rightfully or wrongfully). So I didn't even start the described workload on the md devices. I should have made this difference clear and apologize that you needed to spend time to figure that out.

>> - (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL from 20 to 8 bytes
>>
>> [   29.982907] XFS (sda1): XXX required buf size 172 -> 160
>> [   29.992716] XFS (sda1): XXX niovecs           3 -> 3
>>
>> [   29.998728] XFS (sda1): XXX old log item:
>> [   30.003624] XFS (sda1): log item:
>> [   30.007835] XFS (sda1):   type       = 0x123b
>> [   30.012654] XFS (sda1):   flags      = 0x8
>> [   30.017145] XFS (sda1):   niovecs    = 3
>> [   30.021638] XFS (sda1):   size       = 312
>> [   30.026012] XFS (sda1):   bytes      = 172
>> [   30.030610] XFS (sda1):   buf len    = 176
>> [   30.035292] XFS (sda1):   iovec[0]
>> [   30.039480] XFS (sda1):     type     = 0x5
>> [   30.044054] XFS (sda1):     len      = 56
>> [   30.048534] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   30.054749] 00000000: 3b 12 03 00 03 00 00 00 00 00 14 00 00 00 00 00  ;...............
>> [   30.064091] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   30.073545] XFS (sda1):   iovec[1]
>> [   30.077744] XFS (sda1):     type     = 0x6
>> [   30.082455] XFS (sda1):     len      = 96
>> [   30.086824] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   30.093025] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   30.102346] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   30.111801] XFS (sda1):   iovec[2]
>> [   30.115989] XFS (sda1):     type     = 0x9
>> [   30.120715] XFS (sda1):     len      = 20
>> [   30.125102] XFS (sda1):     first 20 bytes of iovec[2]:
>> [   30.131331] 00000000: 01 00 30 e7 02 26 04 00 58 6d 78 36 34 00 d3 93  ..0..&..Xmx64...
>> [   30.140808] 00000010: 58 92 ef 04                                      X...
>>
>> [   30.149006] XFS (sda1): XXX new log item:
>> [   30.154039] XFS (sda1): log item:
>> [   30.154039] XFS (sda1):   type       = 0x123b
>> [   30.154041] XFS (sda1):   flags      = 0x8
>> [   30.167436] XFS (sda1):   niovecs    = 3
>> [   30.167437] XFS (sda1):   size       = 312
>> [   30.167438] XFS (sda1):   bytes      = 160
>> [   30.180881] XFS (sda1):   buf len    = 160
>> [   30.180882] XFS (sda1):   iovec[0]
>> [   30.180882] XFS (sda1):     type     = 0x5
>> [   30.180883] XFS (sda1):     len      = 56
>> [   30.180884] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   30.180884] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
>> [   30.180885] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   30.180886] XFS (sda1):   iovec[1]
>> [   30.180886] XFS (sda1):     type     = 0x6
>> [   30.180887] XFS (sda1):     len      = 96
>> [   30.180887] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   30.180888] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   30.180889] 00000010: 02 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   30.180889] XFS (sda1):   iovec[2]
>> [   30.180890] XFS (sda1):     type     = 0x9
>> [   30.180890] XFS (sda1):     len      = 8
>> [   30.180890] XFS (sda1):     first 8 bytes of iovec[2]:
>> [   30.180891] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..
> 
> And that looks to be another unlink in the same directory inode, now
> empty.
> 
>>
>> - (then) XFS_LI_INODE.XLOG_REG_TYPE_ILOCAL removed - niovecs from 3 to 2
>>
>> [   30.197403] XFS (sda1): XXX required buf size 160 -> 152
>> [   30.296091] XFS (sda1): XXX niovecs           3 -> 2
>>
>> [   30.296092] XFS (sda1): XXX old log item:
>> [   30.296093] XFS (sda1): log item:
>> [   30.297552] ixgbe 0000:01:00.1 net03: renamed from eth3
>> [   30.317524] XFS (sda1):   type       = 0x123b
>> [   30.317524] XFS (sda1):   flags      = 0x8
>> [   30.317525] XFS (sda1):   niovecs    = 3
>> [   30.317525] XFS (sda1):   size       = 304
>> [   30.317526] XFS (sda1):   bytes      = 160
>> [   30.317526] XFS (sda1):   buf len    = 160
>> [   30.317527] XFS (sda1):   iovec[0]
>> [   30.317527] XFS (sda1):     type     = 0x5
>> [   30.317528] XFS (sda1):     len      = 56
>> [   30.317528] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   30.317529] 00000000: 3b 12 03 00 03 00 00 00 00 00 08 00 00 00 00 00  ;...............
>> [   30.317530] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   30.317531] XFS (sda1):   iovec[1]
>> [   30.317531] XFS (sda1):     type     = 0x6
>> [   30.317531] XFS (sda1):     len      = 96
>> [   30.317532] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   30.317533] 00000000: 4e 49 ed 41 02 01 00 00 00 00 00 00 00 00 00 00  NI.A............
>> [   30.317533] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
>> [   30.317533] XFS (sda1):   iovec[2]
>> [   30.317534] XFS (sda1):     type     = 0x9
>> [   30.317534] XFS (sda1):     len      = 8
>> [   30.317535] XFS (sda1):     first 8 bytes of iovec[2]:
>> [   30.317535] 00000000: 00 00 30 e7 02 26 04 00                          ..0..&..
>>
>> [   30.317536] XFS (sda1): XXX new log item:
>> [   30.317537] XFS (sda1): log item:
>> [   30.317537] XFS (sda1):   type       = 0x123b
>> [   30.317538] XFS (sda1):   flags      = 0x8
>> [   30.317539] XFS (sda1):   niovecs    = 2
>> [   30.317539] XFS (sda1):   size       = 304
>> [   30.317540] XFS (sda1):   bytes      = 152
>> [   30.317540] XFS (sda1):   buf len    = 152
>> [   30.317541] XFS (sda1):   iovec[0]
>> [   30.317541] XFS (sda1):     type     = 0x5
>> [   30.317542] XFS (sda1):     len      = 56
>> [   30.317542] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   30.317543] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
>> [   30.317543] 00000010: 37 ab 20 00 00 00 00 00 00 00 00 00 00 00 00 00  7. .............
>> [   30.317544] XFS (sda1):   iovec[1]
>> [   30.317544] XFS (sda1):     type     = 0x6
>> [   30.317545] XFS (sda1):     len      = 96
>> [   30.317545] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   30.317546] 00000000: 4e 49 00 00 02 02 00 00 00 00 00 00 00 00 00 00  NI..............
>> [   30.317546] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 05 00  ................
> 
> And maybe that's an rmdir() triggering an unlink on the same - now
> empty - directory.
> 
> Does your workload also do 'rm -rf' anywhere because I can't explain
> how a concurrent 'cp -a' workload would trigger this behaviour...
> 
>> - XFS_LI_INODE.XLOG_REG_TYPE_IEXT removed - niovecs from 3 to 2
>>
>> [   37.983756] XFS (sda1): XXX required buf size 168 -> 152
>> [   37.990253] XFS (sda1): XXX niovecs           3 -> 2
>>
>> [   37.996202] XFS (sda1): XXX old log item:
>> [   38.001061] XFS (sda1): log item:
>> [   38.005239] XFS (sda1):   type       = 0x123b
>> [   38.009885] XFS (sda1):   flags      = 0x9
>> [   38.014330] XFS (sda1):   niovecs    = 3
>> [   38.018764] XFS (sda1):   size       = 440
>> [   38.023100] XFS (sda1):   bytes      = 168
>> [   38.027533] XFS (sda1):   buf len    = 168
>> [   38.032157] XFS (sda1):   iovec[0]
>> [   38.036286] XFS (sda1):     type     = 0x5
>> [   38.040796] XFS (sda1):     len      = 56
>> [   38.045114] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   38.051277] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
>> [   38.060562] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>> [   38.069853] XFS (sda1):   iovec[1]
>> [   38.073989] XFS (sda1):     type     = 0x6
>> [   38.078525] XFS (sda1):     len      = 96
>> [   38.082871] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   38.089052] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
>> [   38.098331] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.
>> [   38.107611] XFS (sda1):   iovec[2]
>> [   38.111754] XFS (sda1):     type     = 0x7                                                # XLOG_REG_TYPE_IEXT
>> [   38.116285] XFS (sda1):     len      = 16
>> [   38.120608] XFS (sda1):     first 16 bytes of iovec[2]:
>> [   38.126770] 00000000: 00 00 00 00 00 00 00 00 00 00 00 11 31 80 00 01  ............1...
>>
>> [   38.136054] XFS (sda1): XXX new log item:
>> [   38.140878] XFS (sda1): log item:
>> [   38.145025] XFS (sda1):   type       = 0x123b
>> [   38.149645] XFS (sda1):   flags      = 0x9
>> [   38.154067] XFS (sda1):   niovecs    = 2
>> [   38.158490] XFS (sda1):   size       = 440
>> [   38.162799] XFS (sda1):   bytes      = 152
>> [   38.167202] XFS (sda1):   buf len    = 152
>> [   38.171801] XFS (sda1):   iovec[0]
>> [   38.175911] XFS (sda1):     type     = 0x5
>> [   38.180409] XFS (sda1):     len      = 56
>> [   38.184708] XFS (sda1):     first 32 bytes of iovec[0]:
>> [   38.190852] 00000000: 3b 12 02 00 01 00 00 00 00 00 00 00 00 00 00 00  ;...............
>> [   38.200115] 00000010: cb 91 08 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>> [   38.209373] XFS (sda1):   iovec[1]
>> [   38.213488] XFS (sda1):     type     = 0x6
>> [   38.217990] XFS (sda1):     len      = 96
>> [   38.222297] XFS (sda1):     first 32 bytes of iovec[1]:
>> [   38.228442] 00000000: 4e 49 a4 81 02 02 00 00 62 00 00 00 62 00 00 00  NI......b...b...
>> [   38.237707] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 42 09  ..............B.
> 
> truncate() or directory entry removal resulting in block to
> shortform conversion, maybe? Implies 'rm -rf', too.
> 
>>
>> - XFS_LI_BUF.XLOG_REG_TYPE_BCHUNK removed - niovecs from 3 to 2
>>
>> [ 2120.030590] XFS (md0): XXX required buf size 3992 -> 4120
>> [ 2120.037257] XFS (md0): XXX niovecs           3 -> 2
> 
> Logged buffer size went up because a hole was filled. Not a
> candidate. Was a directory buffer.
> 
>>
>> [ 2279.729095] XFS (sda1): XXX required buf size 152 -> 24
>> [ 2279.735437] XFS (sda1): XXX niovecs           2 -> 1
>>
>> [ 2279.741360] XFS (sda1): XXX old log item:
>> [ 2279.746199] XFS (sda1): log item:
>> [ 2279.750512] XFS (sda1):   type       = 0x123c
>> [ 2279.755181] XFS (sda1):   flags      = 0x8
>> [ 2279.759644] XFS (sda1):   niovecs    = 2
>> [ 2279.764246] XFS (sda1):   size       = 256
>> [ 2279.768625] XFS (sda1):   bytes      = 152
>> [ 2279.773094] XFS (sda1):   buf len    = 152
>> [ 2279.777741] XFS (sda1):   iovec[0]
>> [ 2279.782044] XFS (sda1):     type     = 0x1
>> [ 2279.786607] XFS (sda1):     len      = 24
>> [ 2279.790980] XFS (sda1):     first 24 bytes of iovec[0]:
>> [ 2279.797290] 00000000: 3c 12 02 00 00 20 08 00 b0 41 f8 08 00 00 00 00  <.... ...A......
>> [ 2279.806730] 00000010: 01 00 00 00 01 00 00 00                          ........
>> [ 2279.815287] XFS (sda1):   iovec[1]
>> [ 2279.819604] XFS (sda1):     type     = 0x2
>> [ 2279.824408] XFS (sda1):     len      = 128
>> [ 2279.828888] XFS (sda1):     first 32 bytes of iovec[1]:
>> [ 2279.835126] 00000000: 42 4d 41 50 00 00 00 09 ff ff ff ff ff ff ff ff  BMAP............
>> [ 2279.844617] 00000010: ff ff ff ff ff ff ff ff 00 00 00 00 00 00 00 00  ................
> 
> Hold on - BMAP?
> 
> But the directory buffer above was a XDB3, and so a matching BMAP
> btree buffer should have a BMA3 signature. So this is output from
> two different filesystems, one formatted with crc=0, the other with
> crc=1....
> 
> Oh, wait, now that I look at it, all the inodes in the output are v2
> inodes. Yet your testing is supposed to be running on CRC+reflink
> enabled filesystems, which use v3 inodes. So none of these traces
> come from the filesystem under your 'cp -a' workload, right?
> 
> Fmeh, it's right there - "XFS (sda1)...". These are all traces from
> your root directory, not md0/md1 which are your RAID6 devices
> running the 'cp -a' workload. The only trace from md0 was the hole
> filling buffer which grew the size of the log vector.
> 
> Ok, can you plese filter the output to limit it to just the RAID6
> filesystem under test so we get the transaction dumps just before it
> hangs? If you want to, create a separate dump output from the root
> device to capture the things it is doing, but the first thing is to
> isolate the 'cp -a' hang vector...
Okay, the (transaction-) logs for the `cp -a` load to the (empty) raid is above.

When this is understood, you probably also know, whether other transactions might still cause a problem and are of interest. So I suggest to postpone these for now.

Best

   Donald

> Cheers,
> 
> Dave.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2020-12-30 22:16               ` Dave Chinner
  2020-12-31 11:48                 ` Donald Buczek
@ 2021-01-04 16:23                 ` Brian Foster
  2021-01-07 21:54                   ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Brian Foster @ 2021-01-04 16:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Donald Buczek, linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > Threads, which committed items to the CIL, wait in the xc_push_wait
> > waitqueue when used_space in the push context goes over a limit. These
> > threads need to be woken when the CIL is pushed.
> > 
> > The CIL push worker tries to avoid the overhead of calling wake_all()
> > when there are no waiters waiting. It does so by checking the same
> > condition which caused the waits to happen. This, however, is
> > unreliable, because ctx->space_used can actually decrease when items are
> > recommitted.
> 
> When does this happen?
> 
> Do you have tracing showing the operation where the relogged item
> has actually gotten smaller? By definition, relogging in the CIL
> should only grow the size of the object in the CIL because it must
> relog all the existing changes on top of the new changed being made
> to the object. Hence the CIL reservation should only ever grow.
> 
> IOWs, returning negative lengths from the formatting code is
> unexpected and probably a bug and requires further investigation,
> not papering over the occurrence with broadcast wakeups...
> 

I agree that this warrants a bit more explanation and analysis before
changing the current code...

> > If the value goes below the limit while some threads are
> > already waiting but before the push worker gets to it, these threads are
> > not woken.
> > 
> > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > optimization. This is possible, because we hold the xc_push_lock
> > spinlock, which prevents additions to the waitqueue.
> > 
> > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > ---
> >  fs/xfs/xfs_log_cil.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index b0ef071b3cb5..d620de8e217c 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> >  	/*
> >  	 * Wake up any background push waiters now this context is being pushed.
> >  	 */
> > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > +	if (waitqueue_active(&cil->xc_push_wait))
> >  		wake_up_all(&cil->xc_push_wait);
> 
> That just smells wrong to me. It *might* be correct, but this
> condition should pair with the sleep condition, as space used by a
> CIL context should never actually decrease....
> 

... but I'm a little confused by this assertion. The shadow buffer
allocation code refers to the possibility of shadow buffers falling out
that are smaller than currently allocated buffers. Further, the
_insert_format_items() code appears to explicitly optimize for this
possibility by reusing the active buffer, subtracting the old size/count
values from the diff variables and then reformatting the latest
(presumably smaller) item to the lv.

Of course this could just be implementation detail. I haven't dug into
the details in the remainder of this thread and I don't have specific
examples off the top of my head, but perhaps based on the ability of
various structures to change formats and the ability of log vectors to
shrink in size, shouldn't we expect the possibility of a CIL context to
shrink in size as well? Just from poking around the CIL it seems like
the surrounding code supports it (xlog_cil_insert_items() checks len > 0
for recalculating split res as well)...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-04 16:23                 ` Brian Foster
@ 2021-01-07 21:54                   ` Dave Chinner
  2021-01-08 16:56                     ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2021-01-07 21:54 UTC (permalink / raw)
  To: Brian Foster
  Cc: Donald Buczek, linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > If the value goes below the limit while some threads are
> > > already waiting but before the push worker gets to it, these threads are
> > > not woken.
> > > 
> > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > optimization. This is possible, because we hold the xc_push_lock
> > > spinlock, which prevents additions to the waitqueue.
> > > 
> > > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > > ---
> > >  fs/xfs/xfs_log_cil.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index b0ef071b3cb5..d620de8e217c 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > >  	/*
> > >  	 * Wake up any background push waiters now this context is being pushed.
> > >  	 */
> > > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > +	if (waitqueue_active(&cil->xc_push_wait))
> > >  		wake_up_all(&cil->xc_push_wait);
> > 
> > That just smells wrong to me. It *might* be correct, but this
> > condition should pair with the sleep condition, as space used by a
> > CIL context should never actually decrease....
> > 
> 
> ... but I'm a little confused by this assertion. The shadow buffer
> allocation code refers to the possibility of shadow buffers falling out
> that are smaller than currently allocated buffers. Further, the
> _insert_format_items() code appears to explicitly optimize for this
> possibility by reusing the active buffer, subtracting the old size/count
> values from the diff variables and then reformatting the latest
> (presumably smaller) item to the lv.

Individual items might shrink, but the overall transaction should
grow. Think of a extent to btree conversion of an inode fork. THe
data in the inode fork decreases from a list of extents to a btree
root block pointer, so the inode item shrinks. But then we add a new
btree root block that contains all the extents + the btree block
header, and it gets rounded up to ithe 128 byte buffer logging chunk
size.

IOWs, while the inode item has decreased in size, the overall
space consumed by the transaction has gone up and so the CIL ctx
used_space should increase. Hence we can't just look at individual
log items and whether they have decreased in size - we have to look
at all the items in the transaction to understand how the space used
in that transaction has changed. i.e. it's the aggregation of all
items in the transaction that matter here, not so much the
individual items.

> Of course this could just be implementation detail. I haven't dug into
> the details in the remainder of this thread and I don't have specific
> examples off the top of my head, but perhaps based on the ability of
> various structures to change formats and the ability of log vectors to
> shrink in size, shouldn't we expect the possibility of a CIL context to
> shrink in size as well? Just from poking around the CIL it seems like
> the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> for recalculating split res as well)...

Yes, there may be situations where it decreases. It may be this is
fine, but the assumption *I've made* in lots of the CIL push code is
that ctx->used_space rarely, if ever, will go backwards.

e.g. we run the first transaction into the CIL, it steals the sapce
needed for the cil checkpoint headers for the transaciton. Then if
the space returned by the item formatting is negative (because it is
in the AIL and being relogged), the CIL checkpoint now doesn't have
the space reserved it needs to run a checkpoint. That transaction is
a sync transaction, so it forces the log, and now we push the CIL
without sufficient reservation to write out the log headers and the
items we just formatted....

So, yeah, shrinking transaction space usage definitely violates some
of the assumptions the code makes about how relogging works. It's
entirely possible the assumptions I've made are not entirely correct
in some corner cases - those particular cases are what we need to
ferret out here, and then decide if they are correct or not and deal
with it from there...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-03 16:03                         ` Donald Buczek
@ 2021-01-07 22:19                           ` Dave Chinner
  2021-01-09 14:39                             ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2021-01-07 22:19 UTC (permalink / raw)
  To: Donald Buczek
  Cc: linux-xfs, Brian Foster, Linux Kernel Mailing List, it+linux-xfs

On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:
> On 02.01.21 23:44, Dave Chinner wrote:
> > On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
> > > On 31.12.20 22:59, Dave Chinner wrote:
> > > > On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
> > > > > On 30.12.20 23:16, Dave Chinner wrote:
> > 
> > > > One could argue that, but one should also understand the design
> > > > constraints for a particular algorithm are before suggesting
> > > > that their solution is "robust". :)
> > > 
> > > Yes, but an understanding to the extend required by the
> > > argument should be sufficient :-)
> > 
> > And that's the fundamental conceit described by Dunning-Kruger.
> > 
> > I.e. someone thinks they know enough to understand the argument,
> > when in fact they don't understand enough about the subject matter
> > to realise they don't understand what the expert at the other end is
> > saying at all....
> > 
> > > > > # seq 29
> > > > > 
> > > > > 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
> > > > 
> > > > So, at 20:08:15 we get a push trigger and the work is queued. But...
> > > > 
> > > > .....
> > > > > 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
> > > > 
> > > > It takes the best part of *50 seconds* before the push work actually
> > > > runs?
> > > > 
> > > > That's .... well and truly screwed up - the work should run on that
> > > > CPU on the very next time it yeilds the CPU. If we're holding the
> > > > CPU without yeilding it for that long, hangcheck and RCU warnings
> > > > should be going off...
> > > 
> > > No such warnings.
> > > 
> > > But the load is probably I/O bound to the log:
> > > 
> > > - creates `cp -a` copies of a directory tree with small files (linux source repository)
> > > - source tree probably completely cached.
> > > - two copies in parallel
> > > - slow log (on software raid6)
> > > 
> > > Isn't it to be expected that sooner or later you need to wait for
> > > log writes when you write as fast as possible with lots of
> > > metadata updates and not so much data?
> > 
> > No, incoming transactions waits for transaction reservation space,
> > not log writes. Reservation space is freed up by metadata writeback.
> > So if you have new transactions blocking in xfs_trans_reserve(),
> > then we are blocking on metadata writeback.
> > 
> > The CIL worker thread may be waiting for log IO completion before it
> > issues more log IO, and in that case it is waiting on iclog buffer
> > space (i.e. only waiting on internal log state, not metadata
> > writeback).  Can you find that kworker thread stack and paste it? If
> > bound on log IO, it will be blocked in xlog_get_iclog_space().
> > 
> > Please paste the entire stack output, too, not just the bits you
> > think are relevant or understand....
> 
> That would be a kworker on the "xfs-cil/%s" workqueue, correct?
> And you mean before the lockup, when the I/O is still active, correct?
> 
> This is the usual stack output from that thread:
> 
> # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
> # cat /proc/2080/task/2080/stack
> 
> [<0>] md_flush_request+0x87/0x190
> [<0>] raid5_make_request+0x61b/0xb30
> [<0>] md_handle_request+0x127/0x1a0
> [<0>] md_submit_bio+0xbd/0x100
> [<0>] submit_bio_noacct+0x151/0x410
> [<0>] submit_bio+0x4b/0x1a0
> [<0>] xlog_state_release_iclog+0x87/0xb0
> [<0>] xlog_write+0x452/0x6d0
> [<0>] xlog_cil_push_work+0x2e0/0x4d0
> [<0>] process_one_work+0x1dd/0x3e0
> [<0>] worker_thread+0x23f/0x3b0
> [<0>] kthread+0x118/0x130
> [<0>] ret_from_fork+0x22/0x30
> 
> sampled three times with a few seconds in between, stack identical.

Yeah, so that is MD blocking waiting for a running device cache
flush to finish before submitting the iclog IO. IOWs, it's waiting
for hardware to flush volatile caches to stable storage. Every log
IO has a cache flush preceding it, and if the device doesn't support
FUA, then there is a post-IO cache flush as well. These flushes are
necessary for correct IO ordering between the log and metadata/data
writeback.

Essentially, you're waiting for MD to flush all it's dirty stripe
cache to the backing devices and then flush the backing device
caches as well. Nothing can be done at the filesystem level to make
this faster....

> > Also, cp -a of a linux source tree is just as data intensive as it
> > is metadata intensive. There's probably more data IO than metadata
> > IO, so that's more likely to be what is slowing the disks down as
> > metadata writeback is...
> > 
> > > I'm a bit concerned, though, that there seem to be a rather
> > > unlimited (~ 1000) number of kernel worker threads waiting for the
> > > cil push and indirectly for log writes.
> > 
> > That's normal - XFS is highly asynchronous and defers a lot of work
> > to completion threads.
> > 
> > > > So it dropped by 16 bytes (seems to be common) which is
> > > > unexpected.  I wonder if it filled a hole in a buffer and so
> > > > needed one less xlog_op_header()? But then the size would have
> > > > gone up by at least 128 bytes for the hole that was filled, so
> > > > it still shouldn't go down in size.
> > > > 
> > > > I think you need to instrument xlog_cil_insert_items() and catch
> > > > a negative length here:
> > > > 
> > > > 	/* account for space used by new iovec headers  */
> > > > 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
> > > > 	iovhdr_res; ctx->nvecs += diff_iovecs;
> > > > 
> > > > (diff_iovecs will be negative if the number of xlog_op_header
> > > > structures goes down)
> > > > 
> > > > And if this happens, then dump the transaction ticket via
> > > > xlog_print_trans(tp) so we can see all the log items types and
> > > > vectors that the transaction has formatted...
> > > 
> > > I tried that, but the output was difficult to understand, because
> > > at that point you can only log the complete transaction with the
> > > items already updated.  And a shrinking item is not switched to the
> > > shadow vector, so the formatted content is already overwritten and
> > > not available for analysis.
> > 
> > Yes, that's exactly the information I need to see.
> > 
> > But the fact you think this is something I don't need to know about
> > is classic Dunning-Kruger in action. You don't understand why I
> > asked for this information, and found the information "difficult to
> > understand", so you decided I didn't need it either, despite the
> > fact I asked explicitly for it.
> > 
> > What I first need to know is what operations are being performed by
> > the transaciton that shrunk and what all the items in it are, not
> > which specific items shrunk and by how much. There can be tens to
> > hundreds of items in a single transaction, and it's the combination
> > of the transaction state, the reservation, the amount of the
> > reservation that has been consumed, what items are consuming that
> > reservation, etc. that I need to first see and analyse.
> > 
> > I don't ask for specific information just for fun - I ask for
> > specific information because it is *necessary*. If you want the
> > problem triaged and fixed, then please supply the information you
> > are asked for, even if you don't understand why or what it means.
> 
> I see. I hope, you can make use of the following.

....
> Last two events before the lockup:
> 
> 2021-01-03T14:48:43.098887+01:00 deadbird kernel: [ 3194.831260] XFS (md0): XXX diff_iovecs 0 diff_len -16
> 2021-01-03T14:48:43.109363+01:00 deadbird kernel: [ 3194.837364] XFS (md0): transaction summary:
> 2021-01-03T14:48:43.109367+01:00 deadbird kernel: [ 3194.842544] XFS (md0):   log res   = 212728
> 2021-01-03T14:48:43.118996+01:00 deadbird kernel: [ 3194.847617] XFS (md0):   log count = 8
> 2021-01-03T14:48:43.119010+01:00 deadbird kernel: [ 3194.852193] XFS (md0):   flags     = 0x25
> 2021-01-03T14:48:43.129701+01:00 deadbird kernel: [ 3194.857156] XFS (md0): ticket reservation summary:
> 2021-01-03T14:48:43.129714+01:00 deadbird kernel: [ 3194.862890] XFS (md0):   unit res    = 225140 bytes
> 2021-01-03T14:48:43.135515+01:00 deadbird kernel: [ 3194.868710] XFS (md0):   current res = 225140 bytes
> 2021-01-03T14:48:43.148684+01:00 deadbird kernel: [ 3194.874702] XFS (md0):   total reg   = 0 bytes (o/flow = 0 bytes)
> 2021-01-03T14:48:43.148700+01:00 deadbird kernel: [ 3194.881885] XFS (md0):   ophdrs      = 0 (ophdr space = 0 bytes)
> 2021-01-03T14:48:43.155781+01:00 deadbird kernel: [ 3194.888975] XFS (md0):   ophdr + reg = 0 bytes
> 2021-01-03T14:48:43.161210+01:00 deadbird kernel: [ 3194.894404] XFS (md0):   num regions = 0
> 2021-01-03T14:48:43.165948+01:00 deadbird kernel: [ 3194.899141] XFS (md0): log item:
> 2021-01-03T14:48:43.169996+01:00 deadbird kernel: [ 3194.903203] XFS (md0):   type      = 0x123b
> 2021-01-03T14:48:43.174544+01:00 deadbird kernel: [ 3194.907741] XFS (md0):   flags     = 0x8
> 2021-01-03T14:48:43.178996+01:00 deadbird kernel: [ 3194.912191] XFS (md0):   niovecs   = 3
> 2021-01-03T14:48:43.183347+01:00 deadbird kernel: [ 3194.916546] XFS (md0):   size      = 696
> 2021-01-03T14:48:43.187598+01:00 deadbird kernel: [ 3194.920791] XFS (md0):   bytes     = 248
> 2021-01-03T14:48:43.191932+01:00 deadbird kernel: [ 3194.925131] XFS (md0):   buf len   = 248
> 2021-01-03T14:48:43.196581+01:00 deadbird kernel: [ 3194.929776] XFS (md0):   iovec[0]
> 2021-01-03T14:48:43.200633+01:00 deadbird kernel: [ 3194.933832] XFS (md0):     type    = 0x5
> 2021-01-03T14:48:43.205069+01:00 deadbird kernel: [ 3194.938264] XFS (md0):     len     = 56
> 2021-01-03T14:48:43.209293+01:00 deadbird kernel: [ 3194.942497] XFS (md0):     first 32 bytes of iovec[0]:
> 2021-01-03T14:48:43.215494+01:00 deadbird kernel: [ 3194.948690] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
> 2021-01-03T14:48:43.224801+01:00 deadbird kernel: [ 3194.957997] 00000010: 35 e3 ba 80 2e 00 00 00 00 00 00 00 00 00 00 00  5...............
> 2021-01-03T14:48:43.234100+01:00 deadbird kernel: [ 3194.967297] XFS (md0):   iovec[1]
> 2021-01-03T14:48:43.238293+01:00 deadbird kernel: [ 3194.971484] XFS (md0):     type    = 0x6
> 2021-01-03T14:48:43.242744+01:00 deadbird kernel: [ 3194.975942] XFS (md0):     len     = 176
> 2021-01-03T14:48:43.247108+01:00 deadbird kernel: [ 3194.980295] XFS (md0):     first 32 bytes of iovec[1]:
> 2021-01-03T14:48:43.253304+01:00 deadbird kernel: [ 3194.986506] 00000000: 4e 49 b0 81 03 02 00 00 7d 00 00 00 7d 00 00 00  NI......}...}...
> 2021-01-03T14:48:43.262648+01:00 deadbird kernel: [ 3194.995835] 00000010: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 2021-01-03T14:48:43.272064+01:00 deadbird kernel: [ 3195.005267] XFS (md0):   iovec[2]
> 2021-01-03T14:48:43.276174+01:00 deadbird kernel: [ 3195.009366] XFS (md0):     type    = 0x7
> 2021-01-03T14:48:43.280649+01:00 deadbird kernel: [ 3195.013848] XFS (md0):     len     = 16
> 2021-01-03T14:48:43.285042+01:00 deadbird kernel: [ 3195.018243] XFS (md0):     first 16 bytes of iovec[2]:
> 2021-01-03T14:48:43.291166+01:00 deadbird kernel: [ 3195.024364] 00000000: 00 00 00 00 00 00 00 00 00 ba 02 ef e0 01 23 ef  ..............#.

Ok, they are all of this type - same transaction type/size, same
form, all a reduction in the size of the inode data fork in extent
format. I suspect this is trimming speculative block preallocation
beyond EOF on final close of a written file.

Ok, that matches the case from the root directory you gave earlier,
where the inode data fork was reducing in size from unlinks in a
shortform directory.  Different vector, different inode fork format,
but it appears to result in the same thing where just an inode with
a reducing inode fork size is relogged.

I'll think about this over the weekend and look at it more closely
next week when I'm back from PTO...

Thanks for following up and providing this information, Donald!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-07 21:54                   ` Dave Chinner
@ 2021-01-08 16:56                     ` Brian Foster
  2021-01-11 16:38                       ` Brian Foster
  2021-01-13 21:44                       ` Dave Chinner
  0 siblings, 2 replies; 29+ messages in thread
From: Brian Foster @ 2021-01-08 16:56 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Donald Buczek, linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > If the value goes below the limit while some threads are
> > > > already waiting but before the push worker gets to it, these threads are
> > > > not woken.
> > > > 
> > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > optimization. This is possible, because we hold the xc_push_lock
> > > > spinlock, which prevents additions to the waitqueue.
> > > > 
> > > > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > > > ---
> > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > --- a/fs/xfs/xfs_log_cil.c
> > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > >  	/*
> > > >  	 * Wake up any background push waiters now this context is being pushed.
> > > >  	 */
> > > > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > +	if (waitqueue_active(&cil->xc_push_wait))
> > > >  		wake_up_all(&cil->xc_push_wait);
> > > 
> > > That just smells wrong to me. It *might* be correct, but this
> > > condition should pair with the sleep condition, as space used by a
> > > CIL context should never actually decrease....
> > > 
> > 
> > ... but I'm a little confused by this assertion. The shadow buffer
> > allocation code refers to the possibility of shadow buffers falling out
> > that are smaller than currently allocated buffers. Further, the
> > _insert_format_items() code appears to explicitly optimize for this
> > possibility by reusing the active buffer, subtracting the old size/count
> > values from the diff variables and then reformatting the latest
> > (presumably smaller) item to the lv.
> 
> Individual items might shrink, but the overall transaction should
> grow. Think of a extent to btree conversion of an inode fork. THe
> data in the inode fork decreases from a list of extents to a btree
> root block pointer, so the inode item shrinks. But then we add a new
> btree root block that contains all the extents + the btree block
> header, and it gets rounded up to ithe 128 byte buffer logging chunk
> size.
> 
> IOWs, while the inode item has decreased in size, the overall
> space consumed by the transaction has gone up and so the CIL ctx
> used_space should increase. Hence we can't just look at individual
> log items and whether they have decreased in size - we have to look
> at all the items in the transaction to understand how the space used
> in that transaction has changed. i.e. it's the aggregation of all
> items in the transaction that matter here, not so much the
> individual items.
> 

Ok, that makes more sense...

> > Of course this could just be implementation detail. I haven't dug into
> > the details in the remainder of this thread and I don't have specific
> > examples off the top of my head, but perhaps based on the ability of
> > various structures to change formats and the ability of log vectors to
> > shrink in size, shouldn't we expect the possibility of a CIL context to
> > shrink in size as well? Just from poking around the CIL it seems like
> > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > for recalculating split res as well)...
> 
> Yes, there may be situations where it decreases. It may be this is
> fine, but the assumption *I've made* in lots of the CIL push code is
> that ctx->used_space rarely, if ever, will go backwards.
> 

... and rarely seems a bit more pragmatic than never.

> e.g. we run the first transaction into the CIL, it steals the sapce
> needed for the cil checkpoint headers for the transaciton. Then if
> the space returned by the item formatting is negative (because it is
> in the AIL and being relogged), the CIL checkpoint now doesn't have
> the space reserved it needs to run a checkpoint. That transaction is
> a sync transaction, so it forces the log, and now we push the CIL
> without sufficient reservation to write out the log headers and the
> items we just formatted....
> 

Hmmm... that seems like an odd scenario because I'd expect the space
usage delta to reflect what might or might not have already been added
to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
delta only occur for items being relogged while still CIL resident
(regardless of AIL residency)?

From a code standpoint, the way a particular log item delta comes out
negative is from having a shadow lv size smaller than the ->li_lv size.
Thus, xlog_cil_insert_format_items() subtracts the currently formatted
lv size from the delta, formats the current state of the item, and
xfs_cil_prepare_item() adds the new (presumably smaller) size to the
delta. We reuse ->li_lv in this scenario so both it and the shadow
buffer remain, but a CIL push pulls ->li_lv from all log items and
chains them to the CIL context for writing, so I don't see how we could
have an item return a negative delta on an empty CIL. Hm?

(I was also wondering whether repeated smaller relogs of an item could
be a vector for this to go wrong, but it looks like
xlog_cil_insert_format_items() always uses the formatted size of the
current buffer...).

Brian

> So, yeah, shrinking transaction space usage definitely violates some
> of the assumptions the code makes about how relogging works. It's
> entirely possible the assumptions I've made are not entirely correct
> in some corner cases - those particular cases are what we need to
> ferret out here, and then decide if they are correct or not and deal
> with it from there...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-07 22:19                           ` Dave Chinner
@ 2021-01-09 14:39                             ` Donald Buczek
  0 siblings, 0 replies; 29+ messages in thread
From: Donald Buczek @ 2021-01-09 14:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, Brian Foster, Linux Kernel Mailing List, it+linux-xfs

On 07.01.21 23:19, Dave Chinner wrote:
> On Sun, Jan 03, 2021 at 05:03:33PM +0100, Donald Buczek wrote:
>> On 02.01.21 23:44, Dave Chinner wrote:
>>> On Sat, Jan 02, 2021 at 08:12:56PM +0100, Donald Buczek wrote:
>>>> On 31.12.20 22:59, Dave Chinner wrote:
>>>>> On Thu, Dec 31, 2020 at 12:48:56PM +0100, Donald Buczek wrote:
>>>>>> On 30.12.20 23:16, Dave Chinner wrote:
>>>
>>>>> One could argue that, but one should also understand the design
>>>>> constraints for a particular algorithm are before suggesting
>>>>> that their solution is "robust". :)
>>>>
>>>> Yes, but an understanding to the extend required by the
>>>> argument should be sufficient :-)
>>>
>>> And that's the fundamental conceit described by Dunning-Kruger.
>>>
>>> I.e. someone thinks they know enough to understand the argument,
>>> when in fact they don't understand enough about the subject matter
>>> to realise they don't understand what the expert at the other end is
>>> saying at all....
>>>
>>>>>> # seq 29
>>>>>>
>>>>>> 2020-12-29T20:08:15.652167+01:00 deadbird kernel: [ 1053.860637] XXX trigger cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=33554656      , push_seq=29, ctx->sequence=29
>>>>>
>>>>> So, at 20:08:15 we get a push trigger and the work is queued. But...
>>>>>
>>>>> .....
>>>>>> 2020-12-29T20:09:04.961088+01:00 deadbird kernel: [ 1103.168964] XXX wake    cil 00000000e374c6f1 ctx 000000004967d650  ctx->space_used=67109136 >= 67108864, push_seq=29, ctx->sequence=29
>>>>>
>>>>> It takes the best part of *50 seconds* before the push work actually
>>>>> runs?
>>>>>
>>>>> That's .... well and truly screwed up - the work should run on that
>>>>> CPU on the very next time it yeilds the CPU. If we're holding the
>>>>> CPU without yeilding it for that long, hangcheck and RCU warnings
>>>>> should be going off...
>>>>
>>>> No such warnings.
>>>>
>>>> But the load is probably I/O bound to the log:
>>>>
>>>> - creates `cp -a` copies of a directory tree with small files (linux source repository)
>>>> - source tree probably completely cached.
>>>> - two copies in parallel
>>>> - slow log (on software raid6)
>>>>
>>>> Isn't it to be expected that sooner or later you need to wait for
>>>> log writes when you write as fast as possible with lots of
>>>> metadata updates and not so much data?
>>>
>>> No, incoming transactions waits for transaction reservation space,
>>> not log writes. Reservation space is freed up by metadata writeback.
>>> So if you have new transactions blocking in xfs_trans_reserve(),
>>> then we are blocking on metadata writeback.
>>>
>>> The CIL worker thread may be waiting for log IO completion before it
>>> issues more log IO, and in that case it is waiting on iclog buffer
>>> space (i.e. only waiting on internal log state, not metadata
>>> writeback).  Can you find that kworker thread stack and paste it? If
>>> bound on log IO, it will be blocked in xlog_get_iclog_space().
>>>
>>> Please paste the entire stack output, too, not just the bits you
>>> think are relevant or understand....
>>
>> That would be a kworker on the "xfs-cil/%s" workqueue, correct?
>> And you mean before the lockup, when the I/O is still active, correct?
>>
>> This is the usual stack output from that thread:
>>
>> # # /proc/2080/task/2080: kworker/u82:3+xfs-cil/md0 :
>> # cat /proc/2080/task/2080/stack
>>
>> [<0>] md_flush_request+0x87/0x190
>> [<0>] raid5_make_request+0x61b/0xb30
>> [<0>] md_handle_request+0x127/0x1a0
>> [<0>] md_submit_bio+0xbd/0x100
>> [<0>] submit_bio_noacct+0x151/0x410
>> [<0>] submit_bio+0x4b/0x1a0
>> [<0>] xlog_state_release_iclog+0x87/0xb0
>> [<0>] xlog_write+0x452/0x6d0
>> [<0>] xlog_cil_push_work+0x2e0/0x4d0
>> [<0>] process_one_work+0x1dd/0x3e0
>> [<0>] worker_thread+0x23f/0x3b0
>> [<0>] kthread+0x118/0x130
>> [<0>] ret_from_fork+0x22/0x30
>>
>> sampled three times with a few seconds in between, stack identical.
> 
> Yeah, so that is MD blocking waiting for a running device cache
> flush to finish before submitting the iclog IO. IOWs, it's waiting
> for hardware to flush volatile caches to stable storage. Every log
> IO has a cache flush preceding it, and if the device doesn't support
> FUA, then there is a post-IO cache flush as well. These flushes are
> necessary for correct IO ordering between the log and metadata/data
> writeback.
> 
> Essentially, you're waiting for MD to flush all it's dirty stripe
> cache to the backing devices and then flush the backing device
> caches as well. Nothing can be done at the filesystem level to make
> this faster....
> 
>>> Also, cp -a of a linux source tree is just as data intensive as it
>>> is metadata intensive. There's probably more data IO than metadata
>>> IO, so that's more likely to be what is slowing the disks down as
>>> metadata writeback is...
>>>
>>>> I'm a bit concerned, though, that there seem to be a rather
>>>> unlimited (~ 1000) number of kernel worker threads waiting for the
>>>> cil push and indirectly for log writes.
>>>
>>> That's normal - XFS is highly asynchronous and defers a lot of work
>>> to completion threads.
>>>
>>>>> So it dropped by 16 bytes (seems to be common) which is
>>>>> unexpected.  I wonder if it filled a hole in a buffer and so
>>>>> needed one less xlog_op_header()? But then the size would have
>>>>> gone up by at least 128 bytes for the hole that was filled, so
>>>>> it still shouldn't go down in size.
>>>>>
>>>>> I think you need to instrument xlog_cil_insert_items() and catch
>>>>> a negative length here:
>>>>>
>>>>> 	/* account for space used by new iovec headers  */
>>>>> 	iovhdr_res = diff_iovecs * sizeof(xlog_op_header_t); len +=
>>>>> 	iovhdr_res; ctx->nvecs += diff_iovecs;
>>>>>
>>>>> (diff_iovecs will be negative if the number of xlog_op_header
>>>>> structures goes down)
>>>>>
>>>>> And if this happens, then dump the transaction ticket via
>>>>> xlog_print_trans(tp) so we can see all the log items types and
>>>>> vectors that the transaction has formatted...
>>>>
>>>> I tried that, but the output was difficult to understand, because
>>>> at that point you can only log the complete transaction with the
>>>> items already updated.  And a shrinking item is not switched to the
>>>> shadow vector, so the formatted content is already overwritten and
>>>> not available for analysis.
>>>
>>> Yes, that's exactly the information I need to see.
>>>
>>> But the fact you think this is something I don't need to know about
>>> is classic Dunning-Kruger in action. You don't understand why I
>>> asked for this information, and found the information "difficult to
>>> understand", so you decided I didn't need it either, despite the
>>> fact I asked explicitly for it.
>>>
>>> What I first need to know is what operations are being performed by
>>> the transaciton that shrunk and what all the items in it are, not
>>> which specific items shrunk and by how much. There can be tens to
>>> hundreds of items in a single transaction, and it's the combination
>>> of the transaction state, the reservation, the amount of the
>>> reservation that has been consumed, what items are consuming that
>>> reservation, etc. that I need to first see and analyse.
>>>
>>> I don't ask for specific information just for fun - I ask for
>>> specific information because it is *necessary*. If you want the
>>> problem triaged and fixed, then please supply the information you
>>> are asked for, even if you don't understand why or what it means.
>>
>> I see. I hope, you can make use of the following.
> 
> ....
>> Last two events before the lockup:
>>
>> 2021-01-03T14:48:43.098887+01:00 deadbird kernel: [ 3194.831260] XFS (md0): XXX diff_iovecs 0 diff_len -16
>> 2021-01-03T14:48:43.109363+01:00 deadbird kernel: [ 3194.837364] XFS (md0): transaction summary:
>> 2021-01-03T14:48:43.109367+01:00 deadbird kernel: [ 3194.842544] XFS (md0):   log res   = 212728
>> 2021-01-03T14:48:43.118996+01:00 deadbird kernel: [ 3194.847617] XFS (md0):   log count = 8
>> 2021-01-03T14:48:43.119010+01:00 deadbird kernel: [ 3194.852193] XFS (md0):   flags     = 0x25
>> 2021-01-03T14:48:43.129701+01:00 deadbird kernel: [ 3194.857156] XFS (md0): ticket reservation summary:
>> 2021-01-03T14:48:43.129714+01:00 deadbird kernel: [ 3194.862890] XFS (md0):   unit res    = 225140 bytes
>> 2021-01-03T14:48:43.135515+01:00 deadbird kernel: [ 3194.868710] XFS (md0):   current res = 225140 bytes
>> 2021-01-03T14:48:43.148684+01:00 deadbird kernel: [ 3194.874702] XFS (md0):   total reg   = 0 bytes (o/flow = 0 bytes)
>> 2021-01-03T14:48:43.148700+01:00 deadbird kernel: [ 3194.881885] XFS (md0):   ophdrs      = 0 (ophdr space = 0 bytes)
>> 2021-01-03T14:48:43.155781+01:00 deadbird kernel: [ 3194.888975] XFS (md0):   ophdr + reg = 0 bytes
>> 2021-01-03T14:48:43.161210+01:00 deadbird kernel: [ 3194.894404] XFS (md0):   num regions = 0
>> 2021-01-03T14:48:43.165948+01:00 deadbird kernel: [ 3194.899141] XFS (md0): log item:
>> 2021-01-03T14:48:43.169996+01:00 deadbird kernel: [ 3194.903203] XFS (md0):   type      = 0x123b
>> 2021-01-03T14:48:43.174544+01:00 deadbird kernel: [ 3194.907741] XFS (md0):   flags     = 0x8
>> 2021-01-03T14:48:43.178996+01:00 deadbird kernel: [ 3194.912191] XFS (md0):   niovecs   = 3
>> 2021-01-03T14:48:43.183347+01:00 deadbird kernel: [ 3194.916546] XFS (md0):   size      = 696
>> 2021-01-03T14:48:43.187598+01:00 deadbird kernel: [ 3194.920791] XFS (md0):   bytes     = 248
>> 2021-01-03T14:48:43.191932+01:00 deadbird kernel: [ 3194.925131] XFS (md0):   buf len   = 248
>> 2021-01-03T14:48:43.196581+01:00 deadbird kernel: [ 3194.929776] XFS (md0):   iovec[0]
>> 2021-01-03T14:48:43.200633+01:00 deadbird kernel: [ 3194.933832] XFS (md0):     type    = 0x5
>> 2021-01-03T14:48:43.205069+01:00 deadbird kernel: [ 3194.938264] XFS (md0):     len     = 56
>> 2021-01-03T14:48:43.209293+01:00 deadbird kernel: [ 3194.942497] XFS (md0):     first 32 bytes of iovec[0]:
>> 2021-01-03T14:48:43.215494+01:00 deadbird kernel: [ 3194.948690] 00000000: 3b 12 03 00 05 00 00 00 00 00 10 00 00 00 00 00  ;...............
>> 2021-01-03T14:48:43.224801+01:00 deadbird kernel: [ 3194.957997] 00000010: 35 e3 ba 80 2e 00 00 00 00 00 00 00 00 00 00 00  5...............
>> 2021-01-03T14:48:43.234100+01:00 deadbird kernel: [ 3194.967297] XFS (md0):   iovec[1]
>> 2021-01-03T14:48:43.238293+01:00 deadbird kernel: [ 3194.971484] XFS (md0):     type    = 0x6
>> 2021-01-03T14:48:43.242744+01:00 deadbird kernel: [ 3194.975942] XFS (md0):     len     = 176
>> 2021-01-03T14:48:43.247108+01:00 deadbird kernel: [ 3194.980295] XFS (md0):     first 32 bytes of iovec[1]:
>> 2021-01-03T14:48:43.253304+01:00 deadbird kernel: [ 3194.986506] 00000000: 4e 49 b0 81 03 02 00 00 7d 00 00 00 7d 00 00 00  NI......}...}...
>> 2021-01-03T14:48:43.262648+01:00 deadbird kernel: [ 3194.995835] 00000010: 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>> 2021-01-03T14:48:43.272064+01:00 deadbird kernel: [ 3195.005267] XFS (md0):   iovec[2]
>> 2021-01-03T14:48:43.276174+01:00 deadbird kernel: [ 3195.009366] XFS (md0):     type    = 0x7
>> 2021-01-03T14:48:43.280649+01:00 deadbird kernel: [ 3195.013848] XFS (md0):     len     = 16
>> 2021-01-03T14:48:43.285042+01:00 deadbird kernel: [ 3195.018243] XFS (md0):     first 16 bytes of iovec[2]:
>> 2021-01-03T14:48:43.291166+01:00 deadbird kernel: [ 3195.024364] 00000000: 00 00 00 00 00 00 00 00 00 ba 02 ef e0 01 23 ef  ..............#.
> 
> Ok, they are all of this type - same transaction type/size, same
> form, all a reduction in the size of the inode data fork in extent
> format. I suspect this is trimming speculative block preallocation
> beyond EOF on final close of a written file.

Btw: I looked up such a file by its inode number and it was a rather
long one (git pack).

But please keep in mind, that this log was restricted to the "only
write to fresh filesystem" scenario. If you do something else on the
filesystem, there appear to be other cases than just the trimming
of the last extent.

I've collected some more log at
http://owww.molgen.mpg.de/~buczek/2021-01-03_13-09/log2.txt

The swap to the shadow buffer was made unconditional so that the
old vectors could be logged together with the new vectors. And I
logged a few bytes more from the buffer. A post-processing
script provided the summary header and removed duplicates.

The first events are from the "cp" with the truncation of the data
fork as seen before. But starting with the transaction at 881.111968
there are six of another kind with a very high diff_len (and
apparently items without an allocated vector).

After 945.723769 I playfully tried to force truncation events at
the attribute fork, so this is not new, just the other fork.

Donald

> Ok, that matches the case from the root directory you gave earlier,
> where the inode data fork was reducing in size from unlinks in a
> shortform directory.  Different vector, different inode fork format,
> but it appears to result in the same thing where just an inode with
> a reducing inode fork size is relogged.
> 
> 
> I'll think about this over the weekend and look at it more closely
> next week when I'm back from PTO...
> 
> Thanks for following up and providing this information, Donald!
> 
> Cheers,
> 
> Dave.
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-08 16:56                     ` Brian Foster
@ 2021-01-11 16:38                       ` Brian Foster
  2021-01-13 21:53                         ` Dave Chinner
  2021-01-13 21:44                       ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Brian Foster @ 2021-01-11 16:38 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Donald Buczek, linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > > If the value goes below the limit while some threads are
> > > > > already waiting but before the push worker gets to it, these threads are
> > > > > not woken.
> > > > > 
> > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > > optimization. This is possible, because we hold the xc_push_lock
> > > > > spinlock, which prevents additions to the waitqueue.
> > > > > 
> > > > > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > > > > ---
> > > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > >  	/*
> > > > >  	 * Wake up any background push waiters now this context is being pushed.
> > > > >  	 */
> > > > > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > > +	if (waitqueue_active(&cil->xc_push_wait))
> > > > >  		wake_up_all(&cil->xc_push_wait);
> > > > 
> > > > That just smells wrong to me. It *might* be correct, but this
> > > > condition should pair with the sleep condition, as space used by a
> > > > CIL context should never actually decrease....
> > > > 
> > > 
> > > ... but I'm a little confused by this assertion. The shadow buffer
> > > allocation code refers to the possibility of shadow buffers falling out
> > > that are smaller than currently allocated buffers. Further, the
> > > _insert_format_items() code appears to explicitly optimize for this
> > > possibility by reusing the active buffer, subtracting the old size/count
> > > values from the diff variables and then reformatting the latest
> > > (presumably smaller) item to the lv.
> > 
> > Individual items might shrink, but the overall transaction should
> > grow. Think of a extent to btree conversion of an inode fork. THe
> > data in the inode fork decreases from a list of extents to a btree
> > root block pointer, so the inode item shrinks. But then we add a new
> > btree root block that contains all the extents + the btree block
> > header, and it gets rounded up to ithe 128 byte buffer logging chunk
> > size.
> > 
> > IOWs, while the inode item has decreased in size, the overall
> > space consumed by the transaction has gone up and so the CIL ctx
> > used_space should increase. Hence we can't just look at individual
> > log items and whether they have decreased in size - we have to look
> > at all the items in the transaction to understand how the space used
> > in that transaction has changed. i.e. it's the aggregation of all
> > items in the transaction that matter here, not so much the
> > individual items.
> > 
> 
> Ok, that makes more sense...
> 
> > > Of course this could just be implementation detail. I haven't dug into
> > > the details in the remainder of this thread and I don't have specific
> > > examples off the top of my head, but perhaps based on the ability of
> > > various structures to change formats and the ability of log vectors to
> > > shrink in size, shouldn't we expect the possibility of a CIL context to
> > > shrink in size as well? Just from poking around the CIL it seems like
> > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > > for recalculating split res as well)...
> > 
> > Yes, there may be situations where it decreases. It may be this is
> > fine, but the assumption *I've made* in lots of the CIL push code is
> > that ctx->used_space rarely, if ever, will go backwards.
> > 
> 
> ... and rarely seems a bit more pragmatic than never.
> 

FWIW, a cursory look at the inode size/format code (motivated by
Donald's recent log dump that appears to show inode log items changing
size) suggested that a simple local format size change might be enough
to cause this condition on an item. A subsequent test to create and
immediately remove a file from an otherwise empty directory triggers a
tracepoint I injected in xlog_cil_insert_items() to detect a negative
transaction delta. As expected, the absolute value of the delta does
seem to increase with a larger filename. This also produces a negative
iovec delta, fwiw. E.g.:

# touch `for i in $(seq 0 63); do echo -n a; done`
# rm -f `for i in $(seq 0 63); do echo -n a; done`
#

rm-9265    [001] ....  4660.177806: xfs_log_commit_cil: 409: len -72 diff_iovecs 0
rm-9265    [001] .N.1  4660.177913: xfs_log_commit_cil: 419: len -72 diff_iovecs 0
rm-9265    [001] ....  4660.178313: xfs_log_commit_cil: 409: len -52 diff_iovecs -1
rm-9265    [001] ...1  4660.178336: xfs_log_commit_cil: 419: len -64 diff_iovecs -1

... and this only seems to occur when the unlink occurs before the CIL
has been checkpointed and pushed out the inode (i.e. a freeze/unfreeze
cycle prevents it).

I've not dug into the transaction details and have no idea if this is
the variant that Donald reproduces; it wouldn't surprise me a ton if
there were various others. This is pretty straightforward, however, and
shows the negative item delta carry through the transaction. IMO, that
seems to justify a throttling fix...

Brian

> > e.g. we run the first transaction into the CIL, it steals the sapce
> > needed for the cil checkpoint headers for the transaciton. Then if
> > the space returned by the item formatting is negative (because it is
> > in the AIL and being relogged), the CIL checkpoint now doesn't have
> > the space reserved it needs to run a checkpoint. That transaction is
> > a sync transaction, so it forces the log, and now we push the CIL
> > without sufficient reservation to write out the log headers and the
> > items we just formatted....
> > 
> 
> Hmmm... that seems like an odd scenario because I'd expect the space
> usage delta to reflect what might or might not have already been added
> to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
> delta only occur for items being relogged while still CIL resident
> (regardless of AIL residency)?
> 
> From a code standpoint, the way a particular log item delta comes out
> negative is from having a shadow lv size smaller than the ->li_lv size.
> Thus, xlog_cil_insert_format_items() subtracts the currently formatted
> lv size from the delta, formats the current state of the item, and
> xfs_cil_prepare_item() adds the new (presumably smaller) size to the
> delta. We reuse ->li_lv in this scenario so both it and the shadow
> buffer remain, but a CIL push pulls ->li_lv from all log items and
> chains them to the CIL context for writing, so I don't see how we could
> have an item return a negative delta on an empty CIL. Hm?
> 
> (I was also wondering whether repeated smaller relogs of an item could
> be a vector for this to go wrong, but it looks like
> xlog_cil_insert_format_items() always uses the formatted size of the
> current buffer...).
> 
> Brian
> 
> > So, yeah, shrinking transaction space usage definitely violates some
> > of the assumptions the code makes about how relogging works. It's
> > entirely possible the assumptions I've made are not entirely correct
> > in some corner cases - those particular cases are what we need to
> > ferret out here, and then decide if they are correct or not and deal
> > with it from there...
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-08 16:56                     ` Brian Foster
  2021-01-11 16:38                       ` Brian Foster
@ 2021-01-13 21:44                       ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2021-01-13 21:44 UTC (permalink / raw)
  To: Brian Foster
  Cc: Donald Buczek, linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > e.g. we run the first transaction into the CIL, it steals the sapce
> > needed for the cil checkpoint headers for the transaciton. Then if
> > the space returned by the item formatting is negative (because it is
> > in the AIL and being relogged), the CIL checkpoint now doesn't have
> > the space reserved it needs to run a checkpoint. That transaction is
> > a sync transaction, so it forces the log, and now we push the CIL
> > without sufficient reservation to write out the log headers and the
> > items we just formatted....
> > 
> 
> Hmmm... that seems like an odd scenario because I'd expect the space
> usage delta to reflect what might or might not have already been added
> to the CIL context, not necessarily the AIL. IOW, shouldn't a negative
> delta only occur for items being relogged while still CIL resident
> (regardless of AIL residency)?
>
> From a code standpoint, the way a particular log item delta comes out
> negative is from having a shadow lv size smaller than the ->li_lv size.
> Thus, xlog_cil_insert_format_items() subtracts the currently formatted
> lv size from the delta, formats the current state of the item, and
> xfs_cil_prepare_item() adds the new (presumably smaller) size to the
> delta. We reuse ->li_lv in this scenario so both it and the shadow
> buffer remain, but a CIL push pulls ->li_lv from all log items and
> chains them to the CIL context for writing, so I don't see how we could
> have an item return a negative delta on an empty CIL. Hm?

In theory, yes. But like I said, I've made a bunch of assumptions
that this won't happen, and so without actually auditting the code
I'm not actually sure that it won't. i.e. I need to go check what
happens with items being marked stale, how shadow buffer
reallocation interacts with items that end up being smaller than the
existing buffer, etc. I've paged a lot of this detail out of my
brain, so until I spend the time to go over it all again I'm not
going to be able to answer definitively.

> (I was also wondering whether repeated smaller relogs of an item could
> be a vector for this to go wrong, but it looks like
> xlog_cil_insert_format_items() always uses the formatted size of the
> current buffer...).

That's my nagging doubt about all this - that there's an interaction
of this nature that I haven't considered due to the assumptions I've
made that allows underflow to occur. That would be much worse than
the current situation of hanging on a missing wakeup when the CIL is
full and used_space goes backwards....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-11 16:38                       ` Brian Foster
@ 2021-01-13 21:53                         ` Dave Chinner
  2021-02-15 13:36                           ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2021-01-13 21:53 UTC (permalink / raw)
  To: Brian Foster
  Cc: Donald Buczek, linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Mon, Jan 11, 2021 at 11:38:48AM -0500, Brian Foster wrote:
> On Fri, Jan 08, 2021 at 11:56:57AM -0500, Brian Foster wrote:
> > On Fri, Jan 08, 2021 at 08:54:44AM +1100, Dave Chinner wrote:
> > > On Mon, Jan 04, 2021 at 11:23:53AM -0500, Brian Foster wrote:
> > > > On Thu, Dec 31, 2020 at 09:16:11AM +1100, Dave Chinner wrote:
> > > > > On Wed, Dec 30, 2020 at 12:56:27AM +0100, Donald Buczek wrote:
> > > > > > If the value goes below the limit while some threads are
> > > > > > already waiting but before the push worker gets to it, these threads are
> > > > > > not woken.
> > > > > > 
> > > > > > Always wake all CIL push waiters. Test with waitqueue_active() as an
> > > > > > optimization. This is possible, because we hold the xc_push_lock
> > > > > > spinlock, which prevents additions to the waitqueue.
> > > > > > 
> > > > > > Signed-off-by: Donald Buczek <buczek@molgen.mpg.de>
> > > > > > ---
> > > > > >  fs/xfs/xfs_log_cil.c | 2 +-
> > > > > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > > > > index b0ef071b3cb5..d620de8e217c 100644
> > > > > > --- a/fs/xfs/xfs_log_cil.c
> > > > > > +++ b/fs/xfs/xfs_log_cil.c
> > > > > > @@ -670,7 +670,7 @@ xlog_cil_push_work(
> > > > > >  	/*
> > > > > >  	 * Wake up any background push waiters now this context is being pushed.
> > > > > >  	 */
> > > > > > -	if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
> > > > > > +	if (waitqueue_active(&cil->xc_push_wait))
> > > > > >  		wake_up_all(&cil->xc_push_wait);
> > > > > 
> > > > > That just smells wrong to me. It *might* be correct, but this
> > > > > condition should pair with the sleep condition, as space used by a
> > > > > CIL context should never actually decrease....
> > > > > 
> > > > 
> > > > ... but I'm a little confused by this assertion. The shadow buffer
> > > > allocation code refers to the possibility of shadow buffers falling out
> > > > that are smaller than currently allocated buffers. Further, the
> > > > _insert_format_items() code appears to explicitly optimize for this
> > > > possibility by reusing the active buffer, subtracting the old size/count
> > > > values from the diff variables and then reformatting the latest
> > > > (presumably smaller) item to the lv.
> > > 
> > > Individual items might shrink, but the overall transaction should
> > > grow. Think of a extent to btree conversion of an inode fork. THe
> > > data in the inode fork decreases from a list of extents to a btree
> > > root block pointer, so the inode item shrinks. But then we add a new
> > > btree root block that contains all the extents + the btree block
> > > header, and it gets rounded up to ithe 128 byte buffer logging chunk
> > > size.
> > > 
> > > IOWs, while the inode item has decreased in size, the overall
> > > space consumed by the transaction has gone up and so the CIL ctx
> > > used_space should increase. Hence we can't just look at individual
> > > log items and whether they have decreased in size - we have to look
> > > at all the items in the transaction to understand how the space used
> > > in that transaction has changed. i.e. it's the aggregation of all
> > > items in the transaction that matter here, not so much the
> > > individual items.
> > > 
> > 
> > Ok, that makes more sense...
> > 
> > > > Of course this could just be implementation detail. I haven't dug into
> > > > the details in the remainder of this thread and I don't have specific
> > > > examples off the top of my head, but perhaps based on the ability of
> > > > various structures to change formats and the ability of log vectors to
> > > > shrink in size, shouldn't we expect the possibility of a CIL context to
> > > > shrink in size as well? Just from poking around the CIL it seems like
> > > > the surrounding code supports it (xlog_cil_insert_items() checks len > 0
> > > > for recalculating split res as well)...
> > > 
> > > Yes, there may be situations where it decreases. It may be this is
> > > fine, but the assumption *I've made* in lots of the CIL push code is
> > > that ctx->used_space rarely, if ever, will go backwards.
> > > 
> > 
> > ... and rarely seems a bit more pragmatic than never.
> > 
> 
> FWIW, a cursory look at the inode size/format code (motivated by
> Donald's recent log dump that appears to show inode log items changing
> size) suggested that a simple local format size change might be enough
> to cause this condition on an item. A subsequent test to create and
> immediately remove a file from an otherwise empty directory triggers a
> tracepoint I injected in xlog_cil_insert_items() to detect a negative
> transaction delta. As expected, the absolute value of the delta does
> seem to increase with a larger filename. This also produces a negative
> iovec delta, fwiw. E.g.:
> 
> # touch `for i in $(seq 0 63); do echo -n a; done`
> # rm -f `for i in $(seq 0 63); do echo -n a; done`
> #
> 
> rm-9265    [001] ....  4660.177806: xfs_log_commit_cil: 409: len -72 diff_iovecs 0
> rm-9265    [001] .N.1  4660.177913: xfs_log_commit_cil: 419: len -72 diff_iovecs 0
> rm-9265    [001] ....  4660.178313: xfs_log_commit_cil: 409: len -52 diff_iovecs -1
> rm-9265    [001] ...1  4660.178336: xfs_log_commit_cil: 419: len -64 diff_iovecs -1
> 
> ... and this only seems to occur when the unlink occurs before the CIL
> has been checkpointed and pushed out the inode (i.e. a freeze/unfreeze
> cycle prevents it).

Yeha, it's a shortform directory removal that triggers it easily
because the other items being modified in the transaction aren't
changing size on relogging (AGI unlink list pointer, unlinked inode
core for nlink change). Hence the reduction in size of the directory
inode reduces the overall CIL size...

> I've not dug into the transaction details and have no idea if this is
> the variant that Donald reproduces; it wouldn't surprise me a ton if
> there were various others. This is pretty straightforward, however, and
> shows the negative item delta carry through the transaction. IMO, that
> seems to justify a throttling fix...

I agree that a throttling fix is needed, but I'm trying to
understand the scope and breadth of the problem first instead of
jumping the gun and making the wrong fix for the wrong reasons that
just papers over the underlying problems that the throttling bug has
made us aware of...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-01-13 21:53                         ` Dave Chinner
@ 2021-02-15 13:36                           ` Donald Buczek
  2021-02-16 11:18                             ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Donald Buczek @ 2021-02-15 13:36 UTC (permalink / raw)
  To: Dave Chinner, Brian Foster
  Cc: linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On 13.01.21 22:53, Dave Chinner wrote:
> [...]
> I agree that a throttling fix is needed, but I'm trying to
> understand the scope and breadth of the problem first instead of
> jumping the gun and making the wrong fix for the wrong reasons that
> just papers over the underlying problems that the throttling bug has
> made us aware of...

Are you still working on this?

If it takes more time to understand the potential underlying problem, the fix for the problem at hand should be applied.

This is a real world problem, accidentally found in the wild. It appears very rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") which silently added a condition to the wakeup. The condition is based on a wrong assumption.

Why is this "papering over"? If a reminder was needed, there were better ways than randomly hanging the system.

Why is

     if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
         wake_up_all(&cil->xc_push_wait);

, which doesn't work reliably, preferable to

     if (waitqueue_active(&cil->xc_push_wait))
         wake_up_all(&cil->xc_push_wait);

which does?

Best
   Donald

> Cheers,
> 
> Dave

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-02-15 13:36                           ` Donald Buczek
@ 2021-02-16 11:18                             ` Brian Foster
  2021-02-16 12:40                               ` Donald Buczek
  0 siblings, 1 reply; 29+ messages in thread
From: Brian Foster @ 2021-02-16 11:18 UTC (permalink / raw)
  To: Donald Buczek
  Cc: Dave Chinner, linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote:
> On 13.01.21 22:53, Dave Chinner wrote:
> > [...]
> > I agree that a throttling fix is needed, but I'm trying to
> > understand the scope and breadth of the problem first instead of
> > jumping the gun and making the wrong fix for the wrong reasons that
> > just papers over the underlying problems that the throttling bug has
> > made us aware of...
> 
> Are you still working on this?
> 
> If it takes more time to understand the potential underlying problem, the fix for the problem at hand should be applied.
> 
> This is a real world problem, accidentally found in the wild. It appears very rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") which silently added a condition to the wakeup. The condition is based on a wrong assumption.
> 
> Why is this "papering over"? If a reminder was needed, there were better ways than randomly hanging the system.
> 
> Why is
> 
>     if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
>         wake_up_all(&cil->xc_push_wait);
> 
> , which doesn't work reliably, preferable to
> 
>     if (waitqueue_active(&cil->xc_push_wait))
>         wake_up_all(&cil->xc_push_wait);
> 
> which does?
> 

JFYI, Dave followed up with a patch a couple weeks or so ago:

https://lore.kernel.org/linux-xfs/20210128044154.806715-5-david@fromorbit.com/

Brian

> Best
>   Donald
> 
> > Cheers,
> > 
> > Dave
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] xfs: Wake CIL push waiters more reliably
  2021-02-16 11:18                             ` Brian Foster
@ 2021-02-16 12:40                               ` Donald Buczek
  0 siblings, 0 replies; 29+ messages in thread
From: Donald Buczek @ 2021-02-16 12:40 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, linux-xfs, Linux Kernel Mailing List, it+linux-xfs

On 16.02.21 12:18, Brian Foster wrote:
> On Mon, Feb 15, 2021 at 02:36:38PM +0100, Donald Buczek wrote:
>> On 13.01.21 22:53, Dave Chinner wrote:
>>> [...]
>>> I agree that a throttling fix is needed, but I'm trying to
>>> understand the scope and breadth of the problem first instead of
>>> jumping the gun and making the wrong fix for the wrong reasons that
>>> just papers over the underlying problems that the throttling bug has
>>> made us aware of...
>>
>> Are you still working on this?
>>
>> If it takes more time to understand the potential underlying problem, the fix for the problem at hand should be applied.
>>
>> This is a real world problem, accidentally found in the wild. It appears very rarely, but it freezes a filesystem or the whole system. It exists in 5.7 , 5.8 , 5.9 , 5.10 and 5.11 and is caused by c7f87f3984cf ("xfs: fix use-after-free on CIL context on shutdown") which silently added a condition to the wakeup. The condition is based on a wrong assumption.
>>
>> Why is this "papering over"? If a reminder was needed, there were better ways than randomly hanging the system.
>>
>> Why is
>>
>>      if (ctx->space_used >= XLOG_CIL_BLOCKING_SPACE_LIMIT(log))
>>          wake_up_all(&cil->xc_push_wait);
>>
>> , which doesn't work reliably, preferable to
>>
>>      if (waitqueue_active(&cil->xc_push_wait))
>>          wake_up_all(&cil->xc_push_wait);
>>
>> which does?
>>
> 
> JFYI, Dave followed up with a patch a couple weeks or so ago:
> 
> https://lore.kernel.org/linux-xfs/20210128044154.806715-5-david@fromorbit.com/

Oh, great. I apologize for the unneeded reminder.

Best

   Donald

> 
> Brian
> 
>> Best
>>    Donald
>>
>>> Cheers,
>>>
>>> Dave
>>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-02-16 12:41 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-17 17:44 v5.10.1 xfs deadlock Donald Buczek
2020-12-17 19:43 ` Brian Foster
2020-12-17 21:30   ` Donald Buczek
2020-12-18 15:35     ` Brian Foster
2020-12-18 18:35       ` Donald Buczek
2020-12-27 17:34         ` Donald Buczek
2020-12-28 23:13           ` Donald Buczek
2020-12-29 23:56             ` [PATCH] xfs: Wake CIL push waiters more reliably Donald Buczek
2020-12-30 22:16               ` Dave Chinner
2020-12-31 11:48                 ` Donald Buczek
2020-12-31 21:59                   ` Dave Chinner
2021-01-02 19:12                     ` Donald Buczek
2021-01-02 22:44                       ` Dave Chinner
2021-01-03 16:03                         ` Donald Buczek
2021-01-07 22:19                           ` Dave Chinner
2021-01-09 14:39                             ` Donald Buczek
2021-01-04 16:23                 ` Brian Foster
2021-01-07 21:54                   ` Dave Chinner
2021-01-08 16:56                     ` Brian Foster
2021-01-11 16:38                       ` Brian Foster
2021-01-13 21:53                         ` Dave Chinner
2021-02-15 13:36                           ` Donald Buczek
2021-02-16 11:18                             ` Brian Foster
2021-02-16 12:40                               ` Donald Buczek
2021-01-13 21:44                       ` Dave Chinner
     [not found]             ` <20201230024642.2171-1-hdanton@sina.com>
2020-12-30 16:54               ` Donald Buczek
2020-12-18 21:49 ` v5.10.1 xfs deadlock Dave Chinner
2020-12-21 12:22   ` Donald Buczek
2020-12-27 17:22     ` Donald Buczek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.