All of lore.kernel.org
 help / color / mirror / Atom feed
* [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
@ 2018-12-24  7:16 bugzilla-daemon
  2018-12-24  7:19 ` [Bug 202053] " bugzilla-daemon
                   ` (16 more replies)
  0 siblings, 17 replies; 29+ messages in thread
From: bugzilla-daemon @ 2018-12-24  7:16 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

            Bug ID: 202053
           Summary: [xfstests generic/464]: XFS corruption and Assertion
                    failed: 0, file: fs/xfs/xfs_super.c, line: 985
           Product: File System
           Version: 2.5
    Kernel Version: 4.20-rc6 with xfs-4.21-merge-2
          Hardware: All
                OS: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: XFS
          Assignee: filesystem_xfs@kernel-bugs.kernel.org
          Reporter: zlang@redhat.com
        Regression: No

generic/464 reproduced a xfs corruption:

# while true; do generic/464|break;done
...
...
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 ibm-x3650m4-10 4.20.0-rc6
MKFS_OPTIONS  -- -f -m reflink=1,rmapbt=1,finobt=1,crc=1 -i sparse=1
/dev/mapper/rhel_ibm--x3650m4--10-xfscratch
MOUNT_OPTIONS -- -o context=system_u:object_r:root_t:s0
/dev/mapper/rhel_ibm--x3650m4--10-xfscratch /mnt/scratch

generic/464 80s ... [failed, exit status 1]- output mismatch (see
/home/git/xfstests-dev/results//generic/464.out.bad)
    --- tests/generic/464.out   2018-11-25 23:54:50.485528602 -0500
    +++ /home/git/xfstests-dev/results//generic/464.out.bad     2018-12-24
01:47:45.633426963 -0500
    @@ -1,2 +1,3 @@
     QA output created by 464
    -Silence is golden
    +_check_xfs_filesystem: filesystem on
/dev/mapper/rhel_ibm--x3650m4--10-xfscratch is inconsistent (r)
    +(see /home/git/xfstests-dev/results//generic/464.full for details)
    ...
    (Run 'diff -u /home/git/xfstests-dev/tests/generic/464.out
/home/git/xfstests-dev/results//generic/464.out.bad'  to see the entire diff)
Ran: generic/464
Failures: generic/464
Failed 1 of 1 tests

# xfs_repair -n /dev/mapper/rhel_ibm--x3650m4--10-xfscratch
Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
sb_fdblocks 22130881, counted 22130887
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 2
        - agno = 1
        - agno = 3
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

# dmesg
[ 2121.041622] run fstests generic/464 at 2018-12-24 01:47:12
[ 2121.967789] XFS (dm-5): Unmounting Filesystem
[ 2123.883676] XFS (dm-5): Mounting V5 Filesystem
[ 2124.004010] XFS (dm-5): Ending clean mount
[ 2130.837461] XFS (dm-5): Unmounting Filesystem
[ 2131.542125] XFS (dm-5): Mounting V5 Filesystem
[ 2131.604660] XFS (dm-5): Ending clean mount
[ 2138.112176] XFS (dm-5): Unmounting Filesystem
[ 2138.825792] XFS (dm-5): Mounting V5 Filesystem
[ 2138.912444] XFS (dm-5): Ending clean mount
[ 2145.256700] XFS (dm-5): Unmounting Filesystem
[ 2145.995742] XFS (dm-5): Mounting V5 Filesystem
[ 2146.106208] XFS (dm-5): Ending clean mount
[ 2152.590844] XFS: Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
[ 2152.598602] WARNING: CPU: 22 PID: 7791 at fs/xfs/xfs_message.c:93
asswarn+0x1c/0x1f [xfs]
[ 2152.607732] Modules linked in: sunrpc intel_rapl sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ipmi_ssif ghash_clmulni_intel
intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore ipmi_si ioatdma
intel_rapl_perf sg pcspkr ipmi_devintf i2c_i801 lpc_ich ipmi_msghandler xfs
libcrc32c sd_mod mgag200 drm_kms_helper sysc
opyarea wmi sysfillrect sysimgblt fb_sys_fops ttm igb dca drm crc32c_intel
megaraid_sas i2c_algo_bit rndis_host cdc_ether usbnet mii dm_mirror
dm_region_hash dm_log dm_mod
[ 2152.661611] CPU: 22 PID: 7791 Comm: umount Not tainted 4.20.0-rc6 #4
[ 2152.668706] Hardware name: IBM System x3650 M4 -[7915ON3]-/00J6520, BIOS
-[VVE124AUS-1.30]- 11/21/2012
[ 2152.679152] RIP: 0010:asswarn+0x1c/0x1f [xfs]
[ 2152.684018] Code: 00 00 00 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 66 66 66 90
48 89 f1 41 89 d0 48 c7 c6 00 bf ec c0 48 89 fa 31 ff e8 06 fa ff ff <0f> 0b c3
66 66 66 66 90 48 89 f1 41 89 d0 48 c7 c6 00 bf ec c0 48
[ 2152.704976] RSP: 0018:ffff88845bedfc70 EFLAGS: 00010286
[ 2152.710811] RAX: 0000000000000000 RBX: ffff8881c418e0c0 RCX:
0000000000000000
[ 2152.718776] RDX: dffffc0000000000 RSI: 000000000000000a RDI:
ffffed108b7dbf80
[ 2152.726742] RBP: ffff88845d701100 R08: ffffed108ef3cf91 R09:
ffffed108ef3cf90
[ 2152.734707] R10: ffffed108ef3cf90 R11: ffff8884779e7c87 R12:
ffff8881c418de40
[ 2152.742672] R13: 0000000000000016 R14: ffffffffc0fc0000 R15:
0000000000000016
[ 2152.750631] FS:  00007f3aef255080(0000) GS:ffff888477800000(0000)
knlGS:0000000000000000
[ 2152.759662] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2152.766076] CR2: 0000564c78b672e8 CR3: 000000046771c002 CR4:
00000000000606e0
[ 2152.774042] Call Trace:
[ 2152.776829]  xfs_fs_destroy_inode+0x584/0x8e0 [xfs]
[ 2152.782286]  dispose_list+0xfa/0x1d0
[ 2152.786284]  evict_inodes+0x29e/0x390
[ 2152.790378]  ? dispose_list+0x1d0/0x1d0
[ 2152.794673]  generic_shutdown_super+0xac/0x330
[ 2152.799638]  kill_block_super+0x94/0xe0
[ 2152.803923]  deactivate_locked_super+0x82/0xd0
[ 2152.808885]  deactivate_super+0x123/0x140
[ 2152.813363]  ? get_super_exclusive_thawed+0x10/0x10
[ 2152.818813]  ? rcu_is_watching+0x2c/0x80
[ 2152.823201]  cleanup_mnt+0x9f/0x130
[ 2152.827089]  task_work_run+0x10e/0x190
[ 2152.831294]  exit_to_usermode_loop+0x136/0x160
[ 2152.836260]  do_syscall_64+0x3c3/0x480
[ 2152.840452]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 2152.846092] RIP: 0033:0x7f3aee28da3b
[ 2152.850083] Code: 14 2c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 90 f3 0f 1e fa
31 f6 e9 05 00 00 00 0f 1f 44 00 00 f3 0f 1e fa b8 a6 00 00 00 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 8b 0d 1d 14 2c 00 f7 d8 64 89 01 48
[ 2152.871041] RSP: 002b:00007ffff8962718 EFLAGS: 00000246 ORIG_RAX:
00000000000000a6
[ 2152.879485] RAX: 0000000000000000 RBX: 0000564c78b5c2d0 RCX:
00007f3aee28da3b
[ 2152.887441] RDX: 0000000000000001 RSI: 0000000000000000 RDI:
0000564c78b64db0
[ 2152.895412] RBP: 0000000000000000 R08: 0000564c78b64dd0 R09:
00007ffff8960f80
[ 2152.903376] R10: 0000000000000000 R11: 0000000000000246 R12:
0000564c78b64db0
[ 2152.911343] R13: 00007f3aef039184 R14: 0000564c78b5c4b0 R15:
00000000ffffffff
[ 2152.919317] irq event stamp: 46282
[ 2152.923116] hardirqs last  enabled at (46281): [<ffffffffb570eeb2>]
console_unlock+0x672/0xc50
[ 2152.932731] hardirqs last disabled at (46282): [<ffffffffb54054d1>]
trace_hardirqs_off_thunk+0x1a/0x1c
[ 2152.943113] softirqs last  enabled at (46278): [<ffffffffb72006a8>]
__do_softirq+0x6a8/0xa1c
[ 2152.952535] softirqs last disabled at (46271): [<ffffffffb55c0c71>]
irq_exit+0x281/0x2d0
[ 2152.961568] ---[ end trace 94749dfc843d5970 ]---
[ 2153.124997] XFS (dm-5): Unmounting Filesystem
[ 2154.012646] XFS (dm-2): Unmounting Filesystem


I tested on xfs-linux for-next branch:
# git log --oneline
65eed012d1f2 (HEAD -> for-next, tag: xfs-4.21-merge-2, origin/xfs-4.21-merge,
origin/for-next) xfs: reallocate realtime summary cache on growfs
86d163dbfe2a (tag: xfs-4.21-merge-1) xfs: stringify scrub types in ftrace
output
...
...

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
@ 2018-12-24  7:19 ` bugzilla-daemon
  2018-12-24 10:40 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2018-12-24  7:19 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #1 from Zorro Lang (zlang@redhat.com) ---
Created attachment 280133
  --> https://bugzilla.kernel.org/attachment.cgi?id=280133&action=edit
corrupted metadump

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
  2018-12-24  7:19 ` [Bug 202053] " bugzilla-daemon
@ 2018-12-24 10:40 ` bugzilla-daemon
  2018-12-24 10:43 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2018-12-24 10:40 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #2 from Zorro Lang (zlang@redhat.com) ---
By enable XFS_DEBUG config option, I get a debug info "ino 28ff data fork has
delalloc extent at [0x160:0x1]":

[ 1298.421477] run fstests generic/464 at 2018-12-24 04:53:34
[ 1299.350730] XFS (dm-5): Unmounting Filesystem
[ 1301.357964] XFS (dm-5): Mounting V5 Filesystem
[ 1301.453602] XFS (dm-5): Ending clean mount
[ 1308.233740] XFS (dm-5): Unmounting Filesystem
[ 1308.921723] XFS (dm-5): Mounting V5 Filesystem
[ 1308.990435] XFS (dm-5): Ending clean mount
[ 1315.521677] XFS (dm-5): ino 28ff data fork has delalloc extent at
[0x160:0x1]
[ 1315.529686] XFS: Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
[ 1315.537563] WARNING: CPU: 12 PID: 16993 at fs/xfs/xfs_message.c:104
assfail+0x54/0x57 [xfs]

I'll upload the metadump file which match this debug info later.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
  2018-12-24  7:19 ` [Bug 202053] " bugzilla-daemon
  2018-12-24 10:40 ` bugzilla-daemon
@ 2018-12-24 10:43 ` bugzilla-daemon
  2018-12-24 10:49 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2018-12-24 10:43 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #3 from Zorro Lang (zlang@redhat.com) ---
Created attachment 280135
  --> https://bugzilla.kernel.org/attachment.cgi?id=280135&action=edit
corrupted xfs metadump

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (2 preceding siblings ...)
  2018-12-24 10:43 ` bugzilla-daemon
@ 2018-12-24 10:49 ` bugzilla-daemon
  2018-12-25  6:10 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2018-12-24 10:49 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #4 from Zorro Lang (zlang@redhat.com) ---
I never hit this bug before, just a similar bug which has been fixed one year
ago, by:
commit 40214d128e07dd21bb07a8ed6a7fe2f911281ab2
Author: Brian Foster <bfoster@redhat.com>
Date:   Fri Oct 13 09:47:46 2017 -0700

    xfs: trim writepage mapping to within eof

So I doubt if this's a regression issue?

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (3 preceding siblings ...)
  2018-12-24 10:49 ` bugzilla-daemon
@ 2018-12-25  6:10 ` bugzilla-daemon
  2019-01-04 12:32   ` Brian Foster
  2019-01-04 12:40 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 29+ messages in thread
From: bugzilla-daemon @ 2018-12-25  6:10 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #5 from Zorro Lang (zlang@redhat.com) ---
(In reply to Zorro Lang from comment #4)
> I never hit this bug before, just a similar bug which has been fixed one
> year ago, by:
> commit 40214d128e07dd21bb07a8ed6a7fe2f911281ab2
> Author: Brian Foster <bfoster@redhat.com>
> Date:   Fri Oct 13 09:47:46 2017 -0700
> 
>     xfs: trim writepage mapping to within eof
> 
> So I doubt if this's a regression issue?

I just reproduced this issue on kernel 4.19, so it's not a regression from
v4.19:

[ 1297.449750] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) ||
ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 954
[ 1297.463147] WARNING: CPU: 20 PID: 26952 at fs/xfs/xfs_message.c:104
assfail+0x54/0x57 [xfs]
[ 1297.472473] Modules linked in: sunrpc intel_rapl sb_edac
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif 
intel_cstate intel_uncore iTCO_wdt iTCO_vendor_support ipmi_si sg
intel_rapl_perf ipmi_devintf wmi ioatdma i2c_i801 pcspkr ipmi_msghandler
lpc_ich xfs libcrc32c sd_mod mgag200 drm_kms_helper 
syscopyarea sysfillrect sysimgblt igb fb_sys_fops ttm dca drm crc32c_intel
megaraid_sas i2c_algo_bit cdc_ether usbnet mii dm_mirror dm_region_hash dm_log
dm_mod
[ 1297.525374] CPU: 20 PID: 26952 Comm: umount Not tainted 4.19.0-mainline #1

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-25  6:10 ` bugzilla-daemon
@ 2019-01-04 12:32   ` Brian Foster
  2019-01-04 12:52     ` Brian Foster
  2019-01-05 21:31     ` Dave Chinner
  0 siblings, 2 replies; 29+ messages in thread
From: Brian Foster @ 2019-01-04 12:32 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: linux-xfs

On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202053
> 
> --- Comment #5 from Zorro Lang (zlang@redhat.com) ---
> (In reply to Zorro Lang from comment #4)
> > I never hit this bug before, just a similar bug which has been fixed one
> > year ago, by:
> > commit 40214d128e07dd21bb07a8ed6a7fe2f911281ab2
> > Author: Brian Foster <bfoster@redhat.com>
> > Date:   Fri Oct 13 09:47:46 2017 -0700
> > 
> >     xfs: trim writepage mapping to within eof
> > 
> > So I doubt if this's a regression issue?
> 
> I just reproduced this issue on kernel 4.19, so it's not a regression from
> v4.19:
> 
> [ 1297.449750] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) ||
> ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 954
> [ 1297.463147] WARNING: CPU: 20 PID: 26952 at fs/xfs/xfs_message.c:104
> assfail+0x54/0x57 [xfs]
> [ 1297.472473] Modules linked in: sunrpc intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif 
> intel_cstate intel_uncore iTCO_wdt iTCO_vendor_support ipmi_si sg
> intel_rapl_perf ipmi_devintf wmi ioatdma i2c_i801 pcspkr ipmi_msghandler
> lpc_ich xfs libcrc32c sd_mod mgag200 drm_kms_helper 
> syscopyarea sysfillrect sysimgblt igb fb_sys_fops ttm dca drm crc32c_intel
> megaraid_sas i2c_algo_bit cdc_ether usbnet mii dm_mirror dm_region_hash dm_log
> dm_mod
> [ 1297.525374] CPU: 20 PID: 26952 Comm: umount Not tainted 4.19.0-mainline #1
> 

I can reproduce this problem and it appears to be somewhat related to
the commit referenced above, mainly because the placement of the imap
trim leaves a larger than necessary window to race with external changes
to the extent map.

For example, a trace dump shows the following sequence of events:

- writepages is in progress on a particular file that has decently sized
  post-eof speculative preallocation
- writepages gets to the point where it looks up or allocates a new imap
  that includes the preallocation, the allocation/lookup result is
  stored in wpc
- the file is closed by one process, killing off preallocation, then
  immediately appended to by another, updating the file size by a few
  bytes
- writepages comes back around to xfs_map_blocks() and trims imap to the
  current size, but imap still includes one block of the original speculative
  prealloc (that was truncated and recreated) because the size increased
  between the time imap was stored and trimmed

The EOF trim approach is known to be a bandaid and potentially racy, but
ISTM that this problem can be trivially avoided by moving or adding
trims of wpc->imap immediately after a new one is cached. I don't
reproduce the problem so far with a couple such extra calls in place.

Bigger picture, we need some kind of invalidation mechanism similar to
what we're already doing for dealing with the COW fork in this writeback
context. I'm not sure the broad semantics used by the COW fork sequence
counter mechanism is really suitable for the data fork because any
extent-related change in the fork would cause an invalidation, but I am
wondering if we could define some subset of less frequent operations for
the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
starters).

Brian

> -- 
> You are receiving this mail because:
> You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (4 preceding siblings ...)
  2018-12-25  6:10 ` bugzilla-daemon
@ 2019-01-04 12:40 ` bugzilla-daemon
  2019-01-04 12:52 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-04 12:40 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #6 from bfoster@redhat.com ---
On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org
wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=202053
> 
> --- Comment #5 from Zorro Lang (zlang@redhat.com) ---
> (In reply to Zorro Lang from comment #4)
> > I never hit this bug before, just a similar bug which has been fixed one
> > year ago, by:
> > commit 40214d128e07dd21bb07a8ed6a7fe2f911281ab2
> > Author: Brian Foster <bfoster@redhat.com>
> > Date:   Fri Oct 13 09:47:46 2017 -0700
> > 
> >     xfs: trim writepage mapping to within eof
> > 
> > So I doubt if this's a regression issue?
> 
> I just reproduced this issue on kernel 4.19, so it's not a regression from
> v4.19:
> 
> [ 1297.449750] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) ||
> ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 954
> [ 1297.463147] WARNING: CPU: 20 PID: 26952 at fs/xfs/xfs_message.c:104
> assfail+0x54/0x57 [xfs]
> [ 1297.472473] Modules linked in: sunrpc intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif 
> intel_cstate intel_uncore iTCO_wdt iTCO_vendor_support ipmi_si sg
> intel_rapl_perf ipmi_devintf wmi ioatdma i2c_i801 pcspkr ipmi_msghandler
> lpc_ich xfs libcrc32c sd_mod mgag200 drm_kms_helper 
> syscopyarea sysfillrect sysimgblt igb fb_sys_fops ttm dca drm crc32c_intel
> megaraid_sas i2c_algo_bit cdc_ether usbnet mii dm_mirror dm_region_hash
> dm_log
> dm_mod
> [ 1297.525374] CPU: 20 PID: 26952 Comm: umount Not tainted 4.19.0-mainline #1
> 

I can reproduce this problem and it appears to be somewhat related to
the commit referenced above, mainly because the placement of the imap
trim leaves a larger than necessary window to race with external changes
to the extent map.

For example, a trace dump shows the following sequence of events:

- writepages is in progress on a particular file that has decently sized
  post-eof speculative preallocation
- writepages gets to the point where it looks up or allocates a new imap
  that includes the preallocation, the allocation/lookup result is
  stored in wpc
- the file is closed by one process, killing off preallocation, then
  immediately appended to by another, updating the file size by a few
  bytes
- writepages comes back around to xfs_map_blocks() and trims imap to the
  current size, but imap still includes one block of the original speculative
  prealloc (that was truncated and recreated) because the size increased
  between the time imap was stored and trimmed

The EOF trim approach is known to be a bandaid and potentially racy, but
ISTM that this problem can be trivially avoided by moving or adding
trims of wpc->imap immediately after a new one is cached. I don't
reproduce the problem so far with a couple such extra calls in place.

Bigger picture, we need some kind of invalidation mechanism similar to
what we're already doing for dealing with the COW fork in this writeback
context. I'm not sure the broad semantics used by the COW fork sequence
counter mechanism is really suitable for the data fork because any
extent-related change in the fork would cause an invalidation, but I am
wondering if we could define some subset of less frequent operations for
the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
starters).

Brian

> -- 
> You are receiving this mail because:
> You are watching the assignee of the bug.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-04 12:32   ` Brian Foster
@ 2019-01-04 12:52     ` Brian Foster
  2019-01-05 21:31     ` Dave Chinner
  1 sibling, 0 replies; 29+ messages in thread
From: Brian Foster @ 2019-01-04 12:52 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: linux-xfs

On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=202053
> > 
> > --- Comment #5 from Zorro Lang (zlang@redhat.com) ---
> > (In reply to Zorro Lang from comment #4)
> > > I never hit this bug before, just a similar bug which has been fixed one
> > > year ago, by:
> > > commit 40214d128e07dd21bb07a8ed6a7fe2f911281ab2
> > > Author: Brian Foster <bfoster@redhat.com>
> > > Date:   Fri Oct 13 09:47:46 2017 -0700
> > > 
> > >     xfs: trim writepage mapping to within eof
> > > 
> > > So I doubt if this's a regression issue?
> > 
> > I just reproduced this issue on kernel 4.19, so it's not a regression from
> > v4.19:
> > 
> > [ 1297.449750] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) ||
> > ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 954
> > [ 1297.463147] WARNING: CPU: 20 PID: 26952 at fs/xfs/xfs_message.c:104
> > assfail+0x54/0x57 [xfs]
> > [ 1297.472473] Modules linked in: sunrpc intel_rapl sb_edac
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif 
> > intel_cstate intel_uncore iTCO_wdt iTCO_vendor_support ipmi_si sg
> > intel_rapl_perf ipmi_devintf wmi ioatdma i2c_i801 pcspkr ipmi_msghandler
> > lpc_ich xfs libcrc32c sd_mod mgag200 drm_kms_helper 
> > syscopyarea sysfillrect sysimgblt igb fb_sys_fops ttm dca drm crc32c_intel
> > megaraid_sas i2c_algo_bit cdc_ether usbnet mii dm_mirror dm_region_hash dm_log
> > dm_mod
> > [ 1297.525374] CPU: 20 PID: 26952 Comm: umount Not tainted 4.19.0-mainline #1
> > 
> 
> I can reproduce this problem and it appears to be somewhat related to
> the commit referenced above, mainly because the placement of the imap
> trim leaves a larger than necessary window to race with external changes
> to the extent map.
> 
> For example, a trace dump shows the following sequence of events:
> 
> - writepages is in progress on a particular file that has decently sized
>   post-eof speculative preallocation
> - writepages gets to the point where it looks up or allocates a new imap
>   that includes the preallocation, the allocation/lookup result is
>   stored in wpc
> - the file is closed by one process, killing off preallocation, then
>   immediately appended to by another, updating the file size by a few
>   bytes
> - writepages comes back around to xfs_map_blocks() and trims imap to the
>   current size, but imap still includes one block of the original speculative
>   prealloc (that was truncated and recreated) because the size increased
>   between the time imap was stored and trimmed
> 
> The EOF trim approach is known to be a bandaid and potentially racy, but
> ISTM that this problem can be trivially avoided by moving or adding
> trims of wpc->imap immediately after a new one is cached. I don't
> reproduce the problem so far with a couple such extra calls in place.
> 
> Bigger picture, we need some kind of invalidation mechanism similar to
> what we're already doing for dealing with the COW fork in this writeback
> context. I'm not sure the broad semantics used by the COW fork sequence
> counter mechanism is really suitable for the data fork because any
> extent-related change in the fork would cause an invalidation, but I am
> wondering if we could define some subset of less frequent operations for
> the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> starters).
> 

Zorro,

Can you still reproduce with the following patch?

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..d9048bcea49c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -449,6 +449,7 @@ xfs_map_blocks(
 	}
 
 	wpc->imap = imap;
+	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 allocate_blocks:
@@ -459,6 +460,7 @@ xfs_map_blocks(
 	ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
 	       imap.br_startoff + imap.br_blockcount <= cow_fsb);
 	wpc->imap = imap;
+	xfs_trim_extent_eof(&wpc->imap, ip);
 	trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (5 preceding siblings ...)
  2019-01-04 12:40 ` bugzilla-daemon
@ 2019-01-04 12:52 ` bugzilla-daemon
  2019-01-05 21:31 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-04 12:52 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #7 from bfoster@redhat.com ---
On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=202053
> > 
> > --- Comment #5 from Zorro Lang (zlang@redhat.com) ---
> > (In reply to Zorro Lang from comment #4)
> > > I never hit this bug before, just a similar bug which has been fixed one
> > > year ago, by:
> > > commit 40214d128e07dd21bb07a8ed6a7fe2f911281ab2
> > > Author: Brian Foster <bfoster@redhat.com>
> > > Date:   Fri Oct 13 09:47:46 2017 -0700
> > > 
> > >     xfs: trim writepage mapping to within eof
> > > 
> > > So I doubt if this's a regression issue?
> > 
> > I just reproduced this issue on kernel 4.19, so it's not a regression from
> > v4.19:
> > 
> > [ 1297.449750] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) ||
> > ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 954
> > [ 1297.463147] WARNING: CPU: 20 PID: 26952 at fs/xfs/xfs_message.c:104
> > assfail+0x54/0x57 [xfs]
> > [ 1297.472473] Modules linked in: sunrpc intel_rapl sb_edac
> > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif 
> > intel_cstate intel_uncore iTCO_wdt iTCO_vendor_support ipmi_si sg
> > intel_rapl_perf ipmi_devintf wmi ioatdma i2c_i801 pcspkr ipmi_msghandler
> > lpc_ich xfs libcrc32c sd_mod mgag200 drm_kms_helper 
> > syscopyarea sysfillrect sysimgblt igb fb_sys_fops ttm dca drm crc32c_intel
> > megaraid_sas i2c_algo_bit cdc_ether usbnet mii dm_mirror dm_region_hash
> dm_log
> > dm_mod
> > [ 1297.525374] CPU: 20 PID: 26952 Comm: umount Not tainted 4.19.0-mainline
> #1
> > 
> 
> I can reproduce this problem and it appears to be somewhat related to
> the commit referenced above, mainly because the placement of the imap
> trim leaves a larger than necessary window to race with external changes
> to the extent map.
> 
> For example, a trace dump shows the following sequence of events:
> 
> - writepages is in progress on a particular file that has decently sized
>   post-eof speculative preallocation
> - writepages gets to the point where it looks up or allocates a new imap
>   that includes the preallocation, the allocation/lookup result is
>   stored in wpc
> - the file is closed by one process, killing off preallocation, then
>   immediately appended to by another, updating the file size by a few
>   bytes
> - writepages comes back around to xfs_map_blocks() and trims imap to the
>   current size, but imap still includes one block of the original speculative
>   prealloc (that was truncated and recreated) because the size increased
>   between the time imap was stored and trimmed
> 
> The EOF trim approach is known to be a bandaid and potentially racy, but
> ISTM that this problem can be trivially avoided by moving or adding
> trims of wpc->imap immediately after a new one is cached. I don't
> reproduce the problem so far with a couple such extra calls in place.
> 
> Bigger picture, we need some kind of invalidation mechanism similar to
> what we're already doing for dealing with the COW fork in this writeback
> context. I'm not sure the broad semantics used by the COW fork sequence
> counter mechanism is really suitable for the data fork because any
> extent-related change in the fork would cause an invalidation, but I am
> wondering if we could define some subset of less frequent operations for
> the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> starters).
> 

Zorro,

Can you still reproduce with the following patch?

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 338b9d9984e0..d9048bcea49c 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -449,6 +449,7 @@ xfs_map_blocks(
        }

        wpc->imap = imap;
+       xfs_trim_extent_eof(&wpc->imap, ip);
        trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
        return 0;
 allocate_blocks:
@@ -459,6 +460,7 @@ xfs_map_blocks(
        ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
               imap.br_startoff + imap.br_blockcount <= cow_fsb);
        wpc->imap = imap;
+       xfs_trim_extent_eof(&wpc->imap, ip);
        trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
        return 0;
 }

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-04 12:32   ` Brian Foster
  2019-01-04 12:52     ` Brian Foster
@ 2019-01-05 21:31     ` Dave Chinner
  2019-01-06 21:57       ` Dave Chinner
  2019-01-07 14:41       ` Brian Foster
  1 sibling, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2019-01-05 21:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: bugzilla-daemon, linux-xfs

On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> - writepages is in progress on a particular file that has decently sized
>   post-eof speculative preallocation
> - writepages gets to the point where it looks up or allocates a new imap
>   that includes the preallocation, the allocation/lookup result is
>   stored in wpc
> - the file is closed by one process, killing off preallocation, then
>   immediately appended to by another, updating the file size by a few
>   bytes
> - writepages comes back around to xfs_map_blocks() and trims imap to the
>   current size, but imap still includes one block of the original speculative
>   prealloc (that was truncated and recreated) because the size increased
>   between the time imap was stored and trimmed

I'm betting hole punch can cause similar problems with cached maps
in writepage. I wrote a patch yonks ago to put a generation number
in the extent fork and to store it in the cached map, and to
invalidate the cached map if they didn't match.

> The EOF trim approach is known to be a bandaid and potentially racy, but
> ISTM that this problem can be trivially avoided by moving or adding
> trims of wpc->imap immediately after a new one is cached. I don't
> reproduce the problem so far with a couple such extra calls in place.
> 
> Bigger picture, we need some kind of invalidation mechanism similar to
> what we're already doing for dealing with the COW fork in this writeback
> context. I'm not sure the broad semantics used by the COW fork sequence
> counter mechanism is really suitable for the data fork because any
> extent-related change in the fork would cause an invalidation, but I am
> wondering if we could define some subset of less frequent operations for
> the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> starters).

The patch I had is below - I haven't forward ported it or anything,
just pulled it from my archive to demonstrate what I think we
probably need to be doing here. If we want to limit when it causes
invalidations, then we need probably need to limit which operations
cause the generation number for that inode fork to be bumped.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (6 preceding siblings ...)
  2019-01-04 12:52 ` bugzilla-daemon
@ 2019-01-05 21:31 ` bugzilla-daemon
  2019-01-06 21:57 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-05 21:31 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #8 from Dave Chinner (david@fromorbit.com) ---
On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org
> wrote:
> - writepages is in progress on a particular file that has decently sized
>   post-eof speculative preallocation
> - writepages gets to the point where it looks up or allocates a new imap
>   that includes the preallocation, the allocation/lookup result is
>   stored in wpc
> - the file is closed by one process, killing off preallocation, then
>   immediately appended to by another, updating the file size by a few
>   bytes
> - writepages comes back around to xfs_map_blocks() and trims imap to the
>   current size, but imap still includes one block of the original speculative
>   prealloc (that was truncated and recreated) because the size increased
>   between the time imap was stored and trimmed

I'm betting hole punch can cause similar problems with cached maps
in writepage. I wrote a patch yonks ago to put a generation number
in the extent fork and to store it in the cached map, and to
invalidate the cached map if they didn't match.

> The EOF trim approach is known to be a bandaid and potentially racy, but
> ISTM that this problem can be trivially avoided by moving or adding
> trims of wpc->imap immediately after a new one is cached. I don't
> reproduce the problem so far with a couple such extra calls in place.
> 
> Bigger picture, we need some kind of invalidation mechanism similar to
> what we're already doing for dealing with the COW fork in this writeback
> context. I'm not sure the broad semantics used by the COW fork sequence
> counter mechanism is really suitable for the data fork because any
> extent-related change in the fork would cause an invalidation, but I am
> wondering if we could define some subset of less frequent operations for
> the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> starters).

The patch I had is below - I haven't forward ported it or anything,
just pulled it from my archive to demonstrate what I think we
probably need to be doing here. If we want to limit when it causes
invalidations, then we need probably need to limit which operations
cause the generation number for that inode fork to be bumped.

Cheers,

Dave.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-05 21:31     ` Dave Chinner
@ 2019-01-06 21:57       ` Dave Chinner
  2019-01-07 14:41         ` Brian Foster
  2019-01-07 14:41       ` Brian Foster
  1 sibling, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2019-01-06 21:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: bugzilla-daemon, linux-xfs

On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > - writepages is in progress on a particular file that has decently sized
> >   post-eof speculative preallocation
> > - writepages gets to the point where it looks up or allocates a new imap
> >   that includes the preallocation, the allocation/lookup result is
> >   stored in wpc
> > - the file is closed by one process, killing off preallocation, then
> >   immediately appended to by another, updating the file size by a few
> >   bytes
> > - writepages comes back around to xfs_map_blocks() and trims imap to the
> >   current size, but imap still includes one block of the original speculative
> >   prealloc (that was truncated and recreated) because the size increased
> >   between the time imap was stored and trimmed
> 
> I'm betting hole punch can cause similar problems with cached maps
> in writepage. I wrote a patch yonks ago to put a generation number
> in the extent fork and to store it in the cached map, and to
> invalidate the cached map if they didn't match.
> 
> > The EOF trim approach is known to be a bandaid and potentially racy, but
> > ISTM that this problem can be trivially avoided by moving or adding
> > trims of wpc->imap immediately after a new one is cached. I don't
> > reproduce the problem so far with a couple such extra calls in place.
> > 
> > Bigger picture, we need some kind of invalidation mechanism similar to
> > what we're already doing for dealing with the COW fork in this writeback
> > context. I'm not sure the broad semantics used by the COW fork sequence
> > counter mechanism is really suitable for the data fork because any
> > extent-related change in the fork would cause an invalidation, but I am
> > wondering if we could define some subset of less frequent operations for
> > the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> > starters).
> 
> The patch I had is below - I haven't forward ported it or anything,
> just pulled it from my archive to demonstrate what I think we
> probably need to be doing here. If we want to limit when it causes
> invalidations, then we need probably need to limit which operations
> cause the generation number for that inode fork to be bumped.

Ugh. Didn't attach patch.

-Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: add generation number for cached imap invalidation

From: Dave Chinner <dchinner@redhat.com>

Because writepage caches an extent map with locks being held, it can
race with truncate and other extent invalidations. This means we can
attempt to write to ranges that have been freed and/or use pages
that have been unmapped.

The band-aid of using xfs_trim_extent_eof() to invalidate mappings
introduces a new race condition between truncate_setsize() and
a page under writeback - the locked page can allocate a valid map
because it's over a delalloc region, but between xfs_map_blocks()
allocating the extent and xfs_imap_valid() checking it, the
inode size can change and that will result in xfs_imap_valid()
saying the map is invalid.

Hence we get spurious writeback failures of correctly mapped
pages/buffers with allocated extents underneath them. This triggers
assert failures in writeback that are false positives - it occurs
when truncate_setsize() is waiting for IO on the page to complete
before allowing invalidation of the page to occur. Removal of the
extent being written into doesn't occur until after the page cache
truncation is completed, so xfs_trim_extent_eof() is acting
prematurely in this case.

To fix this, what we really need to do is invalidate the cached map
when the extent list changes, not when the vfs inode size changes.
Add a extent list generation number to the in-core extent list
that is bumped on every change to the extent list, and add that
in all imaps returned from xfs_bmapi_read().

When we go to use the information in the imap, and we've dropped the
locks that keep the map coherent, we need to recheck the the
generation is still valid once we regain the inode locks. If the
generation number has changed, we can't use the cached mapping
information and we need to back out and restart whatever operation
we are caching the map for.

In theory, this can be done with seqlocks, but I couldn't get my
head around how to do this sanely. Generation numbers are simple
to understand and implement...

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 08df809e2315..1ec408007cab 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3691,6 +3691,7 @@ xfs_bmapi_trim_map(
 	mval->br_blockcount = XFS_FILBLKS_MIN(end - *bno,
 			got->br_blockcount - (*bno - got->br_startoff));
 	mval->br_state = got->br_state;
+	mval->br_generation = got->br_generation;
 	ASSERT(mval->br_blockcount <= len);
 	return;
 }
@@ -3799,6 +3800,7 @@ xfs_bmapi_read(
 		mval->br_startblock = HOLESTARTBLOCK;
 		mval->br_blockcount = len;
 		mval->br_state = XFS_EXT_NORM;
+		mval->br_generation = xfs_iext_generation_fetch(ifp);
 		*nmap = 1;
 		return 0;
 	}
@@ -3815,9 +3817,14 @@ xfs_bmapi_read(
 	obno = bno;
 
 	while (bno < end && n < *nmap) {
-		/* Reading past eof, act as though there's a hole up to end. */
-		if (eof)
+		/*
+		 * If we are mapping past the last extent in the fork, act as
+		 * though there's a hole up to end of the requested mapping.
+		 */
+		if (eof) {
 			got.br_startoff = end;
+			got.br_generation = xfs_iext_generation_fetch(ifp);
+		}
 		if (got.br_startoff > bno) {
 			/* Reading in a hole.  */
 			mval->br_startoff = bno;
@@ -3825,6 +3832,7 @@ xfs_bmapi_read(
 			mval->br_blockcount =
 				XFS_FILBLKS_MIN(len, got.br_startoff - bno);
 			mval->br_state = XFS_EXT_NORM;
+			mval->br_generation = got.br_generation;
 			bno += mval->br_blockcount;
 			len -= mval->br_blockcount;
 			mval++;
diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
index 343a94246f5b..bc52269b7973 100644
--- a/fs/xfs/libxfs/xfs_iext_tree.c
+++ b/fs/xfs/libxfs/xfs_iext_tree.c
@@ -663,6 +663,7 @@ xfs_iext_insert(
 
 	if (new)
 		xfs_iext_insert_node(ifp, xfs_iext_leaf_key(new, 0), new, 2);
+	xfs_iext_generation_inc(ifp);
 }
 
 static struct xfs_iext_node *
@@ -890,6 +891,7 @@ xfs_iext_remove(
 		cur->pos = 0;
 	}
 
+	xfs_iext_generation_inc(ifp);
 	if (nr_entries >= RECS_PER_LEAF / 2)
 		return;
 
@@ -944,6 +946,7 @@ xfs_iext_lookup_extent(
 		return false;
 found:
 	xfs_iext_get(gotp, cur_rec(cur));
+	gotp->br_generation = xfs_iext_generation_fetch(ifp);
 	return true;
 }
 
@@ -990,6 +993,7 @@ xfs_iext_update_extent(
 
 	trace_xfs_bmap_pre_update(ip, cur, state, _RET_IP_);
 	xfs_iext_set(cur_rec(cur), new);
+	xfs_iext_generation_inc(ifp);
 	trace_xfs_bmap_post_update(ip, cur, state, _RET_IP_);
 }
 
@@ -1006,6 +1010,7 @@ xfs_iext_get_extent(
 	if (!xfs_iext_valid(ifp, cur))
 		return false;
 	xfs_iext_get(gotp, cur_rec(cur));
+	gotp->br_generation = xfs_iext_generation_fetch(ifp);
 	return true;
 }
 
@@ -1040,4 +1045,5 @@ xfs_iext_destroy(
 	ifp->if_bytes = 0;
 	ifp->if_height = 0;
 	ifp->if_u1.if_root = NULL;
+	xfs_iext_generation_inc(ifp);
 }
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index b9f0098e33b8..3a1c1dcb8b54 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -28,6 +28,7 @@ typedef struct xfs_ifork {
 	int			if_bytes;	/* bytes in if_u1 */
 	int			if_real_bytes;	/* bytes allocated in if_u1 */
 	struct xfs_btree_block	*if_broot;	/* file's incore btree root */
+	uint32_t		if_generation;	/* extent list generation # */
 	short			if_broot_bytes;	/* bytes allocated for root */
 	unsigned char		if_flags;	/* per-fork flags */
 	int			if_height;	/* height of the extent tree */
@@ -186,4 +187,32 @@ extern struct kmem_zone	*xfs_ifork_zone;
 
 extern void xfs_ifork_init_cow(struct xfs_inode *ip);
 
+/*
+ * Bump the fork's generation count and drop a write barrier so it can
+ * be checked without holding locks
+ */
+static inline void
+xfs_iext_generation_inc(
+	struct xfs_ifork	*ifp)
+{
+	ifp->if_generation++;
+	smp_wmb();
+}
+
+static inline uint32_t
+xfs_iext_generation_fetch(
+	struct xfs_ifork	*ifp)
+{
+	smp_rmb();
+	return READ_ONCE(ifp->if_generation);
+}
+
+static inline bool
+xfs_iext_generation_same(
+	struct xfs_ifork	*ifp,
+	uint32_t		val)
+{
+	return (val == xfs_iext_generation_fetch(ifp));
+}
+
 #endif	/* __XFS_INODE_FORK_H__ */
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 3c560695c546..781ca5252fe4 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -151,12 +151,18 @@ typedef enum {
 	XFS_EXT_NORM, XFS_EXT_UNWRITTEN,
 } xfs_exntst_t;
 
+/*
+ * There is a generation number added to the incore BMBT record structure for
+ * determining when cached records held outside of the extent list locks are
+ * still valid.  This is only filled out on lookups, and is a read-only value.
+ */
 typedef struct xfs_bmbt_irec
 {
 	xfs_fileoff_t	br_startoff;	/* starting file offset */
 	xfs_fsblock_t	br_startblock;	/* starting block number */
 	xfs_filblks_t	br_blockcount;	/* number of blocks */
 	xfs_exntst_t	br_state;	/* extent state */
+	uint32_t	br_generation;	/* extent list generation number */
 } xfs_bmbt_irec_t;
 
 #endif	/* __XFS_TYPES_H__ */
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 0ca511857f7c..4504e8ecbae1 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -793,6 +793,7 @@ xfs_map_blocks(
 			imap->br_startoff = offset_fsb;
 			imap->br_blockcount = end_fsb - offset_fsb;
 			imap->br_startblock = HOLESTARTBLOCK;
+			imap->br_generation = xfs_iext_generation_fetch(&ip->i_df);
 			*type = XFS_IO_HOLE;
 		} else if (imap->br_startblock == HOLESTARTBLOCK) {
 			/* landed in a hole */
@@ -864,23 +865,46 @@ STATIC bool
 xfs_imap_valid(
 	struct inode		*inode,
 	struct xfs_bmbt_irec	*imap,
-	xfs_off_t		offset)
+	xfs_off_t		offset,
+	int			type)
 {
-	offset >>= inode->i_blkbits;
+	struct xfs_ifork	*ifp;
+
 
 	/*
-	 * We have to make sure the cached mapping is within EOF to protect
-	 * against eofblocks trimming on file release leaving us with a stale
-	 * mapping. Otherwise, a page for a subsequent file extending buffered
-	 * write could get picked up by this writeback cycle and written to the
-	 * wrong blocks.
+	 * We have to make sure the cached mapping hasn't been invalidated by an
+	 * extent list modification (e.g. truncate, releasing post-eof blocks,
+	 * etc). If the extent list has changed, we need to get a new map.
+	 *
+	 * Removing a range of a file will clear extents from both the data and
+	 * COW fork. That means a single extent range removal will bump both the
+	 * data and cow fork generation numbers if such extents exist in the
+	 * range. Hence we can safely check against the generation number of the
+	 * specific fork we are operating on here as it will flag an
+	 * invalidation if the removal affected the specific fork we are doing
+	 * IO from.
 	 *
-	 * Note that what we really want here is a generic mapping invalidation
-	 * mechanism to protect us from arbitrary extent modifying contexts, not
-	 * just eofblocks.
+	 * Finally, page invalidation will block on the page lock we are
+	 * holding, but truncate may have already changed the file size. Hence
+	 * we could be beyond the new EOF here, but truncate can't proceed until
+	 * we finish IO on this page. Checking against EOF right now means we
+	 * have to unwind and discard the page, and we have to do similar checks
+	 * everywhere we pick up the inode locks again.
+	 *
+	 * However, it's still valid to allocate and write the data here because
+	 * we have valid data and the the extent removal after the page
+	 * invalidation completes will clean up for us. Ignoring this size
+	 * change race condition and using on extent map coherency checks to
+	 * determine whether to proceed or not makes everything in the writepage
+	 * path simpler.
 	 */
-	xfs_trim_extent_eof(imap, XFS_I(inode));
+	ifp = XFS_IFORK_PTR(XFS_I(inode), type == XFS_IO_COW ? XFS_COW_FORK
+							     : XFS_DATA_FORK);
+	if (!xfs_iext_generation_same(ifp, imap->br_generation))
+		return false;
 
+
+	offset >>= inode->i_blkbits;
 	return offset >= imap->br_startoff &&
 		offset < imap->br_startoff + imap->br_blockcount;
 }
@@ -932,6 +956,7 @@ xfs_writepage_map(
 	for (poffset = 0;
 	     poffset < PAGE_SIZE;
 	     poffset += len, file_offset += len, bh = bh->b_this_page) {
+		int retries = 0;
 
 		/* past the range we are writing, so nothing more to write. */
 		if (file_offset >= end_offset)
@@ -954,24 +979,54 @@ xfs_writepage_map(
 		/* Check to see if current map spans this file offset */
 		if (wpc->imap_valid)
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 file_offset);
+						 file_offset, wpc->io_type);
 
 		/*
-		 * If we don't have a valid map, now it's time to get
-		 * a new one for this offset. This will allocate delalloc
-		 * regions or COW shared extents. If we return without a valid
-		 * map, it means we landed in a hole and we skip the block.
+		 * If we don't have a valid map, now it's time to get a new one
+		 * for this offset. This will allocate delalloc regions or COW
+		 * shared extents. If we return without a valid map, it means we
+		 * raced with another extent operation.
+		 *
+		 * If we are racing with invalidation, it will wait for us to
+		 * release the page lock which means allocation and IO is ok to
+		 * finish off.
+		 *
+		 * Hence we loop here trying to remap the page until we get a
+		 * valid map that hasn't been changed by the time we check it.
 		 */
-		if (!wpc->imap_valid) {
+		while (!wpc->imap_valid) {
+			if (!(++retries % 10)) {
+				xfs_warn_ratelimited(XFS_I(inode)->i_mount,
+					"%s: looping %d times", __func__, retries);
+			}
+
 			error = xfs_map_blocks(inode, file_offset, &wpc->imap,
 					     &wpc->io_type);
+			if (error == -EAGAIN) {
+				/*
+				 * Raced with another extent list modification
+				 * between map and allocate operations. Back off
+				 * for a short while, try again.
+				 */
+				msleep(1);
+				continue;
+			}
 			if (error)
 				goto out;
+
+			/*
+			 * We can still race with another extent list mod by the
+			 * time we get back here. It may have been a truncate,
+			 * so if the extent is not valid now, we need to map
+			 * again, knowing the truncate will be complete before
+			 * the next map is executed and so any future
+			 * invalidation mod will block on the page lock we hold
+			 */
 			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
-							 file_offset);
+						 file_offset, wpc->io_type);
 		}
 
-		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
+		if (wpc->io_type == XFS_IO_HOLE) {
 			/*
 			 * set_page_dirty dirties all buffers in a page, independent
 			 * of their state.  The dirty state however is entirely
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index ad48e2f24699..3383287633ff 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -674,8 +674,8 @@ xfs_iomap_write_allocate(
 	xfs_bmbt_irec_t *imap)
 {
 	xfs_mount_t	*mp = ip->i_mount;
-	xfs_fileoff_t	offset_fsb, last_block;
-	xfs_fileoff_t	end_fsb, map_start_fsb;
+	xfs_fileoff_t	offset_fsb;
+	xfs_fileoff_t	map_start_fsb;
 	xfs_fsblock_t	first_block;
 	struct xfs_defer_ops	dfops;
 	xfs_filblks_t	count_fsb;
@@ -730,56 +730,28 @@ xfs_iomap_write_allocate(
 			xfs_defer_init(&dfops, &first_block);
 
 			/*
-			 * it is possible that the extents have changed since
-			 * we did the read call as we dropped the ilock for a
-			 * while. We have to be careful about truncates or hole
-			 * punchs here - we are not allowed to allocate
-			 * non-delalloc blocks here.
-			 *
-			 * The only protection against truncation is the pages
-			 * for the range we are being asked to convert are
-			 * locked and hence a truncate will block on them
-			 * first.
-			 *
-			 * As a result, if we go beyond the range we really
-			 * need and hit an delalloc extent boundary followed by
-			 * a hole while we have excess blocks in the map, we
-			 * will fill the hole incorrectly and overrun the
-			 * transaction reservation.
+			 * Now we've locked the inode, we can determine if the
+			 * region we are attempting to allocate over is still
+			 * valid. If it's potentially invalid, then generation
+			 * numbers won't match and we need to back out
+			 * completely and let the caller decide what to do.
 			 *
-			 * Using a single map prevents this as we are forced to
-			 * check each map we look for overlap with the desired
-			 * range and abort as soon as we find it. Also, given
-			 * that we only return a single map, having one beyond
-			 * what we can return is probably a bit silly.
-			 *
-			 * We also need to check that we don't go beyond EOF;
-			 * this is a truncate optimisation as a truncate sets
-			 * the new file size before block on the pages we
-			 * currently have locked under writeback. Because they
-			 * are about to be tossed, we don't need to write them
-			 * back....
+			 * This will occur if we race with truncate or some
+			 * other extent manipulation while we don't have the
+			 * inode locked.
 			 */
-			nimaps = 1;
-			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
-			error = xfs_bmap_last_offset(ip, &last_block,
-							XFS_DATA_FORK);
-			if (error)
+			if (!xfs_iext_generation_same(
+					XFS_IFORK_PTR(ip, whichfork),
+					imap->br_generation)) {
+				error = -EAGAIN;
 				goto trans_cancel;
-
-			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
-			if ((map_start_fsb + count_fsb) > last_block) {
-				count_fsb = last_block - map_start_fsb;
-				if (count_fsb == 0) {
-					error = -EAGAIN;
-					goto trans_cancel;
-				}
 			}
 
 			/*
 			 * From this point onwards we overwrite the imap
 			 * pointer that the caller gave to us.
 			 */
+			nimaps = 1;
 			error = xfs_bmapi_write(tp, ip, map_start_fsb,
 						count_fsb, flags, &first_block,
 						nres, imap, &nimaps,

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (7 preceding siblings ...)
  2019-01-05 21:31 ` bugzilla-daemon
@ 2019-01-06 21:57 ` bugzilla-daemon
  2019-01-07  2:35 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-06 21:57 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #9 from Dave Chinner (david@fromorbit.com) ---
On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > On Tue, Dec 25, 2018 at 06:10:59AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > - writepages is in progress on a particular file that has decently sized
> >   post-eof speculative preallocation
> > - writepages gets to the point where it looks up or allocates a new imap
> >   that includes the preallocation, the allocation/lookup result is
> >   stored in wpc
> > - the file is closed by one process, killing off preallocation, then
> >   immediately appended to by another, updating the file size by a few
> >   bytes
> > - writepages comes back around to xfs_map_blocks() and trims imap to the
> >   current size, but imap still includes one block of the original
> speculative
> >   prealloc (that was truncated and recreated) because the size increased
> >   between the time imap was stored and trimmed
> 
> I'm betting hole punch can cause similar problems with cached maps
> in writepage. I wrote a patch yonks ago to put a generation number
> in the extent fork and to store it in the cached map, and to
> invalidate the cached map if they didn't match.
> 
> > The EOF trim approach is known to be a bandaid and potentially racy, but
> > ISTM that this problem can be trivially avoided by moving or adding
> > trims of wpc->imap immediately after a new one is cached. I don't
> > reproduce the problem so far with a couple such extra calls in place.
> > 
> > Bigger picture, we need some kind of invalidation mechanism similar to
> > what we're already doing for dealing with the COW fork in this writeback
> > context. I'm not sure the broad semantics used by the COW fork sequence
> > counter mechanism is really suitable for the data fork because any
> > extent-related change in the fork would cause an invalidation, but I am
> > wondering if we could define some subset of less frequent operations for
> > the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> > starters).
> 
> The patch I had is below - I haven't forward ported it or anything,
> just pulled it from my archive to demonstrate what I think we
> probably need to be doing here. If we want to limit when it causes
> invalidations, then we need probably need to limit which operations
> cause the generation number for that inode fork to be bumped.

Ugh. Didn't attach patch.

-Dave.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (8 preceding siblings ...)
  2019-01-06 21:57 ` bugzilla-daemon
@ 2019-01-07  2:35 ` bugzilla-daemon
  2019-01-07 14:41 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-07  2:35 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #10 from Zorro Lang (zlang@redhat.com) ---
(In reply to bfoster from comment #7)
> On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > On Tue, Dec 25, 2018 at 06:10:59AM +0000,
> bugzilla-daemon@bugzilla.kernel.org
> > wrote:
> > > https://bugzilla.kernel.org/show_bug.cgi?id=202053
> > > 
> > > --- Comment #5 from Zorro Lang (zlang@redhat.com) ---
> > > (In reply to Zorro Lang from comment #4)
> > > > I never hit this bug before, just a similar bug which has been fixed
> one
> > > > year ago, by:
> > > > commit 40214d128e07dd21bb07a8ed6a7fe2f911281ab2
> > > > Author: Brian Foster <bfoster@redhat.com>
> > > > Date:   Fri Oct 13 09:47:46 2017 -0700
> > > > 
> > > >     xfs: trim writepage mapping to within eof
> > > > 
> > > > So I doubt if this's a regression issue?
> > > 
> > > I just reproduced this issue on kernel 4.19, so it's not a regression
> from
> > > v4.19:
> > > 
> > > [ 1297.449750] XFS: Assertion failed: XFS_FORCED_SHUTDOWN(ip->i_mount) ||
> > > ip->i_delayed_blks == 0, file: fs/xfs/xfs_super.c, line: 954
> > > [ 1297.463147] WARNING: CPU: 20 PID: 26952 at fs/xfs/xfs_message.c:104
> > > assfail+0x54/0x57 [xfs]
> > > [ 1297.472473] Modules linked in: sunrpc intel_rapl sb_edac
> > > x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> > > crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_ssif 
> > > intel_cstate intel_uncore iTCO_wdt iTCO_vendor_support ipmi_si sg
> > > intel_rapl_perf ipmi_devintf wmi ioatdma i2c_i801 pcspkr ipmi_msghandler
> > > lpc_ich xfs libcrc32c sd_mod mgag200 drm_kms_helper 
> > > syscopyarea sysfillrect sysimgblt igb fb_sys_fops ttm dca drm
> crc32c_intel
> > > megaraid_sas i2c_algo_bit cdc_ether usbnet mii dm_mirror dm_region_hash
> > dm_log
> > > dm_mod
> > > [ 1297.525374] CPU: 20 PID: 26952 Comm: umount Not tainted
> 4.19.0-mainline
> > #1
> > > 
> > 
> > I can reproduce this problem and it appears to be somewhat related to
> > the commit referenced above, mainly because the placement of the imap
> > trim leaves a larger than necessary window to race with external changes
> > to the extent map.
> > 
> > For example, a trace dump shows the following sequence of events:
> > 
> > - writepages is in progress on a particular file that has decently sized
> >   post-eof speculative preallocation
> > - writepages gets to the point where it looks up or allocates a new imap
> >   that includes the preallocation, the allocation/lookup result is
> >   stored in wpc
> > - the file is closed by one process, killing off preallocation, then
> >   immediately appended to by another, updating the file size by a few
> >   bytes
> > - writepages comes back around to xfs_map_blocks() and trims imap to the
> >   current size, but imap still includes one block of the original
> speculative
> >   prealloc (that was truncated and recreated) because the size increased
> >   between the time imap was stored and trimmed
> > 
> > The EOF trim approach is known to be a bandaid and potentially racy, but
> > ISTM that this problem can be trivially avoided by moving or adding
> > trims of wpc->imap immediately after a new one is cached. I don't
> > reproduce the problem so far with a couple such extra calls in place.
> > 
> > Bigger picture, we need some kind of invalidation mechanism similar to
> > what we're already doing for dealing with the COW fork in this writeback
> > context. I'm not sure the broad semantics used by the COW fork sequence
> > counter mechanism is really suitable for the data fork because any
> > extent-related change in the fork would cause an invalidation, but I am
> > wondering if we could define some subset of less frequent operations for
> > the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> > starters).
> > 
> 
> Zorro,
> 
> Can you still reproduce with the following patch?

Hi Brian, below patch looks good. I've kept running g/464 one day, can't
reproduce this bug now.

Thanks,
Zorro

> 
> Brian
> 
> --- 8< ---
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 338b9d9984e0..d9048bcea49c 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -449,6 +449,7 @@ xfs_map_blocks(
>       }
>  
>       wpc->imap = imap;
> +     xfs_trim_extent_eof(&wpc->imap, ip);
>       trace_xfs_map_blocks_found(ip, offset, count, wpc->io_type, &imap);
>       return 0;
>  allocate_blocks:
> @@ -459,6 +460,7 @@ xfs_map_blocks(
>       ASSERT(whichfork == XFS_COW_FORK || cow_fsb == NULLFILEOFF ||
>              imap.br_startoff + imap.br_blockcount <= cow_fsb);
>       wpc->imap = imap;
> +     xfs_trim_extent_eof(&wpc->imap, ip);
>       trace_xfs_map_blocks_alloc(ip, offset, count, wpc->io_type, &imap);
>       return 0;
>  }

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-05 21:31     ` Dave Chinner
  2019-01-06 21:57       ` Dave Chinner
@ 2019-01-07 14:41       ` Brian Foster
  2019-01-08  5:46         ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Brian Foster @ 2019-01-07 14:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bugzilla-daemon, linux-xfs

On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > - writepages is in progress on a particular file that has decently sized
> >   post-eof speculative preallocation
> > - writepages gets to the point where it looks up or allocates a new imap
> >   that includes the preallocation, the allocation/lookup result is
> >   stored in wpc
> > - the file is closed by one process, killing off preallocation, then
> >   immediately appended to by another, updating the file size by a few
> >   bytes
> > - writepages comes back around to xfs_map_blocks() and trims imap to the
> >   current size, but imap still includes one block of the original speculative
> >   prealloc (that was truncated and recreated) because the size increased
> >   between the time imap was stored and trimmed
> 
> I'm betting hole punch can cause similar problems with cached maps
> in writepage. I wrote a patch yonks ago to put a generation number
> in the extent fork and to store it in the cached map, and to
> invalidate the cached map if they didn't match.
> 

Isn't hole punch already serialized with writeback? I thought the issue
was that post-eof blocks are not fully covered by the existing
serialization logic because there are no pages involved, but it's been a
while since I've looked at it..

Brian

> > The EOF trim approach is known to be a bandaid and potentially racy, but
> > ISTM that this problem can be trivially avoided by moving or adding
> > trims of wpc->imap immediately after a new one is cached. I don't
> > reproduce the problem so far with a couple such extra calls in place.
> > 
> > Bigger picture, we need some kind of invalidation mechanism similar to
> > what we're already doing for dealing with the COW fork in this writeback
> > context. I'm not sure the broad semantics used by the COW fork sequence
> > counter mechanism is really suitable for the data fork because any
> > extent-related change in the fork would cause an invalidation, but I am
> > wondering if we could define some subset of less frequent operations for
> > the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> > starters).
> 
> The patch I had is below - I haven't forward ported it or anything,
> just pulled it from my archive to demonstrate what I think we
> probably need to be doing here. If we want to limit when it causes
> invalidations, then we need probably need to limit which operations
> cause the generation number for that inode fork to be bumped.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (9 preceding siblings ...)
  2019-01-07  2:35 ` bugzilla-daemon
@ 2019-01-07 14:41 ` bugzilla-daemon
  2019-01-07 14:41 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-07 14:41 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #11 from bfoster@redhat.com ---
On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > On Tue, Dec 25, 2018 at 06:10:59AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > - writepages is in progress on a particular file that has decently sized
> >   post-eof speculative preallocation
> > - writepages gets to the point where it looks up or allocates a new imap
> >   that includes the preallocation, the allocation/lookup result is
> >   stored in wpc
> > - the file is closed by one process, killing off preallocation, then
> >   immediately appended to by another, updating the file size by a few
> >   bytes
> > - writepages comes back around to xfs_map_blocks() and trims imap to the
> >   current size, but imap still includes one block of the original
> speculative
> >   prealloc (that was truncated and recreated) because the size increased
> >   between the time imap was stored and trimmed
> 
> I'm betting hole punch can cause similar problems with cached maps
> in writepage. I wrote a patch yonks ago to put a generation number
> in the extent fork and to store it in the cached map, and to
> invalidate the cached map if they didn't match.
> 

Isn't hole punch already serialized with writeback? I thought the issue
was that post-eof blocks are not fully covered by the existing
serialization logic because there are no pages involved, but it's been a
while since I've looked at it..

Brian

> > The EOF trim approach is known to be a bandaid and potentially racy, but
> > ISTM that this problem can be trivially avoided by moving or adding
> > trims of wpc->imap immediately after a new one is cached. I don't
> > reproduce the problem so far with a couple such extra calls in place.
> > 
> > Bigger picture, we need some kind of invalidation mechanism similar to
> > what we're already doing for dealing with the COW fork in this writeback
> > context. I'm not sure the broad semantics used by the COW fork sequence
> > counter mechanism is really suitable for the data fork because any
> > extent-related change in the fork would cause an invalidation, but I am
> > wondering if we could define some subset of less frequent operations for
> > the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> > starters).
> 
> The patch I had is below - I haven't forward ported it or anything,
> just pulled it from my archive to demonstrate what I think we
> probably need to be doing here. If we want to limit when it causes
> invalidations, then we need probably need to limit which operations
> cause the generation number for that inode fork to be bumped.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-06 21:57       ` Dave Chinner
@ 2019-01-07 14:41         ` Brian Foster
  2019-01-07 19:11           ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Brian Foster @ 2019-01-07 14:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bugzilla-daemon, linux-xfs

On Mon, Jan 07, 2019 at 08:57:37AM +1100, Dave Chinner wrote:
> On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> > On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > > On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > > - writepages is in progress on a particular file that has decently sized
> > >   post-eof speculative preallocation
> > > - writepages gets to the point where it looks up or allocates a new imap
> > >   that includes the preallocation, the allocation/lookup result is
> > >   stored in wpc
> > > - the file is closed by one process, killing off preallocation, then
> > >   immediately appended to by another, updating the file size by a few
> > >   bytes
> > > - writepages comes back around to xfs_map_blocks() and trims imap to the
> > >   current size, but imap still includes one block of the original speculative
> > >   prealloc (that was truncated and recreated) because the size increased
> > >   between the time imap was stored and trimmed
> > 
> > I'm betting hole punch can cause similar problems with cached maps
> > in writepage. I wrote a patch yonks ago to put a generation number
> > in the extent fork and to store it in the cached map, and to
> > invalidate the cached map if they didn't match.
> > 
> > > The EOF trim approach is known to be a bandaid and potentially racy, but
> > > ISTM that this problem can be trivially avoided by moving or adding
> > > trims of wpc->imap immediately after a new one is cached. I don't
> > > reproduce the problem so far with a couple such extra calls in place.
> > > 
> > > Bigger picture, we need some kind of invalidation mechanism similar to
> > > what we're already doing for dealing with the COW fork in this writeback
> > > context. I'm not sure the broad semantics used by the COW fork sequence
> > > counter mechanism is really suitable for the data fork because any
> > > extent-related change in the fork would cause an invalidation, but I am
> > > wondering if we could define some subset of less frequent operations for
> > > the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> > > starters).
> > 
> > The patch I had is below - I haven't forward ported it or anything,
> > just pulled it from my archive to demonstrate what I think we
> > probably need to be doing here. If we want to limit when it causes
> > invalidations, then we need probably need to limit which operations
> > cause the generation number for that inode fork to be bumped.
> 
> Ugh. Didn't attach patch.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

Thanks...

> 
> xfs: add generation number for cached imap invalidation
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because writepage caches an extent map with locks being held, it can
> race with truncate and other extent invalidations. This means we can
> attempt to write to ranges that have been freed and/or use pages
> that have been unmapped.
> 
> The band-aid of using xfs_trim_extent_eof() to invalidate mappings
> introduces a new race condition between truncate_setsize() and
> a page under writeback - the locked page can allocate a valid map
> because it's over a delalloc region, but between xfs_map_blocks()
> allocating the extent and xfs_imap_valid() checking it, the
> inode size can change and that will result in xfs_imap_valid()
> saying the map is invalid.
> 

For reference, note that we partially screwed this up during review of
the original patch. I had it in my mind that the race window here would
be too small to exploit in practice and looking back, I see that is
because we tweaked the logic during review. The original patch is
here[1] and the tweaks seemed fine at the time, but this bug clearly
demonstrates that was wrong.

[1] https://patchwork.kernel.org/patch/10001703/

This of course doesn't change the fact that the approach itself is
fundamentally racy, but rather suggests that something like the patch
that Zorro verified could serve as stable fodder on the way to a proper
fix.

> Hence we get spurious writeback failures of correctly mapped
> pages/buffers with allocated extents underneath them. This triggers
> assert failures in writeback that are false positives - it occurs
> when truncate_setsize() is waiting for IO on the page to complete
> before allowing invalidation of the page to occur. Removal of the
> extent being written into doesn't occur until after the page cache
> truncation is completed, so xfs_trim_extent_eof() is acting
> prematurely in this case.
> 
> To fix this, what we really need to do is invalidate the cached map
> when the extent list changes, not when the vfs inode size changes.
> Add a extent list generation number to the in-core extent list
> that is bumped on every change to the extent list, and add that
> in all imaps returned from xfs_bmapi_read().
> 
> When we go to use the information in the imap, and we've dropped the
> locks that keep the map coherent, we need to recheck the the
> generation is still valid once we regain the inode locks. If the
> generation number has changed, we can't use the cached mapping
> information and we need to back out and restart whatever operation
> we are caching the map for.
> 
> In theory, this can be done with seqlocks, but I couldn't get my
> head around how to do this sanely. Generation numbers are simple
> to understand and implement...
> 

I've only made a quick pass and much of the affected context has
changed, but isn't this roughly equivalent to the mechanism introduced
by Christoph's commit e666aa37f4 ("xfs: avoid COW fork extent lookups in
writeback if the fork didn't change")? I was thinking we could
essentially accomplish the same thing by generalizing that for the data
fork. That's a very small change, but we'd still need to consider
whether we really want to invalidate on every data fork update...

For example, I'm concerned that something like sustained buffered writes
could completely break the writeback imap cache by continuously
invalidating it. I think speculative preallocation should help with this
in the common case by already spreading those writes over fewer
allocations, but do we care enough about the case where preallocation
might be turned down/off to try and restrict where we bump the sequence
number (to > i_size changes, for example)? Maybe it's not worth the
trouble just to optimize out a shared ilock cycle and lookup, since the
extent list is still in-core after all.

Brian

> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 08df809e2315..1ec408007cab 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3691,6 +3691,7 @@ xfs_bmapi_trim_map(
>  	mval->br_blockcount = XFS_FILBLKS_MIN(end - *bno,
>  			got->br_blockcount - (*bno - got->br_startoff));
>  	mval->br_state = got->br_state;
> +	mval->br_generation = got->br_generation;
>  	ASSERT(mval->br_blockcount <= len);
>  	return;
>  }
> @@ -3799,6 +3800,7 @@ xfs_bmapi_read(
>  		mval->br_startblock = HOLESTARTBLOCK;
>  		mval->br_blockcount = len;
>  		mval->br_state = XFS_EXT_NORM;
> +		mval->br_generation = xfs_iext_generation_fetch(ifp);
>  		*nmap = 1;
>  		return 0;
>  	}
> @@ -3815,9 +3817,14 @@ xfs_bmapi_read(
>  	obno = bno;
>  
>  	while (bno < end && n < *nmap) {
> -		/* Reading past eof, act as though there's a hole up to end. */
> -		if (eof)
> +		/*
> +		 * If we are mapping past the last extent in the fork, act as
> +		 * though there's a hole up to end of the requested mapping.
> +		 */
> +		if (eof) {
>  			got.br_startoff = end;
> +			got.br_generation = xfs_iext_generation_fetch(ifp);
> +		}
>  		if (got.br_startoff > bno) {
>  			/* Reading in a hole.  */
>  			mval->br_startoff = bno;
> @@ -3825,6 +3832,7 @@ xfs_bmapi_read(
>  			mval->br_blockcount =
>  				XFS_FILBLKS_MIN(len, got.br_startoff - bno);
>  			mval->br_state = XFS_EXT_NORM;
> +			mval->br_generation = got.br_generation;
>  			bno += mval->br_blockcount;
>  			len -= mval->br_blockcount;
>  			mval++;
> diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
> index 343a94246f5b..bc52269b7973 100644
> --- a/fs/xfs/libxfs/xfs_iext_tree.c
> +++ b/fs/xfs/libxfs/xfs_iext_tree.c
> @@ -663,6 +663,7 @@ xfs_iext_insert(
>  
>  	if (new)
>  		xfs_iext_insert_node(ifp, xfs_iext_leaf_key(new, 0), new, 2);
> +	xfs_iext_generation_inc(ifp);
>  }
>  
>  static struct xfs_iext_node *
> @@ -890,6 +891,7 @@ xfs_iext_remove(
>  		cur->pos = 0;
>  	}
>  
> +	xfs_iext_generation_inc(ifp);
>  	if (nr_entries >= RECS_PER_LEAF / 2)
>  		return;
>  
> @@ -944,6 +946,7 @@ xfs_iext_lookup_extent(
>  		return false;
>  found:
>  	xfs_iext_get(gotp, cur_rec(cur));
> +	gotp->br_generation = xfs_iext_generation_fetch(ifp);
>  	return true;
>  }
>  
> @@ -990,6 +993,7 @@ xfs_iext_update_extent(
>  
>  	trace_xfs_bmap_pre_update(ip, cur, state, _RET_IP_);
>  	xfs_iext_set(cur_rec(cur), new);
> +	xfs_iext_generation_inc(ifp);
>  	trace_xfs_bmap_post_update(ip, cur, state, _RET_IP_);
>  }
>  
> @@ -1006,6 +1010,7 @@ xfs_iext_get_extent(
>  	if (!xfs_iext_valid(ifp, cur))
>  		return false;
>  	xfs_iext_get(gotp, cur_rec(cur));
> +	gotp->br_generation = xfs_iext_generation_fetch(ifp);
>  	return true;
>  }
>  
> @@ -1040,4 +1045,5 @@ xfs_iext_destroy(
>  	ifp->if_bytes = 0;
>  	ifp->if_height = 0;
>  	ifp->if_u1.if_root = NULL;
> +	xfs_iext_generation_inc(ifp);
>  }
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> index b9f0098e33b8..3a1c1dcb8b54 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.h
> +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> @@ -28,6 +28,7 @@ typedef struct xfs_ifork {
>  	int			if_bytes;	/* bytes in if_u1 */
>  	int			if_real_bytes;	/* bytes allocated in if_u1 */
>  	struct xfs_btree_block	*if_broot;	/* file's incore btree root */
> +	uint32_t		if_generation;	/* extent list generation # */
>  	short			if_broot_bytes;	/* bytes allocated for root */
>  	unsigned char		if_flags;	/* per-fork flags */
>  	int			if_height;	/* height of the extent tree */
> @@ -186,4 +187,32 @@ extern struct kmem_zone	*xfs_ifork_zone;
>  
>  extern void xfs_ifork_init_cow(struct xfs_inode *ip);
>  
> +/*
> + * Bump the fork's generation count and drop a write barrier so it can
> + * be checked without holding locks
> + */
> +static inline void
> +xfs_iext_generation_inc(
> +	struct xfs_ifork	*ifp)
> +{
> +	ifp->if_generation++;
> +	smp_wmb();
> +}
> +
> +static inline uint32_t
> +xfs_iext_generation_fetch(
> +	struct xfs_ifork	*ifp)
> +{
> +	smp_rmb();
> +	return READ_ONCE(ifp->if_generation);
> +}
> +
> +static inline bool
> +xfs_iext_generation_same(
> +	struct xfs_ifork	*ifp,
> +	uint32_t		val)
> +{
> +	return (val == xfs_iext_generation_fetch(ifp));
> +}
> +
>  #endif	/* __XFS_INODE_FORK_H__ */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 3c560695c546..781ca5252fe4 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -151,12 +151,18 @@ typedef enum {
>  	XFS_EXT_NORM, XFS_EXT_UNWRITTEN,
>  } xfs_exntst_t;
>  
> +/*
> + * There is a generation number added to the incore BMBT record structure for
> + * determining when cached records held outside of the extent list locks are
> + * still valid.  This is only filled out on lookups, and is a read-only value.
> + */
>  typedef struct xfs_bmbt_irec
>  {
>  	xfs_fileoff_t	br_startoff;	/* starting file offset */
>  	xfs_fsblock_t	br_startblock;	/* starting block number */
>  	xfs_filblks_t	br_blockcount;	/* number of blocks */
>  	xfs_exntst_t	br_state;	/* extent state */
> +	uint32_t	br_generation;	/* extent list generation number */
>  } xfs_bmbt_irec_t;
>  
>  #endif	/* __XFS_TYPES_H__ */
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0ca511857f7c..4504e8ecbae1 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -793,6 +793,7 @@ xfs_map_blocks(
>  			imap->br_startoff = offset_fsb;
>  			imap->br_blockcount = end_fsb - offset_fsb;
>  			imap->br_startblock = HOLESTARTBLOCK;
> +			imap->br_generation = xfs_iext_generation_fetch(&ip->i_df);
>  			*type = XFS_IO_HOLE;
>  		} else if (imap->br_startblock == HOLESTARTBLOCK) {
>  			/* landed in a hole */
> @@ -864,23 +865,46 @@ STATIC bool
>  xfs_imap_valid(
>  	struct inode		*inode,
>  	struct xfs_bmbt_irec	*imap,
> -	xfs_off_t		offset)
> +	xfs_off_t		offset,
> +	int			type)
>  {
> -	offset >>= inode->i_blkbits;
> +	struct xfs_ifork	*ifp;
> +
>  
>  	/*
> -	 * We have to make sure the cached mapping is within EOF to protect
> -	 * against eofblocks trimming on file release leaving us with a stale
> -	 * mapping. Otherwise, a page for a subsequent file extending buffered
> -	 * write could get picked up by this writeback cycle and written to the
> -	 * wrong blocks.
> +	 * We have to make sure the cached mapping hasn't been invalidated by an
> +	 * extent list modification (e.g. truncate, releasing post-eof blocks,
> +	 * etc). If the extent list has changed, we need to get a new map.
> +	 *
> +	 * Removing a range of a file will clear extents from both the data and
> +	 * COW fork. That means a single extent range removal will bump both the
> +	 * data and cow fork generation numbers if such extents exist in the
> +	 * range. Hence we can safely check against the generation number of the
> +	 * specific fork we are operating on here as it will flag an
> +	 * invalidation if the removal affected the specific fork we are doing
> +	 * IO from.
>  	 *
> -	 * Note that what we really want here is a generic mapping invalidation
> -	 * mechanism to protect us from arbitrary extent modifying contexts, not
> -	 * just eofblocks.
> +	 * Finally, page invalidation will block on the page lock we are
> +	 * holding, but truncate may have already changed the file size. Hence
> +	 * we could be beyond the new EOF here, but truncate can't proceed until
> +	 * we finish IO on this page. Checking against EOF right now means we
> +	 * have to unwind and discard the page, and we have to do similar checks
> +	 * everywhere we pick up the inode locks again.
> +	 *
> +	 * However, it's still valid to allocate and write the data here because
> +	 * we have valid data and the the extent removal after the page
> +	 * invalidation completes will clean up for us. Ignoring this size
> +	 * change race condition and using on extent map coherency checks to
> +	 * determine whether to proceed or not makes everything in the writepage
> +	 * path simpler.
>  	 */
> -	xfs_trim_extent_eof(imap, XFS_I(inode));
> +	ifp = XFS_IFORK_PTR(XFS_I(inode), type == XFS_IO_COW ? XFS_COW_FORK
> +							     : XFS_DATA_FORK);
> +	if (!xfs_iext_generation_same(ifp, imap->br_generation))
> +		return false;
>  
> +
> +	offset >>= inode->i_blkbits;
>  	return offset >= imap->br_startoff &&
>  		offset < imap->br_startoff + imap->br_blockcount;
>  }
> @@ -932,6 +956,7 @@ xfs_writepage_map(
>  	for (poffset = 0;
>  	     poffset < PAGE_SIZE;
>  	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> +		int retries = 0;
>  
>  		/* past the range we are writing, so nothing more to write. */
>  		if (file_offset >= end_offset)
> @@ -954,24 +979,54 @@ xfs_writepage_map(
>  		/* Check to see if current map spans this file offset */
>  		if (wpc->imap_valid)
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> -							 file_offset);
> +						 file_offset, wpc->io_type);
>  
>  		/*
> -		 * If we don't have a valid map, now it's time to get
> -		 * a new one for this offset. This will allocate delalloc
> -		 * regions or COW shared extents. If we return without a valid
> -		 * map, it means we landed in a hole and we skip the block.
> +		 * If we don't have a valid map, now it's time to get a new one
> +		 * for this offset. This will allocate delalloc regions or COW
> +		 * shared extents. If we return without a valid map, it means we
> +		 * raced with another extent operation.
> +		 *
> +		 * If we are racing with invalidation, it will wait for us to
> +		 * release the page lock which means allocation and IO is ok to
> +		 * finish off.
> +		 *
> +		 * Hence we loop here trying to remap the page until we get a
> +		 * valid map that hasn't been changed by the time we check it.
>  		 */
> -		if (!wpc->imap_valid) {
> +		while (!wpc->imap_valid) {
> +			if (!(++retries % 10)) {
> +				xfs_warn_ratelimited(XFS_I(inode)->i_mount,
> +					"%s: looping %d times", __func__, retries);
> +			}
> +
>  			error = xfs_map_blocks(inode, file_offset, &wpc->imap,
>  					     &wpc->io_type);
> +			if (error == -EAGAIN) {
> +				/*
> +				 * Raced with another extent list modification
> +				 * between map and allocate operations. Back off
> +				 * for a short while, try again.
> +				 */
> +				msleep(1);
> +				continue;
> +			}
>  			if (error)
>  				goto out;
> +
> +			/*
> +			 * We can still race with another extent list mod by the
> +			 * time we get back here. It may have been a truncate,
> +			 * so if the extent is not valid now, we need to map
> +			 * again, knowing the truncate will be complete before
> +			 * the next map is executed and so any future
> +			 * invalidation mod will block on the page lock we hold
> +			 */
>  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> -							 file_offset);
> +						 file_offset, wpc->io_type);
>  		}
>  
> -		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> +		if (wpc->io_type == XFS_IO_HOLE) {
>  			/*
>  			 * set_page_dirty dirties all buffers in a page, independent
>  			 * of their state.  The dirty state however is entirely
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ad48e2f24699..3383287633ff 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -674,8 +674,8 @@ xfs_iomap_write_allocate(
>  	xfs_bmbt_irec_t *imap)
>  {
>  	xfs_mount_t	*mp = ip->i_mount;
> -	xfs_fileoff_t	offset_fsb, last_block;
> -	xfs_fileoff_t	end_fsb, map_start_fsb;
> +	xfs_fileoff_t	offset_fsb;
> +	xfs_fileoff_t	map_start_fsb;
>  	xfs_fsblock_t	first_block;
>  	struct xfs_defer_ops	dfops;
>  	xfs_filblks_t	count_fsb;
> @@ -730,56 +730,28 @@ xfs_iomap_write_allocate(
>  			xfs_defer_init(&dfops, &first_block);
>  
>  			/*
> -			 * it is possible that the extents have changed since
> -			 * we did the read call as we dropped the ilock for a
> -			 * while. We have to be careful about truncates or hole
> -			 * punchs here - we are not allowed to allocate
> -			 * non-delalloc blocks here.
> -			 *
> -			 * The only protection against truncation is the pages
> -			 * for the range we are being asked to convert are
> -			 * locked and hence a truncate will block on them
> -			 * first.
> -			 *
> -			 * As a result, if we go beyond the range we really
> -			 * need and hit an delalloc extent boundary followed by
> -			 * a hole while we have excess blocks in the map, we
> -			 * will fill the hole incorrectly and overrun the
> -			 * transaction reservation.
> +			 * Now we've locked the inode, we can determine if the
> +			 * region we are attempting to allocate over is still
> +			 * valid. If it's potentially invalid, then generation
> +			 * numbers won't match and we need to back out
> +			 * completely and let the caller decide what to do.
>  			 *
> -			 * Using a single map prevents this as we are forced to
> -			 * check each map we look for overlap with the desired
> -			 * range and abort as soon as we find it. Also, given
> -			 * that we only return a single map, having one beyond
> -			 * what we can return is probably a bit silly.
> -			 *
> -			 * We also need to check that we don't go beyond EOF;
> -			 * this is a truncate optimisation as a truncate sets
> -			 * the new file size before block on the pages we
> -			 * currently have locked under writeback. Because they
> -			 * are about to be tossed, we don't need to write them
> -			 * back....
> +			 * This will occur if we race with truncate or some
> +			 * other extent manipulation while we don't have the
> +			 * inode locked.
>  			 */
> -			nimaps = 1;
> -			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> -			error = xfs_bmap_last_offset(ip, &last_block,
> -							XFS_DATA_FORK);
> -			if (error)
> +			if (!xfs_iext_generation_same(
> +					XFS_IFORK_PTR(ip, whichfork),
> +					imap->br_generation)) {
> +				error = -EAGAIN;
>  				goto trans_cancel;
> -
> -			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
> -			if ((map_start_fsb + count_fsb) > last_block) {
> -				count_fsb = last_block - map_start_fsb;
> -				if (count_fsb == 0) {
> -					error = -EAGAIN;
> -					goto trans_cancel;
> -				}
>  			}
>  
>  			/*
>  			 * From this point onwards we overwrite the imap
>  			 * pointer that the caller gave to us.
>  			 */
> +			nimaps = 1;
>  			error = xfs_bmapi_write(tp, ip, map_start_fsb,
>  						count_fsb, flags, &first_block,
>  						nres, imap, &nimaps,

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (10 preceding siblings ...)
  2019-01-07 14:41 ` bugzilla-daemon
@ 2019-01-07 14:41 ` bugzilla-daemon
  2019-01-07 19:11 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-07 14:41 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #12 from bfoster@redhat.com ---
On Mon, Jan 07, 2019 at 08:57:37AM +1100, Dave Chinner wrote:
> On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> > On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > > On Tue, Dec 25, 2018 at 06:10:59AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > - writepages is in progress on a particular file that has decently sized
> > >   post-eof speculative preallocation
> > > - writepages gets to the point where it looks up or allocates a new imap
> > >   that includes the preallocation, the allocation/lookup result is
> > >   stored in wpc
> > > - the file is closed by one process, killing off preallocation, then
> > >   immediately appended to by another, updating the file size by a few
> > >   bytes
> > > - writepages comes back around to xfs_map_blocks() and trims imap to the
> > >   current size, but imap still includes one block of the original
> speculative
> > >   prealloc (that was truncated and recreated) because the size increased
> > >   between the time imap was stored and trimmed
> > 
> > I'm betting hole punch can cause similar problems with cached maps
> > in writepage. I wrote a patch yonks ago to put a generation number
> > in the extent fork and to store it in the cached map, and to
> > invalidate the cached map if they didn't match.
> > 
> > > The EOF trim approach is known to be a bandaid and potentially racy, but
> > > ISTM that this problem can be trivially avoided by moving or adding
> > > trims of wpc->imap immediately after a new one is cached. I don't
> > > reproduce the problem so far with a couple such extra calls in place.
> > > 
> > > Bigger picture, we need some kind of invalidation mechanism similar to
> > > what we're already doing for dealing with the COW fork in this writeback
> > > context. I'm not sure the broad semantics used by the COW fork sequence
> > > counter mechanism is really suitable for the data fork because any
> > > extent-related change in the fork would cause an invalidation, but I am
> > > wondering if we could define some subset of less frequent operations for
> > > the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> > > starters).
> > 
> > The patch I had is below - I haven't forward ported it or anything,
> > just pulled it from my archive to demonstrate what I think we
> > probably need to be doing here. If we want to limit when it causes
> > invalidations, then we need probably need to limit which operations
> > cause the generation number for that inode fork to be bumped.
> 
> Ugh. Didn't attach patch.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

Thanks...

> 
> xfs: add generation number for cached imap invalidation
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because writepage caches an extent map with locks being held, it can
> race with truncate and other extent invalidations. This means we can
> attempt to write to ranges that have been freed and/or use pages
> that have been unmapped.
> 
> The band-aid of using xfs_trim_extent_eof() to invalidate mappings
> introduces a new race condition between truncate_setsize() and
> a page under writeback - the locked page can allocate a valid map
> because it's over a delalloc region, but between xfs_map_blocks()
> allocating the extent and xfs_imap_valid() checking it, the
> inode size can change and that will result in xfs_imap_valid()
> saying the map is invalid.
> 

For reference, note that we partially screwed this up during review of
the original patch. I had it in my mind that the race window here would
be too small to exploit in practice and looking back, I see that is
because we tweaked the logic during review. The original patch is
here[1] and the tweaks seemed fine at the time, but this bug clearly
demonstrates that was wrong.

[1] https://patchwork.kernel.org/patch/10001703/

This of course doesn't change the fact that the approach itself is
fundamentally racy, but rather suggests that something like the patch
that Zorro verified could serve as stable fodder on the way to a proper
fix.

> Hence we get spurious writeback failures of correctly mapped
> pages/buffers with allocated extents underneath them. This triggers
> assert failures in writeback that are false positives - it occurs
> when truncate_setsize() is waiting for IO on the page to complete
> before allowing invalidation of the page to occur. Removal of the
> extent being written into doesn't occur until after the page cache
> truncation is completed, so xfs_trim_extent_eof() is acting
> prematurely in this case.
> 
> To fix this, what we really need to do is invalidate the cached map
> when the extent list changes, not when the vfs inode size changes.
> Add a extent list generation number to the in-core extent list
> that is bumped on every change to the extent list, and add that
> in all imaps returned from xfs_bmapi_read().
> 
> When we go to use the information in the imap, and we've dropped the
> locks that keep the map coherent, we need to recheck the the
> generation is still valid once we regain the inode locks. If the
> generation number has changed, we can't use the cached mapping
> information and we need to back out and restart whatever operation
> we are caching the map for.
> 
> In theory, this can be done with seqlocks, but I couldn't get my
> head around how to do this sanely. Generation numbers are simple
> to understand and implement...
> 

I've only made a quick pass and much of the affected context has
changed, but isn't this roughly equivalent to the mechanism introduced
by Christoph's commit e666aa37f4 ("xfs: avoid COW fork extent lookups in
writeback if the fork didn't change")? I was thinking we could
essentially accomplish the same thing by generalizing that for the data
fork. That's a very small change, but we'd still need to consider
whether we really want to invalidate on every data fork update...

For example, I'm concerned that something like sustained buffered writes
could completely break the writeback imap cache by continuously
invalidating it. I think speculative preallocation should help with this
in the common case by already spreading those writes over fewer
allocations, but do we care enough about the case where preallocation
might be turned down/off to try and restrict where we bump the sequence
number (to > i_size changes, for example)? Maybe it's not worth the
trouble just to optimize out a shared ilock cycle and lookup, since the
extent list is still in-core after all.

Brian

> Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 08df809e2315..1ec408007cab 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -3691,6 +3691,7 @@ xfs_bmapi_trim_map(
>       mval->br_blockcount = XFS_FILBLKS_MIN(end - *bno,
>                       got->br_blockcount - (*bno - got->br_startoff));
>       mval->br_state = got->br_state;
> +     mval->br_generation = got->br_generation;
>       ASSERT(mval->br_blockcount <= len);
>       return;
>  }
> @@ -3799,6 +3800,7 @@ xfs_bmapi_read(
>               mval->br_startblock = HOLESTARTBLOCK;
>               mval->br_blockcount = len;
>               mval->br_state = XFS_EXT_NORM;
> +             mval->br_generation = xfs_iext_generation_fetch(ifp);
>               *nmap = 1;
>               return 0;
>       }
> @@ -3815,9 +3817,14 @@ xfs_bmapi_read(
>       obno = bno;
>  
>       while (bno < end && n < *nmap) {
> -             /* Reading past eof, act as though there's a hole up to end. */
> -             if (eof)
> +             /*
> +              * If we are mapping past the last extent in the fork, act as
> +              * though there's a hole up to end of the requested mapping.
> +              */
> +             if (eof) {
>                       got.br_startoff = end;
> +                     got.br_generation = xfs_iext_generation_fetch(ifp);
> +             }
>               if (got.br_startoff > bno) {
>                       /* Reading in a hole.  */
>                       mval->br_startoff = bno;
> @@ -3825,6 +3832,7 @@ xfs_bmapi_read(
>                       mval->br_blockcount =
>                               XFS_FILBLKS_MIN(len, got.br_startoff - bno);
>                       mval->br_state = XFS_EXT_NORM;
> +                     mval->br_generation = got.br_generation;
>                       bno += mval->br_blockcount;
>                       len -= mval->br_blockcount;
>                       mval++;
> diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
> index 343a94246f5b..bc52269b7973 100644
> --- a/fs/xfs/libxfs/xfs_iext_tree.c
> +++ b/fs/xfs/libxfs/xfs_iext_tree.c
> @@ -663,6 +663,7 @@ xfs_iext_insert(
>  
>       if (new)
>               xfs_iext_insert_node(ifp, xfs_iext_leaf_key(new, 0), new, 2);
> +     xfs_iext_generation_inc(ifp);
>  }
>  
>  static struct xfs_iext_node *
> @@ -890,6 +891,7 @@ xfs_iext_remove(
>               cur->pos = 0;
>       }
>  
> +     xfs_iext_generation_inc(ifp);
>       if (nr_entries >= RECS_PER_LEAF / 2)
>               return;
>  
> @@ -944,6 +946,7 @@ xfs_iext_lookup_extent(
>               return false;
>  found:
>       xfs_iext_get(gotp, cur_rec(cur));
> +     gotp->br_generation = xfs_iext_generation_fetch(ifp);
>       return true;
>  }
>  
> @@ -990,6 +993,7 @@ xfs_iext_update_extent(
>  
>       trace_xfs_bmap_pre_update(ip, cur, state, _RET_IP_);
>       xfs_iext_set(cur_rec(cur), new);
> +     xfs_iext_generation_inc(ifp);
>       trace_xfs_bmap_post_update(ip, cur, state, _RET_IP_);
>  }
>  
> @@ -1006,6 +1010,7 @@ xfs_iext_get_extent(
>       if (!xfs_iext_valid(ifp, cur))
>               return false;
>       xfs_iext_get(gotp, cur_rec(cur));
> +     gotp->br_generation = xfs_iext_generation_fetch(ifp);
>       return true;
>  }
>  
> @@ -1040,4 +1045,5 @@ xfs_iext_destroy(
>       ifp->if_bytes = 0;
>       ifp->if_height = 0;
>       ifp->if_u1.if_root = NULL;
> +     xfs_iext_generation_inc(ifp);
>  }
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> index b9f0098e33b8..3a1c1dcb8b54 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.h
> +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> @@ -28,6 +28,7 @@ typedef struct xfs_ifork {
>       int                     if_bytes;       /* bytes in if_u1 */
>       int                     if_real_bytes;  /* bytes allocated in if_u1 */
>       struct xfs_btree_block  *if_broot;      /* file's incore btree root */
> +     uint32_t                if_generation;  /* extent list generation # */
>       short                   if_broot_bytes; /* bytes allocated for root */
>       unsigned char           if_flags;       /* per-fork flags */
>       int                     if_height;      /* height of the extent tree */
> @@ -186,4 +187,32 @@ extern struct kmem_zone  *xfs_ifork_zone;
>  
>  extern void xfs_ifork_init_cow(struct xfs_inode *ip);
>  
> +/*
> + * Bump the fork's generation count and drop a write barrier so it can
> + * be checked without holding locks
> + */
> +static inline void
> +xfs_iext_generation_inc(
> +     struct xfs_ifork        *ifp)
> +{
> +     ifp->if_generation++;
> +     smp_wmb();
> +}
> +
> +static inline uint32_t
> +xfs_iext_generation_fetch(
> +     struct xfs_ifork        *ifp)
> +{
> +     smp_rmb();
> +     return READ_ONCE(ifp->if_generation);
> +}
> +
> +static inline bool
> +xfs_iext_generation_same(
> +     struct xfs_ifork        *ifp,
> +     uint32_t                val)
> +{
> +     return (val == xfs_iext_generation_fetch(ifp));
> +}
> +
>  #endif       /* __XFS_INODE_FORK_H__ */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 3c560695c546..781ca5252fe4 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -151,12 +151,18 @@ typedef enum {
>       XFS_EXT_NORM, XFS_EXT_UNWRITTEN,
>  } xfs_exntst_t;
>  
> +/*
> + * There is a generation number added to the incore BMBT record structure
> for
> + * determining when cached records held outside of the extent list locks are
> + * still valid.  This is only filled out on lookups, and is a read-only
> value.
> + */
>  typedef struct xfs_bmbt_irec
>  {
>       xfs_fileoff_t   br_startoff;    /* starting file offset */
>       xfs_fsblock_t   br_startblock;  /* starting block number */
>       xfs_filblks_t   br_blockcount;  /* number of blocks */
>       xfs_exntst_t    br_state;       /* extent state */
> +     uint32_t        br_generation;  /* extent list generation number */
>  } xfs_bmbt_irec_t;
>  
>  #endif       /* __XFS_TYPES_H__ */
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 0ca511857f7c..4504e8ecbae1 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -793,6 +793,7 @@ xfs_map_blocks(
>                       imap->br_startoff = offset_fsb;
>                       imap->br_blockcount = end_fsb - offset_fsb;
>                       imap->br_startblock = HOLESTARTBLOCK;
> +                     imap->br_generation =
> xfs_iext_generation_fetch(&ip->i_df);
>                       *type = XFS_IO_HOLE;
>               } else if (imap->br_startblock == HOLESTARTBLOCK) {
>                       /* landed in a hole */
> @@ -864,23 +865,46 @@ STATIC bool
>  xfs_imap_valid(
>       struct inode            *inode,
>       struct xfs_bmbt_irec    *imap,
> -     xfs_off_t               offset)
> +     xfs_off_t               offset,
> +     int                     type)
>  {
> -     offset >>= inode->i_blkbits;
> +     struct xfs_ifork        *ifp;
> +
>  
>       /*
> -      * We have to make sure the cached mapping is within EOF to protect
> -      * against eofblocks trimming on file release leaving us with a stale
> -      * mapping. Otherwise, a page for a subsequent file extending buffered
> -      * write could get picked up by this writeback cycle and written to the
> -      * wrong blocks.
> +      * We have to make sure the cached mapping hasn't been invalidated by
> an
> +      * extent list modification (e.g. truncate, releasing post-eof blocks,
> +      * etc). If the extent list has changed, we need to get a new map.
> +      *
> +      * Removing a range of a file will clear extents from both the data and
> +      * COW fork. That means a single extent range removal will bump both
> the
> +      * data and cow fork generation numbers if such extents exist in the
> +      * range. Hence we can safely check against the generation number of
> the
> +      * specific fork we are operating on here as it will flag an
> +      * invalidation if the removal affected the specific fork we are doing
> +      * IO from.
>        *
> -      * Note that what we really want here is a generic mapping invalidation
> -      * mechanism to protect us from arbitrary extent modifying contexts,
> not
> -      * just eofblocks.
> +      * Finally, page invalidation will block on the page lock we are
> +      * holding, but truncate may have already changed the file size. Hence
> +      * we could be beyond the new EOF here, but truncate can't proceed
> until
> +      * we finish IO on this page. Checking against EOF right now means we
> +      * have to unwind and discard the page, and we have to do similar
> checks
> +      * everywhere we pick up the inode locks again.
> +      *
> +      * However, it's still valid to allocate and write the data here
> because
> +      * we have valid data and the the extent removal after the page
> +      * invalidation completes will clean up for us. Ignoring this size
> +      * change race condition and using on extent map coherency checks to
> +      * determine whether to proceed or not makes everything in the
> writepage
> +      * path simpler.
>        */
> -     xfs_trim_extent_eof(imap, XFS_I(inode));
> +     ifp = XFS_IFORK_PTR(XFS_I(inode), type == XFS_IO_COW ? XFS_COW_FORK
> +                                                          : XFS_DATA_FORK);
> +     if (!xfs_iext_generation_same(ifp, imap->br_generation))
> +             return false;
>  
> +
> +     offset >>= inode->i_blkbits;
>       return offset >= imap->br_startoff &&
>               offset < imap->br_startoff + imap->br_blockcount;
>  }
> @@ -932,6 +956,7 @@ xfs_writepage_map(
>       for (poffset = 0;
>            poffset < PAGE_SIZE;
>            poffset += len, file_offset += len, bh = bh->b_this_page) {
> +             int retries = 0;
>  
>               /* past the range we are writing, so nothing more to write. */
>               if (file_offset >= end_offset)
> @@ -954,24 +979,54 @@ xfs_writepage_map(
>               /* Check to see if current map spans this file offset */
>               if (wpc->imap_valid)
>                       wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> -                                                      file_offset);
> +                                              file_offset, wpc->io_type);
>  
>               /*
> -              * If we don't have a valid map, now it's time to get
> -              * a new one for this offset. This will allocate delalloc
> -              * regions or COW shared extents. If we return without a valid
> -              * map, it means we landed in a hole and we skip the block.
> +              * If we don't have a valid map, now it's time to get a new one
> +              * for this offset. This will allocate delalloc regions or COW
> +              * shared extents. If we return without a valid map, it means
> we
> +              * raced with another extent operation.
> +              *
> +              * If we are racing with invalidation, it will wait for us to
> +              * release the page lock which means allocation and IO is ok to
> +              * finish off.
> +              *
> +              * Hence we loop here trying to remap the page until we get a
> +              * valid map that hasn't been changed by the time we check it.
>                */
> -             if (!wpc->imap_valid) {
> +             while (!wpc->imap_valid) {
> +                     if (!(++retries % 10)) {
> +                             xfs_warn_ratelimited(XFS_I(inode)->i_mount,
> +                                     "%s: looping %d times", __func__,
> retries);
> +                     }
> +
>                       error = xfs_map_blocks(inode, file_offset, &wpc->imap,
>                                            &wpc->io_type);
> +                     if (error == -EAGAIN) {
> +                             /*
> +                              * Raced with another extent list modification
> +                              * between map and allocate operations. Back
> off
> +                              * for a short while, try again.
> +                              */
> +                             msleep(1);
> +                             continue;
> +                     }
>                       if (error)
>                               goto out;
> +
> +                     /*
> +                      * We can still race with another extent list mod by
> the
> +                      * time we get back here. It may have been a truncate,
> +                      * so if the extent is not valid now, we need to map
> +                      * again, knowing the truncate will be complete before
> +                      * the next map is executed and so any future
> +                      * invalidation mod will block on the page lock we hold
> +                      */
>                       wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> -                                                      file_offset);
> +                                              file_offset, wpc->io_type);
>               }
>  
> -             if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> +             if (wpc->io_type == XFS_IO_HOLE) {
>                       /*
>                        * set_page_dirty dirties all buffers in a page,
>  independent
>                        * of their state.  The dirty state however is entirely
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index ad48e2f24699..3383287633ff 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -674,8 +674,8 @@ xfs_iomap_write_allocate(
>       xfs_bmbt_irec_t *imap)
>  {
>       xfs_mount_t     *mp = ip->i_mount;
> -     xfs_fileoff_t   offset_fsb, last_block;
> -     xfs_fileoff_t   end_fsb, map_start_fsb;
> +     xfs_fileoff_t   offset_fsb;
> +     xfs_fileoff_t   map_start_fsb;
>       xfs_fsblock_t   first_block;
>       struct xfs_defer_ops    dfops;
>       xfs_filblks_t   count_fsb;
> @@ -730,56 +730,28 @@ xfs_iomap_write_allocate(
>                       xfs_defer_init(&dfops, &first_block);
>  
>                       /*
> -                      * it is possible that the extents have changed since
> -                      * we did the read call as we dropped the ilock for a
> -                      * while. We have to be careful about truncates or hole
> -                      * punchs here - we are not allowed to allocate
> -                      * non-delalloc blocks here.
> -                      *
> -                      * The only protection against truncation is the pages
> -                      * for the range we are being asked to convert are
> -                      * locked and hence a truncate will block on them
> -                      * first.
> -                      *
> -                      * As a result, if we go beyond the range we really
> -                      * need and hit an delalloc extent boundary followed by
> -                      * a hole while we have excess blocks in the map, we
> -                      * will fill the hole incorrectly and overrun the
> -                      * transaction reservation.
> +                      * Now we've locked the inode, we can determine if the
> +                      * region we are attempting to allocate over is still
> +                      * valid. If it's potentially invalid, then generation
> +                      * numbers won't match and we need to back out
> +                      * completely and let the caller decide what to do.
>                        *
> -                      * Using a single map prevents this as we are forced to
> -                      * check each map we look for overlap with the desired
> -                      * range and abort as soon as we find it. Also, given
> -                      * that we only return a single map, having one beyond
> -                      * what we can return is probably a bit silly.
> -                      *
> -                      * We also need to check that we don't go beyond EOF;
> -                      * this is a truncate optimisation as a truncate sets
> -                      * the new file size before block on the pages we
> -                      * currently have locked under writeback. Because they
> -                      * are about to be tossed, we don't need to write them
> -                      * back....
> +                      * This will occur if we race with truncate or some
> +                      * other extent manipulation while we don't have the
> +                      * inode locked.
>                        */
> -                     nimaps = 1;
> -                     end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> -                     error = xfs_bmap_last_offset(ip, &last_block,
> -                                                     XFS_DATA_FORK);
> -                     if (error)
> +                     if (!xfs_iext_generation_same(
> +                                     XFS_IFORK_PTR(ip, whichfork),
> +                                     imap->br_generation)) {
> +                             error = -EAGAIN;
>                               goto trans_cancel;
> -
> -                     last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
> -                     if ((map_start_fsb + count_fsb) > last_block) {
> -                             count_fsb = last_block - map_start_fsb;
> -                             if (count_fsb == 0) {
> -                                     error = -EAGAIN;
> -                                     goto trans_cancel;
> -                             }
>                       }
>  
>                       /*
>                        * From this point onwards we overwrite the imap
>                        * pointer that the caller gave to us.
>                        */
> +                     nimaps = 1;
>                       error = xfs_bmapi_write(tp, ip, map_start_fsb,
>                                               count_fsb, flags, &first_block,
>                                               nres, imap, &nimaps,

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-07 14:41         ` Brian Foster
@ 2019-01-07 19:11           ` Brian Foster
  2019-01-08  5:55             ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Brian Foster @ 2019-01-07 19:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bugzilla-daemon, linux-xfs

On Mon, Jan 07, 2019 at 09:41:14AM -0500, Brian Foster wrote:
> On Mon, Jan 07, 2019 at 08:57:37AM +1100, Dave Chinner wrote:
> > On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> > > On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > > > On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > - writepages is in progress on a particular file that has decently sized
> > > >   post-eof speculative preallocation
> > > > - writepages gets to the point where it looks up or allocates a new imap
> > > >   that includes the preallocation, the allocation/lookup result is
> > > >   stored in wpc
> > > > - the file is closed by one process, killing off preallocation, then
> > > >   immediately appended to by another, updating the file size by a few
> > > >   bytes
> > > > - writepages comes back around to xfs_map_blocks() and trims imap to the
> > > >   current size, but imap still includes one block of the original speculative
> > > >   prealloc (that was truncated and recreated) because the size increased
> > > >   between the time imap was stored and trimmed
> > > 
> > > I'm betting hole punch can cause similar problems with cached maps
> > > in writepage. I wrote a patch yonks ago to put a generation number
> > > in the extent fork and to store it in the cached map, and to
> > > invalidate the cached map if they didn't match.
> > > 
> > > > The EOF trim approach is known to be a bandaid and potentially racy, but
> > > > ISTM that this problem can be trivially avoided by moving or adding
> > > > trims of wpc->imap immediately after a new one is cached. I don't
> > > > reproduce the problem so far with a couple such extra calls in place.
> > > > 
> > > > Bigger picture, we need some kind of invalidation mechanism similar to
> > > > what we're already doing for dealing with the COW fork in this writeback
> > > > context. I'm not sure the broad semantics used by the COW fork sequence
> > > > counter mechanism is really suitable for the data fork because any
> > > > extent-related change in the fork would cause an invalidation, but I am
> > > > wondering if we could define some subset of less frequent operations for
> > > > the same mechanism to reliably invalidate (e.g., on eofblocks trims, for
> > > > starters).
> > > 
> > > The patch I had is below - I haven't forward ported it or anything,
> > > just pulled it from my archive to demonstrate what I think we
> > > probably need to be doing here. If we want to limit when it causes
> > > invalidations, then we need probably need to limit which operations
> > > cause the generation number for that inode fork to be bumped.
> > 
> > Ugh. Didn't attach patch.
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 
> Thanks...
> 
> > 
> > xfs: add generation number for cached imap invalidation
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because writepage caches an extent map with locks being held, it can
> > race with truncate and other extent invalidations. This means we can
> > attempt to write to ranges that have been freed and/or use pages
> > that have been unmapped.
> > 
> > The band-aid of using xfs_trim_extent_eof() to invalidate mappings
> > introduces a new race condition between truncate_setsize() and
> > a page under writeback - the locked page can allocate a valid map
> > because it's over a delalloc region, but between xfs_map_blocks()
> > allocating the extent and xfs_imap_valid() checking it, the
> > inode size can change and that will result in xfs_imap_valid()
> > saying the map is invalid.
> > 
> 
> For reference, note that we partially screwed this up during review of
> the original patch. I had it in my mind that the race window here would
> be too small to exploit in practice and looking back, I see that is
> because we tweaked the logic during review. The original patch is
> here[1] and the tweaks seemed fine at the time, but this bug clearly
> demonstrates that was wrong.
> 
> [1] https://patchwork.kernel.org/patch/10001703/
> 
> This of course doesn't change the fact that the approach itself is
> fundamentally racy, but rather suggests that something like the patch
> that Zorro verified could serve as stable fodder on the way to a proper
> fix.
> 
> > Hence we get spurious writeback failures of correctly mapped
> > pages/buffers with allocated extents underneath them. This triggers
> > assert failures in writeback that are false positives - it occurs
> > when truncate_setsize() is waiting for IO on the page to complete
> > before allowing invalidation of the page to occur. Removal of the
> > extent being written into doesn't occur until after the page cache
> > truncation is completed, so xfs_trim_extent_eof() is acting
> > prematurely in this case.
> > 
> > To fix this, what we really need to do is invalidate the cached map
> > when the extent list changes, not when the vfs inode size changes.
> > Add a extent list generation number to the in-core extent list
> > that is bumped on every change to the extent list, and add that
> > in all imaps returned from xfs_bmapi_read().
> > 
> > When we go to use the information in the imap, and we've dropped the
> > locks that keep the map coherent, we need to recheck the the
> > generation is still valid once we regain the inode locks. If the
> > generation number has changed, we can't use the cached mapping
> > information and we need to back out and restart whatever operation
> > we are caching the map for.
> > 
> > In theory, this can be done with seqlocks, but I couldn't get my
> > head around how to do this sanely. Generation numbers are simple
> > to understand and implement...
> > 
> 
> I've only made a quick pass and much of the affected context has
> changed, but isn't this roughly equivalent to the mechanism introduced
> by Christoph's commit e666aa37f4 ("xfs: avoid COW fork extent lookups in
> writeback if the fork didn't change")? I was thinking we could
> essentially accomplish the same thing by generalizing that for the data
> fork. That's a very small change, but we'd still need to consider
> whether we really want to invalidate on every data fork update...
> 
> For example, I'm concerned that something like sustained buffered writes
> could completely break the writeback imap cache by continuously
> invalidating it. I think speculative preallocation should help with this
> in the common case by already spreading those writes over fewer
> allocations, but do we care enough about the case where preallocation
> might be turned down/off to try and restrict where we bump the sequence
> number (to > i_size changes, for example)? Maybe it's not worth the
> trouble just to optimize out a shared ilock cycle and lookup, since the
> extent list is still in-core after all.
> 

A follow up FWIW... a quick test of some changes to reuse the existing
mechanism doesn't appear to show much of a problem in this regard, even
with allocsize=4k. I think another thing that minimizes impact is that
even if we end up revalidating the same imap over and over, the ioend
construction logic is distinct and based on contiguity. IOW, writeback
is still sending the same sized I/Os for contiguous blocks...

Brian

> Brian
> 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 08df809e2315..1ec408007cab 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3691,6 +3691,7 @@ xfs_bmapi_trim_map(
> >  	mval->br_blockcount = XFS_FILBLKS_MIN(end - *bno,
> >  			got->br_blockcount - (*bno - got->br_startoff));
> >  	mval->br_state = got->br_state;
> > +	mval->br_generation = got->br_generation;
> >  	ASSERT(mval->br_blockcount <= len);
> >  	return;
> >  }
> > @@ -3799,6 +3800,7 @@ xfs_bmapi_read(
> >  		mval->br_startblock = HOLESTARTBLOCK;
> >  		mval->br_blockcount = len;
> >  		mval->br_state = XFS_EXT_NORM;
> > +		mval->br_generation = xfs_iext_generation_fetch(ifp);
> >  		*nmap = 1;
> >  		return 0;
> >  	}
> > @@ -3815,9 +3817,14 @@ xfs_bmapi_read(
> >  	obno = bno;
> >  
> >  	while (bno < end && n < *nmap) {
> > -		/* Reading past eof, act as though there's a hole up to end. */
> > -		if (eof)
> > +		/*
> > +		 * If we are mapping past the last extent in the fork, act as
> > +		 * though there's a hole up to end of the requested mapping.
> > +		 */
> > +		if (eof) {
> >  			got.br_startoff = end;
> > +			got.br_generation = xfs_iext_generation_fetch(ifp);
> > +		}
> >  		if (got.br_startoff > bno) {
> >  			/* Reading in a hole.  */
> >  			mval->br_startoff = bno;
> > @@ -3825,6 +3832,7 @@ xfs_bmapi_read(
> >  			mval->br_blockcount =
> >  				XFS_FILBLKS_MIN(len, got.br_startoff - bno);
> >  			mval->br_state = XFS_EXT_NORM;
> > +			mval->br_generation = got.br_generation;
> >  			bno += mval->br_blockcount;
> >  			len -= mval->br_blockcount;
> >  			mval++;
> > diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
> > index 343a94246f5b..bc52269b7973 100644
> > --- a/fs/xfs/libxfs/xfs_iext_tree.c
> > +++ b/fs/xfs/libxfs/xfs_iext_tree.c
> > @@ -663,6 +663,7 @@ xfs_iext_insert(
> >  
> >  	if (new)
> >  		xfs_iext_insert_node(ifp, xfs_iext_leaf_key(new, 0), new, 2);
> > +	xfs_iext_generation_inc(ifp);
> >  }
> >  
> >  static struct xfs_iext_node *
> > @@ -890,6 +891,7 @@ xfs_iext_remove(
> >  		cur->pos = 0;
> >  	}
> >  
> > +	xfs_iext_generation_inc(ifp);
> >  	if (nr_entries >= RECS_PER_LEAF / 2)
> >  		return;
> >  
> > @@ -944,6 +946,7 @@ xfs_iext_lookup_extent(
> >  		return false;
> >  found:
> >  	xfs_iext_get(gotp, cur_rec(cur));
> > +	gotp->br_generation = xfs_iext_generation_fetch(ifp);
> >  	return true;
> >  }
> >  
> > @@ -990,6 +993,7 @@ xfs_iext_update_extent(
> >  
> >  	trace_xfs_bmap_pre_update(ip, cur, state, _RET_IP_);
> >  	xfs_iext_set(cur_rec(cur), new);
> > +	xfs_iext_generation_inc(ifp);
> >  	trace_xfs_bmap_post_update(ip, cur, state, _RET_IP_);
> >  }
> >  
> > @@ -1006,6 +1010,7 @@ xfs_iext_get_extent(
> >  	if (!xfs_iext_valid(ifp, cur))
> >  		return false;
> >  	xfs_iext_get(gotp, cur_rec(cur));
> > +	gotp->br_generation = xfs_iext_generation_fetch(ifp);
> >  	return true;
> >  }
> >  
> > @@ -1040,4 +1045,5 @@ xfs_iext_destroy(
> >  	ifp->if_bytes = 0;
> >  	ifp->if_height = 0;
> >  	ifp->if_u1.if_root = NULL;
> > +	xfs_iext_generation_inc(ifp);
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > index b9f0098e33b8..3a1c1dcb8b54 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > @@ -28,6 +28,7 @@ typedef struct xfs_ifork {
> >  	int			if_bytes;	/* bytes in if_u1 */
> >  	int			if_real_bytes;	/* bytes allocated in if_u1 */
> >  	struct xfs_btree_block	*if_broot;	/* file's incore btree root */
> > +	uint32_t		if_generation;	/* extent list generation # */
> >  	short			if_broot_bytes;	/* bytes allocated for root */
> >  	unsigned char		if_flags;	/* per-fork flags */
> >  	int			if_height;	/* height of the extent tree */
> > @@ -186,4 +187,32 @@ extern struct kmem_zone	*xfs_ifork_zone;
> >  
> >  extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> >  
> > +/*
> > + * Bump the fork's generation count and drop a write barrier so it can
> > + * be checked without holding locks
> > + */
> > +static inline void
> > +xfs_iext_generation_inc(
> > +	struct xfs_ifork	*ifp)
> > +{
> > +	ifp->if_generation++;
> > +	smp_wmb();
> > +}
> > +
> > +static inline uint32_t
> > +xfs_iext_generation_fetch(
> > +	struct xfs_ifork	*ifp)
> > +{
> > +	smp_rmb();
> > +	return READ_ONCE(ifp->if_generation);
> > +}
> > +
> > +static inline bool
> > +xfs_iext_generation_same(
> > +	struct xfs_ifork	*ifp,
> > +	uint32_t		val)
> > +{
> > +	return (val == xfs_iext_generation_fetch(ifp));
> > +}
> > +
> >  #endif	/* __XFS_INODE_FORK_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 3c560695c546..781ca5252fe4 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -151,12 +151,18 @@ typedef enum {
> >  	XFS_EXT_NORM, XFS_EXT_UNWRITTEN,
> >  } xfs_exntst_t;
> >  
> > +/*
> > + * There is a generation number added to the incore BMBT record structure for
> > + * determining when cached records held outside of the extent list locks are
> > + * still valid.  This is only filled out on lookups, and is a read-only value.
> > + */
> >  typedef struct xfs_bmbt_irec
> >  {
> >  	xfs_fileoff_t	br_startoff;	/* starting file offset */
> >  	xfs_fsblock_t	br_startblock;	/* starting block number */
> >  	xfs_filblks_t	br_blockcount;	/* number of blocks */
> >  	xfs_exntst_t	br_state;	/* extent state */
> > +	uint32_t	br_generation;	/* extent list generation number */
> >  } xfs_bmbt_irec_t;
> >  
> >  #endif	/* __XFS_TYPES_H__ */
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 0ca511857f7c..4504e8ecbae1 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -793,6 +793,7 @@ xfs_map_blocks(
> >  			imap->br_startoff = offset_fsb;
> >  			imap->br_blockcount = end_fsb - offset_fsb;
> >  			imap->br_startblock = HOLESTARTBLOCK;
> > +			imap->br_generation = xfs_iext_generation_fetch(&ip->i_df);
> >  			*type = XFS_IO_HOLE;
> >  		} else if (imap->br_startblock == HOLESTARTBLOCK) {
> >  			/* landed in a hole */
> > @@ -864,23 +865,46 @@ STATIC bool
> >  xfs_imap_valid(
> >  	struct inode		*inode,
> >  	struct xfs_bmbt_irec	*imap,
> > -	xfs_off_t		offset)
> > +	xfs_off_t		offset,
> > +	int			type)
> >  {
> > -	offset >>= inode->i_blkbits;
> > +	struct xfs_ifork	*ifp;
> > +
> >  
> >  	/*
> > -	 * We have to make sure the cached mapping is within EOF to protect
> > -	 * against eofblocks trimming on file release leaving us with a stale
> > -	 * mapping. Otherwise, a page for a subsequent file extending buffered
> > -	 * write could get picked up by this writeback cycle and written to the
> > -	 * wrong blocks.
> > +	 * We have to make sure the cached mapping hasn't been invalidated by an
> > +	 * extent list modification (e.g. truncate, releasing post-eof blocks,
> > +	 * etc). If the extent list has changed, we need to get a new map.
> > +	 *
> > +	 * Removing a range of a file will clear extents from both the data and
> > +	 * COW fork. That means a single extent range removal will bump both the
> > +	 * data and cow fork generation numbers if such extents exist in the
> > +	 * range. Hence we can safely check against the generation number of the
> > +	 * specific fork we are operating on here as it will flag an
> > +	 * invalidation if the removal affected the specific fork we are doing
> > +	 * IO from.
> >  	 *
> > -	 * Note that what we really want here is a generic mapping invalidation
> > -	 * mechanism to protect us from arbitrary extent modifying contexts, not
> > -	 * just eofblocks.
> > +	 * Finally, page invalidation will block on the page lock we are
> > +	 * holding, but truncate may have already changed the file size. Hence
> > +	 * we could be beyond the new EOF here, but truncate can't proceed until
> > +	 * we finish IO on this page. Checking against EOF right now means we
> > +	 * have to unwind and discard the page, and we have to do similar checks
> > +	 * everywhere we pick up the inode locks again.
> > +	 *
> > +	 * However, it's still valid to allocate and write the data here because
> > +	 * we have valid data and the the extent removal after the page
> > +	 * invalidation completes will clean up for us. Ignoring this size
> > +	 * change race condition and using on extent map coherency checks to
> > +	 * determine whether to proceed or not makes everything in the writepage
> > +	 * path simpler.
> >  	 */
> > -	xfs_trim_extent_eof(imap, XFS_I(inode));
> > +	ifp = XFS_IFORK_PTR(XFS_I(inode), type == XFS_IO_COW ? XFS_COW_FORK
> > +							     : XFS_DATA_FORK);
> > +	if (!xfs_iext_generation_same(ifp, imap->br_generation))
> > +		return false;
> >  
> > +
> > +	offset >>= inode->i_blkbits;
> >  	return offset >= imap->br_startoff &&
> >  		offset < imap->br_startoff + imap->br_blockcount;
> >  }
> > @@ -932,6 +956,7 @@ xfs_writepage_map(
> >  	for (poffset = 0;
> >  	     poffset < PAGE_SIZE;
> >  	     poffset += len, file_offset += len, bh = bh->b_this_page) {
> > +		int retries = 0;
> >  
> >  		/* past the range we are writing, so nothing more to write. */
> >  		if (file_offset >= end_offset)
> > @@ -954,24 +979,54 @@ xfs_writepage_map(
> >  		/* Check to see if current map spans this file offset */
> >  		if (wpc->imap_valid)
> >  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> > -							 file_offset);
> > +						 file_offset, wpc->io_type);
> >  
> >  		/*
> > -		 * If we don't have a valid map, now it's time to get
> > -		 * a new one for this offset. This will allocate delalloc
> > -		 * regions or COW shared extents. If we return without a valid
> > -		 * map, it means we landed in a hole and we skip the block.
> > +		 * If we don't have a valid map, now it's time to get a new one
> > +		 * for this offset. This will allocate delalloc regions or COW
> > +		 * shared extents. If we return without a valid map, it means we
> > +		 * raced with another extent operation.
> > +		 *
> > +		 * If we are racing with invalidation, it will wait for us to
> > +		 * release the page lock which means allocation and IO is ok to
> > +		 * finish off.
> > +		 *
> > +		 * Hence we loop here trying to remap the page until we get a
> > +		 * valid map that hasn't been changed by the time we check it.
> >  		 */
> > -		if (!wpc->imap_valid) {
> > +		while (!wpc->imap_valid) {
> > +			if (!(++retries % 10)) {
> > +				xfs_warn_ratelimited(XFS_I(inode)->i_mount,
> > +					"%s: looping %d times", __func__, retries);
> > +			}
> > +
> >  			error = xfs_map_blocks(inode, file_offset, &wpc->imap,
> >  					     &wpc->io_type);
> > +			if (error == -EAGAIN) {
> > +				/*
> > +				 * Raced with another extent list modification
> > +				 * between map and allocate operations. Back off
> > +				 * for a short while, try again.
> > +				 */
> > +				msleep(1);
> > +				continue;
> > +			}
> >  			if (error)
> >  				goto out;
> > +
> > +			/*
> > +			 * We can still race with another extent list mod by the
> > +			 * time we get back here. It may have been a truncate,
> > +			 * so if the extent is not valid now, we need to map
> > +			 * again, knowing the truncate will be complete before
> > +			 * the next map is executed and so any future
> > +			 * invalidation mod will block on the page lock we hold
> > +			 */
> >  			wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> > -							 file_offset);
> > +						 file_offset, wpc->io_type);
> >  		}
> >  
> > -		if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > +		if (wpc->io_type == XFS_IO_HOLE) {
> >  			/*
> >  			 * set_page_dirty dirties all buffers in a page, independent
> >  			 * of their state.  The dirty state however is entirely
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index ad48e2f24699..3383287633ff 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -674,8 +674,8 @@ xfs_iomap_write_allocate(
> >  	xfs_bmbt_irec_t *imap)
> >  {
> >  	xfs_mount_t	*mp = ip->i_mount;
> > -	xfs_fileoff_t	offset_fsb, last_block;
> > -	xfs_fileoff_t	end_fsb, map_start_fsb;
> > +	xfs_fileoff_t	offset_fsb;
> > +	xfs_fileoff_t	map_start_fsb;
> >  	xfs_fsblock_t	first_block;
> >  	struct xfs_defer_ops	dfops;
> >  	xfs_filblks_t	count_fsb;
> > @@ -730,56 +730,28 @@ xfs_iomap_write_allocate(
> >  			xfs_defer_init(&dfops, &first_block);
> >  
> >  			/*
> > -			 * it is possible that the extents have changed since
> > -			 * we did the read call as we dropped the ilock for a
> > -			 * while. We have to be careful about truncates or hole
> > -			 * punchs here - we are not allowed to allocate
> > -			 * non-delalloc blocks here.
> > -			 *
> > -			 * The only protection against truncation is the pages
> > -			 * for the range we are being asked to convert are
> > -			 * locked and hence a truncate will block on them
> > -			 * first.
> > -			 *
> > -			 * As a result, if we go beyond the range we really
> > -			 * need and hit an delalloc extent boundary followed by
> > -			 * a hole while we have excess blocks in the map, we
> > -			 * will fill the hole incorrectly and overrun the
> > -			 * transaction reservation.
> > +			 * Now we've locked the inode, we can determine if the
> > +			 * region we are attempting to allocate over is still
> > +			 * valid. If it's potentially invalid, then generation
> > +			 * numbers won't match and we need to back out
> > +			 * completely and let the caller decide what to do.
> >  			 *
> > -			 * Using a single map prevents this as we are forced to
> > -			 * check each map we look for overlap with the desired
> > -			 * range and abort as soon as we find it. Also, given
> > -			 * that we only return a single map, having one beyond
> > -			 * what we can return is probably a bit silly.
> > -			 *
> > -			 * We also need to check that we don't go beyond EOF;
> > -			 * this is a truncate optimisation as a truncate sets
> > -			 * the new file size before block on the pages we
> > -			 * currently have locked under writeback. Because they
> > -			 * are about to be tossed, we don't need to write them
> > -			 * back....
> > +			 * This will occur if we race with truncate or some
> > +			 * other extent manipulation while we don't have the
> > +			 * inode locked.
> >  			 */
> > -			nimaps = 1;
> > -			end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > -			error = xfs_bmap_last_offset(ip, &last_block,
> > -							XFS_DATA_FORK);
> > -			if (error)
> > +			if (!xfs_iext_generation_same(
> > +					XFS_IFORK_PTR(ip, whichfork),
> > +					imap->br_generation)) {
> > +				error = -EAGAIN;
> >  				goto trans_cancel;
> > -
> > -			last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
> > -			if ((map_start_fsb + count_fsb) > last_block) {
> > -				count_fsb = last_block - map_start_fsb;
> > -				if (count_fsb == 0) {
> > -					error = -EAGAIN;
> > -					goto trans_cancel;
> > -				}
> >  			}
> >  
> >  			/*
> >  			 * From this point onwards we overwrite the imap
> >  			 * pointer that the caller gave to us.
> >  			 */
> > +			nimaps = 1;
> >  			error = xfs_bmapi_write(tp, ip, map_start_fsb,
> >  						count_fsb, flags, &first_block,
> >  						nres, imap, &nimaps,

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (11 preceding siblings ...)
  2019-01-07 14:41 ` bugzilla-daemon
@ 2019-01-07 19:11 ` bugzilla-daemon
  2019-01-08  5:46 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-07 19:11 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #13 from bfoster@redhat.com ---
On Mon, Jan 07, 2019 at 09:41:14AM -0500, Brian Foster wrote:
> On Mon, Jan 07, 2019 at 08:57:37AM +1100, Dave Chinner wrote:
> > On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> > > On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > > > On Tue, Dec 25, 2018 at 06:10:59AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > - writepages is in progress on a particular file that has decently
> sized
> > > >   post-eof speculative preallocation
> > > > - writepages gets to the point where it looks up or allocates a new
> imap
> > > >   that includes the preallocation, the allocation/lookup result is
> > > >   stored in wpc
> > > > - the file is closed by one process, killing off preallocation, then
> > > >   immediately appended to by another, updating the file size by a few
> > > >   bytes
> > > > - writepages comes back around to xfs_map_blocks() and trims imap to
> the
> > > >   current size, but imap still includes one block of the original
> speculative
> > > >   prealloc (that was truncated and recreated) because the size
> increased
> > > >   between the time imap was stored and trimmed
> > > 
> > > I'm betting hole punch can cause similar problems with cached maps
> > > in writepage. I wrote a patch yonks ago to put a generation number
> > > in the extent fork and to store it in the cached map, and to
> > > invalidate the cached map if they didn't match.
> > > 
> > > > The EOF trim approach is known to be a bandaid and potentially racy,
> but
> > > > ISTM that this problem can be trivially avoided by moving or adding
> > > > trims of wpc->imap immediately after a new one is cached. I don't
> > > > reproduce the problem so far with a couple such extra calls in place.
> > > > 
> > > > Bigger picture, we need some kind of invalidation mechanism similar to
> > > > what we're already doing for dealing with the COW fork in this
> writeback
> > > > context. I'm not sure the broad semantics used by the COW fork sequence
> > > > counter mechanism is really suitable for the data fork because any
> > > > extent-related change in the fork would cause an invalidation, but I am
> > > > wondering if we could define some subset of less frequent operations
> for
> > > > the same mechanism to reliably invalidate (e.g., on eofblocks trims,
> for
> > > > starters).
> > > 
> > > The patch I had is below - I haven't forward ported it or anything,
> > > just pulled it from my archive to demonstrate what I think we
> > > probably need to be doing here. If we want to limit when it causes
> > > invalidations, then we need probably need to limit which operations
> > > cause the generation number for that inode fork to be bumped.
> > 
> > Ugh. Didn't attach patch.
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 
> Thanks...
> 
> > 
> > xfs: add generation number for cached imap invalidation
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Because writepage caches an extent map with locks being held, it can
> > race with truncate and other extent invalidations. This means we can
> > attempt to write to ranges that have been freed and/or use pages
> > that have been unmapped.
> > 
> > The band-aid of using xfs_trim_extent_eof() to invalidate mappings
> > introduces a new race condition between truncate_setsize() and
> > a page under writeback - the locked page can allocate a valid map
> > because it's over a delalloc region, but between xfs_map_blocks()
> > allocating the extent and xfs_imap_valid() checking it, the
> > inode size can change and that will result in xfs_imap_valid()
> > saying the map is invalid.
> > 
> 
> For reference, note that we partially screwed this up during review of
> the original patch. I had it in my mind that the race window here would
> be too small to exploit in practice and looking back, I see that is
> because we tweaked the logic during review. The original patch is
> here[1] and the tweaks seemed fine at the time, but this bug clearly
> demonstrates that was wrong.
> 
> [1] https://patchwork.kernel.org/patch/10001703/
> 
> This of course doesn't change the fact that the approach itself is
> fundamentally racy, but rather suggests that something like the patch
> that Zorro verified could serve as stable fodder on the way to a proper
> fix.
> 
> > Hence we get spurious writeback failures of correctly mapped
> > pages/buffers with allocated extents underneath them. This triggers
> > assert failures in writeback that are false positives - it occurs
> > when truncate_setsize() is waiting for IO on the page to complete
> > before allowing invalidation of the page to occur. Removal of the
> > extent being written into doesn't occur until after the page cache
> > truncation is completed, so xfs_trim_extent_eof() is acting
> > prematurely in this case.
> > 
> > To fix this, what we really need to do is invalidate the cached map
> > when the extent list changes, not when the vfs inode size changes.
> > Add a extent list generation number to the in-core extent list
> > that is bumped on every change to the extent list, and add that
> > in all imaps returned from xfs_bmapi_read().
> > 
> > When we go to use the information in the imap, and we've dropped the
> > locks that keep the map coherent, we need to recheck the the
> > generation is still valid once we regain the inode locks. If the
> > generation number has changed, we can't use the cached mapping
> > information and we need to back out and restart whatever operation
> > we are caching the map for.
> > 
> > In theory, this can be done with seqlocks, but I couldn't get my
> > head around how to do this sanely. Generation numbers are simple
> > to understand and implement...
> > 
> 
> I've only made a quick pass and much of the affected context has
> changed, but isn't this roughly equivalent to the mechanism introduced
> by Christoph's commit e666aa37f4 ("xfs: avoid COW fork extent lookups in
> writeback if the fork didn't change")? I was thinking we could
> essentially accomplish the same thing by generalizing that for the data
> fork. That's a very small change, but we'd still need to consider
> whether we really want to invalidate on every data fork update...
> 
> For example, I'm concerned that something like sustained buffered writes
> could completely break the writeback imap cache by continuously
> invalidating it. I think speculative preallocation should help with this
> in the common case by already spreading those writes over fewer
> allocations, but do we care enough about the case where preallocation
> might be turned down/off to try and restrict where we bump the sequence
> number (to > i_size changes, for example)? Maybe it's not worth the
> trouble just to optimize out a shared ilock cycle and lookup, since the
> extent list is still in-core after all.
> 

A follow up FWIW... a quick test of some changes to reuse the existing
mechanism doesn't appear to show much of a problem in this regard, even
with allocsize=4k. I think another thing that minimizes impact is that
even if we end up revalidating the same imap over and over, the ioend
construction logic is distinct and based on contiguity. IOW, writeback
is still sending the same sized I/Os for contiguous blocks...

Brian

> Brian
> 
> > Signed-Off-By: Dave Chinner <dchinner@redhat.com>
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 08df809e2315..1ec408007cab 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -3691,6 +3691,7 @@ xfs_bmapi_trim_map(
> >     mval->br_blockcount = XFS_FILBLKS_MIN(end - *bno,
> >                     got->br_blockcount - (*bno - got->br_startoff));
> >     mval->br_state = got->br_state;
> > +   mval->br_generation = got->br_generation;
> >     ASSERT(mval->br_blockcount <= len);
> >     return;
> >  }
> > @@ -3799,6 +3800,7 @@ xfs_bmapi_read(
> >             mval->br_startblock = HOLESTARTBLOCK;
> >             mval->br_blockcount = len;
> >             mval->br_state = XFS_EXT_NORM;
> > +           mval->br_generation = xfs_iext_generation_fetch(ifp);
> >             *nmap = 1;
> >             return 0;
> >     }
> > @@ -3815,9 +3817,14 @@ xfs_bmapi_read(
> >     obno = bno;
> >  
> >     while (bno < end && n < *nmap) {
> > -           /* Reading past eof, act as though there's a hole up to end. */
> > -           if (eof)
> > +           /*
> > +            * If we are mapping past the last extent in the fork, act as
> > +            * though there's a hole up to end of the requested mapping.
> > +            */
> > +           if (eof) {
> >                     got.br_startoff = end;
> > +                   got.br_generation = xfs_iext_generation_fetch(ifp);
> > +           }
> >             if (got.br_startoff > bno) {
> >                     /* Reading in a hole.  */
> >                     mval->br_startoff = bno;
> > @@ -3825,6 +3832,7 @@ xfs_bmapi_read(
> >                     mval->br_blockcount =
> >                             XFS_FILBLKS_MIN(len, got.br_startoff - bno);
> >                     mval->br_state = XFS_EXT_NORM;
> > +                   mval->br_generation = got.br_generation;
> >                     bno += mval->br_blockcount;
> >                     len -= mval->br_blockcount;
> >                     mval++;
> > diff --git a/fs/xfs/libxfs/xfs_iext_tree.c b/fs/xfs/libxfs/xfs_iext_tree.c
> > index 343a94246f5b..bc52269b7973 100644
> > --- a/fs/xfs/libxfs/xfs_iext_tree.c
> > +++ b/fs/xfs/libxfs/xfs_iext_tree.c
> > @@ -663,6 +663,7 @@ xfs_iext_insert(
> >  
> >     if (new)
> >             xfs_iext_insert_node(ifp, xfs_iext_leaf_key(new, 0), new, 2);
> > +   xfs_iext_generation_inc(ifp);
> >  }
> >  
> >  static struct xfs_iext_node *
> > @@ -890,6 +891,7 @@ xfs_iext_remove(
> >             cur->pos = 0;
> >     }
> >  
> > +   xfs_iext_generation_inc(ifp);
> >     if (nr_entries >= RECS_PER_LEAF / 2)
> >             return;
> >  
> > @@ -944,6 +946,7 @@ xfs_iext_lookup_extent(
> >             return false;
> >  found:
> >     xfs_iext_get(gotp, cur_rec(cur));
> > +   gotp->br_generation = xfs_iext_generation_fetch(ifp);
> >     return true;
> >  }
> >  
> > @@ -990,6 +993,7 @@ xfs_iext_update_extent(
> >  
> >     trace_xfs_bmap_pre_update(ip, cur, state, _RET_IP_);
> >     xfs_iext_set(cur_rec(cur), new);
> > +   xfs_iext_generation_inc(ifp);
> >     trace_xfs_bmap_post_update(ip, cur, state, _RET_IP_);
> >  }
> >  
> > @@ -1006,6 +1010,7 @@ xfs_iext_get_extent(
> >     if (!xfs_iext_valid(ifp, cur))
> >             return false;
> >     xfs_iext_get(gotp, cur_rec(cur));
> > +   gotp->br_generation = xfs_iext_generation_fetch(ifp);
> >     return true;
> >  }
> >  
> > @@ -1040,4 +1045,5 @@ xfs_iext_destroy(
> >     ifp->if_bytes = 0;
> >     ifp->if_height = 0;
> >     ifp->if_u1.if_root = NULL;
> > +   xfs_iext_generation_inc(ifp);
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h
> b/fs/xfs/libxfs/xfs_inode_fork.h
> > index b9f0098e33b8..3a1c1dcb8b54 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > @@ -28,6 +28,7 @@ typedef struct xfs_ifork {
> >     int                     if_bytes;       /* bytes in if_u1 */
> >     int                     if_real_bytes;  /* bytes allocated in if_u1 */
> >     struct xfs_btree_block  *if_broot;      /* file's incore btree root */
> > +   uint32_t                if_generation;  /* extent list generation # */
> >     short                   if_broot_bytes; /* bytes allocated for root */
> >     unsigned char           if_flags;       /* per-fork flags */
> >     int                     if_height;      /* height of the extent tree */
> > @@ -186,4 +187,32 @@ extern struct kmem_zone        *xfs_ifork_zone;
> >  
> >  extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> >  
> > +/*
> > + * Bump the fork's generation count and drop a write barrier so it can
> > + * be checked without holding locks
> > + */
> > +static inline void
> > +xfs_iext_generation_inc(
> > +   struct xfs_ifork        *ifp)
> > +{
> > +   ifp->if_generation++;
> > +   smp_wmb();
> > +}
> > +
> > +static inline uint32_t
> > +xfs_iext_generation_fetch(
> > +   struct xfs_ifork        *ifp)
> > +{
> > +   smp_rmb();
> > +   return READ_ONCE(ifp->if_generation);
> > +}
> > +
> > +static inline bool
> > +xfs_iext_generation_same(
> > +   struct xfs_ifork        *ifp,
> > +   uint32_t                val)
> > +{
> > +   return (val == xfs_iext_generation_fetch(ifp));
> > +}
> > +
> >  #endif     /* __XFS_INODE_FORK_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 3c560695c546..781ca5252fe4 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -151,12 +151,18 @@ typedef enum {
> >     XFS_EXT_NORM, XFS_EXT_UNWRITTEN,
> >  } xfs_exntst_t;
> >  
> > +/*
> > + * There is a generation number added to the incore BMBT record structure
> for
> > + * determining when cached records held outside of the extent list locks
> are
> > + * still valid.  This is only filled out on lookups, and is a read-only
> value.
> > + */
> >  typedef struct xfs_bmbt_irec
> >  {
> >     xfs_fileoff_t   br_startoff;    /* starting file offset */
> >     xfs_fsblock_t   br_startblock;  /* starting block number */
> >     xfs_filblks_t   br_blockcount;  /* number of blocks */
> >     xfs_exntst_t    br_state;       /* extent state */
> > +   uint32_t        br_generation;  /* extent list generation number */
> >  } xfs_bmbt_irec_t;
> >  
> >  #endif     /* __XFS_TYPES_H__ */
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 0ca511857f7c..4504e8ecbae1 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -793,6 +793,7 @@ xfs_map_blocks(
> >                     imap->br_startoff = offset_fsb;
> >                     imap->br_blockcount = end_fsb - offset_fsb;
> >                     imap->br_startblock = HOLESTARTBLOCK;
> > +                   imap->br_generation =
> xfs_iext_generation_fetch(&ip->i_df);
> >                     *type = XFS_IO_HOLE;
> >             } else if (imap->br_startblock == HOLESTARTBLOCK) {
> >                     /* landed in a hole */
> > @@ -864,23 +865,46 @@ STATIC bool
> >  xfs_imap_valid(
> >     struct inode            *inode,
> >     struct xfs_bmbt_irec    *imap,
> > -   xfs_off_t               offset)
> > +   xfs_off_t               offset,
> > +   int                     type)
> >  {
> > -   offset >>= inode->i_blkbits;
> > +   struct xfs_ifork        *ifp;
> > +
> >  
> >     /*
> > -    * We have to make sure the cached mapping is within EOF to protect
> > -    * against eofblocks trimming on file release leaving us with a stale
> > -    * mapping. Otherwise, a page for a subsequent file extending buffered
> > -    * write could get picked up by this writeback cycle and written to the
> > -    * wrong blocks.
> > +    * We have to make sure the cached mapping hasn't been invalidated by
> an
> > +    * extent list modification (e.g. truncate, releasing post-eof blocks,
> > +    * etc). If the extent list has changed, we need to get a new map.
> > +    *
> > +    * Removing a range of a file will clear extents from both the data and
> > +    * COW fork. That means a single extent range removal will bump both
> the
> > +    * data and cow fork generation numbers if such extents exist in the
> > +    * range. Hence we can safely check against the generation number of
> the
> > +    * specific fork we are operating on here as it will flag an
> > +    * invalidation if the removal affected the specific fork we are doing
> > +    * IO from.
> >      *
> > -    * Note that what we really want here is a generic mapping invalidation
> > -    * mechanism to protect us from arbitrary extent modifying contexts,
> not
> > -    * just eofblocks.
> > +    * Finally, page invalidation will block on the page lock we are
> > +    * holding, but truncate may have already changed the file size. Hence
> > +    * we could be beyond the new EOF here, but truncate can't proceed
> until
> > +    * we finish IO on this page. Checking against EOF right now means we
> > +    * have to unwind and discard the page, and we have to do similar
> checks
> > +    * everywhere we pick up the inode locks again.
> > +    *
> > +    * However, it's still valid to allocate and write the data here
> because
> > +    * we have valid data and the the extent removal after the page
> > +    * invalidation completes will clean up for us. Ignoring this size
> > +    * change race condition and using on extent map coherency checks to
> > +    * determine whether to proceed or not makes everything in the
> writepage
> > +    * path simpler.
> >      */
> > -   xfs_trim_extent_eof(imap, XFS_I(inode));
> > +   ifp = XFS_IFORK_PTR(XFS_I(inode), type == XFS_IO_COW ? XFS_COW_FORK
> > +                                                        : XFS_DATA_FORK);
> > +   if (!xfs_iext_generation_same(ifp, imap->br_generation))
> > +           return false;
> >  
> > +
> > +   offset >>= inode->i_blkbits;
> >     return offset >= imap->br_startoff &&
> >             offset < imap->br_startoff + imap->br_blockcount;
> >  }
> > @@ -932,6 +956,7 @@ xfs_writepage_map(
> >     for (poffset = 0;
> >          poffset < PAGE_SIZE;
> >          poffset += len, file_offset += len, bh = bh->b_this_page) {
> > +           int retries = 0;
> >  
> >             /* past the range we are writing, so nothing more to write. */
> >             if (file_offset >= end_offset)
> > @@ -954,24 +979,54 @@ xfs_writepage_map(
> >             /* Check to see if current map spans this file offset */
> >             if (wpc->imap_valid)
> >                     wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> > -                                                    file_offset);
> > +                                            file_offset, wpc->io_type);
> >  
> >             /*
> > -            * If we don't have a valid map, now it's time to get
> > -            * a new one for this offset. This will allocate delalloc
> > -            * regions or COW shared extents. If we return without a valid
> > -            * map, it means we landed in a hole and we skip the block.
> > +            * If we don't have a valid map, now it's time to get a new one
> > +            * for this offset. This will allocate delalloc regions or COW
> > +            * shared extents. If we return without a valid map, it means
> we
> > +            * raced with another extent operation.
> > +            *
> > +            * If we are racing with invalidation, it will wait for us to
> > +            * release the page lock which means allocation and IO is ok to
> > +            * finish off.
> > +            *
> > +            * Hence we loop here trying to remap the page until we get a
> > +            * valid map that hasn't been changed by the time we check it.
> >              */
> > -           if (!wpc->imap_valid) {
> > +           while (!wpc->imap_valid) {
> > +                   if (!(++retries % 10)) {
> > +                           xfs_warn_ratelimited(XFS_I(inode)->i_mount,
> > +                                   "%s: looping %d times", __func__,
> retries);
> > +                   }
> > +
> >                     error = xfs_map_blocks(inode, file_offset, &wpc->imap,
> >                                          &wpc->io_type);
> > +                   if (error == -EAGAIN) {
> > +                           /*
> > +                            * Raced with another extent list modification
> > +                            * between map and allocate operations. Back
> off
> > +                            * for a short while, try again.
> > +                            */
> > +                           msleep(1);
> > +                           continue;
> > +                   }
> >                     if (error)
> >                             goto out;
> > +
> > +                   /*
> > +                    * We can still race with another extent list mod by
> the
> > +                    * time we get back here. It may have been a truncate,
> > +                    * so if the extent is not valid now, we need to map
> > +                    * again, knowing the truncate will be complete before
> > +                    * the next map is executed and so any future
> > +                    * invalidation mod will block on the page lock we hold
> > +                    */
> >                     wpc->imap_valid = xfs_imap_valid(inode, &wpc->imap,
> > -                                                    file_offset);
> > +                                            file_offset, wpc->io_type);
> >             }
> >  
> > -           if (!wpc->imap_valid || wpc->io_type == XFS_IO_HOLE) {
> > +           if (wpc->io_type == XFS_IO_HOLE) {
> >                     /*
> >                      * set_page_dirty dirties all buffers in a page,
> independent
> >                      * of their state.  The dirty state however is entirely
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index ad48e2f24699..3383287633ff 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -674,8 +674,8 @@ xfs_iomap_write_allocate(
> >     xfs_bmbt_irec_t *imap)
> >  {
> >     xfs_mount_t     *mp = ip->i_mount;
> > -   xfs_fileoff_t   offset_fsb, last_block;
> > -   xfs_fileoff_t   end_fsb, map_start_fsb;
> > +   xfs_fileoff_t   offset_fsb;
> > +   xfs_fileoff_t   map_start_fsb;
> >     xfs_fsblock_t   first_block;
> >     struct xfs_defer_ops    dfops;
> >     xfs_filblks_t   count_fsb;
> > @@ -730,56 +730,28 @@ xfs_iomap_write_allocate(
> >                     xfs_defer_init(&dfops, &first_block);
> >  
> >                     /*
> > -                    * it is possible that the extents have changed since
> > -                    * we did the read call as we dropped the ilock for a
> > -                    * while. We have to be careful about truncates or hole
> > -                    * punchs here - we are not allowed to allocate
> > -                    * non-delalloc blocks here.
> > -                    *
> > -                    * The only protection against truncation is the pages
> > -                    * for the range we are being asked to convert are
> > -                    * locked and hence a truncate will block on them
> > -                    * first.
> > -                    *
> > -                    * As a result, if we go beyond the range we really
> > -                    * need and hit an delalloc extent boundary followed by
> > -                    * a hole while we have excess blocks in the map, we
> > -                    * will fill the hole incorrectly and overrun the
> > -                    * transaction reservation.
> > +                    * Now we've locked the inode, we can determine if the
> > +                    * region we are attempting to allocate over is still
> > +                    * valid. If it's potentially invalid, then generation
> > +                    * numbers won't match and we need to back out
> > +                    * completely and let the caller decide what to do.
> >                      *
> > -                    * Using a single map prevents this as we are forced to
> > -                    * check each map we look for overlap with the desired
> > -                    * range and abort as soon as we find it. Also, given
> > -                    * that we only return a single map, having one beyond
> > -                    * what we can return is probably a bit silly.
> > -                    *
> > -                    * We also need to check that we don't go beyond EOF;
> > -                    * this is a truncate optimisation as a truncate sets
> > -                    * the new file size before block on the pages we
> > -                    * currently have locked under writeback. Because they
> > -                    * are about to be tossed, we don't need to write them
> > -                    * back....
> > +                    * This will occur if we race with truncate or some
> > +                    * other extent manipulation while we don't have the
> > +                    * inode locked.
> >                      */
> > -                   nimaps = 1;
> > -                   end_fsb = XFS_B_TO_FSB(mp, XFS_ISIZE(ip));
> > -                   error = xfs_bmap_last_offset(ip, &last_block,
> > -                                                   XFS_DATA_FORK);
> > -                   if (error)
> > +                   if (!xfs_iext_generation_same(
> > +                                   XFS_IFORK_PTR(ip, whichfork),
> > +                                   imap->br_generation)) {
> > +                           error = -EAGAIN;
> >                             goto trans_cancel;
> > -
> > -                   last_block = XFS_FILEOFF_MAX(last_block, end_fsb);
> > -                   if ((map_start_fsb + count_fsb) > last_block) {
> > -                           count_fsb = last_block - map_start_fsb;
> > -                           if (count_fsb == 0) {
> > -                                   error = -EAGAIN;
> > -                                   goto trans_cancel;
> > -                           }
> >                     }
> >  
> >                     /*
> >                      * From this point onwards we overwrite the imap
> >                      * pointer that the caller gave to us.
> >                      */
> > +                   nimaps = 1;
> >                     error = xfs_bmapi_write(tp, ip, map_start_fsb,
> >                                             count_fsb, flags, &first_block,
> >                                             nres, imap, &nimaps,

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-07 14:41       ` Brian Foster
@ 2019-01-08  5:46         ` Dave Chinner
  2019-01-08 14:54           ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2019-01-08  5:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: bugzilla-daemon, linux-xfs

On Mon, Jan 07, 2019 at 09:41:04AM -0500, Brian Foster wrote:
> On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> > On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > > On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > > - writepages is in progress on a particular file that has decently sized
> > >   post-eof speculative preallocation
> > > - writepages gets to the point where it looks up or allocates a new imap
> > >   that includes the preallocation, the allocation/lookup result is
> > >   stored in wpc
> > > - the file is closed by one process, killing off preallocation, then
> > >   immediately appended to by another, updating the file size by a few
> > >   bytes
> > > - writepages comes back around to xfs_map_blocks() and trims imap to the
> > >   current size, but imap still includes one block of the original speculative
> > >   prealloc (that was truncated and recreated) because the size increased
> > >   between the time imap was stored and trimmed
> > 
> > I'm betting hole punch can cause similar problems with cached maps
> > in writepage. I wrote a patch yonks ago to put a generation number
> > in the extent fork and to store it in the cached map, and to
> > invalidate the cached map if they didn't match.
> > 
> 
> Isn't hole punch already serialized with writeback? I thought the issue

Yes, dirty pages over the range of the hole are flushed flushed
before we punch the hole. And we do that after preventing new pages
from being dirtied. But this doesn't prevent background writeback
from running at the same time on regions of the file outside the
range to be hole punched. It also does not prevent writeback from
caching a map that covers the range that has a hole punched in the
middle of it.

Say the file has one large allocated extent and all the pages are in
cache and dirty (e.g. full file overwrite). Hole punch
runs, locks out new writes and cleans the middle of the file. At the
same time, background writeback starts at offset zero and
caches the single extent that maps the entire file. Hole punch then locks the
extent list, punches out the middle of the file and drops all it's
locks. writeback is still writing pages at the start of the file,
oblivious to the hole that has just been punched that made it's
cached extent map stale.

App then dirties a new page over the hole, creating a delalloc
extent.

If writeback hasn't yet reached the hole and skipped over it because
there are no dirty pages in that range (say it
blocked in the block device because of device congestion or
throttling), it will see this newly dirtied page and check that it
is within the range of the cached map. Which it is, because the
cached map spans the entire file. So writeback will map it directly
to a bio because it considers the cached map to be correct.

However, the hole punch made the cached map stale, and the newly
dirtied page needs to call xfs_iomap_write_allocate() to convert the
delalloc extent. But because we had no way to detect that the cached
map no longer reflected the inode's extent map, writeback will
incorrectly map the newly dirtied data to a freed (and potentially
reallocated) block on disk. i.e. it's a use after free situation.

FWIW, the recent range_cyclic changes we made removed one of the
vectors for this problem (skip over hole, go back to start, hit
newly dirtied page and write with stale cached map), but the problem
still exists if the cached extent is large enough and we block
writeback lower in the IO stack....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (12 preceding siblings ...)
  2019-01-07 19:11 ` bugzilla-daemon
@ 2019-01-08  5:46 ` bugzilla-daemon
  2019-01-08  5:55 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-08  5:46 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #14 from Dave Chinner (david@fromorbit.com) ---
On Mon, Jan 07, 2019 at 09:41:04AM -0500, Brian Foster wrote:
> On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> > On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > > On Tue, Dec 25, 2018 at 06:10:59AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > - writepages is in progress on a particular file that has decently sized
> > >   post-eof speculative preallocation
> > > - writepages gets to the point where it looks up or allocates a new imap
> > >   that includes the preallocation, the allocation/lookup result is
> > >   stored in wpc
> > > - the file is closed by one process, killing off preallocation, then
> > >   immediately appended to by another, updating the file size by a few
> > >   bytes
> > > - writepages comes back around to xfs_map_blocks() and trims imap to the
> > >   current size, but imap still includes one block of the original
> speculative
> > >   prealloc (that was truncated and recreated) because the size increased
> > >   between the time imap was stored and trimmed
> > 
> > I'm betting hole punch can cause similar problems with cached maps
> > in writepage. I wrote a patch yonks ago to put a generation number
> > in the extent fork and to store it in the cached map, and to
> > invalidate the cached map if they didn't match.
> > 
> 
> Isn't hole punch already serialized with writeback? I thought the issue

Yes, dirty pages over the range of the hole are flushed flushed
before we punch the hole. And we do that after preventing new pages
from being dirtied. But this doesn't prevent background writeback
from running at the same time on regions of the file outside the
range to be hole punched. It also does not prevent writeback from
caching a map that covers the range that has a hole punched in the
middle of it.

Say the file has one large allocated extent and all the pages are in
cache and dirty (e.g. full file overwrite). Hole punch
runs, locks out new writes and cleans the middle of the file. At the
same time, background writeback starts at offset zero and
caches the single extent that maps the entire file. Hole punch then locks the
extent list, punches out the middle of the file and drops all it's
locks. writeback is still writing pages at the start of the file,
oblivious to the hole that has just been punched that made it's
cached extent map stale.

App then dirties a new page over the hole, creating a delalloc
extent.

If writeback hasn't yet reached the hole and skipped over it because
there are no dirty pages in that range (say it
blocked in the block device because of device congestion or
throttling), it will see this newly dirtied page and check that it
is within the range of the cached map. Which it is, because the
cached map spans the entire file. So writeback will map it directly
to a bio because it considers the cached map to be correct.

However, the hole punch made the cached map stale, and the newly
dirtied page needs to call xfs_iomap_write_allocate() to convert the
delalloc extent. But because we had no way to detect that the cached
map no longer reflected the inode's extent map, writeback will
incorrectly map the newly dirtied data to a freed (and potentially
reallocated) block on disk. i.e. it's a use after free situation.

FWIW, the recent range_cyclic changes we made removed one of the
vectors for this problem (skip over hole, go back to start, hit
newly dirtied page and write with stale cached map), but the problem
still exists if the cached extent is large enough and we block
writeback lower in the IO stack....

Cheers,

Dave.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-07 19:11           ` Brian Foster
@ 2019-01-08  5:55             ` Dave Chinner
  2019-01-08 14:57               ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2019-01-08  5:55 UTC (permalink / raw)
  To: Brian Foster; +Cc: bugzilla-daemon, linux-xfs

On Mon, Jan 07, 2019 at 02:11:01PM -0500, Brian Foster wrote:
> On Mon, Jan 07, 2019 at 09:41:14AM -0500, Brian Foster wrote:
> > On Mon, Jan 07, 2019 at 08:57:37AM +1100, Dave Chinner wrote:
> > For example, I'm concerned that something like sustained buffered writes
> > could completely break the writeback imap cache by continuously
> > invalidating it. I think speculative preallocation should help with this
> > in the common case by already spreading those writes over fewer
> > allocations, but do we care enough about the case where preallocation
> > might be turned down/off to try and restrict where we bump the sequence
> > number (to > i_size changes, for example)? Maybe it's not worth the
> > trouble just to optimize out a shared ilock cycle and lookup, since the
> > extent list is still in-core after all.
> > 
> 
> A follow up FWIW... a quick test of some changes to reuse the existing
> mechanism doesn't appear to show much of a problem in this regard, even
> with allocsize=4k. I think another thing that minimizes impact is that
> even if we end up revalidating the same imap over and over, the ioend
> construction logic is distinct and based on contiguity. IOW, writeback
> is still sending the same sized I/Os for contiguous blocks...

Ah, I think you discovered that the delay between write(),
->writepages() and the incoming write throttling in
balance_dirty_pages() creates a large enough dirty page window that
we avoid lock-stepping write and writepage in a determental way....

AFAICT, the only time we have to worry about this is if we are so
short of memory the kernel is cleaning every page as soon as it is
dirtied. If we get into that situation, invalidating the cached map
is the least of our worries :P

Cheers,

dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (13 preceding siblings ...)
  2019-01-08  5:46 ` bugzilla-daemon
@ 2019-01-08  5:55 ` bugzilla-daemon
  2019-01-08 14:54 ` bugzilla-daemon
  2019-01-08 14:57 ` bugzilla-daemon
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-08  5:55 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #15 from Dave Chinner (david@fromorbit.com) ---
On Mon, Jan 07, 2019 at 02:11:01PM -0500, Brian Foster wrote:
> On Mon, Jan 07, 2019 at 09:41:14AM -0500, Brian Foster wrote:
> > On Mon, Jan 07, 2019 at 08:57:37AM +1100, Dave Chinner wrote:
> > For example, I'm concerned that something like sustained buffered writes
> > could completely break the writeback imap cache by continuously
> > invalidating it. I think speculative preallocation should help with this
> > in the common case by already spreading those writes over fewer
> > allocations, but do we care enough about the case where preallocation
> > might be turned down/off to try and restrict where we bump the sequence
> > number (to > i_size changes, for example)? Maybe it's not worth the
> > trouble just to optimize out a shared ilock cycle and lookup, since the
> > extent list is still in-core after all.
> > 
> 
> A follow up FWIW... a quick test of some changes to reuse the existing
> mechanism doesn't appear to show much of a problem in this regard, even
> with allocsize=4k. I think another thing that minimizes impact is that
> even if we end up revalidating the same imap over and over, the ioend
> construction logic is distinct and based on contiguity. IOW, writeback
> is still sending the same sized I/Os for contiguous blocks...

Ah, I think you discovered that the delay between write(),
->writepages() and the incoming write throttling in
balance_dirty_pages() creates a large enough dirty page window that
we avoid lock-stepping write and writepage in a determental way....

AFAICT, the only time we have to worry about this is if we are so
short of memory the kernel is cleaning every page as soon as it is
dirtied. If we get into that situation, invalidating the cached map
is the least of our worries :P

Cheers,

dave.

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-08  5:46         ` Dave Chinner
@ 2019-01-08 14:54           ` Brian Foster
  0 siblings, 0 replies; 29+ messages in thread
From: Brian Foster @ 2019-01-08 14:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bugzilla-daemon, linux-xfs

On Tue, Jan 08, 2019 at 04:46:39PM +1100, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 09:41:04AM -0500, Brian Foster wrote:
> > On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> > > On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > > > On Tue, Dec 25, 2018 at 06:10:59AM +0000, bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > - writepages is in progress on a particular file that has decently sized
> > > >   post-eof speculative preallocation
> > > > - writepages gets to the point where it looks up or allocates a new imap
> > > >   that includes the preallocation, the allocation/lookup result is
> > > >   stored in wpc
> > > > - the file is closed by one process, killing off preallocation, then
> > > >   immediately appended to by another, updating the file size by a few
> > > >   bytes
> > > > - writepages comes back around to xfs_map_blocks() and trims imap to the
> > > >   current size, but imap still includes one block of the original speculative
> > > >   prealloc (that was truncated and recreated) because the size increased
> > > >   between the time imap was stored and trimmed
> > > 
> > > I'm betting hole punch can cause similar problems with cached maps
> > > in writepage. I wrote a patch yonks ago to put a generation number
> > > in the extent fork and to store it in the cached map, and to
> > > invalidate the cached map if they didn't match.
> > > 
> > 
> > Isn't hole punch already serialized with writeback? I thought the issue
> 
> Yes, dirty pages over the range of the hole are flushed flushed
> before we punch the hole. And we do that after preventing new pages
> from being dirtied. But this doesn't prevent background writeback
> from running at the same time on regions of the file outside the
> range to be hole punched. It also does not prevent writeback from
> caching a map that covers the range that has a hole punched in the
> middle of it.
> 
> Say the file has one large allocated extent and all the pages are in
> cache and dirty (e.g. full file overwrite). Hole punch
> runs, locks out new writes and cleans the middle of the file. At the
> same time, background writeback starts at offset zero and
> caches the single extent that maps the entire file. Hole punch then locks the
> extent list, punches out the middle of the file and drops all it's
> locks. writeback is still writing pages at the start of the file,
> oblivious to the hole that has just been punched that made it's
> cached extent map stale.
> 
> App then dirties a new page over the hole, creating a delalloc
> extent.
> 

Ah right, thanks. I had just lost track of this. I was wondering why we
didn't have a test for this problem and then I realized I wrote one[1]
over a year ago and it just never landed. :P

The test uses truncate/pwrite rather than hole punch, but it triggers
the same fundamental problem. It looks like it left off at trying to
find a suitable set of parameters to (reasonably) reliably reproduce on
various test setups without taking forever.

Anyways, I have a small series to include a stable fix and then replace
the whole EOF trim band-aid with an invalidation fix based on
Christoph's aforementioned fork seqno patch. It needs more massage and
testing, but I'll revisit the fstest and include that as well.

Brian

[1] https://marc.info/?l=fstests&m=150902929900510&w=2


> If writeback hasn't yet reached the hole and skipped over it because
> there are no dirty pages in that range (say it
> blocked in the block device because of device congestion or
> throttling), it will see this newly dirtied page and check that it
> is within the range of the cached map. Which it is, because the
> cached map spans the entire file. So writeback will map it directly
> to a bio because it considers the cached map to be correct.
> 
> However, the hole punch made the cached map stale, and the newly
> dirtied page needs to call xfs_iomap_write_allocate() to convert the
> delalloc extent. But because we had no way to detect that the cached
> map no longer reflected the inode's extent map, writeback will
> incorrectly map the newly dirtied data to a freed (and potentially
> reallocated) block on disk. i.e. it's a use after free situation.
> 
> FWIW, the recent range_cyclic changes we made removed one of the
> vectors for this problem (skip over hole, go back to start, hit
> newly dirtied page and write with stale cached map), but the problem
> still exists if the cached extent is large enough and we block
> writeback lower in the IO stack....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (14 preceding siblings ...)
  2019-01-08  5:55 ` bugzilla-daemon
@ 2019-01-08 14:54 ` bugzilla-daemon
  2019-01-08 14:57 ` bugzilla-daemon
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-08 14:54 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #16 from bfoster@redhat.com ---
On Tue, Jan 08, 2019 at 04:46:39PM +1100, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 09:41:04AM -0500, Brian Foster wrote:
> > On Sun, Jan 06, 2019 at 08:31:20AM +1100, Dave Chinner wrote:
> > > On Fri, Jan 04, 2019 at 07:32:17AM -0500, Brian Foster wrote:
> > > > On Tue, Dec 25, 2018 at 06:10:59AM +0000,
> bugzilla-daemon@bugzilla.kernel.org wrote:
> > > > - writepages is in progress on a particular file that has decently
> sized
> > > >   post-eof speculative preallocation
> > > > - writepages gets to the point where it looks up or allocates a new
> imap
> > > >   that includes the preallocation, the allocation/lookup result is
> > > >   stored in wpc
> > > > - the file is closed by one process, killing off preallocation, then
> > > >   immediately appended to by another, updating the file size by a few
> > > >   bytes
> > > > - writepages comes back around to xfs_map_blocks() and trims imap to
> the
> > > >   current size, but imap still includes one block of the original
> speculative
> > > >   prealloc (that was truncated and recreated) because the size
> increased
> > > >   between the time imap was stored and trimmed
> > > 
> > > I'm betting hole punch can cause similar problems with cached maps
> > > in writepage. I wrote a patch yonks ago to put a generation number
> > > in the extent fork and to store it in the cached map, and to
> > > invalidate the cached map if they didn't match.
> > > 
> > 
> > Isn't hole punch already serialized with writeback? I thought the issue
> 
> Yes, dirty pages over the range of the hole are flushed flushed
> before we punch the hole. And we do that after preventing new pages
> from being dirtied. But this doesn't prevent background writeback
> from running at the same time on regions of the file outside the
> range to be hole punched. It also does not prevent writeback from
> caching a map that covers the range that has a hole punched in the
> middle of it.
> 
> Say the file has one large allocated extent and all the pages are in
> cache and dirty (e.g. full file overwrite). Hole punch
> runs, locks out new writes and cleans the middle of the file. At the
> same time, background writeback starts at offset zero and
> caches the single extent that maps the entire file. Hole punch then locks the
> extent list, punches out the middle of the file and drops all it's
> locks. writeback is still writing pages at the start of the file,
> oblivious to the hole that has just been punched that made it's
> cached extent map stale.
> 
> App then dirties a new page over the hole, creating a delalloc
> extent.
> 

Ah right, thanks. I had just lost track of this. I was wondering why we
didn't have a test for this problem and then I realized I wrote one[1]
over a year ago and it just never landed. :P

The test uses truncate/pwrite rather than hole punch, but it triggers
the same fundamental problem. It looks like it left off at trying to
find a suitable set of parameters to (reasonably) reliably reproduce on
various test setups without taking forever.

Anyways, I have a small series to include a stable fix and then replace
the whole EOF trim band-aid with an invalidation fix based on
Christoph's aforementioned fork seqno patch. It needs more massage and
testing, but I'll revisit the fstest and include that as well.

Brian

[1] https://marc.info/?l=fstests&m=150902929900510&w=2


> If writeback hasn't yet reached the hole and skipped over it because
> there are no dirty pages in that range (say it
> blocked in the block device because of device congestion or
> throttling), it will see this newly dirtied page and check that it
> is within the range of the cached map. Which it is, because the
> cached map spans the entire file. So writeback will map it directly
> to a bio because it considers the cached map to be correct.
> 
> However, the hole punch made the cached map stale, and the newly
> dirtied page needs to call xfs_iomap_write_allocate() to convert the
> delalloc extent. But because we had no way to detect that the cached
> map no longer reflected the inode's extent map, writeback will
> incorrectly map the newly dirtied data to a freed (and potentially
> reallocated) block on disk. i.e. it's a use after free situation.
> 
> FWIW, the recent range_cyclic changes we made removed one of the
> vectors for this problem (skip over hole, go back to start, hit
> newly dirtied page and write with stale cached map), but the problem
> still exists if the cached extent is large enough and we block
> writeback lower in the IO stack....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2019-01-08  5:55             ` Dave Chinner
@ 2019-01-08 14:57               ` Brian Foster
  0 siblings, 0 replies; 29+ messages in thread
From: Brian Foster @ 2019-01-08 14:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bugzilla-daemon, linux-xfs

On Tue, Jan 08, 2019 at 04:55:18PM +1100, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 02:11:01PM -0500, Brian Foster wrote:
> > On Mon, Jan 07, 2019 at 09:41:14AM -0500, Brian Foster wrote:
> > > On Mon, Jan 07, 2019 at 08:57:37AM +1100, Dave Chinner wrote:
> > > For example, I'm concerned that something like sustained buffered writes
> > > could completely break the writeback imap cache by continuously
> > > invalidating it. I think speculative preallocation should help with this
> > > in the common case by already spreading those writes over fewer
> > > allocations, but do we care enough about the case where preallocation
> > > might be turned down/off to try and restrict where we bump the sequence
> > > number (to > i_size changes, for example)? Maybe it's not worth the
> > > trouble just to optimize out a shared ilock cycle and lookup, since the
> > > extent list is still in-core after all.
> > > 
> > 
> > A follow up FWIW... a quick test of some changes to reuse the existing
> > mechanism doesn't appear to show much of a problem in this regard, even
> > with allocsize=4k. I think another thing that minimizes impact is that
> > even if we end up revalidating the same imap over and over, the ioend
> > construction logic is distinct and based on contiguity. IOW, writeback
> > is still sending the same sized I/Os for contiguous blocks...
> 
> Ah, I think you discovered that the delay between write(),
> ->writepages() and the incoming write throttling in
> balance_dirty_pages() creates a large enough dirty page window that
> we avoid lock-stepping write and writepage in a determental way....
> 

That certainly may be another factor, but note that I am able to
reproduce a significant number of spurious invalidations and thus a
significant increase in imap lookups (in the allocsize=4k case). Taking
another look at some trace data, I also see sequences of xfs_iomap_alloc
and xfs_map_blocks_found events in lockstep, though those sequences
aren't terribly long and aren't sustained (which perhaps may be to your
point wrt to buffered write throttling).

I think the larger point is that the following factors altogether:

- buffered write throttling (via dirty pages)
- speculative preallocation in XFS throttling fork changes in the common
  sequential write (file copy) use case
- cached map lookup being fairly cheap (i.e., no extra I/Os) relative to
  I/O
- spurious cached map revals not affecting I/O sizes

... mean that this isn't some blatant problem that makes global
invalidation as such a nonstarter for common usage patterns. I wouldn't
rule out other problems precisely because the spurious invals are there,
as noted above, but so far it seems it's not worth worrying about or
overcomplicating things unless/until we find a use case and workload
noticeably affected by it.

Brian

> AFAICT, the only time we have to worry about this is if we are so
> short of memory the kernel is cleaning every page as soon as it is
> dirtied. If we get into that situation, invalidating the cached map
> is the least of our worries :P
> 
> Cheers,
> 
> dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [Bug 202053] [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985
  2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
                   ` (15 preceding siblings ...)
  2019-01-08 14:54 ` bugzilla-daemon
@ 2019-01-08 14:57 ` bugzilla-daemon
  16 siblings, 0 replies; 29+ messages in thread
From: bugzilla-daemon @ 2019-01-08 14:57 UTC (permalink / raw)
  To: linux-xfs

https://bugzilla.kernel.org/show_bug.cgi?id=202053

--- Comment #17 from bfoster@redhat.com ---
On Tue, Jan 08, 2019 at 04:55:18PM +1100, Dave Chinner wrote:
> On Mon, Jan 07, 2019 at 02:11:01PM -0500, Brian Foster wrote:
> > On Mon, Jan 07, 2019 at 09:41:14AM -0500, Brian Foster wrote:
> > > On Mon, Jan 07, 2019 at 08:57:37AM +1100, Dave Chinner wrote:
> > > For example, I'm concerned that something like sustained buffered writes
> > > could completely break the writeback imap cache by continuously
> > > invalidating it. I think speculative preallocation should help with this
> > > in the common case by already spreading those writes over fewer
> > > allocations, but do we care enough about the case where preallocation
> > > might be turned down/off to try and restrict where we bump the sequence
> > > number (to > i_size changes, for example)? Maybe it's not worth the
> > > trouble just to optimize out a shared ilock cycle and lookup, since the
> > > extent list is still in-core after all.
> > > 
> > 
> > A follow up FWIW... a quick test of some changes to reuse the existing
> > mechanism doesn't appear to show much of a problem in this regard, even
> > with allocsize=4k. I think another thing that minimizes impact is that
> > even if we end up revalidating the same imap over and over, the ioend
> > construction logic is distinct and based on contiguity. IOW, writeback
> > is still sending the same sized I/Os for contiguous blocks...
> 
> Ah, I think you discovered that the delay between write(),
> ->writepages() and the incoming write throttling in
> balance_dirty_pages() creates a large enough dirty page window that
> we avoid lock-stepping write and writepage in a determental way....
> 

That certainly may be another factor, but note that I am able to
reproduce a significant number of spurious invalidations and thus a
significant increase in imap lookups (in the allocsize=4k case). Taking
another look at some trace data, I also see sequences of xfs_iomap_alloc
and xfs_map_blocks_found events in lockstep, though those sequences
aren't terribly long and aren't sustained (which perhaps may be to your
point wrt to buffered write throttling).

I think the larger point is that the following factors altogether:

- buffered write throttling (via dirty pages)
- speculative preallocation in XFS throttling fork changes in the common
  sequential write (file copy) use case
- cached map lookup being fairly cheap (i.e., no extra I/Os) relative to
  I/O
- spurious cached map revals not affecting I/O sizes

... mean that this isn't some blatant problem that makes global
invalidation as such a nonstarter for common usage patterns. I wouldn't
rule out other problems precisely because the spurious invals are there,
as noted above, but so far it seems it's not worth worrying about or
overcomplicating things unless/until we find a use case and workload
noticeably affected by it.

Brian

> AFAICT, the only time we have to worry about this is if we are so
> short of memory the kernel is cleaning every page as soon as it is
> dirtied. If we get into that situation, invalidating the cached map
> is the least of our worries :P
> 
> Cheers,
> 
> dave.
> -- 
> Dave Chinner
> david@fromorbit.com

-- 
You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2019-01-08 14:57 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-12-24  7:16 [Bug 202053] New: [xfstests generic/464]: XFS corruption and Assertion failed: 0, file: fs/xfs/xfs_super.c, line: 985 bugzilla-daemon
2018-12-24  7:19 ` [Bug 202053] " bugzilla-daemon
2018-12-24 10:40 ` bugzilla-daemon
2018-12-24 10:43 ` bugzilla-daemon
2018-12-24 10:49 ` bugzilla-daemon
2018-12-25  6:10 ` bugzilla-daemon
2019-01-04 12:32   ` Brian Foster
2019-01-04 12:52     ` Brian Foster
2019-01-05 21:31     ` Dave Chinner
2019-01-06 21:57       ` Dave Chinner
2019-01-07 14:41         ` Brian Foster
2019-01-07 19:11           ` Brian Foster
2019-01-08  5:55             ` Dave Chinner
2019-01-08 14:57               ` Brian Foster
2019-01-07 14:41       ` Brian Foster
2019-01-08  5:46         ` Dave Chinner
2019-01-08 14:54           ` Brian Foster
2019-01-04 12:40 ` bugzilla-daemon
2019-01-04 12:52 ` bugzilla-daemon
2019-01-05 21:31 ` bugzilla-daemon
2019-01-06 21:57 ` bugzilla-daemon
2019-01-07  2:35 ` bugzilla-daemon
2019-01-07 14:41 ` bugzilla-daemon
2019-01-07 14:41 ` bugzilla-daemon
2019-01-07 19:11 ` bugzilla-daemon
2019-01-08  5:46 ` bugzilla-daemon
2019-01-08  5:55 ` bugzilla-daemon
2019-01-08 14:54 ` bugzilla-daemon
2019-01-08 14:57 ` bugzilla-daemon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.