All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
@ 2017-07-07 12:01 Eryu Guan
  2017-07-08  6:49 ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Eryu Guan @ 2017-07-07 12:01 UTC (permalink / raw)
  To: linux-xfs; +Cc: Darrick J. Wong

Hi all,

I recently hit a repeatable sb_fdblocks corruption as below:

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
        - scan filesystem freespace and inode maps...
sb_fdblocks 14538692, counted 14669764
        - found root inode chunk
Phase 3 - for each AG...
...

And the count diff is always 14669764 - 14538692 = 131072 (128k). The
XFS in question was formated with "-m rmapbt=1 -b 1k" option.

After turning on XFS_WARN and adding some debug printks (I appended the
detailed logs at the end of mail), I found that this was caused by too
large 'indlen' returned by xfs_bmap_worst_indlen(), which can't fit in a
17 bits value (STARTBLOCKVALBITS is defined as 17), so the assert in
nullstartblock() failed: ASSERT(k < (1 << STARTBLOCKVALBITS));

>From the log, newlen = 151513, which needs 18 bits, so nullstartblock()
throws away the 18th bit, and the sb_fdblocks difference is always 2^17
= 131072.

To reproduce this, you need to keep enough dirty data in memory, so that
you can keep a large enough delay allocated extent in memory (not
converted to allocated by writeback thread), then speculative
preallocation could allocate large number of blocks based on the
existing extent size.

I first saw this by running xfs/217 on a ppc64 host with 18G memory, and
the default vm.dirty_background_ratio is 10, so it could keep around
1.8G dirty memory. Now I can reproduce by tuning
vm.dirty_background_ratio and vm.dirty_ratio on a x86_64 host with 4G
memory.

---- 8< ----
#!/bin/bash
dev=/dev/sdc1
mnt=/mnt/xfs

# write 1G file
size=$((1024*1024*1024))

echo 90 > /proc/sys/vm/dirty_background_ratio
echo 90 > /proc/sys/vm/dirty_ratio

mkfs -t xfs -m rmapbt=1 -b size=1k -f $dev
mount $dev $mnt

xfs_io -fc "pwrite -b 1m 0 $size" $mnt/testfile
umount $mnt

xfs_repair -n $dev
exit $?
---- >8 ----

This is uncovered by commit fd26a88093ba ("xfs: factor rmap btree size
into the indlen calculations"), which adds worst-case size of rmapbt
into account. But I'm not sure what's the best fix.

BTW, what are these magic numbers? What's the reason behind
STARTBLOCKVALBITS being 17? I can't find any explanation..

#define STARTBLOCKVALBITS       17
#define STARTBLOCKMASKBITS      (15 + 20)

Thanks,
Eryu

[11342.901613] XFS (sdc7): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
[11342.911072] XFS (sdc7): Mounting V5 Filesystem
[11342.949401] XFS (sdc7): Ending clean mount
[11342.962794] xfs_bmap_add_extent_hole_delay: oldlen(323) = startblockval(left.br_startblock)(90) + startblockval(new->br_startblock)(233)
[11342.975049] xfs_bmap_add_extent_hole_delay: temp(4032) = left.br_blockcount(1024) + new->br_blockcount(3008)
[11342.984871] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 308
[11342.992005] xfs_bmap_add_extent_hole_delay: newlen = 308
[11342.999286] xfs_bmap_add_extent_hole_delay: oldlen(620) = startblockval(left.br_startblock)(308) + startblockval(new->br_startblock)(312)
[11343.011626] xfs_bmap_add_extent_hole_delay: temp(8128) = left.br_blockcount(4032) + new->br_blockcount(4096)
[11343.021446] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 603
[11343.028578] xfs_bmap_add_extent_hole_delay: newlen = 603
[11343.036615] xfs_bmap_add_extent_hole_delay: oldlen(1211) = startblockval(left.br_startblock)(603) + startblockval(new->br_startblock)(608)
[11343.049041] xfs_bmap_add_extent_hole_delay: temp(16320) = left.br_blockcount(8128) + new->br_blockcount(8192)
[11343.058951] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 1194
[11343.066170] xfs_bmap_add_extent_hole_delay: newlen = 1194
[11343.077016] xfs_bmap_add_extent_hole_delay: oldlen(2393) = startblockval(left.br_startblock)(1194) + startblockval(new->br_startblock)(1199)
[11343.089617] xfs_bmap_add_extent_hole_delay: temp(32704) = left.br_blockcount(16320) + new->br_blockcount(16384)
[11343.099699] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 2380
[11343.106919] xfs_bmap_add_extent_hole_delay: newlen = 2380
[11343.123254] xfs_bmap_add_extent_hole_delay: oldlen(4764) = startblockval(left.br_startblock)(2380) + startblockval(new->br_startblock)(2384)
[11343.135855] xfs_bmap_add_extent_hole_delay: temp(65472) = left.br_blockcount(32704) + new->br_blockcount(32768)
[11343.145936] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 4746
[11343.153156] xfs_bmap_add_extent_hole_delay: newlen = 4746
[11343.180360] xfs_bmap_add_extent_hole_delay: oldlen(9496) = startblockval(left.br_startblock)(4746) + startblockval(new->br_startblock)(4750)
[11343.192958] xfs_bmap_add_extent_hole_delay: temp(131008) = left.br_blockcount(65472) + new->br_blockcount(65536)
[11343.203131] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 9480
[11343.210350] xfs_bmap_add_extent_hole_delay: newlen = 9480
[11343.259106] xfs_bmap_add_extent_hole_delay: oldlen(18964) = startblockval(left.br_startblock)(9480) + startblockval(new->br_startblock)(9484)
[11343.271793] xfs_bmap_add_extent_hole_delay: temp(262080) = left.br_blockcount(131008) + new->br_blockcount(131072)
[11343.282140] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 18949
[11343.289446] xfs_bmap_add_extent_hole_delay: newlen = 18949
[11343.381338] xfs_bmap_add_extent_hole_delay: oldlen(37903) = startblockval(left.br_startblock)(18949) + startblockval(new->br_startblock)(18954)
[11343.394196] xfs_bmap_add_extent_hole_delay: temp(524224) = left.br_blockcount(262080) + new->br_blockcount(262144)
[11343.404545] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 37887
[11343.411852] xfs_bmap_add_extent_hole_delay: newlen = 37887
[11343.590032] xfs_bmap_add_extent_hole_delay: oldlen(75779) = startblockval(left.br_startblock)(37887) + startblockval(new->br_startblock)(37892)
[11343.602892] xfs_bmap_add_extent_hole_delay: temp(1048512) = left.br_blockcount(524224) + new->br_blockcount(524288)
[11343.613325] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 75763
[11343.620630] xfs_bmap_add_extent_hole_delay: newlen = 75763
[11343.972010] xfs_bmap_add_extent_hole_delay: oldlen(151531) = startblockval(left.br_startblock)(75763) + startblockval(new->br_startblock)(75768)
[11343.984957] xfs_bmap_add_extent_hole_delay: temp(2097088) = left.br_blockcount(1048512) + new->br_blockcount(1048576)
[11343.995558] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 151513
[11344.002952] xfs_bmap_add_extent_hole_delay: newlen = 151513
[11344.008528] XFS: Assertion failed: k < (1 << STARTBLOCKVALBITS), file: fs/xfs/libxfs/xfs_format.h, line: 1569
[11344.018456] ------------[ cut here ]------------
[11344.023142] WARNING: CPU: 1 PID: 7594 at fs/xfs/xfs_message.c:105 asswarn+0x22/0x30 [xfs]
[11344.031379] Modules linked in: xfs(OE) libcrc32c btrfs intel_powerclamp coretemp xor kvm_intel raid6_pq kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support pcbc cdc_ether aesni_intel sg ipmi_si crypto_simd pcspkr glue_helper lpc_ich usbnet i2c_i801 cryptd mfd_core mii ipmi_devintf ioatdma shpchp ipmi_msghandler i7core_edac dca acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 jbd2 mbcache sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_generic pata_acpi drm ata_piix crc32c_intel libata megaraid_sas i2c_core bnx2 dm_mirror dm_region_hash dm_log dm_mod [last unloaded: xfs]
[11344.091315] CPU: 1 PID: 7594 Comm: xfs_io Tainted: G        W  OE   4.12.0 #102
[11344.098674] Hardware name: IBM System x3550 M3 -[7944OEJ]-/90Y4784     , BIOS -[D6E150CUS-1.11]- 02/08/2011
[11344.108423] task: ffff88017481ea40 task.stack: ffffc90003d14000
[11344.114460] RIP: 0010:asswarn+0x22/0x30 [xfs]
[11344.118881] RSP: 0018:ffffc90003d17ad0 EFLAGS: 00010282
[11344.124145] RAX: 00000000ffffffea RBX: 0000000000000051 RCX: 0000000000000021
[11344.131339] RDX: ffffc90003d179f8 RSI: 000000000000000a RDI: ffffffffa07d8b22
[11344.138470] RBP: ffffc90003d17ad0 R08: 0000000000000000 R09: 0000000000000000
[11344.145665] R10: 000000000000000a R11: f000000000000000 R12: ffff880143648000
[11344.152853] R13: ffffc90003d17c7c R14: 0000000000024fd9 R15: 0000000000024fd9
[11344.160047] FS:  00007fbe214f5740(0000) GS:ffff88017ba40000(0000) knlGS:0000000000000000
[11344.168190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[11344.173989] CR2: 00007fbe214f2000 CR3: 0000000171bb2000 CR4: 00000000000006e0
[11344.181180] Call Trace:
[11344.183715]  xfs_bmap_add_extent_hole_delay+0x76b/0x7c0 [xfs]
[11344.189539]  ? xfs_mod_fdblocks+0xa6/0x1a0 [xfs]
[11344.194247]  xfs_bmapi_reserve_delalloc+0x263/0x310 [xfs]
[11344.199731]  ? xfs_bmbt_get_all+0x18/0x20 [xfs]
[11344.204354]  ? xfs_iext_get_extent+0x35/0x40 [xfs]
[11344.209240]  xfs_file_iomap_begin+0x658/0x8c0 [xfs]
[11344.214165]  ? iomap_write_end+0x38/0x80
[11344.218127]  iomap_apply+0x44/0xf0
[11344.221571]  iomap_file_buffered_write+0x68/0xa0
[11344.226225]  ? iomap_write_begin.constprop.16+0xf0/0xf0
[11344.231545]  xfs_file_buffered_aio_write+0xca/0x240 [xfs]
[11344.237009]  ? page_add_new_anon_rmap+0x89/0xc0
[11344.241631]  xfs_file_write_iter+0xfc/0x140 [xfs]
[11344.246405]  __vfs_write+0xc5/0x140
[11344.249936]  vfs_write+0xb2/0x1b0
[11344.253315]  ? syscall_trace_enter+0x1b7/0x290
[11344.257800]  SyS_pwrite64+0x69/0xa0
[11344.261294]  do_syscall_64+0x67/0x150
[11344.265021]  entry_SYSCALL64_slow_path+0x25/0x25
[11344.269682] RIP: 0033:0x7fbe210d5fb3
[11344.273262] RSP: 002b:00007ffd75b34778 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
[11344.280888] RAX: ffffffffffffffda RBX: 000000003ff00000 RCX: 00007fbe210d5fb3
[11344.288078] RDX: 0000000000100000 RSI: 00007fbe213f4000 RDI: 0000000000000003
[11344.295269] RBP: 00000000000003ff R08: 0000000000000000 R09: 0000000000101030
[11344.302419] R10: 000000003ff00000 R11: 0000000000000246 R12: 0000000000100000
[11344.309608] R13: 000000003ff00000 R14: 00000000ffffffff R15: 0000000000000000
[11344.316804] Code: 0f 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 f1 41 89 d0 48 c7 c6 98 3a 7e a0 48 89 fa 31 c0 48 89 e5 31 ff e8 2e fd ff ff <0f> ff 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
[11344.335722] ---[ end trace 812850ed381d73bb ]---

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
  2017-07-07 12:01 [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen() Eryu Guan
@ 2017-07-08  6:49 ` Darrick J. Wong
  2017-07-09 14:08   ` Eryu Guan
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2017-07-08  6:49 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs

On Fri, Jul 07, 2017 at 08:01:43PM +0800, Eryu Guan wrote:
> Hi all,
> 
> I recently hit a repeatable sb_fdblocks corruption as below:
> 
> Phase 1 - find and verify superblock...
> Phase 2 - using internal log
>         - zero log...
>         - scan filesystem freespace and inode maps...
> sb_fdblocks 14538692, counted 14669764
>         - found root inode chunk
> Phase 3 - for each AG...
> ...
> 
> And the count diff is always 14669764 - 14538692 = 131072 (128k). The
> XFS in question was formated with "-m rmapbt=1 -b 1k" option.
> 
> After turning on XFS_WARN and adding some debug printks (I appended the
> detailed logs at the end of mail), I found that this was caused by too
> large 'indlen' returned by xfs_bmap_worst_indlen(), which can't fit in a
> 17 bits value (STARTBLOCKVALBITS is defined as 17), so the assert in
> nullstartblock() failed: ASSERT(k < (1 << STARTBLOCKVALBITS));
> 
> From the log, newlen = 151513, which needs 18 bits, so nullstartblock()
> throws away the 18th bit, and the sb_fdblocks difference is always 2^17
> = 131072.

br_startblock is encoded in memory (and in the on-disk bmbt records) as
a 52-bit unsigned integer.  We signal a delayed allocation record by
setting the uppermost STARTBLOCKMASKBITS (35) bits to 1 and stash the
'indlen' reservation (i.e. the worst case estimate of the space we need
to grow the bmbt/rmapbt to map the entire delayed allocation) in the
lower 17 bits of br_startblock.  In theory this is ok because we're
still quite a ways from having enough storage to create an fs where
the upper bits in the agno part of an xfs_fsblock_t are actually set.

> To reproduce this, you need to keep enough dirty data in memory, so that
> you can keep a large enough delay allocated extent in memory (not
> converted to allocated by writeback thread), then speculative
> preallocation could allocate large number of blocks based on the
> existing extent size.
> 
> I first saw this by running xfs/217 on a ppc64 host with 18G memory, and
> the default vm.dirty_background_ratio is 10, so it could keep around
> 1.8G dirty memory. Now I can reproduce by tuning
> vm.dirty_background_ratio and vm.dirty_ratio on a x86_64 host with 4G
> memory.
> 
> ---- 8< ----
> #!/bin/bash
> dev=/dev/sdc1
> mnt=/mnt/xfs
> 
> # write 1G file
> size=$((1024*1024*1024))
> 
> echo 90 > /proc/sys/vm/dirty_background_ratio
> echo 90 > /proc/sys/vm/dirty_ratio
> 
> mkfs -t xfs -m rmapbt=1 -b size=1k -f $dev
> mount $dev $mnt
> 
> xfs_io -fc "pwrite -b 1m 0 $size" $mnt/testfile
> umount $mnt
> 
> xfs_repair -n $dev
> exit $?
> ---- >8 ----
> 
> This is uncovered by commit fd26a88093ba ("xfs: factor rmap btree size
> into the indlen calculations"), which adds worst-case size of rmapbt
> into account. But I'm not sure what's the best fix.

Aha, that old silly fix.  In theory the per-AG reservation code is
supposed to reserve enough backup AGFL space to handle a reasonable
amount of rmapbt expansion, but then we double that up by adding
additional rmapbt block estimates to indlen, presumably so that we favor
returning ENOSPC when we go making delalloc reservations at
write_begin/page_mkwrite time.

However, we drop the indlen reservation as soon as the first transaction
in a allocate -> map -> rmap chain commits.  Since rmap is never the
first transaction in a complex transaction series, it never gets its
hands on that indlen.  Furthermore, indlen blocks are reserved from the
/global/ free block counter and not at a per-AG level, that means that
even with the indlen reservation we can still blow up during the rmap
step because a particular AG might be totally out of blocks.

So maybe the solution is to revert this patch and see if generic/224
still blows up when suint/swidth are set?  I tried the steps given in
your email from 18 Nov 2016 ("[BUG] dd doesn't return on ENOSPC and hang
when fulfilling rmapbt XFS") with sunit=32,swidth=224 (numbers I
entirely made up) and it ran just fine.  I then ran it with the
reproducer steps you outlined above, and that ran just fine too.
I did not run the rest of xfstests.

> BTW, what are these magic numbers? What's the reason behind
> STARTBLOCKVALBITS being 17? I can't find any explanation..

(See above)

--D

> #define STARTBLOCKVALBITS       17
> #define STARTBLOCKMASKBITS      (15 + 20)
> 
> Thanks,
> Eryu
> 
> [11342.901613] XFS (sdc7): EXPERIMENTAL reverse mapping btree feature enabled. Use at your own risk!
> [11342.911072] XFS (sdc7): Mounting V5 Filesystem
> [11342.949401] XFS (sdc7): Ending clean mount
> [11342.962794] xfs_bmap_add_extent_hole_delay: oldlen(323) = startblockval(left.br_startblock)(90) + startblockval(new->br_startblock)(233)
> [11342.975049] xfs_bmap_add_extent_hole_delay: temp(4032) = left.br_blockcount(1024) + new->br_blockcount(3008)
> [11342.984871] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 308
> [11342.992005] xfs_bmap_add_extent_hole_delay: newlen = 308
> [11342.999286] xfs_bmap_add_extent_hole_delay: oldlen(620) = startblockval(left.br_startblock)(308) + startblockval(new->br_startblock)(312)
> [11343.011626] xfs_bmap_add_extent_hole_delay: temp(8128) = left.br_blockcount(4032) + new->br_blockcount(4096)
> [11343.021446] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 603
> [11343.028578] xfs_bmap_add_extent_hole_delay: newlen = 603
> [11343.036615] xfs_bmap_add_extent_hole_delay: oldlen(1211) = startblockval(left.br_startblock)(603) + startblockval(new->br_startblock)(608)
> [11343.049041] xfs_bmap_add_extent_hole_delay: temp(16320) = left.br_blockcount(8128) + new->br_blockcount(8192)
> [11343.058951] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 1194
> [11343.066170] xfs_bmap_add_extent_hole_delay: newlen = 1194
> [11343.077016] xfs_bmap_add_extent_hole_delay: oldlen(2393) = startblockval(left.br_startblock)(1194) + startblockval(new->br_startblock)(1199)
> [11343.089617] xfs_bmap_add_extent_hole_delay: temp(32704) = left.br_blockcount(16320) + new->br_blockcount(16384)
> [11343.099699] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 2380
> [11343.106919] xfs_bmap_add_extent_hole_delay: newlen = 2380
> [11343.123254] xfs_bmap_add_extent_hole_delay: oldlen(4764) = startblockval(left.br_startblock)(2380) + startblockval(new->br_startblock)(2384)
> [11343.135855] xfs_bmap_add_extent_hole_delay: temp(65472) = left.br_blockcount(32704) + new->br_blockcount(32768)
> [11343.145936] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 4746
> [11343.153156] xfs_bmap_add_extent_hole_delay: newlen = 4746
> [11343.180360] xfs_bmap_add_extent_hole_delay: oldlen(9496) = startblockval(left.br_startblock)(4746) + startblockval(new->br_startblock)(4750)
> [11343.192958] xfs_bmap_add_extent_hole_delay: temp(131008) = left.br_blockcount(65472) + new->br_blockcount(65536)
> [11343.203131] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 9480
> [11343.210350] xfs_bmap_add_extent_hole_delay: newlen = 9480
> [11343.259106] xfs_bmap_add_extent_hole_delay: oldlen(18964) = startblockval(left.br_startblock)(9480) + startblockval(new->br_startblock)(9484)
> [11343.271793] xfs_bmap_add_extent_hole_delay: temp(262080) = left.br_blockcount(131008) + new->br_blockcount(131072)
> [11343.282140] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 18949
> [11343.289446] xfs_bmap_add_extent_hole_delay: newlen = 18949
> [11343.381338] xfs_bmap_add_extent_hole_delay: oldlen(37903) = startblockval(left.br_startblock)(18949) + startblockval(new->br_startblock)(18954)
> [11343.394196] xfs_bmap_add_extent_hole_delay: temp(524224) = left.br_blockcount(262080) + new->br_blockcount(262144)
> [11343.404545] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 37887
> [11343.411852] xfs_bmap_add_extent_hole_delay: newlen = 37887
> [11343.590032] xfs_bmap_add_extent_hole_delay: oldlen(75779) = startblockval(left.br_startblock)(37887) + startblockval(new->br_startblock)(37892)
> [11343.602892] xfs_bmap_add_extent_hole_delay: temp(1048512) = left.br_blockcount(524224) + new->br_blockcount(524288)
> [11343.613325] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 75763
> [11343.620630] xfs_bmap_add_extent_hole_delay: newlen = 75763
> [11343.972010] xfs_bmap_add_extent_hole_delay: oldlen(151531) = startblockval(left.br_startblock)(75763) + startblockval(new->br_startblock)(75768)
> [11343.984957] xfs_bmap_add_extent_hole_delay: temp(2097088) = left.br_blockcount(1048512) + new->br_blockcount(1048576)
> [11343.995558] xfs_bmap_add_extent_hole_delay: xfs_bmap_worst_indlen(temp) = 151513
> [11344.002952] xfs_bmap_add_extent_hole_delay: newlen = 151513
> [11344.008528] XFS: Assertion failed: k < (1 << STARTBLOCKVALBITS), file: fs/xfs/libxfs/xfs_format.h, line: 1569
> [11344.018456] ------------[ cut here ]------------
> [11344.023142] WARNING: CPU: 1 PID: 7594 at fs/xfs/xfs_message.c:105 asswarn+0x22/0x30 [xfs]
> [11344.031379] Modules linked in: xfs(OE) libcrc32c btrfs intel_powerclamp coretemp xor kvm_intel raid6_pq kvm irqbypass crc32_pclmul ghash_clmulni_intel iTCO_wdt iTCO_vendor_support pcbc cdc_ether aesni_intel sg ipmi_si crypto_simd pcspkr glue_helper lpc_ich usbnet i2c_i801 cryptd mfd_core mii ipmi_devintf ioatdma shpchp ipmi_msghandler i7core_edac dca acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 jbd2 mbcache sr_mod sd_mod cdrom mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_generic pata_acpi drm ata_piix crc32c_intel libata megaraid_sas i2c_core bnx2 dm_mirror dm_region_hash dm_log dm_mod [last unloaded: xfs]
> [11344.091315] CPU: 1 PID: 7594 Comm: xfs_io Tainted: G        W  OE   4.12.0 #102
> [11344.098674] Hardware name: IBM System x3550 M3 -[7944OEJ]-/90Y4784     , BIOS -[D6E150CUS-1.11]- 02/08/2011
> [11344.108423] task: ffff88017481ea40 task.stack: ffffc90003d14000
> [11344.114460] RIP: 0010:asswarn+0x22/0x30 [xfs]
> [11344.118881] RSP: 0018:ffffc90003d17ad0 EFLAGS: 00010282
> [11344.124145] RAX: 00000000ffffffea RBX: 0000000000000051 RCX: 0000000000000021
> [11344.131339] RDX: ffffc90003d179f8 RSI: 000000000000000a RDI: ffffffffa07d8b22
> [11344.138470] RBP: ffffc90003d17ad0 R08: 0000000000000000 R09: 0000000000000000
> [11344.145665] R10: 000000000000000a R11: f000000000000000 R12: ffff880143648000
> [11344.152853] R13: ffffc90003d17c7c R14: 0000000000024fd9 R15: 0000000000024fd9
> [11344.160047] FS:  00007fbe214f5740(0000) GS:ffff88017ba40000(0000) knlGS:0000000000000000
> [11344.168190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [11344.173989] CR2: 00007fbe214f2000 CR3: 0000000171bb2000 CR4: 00000000000006e0
> [11344.181180] Call Trace:
> [11344.183715]  xfs_bmap_add_extent_hole_delay+0x76b/0x7c0 [xfs]
> [11344.189539]  ? xfs_mod_fdblocks+0xa6/0x1a0 [xfs]
> [11344.194247]  xfs_bmapi_reserve_delalloc+0x263/0x310 [xfs]
> [11344.199731]  ? xfs_bmbt_get_all+0x18/0x20 [xfs]
> [11344.204354]  ? xfs_iext_get_extent+0x35/0x40 [xfs]
> [11344.209240]  xfs_file_iomap_begin+0x658/0x8c0 [xfs]
> [11344.214165]  ? iomap_write_end+0x38/0x80
> [11344.218127]  iomap_apply+0x44/0xf0
> [11344.221571]  iomap_file_buffered_write+0x68/0xa0
> [11344.226225]  ? iomap_write_begin.constprop.16+0xf0/0xf0
> [11344.231545]  xfs_file_buffered_aio_write+0xca/0x240 [xfs]
> [11344.237009]  ? page_add_new_anon_rmap+0x89/0xc0
> [11344.241631]  xfs_file_write_iter+0xfc/0x140 [xfs]
> [11344.246405]  __vfs_write+0xc5/0x140
> [11344.249936]  vfs_write+0xb2/0x1b0
> [11344.253315]  ? syscall_trace_enter+0x1b7/0x290
> [11344.257800]  SyS_pwrite64+0x69/0xa0
> [11344.261294]  do_syscall_64+0x67/0x150
> [11344.265021]  entry_SYSCALL64_slow_path+0x25/0x25
> [11344.269682] RIP: 0033:0x7fbe210d5fb3
> [11344.273262] RSP: 002b:00007ffd75b34778 EFLAGS: 00000246 ORIG_RAX: 0000000000000012
> [11344.280888] RAX: ffffffffffffffda RBX: 000000003ff00000 RCX: 00007fbe210d5fb3
> [11344.288078] RDX: 0000000000100000 RSI: 00007fbe213f4000 RDI: 0000000000000003
> [11344.295269] RBP: 00000000000003ff R08: 0000000000000000 R09: 0000000000101030
> [11344.302419] R10: 000000003ff00000 R11: 0000000000000246 R12: 0000000000100000
> [11344.309608] R13: 000000003ff00000 R14: 00000000ffffffff R15: 0000000000000000
> [11344.316804] Code: 0f 0b 0f 1f 80 00 00 00 00 66 66 66 66 90 55 48 89 f1 41 89 d0 48 c7 c6 98 3a 7e a0 48 89 fa 31 c0 48 89 e5 31 ff e8 2e fd ff ff <0f> ff 5d c3 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48
> [11344.335722] ---[ end trace 812850ed381d73bb ]---
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
  2017-07-08  6:49 ` Darrick J. Wong
@ 2017-07-09 14:08   ` Eryu Guan
  2017-07-10 16:18     ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Eryu Guan @ 2017-07-09 14:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Jul 07, 2017 at 11:49:37PM -0700, Darrick J. Wong wrote:
> On Fri, Jul 07, 2017 at 08:01:43PM +0800, Eryu Guan wrote:
> > Hi all,
> > 
> > I recently hit a repeatable sb_fdblocks corruption as below:
> > 
> > Phase 1 - find and verify superblock...
> > Phase 2 - using internal log
> >         - zero log...
> >         - scan filesystem freespace and inode maps...
> > sb_fdblocks 14538692, counted 14669764
> >         - found root inode chunk
> > Phase 3 - for each AG...
> > ...
> > 
> > And the count diff is always 14669764 - 14538692 = 131072 (128k). The
> > XFS in question was formated with "-m rmapbt=1 -b 1k" option.
> > 
> > After turning on XFS_WARN and adding some debug printks (I appended the
> > detailed logs at the end of mail), I found that this was caused by too
> > large 'indlen' returned by xfs_bmap_worst_indlen(), which can't fit in a
> > 17 bits value (STARTBLOCKVALBITS is defined as 17), so the assert in
> > nullstartblock() failed: ASSERT(k < (1 << STARTBLOCKVALBITS));
> > 
> > From the log, newlen = 151513, which needs 18 bits, so nullstartblock()
> > throws away the 18th bit, and the sb_fdblocks difference is always 2^17
> > = 131072.
> 
> br_startblock is encoded in memory (and in the on-disk bmbt records) as
> a 52-bit unsigned integer.  We signal a delayed allocation record by
> setting the uppermost STARTBLOCKMASKBITS (35) bits to 1 and stash the
> 'indlen' reservation (i.e. the worst case estimate of the space we need
> to grow the bmbt/rmapbt to map the entire delayed allocation) in the
> lower 17 bits of br_startblock.  In theory this is ok because we're
> still quite a ways from having enough storage to create an fs where
> the upper bits in the agno part of an xfs_fsblock_t are actually set.

This confirms what I read from the code, thanks! But I'm still curious
about how these numbers are chosen, especially STARTBLOCKMASKBITS is
defined as (15 + 20), where are they from?

> 
> > To reproduce this, you need to keep enough dirty data in memory, so that
> > you can keep a large enough delay allocated extent in memory (not
> > converted to allocated by writeback thread), then speculative
> > preallocation could allocate large number of blocks based on the
> > existing extent size.
> > 
> > I first saw this by running xfs/217 on a ppc64 host with 18G memory, and
> > the default vm.dirty_background_ratio is 10, so it could keep around
> > 1.8G dirty memory. Now I can reproduce by tuning
> > vm.dirty_background_ratio and vm.dirty_ratio on a x86_64 host with 4G
> > memory.
> > 
> > ---- 8< ----
> > #!/bin/bash
> > dev=/dev/sdc1
> > mnt=/mnt/xfs
> > 
> > # write 1G file
> > size=$((1024*1024*1024))
> > 
> > echo 90 > /proc/sys/vm/dirty_background_ratio
> > echo 90 > /proc/sys/vm/dirty_ratio
> > 
> > mkfs -t xfs -m rmapbt=1 -b size=1k -f $dev
> > mount $dev $mnt
> > 
> > xfs_io -fc "pwrite -b 1m 0 $size" $mnt/testfile
> > umount $mnt
> > 
> > xfs_repair -n $dev
> > exit $?
> > ---- >8 ----
> > 
> > This is uncovered by commit fd26a88093ba ("xfs: factor rmap btree size
> > into the indlen calculations"), which adds worst-case size of rmapbt
> > into account. But I'm not sure what's the best fix.
> 
> Aha, that old silly fix.  In theory the per-AG reservation code is
> supposed to reserve enough backup AGFL space to handle a reasonable
> amount of rmapbt expansion, but then we double that up by adding
> additional rmapbt block estimates to indlen, presumably so that we favor
> returning ENOSPC when we go making delalloc reservations at
> write_begin/page_mkwrite time.
> 
> However, we drop the indlen reservation as soon as the first transaction
> in a allocate -> map -> rmap chain commits.  Since rmap is never the
> first transaction in a complex transaction series, it never gets its
> hands on that indlen.  Furthermore, indlen blocks are reserved from the
> /global/ free block counter and not at a per-AG level, that means that
> even with the indlen reservation we can still blow up during the rmap
> step because a particular AG might be totally out of blocks.
> 
> So maybe the solution is to revert this patch and see if generic/224
> still blows up when suint/swidth are set?  I tried the steps given in
> your email from 18 Nov 2016 ("[BUG] dd doesn't return on ENOSPC and hang
> when fulfilling rmapbt XFS") with sunit=32,swidth=224 (numbers I
> entirely made up) and it ran just fine.  I then ran it with the
> reproducer steps you outlined above, and that ran just fine too.
> I did not run the rest of xfstests.

Reverting commit fd26a88093ba works for me, I can't reproduce the
sb_fdblocks accounting error nor the dd hang bug.

> 
> > BTW, what are these magic numbers? What's the reason behind
> > STARTBLOCKVALBITS being 17? I can't find any explanation..
> 
> (See above)

Thanks!

Eryu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
  2017-07-09 14:08   ` Eryu Guan
@ 2017-07-10 16:18     ` Darrick J. Wong
  2017-07-11  0:08       ` Dave Chinner
  2017-09-02  7:49       ` Eryu Guan
  0 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2017-07-10 16:18 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs

On Sun, Jul 09, 2017 at 10:08:18PM +0800, Eryu Guan wrote:
> On Fri, Jul 07, 2017 at 11:49:37PM -0700, Darrick J. Wong wrote:
> > On Fri, Jul 07, 2017 at 08:01:43PM +0800, Eryu Guan wrote:
> > > Hi all,
> > > 
> > > I recently hit a repeatable sb_fdblocks corruption as below:
> > > 
> > > Phase 1 - find and verify superblock...
> > > Phase 2 - using internal log
> > >         - zero log...
> > >         - scan filesystem freespace and inode maps...
> > > sb_fdblocks 14538692, counted 14669764
> > >         - found root inode chunk
> > > Phase 3 - for each AG...
> > > ...
> > > 
> > > And the count diff is always 14669764 - 14538692 = 131072 (128k). The
> > > XFS in question was formated with "-m rmapbt=1 -b 1k" option.
> > > 
> > > After turning on XFS_WARN and adding some debug printks (I appended the
> > > detailed logs at the end of mail), I found that this was caused by too
> > > large 'indlen' returned by xfs_bmap_worst_indlen(), which can't fit in a
> > > 17 bits value (STARTBLOCKVALBITS is defined as 17), so the assert in
> > > nullstartblock() failed: ASSERT(k < (1 << STARTBLOCKVALBITS));
> > > 
> > > From the log, newlen = 151513, which needs 18 bits, so nullstartblock()
> > > throws away the 18th bit, and the sb_fdblocks difference is always 2^17
> > > = 131072.
> > 
> > br_startblock is encoded in memory (and in the on-disk bmbt records) as
> > a 52-bit unsigned integer.  We signal a delayed allocation record by
> > setting the uppermost STARTBLOCKMASKBITS (35) bits to 1 and stash the
> > 'indlen' reservation (i.e. the worst case estimate of the space we need
> > to grow the bmbt/rmapbt to map the entire delayed allocation) in the
> > lower 17 bits of br_startblock.  In theory this is ok because we're
> > still quite a ways from having enough storage to create an fs where
> > the upper bits in the agno part of an xfs_fsblock_t are actually set.
> 
> This confirms what I read from the code, thanks! But I'm still curious
> about how these numbers are chosen, especially STARTBLOCKMASKBITS is
> defined as (15 + 20), where are they from?

<shrug> Dave? :)

In the meantime I'll stick to my hypothesis that this value was chosen
so that the AG# would be "impossibly" high if it ever escaped to disk
and thereby stand out.

> > > To reproduce this, you need to keep enough dirty data in memory, so that
> > > you can keep a large enough delay allocated extent in memory (not
> > > converted to allocated by writeback thread), then speculative
> > > preallocation could allocate large number of blocks based on the
> > > existing extent size.
> > > 
> > > I first saw this by running xfs/217 on a ppc64 host with 18G memory, and
> > > the default vm.dirty_background_ratio is 10, so it could keep around
> > > 1.8G dirty memory. Now I can reproduce by tuning
> > > vm.dirty_background_ratio and vm.dirty_ratio on a x86_64 host with 4G
> > > memory.
> > > 
> > > ---- 8< ----
> > > #!/bin/bash
> > > dev=/dev/sdc1
> > > mnt=/mnt/xfs
> > > 
> > > # write 1G file
> > > size=$((1024*1024*1024))
> > > 
> > > echo 90 > /proc/sys/vm/dirty_background_ratio
> > > echo 90 > /proc/sys/vm/dirty_ratio
> > > 
> > > mkfs -t xfs -m rmapbt=1 -b size=1k -f $dev
> > > mount $dev $mnt
> > > 
> > > xfs_io -fc "pwrite -b 1m 0 $size" $mnt/testfile
> > > umount $mnt
> > > 
> > > xfs_repair -n $dev
> > > exit $?
> > > ---- >8 ----
> > > 
> > > This is uncovered by commit fd26a88093ba ("xfs: factor rmap btree size
> > > into the indlen calculations"), which adds worst-case size of rmapbt
> > > into account. But I'm not sure what's the best fix.
> > 
> > Aha, that old silly fix.  In theory the per-AG reservation code is
> > supposed to reserve enough backup AGFL space to handle a reasonable
> > amount of rmapbt expansion, but then we double that up by adding
> > additional rmapbt block estimates to indlen, presumably so that we favor
> > returning ENOSPC when we go making delalloc reservations at
> > write_begin/page_mkwrite time.
> > 
> > However, we drop the indlen reservation as soon as the first transaction
> > in a allocate -> map -> rmap chain commits.  Since rmap is never the
> > first transaction in a complex transaction series, it never gets its
> > hands on that indlen.  Furthermore, indlen blocks are reserved from the
> > /global/ free block counter and not at a per-AG level, that means that
> > even with the indlen reservation we can still blow up during the rmap
> > step because a particular AG might be totally out of blocks.
> > 
> > So maybe the solution is to revert this patch and see if generic/224
> > still blows up when suint/swidth are set?  I tried the steps given in
> > your email from 18 Nov 2016 ("[BUG] dd doesn't return on ENOSPC and hang
> > when fulfilling rmapbt XFS") with sunit=32,swidth=224 (numbers I
> > entirely made up) and it ran just fine.  I then ran it with the
> > reproducer steps you outlined above, and that ran just fine too.
> > I did not run the rest of xfstests.
> 
> Reverting commit fd26a88093ba works for me, I can't reproduce the
> sb_fdblocks accounting error nor the dd hang bug.

<nod> I'll consider posting a revert patch for the post -rc1 fixes.

But... merge window stuff comes first. :)

--D

> 
> > 
> > > BTW, what are these magic numbers? What's the reason behind
> > > STARTBLOCKVALBITS being 17? I can't find any explanation..
> > 
> > (See above)
> 
> Thanks!
> 
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
  2017-07-10 16:18     ` Darrick J. Wong
@ 2017-07-11  0:08       ` Dave Chinner
  2017-07-11 11:09         ` Eryu Guan
  2017-09-02  7:49       ` Eryu Guan
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2017-07-11  0:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Eryu Guan, linux-xfs

On Mon, Jul 10, 2017 at 09:18:24AM -0700, Darrick J. Wong wrote:
> On Sun, Jul 09, 2017 at 10:08:18PM +0800, Eryu Guan wrote:
> > On Fri, Jul 07, 2017 at 11:49:37PM -0700, Darrick J. Wong wrote:
> > > On Fri, Jul 07, 2017 at 08:01:43PM +0800, Eryu Guan wrote:
> > > > Hi all,
> > > > 
> > > > I recently hit a repeatable sb_fdblocks corruption as below:
> > > > 
> > > > Phase 1 - find and verify superblock...
> > > > Phase 2 - using internal log
> > > >         - zero log...
> > > >         - scan filesystem freespace and inode maps...
> > > > sb_fdblocks 14538692, counted 14669764
> > > >         - found root inode chunk
> > > > Phase 3 - for each AG...
> > > > ...
> > > > 
> > > > And the count diff is always 14669764 - 14538692 = 131072 (128k). The
> > > > XFS in question was formated with "-m rmapbt=1 -b 1k" option.
> > > > 
> > > > After turning on XFS_WARN and adding some debug printks (I appended the
> > > > detailed logs at the end of mail), I found that this was caused by too
> > > > large 'indlen' returned by xfs_bmap_worst_indlen(), which can't fit in a
> > > > 17 bits value (STARTBLOCKVALBITS is defined as 17), so the assert in
> > > > nullstartblock() failed: ASSERT(k < (1 << STARTBLOCKVALBITS));
> > > > 
> > > > From the log, newlen = 151513, which needs 18 bits, so nullstartblock()
> > > > throws away the 18th bit, and the sb_fdblocks difference is always 2^17
> > > > = 131072.
> > > 
> > > br_startblock is encoded in memory (and in the on-disk bmbt records) as
> > > a 52-bit unsigned integer.  We signal a delayed allocation record by
> > > setting the uppermost STARTBLOCKMASKBITS (35) bits to 1 and stash the
> > > 'indlen' reservation (i.e. the worst case estimate of the space we need
> > > to grow the bmbt/rmapbt to map the entire delayed allocation) in the
> > > lower 17 bits of br_startblock.  In theory this is ok because we're
> > > still quite a ways from having enough storage to create an fs where
> > > the upper bits in the agno part of an xfs_fsblock_t are actually set.
> > 
> > This confirms what I read from the code, thanks! But I'm still curious
> > about how these numbers are chosen, especially STARTBLOCKMASKBITS is
> > defined as (15 + 20), where are they from?
> 
> <shrug> Dave? :)

You mean these definitions?

#define STARTBLOCKVALBITS	17
#define STARTBLOCKMASKBITS	(15 + 20)

Today: a history lesson. :)

Remember that these are stored encoded in a xfs_bmbt_rec on disk and
a xfs_bmbt_rec_host in memory, so they need to fit in this
definition both on disk and in memory:

/*
 * Bmap btree record and extent descriptor.
 *  l0:63 is an extent flag (value 1 indicates non-normal).
 *  l0:9-62 are startoff.
 *  l0:0-8 and l1:21-63 are startblock.
 *  l1:0-20 are blockcount.
 */
#define BMBT_EXNTFLAG_BITLEN    1
#define BMBT_STARTOFF_BITLEN    54
#define BMBT_STARTBLOCK_BITLEN  52
#define BMBT_BLOCKCOUNT_BITLEN  21

We're looking at BMBT_STARTBLOCK_BITLEN here so it's obvious that

that 15 + 20 + 17 is 52 bits. And that the startblock encoding for
delayed allocation obivously fits inside the space in on disk and
host extent tree records correctly.

But what about the old "small block" format that was originally
used on 32 bit MIPS systems? That only had 32 bits in the start
block encoding *in memory*, so it should be clear that:

	15 + 17 = 32 bits

Indeed, look back at the older code:

(http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=blob;f=fs/xfs/xfs_bmap_btree.h;h=a4555abb6622a5d2ac6c08ab0f585761d6ff4585;hb=HEAD)

#define STARTBLOCKMASKBITS      (15 + XFS_BIG_BLKNOS * 20)
#define DSTARTBLOCKMASKBITS     (15 + 20)

And you can see we had different definitions for in-memory and
on-disk start block masks, and that difference was the size of the
block addresses the compiled kernel could manage.

IOWs, the "15 + 20" is the old definition that recognised the
difference in block size between 32 bit and 64 bit systems could
originally support in memory vs what the 64-bit on-disk format
supported. We now only support 64bit in memory, so the in-memory and
on-disk definitions are now the same....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
  2017-07-11  0:08       ` Dave Chinner
@ 2017-07-11 11:09         ` Eryu Guan
  0 siblings, 0 replies; 9+ messages in thread
From: Eryu Guan @ 2017-07-11 11:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Tue, Jul 11, 2017 at 10:08:59AM +1000, Dave Chinner wrote:
> On Mon, Jul 10, 2017 at 09:18:24AM -0700, Darrick J. Wong wrote:
> > On Sun, Jul 09, 2017 at 10:08:18PM +0800, Eryu Guan wrote:
> > > On Fri, Jul 07, 2017 at 11:49:37PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Jul 07, 2017 at 08:01:43PM +0800, Eryu Guan wrote:
> > > > > Hi all,
> > > > > 
> > > > > I recently hit a repeatable sb_fdblocks corruption as below:
> > > > > 
> > > > > Phase 1 - find and verify superblock...
> > > > > Phase 2 - using internal log
> > > > >         - zero log...
> > > > >         - scan filesystem freespace and inode maps...
> > > > > sb_fdblocks 14538692, counted 14669764
> > > > >         - found root inode chunk
> > > > > Phase 3 - for each AG...
> > > > > ...
> > > > > 
> > > > > And the count diff is always 14669764 - 14538692 = 131072 (128k). The
> > > > > XFS in question was formated with "-m rmapbt=1 -b 1k" option.
> > > > > 
> > > > > After turning on XFS_WARN and adding some debug printks (I appended the
> > > > > detailed logs at the end of mail), I found that this was caused by too
> > > > > large 'indlen' returned by xfs_bmap_worst_indlen(), which can't fit in a
> > > > > 17 bits value (STARTBLOCKVALBITS is defined as 17), so the assert in
> > > > > nullstartblock() failed: ASSERT(k < (1 << STARTBLOCKVALBITS));
> > > > > 
> > > > > From the log, newlen = 151513, which needs 18 bits, so nullstartblock()
> > > > > throws away the 18th bit, and the sb_fdblocks difference is always 2^17
> > > > > = 131072.
> > > > 
> > > > br_startblock is encoded in memory (and in the on-disk bmbt records) as
> > > > a 52-bit unsigned integer.  We signal a delayed allocation record by
> > > > setting the uppermost STARTBLOCKMASKBITS (35) bits to 1 and stash the
> > > > 'indlen' reservation (i.e. the worst case estimate of the space we need
> > > > to grow the bmbt/rmapbt to map the entire delayed allocation) in the
> > > > lower 17 bits of br_startblock.  In theory this is ok because we're
> > > > still quite a ways from having enough storage to create an fs where
> > > > the upper bits in the agno part of an xfs_fsblock_t are actually set.
> > > 
> > > This confirms what I read from the code, thanks! But I'm still curious
> > > about how these numbers are chosen, especially STARTBLOCKMASKBITS is
> > > defined as (15 + 20), where are they from?
> > 
> > <shrug> Dave? :)
> 
> You mean these definitions?
> 
> #define STARTBLOCKVALBITS	17
> #define STARTBLOCKMASKBITS	(15 + 20)
> 
> Today: a history lesson. :)

Glad to study :)

> 
> Remember that these are stored encoded in a xfs_bmbt_rec on disk and
> a xfs_bmbt_rec_host in memory, so they need to fit in this
> definition both on disk and in memory:
> 
> /*
>  * Bmap btree record and extent descriptor.
>  *  l0:63 is an extent flag (value 1 indicates non-normal).
>  *  l0:9-62 are startoff.
>  *  l0:0-8 and l1:21-63 are startblock.
>  *  l1:0-20 are blockcount.
>  */
> #define BMBT_EXNTFLAG_BITLEN    1
> #define BMBT_STARTOFF_BITLEN    54
> #define BMBT_STARTBLOCK_BITLEN  52
> #define BMBT_BLOCKCOUNT_BITLEN  21
> 
> We're looking at BMBT_STARTBLOCK_BITLEN here so it's obvious that
> 
> that 15 + 20 + 17 is 52 bits. And that the startblock encoding for
> delayed allocation obivously fits inside the space in on disk and
> host extent tree records correctly.
> 
> But what about the old "small block" format that was originally
> used on 32 bit MIPS systems? That only had 32 bits in the start
> block encoding *in memory*, so it should be clear that:
> 
> 	15 + 17 = 32 bits

This makes a lot sense.

> 
> Indeed, look back at the older code:
> 
> (http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=blob;f=fs/xfs/xfs_bmap_btree.h;h=a4555abb6622a5d2ac6c08ab0f585761d6ff4585;hb=HEAD)
> 
> #define STARTBLOCKMASKBITS      (15 + XFS_BIG_BLKNOS * 20)
> #define DSTARTBLOCKMASKBITS     (15 + 20)
> 
> And you can see we had different definitions for in-memory and
> on-disk start block masks, and that difference was the size of the
> block addresses the compiled kernel could manage.
> 
> IOWs, the "15 + 20" is the old definition that recognised the
> difference in block size between 32 bit and 64 bit systems could
> originally support in memory vs what the 64-bit on-disk format
> supported. We now only support 64bit in memory, so the in-memory and
> on-disk definitions are now the same....

Thanks a lot for the detailed explanation!

Eryu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
  2017-07-10 16:18     ` Darrick J. Wong
  2017-07-11  0:08       ` Dave Chinner
@ 2017-09-02  7:49       ` Eryu Guan
  2017-09-02 15:20         ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Eryu Guan @ 2017-09-02  7:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Jul 10, 2017 at 09:18:24AM -0700, Darrick J. Wong wrote:
> On Sun, Jul 09, 2017 at 10:08:18PM +0800, Eryu Guan wrote:
> > On Fri, Jul 07, 2017 at 11:49:37PM -0700, Darrick J. Wong wrote:
> > > On Fri, Jul 07, 2017 at 08:01:43PM +0800, Eryu Guan wrote:
> > > > Hi all,
> > > > 
> > > > I recently hit a repeatable sb_fdblocks corruption as below:
> > > > 
> > > > Phase 1 - find and verify superblock...
> > > > Phase 2 - using internal log
> > > >         - zero log...
> > > >         - scan filesystem freespace and inode maps...
> > > > sb_fdblocks 14538692, counted 14669764
> > > >         - found root inode chunk
> > > > Phase 3 - for each AG...
> > > > ...
> > > > 
> > > > And the count diff is always 14669764 - 14538692 = 131072 (128k). The
> > > > XFS in question was formated with "-m rmapbt=1 -b 1k" option.

I hit this corruption again today in my fstests release testing, and it
reminded me this old thread :)

...

> > > 
> > > So maybe the solution is to revert this patch and see if generic/224
> > > still blows up when suint/swidth are set?  I tried the steps given in
> > > your email from 18 Nov 2016 ("[BUG] dd doesn't return on ENOSPC and hang
> > > when fulfilling rmapbt XFS") with sunit=32,swidth=224 (numbers I
> > > entirely made up) and it ran just fine.  I then ran it with the
> > > reproducer steps you outlined above, and that ran just fine too.
> > > I did not run the rest of xfstests.
> > 
> > Reverting commit fd26a88093ba works for me, I can't reproduce the
> > sb_fdblocks accounting error nor the dd hang bug.
> 
> <nod> I'll consider posting a revert patch for the post -rc1 fixes.
> 
> But... merge window stuff comes first. :)

Is reverting commit fd26a88093ba still the plan? Maybe we can do it for
4.14.

Thanks,
Eryu

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
  2017-09-02  7:49       ` Eryu Guan
@ 2017-09-02 15:20         ` Darrick J. Wong
  2017-09-03  4:01           ` Eryu Guan
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2017-09-02 15:20 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs

On Sat, Sep 02, 2017 at 03:49:45PM +0800, Eryu Guan wrote:
> On Mon, Jul 10, 2017 at 09:18:24AM -0700, Darrick J. Wong wrote:
> > On Sun, Jul 09, 2017 at 10:08:18PM +0800, Eryu Guan wrote:
> > > On Fri, Jul 07, 2017 at 11:49:37PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Jul 07, 2017 at 08:01:43PM +0800, Eryu Guan wrote:
> > > > > Hi all,
> > > > > 
> > > > > I recently hit a repeatable sb_fdblocks corruption as below:
> > > > > 
> > > > > Phase 1 - find and verify superblock...
> > > > > Phase 2 - using internal log
> > > > >         - zero log...
> > > > >         - scan filesystem freespace and inode maps...
> > > > > sb_fdblocks 14538692, counted 14669764
> > > > >         - found root inode chunk
> > > > > Phase 3 - for each AG...
> > > > > ...
> > > > > 
> > > > > And the count diff is always 14669764 - 14538692 = 131072 (128k). The
> > > > > XFS in question was formated with "-m rmapbt=1 -b 1k" option.
> 
> I hit this corruption again today in my fstests release testing, and it
> reminded me this old thread :)
> 
> ...
> 
> > > > 
> > > > So maybe the solution is to revert this patch and see if generic/224
> > > > still blows up when suint/swidth are set?  I tried the steps given in
> > > > your email from 18 Nov 2016 ("[BUG] dd doesn't return on ENOSPC and hang
> > > > when fulfilling rmapbt XFS") with sunit=32,swidth=224 (numbers I
> > > > entirely made up) and it ran just fine.  I then ran it with the
> > > > reproducer steps you outlined above, and that ran just fine too.
> > > > I did not run the rest of xfstests.
> > > 
> > > Reverting commit fd26a88093ba works for me, I can't reproduce the
> > > sb_fdblocks accounting error nor the dd hang bug.
> > 
> > <nod> I'll consider posting a revert patch for the post -rc1 fixes.
> > 
> > But... merge window stuff comes first. :)
> 
> Is reverting commit fd26a88093ba still the plan? Maybe we can do it for
> 4.14.

Sure.  Can you make the reproducer script an xfstest, please?

Will post revert patch for review soon.

--D

> 
> Thanks,
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen()
  2017-09-02 15:20         ` Darrick J. Wong
@ 2017-09-03  4:01           ` Eryu Guan
  0 siblings, 0 replies; 9+ messages in thread
From: Eryu Guan @ 2017-09-03  4:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Sat, Sep 02, 2017 at 08:20:59AM -0700, Darrick J. Wong wrote:
> On Sat, Sep 02, 2017 at 03:49:45PM +0800, Eryu Guan wrote:
> > On Mon, Jul 10, 2017 at 09:18:24AM -0700, Darrick J. Wong wrote:
> > > On Sun, Jul 09, 2017 at 10:08:18PM +0800, Eryu Guan wrote:
> > > > On Fri, Jul 07, 2017 at 11:49:37PM -0700, Darrick J. Wong wrote:
> > > > > On Fri, Jul 07, 2017 at 08:01:43PM +0800, Eryu Guan wrote:
> > > > > > Hi all,
> > > > > > 
> > > > > > I recently hit a repeatable sb_fdblocks corruption as below:
> > > > > > 
> > > > > > Phase 1 - find and verify superblock...
> > > > > > Phase 2 - using internal log
> > > > > >         - zero log...
> > > > > >         - scan filesystem freespace and inode maps...
> > > > > > sb_fdblocks 14538692, counted 14669764
> > > > > >         - found root inode chunk
> > > > > > Phase 3 - for each AG...
> > > > > > ...
> > > > > > 
> > > > > > And the count diff is always 14669764 - 14538692 = 131072 (128k). The
> > > > > > XFS in question was formated with "-m rmapbt=1 -b 1k" option.
> > 
> > I hit this corruption again today in my fstests release testing, and it
> > reminded me this old thread :)
> > 
> > ...
> > 
> > > > > 
> > > > > So maybe the solution is to revert this patch and see if generic/224
> > > > > still blows up when suint/swidth are set?  I tried the steps given in
> > > > > your email from 18 Nov 2016 ("[BUG] dd doesn't return on ENOSPC and hang
> > > > > when fulfilling rmapbt XFS") with sunit=32,swidth=224 (numbers I
> > > > > entirely made up) and it ran just fine.  I then ran it with the
> > > > > reproducer steps you outlined above, and that ran just fine too.
> > > > > I did not run the rest of xfstests.
> > > > 
> > > > Reverting commit fd26a88093ba works for me, I can't reproduce the
> > > > sb_fdblocks accounting error nor the dd hang bug.
> > > 
> > > <nod> I'll consider posting a revert patch for the post -rc1 fixes.
> > > 
> > > But... merge window stuff comes first. :)
> > 
> > Is reverting commit fd26a88093ba still the plan? Maybe we can do it for
> > 4.14.
> 
> Sure.  Can you make the reproducer script an xfstest, please?

Sure, will work on that.

> 
> Will post revert patch for review soon.

Thanks!

Eryu

> 
> --D
> 
> > 
> > Thanks,
> > Eryu
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-09-03  4:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-07 12:01 [BUG] sb_fdblocks counting error caused by too large indlen returned from xfs_bmap_worst_indlen() Eryu Guan
2017-07-08  6:49 ` Darrick J. Wong
2017-07-09 14:08   ` Eryu Guan
2017-07-10 16:18     ` Darrick J. Wong
2017-07-11  0:08       ` Dave Chinner
2017-07-11 11:09         ` Eryu Guan
2017-09-02  7:49       ` Eryu Guan
2017-09-02 15:20         ` Darrick J. Wong
2017-09-03  4:01           ` Eryu Guan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.