All of lore.kernel.org
 help / color / mirror / Atom feed
* [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS
@ 2016-11-17 16:35 Eryu Guan
  2016-11-17 17:36 ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Eryu Guan @ 2016-11-17 16:35 UTC (permalink / raw)
  To: linux-xfs

Hi all,

I hit a test hang in generic/224 when testing rmapbt enabled XFS on a
host that has non-zero sunit/swidth reported from underlying device. And
I simplified the reproducer to the following script, and the hang can be
reproduced on any host now.

-----
#!/bin/bash

dev=/dev/sda5
mnt=/mnt/xfs

mkfs -t xfs -m rmapbt=1 -d agcount=8,size=1g -f $dev
mount $dev $mnt
xfs_io -x -c "resblks 4" $mnt

dd if=/dev/zero of=/mnt/xfs/testfile

echo "dd should return and report ENOSPC"
-----

sysrq-w output:
sysrq: SysRq : Show Blocked State
  task                        PC stack   pid father
dd              D    0  2504   2491 0x00000080
 ffff88021077dd00 0000000000000000 ffff88021623be40 ffff88021fd99300
 ffff8802107467c0 ffffc900019dbc18 ffffffff816e2cf5 ffff88020fa8ce90
 ffffc900019dbc40 0000000000000286 ffff8802107467c0 ffff88020fa8ce90
Call Trace:
 [<ffffffff816e2cf5>] ? __schedule+0x195/0x630
 [<ffffffff816e31c6>] schedule+0x36/0x80
 [<ffffffff812534b4>] wb_wait_for_completion+0x64/0x90
 [<ffffffff810d2910>] ? prepare_to_wait_event+0xf0/0xf0
 [<ffffffff81255b3d>] sync_inodes_sb+0xad/0x290
 [<ffffffff81288bc0>] ? iomap_write_end+0x80/0x80
 [<ffffffff8128922c>] ? iomap_apply+0x6c/0x130
 [<ffffffffa014c2a8>] xfs_flush_inodes+0x28/0x40 [xfs]
 [<ffffffffa013370b>] xfs_file_buffered_aio_write+0x18b/0x280 [xfs]
 [<ffffffffa0133890>] xfs_file_write_iter+0x90/0x130 [xfs]
 [<ffffffff81226b52>] __vfs_write+0xe2/0x140
 [<ffffffff812279d2>] vfs_write+0xb2/0x1b0
 [<ffffffff81003510>] ? syscall_trace_enter+0x1d0/0x2b0
 [<ffffffff81228e25>] SyS_write+0x55/0xc0
 [<ffffffff81003a47>] do_syscall_64+0x67/0x180
 [<ffffffff816e796b>] entry_SYSCALL64_slow_path+0x25/0x25

I tested on 4.9-rc5 kernel, for-next branch from linux-xfs tree and
latest djwong-devel branch from Darrick's tree.

If you need more information please let me know.

Thanks,
Eryu

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS
  2016-11-17 16:35 [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS Eryu Guan
@ 2016-11-17 17:36 ` Darrick J. Wong
  2016-11-17 20:11   ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2016-11-17 17:36 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs

On Fri, Nov 18, 2016 at 12:35:15AM +0800, Eryu Guan wrote:
> Hi all,
> 
> I hit a test hang in generic/224 when testing rmapbt enabled XFS on a
> host that has non-zero sunit/swidth reported from underlying device. And
> I simplified the reproducer to the following script, and the hang can be
> reproduced on any host now.
> 
> -----
> #!/bin/bash
> 
> dev=/dev/sda5
> mnt=/mnt/xfs
> 
> mkfs -t xfs -m rmapbt=1 -d agcount=8,size=1g -f $dev

Hm.  I formatted with:
mkfs.xfs -m rmapbt=1 -d sunit=4096,swidth=40960 -f /dev/sdf

(made up sunit numbers just to see how whacky it could get)

and got a different hang instead.  It looks like we are unable to
allocate any blocks to the bmbt and various things blow up from
there.  Will go retry with tracepoints on to see if we're running
out of AG reservation or if we're really out of disk blocks or what.

Crash message attached at the end.

--D

> mount $dev $mnt
> xfs_io -x -c "resblks 4" $mnt
> 
> dd if=/dev/zero of=/mnt/xfs/testfile
> 
> echo "dd should return and report ENOSPC"
> -----
> 
> sysrq-w output:
> sysrq: SysRq : Show Blocked State
>   task                        PC stack   pid father
> dd              D    0  2504   2491 0x00000080
>  ffff88021077dd00 0000000000000000 ffff88021623be40 ffff88021fd99300
>  ffff8802107467c0 ffffc900019dbc18 ffffffff816e2cf5 ffff88020fa8ce90
>  ffffc900019dbc40 0000000000000286 ffff8802107467c0 ffff88020fa8ce90
> Call Trace:
>  [<ffffffff816e2cf5>] ? __schedule+0x195/0x630
>  [<ffffffff816e31c6>] schedule+0x36/0x80
>  [<ffffffff812534b4>] wb_wait_for_completion+0x64/0x90
>  [<ffffffff810d2910>] ? prepare_to_wait_event+0xf0/0xf0
>  [<ffffffff81255b3d>] sync_inodes_sb+0xad/0x290
>  [<ffffffff81288bc0>] ? iomap_write_end+0x80/0x80
>  [<ffffffff8128922c>] ? iomap_apply+0x6c/0x130
>  [<ffffffffa014c2a8>] xfs_flush_inodes+0x28/0x40 [xfs]
>  [<ffffffffa013370b>] xfs_file_buffered_aio_write+0x18b/0x280 [xfs]
>  [<ffffffffa0133890>] xfs_file_write_iter+0x90/0x130 [xfs]
>  [<ffffffff81226b52>] __vfs_write+0xe2/0x140
>  [<ffffffff812279d2>] vfs_write+0xb2/0x1b0
>  [<ffffffff81003510>] ? syscall_trace_enter+0x1d0/0x2b0
>  [<ffffffff81228e25>] SyS_write+0x55/0xc0
>  [<ffffffff81003a47>] do_syscall_64+0x67/0x180
>  [<ffffffff816e796b>] entry_SYSCALL64_slow_path+0x25/0x25
> 
> I tested on 4.9-rc5 kernel, for-next branch from linux-xfs tree and
> latest djwong-devel branch from Darrick's tree.
> 
> If you need more information please let me know.
> 
> Thanks,
> Eryu
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


XFS: Assertion failed: args.fsbno != NULLFSBLOCK, file: /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/libxfs/xfs_bmap.c, line: 789
------------[ cut here ]------------
WARNING: CPU: 1 PID: 104 at /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_message.c:113 assfail+0x31/0x40 [xfs]
Modules linked in: xfs libcrc32c dax_pmem dax nd_pmem sch_fq_codel af_packet
CPU: 1 PID: 104 Comm: kworker/u8:6 Not tainted 4.9.0-rc5-xfsx #41
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: writeback wb_workfn (flush-8:80)
 ffffc9000083b398 ffffffff81340643 0000000000000000 0000000000000000
 ffffc9000083b3d8 ffffffff8108690b 000000710083b3f8 ffff88006be0a080
 ffffc9000083b838 ffff88006d437000 0000000000000001 ffff88006e407350
Call Trace:
 [<ffffffff81340643>] dump_stack+0x85/0xc2
 [<ffffffff8108690b>] __warn+0xcb/0xf0
 [<ffffffff810869fd>] warn_slowpath_null+0x1d/0x20
 [<ffffffffa00e8971>] assfail+0x31/0x40 [xfs]
 [<ffffffffa0062df5>] xfs_bmap_extents_to_btree+0x235/0x780 [xfs]
 [<ffffffffa0066a91>] xfs_bmap_add_extent_delay_real+0x1af1/0x1ce0 [xfs]
 [<ffffffffa009eb0c>] ? xfs_iext_bno_to_ext+0x8c/0x170 [xfs]
 [<ffffffffa006d06f>] xfs_bmapi_write+0xb6f/0x1180 [xfs]
 [<ffffffff810e016d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa00dda24>] xfs_iomap_write_allocate+0x184/0x360 [xfs]
 [<ffffffffa00b9baf>] xfs_map_blocks+0x34f/0x4e0 [xfs]
 [<ffffffffa00ba920>] xfs_do_writepage+0x310/0x840 [xfs]
 [<ffffffff811abc91>] write_cache_pages+0x251/0x640
 [<ffffffffa00ba610>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
 [<ffffffffa00ba043>] xfs_vm_writepages+0xa3/0xe0 [xfs]
 [<ffffffff811acb81>] do_writepages+0x21/0x30
 [<ffffffff81252031>] __writeback_single_inode+0x61/0x800
 [<ffffffff81252cf7>] writeback_sb_inodes+0x297/0x5e0
 [<ffffffff8125329c>] wb_writeback+0xfc/0x5c0
 [<ffffffff81255e8b>] wb_workfn+0x11b/0x680
 [<ffffffff810a62d8>] process_one_work+0x1f8/0x750
 [<ffffffff810a6259>] ? process_one_work+0x179/0x750
 [<ffffffff810a687b>] worker_thread+0x4b/0x4f0
 [<ffffffff810a6830>] ? process_one_work+0x750/0x750
 [<ffffffff810ad3c2>] kthread+0x102/0x120
 [<ffffffff810ad2c0>] ? kthread_park+0x60/0x60
 [<ffffffff8168243a>] ret_from_fork+0x2a/0x40
---[ end trace a2c2fccec3607587 ]---
XFS: Assertion failed: fsbno != NULLFSBLOCK, file: /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/libxfs/xfs_btree.c, line: 662
------------[ cut here ]------------
WARNING: CPU: 1 PID: 104 at /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_message.c:113 assfail+0x31/0x40 [xfs]
Modules linked in: xfs libcrc32c dax_pmem dax nd_pmem sch_fq_codel af_packet
CPU: 1 PID: 104 Comm: kworker/u8:6 Tainted: G        W       4.9.0-rc5-xfsx #41
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: writeback wb_workfn (flush-8:80)
 ffffc9000083b358 ffffffff81340643 0000000000000000 0000000000000000
 ffffc9000083b398 ffffffff8108690b 000000710083b3b8 ffffffffffffffff
 ffff88006e407350 0000000000000000 ffff88006d437000 ffff88006e407350
Call Trace:
 [<ffffffff81340643>] dump_stack+0x85/0xc2
 [<ffffffff8108690b>] __warn+0xcb/0xf0
 [<ffffffff810869fd>] warn_slowpath_null+0x1d/0x20
 [<ffffffffa00e8971>] assfail+0x31/0x40 [xfs]
 [<ffffffffa007602e>] xfs_btree_get_bufl+0x3e/0xb0 [xfs]
 [<ffffffffa0062e8e>] xfs_bmap_extents_to_btree+0x2ce/0x780 [xfs]
 [<ffffffffa0066a91>] xfs_bmap_add_extent_delay_real+0x1af1/0x1ce0 [xfs]
 [<ffffffffa009eb0c>] ? xfs_iext_bno_to_ext+0x8c/0x170 [xfs]
 [<ffffffffa006d06f>] xfs_bmapi_write+0xb6f/0x1180 [xfs]
 [<ffffffff810e016d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa00dda24>] xfs_iomap_write_allocate+0x184/0x360 [xfs]
 [<ffffffffa00b9baf>] xfs_map_blocks+0x34f/0x4e0 [xfs]
 [<ffffffffa00ba920>] xfs_do_writepage+0x310/0x840 [xfs]
 [<ffffffff811abc91>] write_cache_pages+0x251/0x640
 [<ffffffffa00ba610>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
 [<ffffffffa00ba043>] xfs_vm_writepages+0xa3/0xe0 [xfs]
 [<ffffffff811acb81>] do_writepages+0x21/0x30
 [<ffffffff81252031>] __writeback_single_inode+0x61/0x800
 [<ffffffff81252cf7>] writeback_sb_inodes+0x297/0x5e0
 [<ffffffff8125329c>] wb_writeback+0xfc/0x5c0
 [<ffffffff81255e8b>] wb_workfn+0x11b/0x680
 [<ffffffff810a62d8>] process_one_work+0x1f8/0x750
 [<ffffffff810a6259>] ? process_one_work+0x179/0x750
 [<ffffffff810a687b>] worker_thread+0x4b/0x4f0
 [<ffffffff810a6830>] ? process_one_work+0x750/0x750
 [<ffffffff810ad3c2>] kthread+0x102/0x120
 [<ffffffff810ad2c0>] ? kthread_park+0x60/0x60
 [<ffffffff8168243a>] ret_from_fork+0x2a/0x40
---[ end trace a2c2fccec3607588 ]---
XFS (sdf): _xfs_buf_find: Block out of range: block 0x600000001fff8, EOFS 0x300000 
------------[ cut here ]------------
WARNING: CPU: 1 PID: 104 at /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_buf.c:520 _xfs_buf_find+0x448/0x4c0 [xfs]
Modules linked in: xfs libcrc32c dax_pmem dax nd_pmem sch_fq_codel af_packetCPU: 1 PID: 104 Comm: kworker/u8:6 Tainted: G        W       4.9.0-rc5-xfsx #41
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: writeback wb_workfn (flush-8:80)
 ffffc9000083b288 ffffffff81340643 0000000000000000 0000000000000000
 ffffc9000083b2c8 ffffffff8108690b 000002080083b2e8 0000000000000008
 ffff88006e72a400 000600000001fff8 ffffc9000083b3c8 0000000000000001
Call Trace:
 [<ffffffff81340643>] dump_stack+0x85/0xc2
 [<ffffffff8108690b>] __warn+0xcb/0xf0
 [<ffffffff810869fd>] warn_slowpath_null+0x1d/0x20
 [<ffffffffa00c49e8>] _xfs_buf_find+0x448/0x4c0 [xfs]
 [<ffffffffa00c4e5a>] xfs_buf_get_map+0x2a/0x400 [xfs]
 [<ffffffffa0126692>] xfs_trans_get_buf_map+0x222/0x3f0 [xfs]
 [<ffffffffa0076089>] xfs_btree_get_bufl+0x99/0xb0 [xfs]
 [<ffffffffa0062e8e>] xfs_bmap_extents_to_btree+0x2ce/0x780 [xfs]
 [<ffffffffa0066a91>] xfs_bmap_add_extent_delay_real+0x1af1/0x1ce0 [xfs]
 [<ffffffffa009eb0c>] ? xfs_iext_bno_to_ext+0x8c/0x170 [xfs]
 [<ffffffffa006d06f>] xfs_bmapi_write+0xb6f/0x1180 [xfs]
 [<ffffffff810e016d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa00dda24>] xfs_iomap_write_allocate+0x184/0x360 [xfs]
 [<ffffffffa00b9baf>] xfs_map_blocks+0x34f/0x4e0 [xfs]
 [<ffffffffa00ba920>] xfs_do_writepage+0x310/0x840 [xfs]
 [<ffffffff811abc91>] write_cache_pages+0x251/0x640
 [<ffffffffa00ba610>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
 [<ffffffffa00ba043>] xfs_vm_writepages+0xa3/0xe0 [xfs]
 [<ffffffff811acb81>] do_writepages+0x21/0x30
 [<ffffffff81252031>] __writeback_single_inode+0x61/0x800
 [<ffffffff81252cf7>] writeback_sb_inodes+0x297/0x5e0
 [<ffffffff8125329c>] wb_writeback+0xfc/0x5c0
 [<ffffffff81255e8b>] wb_workfn+0x11b/0x680
 [<ffffffff810a62d8>] process_one_work+0x1f8/0x750
 [<ffffffff810a6259>] ? process_one_work+0x179/0x750
 [<ffffffff810a687b>] worker_thread+0x4b/0x4f0
 [<ffffffff810a6830>] ? process_one_work+0x750/0x750
 [<ffffffff810ad3c2>] kthread+0x102/0x120
 [<ffffffff810ad2c0>] ? kthread_park+0x60/0x60
 [<ffffffff8168243a>] ret_from_fork+0x2a/0x40
---[ end trace a2c2fccec3607589 ]---
XFS (sdf): _xfs_buf_find: Block out of range: block 0x600000001fff8, EOFS 0x300000 
------------[ cut here ]------------
WARNING: CPU: 1 PID: 104 at /storage/home/djwong/cdev/work/linux-xfs/fs/xfs/xfs_buf.c:520 _xfs_buf_find+0x448/0x4c0 [xfs]
Modules linked in: xfs libcrc32c dax_pmem dax nd_pmem sch_fq_codel af_packet
CPU: 1 PID: 104 Comm: kworker/u8:6 Tainted: G        W       4.9.0-rc5-xfsx #41
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: writeback wb_workfn (flush-8:80)
 ffffc9000083b288 ffffffff81340643 0000000000000000 0000000000000000
 ffffc9000083b2c8 ffffffff8108690b 000002080083b2e8 0000000000000008
 ffff88006e72a400 000600000001fff8 ffffc9000083b3c8 0000000000000001
Call Trace:
 [<ffffffff81340643>] dump_stack+0x85/0xc2
 [<ffffffff8108690b>] __warn+0xcb/0xf0
 [<ffffffff810869fd>] warn_slowpath_null+0x1d/0x20
 [<ffffffffa00c49e8>] _xfs_buf_find+0x448/0x4c0 [xfs]
 [<ffffffffa00c5011>] xfs_buf_get_map+0x1e1/0x400 [xfs]
 [<ffffffffa0126692>] xfs_trans_get_buf_map+0x222/0x3f0 [xfs]
 [<ffffffffa0076089>] xfs_btree_get_bufl+0x99/0xb0 [xfs]
 [<ffffffffa0062e8e>] xfs_bmap_extents_to_btree+0x2ce/0x780 [xfs]
 [<ffffffffa0066a91>] xfs_bmap_add_extent_delay_real+0x1af1/0x1ce0 [xfs]
 [<ffffffffa009eb0c>] ? xfs_iext_bno_to_ext+0x8c/0x170 [xfs]
 [<ffffffffa006d06f>] xfs_bmapi_write+0xb6f/0x1180 [xfs]
 [<ffffffff810e016d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa00dda24>] xfs_iomap_write_allocate+0x184/0x360 [xfs]
 [<ffffffffa00b9baf>] xfs_map_blocks+0x34f/0x4e0 [xfs]
 [<ffffffffa00ba920>] xfs_do_writepage+0x310/0x840 [xfs]
 [<ffffffff811abc91>] write_cache_pages+0x251/0x640
 [<ffffffffa00ba610>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
 [<ffffffffa00ba043>] xfs_vm_writepages+0xa3/0xe0 [xfs]
 [<ffffffff811acb81>] do_writepages+0x21/0x30
 [<ffffffff81252031>] __writeback_single_inode+0x61/0x800
 [<ffffffff81252cf7>] writeback_sb_inodes+0x297/0x5e0
 [<ffffffff8125329c>] wb_writeback+0xfc/0x5c0
 [<ffffffff81255e8b>] wb_workfn+0x11b/0x680
 [<ffffffff810a62d8>] process_one_work+0x1f8/0x750
 [<ffffffff810a6259>] ? process_one_work+0x179/0x750
 [<ffffffff810a687b>] worker_thread+0x4b/0x4f0
 [<ffffffff810a6830>] ? process_one_work+0x750/0x750
 [<ffffffff810ad3c2>] kthread+0x102/0x120
 [<ffffffff810ad2c0>] ? kthread_park+0x60/0x60
 [<ffffffff8168243a>] ret_from_fork+0x2a/0x40
---[ end trace a2c2fccec360758a ]---
BUG: unable to handle kernel NULL pointer dereference at 00000000000002a0
IP: [<ffffffffa0062e95>] xfs_bmap_extents_to_btree+0x2d5/0x780 [xfs]
PGD 76f16067 PUD 6f51c067 
PMD 0 
Oops: 0002 [#1] PREEMPT SMP
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in: xfs libcrc32c dax_pmem dax nd_pmem sch_fq_codel af_packet
CPU: 1 PID: 104 Comm: kworker/u8:6 Tainted: G        W       4.9.0-rc5-xfsx #41
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue: writeback wb_workfn (flush-8:80)
task: ffff8800750f9680 task.stack: ffffc90000838000
RIP: 0010:[<ffffffffa0062e95>]  [<ffffffffa0062e95>] xfs_bmap_extents_to_btree+0x2d5/0x780 [xfs]
RSP: 0018:ffffc9000083b408  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff88006be0a080 RCX: 0000000000000002
RDX: 0000000000000000 RSI: ffffffff819e97f8 RDI: 0000000000000246
RBP: ffffc9000083b520 R08: ffffea0000f8b090 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc9000083b838
R13: ffff88006d437000 R14: 0000000000000001 R15: ffff88006e407350
FS:  0000000000000000(0000) GS:ffff88007f200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000000002a0 CR3: 000000006e42f000 CR4: 00000000001406e0
Stack:
 0000000000001003 ffffc90000000009 ffffc9000083b568 0000000000000001
 0000000000000000 ffffc9000083b770 ffff880074d8c780 ffff88006e6f2db0
 ffffc9000083b848 ffff88006be0a0c8 ffff88006e407350 ffff88006d437000
Call Trace:
 [<ffffffffa0066a91>] xfs_bmap_add_extent_delay_real+0x1af1/0x1ce0 [xfs]
 [<ffffffffa009eb0c>] ? xfs_iext_bno_to_ext+0x8c/0x170 [xfs]
 [<ffffffffa006d06f>] xfs_bmapi_write+0xb6f/0x1180 [xfs]
 [<ffffffff810e016d>] ? trace_hardirqs_on+0xd/0x10
 [<ffffffffa00dda24>] xfs_iomap_write_allocate+0x184/0x360 [xfs]
 [<ffffffffa00b9baf>] xfs_map_blocks+0x34f/0x4e0 [xfs]
 [<ffffffffa00ba920>] xfs_do_writepage+0x310/0x840 [xfs]
 [<ffffffff811abc91>] write_cache_pages+0x251/0x640
 [<ffffffffa00ba610>] ? xfs_aops_discard_page+0x140/0x140 [xfs]
 [<ffffffffa00ba043>] xfs_vm_writepages+0xa3/0xe0 [xfs]
 [<ffffffff811acb81>] do_writepages+0x21/0x30
 [<ffffffff81252031>] __writeback_single_inode+0x61/0x800
 [<ffffffff81252cf7>] writeback_sb_inodes+0x297/0x5e0
 [<ffffffff8125329c>] wb_writeback+0xfc/0x5c0
 [<ffffffff81255e8b>] wb_workfn+0x11b/0x680
 [<ffffffff810a62d8>] process_one_work+0x1f8/0x750
 [<ffffffff810a6259>] ? process_one_work+0x179/0x750
 [<ffffffff810a687b>] worker_thread+0x4b/0x4f0
 [<ffffffff810a6830>] ? process_one_work+0x750/0x750
 [<ffffffff810ad3c2>] kthread+0x102/0x120
 [<ffffffff810ad2c0>] ? kthread_park+0x60/0x60
 [<ffffffff8168243a>] ret_from_fork+0x2a/0x40
Code: 00 00 00 48 83 83 08 03 00 00 01 e8 06 a3 0c 00 31 c9 4c 89 fe 4c 89 ef 48 8b 95 60 ff ff ff e8 62 31 01 00 48 89 85 08 ff ff ff <48> c7 80 a0 02 00 00 50 c8 13 a0 48 8b 80 68 01 00 00 48 89 85 
RIP  [<ffffffffa0062e95>] xfs_bmap_extents_to_btree+0x2d5/0x780 [xfs]
 RSP <ffffc9000083b408>
CR2: 00000000000002a0
---[ end trace a2c2fccec360758b ]---
------------[ cut here ]------------
WARNING: CPU: 1 PID: 104 at /storage/home/djwong/cdev/work/linux-xfs/kernel/exit.c:737 do_exit+0x4b/0xc40
Modules linked in: xfs libcrc32c dax_pmem dax nd_pmem sch_fq_codel af_packet
CPU: 1 PID: 104 Comm: kworker/u8:6 Tainted: G      D W       4.9.0-rc5-xfsx #41
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
 ffffc9000083be90 ffffffff81340643 0000000000000000 0000000000000000
 ffffc9000083bed0 ffffffff8108690b 000002e1819bf972 ffff8800750f9680
 0000000000000009 ffffc9000083b358 0000000000000002 000000000000000b
Call Trace:
 [<ffffffff81340643>] dump_stack+0x85/0xc2
 [<ffffffff8108690b>] __warn+0xcb/0xf0
 [<ffffffff810869fd>] warn_slowpath_null+0x1d/0x20
 [<ffffffff8108a99b>] do_exit+0x4b/0xc40
 [<ffffffff816836b7>] rewind_stack_do_exit+0x17/0x20
---[ end trace a2c2fccec360758c ]---
BUG: sleeping function called from invalid context at /storage/home/djwong/cdev/work/linux-xfs/include/linux/sched.h:3109
in_atomic(): 0, irqs_disabled(): 1, pid: 104, name: kworker/u8:6
INFO: lockdep is turned off.
irq event stamp: 403636
hardirqs last  enabled at (403635): [<ffffffff81205906>] kmem_cache_free+0x66/0x220
hardirqs last disabled at (403636): [<ffffffff81683489>] error_entry+0x69/0xc0
softirqs last  enabled at (403622): [<ffffffff81684c50>] __do_softirq+0x200/0x4bb
softirqs last disabled at (403615): [<ffffffff8108e979>] irq_exit+0xc9/0xd0
CPU: 1 PID: 104 Comm: kworker/u8:6 Tainted: G      D W       4.9.0-rc5-xfsx #41
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Workqueue:  0xfffed4840f02c4f6 (�Ł��)
 ffffc9000083be58 ffffffff81340643 0000000000000000 ffff8800750f9680
 ffffc9000083be90 ffffffff810b4994 ffffffff819c0848 0000000000000c25
 0000000000000000 0000000000000002 000000000000000b ffffc9000083beb8
Call Trace:
 [<ffffffff81340643>] dump_stack+0x85/0xc2
 [<ffffffff810b4994>] ___might_sleep+0x174/0x260
 [<ffffffff810b4aca>] __might_sleep+0x4a/0x80
 [<ffffffff8109b1b4>] exit_signals+0x24/0x2c0
 [<ffffffff8108a9f4>] do_exit+0xa4/0xc40
 [<ffffffff816836b7>] rewind_stack_do_exit+0x17/0x20

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS
  2016-11-17 17:36 ` Darrick J. Wong
@ 2016-11-17 20:11   ` Darrick J. Wong
  2016-11-17 21:32     ` Dave Chinner
  2016-11-18  5:26     ` [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS Eryu Guan
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2016-11-17 20:11 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs

On Thu, Nov 17, 2016 at 09:36:39AM -0800, Darrick J. Wong wrote:
> On Fri, Nov 18, 2016 at 12:35:15AM +0800, Eryu Guan wrote:
> > Hi all,
> > 
> > I hit a test hang in generic/224 when testing rmapbt enabled XFS on a
> > host that has non-zero sunit/swidth reported from underlying device. And
> > I simplified the reproducer to the following script, and the hang can be
> > reproduced on any host now.
> > 
> > -----
> > #!/bin/bash
> > 
> > dev=/dev/sda5
> > mnt=/mnt/xfs
> > 
> > mkfs -t xfs -m rmapbt=1 -d agcount=8,size=1g -f $dev
> 
> Hm.  I formatted with:
> mkfs.xfs -m rmapbt=1 -d sunit=4096,swidth=40960 -f /dev/sdf
> 
> (made up sunit numbers just to see how whacky it could get)
> 
> and got a different hang instead.  It looks like we are unable to
> allocate any blocks to the bmbt and various things blow up from
> there.  Will go retry with tracepoints on to see if we're running
> out of AG reservation or if we're really out of disk blocks or what.
> 
> Crash message attached at the end.

Hm.  Looking at the indlen calculations, I see that we don't include the
space that the rmapbt might need to store all the reverse mappings.  I
think this is a problem, since we decline delalloc reservations if (len
+ indlen) > fdblocks, but we potentially end up using more than indlen
blocks to map len blocks into the file, so the allocator goes nuts.

Eryu, does the following patch fix the problem you see?  I ran your
reproducer and mine and it fixed the problem in both cases.  I didn't
observe any issues running generic/224 either.

--D

---
From: Darrick J. Wong <darrick.wong@oracle.com>
Subject: [PATCH] xfs: factor rmap btree size into the indlen calculations

When we're estimating the amount of space it's going to take to satisfy
a delalloc reservation, we need to include the space that we might need
to grow the rmapbt.  This helps us to avoid running out of space later
when _iomap_write_allocate needs more space than we reserved.  Eryu Guan
observed this happening on generic/224 when sunit/swidth were set.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index b80a294..afedf96 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -49,6 +49,7 @@
 #include "xfs_rmap.h"
 #include "xfs_ag_resv.h"
 #include "xfs_refcount.h"
+#include "xfs_rmap_btree.h"
 
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
@@ -190,8 +191,12 @@ xfs_bmap_worst_indlen(
 	int		maxrecs;	/* maximum record count at this level */
 	xfs_mount_t	*mp;		/* mount structure */
 	xfs_filblks_t	rval;		/* return value */
+	xfs_filblks_t   orig_len;
 
 	mp = ip->i_mount;
+
+	/* Calculate the worst-case size of the bmbt. */
+	orig_len = len;
 	maxrecs = mp->m_bmap_dmxr[0];
 	for (level = 0, rval = 0;
 	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK);
@@ -199,12 +204,20 @@ xfs_bmap_worst_indlen(
 		len += maxrecs - 1;
 		do_div(len, maxrecs);
 		rval += len;
-		if (len == 1)
-			return rval + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
+		if (len == 1) {
+			rval += XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
 				level - 1;
+			break;
+		}
 		if (level == 0)
 			maxrecs = mp->m_bmap_dmxr[1];
 	}
+
+	/* Calculate the worst-case size of the rmapbt. */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		rval += 1 + xfs_rmapbt_calc_size(mp, orig_len) +
+				mp->m_rmap_maxlevels;
+
 	return rval;
 }

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS
  2016-11-17 20:11   ` Darrick J. Wong
@ 2016-11-17 21:32     ` Dave Chinner
  2016-11-17 23:55       ` [PATCH 0/4] xfs: fix rmapbt ENOSPC hangs Dave Chinner
  2016-11-18  5:26     ` [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS Eryu Guan
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-11-17 21:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Eryu Guan, linux-xfs

On Thu, Nov 17, 2016 at 12:11:02PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 17, 2016 at 09:36:39AM -0800, Darrick J. Wong wrote:
> > On Fri, Nov 18, 2016 at 12:35:15AM +0800, Eryu Guan wrote:
> > > Hi all,
> > > 
> > > I hit a test hang in generic/224 when testing rmapbt enabled XFS on a
> > > host that has non-zero sunit/swidth reported from underlying device. And
> > > I simplified the reproducer to the following script, and the hang can be
> > > reproduced on any host now.
> > > 
> > > -----
> > > #!/bin/bash
> > > 
> > > dev=/dev/sda5
> > > mnt=/mnt/xfs
> > > 
> > > mkfs -t xfs -m rmapbt=1 -d agcount=8,size=1g -f $dev
> > 
> > Hm.  I formatted with:
> > mkfs.xfs -m rmapbt=1 -d sunit=4096,swidth=40960 -f /dev/sdf
> > 
> > (made up sunit numbers just to see how whacky it could get)
> > 
> > and got a different hang instead.  It looks like we are unable to
> > allocate any blocks to the bmbt and various things blow up from
> > there.  Will go retry with tracepoints on to see if we're running
> > out of AG reservation or if we're really out of disk blocks or what.
> > 
> > Crash message attached at the end.
> 
> Hm.  Looking at the indlen calculations, I see that we don't include the
> space that the rmapbt might need to store all the reverse mappings.  I
> think this is a problem, since we decline delalloc reservations if (len
> + indlen) > fdblocks, but we potentially end up using more than indlen
> blocks to map len blocks into the file, so the allocator goes nuts.
> 
> Eryu, does the following patch fix the problem you see?  I ran your
> reproducer and mine and it fixed the problem in both cases.  I didn't
> observe any issues running generic/224 either.
> 
> --D
> 
> ---
> From: Darrick J. Wong <darrick.wong@oracle.com>
> Subject: [PATCH] xfs: factor rmap btree size into the indlen calculations
> 
> When we're estimating the amount of space it's going to take to satisfy
> a delalloc reservation, we need to include the space that we might need
> to grow the rmapbt.  This helps us to avoid running out of space later
> when _iomap_write_allocate needs more space than we reserved.  Eryu Guan
> observed this happening on generic/224 when sunit/swidth were set.
> 
> Reported-by: Eryu Guan <eguan@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |   17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index b80a294..afedf96 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -49,6 +49,7 @@
>  #include "xfs_rmap.h"
>  #include "xfs_ag_resv.h"
>  #include "xfs_refcount.h"
> +#include "xfs_rmap_btree.h"
>  
>  
>  kmem_zone_t		*xfs_bmap_free_item_zone;
> @@ -190,8 +191,12 @@ xfs_bmap_worst_indlen(
>  	int		maxrecs;	/* maximum record count at this level */
>  	xfs_mount_t	*mp;		/* mount structure */
>  	xfs_filblks_t	rval;		/* return value */
> +	xfs_filblks_t   orig_len;
>  
>  	mp = ip->i_mount;
> +
> +	/* Calculate the worst-case size of the bmbt. */
> +	orig_len = len;
>  	maxrecs = mp->m_bmap_dmxr[0];
>  	for (level = 0, rval = 0;
>  	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK);
> @@ -199,12 +204,20 @@ xfs_bmap_worst_indlen(
>  		len += maxrecs - 1;
>  		do_div(len, maxrecs);
>  		rval += len;
> -		if (len == 1)
> -			return rval + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
> +		if (len == 1) {
> +			rval += XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
>  				level - 1;
> +			break;
> +		}
>  		if (level == 0)
>  			maxrecs = mp->m_bmap_dmxr[1];
>  	}
> +
> +	/* Calculate the worst-case size of the rmapbt. */
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		rval += 1 + xfs_rmapbt_calc_size(mp, orig_len) +
> +				mp->m_rmap_maxlevels;
> +
>  	return rval;
>  }

So, I wrote an identical patch back when rmap was first merged and
224 was failing, but the ENOSPC problem I was hitting with 224 went
away with the reflink merge a cycle later.

This change then uncovered a separate failure with rmap+delalloc
reservation: when we allocate the delalloc extent, we release the
entire remaining indlen reservation, and so it's not available for
the rmapbt allocation that it was reserved for. Hence we still
get lockups at ENOSPC. There were a series of patches to address
this that I never finished - let me go forward port it, smoke test
it and repost it....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 0/4] xfs: fix rmapbt ENOSPC hangs
  2016-11-17 21:32     ` Dave Chinner
@ 2016-11-17 23:55       ` Dave Chinner
  2016-11-17 23:55         ` [PATCH 1/4] xfs: factor rmap btree size into the indlen calculations Dave Chinner
                           ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Dave Chinner @ 2016-11-17 23:55 UTC (permalink / raw)
  To: linux-xfs

Darrick,

This is the patchset I write to address the rmapbt enospc issue
that generic/224 was tripping across. I replaced the first patch
with yours as it's cleaner, and forward ported the other patches.

I stopped working on this because the reflink merge fixed the
problems I was seeing. maybe that was because I stopped testing
rmapbt=1 by itself, though, so the problem never really got fixed.

I'm running it through an auto test run right now, but I thought
I'd post it now so you can comment on the next three patches in the
series. Also to see if it passes the enospc tests on your setups...

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH 1/4] xfs: factor rmap btree size into the indlen calculations
  2016-11-17 23:55       ` [PATCH 0/4] xfs: fix rmapbt ENOSPC hangs Dave Chinner
@ 2016-11-17 23:55         ` Dave Chinner
  2016-11-17 23:55         ` [PATCH 2/4] xfs: add more AGF/AGFL manipulation tracepoints Dave Chinner
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-11-17 23:55 UTC (permalink / raw)
  To: linux-xfs

From: "Darrick J. Wong" <darrick.wong@oracle.com>

When we're estimating the amount of space it's going to take to satisfy
a delalloc reservation, we need to include the space that we might need
to grow the rmapbt.  This helps us to avoid running out of space later
when _iomap_write_allocate needs more space than we reserved.  Eryu Guan
observed this happening on generic/224 when sunit/swidth were set.

Reported-by: Eryu Guan <eguan@redhat.com>
Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 5c3c4dd14735..00188f559c8d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -49,6 +49,7 @@
 #include "xfs_rmap.h"
 #include "xfs_ag_resv.h"
 #include "xfs_refcount.h"
+#include "xfs_rmap_btree.h"
 
 
 kmem_zone_t		*xfs_bmap_free_item_zone;
@@ -190,8 +191,12 @@ xfs_bmap_worst_indlen(
 	int		maxrecs;	/* maximum record count at this level */
 	xfs_mount_t	*mp;		/* mount structure */
 	xfs_filblks_t	rval;		/* return value */
+	xfs_filblks_t   orig_len;
 
 	mp = ip->i_mount;
+
+	/* Calculate the worst-case size of the bmbt. */
+	orig_len = len;
 	maxrecs = mp->m_bmap_dmxr[0];
 	for (level = 0, rval = 0;
 	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK);
@@ -199,12 +204,20 @@ xfs_bmap_worst_indlen(
 		len += maxrecs - 1;
 		do_div(len, maxrecs);
 		rval += len;
-		if (len == 1)
-			return rval + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
+		if (len == 1) {
+			rval += XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
 				level - 1;
+			break;
+		}
 		if (level == 0)
 			maxrecs = mp->m_bmap_dmxr[1];
 	}
+
+	/* Calculate the worst-case size of the rmapbt. */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
+		rval += 1 + xfs_rmapbt_calc_size(mp, orig_len) +
+				mp->m_rmap_maxlevels;
+
 	return rval;
 }
 
-- 
2.8.0.rc3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 2/4] xfs: add more AGF/AGFL manipulation tracepoints
  2016-11-17 23:55       ` [PATCH 0/4] xfs: fix rmapbt ENOSPC hangs Dave Chinner
  2016-11-17 23:55         ` [PATCH 1/4] xfs: factor rmap btree size into the indlen calculations Dave Chinner
@ 2016-11-17 23:55         ` Dave Chinner
  2016-11-17 23:55         ` [PATCH 3/4] xfs: hold AGF buffers over defer ops Dave Chinner
  2016-11-17 23:55         ` [PATCH 4/4] xfs: defer indirect delalloc rmap reservations Dave Chinner
  3 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-11-17 23:55 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

So we can see what blocks are added to or removed from the AGFL during
allocation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c |  8 ++++++--
 fs/xfs/xfs_trace.h        | 11 ++++++++++-
 2 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index effb64cf714f..a5a9d8360e74 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2264,10 +2264,12 @@ xfs_alloc_get_freelist(
 	xfs_mount_t	*mp = tp->t_mountp;
 	xfs_perag_t	*pag;	/* per allocation group data */
 
+	agf = XFS_BUF_TO_AGF(agbp);
+	trace_xfs_alloc_get_freelist(mp, agf, 0, _RET_IP_);
+
 	/*
 	 * Freelist is empty, give up.
 	 */
-	agf = XFS_BUF_TO_AGF(agbp);
 	if (!agf->agf_flcount) {
 		*bnop = NULLAGBLOCK;
 		return 0;
@@ -2392,8 +2394,9 @@ xfs_alloc_put_freelist(
 	__be32			*agfl_bno;
 	int			startoff;
 
-	agf = XFS_BUF_TO_AGF(agbp);
 	mp = tp->t_mountp;
+	agf = XFS_BUF_TO_AGF(agbp);
+	trace_xfs_alloc_put_freelist(mp, agf, 0, _RET_IP_);
 
 	if (!agflbp && (error = xfs_alloc_read_agfl(mp, tp,
 			be32_to_cpu(agf->agf_seqno), &agflbp)))
@@ -2555,6 +2558,7 @@ xfs_read_agf(
 	if (!*bpp)
 		return 0;
 
+	trace_xfs_read_agf_detail(mp, XFS_BUF_TO_AGF(*bpp), 0, _RET_IP_);
 	ASSERT(!(*bpp)->b_error);
 	xfs_buf_set_ref(*bpp, XFS_AGF_REF);
 	return 0;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 0907752be62d..73e001d795ce 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -1516,7 +1516,7 @@ TRACE_EVENT(xfs_trans_commit_lsn,
 		  __entry->lsn)
 );
 
-TRACE_EVENT(xfs_agf,
+DECLARE_EVENT_CLASS(xfs_agf_class,
 	TP_PROTO(struct xfs_mount *mp, struct xfs_agf *agf, int flags,
 		 unsigned long caller_ip),
 	TP_ARGS(mp, agf, flags, caller_ip),
@@ -1572,6 +1572,15 @@ TRACE_EVENT(xfs_agf,
 		  __entry->longest,
 		  (void *)__entry->caller_ip)
 );
+#define DEFINE_AGF_EVENT(name) \
+DEFINE_EVENT(xfs_agf_class, name, \
+	TP_PROTO(struct xfs_mount *mp, struct xfs_agf *agf, int flags, \
+		 unsigned long caller_ip), \
+	TP_ARGS(mp, agf, flags, caller_ip))
+DEFINE_AGF_EVENT(xfs_agf);
+DEFINE_AGF_EVENT(xfs_read_agf_detail);
+DEFINE_AGF_EVENT(xfs_alloc_get_freelist);
+DEFINE_AGF_EVENT(xfs_alloc_put_freelist);
 
 TRACE_EVENT(xfs_free_extent,
 	TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, xfs_agblock_t agbno,
-- 
2.8.0.rc3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 3/4] xfs: hold AGF buffers over defer ops
  2016-11-17 23:55       ` [PATCH 0/4] xfs: fix rmapbt ENOSPC hangs Dave Chinner
  2016-11-17 23:55         ` [PATCH 1/4] xfs: factor rmap btree size into the indlen calculations Dave Chinner
  2016-11-17 23:55         ` [PATCH 2/4] xfs: add more AGF/AGFL manipulation tracepoints Dave Chinner
@ 2016-11-17 23:55         ` Dave Chinner
  2016-11-18  0:53           ` Dave Chinner
  2016-11-17 23:55         ` [PATCH 4/4] xfs: defer indirect delalloc rmap reservations Dave Chinner
  3 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-11-17 23:55 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

We make space and AGFL adjustments in the initial transaction for
rmap defer ops to complete, but those reservations are only valid
while we hold the AGF locked. The issue here is that if we drop the
AGF lock, we can race withother operations that need AGFL
adjustments for their deferred rmap operations, and so we can get
multiple deferred ops running at the same that share a single AGFL
reservation instead of having one each. With enough defered rmap ops
running at the same time, we run the AGFL out of free blocks and
fail rmap insertions, resulting in  corruption shutdowns.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c  | 14 ++++++++++
 fs/xfs/libxfs/xfs_defer.c | 69 ++++++++++++++++++++++++++++++++++++++++++++---
 fs/xfs/libxfs/xfs_defer.h |  6 +++++
 fs/xfs/xfs_trans_buf.c    | 12 ++++++++-
 4 files changed, 96 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 00188f559c8d..6211b4b5e826 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3932,6 +3932,20 @@ xfs_bmap_btalloc(
 			ap->wasdel ? XFS_TRANS_DQ_DELBCOUNT :
 					XFS_TRANS_DQ_BCOUNT,
 			(long) args.len);
+
+		/*
+		 * If we are going to update the rmap btree, we need to join the
+		 * AGF used for this allocation to the defer ops processing so
+		 * that it stays locked until we've processed all the relevant
+		 * btree updates that are required.
+		 */
+		if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+			ASSERT(args.agbp);
+			error = xfs_defer_join_buf(ap->tp, ap->dfops,
+						   args.agbp);
+			if (error)
+				return error;
+		}
 	} else {
 		ap->blkno = NULLFSBLOCK;
 		ap->length = 0;
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 5c2929f94bd3..1e5a8b04c7f5 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -271,6 +271,13 @@ xfs_defer_trans_roll(
 		xfs_trans_ijoin(*tp, dop->dop_inodes[i], 0);
 	}
 
+	/* Rejoin the held buffers */
+	for (i = 0; i < XFS_DEFER_OPS_NR_BUFS && dop->dop_bufs[i]; i++) {
+		ASSERT(dop->dop_bufs[i]->b_transp == NULL);
+		xfs_trans_bjoin(*tp, dop->dop_bufs[i]);
+		xfs_trans_bhold(*tp, dop->dop_bufs[i]);
+	}
+
 	return error;
 }
 
@@ -297,7 +304,7 @@ xfs_defer_join(
 	for (i = 0; i < XFS_DEFER_OPS_NR_INODES; i++) {
 		if (dop->dop_inodes[i] == ip)
 			return 0;
-		else if (dop->dop_inodes[i] == NULL) {
+		if (!dop->dop_inodes[i]) {
 			dop->dop_inodes[i] = ip;
 			return 0;
 		}
@@ -307,6 +314,60 @@ xfs_defer_join(
 }
 
 /*
+ * Add this buffer to the deferred op.  If this is a new buffer we are joining
+ * to the deferred ops, we need to ensure it will be held locked across
+ * transaction commits. This means each joined buffer remains locked until
+ * the final defer op commits and we clean up the xfs_defer_ops structure. This
+ * ensures atomicity of access to buffers and their state while we perform
+ * multiple operations to them through deferred ops processing.
+ */
+int
+xfs_defer_join_buf(
+	struct xfs_trans	*tp,
+	struct xfs_defer_ops	*dop,
+	struct xfs_buf		*bp)
+{
+	int				i;
+
+	for (i = 0; i < XFS_DEFER_OPS_NR_BUFS; i++) {
+		if (dop->dop_bufs[i] == bp)
+			return 0;
+		if (!dop->dop_bufs[i]) {
+			xfs_trans_bhold(tp, bp);
+			dop->dop_bufs[i] = bp;
+			return 0;
+		}
+	}
+
+	return -EFSCORRUPTED;
+}
+
+/*
+ * When we are all done with the defer processing, the buffers we hold
+ * will still be locked. We need to unlock and release them now.
+ *
+ * We can get called in from a context that doesn't pass a transaction,
+ * but the buffers are still attached to a transaction. If this happens,
+ * pull the transaction from the buffer. If the buffer is not part of a
+ * transaction, then b_transp will be null and the buffer will be released
+ * normally.
+ */
+static void
+xfs_defer_brelse(
+	struct xfs_trans	*tp,
+	struct xfs_defer_ops	*dop)
+{
+	int				i;
+
+	for (i = 0; i < XFS_DEFER_OPS_NR_BUFS; i++) {
+		if (!tp)
+			tp = dop->dop_bufs[i]->b_transp;
+		if (dop->dop_bufs[i])
+			xfs_trans_bhold_release(tp, dop->dop_bufs[i]);
+	}
+}
+
+/*
  * Finish all the pending work.  This involves logging intent items for
  * any work items that wandered in since the last transaction roll (if
  * one has even happened), rolling the transaction, and finishing the
@@ -402,6 +463,7 @@ xfs_defer_finish(
 	}
 
 out:
+	xfs_defer_brelse(*tp, dop);
 	if (error)
 		trace_xfs_defer_finish_error((*tp)->t_mountp, dop, error);
 	else
@@ -449,6 +511,7 @@ xfs_defer_cancel(
 		ASSERT(dfp->dfp_count == 0);
 		kmem_free(dfp);
 	}
+	xfs_defer_brelse(NULL, dop);
 }
 
 /* Add an item for later deferred processing. */
@@ -502,9 +565,7 @@ xfs_defer_init(
 	struct xfs_defer_ops		*dop,
 	xfs_fsblock_t			*fbp)
 {
-	dop->dop_committed = false;
-	dop->dop_low = false;
-	memset(&dop->dop_inodes, 0, sizeof(dop->dop_inodes));
+	memset(dop, 0, sizeof(*dop));
 	*fbp = NULLFSBLOCK;
 	INIT_LIST_HEAD(&dop->dop_intake);
 	INIT_LIST_HEAD(&dop->dop_pending);
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index f6e93ef0bffe..036fde7d839b 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -59,6 +59,7 @@ enum xfs_defer_ops_type {
 };
 
 #define XFS_DEFER_OPS_NR_INODES	2	/* join up to two inodes */
+#define XFS_DEFER_OPS_NR_BUFS	2	/* join up to two buffers */
 
 struct xfs_defer_ops {
 	bool			dop_committed;	/* did any trans commit? */
@@ -68,6 +69,9 @@ struct xfs_defer_ops {
 
 	/* relog these inodes with each roll */
 	struct xfs_inode	*dop_inodes[XFS_DEFER_OPS_NR_INODES];
+
+	/* hold these buffers with each roll */
+	struct xfs_buf		*dop_bufs[XFS_DEFER_OPS_NR_BUFS];
 };
 
 void xfs_defer_add(struct xfs_defer_ops *dop, enum xfs_defer_ops_type type,
@@ -78,6 +82,8 @@ void xfs_defer_cancel(struct xfs_defer_ops *dop);
 void xfs_defer_init(struct xfs_defer_ops *dop, xfs_fsblock_t *fbp);
 bool xfs_defer_has_unfinished_work(struct xfs_defer_ops *dop);
 int xfs_defer_join(struct xfs_defer_ops *dop, struct xfs_inode *ip);
+int xfs_defer_join_buf(struct xfs_trans *tp, struct xfs_defer_ops *dop,
+		struct xfs_buf *bp);
 
 /* Description of a deferred type. */
 struct xfs_defer_op_type {
diff --git a/fs/xfs/xfs_trans_buf.c b/fs/xfs/xfs_trans_buf.c
index 8ee29ca132dc..f88552182256 100644
--- a/fs/xfs/xfs_trans_buf.c
+++ b/fs/xfs/xfs_trans_buf.c
@@ -66,6 +66,12 @@ xfs_trans_buf_item_match(
  * Add the locked buffer to the transaction.
  *
  * The buffer must be locked, and it cannot be associated with any
+ * transaction other than the one we pass in. This "join" recursion
+ * is a result of needing to hold buffers locked across multiple transactions in
+ * the defer ops processing. To hold the buffer when rolling the transaction we
+ * need to first join the buffer to transaction, and hence when this gets done
+ * in the normal course of the transaction we then get called a second time.
+ * Hence just do nothing if bp->b_transp already matches the incoming
  * transaction.
  *
  * If the buffer does not yet have a buf log item associated with it,
@@ -79,7 +85,11 @@ _xfs_trans_bjoin(
 {
 	struct xfs_buf_log_item	*bip;
 
-	ASSERT(bp->b_transp == NULL);
+	ASSERT(!bp->b_transp || bp->b_transp == tp);
+
+	/* already joined to this transaction from defer ops? */
+	if (bp->b_transp == tp)
+		return;
 
 	/*
 	 * The xfs_buf_log_item pointer is stored in b_fsprivate.  If
-- 
2.8.0.rc3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/4] xfs: defer indirect delalloc rmap reservations
  2016-11-17 23:55       ` [PATCH 0/4] xfs: fix rmapbt ENOSPC hangs Dave Chinner
                           ` (2 preceding siblings ...)
  2016-11-17 23:55         ` [PATCH 3/4] xfs: hold AGF buffers over defer ops Dave Chinner
@ 2016-11-17 23:55         ` Dave Chinner
  3 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-11-17 23:55 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When we do rmap additions for delalloc extents, we need to ensure we
have space reserved for them. We do this by keeping the space
required in the indirect length associated with a delalloc extent.

However, when we allocate the extent, we immediately release the
unused portion of the indlen reservation, and hence when we come to
needing it when processing the deferred rmap btree insertion, it's
no longer available. Because it gets returned to the global free
space pool, other delalloc reservations can take it before we use
it, resulting in having no space available to fix up the free list
to the correct length before doing the rmap insertion. This results
in an insertion failure and shutdown.

To avoid this problem, rather than releasing the unused indlen
reservation, store it in the transaction to be released when the
transaction is finally committed. When we roll a transaction during
defer ops processing, we transfer the unused block reservation to
the new transaction before we commit the old one. This keeps the
unused reservation local to the deferred ops context. On final
commit, the unused reservation space can be returned to the global
pool.

The final piece of the puzzle is hooking this up to the free list
fixup that ensures we have enough blocks on the free list for the
rmap insert. In this case, ensure that xfs_rmapbt_alloc_block()
always decrements a block from the reservation on the transaction.
This will track the number of blocks we've actually consumed from
the free list for the rmapbt, hence ensuring that we accurately
account of those blocks when the final transaction commit occurs.

Because we now hold the delalloc rmapbt block reservation until
we've done all the rmapbt block allocation, we should not see ENOSPC
problems as a result of the AGFL being emptied during rmap btree
insertion operations.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 14 ++++++++++++++
 fs/xfs/libxfs/xfs_bmap.c  | 16 ++++++++++++----
 fs/xfs/xfs_trans.c        | 28 ++++++++++++++++++++++++++++
 fs/xfs/xfs_trans.h        |  1 +
 4 files changed, 55 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index a5a9d8360e74..e835bf24a85b 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2228,6 +2228,20 @@ xfs_alloc_fix_freelist(
 							agflbp, bno, 0);
 			if (error)
 				goto out_agflbp_relse;
+
+			/*
+			 * If we've just done a delayed allocation and now we
+			 * are processing a deferred metadata update (such as an
+			 * rmapbt update), we'll have a space reservation for
+			 * the rmapbt blocks that may be needed. These are
+			 * allocated from the freelist, so account for them here
+			 * when we refill the AGFL. We've held the AGF locked
+			 * across the defered transactions, so this should only
+			 * be refilling blocks we consumed from the AGFL in the
+			 * preceeding transaction.
+			 */
+			if (tp->t_blk_deferred)
+				tp->t_blk_deferred--;
 		}
 	}
 	xfs_trans_brelse(tp, agflbp);
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 6211b4b5e826..f1db9a03e4c4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -2203,10 +2203,19 @@ xfs_bmap_add_extent_delay_real(
 		temp2 = xfs_bmap_worst_indlen(bma->ip, temp2);
 		diff = (int)(temp + temp2 - startblockval(PREV.br_startblock) -
 			(bma->cur ? bma->cur->bc_private.b.allocated : 0));
+
 		if (diff > 0) {
+			/*
+			 * XXX (dgc): Ouch! Pulling more blocks from the free pool
+			 * during allocation during delalloc split. This will
+			 * fail at ENOSPC and it screws up rmapbt space
+			 * accounting. We need to know when this happens so
+			 * we can isolate the typical causes of reservation
+			 * underruns so that they never happen in production.
+			 */
+			ASSERT(0);
 			error = xfs_mod_fdblocks(bma->ip->i_mount,
 						 -((int64_t)diff), false);
-			ASSERT(!error);
 			if (error)
 				goto done;
 		}
@@ -2261,8 +2270,7 @@ xfs_bmap_add_extent_delay_real(
 			temp += bma->cur->bc_private.b.allocated;
 		ASSERT(temp <= da_old);
 		if (temp < da_old)
-			xfs_mod_fdblocks(bma->ip->i_mount,
-					(int64_t)(da_old - temp), false);
+			bma->tp->t_blk_deferred += (int64_t)(da_old - temp);
 	}
 
 	/* clear out the allocated field, done with it now in any case. */
@@ -5437,7 +5445,7 @@ xfs_bmap_del_extent(
 	 */
 	ASSERT(da_old >= da_new);
 	if (da_old > da_new)
-		xfs_mod_fdblocks(mp, (int64_t)(da_old - da_new), false);
+		tp->t_blk_deferred += (int64_t)(da_old - da_new);
 done:
 	*logflagsp = flags;
 	return error;
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 70f42ea86dfb..0728ff7a04ab 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -103,6 +103,16 @@ xfs_trans_dup(
 	tp->t_rtx_res = tp->t_rtx_res_used;
 	ntp->t_pflags = tp->t_pflags;
 
+	/*
+	 * Transfer deferred block reservations to new transaction so the remain
+	 * available to the ongoing deferred ops processing. We clear the
+	 * existing transaction count so that the deferred block reservation is
+	 * not released when that transaction is committed (i.e. it's not a
+	 * regrantable reservation).
+	 */
+	ntp->t_blk_deferred = tp->t_blk_deferred;
+	tp->t_blk_deferred = 0;
+
 	xfs_trans_dup_dqinfo(tp, ntp);
 
 	atomic_inc(&tp->t_mountp->m_active_trans);
@@ -680,6 +690,21 @@ xfs_trans_unreserve_and_mod_sb(
 }
 
 /*
+ * Release unused defered block reservations back to the global free space pool.
+ * These blocks came from the in-core counter, so return them there.
+ */
+static void
+xfs_trans_release_deferred_blocks(
+	struct xfs_trans	*tp)
+{
+	ASSERT(tp->t_blk_deferred >= 0);
+	if (!tp->t_blk_deferred)
+		return;
+	xfs_mod_fdblocks(tp->t_mountp, tp->t_blk_deferred,
+			 !!(tp->t_flags & XFS_TRANS_RESERVE));
+}
+
+/*
  * Add the given log item to the transaction's list of log items.
  *
  * The log item will now point to its new descriptor with its li_desc field.
@@ -908,6 +933,7 @@ __xfs_trans_commit(
 	/*
 	 * If we need to update the superblock, then do it now.
 	 */
+	xfs_trans_release_deferred_blocks(tp);
 	if (tp->t_flags & XFS_TRANS_SB_DIRTY)
 		xfs_trans_apply_sb_deltas(tp);
 	xfs_trans_apply_dquot_deltas(tp);
@@ -931,6 +957,7 @@ __xfs_trans_commit(
 	return error;
 
 out_unreserve:
+	xfs_trans_release_deferred_blocks(tp);
 	xfs_trans_unreserve_and_mod_sb(tp);
 
 	/*
@@ -991,6 +1018,7 @@ xfs_trans_cancel(
 			ASSERT(!(lidp->lid_item->li_type == XFS_LI_EFD));
 	}
 #endif
+	xfs_trans_release_deferred_blocks(tp);
 	xfs_trans_unreserve_and_mod_sb(tp);
 	xfs_trans_unreserve_and_mod_dquots(tp);
 
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index 61b7fbdd3ebd..6126e6fb9f5c 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -132,6 +132,7 @@ typedef struct xfs_trans {
 	int64_t			t_rblocks_delta;/* superblock rblocks change */
 	int64_t			t_rextents_delta;/* superblocks rextents chg */
 	int64_t			t_rextslog_delta;/* superblocks rextslog chg */
+	int64_t			t_blk_deferred;	/* blocks for deferred ops */
 	struct list_head	t_items;	/* log item descriptors */
 	struct list_head	t_busy;		/* list of busy extents */
 	unsigned long		t_pflags;	/* saved process flags state */
-- 
2.8.0.rc3


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/4] xfs: hold AGF buffers over defer ops
  2016-11-17 23:55         ` [PATCH 3/4] xfs: hold AGF buffers over defer ops Dave Chinner
@ 2016-11-18  0:53           ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2016-11-18  0:53 UTC (permalink / raw)
  To: linux-xfs

On Fri, Nov 18, 2016 at 10:55:34AM +1100, Dave Chinner wrote:
> +static void
> +xfs_defer_brelse(
> +	struct xfs_trans	*tp,
> +	struct xfs_defer_ops	*dop)
> +{
> +	int				i;
> +
> +	for (i = 0; i < XFS_DEFER_OPS_NR_BUFS; i++) {
> +		if (!tp)
> +			tp = dop->dop_bufs[i]->b_transp;
> +		if (dop->dop_bufs[i])
> +			xfs_trans_bhold_release(tp, dop->dop_bufs[i]);
> +	}

This let's the smoke out in generic/139. Will fix...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS
  2016-11-17 20:11   ` Darrick J. Wong
  2016-11-17 21:32     ` Dave Chinner
@ 2016-11-18  5:26     ` Eryu Guan
  2016-11-18  5:46       ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Eryu Guan @ 2016-11-18  5:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Nov 17, 2016 at 12:11:02PM -0800, Darrick J. Wong wrote:
> On Thu, Nov 17, 2016 at 09:36:39AM -0800, Darrick J. Wong wrote:
> > On Fri, Nov 18, 2016 at 12:35:15AM +0800, Eryu Guan wrote:
> > > Hi all,
> > > 
> > > I hit a test hang in generic/224 when testing rmapbt enabled XFS on a
> > > host that has non-zero sunit/swidth reported from underlying device. And
> > > I simplified the reproducer to the following script, and the hang can be
> > > reproduced on any host now.
> > > 
> > > -----
> > > #!/bin/bash
> > > 
> > > dev=/dev/sda5
> > > mnt=/mnt/xfs
> > > 
> > > mkfs -t xfs -m rmapbt=1 -d agcount=8,size=1g -f $dev
> > 
> > Hm.  I formatted with:
> > mkfs.xfs -m rmapbt=1 -d sunit=4096,swidth=40960 -f /dev/sdf
> > 
> > (made up sunit numbers just to see how whacky it could get)
> > 
> > and got a different hang instead.  It looks like we are unable to
> > allocate any blocks to the bmbt and various things blow up from
> > there.  Will go retry with tracepoints on to see if we're running
> > out of AG reservation or if we're really out of disk blocks or what.
> > 
> > Crash message attached at the end.
> 
> Hm.  Looking at the indlen calculations, I see that we don't include the
> space that the rmapbt might need to store all the reverse mappings.  I
> think this is a problem, since we decline delalloc reservations if (len
> + indlen) > fdblocks, but we potentially end up using more than indlen
> blocks to map len blocks into the file, so the allocator goes nuts.
> 
> Eryu, does the following patch fix the problem you see?  I ran your
> reproducer and mine and it fixed the problem in both cases.  I didn't
> observe any issues running generic/224 either.

I applied your patch (and only your patch, patches posted by Dave were
not included) on top of 4.9-rc5 kernel, and it passed my simplified
reproducer, but still failed generic/224 with

MKFS_OPTIONS="-b size=4k -m crc=1,rmapbt=1 -d agcount=8"

Not all the time, but easily to hit. And sysrq-w showed the same traces
as before.

SECTION       -- xfs_test
RECREATING    -- xfs on /dev/mapper/testvg-testlv1
FSTYP         -- xfs (non-debug)
PLATFORM      -- Linux/x86_64 ibm-x3550m3-05 4.9.0-rc5+
MKFS_OPTIONS  -- -f -f -b size=4k -m crc=1,rmapbt=1 -d agcount=8 /dev/mapper/testvg-testlv2
MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/mapper/testvg-testlv2 /mnt/testarea/scratch

generic/224 16s ...  <===== never return

My local.config file:
[default]
TEST_DEV=/dev/mapper/testvg-testlv1
TEST_DIR=/mnt/testarea/test
SCRATCH_MNT=/mnt/testarea/scratch
SCRATCH_DEV=/dev/mapper/testvg-testlv2

[xfs_test]
FSTYP=xfs
MKFS_OPTIONS="-f -b size=4k -m crc=1,rmapbt=1 -d agcount=8"
# other unrelated configs follow

But this patch does make some differences. Prior to this patch, I saw
thousands of dd processes hang, now there's only one or two.

Thanks,
Eryu

> 
> --D
> 
> ---
> From: Darrick J. Wong <darrick.wong@oracle.com>
> Subject: [PATCH] xfs: factor rmap btree size into the indlen calculations
> 
> When we're estimating the amount of space it's going to take to satisfy
> a delalloc reservation, we need to include the space that we might need
> to grow the rmapbt.  This helps us to avoid running out of space later
> when _iomap_write_allocate needs more space than we reserved.  Eryu Guan
> observed this happening on generic/224 when sunit/swidth were set.
> 
> Reported-by: Eryu Guan <eguan@redhat.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c |   17 +++++++++++++++--
>  1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index b80a294..afedf96 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -49,6 +49,7 @@
>  #include "xfs_rmap.h"
>  #include "xfs_ag_resv.h"
>  #include "xfs_refcount.h"
> +#include "xfs_rmap_btree.h"
>  
>  
>  kmem_zone_t		*xfs_bmap_free_item_zone;
> @@ -190,8 +191,12 @@ xfs_bmap_worst_indlen(
>  	int		maxrecs;	/* maximum record count at this level */
>  	xfs_mount_t	*mp;		/* mount structure */
>  	xfs_filblks_t	rval;		/* return value */
> +	xfs_filblks_t   orig_len;
>  
>  	mp = ip->i_mount;
> +
> +	/* Calculate the worst-case size of the bmbt. */
> +	orig_len = len;
>  	maxrecs = mp->m_bmap_dmxr[0];
>  	for (level = 0, rval = 0;
>  	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK);
> @@ -199,12 +204,20 @@ xfs_bmap_worst_indlen(
>  		len += maxrecs - 1;
>  		do_div(len, maxrecs);
>  		rval += len;
> -		if (len == 1)
> -			return rval + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
> +		if (len == 1) {
> +			rval += XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
>  				level - 1;
> +			break;
> +		}
>  		if (level == 0)
>  			maxrecs = mp->m_bmap_dmxr[1];
>  	}
> +
> +	/* Calculate the worst-case size of the rmapbt. */
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
> +		rval += 1 + xfs_rmapbt_calc_size(mp, orig_len) +
> +				mp->m_rmap_maxlevels;
> +
>  	return rval;
>  }

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS
  2016-11-18  5:26     ` [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS Eryu Guan
@ 2016-11-18  5:46       ` Dave Chinner
  2016-11-18  6:52         ` Eryu Guan
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2016-11-18  5:46 UTC (permalink / raw)
  To: Eryu Guan; +Cc: Darrick J. Wong, linux-xfs

On Fri, Nov 18, 2016 at 01:26:33PM +0800, Eryu Guan wrote:
> On Thu, Nov 17, 2016 at 12:11:02PM -0800, Darrick J. Wong wrote:
> > On Thu, Nov 17, 2016 at 09:36:39AM -0800, Darrick J. Wong wrote:
> > > On Fri, Nov 18, 2016 at 12:35:15AM +0800, Eryu Guan wrote:
> > > > Hi all,
> > > > 
> > > > I hit a test hang in generic/224 when testing rmapbt enabled XFS on a
> > > > host that has non-zero sunit/swidth reported from underlying device. And
> > > > I simplified the reproducer to the following script, and the hang can be
> > > > reproduced on any host now.
> > > > 
> > > > -----
> > > > #!/bin/bash
> > > > 
> > > > dev=/dev/sda5
> > > > mnt=/mnt/xfs
> > > > 
> > > > mkfs -t xfs -m rmapbt=1 -d agcount=8,size=1g -f $dev
> > > 
> > > Hm.  I formatted with:
> > > mkfs.xfs -m rmapbt=1 -d sunit=4096,swidth=40960 -f /dev/sdf
> > > 
> > > (made up sunit numbers just to see how whacky it could get)
> > > 
> > > and got a different hang instead.  It looks like we are unable to
> > > allocate any blocks to the bmbt and various things blow up from
> > > there.  Will go retry with tracepoints on to see if we're running
> > > out of AG reservation or if we're really out of disk blocks or what.
> > > 
> > > Crash message attached at the end.
> > 
> > Hm.  Looking at the indlen calculations, I see that we don't include the
> > space that the rmapbt might need to store all the reverse mappings.  I
> > think this is a problem, since we decline delalloc reservations if (len
> > + indlen) > fdblocks, but we potentially end up using more than indlen
> > blocks to map len blocks into the file, so the allocator goes nuts.
> > 
> > Eryu, does the following patch fix the problem you see?  I ran your
> > reproducer and mine and it fixed the problem in both cases.  I didn't
> > observe any issues running generic/224 either.
> 
> I applied your patch (and only your patch, patches posted by Dave were
> not included) on top of 4.9-rc5 kernel, and it passed my simplified
> reproducer, but still failed generic/224 with
> 
> MKFS_OPTIONS="-b size=4k -m crc=1,rmapbt=1 -d agcount=8"
> 
> Not all the time, but easily to hit. And sysrq-w showed the same traces
> as before.
> 
> SECTION       -- xfs_test
> RECREATING    -- xfs on /dev/mapper/testvg-testlv1
> FSTYP         -- xfs (non-debug)
> PLATFORM      -- Linux/x86_64 ibm-x3550m3-05 4.9.0-rc5+
> MKFS_OPTIONS  -- -f -f -b size=4k -m crc=1,rmapbt=1 -d agcount=8 /dev/mapper/testvg-testlv2
> MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/mapper/testvg-testlv2 /mnt/testarea/scratch
> 
> generic/224 16s ...  <===== never return

My patchset does pass generic/224 here, but it fails lots of other tests
because of an accounting problem I've not yet found.

SECTION       -- xfs
FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 test2 4.9.0-rc4-dgc+
MKFS_OPTIONS  -- -f -m rmapbt=1 -i sparse=1 /dev/sdg
MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/sdg /mnt/scratch

generic/224 25s ... 24s
Ran: generic/224
Passed all 1 tests

SECTION       -- xfs
=========================
Ran: generic/224
Passed all 1 tests

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS
  2016-11-18  5:46       ` Dave Chinner
@ 2016-11-18  6:52         ` Eryu Guan
  0 siblings, 0 replies; 13+ messages in thread
From: Eryu Guan @ 2016-11-18  6:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Fri, Nov 18, 2016 at 04:46:20PM +1100, Dave Chinner wrote:
> On Fri, Nov 18, 2016 at 01:26:33PM +0800, Eryu Guan wrote:
> > On Thu, Nov 17, 2016 at 12:11:02PM -0800, Darrick J. Wong wrote:
> > > On Thu, Nov 17, 2016 at 09:36:39AM -0800, Darrick J. Wong wrote:
> > > > On Fri, Nov 18, 2016 at 12:35:15AM +0800, Eryu Guan wrote:
> > > > > Hi all,
> > > > > 
> > > > > I hit a test hang in generic/224 when testing rmapbt enabled XFS on a
> > > > > host that has non-zero sunit/swidth reported from underlying device. And
> > > > > I simplified the reproducer to the following script, and the hang can be
> > > > > reproduced on any host now.
> > > > > 
> > > > > -----
> > > > > #!/bin/bash
> > > > > 
> > > > > dev=/dev/sda5
> > > > > mnt=/mnt/xfs
> > > > > 
> > > > > mkfs -t xfs -m rmapbt=1 -d agcount=8,size=1g -f $dev
> > > > 
> > > > Hm.  I formatted with:
> > > > mkfs.xfs -m rmapbt=1 -d sunit=4096,swidth=40960 -f /dev/sdf
> > > > 
> > > > (made up sunit numbers just to see how whacky it could get)
> > > > 
> > > > and got a different hang instead.  It looks like we are unable to
> > > > allocate any blocks to the bmbt and various things blow up from
> > > > there.  Will go retry with tracepoints on to see if we're running
> > > > out of AG reservation or if we're really out of disk blocks or what.
> > > > 
> > > > Crash message attached at the end.
> > > 
> > > Hm.  Looking at the indlen calculations, I see that we don't include the
> > > space that the rmapbt might need to store all the reverse mappings.  I
> > > think this is a problem, since we decline delalloc reservations if (len
> > > + indlen) > fdblocks, but we potentially end up using more than indlen
> > > blocks to map len blocks into the file, so the allocator goes nuts.
> > > 
> > > Eryu, does the following patch fix the problem you see?  I ran your
> > > reproducer and mine and it fixed the problem in both cases.  I didn't
> > > observe any issues running generic/224 either.
> > 
> > I applied your patch (and only your patch, patches posted by Dave were
> > not included) on top of 4.9-rc5 kernel, and it passed my simplified
> > reproducer, but still failed generic/224 with
> > 
> > MKFS_OPTIONS="-b size=4k -m crc=1,rmapbt=1 -d agcount=8"
> > 
> > Not all the time, but easily to hit. And sysrq-w showed the same traces
> > as before.
> > 
> > SECTION       -- xfs_test
> > RECREATING    -- xfs on /dev/mapper/testvg-testlv1
> > FSTYP         -- xfs (non-debug)
> > PLATFORM      -- Linux/x86_64 ibm-x3550m3-05 4.9.0-rc5+
> > MKFS_OPTIONS  -- -f -f -b size=4k -m crc=1,rmapbt=1 -d agcount=8 /dev/mapper/testvg-testlv2
> > MOUNT_OPTIONS -- -o context=system_u:object_r:nfs_t:s0 /dev/mapper/testvg-testlv2 /mnt/testarea/scratch
> > 
> > generic/224 16s ...  <===== never return
> 
> My patchset does pass generic/224 here, but it fails lots of other tests
> because of an accounting problem I've not yet found.

I applied all four patches you posted on top of v.9-rc5 this time. And
generic/224 still failed my test (test hang).

> 
> SECTION       -- xfs
> FSTYP         -- xfs (debug)
> PLATFORM      -- Linux/x86_64 test2 4.9.0-rc4-dgc+
> MKFS_OPTIONS  -- -f -m rmapbt=1 -i sparse=1 /dev/sdg

Does appending "-d agcount=8" to MKFS_OPTIONS make any difference for
you? I cannot reproduce the hang either if I remove the agcount config.

Thanks,
Eryu

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2016-11-18  6:52 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-11-17 16:35 [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS Eryu Guan
2016-11-17 17:36 ` Darrick J. Wong
2016-11-17 20:11   ` Darrick J. Wong
2016-11-17 21:32     ` Dave Chinner
2016-11-17 23:55       ` [PATCH 0/4] xfs: fix rmapbt ENOSPC hangs Dave Chinner
2016-11-17 23:55         ` [PATCH 1/4] xfs: factor rmap btree size into the indlen calculations Dave Chinner
2016-11-17 23:55         ` [PATCH 2/4] xfs: add more AGF/AGFL manipulation tracepoints Dave Chinner
2016-11-17 23:55         ` [PATCH 3/4] xfs: hold AGF buffers over defer ops Dave Chinner
2016-11-18  0:53           ` Dave Chinner
2016-11-17 23:55         ` [PATCH 4/4] xfs: defer indirect delalloc rmap reservations Dave Chinner
2016-11-18  5:26     ` [BUG] dd doesn't return on ENOSPC and hang when fulfilling rmapbt XFS Eryu Guan
2016-11-18  5:46       ` Dave Chinner
2016-11-18  6:52         ` Eryu Guan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.