All of lore.kernel.org
 help / color / mirror / Atom feed
* 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
@ 2017-05-21 21:47 Marc MERLIN
  2017-05-21 23:45 ` Marc MERLIN
  0 siblings, 1 reply; 13+ messages in thread
From: Marc MERLIN @ 2017-05-21 21:47 UTC (permalink / raw)
  To: linux-btrfs

gargamel:~# btrfs check --repair /dev/mapper/dshelf1
enabling repair mode
Checking filesystem on /dev/mapper/dshelf1
UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d
checking extents

This causes a bunch of these:
btrfs-transacti: page allocation stalls for 23508ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
btrfs-transacti cpuset=/ mems_allowed=0

What's the recommended way out of this and which code is at fault? I can't tell if btrfs is doing memory allocations wrong, or if it's just being undermined by the block layer dying underneath.

And sadly, I'm also getting workqueues stalls like these:
[ 3996.047531] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 45s!
[ 3996.073512] BUG: workqueue lockup - pool cpus=3 node=0 flags=0x0 nice=0 stuck for 52s!
[ 3996.099466] Showing busy workqueues and worker pools:
[ 3996.116824] workqueue events: flags=0x0
[ 3996.130409]   pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=3/256
[ 3996.150268]     in-flight: 9661:do_sync_work
[ 3996.165186]     pending: wait_rcu_exp_gp, cache_reap
[ 3996.182139]   pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=8/256
[ 3996.202099]     in-flight: 9725:do_emergency_remount, 9738:do_poweroff
[ 3996.223543]     pending: drm_fb_helper_dirty_work, cache_reap, do_sync_work, vmstat_shepherd, update_writeback_rate [bcache], do_poweroff
[ 3996.263991] workqueue writeback: flags=0x4e
[ 3996.278116]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 3996.296586]     in-flight: 149:wb_workfn wb_workfn
[ 3996.312387] workqueue btrfs-endio-write: flags=0xe
[ 3996.328090]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/8
[ 3996.345794]     in-flight: 20326:btrfs_endio_write_helper, 2927:btrfs_endio_write_helper
[ 3996.371981] workqueue kcryptd: flags=0x2a
[ 3996.386019]   pwq 16: cpus=0-7 flags=0x4 nice=0 active=8/8
[ 3996.404325]     in-flight: 9950:kcryptd_crypt [dm_crypt], 8859:kcryptd_crypt [dm_crypt], 31087:kcryptd_crypt [dm_crypt], 2929:kcryptd_crypt [dm_crypt], 20328:kcryptd_crypt [dm_crypt], 5951:kcryptd_crypt [dm_crypt], 31084:kcryptd_crypt [dm_crypt], 7553:kcryptd_crypt [dm_crypt]
[ 3996.484333]     delayed: kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]
[ 3996.719697] , kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt], kcryptd_crypt [dm_crypt]

Problems started here:
[ 3624.349624] php5: page allocation stalls for 15028ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
[ 3624.382270] php5 cpuset=/ mems_allowed=0
[ 3624.395474] CPU: 1 PID: 9949 Comm: php5 Tainted: G     U          4.11.1-amd64-preempt-sysrq-20170406 #4
[ 3624.424907] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[ 3624.453233] Call Trace:
[ 3624.461292]  dump_stack+0x61/0x7d
[ 3624.472098]  warn_alloc+0xfc/0x18c
[ 3624.483150]  __alloc_pages_slowpath+0x3bc/0xb31
[ 3624.497528]  ? finish_wait+0x5a/0x63
[ 3624.509018]  __alloc_pages_nodemask+0x12c/0x1e0
[ 3624.523343]  alloc_pages_current+0x9b/0xbd
[ 3624.536346]  __page_cache_alloc+0x8e/0xa4
[ 3624.549067]  pagecache_get_page+0xc9/0x16b
[ 3624.562067]  alloc_extent_buffer+0xdf/0x305
[ 3624.575342]  read_tree_block+0x19/0x4e
[ 3624.587295]  read_block_for_search.isra.21+0x211/0x264
[ 3624.603420]  btrfs_search_slot+0x52b/0x72e
[ 3624.616387]  btrfs_lookup_csum+0x52/0xf7
[ 3624.628835]  __btrfs_lookup_bio_sums+0x23b/0x448
[ 3624.643396]  btrfs_lookup_bio_sums+0x16/0x18
[ 3624.656886]  btrfs_submit_bio_hook+0xcb/0x14a
[ 3624.670639]  submit_one_bio+0x60/0x7e
[ 3624.682308]  extent_readpages+0x1c8/0x1eb
[ 3624.695018]  ? insert_reserved_file_extent.constprop.41+0x28e/0x28e
[ 3624.714473]  btrfs_readpages+0x1f/0x21
[ 3624.726384]  __do_page_cache_readahead+0x1af/0x27b
[ 3624.741410]  ? pagecache_get_page+0x2b/0x16b
[ 3624.754869]  filemap_fault+0x17f/0x3a7
[ 3624.766775]  ? filemap_fault+0x17f/0x3a7
[ 3624.779193]  __do_fault+0x1b/0x67
[ 3624.789778]  __handle_mm_fault+0x67f/0xa6f
[ 3624.802699]  handle_mm_fault+0x126/0x1d9
[ 3624.815096]  __do_page_fault+0x31a/0x43c
[ 3624.827510]  do_page_fault+0x25/0x27
[ 3624.838865]  page_fault+0x28/0x30
[ 3624.849436] RIP: 0023:0x80a6d74
[ 3624.859479] RSP: 002b:00000000ffaae318 EFLAGS: 00210246
[ 3624.875761] RAX: 0000000008972dc0 RBX: 0000000008955000 RCX: 0000000000000341
[ 3624.897769] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000896f15c
[ 3624.919779] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
[ 3624.941768] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 3624.963784] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3624.985785] Mem-Info:
[ 3624.994026] active_anon:5274699 inactive_anon:378530 isolated_anon:3232
[ 3624.994026]  active_file:3711 inactive_file:3809 isolated_file:0
[ 3624.994026]  unevictable:1467 dirty:5068 writeback:49122 unstable:0
[ 3624.994026]  slab_reclaimable:8800 slab_unreclaimable:67134
[ 3624.994026]  mapped:558735 shmem:801466 pagetables:15777 bounce:0
[ 3624.994026]  free:89592 free_pcp:31 free_cma:0
[ 3625.104705] Node 0 active_anon:21098796kB inactive_anon:1513736kB active_file:14844kB inactive_file:15236kB unevictable:5868kB isolated(anon):13312kB isolated(file):0kB mapped:2234940kB dirty:20272kB writeback:196488kB shmem:3205864kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:43 all_unreclaimable? no
[ 3625.202085] Node 0 DMA free:15880kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3625.295661] lowmem_reserve[]: 0 3201 23768 23768 23768
[ 3625.312447] Node 0 DMA32 free:116696kB min:35424kB low:44280kB high:53136kB active_anon:3161376kB inactive_anon:8kB active_file:320kB inactive_file:332kB unevictable:0kB writepending:612kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:472kB slab_unreclaimable:680kB kernel_stack:16kB pagetables:7292kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3625.415160] lowmem_reserve[]: 0 0 20567 20567 20567
[ 3625.431182] Node 0 Normal free:225612kB min:226544kB low:283180kB high:339816kB active_anon:17937416kB inactive_anon:1513964kB active_file:14524kB inactive_file:14904kB unevictable:5868kB writepending:216424kB present:21485568kB managed:21080208kB mlocked:5868kB slab_reclaimable:34524kB slab_unreclaimable:268752kB kernel_stack:12480kB pagetables:55816kB bounce:0kB free_pcp:392kB local_pcp:124kB free_cma:0kB
[ 3625.546415] lowmem_reserve[]: 0 0 0 0 0
[ 3625.559648] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
[ 3625.601566] Node 0 DMA32: 768*4kB (UME) 743*8kB (UME) 686*16kB (UME) 446*32kB (UME) 426*64kB (UME) 233*128kB (UME) 79*256kB (UME) 10*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 116696kB
[ 3625.652772] Node 0 Normal: 26419*4kB (UME) 11249*8kB (UME) 918*16kB (UME) 252*32kB (UME) 74*64kB (UM) 15*128kB (UM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 225332kB
[ 3625.703427] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3625.729772] 858805 total pagecache pages
[ 3625.742561] 49207 pages in swap cache
[ 3625.754535] Swap cache stats: add 62005, delete 12831, find 75/76
[ 3625.773719] Free swap  = 15367668kB
[ 3625.785093] Total swap = 15616764kB
[ 3625.796457] 6215902 pages RAM
[ 3625.806278] 0 pages HighMem/MovableOnly
[ 3625.818722] 117753 pages reserved
[ 3625.829550] 4096 pages cma reserved
[ 3625.840989] 0 pages hwpoisoned
[ 3629.587421] btrfs-transacti: page allocation stalls for 23508ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
[ 3629.624070] btrfs-transacti cpuset=/ mems_allowed=0
[ 3629.639649] CPU: 3 PID: 957 Comm: btrfs-transacti Tainted: G     U          4.11.1-amd64-preempt-sysrq-20170406 #4
[ 3629.671670] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[ 3629.700024] Call Trace:
[ 3629.708371]  dump_stack+0x61/0x7d
[ 3629.719341]  warn_alloc+0xfc/0x18c
[ 3629.730544]  __alloc_pages_slowpath+0x3bc/0xb31
[ 3629.745094]  ? elv_set_request+0x41/0x54
[ 3629.757829]  ? search_bitmap+0x7e/0x12b
[ 3629.770305]  __alloc_pages_nodemask+0x12c/0x1e0
[ 3629.784862]  alloc_pages_current+0x9b/0xbd
[ 3629.798113]  __page_cache_alloc+0x8e/0xa4
[ 3629.811112]  pagecache_get_page+0xc9/0x16b
[ 3629.824343]  alloc_extent_buffer+0xdf/0x305
[ 3629.837796]  btrfs_find_create_tree_block+0xe/0x10
[ 3629.853062]  btrfs_alloc_tree_block+0x1c4/0x3ea
[ 3629.853062]  btrfs_alloc_tree_block+0x1c4/0x3ea
[ 3629.867523]  __btrfs_cow_block+0x109/0x4a9
[ 3629.880658]  btrfs_cow_block+0xe4/0x15b
[ 3629.893008]  btrfs_search_slot+0x2a2/0x72e
[ 3629.906129]  btrfs_insert_empty_items+0x5d/0xa6
[ 3629.920515]  alloc_reserved_file_extent+0x94/0x238
[ 3629.935697]  __btrfs_run_delayed_refs+0xae4/0xd15
[ 3629.950581]  btrfs_run_delayed_refs+0x7a/0x1cc
[ 3629.964653]  ? slab_pre_alloc_hook+0x1a/0x44
[ 3629.978178]  btrfs_commit_transaction+0x50/0x79a
[ 3629.992743]  ? start_transaction+0x2c2/0x3b4
[ 3630.006241]  transaction_kthread+0xe2/0x178
[ 3630.019459]  ? btrfs_cleanup_transaction+0x3e8/0x3e8
[ 3630.035005]  kthread+0xfb/0x100
[ 3630.045072]  ? init_completion+0x24/0x24
[ 3630.057482]  ret_from_fork+0x2c/0x40
[ 3630.068848] Mem-Info:
[ 3630.076319] active_anon:5274313 inactive_anon:378373 isolated_anon:3590
[ 3630.076319]  active_file:3711 inactive_file:3809 isolated_file:0
[ 3630.076319]  unevictable:1467 dirty:5068 writeback:49189 unstable:0
[ 3630.076319]  slab_reclaimable:8721 slab_unreclaimable:67310
[ 3630.076319]  mapped:556943 shmem:801313 pagetables:15777 bounce:0
[ 3630.076319]  free:89741 free_pcp:6 free_cma:0
[ 3630.188769] Node 0 active_anon:21097252kB inactive_anon:1513492kB active_file:14844kB inactive_file:15236kB unevictable:5868kB isolated(anon):14360kB isolated(file):0kB mapped:2227772kB dirty:20272kB writeback:196756kB shmem:3205252kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:215184 all_unreclaimable? no
[ 3630.286396] Node 0 DMA free:15880kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3630.378363] lowmem_reserve[]: 0 3201 23768 23768 23768
[ 3630.394566] Node 0 DMA32 free:116720kB min:35424kB low:44280kB high:53136kB active_anon:3161376kB inactive_anon:8kB active_file:320kB inactive_file:332kB unevictable:0kB writepending:612kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:460kB slab_unreclaimable:668kB kernel_stack:16kB pagetables:7292kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3630.494548] lowmem_reserve[]: 0 0 20567 20567 20567
[ 3630.510089] Node 0 Normal free:226664kB min:226544kB low:283180kB high:339816kB active_anon:17935552kB inactive_anon:1513564kB active_file:14524kB inactive_file:14904kB unevictable:5868kB writepending:216372kB present:21485568kB managed:21080208kB mlocked:5868kB slab_reclaimable:34412kB slab_unreclaimable:268520kB kernel_stack:12480kB pagetables:55816kB bounce:0kB free_pcp:148kB local_pcp:0kB free_cma:0kB
[ 3630.620626] lowmem_reserve[]: 0 0 0 0 0
[ 3630.632905] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
[ 3630.674463] Node 0 DMA32: 768*4kB (UME) 740*8kB (UME) 685*16kB (UME) 446*32kB (UME) 427*64kB (UME) 233*128kB (UME) 79*256kB (UME) 10*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 116720kB
[ 3630.725314] Node 0 Normal: 25803*4kB (UME) 11297*8kB (UME) 947*16kB (UME) 260*32kB (ME) 72*64kB (UM) 15*128kB (UM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 223844kB
[ 3630.773343] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3630.799583] 858720 total pagecache pages
[ 3630.812261] 49221 pages in swap cache
[ 3630.824146] Swap cache stats: add 62319, delete 13131, find 75/76
[ 3630.843322] Free swap  = 15366388kB
[ 3630.854712] Total swap = 15616764kB
[ 3630.866101] 6215902 pages RAM
[ 3630.875919] 0 pages HighMem/MovableOnly
[ 3630.888342] 117753 pages reserved
[ 3630.899212] 4096 pages cma reserved
[ 3630.910604] 0 pages hwpoisoned
[ 3637.300184] bash: page allocation stalls for 28224ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[ 3637.336505] bash cpuset=/ mems_allowed=0
[ 3637.349512] CPU: 2 PID: 9958 Comm: bash Tainted: G     U          4.11.1-amd64-preempt-sysrq-20170406 #4
[ 3637.378892] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[ 3637.407240] Call Trace:
[ 3637.415564]  dump_stack+0x61/0x7d
[ 3637.426501]  warn_alloc+0xfc/0x18c
[ 3637.437688]  __alloc_pages_slowpath+0x3bc/0xb31
[ 3637.452244]  ? __page_cache_alloc+0x8e/0xa4
[ 3637.465756]  __alloc_pages_nodemask+0x12c/0x1e0
[ 3637.480553]  alloc_pages_current+0x9b/0xbd
[ 3637.494051]  __page_cache_alloc+0x8e/0xa4
[ 3637.507288]  filemap_fault+0x298/0x3a7
[ 3637.519718]  __do_fault+0x1b/0x67
[ 3637.530838]  __handle_mm_fault+0x67f/0xa6f
[ 3637.544289]  handle_mm_fault+0x126/0x1d9
[ 3637.557153]  __do_page_fault+0x31a/0x43c
[ 3637.569996]  do_page_fault+0x25/0x27
[ 3637.581780]  page_fault+0x28/0x30
[ 3637.592746] RIP: 0023:0x8075060
[ 3637.603193] RSP: 002b:00000000ffcea1ac EFLAGS: 00010296
[ 3637.619854] RAX: 000000000923ee68 RBX: 000000000923ee68 RCX: 0000000000000000
[ 3637.642242] RDX: 000000000923eb48 RSI: 00000000000026f5 RDI: 0000000000000000
[ 3637.664599] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 3637.686949] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 3637.709252] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 3637.731566] Mem-Info:
[ 3637.739527] active_anon:5273186 inactive_anon:378405 isolated_anon:3721
[ 3637.739527]  active_file:3713 inactive_file:3807 isolated_file:0
[ 3637.739527]  unevictable:1467 dirty:5068 writeback:49193 unstable:0
[ 3637.739527]  slab_reclaimable:8704 slab_unreclaimable:67298
[ 3637.739527]  mapped:556047 shmem:800887 pagetables:15777 bounce:0
[ 3637.739527]  free:89862 free_pcp:14 free_cma:0
[ 3637.850642] Node 0 active_anon:21092744kB inactive_anon:1513620kB active_file:14852kB inactive_file:15228kB unevictable:5868kB isolated(anon):14884kB isolated(file):0kB mapped:2224188kB dirty:20272kB writeback:196772kB shmem:3203548kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:39497 all_unreclaimable? no
[ 3637.947503] Node 0 DMA free:15880kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3638.041435] lowmem_reserve[]: 0 3201 23768 23768 23768
[ 3638.058182] Node 0 DMA32 free:116768kB min:35424kB low:44280kB high:53136kB active_anon:3161376kB inactive_anon:8kB active_file:500kB inactive_file:152kB unevictable:0kB writepending:612kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:460kB slab_unreclaimable:652kB kernel_stack:16kB pagetables:7292kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
[ 3638.159998] lowmem_reserve[]: 0 0 20567 20567 20567
[ 3638.175886] Node 0 Normal free:226800kB min:226544kB low:283180kB high:339816kB active_anon:17931660kB inactive_anon:1513324kB active_file:14352kB inactive_file:15076kB unevictable:5868kB writepending:216868kB present:21485568kB managed:21080208kB mlocked:5868kB slab_reclaimable:34356kB slab_unreclaimable:268532kB kernel_stack:12480kB pagetables:55816kB bounce:0kB free_pcp:56kB local_pcp:0kB free_cma:0kB
[ 3638.286889] lowmem_reserve[]: 0 0 0 0 0
[ 3638.299205] Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
[ 3638.339959] Node 0 DMA32: 773*4kB (UME) 739*8kB (UME) 684*16kB (UME) 446*32kB (UME) 426*64kB (UME) 234*128k
[ 3638.390376] Node 0 Normal: 26236*4kB (UME) 11398*8kB (UME) 993*16kB (UME) 286*32kB (UME) 70*64kB (M) 14*128kB (UM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 227696kB
[ 3638.438245] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[ 3638.464862] 858349 total pagecache pages
[ 3638.477801] 49221 pages in swap cache
[ 3638.489711] Swap cache stats: add 65079, delete 14676, find 75/76
[ 3638.489926] named: page allocation stalls for 31440ms, order:0, mode:0x14201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=(null)
[ 3638.489929] named cpuset=/ mems_allowed=0
[ 3638.489932] CPU: 1 PID: 6298 Comm: named Tainted: G     U          4.11.1-amd64-preempt-sysrq-20170406 #4
[ 3638.489932] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[ 3638.489933] Call Trace:
[ 3638.489938]  dump_stack+0x61/0x7d
[ 3638.489940]  warn_alloc+0xfc/0x18c
[ 3638.489942]  __alloc_pages_slowpath+0x3bc/0xb31
[ 3638.489944]  __alloc_pages_nodemask+0x12c/0x1e0
[ 3638.489946]  alloc_pages_current+0x9b/0xbd
[ 3638.489948]  __page_cache_alloc+0x8e/0xa4
[ 3638.489949]  filemap_fault+0x298/0x3a7
[ 3638.489952]  __do_fault+0x1b/0x67
[ 3638.489953]  __handle_mm_fault+0x67f/0xa6f
[ 3638.489955]  handle_mm_fault+0x126/0x1d9
[ 3638.489957]  __do_page_fault+0x31a/0x43c
[ 3638.489958]  do_page_fault+0x25/0x27
[ 3638.489960]  page_fault+0x28/0x30
[ 3638.489961] RIP: 0033:0x7fa929835b50
[ 3638.489962] RSP: 002b:00007fa920ee0de8 EFLAGS: 00010202
[ 3638.489963] RAX: 0000000000000208 RBX: 0000000000000208 RCX: 00007fa928c85fd3
[ 3638.489963] RDX: 0000000000000000 RSI: 00007fa92bc84010 RDI: 000055c467ef6580
[ 3638.489964] RBP: 0000000000000001 R08: 00007fa929eb47f8 R09: 00000000fffffffe
[ 3638.489964] R10: 00000000ffffffff R11: 0000000000000000 R12: 0000000000005140
[ 3638.489964] R13: 0000000000000000 R14: 0000000000000001 R15: 00007fa92bc83010
[ 3638.616669] rfx-xpl-static: page allocation stalls for 30476ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
[ 3638.616672] rfx-xpl-static cpuset=/ mems_allowed=0
[ 3638.616676] CPU: 2 PID: 2296 Comm: rfx-xpl-static Tainted: G     U          4.11.1-amd64-preempt-sysrq-20170406 #4
[ 3638.616676] Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
[ 3638.616677] Call Trace:
[ 3638.616684]  dump_stack+0x61/0x7d
[ 3638.616686]  warn_alloc+0xfc/0x18c
[ 3638.616687] php5:
[ 3638.616688]  __alloc_pages_slowpath+0x3bc/0xb31
[ 3638.616689] page allocation stalls for 29296ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=
[ 3638.616693]  ? alloc_request_size+0x55/0x6d
[ 3638.616693] (null)
[ 3638.616694]  __alloc_pages_nodemask+0x12c/0x1e0
[ 3638.616695] php5 cpuset=/
[ 3638.616697]  alloc_pages_current+0x9b/0xbd
[ 3638.616698]  mems_allowed=0
[ 3638.616700]  __page_cache_alloc+0x8e/0xa4
[ 3638.616701]  pagecache_get_page+0xc9/0x16b
[ 3638.616704]  alloc_extent_buffer+0xdf/0x305
[ 3638.616708]  read_tree_block+0x19/0x4e
[ 3638.616709]  read_block_for_search.isra.21+0x211/0x264
[ 3638.616711]  btrfs_search_slot+0x52b/0x72e
[ 3638.616713]  btrfs_lookup_file_extent+0x49/0x5f
[ 3638.616715]  btrfs_get_extent+0x17d/0x77d
[ 3638.616717]  ? submit_extent_page+0x14b/0x1c8
[ 3638.616719]  ? __do_readpage+0x4fb/0x58d
[ 3638.616720]  __do_readpage+0x259/0x58d
[ 3638.616722]  ? insert_reserved_file_extent.constprop.41+0x28e/0x28e
[ 3638.616724]  __do_contiguous_readpages.constprop.33+0xc8/0xec
[ 3638.616725]  ? insert_reserved_file_extent.constprop.41+0x28e/0x28e
[ 3638.616727]  __extent_readpages.constprop.32+0xea/0xfd
[ 3638.616729]  ? lru_cache_add+0xe/0x10
[ 3638.616731]  extent_readpages+0x139/0x1eb
[ 3638.616732]  ? insert_reserved_file_extent.constprop.41+0x28e/0x28e
[ 3638.616734]  btrfs_readpages+0x1f/0x21
[ 3638.616736]  __do_page_cache_readahead+0x1af/0x27b
[ 3638.616737]  ? pagecache_get_page+0x2b/0x16b
[ 3638.616739]  filemap_fault+0x17f/0x3a7
[ 3638.616740]  ? filemap_fault+0x17f/0x3a7
[ 3638.616742]  ? balance_dirty_pages_ratelimited+0x251/0x29b
[ 3638.616744]  __do_fault+0x1b/0x67
[ 3638.616746]  __handle_mm_fault+0x90b/0xa6f
[ 3638.616748]  handle_mm_fault+0x126/0x1d9
[ 3638.616751]  __do_page_fault+0x31a/0x43c
[ 3638.616752]  do_page_fault+0x25/0x27
[ 3638.616755]  page_fault+0x28/0x30
[ 3638.616756] RIP: 0023:0x812bbe8
[ 3638.616757] RSP: 002b:00000000ffe16f10 EFLAGS: 00010246
[ 3638.616758] RAX: 0000000000000012 RBX: 0000000008208170 RCX: 0000000009fdd4f8
[ 3638.616758] RDX: 0000000000000000 RSI: 00000000000037e0 RDI: 00000000f6cb3000
[ 3638.616759] RBP: 00000000ffe16f38 R08: 0000000000000000 R09: 0000000000000000
[ 3638.616759] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 3638.616760] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000


-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-21 21:47 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls Marc MERLIN
@ 2017-05-21 23:45 ` Marc MERLIN
  2017-05-22  1:35   ` Marc MERLIN
  0 siblings, 1 reply; 13+ messages in thread
From: Marc MERLIN @ 2017-05-21 23:45 UTC (permalink / raw)
  To: linux-btrfs

On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote:
> gargamel:~# btrfs check --repair /dev/mapper/dshelf1
> enabling repair mode
> Checking filesystem on /dev/mapper/dshelf1
> UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d
> checking extents
> 
> This causes a bunch of these:
> btrfs-transacti: page allocation stalls for 23508ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
> btrfs-transacti cpuset=/ mems_allowed=0
> 
> What's the recommended way out of this and which code is at fault? I can't tell if btrfs is doing memory allocations wrong, or if it's just being undermined by the block layer dying underneath.

I went back to 4.8.10, and similar problem.
It looks like btrfs check exercises the kernel and causes everything to come down to a halt :(

Sadly, I tried a scrub on the same device, and it stalled after 6TB. The scrub process went zombie
and the scrub never succeeded, nor could it be stopped.

What do I try next? My filesystem seems ok when I use it except for that
BUG() crash I just reported a few days ago. I'm willing to believe there is a some problem with it somewhere
but if I can't scrub or check it, it's kind of hard to look into it further 

[ 1090.912073] INFO: task kworker/dying:63 blocked for more than 120 seconds.
[ 1090.933850]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[ 1090.959465] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1090.983973] kworker/dying   D ffff9a23424e3d68     0    63      2 0x00000000
[ 1091.006171]  ffff9a23424e3d68 00ff9a213ab32140 ffffffff8dc0d4c0 ffff9a23424dc100
[ 1091.029349]  ffff9a23424e3d50 ffff9a23424e4000 ffff9a234098d064 ffff9a23424dc100
[ 1091.052490]  ffff9a234098d068 00000000ffffffff ffff9a23424e3d80 ffffffff8d6cf1a6
[ 1091.075679] Call Trace:
[ 1091.083882]  [<ffffffff8d6cf1a6>] schedule+0x8b/0xa3
[ 1091.099532]  [<ffffffff8d6cf44b>] schedule_preempt_disabled+0x18/0x24
[ 1091.119518]  [<ffffffff8d6d0841>] __mutex_lock_slowpath+0xce/0x16d
[ 1091.138705]  [<ffffffff8d6d08f7>] mutex_lock+0x17/0x27
[ 1091.154772]  [<ffffffff8d6d08f7>] ? mutex_lock+0x17/0x27
[ 1091.171382]  [<ffffffff8d0da5c6>] acct_process+0x4e/0xe0
[ 1091.187974]  [<ffffffff8d07d401>] ? rescuer_thread+0x24f/0x2d1
[ 1091.206170]  [<ffffffff8d06a5c1>] do_exit+0x3ba/0x97b
[ 1091.222001]  [<ffffffff8d192a8c>] ? kfree+0x7a/0x99
[ 1091.237307]  [<ffffffff8d07d72e>] ? worker_thread+0x2ab/0x2ba
[ 1091.255219]  [<ffffffff8d07d483>] ? rescuer_thread+0x2d1/0x2d1
[ 1091.273390]  [<ffffffff8d081dfb>] kthread+0xbc/0xbc
[ 1091.288672]  [<ffffffff8d6d23df>] ret_from_fork+0x1f/0x40
[ 1091.305524]  [<ffffffff8d081d3f>] ? init_completion+0x24/0x24
[ 1091.323404] INFO: task kworker/u16:4:158 blocked for more than 120 seconds.
[ 1091.344956]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[ 1091.370145] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1091.394299] kworker/u16:4   D ffff9a233e607b58     0   158      2 0x00000000
[ 1091.416260] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[ 1091.435259]  ffff9a233e607b58 00ffffff8d0899ae ffff9a21f0c76180 ffff9a233e6021c0
[ 1091.458328]  ffff9a233e607b40 ffff9a233e608000 7fffffffffffffff ffff9a233e6021c0
[ 1091.481385]  ffffffff8d6d1244 ffff9a2317491e68 ffff9a233e607b70 ffffffff8d6cf1a6
[ 1091.504472] Call Trace:
[ 1091.512751]  [<ffffffff8d6d1244>] ? usleep_range+0x65/0x65
[ 1091.530093]  [<ffffffff8d6cf1a6>] schedule+0x8b/0xa3
[ 1091.545833]  [<ffffffff8d6d1287>] schedule_timeout+0x43/0x126
[ 1091.563782]  [<ffffffff8d08a9d3>] ? wake_up_process+0x15/0x17
[ 1091.581707]  [<ffffffff8d6cfb45>] do_wait_for_common+0x123/0x15f
[ 1091.600403]  [<ffffffff8d6cfb45>] ? do_wait_for_common+0x123/0x15f
[ 1091.619625]  [<ffffffff8d08aa1c>] ? wake_up_q+0x47/0x47
[ 1091.635983]  [<ffffffff8d6cfc02>] wait_for_common+0x3b/0x55
[ 1091.653380]  [<ffffffff8d6cfc39>] wait_for_completion+0x1d/0x1f
[ 1091.671811]  [<ffffffff8d28d247>] btrfs_async_run_delayed_refs+0xd3/0xed
[ 1091.692598]  [<ffffffff8d2a5dbe>] __btrfs_end_transaction+0x2a7/0x2dd
[ 1091.712585]  [<ffffffff8d2a5e04>] btrfs_end_transaction+0x10/0x12
[ 1091.731529]  [<ffffffff8d2ad5fa>] btrfs_finish_ordered_io+0x3f7/0x4db
[ 1091.751495]  [<ffffffff8d2ad946>] finish_ordered_fn+0x15/0x17
[ 1091.769372]  [<ffffffff8d2cfd68>] btrfs_scrubparity_helper+0x10e/0x258
[ 1091.789590]  [<ffffffff8d2cff40>] btrfs_endio_write_helper+0xe/0x10
[ 1091.809014]  [<ffffffff8d07d068>] process_one_work+0x186/0x29d
[ 1091.827123]  [<ffffffff8d07d66d>] worker_thread+0x1ea/0x2ba
[ 1091.844438]  [<ffffffff8d07d483>] ? rescuer_thread+0x2d1/0x2d1
[ 1091.862521]  [<ffffffff8d081df3>] kthread+0xb4/0xbc
[ 1091.877718]  [<ffffffff8d6d23df>] ret_from_fork+0x1f/0x40
[ 1091.894476]  [<ffffffff8d081d3f>] ? init_completion+0x24/0x24
[ 1091.912276] INFO: task kworker/u16:5:159 blocked for more than 120 seconds.
[ 1091.933740]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[ 1091.958847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1091.982906] kworker/u16:5   D ffff9a233e60f9c0     0   159      2 0x00000000
[ 1092.004713] Workqueue: btrfs-extent-refs btrfs_extent_refs_helper
[ 1092.023611]  ffff9a233e60f9c0 0000000000000246 ffff9a2342fa41c0 ffff9a233e608200
[ 1092.046575]  ffff9a233e60f9a8 ffff9a233e610000 ffff9a213898bd88 ffff9a233052f800
[ 1092.069536]  0000000000000001 0000000000000001 ffff9a233e60f9d8 ffffffff8d6cf1a6
[ 1092.092523] Call Trace:
[ 1092.100496]  [<ffffffff8d6cf1a6>] schedule+0x8b/0xa3
[ 1092.115995]  [<ffffffff8d2d866d>] btrfs_tree_lock+0xd6/0x1fb
[ 1092.133574]  [<ffffffff8d09af6c>] ? wake_up_atomic_t+0x2c/0x2c
[ 1092.151696]  [<ffffffff8d2847ff>] btrfs_search_slot+0x684/0x6fb
[ 1092.170074]  [<ffffffff8d28a364>] lookup_inline_extent_backref+0xf3/0x45b
[ 1092.191060]  [<ffffffff8d190787>] ? cpu_cache_get.isra.44+0xd/0x1d
[ 1092.210225]  [<ffffffff8d0a12c5>] ? __raw_write_unlock+0x11/0x23
[ 1092.228873]  [<ffffffff8d28b994>] __btrfs_free_extent+0x100/0x9ac
[ 1092.247770]  [<ffffffff8d088000>] ? preempt_latency_start+0x34/0x5d
[ 1092.267183]  [<ffffffff8d2e860f>] ? btrfs_merge_delayed_refs+0x90/0x270
[ 1092.287638]  [<ffffffff8d2900fe>] __btrfs_run_delayed_refs+0xa02/0xd13
[ 1092.307818]  [<ffffffff8d190787>] ? cpu_cache_get.isra.44+0xd/0x1d
[ 1092.326947]  [<ffffffff8d291d92>] btrfs_run_delayed_refs+0x86/0x1dc
[ 1092.346331]  [<ffffffff8d291f42>] delayed_ref_async_start+0x5a/0x99
[ 1092.365732]  [<ffffffff8d2cfd68>] btrfs_scrubparity_helper+0x10e/0x258
[ 1092.385888]  [<ffffffff8d2cfef0>] btrfs_extent_refs_helper+0xe/0x10
[ 1092.405244]  [<ffffffff8d07d068>] process_one_work+0x186/0x29d
[ 1092.423312]  [<ffffffff8d07d66d>] worker_thread+0x1ea/0x2ba
[ 1092.440625]  [<ffffffff8d07d483>] ? rescuer_thread+0x2d1/0x2d1
[ 1092.458723]  [<ffffffff8d081df3>] kthread+0xb4/0xbc
[ 1092.473931]  [<ffffffff8d6d23df>] ret_from_fork+0x1f/0x40
[ 1092.490685]  [<ffffffff8d081d3f>] ? init_completion+0x24/0x24
[ 1092.508525] INFO: task kworker/u16:8:162 blocked for more than 120 seconds.
[ 1092.529991]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[ 1092.555104] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1092.579194] kworker/u16:8   D ffff9a233e623b58     0   162      2 0x00000000
[ 1092.601033] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[ 1092.619956]  ffff9a233e623b58 00ffffff8d0899ae ffff9a2342f80080 ffff9a233e61a000
[ 1092.642934]  ffff9a233e623b40 ffff9a233e624000 7fffffffffffffff ffff9a233e61a000
[ 1092.665923]  ffffffff8d6d1244 ffff9a2326a0d628 ffff9a233e623b70 ffffffff8d6cf1a6
[ 1092.688901] Call Trace:
[ 1092.696864]  [<ffffffff8d6d1244>] ? usleep_range+0x65/0x65
[ 1092.713931]  [<ffffffff8d6cf1a6>] schedule+0x8b/0xa3
[ 1092.729451]  [<ffffffff8d6d1287>] schedule_timeout+0x43/0x126
[ 1092.747312]  [<ffffffff8d3b7e78>] ? debug_smp_processor_id+0x17/0x19
[ 1092.767021]  [<ffffffff8d08a9d3>] ? wake_up_process+0x15/0x17
[ 1092.784903]  [<ffffffff8d6cfb45>] do_wait_for_common+0x123/0x15f
[ 1092.803533]  [<ffffffff8d6cfb45>] ? do_wait_for_common+0x123/0x15f
[ 1092.822696]  [<ffffffff8d08aa1c>] ? wake_up_q+0x47/0x47
[ 1092.838989]  [<ffffffff8d6cfc02>] wait_for_common+0x3b/0x55
[ 1092.856310]  [<ffffffff8d6cfc39>] wait_for_completion+0x1d/0x1f
[ 1092.874683]  [<ffffffff8d28d247>] btrfs_async_run_delayed_refs+0xd3/0xed
[ 1092.895383]  [<ffffffff8d2a5dbe>] __btrfs_end_transaction+0x2a7/0x2dd
[ 1092.915269]  [<ffffffff8d2a5e04>] btrfs_end_transaction+0x10/0x12
[ 1092.934131]  [<ffffffff8d2ad5fa>] btrfs_finish_ordered_io+0x3f7/0x4db
[ 1092.954047]  [<ffffffff8d088000>] ? preempt_latency_start+0x34/0x5d
[ 1092.973440]  [<ffffffff8d2ad946>] finish_ordered_fn+0x15/0x17
[ 1092.991244]  [<ffffffff8d2cfd68>] btrfs_scrubparity_helper+0x10e/0x258
[ 1093.011384]  [<ffffffff8d07b8c1>] ? pwq_activate_delayed_work+0x4d/0x5b
[ 1093.031800]  [<ffffffff8d2cff40>] btrfs_endio_write_helper+0xe/0x10
[ 1093.051185]  [<ffffffff8d07d068>] process_one_work+0x186/0x29d
[ 1093.069265]  [<ffffffff8d07d66d>] worker_thread+0x1ea/0x2ba
[ 1093.086569]  [<ffffffff8d07d483>] ? rescuer_thread+0x2d1/0x2d1
[ 1093.104632]  [<ffffffff8d081df3>] kthread+0xb4/0xbc
[ 1093.119828]  [<ffffffff8d6d23df>] ret_from_fork+0x1f/0x40
[ 1093.136618]  [<ffffffff8d081d3f>] ? init_completion+0x24/0x24
[ 1093.154411] INFO: task kworker/u16:13:171 blocked for more than 120 seconds.
[ 1093.176141]       Tainted: G     U          4.8.10-amd64-preempt-sysrq-20161121vb3tj1 #12
[ 1093.201248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1093.225310] kworker/u16:13  D ffff9a233e653b58     0   171      2 0x00000000
[ 1093.247120] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[ 1093.266059]  ffff9a233e653b58 00ffffff8d0899ae ffff9a2342f8a100 ffff9a233e64a240
[ 1093.289050]  ffff9a233e653b40 ffff9a233e654000 7fffffffffffffff ffff9a233e64a240
[ 1093.312035]  ffffffff8d6d1244 ffff9a220b50ac28 ffff9a233e653b70 ffffffff8d6cf1a6
[ 1093.335029] Call Trace:
[ 1093.343008]  [<ffffffff8d6d1244>] ? usleep_range+0x65/0x65
[ 1093.360096]  [<ffffffff8d6cf1a6>] schedule+0x8b/0xa3
[ 1093.375601]  [<ffffffff8d6d1287>] schedule_timeout+0x43/0x126
[ 1093.393490]  [<ffffffff8d3b7e78>] ? debug_smp_processor_id+0x17/0x19
[ 1093.413203]  [<ffffffff8d08a9d3>] ? wake_up_process+0x15/0x17
[ 1093.431075]  [<ffffffff8d6cfb45>] do_wait_for_common+0x123/0x15f
[ 1093.449709]  [<ffffffff8d6cfb45>] ? do_wait_for_common+0x123/0x15f
[ 1093.468870]  [<ffffffff8d08aa1c>] ? wake_up_q+0x47/0x47
[ 1093.485154]  [<ffffffff8d6cfc02>] wait_for_common+0x3b/0x55
[ 1093.502461]  [<ffffffff8d6cfc39>] wait_for_completion+0x1d/0x1f
[ 1093.520811]  [<ffffffff8d28d247>] btrfs_async_run_delayed_refs+0xd3/0xed
[ 1093.541485]  [<ffffffff8d2a5dbe>] __btrfs_end_transaction+0x2a7/0x2dd
[ 1093.561398]  [<ffffffff8d2a5e04>] btrfs_end_transaction+0x10/0x12
[ 1093.580241]  [<ffffffff8d2ad5fa>] btrfs_finish_ordered_io+0x3f7/0x4db
[ 1093.600133]  [<ffffffff8d088000>] ? preempt_latency_start+0x34/0x5d
[ 1093.619502]  [<ffffffff8d2ad946>] finish_ordered_fn+0x15/0x17
[ 1093.637313]  [<ffffffff8d2cfd68>] btrfs_scrubparity_helper+0x10e/0x258
[ 1093.657467]  [<ffffffff8d07b8c1>] ? pwq_activate_delayed_work+0x4d/0x5b
[ 1093.677881]  [<ffffffff8d2cff40>] btrfs_endio_write_helper+0xe/0x10
[ 1093.697254]  [<ffffffff8d07d068>] process_one_work+0x186/0x29d
[ 1093.715324]  [<ffffffff8d07d66d>] worker_thread+0x1ea/0x2ba
[ 1093.732624]  [<ffffffff8d07d483>] ? rescuer_thread+0x2d1/0x2d1
[ 1093.750711]  [<ffffffff8d081df3>] kthread+0xb4/0xbc
[ 1093.765925]  [<ffffffff8d6d23df>] ret_from_fork+0x1f/0x40
[ 1093.782718]  [<ffffffff8d081d3f>] ? init_completion+0x24/0x24
[ 1093.800556] INFO: task kworker/u16:14:172 blocked for more than 120 seconds.
[ 1093.847428] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1093.871538] kworker/u16:14  D ffff9a233e65bb58     0   172      2 0x00000000
[ 1093.893373] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[ 1093.912298]  ffff9a233e65bb58 00ffffff8d0899ae ffffffff8dc0d4c0 ffff9a233e654280
[ 1093.935281]  ffff9a233e65bb40 ffff9a233e65c000 7fffffffffffffff ffff9a233e654280
[ 1093.958274]  ffffffff8d6d1244 ffff9a2318e2e928 ffff9a233e65bb70 ffffffff8d6cf1a6
[ 1093.981288] Call Trace:
[ 1093.989347]  [<ffffffff8d6d1244>] ? usleep_range+0x65/0x65
[ 1094.006469]  [<ffffffff8d6cf1a6>] schedule+0x8b/0xa3
[ 1094.022076]  [<ffffffff8d6d1287>] schedule_timeout+0x43/0x126
[ 1094.039943]  [<ffffffff8d3b7e78>] ? debug_smp_processor_id+0x17/0x19
[ 1094.059628]  [<ffffffff8d08a9d3>] ? wake_up_process+0x15/0x17
[ 1094.077525]  [<ffffffff8d6cfb45>] do_wait_for_common+0x123/0x15f
[ 1094.096215]  [<ffffffff8d6cfb45>] ? do_wait_for_common+0x123/0x15f
[ 1094.115490]  [<ffffffff8d08aa1c>] ? wake_up_q+0x47/0x47
[ 1094.131787]  [<ffffffff8d6cfc02>] wait_for_common+0x3b/0x55
[ 1094.149116]  [<ffffffff8d6cfc39>] wait_for_completion+0x1d/0x1f
[ 1094.167457]  [<ffffffff8d28d247>] btrfs_async_run_delayed_refs+0xd3/0xed
[ 1094.188167]  [<ffffffff8d2a5dbe>] __btrfs_end_transaction+0x2a7/0x2dd
[ 1094.208087]  [<ffffffff8d2a5e04>] btrfs_end_transaction+0x10/0x12
[ 1094.226949]  [<ffffffff8d2ad5fa>] btrfs_finish_ordered_io+0x3f7/0x4db
[ 1094.246856]  [<ffffffff8d088000>] ? preempt_latency_start+0x34/0x5d
[ 1094.266223]  [<ffffffff8d2ad946>] finish_ordered_fn+0x15/0x17
[ 1094.284063]  [<ffffffff8d2cfd68>] btrfs_scrubparity_helper+0x10e/0x258
[ 1094.304354]  [<ffffffff8d07b8c1>] ? pwq_activate_delayed_work+0x4d/0x5b
[ 1094.324809]  [<ffffffff8d2cff40>] btrfs_endio_write_helper+0xe/0x10
[ 1094.344190]  [<ffffffff8d07d068>] process_one_work+0x186/0x29d
[ 1094.362283]  [<ffffffff8d07d66d>] worker_thread+0x1ea/0x2ba
[ 1094.379576]  [<ffffffff8d07d483>] ? rescuer_thread+0x2d1/0x2d1
[ 1094.397667]  [<ffffffff8d081df3>] kthread+0xb4/0xbc
[ 1094.412878]  [<ffffffff8d6d23df>] ret_from_fork+0x1f/0x40
[ 1094.429647]  [<ffffffff8d081d3f>] ? init_completion+0x24/0x24

-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-21 23:45 ` Marc MERLIN
@ 2017-05-22  1:35   ` Marc MERLIN
  2017-05-22  9:19     ` Duncan
  2017-05-22 16:31     ` Marc MERLIN
  0 siblings, 2 replies; 13+ messages in thread
From: Marc MERLIN @ 2017-05-22  1:35 UTC (permalink / raw)
  To: linux-btrfs

On Sun, May 21, 2017 at 04:45:57PM -0700, Marc MERLIN wrote:
> On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote:
> > gargamel:~# btrfs check --repair /dev/mapper/dshelf1
> > enabling repair mode
> > Checking filesystem on /dev/mapper/dshelf1
> > UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d
> > checking extents
> > 
> > This causes a bunch of these:
> > btrfs-transacti: page allocation stalls for 23508ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
> > btrfs-transacti cpuset=/ mems_allowed=0
> > 
> > What's the recommended way out of this and which code is at fault? I can't tell if btrfs is doing memory allocations wrong, or if it's just being undermined by the block layer dying underneath.
> 
> I went back to 4.8.10, and similar problem.
> It looks like btrfs check exercises the kernel and causes everything to come down to a halt :(
> 
> Sadly, I tried a scrub on the same device, and it stalled after 6TB. The scrub process went zombie
> and the scrub never succeeded, nor could it be stopped.

So, putting the btrfs scrub that stalled issue, I didn't quite realize
that btrs check memory issues actually caused the kernel to eat all the
memory until everything crashed/deadlocked/stalled.
Is that actually working as intended?
Why doesn't it fail and stop instead of taking my entire server down?
Clearly there must be a rule against a kernel subsystem taking all the
memory from everything until everything crashes/deadlocks, right?

So for now, I'm doing a lowmem check, but it's not going to be very
helpful since it cannot repair anything if it finds a problem.

At least my machine isn't crashing anymore, I suppose that's still an
improvement.
gargamel:~# btrfs check --mode=lowmem /dev/mapper/dshelf1
We'll see how many days it takes.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-22  1:35   ` Marc MERLIN
@ 2017-05-22  9:19     ` Duncan
  2017-05-23 17:15       ` Marc MERLIN
  2017-05-22 16:31     ` Marc MERLIN
  1 sibling, 1 reply; 13+ messages in thread
From: Duncan @ 2017-05-22  9:19 UTC (permalink / raw)
  To: linux-btrfs

Marc MERLIN posted on Sun, 21 May 2017 18:35:53 -0700 as excerpted:

> On Sun, May 21, 2017 at 04:45:57PM -0700, Marc MERLIN wrote:
>> On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote:
>> > gargamel:~# btrfs check --repair /dev/mapper/dshelf1 enabling repair
>> > mode Checking filesystem on /dev/mapper/dshelf1 UUID:
>> > 36f5079e-ca6c-4855-8639-ccb82695c18d checking extents
>> > 
>> > This causes a bunch of these:
>> > btrfs-transacti: page allocation stalls for 23508ms, order:0,
>> > mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
>> > btrfs-transacti cpuset=/ mems_allowed=0
>> > 
>> > What's the recommended way out of this and which code is at fault? I
>> > can't tell if btrfs is doing memory allocations wrong, or if it's
>> > just being undermined by the block layer dying underneath.
>> 
>> I went back to 4.8.10, and similar problem.
>> It looks like btrfs check exercises the kernel and causes everything to
>> come down to a halt :(

btrfs check is userspace, not kernelspace.  The btrfs-transacti threads 
are indeed kernelspace, but the problem would appear to be either IO or 
memory starvation triggered by the userspace check hogging all available 
resources, not leaving enough for normal system, including kernel, 
processes.

Check is /known/ to be memory intensive, with multi-TB filesystems often 
requiring tens of GiB of memory, and qgroups and snapshots are both known 
to dramatically intensify the scaling issues.  (btrfs balance, by 
contrast, has the same scaling issues, but is kernelspace.)

That's one reason why (not all of these may apply to your case) ...

* Keeping the number of snapshots as low as possible is strongly 
recommended by pretty much everyone here, definitely under 300 per 
subvolume and if possible, to double-digits per subvolume.

* I personally recommend disabling qgroups, unless you're actively 
working with the devs on improving them.  In addition to the scaling 
issues, quotas simply aren't reliable enough on btrfs yet to rely on them 
if the use-case requires them (in which case using a mature filesystem 
where they're proven to work is recommended), and if it doesn't, there's 
simply too many remaining issues for the qgroups option to be worth it.

* I personally recommend keeping overall filesystem size to something one 
can reasonably manage.  Most people's use-cases aren't going to allow for 
an fsck taking days and tens of GiB, but /will/ allow for multi-TB 
filesystems to be split out into multiple independent filesystems of 
perhaps a TB or two each, tops, if that's the alternative to multiple-day 
fscks taking tens of GiB.  (Some use-cases are of course exceptions.)

* The low-memory-mode btrfs check is being developed, tho unfortunately 
it doesn't yet do repairs.  (Another reason is that it's an alternate 
implementation that provides a very useful second opinion and the ability 
to cross-check one implementation against the other in hard problem 
cases.)

(The two "I personally recommend" points above aren't recommendations 
shared by everyone on the list, but obviously I've found them very useful 
here. =:^)

>> Sadly, I tried a scrub on the same device, and it stalled after 6TB.
>> The scrub process went zombie and the scrub never succeeded, nor could
>> it be stopped.

Quite apart from the "... after 6TB" bit setting off my own "it's too big 
to reasonably manage" alarm, the filesystem obviously is bugged, and 
scrub as well, since it shouldn't just go zombie regardless of the 
problem -- it should fail much more gracefully.

Meanwhile, FWIW, unlike check, scrub /is/ kernelspace.

> So, putting the btrfs scrub that stalled issue, I didn't quite realize
> that btrs check memory issues actually caused the kernel to eat all the
> memory until everything crashed/deadlocked/stalled.
> Is that actually working as intended?
> Why doesn't it fail and stop instead of taking my entire server down?
> Clearly there must be a rule against a kernel subsystem taking all the
> memory from everything until everything crashes/deadlocks, right?

As explained, check is userspace, but as you found, it can still 
interfere with kernelspace, including unrelated btrfs-transaction 
threads.  When the system's out of memory, it's out of memory.

Tho there is ongoing work into better predicting memory allocation needs 
for btrfs kernel threads and reserving memory space accordingly, so this 
sort of thing doesn't happen any more.

Of course it could also be some sort of (not necessarily directly btrfs) 
lockdep issue, and there's ongoing kernel-wide and btrfs work there as 
well.

> So for now, I'm doing a lowmem check, but it's not going to be very
> helpful since it cannot repair anything if it finds a problem.
> 
> At least my machine isn't crashing anymore, I suppose that's still an
> improvement.
> gargamel:~# btrfs check --mode=lowmem /dev/mapper/dshelf1 We'll see how
> many days it takes.

Agreed.  Lowmem mode looks like about your only option, beyond simply 
blowing it away, at this point.  Too bad it doesn't do repair yet, but 
with a bit of luck it should at least give you and the devs some idea 
what's wrong, information that can in turn be used to fix both scrub and 
normal check mode, as well as low-mem repair mode, once it's available.

Of course your "days" comment is triggering my "it's too big to maintain" 
reflex again, but obviously it's something you've found to be tolerable 
or possibly required in your use-case, so who am I to second-guess... 
maybe you have /files/ of multi-TB size, which of course kills the split 
the filesystem down to under that, idea. <shrug>

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-22  1:35   ` Marc MERLIN
  2017-05-22  9:19     ` Duncan
@ 2017-05-22 16:31     ` Marc MERLIN
  2017-05-22 23:26       ` Chris Murphy
  1 sibling, 1 reply; 13+ messages in thread
From: Marc MERLIN @ 2017-05-22 16:31 UTC (permalink / raw)
  To: linux-btrfs

On Sun, May 21, 2017 at 06:35:53PM -0700, Marc MERLIN wrote:
> On Sun, May 21, 2017 at 04:45:57PM -0700, Marc MERLIN wrote:
> > On Sun, May 21, 2017 at 02:47:33PM -0700, Marc MERLIN wrote:
> > > gargamel:~# btrfs check --repair /dev/mapper/dshelf1
> > > enabling repair mode
> > > Checking filesystem on /dev/mapper/dshelf1
> > > UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d
> > > checking extents
> > > 
> > > This causes a bunch of these:
> > > btrfs-transacti: page allocation stalls for 23508ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
> > > btrfs-transacti cpuset=/ mems_allowed=0
> > > 
> > > What's the recommended way out of this and which code is at fault? I can't tell if btrfs is doing memory allocations wrong, or if it's just being undermined by the block layer dying underneath.
> > 
> > I went back to 4.8.10, and similar problem.
> > It looks like btrfs check exercises the kernel and causes everything to come down to a halt :(
> > 
> > Sadly, I tried a scrub on the same device, and it stalled after 6TB. The scrub process went zombie
> > and the scrub never succeeded, nor could it be stopped.
> 
> So, putting the btrfs scrub that stalled issue, I didn't quite realize
> that btrs check memory issues actually caused the kernel to eat all the
> memory until everything crashed/deadlocked/stalled.
> Is that actually working as intended?
> Why doesn't it fail and stop instead of taking my entire server down?
> Clearly there must be a rule against a kernel subsystem taking all the
> memory from everything until everything crashes/deadlocks, right?
> 
> So for now, I'm doing a lowmem check, but it's not going to be very
> helpful since it cannot repair anything if it finds a problem.
> 
> At least my machine isn't crashing anymore, I suppose that's still an
> improvement.
> gargamel:~# btrfs check --mode=lowmem /dev/mapper/dshelf1
> We'll see how many days it takes.

Well, at least it's finding errors, but of course it can't fix them
since lowmem doesn't have repair yet (yes, I know it's WIP)

I already have 24GB of RAM in that machine, adding more for the real
fsck repair to run, is going to be difficult and ndb would take days I
guess (then again I don't have a machine with 32 or 48 or 64GB of RAM
anyway).

I'm guessing my next step is to delete a lot of data from that array
until its metadata use gets back below something that fits in RAM :-/
But hopefully check --repair can be fixed not to crash your machine if
it needs more RAM than is available.

 
Checking filesystem on /dev/mapper/dshelf1
UUID: 36f5079e-ca6c-4855-8639-ccb82695c18d
checking free space cache [.]
ERROR: root 53282 EXTENT_DATA[8244 4096] interrupt
ERROR: root 53282 EXTENT_DATA[50585 4096] interrupt
ERROR: root 53282 EXTENT_DATA[51096 4096] interrupt
ERROR: root 53282 EXTENT_DATA[182617 4096] interrupt
ERROR: root 53282 EXTENT_DATA[212972 4096] interrupt
ERROR: root 53282 EXTENT_DATA[260115 4096] interrupt
ERROR: root 53282 EXTENT_DATA[278370 4096] interrupt
ERROR: root 53282 EXTENT_DATA[323505 4096] interrupt
ERROR: root 53282 EXTENT_DATA[396923 4096] interrupt
ERROR: root 53282 EXTENT_DATA[419599 4096] interrupt
ERROR: root 53282 EXTENT_DATA[490602 4096] interrupt
ERROR: root 53282 EXTENT_DATA[555541 4096] interrupt
ERROR: root 53282 EXTENT_DATA[601942 4096] interrupt
ERROR: root 53282 EXTENT_DATA[682215 4096] interrupt
ERROR: root 53282 EXTENT_DATA[721729 4096] interrupt
ERROR: root 53282 EXTENT_DATA[916271 4096] interrupt
ERROR: root 53282 EXTENT_DATA[961074 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1118062 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1127879 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1142984 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1379975 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1398275 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1446265 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1459061 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1477900 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1477900 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1484265 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1509227 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1671096 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1692559 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1742832 4096] interrupt
ERROR: root 53282 EXTENT_DATA[1808649 4096] interrupt
ERROR: root 53292 EXTENT_DATA[57240 4096] interrupt
ERROR: root 53446 EXTENT_DATA[3554 4096] interrupt
ERROR: root 53446 EXTENT_DATA[64241 4096] interrupt
(...)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-22 16:31     ` Marc MERLIN
@ 2017-05-22 23:26       ` Chris Murphy
  2017-05-22 23:57         ` Marc MERLIN
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2017-05-22 23:26 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Btrfs BTRFS

On Mon, May 22, 2017 at 10:31 AM, Marc MERLIN <marc@merlins.org> wrote:

>
> I already have 24GB of RAM in that machine, adding more for the real
> fsck repair to run, is going to be difficult and ndb would take days I
> guess (then again I don't have a machine with 32 or 48 or 64GB of RAM
> anyway).

If you can acquire an SSD, you can give the system a bunch of swap,
and at least then hopefully the check repair can complete. Yes it'll
be slower than with real RAM but it's not nearly as bad as you might
think it'd be, based on HDD based swap.



-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-22 23:26       ` Chris Murphy
@ 2017-05-22 23:57         ` Marc MERLIN
  2017-05-23  2:07           ` Chris Murphy
  0 siblings, 1 reply; 13+ messages in thread
From: Marc MERLIN @ 2017-05-22 23:57 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Btrfs BTRFS

On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote:
> On Mon, May 22, 2017 at 10:31 AM, Marc MERLIN <marc@merlins.org> wrote:
> 
> >
> > I already have 24GB of RAM in that machine, adding more for the real
> > fsck repair to run, is going to be difficult and ndb would take days I
> > guess (then again I don't have a machine with 32 or 48 or 64GB of RAM
> > anyway).
> 
> If you can acquire an SSD, you can give the system a bunch of swap,
> and at least then hopefully the check repair can complete. Yes it'll
> be slower than with real RAM but it's not nearly as bad as you might
> think it'd be, based on HDD based swap.

Oh, swap will work, you're sure? 
I already have an SSD, if that's good enough, I can give it a shot.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-22 23:57         ` Marc MERLIN
@ 2017-05-23  2:07           ` Chris Murphy
  2017-05-23 11:21             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 13+ messages in thread
From: Chris Murphy @ 2017-05-23  2:07 UTC (permalink / raw)
  To: Marc MERLIN; +Cc: Chris Murphy, Btrfs BTRFS

On Mon, May 22, 2017 at 5:57 PM, Marc MERLIN <marc@merlins.org> wrote:
> On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote:
>> On Mon, May 22, 2017 at 10:31 AM, Marc MERLIN <marc@merlins.org> wrote:
>>
>> >
>> > I already have 24GB of RAM in that machine, adding more for the real
>> > fsck repair to run, is going to be difficult and ndb would take days I
>> > guess (then again I don't have a machine with 32 or 48 or 64GB of RAM
>> > anyway).
>>
>> If you can acquire an SSD, you can give the system a bunch of swap,
>> and at least then hopefully the check repair can complete. Yes it'll
>> be slower than with real RAM but it's not nearly as bad as you might
>> think it'd be, based on HDD based swap.
>
> Oh, swap will work, you're sure?
> I already have an SSD, if that's good enough, I can give it a shot.

Yeah although I have no idea how much swap is needed for it to
succeed. I'm not sure what the relationship is to fs metadata chunk
size to btrfs check RAM requirement is; but if it wants all of the
metadata in RAM, then whatever btrfs fi us shows you for metadata may
be a guide (?) for how much memory it's going to want.

Another possibility is zswap, which still requires a backing device,
but it might be able to limit how much swap to disk is needed if the
data to swap out is highly compressible. *shrug*

-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-23  2:07           ` Chris Murphy
@ 2017-05-23 11:21             ` Austin S. Hemmelgarn
  2017-05-23 16:49               ` Marc MERLIN
  2017-05-23 18:32               ` Kai Krakow
  0 siblings, 2 replies; 13+ messages in thread
From: Austin S. Hemmelgarn @ 2017-05-23 11:21 UTC (permalink / raw)
  To: Chris Murphy, Marc MERLIN; +Cc: Btrfs BTRFS

On 2017-05-22 22:07, Chris Murphy wrote:
> On Mon, May 22, 2017 at 5:57 PM, Marc MERLIN <marc@merlins.org> wrote:
>> On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote:
>>> On Mon, May 22, 2017 at 10:31 AM, Marc MERLIN <marc@merlins.org> wrote:
>>>
>>>>
>>>> I already have 24GB of RAM in that machine, adding more for the real
>>>> fsck repair to run, is going to be difficult and ndb would take days I
>>>> guess (then again I don't have a machine with 32 or 48 or 64GB of RAM
>>>> anyway).
>>>
>>> If you can acquire an SSD, you can give the system a bunch of swap,
>>> and at least then hopefully the check repair can complete. Yes it'll
>>> be slower than with real RAM but it's not nearly as bad as you might
>>> think it'd be, based on HDD based swap.
>>
>> Oh, swap will work, you're sure?
>> I already have an SSD, if that's good enough, I can give it a shot.
>
> Yeah although I have no idea how much swap is needed for it to
> succeed. I'm not sure what the relationship is to fs metadata chunk
> size to btrfs check RAM requirement is; but if it wants all of the
> metadata in RAM, then whatever btrfs fi us shows you for metadata may
> be a guide (?) for how much memory it's going to want.
I think the in-memory storage is a bit more space efficient than the 
on-disk storage, but I'm not certain, and I'm pretty sure it takes up 
more space when it's actually repairing things.  If I'm doing the math 
correctly, you _may_ need up to 50% _more_ than the total metadata size 
for the FS in virtual memory space.
>
> Another possibility is zswap, which still requires a backing device,
> but it might be able to limit how much swap to disk is needed if the
> data to swap out is highly compressible. *shrug*
>
zswap won't help in that respect, but it might make swapping stuff back 
in faster.  It just keeps a compressed copy in memory in parallel to 
writing the full copy out to disk, then uses that compressed copy to 
swap in instead of going to disk if the copy is still in memory (but it 
will discard the compressed copies if memory gets really low).  In 
essence, it reduces the impact of swapping when memory pressure is 
moderate (the situation for most desktops for example), but becomes 
almost useless when you have very high memory pressure (which is what 
describes this usage).

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-23 11:21             ` Austin S. Hemmelgarn
@ 2017-05-23 16:49               ` Marc MERLIN
  2017-05-23 18:32               ` Kai Krakow
  1 sibling, 0 replies; 13+ messages in thread
From: Marc MERLIN @ 2017-05-23 16:49 UTC (permalink / raw)
  To: Austin S. Hemmelgarn; +Cc: Chris Murphy, Btrfs BTRFS

On Tue, May 23, 2017 at 07:21:33AM -0400, Austin S. Hemmelgarn wrote:
> > Yeah although I have no idea how much swap is needed for it to
> > succeed. I'm not sure what the relationship is to fs metadata chunk
> > size to btrfs check RAM requirement is; but if it wants all of the
> > metadata in RAM, then whatever btrfs fi us shows you for metadata may
> > be a guide (?) for how much memory it's going to want.
>
> I think the in-memory storage is a bit more space efficient than the on-disk
> storage, but I'm not certain, and I'm pretty sure it takes up more space
> when it's actually repairing things.  If I'm doing the math correctly, you
> _may_ need up to 50% _more_ than the total metadata size for the FS in
> virtual memory space.

So I was able to rescue/fix my system by removing a bunch of temporary
data on it, which in turn freed up enough metadata for things to btrfs
check to work again.
The things to check were minor, so they were fixed quickly.

I seem to have been the last person who last edited 
https://btrfs.wiki.kernel.org/index.php/Btrfsck
and it's therefore way out of date :)

I propose the following
1) One dev needs to confirm that as long as you have enough swap, btrfs
check should. Give some guideline of metadatasize to swap size.
Then again I think swap doesn't help, see below


2) I still think there is an issue with either the OOM killer, or btrfs
check actually chewing up kernel RAM. I've never seen any linux system
die in the spectacular ways mine died with that btrfs check, if it were
only taking userspace RAM.
I've filed a bug, because it looks bad:
https://bugzilla.kernel.org/show_bug.cgi?id=195863

Can someone read those better than me? Is it userspace RAM that is missing?
You said that swap would help, but in the dump below, I see:
Free swap  = 15366388kB
so my swap was unused and the system crashed due to OOM anyway.

btrfs-transacti: page allocation stalls for 23508ms, order:0, mode:0x1400840(GFP_NOFS|__GFP_NOFAIL), nodemask=(null)
btrfs-transacti cpuset=/ mems_allowed=0
Mem-Info:
active_anon:5274313 inactive_anon:378373 isolated_anon:3590
 active_file:3711 inactive_file:3809 isolated_file:0
 unevictable:1467 dirty:5068 writeback:49189 unstable:0
 slab_reclaimable:8721 slab_unreclaimable:67310
 mapped:556943 shmem:801313 pagetables:15777 bounce:0
 free:89741 free_pcp:6 free_cma:0
Node 0 active_anon:21097252kB inactive_anon:1513492kB active_file:14844kB inactive_file:15236kB unevictable:5868kB isolated(anon):14360kB isolated(file):0kB mapped:2227772kB dirty:20272kB writeback:196756kB shmem:3205252kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:215184 all_unreclaimable? no
Node 0 DMA free:15880kB min:168kB low:208kB high:248kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15972kB managed:15888kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:8kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 3201 23768 23768 23768
Node 0 DMA32 free:116720kB min:35424kB low:44280kB high:53136kB active_anon:3161376kB inactive_anon:8kB active_file:320kB inactive_file:332kB unevictable:0kB writepending:612kB present:3362068kB managed:3296500kB mlocked:0kB slab_reclaimable:460kB slab_unreclaimable:668kB kernel_stack:16kB pagetables:7292kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 20567 20567 20567
Node 0 Normal free:226664kB min:226544kB low:283180kB high:339816kB active_anon:17935552kB inactive_anon:1513564kB active_file:14524kB inactive_file:14904kB unevictable:5868kB writepending:216372kB present:21485568kB managed:21080208kB mlocked:5868kB slab_reclaimable:34412kB slab_unreclaimable:268520kB kernel_stack:12480kB pagetables:55816kB bounce:0kB free_pcp:148kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 0*4kB 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15880kB
Node 0 DMA32: 768*4kB (UME) 740*8kB (UME) 685*16kB (UME) 446*32kB (UME) 427*64kB (UME) 233*128kB (UME) 79*256kB (UME) 10*512kB (UME) 0*1024kB 0*2048kB 0*4096kB = 116720kB
Node 0 Normal: 25803*4kB (UME) 11297*8kB (UME) 947*16kB (UME) 260*32kB (ME) 72*64kB (UM) 15*128kB (UM) 1*256kB (U) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 223844kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
858720 total pagecache pages
49221 pages in swap cache
Swap cache stats: add 62319, delete 13131, find 75/76
Free swap  = 15366388kB
Total swap = 15616764kB
6215902 pages RAM
0 pages HighMem/MovableOnly
117753 pages reserved
4096 pages cma reserved


I'm also happy to modify the wiki to
1) mention that there is a lowmem mode which in turn isn't really useful
for much yet since it won't repair even a trivial thing (seen patches go
around, but not in upstream yet)

2) warn that for now check --repair of a big filesystem will crash your
system in bad ways if you are lacking physical RAM
If that occurs, the only ways out are
a) delete data/free snapshots if corruption is not bad enough to disallow it
b) move the filesystem to machine with more RAM.

What do you think?

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-22  9:19     ` Duncan
@ 2017-05-23 17:15       ` Marc MERLIN
  0 siblings, 0 replies; 13+ messages in thread
From: Marc MERLIN @ 2017-05-23 17:15 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Mon, May 22, 2017 at 09:19:34AM +0000, Duncan wrote:
> btrfs check is userspace, not kernelspace.  The btrfs-transacti threads 

That was my understanding, yes, but since I got it to starve my system,
including in kernel OOM issues I pasted in my last message and just
referenced in https://bugzilla.kernel.org/show_bug.cgi?id=195863 I think
it's not much as black and white as running a userland process that
takes too much RAM and get killed if it does.

> are indeed kernelspace, but the problem would appear to be either IO or 
> memory starvation triggered by the userspace check hogging all available 
> resources, not leaving enough for normal system, including kernel, 
> processes.

Looks like it, but also memory.

> * Keeping the number of snapshots as low as possible is strongly 
> recommended by pretty much everyone here, definitely under 300 per 
> subvolume and if possible, to double-digits per subvolume.

I agree that fewer snapshots is better, but between recovery snapshots
and btrfs snapshots for some amount of subvolumes, things add up :)

gargamel:/mnt/btrfs_pool1# btrfs subvolume list . | wc -l
93
gargamel:/mnt/btrfs_pool2# btrfs subvolume list . | wc -l
103

> * I personally recommend disabling qgroups, unless you're actively 
> working with the devs on improving them.  In addition to the scaling 
> issues, quotas simply aren't reliable enough on btrfs yet to rely on them 
> if the use-case requires them (in which case using a mature filesystem 
> where they're proven to work is recommended), and if it doesn't, there's 
> simply too many remaining issues for the qgroups option to be worth it.
 
I had consider using them at some point for each size of each subvolume
but good to know they're still not ready quite yet.

> * I personally recommend keeping overall filesystem size to something one 
> can reasonably manage.  Most people's use-cases aren't going to allow for 
> an fsck taking days and tens of GiB, but /will/ allow for multi-TB 
> filesystems to be split out into multiple independent filesystems of 
> perhaps a TB or two each, tops, if that's the alternative to multiple-day 
> fscks taking tens of GiB.  (Some use-cases are of course exceptions.)

fsck ran in 6H with bcache, but the lowmem one could take a lot longer.
Running over ndb to another host with more RAM could indeed take days
given the loss of bcache and adding the latency/bandwidth of a
networkg.

> * The low-memory-mode btrfs check is being developed, tho unfortunately 
> it doesn't yet do repairs.  (Another reason is that it's an alternate 
> implementation that provides a very useful second opinion and the ability 
> to cross-check one implementation against the other in hard problem 
> cases.)

True.

> >> Sadly, I tried a scrub on the same device, and it stalled after 6TB.
> >> The scrub process went zombie and the scrub never succeeded, nor could
> >> it be stopped.
> 
> Quite apart from the "... after 6TB" bit setting off my own "it's too big 
> to reasonably manage" alarm, the filesystem obviously is bugged, and 
> scrub as well, since it shouldn't just go zombie regardless of the 
> problem -- it should fail much more gracefully.
 
:)
In this case it's mostly big files, so it's fine metadata wise but takes
a while to scrub (<24H though).

The problem I had is that I copied all of dshelf2 onto dshelf1 while I
blew ds2, and rebuilt it. That extra metadata (many smaller files)
tipped the metadata size of ds1 over the edge.
Once I blew that backup, things became ok again.

> Meanwhile, FWIW, unlike check, scrub /is/ kernelspace.

Correct, just like balance.

> As explained, check is userspace, but as you found, it can still 
> interfere with kernelspace, including unrelated btrfs-transaction 
> threads.  When the system's out of memory, it's out of memory.
 
userspace should not take the entire system down without the OOM killer
even firing.
Also, is the logs I just sent, it showed that none of my swap space had
been used. Why would that be?

> Tho there is ongoing work into better predicting memory allocation needs 
> for btrfs kernel threads and reserving memory space accordingly, so this 
> sort of thing doesn't happen any more.

That would be good.

> Agreed.  Lowmem mode looks like about your only option, beyond simply 
> blowing it away, at this point.  Too bad it doesn't do repair yet, but 

it's not an option since it won't fix the small corruption issue I had.
Thankfully deleting enough metadata allowed it to run within my RAM and
check --repair fixed it now.

> with a bit of luck it should at least give you and the devs some idea 
> what's wrong, information that can in turn be used to fix both scrub and 
> normal check mode, as well as low-mem repair mode, once it's available.

In this case, not useful information for the devs. It's a bad SAS card
that corrupted my data, not a bug in the kernel code.

> Of course your "days" comment is triggering my "it's too big to maintain" 
> reflex again, but obviously it's something you've found to be tolerable 

days would refer to either "lowmem" or "btrfs check over ndb" :)

Cheers,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems ....
                                      .... what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/                         | PGP 1024R/763BE901

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-23 11:21             ` Austin S. Hemmelgarn
  2017-05-23 16:49               ` Marc MERLIN
@ 2017-05-23 18:32               ` Kai Krakow
  2017-05-24 11:57                 ` Austin S. Hemmelgarn
  1 sibling, 1 reply; 13+ messages in thread
From: Kai Krakow @ 2017-05-23 18:32 UTC (permalink / raw)
  To: linux-btrfs

Am Tue, 23 May 2017 07:21:33 -0400
schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:

> On 2017-05-22 22:07, Chris Murphy wrote:
> > On Mon, May 22, 2017 at 5:57 PM, Marc MERLIN <marc@merlins.org>
> > wrote:  
> >> On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote:  
>  [...]  
>  [...]  
>  [...]  
> >>
> >> Oh, swap will work, you're sure?
> >> I already have an SSD, if that's good enough, I can give it a
> >> shot.  
> >
> > Yeah although I have no idea how much swap is needed for it to
> > succeed. I'm not sure what the relationship is to fs metadata chunk
> > size to btrfs check RAM requirement is; but if it wants all of the
> > metadata in RAM, then whatever btrfs fi us shows you for metadata
> > may be a guide (?) for how much memory it's going to want.  
> I think the in-memory storage is a bit more space efficient than the 
> on-disk storage, but I'm not certain, and I'm pretty sure it takes up 
> more space when it's actually repairing things.  If I'm doing the
> math correctly, you _may_ need up to 50% _more_ than the total
> metadata size for the FS in virtual memory space.
> >
> > Another possibility is zswap, which still requires a backing device,
> > but it might be able to limit how much swap to disk is needed if the
> > data to swap out is highly compressible. *shrug*
> >  
> zswap won't help in that respect, but it might make swapping stuff
> back in faster.  It just keeps a compressed copy in memory in
> parallel to writing the full copy out to disk, then uses that
> compressed copy to swap in instead of going to disk if the copy is
> still in memory (but it will discard the compressed copies if memory
> gets really low).  In essence, it reduces the impact of swapping when
> memory pressure is moderate (the situation for most desktops for
> example), but becomes almost useless when you have very high memory
> pressure (which is what describes this usage).

Is this really how zswap works?

I always thought it acts as a compressed write-back cache in front of
the swap devices. Pages first go to zswap compressed, and later
write-back kicks in and migrates those compressed pages to real swap,
but still compressed. This is done by zswap putting two (or up to three
in modern kernels) compressed pages into one page. It has the downside
of uncompressing all "buddy pages" when only one is needed back in. But
it stays compressed. This also tells me zswap will either achieve
around 1:2 or 1:3 effective compression ratio or none. So it cannot be
compared to how streaming compression works.

OTOH, if the page is reloaded from cache before write-back kicks in, it
will never be written to swap but just uncompressed and discarded from
the cache.

Under high memory pressure it doesn't really work that well due to high
CPU overhead if pages constantly swap out, compress, write, read,
uncompress, swap in... This usually results in very low CPU usage for
processes but high IO and disk wait and high kernel CPU usage. But it
defers memory pressure conditions to a little later in exchange for
more a little more IO usage and more CPU usage. If you have a lot of
inactive memory around, it can make a difference. But it is counter
productive if almost all your memory is active and pressure is high.

So, in this scenario, it probably still doesn't help.


-- 
Regards,
Kai

Replies to list-only preferred.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls
  2017-05-23 18:32               ` Kai Krakow
@ 2017-05-24 11:57                 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 13+ messages in thread
From: Austin S. Hemmelgarn @ 2017-05-24 11:57 UTC (permalink / raw)
  To: Kai Krakow, linux-btrfs

On 2017-05-23 14:32, Kai Krakow wrote:
> Am Tue, 23 May 2017 07:21:33 -0400
> schrieb "Austin S. Hemmelgarn" <ahferroin7@gmail.com>:
> 
>> On 2017-05-22 22:07, Chris Murphy wrote:
>>> On Mon, May 22, 2017 at 5:57 PM, Marc MERLIN <marc@merlins.org>
>>> wrote:
>>>> On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote:
>>   [...]
>>   [...]
>>   [...]
>>>>
>>>> Oh, swap will work, you're sure?
>>>> I already have an SSD, if that's good enough, I can give it a
>>>> shot.
>>>
>>> Yeah although I have no idea how much swap is needed for it to
>>> succeed. I'm not sure what the relationship is to fs metadata chunk
>>> size to btrfs check RAM requirement is; but if it wants all of the
>>> metadata in RAM, then whatever btrfs fi us shows you for metadata
>>> may be a guide (?) for how much memory it's going to want.
>> I think the in-memory storage is a bit more space efficient than the
>> on-disk storage, but I'm not certain, and I'm pretty sure it takes up
>> more space when it's actually repairing things.  If I'm doing the
>> math correctly, you _may_ need up to 50% _more_ than the total
>> metadata size for the FS in virtual memory space.
>>>
>>> Another possibility is zswap, which still requires a backing device,
>>> but it might be able to limit how much swap to disk is needed if the
>>> data to swap out is highly compressible. *shrug*
>>>   
>> zswap won't help in that respect, but it might make swapping stuff
>> back in faster.  It just keeps a compressed copy in memory in
>> parallel to writing the full copy out to disk, then uses that
>> compressed copy to swap in instead of going to disk if the copy is
>> still in memory (but it will discard the compressed copies if memory
>> gets really low).  In essence, it reduces the impact of swapping when
>> memory pressure is moderate (the situation for most desktops for
>> example), but becomes almost useless when you have very high memory
>> pressure (which is what describes this usage).
> 
> Is this really how zswap works?
OK, looking at the documentation, you're correct, and my assumption 
based on the description of the frond-end (frontswap) and how the other 
back-end (the Xen transcendent memory driver) appears to behave was 
wrong. However, given how zswap does behave, I can't see how it would 
ever be useful with the default kernel settings, since without manual 
configuration, the kernel won't try to swap until memory pressure is 
pretty high, at which point zswap won't likely have much impact.
> 
> I always thought it acts as a compressed write-back cache in front of
> the swap devices. Pages first go to zswap compressed, and later
> write-back kicks in and migrates those compressed pages to real swap,
> but still compressed. This is done by zswap putting two (or up to three
> in modern kernels) compressed pages into one page. It has the downside
> of uncompressing all "buddy pages" when only one is needed back in. But
> it stays compressed. This also tells me zswap will either achieve
> around 1:2 or 1:3 effective compression ratio or none. So it cannot be
> compared to how streaming compression works.
> 
> OTOH, if the page is reloaded from cache before write-back kicks in, it
> will never be written to swap but just uncompressed and discarded from
> the cache.
> 
> Under high memory pressure it doesn't really work that well due to high
> CPU overhead if pages constantly swap out, compress, write, read,
> uncompress, swap in... This usually results in very low CPU usage for
> processes but high IO and disk wait and high kernel CPU usage. But it
> defers memory pressure conditions to a little later in exchange for
> more a little more IO usage and more CPU usage. If you have a lot of
> inactive memory around, it can make a difference. But it is counter
> productive if almost all your memory is active and pressure is high.
> 
> So, in this scenario, it probably still doesn't help.


^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-05-24 11:57 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-21 21:47 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls Marc MERLIN
2017-05-21 23:45 ` Marc MERLIN
2017-05-22  1:35   ` Marc MERLIN
2017-05-22  9:19     ` Duncan
2017-05-23 17:15       ` Marc MERLIN
2017-05-22 16:31     ` Marc MERLIN
2017-05-22 23:26       ` Chris Murphy
2017-05-22 23:57         ` Marc MERLIN
2017-05-23  2:07           ` Chris Murphy
2017-05-23 11:21             ` Austin S. Hemmelgarn
2017-05-23 16:49               ` Marc MERLIN
2017-05-23 18:32               ` Kai Krakow
2017-05-24 11:57                 ` Austin S. Hemmelgarn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.