All of lore.kernel.org
 help / color / mirror / Atom feed
* Writing a large file causes odd freeze
@ 2018-10-02 13:31 Gerard Saraber
  2018-10-02 15:24 ` joshua
  2018-10-03  0:22 ` Qu Wenruo
  0 siblings, 2 replies; 8+ messages in thread
From: Gerard Saraber @ 2018-10-02 13:31 UTC (permalink / raw)
  To: Btrfs BTRFS

I have a 25TB mirrored filesystem that I'm able to consistently freeze
by ungzipping a large file.
The filesystem scrubs complete without errors and smartd reports no
errors at the moment.

The command is:
gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
At around 31GB it quits writing to the output file, and I start
getting the 'hung task' messages in my kernel log.
They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
dump files.

after that 'ps auwx' completely hangs and I can't write to the
filesystem anymore, no errors, the processes just hang. I can still
read from the FS as far as I can tell.
I'm not sure how to diagnose it, or if i've simply ran into some kind
of bug (may not even be BTRFS related)

Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
x86_64 GNU/Linux
[ I think It used to work on 4.18.0-rcX ]

Hardware specs:
 AMD Ryzen Threadripper 1950X 16-Core Processor
128GB ram
A bunch of 3-8TB SATA drives connected to the motherboard and a pair
of LSI cards

root@cloud1:~# btrfs fi usage /btrfsarray
Overall:
    Device size:                  79.14TiB
    Device allocated:             55.38TiB
    Device unallocated:           23.76TiB
    Device missing:                  0.00B
    Used:                         55.30TiB
    Free (estimated):             11.91TiB      (min: 11.91TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:27.61TiB, Used:27.58TiB
   /dev/sda        1.64TiB
   /dev/sdb        1.64TiB
   /dev/sdc        1.87TiB
   /dev/sdd        4.37TiB
   /dev/sde        2.55TiB
   /dev/sdf        2.55TiB
   /dev/sdg        4.37TiB
   /dev/sdh        4.36TiB
   /dev/sdi        2.55TiB
   /dev/sdj        4.36TiB
   /dev/sdk        4.37TiB
   /dev/sdl        1.64TiB
   /dev/sdm        4.37TiB
   /dev/sdn        2.54TiB
   /dev/sdo        2.55TiB
   /dev/sdp        2.55TiB
   /dev/sdq        2.55TiB
   /dev/sdr        4.37TiB

Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
   /dev/sda        5.00GiB
   /dev/sdb        5.00GiB
   /dev/sdd       14.00GiB
   /dev/sde       12.00GiB
   /dev/sdf        8.00GiB
   /dev/sdg        6.00GiB
   /dev/sdh       20.00GiB
   /dev/sdi       11.00GiB
   /dev/sdj       19.00GiB
   /dev/sdk        5.00GiB
   /dev/sdl        6.00GiB
   /dev/sdm        7.00GiB
   /dev/sdn       17.00GiB
   /dev/sdo        7.00GiB
   /dev/sdp       13.00GiB
   /dev/sdq       12.00GiB
   /dev/sdr        7.00GiB

System,RAID1: Size:32.00MiB, Used:5.19MiB
   /dev/sdc       32.00MiB
   /dev/sdr       32.00MiB

Unallocated:
   /dev/sda        1.08TiB
   /dev/sdb        1.08TiB
   /dev/sdc        5.40TiB
   /dev/sdd        1.08TiB
   /dev/sde        1.08TiB
   /dev/sdf        1.08TiB
   /dev/sdg        1.08TiB
   /dev/sdh        1.08TiB
   /dev/sdi        1.08TiB
   /dev/sdj        1.08TiB
   /dev/sdk        1.08TiB
   /dev/sdl        1.08TiB
   /dev/sdm        1.08TiB
   /dev/sdn        1.08TiB
   /dev/sdo        1.08TiB
   /dev/sdp        1.08TiB
   /dev/sdq        1.08TiB
   /dev/sdr        1.08TiB


Oct  2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
blocked for more than 120 seconds.
Oct  2 08:02:25 cloud1 kernel: [165537.793068]       Not tainted 4.18.10 #1
Oct  2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  2 08:02:25 cloud1 kernel: [165537.793094] smbd            D    0
47578   3375 0x00000000
Oct  2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
Oct  2 08:02:25 cloud1 kernel: [165537.793098]  ? __schedule+0x299/0x880
Oct  2 08:02:25 cloud1 kernel: [165537.793100]  schedule+0x28/0x80
Oct  2 08:02:25 cloud1 kernel: [165537.793102]  wait_current_trans+0xad/0xe0
Oct  2 08:02:25 cloud1 kernel: [165537.793104]  ? wait_woken+0x80/0x80
Oct  2 08:02:25 cloud1 kernel: [165537.793106]  start_transaction+0x2ee/0x3c0
Oct  2 08:02:25 cloud1 kernel: [165537.793108]  btrfs_sync_file+0x279/0x400
Oct  2 08:02:25 cloud1 kernel: [165537.793110]
btrfs_file_write_iter+0x461/0x576
Oct  2 08:02:25 cloud1 kernel: [165537.793111]  __vfs_write+0x114/0x1a0
Oct  2 08:02:25 cloud1 kernel: [165537.793113]  vfs_write+0xad/0x1a0
Oct  2 08:02:25 cloud1 kernel: [165537.793115]  ksys_pwrite64+0x71/0x90
Oct  2 08:02:25 cloud1 kernel: [165537.793116]  ? __switch_to_asm+0x34/0x70
Oct  2 08:02:25 cloud1 kernel: [165537.793117]  do_syscall_64+0x4f/0x100
Oct  2 08:02:25 cloud1 kernel: [165537.793119]
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct  2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
Oct  2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
Oct  2 08:02:25 cloud1 kernel: [165537.793122] RSP:
002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
Oct  2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
RBX: 0000000000000009 RCX: 00007f82af38104f
Oct  2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
RSI: 000055e06183fea0 RDI: 0000000000000009
Oct  2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
R08: 0000000000000000 R09: 00007fffac636080
Oct  2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
R11: 0000000000000293 R12: 0000000000001000
Oct  2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
R14: 00007f82aa40e090 R15: 000055e0616b5f70


Oct  2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
btrfs-transacti:12370 blocked for more than 120 seconds.
Oct  2 08:04:25 cloud1 kernel: [165658.622366]       Not tainted 4.18.10 #1
Oct  2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D    0
12370      2 0x80000000
Oct  2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
Oct  2 08:04:25 cloud1 kernel: [165658.622406]  ? __schedule+0x299/0x880
Oct  2 08:04:25 cloud1 kernel: [165658.622408]  ?
_raw_spin_unlock_irqrestore+0xa/0x10
Oct  2 08:04:25 cloud1 kernel: [165658.622410]  schedule+0x28/0x80
Oct  2 08:04:25 cloud1 kernel: [165658.622414]
btrfs_commit_transaction+0x760/0x870
Oct  2 08:04:25 cloud1 kernel: [165658.622418]  ? wait_woken+0x80/0x80
Oct  2 08:04:25 cloud1 kernel: [165658.622420]  transaction_kthread+0x12e/0x150
Oct  2 08:04:25 cloud1 kernel: [165658.622422]  ?
btrfs_cleanup_transaction+0x520/0x520
Oct  2 08:04:25 cloud1 kernel: [165658.622425]  kthread+0x113/0x130
Oct  2 08:04:25 cloud1 kernel: [165658.622427]  ?
kthread_create_worker_on_cpu+0x70/0x70
Oct  2 08:04:25 cloud1 kernel: [165658.622428]  ret_from_fork+0x22/0x40
Oct  2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
blocked for more than 120 seconds.
Oct  2 08:04:25 cloud1 kernel: [165658.622501]       Not tainted 4.18.10 #1
Oct  2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  2 08:04:25 cloud1 kernel: [165658.622527] jsvc            D    0
15530  15312 0x00000120
Oct  2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
Oct  2 08:04:25 cloud1 kernel: [165658.622531]  ? __schedule+0x299/0x880
Oct  2 08:04:25 cloud1 kernel: [165658.622533]  schedule+0x28/0x80
Oct  2 08:04:25 cloud1 kernel: [165658.622536]
btrfs_start_ordered_extent+0xea/0x120
Oct  2 08:04:25 cloud1 kernel: [165658.622538]  ? wait_woken+0x80/0x80
Oct  2 08:04:25 cloud1 kernel: [165658.622540]  btrfs_page_mkwrite+0x1f3/0x500
Oct  2 08:04:25 cloud1 kernel: [165658.622544]  do_page_mkwrite+0x31/0x90
Oct  2 08:04:25 cloud1 kernel: [165658.622546]  do_wp_page+0x214/0x5a0
Oct  2 08:04:25 cloud1 kernel: [165658.622548]  __handle_mm_fault+0xb6a/0x1260
Oct  2 08:04:25 cloud1 kernel: [165658.622551]  ? __seccomp_filter+0x44/0x4c0
Oct  2 08:04:25 cloud1 kernel: [165658.622553]  handle_mm_fault+0xc6/0x200
Oct  2 08:04:25 cloud1 kernel: [165658.622556]  __do_page_fault+0x24c/0x4d0
Oct  2 08:04:25 cloud1 kernel: [165658.622558]  ? page_fault+0x8/0x30
Oct  2 08:04:25 cloud1 kernel: [165658.622559]  page_fault+0x1e/0x30
Oct  2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
Oct  2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
Oct  2 08:04:25 cloud1 kernel: [165658.622569] RSP:
002b:00007f92dccced60 EFLAGS: 00010246
Oct  2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
Oct  2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
RSI: 00007f92dccced30 RDI: 0000000000000001
Oct  2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
Oct  2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
R11: 0000000000000001 R12: 0000000000000028
Oct  2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
R14: 0000000000000002 R15: 0000000000000001
Oct  2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
blocked for more than 120 seconds.
Oct  2 08:04:25 cloud1 kernel: [165658.622628]       Not tainted 4.18.10 #1
Oct  2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  2 08:04:25 cloud1 kernel: [165658.622653] mongod          D    0
23344  15560 0x00000120
Oct  2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
Oct  2 08:04:25 cloud1 kernel: [165658.622657]  ? __schedule+0x299/0x880
Oct  2 08:04:25 cloud1 kernel: [165658.622659]  schedule+0x28/0x80
Oct  2 08:04:25 cloud1 kernel: [165658.622661]
btrfs_start_ordered_extent+0xea/0x120
Oct  2 08:04:25 cloud1 kernel: [165658.622663]  ? wait_woken+0x80/0x80
Oct  2 08:04:25 cloud1 kernel: [165658.622665]
btrfs_wait_ordered_range+0xbb/0x100
Oct  2 08:04:25 cloud1 kernel: [165658.622667]  btrfs_sync_file+0x30c/0x400
Oct  2 08:04:25 cloud1 kernel: [165658.622671]  do_fsync+0x38/0x60
Oct  2 08:04:25 cloud1 kernel: [165658.622673]  __x64_sys_fdatasync+0x13/0x20
Oct  2 08:04:25 cloud1 kernel: [165658.622675]  do_syscall_64+0x4f/0x100
Oct  2 08:04:25 cloud1 kernel: [165658.622677]
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct  2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
Oct  2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
Oct  2 08:04:25 cloud1 kernel: [165658.622681] RSP:
002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
Oct  2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
RBX: 000000000000000c RCX: 00007f3d36c122e7
Oct  2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
RSI: 000000000000000c RDI: 000000000000000c
Oct  2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
Oct  2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
R11: 0000000000000293 R12: 00005633b3ab9cc0
Oct  2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
R14: 0000000000000000 R15: 00005633b39a20f8
Oct  2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
blocked for more than 120 seconds.
Oct  2 08:04:25 cloud1 kernel: [165658.622702]       Not tainted 4.18.10 #1
Oct  2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct  2 08:04:25 cloud1 kernel: [165658.622728] ftdc            D    0
23347  15560 0x00000120
Oct  2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
Oct  2 08:04:25 cloud1 kernel: [165658.622731]  ? __schedule+0x299/0x880
Oct  2 08:04:25 cloud1 kernel: [165658.622733]  schedule+0x28/0x80
Oct  2 08:04:25 cloud1 kernel: [165658.622735]  wait_current_trans+0xad/0xe0
Oct  2 08:04:25 cloud1 kernel: [165658.622737]  ? wait_woken+0x80/0x80
Oct  2 08:04:25 cloud1 kernel: [165658.622738]  start_transaction+0x2ee/0x3c0
Oct  2 08:04:25 cloud1 kernel: [165658.622740]  btrfs_create+0x57/0x1f0
Oct  2 08:04:25 cloud1 kernel: [165658.622743]  path_openat+0x13c1/0x16a0
Oct  2 08:04:25 cloud1 kernel: [165658.622746]  do_filp_open+0x9b/0x110
Oct  2 08:04:25 cloud1 kernel: [165658.622748]  ? try_lookup_one_len+0x70/0x70
Oct  2 08:04:25 cloud1 kernel: [165658.622751]  ? close_pdeo+0x93/0xf0
Oct  2 08:04:25 cloud1 kernel: [165658.622753]  ? __check_object_size+0xa7/0x1a0
Oct  2 08:04:25 cloud1 kernel: [165658.622756]  ? __alloc_fd+0x3d/0x160
Oct  2 08:04:25 cloud1 kernel: [165658.622758]  ? do_sys_open+0x1bd/0x250
Oct  2 08:04:25 cloud1 kernel: [165658.622760]  do_sys_open+0x1bd/0x250
Oct  2 08:04:25 cloud1 kernel: [165658.622762]  do_syscall_64+0x4f/0x100
Oct  2 08:04:25 cloud1 kernel: [165658.622763]
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Oct  2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
Oct  2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
Oct  2 08:04:25 cloud1 kernel: [165658.622767] RSP:
002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
Oct  2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
Oct  2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
RSI: 00005633b7106740 RDI: 00000000ffffff9c
Oct  2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
R08: 0000000000000000 R09: 00005633b1077b14
Oct  2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
R11: 0000000000000293 R12: 0000000000000004
Oct  2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
R14: 00007f3d302ec720 R15: 00005633b9942004

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Writing a large file causes odd freeze
  2018-10-02 13:31 Writing a large file causes odd freeze Gerard Saraber
@ 2018-10-02 15:24 ` joshua
  2018-10-02 15:50   ` Gerard Saraber
  2018-10-03  0:22 ` Qu Wenruo
  1 sibling, 1 reply; 8+ messages in thread
From: joshua @ 2018-10-02 15:24 UTC (permalink / raw)
  To: Gerard Saraber, Btrfs BTRFS

> I have a 25TB mirrored filesystem that I'm able to consistently freeze
> by ungzipping a large file.
> The filesystem scrubs complete without errors and smartd reports no
> errors at the moment.
> 
> The command is:
> gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
> At around 31GB it quits writing to the output file, and I start
> getting the 'hung task' messages in my kernel log.
> They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
> dump files.
> 
> after that 'ps auwx' completely hangs and I can't write to the
> filesystem anymore, no errors, the processes just hang. I can still
> read from the FS as far as I can tell.
> I'm not sure how to diagnose it, or if i've simply ran into some kind
> of bug (may not even be BTRFS related)
> 
> Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
> x86_64 GNU/Linux
> [ I think It used to work on 4.18.0-rcX ]
> 
> Hardware specs:
> AMD Ryzen Threadripper 1950X 16-Core Processor
> 128GB ram
> A bunch of 3-8TB SATA drives connected to the motherboard and a pair
> of LSI cards
> 
> root@cloud1:~# btrfs fi usage /btrfsarray
> Overall:
> Device size: 79.14TiB
> Device allocated: 55.38TiB
> Device unallocated: 23.76TiB
> Device missing: 0.00B
> Used: 55.30TiB
> Free (estimated): 11.91TiB (min: 11.91TiB)
> Data ratio: 2.00
> Metadata ratio: 2.00
> Global reserve: 512.00MiB (used: 0.00B)
> 
> Data,RAID1: Size:27.61TiB, Used:27.58TiB
> /dev/sda 1.64TiB
> /dev/sdb 1.64TiB
> /dev/sdc 1.87TiB
> /dev/sdd 4.37TiB
> /dev/sde 2.55TiB
> /dev/sdf 2.55TiB
> /dev/sdg 4.37TiB
> /dev/sdh 4.36TiB
> /dev/sdi 2.55TiB
> /dev/sdj 4.36TiB
> /dev/sdk 4.37TiB
> /dev/sdl 1.64TiB
> /dev/sdm 4.37TiB
> /dev/sdn 2.54TiB
> /dev/sdo 2.55TiB
> /dev/sdp 2.55TiB
> /dev/sdq 2.55TiB
> /dev/sdr 4.37TiB
> 
> Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
> /dev/sda 5.00GiB
> /dev/sdb 5.00GiB
> /dev/sdd 14.00GiB
> /dev/sde 12.00GiB
> /dev/sdf 8.00GiB
> /dev/sdg 6.00GiB
> /dev/sdh 20.00GiB
> /dev/sdi 11.00GiB
> /dev/sdj 19.00GiB
> /dev/sdk 5.00GiB
> /dev/sdl 6.00GiB
> /dev/sdm 7.00GiB
> /dev/sdn 17.00GiB
> /dev/sdo 7.00GiB
> /dev/sdp 13.00GiB
> /dev/sdq 12.00GiB
> /dev/sdr 7.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:5.19MiB
> /dev/sdc 32.00MiB
> /dev/sdr 32.00MiB
> 
> Unallocated:
> /dev/sda 1.08TiB
> /dev/sdb 1.08TiB
> /dev/sdc 5.40TiB
> /dev/sdd 1.08TiB
> /dev/sde 1.08TiB
> /dev/sdf 1.08TiB
> /dev/sdg 1.08TiB
> /dev/sdh 1.08TiB
> /dev/sdi 1.08TiB
> /dev/sdj 1.08TiB
> /dev/sdk 1.08TiB
> /dev/sdl 1.08TiB
> /dev/sdm 1.08TiB
> /dev/sdn 1.08TiB
> /dev/sdo 1.08TiB
> /dev/sdp 1.08TiB
> /dev/sdq 1.08TiB
> /dev/sdr 1.08TiB
> 
> Oct 2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
> blocked for more than 120 seconds.
> Oct 2 08:02:25 cloud1 kernel: [165537.793068] Not tainted 4.18.10 #1
> Oct 2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 2 08:02:25 cloud1 kernel: [165537.793094] smbd D 0
> 47578 3375 0x00000000
> Oct 2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
> Oct 2 08:02:25 cloud1 kernel: [165537.793098] ? __schedule+0x299/0x880
> Oct 2 08:02:25 cloud1 kernel: [165537.793100] schedule+0x28/0x80
> Oct 2 08:02:25 cloud1 kernel: [165537.793102] wait_current_trans+0xad/0xe0
> Oct 2 08:02:25 cloud1 kernel: [165537.793104] ? wait_woken+0x80/0x80
> Oct 2 08:02:25 cloud1 kernel: [165537.793106] start_transaction+0x2ee/0x3c0
> Oct 2 08:02:25 cloud1 kernel: [165537.793108] btrfs_sync_file+0x279/0x400
> Oct 2 08:02:25 cloud1 kernel: [165537.793110]
> btrfs_file_write_iter+0x461/0x576
> Oct 2 08:02:25 cloud1 kernel: [165537.793111] __vfs_write+0x114/0x1a0
> Oct 2 08:02:25 cloud1 kernel: [165537.793113] vfs_write+0xad/0x1a0
> Oct 2 08:02:25 cloud1 kernel: [165537.793115] ksys_pwrite64+0x71/0x90
> Oct 2 08:02:25 cloud1 kernel: [165537.793116] ? __switch_to_asm+0x34/0x70
> Oct 2 08:02:25 cloud1 kernel: [165537.793117] do_syscall_64+0x4f/0x100
> Oct 2 08:02:25 cloud1 kernel: [165537.793119]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Oct 2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
> Oct 2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
> Oct 2 08:02:25 cloud1 kernel: [165537.793122] RSP:
> 002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> Oct 2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
> RBX: 0000000000000009 RCX: 00007f82af38104f
> Oct 2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
> RSI: 000055e06183fea0 RDI: 0000000000000009
> Oct 2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
> R08: 0000000000000000 R09: 00007fffac636080
> Oct 2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
> R11: 0000000000000293 R12: 0000000000001000
> Oct 2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
> R14: 00007f82aa40e090 R15: 000055e0616b5f70
> 
> Oct 2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
> btrfs-transacti:12370 blocked for more than 120 seconds.
> Oct 2 08:04:25 cloud1 kernel: [165658.622366] Not tainted 4.18.10 #1
> Oct 2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D 0
> 12370 2 0x80000000
> Oct 2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
> Oct 2 08:04:25 cloud1 kernel: [165658.622406] ? __schedule+0x299/0x880
> Oct 2 08:04:25 cloud1 kernel: [165658.622408] ?
> _raw_spin_unlock_irqrestore+0xa/0x10
> Oct 2 08:04:25 cloud1 kernel: [165658.622410] schedule+0x28/0x80
> Oct 2 08:04:25 cloud1 kernel: [165658.622414]
> btrfs_commit_transaction+0x760/0x870
> Oct 2 08:04:25 cloud1 kernel: [165658.622418] ? wait_woken+0x80/0x80
> Oct 2 08:04:25 cloud1 kernel: [165658.622420] transaction_kthread+0x12e/0x150
> Oct 2 08:04:25 cloud1 kernel: [165658.622422] ?
> btrfs_cleanup_transaction+0x520/0x520
> Oct 2 08:04:25 cloud1 kernel: [165658.622425] kthread+0x113/0x130
> Oct 2 08:04:25 cloud1 kernel: [165658.622427] ?
> kthread_create_worker_on_cpu+0x70/0x70
> Oct 2 08:04:25 cloud1 kernel: [165658.622428] ret_from_fork+0x22/0x40
> Oct 2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
> blocked for more than 120 seconds.
> Oct 2 08:04:25 cloud1 kernel: [165658.622501] Not tainted 4.18.10 #1
> Oct 2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 2 08:04:25 cloud1 kernel: [165658.622527] jsvc D 0
> 15530 15312 0x00000120
> Oct 2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
> Oct 2 08:04:25 cloud1 kernel: [165658.622531] ? __schedule+0x299/0x880
> Oct 2 08:04:25 cloud1 kernel: [165658.622533] schedule+0x28/0x80
> Oct 2 08:04:25 cloud1 kernel: [165658.622536]
> btrfs_start_ordered_extent+0xea/0x120
> Oct 2 08:04:25 cloud1 kernel: [165658.622538] ? wait_woken+0x80/0x80
> Oct 2 08:04:25 cloud1 kernel: [165658.622540] btrfs_page_mkwrite+0x1f3/0x500
> Oct 2 08:04:25 cloud1 kernel: [165658.622544] do_page_mkwrite+0x31/0x90
> Oct 2 08:04:25 cloud1 kernel: [165658.622546] do_wp_page+0x214/0x5a0
> Oct 2 08:04:25 cloud1 kernel: [165658.622548] __handle_mm_fault+0xb6a/0x1260
> Oct 2 08:04:25 cloud1 kernel: [165658.622551] ? __seccomp_filter+0x44/0x4c0
> Oct 2 08:04:25 cloud1 kernel: [165658.622553] handle_mm_fault+0xc6/0x200
> Oct 2 08:04:25 cloud1 kernel: [165658.622556] __do_page_fault+0x24c/0x4d0
> Oct 2 08:04:25 cloud1 kernel: [165658.622558] ? page_fault+0x8/0x30
> Oct 2 08:04:25 cloud1 kernel: [165658.622559] page_fault+0x1e/0x30
> Oct 2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
> Oct 2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
> Oct 2 08:04:25 cloud1 kernel: [165658.622569] RSP:
> 002b:00007f92dccced60 EFLAGS: 00010246
> Oct 2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
> RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
> Oct 2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
> RSI: 00007f92dccced30 RDI: 0000000000000001
> Oct 2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
> R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
> Oct 2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
> R11: 0000000000000001 R12: 0000000000000028
> Oct 2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
> R14: 0000000000000002 R15: 0000000000000001
> Oct 2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
> blocked for more than 120 seconds.
> Oct 2 08:04:25 cloud1 kernel: [165658.622628] Not tainted 4.18.10 #1
> Oct 2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 2 08:04:25 cloud1 kernel: [165658.622653] mongod D 0
> 23344 15560 0x00000120
> Oct 2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
> Oct 2 08:04:25 cloud1 kernel: [165658.622657] ? __schedule+0x299/0x880
> Oct 2 08:04:25 cloud1 kernel: [165658.622659] schedule+0x28/0x80
> Oct 2 08:04:25 cloud1 kernel: [165658.622661]
> btrfs_start_ordered_extent+0xea/0x120
> Oct 2 08:04:25 cloud1 kernel: [165658.622663] ? wait_woken+0x80/0x80
> Oct 2 08:04:25 cloud1 kernel: [165658.622665]
> btrfs_wait_ordered_range+0xbb/0x100
> Oct 2 08:04:25 cloud1 kernel: [165658.622667] btrfs_sync_file+0x30c/0x400
> Oct 2 08:04:25 cloud1 kernel: [165658.622671] do_fsync+0x38/0x60
> Oct 2 08:04:25 cloud1 kernel: [165658.622673] __x64_sys_fdatasync+0x13/0x20
> Oct 2 08:04:25 cloud1 kernel: [165658.622675] do_syscall_64+0x4f/0x100
> Oct 2 08:04:25 cloud1 kernel: [165658.622677]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Oct 2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
> Oct 2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
> Oct 2 08:04:25 cloud1 kernel: [165658.622681] RSP:
> 002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
> Oct 2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
> RBX: 000000000000000c RCX: 00007f3d36c122e7
> Oct 2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
> RSI: 000000000000000c RDI: 000000000000000c
> Oct 2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
> R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
> Oct 2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
> R11: 0000000000000293 R12: 00005633b3ab9cc0
> Oct 2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
> R14: 0000000000000000 R15: 00005633b39a20f8
> Oct 2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
> blocked for more than 120 seconds.
> Oct 2 08:04:25 cloud1 kernel: [165658.622702] Not tainted 4.18.10 #1
> Oct 2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 2 08:04:25 cloud1 kernel: [165658.622728] ftdc D 0
> 23347 15560 0x00000120
> Oct 2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
> Oct 2 08:04:25 cloud1 kernel: [165658.622731] ? __schedule+0x299/0x880
> Oct 2 08:04:25 cloud1 kernel: [165658.622733] schedule+0x28/0x80
> Oct 2 08:04:25 cloud1 kernel: [165658.622735] wait_current_trans+0xad/0xe0
> Oct 2 08:04:25 cloud1 kernel: [165658.622737] ? wait_woken+0x80/0x80
> Oct 2 08:04:25 cloud1 kernel: [165658.622738] start_transaction+0x2ee/0x3c0
> Oct 2 08:04:25 cloud1 kernel: [165658.622740] btrfs_create+0x57/0x1f0
> Oct 2 08:04:25 cloud1 kernel: [165658.622743] path_openat+0x13c1/0x16a0
> Oct 2 08:04:25 cloud1 kernel: [165658.622746] do_filp_open+0x9b/0x110
> Oct 2 08:04:25 cloud1 kernel: [165658.622748] ? try_lookup_one_len+0x70/0x70
> Oct 2 08:04:25 cloud1 kernel: [165658.622751] ? close_pdeo+0x93/0xf0
> Oct 2 08:04:25 cloud1 kernel: [165658.622753] ? __check_object_size+0xa7/0x1a0
> Oct 2 08:04:25 cloud1 kernel: [165658.622756] ? __alloc_fd+0x3d/0x160
> Oct 2 08:04:25 cloud1 kernel: [165658.622758] ? do_sys_open+0x1bd/0x250
> Oct 2 08:04:25 cloud1 kernel: [165658.622760] do_sys_open+0x1bd/0x250
> Oct 2 08:04:25 cloud1 kernel: [165658.622762] do_syscall_64+0x4f/0x100
> Oct 2 08:04:25 cloud1 kernel: [165658.622763]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Oct 2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
> Oct 2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
> Oct 2 08:04:25 cloud1 kernel: [165658.622767] RSP:
> 002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> Oct 2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
> RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
> Oct 2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
> RSI: 00005633b7106740 RDI: 00000000ffffff9c
> Oct 2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
> R08: 0000000000000000 R09: 00005633b1077b14
> Oct 2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
> R11: 0000000000000293 R12: 0000000000000004
> Oct 2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
> R14: 00007f3d302ec720 R15: 00005633b9942004

Are you using mount option space_cache=v2?

If not, you should enable that and see if you experience the same behavior.  I believe it's highly recommended on all large BTRFS arrays, and I had intermittent freezing problems on large writes before I enabled it.  I do not appear to have those problems any more...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Writing a large file causes odd freeze
  2018-10-02 15:24 ` joshua
@ 2018-10-02 15:50   ` Gerard Saraber
  2018-10-02 15:51     ` Gerard Saraber
  0 siblings, 1 reply; 8+ messages in thread
From: Gerard Saraber @ 2018-10-02 15:50 UTC (permalink / raw)
  To: joshua; +Cc: Btrfs BTRFS

I'll give that a shot! thank you!

-Gerard
On Tue, Oct 2, 2018 at 10:24 AM <joshua@mailmag.net> wrote:
>
> > I have a 25TB mirrored filesystem that I'm able to consistently freeze
> > by ungzipping a large file.
> > The filesystem scrubs complete without errors and smartd reports no
> > errors at the moment.
> >
> > The command is:
> > gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
> > At around 31GB it quits writing to the output file, and I start
> > getting the 'hung task' messages in my kernel log.
> > They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
> > dump files.
> >
> > after that 'ps auwx' completely hangs and I can't write to the
> > filesystem anymore, no errors, the processes just hang. I can still
> > read from the FS as far as I can tell.
> > I'm not sure how to diagnose it, or if i've simply ran into some kind
> > of bug (may not even be BTRFS related)
> >
> > Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
> > x86_64 GNU/Linux
> > [ I think It used to work on 4.18.0-rcX ]
> >
> > Hardware specs:
> > AMD Ryzen Threadripper 1950X 16-Core Processor
> > 128GB ram
> > A bunch of 3-8TB SATA drives connected to the motherboard and a pair
> > of LSI cards
> >
> > root@cloud1:~# btrfs fi usage /btrfsarray
> > Overall:
> > Device size: 79.14TiB
> > Device allocated: 55.38TiB
> > Device unallocated: 23.76TiB
> > Device missing: 0.00B
> > Used: 55.30TiB
> > Free (estimated): 11.91TiB (min: 11.91TiB)
> > Data ratio: 2.00
> > Metadata ratio: 2.00
> > Global reserve: 512.00MiB (used: 0.00B)
> >
> > Data,RAID1: Size:27.61TiB, Used:27.58TiB
> > /dev/sda 1.64TiB
> > /dev/sdb 1.64TiB
> > /dev/sdc 1.87TiB
> > /dev/sdd 4.37TiB
> > /dev/sde 2.55TiB
> > /dev/sdf 2.55TiB
> > /dev/sdg 4.37TiB
> > /dev/sdh 4.36TiB
> > /dev/sdi 2.55TiB
> > /dev/sdj 4.36TiB
> > /dev/sdk 4.37TiB
> > /dev/sdl 1.64TiB
> > /dev/sdm 4.37TiB
> > /dev/sdn 2.54TiB
> > /dev/sdo 2.55TiB
> > /dev/sdp 2.55TiB
> > /dev/sdq 2.55TiB
> > /dev/sdr 4.37TiB
> >
> > Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
> > /dev/sda 5.00GiB
> > /dev/sdb 5.00GiB
> > /dev/sdd 14.00GiB
> > /dev/sde 12.00GiB
> > /dev/sdf 8.00GiB
> > /dev/sdg 6.00GiB
> > /dev/sdh 20.00GiB
> > /dev/sdi 11.00GiB
> > /dev/sdj 19.00GiB
> > /dev/sdk 5.00GiB
> > /dev/sdl 6.00GiB
> > /dev/sdm 7.00GiB
> > /dev/sdn 17.00GiB
> > /dev/sdo 7.00GiB
> > /dev/sdp 13.00GiB
> > /dev/sdq 12.00GiB
> > /dev/sdr 7.00GiB
> >
> > System,RAID1: Size:32.00MiB, Used:5.19MiB
> > /dev/sdc 32.00MiB
> > /dev/sdr 32.00MiB
> >
> > Unallocated:
> > /dev/sda 1.08TiB
> > /dev/sdb 1.08TiB
> > /dev/sdc 5.40TiB
> > /dev/sdd 1.08TiB
> > /dev/sde 1.08TiB
> > /dev/sdf 1.08TiB
> > /dev/sdg 1.08TiB
> > /dev/sdh 1.08TiB
> > /dev/sdi 1.08TiB
> > /dev/sdj 1.08TiB
> > /dev/sdk 1.08TiB
> > /dev/sdl 1.08TiB
> > /dev/sdm 1.08TiB
> > /dev/sdn 1.08TiB
> > /dev/sdo 1.08TiB
> > /dev/sdp 1.08TiB
> > /dev/sdq 1.08TiB
> > /dev/sdr 1.08TiB
> >
> > Oct 2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
> > blocked for more than 120 seconds.
> > Oct 2 08:02:25 cloud1 kernel: [165537.793068] Not tainted 4.18.10 #1
> > Oct 2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 2 08:02:25 cloud1 kernel: [165537.793094] smbd D 0
> > 47578 3375 0x00000000
> > Oct 2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
> > Oct 2 08:02:25 cloud1 kernel: [165537.793098] ? __schedule+0x299/0x880
> > Oct 2 08:02:25 cloud1 kernel: [165537.793100] schedule+0x28/0x80
> > Oct 2 08:02:25 cloud1 kernel: [165537.793102] wait_current_trans+0xad/0xe0
> > Oct 2 08:02:25 cloud1 kernel: [165537.793104] ? wait_woken+0x80/0x80
> > Oct 2 08:02:25 cloud1 kernel: [165537.793106] start_transaction+0x2ee/0x3c0
> > Oct 2 08:02:25 cloud1 kernel: [165537.793108] btrfs_sync_file+0x279/0x400
> > Oct 2 08:02:25 cloud1 kernel: [165537.793110]
> > btrfs_file_write_iter+0x461/0x576
> > Oct 2 08:02:25 cloud1 kernel: [165537.793111] __vfs_write+0x114/0x1a0
> > Oct 2 08:02:25 cloud1 kernel: [165537.793113] vfs_write+0xad/0x1a0
> > Oct 2 08:02:25 cloud1 kernel: [165537.793115] ksys_pwrite64+0x71/0x90
> > Oct 2 08:02:25 cloud1 kernel: [165537.793116] ? __switch_to_asm+0x34/0x70
> > Oct 2 08:02:25 cloud1 kernel: [165537.793117] do_syscall_64+0x4f/0x100
> > Oct 2 08:02:25 cloud1 kernel: [165537.793119]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct 2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
> > Oct 2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
> > Oct 2 08:02:25 cloud1 kernel: [165537.793122] RSP:
> > 002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> > Oct 2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
> > RBX: 0000000000000009 RCX: 00007f82af38104f
> > Oct 2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
> > RSI: 000055e06183fea0 RDI: 0000000000000009
> > Oct 2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
> > R08: 0000000000000000 R09: 00007fffac636080
> > Oct 2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
> > R11: 0000000000000293 R12: 0000000000001000
> > Oct 2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
> > R14: 00007f82aa40e090 R15: 000055e0616b5f70
> >
> > Oct 2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
> > btrfs-transacti:12370 blocked for more than 120 seconds.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622366] Not tainted 4.18.10 #1
> > Oct 2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D 0
> > 12370 2 0x80000000
> > Oct 2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
> > Oct 2 08:04:25 cloud1 kernel: [165658.622406] ? __schedule+0x299/0x880
> > Oct 2 08:04:25 cloud1 kernel: [165658.622408] ?
> > _raw_spin_unlock_irqrestore+0xa/0x10
> > Oct 2 08:04:25 cloud1 kernel: [165658.622410] schedule+0x28/0x80
> > Oct 2 08:04:25 cloud1 kernel: [165658.622414]
> > btrfs_commit_transaction+0x760/0x870
> > Oct 2 08:04:25 cloud1 kernel: [165658.622418] ? wait_woken+0x80/0x80
> > Oct 2 08:04:25 cloud1 kernel: [165658.622420] transaction_kthread+0x12e/0x150
> > Oct 2 08:04:25 cloud1 kernel: [165658.622422] ?
> > btrfs_cleanup_transaction+0x520/0x520
> > Oct 2 08:04:25 cloud1 kernel: [165658.622425] kthread+0x113/0x130
> > Oct 2 08:04:25 cloud1 kernel: [165658.622427] ?
> > kthread_create_worker_on_cpu+0x70/0x70
> > Oct 2 08:04:25 cloud1 kernel: [165658.622428] ret_from_fork+0x22/0x40
> > Oct 2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
> > blocked for more than 120 seconds.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622501] Not tainted 4.18.10 #1
> > Oct 2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622527] jsvc D 0
> > 15530 15312 0x00000120
> > Oct 2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
> > Oct 2 08:04:25 cloud1 kernel: [165658.622531] ? __schedule+0x299/0x880
> > Oct 2 08:04:25 cloud1 kernel: [165658.622533] schedule+0x28/0x80
> > Oct 2 08:04:25 cloud1 kernel: [165658.622536]
> > btrfs_start_ordered_extent+0xea/0x120
> > Oct 2 08:04:25 cloud1 kernel: [165658.622538] ? wait_woken+0x80/0x80
> > Oct 2 08:04:25 cloud1 kernel: [165658.622540] btrfs_page_mkwrite+0x1f3/0x500
> > Oct 2 08:04:25 cloud1 kernel: [165658.622544] do_page_mkwrite+0x31/0x90
> > Oct 2 08:04:25 cloud1 kernel: [165658.622546] do_wp_page+0x214/0x5a0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622548] __handle_mm_fault+0xb6a/0x1260
> > Oct 2 08:04:25 cloud1 kernel: [165658.622551] ? __seccomp_filter+0x44/0x4c0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622553] handle_mm_fault+0xc6/0x200
> > Oct 2 08:04:25 cloud1 kernel: [165658.622556] __do_page_fault+0x24c/0x4d0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622558] ? page_fault+0x8/0x30
> > Oct 2 08:04:25 cloud1 kernel: [165658.622559] page_fault+0x1e/0x30
> > Oct 2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
> > Oct 2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622569] RSP:
> > 002b:00007f92dccced60 EFLAGS: 00010246
> > Oct 2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
> > RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
> > Oct 2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
> > RSI: 00007f92dccced30 RDI: 0000000000000001
> > Oct 2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
> > R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
> > Oct 2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
> > R11: 0000000000000001 R12: 0000000000000028
> > Oct 2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
> > R14: 0000000000000002 R15: 0000000000000001
> > Oct 2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
> > blocked for more than 120 seconds.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622628] Not tainted 4.18.10 #1
> > Oct 2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622653] mongod D 0
> > 23344 15560 0x00000120
> > Oct 2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
> > Oct 2 08:04:25 cloud1 kernel: [165658.622657] ? __schedule+0x299/0x880
> > Oct 2 08:04:25 cloud1 kernel: [165658.622659] schedule+0x28/0x80
> > Oct 2 08:04:25 cloud1 kernel: [165658.622661]
> > btrfs_start_ordered_extent+0xea/0x120
> > Oct 2 08:04:25 cloud1 kernel: [165658.622663] ? wait_woken+0x80/0x80
> > Oct 2 08:04:25 cloud1 kernel: [165658.622665]
> > btrfs_wait_ordered_range+0xbb/0x100
> > Oct 2 08:04:25 cloud1 kernel: [165658.622667] btrfs_sync_file+0x30c/0x400
> > Oct 2 08:04:25 cloud1 kernel: [165658.622671] do_fsync+0x38/0x60
> > Oct 2 08:04:25 cloud1 kernel: [165658.622673] __x64_sys_fdatasync+0x13/0x20
> > Oct 2 08:04:25 cloud1 kernel: [165658.622675] do_syscall_64+0x4f/0x100
> > Oct 2 08:04:25 cloud1 kernel: [165658.622677]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct 2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
> > Oct 2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622681] RSP:
> > 002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
> > Oct 2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
> > RBX: 000000000000000c RCX: 00007f3d36c122e7
> > Oct 2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
> > RSI: 000000000000000c RDI: 000000000000000c
> > Oct 2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
> > R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
> > Oct 2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
> > R11: 0000000000000293 R12: 00005633b3ab9cc0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
> > R14: 0000000000000000 R15: 00005633b39a20f8
> > Oct 2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
> > blocked for more than 120 seconds.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622702] Not tainted 4.18.10 #1
> > Oct 2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622728] ftdc D 0
> > 23347 15560 0x00000120
> > Oct 2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
> > Oct 2 08:04:25 cloud1 kernel: [165658.622731] ? __schedule+0x299/0x880
> > Oct 2 08:04:25 cloud1 kernel: [165658.622733] schedule+0x28/0x80
> > Oct 2 08:04:25 cloud1 kernel: [165658.622735] wait_current_trans+0xad/0xe0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622737] ? wait_woken+0x80/0x80
> > Oct 2 08:04:25 cloud1 kernel: [165658.622738] start_transaction+0x2ee/0x3c0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622740] btrfs_create+0x57/0x1f0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622743] path_openat+0x13c1/0x16a0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622746] do_filp_open+0x9b/0x110
> > Oct 2 08:04:25 cloud1 kernel: [165658.622748] ? try_lookup_one_len+0x70/0x70
> > Oct 2 08:04:25 cloud1 kernel: [165658.622751] ? close_pdeo+0x93/0xf0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622753] ? __check_object_size+0xa7/0x1a0
> > Oct 2 08:04:25 cloud1 kernel: [165658.622756] ? __alloc_fd+0x3d/0x160
> > Oct 2 08:04:25 cloud1 kernel: [165658.622758] ? do_sys_open+0x1bd/0x250
> > Oct 2 08:04:25 cloud1 kernel: [165658.622760] do_sys_open+0x1bd/0x250
> > Oct 2 08:04:25 cloud1 kernel: [165658.622762] do_syscall_64+0x4f/0x100
> > Oct 2 08:04:25 cloud1 kernel: [165658.622763]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct 2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
> > Oct 2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
> > Oct 2 08:04:25 cloud1 kernel: [165658.622767] RSP:
> > 002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> > Oct 2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
> > RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
> > Oct 2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
> > RSI: 00005633b7106740 RDI: 00000000ffffff9c
> > Oct 2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
> > R08: 0000000000000000 R09: 00005633b1077b14
> > Oct 2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
> > R11: 0000000000000293 R12: 0000000000000004
> > Oct 2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
> > R14: 00007f3d302ec720 R15: 00005633b9942004
>
> Are you using mount option space_cache=v2?
>
> If not, you should enable that and see if you experience the same behavior.  I believe it's highly recommended on all large BTRFS arrays, and I had intermittent freezing problems on large writes before I enabled it.  I do not appear to have those problems any more...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Writing a large file causes odd freeze
  2018-10-02 15:50   ` Gerard Saraber
@ 2018-10-02 15:51     ` Gerard Saraber
  2018-10-02 18:22       ` Eli V
  0 siblings, 1 reply; 8+ messages in thread
From: Gerard Saraber @ 2018-10-02 15:51 UTC (permalink / raw)
  To: joshua; +Cc: Btrfs BTRFS

Correction, I am using that option:
mount UUID="${FS}" -o space_cache=v2,noatime,degraded /btrfsarray
On Tue, Oct 2, 2018 at 10:50 AM Gerard Saraber <gsaraber@rarcoa.com> wrote:
>
> I'll give that a shot! thank you!
>
> -Gerard
> On Tue, Oct 2, 2018 at 10:24 AM <joshua@mailmag.net> wrote:
> >
> > > I have a 25TB mirrored filesystem that I'm able to consistently freeze
> > > by ungzipping a large file.
> > > The filesystem scrubs complete without errors and smartd reports no
> > > errors at the moment.
> > >
> > > The command is:
> > > gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
> > > At around 31GB it quits writing to the output file, and I start
> > > getting the 'hung task' messages in my kernel log.
> > > They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
> > > dump files.
> > >
> > > after that 'ps auwx' completely hangs and I can't write to the
> > > filesystem anymore, no errors, the processes just hang. I can still
> > > read from the FS as far as I can tell.
> > > I'm not sure how to diagnose it, or if i've simply ran into some kind
> > > of bug (may not even be BTRFS related)
> > >
> > > Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
> > > x86_64 GNU/Linux
> > > [ I think It used to work on 4.18.0-rcX ]
> > >
> > > Hardware specs:
> > > AMD Ryzen Threadripper 1950X 16-Core Processor
> > > 128GB ram
> > > A bunch of 3-8TB SATA drives connected to the motherboard and a pair
> > > of LSI cards
> > >
> > > root@cloud1:~# btrfs fi usage /btrfsarray
> > > Overall:
> > > Device size: 79.14TiB
> > > Device allocated: 55.38TiB
> > > Device unallocated: 23.76TiB
> > > Device missing: 0.00B
> > > Used: 55.30TiB
> > > Free (estimated): 11.91TiB (min: 11.91TiB)
> > > Data ratio: 2.00
> > > Metadata ratio: 2.00
> > > Global reserve: 512.00MiB (used: 0.00B)
> > >
> > > Data,RAID1: Size:27.61TiB, Used:27.58TiB
> > > /dev/sda 1.64TiB
> > > /dev/sdb 1.64TiB
> > > /dev/sdc 1.87TiB
> > > /dev/sdd 4.37TiB
> > > /dev/sde 2.55TiB
> > > /dev/sdf 2.55TiB
> > > /dev/sdg 4.37TiB
> > > /dev/sdh 4.36TiB
> > > /dev/sdi 2.55TiB
> > > /dev/sdj 4.36TiB
> > > /dev/sdk 4.37TiB
> > > /dev/sdl 1.64TiB
> > > /dev/sdm 4.37TiB
> > > /dev/sdn 2.54TiB
> > > /dev/sdo 2.55TiB
> > > /dev/sdp 2.55TiB
> > > /dev/sdq 2.55TiB
> > > /dev/sdr 4.37TiB
> > >
> > > Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
> > > /dev/sda 5.00GiB
> > > /dev/sdb 5.00GiB
> > > /dev/sdd 14.00GiB
> > > /dev/sde 12.00GiB
> > > /dev/sdf 8.00GiB
> > > /dev/sdg 6.00GiB
> > > /dev/sdh 20.00GiB
> > > /dev/sdi 11.00GiB
> > > /dev/sdj 19.00GiB
> > > /dev/sdk 5.00GiB
> > > /dev/sdl 6.00GiB
> > > /dev/sdm 7.00GiB
> > > /dev/sdn 17.00GiB
> > > /dev/sdo 7.00GiB
> > > /dev/sdp 13.00GiB
> > > /dev/sdq 12.00GiB
> > > /dev/sdr 7.00GiB
> > >
> > > System,RAID1: Size:32.00MiB, Used:5.19MiB
> > > /dev/sdc 32.00MiB
> > > /dev/sdr 32.00MiB
> > >
> > > Unallocated:
> > > /dev/sda 1.08TiB
> > > /dev/sdb 1.08TiB
> > > /dev/sdc 5.40TiB
> > > /dev/sdd 1.08TiB
> > > /dev/sde 1.08TiB
> > > /dev/sdf 1.08TiB
> > > /dev/sdg 1.08TiB
> > > /dev/sdh 1.08TiB
> > > /dev/sdi 1.08TiB
> > > /dev/sdj 1.08TiB
> > > /dev/sdk 1.08TiB
> > > /dev/sdl 1.08TiB
> > > /dev/sdm 1.08TiB
> > > /dev/sdn 1.08TiB
> > > /dev/sdo 1.08TiB
> > > /dev/sdp 1.08TiB
> > > /dev/sdq 1.08TiB
> > > /dev/sdr 1.08TiB
> > >
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
> > > blocked for more than 120 seconds.
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793068] Not tainted 4.18.10 #1
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793094] smbd D 0
> > > 47578 3375 0x00000000
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793098] ? __schedule+0x299/0x880
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793100] schedule+0x28/0x80
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793102] wait_current_trans+0xad/0xe0
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793104] ? wait_woken+0x80/0x80
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793106] start_transaction+0x2ee/0x3c0
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793108] btrfs_sync_file+0x279/0x400
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793110]
> > > btrfs_file_write_iter+0x461/0x576
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793111] __vfs_write+0x114/0x1a0
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793113] vfs_write+0xad/0x1a0
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793115] ksys_pwrite64+0x71/0x90
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793116] ? __switch_to_asm+0x34/0x70
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793117] do_syscall_64+0x4f/0x100
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793119]
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793122] RSP:
> > > 002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
> > > RBX: 0000000000000009 RCX: 00007f82af38104f
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
> > > RSI: 000055e06183fea0 RDI: 0000000000000009
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
> > > R08: 0000000000000000 R09: 00007fffac636080
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
> > > R11: 0000000000000293 R12: 0000000000001000
> > > Oct 2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
> > > R14: 00007f82aa40e090 R15: 000055e0616b5f70
> > >
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
> > > btrfs-transacti:12370 blocked for more than 120 seconds.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622366] Not tainted 4.18.10 #1
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D 0
> > > 12370 2 0x80000000
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622406] ? __schedule+0x299/0x880
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622408] ?
> > > _raw_spin_unlock_irqrestore+0xa/0x10
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622410] schedule+0x28/0x80
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622414]
> > > btrfs_commit_transaction+0x760/0x870
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622418] ? wait_woken+0x80/0x80
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622420] transaction_kthread+0x12e/0x150
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622422] ?
> > > btrfs_cleanup_transaction+0x520/0x520
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622425] kthread+0x113/0x130
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622427] ?
> > > kthread_create_worker_on_cpu+0x70/0x70
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622428] ret_from_fork+0x22/0x40
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
> > > blocked for more than 120 seconds.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622501] Not tainted 4.18.10 #1
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622527] jsvc D 0
> > > 15530 15312 0x00000120
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622531] ? __schedule+0x299/0x880
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622533] schedule+0x28/0x80
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622536]
> > > btrfs_start_ordered_extent+0xea/0x120
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622538] ? wait_woken+0x80/0x80
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622540] btrfs_page_mkwrite+0x1f3/0x500
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622544] do_page_mkwrite+0x31/0x90
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622546] do_wp_page+0x214/0x5a0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622548] __handle_mm_fault+0xb6a/0x1260
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622551] ? __seccomp_filter+0x44/0x4c0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622553] handle_mm_fault+0xc6/0x200
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622556] __do_page_fault+0x24c/0x4d0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622558] ? page_fault+0x8/0x30
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622559] page_fault+0x1e/0x30
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622569] RSP:
> > > 002b:00007f92dccced60 EFLAGS: 00010246
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
> > > RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
> > > RSI: 00007f92dccced30 RDI: 0000000000000001
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
> > > R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
> > > R11: 0000000000000001 R12: 0000000000000028
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
> > > R14: 0000000000000002 R15: 0000000000000001
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
> > > blocked for more than 120 seconds.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622628] Not tainted 4.18.10 #1
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622653] mongod D 0
> > > 23344 15560 0x00000120
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622657] ? __schedule+0x299/0x880
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622659] schedule+0x28/0x80
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622661]
> > > btrfs_start_ordered_extent+0xea/0x120
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622663] ? wait_woken+0x80/0x80
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622665]
> > > btrfs_wait_ordered_range+0xbb/0x100
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622667] btrfs_sync_file+0x30c/0x400
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622671] do_fsync+0x38/0x60
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622673] __x64_sys_fdatasync+0x13/0x20
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622675] do_syscall_64+0x4f/0x100
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622677]
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622681] RSP:
> > > 002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
> > > RBX: 000000000000000c RCX: 00007f3d36c122e7
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
> > > RSI: 000000000000000c RDI: 000000000000000c
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
> > > R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
> > > R11: 0000000000000293 R12: 00005633b3ab9cc0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
> > > R14: 0000000000000000 R15: 00005633b39a20f8
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
> > > blocked for more than 120 seconds.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622702] Not tainted 4.18.10 #1
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622728] ftdc D 0
> > > 23347 15560 0x00000120
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622731] ? __schedule+0x299/0x880
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622733] schedule+0x28/0x80
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622735] wait_current_trans+0xad/0xe0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622737] ? wait_woken+0x80/0x80
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622738] start_transaction+0x2ee/0x3c0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622740] btrfs_create+0x57/0x1f0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622743] path_openat+0x13c1/0x16a0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622746] do_filp_open+0x9b/0x110
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622748] ? try_lookup_one_len+0x70/0x70
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622751] ? close_pdeo+0x93/0xf0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622753] ? __check_object_size+0xa7/0x1a0
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622756] ? __alloc_fd+0x3d/0x160
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622758] ? do_sys_open+0x1bd/0x250
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622760] do_sys_open+0x1bd/0x250
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622762] do_syscall_64+0x4f/0x100
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622763]
> > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622767] RSP:
> > > 002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
> > > RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
> > > RSI: 00005633b7106740 RDI: 00000000ffffff9c
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
> > > R08: 0000000000000000 R09: 00005633b1077b14
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
> > > R11: 0000000000000293 R12: 0000000000000004
> > > Oct 2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
> > > R14: 00007f3d302ec720 R15: 00005633b9942004
> >
> > Are you using mount option space_cache=v2?
> >
> > If not, you should enable that and see if you experience the same behavior.  I believe it's highly recommended on all large BTRFS arrays, and I had intermittent freezing problems on large writes before I enabled it.  I do not appear to have those problems any more...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Writing a large file causes odd freeze
  2018-10-02 15:51     ` Gerard Saraber
@ 2018-10-02 18:22       ` Eli V
  2018-10-02 21:12         ` Gerard Saraber
  0 siblings, 1 reply; 8+ messages in thread
From: Eli V @ 2018-10-02 18:22 UTC (permalink / raw)
  To: gsaraber; +Cc: joshua, linux-btrfs

I've seen similar symptoms running
rsync(https://bugzilla.kernel.org/show_bug.cgi?id=199753), and the
hangs always seem to have the wait_woken() in the stack trace. Doesn't
happen repeatably for me, though I ran 4.18.6 for a month without
issues, and had freezes the first(and only) two times I ran with
4.18.8. So I've rebooted back to 4.18.6 for now. You could try that to
see if it's any different for you or 4.19 is about done so could try
that as well.

On Tue, Oct 2, 2018 at 1:43 PM Gerard Saraber <gsaraber@rarcoa.com> wrote:
>
> Correction, I am using that option:
> mount UUID="${FS}" -o space_cache=v2,noatime,degraded /btrfsarray
> On Tue, Oct 2, 2018 at 10:50 AM Gerard Saraber <gsaraber@rarcoa.com> wrote:
> >
> > I'll give that a shot! thank you!
> >
> > -Gerard
> > On Tue, Oct 2, 2018 at 10:24 AM <joshua@mailmag.net> wrote:
> > >
> > > > I have a 25TB mirrored filesystem that I'm able to consistently freeze
> > > > by ungzipping a large file.
> > > > The filesystem scrubs complete without errors and smartd reports no
> > > > errors at the moment.
> > > >
> > > > The command is:
> > > > gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
> > > > At around 31GB it quits writing to the output file, and I start
> > > > getting the 'hung task' messages in my kernel log.
> > > > They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
> > > > dump files.
> > > >
> > > > after that 'ps auwx' completely hangs and I can't write to the
> > > > filesystem anymore, no errors, the processes just hang. I can still
> > > > read from the FS as far as I can tell.
> > > > I'm not sure how to diagnose it, or if i've simply ran into some kind
> > > > of bug (may not even be BTRFS related)
> > > >
> > > > Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
> > > > x86_64 GNU/Linux
> > > > [ I think It used to work on 4.18.0-rcX ]
> > > >
> > > > Hardware specs:
> > > > AMD Ryzen Threadripper 1950X 16-Core Processor
> > > > 128GB ram
> > > > A bunch of 3-8TB SATA drives connected to the motherboard and a pair
> > > > of LSI cards
> > > >
> > > > root@cloud1:~# btrfs fi usage /btrfsarray
> > > > Overall:
> > > > Device size: 79.14TiB
> > > > Device allocated: 55.38TiB
> > > > Device unallocated: 23.76TiB
> > > > Device missing: 0.00B
> > > > Used: 55.30TiB
> > > > Free (estimated): 11.91TiB (min: 11.91TiB)
> > > > Data ratio: 2.00
> > > > Metadata ratio: 2.00
> > > > Global reserve: 512.00MiB (used: 0.00B)
> > > >
> > > > Data,RAID1: Size:27.61TiB, Used:27.58TiB
> > > > /dev/sda 1.64TiB
> > > > /dev/sdb 1.64TiB
> > > > /dev/sdc 1.87TiB
> > > > /dev/sdd 4.37TiB
> > > > /dev/sde 2.55TiB
> > > > /dev/sdf 2.55TiB
> > > > /dev/sdg 4.37TiB
> > > > /dev/sdh 4.36TiB
> > > > /dev/sdi 2.55TiB
> > > > /dev/sdj 4.36TiB
> > > > /dev/sdk 4.37TiB
> > > > /dev/sdl 1.64TiB
> > > > /dev/sdm 4.37TiB
> > > > /dev/sdn 2.54TiB
> > > > /dev/sdo 2.55TiB
> > > > /dev/sdp 2.55TiB
> > > > /dev/sdq 2.55TiB
> > > > /dev/sdr 4.37TiB
> > > >
> > > > Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
> > > > /dev/sda 5.00GiB
> > > > /dev/sdb 5.00GiB
> > > > /dev/sdd 14.00GiB
> > > > /dev/sde 12.00GiB
> > > > /dev/sdf 8.00GiB
> > > > /dev/sdg 6.00GiB
> > > > /dev/sdh 20.00GiB
> > > > /dev/sdi 11.00GiB
> > > > /dev/sdj 19.00GiB
> > > > /dev/sdk 5.00GiB
> > > > /dev/sdl 6.00GiB
> > > > /dev/sdm 7.00GiB
> > > > /dev/sdn 17.00GiB
> > > > /dev/sdo 7.00GiB
> > > > /dev/sdp 13.00GiB
> > > > /dev/sdq 12.00GiB
> > > > /dev/sdr 7.00GiB
> > > >
> > > > System,RAID1: Size:32.00MiB, Used:5.19MiB
> > > > /dev/sdc 32.00MiB
> > > > /dev/sdr 32.00MiB
> > > >
> > > > Unallocated:
> > > > /dev/sda 1.08TiB
> > > > /dev/sdb 1.08TiB
> > > > /dev/sdc 5.40TiB
> > > > /dev/sdd 1.08TiB
> > > > /dev/sde 1.08TiB
> > > > /dev/sdf 1.08TiB
> > > > /dev/sdg 1.08TiB
> > > > /dev/sdh 1.08TiB
> > > > /dev/sdi 1.08TiB
> > > > /dev/sdj 1.08TiB
> > > > /dev/sdk 1.08TiB
> > > > /dev/sdl 1.08TiB
> > > > /dev/sdm 1.08TiB
> > > > /dev/sdn 1.08TiB
> > > > /dev/sdo 1.08TiB
> > > > /dev/sdp 1.08TiB
> > > > /dev/sdq 1.08TiB
> > > > /dev/sdr 1.08TiB
> > > >
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
> > > > blocked for more than 120 seconds.
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793068] Not tainted 4.18.10 #1
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
> > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793094] smbd D 0
> > > > 47578 3375 0x00000000
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793098] ? __schedule+0x299/0x880
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793100] schedule+0x28/0x80
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793102] wait_current_trans+0xad/0xe0
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793104] ? wait_woken+0x80/0x80
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793106] start_transaction+0x2ee/0x3c0
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793108] btrfs_sync_file+0x279/0x400
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793110]
> > > > btrfs_file_write_iter+0x461/0x576
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793111] __vfs_write+0x114/0x1a0
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793113] vfs_write+0xad/0x1a0
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793115] ksys_pwrite64+0x71/0x90
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793116] ? __switch_to_asm+0x34/0x70
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793117] do_syscall_64+0x4f/0x100
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793119]
> > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793122] RSP:
> > > > 002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
> > > > RBX: 0000000000000009 RCX: 00007f82af38104f
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
> > > > RSI: 000055e06183fea0 RDI: 0000000000000009
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
> > > > R08: 0000000000000000 R09: 00007fffac636080
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
> > > > R11: 0000000000000293 R12: 0000000000001000
> > > > Oct 2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
> > > > R14: 00007f82aa40e090 R15: 000055e0616b5f70
> > > >
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
> > > > btrfs-transacti:12370 blocked for more than 120 seconds.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622366] Not tainted 4.18.10 #1
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
> > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D 0
> > > > 12370 2 0x80000000
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622406] ? __schedule+0x299/0x880
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622408] ?
> > > > _raw_spin_unlock_irqrestore+0xa/0x10
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622410] schedule+0x28/0x80
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622414]
> > > > btrfs_commit_transaction+0x760/0x870
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622418] ? wait_woken+0x80/0x80
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622420] transaction_kthread+0x12e/0x150
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622422] ?
> > > > btrfs_cleanup_transaction+0x520/0x520
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622425] kthread+0x113/0x130
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622427] ?
> > > > kthread_create_worker_on_cpu+0x70/0x70
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622428] ret_from_fork+0x22/0x40
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
> > > > blocked for more than 120 seconds.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622501] Not tainted 4.18.10 #1
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
> > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622527] jsvc D 0
> > > > 15530 15312 0x00000120
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622531] ? __schedule+0x299/0x880
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622533] schedule+0x28/0x80
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622536]
> > > > btrfs_start_ordered_extent+0xea/0x120
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622538] ? wait_woken+0x80/0x80
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622540] btrfs_page_mkwrite+0x1f3/0x500
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622544] do_page_mkwrite+0x31/0x90
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622546] do_wp_page+0x214/0x5a0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622548] __handle_mm_fault+0xb6a/0x1260
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622551] ? __seccomp_filter+0x44/0x4c0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622553] handle_mm_fault+0xc6/0x200
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622556] __do_page_fault+0x24c/0x4d0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622558] ? page_fault+0x8/0x30
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622559] page_fault+0x1e/0x30
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622569] RSP:
> > > > 002b:00007f92dccced60 EFLAGS: 00010246
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
> > > > RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
> > > > RSI: 00007f92dccced30 RDI: 0000000000000001
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
> > > > R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
> > > > R11: 0000000000000001 R12: 0000000000000028
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
> > > > R14: 0000000000000002 R15: 0000000000000001
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
> > > > blocked for more than 120 seconds.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622628] Not tainted 4.18.10 #1
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
> > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622653] mongod D 0
> > > > 23344 15560 0x00000120
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622657] ? __schedule+0x299/0x880
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622659] schedule+0x28/0x80
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622661]
> > > > btrfs_start_ordered_extent+0xea/0x120
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622663] ? wait_woken+0x80/0x80
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622665]
> > > > btrfs_wait_ordered_range+0xbb/0x100
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622667] btrfs_sync_file+0x30c/0x400
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622671] do_fsync+0x38/0x60
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622673] __x64_sys_fdatasync+0x13/0x20
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622675] do_syscall_64+0x4f/0x100
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622677]
> > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622681] RSP:
> > > > 002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
> > > > RBX: 000000000000000c RCX: 00007f3d36c122e7
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
> > > > RSI: 000000000000000c RDI: 000000000000000c
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
> > > > R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
> > > > R11: 0000000000000293 R12: 00005633b3ab9cc0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
> > > > R14: 0000000000000000 R15: 00005633b39a20f8
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
> > > > blocked for more than 120 seconds.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622702] Not tainted 4.18.10 #1
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
> > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622728] ftdc D 0
> > > > 23347 15560 0x00000120
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622731] ? __schedule+0x299/0x880
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622733] schedule+0x28/0x80
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622735] wait_current_trans+0xad/0xe0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622737] ? wait_woken+0x80/0x80
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622738] start_transaction+0x2ee/0x3c0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622740] btrfs_create+0x57/0x1f0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622743] path_openat+0x13c1/0x16a0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622746] do_filp_open+0x9b/0x110
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622748] ? try_lookup_one_len+0x70/0x70
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622751] ? close_pdeo+0x93/0xf0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622753] ? __check_object_size+0xa7/0x1a0
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622756] ? __alloc_fd+0x3d/0x160
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622758] ? do_sys_open+0x1bd/0x250
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622760] do_sys_open+0x1bd/0x250
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622762] do_syscall_64+0x4f/0x100
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622763]
> > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622767] RSP:
> > > > 002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
> > > > RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
> > > > RSI: 00005633b7106740 RDI: 00000000ffffff9c
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
> > > > R08: 0000000000000000 R09: 00005633b1077b14
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
> > > > R11: 0000000000000293 R12: 0000000000000004
> > > > Oct 2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
> > > > R14: 00007f3d302ec720 R15: 00005633b9942004
> > >
> > > Are you using mount option space_cache=v2?
> > >
> > > If not, you should enable that and see if you experience the same behavior.  I believe it's highly recommended on all large BTRFS arrays, and I had intermittent freezing problems on large writes before I enabled it.  I do not appear to have those problems any more...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Writing a large file causes odd freeze
  2018-10-02 18:22       ` Eli V
@ 2018-10-02 21:12         ` Gerard Saraber
  0 siblings, 0 replies; 8+ messages in thread
From: Gerard Saraber @ 2018-10-02 21:12 UTC (permalink / raw)
  To: eliventer; +Cc: joshua, Btrfs BTRFS

That was going to be my next move, downgrade kernels, but I wasn't
sure how far back to go..
I'll give 4.18.6 a shot and see how it does the next time I need to reboot it :)

-G
On Tue, Oct 2, 2018 at 1:23 PM Eli V <eliventer@gmail.com> wrote:
>
> I've seen similar symptoms running
> rsync(https://bugzilla.kernel.org/show_bug.cgi?id=199753), and the
> hangs always seem to have the wait_woken() in the stack trace. Doesn't
> happen repeatably for me, though I ran 4.18.6 for a month without
> issues, and had freezes the first(and only) two times I ran with
> 4.18.8. So I've rebooted back to 4.18.6 for now. You could try that to
> see if it's any different for you or 4.19 is about done so could try
> that as well.
>
> On Tue, Oct 2, 2018 at 1:43 PM Gerard Saraber <gsaraber@rarcoa.com> wrote:
> >
> > Correction, I am using that option:
> > mount UUID="${FS}" -o space_cache=v2,noatime,degraded /btrfsarray
> > On Tue, Oct 2, 2018 at 10:50 AM Gerard Saraber <gsaraber@rarcoa.com> wrote:
> > >
> > > I'll give that a shot! thank you!
> > >
> > > -Gerard
> > > On Tue, Oct 2, 2018 at 10:24 AM <joshua@mailmag.net> wrote:
> > > >
> > > > > I have a 25TB mirrored filesystem that I'm able to consistently freeze
> > > > > by ungzipping a large file.
> > > > > The filesystem scrubs complete without errors and smartd reports no
> > > > > errors at the moment.
> > > > >
> > > > > The command is:
> > > > > gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
> > > > > At around 31GB it quits writing to the output file, and I start
> > > > > getting the 'hung task' messages in my kernel log.
> > > > > They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
> > > > > dump files.
> > > > >
> > > > > after that 'ps auwx' completely hangs and I can't write to the
> > > > > filesystem anymore, no errors, the processes just hang. I can still
> > > > > read from the FS as far as I can tell.
> > > > > I'm not sure how to diagnose it, or if i've simply ran into some kind
> > > > > of bug (may not even be BTRFS related)
> > > > >
> > > > > Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
> > > > > x86_64 GNU/Linux
> > > > > [ I think It used to work on 4.18.0-rcX ]
> > > > >
> > > > > Hardware specs:
> > > > > AMD Ryzen Threadripper 1950X 16-Core Processor
> > > > > 128GB ram
> > > > > A bunch of 3-8TB SATA drives connected to the motherboard and a pair
> > > > > of LSI cards
> > > > >
> > > > > root@cloud1:~# btrfs fi usage /btrfsarray
> > > > > Overall:
> > > > > Device size: 79.14TiB
> > > > > Device allocated: 55.38TiB
> > > > > Device unallocated: 23.76TiB
> > > > > Device missing: 0.00B
> > > > > Used: 55.30TiB
> > > > > Free (estimated): 11.91TiB (min: 11.91TiB)
> > > > > Data ratio: 2.00
> > > > > Metadata ratio: 2.00
> > > > > Global reserve: 512.00MiB (used: 0.00B)
> > > > >
> > > > > Data,RAID1: Size:27.61TiB, Used:27.58TiB
> > > > > /dev/sda 1.64TiB
> > > > > /dev/sdb 1.64TiB
> > > > > /dev/sdc 1.87TiB
> > > > > /dev/sdd 4.37TiB
> > > > > /dev/sde 2.55TiB
> > > > > /dev/sdf 2.55TiB
> > > > > /dev/sdg 4.37TiB
> > > > > /dev/sdh 4.36TiB
> > > > > /dev/sdi 2.55TiB
> > > > > /dev/sdj 4.36TiB
> > > > > /dev/sdk 4.37TiB
> > > > > /dev/sdl 1.64TiB
> > > > > /dev/sdm 4.37TiB
> > > > > /dev/sdn 2.54TiB
> > > > > /dev/sdo 2.55TiB
> > > > > /dev/sdp 2.55TiB
> > > > > /dev/sdq 2.55TiB
> > > > > /dev/sdr 4.37TiB
> > > > >
> > > > > Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
> > > > > /dev/sda 5.00GiB
> > > > > /dev/sdb 5.00GiB
> > > > > /dev/sdd 14.00GiB
> > > > > /dev/sde 12.00GiB
> > > > > /dev/sdf 8.00GiB
> > > > > /dev/sdg 6.00GiB
> > > > > /dev/sdh 20.00GiB
> > > > > /dev/sdi 11.00GiB
> > > > > /dev/sdj 19.00GiB
> > > > > /dev/sdk 5.00GiB
> > > > > /dev/sdl 6.00GiB
> > > > > /dev/sdm 7.00GiB
> > > > > /dev/sdn 17.00GiB
> > > > > /dev/sdo 7.00GiB
> > > > > /dev/sdp 13.00GiB
> > > > > /dev/sdq 12.00GiB
> > > > > /dev/sdr 7.00GiB
> > > > >
> > > > > System,RAID1: Size:32.00MiB, Used:5.19MiB
> > > > > /dev/sdc 32.00MiB
> > > > > /dev/sdr 32.00MiB
> > > > >
> > > > > Unallocated:
> > > > > /dev/sda 1.08TiB
> > > > > /dev/sdb 1.08TiB
> > > > > /dev/sdc 5.40TiB
> > > > > /dev/sdd 1.08TiB
> > > > > /dev/sde 1.08TiB
> > > > > /dev/sdf 1.08TiB
> > > > > /dev/sdg 1.08TiB
> > > > > /dev/sdh 1.08TiB
> > > > > /dev/sdi 1.08TiB
> > > > > /dev/sdj 1.08TiB
> > > > > /dev/sdk 1.08TiB
> > > > > /dev/sdl 1.08TiB
> > > > > /dev/sdm 1.08TiB
> > > > > /dev/sdn 1.08TiB
> > > > > /dev/sdo 1.08TiB
> > > > > /dev/sdp 1.08TiB
> > > > > /dev/sdq 1.08TiB
> > > > > /dev/sdr 1.08TiB
> > > > >
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
> > > > > blocked for more than 120 seconds.
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793068] Not tainted 4.18.10 #1
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
> > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793094] smbd D 0
> > > > > 47578 3375 0x00000000
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793098] ? __schedule+0x299/0x880
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793100] schedule+0x28/0x80
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793102] wait_current_trans+0xad/0xe0
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793104] ? wait_woken+0x80/0x80
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793106] start_transaction+0x2ee/0x3c0
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793108] btrfs_sync_file+0x279/0x400
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793110]
> > > > > btrfs_file_write_iter+0x461/0x576
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793111] __vfs_write+0x114/0x1a0
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793113] vfs_write+0xad/0x1a0
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793115] ksys_pwrite64+0x71/0x90
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793116] ? __switch_to_asm+0x34/0x70
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793117] do_syscall_64+0x4f/0x100
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793119]
> > > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793122] RSP:
> > > > > 002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
> > > > > RBX: 0000000000000009 RCX: 00007f82af38104f
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
> > > > > RSI: 000055e06183fea0 RDI: 0000000000000009
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
> > > > > R08: 0000000000000000 R09: 00007fffac636080
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
> > > > > R11: 0000000000000293 R12: 0000000000001000
> > > > > Oct 2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
> > > > > R14: 00007f82aa40e090 R15: 000055e0616b5f70
> > > > >
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
> > > > > btrfs-transacti:12370 blocked for more than 120 seconds.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622366] Not tainted 4.18.10 #1
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
> > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D 0
> > > > > 12370 2 0x80000000
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622406] ? __schedule+0x299/0x880
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622408] ?
> > > > > _raw_spin_unlock_irqrestore+0xa/0x10
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622410] schedule+0x28/0x80
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622414]
> > > > > btrfs_commit_transaction+0x760/0x870
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622418] ? wait_woken+0x80/0x80
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622420] transaction_kthread+0x12e/0x150
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622422] ?
> > > > > btrfs_cleanup_transaction+0x520/0x520
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622425] kthread+0x113/0x130
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622427] ?
> > > > > kthread_create_worker_on_cpu+0x70/0x70
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622428] ret_from_fork+0x22/0x40
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
> > > > > blocked for more than 120 seconds.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622501] Not tainted 4.18.10 #1
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
> > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622527] jsvc D 0
> > > > > 15530 15312 0x00000120
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622531] ? __schedule+0x299/0x880
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622533] schedule+0x28/0x80
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622536]
> > > > > btrfs_start_ordered_extent+0xea/0x120
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622538] ? wait_woken+0x80/0x80
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622540] btrfs_page_mkwrite+0x1f3/0x500
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622544] do_page_mkwrite+0x31/0x90
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622546] do_wp_page+0x214/0x5a0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622548] __handle_mm_fault+0xb6a/0x1260
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622551] ? __seccomp_filter+0x44/0x4c0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622553] handle_mm_fault+0xc6/0x200
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622556] __do_page_fault+0x24c/0x4d0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622558] ? page_fault+0x8/0x30
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622559] page_fault+0x1e/0x30
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622569] RSP:
> > > > > 002b:00007f92dccced60 EFLAGS: 00010246
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
> > > > > RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
> > > > > RSI: 00007f92dccced30 RDI: 0000000000000001
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
> > > > > R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
> > > > > R11: 0000000000000001 R12: 0000000000000028
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
> > > > > R14: 0000000000000002 R15: 0000000000000001
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
> > > > > blocked for more than 120 seconds.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622628] Not tainted 4.18.10 #1
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
> > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622653] mongod D 0
> > > > > 23344 15560 0x00000120
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622657] ? __schedule+0x299/0x880
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622659] schedule+0x28/0x80
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622661]
> > > > > btrfs_start_ordered_extent+0xea/0x120
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622663] ? wait_woken+0x80/0x80
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622665]
> > > > > btrfs_wait_ordered_range+0xbb/0x100
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622667] btrfs_sync_file+0x30c/0x400
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622671] do_fsync+0x38/0x60
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622673] __x64_sys_fdatasync+0x13/0x20
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622675] do_syscall_64+0x4f/0x100
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622677]
> > > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622681] RSP:
> > > > > 002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
> > > > > RBX: 000000000000000c RCX: 00007f3d36c122e7
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
> > > > > RSI: 000000000000000c RDI: 000000000000000c
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
> > > > > R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
> > > > > R11: 0000000000000293 R12: 00005633b3ab9cc0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
> > > > > R14: 0000000000000000 R15: 00005633b39a20f8
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
> > > > > blocked for more than 120 seconds.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622702] Not tainted 4.18.10 #1
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
> > > > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622728] ftdc D 0
> > > > > 23347 15560 0x00000120
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622731] ? __schedule+0x299/0x880
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622733] schedule+0x28/0x80
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622735] wait_current_trans+0xad/0xe0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622737] ? wait_woken+0x80/0x80
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622738] start_transaction+0x2ee/0x3c0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622740] btrfs_create+0x57/0x1f0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622743] path_openat+0x13c1/0x16a0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622746] do_filp_open+0x9b/0x110
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622748] ? try_lookup_one_len+0x70/0x70
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622751] ? close_pdeo+0x93/0xf0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622753] ? __check_object_size+0xa7/0x1a0
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622756] ? __alloc_fd+0x3d/0x160
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622758] ? do_sys_open+0x1bd/0x250
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622760] do_sys_open+0x1bd/0x250
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622762] do_syscall_64+0x4f/0x100
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622763]
> > > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622767] RSP:
> > > > > 002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
> > > > > RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
> > > > > RSI: 00005633b7106740 RDI: 00000000ffffff9c
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
> > > > > R08: 0000000000000000 R09: 00005633b1077b14
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
> > > > > R11: 0000000000000293 R12: 0000000000000004
> > > > > Oct 2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
> > > > > R14: 00007f3d302ec720 R15: 00005633b9942004
> > > >
> > > > Are you using mount option space_cache=v2?
> > > >
> > > > If not, you should enable that and see if you experience the same behavior.  I believe it's highly recommended on all large BTRFS arrays, and I had intermittent freezing problems on large writes before I enabled it.  I do not appear to have those problems any more...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Writing a large file causes odd freeze
  2018-10-02 13:31 Writing a large file causes odd freeze Gerard Saraber
  2018-10-02 15:24 ` joshua
@ 2018-10-03  0:22 ` Qu Wenruo
  2018-10-03 12:51   ` Gerard Saraber
  1 sibling, 1 reply; 8+ messages in thread
From: Qu Wenruo @ 2018-10-03  0:22 UTC (permalink / raw)
  To: Gerard Saraber, Btrfs BTRFS


[-- Attachment #1.1: Type: text/plain, Size: 14597 bytes --]



On 2018/10/2 下午9:31, Gerard Saraber wrote:
> I have a 25TB mirrored filesystem that I'm able to consistently freeze
> by ungzipping a large file.
> The filesystem scrubs complete without errors and smartd reports no
> errors at the moment.
> 
> The command is:
> gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
> At around 31GB it quits writing to the output file, and I start
> getting the 'hung task' messages in my kernel log.
> They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
> dump files.

I assume you're not using qgroups, right?

Just to rule out some obvious transaction slowdown causes.

BTW, as you're already using space_cache=v2, what about disable free
space cache completely?
If it's really caused by space cache writing at transaction committing
time, disabling it more or release reduce the hang time.

Thanks,
Qu

> 
> after that 'ps auwx' completely hangs and I can't write to the
> filesystem anymore, no errors, the processes just hang. I can still
> read from the FS as far as I can tell.
> I'm not sure how to diagnose it, or if i've simply ran into some kind
> of bug (may not even be BTRFS related)
> 
> Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
> x86_64 GNU/Linux
> [ I think It used to work on 4.18.0-rcX ]
> 
> Hardware specs:
>  AMD Ryzen Threadripper 1950X 16-Core Processor
> 128GB ram
> A bunch of 3-8TB SATA drives connected to the motherboard and a pair
> of LSI cards
> 
> root@cloud1:~# btrfs fi usage /btrfsarray
> Overall:
>     Device size:                  79.14TiB
>     Device allocated:             55.38TiB
>     Device unallocated:           23.76TiB
>     Device missing:                  0.00B
>     Used:                         55.30TiB
>     Free (estimated):             11.91TiB      (min: 11.91TiB)
>     Data ratio:                       2.00
>     Metadata ratio:                   2.00
>     Global reserve:              512.00MiB      (used: 0.00B)
> 
> Data,RAID1: Size:27.61TiB, Used:27.58TiB
>    /dev/sda        1.64TiB
>    /dev/sdb        1.64TiB
>    /dev/sdc        1.87TiB
>    /dev/sdd        4.37TiB
>    /dev/sde        2.55TiB
>    /dev/sdf        2.55TiB
>    /dev/sdg        4.37TiB
>    /dev/sdh        4.36TiB
>    /dev/sdi        2.55TiB
>    /dev/sdj        4.36TiB
>    /dev/sdk        4.37TiB
>    /dev/sdl        1.64TiB
>    /dev/sdm        4.37TiB
>    /dev/sdn        2.54TiB
>    /dev/sdo        2.55TiB
>    /dev/sdp        2.55TiB
>    /dev/sdq        2.55TiB
>    /dev/sdr        4.37TiB
> 
> Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
>    /dev/sda        5.00GiB
>    /dev/sdb        5.00GiB
>    /dev/sdd       14.00GiB
>    /dev/sde       12.00GiB
>    /dev/sdf        8.00GiB
>    /dev/sdg        6.00GiB
>    /dev/sdh       20.00GiB
>    /dev/sdi       11.00GiB
>    /dev/sdj       19.00GiB
>    /dev/sdk        5.00GiB
>    /dev/sdl        6.00GiB
>    /dev/sdm        7.00GiB
>    /dev/sdn       17.00GiB
>    /dev/sdo        7.00GiB
>    /dev/sdp       13.00GiB
>    /dev/sdq       12.00GiB
>    /dev/sdr        7.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:5.19MiB
>    /dev/sdc       32.00MiB
>    /dev/sdr       32.00MiB
> 
> Unallocated:
>    /dev/sda        1.08TiB
>    /dev/sdb        1.08TiB
>    /dev/sdc        5.40TiB
>    /dev/sdd        1.08TiB
>    /dev/sde        1.08TiB
>    /dev/sdf        1.08TiB
>    /dev/sdg        1.08TiB
>    /dev/sdh        1.08TiB
>    /dev/sdi        1.08TiB
>    /dev/sdj        1.08TiB
>    /dev/sdk        1.08TiB
>    /dev/sdl        1.08TiB
>    /dev/sdm        1.08TiB
>    /dev/sdn        1.08TiB
>    /dev/sdo        1.08TiB
>    /dev/sdp        1.08TiB
>    /dev/sdq        1.08TiB
>    /dev/sdr        1.08TiB
> 
> 
> Oct  2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
> blocked for more than 120 seconds.
> Oct  2 08:02:25 cloud1 kernel: [165537.793068]       Not tainted 4.18.10 #1
> Oct  2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct  2 08:02:25 cloud1 kernel: [165537.793094] smbd            D    0
> 47578   3375 0x00000000
> Oct  2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
> Oct  2 08:02:25 cloud1 kernel: [165537.793098]  ? __schedule+0x299/0x880
> Oct  2 08:02:25 cloud1 kernel: [165537.793100]  schedule+0x28/0x80
> Oct  2 08:02:25 cloud1 kernel: [165537.793102]  wait_current_trans+0xad/0xe0
> Oct  2 08:02:25 cloud1 kernel: [165537.793104]  ? wait_woken+0x80/0x80
> Oct  2 08:02:25 cloud1 kernel: [165537.793106]  start_transaction+0x2ee/0x3c0
> Oct  2 08:02:25 cloud1 kernel: [165537.793108]  btrfs_sync_file+0x279/0x400
> Oct  2 08:02:25 cloud1 kernel: [165537.793110]
> btrfs_file_write_iter+0x461/0x576
> Oct  2 08:02:25 cloud1 kernel: [165537.793111]  __vfs_write+0x114/0x1a0
> Oct  2 08:02:25 cloud1 kernel: [165537.793113]  vfs_write+0xad/0x1a0
> Oct  2 08:02:25 cloud1 kernel: [165537.793115]  ksys_pwrite64+0x71/0x90
> Oct  2 08:02:25 cloud1 kernel: [165537.793116]  ? __switch_to_asm+0x34/0x70
> Oct  2 08:02:25 cloud1 kernel: [165537.793117]  do_syscall_64+0x4f/0x100
> Oct  2 08:02:25 cloud1 kernel: [165537.793119]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Oct  2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
> Oct  2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
> Oct  2 08:02:25 cloud1 kernel: [165537.793122] RSP:
> 002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> Oct  2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
> RBX: 0000000000000009 RCX: 00007f82af38104f
> Oct  2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
> RSI: 000055e06183fea0 RDI: 0000000000000009
> Oct  2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
> R08: 0000000000000000 R09: 00007fffac636080
> Oct  2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
> R11: 0000000000000293 R12: 0000000000001000
> Oct  2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
> R14: 00007f82aa40e090 R15: 000055e0616b5f70
> 
> 
> Oct  2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
> btrfs-transacti:12370 blocked for more than 120 seconds.
> Oct  2 08:04:25 cloud1 kernel: [165658.622366]       Not tainted 4.18.10 #1
> Oct  2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct  2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D    0
> 12370      2 0x80000000
> Oct  2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
> Oct  2 08:04:25 cloud1 kernel: [165658.622406]  ? __schedule+0x299/0x880
> Oct  2 08:04:25 cloud1 kernel: [165658.622408]  ?
> _raw_spin_unlock_irqrestore+0xa/0x10
> Oct  2 08:04:25 cloud1 kernel: [165658.622410]  schedule+0x28/0x80
> Oct  2 08:04:25 cloud1 kernel: [165658.622414]
> btrfs_commit_transaction+0x760/0x870
> Oct  2 08:04:25 cloud1 kernel: [165658.622418]  ? wait_woken+0x80/0x80
> Oct  2 08:04:25 cloud1 kernel: [165658.622420]  transaction_kthread+0x12e/0x150
> Oct  2 08:04:25 cloud1 kernel: [165658.622422]  ?
> btrfs_cleanup_transaction+0x520/0x520
> Oct  2 08:04:25 cloud1 kernel: [165658.622425]  kthread+0x113/0x130
> Oct  2 08:04:25 cloud1 kernel: [165658.622427]  ?
> kthread_create_worker_on_cpu+0x70/0x70
> Oct  2 08:04:25 cloud1 kernel: [165658.622428]  ret_from_fork+0x22/0x40
> Oct  2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
> blocked for more than 120 seconds.
> Oct  2 08:04:25 cloud1 kernel: [165658.622501]       Not tainted 4.18.10 #1
> Oct  2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct  2 08:04:25 cloud1 kernel: [165658.622527] jsvc            D    0
> 15530  15312 0x00000120
> Oct  2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
> Oct  2 08:04:25 cloud1 kernel: [165658.622531]  ? __schedule+0x299/0x880
> Oct  2 08:04:25 cloud1 kernel: [165658.622533]  schedule+0x28/0x80
> Oct  2 08:04:25 cloud1 kernel: [165658.622536]
> btrfs_start_ordered_extent+0xea/0x120
> Oct  2 08:04:25 cloud1 kernel: [165658.622538]  ? wait_woken+0x80/0x80
> Oct  2 08:04:25 cloud1 kernel: [165658.622540]  btrfs_page_mkwrite+0x1f3/0x500
> Oct  2 08:04:25 cloud1 kernel: [165658.622544]  do_page_mkwrite+0x31/0x90
> Oct  2 08:04:25 cloud1 kernel: [165658.622546]  do_wp_page+0x214/0x5a0
> Oct  2 08:04:25 cloud1 kernel: [165658.622548]  __handle_mm_fault+0xb6a/0x1260
> Oct  2 08:04:25 cloud1 kernel: [165658.622551]  ? __seccomp_filter+0x44/0x4c0
> Oct  2 08:04:25 cloud1 kernel: [165658.622553]  handle_mm_fault+0xc6/0x200
> Oct  2 08:04:25 cloud1 kernel: [165658.622556]  __do_page_fault+0x24c/0x4d0
> Oct  2 08:04:25 cloud1 kernel: [165658.622558]  ? page_fault+0x8/0x30
> Oct  2 08:04:25 cloud1 kernel: [165658.622559]  page_fault+0x1e/0x30
> Oct  2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
> Oct  2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
> Oct  2 08:04:25 cloud1 kernel: [165658.622569] RSP:
> 002b:00007f92dccced60 EFLAGS: 00010246
> Oct  2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
> RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
> Oct  2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
> RSI: 00007f92dccced30 RDI: 0000000000000001
> Oct  2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
> R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
> Oct  2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
> R11: 0000000000000001 R12: 0000000000000028
> Oct  2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
> R14: 0000000000000002 R15: 0000000000000001
> Oct  2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
> blocked for more than 120 seconds.
> Oct  2 08:04:25 cloud1 kernel: [165658.622628]       Not tainted 4.18.10 #1
> Oct  2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct  2 08:04:25 cloud1 kernel: [165658.622653] mongod          D    0
> 23344  15560 0x00000120
> Oct  2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
> Oct  2 08:04:25 cloud1 kernel: [165658.622657]  ? __schedule+0x299/0x880
> Oct  2 08:04:25 cloud1 kernel: [165658.622659]  schedule+0x28/0x80
> Oct  2 08:04:25 cloud1 kernel: [165658.622661]
> btrfs_start_ordered_extent+0xea/0x120
> Oct  2 08:04:25 cloud1 kernel: [165658.622663]  ? wait_woken+0x80/0x80
> Oct  2 08:04:25 cloud1 kernel: [165658.622665]
> btrfs_wait_ordered_range+0xbb/0x100
> Oct  2 08:04:25 cloud1 kernel: [165658.622667]  btrfs_sync_file+0x30c/0x400
> Oct  2 08:04:25 cloud1 kernel: [165658.622671]  do_fsync+0x38/0x60
> Oct  2 08:04:25 cloud1 kernel: [165658.622673]  __x64_sys_fdatasync+0x13/0x20
> Oct  2 08:04:25 cloud1 kernel: [165658.622675]  do_syscall_64+0x4f/0x100
> Oct  2 08:04:25 cloud1 kernel: [165658.622677]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Oct  2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
> Oct  2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
> Oct  2 08:04:25 cloud1 kernel: [165658.622681] RSP:
> 002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
> Oct  2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
> RBX: 000000000000000c RCX: 00007f3d36c122e7
> Oct  2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
> RSI: 000000000000000c RDI: 000000000000000c
> Oct  2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
> R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
> Oct  2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
> R11: 0000000000000293 R12: 00005633b3ab9cc0
> Oct  2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
> R14: 0000000000000000 R15: 00005633b39a20f8
> Oct  2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
> blocked for more than 120 seconds.
> Oct  2 08:04:25 cloud1 kernel: [165658.622702]       Not tainted 4.18.10 #1
> Oct  2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct  2 08:04:25 cloud1 kernel: [165658.622728] ftdc            D    0
> 23347  15560 0x00000120
> Oct  2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
> Oct  2 08:04:25 cloud1 kernel: [165658.622731]  ? __schedule+0x299/0x880
> Oct  2 08:04:25 cloud1 kernel: [165658.622733]  schedule+0x28/0x80
> Oct  2 08:04:25 cloud1 kernel: [165658.622735]  wait_current_trans+0xad/0xe0
> Oct  2 08:04:25 cloud1 kernel: [165658.622737]  ? wait_woken+0x80/0x80
> Oct  2 08:04:25 cloud1 kernel: [165658.622738]  start_transaction+0x2ee/0x3c0
> Oct  2 08:04:25 cloud1 kernel: [165658.622740]  btrfs_create+0x57/0x1f0
> Oct  2 08:04:25 cloud1 kernel: [165658.622743]  path_openat+0x13c1/0x16a0
> Oct  2 08:04:25 cloud1 kernel: [165658.622746]  do_filp_open+0x9b/0x110
> Oct  2 08:04:25 cloud1 kernel: [165658.622748]  ? try_lookup_one_len+0x70/0x70
> Oct  2 08:04:25 cloud1 kernel: [165658.622751]  ? close_pdeo+0x93/0xf0
> Oct  2 08:04:25 cloud1 kernel: [165658.622753]  ? __check_object_size+0xa7/0x1a0
> Oct  2 08:04:25 cloud1 kernel: [165658.622756]  ? __alloc_fd+0x3d/0x160
> Oct  2 08:04:25 cloud1 kernel: [165658.622758]  ? do_sys_open+0x1bd/0x250
> Oct  2 08:04:25 cloud1 kernel: [165658.622760]  do_sys_open+0x1bd/0x250
> Oct  2 08:04:25 cloud1 kernel: [165658.622762]  do_syscall_64+0x4f/0x100
> Oct  2 08:04:25 cloud1 kernel: [165658.622763]
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> Oct  2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
> Oct  2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
> Oct  2 08:04:25 cloud1 kernel: [165658.622767] RSP:
> 002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> Oct  2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
> RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
> Oct  2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
> RSI: 00005633b7106740 RDI: 00000000ffffff9c
> Oct  2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
> R08: 0000000000000000 R09: 00005633b1077b14
> Oct  2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
> R11: 0000000000000293 R12: 0000000000000004
> Oct  2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
> R14: 00007f3d302ec720 R15: 00005633b9942004
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Writing a large file causes odd freeze
  2018-10-03  0:22 ` Qu Wenruo
@ 2018-10-03 12:51   ` Gerard Saraber
  0 siblings, 0 replies; 8+ messages in thread
From: Gerard Saraber @ 2018-10-03 12:51 UTC (permalink / raw)
  To: Qu Wenruo; +Cc: Btrfs BTRFS

No qgroups, just a big FS to dump backups.
I'll try it without free space cache too.

I appreciate all the suggestions!
On Tue, Oct 2, 2018 at 7:22 PM Qu Wenruo <quwenruo.btrfs@gmx.com> wrote:
>
>
>
> On 2018/10/2 下午9:31, Gerard Saraber wrote:
> > I have a 25TB mirrored filesystem that I'm able to consistently freeze
> > by ungzipping a large file.
> > The filesystem scrubs complete without errors and smartd reports no
> > errors at the moment.
> >
> > The command is:
> > gzip -dc /btrfsarray/largefile.gz > /btrfs/db/output.sql
> > At around 31GB it quits writing to the output file, and I start
> > getting the 'hung task' messages in my kernel log.
> > They are 3.0GB gzipped SQL dumps, that uncompress into about 42GB SQL
> > dump files.
>
> I assume you're not using qgroups, right?
>
> Just to rule out some obvious transaction slowdown causes.
>
> BTW, as you're already using space_cache=v2, what about disable free
> space cache completely?
> If it's really caused by space cache writing at transaction committing
> time, disabling it more or release reduce the hang time.
>
> Thanks,
> Qu
>
> >
> > after that 'ps auwx' completely hangs and I can't write to the
> > filesystem anymore, no errors, the processes just hang. I can still
> > read from the FS as far as I can tell.
> > I'm not sure how to diagnose it, or if i've simply ran into some kind
> > of bug (may not even be BTRFS related)
> >
> > Linux cloud1 4.18.10 #1 SMP Fri Sep 28 07:44:06 CDT 2018 x86_64 x86_64
> > x86_64 GNU/Linux
> > [ I think It used to work on 4.18.0-rcX ]
> >
> > Hardware specs:
> >  AMD Ryzen Threadripper 1950X 16-Core Processor
> > 128GB ram
> > A bunch of 3-8TB SATA drives connected to the motherboard and a pair
> > of LSI cards
> >
> > root@cloud1:~# btrfs fi usage /btrfsarray
> > Overall:
> >     Device size:                  79.14TiB
> >     Device allocated:             55.38TiB
> >     Device unallocated:           23.76TiB
> >     Device missing:                  0.00B
> >     Used:                         55.30TiB
> >     Free (estimated):             11.91TiB      (min: 11.91TiB)
> >     Data ratio:                       2.00
> >     Metadata ratio:                   2.00
> >     Global reserve:              512.00MiB      (used: 0.00B)
> >
> > Data,RAID1: Size:27.61TiB, Used:27.58TiB
> >    /dev/sda        1.64TiB
> >    /dev/sdb        1.64TiB
> >    /dev/sdc        1.87TiB
> >    /dev/sdd        4.37TiB
> >    /dev/sde        2.55TiB
> >    /dev/sdf        2.55TiB
> >    /dev/sdg        4.37TiB
> >    /dev/sdh        4.36TiB
> >    /dev/sdi        2.55TiB
> >    /dev/sdj        4.36TiB
> >    /dev/sdk        4.37TiB
> >    /dev/sdl        1.64TiB
> >    /dev/sdm        4.37TiB
> >    /dev/sdn        2.54TiB
> >    /dev/sdo        2.55TiB
> >    /dev/sdp        2.55TiB
> >    /dev/sdq        2.55TiB
> >    /dev/sdr        4.37TiB
> >
> > Metadata,RAID1: Size:87.00GiB, Used:77.46GiB
> >    /dev/sda        5.00GiB
> >    /dev/sdb        5.00GiB
> >    /dev/sdd       14.00GiB
> >    /dev/sde       12.00GiB
> >    /dev/sdf        8.00GiB
> >    /dev/sdg        6.00GiB
> >    /dev/sdh       20.00GiB
> >    /dev/sdi       11.00GiB
> >    /dev/sdj       19.00GiB
> >    /dev/sdk        5.00GiB
> >    /dev/sdl        6.00GiB
> >    /dev/sdm        7.00GiB
> >    /dev/sdn       17.00GiB
> >    /dev/sdo        7.00GiB
> >    /dev/sdp       13.00GiB
> >    /dev/sdq       12.00GiB
> >    /dev/sdr        7.00GiB
> >
> > System,RAID1: Size:32.00MiB, Used:5.19MiB
> >    /dev/sdc       32.00MiB
> >    /dev/sdr       32.00MiB
> >
> > Unallocated:
> >    /dev/sda        1.08TiB
> >    /dev/sdb        1.08TiB
> >    /dev/sdc        5.40TiB
> >    /dev/sdd        1.08TiB
> >    /dev/sde        1.08TiB
> >    /dev/sdf        1.08TiB
> >    /dev/sdg        1.08TiB
> >    /dev/sdh        1.08TiB
> >    /dev/sdi        1.08TiB
> >    /dev/sdj        1.08TiB
> >    /dev/sdk        1.08TiB
> >    /dev/sdl        1.08TiB
> >    /dev/sdm        1.08TiB
> >    /dev/sdn        1.08TiB
> >    /dev/sdo        1.08TiB
> >    /dev/sdp        1.08TiB
> >    /dev/sdq        1.08TiB
> >    /dev/sdr        1.08TiB
> >
> >
> > Oct  2 08:02:25 cloud1 kernel: [165537.793055] INFO: task smbd:47578
> > blocked for more than 120 seconds.
> > Oct  2 08:02:25 cloud1 kernel: [165537.793068]       Not tainted 4.18.10 #1
> > Oct  2 08:02:25 cloud1 kernel: [165537.793079] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:02:25 cloud1 kernel: [165537.793094] smbd            D    0
> > 47578   3375 0x00000000
> > Oct  2 08:02:25 cloud1 kernel: [165537.793096] Call Trace:
> > Oct  2 08:02:25 cloud1 kernel: [165537.793098]  ? __schedule+0x299/0x880
> > Oct  2 08:02:25 cloud1 kernel: [165537.793100]  schedule+0x28/0x80
> > Oct  2 08:02:25 cloud1 kernel: [165537.793102]  wait_current_trans+0xad/0xe0
> > Oct  2 08:02:25 cloud1 kernel: [165537.793104]  ? wait_woken+0x80/0x80
> > Oct  2 08:02:25 cloud1 kernel: [165537.793106]  start_transaction+0x2ee/0x3c0
> > Oct  2 08:02:25 cloud1 kernel: [165537.793108]  btrfs_sync_file+0x279/0x400
> > Oct  2 08:02:25 cloud1 kernel: [165537.793110]
> > btrfs_file_write_iter+0x461/0x576
> > Oct  2 08:02:25 cloud1 kernel: [165537.793111]  __vfs_write+0x114/0x1a0
> > Oct  2 08:02:25 cloud1 kernel: [165537.793113]  vfs_write+0xad/0x1a0
> > Oct  2 08:02:25 cloud1 kernel: [165537.793115]  ksys_pwrite64+0x71/0x90
> > Oct  2 08:02:25 cloud1 kernel: [165537.793116]  ? __switch_to_asm+0x34/0x70
> > Oct  2 08:02:25 cloud1 kernel: [165537.793117]  do_syscall_64+0x4f/0x100
> > Oct  2 08:02:25 cloud1 kernel: [165537.793119]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct  2 08:02:25 cloud1 kernel: [165537.793120] RIP: 0033:0x7f82af38104f
> > Oct  2 08:02:25 cloud1 kernel: [165537.793120] Code: Bad RIP value.
> > Oct  2 08:02:25 cloud1 kernel: [165537.793122] RSP:
> > 002b:00007f8296b4fd70 EFLAGS: 00000293 ORIG_RAX: 0000000000000012
> > Oct  2 08:02:25 cloud1 kernel: [165537.793124] RAX: ffffffffffffffda
> > RBX: 0000000000000009 RCX: 00007f82af38104f
> > Oct  2 08:02:25 cloud1 kernel: [165537.793124] RDX: 0000000000001000
> > RSI: 000055e06183fea0 RDI: 0000000000000009
> > Oct  2 08:02:25 cloud1 kernel: [165537.793125] RBP: 000055e06183fea0
> > R08: 0000000000000000 R09: 00007fffac636080
> > Oct  2 08:02:25 cloud1 kernel: [165537.793126] R10: 00000000089c8000
> > R11: 0000000000000293 R12: 0000000000001000
> > Oct  2 08:02:25 cloud1 kernel: [165537.793127] R13: 00000000089c8000
> > R14: 00007f82aa40e090 R15: 000055e0616b5f70
> >
> >
> > Oct  2 08:04:25 cloud1 kernel: [165658.622345] INFO: task
> > btrfs-transacti:12370 blocked for more than 120 seconds.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622366]       Not tainted 4.18.10 #1
> > Oct  2 08:04:25 cloud1 kernel: [165658.622378] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622393] btrfs-transacti D    0
> > 12370      2 0x80000000
> > Oct  2 08:04:25 cloud1 kernel: [165658.622396] Call Trace:
> > Oct  2 08:04:25 cloud1 kernel: [165658.622406]  ? __schedule+0x299/0x880
> > Oct  2 08:04:25 cloud1 kernel: [165658.622408]  ?
> > _raw_spin_unlock_irqrestore+0xa/0x10
> > Oct  2 08:04:25 cloud1 kernel: [165658.622410]  schedule+0x28/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622414]
> > btrfs_commit_transaction+0x760/0x870
> > Oct  2 08:04:25 cloud1 kernel: [165658.622418]  ? wait_woken+0x80/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622420]  transaction_kthread+0x12e/0x150
> > Oct  2 08:04:25 cloud1 kernel: [165658.622422]  ?
> > btrfs_cleanup_transaction+0x520/0x520
> > Oct  2 08:04:25 cloud1 kernel: [165658.622425]  kthread+0x113/0x130
> > Oct  2 08:04:25 cloud1 kernel: [165658.622427]  ?
> > kthread_create_worker_on_cpu+0x70/0x70
> > Oct  2 08:04:25 cloud1 kernel: [165658.622428]  ret_from_fork+0x22/0x40
> > Oct  2 08:04:25 cloud1 kernel: [165658.622487] INFO: task jsvc:15530
> > blocked for more than 120 seconds.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622501]       Not tainted 4.18.10 #1
> > Oct  2 08:04:25 cloud1 kernel: [165658.622512] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622527] jsvc            D    0
> > 15530  15312 0x00000120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622529] Call Trace:
> > Oct  2 08:04:25 cloud1 kernel: [165658.622531]  ? __schedule+0x299/0x880
> > Oct  2 08:04:25 cloud1 kernel: [165658.622533]  schedule+0x28/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622536]
> > btrfs_start_ordered_extent+0xea/0x120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622538]  ? wait_woken+0x80/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622540]  btrfs_page_mkwrite+0x1f3/0x500
> > Oct  2 08:04:25 cloud1 kernel: [165658.622544]  do_page_mkwrite+0x31/0x90
> > Oct  2 08:04:25 cloud1 kernel: [165658.622546]  do_wp_page+0x214/0x5a0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622548]  __handle_mm_fault+0xb6a/0x1260
> > Oct  2 08:04:25 cloud1 kernel: [165658.622551]  ? __seccomp_filter+0x44/0x4c0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622553]  handle_mm_fault+0xc6/0x200
> > Oct  2 08:04:25 cloud1 kernel: [165658.622556]  __do_page_fault+0x24c/0x4d0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622558]  ? page_fault+0x8/0x30
> > Oct  2 08:04:25 cloud1 kernel: [165658.622559]  page_fault+0x1e/0x30
> > Oct  2 08:04:25 cloud1 kernel: [165658.622562] RIP: 0033:0x7f9b5aadb47e
> > Oct  2 08:04:25 cloud1 kernel: [165658.622563] Code: Bad RIP value.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622569] RSP:
> > 002b:00007f92dccced60 EFLAGS: 00010246
> > Oct  2 08:04:25 cloud1 kernel: [165658.622571] RAX: 00009606c50d609c
> > RBX: 00007f9b5c5a1c30 RCX: 0000000000000018
> > Oct  2 08:04:25 cloud1 kernel: [165658.622572] RDX: 0000000000000000
> > RSI: 00007f92dccced30 RDI: 0000000000000001
> > Oct  2 08:04:25 cloud1 kernel: [165658.622572] RBP: 00007f92dccced70
> > R08: 001f9c207abaa2b8 R09: 00007fff9f29c080
> > Oct  2 08:04:25 cloud1 kernel: [165658.622573] R10: 00000000048732ba
> > R11: 0000000000000001 R12: 0000000000000028
> > Oct  2 08:04:25 cloud1 kernel: [165658.622574] R13: 0000561d92af6950
> > R14: 0000000000000002 R15: 0000000000000001
> > Oct  2 08:04:25 cloud1 kernel: [165658.622613] INFO: task mongod:23344
> > blocked for more than 120 seconds.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622628]       Not tainted 4.18.10 #1
> > Oct  2 08:04:25 cloud1 kernel: [165658.622638] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622653] mongod          D    0
> > 23344  15560 0x00000120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622655] Call Trace:
> > Oct  2 08:04:25 cloud1 kernel: [165658.622657]  ? __schedule+0x299/0x880
> > Oct  2 08:04:25 cloud1 kernel: [165658.622659]  schedule+0x28/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622661]
> > btrfs_start_ordered_extent+0xea/0x120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622663]  ? wait_woken+0x80/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622665]
> > btrfs_wait_ordered_range+0xbb/0x100
> > Oct  2 08:04:25 cloud1 kernel: [165658.622667]  btrfs_sync_file+0x30c/0x400
> > Oct  2 08:04:25 cloud1 kernel: [165658.622671]  do_fsync+0x38/0x60
> > Oct  2 08:04:25 cloud1 kernel: [165658.622673]  __x64_sys_fdatasync+0x13/0x20
> > Oct  2 08:04:25 cloud1 kernel: [165658.622675]  do_syscall_64+0x4f/0x100
> > Oct  2 08:04:25 cloud1 kernel: [165658.622677]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct  2 08:04:25 cloud1 kernel: [165658.622678] RIP: 0033:0x7f3d36c122e7
> > Oct  2 08:04:25 cloud1 kernel: [165658.622679] Code: Bad RIP value.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622681] RSP:
> > 002b:00007f3d31aef560 EFLAGS: 00000293 ORIG_RAX: 000000000000004b
> > Oct  2 08:04:25 cloud1 kernel: [165658.622683] RAX: ffffffffffffffda
> > RBX: 000000000000000c RCX: 00007f3d36c122e7
> > Oct  2 08:04:25 cloud1 kernel: [165658.622683] RDX: 0000000000000000
> > RSI: 000000000000000c RDI: 000000000000000c
> > Oct  2 08:04:25 cloud1 kernel: [165658.622684] RBP: 00007f3d31aef5a0
> > R08: 00005633b3ab9c00 R09: 00007ffeef7a3080
> > Oct  2 08:04:25 cloud1 kernel: [165658.622685] R10: 0000000004873546
> > R11: 0000000000000293 R12: 00005633b3ab9cc0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622686] R13: 00005633b1075d90
> > R14: 0000000000000000 R15: 00005633b39a20f8
> > Oct  2 08:04:25 cloud1 kernel: [165658.622688] INFO: task ftdc:23347
> > blocked for more than 120 seconds.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622702]       Not tainted 4.18.10 #1
> > Oct  2 08:04:25 cloud1 kernel: [165658.622713] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622728] ftdc            D    0
> > 23347  15560 0x00000120
> > Oct  2 08:04:25 cloud1 kernel: [165658.622729] Call Trace:
> > Oct  2 08:04:25 cloud1 kernel: [165658.622731]  ? __schedule+0x299/0x880
> > Oct  2 08:04:25 cloud1 kernel: [165658.622733]  schedule+0x28/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622735]  wait_current_trans+0xad/0xe0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622737]  ? wait_woken+0x80/0x80
> > Oct  2 08:04:25 cloud1 kernel: [165658.622738]  start_transaction+0x2ee/0x3c0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622740]  btrfs_create+0x57/0x1f0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622743]  path_openat+0x13c1/0x16a0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622746]  do_filp_open+0x9b/0x110
> > Oct  2 08:04:25 cloud1 kernel: [165658.622748]  ? try_lookup_one_len+0x70/0x70
> > Oct  2 08:04:25 cloud1 kernel: [165658.622751]  ? close_pdeo+0x93/0xf0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622753]  ? __check_object_size+0xa7/0x1a0
> > Oct  2 08:04:25 cloud1 kernel: [165658.622756]  ? __alloc_fd+0x3d/0x160
> > Oct  2 08:04:25 cloud1 kernel: [165658.622758]  ? do_sys_open+0x1bd/0x250
> > Oct  2 08:04:25 cloud1 kernel: [165658.622760]  do_sys_open+0x1bd/0x250
> > Oct  2 08:04:25 cloud1 kernel: [165658.622762]  do_syscall_64+0x4f/0x100
> > Oct  2 08:04:25 cloud1 kernel: [165658.622763]
> > entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > Oct  2 08:04:25 cloud1 kernel: [165658.622764] RIP: 0033:0x7f3d36c0ad19
> > Oct  2 08:04:25 cloud1 kernel: [165658.622765] Code: Bad RIP value.
> > Oct  2 08:04:25 cloud1 kernel: [165658.622767] RSP:
> > 002b:00007f3d302ec200 EFLAGS: 00000293 ORIG_RAX: 0000000000000101
> > Oct  2 08:04:25 cloud1 kernel: [165658.622769] RAX: ffffffffffffffda
> > RBX: 00005633b7197d80 RCX: 00007f3d36c0ad19
> > Oct  2 08:04:25 cloud1 kernel: [165658.622769] RDX: 0000000000000241
> > RSI: 00005633b7106740 RDI: 00000000ffffff9c
> > Oct  2 08:04:25 cloud1 kernel: [165658.622770] RBP: 00005633b1077b13
> > R08: 0000000000000000 R09: 00005633b1077b14
> > Oct  2 08:04:25 cloud1 kernel: [165658.622771] R10: 00000000000001b6
> > R11: 0000000000000293 R12: 0000000000000004
> > Oct  2 08:04:25 cloud1 kernel: [165658.622771] R13: 00005633b1077b12
> > R14: 00007f3d302ec720 R15: 00005633b9942004
> >
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-10-03 12:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-02 13:31 Writing a large file causes odd freeze Gerard Saraber
2018-10-02 15:24 ` joshua
2018-10-02 15:50   ` Gerard Saraber
2018-10-02 15:51     ` Gerard Saraber
2018-10-02 18:22       ` Eli V
2018-10-02 21:12         ` Gerard Saraber
2018-10-03  0:22 ` Qu Wenruo
2018-10-03 12:51   ` Gerard Saraber

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.