linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* applications hang on a btrfs spanning two partitions
@ 2019-01-08 19:38 Florian Stecker
  2019-01-09  6:24 ` Nikolay Borisov
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Stecker @ 2019-01-08 19:38 UTC (permalink / raw)
  To: linux-btrfs

Hi everyone,

I extended the btrfs volume on my laptop by adding a second partition to 
it which lies on the same SSD (using btrfs device add). Since I did 
this, all kinds of applications regularly hang for up to 30 seconds. It 
seems they are stuck in the fdatasync syscall. For example:

$ strace -tt -T gajim 2>&1 | grep fdatasync
[...]
11:36:31.112200 fdatasync(25)           = 0 <0.006958>
11:36:32.147525 fdatasync(25)           = 0 <0.008138>
11:36:32.156882 fdatasync(25)           = 0 <0.006866>
11:36:32.165979 fdatasync(25)           = 0 <0.011797>
11:36:32.178867 fdatasync(25)           = 0 <23.636614>
11:36:55.827726 fdatasync(25)           = 0 <0.009595>
11:36:55.838702 fdatasync(25)           = 0 <0.007261>
11:36:55.850440 fdatasync(25)           = 0 <0.006807>
11:36:55.858168 fdatasync(25)           = 0 <0.006767>
[...]

File descriptor 25 here points to a file which is just ~90KB, so it 
really shouldn't take that long.

Removing the second partition again resolves the problem. Does anyone 
know this issue? Is it related to btrfs? Or am I just doing something wrong?

Best,
Florian

Some more info:

$ btrfs device usage /
/dev/sda2, ID: 2
    Device size:            52.16GiB
    Device slack:              0.00B
    Data,single:             1.00GiB
    Unallocated:            51.16GiB

/dev/sda8, ID: 1
    Device size:           174.92GiB
    Device slack:              0.00B
    Data,single:           168.91GiB
    Metadata,single:         3.01GiB
    System,single:           4.00MiB
    Unallocated:             3.00GiB

$ fdisk -l /dev/sda
Disk /dev/sda: 238.5 GiB, 256060514304 bytes, 500118192 sectors
Disk model: SAMSUNG SSD PM87
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: A48B5A25-AA84-4D3F-90DD-E8A4991BDF03

Device         Start       End   Sectors   Size Type
/dev/sda1       2048   1026047   1024000   500M EFI System
/dev/sda2    1026048 110422015 109395968  52.2G Linux filesystem
/dev/sda8  110422016 477263871 366841856 174.9G Linux filesystem
/dev/sda9  477263872 481458175   4194304     2G Linux swap

$ uname -a
Linux dell 4.20.0-arch1-1-ARCH #1 SMP PREEMPT Mon Dec 24 03:00:40 UTC 
2018 x86_64 GNU/Linux


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-08 19:38 applications hang on a btrfs spanning two partitions Florian Stecker
@ 2019-01-09  6:24 ` Nikolay Borisov
  2019-01-09  9:16   ` Florian Stecker
  0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2019-01-09  6:24 UTC (permalink / raw)
  To: Florian Stecker, linux-btrfs



On 8.01.19 г. 21:38 ч., Florian Stecker wrote:
> Hi everyone,
> 
> I extended the btrfs volume on my laptop by adding a second partition to
> it which lies on the same SSD (using btrfs device add). Since I did
> this, all kinds of applications regularly hang for up to 30 seconds. It
> seems they are stuck in the fdatasync syscall. For example:
> 
> $ strace -tt -T gajim 2>&1 | grep fdatasync
> [...]
> 11:36:31.112200 fdatasync(25)           = 0 <0.006958>
> 11:36:32.147525 fdatasync(25)           = 0 <0.008138>
> 11:36:32.156882 fdatasync(25)           = 0 <0.006866>
> 11:36:32.165979 fdatasync(25)           = 0 <0.011797>
> 11:36:32.178867 fdatasync(25)           = 0 <23.636614>
> 11:36:55.827726 fdatasync(25)           = 0 <0.009595>
> 11:36:55.838702 fdatasync(25)           = 0 <0.007261>
> 11:36:55.850440 fdatasync(25)           = 0 <0.006807>
> 11:36:55.858168 fdatasync(25)           = 0 <0.006767>
> [...]
> 
> File descriptor 25 here points to a file which is just ~90KB, so it
> really shouldn't take that long.
> 
> Removing the second partition again resolves the problem. Does anyone
> know this issue? Is it related to btrfs? Or am I just doing something
> wrong?
> 
> Best,
> Florian
> 
> Some more info:
> 
> $ btrfs device usage /
> /dev/sda2, ID: 2
>    Device size:            52.16GiB
>    Device slack:              0.00B
>    Data,single:             1.00GiB
>    Unallocated:            51.16GiB
> 
> /dev/sda8, ID: 1
>    Device size:           174.92GiB
>    Device slack:              0.00B
>    Data,single:           168.91GiB
>    Metadata,single:         3.01GiB
>    System,single:           4.00MiB
>    Unallocated:             3.00GiB
> 
> $ fdisk -l /dev/sda
> Disk /dev/sda: 238.5 GiB, 256060514304 bytes, 500118192 sectors
> Disk model: SAMSUNG SSD PM87
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: gpt
> Disk identifier: A48B5A25-AA84-4D3F-90DD-E8A4991BDF03
> 
> Device         Start       End   Sectors   Size Type
> /dev/sda1       2048   1026047   1024000   500M EFI System
> /dev/sda2    1026048 110422015 109395968  52.2G Linux filesystem
> /dev/sda8  110422016 477263871 366841856 174.9G Linux filesystem
> /dev/sda9  477263872 481458175   4194304     2G Linux swap
> 
> $ uname -a
> Linux dell 4.20.0-arch1-1-ARCH #1 SMP PREEMPT Mon Dec 24 03:00:40 UTC
> 2018 x86_64 GNU/Linux

Provide output of echo w > /proc/sysrq-trigger when the hang occurs
otherwise it's hard to figure what's going on.


> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-09  6:24 ` Nikolay Borisov
@ 2019-01-09  9:16   ` Florian Stecker
  2019-01-09 10:03     ` Nikolay Borisov
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Stecker @ 2019-01-09  9:16 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs

 >
 > Provide output of echo w > /proc/sysrq-trigger when the hang occurs
 > otherwise it's hard to figure what's going on.
 >

Here's one, again in gajim. This time, fdatasync() took "only" 2 seconds:

[42481.243491] sysrq: SysRq : Show Blocked State
[42481.243494]   task                        PC stack   pid father
[42481.243566] gajim           D    0 15778  15774 0x00000083
[42481.243569] Call Trace:
[42481.243575]  ? __schedule+0x29b/0x8b0
[42481.243576]  ? bit_wait+0x50/0x50
[42481.243578]  schedule+0x32/0x90
[42481.243580]  io_schedule+0x12/0x40
[42481.243582]  bit_wait_io+0xd/0x50
[42481.243583]  __wait_on_bit+0x6c/0x80
[42481.243585]  out_of_line_wait_on_bit+0x91/0xb0
[42481.243587]  ? init_wait_var_entry+0x40/0x40
[42481.243605]  write_all_supers+0x418/0xa70 [btrfs]
[42481.243622]  btrfs_sync_log+0x695/0x910 [btrfs]
[42481.243625]  ? _raw_spin_lock_irqsave+0x25/0x50
[42481.243641]  ? btrfs_log_dentry_safe+0x54/0x70 [btrfs]
[42481.243655]  btrfs_sync_file+0x3a9/0x3d0 [btrfs]
[42481.243659]  do_fsync+0x38/0x70
[42481.243661]  __x64_sys_fdatasync+0x13/0x20
[42481.243663]  do_syscall_64+0x5b/0x170
[42481.243666]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[42481.243667] RIP: 0033:0x7fd4022f873f
[42481.243671] Code: Bad RIP value.
[42481.243672] RSP: 002b:00007ffd3710a300 EFLAGS: 00000293 ORIG_RAX: 
000000000000004b
[42481.243674] RAX: ffffffffffffffda RBX: 0000000000000019 RCX: 
00007fd4022f873f
[42481.243675] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 
0000000000000019
[42481.243675] RBP: 0000000000000000 R08: 000055d8d8649f68 R09: 
00007ffd3710a320
[42481.243676] R10: 0000000000013000 R11: 0000000000000293 R12: 
0000000000000000
[42481.243677] R13: 0000000000000000 R14: 000055d8d8363fa0 R15: 
000055d8d8613040


On 1/9/19 7:24 AM, Nikolay Borisov wrote:
> 
> 
> On 8.01.19 г. 21:38 ч., Florian Stecker wrote:
>> Hi everyone,
>>
>> I extended the btrfs volume on my laptop by adding a second partition to
>> it which lies on the same SSD (using btrfs device add). Since I did
>> this, all kinds of applications regularly hang for up to 30 seconds. It
>> seems they are stuck in the fdatasync syscall. For example:
>>
>> $ strace -tt -T gajim 2>&1 | grep fdatasync
>> [...]
>> 11:36:31.112200 fdatasync(25)           = 0 <0.006958>
>> 11:36:32.147525 fdatasync(25)           = 0 <0.008138>
>> 11:36:32.156882 fdatasync(25)           = 0 <0.006866>
>> 11:36:32.165979 fdatasync(25)           = 0 <0.011797>
>> 11:36:32.178867 fdatasync(25)           = 0 <23.636614>
>> 11:36:55.827726 fdatasync(25)           = 0 <0.009595>
>> 11:36:55.838702 fdatasync(25)           = 0 <0.007261>
>> 11:36:55.850440 fdatasync(25)           = 0 <0.006807>
>> 11:36:55.858168 fdatasync(25)           = 0 <0.006767>
>> [...]
>>
>> File descriptor 25 here points to a file which is just ~90KB, so it
>> really shouldn't take that long.
>>
>> Removing the second partition again resolves the problem. Does anyone
>> know this issue? Is it related to btrfs? Or am I just doing something
>> wrong?
>>
>> Best,
>> Florian
>>
>> Some more info:
>>
>> $ btrfs device usage /
>> /dev/sda2, ID: 2
>>     Device size:            52.16GiB
>>     Device slack:              0.00B
>>     Data,single:             1.00GiB
>>     Unallocated:            51.16GiB
>>
>> /dev/sda8, ID: 1
>>     Device size:           174.92GiB
>>     Device slack:              0.00B
>>     Data,single:           168.91GiB
>>     Metadata,single:         3.01GiB
>>     System,single:           4.00MiB
>>     Unallocated:             3.00GiB
>>
>> $ fdisk -l /dev/sda
>> Disk /dev/sda: 238.5 GiB, 256060514304 bytes, 500118192 sectors
>> Disk model: SAMSUNG SSD PM87
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disklabel type: gpt
>> Disk identifier: A48B5A25-AA84-4D3F-90DD-E8A4991BDF03
>>
>> Device         Start       End   Sectors   Size Type
>> /dev/sda1       2048   1026047   1024000   500M EFI System
>> /dev/sda2    1026048 110422015 109395968  52.2G Linux filesystem
>> /dev/sda8  110422016 477263871 366841856 174.9G Linux filesystem
>> /dev/sda9  477263872 481458175   4194304     2G Linux swap
>>
>> $ uname -a
>> Linux dell 4.20.0-arch1-1-ARCH #1 SMP PREEMPT Mon Dec 24 03:00:40 UTC
>> 2018 x86_64 GNU/Linux

> 
>>
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-09  9:16   ` Florian Stecker
@ 2019-01-09 10:03     ` Nikolay Borisov
  2019-01-09 20:10       ` Florian Stecker
  0 siblings, 1 reply; 12+ messages in thread
From: Nikolay Borisov @ 2019-01-09 10:03 UTC (permalink / raw)
  To: Florian Stecker, linux-btrfs



On 9.01.19 г. 11:16 ч., Florian Stecker wrote:
>>
>> Provide output of echo w > /proc/sysrq-trigger when the hang occurs
>> otherwise it's hard to figure what's going on.
>>
> 
> Here's one, again in gajim. This time, fdatasync() took "only" 2 seconds:
> 
> [42481.243491] sysrq: SysRq : Show Blocked State
> [42481.243494]   task                        PC stack   pid father
> [42481.243566] gajim           D    0 15778  15774 0x00000083
> [42481.243569] Call Trace:
> [42481.243575]  ? __schedule+0x29b/0x8b0
> [42481.243576]  ? bit_wait+0x50/0x50
> [42481.243578]  schedule+0x32/0x90
> [42481.243580]  io_schedule+0x12/0x40
> [42481.243582]  bit_wait_io+0xd/0x50
> [42481.243583]  __wait_on_bit+0x6c/0x80
> [42481.243585]  out_of_line_wait_on_bit+0x91/0xb0
> [42481.243587]  ? init_wait_var_entry+0x40/0x40
> [42481.243605]  write_all_supers+0x418/0xa70 [btrfs]
> [42481.243622]  btrfs_sync_log+0x695/0x910 [btrfs]
> [42481.243625]  ? _raw_spin_lock_irqsave+0x25/0x50
> [42481.243641]  ? btrfs_log_dentry_safe+0x54/0x70 [btrfs]
> [42481.243655]  btrfs_sync_file+0x3a9/0x3d0 [btrfs]
> [42481.243659]  do_fsync+0x38/0x70
> [42481.243661]  __x64_sys_fdatasync+0x13/0x20
> [42481.243663]  do_syscall_64+0x5b/0x170
> [42481.243666]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [42481.243667] RIP: 0033:0x7fd4022f873f
> [42481.243671] Code: Bad RIP value.
> [42481.243672] RSP: 002b:00007ffd3710a300 EFLAGS: 00000293 ORIG_RAX:
> 000000000000004b
> [42481.243674] RAX: ffffffffffffffda RBX: 0000000000000019 RCX:
> 00007fd4022f873f
> [42481.243675] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
> 0000000000000019
> [42481.243675] RBP: 0000000000000000 R08: 000055d8d8649f68 R09:
> 00007ffd3710a320
> [42481.243676] R10: 0000000000013000 R11: 0000000000000293 R12:
> 0000000000000000
> [42481.243677] R13: 0000000000000000 R14: 000055d8d8363fa0 R15:
> 000055d8d8613040

This shows that IO was send to disk to write the supper blocks following
an fsync and it's waiting for IO to finish. This seems like a problem in
the storage layer, i.e IOs being stuck. Check your dmesg for any errors.

> 
> 
> On 1/9/19 7:24 AM, Nikolay Borisov wrote:
>>
>>
>> On 8.01.19 г. 21:38 ч., Florian Stecker wrote:
>>> Hi everyone,
>>>
>>> I extended the btrfs volume on my laptop by adding a second partition to
>>> it which lies on the same SSD (using btrfs device add). Since I did
>>> this, all kinds of applications regularly hang for up to 30 seconds. It
>>> seems they are stuck in the fdatasync syscall. For example:
>>>
>>> $ strace -tt -T gajim 2>&1 | grep fdatasync
>>> [...]
>>> 11:36:31.112200 fdatasync(25)           = 0 <0.006958>
>>> 11:36:32.147525 fdatasync(25)           = 0 <0.008138>
>>> 11:36:32.156882 fdatasync(25)           = 0 <0.006866>
>>> 11:36:32.165979 fdatasync(25)           = 0 <0.011797>
>>> 11:36:32.178867 fdatasync(25)           = 0 <23.636614>
>>> 11:36:55.827726 fdatasync(25)           = 0 <0.009595>
>>> 11:36:55.838702 fdatasync(25)           = 0 <0.007261>
>>> 11:36:55.850440 fdatasync(25)           = 0 <0.006807>
>>> 11:36:55.858168 fdatasync(25)           = 0 <0.006767>
>>> [...]
>>>
>>> File descriptor 25 here points to a file which is just ~90KB, so it
>>> really shouldn't take that long.
>>>
>>> Removing the second partition again resolves the problem. Does anyone
>>> know this issue? Is it related to btrfs? Or am I just doing something
>>> wrong?
>>>
>>> Best,
>>> Florian
>>>
>>> Some more info:
>>>
>>> $ btrfs device usage /
>>> /dev/sda2, ID: 2
>>>     Device size:            52.16GiB
>>>     Device slack:              0.00B
>>>     Data,single:             1.00GiB
>>>     Unallocated:            51.16GiB
>>>
>>> /dev/sda8, ID: 1
>>>     Device size:           174.92GiB
>>>     Device slack:              0.00B
>>>     Data,single:           168.91GiB
>>>     Metadata,single:         3.01GiB
>>>     System,single:           4.00MiB
>>>     Unallocated:             3.00GiB
>>>
>>> $ fdisk -l /dev/sda
>>> Disk /dev/sda: 238.5 GiB, 256060514304 bytes, 500118192 sectors
>>> Disk model: SAMSUNG SSD PM87
>>> Units: sectors of 1 * 512 = 512 bytes
>>> Sector size (logical/physical): 512 bytes / 512 bytes
>>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>>> Disklabel type: gpt
>>> Disk identifier: A48B5A25-AA84-4D3F-90DD-E8A4991BDF03
>>>
>>> Device         Start       End   Sectors   Size Type
>>> /dev/sda1       2048   1026047   1024000   500M EFI System
>>> /dev/sda2    1026048 110422015 109395968  52.2G Linux filesystem
>>> /dev/sda8  110422016 477263871 366841856 174.9G Linux filesystem
>>> /dev/sda9  477263872 481458175   4194304     2G Linux swap
>>>
>>> $ uname -a
>>> Linux dell 4.20.0-arch1-1-ARCH #1 SMP PREEMPT Mon Dec 24 03:00:40 UTC
>>> 2018 x86_64 GNU/Linux
> 
>>
>>>
>>>
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-09 10:03     ` Nikolay Borisov
@ 2019-01-09 20:10       ` Florian Stecker
  2019-01-12  2:12         ` Chris Murphy
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Stecker @ 2019-01-09 20:10 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs



On 1/9/19 11:03 AM, Nikolay Borisov wrote:
> 
> 
> On 9.01.19 г. 11:16 ч., Florian Stecker wrote:
>>>
>>> Provide output of echo w > /proc/sysrq-trigger when the hang occurs
>>> otherwise it's hard to figure what's going on.
>>>
>>
>> Here's one, again in gajim. This time, fdatasync() took "only" 2 seconds:
>>
>> [42481.243491] sysrq: SysRq : Show Blocked State
>> [42481.243494]   task                        PC stack   pid father
>> [42481.243566] gajim           D    0 15778  15774 0x00000083
>> [42481.243569] Call Trace:
>> [42481.243575]  ? __schedule+0x29b/0x8b0
>> [42481.243576]  ? bit_wait+0x50/0x50
>> [42481.243578]  schedule+0x32/0x90
>> [42481.243580]  io_schedule+0x12/0x40
>> [42481.243582]  bit_wait_io+0xd/0x50
>> [42481.243583]  __wait_on_bit+0x6c/0x80
>> [42481.243585]  out_of_line_wait_on_bit+0x91/0xb0
>> [42481.243587]  ? init_wait_var_entry+0x40/0x40
>> [42481.243605]  write_all_supers+0x418/0xa70 [btrfs]
>> [42481.243622]  btrfs_sync_log+0x695/0x910 [btrfs]
>> [42481.243625]  ? _raw_spin_lock_irqsave+0x25/0x50
>> [42481.243641]  ? btrfs_log_dentry_safe+0x54/0x70 [btrfs]
>> [42481.243655]  btrfs_sync_file+0x3a9/0x3d0 [btrfs]
>> [42481.243659]  do_fsync+0x38/0x70
>> [42481.243661]  __x64_sys_fdatasync+0x13/0x20
>> [42481.243663]  do_syscall_64+0x5b/0x170
>> [42481.243666]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [42481.243667] RIP: 0033:0x7fd4022f873f
>> [42481.243671] Code: Bad RIP value.
>> [42481.243672] RSP: 002b:00007ffd3710a300 EFLAGS: 00000293 ORIG_RAX:
>> 000000000000004b
>> [42481.243674] RAX: ffffffffffffffda RBX: 0000000000000019 RCX:
>> 00007fd4022f873f
>> [42481.243675] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
>> 0000000000000019
>> [42481.243675] RBP: 0000000000000000 R08: 000055d8d8649f68 R09:
>> 00007ffd3710a320
>> [42481.243676] R10: 0000000000013000 R11: 0000000000000293 R12:
>> 0000000000000000
>> [42481.243677] R13: 0000000000000000 R14: 000055d8d8363fa0 R15:
>> 000055d8d8613040
> 
> This shows that IO was send to disk to write the supper blocks following
> an fsync and it's waiting for IO to finish. This seems like a problem in
> the storage layer, i.e IOs being stuck. Check your dmesg for any error.
There are no IO errors in dmesg. Also, I never had any problems with 
this disk, SMART reports nothing, and also btrfs dev stats and btrfs 
scrub say everything's ok.

I now found a way to reproduce this issue more reliably: If I just write 
10KB of random data to a file and sync, this usually takes only a few 
ms, but on my setup if I do it 1000 times, about 10 of them will be 
longer than 100ms, sometimes much longer:

$ for i in $(seq 0 1000); do dd if=/dev/urandom of=/home/stecker/test 
bs=10k count=1 conv=fdatasync 2>&1 && sleep 0.1; done | grep 
'([1-9][0-9]*\.|0\.[1-9])[0-9]* s'
10240 bytes (10 kB, 10 KiB) copied, 1.12436 s, 9.1 kB/s
10240 bytes (10 kB, 10 KiB) copied, 1.33179 s, 7.7 kB/s
10240 bytes (10 kB, 10 KiB) copied, 1.27658 s, 8.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 0.401769 s, 25.5 kB/s
10240 bytes (10 kB, 10 KiB) copied, 1.019 s, 10.0 kB/s
10240 bytes (10 kB, 10 KiB) copied, 1.95148 s, 5.2 kB/s
10240 bytes (10 kB, 10 KiB) copied, 1.48939 s, 6.9 kB/s
10240 bytes (10 kB, 10 KiB) copied, 1.9071 s, 5.4 kB/s
10240 bytes (10 kB, 10 KiB) copied, 1.90988 s, 5.4 kB/s
10240 bytes (10 kB, 10 KiB) copied, 0.845141 s, 12.1 kB/s
10240 bytes (10 kB, 10 KiB) copied, 0.184172 s, 55.6 kB/s

If I use the two partitions /dev/sda2 not as part of a single fs, but as 
a seperate btrfs filesystems, this does not happen, all writes are fast. 
But this should not make a difference for the storage layer, or should 
it? I mean, it writes the superblocks to the exact same position on the 
disk?

By the way, thanks a lot for your help!

> 
>>
>>
>> On 1/9/19 7:24 AM, Nikolay Borisov wrote:
>>>
>>>
>>> On 8.01.19 г. 21:38 ч., Florian Stecker wrote:
>>>> Hi everyone,
>>>>
>>>> I extended the btrfs volume on my laptop by adding a second partition to
>>>> it which lies on the same SSD (using btrfs device add). Since I did
>>>> this, all kinds of applications regularly hang for up to 30 seconds. It
>>>> seems they are stuck in the fdatasync syscall. For example:
>>>>
>>>> $ strace -tt -T gajim 2>&1 | grep fdatasync
>>>> [...]
>>>> 11:36:31.112200 fdatasync(25)           = 0 <0.006958>
>>>> 11:36:32.147525 fdatasync(25)           = 0 <0.008138>
>>>> 11:36:32.156882 fdatasync(25)           = 0 <0.006866>
>>>> 11:36:32.165979 fdatasync(25)           = 0 <0.011797>
>>>> 11:36:32.178867 fdatasync(25)           = 0 <23.636614>
>>>> 11:36:55.827726 fdatasync(25)           = 0 <0.009595>
>>>> 11:36:55.838702 fdatasync(25)           = 0 <0.007261>
>>>> 11:36:55.850440 fdatasync(25)           = 0 <0.006807>
>>>> 11:36:55.858168 fdatasync(25)           = 0 <0.006767>
>>>> [...]
>>>>
>>>> File descriptor 25 here points to a file which is just ~90KB, so it
>>>> really shouldn't take that long.
>>>>
>>>> Removing the second partition again resolves the problem. Does anyone
>>>> know this issue? Is it related to btrfs? Or am I just doing something
>>>> wrong?
>>>>
>>>> Best,
>>>> Florian
>>>>
>>>> Some more info:
>>>>
>>>> $ btrfs device usage /
>>>> /dev/sda2, ID: 2
>>>>      Device size:            52.16GiB
>>>>      Device slack:              0.00B
>>>>      Data,single:             1.00GiB
>>>>      Unallocated:            51.16GiB
>>>>
>>>> /dev/sda8, ID: 1
>>>>      Device size:           174.92GiB
>>>>      Device slack:              0.00B
>>>>      Data,single:           168.91GiB
>>>>      Metadata,single:         3.01GiB
>>>>      System,single:           4.00MiB
>>>>      Unallocated:             3.00GiB
>>>>
>>>> $ fdisk -l /dev/sda
>>>> Disk /dev/sda: 238.5 GiB, 256060514304 bytes, 500118192 sectors
>>>> Disk model: SAMSUNG SSD PM87
>>>> Units: sectors of 1 * 512 = 512 bytes
>>>> Sector size (logical/physical): 512 bytes / 512 bytes
>>>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>>>> Disklabel type: gpt
>>>> Disk identifier: A48B5A25-AA84-4D3F-90DD-E8A4991BDF03
>>>>
>>>> Device         Start       End   Sectors   Size Type
>>>> /dev/sda1       2048   1026047   1024000   500M EFI System
>>>> /dev/sda2    1026048 110422015 109395968  52.2G Linux filesystem
>>>> /dev/sda8  110422016 477263871 366841856 174.9G Linux filesystem
>>>> /dev/sda9  477263872 481458175   4194304     2G Linux swap
>>>>
>>>> $ uname -a
>>>> Linux dell 4.20.0-arch1-1-ARCH #1 SMP PREEMPT Mon Dec 24 03:00:40 UTC
>>>> 2018 x86_64 GNU/Linux
>>
>>>
>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-09 20:10       ` Florian Stecker
@ 2019-01-12  2:12         ` Chris Murphy
  2019-01-12 10:19           ` Florian Stecker
  0 siblings, 1 reply; 12+ messages in thread
From: Chris Murphy @ 2019-01-12  2:12 UTC (permalink / raw)
  To: Florian Stecker; +Cc: Nikolay Borisov, Btrfs BTRFS

On Wed, Jan 9, 2019 at 1:10 PM Florian Stecker <m19@florianstecker.de> wrote:
>
>
>
> On 1/9/19 11:03 AM, Nikolay Borisov wrote:
> >
> >
> > On 9.01.19 г. 11:16 ч., Florian Stecker wrote:
> >>>
> >>> Provide output of echo w > /proc/sysrq-trigger when the hang occurs
> >>> otherwise it's hard to figure what's going on.
> >>>
> >>
> >> Here's one, again in gajim. This time, fdatasync() took "only" 2 seconds:
> >>
> >> [42481.243491] sysrq: SysRq : Show Blocked State
> >> [42481.243494]   task                        PC stack   pid father
> >> [42481.243566] gajim           D    0 15778  15774 0x00000083
> >> [42481.243569] Call Trace:
> >> [42481.243575]  ? __schedule+0x29b/0x8b0
> >> [42481.243576]  ? bit_wait+0x50/0x50
> >> [42481.243578]  schedule+0x32/0x90
> >> [42481.243580]  io_schedule+0x12/0x40
> >> [42481.243582]  bit_wait_io+0xd/0x50
> >> [42481.243583]  __wait_on_bit+0x6c/0x80
> >> [42481.243585]  out_of_line_wait_on_bit+0x91/0xb0
> >> [42481.243587]  ? init_wait_var_entry+0x40/0x40
> >> [42481.243605]  write_all_supers+0x418/0xa70 [btrfs]
> >> [42481.243622]  btrfs_sync_log+0x695/0x910 [btrfs]
> >> [42481.243625]  ? _raw_spin_lock_irqsave+0x25/0x50
> >> [42481.243641]  ? btrfs_log_dentry_safe+0x54/0x70 [btrfs]
> >> [42481.243655]  btrfs_sync_file+0x3a9/0x3d0 [btrfs]
> >> [42481.243659]  do_fsync+0x38/0x70
> >> [42481.243661]  __x64_sys_fdatasync+0x13/0x20
> >> [42481.243663]  do_syscall_64+0x5b/0x170
> >> [42481.243666]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [42481.243667] RIP: 0033:0x7fd4022f873f
> >> [42481.243671] Code: Bad RIP value.
> >> [42481.243672] RSP: 002b:00007ffd3710a300 EFLAGS: 00000293 ORIG_RAX:
> >> 000000000000004b
> >> [42481.243674] RAX: ffffffffffffffda RBX: 0000000000000019 RCX:
> >> 00007fd4022f873f
> >> [42481.243675] RDX: 0000000000000000 RSI: 0000000000000002 RDI:
> >> 0000000000000019
> >> [42481.243675] RBP: 0000000000000000 R08: 000055d8d8649f68 R09:
> >> 00007ffd3710a320
> >> [42481.243676] R10: 0000000000013000 R11: 0000000000000293 R12:
> >> 0000000000000000
> >> [42481.243677] R13: 0000000000000000 R14: 000055d8d8363fa0 R15:
> >> 000055d8d8613040
> >
> > This shows that IO was send to disk to write the supper blocks following
> > an fsync and it's waiting for IO to finish. This seems like a problem in
> > the storage layer, i.e IOs being stuck. Check your dmesg for any error.
> There are no IO errors in dmesg. Also, I never had any problems with
> this disk, SMART reports nothing, and also btrfs dev stats and btrfs
> scrub say everything's ok.

What do you get for:
mount | grep btrfs
btrfs insp dump-s -f /dev/sda8

I ran in this same configuration for a long time, maybe 5 months, and
never ran into this problem. But it was with much older kernel,
perhaps circa 4.8 era.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-12  2:12         ` Chris Murphy
@ 2019-01-12 10:19           ` Florian Stecker
  2019-01-14  5:49             ` Duncan
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Stecker @ 2019-01-12 10:19 UTC (permalink / raw)
  To: Chris Murphy; +Cc: Nikolay Borisov, Btrfs BTRFS

I found out a few things in the meantime:

* My IO scheduler is mq-deadline by default. When I switch it to none, 
the problem disappears
* What hangs is the call to wait_on_buffer inside wait_dev_supers, while 
waiting for superblock 0 of device 1 to be written. That is /dev/sda8, 
i.e. the one which is physically behind /dev/sda2, but has the lower 
device id

So it seems to me as if btrfs produces some strange sequence of writes 
which confuse the scheduler and cause it to hang? Could this have to do 
with the fact that the order of devids is different to the physical 
order of the partitons?

If you guys want me to, I can definitely put some printks etc. into my 
kernel. I just know too little about the code to see what information 
could be useful.

 > What do you get for:
 > mount | grep btrfs
 > btrfs insp dump-s -f /dev/sda8

$ mount | grep btrfs
/dev/sda8 on / type btrfs (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)

$ btrfs insp dump-s -f /dev/sda8
superblock: bytenr=65536, device=/dev/sda8
---------------------------------------------------------
csum_type		0 (crc32c)
csum_size		4
csum			0x5a2fbdf1 [match]
bytenr			65536
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			c4c0b512-00d3-42f2-a2e1-dcc62a2acd98
label			
generation		575201
root			622264320
sys_array_size		97
chunk_root_generation	574575
root_level		1
chunk_root		241407393792
chunk_root_level	1
log_root		0
log_root_transid	0
log_root_level		0
total_bytes		243285360640
bytes_used		184232562688
sectorsize		4096
nodesize		16384
leafsize (deprecated)		16384
stripesize		4096
root_dir		6
num_devices		2
compat_flags		0x0
compat_ro_flags		0x0
incompat_flags		0x161
			( MIXED_BACKREF |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA )
cache_generation	575201
uuid_tree_generation	574411
dev_item.uuid		3e8d6ecb-a595-4c6a-aae8-e6a09e5b151d
dev_item.fsid		c4c0b512-00d3-42f2-a2e1-dcc62a2acd98 [match]
dev_item.type		0
dev_item.total_bytes	187823030272
dev_item.bytes_used	183523868672
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		1
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0
sys_chunk_array[2048]:
	item 0 key (FIRST_CHUNK_TREE CHUNK_ITEM 241407361024)
		length 33554432 owner 2 stripe_len 65536 type SYSTEM
		io_align 65536 io_width 65536 sector_size 4096
		num_stripes 1 sub_stripes 1
			stripe 0 devid 2 offset 1074790400
			dev_uuid d56afd3b-d89c-4e59-8850-7dedd84900b9
backup_roots[4]:
	backup 0:
		backup_tree_root:	618283008	gen: 575198	level: 1
		backup_chunk_root:	241407393792	gen: 574575	level: 1
		backup_extent_root:	616038400	gen: 575198	level: 2
		backup_fs_root:		593952768	gen: 575198	level: 2
		backup_dev_root:	487751680	gen: 575188	level: 0
		backup_csum_root:	594149376	gen: 575198	level: 2
		backup_total_bytes:	243285360640
		backup_bytes_used:	184231972864
		backup_num_devices:	2

	backup 1:
		backup_tree_root:	613908480	gen: 575199	level: 1
		backup_chunk_root:	241407393792	gen: 574575	level: 1
		backup_extent_root:	607731712	gen: 575199	level: 2
		backup_fs_root:		622444544	gen: 575200	level: 2
		backup_dev_root:	487751680	gen: 575188	level: 0
		backup_csum_root:	603389952	gen: 575199	level: 2
		backup_total_bytes:	243285360640
		backup_bytes_used:	184232038400
		backup_num_devices:	2

	backup 2:
		backup_tree_root:	628539392	gen: 575200	level: 1
		backup_chunk_root:	241407393792	gen: 574575	level: 1
		backup_extent_root:	616103936	gen: 575200	level: 2
		backup_fs_root:		622444544	gen: 575200	level: 2
		backup_dev_root:	487751680	gen: 575188	level: 0
		backup_csum_root:	616890368	gen: 575200	level: 2
		backup_total_bytes:	243285360640
		backup_bytes_used:	184232468480
		backup_num_devices:	2

	backup 3:
		backup_tree_root:	622264320	gen: 575201	level: 1
		backup_chunk_root:	241407393792	gen: 574575	level: 1
		backup_extent_root:	617791488	gen: 575201	level: 2
		backup_fs_root:		615432192	gen: 575201	level: 2
		backup_dev_root:	487751680	gen: 575188	level: 0
		backup_csum_root:	615841792	gen: 575201	level: 2
		backup_total_bytes:	243285360640
		backup_bytes_used:	184232562688
		backup_num_devices:	2


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-12 10:19           ` Florian Stecker
@ 2019-01-14  5:49             ` Duncan
  2019-01-14 11:35               ` Marc Joliet
  0 siblings, 1 reply; 12+ messages in thread
From: Duncan @ 2019-01-14  5:49 UTC (permalink / raw)
  To: linux-btrfs

Florian Stecker posted on Sat, 12 Jan 2019 11:19:14 +0100 as excerpted:

> $ mount | grep btrfs
> /dev/sda8 on / type btrfs
> (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)

Unlikely to be apropos to the problem at hand, but FYI...

Unless you have a known reason not to[1], running noatime with btrfs 
instead of the kernel-default relatime is strongly recommended, 
especially if you use btrfs snapshotting on the filesystem.

The reasoning is that even tho relatime reduces the default access-time 
updates to once a day, it still likely-unnecessarily turns otherwise read-
only operations into read-write operations, and atimes are metadata, 
which btrfs always COWs (copy-on-writes), meaning atime updates can 
trigger cascading metadata block-writes and much larger than 
anticipated[2] write-amplification, potentially hurting performance, yes, 
even for relatime, depending on your usage.

In addition, if you're using snapshotting and not using noatime, it can 
easily happen that a large portion of the change between one snapshot and 
the next is simply atime updates, thus making the space referenced 
exclusively by individual affected snapshots far larger than it would 
otherwise be.

---
[1] mutt is AFAIK the only widely used application that still depends on 
atime updates, and it only does so in certain modes, not with mbox-format 
mailboxes, for instance.  So unless you're using it, or your backup 
solution happens to use atime, chances are quite high that noatime won't 
disrupt your usage at all.

[2] Larger than anticipated write-amplification:  Especially when you 
/thought/ you were only reading the files and hadn't considered the atime 
update that read could trigger, thus effectively generating infinity 
write amplification because the read access did an atime update and 
turned what otherwise wouldn't be a write operation at all into one!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-14  5:49             ` Duncan
@ 2019-01-14 11:35               ` Marc Joliet
  2019-01-15  8:33                 ` Duncan
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Joliet @ 2019-01-14 11:35 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 628 bytes --]

Am Montag, 14. Januar 2019, 06:49:58 CET schrieb Duncan:
[...]
> Unless you have a known reason not to[1], running noatime with btrfs
> instead of the kernel-default relatime is strongly recommended,
> especially if you use btrfs snapshotting on the filesystem.
[...]

The one reason I decided to remove noatime from my systems' mount options is 
because I use systemd-tmpfiles to clean up cache directories, for which it is 
necessary to leave atime intact (since caches are often Write Once Read Many).

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-14 11:35               ` Marc Joliet
@ 2019-01-15  8:33                 ` Duncan
  2019-01-15 22:40                   ` Marc Joliet
  0 siblings, 1 reply; 12+ messages in thread
From: Duncan @ 2019-01-15  8:33 UTC (permalink / raw)
  To: linux-btrfs

Marc Joliet posted on Mon, 14 Jan 2019 12:35:05 +0100 as excerpted:

> Am Montag, 14. Januar 2019, 06:49:58 CET schrieb Duncan:
> [...]
>> Unless you have a known reason not to[1], running noatime with btrfs
>> instead of the kernel-default relatime is strongly recommended,
>> especially if you use btrfs snapshotting on the filesystem.
> [...]
> 
> The one reason I decided to remove noatime from my systems' mount
> options is because I use systemd-tmpfiles to clean up cache directories,
> for which it is necessary to leave atime intact (since caches are often
> Write Once Read Many).

Thanks for the reply.  I hadn't really thought of that use, but it makes 
sense...

FWIW systemd here too, but I suppose it depends on what's being cached 
and particularly on the expense of recreation of cached data.  I actually 
have many of my caches (user/browser caches, etc) on tmpfs and reboot 
several times a week, so much of the cached data is only trivially cached 
as it's trivial to recreate/redownload.

OTOH, running gentoo, my ccache and binpkg cache are seriously CPU-cycle 
expensive to recreate, so you can bet those are _not_ tmpfs, but OTTH, 
they're not managed by systemd-tmpfiles either.  (Ccache manages its own 
cache and together with the source-tarballs cache and git-managed repo 
trees along with binpkgs, I have a dedicated packages btrfs containing 
all of them, so I eclean binpkgs and distfiles whenever the 24-gigs space 
(48-gig total, 24-gig each on pair-device btrfs raid1) gets too close to 
full, then btrfs balance with -dusage= to reclaim partial chunks to 
unallocated.)

Anyway, if you're not regularly snapshotting, relatime is reasonably 
fine, tho I'd still keep the atime effects in mind and switch to noatime 
if you end up in a recovery situation that requires writable mounting.  
(Losing a device in btrfs raid1 and mounting writable in ordered to 
replace it and rebalance comes to mind as one example of a writable-mount 
recovery scenario where noatime until full replace/rebalance/scrub 
completion would prevent unnecessary writes until the raid1 is safely 
complete and scrub-verified again.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-15  8:33                 ` Duncan
@ 2019-01-15 22:40                   ` Marc Joliet
  2019-01-17 11:15                     ` Duncan
  0 siblings, 1 reply; 12+ messages in thread
From: Marc Joliet @ 2019-01-15 22:40 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4466 bytes --]

Am Dienstag, 15. Januar 2019, 09:33:40 CET schrieb Duncan:
> Marc Joliet posted on Mon, 14 Jan 2019 12:35:05 +0100 as excerpted:
> > Am Montag, 14. Januar 2019, 06:49:58 CET schrieb Duncan:
> > [...]
> > 
> >> Unless you have a known reason not to[1], running noatime with btrfs
> >> instead of the kernel-default relatime is strongly recommended,
> >> especially if you use btrfs snapshotting on the filesystem.
> > 
> > [...]
> > 
> > The one reason I decided to remove noatime from my systems' mount
> > options is because I use systemd-tmpfiles to clean up cache directories,
> > for which it is necessary to leave atime intact (since caches are often
> > Write Once Read Many).
> 
> Thanks for the reply.  I hadn't really thought of that use, but it makes
> sense...

Specifically, I mean ~/.cache/ (plus a separate entry for ~/.cache/
thumbnails/, since I want thumbnails to live longer):

% grep \^[\^#] .config/user-tmpfiles.d/*.conf
.config/user-tmpfiles.d/clean.conf:d %C/thumbnails - - - 730d -
.config/user-tmpfiles.d/subvolumes.conf:q %C 0700 - - 60d -
.config/user-tmpfiles.d/subvolumes.conf:q %h/tmp 0700 - - - -

I don't use qgroups now, but will probably in the future, hence the use of "q" 
instead of "v".  ~/tmp/ is just a scratch space that I don't want snapshotted.

I haven't bothered configuring /var/cache/, other than making it a subvolume 
so it's not a part of my snapshots (overriding the systemd default of creating 
it as a directory).  It appears to me that it's managed just fine by pre-
existing tmpfiles.d snippets and by the applications that use it cleaning up 
after themselves (except for portage, see below).

> FWIW systemd here too, but I suppose it depends on what's being cached
> and particularly on the expense of recreation of cached data.  I actually
> have many of my caches (user/browser caches, etc) on tmpfs and reboot
> several times a week, so much of the cached data is only trivially cached
> as it's trivial to recreate/redownload.

While that sort of tmpfs hackery is definitely cool, my system is, despite its 
age, fast enough for me that I don't want to bother with that (plus I like my 
8 GB of RAM to be used just for applications and whatever Linux decides to 
cache in RAM).  Also, modern SSDs live long enough that I'm not worried about 
wearing them out through my daily usage (which IIRC was a major reason for you 
to do things that way).

> OTOH, running gentoo, my ccache and binpkg cache are seriously CPU-cycle
> expensive to recreate, so you can bet those are _not_ tmpfs, but OTTH,
> they're not managed by systemd-tmpfiles either.  (Ccache manages its own
> cache and together with the source-tarballs cache and git-managed repo
> trees along with binpkgs, I have a dedicated packages btrfs containing
> all of them, so I eclean binpkgs and distfiles whenever the 24-gigs space
> (48-gig total, 24-gig each on pair-device btrfs raid1) gets too close to
> full, then btrfs balance with -dusage= to reclaim partial chunks to
> unallocated.)

For distfiles I just have a weekly systemd timer that runs "eclean-dist -d" (I 
stopped using the buildpkg feature, so no eclean-pkg), and have moved both 
$DISTDIR and $PKGDIR to their future default locations in /var/cache/.  (They 
used to reside on my desktops HDD RAID1 as distinct subvolumes, but I recently 
bought a larger SSD, so I set up the above and got rid of two fstab entries.)

> Anyway, if you're not regularly snapshotting, relatime is reasonably
> fine, 

Personally, I don't notice the difference between noatime and relatime in day-
to-day usage (perhaps I just don't snapshot often enough).

> tho I'd still keep the atime effects in mind and switch to noatime
> if you end up in a recovery situation that requires writable mounting.
> (Losing a device in btrfs raid1 and mounting writable in ordered to
> replace it and rebalance comes to mind as one example of a writable-mount
> recovery scenario where noatime until full replace/rebalance/scrub
> completion would prevent unnecessary writes until the raid1 is safely
> complete and scrub-verified again.)

That all makes sense.  I was going to argue that I can't imagine randomly 
reading files in a recovery situation, but eventually realized that "ls" would 
be enough to trigger a directory atime update.  So yeah, one should keep the 
above mind.

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: applications hang on a btrfs spanning two partitions
  2019-01-15 22:40                   ` Marc Joliet
@ 2019-01-17 11:15                     ` Duncan
  0 siblings, 0 replies; 12+ messages in thread
From: Duncan @ 2019-01-17 11:15 UTC (permalink / raw)
  To: linux-btrfs

Marc Joliet posted on Tue, 15 Jan 2019 23:40:18 +0100 as excerpted:

> Am Dienstag, 15. Januar 2019, 09:33:40 CET schrieb Duncan:
>> Marc Joliet posted on Mon, 14 Jan 2019 12:35:05 +0100 as excerpted:
>> > Am Montag, 14. Januar 2019, 06:49:58 CET schrieb Duncan:
>> > 
>> >> ... noatime ...
>> > 
>> > The one reason I decided to remove noatime from my systems' mount
>> > options is because I use systemd-tmpfiles to clean up cache
>> > directories, for which it is necessary to leave atime intact
>> > (since caches are often Write Once Read Many).
>> 
>> Thanks for the reply.  I hadn't really thought of that use, but it
>> makes sense...

I really enjoy these "tips" subthreads.  As I said I hadn't really 
thought of that use, and seeing and understanding other people's 
solutions helps when I later find reason to review/change my own. =:^)

One example is an ssd brand reliability discussion from a couple years 
ago.  I had the main system on ssds then and wasn't planning on an 
immediate upgrade, but later on, I got tired of the media partition and a 
main system backup being on slow spinning rust, and dug out that ssd 
discussion to help me decide what to buy.  (Samsung 1 TB evo 850s, FWIW.)

> Specifically, I mean ~/.cache/ (plus a separate entry for ~/.cache/
> thumbnails/, since I want thumbnails to live longer):

Here, ~/.cache -> tmp/cache/ and ~/tmp -> /tmp/tmp-$USER/, plus 
XDG_CACHE_HOME=$HOME/tmp/cache/, with /tmp being tmpfs.

So as I said, user cache is on tmpfs.

Thumbnails... I actually did an experiment with the .thumbnails backed up 
elsewhere and empty, and found that with my ssds anyway, rethumbnailing 
was close enough to having them cached that it didn't really matter to my 
visual browsing experience.  So not only do I not mind thumbnails being 
on tmpfs, I actually have gwenview, my primary images browser, set to 
delete its thumbnails dir on close.

> I haven't bothered configuring /var/cache/, other than making it a
> subvolume so it's not a part of my snapshots (overriding the systemd
> default of creating it as a directory).  It appears to me that it's
> managed just fine by pre- existing tmpfiles.d snippets and by the
> applications that use it cleaning up after themselves (except for
> portage, see below).

Here, /var/cache/ is on /, which remains mounted read-only by default.  
The only things using it are package-updates related, and I obviously 
have to mount / rw for package updates, so it works fine.  (My sync 
script mounts the dedicated packages filesystem containing the repos, 
ccache, distdir, and binpkgs, and remounting / rw, and that's the first 
thing I run doing an update, so I don't even have to worry about doing 
the mounts manually.)

>> FWIW systemd here too, but I suppose it depends on what's being cached
>> and particularly on the expense of recreation of cached data.  I
>> actually have many of my caches (user/browser caches, etc) on tmpfs and
>> reboot several times a week, so much of the cached data is only
>> trivially cached as it's trivial to recreate/redownload.
> 
> While that sort of tmpfs hackery is definitely cool, my system is,
> despite its age, fast enough for me that I don't want to bother with
> that (plus I like my 8 GB of RAM to be used just for applications and
> whatever Linux decides to cache in RAM).  Also, modern SSDs live long
> enough that I'm not worried about wearing them out through my daily
> usage (which IIRC was a major reason for you to do things that way).

16 gigs RAM here, and except for building chromium (in tmpfs), I seldom 
fill it even with cache -- most of the time several gigs remain entirely 
empty.  With 8 gig I'd obviously have to worry a bit more about what I 
put in tmpfs, but given that I have the RAM space, I might as well use it.

When I setup this system I was upgrading from a 4-core (original 2-socket 
dual-core 3-digit Opterons, purchased in 2003 and ran until the caps 
started dying in 2011), this system being a 6-core fx-series, and based 
on the experience with the quad-core, I figured 12 gig RAM for the 6-
core.  But with pairs of RAM sticks for dual-channel, powers of two 
worked better, so it was 8 gig or 16 gig.  And given that I had worked 
with 8 gig on the quad-core, I knew that would be OK, but 12 gig would 
mean less cache dumping, so 16 gig it was.

And my estimate was right on.  Since 2011, I've typically run up to ~12 
gigs RAM used including cache, leaving ~4 gigs of the 16 entirely unused 
most of the time, tho I do use the full 16 gig sometimes when doing 
updates, since I have PORTAGE_TMPDIR set to tmpfs.

Of course since my purchase in 2011 I've upgraded to SSDs and RAM-based 
storage cache isn't as important as it was back on spinning rust, so for 
my routine usage 8 gig RAM with ssds would be just fine, today.

But building chromium on tmpfs is the exception.

Until recently I was running firefox, but for various reasons including 
firefox upstream requiring pulse-audio now so I can't just run upstream 
firefox binaries, and gentoo's firefox updates unfortunately sometimes 
being uncomfortably late for a security-minded user aware that their 
primary browser is the single most security-exposed application they run, 
and often build or run problems after gentoo /did/ have a firefox build, 
making reliably running a secure-as-possible firefox even *more* of a 
problem, a few months ago I switched to chromium.

And chromium is over a half-gig of compressed sources that expands to 
several gigs of build dir.  Put that in tmpfs along with the memory 
requirements of a multi-threaded build, with USE=jumbo-build and a couple 
gigs of other stuff (an X/kde-plasma session, building in a konsole 
window, often with chromium and minitube running) in memory too, and...

That 16 gig RAM isn't enough for that sort of chromium build. =:^(

So for the first time on the ssds, I reconfigured and rebuilt the kernel 
with swap support, and added a pair of 16-gig each swap partitions on the 
ssds, for now 16 gig RAM and 32 gig swap.

With the parallel-jobs cut down slightly via a package.env setting to 
better control memory usage, to -j7 from the normal -j8, and with 
PORTAGE_TMPDIR still pointed at tmpfs, I run about 16 gig into swap 
building chromium now.  So for that I could now use 32 gig of RAM.

Meanwhile, it's 2019, and this 2011 system's starting to feel a bit dated 
in other ways too, now, and I'm already at the ~8 years my last system 
lasted, so I'm thinking about upgrading.  I've upgraded to SSDs and to 
big-screen monitors (a 65-inch/165cm 4K TV as primary) on this system, 
but I've not done the CPU or memory upgrades on it that I did on the last 
one, and having to enable swap to build chromium just seems so last 
century.

So I'm thinking about upgrading later this year, probably to a zen-2-
based system with hardware spectre mitigations.

And I want at least 32-gig RAM when I do, depending on the number of 
cores/threads.  I'm figuring 4-gig/thread now, 4-core/8-thread minimum, 
which would be the 32-gig.  But 8-core/16-thread, 64-gig RAM, would be 
nice.

But I'm moving this spring and am busy with that first.  When that's done 
and I'm settled in the new place I'll see what my financials look like 
and go from there.

>> OTOH, running gentoo, my ccache and binpkg cache are seriously
>> CPU-cycle expensive to recreate, so you can bet those are _not_ tmpfs,
>> but OTTH, they're not managed by systemd-tmpfiles either.  (Ccache
>> manages its own cache and together with the source-tarballs cache and
>> git-managed repo trees along with binpkgs, I have a dedicated packages
>> btrfs containing all of them, so I eclean binpkgs and distfiles
>> whenever the 24-gigs space (48-gig total, 24-gig each on pair-device
>> btrfs raid1) gets too close to full, then btrfs balance with -dusage=
>> to reclaim partial chunks to unallocated.)
> 
> For distfiles I just have a weekly systemd timer that runs "eclean-dist
> -d" (I stopped using the buildpkg feature, so no eclean-pkg), and have
> moved both $DISTDIR and $PKGDIR to their future default locations in
> /var/cache/.  (They used to reside on my desktops HDD RAID1 as distinct
> subvolumes, but I recently bought a larger SSD, so I set up the above
> and got rid of two fstab entries.)

I like short paths.

So my packages filesystem mountpoint is /p, with /p/gentoo and /p/kde 
being my main repos, DISTDIR=/p/src, PKGDIR=/p/pkw (w=workstation, back 
when I had my 32-bit netbook and 32-bit chroot build image on the 
workstation too, I had its packages in pkn, IIRC), /p/linux for the linux 
git tree, /p/kpatch for local kernel patches, /p/cc for ccache, and /p/
initramfs for my (dracut-generate) initramfs.

And FWIW, /h is the home mountpoint, /lg the log mountpoint (with
/var/log -> /lg) /l the system-local dir (with /var/local -> /l) on /, 
/mnt for auxiliary mounts, /bk the root-backup mountpoint, etc.


You stopped using binpkgs?  I can't imagine doing that.  Not only does it 
make the occasional downgrade easier, older binpkgs come in handy for 
checking whether a file location moved in recent versions, looking up 
default configs and seeing how they've changed, checking the dates on 
them to know when I was running version X or whether I upgraded package Y 
before or after package Z, etc.

Of course I could use btrfs snapshotting for most of that and could get 
the other info in other ways, but I had this setup working and tested 
long before btrfs, and it seems less risky and easier to quantify and 
manage than btrfs snapshotting.  But surely that's because I /did/ have 
it up, running and tested, before btrfs, so it's old hat to me now.  If I 
were starting with it now, I imagine I might well find the btrfs 
snapshotting thing simpler to manage, and covering a broader use-case too.

>> tho I'd still keep the atime effects in mind and switch to noatime if
>> you end up in a recovery situation that requires writable mounting.
>> (Losing a device in btrfs raid1 and mounting writable in ordered to
>> replace it and rebalance comes to mind as one example of a
>> writable-mount recovery scenario where noatime until full
>> replace/rebalance/scrub completion would prevent unnecessary writes
>> until the raid1 is safely complete and scrub-verified again.)
> 
> That all makes sense.  I was going to argue that I can't imagine
> randomly reading files in a recovery situation, but eventually realized
> that "ls" would be enough to trigger a directory atime update.  So yeah,
> one should keep the above mind.

Not just ls, etc, either.  Consider manpage access, etc, as well.  Plus 
of course any executable binaries you run, the libs they load, 
scripts...  If atime's on, all those otherwise read-only accesses will 
trigger atime-update writes, and with btrfs, updating that bit of 
metadata copies and writes the entire updated metadata block, triggering 
an update and thus a COW of the metadata block tracking the one just 
written... all the way up the metadata tree.  In a recovery situation 
where every write is an additional risk, that's a lot of additional risk, 
all for not-so-necessary atime updates!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-01-17 11:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-08 19:38 applications hang on a btrfs spanning two partitions Florian Stecker
2019-01-09  6:24 ` Nikolay Borisov
2019-01-09  9:16   ` Florian Stecker
2019-01-09 10:03     ` Nikolay Borisov
2019-01-09 20:10       ` Florian Stecker
2019-01-12  2:12         ` Chris Murphy
2019-01-12 10:19           ` Florian Stecker
2019-01-14  5:49             ` Duncan
2019-01-14 11:35               ` Marc Joliet
2019-01-15  8:33                 ` Duncan
2019-01-15 22:40                   ` Marc Joliet
2019-01-17 11:15                     ` Duncan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).