All of lore.kernel.org
 help / color / mirror / Atom feed
* csum failed root raveled during balance
@ 2018-05-22 20:05 ein
  2018-05-23  6:32 ` Nikolay Borisov
  0 siblings, 1 reply; 14+ messages in thread
From: ein @ 2018-05-22 20:05 UTC (permalink / raw)
  To: linux-btrfs

Hello devs,

I tested BTRFS in production for about a month:

21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00

Without power blackout, hardware failure, SSD's SMART is flawless etc.
The tests ended with:

root@node0:~# dmesg | grep BTRFS | grep warn
185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
-9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
-9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
-9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
-9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
-9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
-9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
mirror 1
209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
mirror 1
210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
mirror 1
211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
mirror 1

Above has been revealed during below command and quite high IO usage by
few VMs (Linux on top Ext4 with firebird database, lots of random
read/writes, two others with Windows 2016 and Windows Update in the
background):

btrfs balance start -dusage=85 ./

Surprisingly without error counter increasing (how this is even possible?!):

root@node0:/var/lib/libvirt/images# backup_dir="/var/lib/libvirt"; btrfs
fi df $backup_dir; btrfs filesystem usage $backup_dir; btrfs dev stats
$backup_dir
Data, single: total=203.00GiB, used=188.28GiB
System, single: total=32.00MiB, used=48.00KiB
Metadata, single: total=3.00GiB, used=856.27MiB
GlobalReserve, single: total=512.00MiB, used=0.00B
Overall:
    Device size:                 300.00GiB
    Device allocated:            206.03GiB
    Device unallocated:           93.97GiB
    Device missing:                  0.00B
    Used:                        189.12GiB
    Free (estimated):            108.68GiB      (min: 108.68GiB)
    Data ratio:                       1.00
    Metadata ratio:                   1.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,single: Size:203.00GiB, Used:188.28GiB
   /dev/mapper/raid10-images     203.00GiB

Metadata,single: Size:3.00GiB, Used:856.27MiB
   /dev/mapper/raid10-images       3.00GiB

System,single: Size:32.00MiB, Used:48.00KiB
   /dev/mapper/raid10-images      32.00MiB

Unallocated:
   /dev/mapper/raid10-images      93.97GiB
[/dev/mapper/raid10-images].write_io_errs    0
[/dev/mapper/raid10-images].read_io_errs     0
[/dev/mapper/raid10-images].flush_io_errs    0
[/dev/mapper/raid10-images].corruption_errs  0

Few words about the setup Hardware layout:

/dev/sda Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q,
SerialNo=S251NX0H822367V
/dev/sdb Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q,
SerialNo=S251NX0H822370A
/dev/sdc Model=INTEL SSDSC2BP240G4, FwRev=L2010420,
SerialNo=BTJR6141002D240AGN
/dev/sdd Model=INTEL SSDSC2BP240G4, FwRev=L2010420,
SerialNo=BTJR6063000F240AGN

Linux Software RAID layout (mdraid):

md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
      468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 0/4 pages [0KB], 65536KB chunk

md0 : active raid1 sda1[0] sdb1[1]
      15616000 blocks super 1.2 [2/2] [UU]

Logical layout:

NAME                MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda                   8:0    0 238.5G  0 disk  
├─sda1                8:1    0  14.9G  0 part  
│ └─md0               9:0    0  14.9G  0 raid1 
│   └─raid1-rootfs  253:2    0  14.9G  0 lvm    /
└─sda2                8:2    0 223.6G  0 part  
  └─md1               9:1    0 446.9G  0 raid10
    ├─raid10-images 253:0    0   300G  0 lvm    /var/lib/libvirt
    └─raid10-swapfs 253:1    0     4G  0 lvm   
sdb                   8:16   0 238.5G  0 disk  
├─sdb1                8:17   0  14.9G  0 part  
│ └─md0               9:0    0  14.9G  0 raid1 
│   └─raid1-rootfs  253:2    0  14.9G  0 lvm    /
└─sdb2                8:18   0 223.6G  0 part  
  └─md1               9:1    0 446.9G  0 raid10
    ├─raid10-images 253:0    0   300G  0 lvm    /var/lib/libvirt
    └─raid10-swapfs 253:1    0     4G  0 lvm   
sdc                   8:32   0 223.6G  0 disk  
└─sdc1                8:33   0 223.6G  0 part  
  └─md1               9:1    0 446.9G  0 raid10
    ├─raid10-images 253:0    0   300G  0 lvm    /var/lib/libvirt
    └─raid10-swapfs 253:1    0     4G  0 lvm   
sdd                   8:48   0 223.6G  0 disk  
└─sdd1                8:49   0 223.6G  0 part  
  └─md1               9:1    0 446.9G  0 raid10
    ├─raid10-images 253:0    0   300G  0 lvm    /var/lib/libvirt
    └─raid10-swapfs 253:1    0     4G  0 lvm   

BTRFS is mounted with below options:

/dev/mapper/raid10-images on /var/lib/libvirt type btrfs
(rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)

As you can see KVM VMs reside on BTRFS volume on top of LVM partition on
top of software RAID10 on top of 4 SSD drives. Let's put away
performance point of view which is at least 15-20 times worse (number of
possible IO ops in VM and latency) comparing to the same setup but image
residing on top of Ext4 volume. Do you also see so huge performance
impact in your environment? The VMs lay on preallocated RAW images on
top of BTRFS filesystem.

The filesystem was created and used with the following tools:

root@node0:/var/lib/libvirt/images# uname -a
Linux node0 4.15.0-0.bpo.2-amd64 #1 SMP Debian 4.15.11-1~bpo9+1
(2018-04-07) x86_64 GNU/Linux

root@node0:/var/lib/libvirt/images# btrfs version
btrfs-progs v4.13.3

root@node0:/var/lib/libvirt/images# cat /etc/debian_version
9.4

BTRFS progs and kernel was from backports. After I finished restoring
from backup I ran again btrfs balance:

root@node0:/var/lib/libvirt/images# btrfs balance start ./          
WARNING:

        Full balance without filters requested. This operation is very
        intense and takes potentially very long. It is recommended to
        use the balance filters to narrow down the scope of balance.
        Use 'btrfs balance start --full-balance' option to skip this
        warning. The operation will start in 10 seconds.
        Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting balance without any filters.
May 22 21:41:28 node0 kernel: [2948094.226177] BTRFS info (device dm-0):
relocating block group 437201666048 flags data
May 22 21:41:28 node0 kernel: [2948094.818796] BTRFS info (device dm-0):
found 1518 extents
May 22 21:41:30 node0 kernel: [2948096.678234] BTRFS info (device dm-0):
found 1518 extents
May 22 21:41:30 node0 kernel: [2948096.736582] BTRFS info (device dm-0):
relocating block group 436127924224 flags data
May 22 21:41:32 node0 kernel: [2948098.811837] BTRFS info (device dm-0):
found 3932 extents
May 22 21:41:34 node0 kernel: [2948100.397485] BTRFS info (device dm-0):
found 3932 extents
May 22 21:41:34 node0 kernel: [2948100.473214] BTRFS info (device dm-0):
relocating block group 428611731456 flags data
May 22 21:41:37 node0 kernel: [2948103.862030] BTRFS info (device dm-0):
found 3519 extents
May 22 21:41:38 node0 kernel: [2948104.961367] BTRFS info (device dm-0):
found 3519 extents
May 22 21:41:39 node0 kernel: [2948105.026999] BTRFS info (device dm-0):
relocating block group 427537989632 flags data
May 22 21:41:40 node0 kernel: [2948106.155623] BTRFS warning (device
dm-0): csum failed root -9 ino 336 off 608284672 csum 0x7da1b152
expected csum 0x3163a9b7 mirror 1
May 22 21:41:40 node0 kernel: [2948106.156301] BTRFS warning (device
dm-0): csum failed root -9 ino 336 off 608284672 csum 0x7d03a376
expected csum 0x3163a9b7 mirror 1

ERROR: error during balancing './': Input/output error
There may be more info in syslog - try dmesg | tail

root@node0:/var/lib/libvirt/images# backup_dir="/var/lib/libvirt"; btrfs
dev stats $backup_dir
[/dev/mapper/raid10-images].write_io_errs    0
[/dev/mapper/raid10-images].read_io_errs     0
[/dev/mapper/raid10-images].flush_io_errs    0
[/dev/mapper/raid10-images].corruption_errs  0
[/dev/mapper/raid10-images].generation_errs  0

Again no errors in btrfs stats. I am pretty sure that I corrupted the
drive during remount:

mount -o
remount,noatime,nodiratime,rw,ssd,space_cache,compress=lzo,autodefrag
/dev/mapper/raid10-images /var/lib/libvirt

when I changed BTRFS compress parameters. Or during umount (can't recall
now):

May  2 07:15:39 node0 kernel: [1168145.677431] WARNING: CPU: 6 PID: 3763
at /build/linux-8B5M4n/linux-4.15.11/fs/direct-io.c:293
dio_complete+0x1d6/0x220
May  2 07:15:39 node0 kernel: [1168145.678811] Modules linked in: fuse
ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs vhost_net vhost
tap tun ebtable_filter ebtables ip6tab
le_filter ip6_tables iptable_filter binfmt_misc bridge 8021q garp mrp
stp llc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal
intel_powerclamp coretemp snd_hda_codec_realtek kvm
_intel snd_hda_codec_generic kvm i915 irqbypass crct10dif_pclmul
snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec
intel_cstate snd_hda_core iTCO_wdt iTCO_vendor_support
 intel_uncore drm_kms_helper snd_hwdep wmi_bmof intel_rapl_perf joydev
evdev pcspkr snd_pcm snd_timer drm snd soundcore i2c_algo_bit sg mei_me
lpc_ich shpchp mfd_core mei ie31200_e
dac wmi video button ib_iser rdma_cm iw_cm ib_cm ib_core configfs
iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
May  2 07:15:39 node0 kernel: [1168145.685202]  x_tables autofs4 ext4
crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress zstd_compress
xxhash raid456 async_raid6_recov async_mem
cpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic
raid0 multipath linear hid_generic usbhid hid dm_mod raid10 raid1 md_mod
sd_mod crc32c_intel ahci i2c_i801 lib
ahci aesni_intel xhci_pci aes_x86_64 ehci_pci libata crypto_simd
xhci_hcd ehci_hcd cryptd glue_helper e1000e scsi_mod ptp usbcore
pps_core usb_common fan thermal
May  2 07:15:39 node0 kernel: [1168145.689057] CPU: 6 PID: 3763 Comm:
kworker/6:2 Not tainted 4.15.0-0.bpo.2-amd64 #1 Debian 4.15.11-1~bpo9+1
May  2 07:15:39 node0 kernel: [1168145.690347] Hardware name: LENOVO
ThinkServer TS140/ThinkServer TS140, BIOS FBKTB3AUS 06/16/2015
May  2 07:15:39 node0 kernel: [1168145.691659] Workqueue: dio/dm-0
dio_aio_complete_work
May  2 07:15:39 node0 kernel: [1168145.692935] RIP:
0010:dio_complete+0x1d6/0x220
May  2 07:15:39 node0 kernel: [1168145.694275] RSP:
0018:ffff9abc68447e50 EFLAGS: 00010286
May  2 07:15:39 node0 kernel: [1168145.695605] RAX: 00000000fffffff0
RBX: ffff8e33712e3480 RCX: ffff9abc68447c88
May  2 07:15:39 node0 kernel: [1168145.697024] RDX: fffff1dcc92e4c1f
RSI: 0000000000000000 RDI: 0000000000000246
May  2 07:15:39 node0 kernel: [1168145.698389] RBP: 0000000000005000
R08: 0000000000000000 R09: ffffffffb7a075c0
May  2 07:15:39 node0 kernel: [1168145.699703] R10: ffff8e33bb4223c0
R11: 0000000000000195 R12: 0000000000005000
May  2 07:15:39 node0 kernel: [1168145.701044] R13: 0000000000000003
R14: 0000000403060000 R15: ffff8e33712e3500
May  2 07:15:39 node0 kernel: [1168145.702238] FS: 
0000000000000000(0000) GS:ffff8e349eb80000(0000) knlGS:0000000000000000
May  2 07:15:39 node0 kernel: [1168145.703475] CS:  0010 DS: 0000 ES:
0000 CR0: 0000000080050033
May  2 07:15:39 node0 kernel: [1168145.704733] CR2: 00007ff89915b08e
CR3: 00000005b2e0a005 CR4: 00000000001626e0
May  2 07:15:39 node0 kernel: [1168145.705955] Call Trace:
May  2 07:15:39 node0 kernel: [1168145.707151]  process_one_work+0x177/0x360
May  2 07:15:39 node0 kernel: [1168145.708373]  worker_thread+0x4d/0x3c0
May  2 07:15:39 node0 kernel: [1168145.709501]  kthread+0xf8/0x130
May  2 07:15:39 node0 kernel: [1168145.710603]  ?
process_one_work+0x360/0x360
May  2 07:15:39 node0 kernel: [1168145.711701]  ?
kthread_create_worker_on_cpu+0x70/0x70
May  2 07:15:39 node0 kernel: [1168145.712845]  ? SyS_exit_group+0x10/0x10
May  2 07:15:39 node0 kernel: [1168145.713973]  ret_from_fork+0x35/0x40
May  2 07:15:39 node0 kernel: [1168145.715072] Code: 8b 78 30 48 83 7f
58 00 0f 84 e5 fe ff ff 49 8d 54 2e ff 4c 89 f6 48 c1 fe 0c 48 c1 fa 0c
e8 c2 e0 f3 ff 85 c0 0f 84 c8 fe ff f
f <0f> 0b e9 c1 fe ff ff 8b 47 20 a8 10 0f 84 e2 fe ff ff 48 8b 77
May  2 07:15:39 node0 kernel: [1168145.717349] ---[ end trace
cfa707d6465e13d2 ]---

If someone is interested in investigating then please let me know. The
data is not important. The lack of incrementing read_io_errs is
particularly critical IMHO.




-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-22 20:05 csum failed root raveled during balance ein
@ 2018-05-23  6:32 ` Nikolay Borisov
  2018-05-23  8:03   ` ein
  2018-05-27  5:50   ` Andrei Borzenkov
  0 siblings, 2 replies; 14+ messages in thread
From: Nikolay Borisov @ 2018-05-23  6:32 UTC (permalink / raw)
  To: ein, linux-btrfs



On 22.05.2018 23:05, ein wrote:
> Hello devs,
> 
> I tested BTRFS in production for about a month:
> 
> 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00
> 
> Without power blackout, hardware failure, SSD's SMART is flawless etc.
> The tests ended with:
> 
> root@node0:~# dmesg | grep BTRFS | grep warn
> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
> mirror 1
> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
> mirror 1
> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
> mirror 1
> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
> mirror 1
> 
> Above has been revealed during below command and quite high IO usage by
> few VMs (Linux on top Ext4 with firebird database, lots of random
> read/writes, two others with Windows 2016 and Windows Update in the
> background):

I believe you are hitting the issue described here:

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html

Essentially the way qemu operates on vm images atop btrfs is prone to
producing such errors. As a matter of fact, other filesystems also
suffer from this(i.e pages modified while being written, however due to
lack of CRC on the data they don't detect it). Can you confirm that
those inodes (312/314/319/219962/219915) belong to vm images files?

IMHO the best course of action would be to disable checksumming for you
vm files.


For some background I suggest you read the following LWN articles:

https://lwn.net/Articles/486311/
https://lwn.net/Articles/442355/

> 
> when I changed BTRFS compress parameters. Or during umount (can't recall
> now):
> 
> May  2 07:15:39 node0 kernel: [1168145.677431] WARNING: CPU: 6 PID: 3763
> at /build/linux-8B5M4n/linux-4.15.11/fs/direct-io.c:293
> dio_complete+0x1d6/0x220
> May  2 07:15:39 node0 kernel: [1168145.678811] Modules linked in: fuse
> ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs vhost_net vhost
> tap tun ebtable_filter ebtables ip6tab
> le_filter ip6_tables iptable_filter binfmt_misc bridge 8021q garp mrp
> stp llc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal
> intel_powerclamp coretemp snd_hda_codec_realtek kvm
> _intel snd_hda_codec_generic kvm i915 irqbypass crct10dif_pclmul
> snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec
> intel_cstate snd_hda_core iTCO_wdt iTCO_vendor_support
>  intel_uncore drm_kms_helper snd_hwdep wmi_bmof intel_rapl_perf joydev
> evdev pcspkr snd_pcm snd_timer drm snd soundcore i2c_algo_bit sg mei_me
> lpc_ich shpchp mfd_core mei ie31200_e
> dac wmi video button ib_iser rdma_cm iw_cm ib_cm ib_core configfs
> iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
> May  2 07:15:39 node0 kernel: [1168145.685202]  x_tables autofs4 ext4
> crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress zstd_compress
> xxhash raid456 async_raid6_recov async_mem
> cpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic
> raid0 multipath linear hid_generic usbhid hid dm_mod raid10 raid1 md_mod
> sd_mod crc32c_intel ahci i2c_i801 lib
> ahci aesni_intel xhci_pci aes_x86_64 ehci_pci libata crypto_simd
> xhci_hcd ehci_hcd cryptd glue_helper e1000e scsi_mod ptp usbcore
> pps_core usb_common fan thermal
> May  2 07:15:39 node0 kernel: [1168145.689057] CPU: 6 PID: 3763 Comm:
> kworker/6:2 Not tainted 4.15.0-0.bpo.2-amd64 #1 Debian 4.15.11-1~bpo9+1
> May  2 07:15:39 node0 kernel: [1168145.690347] Hardware name: LENOVO
> ThinkServer TS140/ThinkServer TS140, BIOS FBKTB3AUS 06/16/2015
> May  2 07:15:39 node0 kernel: [1168145.691659] Workqueue: dio/dm-0
> dio_aio_complete_work
> May  2 07:15:39 node0 kernel: [1168145.692935] RIP:
> 0010:dio_complete+0x1d6/0x220
> May  2 07:15:39 node0 kernel: [1168145.694275] RSP:
> 0018:ffff9abc68447e50 EFLAGS: 00010286
> May  2 07:15:39 node0 kernel: [1168145.695605] RAX: 00000000fffffff0
> RBX: ffff8e33712e3480 RCX: ffff9abc68447c88
> May  2 07:15:39 node0 kernel: [1168145.697024] RDX: fffff1dcc92e4c1f
> RSI: 0000000000000000 RDI: 0000000000000246
> May  2 07:15:39 node0 kernel: [1168145.698389] RBP: 0000000000005000
> R08: 0000000000000000 R09: ffffffffb7a075c0
> May  2 07:15:39 node0 kernel: [1168145.699703] R10: ffff8e33bb4223c0
> R11: 0000000000000195 R12: 0000000000005000
> May  2 07:15:39 node0 kernel: [1168145.701044] R13: 0000000000000003
> R14: 0000000403060000 R15: ffff8e33712e3500
> May  2 07:15:39 node0 kernel: [1168145.702238] FS: 
> 0000000000000000(0000) GS:ffff8e349eb80000(0000) knlGS:0000000000000000
> May  2 07:15:39 node0 kernel: [1168145.703475] CS:  0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> May  2 07:15:39 node0 kernel: [1168145.704733] CR2: 00007ff89915b08e
> CR3: 00000005b2e0a005 CR4: 00000000001626e0
> May  2 07:15:39 node0 kernel: [1168145.705955] Call Trace:
> May  2 07:15:39 node0 kernel: [1168145.707151]  process_one_work+0x177/0x360
> May  2 07:15:39 node0 kernel: [1168145.708373]  worker_thread+0x4d/0x3c0
> May  2 07:15:39 node0 kernel: [1168145.709501]  kthread+0xf8/0x130
> May  2 07:15:39 node0 kernel: [1168145.710603]  ?
> process_one_work+0x360/0x360
> May  2 07:15:39 node0 kernel: [1168145.711701]  ?
> kthread_create_worker_on_cpu+0x70/0x70
> May  2 07:15:39 node0 kernel: [1168145.712845]  ? SyS_exit_group+0x10/0x10
> May  2 07:15:39 node0 kernel: [1168145.713973]  ret_from_fork+0x35/0x40
> May  2 07:15:39 node0 kernel: [1168145.715072] Code: 8b 78 30 48 83 7f
> 58 00 0f 84 e5 fe ff ff 49 8d 54 2e ff 4c 89 f6 48 c1 fe 0c 48 c1 fa 0c
> e8 c2 e0 f3 ff 85 c0 0f 84 c8 fe ff f
> f <0f> 0b e9 c1 fe ff ff 8b 47 20 a8 10 0f 84 e2 fe ff ff 48 8b 77
> May  2 07:15:39 node0 kernel: [1168145.717349] ---[ end trace
> cfa707d6465e13d2 ]---
> 
> If someone is interested in investigating then please let me know. The
> data is not important. The lack of incrementing read_io_errs is
> particularly critical IMHO.

This warning is due to mixing buffered/dio. For more info check the
commit log of :

332391a9935d ("fs: Fix page cache inconsistency when mixing buffered and
AIO DIO")

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-23  6:32 ` Nikolay Borisov
@ 2018-05-23  8:03   ` ein
  2018-05-23  9:09     ` Duncan
  2018-05-23 11:12     ` Nikolay Borisov
  2018-05-27  5:50   ` Andrei Borzenkov
  1 sibling, 2 replies; 14+ messages in thread
From: ein @ 2018-05-23  8:03 UTC (permalink / raw)
  To: Nikolay Borisov, linux-btrfs

On 05/23/2018 08:32 AM, Nikolay Borisov wrote:

Nikolay, thank you for the answer.

>> [...]
>> root@node0:~# dmesg | grep BTRFS | grep warn
>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>> mirror 1
>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>> mirror 1
>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>>
>> Above has been revealed during below command and quite high IO usage by
>> few VMs (Linux on top Ext4 with firebird database, lots of random
>> read/writes, two others with Windows 2016 and Windows Update in the
>> background):
> 
> I believe you are hitting the issue described here:
> 
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html

It make sense, fsck.ext4, gbak - firebird integrity checking tool,
chkdsk and sfc /scannow don't show any errors internally within VM. As
far I can tell the data inside VMs is not corrupted somehow.

> Essentially the way qemu operates on vm images atop btrfs is prone to
> producing such errors. As a matter of fact, other filesystems also
> suffer from this(i.e pages modified while being written, however due to
> lack of CRC on the data they don't detect it). Can you confirm that
> those inodes (312/314/319/219962/219915) belong to vm images files?

root@node0:/var/lib/libvirt# find  ./ -inum 312
root@node0:/var/lib/libvirt# find  ./ -inum 314
root@node0:/var/lib/libvirt# find  ./ -inum 319
root@node0:/var/lib/libvirt# find  ./ -inum 219962
./images/rds.raw
root@node0:/var/lib/libvirt# find  ./ -inum 219915
./images/database.raw

It seems so (219962, 219915):
- rds.raw - Windows 2016 server, Remote Desktop Server, raw preallocated
image, NTFS
database.raw - Linux 3.8, Firebird DB server, raw preallocated image, Ext4

> IMHO the best course of action would be to disable checksumming for you
> vm files.
> 

Do you mean '-o nodatasum' mount flag? Is it possible to disable
checksumming for singe file by setting some magical chattr? Google
thinks it's not possible to disable csums for a single file.

> For some background I suggest you read the following LWN articles:
> 
> https://lwn.net/Articles/486311/
> https://lwn.net/Articles/442355/
> 
>>
>> when I changed BTRFS compress parameters. Or during umount (can't recall
>> now):
>>
>> May  2 07:15:39 node0 kernel: [1168145.677431] WARNING: CPU: 6 PID: 3763
>> at /build/linux-8B5M4n/linux-4.15.11/fs/direct-io.c:293
>> dio_complete+0x1d6/0x220
>> May  2 07:15:39 node0 kernel: [1168145.678811] Modules linked in: fuse
>> ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs vhost_net vhost
>> tap tun ebtable_filter ebtables ip6tab
>> le_filter ip6_tables iptable_filter binfmt_misc bridge 8021q garp mrp
>> stp llc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal
>> intel_powerclamp coretemp snd_hda_codec_realtek kvm
>> _intel snd_hda_codec_generic kvm i915 irqbypass crct10dif_pclmul
>> snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec
>> intel_cstate snd_hda_core iTCO_wdt iTCO_vendor_support
>>  intel_uncore drm_kms_helper snd_hwdep wmi_bmof intel_rapl_perf joydev
>> evdev pcspkr snd_pcm snd_timer drm snd soundcore i2c_algo_bit sg mei_me
>> lpc_ich shpchp mfd_core mei ie31200_e
>> dac wmi video button ib_iser rdma_cm iw_cm ib_cm ib_core configfs
>> iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
>> May  2 07:15:39 node0 kernel: [1168145.685202]  x_tables autofs4 ext4
>> crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress zstd_compress
>> xxhash raid456 async_raid6_recov async_mem
>> cpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic
>> raid0 multipath linear hid_generic usbhid hid dm_mod raid10 raid1 md_mod
>> sd_mod crc32c_intel ahci i2c_i801 lib
>> ahci aesni_intel xhci_pci aes_x86_64 ehci_pci libata crypto_simd
>> xhci_hcd ehci_hcd cryptd glue_helper e1000e scsi_mod ptp usbcore
>> pps_core usb_common fan thermal
>> May  2 07:15:39 node0 kernel: [1168145.689057] CPU: 6 PID: 3763 Comm:
>> kworker/6:2 Not tainted 4.15.0-0.bpo.2-amd64 #1 Debian 4.15.11-1~bpo9+1
>> May  2 07:15:39 node0 kernel: [1168145.690347] Hardware name: LENOVO
>> ThinkServer TS140/ThinkServer TS140, BIOS FBKTB3AUS 06/16/2015
>> May  2 07:15:39 node0 kernel: [1168145.691659] Workqueue: dio/dm-0
>> dio_aio_complete_work
>> May  2 07:15:39 node0 kernel: [1168145.692935] RIP:
>> 0010:dio_complete+0x1d6/0x220
>> May  2 07:15:39 node0 kernel: [1168145.694275] RSP:
>> 0018:ffff9abc68447e50 EFLAGS: 00010286
>> May  2 07:15:39 node0 kernel: [1168145.695605] RAX: 00000000fffffff0
>> RBX: ffff8e33712e3480 RCX: ffff9abc68447c88
>> May  2 07:15:39 node0 kernel: [1168145.697024] RDX: fffff1dcc92e4c1f
>> RSI: 0000000000000000 RDI: 0000000000000246
>> May  2 07:15:39 node0 kernel: [1168145.698389] RBP: 0000000000005000
>> R08: 0000000000000000 R09: ffffffffb7a075c0
>> May  2 07:15:39 node0 kernel: [1168145.699703] R10: ffff8e33bb4223c0
>> R11: 0000000000000195 R12: 0000000000005000
>> May  2 07:15:39 node0 kernel: [1168145.701044] R13: 0000000000000003
>> R14: 0000000403060000 R15: ffff8e33712e3500
>> May  2 07:15:39 node0 kernel: [1168145.702238] FS: 
>> 0000000000000000(0000) GS:ffff8e349eb80000(0000) knlGS:0000000000000000
>> May  2 07:15:39 node0 kernel: [1168145.703475] CS:  0010 DS: 0000 ES:
>> 0000 CR0: 0000000080050033
>> May  2 07:15:39 node0 kernel: [1168145.704733] CR2: 00007ff89915b08e
>> CR3: 00000005b2e0a005 CR4: 00000000001626e0
>> May  2 07:15:39 node0 kernel: [1168145.705955] Call Trace:
>> May  2 07:15:39 node0 kernel: [1168145.707151]  process_one_work+0x177/0x360
>> May  2 07:15:39 node0 kernel: [1168145.708373]  worker_thread+0x4d/0x3c0
>> May  2 07:15:39 node0 kernel: [1168145.709501]  kthread+0xf8/0x130
>> May  2 07:15:39 node0 kernel: [1168145.710603]  ?
>> process_one_work+0x360/0x360
>> May  2 07:15:39 node0 kernel: [1168145.711701]  ?
>> kthread_create_worker_on_cpu+0x70/0x70
>> May  2 07:15:39 node0 kernel: [1168145.712845]  ? SyS_exit_group+0x10/0x10
>> May  2 07:15:39 node0 kernel: [1168145.713973]  ret_from_fork+0x35/0x40
>> May  2 07:15:39 node0 kernel: [1168145.715072] Code: 8b 78 30 48 83 7f
>> 58 00 0f 84 e5 fe ff ff 49 8d 54 2e ff 4c 89 f6 48 c1 fe 0c 48 c1 fa 0c
>> e8 c2 e0 f3 ff 85 c0 0f 84 c8 fe ff f
>> f <0f> 0b e9 c1 fe ff ff 8b 47 20 a8 10 0f 84 e2 fe ff ff 48 8b 77
>> May  2 07:15:39 node0 kernel: [1168145.717349] ---[ end trace
>> cfa707d6465e13d2 ]---
>>
>> If someone is interested in investigating then please let me know. The
>> data is not important. The lack of incrementing read_io_errs is
>> particularly critical IMHO.
> 
> This warning is due to mixing buffered/dio. For more info check the
> commit log of :
> 
> 332391a9935d ("fs: Fix page cache inconsistency when mixing buffered and
> AIO DIO")

Reading the BTRFS code is beyond my understanding. Have you thought
about read_io_errs counter?

Balance reveals IO read error, copying VM file ends with IO read error,
read_io_errors is unchanged - still shows "0".


-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-23  8:03   ` ein
@ 2018-05-23  9:09     ` Duncan
  2018-05-23 10:09       ` ein
  2018-05-23 11:12     ` Nikolay Borisov
  1 sibling, 1 reply; 14+ messages in thread
From: Duncan @ 2018-05-23  9:09 UTC (permalink / raw)
  To: linux-btrfs

ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:

>> IMHO the best course of action would be to disable checksumming for you
>> vm files.
>> 
>> 
> Do you mean '-o nodatasum' mount flag? Is it possible to disable
> checksumming for singe file by setting some magical chattr? Google
> thinks it's not possible to disable csums for a single file.

You can use nocow (-C), but of course that has other restrictions (like 
setting it on the files when they're zero-length, easiest done for 
existing data by setting it on the containing dir and copying files (no 
reflink) in) as well as the nocow effects, and nocow becomes cow1 after a 
snapshot (which locks the existing copy in place so changes written to a 
block /must/ be written elsewhere, thus the cow1, aka cow the first time 
written after the snapshot but retain the nocow for repeated writes 
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way to go.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-23  9:09     ` Duncan
@ 2018-05-23 10:09       ` ein
  2018-05-23 11:03         ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 14+ messages in thread
From: ein @ 2018-05-23 10:09 UTC (permalink / raw)
  To: Duncan, linux-btrfs

On 05/23/2018 11:09 AM, Duncan wrote:
> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:
> 
>>> IMHO the best course of action would be to disable checksumming for you
>>> vm files.
>>
>> Do you mean '-o nodatasum' mount flag? Is it possible to disable
>> checksumming for singe file by setting some magical chattr? Google
>> thinks it's not possible to disable csums for a single file.
> 
> You can use nocow (-C), but of course that has other restrictions (like 
> setting it on the files when they're zero-length, easiest done for 
> existing data by setting it on the containing dir and copying files (no 
> reflink) in) as well as the nocow effects, and nocow becomes cow1 after a 
> snapshot (which locks the existing copy in place so changes written to a 
> block /must/ be written elsewhere, thus the cow1, aka cow the first time 
> written after the snapshot but retain the nocow for repeated writes 
> between snapshots).
> 
> But if you're disabling checksumming anyway, nocow's likely the way to go.

Disabling checksumming only may be a way to go - we live without it
every day. But nocow @ VM files defeats whole purpose of using BTRFS for
me, even with huge performance penalty - backup reasons - I mean few
snapshots (20-30), send & receive.


-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-23 10:09       ` ein
@ 2018-05-23 11:03         ` Austin S. Hemmelgarn
  2018-05-28 17:10           ` ein
  0 siblings, 1 reply; 14+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-23 11:03 UTC (permalink / raw)
  To: ein, Duncan, linux-btrfs

On 2018-05-23 06:09, ein wrote:
> On 05/23/2018 11:09 AM, Duncan wrote:
>> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:
>>
>>>> IMHO the best course of action would be to disable checksumming for you
>>>> vm files.
>>>
>>> Do you mean '-o nodatasum' mount flag? Is it possible to disable
>>> checksumming for singe file by setting some magical chattr? Google
>>> thinks it's not possible to disable csums for a single file.
>>
>> You can use nocow (-C), but of course that has other restrictions (like
>> setting it on the files when they're zero-length, easiest done for
>> existing data by setting it on the containing dir and copying files (no
>> reflink) in) as well as the nocow effects, and nocow becomes cow1 after a
>> snapshot (which locks the existing copy in place so changes written to a
>> block /must/ be written elsewhere, thus the cow1, aka cow the first time
>> written after the snapshot but retain the nocow for repeated writes
>> between snapshots).
>>
>> But if you're disabling checksumming anyway, nocow's likely the way to go.
> 
> Disabling checksumming only may be a way to go - we live without it
> every day. But nocow @ VM files defeats whole purpose of using BTRFS for
> me, even with huge performance penalty - backup reasons - I mean few
> snapshots (20-30), send & receive.
> 
Setting NOCOW on a file doesn't prevent it from being snapshotted, it 
just prevents COW operations from happening under most normal 
circumstances.  In essence, it prevents COW from happening except for 
writing right after the snapshot.  More specifically, the first write to 
a given block in a file set for NOCOW after taking a snapshot will 
trigger a _single_ COW operation for _only_ that block (unless you have 
autodefrag enabled too), after which that block will revert to not doing 
COW operations on write.  This way, you still get consistent and working 
snapshots, but you also don't take the performance hits from COW except 
right after taking a snapshot.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-23  8:03   ` ein
  2018-05-23  9:09     ` Duncan
@ 2018-05-23 11:12     ` Nikolay Borisov
  1 sibling, 0 replies; 14+ messages in thread
From: Nikolay Borisov @ 2018-05-23 11:12 UTC (permalink / raw)
  To: ein, linux-btrfs



On 23.05.2018 11:03, ein wrote:
> On 05/23/2018 08:32 AM, Nikolay Borisov wrote:
> 
> Nikolay, thank you for the answer.
> 
>>> [...]
>>> root@node0:~# dmesg | grep BTRFS | grep warn
>>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>>> mirror 1
>>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>>> mirror 1
>>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>> mirror 1
>>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>> mirror 1
>>>
>>> Above has been revealed during below command and quite high IO usage by
>>> few VMs (Linux on top Ext4 with firebird database, lots of random
>>> read/writes, two others with Windows 2016 and Windows Update in the
>>> background):
>>
>> I believe you are hitting the issue described here:
>>
>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
> 
> It make sense, fsck.ext4, gbak - firebird integrity checking tool,
> chkdsk and sfc /scannow don't show any errors internally within VM. As
> far I can tell the data inside VMs is not corrupted somehow.
> 
>> Essentially the way qemu operates on vm images atop btrfs is prone to
>> producing such errors. As a matter of fact, other filesystems also
>> suffer from this(i.e pages modified while being written, however due to
>> lack of CRC on the data they don't detect it). Can you confirm that
>> those inodes (312/314/319/219962/219915) belong to vm images files?
> 
> root@node0:/var/lib/libvirt# find  ./ -inum 312
> root@node0:/var/lib/libvirt# find  ./ -inum 314
> root@node0:/var/lib/libvirt# find  ./ -inum 319
> root@node0:/var/lib/libvirt# find  ./ -inum 219962
> ./images/rds.raw
> root@node0:/var/lib/libvirt# find  ./ -inum 219915
> ./images/database.raw
> 
> It seems so (219962, 219915):
> - rds.raw - Windows 2016 server, Remote Desktop Server, raw preallocated
> image, NTFS
> database.raw - Linux 3.8, Firebird DB server, raw preallocated image, Ext4
> 
>> IMHO the best course of action would be to disable checksumming for you
>> vm files.
>>
> 
> Do you mean '-o nodatasum' mount flag? Is it possible to disable
> checksumming for singe file by setting some magical chattr? Google
> thinks it's not possible to disable csums for a single file.

You can't disable checksumming for a single file. However, what you
could do is set a the "No CoW" flag via chattr +c /path/to/file since it
also disables checksumming. Bear in mind you can't set this flag to a
file which already has allocated blocks. So you'd have to create an
empty file, set the +C flag and then copy the data with dd for example.

On a different note - for database workloads and generally random
workloads it makes no sense to have CoW since you'd see very spikey io
performance.

> 
>> For some background I suggest you read the following LWN articles:
>>
>> https://lwn.net/Articles/486311/
>> https://lwn.net/Articles/442355/
>>
>>>
>>> when I changed BTRFS compress parameters. Or during umount (can't recall
>>> now):
>>>
>>> May  2 07:15:39 node0 kernel: [1168145.677431] WARNING: CPU: 6 PID: 3763
>>> at /build/linux-8B5M4n/linux-4.15.11/fs/direct-io.c:293
>>> dio_complete+0x1d6/0x220
>>> May  2 07:15:39 node0 kernel: [1168145.678811] Modules linked in: fuse
>>> ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs vhost_net vhost
>>> tap tun ebtable_filter ebtables ip6tab
>>> le_filter ip6_tables iptable_filter binfmt_misc bridge 8021q garp mrp
>>> stp llc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal
>>> intel_powerclamp coretemp snd_hda_codec_realtek kvm
>>> _intel snd_hda_codec_generic kvm i915 irqbypass crct10dif_pclmul
>>> snd_hda_intel crc32_pclmul ghash_clmulni_intel snd_hda_codec
>>> intel_cstate snd_hda_core iTCO_wdt iTCO_vendor_support
>>>  intel_uncore drm_kms_helper snd_hwdep wmi_bmof intel_rapl_perf joydev
>>> evdev pcspkr snd_pcm snd_timer drm snd soundcore i2c_algo_bit sg mei_me
>>> lpc_ich shpchp mfd_core mei ie31200_e
>>> dac wmi video button ib_iser rdma_cm iw_cm ib_cm ib_core configfs
>>> iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables
>>> May  2 07:15:39 node0 kernel: [1168145.685202]  x_tables autofs4 ext4
>>> crc16 mbcache jbd2 fscrypto ecb btrfs zstd_decompress zstd_compress
>>> xxhash raid456 async_raid6_recov async_mem
>>> cpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic
>>> raid0 multipath linear hid_generic usbhid hid dm_mod raid10 raid1 md_mod
>>> sd_mod crc32c_intel ahci i2c_i801 lib
>>> ahci aesni_intel xhci_pci aes_x86_64 ehci_pci libata crypto_simd
>>> xhci_hcd ehci_hcd cryptd glue_helper e1000e scsi_mod ptp usbcore
>>> pps_core usb_common fan thermal
>>> May  2 07:15:39 node0 kernel: [1168145.689057] CPU: 6 PID: 3763 Comm:
>>> kworker/6:2 Not tainted 4.15.0-0.bpo.2-amd64 #1 Debian 4.15.11-1~bpo9+1
>>> May  2 07:15:39 node0 kernel: [1168145.690347] Hardware name: LENOVO
>>> ThinkServer TS140/ThinkServer TS140, BIOS FBKTB3AUS 06/16/2015
>>> May  2 07:15:39 node0 kernel: [1168145.691659] Workqueue: dio/dm-0
>>> dio_aio_complete_work
>>> May  2 07:15:39 node0 kernel: [1168145.692935] RIP:
>>> 0010:dio_complete+0x1d6/0x220
>>> May  2 07:15:39 node0 kernel: [1168145.694275] RSP:
>>> 0018:ffff9abc68447e50 EFLAGS: 00010286
>>> May  2 07:15:39 node0 kernel: [1168145.695605] RAX: 00000000fffffff0
>>> RBX: ffff8e33712e3480 RCX: ffff9abc68447c88
>>> May  2 07:15:39 node0 kernel: [1168145.697024] RDX: fffff1dcc92e4c1f
>>> RSI: 0000000000000000 RDI: 0000000000000246
>>> May  2 07:15:39 node0 kernel: [1168145.698389] RBP: 0000000000005000
>>> R08: 0000000000000000 R09: ffffffffb7a075c0
>>> May  2 07:15:39 node0 kernel: [1168145.699703] R10: ffff8e33bb4223c0
>>> R11: 0000000000000195 R12: 0000000000005000
>>> May  2 07:15:39 node0 kernel: [1168145.701044] R13: 0000000000000003
>>> R14: 0000000403060000 R15: ffff8e33712e3500
>>> May  2 07:15:39 node0 kernel: [1168145.702238] FS: 
>>> 0000000000000000(0000) GS:ffff8e349eb80000(0000) knlGS:0000000000000000
>>> May  2 07:15:39 node0 kernel: [1168145.703475] CS:  0010 DS: 0000 ES:
>>> 0000 CR0: 0000000080050033
>>> May  2 07:15:39 node0 kernel: [1168145.704733] CR2: 00007ff89915b08e
>>> CR3: 00000005b2e0a005 CR4: 00000000001626e0
>>> May  2 07:15:39 node0 kernel: [1168145.705955] Call Trace:
>>> May  2 07:15:39 node0 kernel: [1168145.707151]  process_one_work+0x177/0x360
>>> May  2 07:15:39 node0 kernel: [1168145.708373]  worker_thread+0x4d/0x3c0
>>> May  2 07:15:39 node0 kernel: [1168145.709501]  kthread+0xf8/0x130
>>> May  2 07:15:39 node0 kernel: [1168145.710603]  ?
>>> process_one_work+0x360/0x360
>>> May  2 07:15:39 node0 kernel: [1168145.711701]  ?
>>> kthread_create_worker_on_cpu+0x70/0x70
>>> May  2 07:15:39 node0 kernel: [1168145.712845]  ? SyS_exit_group+0x10/0x10
>>> May  2 07:15:39 node0 kernel: [1168145.713973]  ret_from_fork+0x35/0x40
>>> May  2 07:15:39 node0 kernel: [1168145.715072] Code: 8b 78 30 48 83 7f
>>> 58 00 0f 84 e5 fe ff ff 49 8d 54 2e ff 4c 89 f6 48 c1 fe 0c 48 c1 fa 0c
>>> e8 c2 e0 f3 ff 85 c0 0f 84 c8 fe ff f
>>> f <0f> 0b e9 c1 fe ff ff 8b 47 20 a8 10 0f 84 e2 fe ff ff 48 8b 77
>>> May  2 07:15:39 node0 kernel: [1168145.717349] ---[ end trace
>>> cfa707d6465e13d2 ]---
>>>
>>> If someone is interested in investigating then please let me know. The
>>> data is not important. The lack of incrementing read_io_errs is
>>> particularly critical IMHO.
>>
>> This warning is due to mixing buffered/dio. For more info check the
>> commit log of :
>>
>> 332391a9935d ("fs: Fix page cache inconsistency when mixing buffered and
>> AIO DIO")
> 
> Reading the BTRFS code is beyond my understanding. Have you thought
> about read_io_errs counter?

I didn't say read the btrfs code but rather read the commit messages.

> 
> Balance reveals IO read error, copying VM file ends with IO read error,
> read_io_errors is unchanged - still shows "0".

Will have to investigate and see whether the current behavior is
intentional or not.

> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-23  6:32 ` Nikolay Borisov
  2018-05-23  8:03   ` ein
@ 2018-05-27  5:50   ` Andrei Borzenkov
  2018-05-27  9:41     ` Nikolay Borisov
  1 sibling, 1 reply; 14+ messages in thread
From: Andrei Borzenkov @ 2018-05-27  5:50 UTC (permalink / raw)
  To: Nikolay Borisov, ein, linux-btrfs

23.05.2018 09:32, Nikolay Borisov пишет:
> 
> 
> On 22.05.2018 23:05, ein wrote:
>> Hello devs,
>>
>> I tested BTRFS in production for about a month:
>>
>> 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00
>>
>> Without power blackout, hardware failure, SSD's SMART is flawless etc.
>> The tests ended with:
>>
>> root@node0:~# dmesg | grep BTRFS | grep warn
>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>> mirror 1
>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>> mirror 1
>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>>
>> Above has been revealed during below command and quite high IO usage by
>> few VMs (Linux on top Ext4 with firebird database, lots of random
>> read/writes, two others with Windows 2016 and Windows Update in the
>> background):
> 
> I believe you are hitting the issue described here:
> 
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
> 
> Essentially the way qemu operates on vm images atop btrfs is prone to
> producing such errors. As a matter of fact, other filesystems also
> suffer from this(i.e pages modified while being written, however due to
> lack of CRC on the data they don't detect it). Can you confirm that
> those inodes (312/314/319/219962/219915) belong to vm images files?
> 
> IMHO the best course of action would be to disable checksumming for you
> vm files.
> 
> 
> For some background I suggest you read the following LWN articles:
> 
> https://lwn.net/Articles/486311/
> https://lwn.net/Articles/442355/
> 

Hmm ... according to these articles, "pages under writeback are marked
as not being writable; any process attempting to write to such a page
will block until the writeback completes". And it says this feature is
available since 3.0 and btrfs has it. So how comes it still happens?
Were stable patches removed since then?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-27  5:50   ` Andrei Borzenkov
@ 2018-05-27  9:41     ` Nikolay Borisov
  2018-05-28 16:51       ` ein
  0 siblings, 1 reply; 14+ messages in thread
From: Nikolay Borisov @ 2018-05-27  9:41 UTC (permalink / raw)
  To: Andrei Borzenkov, ein, linux-btrfs



On 27.05.2018 08:50, Andrei Borzenkov wrote:
> 23.05.2018 09:32, Nikolay Borisov пишет:
>>
>>
>> On 22.05.2018 23:05, ein wrote:
>>> Hello devs,
>>>
>>> I tested BTRFS in production for about a month:
>>>
>>> 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00
>>>
>>> Without power blackout, hardware failure, SSD's SMART is flawless etc.
>>> The tests ended with:
>>>
>>> root@node0:~# dmesg | grep BTRFS | grep warn
>>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>>> mirror 1
>>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>>> mirror 1
>>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>> mirror 1
>>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>> mirror 1
>>>
>>> Above has been revealed during below command and quite high IO usage by
>>> few VMs (Linux on top Ext4 with firebird database, lots of random
>>> read/writes, two others with Windows 2016 and Windows Update in the
>>> background):
>>
>> I believe you are hitting the issue described here:
>>
>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
>>
>> Essentially the way qemu operates on vm images atop btrfs is prone to
>> producing such errors. As a matter of fact, other filesystems also
>> suffer from this(i.e pages modified while being written, however due to
>> lack of CRC on the data they don't detect it). Can you confirm that
>> those inodes (312/314/319/219962/219915) belong to vm images files?
>>
>> IMHO the best course of action would be to disable checksumming for you
>> vm files.
>>
>>
>> For some background I suggest you read the following LWN articles:
>>
>> https://lwn.net/Articles/486311/
>> https://lwn.net/Articles/442355/
>>
> 
> Hmm ... according to these articles, "pages under writeback are marked
> as not being writable; any process attempting to write to such a page
> will block until the writeback completes". And it says this feature is
> available since 3.0 and btrfs has it. So how comes it still happens?
> Were stable patches removed since then?

If you are using buffered writes, then yes you won't have the problem.
However qemu by default bypasses host's page cache and instead uses DIO:

https://btrfs.wiki.kernel.org/index.php/Gotchas#Direct_IO_and_CRCs

> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-27  9:41     ` Nikolay Borisov
@ 2018-05-28 16:51       ` ein
  0 siblings, 0 replies; 14+ messages in thread
From: ein @ 2018-05-28 16:51 UTC (permalink / raw)
  To: Nikolay Borisov, Andrei Borzenkov, linux-btrfs

On 05/27/2018 11:41 AM, Nikolay Borisov wrote:
> 
> 
> On 27.05.2018 08:50, Andrei Borzenkov wrote:
>> 23.05.2018 09:32, Nikolay Borisov пишет:
>>>
>>>
>>> On 22.05.2018 23:05, ein wrote:
>>>> Hello devs,
>>>>
>>>> I tested BTRFS in production for about a month:
>>>>
>>>> 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00
>>>>
>>>> Without power blackout, hardware failure, SSD's SMART is flawless etc.
>>>> The tests ended with:
>>>>
>>>> root@node0:~# dmesg | grep BTRFS | grep warn
>>>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>>>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>>>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>>>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>>>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>>>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>>>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>>>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>>>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>>>> mirror 1
>>>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>>>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>>>> mirror 1
>>>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>>> mirror 1
>>>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>>>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>>>> mirror 1
>>>>
>>>> Above has been revealed during below command and quite high IO usage by
>>>> few VMs (Linux on top Ext4 with firebird database, lots of random
>>>> read/writes, two others with Windows 2016 and Windows Update in the
>>>> background):
>>>
>>> I believe you are hitting the issue described here:
>>>
>>> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
>>>
>>> Essentially the way qemu operates on vm images atop btrfs is prone to
>>> producing such errors. As a matter of fact, other filesystems also
>>> suffer from this(i.e pages modified while being written, however due to
>>> lack of CRC on the data they don't detect it). Can you confirm that
>>> those inodes (312/314/319/219962/219915) belong to vm images files?
>>>
>>> IMHO the best course of action would be to disable checksumming for you
>>> vm files.
>>>
>>>
>>> For some background I suggest you read the following LWN articles:
>>>
>>> https://lwn.net/Articles/486311/
>>> https://lwn.net/Articles/442355/
>>>
>>
>> Hmm ... according to these articles, "pages under writeback are marked
>> as not being writable; any process attempting to write to such a page
>> will block until the writeback completes". And it says this feature is
>> available since 3.0 and btrfs has it. So how comes it still happens?
>> Were stable patches removed since then?
> 
> If you are using buffered writes, then yes you won't have the problem.
> However qemu by default bypasses host's page cache and instead uses DIO:
> 
> https://btrfs.wiki.kernel.org/index.php/Gotchas#Direct_IO_and_CRCs

I can confirm that writing data to the filesystem on guest side is not
buffered at host with config:

<disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/var/lib/libvirt/images/db.raw'/>
      <target dev='vda' bus='virtio'/>
      [...]
</disk>

Because buff/cache memory usage stays unchanged at host during high
sequential writing and there's no kworker/flush process committing the
data. How qemu can avoid dirty page buffering? There's nothing else
than:ppoll, read, io_sumbit and write in strace:

read(52, "\1\0\0\0\0\0\0\0", 512)       = 8
io_submit(0x7f35367f7000, 2, [{pwritev, fildes=19,
iovec=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=368640}, {iov_base="\0\0\0\0\0\0
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=368640},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=679
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=679936}, {iov_bas
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=368640},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=679936}, {iov_base="\0\0\0\0\0\
1048576},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=1048576},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\
read(38, "\3\0\0\0\0\0\0\0", 512)       = 8
ppoll([{fd=52, events=POLLIN|POLLERR|POLLHUP}, {fd=38,
events=POLLIN|POLLERR|POLLHUP}, {fd=10, events=POLLIN|POLLERR|POLLHUP}],
3, NULL, NULL, 8) = 1 ([{fd=52, revents=POLLIN}])
read(52, "\1\0\0\0\0\0\0\0", 512)       = 8
io_submit(0x7f35367f7000, 1, [{pwritev, fildes=19,
iovec=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=368640}, {iov_base="\0\0\0\0\0\0
\0\0\0\0\0\0\0\0\0\0"..., iov_len=1048576},
{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=368640}, {iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0
ppoll([{fd=52, events=POLLIN|POLLERR|POLLHUP}, {fd=38,
events=POLLIN|POLLERR|POLLHUP}, {fd=10, events=POLLIN|POLLERR|POLLHUP}],
3, {tv_sec=0, tv_nsec=0}, NULL, 8) = 2 ([{fd=52, re

-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-23 11:03         ` Austin S. Hemmelgarn
@ 2018-05-28 17:10           ` ein
  2018-05-29 12:12             ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 14+ messages in thread
From: ein @ 2018-05-28 17:10 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, Duncan, linux-btrfs

On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote:
> On 2018-05-23 06:09, ein wrote:
>> On 05/23/2018 11:09 AM, Duncan wrote:
>>> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:
>>>
>>>>> IMHO the best course of action would be to disable checksumming for
>>>>> you
>>>>> vm files.
>>>>
>>>> Do you mean '-o nodatasum' mount flag? Is it possible to disable
>>>> checksumming for singe file by setting some magical chattr? Google
>>>> thinks it's not possible to disable csums for a single file.
>>>
>>> You can use nocow (-C), but of course that has other restrictions (like
>>> setting it on the files when they're zero-length, easiest done for
>>> existing data by setting it on the containing dir and copying files (no
>>> reflink) in) as well as the nocow effects, and nocow becomes cow1
>>> after a
>>> snapshot (which locks the existing copy in place so changes written to a
>>> block /must/ be written elsewhere, thus the cow1, aka cow the first time
>>> written after the snapshot but retain the nocow for repeated writes
>>> between snapshots).
>>>
>>> But if you're disabling checksumming anyway, nocow's likely the way
>>> to go.
>>
>> Disabling checksumming only may be a way to go - we live without it
>> every day. But nocow @ VM files defeats whole purpose of using BTRFS for
>> me, even with huge performance penalty - backup reasons - I mean few
>> snapshots (20-30), send & receive.
>>
> Setting NOCOW on a file doesn't prevent it from being snapshotted, it
> just prevents COW operations from happening under most normal
> circumstances.  In essence, it prevents COW from happening except for
> writing right after the snapshot.  More specifically, the first write to
> a given block in a file set for NOCOW after taking a snapshot will
> trigger a _single_ COW operation for _only_ that block (unless you have
> autodefrag enabled too), after which that block will revert to not doing
> COW operations on write.  This way, you still get consistent and working
> snapshots, but you also don't take the performance hits from COW except
> right after taking a snapshot.

Yeah, just after I've post it, I've found some Duncan post from 2015,
explaining it, thank you anyway.

Is there anything we can do better in random/write VM workload to speed
the BTRFS up and why?

My settings:

<disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none' io='native'/>
      <source file='/var/lib/libvirt/images/db.raw'/>
      <target dev='vda' bus='virtio'/>
      [...]
</disk>

/dev/mapper/raid10-images on /var/lib/libvirt type btrfs
(rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)

md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
      468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 0/4 pages [0KB], 65536KB chunk

CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's
kernel 4.15.0-0.bpo.2-amd64

As far as I understand compress and autodefrag are impacting negatively
for performance (latency), especially autodefrag. I think also that
nodatacow shall also speed things up and it's a must when using qemu and
BTRFS. Is it better to use virtio or virt-scsi with TRIM support?

-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-28 17:10           ` ein
@ 2018-05-29 12:12             ` Austin S. Hemmelgarn
  2018-05-29 14:02               ` ein
  0 siblings, 1 reply; 14+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-29 12:12 UTC (permalink / raw)
  To: ein, linux-btrfs

On 2018-05-28 13:10, ein wrote:
> On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote:
>> On 2018-05-23 06:09, ein wrote:
>>> On 05/23/2018 11:09 AM, Duncan wrote:
>>>> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:
>>>>
>>>>>> IMHO the best course of action would be to disable checksumming for
>>>>>> you
>>>>>> vm files.
>>>>>
>>>>> Do you mean '-o nodatasum' mount flag? Is it possible to disable
>>>>> checksumming for singe file by setting some magical chattr? Google
>>>>> thinks it's not possible to disable csums for a single file.
>>>>
>>>> You can use nocow (-C), but of course that has other restrictions (like
>>>> setting it on the files when they're zero-length, easiest done for
>>>> existing data by setting it on the containing dir and copying files (no
>>>> reflink) in) as well as the nocow effects, and nocow becomes cow1
>>>> after a
>>>> snapshot (which locks the existing copy in place so changes written to a
>>>> block /must/ be written elsewhere, thus the cow1, aka cow the first time
>>>> written after the snapshot but retain the nocow for repeated writes
>>>> between snapshots).
>>>>
>>>> But if you're disabling checksumming anyway, nocow's likely the way
>>>> to go.
>>>
>>> Disabling checksumming only may be a way to go - we live without it
>>> every day. But nocow @ VM files defeats whole purpose of using BTRFS for
>>> me, even with huge performance penalty - backup reasons - I mean few
>>> snapshots (20-30), send & receive.
>>>
>> Setting NOCOW on a file doesn't prevent it from being snapshotted, it
>> just prevents COW operations from happening under most normal
>> circumstances.  In essence, it prevents COW from happening except for
>> writing right after the snapshot.  More specifically, the first write to
>> a given block in a file set for NOCOW after taking a snapshot will
>> trigger a _single_ COW operation for _only_ that block (unless you have
>> autodefrag enabled too), after which that block will revert to not doing
>> COW operations on write.  This way, you still get consistent and working
>> snapshots, but you also don't take the performance hits from COW except
>> right after taking a snapshot.
> 
> Yeah, just after I've post it, I've found some Duncan post from 2015,
> explaining it, thank you anyway.
> 
> Is there anything we can do better in random/write VM workload to speed
> the BTRFS up and why?
> 
> My settings:
> 
> <disk type='file' device='disk'>
>        <driver name='qemu' type='raw' cache='none' io='native'/>
>        <source file='/var/lib/libvirt/images/db.raw'/>
>        <target dev='vda' bus='virtio'/>
>        [...]
> </disk>
> 
> /dev/mapper/raid10-images on /var/lib/libvirt type btrfs
> (rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)
> 
> md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
>        468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
>        bitmap: 0/4 pages [0KB], 65536KB chunk
> 
> CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's
> kernel 4.15.0-0.bpo.2-amd64
> 
> As far as I understand compress and autodefrag are impacting negatively
> for performance (latency), especially autodefrag. I think also that
> nodatacow shall also speed things up and it's a must when using qemu and
> BTRFS. Is it better to use virtio or virt-scsi with TRIM support?
> 
FWIW, I've been doing just fine without nodatacow, but I also use raw 
images contained in sparse files, and keep autodefrag off for the 
dedicated filesystem I put the images on.

Compression shouldn't have much in the way of negative impact unless 
you're also using transparent compression (or disk for file encryption) 
inside the VM (in fact, it may speed things up significantly depending 
on what filesystem is being used by the guest OS, the ext4 inode table 
in particular seems to compress very well).  If you are using 
`nodatacow` though, you can just turn compression off, as it's not going 
to be used anyway.  If you want to keep using compression, then I'd 
suggest using `compress-force` instead of `compress`, which makes BTRFS 
a bit more aggressive about trying to compress things, but makes the 
performance much more deterministic.  You may also want to look int 
using `zstd` instead of `lzo` for the compression, it gets better ratios 
most of the time, and usually performs better than `lzo` does.

Autodefrag should probably be off.  If you have nodatacow set (or just 
have all the files marked with the NOCOW attribute), then there's not 
really any point to having autodefrag on.  If like me you aren't turning 
off COW for data, it's still a good idea to have it off and just do 
batch defragmentation at a regularly scheduled time.

For the VM settings, everything looks fine to me (though if you have 
somewhat slow storage and aren't giving the VM's lots of memory to work 
with, doing write-through caching might be helpful).  I would probably 
be using virtio-scsi for the TRIM support, as with raw images you will 
get holes in the file where the TRIM command was issued, which can 
actually improve performance (and does improve storage utilization 
(though doing batch trims instead of using the `discard` mount option is 
better for performance if you have Linux guests).

You're using an MD RAID10 array.  This is generally the fastest option 
in terms of performance, but it also means you can't take advantage of 
BTRFS' self repairing ability very well, and you may be wasting space 
and some performance (because you probably have the 'dup' profile set 
for metadata).  If it's an option, I'd suggest converting this to a 
BTRFS raid1 volume on top of two MD RAID0 volumes, which should either 
get the same performance, or slightly better performance, will avoid 
wasting space storing metadata, and will also let you take advantage of 
the self-repair functionality in BTRFS.

You should probably switch the `ssd` mount option to `nossd` (and then 
run a full recursive defrag on the volume, as this option affects the 
allocation policy, so the changes only take effect for new allocations). 
  The SSD allocator can actually pretty significantly hurt performance 
in many cases, and has at best very limited benefits for device 
lifetimes (you'll maybe get another few months out of a device that will 
last for ten years without issue).  Make a point to test this though, 
because you're on a RAID array, this may actually be improving 
performance slightly.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-29 12:12             ` Austin S. Hemmelgarn
@ 2018-05-29 14:02               ` ein
  2018-05-29 14:35                 ` Austin S. Hemmelgarn
  0 siblings, 1 reply; 14+ messages in thread
From: ein @ 2018-05-29 14:02 UTC (permalink / raw)
  To: Austin S. Hemmelgarn, linux-btrfs

On 05/29/2018 02:12 PM, Austin S. Hemmelgarn wrote:
> On 2018-05-28 13:10, ein wrote:
>> On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote:
>>> On 2018-05-23 06:09, ein wrote:
>>>> On 05/23/2018 11:09 AM, Duncan wrote:
>>>>> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:
>>>>>
>>>>>>> IMHO the best course of action would be to disable checksumming for
>>>>>>> you
>>>>>>> vm files.
>>>>>>
>>>>>> Do you mean '-o nodatasum' mount flag? Is it possible to disable
>>>>>> checksumming for singe file by setting some magical chattr? Google
>>>>>> thinks it's not possible to disable csums for a single file.
>>>>>
>>>>> You can use nocow (-C), but of course that has other restrictions (like
>>>>> setting it on the files when they're zero-length, easiest done for
>>>>> existing data by setting it on the containing dir and copying files (no
>>>>> reflink) in) as well as the nocow effects, and nocow becomes cow1
>>>>> after a
>>>>> snapshot (which locks the existing copy in place so changes written to a
>>>>> block /must/ be written elsewhere, thus the cow1, aka cow the first time
>>>>> written after the snapshot but retain the nocow for repeated writes
>>>>> between snapshots).
>>>>>
>>>>> But if you're disabling checksumming anyway, nocow's likely the way
>>>>> to go.
>>>>
>>>> Disabling checksumming only may be a way to go - we live without it
>>>> every day. But nocow @ VM files defeats whole purpose of using BTRFS for
>>>> me, even with huge performance penalty - backup reasons - I mean few
>>>> snapshots (20-30), send & receive.
>>>>
>>> Setting NOCOW on a file doesn't prevent it from being snapshotted, it
>>> just prevents COW operations from happening under most normal
>>> circumstances.  In essence, it prevents COW from happening except for
>>> writing right after the snapshot.  More specifically, the first write to
>>> a given block in a file set for NOCOW after taking a snapshot will
>>> trigger a _single_ COW operation for _only_ that block (unless you have
>>> autodefrag enabled too), after which that block will revert to not doing
>>> COW operations on write.  This way, you still get consistent and working
>>> snapshots, but you also don't take the performance hits from COW except
>>> right after taking a snapshot.
>>
>> Yeah, just after I've post it, I've found some Duncan post from 2015,
>> explaining it, thank you anyway.
>>
>> Is there anything we can do better in random/write VM workload to speed
>> the BTRFS up and why?
>>
>> My settings:
>>
>> <disk type='file' device='disk'>
>>        <driver name='qemu' type='raw' cache='none' io='native'/>
>>        <source file='/var/lib/libvirt/images/db.raw'/>
>>        <target dev='vda' bus='virtio'/>
>>        [...]
>> </disk>
>>
>> /dev/mapper/raid10-images on /var/lib/libvirt type btrfs
>> (rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)
>>
>> md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
>>        468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
>>        bitmap: 0/4 pages [0KB], 65536KB chunk
>>
>> CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's
>> kernel 4.15.0-0.bpo.2-amd64
>>
>> As far as I understand compress and autodefrag are impacting negatively
>> for performance (latency), especially autodefrag. I think also that
>> nodatacow shall also speed things up and it's a must when using qemu and
>> BTRFS. Is it better to use virtio or virt-scsi with TRIM support?
>>
> FWIW, I've been doing just fine without nodatacow, but I also use raw images contained in sparse
> files, and keep autodefrag off for the dedicated filesystem I put the images on.

So do I, RAW images created by qemu-img, but I am not sure if preallocation works as expected. The
size of disks in filesystem looks fine though.

May I ask in what workloads? From my testing while having VM on BTRFS storage:
- file/web servers works perfect on BTRFS.
- Windows (2012/2016) file servers with AD, are perfect too, besides time required for Windows
Update, but this service is... let's say not fine enough.
- database (firebird) impact is huuuge, guest filesystem is Ext4, the database performs slower in
this conditions (4 SSDs in RAID10) than when it was on raid1 with 2 10krpm SASes. I am still
thinking how to benchmark it properly. A lot of iowait in host's kernel.

> Compression shouldn't have much in the way of negative impact unless you're also using transparent
> compression (or disk for file encryption) inside the VM (in fact, it may speed things up
> significantly depending on what filesystem is being used by the guest OS, the ext4 inode table in
> particular seems to compress very well).  If you are using `nodatacow` though, you can just turn
> compression off, as it's not going to be used anyway.  If you want to keep using compression, then
> I'd suggest using `compress-force` instead of `compress`, which makes BTRFS a bit more aggressive
> about trying to compress things, but makes the performance much more deterministic.  You may also
> want to look int using `zstd` instead of `lzo` for the compression, it gets better ratios most of
> the time, and usually performs better than `lzo` does.

Yeah, I do know exact values from the post we both know for sure:
https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git/commit/?h=next&id=5c1aab1dd5445ed8bdcdbb575abc1b0d7ee5b2e7

> Autodefrag should probably be off.  If you have nodatacow set (or just have all the files marked
> with the NOCOW attribute), then there's not really any point to having autodefrag on.  If like me
> you aren't turning off COW for data, it's still a good idea to have it off and just do batch
> defragmentation at a regularly scheduled time.

Well, at least I need to try nodatacow to check the impact.

> For the VM settings, everything looks fine to me (though if you have somewhat slow storage and
> aren't giving the VM's lots of memory to work with, doing write-through caching might be helpful). 
> I would probably be using virtio-scsi for the TRIM support, as with raw images you will get holes in
> the file where the TRIM command was issued, which can actually improve performance (and does improve
> storage utilization (though doing batch trims instead of using the `discard` mount option is better
> for performance if you have Linux guests).

I don't consider this... :
4: Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q, SerialNo=S251NX0H822367V
21: Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q, SerialNo=S251NX0H822370A
38: Model=INTEL SSDSC2BP240G4, FwRev=L2010420, SerialNo=BTJR6141002D240AGN
55: Model=INTEL SSDSC2BP240G4, FwRev=L2010420, SerialNo=BTJR6063000F240AGN
as slow in Raid 10 mode, because:

root@node0:~# time dd if=/dev/md1 of=/dev/null bs=4096 count=10M
10485760+0 records in
10485760+0 records out
42949672960 bytes (43 GB, 40 GiB) copied, 31.6336 s, 1.4 GB/s

real    0m31.636s
user    0m1.949s
sys     0m12.222s

root@node0:~# iostat -x 5 /dev/md1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.63    0.00    4.85    6.61    0.00   87.91

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await
w_await  svctm  %util
sdc             306.80     0.00  672.00    0.00 333827.20     0.00   993.53     1.05    1.57    1.57
   0.00   1.21  81.20
sdb             329.80     0.00  663.40    0.00 332640.00     0.00  1002.83     0.94    1.41    1.41
   0.00   1.05  69.44
sdd             298.80     0.00  664.80    0.00 329110.40     0.00   990.10     1.00    1.50    1.50
   0.00   1.22  80.96
sda             291.60     0.00  657.40    0.00 330297.60     0.00  1004.86     0.92    1.40    1.40
   0.00   1.05  69.20
md1               0.00     0.00 3884.80    0.00 2693254.40     0.00  1386.56     0.00    0.00
0.00    0.00   0.00   0.00

It gives me much more than 100k IOPs to the BTRFS filesystem if I remember correctly while fio
benchmark random workload (75% reads, 25% writes), 2 threads.


> You're using an MD RAID10 array.  This is generally the fastest option in terms of performance, but
> it also means you can't take advantage of BTRFS' self repairing ability very well, and you may be
> wasting space and some performance (because you probably have the 'dup' profile set for metadata). 
> If it's an option, I'd suggest converting this to a BTRFS raid1 volume on top of two MD RAID0
> volumes, which should either get the same performance, or slightly better performance, will avoid
> wasting space storing metadata, and will also let you take advantage of the self-repair
> functionality in BTRFS.

That's a good point.

> You should probably switch the `ssd` mount option to `nossd` (and then run a full recursive defrag
> on the volume, as this option affects the allocation policy, so the changes only take effect for new
> allocations).  The SSD allocator can actually pretty significantly hurt performance in many cases,
> and has at best very limited benefits for device lifetimes (you'll maybe get another few months out
> of a device that will last for ten years without issue).  Make a point to test this though, because
> you're on a RAID array, this may actually be improving performance slightly.

Good point either. I am going to test ssd parameter impact to, I think that recreating fs and copy
the data may be good idea.

We should not care about the wearout:
20 users working on the database every day (work week), for about a year and:

Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model:     INTEL SSDSC2BP240G4

233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0

and

177 Wear_Leveling_Count     0x0013   095   095   000    Pre-fail  Always       -       282

Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 850 PRO 256GB

Which means 50 more years @ Samsung Pro 850 and 100 years @ Intel 730, which is interesting. (btw.
start date is exactly the same).

Thank you for sharing Austin.


-- 
PGP Public Key (RSA/4096b):
ID: 0xF2C6EA10
SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: csum failed root raveled during balance
  2018-05-29 14:02               ` ein
@ 2018-05-29 14:35                 ` Austin S. Hemmelgarn
  0 siblings, 0 replies; 14+ messages in thread
From: Austin S. Hemmelgarn @ 2018-05-29 14:35 UTC (permalink / raw)
  To: ein, linux-btrfs

On 2018-05-29 10:02, ein wrote:
> On 05/29/2018 02:12 PM, Austin S. Hemmelgarn wrote:
>> On 2018-05-28 13:10, ein wrote:
>>> On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote:
>>>> On 2018-05-23 06:09, ein wrote:
>>>>> On 05/23/2018 11:09 AM, Duncan wrote:
>>>>>> ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:
>>>>>>
>>>>>>>> IMHO the best course of action would be to disable checksumming for
>>>>>>>> you
>>>>>>>> vm files.
>>>>>>>
>>>>>>> Do you mean '-o nodatasum' mount flag? Is it possible to disable
>>>>>>> checksumming for singe file by setting some magical chattr? Google
>>>>>>> thinks it's not possible to disable csums for a single file.
>>>>>>
>>>>>> You can use nocow (-C), but of course that has other restrictions (like
>>>>>> setting it on the files when they're zero-length, easiest done for
>>>>>> existing data by setting it on the containing dir and copying files (no
>>>>>> reflink) in) as well as the nocow effects, and nocow becomes cow1
>>>>>> after a
>>>>>> snapshot (which locks the existing copy in place so changes written to a
>>>>>> block /must/ be written elsewhere, thus the cow1, aka cow the first time
>>>>>> written after the snapshot but retain the nocow for repeated writes
>>>>>> between snapshots).
>>>>>>
>>>>>> But if you're disabling checksumming anyway, nocow's likely the way
>>>>>> to go.
>>>>>
>>>>> Disabling checksumming only may be a way to go - we live without it
>>>>> every day. But nocow @ VM files defeats whole purpose of using BTRFS for
>>>>> me, even with huge performance penalty - backup reasons - I mean few
>>>>> snapshots (20-30), send & receive.
>>>>>
>>>> Setting NOCOW on a file doesn't prevent it from being snapshotted, it
>>>> just prevents COW operations from happening under most normal
>>>> circumstances.  In essence, it prevents COW from happening except for
>>>> writing right after the snapshot.  More specifically, the first write to
>>>> a given block in a file set for NOCOW after taking a snapshot will
>>>> trigger a _single_ COW operation for _only_ that block (unless you have
>>>> autodefrag enabled too), after which that block will revert to not doing
>>>> COW operations on write.  This way, you still get consistent and working
>>>> snapshots, but you also don't take the performance hits from COW except
>>>> right after taking a snapshot.
>>>
>>> Yeah, just after I've post it, I've found some Duncan post from 2015,
>>> explaining it, thank you anyway.
>>>
>>> Is there anything we can do better in random/write VM workload to speed
>>> the BTRFS up and why?
>>>
>>> My settings:
>>>
>>> <disk type='file' device='disk'>
>>>         <driver name='qemu' type='raw' cache='none' io='native'/>
>>>         <source file='/var/lib/libvirt/images/db.raw'/>
>>>         <target dev='vda' bus='virtio'/>
>>>         [...]
>>> </disk>
>>>
>>> /dev/mapper/raid10-images on /var/lib/libvirt type btrfs
>>> (rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)
>>>
>>> md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
>>>         468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] [UUUU]
>>>         bitmap: 0/4 pages [0KB], 65536KB chunk
>>>
>>> CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's
>>> kernel 4.15.0-0.bpo.2-amd64
>>>
>>> As far as I understand compress and autodefrag are impacting negatively
>>> for performance (latency), especially autodefrag. I think also that
>>> nodatacow shall also speed things up and it's a must when using qemu and
>>> BTRFS. Is it better to use virtio or virt-scsi with TRIM support?
>>>
>> FWIW, I've been doing just fine without nodatacow, but I also use raw images contained in sparse
>> files, and keep autodefrag off for the dedicated filesystem I put the images on.
> 
> So do I, RAW images created by qemu-img, but I am not sure if preallocation works as expected. The
> size of disks in filesystem looks fine though.
Unless I'm mistaken, qemu-img will fully pre-allocate the images.

You can easily check though with `ls -ls`, which will show the amount of 
space taken up by the file on-disk (before compression or deduplication) 
on the left.  If that first column on the left doesn't match up with the 
apparent file size, then the file is sparse and not fully pre-allocated.

 From a practical perspective, if you really want maximal performance, 
it's worth pre-allocating space, as that both avoids the non-determinism 
of allocating blocks on first-write, and avoids some degree of 
fragmentation.

If you would rather save the space and not pre-allocate, you can either 
use touch with the `--size` argument to quickly create an apropriately 
sized virtual disk image file.
> 
> May I ask in what workloads? From my testing while having VM on BTRFS storage:
> - file/web servers works perfect on BTRFS.
> - Windows (2012/2016) file servers with AD, are perfect too, besides time required for Windows
> Update, but this service is... let's say not fine enough.
> - database (firebird) impact is huuuge, guest filesystem is Ext4, the database performs slower in
> this conditions (4 SSDs in RAID10) than when it was on raid1 with 2 10krpm SASes. I am still
> thinking how to benchmark it properly. A lot of iowait in host's kernel.
In my case, I've got a couple of different types of VM's, each with it's 
own type of workload:
- A total of 8 static VM's that are always running, each running a 
different distribution/version of Linux.  These see very little activity 
most of the time (I keep them around as reference systems so i have 
something I can look at directly when doing development or providing 
support), use ext4 for the internal filesystems, and are not 
particularly big to begin with.
- A bunch of transient VM's used for testing kernel patches for BTRFS. 
These literally start up, run xfstests, copy the results out to a file 
share on the host, and shut down.  The overall behavior for these 
shouldn't be too drastically different from most database workloads (the 
internals of BTRFS are very similar to many database systems).
- Less frequently, transient VM's for testing other software (mostly 
Netdata recently).  These have varied workloads depending on what 
exactly I'm testing, but often don't touch the disk much.

So, overall, I don't have any systems quite comparable to what you're 
running, but still at least have a reasonable spread of workloads.
> 
>> Compression shouldn't have much in the way of negative impact unless you're also using transparent
>> compression (or disk for file encryption) inside the VM (in fact, it may speed things up
>> significantly depending on what filesystem is being used by the guest OS, the ext4 inode table in
>> particular seems to compress very well).  If you are using `nodatacow` though, you can just turn
>> compression off, as it's not going to be used anyway.  If you want to keep using compression, then
>> I'd suggest using `compress-force` instead of `compress`, which makes BTRFS a bit more aggressive
>> about trying to compress things, but makes the performance much more deterministic.  You may also
>> want to look int using `zstd` instead of `lzo` for the compression, it gets better ratios most of
>> the time, and usually performs better than `lzo` does.
> 
> Yeah, I do know exact values from the post we both know for sure:
> https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git/commit/?h=next&id=5c1aab1dd5445ed8bdcdbb575abc1b0d7ee5b2e7
> 
>> Autodefrag should probably be off.  If you have nodatacow set (or just have all the files marked
>> with the NOCOW attribute), then there's not really any point to having autodefrag on.  If like me
>> you aren't turning off COW for data, it's still a good idea to have it off and just do batch
>> defragmentation at a regularly scheduled time.
> 
> Well, at least I need to try nodatacow to check the impact.
Provided that the files aren't fragmented, you should see an increase in 
write performance, but probably not much improvement for reads.
> 
>> For the VM settings, everything looks fine to me (though if you have somewhat slow storage and
>> aren't giving the VM's lots of memory to work with, doing write-through caching might be helpful).
>> I would probably be using virtio-scsi for the TRIM support, as with raw images you will get holes in
>> the file where the TRIM command was issued, which can actually improve performance (and does improve
>> storage utilization (though doing batch trims instead of using the `discard` mount option is better
>> for performance if you have Linux guests).
> 
> I don't consider this... :
> 4: Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q, SerialNo=S251NX0H822367V
> 21: Model=Samsung SSD 850 PRO 256GB, FwRev=EXM02B6Q, SerialNo=S251NX0H822370A
> 38: Model=INTEL SSDSC2BP240G4, FwRev=L2010420, SerialNo=BTJR6141002D240AGN
> 55: Model=INTEL SSDSC2BP240G4, FwRev=L2010420, SerialNo=BTJR6063000F240AGN
> as slow in Raid 10 mode, because:
> 
> root@node0:~# time dd if=/dev/md1 of=/dev/null bs=4096 count=10M
> 10485760+0 records in
> 10485760+0 records out
> 42949672960 bytes (43 GB, 40 GiB) copied, 31.6336 s, 1.4 GB/s
> 
> real    0m31.636s
> user    0m1.949s
> sys     0m12.222s
> 
> root@node0:~# iostat -x 5 /dev/md1 /dev/sda /dev/sdb /dev/sdc /dev/sdd
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>             0.63    0.00    4.85    6.61    0.00   87.91
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await
> w_await  svctm  %util
> sdc             306.80     0.00  672.00    0.00 333827.20     0.00   993.53     1.05    1.57    1.57
>     0.00   1.21  81.20
> sdb             329.80     0.00  663.40    0.00 332640.00     0.00  1002.83     0.94    1.41    1.41
>     0.00   1.05  69.44
> sdd             298.80     0.00  664.80    0.00 329110.40     0.00   990.10     1.00    1.50    1.50
>     0.00   1.22  80.96
> sda             291.60     0.00  657.40    0.00 330297.60     0.00  1004.86     0.92    1.40    1.40
>     0.00   1.05  69.20
> md1               0.00     0.00 3884.80    0.00 2693254.40     0.00  1386.56     0.00    0.00
> 0.00    0.00   0.00   0.00
> 
> It gives me much more than 100k IOPs to the BTRFS filesystem if I remember correctly while fio
> benchmark random workload (75% reads, 25% writes), 2 threads.
Yeah, I wouldn't consider that 'slow' either.  In my case, I'm running 
my VM's with the back-end storage being a BTRFS raid1 volume on top of 
two LVM thinp targets, which are in turn on top of a pair of consumer 
7200RPM SATA3 HDD's (because I ran out of space on the two half-TB SSD's 
that I had everything in the system on, and happened to have some 
essentially new 1TB HDD's around still from before I converted to SSD's 
everywhere), and that I would definitely call slow, and it's probably 
worth noting that my definition of 'works just fine' is at least partly 
based on the fact that the storage is so slow.>
>> You're using an MD RAID10 array.  This is generally the fastest option in terms of performance, but
>> it also means you can't take advantage of BTRFS' self repairing ability very well, and you may be
>> wasting space and some performance (because you probably have the 'dup' profile set for metadata).
>> If it's an option, I'd suggest converting this to a BTRFS raid1 volume on top of two MD RAID0
>> volumes, which should either get the same performance, or slightly better performance, will avoid
>> wasting space storing metadata, and will also let you take advantage of the self-repair
>> functionality in BTRFS.
> 
> That's a good point.
> 
>> You should probably switch the `ssd` mount option to `nossd` (and then run a full recursive defrag
>> on the volume, as this option affects the allocation policy, so the changes only take effect for new
>> allocations).  The SSD allocator can actually pretty significantly hurt performance in many cases,
>> and has at best very limited benefits for device lifetimes (you'll maybe get another few months out
>> of a device that will last for ten years without issue).  Make a point to test this though, because
>> you're on a RAID array, this may actually be improving performance slightly.
> 
> Good point either. I am going to test ssd parameter impact to, I think that recreating fs and copy
> the data may be good idea.
One quick point, do make sure you explicitly set 'nossd', as BTRFS tries 
to set the 'ssd' parameter automatically based on whether or not the 
underlying storage is rotational (and I don't remember if MD copies the 
rotational flag from the lower-level storage or not).
> 
> We should not care about the wearout:
> 20 users working on the database every day (work week), for about a year and:
> 
> Model Family:     Intel 730 and DC S35x0/3610/3700 Series SSDs
> Device Model:     INTEL SSDSC2BP240G4
> 
> 233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0
> 
> and
> 
> 177 Wear_Leveling_Count     0x0013   095   095   000    Pre-fail  Always       -       282
> 
> Model Family:     Samsung based SSDs
> Device Model:     Samsung SSD 850 PRO 256GB
> 
> Which means 50 more years @ Samsung Pro 850 and 100 years @ Intel 730, which is interesting. (btw.
> start date is exactly the same).
For what it's worth, based on my own experience, the degradation isn't 
exactly linear, it's more of an exponential falloff (as more blocks go 
bad, there's less extra space for the FTL to work with for wear 
leveling, so it can't do as good a job wear-leveling, which in turn 
causes blocks to fail faster).  Realistically though, you do still 
probably have a few decades worth of life in them at minimum.
> 
> Thank you for sharing Austin.
Glad I could help!


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-05-29 14:35 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-22 20:05 csum failed root raveled during balance ein
2018-05-23  6:32 ` Nikolay Borisov
2018-05-23  8:03   ` ein
2018-05-23  9:09     ` Duncan
2018-05-23 10:09       ` ein
2018-05-23 11:03         ` Austin S. Hemmelgarn
2018-05-28 17:10           ` ein
2018-05-29 12:12             ` Austin S. Hemmelgarn
2018-05-29 14:02               ` ein
2018-05-29 14:35                 ` Austin S. Hemmelgarn
2018-05-23 11:12     ` Nikolay Borisov
2018-05-27  5:50   ` Andrei Borzenkov
2018-05-27  9:41     ` Nikolay Borisov
2018-05-28 16:51       ` ein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.