All of lore.kernel.org
 help / color / mirror / Atom feed
* bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-08 18:39 ` James Johnston
  0 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-05-08 18:39 UTC (permalink / raw)
  To: 'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer'
  Cc: linux-bcache, dm-devel, dm-crypt

Hi,

[1.] One line summary of the problem:

bcache gets stuck flushing writeback cache when used in combination with
LUKS/dm-crypt and non-default bucket size

[2.] Full description of the problem/report:

I've run into a problem where the bcache writeback cache can't be flushed to
disk when the backing device is a LUKS / dm-crypt device and the cache set has
a non-default bucket size.  Basically, only a few megabytes will be flushed to
disk, and then it gets stuck.  Stuck means that the bcache writeback task
thrashes the disk by constantly reading hundreds of MB/second from the cache set
in an infinite loop, while not actually progressing (dirty_data never decreases
beyond a certain point).

I am wondering if anybody else can reproduce this apparent bug?  Apologies for
mailing both device mapper and bcache mailing lists, but I'm not sure where the
bug lies as I've only reproduced it when both are used in combination.

The situation is basically unrecoverable as far as I can tell: if you attempt
to detach the cache set then the cache set disk gets thrashed extra-hard
forever, and it's impossible to actually get the cache set detached.  The only
solution seems to be to back up the data and destroy the volume...

[3.] Keywords (i.e., modules, networking, kernel):

bcache, dm-crypt, LUKS, device mapper, LVM

[4.] Kernel information
[4.1.] Kernel version (from /proc/version):
Linux version 4.6.0-040600rc6-generic (kernel@gloin) (gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) ) #201605012031 SMP Mon May 2 00:33:26 UTC 2016

[7.] A small shell script or example program which triggers the
     problem (if possible)

Here are the steps I used to reproduce:

1.  Set up an Ubuntu 16.04 virtual machine in VMware with three SATA hard
    drives.  Ubuntu was installed with default settings, except that: (1) guided
    partitioning used with NO LVM or dm-crypt, (2) OpenSSH server installed.
    First SATA drive has operating system installation.  Second SATA drive is
    used for bcache cache set.  Third SATA drive has dm-crypt/LUKS + bcache
    backing device.  Note that all drives have 512 byte physical sectors.  Also,
    all virtual drives are backed by a single physical SSD with 512 byte
    sectors. (i.e. not advanced format)

2.  Ubuntu was updated to latest packages as of 5/8/2016.  The problem
    reproduces with both distribution kernel 4.4.0-22-generic and also mainline
    kernel 4.6.0-040600rc6-generic distributed by Ubuntu kernel team.  Installed
    bcache-tools package was 1.0.8-2.  Installed cryptsetup-bin package was
    2:1.6.6-5ubuntu2.

3.  Set up the cache set, dm-crypt, and backing device:

sudo -s
# Make cache set on second drive
# IMPORTANT:  Problem does not occur if I omit --bucket parameter.
make-bcache --bucket 2M -C /dev/sdb
# Set up LUKS/dm-crypt on second drive.
# IMPORTANT:  Problem does not occur if I omit the dm-crypt layer.
cryptsetup luksFormat /dev/sdc
cryptsetup open --type luks /dev/sdc backCrypt
# Make bcache backing device & enable writeback
make-bcache -B /dev/mapper/backCrypt
bcache-super-show /dev/sdb | grep cset.uuid | \
cut -f 3 > /sys/block/bcache0/bcache/attach
echo writeback > /sys/block/bcache0/bcache/cache_mode

4.  Finally, this is the kill sequence to bring the system to its knees:

sudo -s
cd /sys/block/bcache0/bcache
echo 0 > sequential_cutoff
# Verify that the cache is attached (i.e. does not say "no cache").  It should
# say that it's clean since we haven't written anything yet.
cat state
# Copy some random data.
dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
# Show current state.  On my system approximately 20 to 25 MB remain in
# writeback cache.
cat dirty_data
cat state
# Detach the cache set.  This will start the cache set disk thrashing.
echo 1 > detach
# After a few moments, confirm that the cache set is not going anywhere.  On
# my system, only a few MB have been flushed as evidenced by a small decrease
# in dirty_data.  State remains dirty.
cat dirty_data
cat state
# At this point, the hypervisor system reports hundreds of MB/second of reads
# to the underlying physical SSD coming from the virtual machine; the hard drive
# light is stuck on...  hypervisor status bar shows the activity is on cache
# set.  No writes seem to be occurring on any disk.

[8.] Environment
[8.1.] Software (add the output of the ver_linux script here)
Linux bcachetest2 4.6.0-040600rc6-generic #201605012031 SMP Mon May 2 00:33:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Util-linux              2.27.1
Mount                   2.27.1
Module-init-tools       22
E2fsprogs               1.42.13
Xfsprogs                4.3.0
Linux C Library         2.23
Dynamic linker (ldd)    2.23
Linux C++ Library       6.0.21
Procps                  3.3.10
Net-tools               1.60
Kbd                     1.15.5
Console-tools           1.15.5
Sh-utils                8.25
Udev                    229
Modules Loaded          8250_fintek ablk_helper aesni_intel aes_x86_64 ahci async_memcpy async_pq async_raid6_recov async_tx async_xor autofs4 btrfs configfs coretemp crc32_pclmul crct10dif_pclmul cryptd drm drm_kms_helper e1000 fb_sys_fops fjes gf128mul ghash_clmulni_intel glue_helper hid hid_generic i2c_piix4 ib_addr ib_cm ib_core ib_iser ib_mad ib_sa input_leds iscsi_tcp iw_cm joydev libahci libcrc32c libiscsi libiscsi_tcp linear lrw mac_hid mptbase mptscsih mptspi multipath nfit parport parport_pc pata_acpi ppdev psmouse raid0 raid10 raid1 raid456 raid6_pq rdma_cm scsi_transport_iscsi scsi_transport_spi serio_raw shpchp syscopyarea sysfillrect sysimgblt ttm usbhid vmw_balloon vmwgfx vmw_vmci vmw_vsock_vmci_transport vsock xor

[8.2.] Processor information (from /proc/cpuinfo):
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping        : 7
microcode       : 0x29
cpu MHz         : 2491.980
cache size      : 3072 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm epb tsc_adjust dtherm ida arat pln pts
bugs            :
bogomips        : 4983.96
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

[8.3.] Module information (from /proc/modules):
ppdev 20480 0 - Live 0x0000000000000000
vmw_balloon 20480 0 - Live 0x0000000000000000
vmw_vsock_vmci_transport 28672 1 - Live 0x0000000000000000
vsock 36864 2 vmw_vsock_vmci_transport, Live 0x0000000000000000
coretemp 16384 0 - Live 0x0000000000000000
joydev 20480 0 - Live 0x0000000000000000
input_leds 16384 0 - Live 0x0000000000000000
serio_raw 16384 0 - Live 0x0000000000000000
shpchp 36864 0 - Live 0x0000000000000000
vmw_vmci 65536 2 vmw_balloon,vmw_vsock_vmci_transport, Live 0x0000000000000000
i2c_piix4 24576 0 - Live 0x0000000000000000
nfit 40960 0 - Live 0x0000000000000000
8250_fintek 16384 0 - Live 0x0000000000000000
parport_pc 32768 0 - Live 0x0000000000000000
parport 49152 2 ppdev,parport_pc, Live 0x0000000000000000
mac_hid 16384 0 - Live 0x0000000000000000
ib_iser 49152 0 - Live 0x0000000000000000
rdma_cm 53248 1 ib_iser, Live 0x0000000000000000
iw_cm 49152 1 rdma_cm, Live 0x0000000000000000
ib_cm 45056 1 rdma_cm, Live 0x0000000000000000
ib_sa 36864 2 rdma_cm,ib_cm, Live 0x0000000000000000
ib_mad 49152 2 ib_cm,ib_sa, Live 0x0000000000000000
ib_core 122880 6 ib_iser,rdma_cm,iw_cm,ib_cm,ib_sa,ib_mad, Live 0x0000000000000000
ib_addr 20480 3 rdma_cm,ib_sa,ib_core, Live 0x0000000000000000
configfs 40960 2 rdma_cm, Live 0x0000000000000000
iscsi_tcp 20480 0 - Live 0x0000000000000000
libiscsi_tcp 24576 1 iscsi_tcp, Live 0x0000000000000000
libiscsi 53248 3 ib_iser,iscsi_tcp,libiscsi_tcp, Live 0x0000000000000000
scsi_transport_iscsi 98304 4 ib_iser,iscsi_tcp,libiscsi, Live 0x0000000000000000
autofs4 40960 2 - Live 0x0000000000000000
btrfs 1024000 0 - Live 0x0000000000000000
raid10 49152 0 - Live 0x0000000000000000
raid456 110592 0 - Live 0x0000000000000000
async_raid6_recov 20480 1 raid456, Live 0x0000000000000000
async_memcpy 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
async_pq 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
async_xor 16384 3 raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
async_tx 16384 5 raid456,async_raid6_recov,async_memcpy,async_pq,async_xor, Live 0x0000000000000000
xor 24576 2 btrfs,async_xor, Live 0x0000000000000000
raid6_pq 102400 4 btrfs,raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
libcrc32c 16384 1 raid456, Live 0x0000000000000000
raid1 36864 0 - Live 0x0000000000000000
raid0 20480 0 - Live 0x0000000000000000
multipath 16384 0 - Live 0x0000000000000000
linear 16384 0 - Live 0x0000000000000000
hid_generic 16384 0 - Live 0x0000000000000000
usbhid 49152 0 - Live 0x0000000000000000
hid 122880 2 hid_generic,usbhid, Live 0x0000000000000000
crct10dif_pclmul 16384 0 - Live 0x0000000000000000
crc32_pclmul 16384 0 - Live 0x0000000000000000
ghash_clmulni_intel 16384 0 - Live 0x0000000000000000
aesni_intel 167936 0 - Live 0x0000000000000000
aes_x86_64 20480 1 aesni_intel, Live 0x0000000000000000
lrw 16384 1 aesni_intel, Live 0x0000000000000000
gf128mul 16384 1 lrw, Live 0x0000000000000000
glue_helper 16384 1 aesni_intel, Live 0x0000000000000000
ablk_helper 16384 1 aesni_intel, Live 0x0000000000000000
cryptd 20480 3 ghash_clmulni_intel,aesni_intel,ablk_helper, Live 0x0000000000000000
vmwgfx 237568 1 - Live 0x0000000000000000
ttm 98304 1 vmwgfx, Live 0x0000000000000000
drm_kms_helper 147456 1 vmwgfx, Live 0x0000000000000000
syscopyarea 16384 1 drm_kms_helper, Live 0x0000000000000000
psmouse 131072 0 - Live 0x0000000000000000
sysfillrect 16384 1 drm_kms_helper, Live 0x0000000000000000
sysimgblt 16384 1 drm_kms_helper, Live 0x0000000000000000
fb_sys_fops 16384 1 drm_kms_helper, Live 0x0000000000000000
drm 364544 4 vmwgfx,ttm,drm_kms_helper, Live 0x0000000000000000
ahci 36864 2 - Live 0x0000000000000000
libahci 32768 1 ahci, Live 0x0000000000000000
e1000 135168 0 - Live 0x0000000000000000
mptspi 24576 0 - Live 0x0000000000000000
mptscsih 40960 1 mptspi, Live 0x0000000000000000
mptbase 102400 2 mptspi,mptscsih, Live 0x0000000000000000
scsi_transport_spi 32768 1 mptspi, Live 0x0000000000000000
pata_acpi 16384 0 - Live 0x0000000000000000
fjes 28672 0 - Live 0x0000000000000000

[8.6.] SCSI information (from /proc/scsi/scsi)
Attached devices:
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: VMware Virtual S Rev: 0001
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi4 Channel: 00 Id: 00 Lun: 00
  Vendor: NECVMWar Model: VMware SATA CD01 Rev: 1.00
  Type:   CD-ROM                           ANSI  SCSI revision: 05
Host: scsi5 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: VMware Virtual S Rev: 0001
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi6 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: VMware Virtual S Rev: 0001
  Type:   Direct-Access                    ANSI  SCSI revision: 05

Best regards,

James Johnston

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-08 18:39 ` James Johnston
  0 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-05-08 18:39 UTC (permalink / raw)
  To: 'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer'
  Cc: linux-bcache, dm-devel, dm-crypt

Hi,

[1.] One line summary of the problem:

bcache gets stuck flushing writeback cache when used in combination with
LUKS/dm-crypt and non-default bucket size

[2.] Full description of the problem/report:

I've run into a problem where the bcache writeback cache can't be flushed to
disk when the backing device is a LUKS / dm-crypt device and the cache set has
a non-default bucket size.  Basically, only a few megabytes will be flushed to
disk, and then it gets stuck.  Stuck means that the bcache writeback task
thrashes the disk by constantly reading hundreds of MB/second from the cache set
in an infinite loop, while not actually progressing (dirty_data never decreases
beyond a certain point).

I am wondering if anybody else can reproduce this apparent bug?  Apologies for
mailing both device mapper and bcache mailing lists, but I'm not sure where the
bug lies as I've only reproduced it when both are used in combination.

The situation is basically unrecoverable as far as I can tell: if you attempt
to detach the cache set then the cache set disk gets thrashed extra-hard
forever, and it's impossible to actually get the cache set detached.  The only
solution seems to be to back up the data and destroy the volume...

[3.] Keywords (i.e., modules, networking, kernel):

bcache, dm-crypt, LUKS, device mapper, LVM

[4.] Kernel information
[4.1.] Kernel version (from /proc/version):
Linux version 4.6.0-040600rc6-generic (kernel@gloin) (gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) ) #201605012031 SMP Mon May 2 00:33:26 UTC 2016

[7.] A small shell script or example program which triggers the
     problem (if possible)

Here are the steps I used to reproduce:

1.  Set up an Ubuntu 16.04 virtual machine in VMware with three SATA hard
    drives.  Ubuntu was installed with default settings, except that: (1) guided
    partitioning used with NO LVM or dm-crypt, (2) OpenSSH server installed.
    First SATA drive has operating system installation.  Second SATA drive is
    used for bcache cache set.  Third SATA drive has dm-crypt/LUKS + bcache
    backing device.  Note that all drives have 512 byte physical sectors.  Also,
    all virtual drives are backed by a single physical SSD with 512 byte
    sectors. (i.e. not advanced format)

2.  Ubuntu was updated to latest packages as of 5/8/2016.  The problem
    reproduces with both distribution kernel 4.4.0-22-generic and also mainline
    kernel 4.6.0-040600rc6-generic distributed by Ubuntu kernel team.  Installed
    bcache-tools package was 1.0.8-2.  Installed cryptsetup-bin package was
    2:1.6.6-5ubuntu2.

3.  Set up the cache set, dm-crypt, and backing device:

sudo -s
# Make cache set on second drive
# IMPORTANT:  Problem does not occur if I omit --bucket parameter.
make-bcache --bucket 2M -C /dev/sdb
# Set up LUKS/dm-crypt on second drive.
# IMPORTANT:  Problem does not occur if I omit the dm-crypt layer.
cryptsetup luksFormat /dev/sdc
cryptsetup open --type luks /dev/sdc backCrypt
# Make bcache backing device & enable writeback
make-bcache -B /dev/mapper/backCrypt
bcache-super-show /dev/sdb | grep cset.uuid | \
cut -f 3 > /sys/block/bcache0/bcache/attach
echo writeback > /sys/block/bcache0/bcache/cache_mode

4.  Finally, this is the kill sequence to bring the system to its knees:

sudo -s
cd /sys/block/bcache0/bcache
echo 0 > sequential_cutoff
# Verify that the cache is attached (i.e. does not say "no cache").  It should
# say that it's clean since we haven't written anything yet.
cat state
# Copy some random data.
dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
# Show current state.  On my system approximately 20 to 25 MB remain in
# writeback cache.
cat dirty_data
cat state
# Detach the cache set.  This will start the cache set disk thrashing.
echo 1 > detach
# After a few moments, confirm that the cache set is not going anywhere.  On
# my system, only a few MB have been flushed as evidenced by a small decrease
# in dirty_data.  State remains dirty.
cat dirty_data
cat state
# At this point, the hypervisor system reports hundreds of MB/second of reads
# to the underlying physical SSD coming from the virtual machine; the hard drive
# light is stuck on...  hypervisor status bar shows the activity is on cache
# set.  No writes seem to be occurring on any disk.

[8.] Environment
[8.1.] Software (add the output of the ver_linux script here)
Linux bcachetest2 4.6.0-040600rc6-generic #201605012031 SMP Mon May 2 00:33:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Util-linux              2.27.1
Mount                   2.27.1
Module-init-tools       22
E2fsprogs               1.42.13
Xfsprogs                4.3.0
Linux C Library         2.23
Dynamic linker (ldd)    2.23
Linux C++ Library       6.0.21
Procps                  3.3.10
Net-tools               1.60
Kbd                     1.15.5
Console-tools           1.15.5
Sh-utils                8.25
Udev                    229
Modules Loaded          8250_fintek ablk_helper aesni_intel aes_x86_64 ahci async_memcpy async_pq async_raid6_recov async_tx async_xor autofs4 btrfs configfs coretemp crc32_pclmul crct10dif_pclmul cryptd drm drm_kms_helper e1000 fb_sys_fops fjes gf128mul ghash_clmulni_intel glue_helper hid hid_generic i2c_piix4 ib_addr ib_cm ib_core ib_iser ib_mad ib_sa input_leds iscsi_tcp iw_cm joydev libahci libcrc32c libiscsi libiscsi_tcp linear lrw mac_hid mptbase mptscsih mptspi multipath nfit parport parport_pc pata_acpi ppdev psmouse raid0 raid10 raid1 raid456 raid6_pq rdma_cm scsi_transport_iscsi scsi_transport_spi serio_raw shpchp syscopyarea sysfillrect sysimgblt ttm usbhid vmw_balloon vmwgfx vmw_vmci vmw_vsock_vmci_transport vsock xor

[8.2.] Processor information (from /proc/cpuinfo):
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping        : 7
microcode       : 0x29
cpu MHz         : 2491.980
cache size      : 3072 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm epb tsc_adjust dtherm ida arat pln pts
bugs            :
bogomips        : 4983.96
clflush size    : 64
cache_alignment : 64
address sizes   : 42 bits physical, 48 bits virtual
power management:

[8.3.] Module information (from /proc/modules):
ppdev 20480 0 - Live 0x0000000000000000
vmw_balloon 20480 0 - Live 0x0000000000000000
vmw_vsock_vmci_transport 28672 1 - Live 0x0000000000000000
vsock 36864 2 vmw_vsock_vmci_transport, Live 0x0000000000000000
coretemp 16384 0 - Live 0x0000000000000000
joydev 20480 0 - Live 0x0000000000000000
input_leds 16384 0 - Live 0x0000000000000000
serio_raw 16384 0 - Live 0x0000000000000000
shpchp 36864 0 - Live 0x0000000000000000
vmw_vmci 65536 2 vmw_balloon,vmw_vsock_vmci_transport, Live 0x0000000000000000
i2c_piix4 24576 0 - Live 0x0000000000000000
nfit 40960 0 - Live 0x0000000000000000
8250_fintek 16384 0 - Live 0x0000000000000000
parport_pc 32768 0 - Live 0x0000000000000000
parport 49152 2 ppdev,parport_pc, Live 0x0000000000000000
mac_hid 16384 0 - Live 0x0000000000000000
ib_iser 49152 0 - Live 0x0000000000000000
rdma_cm 53248 1 ib_iser, Live 0x0000000000000000
iw_cm 49152 1 rdma_cm, Live 0x0000000000000000
ib_cm 45056 1 rdma_cm, Live 0x0000000000000000
ib_sa 36864 2 rdma_cm,ib_cm, Live 0x0000000000000000
ib_mad 49152 2 ib_cm,ib_sa, Live 0x0000000000000000
ib_core 122880 6 ib_iser,rdma_cm,iw_cm,ib_cm,ib_sa,ib_mad, Live 0x0000000000000000
ib_addr 20480 3 rdma_cm,ib_sa,ib_core, Live 0x0000000000000000
configfs 40960 2 rdma_cm, Live 0x0000000000000000
iscsi_tcp 20480 0 - Live 0x0000000000000000
libiscsi_tcp 24576 1 iscsi_tcp, Live 0x0000000000000000
libiscsi 53248 3 ib_iser,iscsi_tcp,libiscsi_tcp, Live 0x0000000000000000
scsi_transport_iscsi 98304 4 ib_iser,iscsi_tcp,libiscsi, Live 0x0000000000000000
autofs4 40960 2 - Live 0x0000000000000000
btrfs 1024000 0 - Live 0x0000000000000000
raid10 49152 0 - Live 0x0000000000000000
raid456 110592 0 - Live 0x0000000000000000
async_raid6_recov 20480 1 raid456, Live 0x0000000000000000
async_memcpy 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
async_pq 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
async_xor 16384 3 raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
async_tx 16384 5 raid456,async_raid6_recov,async_memcpy,async_pq,async_xor, Live 0x0000000000000000
xor 24576 2 btrfs,async_xor, Live 0x0000000000000000
raid6_pq 102400 4 btrfs,raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
libcrc32c 16384 1 raid456, Live 0x0000000000000000
raid1 36864 0 - Live 0x0000000000000000
raid0 20480 0 - Live 0x0000000000000000
multipath 16384 0 - Live 0x0000000000000000
linear 16384 0 - Live 0x0000000000000000
hid_generic 16384 0 - Live 0x0000000000000000
usbhid 49152 0 - Live 0x0000000000000000
hid 122880 2 hid_generic,usbhid, Live 0x0000000000000000
crct10dif_pclmul 16384 0 - Live 0x0000000000000000
crc32_pclmul 16384 0 - Live 0x0000000000000000
ghash_clmulni_intel 16384 0 - Live 0x0000000000000000
aesni_intel 167936 0 - Live 0x0000000000000000
aes_x86_64 20480 1 aesni_intel, Live 0x0000000000000000
lrw 16384 1 aesni_intel, Live 0x0000000000000000
gf128mul 16384 1 lrw, Live 0x0000000000000000
glue_helper 16384 1 aesni_intel, Live 0x0000000000000000
ablk_helper 16384 1 aesni_intel, Live 0x0000000000000000
cryptd 20480 3 ghash_clmulni_intel,aesni_intel,ablk_helper, Live 0x0000000000000000
vmwgfx 237568 1 - Live 0x0000000000000000
ttm 98304 1 vmwgfx, Live 0x0000000000000000
drm_kms_helper 147456 1 vmwgfx, Live 0x0000000000000000
syscopyarea 16384 1 drm_kms_helper, Live 0x0000000000000000
psmouse 131072 0 - Live 0x0000000000000000
sysfillrect 16384 1 drm_kms_helper, Live 0x0000000000000000
sysimgblt 16384 1 drm_kms_helper, Live 0x0000000000000000
fb_sys_fops 16384 1 drm_kms_helper, Live 0x0000000000000000
drm 364544 4 vmwgfx,ttm,drm_kms_helper, Live 0x0000000000000000
ahci 36864 2 - Live 0x0000000000000000
libahci 32768 1 ahci, Live 0x0000000000000000
e1000 135168 0 - Live 0x0000000000000000
mptspi 24576 0 - Live 0x0000000000000000
mptscsih 40960 1 mptspi, Live 0x0000000000000000
mptbase 102400 2 mptspi,mptscsih, Live 0x0000000000000000
scsi_transport_spi 32768 1 mptspi, Live 0x0000000000000000
pata_acpi 16384 0 - Live 0x0000000000000000
fjes 28672 0 - Live 0x0000000000000000

[8.6.] SCSI information (from /proc/scsi/scsi)
Attached devices:
Host: scsi3 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: VMware Virtual S Rev: 0001
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi4 Channel: 00 Id: 00 Lun: 00
  Vendor: NECVMWar Model: VMware SATA CD01 Rev: 1.00
  Type:   CD-ROM                           ANSI  SCSI revision: 05
Host: scsi5 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: VMware Virtual S Rev: 0001
  Type:   Direct-Access                    ANSI  SCSI revision: 05
Host: scsi6 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA      Model: VMware Virtual S Rev: 0001
  Type:   Direct-Access                    ANSI  SCSI revision: 05

Best regards,

James Johnston

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-08 18:39 ` [dm-crypt] " James Johnston
@ 2016-05-11  1:38   ` Eric Wheeler
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2016-05-11  1:38 UTC (permalink / raw)
  To: James Johnston
  Cc: 'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt


On Sun, 8 May 2016, James Johnston wrote:

> Hi,
> 
> [1.] One line summary of the problem:
> 
> bcache gets stuck flushing writeback cache when used in combination with
> LUKS/dm-crypt and non-default bucket size
> 
> [2.] Full description of the problem/report:
> 
> I've run into a problem where the bcache writeback cache can't be flushed to
> disk when the backing device is a LUKS / dm-crypt device and the cache set has
> a non-default bucket size.  

You might try LUKS atop of bcache instead of under it.  This might be 
better for privacy too, otherwise your cached data is unencrypted.

> # Make cache set on second drive
> # IMPORTANT:  Problem does not occur if I omit --bucket parameter.
> make-bcache --bucket 2M -C /dev/sdb

2MB is quite large, maybe it exceeds the 256-bvec limit.  I'm not sure if 
Ming Lei's patch got in to 4.6 yet, but try this:
  https://lkml.org/lkml/2016/4/5/1046

and maybe Shaohua Li's patch too:
  http://www.spinics.net/lists/raid/msg51830.html


--
Eric Wheeler

> # Set up LUKS/dm-crypt on second drive.
> # IMPORTANT:  Problem does not occur if I omit the dm-crypt layer.
> cryptsetup luksFormat /dev/sdc
> cryptsetup open --type luks /dev/sdc backCrypt
> # Make bcache backing device & enable writeback
> make-bcache -B /dev/mapper/backCrypt
> bcache-super-show /dev/sdb | grep cset.uuid | \
> cut -f 3 > /sys/block/bcache0/bcache/attach
> echo writeback > /sys/block/bcache0/bcache/cache_mode
> 
> 4.  Finally, this is the kill sequence to bring the system to its knees:
> 
> sudo -s
> cd /sys/block/bcache0/bcache
> echo 0 > sequential_cutoff
> # Verify that the cache is attached (i.e. does not say "no cache").  It should
> # say that it's clean since we haven't written anything yet.
> cat state
> # Copy some random data.
> dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
> # Show current state.  On my system approximately 20 to 25 MB remain in
> # writeback cache.
> cat dirty_data
> cat state
> # Detach the cache set.  This will start the cache set disk thrashing.
> echo 1 > detach
> # After a few moments, confirm that the cache set is not going anywhere.  On
> # my system, only a few MB have been flushed as evidenced by a small decrease
> # in dirty_data.  State remains dirty.
> cat dirty_data
> cat state
> # At this point, the hypervisor system reports hundreds of MB/second of reads
> # to the underlying physical SSD coming from the virtual machine; the hard drive
> # light is stuck on...  hypervisor status bar shows the activity is on cache
> # set.  No writes seem to be occurring on any disk.
> 
> [8.] Environment
> [8.1.] Software (add the output of the ver_linux script here)
> Linux bcachetest2 4.6.0-040600rc6-generic #201605012031 SMP Mon May 2 00:33:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
> Util-linux              2.27.1
> Mount                   2.27.1
> Module-init-tools       22
> E2fsprogs               1.42.13
> Xfsprogs                4.3.0
> Linux C Library         2.23
> Dynamic linker (ldd)    2.23
> Linux C++ Library       6.0.21
> Procps                  3.3.10
> Net-tools               1.60
> Kbd                     1.15.5
> Console-tools           1.15.5
> Sh-utils                8.25
> Udev                    229
> Modules Loaded          8250_fintek ablk_helper aesni_intel aes_x86_64 ahci async_memcpy async_pq async_raid6_recov async_tx async_xor autofs4 btrfs configfs coretemp crc32_pclmul crct10dif_pclmul cryptd drm drm_kms_helper e1000 fb_sys_fops fjes gf128mul ghash_clmulni_intel glue_helper hid hid_generic i2c_piix4 ib_addr ib_cm ib_core ib_iser ib_mad ib_sa input_leds iscsi_tcp iw_cm joydev libahci libcrc32c libiscsi libiscsi_tcp linear lrw mac_hid mptbase mptscsih mptspi multipath nfit parport parport_pc pata_acpi ppdev psmouse raid0 raid10 raid1 raid456 raid6_pq rdma_cm scsi_transport_iscsi scsi_transport_spi serio_raw shpchp syscopyarea sysfillrect sysimgblt ttm usbhid vmw_balloon vmwgfx vmw_vmci vmw_vsock_vmci_transport vsock xor
> 
> [8.2.] Processor information (from /proc/cpuinfo):
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 42
> model name      : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> stepping        : 7
> microcode       : 0x29
> cpu MHz         : 2491.980
> cache size      : 3072 KB
> physical id     : 0
> siblings        : 1
> core id         : 0
> cpu cores       : 1
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 13
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm epb tsc_adjust dtherm ida arat pln pts
> bugs            :
> bogomips        : 4983.96
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 42 bits physical, 48 bits virtual
> power management:
> 
> [8.3.] Module information (from /proc/modules):
> ppdev 20480 0 - Live 0x0000000000000000
> vmw_balloon 20480 0 - Live 0x0000000000000000
> vmw_vsock_vmci_transport 28672 1 - Live 0x0000000000000000
> vsock 36864 2 vmw_vsock_vmci_transport, Live 0x0000000000000000
> coretemp 16384 0 - Live 0x0000000000000000
> joydev 20480 0 - Live 0x0000000000000000
> input_leds 16384 0 - Live 0x0000000000000000
> serio_raw 16384 0 - Live 0x0000000000000000
> shpchp 36864 0 - Live 0x0000000000000000
> vmw_vmci 65536 2 vmw_balloon,vmw_vsock_vmci_transport, Live 0x0000000000000000
> i2c_piix4 24576 0 - Live 0x0000000000000000
> nfit 40960 0 - Live 0x0000000000000000
> 8250_fintek 16384 0 - Live 0x0000000000000000
> parport_pc 32768 0 - Live 0x0000000000000000
> parport 49152 2 ppdev,parport_pc, Live 0x0000000000000000
> mac_hid 16384 0 - Live 0x0000000000000000
> ib_iser 49152 0 - Live 0x0000000000000000
> rdma_cm 53248 1 ib_iser, Live 0x0000000000000000
> iw_cm 49152 1 rdma_cm, Live 0x0000000000000000
> ib_cm 45056 1 rdma_cm, Live 0x0000000000000000
> ib_sa 36864 2 rdma_cm,ib_cm, Live 0x0000000000000000
> ib_mad 49152 2 ib_cm,ib_sa, Live 0x0000000000000000
> ib_core 122880 6 ib_iser,rdma_cm,iw_cm,ib_cm,ib_sa,ib_mad, Live 0x0000000000000000
> ib_addr 20480 3 rdma_cm,ib_sa,ib_core, Live 0x0000000000000000
> configfs 40960 2 rdma_cm, Live 0x0000000000000000
> iscsi_tcp 20480 0 - Live 0x0000000000000000
> libiscsi_tcp 24576 1 iscsi_tcp, Live 0x0000000000000000
> libiscsi 53248 3 ib_iser,iscsi_tcp,libiscsi_tcp, Live 0x0000000000000000
> scsi_transport_iscsi 98304 4 ib_iser,iscsi_tcp,libiscsi, Live 0x0000000000000000
> autofs4 40960 2 - Live 0x0000000000000000
> btrfs 1024000 0 - Live 0x0000000000000000
> raid10 49152 0 - Live 0x0000000000000000
> raid456 110592 0 - Live 0x0000000000000000
> async_raid6_recov 20480 1 raid456, Live 0x0000000000000000
> async_memcpy 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
> async_pq 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
> async_xor 16384 3 raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
> async_tx 16384 5 raid456,async_raid6_recov,async_memcpy,async_pq,async_xor, Live 0x0000000000000000
> xor 24576 2 btrfs,async_xor, Live 0x0000000000000000
> raid6_pq 102400 4 btrfs,raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
> libcrc32c 16384 1 raid456, Live 0x0000000000000000
> raid1 36864 0 - Live 0x0000000000000000
> raid0 20480 0 - Live 0x0000000000000000
> multipath 16384 0 - Live 0x0000000000000000
> linear 16384 0 - Live 0x0000000000000000
> hid_generic 16384 0 - Live 0x0000000000000000
> usbhid 49152 0 - Live 0x0000000000000000
> hid 122880 2 hid_generic,usbhid, Live 0x0000000000000000
> crct10dif_pclmul 16384 0 - Live 0x0000000000000000
> crc32_pclmul 16384 0 - Live 0x0000000000000000
> ghash_clmulni_intel 16384 0 - Live 0x0000000000000000
> aesni_intel 167936 0 - Live 0x0000000000000000
> aes_x86_64 20480 1 aesni_intel, Live 0x0000000000000000
> lrw 16384 1 aesni_intel, Live 0x0000000000000000
> gf128mul 16384 1 lrw, Live 0x0000000000000000
> glue_helper 16384 1 aesni_intel, Live 0x0000000000000000
> ablk_helper 16384 1 aesni_intel, Live 0x0000000000000000
> cryptd 20480 3 ghash_clmulni_intel,aesni_intel,ablk_helper, Live 0x0000000000000000
> vmwgfx 237568 1 - Live 0x0000000000000000
> ttm 98304 1 vmwgfx, Live 0x0000000000000000
> drm_kms_helper 147456 1 vmwgfx, Live 0x0000000000000000
> syscopyarea 16384 1 drm_kms_helper, Live 0x0000000000000000
> psmouse 131072 0 - Live 0x0000000000000000
> sysfillrect 16384 1 drm_kms_helper, Live 0x0000000000000000
> sysimgblt 16384 1 drm_kms_helper, Live 0x0000000000000000
> fb_sys_fops 16384 1 drm_kms_helper, Live 0x0000000000000000
> drm 364544 4 vmwgfx,ttm,drm_kms_helper, Live 0x0000000000000000
> ahci 36864 2 - Live 0x0000000000000000
> libahci 32768 1 ahci, Live 0x0000000000000000
> e1000 135168 0 - Live 0x0000000000000000
> mptspi 24576 0 - Live 0x0000000000000000
> mptscsih 40960 1 mptspi, Live 0x0000000000000000
> mptbase 102400 2 mptspi,mptscsih, Live 0x0000000000000000
> scsi_transport_spi 32768 1 mptspi, Live 0x0000000000000000
> pata_acpi 16384 0 - Live 0x0000000000000000
> fjes 28672 0 - Live 0x0000000000000000
> 
> [8.6.] SCSI information (from /proc/scsi/scsi)
> Attached devices:
> Host: scsi3 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> Host: scsi4 Channel: 00 Id: 00 Lun: 00
>   Vendor: NECVMWar Model: VMware SATA CD01 Rev: 1.00
>   Type:   CD-ROM                           ANSI  SCSI revision: 05
> Host: scsi5 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> Host: scsi6 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> 
> Best regards,
> 
> James Johnston
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-11  1:38   ` Eric Wheeler
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2016-05-11  1:38 UTC (permalink / raw)
  To: James Johnston
  Cc: 'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt


On Sun, 8 May 2016, James Johnston wrote:

> Hi,
> 
> [1.] One line summary of the problem:
> 
> bcache gets stuck flushing writeback cache when used in combination with
> LUKS/dm-crypt and non-default bucket size
> 
> [2.] Full description of the problem/report:
> 
> I've run into a problem where the bcache writeback cache can't be flushed to
> disk when the backing device is a LUKS / dm-crypt device and the cache set has
> a non-default bucket size.  

You might try LUKS atop of bcache instead of under it.  This might be 
better for privacy too, otherwise your cached data is unencrypted.

> # Make cache set on second drive
> # IMPORTANT:  Problem does not occur if I omit --bucket parameter.
> make-bcache --bucket 2M -C /dev/sdb

2MB is quite large, maybe it exceeds the 256-bvec limit.  I'm not sure if 
Ming Lei's patch got in to 4.6 yet, but try this:
  https://lkml.org/lkml/2016/4/5/1046

and maybe Shaohua Li's patch too:
  http://www.spinics.net/lists/raid/msg51830.html


--
Eric Wheeler

> # Set up LUKS/dm-crypt on second drive.
> # IMPORTANT:  Problem does not occur if I omit the dm-crypt layer.
> cryptsetup luksFormat /dev/sdc
> cryptsetup open --type luks /dev/sdc backCrypt
> # Make bcache backing device & enable writeback
> make-bcache -B /dev/mapper/backCrypt
> bcache-super-show /dev/sdb | grep cset.uuid | \
> cut -f 3 > /sys/block/bcache0/bcache/attach
> echo writeback > /sys/block/bcache0/bcache/cache_mode
> 
> 4.  Finally, this is the kill sequence to bring the system to its knees:
> 
> sudo -s
> cd /sys/block/bcache0/bcache
> echo 0 > sequential_cutoff
> # Verify that the cache is attached (i.e. does not say "no cache").  It should
> # say that it's clean since we haven't written anything yet.
> cat state
> # Copy some random data.
> dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
> # Show current state.  On my system approximately 20 to 25 MB remain in
> # writeback cache.
> cat dirty_data
> cat state
> # Detach the cache set.  This will start the cache set disk thrashing.
> echo 1 > detach
> # After a few moments, confirm that the cache set is not going anywhere.  On
> # my system, only a few MB have been flushed as evidenced by a small decrease
> # in dirty_data.  State remains dirty.
> cat dirty_data
> cat state
> # At this point, the hypervisor system reports hundreds of MB/second of reads
> # to the underlying physical SSD coming from the virtual machine; the hard drive
> # light is stuck on...  hypervisor status bar shows the activity is on cache
> # set.  No writes seem to be occurring on any disk.
> 
> [8.] Environment
> [8.1.] Software (add the output of the ver_linux script here)
> Linux bcachetest2 4.6.0-040600rc6-generic #201605012031 SMP Mon May 2 00:33:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
> Util-linux              2.27.1
> Mount                   2.27.1
> Module-init-tools       22
> E2fsprogs               1.42.13
> Xfsprogs                4.3.0
> Linux C Library         2.23
> Dynamic linker (ldd)    2.23
> Linux C++ Library       6.0.21
> Procps                  3.3.10
> Net-tools               1.60
> Kbd                     1.15.5
> Console-tools           1.15.5
> Sh-utils                8.25
> Udev                    229
> Modules Loaded          8250_fintek ablk_helper aesni_intel aes_x86_64 ahci async_memcpy async_pq async_raid6_recov async_tx async_xor autofs4 btrfs configfs coretemp crc32_pclmul crct10dif_pclmul cryptd drm drm_kms_helper e1000 fb_sys_fops fjes gf128mul ghash_clmulni_intel glue_helper hid hid_generic i2c_piix4 ib_addr ib_cm ib_core ib_iser ib_mad ib_sa input_leds iscsi_tcp iw_cm joydev libahci libcrc32c libiscsi libiscsi_tcp linear lrw mac_hid mptbase mptscsih mptspi multipath nfit parport parport_pc pata_acpi ppdev psmouse raid0 raid10 raid1 raid456 raid6_pq rdma_cm scsi_transport_iscsi scsi_transport_spi serio_raw shpchp syscopyarea sysfillrect sysimgblt ttm usbhid vmw_balloon vmwgfx vmw_vmci vmw_vsock_vmci_transport vsock xor
> 
> [8.2.] Processor information (from /proc/cpuinfo):
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 42
> model name      : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> stepping        : 7
> microcode       : 0x29
> cpu MHz         : 2491.980
> cache size      : 3072 KB
> physical id     : 0
> siblings        : 1
> core id         : 0
> cpu cores       : 1
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 13
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm epb tsc_adjust dtherm ida arat pln pts
> bugs            :
> bogomips        : 4983.96
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 42 bits physical, 48 bits virtual
> power management:
> 
> [8.3.] Module information (from /proc/modules):
> ppdev 20480 0 - Live 0x0000000000000000
> vmw_balloon 20480 0 - Live 0x0000000000000000
> vmw_vsock_vmci_transport 28672 1 - Live 0x0000000000000000
> vsock 36864 2 vmw_vsock_vmci_transport, Live 0x0000000000000000
> coretemp 16384 0 - Live 0x0000000000000000
> joydev 20480 0 - Live 0x0000000000000000
> input_leds 16384 0 - Live 0x0000000000000000
> serio_raw 16384 0 - Live 0x0000000000000000
> shpchp 36864 0 - Live 0x0000000000000000
> vmw_vmci 65536 2 vmw_balloon,vmw_vsock_vmci_transport, Live 0x0000000000000000
> i2c_piix4 24576 0 - Live 0x0000000000000000
> nfit 40960 0 - Live 0x0000000000000000
> 8250_fintek 16384 0 - Live 0x0000000000000000
> parport_pc 32768 0 - Live 0x0000000000000000
> parport 49152 2 ppdev,parport_pc, Live 0x0000000000000000
> mac_hid 16384 0 - Live 0x0000000000000000
> ib_iser 49152 0 - Live 0x0000000000000000
> rdma_cm 53248 1 ib_iser, Live 0x0000000000000000
> iw_cm 49152 1 rdma_cm, Live 0x0000000000000000
> ib_cm 45056 1 rdma_cm, Live 0x0000000000000000
> ib_sa 36864 2 rdma_cm,ib_cm, Live 0x0000000000000000
> ib_mad 49152 2 ib_cm,ib_sa, Live 0x0000000000000000
> ib_core 122880 6 ib_iser,rdma_cm,iw_cm,ib_cm,ib_sa,ib_mad, Live 0x0000000000000000
> ib_addr 20480 3 rdma_cm,ib_sa,ib_core, Live 0x0000000000000000
> configfs 40960 2 rdma_cm, Live 0x0000000000000000
> iscsi_tcp 20480 0 - Live 0x0000000000000000
> libiscsi_tcp 24576 1 iscsi_tcp, Live 0x0000000000000000
> libiscsi 53248 3 ib_iser,iscsi_tcp,libiscsi_tcp, Live 0x0000000000000000
> scsi_transport_iscsi 98304 4 ib_iser,iscsi_tcp,libiscsi, Live 0x0000000000000000
> autofs4 40960 2 - Live 0x0000000000000000
> btrfs 1024000 0 - Live 0x0000000000000000
> raid10 49152 0 - Live 0x0000000000000000
> raid456 110592 0 - Live 0x0000000000000000
> async_raid6_recov 20480 1 raid456, Live 0x0000000000000000
> async_memcpy 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
> async_pq 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
> async_xor 16384 3 raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
> async_tx 16384 5 raid456,async_raid6_recov,async_memcpy,async_pq,async_xor, Live 0x0000000000000000
> xor 24576 2 btrfs,async_xor, Live 0x0000000000000000
> raid6_pq 102400 4 btrfs,raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
> libcrc32c 16384 1 raid456, Live 0x0000000000000000
> raid1 36864 0 - Live 0x0000000000000000
> raid0 20480 0 - Live 0x0000000000000000
> multipath 16384 0 - Live 0x0000000000000000
> linear 16384 0 - Live 0x0000000000000000
> hid_generic 16384 0 - Live 0x0000000000000000
> usbhid 49152 0 - Live 0x0000000000000000
> hid 122880 2 hid_generic,usbhid, Live 0x0000000000000000
> crct10dif_pclmul 16384 0 - Live 0x0000000000000000
> crc32_pclmul 16384 0 - Live 0x0000000000000000
> ghash_clmulni_intel 16384 0 - Live 0x0000000000000000
> aesni_intel 167936 0 - Live 0x0000000000000000
> aes_x86_64 20480 1 aesni_intel, Live 0x0000000000000000
> lrw 16384 1 aesni_intel, Live 0x0000000000000000
> gf128mul 16384 1 lrw, Live 0x0000000000000000
> glue_helper 16384 1 aesni_intel, Live 0x0000000000000000
> ablk_helper 16384 1 aesni_intel, Live 0x0000000000000000
> cryptd 20480 3 ghash_clmulni_intel,aesni_intel,ablk_helper, Live 0x0000000000000000
> vmwgfx 237568 1 - Live 0x0000000000000000
> ttm 98304 1 vmwgfx, Live 0x0000000000000000
> drm_kms_helper 147456 1 vmwgfx, Live 0x0000000000000000
> syscopyarea 16384 1 drm_kms_helper, Live 0x0000000000000000
> psmouse 131072 0 - Live 0x0000000000000000
> sysfillrect 16384 1 drm_kms_helper, Live 0x0000000000000000
> sysimgblt 16384 1 drm_kms_helper, Live 0x0000000000000000
> fb_sys_fops 16384 1 drm_kms_helper, Live 0x0000000000000000
> drm 364544 4 vmwgfx,ttm,drm_kms_helper, Live 0x0000000000000000
> ahci 36864 2 - Live 0x0000000000000000
> libahci 32768 1 ahci, Live 0x0000000000000000
> e1000 135168 0 - Live 0x0000000000000000
> mptspi 24576 0 - Live 0x0000000000000000
> mptscsih 40960 1 mptspi, Live 0x0000000000000000
> mptbase 102400 2 mptspi,mptscsih, Live 0x0000000000000000
> scsi_transport_spi 32768 1 mptspi, Live 0x0000000000000000
> pata_acpi 16384 0 - Live 0x0000000000000000
> fjes 28672 0 - Live 0x0000000000000000
> 
> [8.6.] SCSI information (from /proc/scsi/scsi)
> Attached devices:
> Host: scsi3 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> Host: scsi4 Channel: 00 Id: 00 Lun: 00
>   Vendor: NECVMWar Model: VMware SATA CD01 Rev: 1.00
>   Type:   CD-ROM                           ANSI  SCSI revision: 05
> Host: scsi5 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> Host: scsi6 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> 
> Best regards,
> 
> James Johnston
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-11  1:38   ` [dm-crypt] " Eric Wheeler
  (?)
@ 2016-05-15  9:08   ` Tim Small
  2016-05-16 13:02       ` [dm-crypt] " Tim Small
  -1 siblings, 1 reply; 28+ messages in thread
From: Tim Small @ 2016-05-15  9:08 UTC (permalink / raw)
  To: Eric Wheeler, James Johnston; +Cc: linux-bcache

Hello,

I've just hit the same bug in production, as it happens on a similar config:

. 4.5.1 (Debian backports kernel)
. bcache with 2M bucket (Intel DC S3500), with dm-crypt layered on top
of the backing device (4x 8TB RAID5).

I'm reducing the cc list, as I think this is bcache specific.

On 11/05/16 02:38, Eric Wheeler wrote:

> You might try LUKS atop of bcache instead of under it.  This might be 
> better for privacy too, otherwise your cached data is unencrypted.

I chose the same config as James, because the SSD has hardware
encryption (whereas the hard drives don't), and it'd be nice if cache
reads didn't get the extra CPU overhead/latency since the workload is
read-heavy, and cache hit should be pretty high (and the CPU doesn't
have AES-NI).

>> # Make cache set on second drive
>> # IMPORTANT:  Problem does not occur if I omit --bucket parameter.
>> make-bcache --bucket 2M -C /dev/sdb
> 
> 2MB is quite large, maybe it exceeds the 256-bvec limit.

In my case I followed the instructions in the make-bcache manual page
which say:

"The bucket size is intended to be equal to the size of your SSD's erase
blocks"

A bit of research suggested an erase block size of either 2M or 4M for
the SSD I was using.  Is this manual page incorrect?

> Ming Lei's patch got in to 4.6 yet, but try this:
>   https://lkml.org/lkml/2016/4/5/1046
> 
> and maybe Shaohua Li's patch too:
>   http://www.spinics.net/lists/raid/msg51830.html

I'll give them both a go...

Perhaps there should probably be a bcache wiki hosted on kernel.org to
cover this sort of stuff, as the bcache docs seem to be a bit lacking
currently?

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-15  9:08   ` Tim Small
@ 2016-05-16 13:02       ` Tim Small
  0 siblings, 0 replies; 28+ messages in thread
From: Tim Small @ 2016-05-16 13:02 UTC (permalink / raw)
  To: Eric Wheeler, James Johnston; +Cc: linux-bcache, dm-crypt, dm-devel

Hi Eric,

On 15/05/16 10:08, Tim Small wrote:
> On 11/05/16 02:38, Eric Wheeler wrote:
>> Ming Lei's patch got in to 4.6 yet, but try this:
>> >   https://lkml.org/lkml/2016/4/5/1046
>> > 
>> > and maybe Shaohua Li's patch too:
>> >   http://www.spinics.net/lists/raid/msg51830.html

> I'll give them both a go...

I tried both of these on 4.6.0-rc7 without change to the symptoms (cache
device continuously read).  Then I tried also disabling
partial_stripes_expensive prior to registering the bcache device as per
your instructions here:

https://lkml.org/lkml/2016/2/1/636

and that seems to have improved things, but not fixed them.

The cache device is 120G, and dirty_data had got up to 55.3G, but has
now dropped down to 44.5G, but isn't going any further...

The cache device is being read at a steady ~270 MB/s, and the backing
device (dm-crypt) being written at the same rate, but the writes aren't
flowing down to the underlying devices (md RAID5, and SATA disks).  I'm
guessing that these writes are being refused/retried, and are maybe
failing due to their size (avgrq-sz showing > 4000 sectors on the
backing device)?  Disabling the partial stripes expensive maybe just
resulted in a few GB of small writes succeeding?

# iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0
Linux 4.6.0-rc7+  16/05/16        _x86_64_        (2 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdf               0.00     0.00  413.00    0.00 281422.00     0.00
1362.82   143.18  338.31  338.31    0.00   2.42 100.00
sdf1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf3              0.00     0.00  413.00    0.00 281422.00     0.00
1362.82   143.18  338.31  338.31    0.00   2.42 100.00
dm-0              0.00     0.00    0.00  138.50     0.00 280912.00
4056.49     0.00    0.01    0.00    0.01   0.01   0.20
md2               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
bcache0           0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdf               0.00     6.00  412.00    1.50 281806.00    32.00
1363.18   135.19  314.09  314.78  124.00   2.42 100.00
sdf1              0.00     6.00    0.00    1.50     0.00    32.00
42.67     4.10  124.00    0.00  124.00 388.00  58.20
sdf2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf3              0.00     0.00  412.00    0.00 281806.00     0.00
1367.99   131.10  314.78  314.78    0.00   2.43 100.00
dm-0              0.00     0.00    0.00  138.50     0.00 282388.00
4077.81     0.00    0.01    0.00    0.01   0.01   0.20
md2               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
bcache0           0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-16 13:02       ` Tim Small
  0 siblings, 0 replies; 28+ messages in thread
From: Tim Small @ 2016-05-16 13:02 UTC (permalink / raw)
  To: Eric Wheeler, James Johnston; +Cc: linux-bcache, dm-crypt, dm-devel

Hi Eric,

On 15/05/16 10:08, Tim Small wrote:
> On 11/05/16 02:38, Eric Wheeler wrote:
>> Ming Lei's patch got in to 4.6 yet, but try this:
>> >   https://lkml.org/lkml/2016/4/5/1046
>> > 
>> > and maybe Shaohua Li's patch too:
>> >   http://www.spinics.net/lists/raid/msg51830.html

> I'll give them both a go...

I tried both of these on 4.6.0-rc7 without change to the symptoms (cache
device continuously read).  Then I tried also disabling
partial_stripes_expensive prior to registering the bcache device as per
your instructions here:

https://lkml.org/lkml/2016/2/1/636

and that seems to have improved things, but not fixed them.

The cache device is 120G, and dirty_data had got up to 55.3G, but has
now dropped down to 44.5G, but isn't going any further...

The cache device is being read at a steady ~270 MB/s, and the backing
device (dm-crypt) being written at the same rate, but the writes aren't
flowing down to the underlying devices (md RAID5, and SATA disks).  I'm
guessing that these writes are being refused/retried, and are maybe
failing due to their size (avgrq-sz showing > 4000 sectors on the
backing device)?  Disabling the partial stripes expensive maybe just
resulted in a few GB of small writes succeeding?

# iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0
Linux 4.6.0-rc7+  16/05/16        _x86_64_        (2 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdf               0.00     0.00  413.00    0.00 281422.00     0.00
1362.82   143.18  338.31  338.31    0.00   2.42 100.00
sdf1              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf3              0.00     0.00  413.00    0.00 281422.00     0.00
1362.82   143.18  338.31  338.31    0.00   2.42 100.00
dm-0              0.00     0.00    0.00  138.50     0.00 280912.00
4056.49     0.00    0.01    0.00    0.01   0.01   0.20
md2               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
bcache0           0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdf               0.00     6.00  412.00    1.50 281806.00    32.00
1363.18   135.19  314.09  314.78  124.00   2.42 100.00
sdf1              0.00     6.00    0.00    1.50     0.00    32.00
42.67     4.10  124.00    0.00  124.00 388.00  58.20
sdf2              0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
sdf3              0.00     0.00  412.00    0.00 281806.00     0.00
1367.99   131.10  314.78  314.78    0.00   2.43 100.00
dm-0              0.00     0.00    0.00  138.50     0.00 282388.00
4077.81     0.00    0.01    0.00    0.01   0.01   0.20
md2               0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00
bcache0           0.00     0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00    0.00    0.00   0.00   0.00

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-16 13:02       ` [dm-crypt] " Tim Small
@ 2016-05-16 13:53         ` Tim Small
  -1 siblings, 0 replies; 28+ messages in thread
From: Tim Small @ 2016-05-16 13:53 UTC (permalink / raw)
  To: Eric Wheeler, James Johnston; +Cc: linux-bcache, dm-crypt, dm-devel

On 16/05/16 14:02, Tim Small wrote:
> # iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0

... and then mangled the word-wrapping.  Try again:

Here's a typical hand-edited excerpt from:

iostat -d 2 -x -y -m -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0

...

Device:    r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await
sdf     396.50   19.50   272.02     0.25  1340.38   138.44  346.09
sdf3    397.00    0.00   272.52     0.00  1405.83   130.05  338.40
dm-0      0.00  149.00     0.00   271.29  3728.81     0.01    0.04
md2       0.00    0.00     0.00     0.00     0.00     0.00    0.00
bcache0   0.00    0.00     0.00     0.00     0.00     0.00    0.00

where:

sdf is the SSD (bcache cache device is sdf3)
dm-0 is dm-crypt backing device (bcache backing store)
md2 is the underlying device for dm-crypt
bcache0 is the bcache device.

According to the iostat manual page:

"avgrq-sz The average size (in sectors) of the requests that were issued
to the device."

dm-0 is described like this in the output of 'dmsetup table':

encryptedstore01: 0 46879675392 crypt aes-xts-plain64
0000000000000000000000000000000000000000000000000000000000000000 0 9:2
3072 1 allow_discards

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-16 13:53         ` Tim Small
  0 siblings, 0 replies; 28+ messages in thread
From: Tim Small @ 2016-05-16 13:53 UTC (permalink / raw)
  To: Eric Wheeler, James Johnston; +Cc: linux-bcache, dm-crypt, dm-devel

On 16/05/16 14:02, Tim Small wrote:
> # iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0

... and then mangled the word-wrapping.  Try again:

Here's a typical hand-edited excerpt from:

iostat -d 2 -x -y -m -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0

...

Device:    r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await
sdf     396.50   19.50   272.02     0.25  1340.38   138.44  346.09
sdf3    397.00    0.00   272.52     0.00  1405.83   130.05  338.40
dm-0      0.00  149.00     0.00   271.29  3728.81     0.01    0.04
md2       0.00    0.00     0.00     0.00     0.00     0.00    0.00
bcache0   0.00    0.00     0.00     0.00     0.00     0.00    0.00

where:

sdf is the SSD (bcache cache device is sdf3)
dm-0 is dm-crypt backing device (bcache backing store)
md2 is the underlying device for dm-crypt
bcache0 is the bcache device.

According to the iostat manual page:

"avgrq-sz The average size (in sectors) of the requests that were issued
to the device."

dm-0 is described like this in the output of 'dmsetup table':

encryptedstore01: 0 46879675392 crypt aes-xts-plain64
0000000000000000000000000000000000000000000000000000000000000000 0 9:2
3072 1 allow_discards

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-08 18:39 ` [dm-crypt] " James Johnston
@ 2016-05-16 16:08   ` Tim Small
  -1 siblings, 0 replies; 28+ messages in thread
From: Tim Small @ 2016-05-16 16:08 UTC (permalink / raw)
  To: James Johnston, 'Kent Overstreet',
	'Alasdair Kergon', 'Mike Snitzer'
  Cc: linux-bcache, dm-devel, dm-crypt

On 08/05/16 19:39, James Johnston wrote:
> I've run into a problem where the bcache writeback cache can't be flushed to
> disk when the backing device is a LUKS / dm-crypt device and the cache set has
> a non-default bucket size.  Basically, only a few megabytes will be flushed to
> disk, and then it gets stuck.  Stuck means that the bcache writeback task
> thrashes the disk by constantly reading hundreds of MB/second from the cache set
> in an infinite loop, while not actually progressing (dirty_data never decreases
> beyond a certain point).

> [...]

> The situation is basically unrecoverable as far as I can tell: if you attempt
> to detach the cache set then the cache set disk gets thrashed extra-hard
> forever, and it's impossible to actually get the cache set detached.  The only
> solution seems to be to back up the data and destroy the volume...

You can boot an older kernel to flush the device without destroying it
(I'm guessing that's because older kernels split down the big requests
which are failing on the 4.4 kernel).  Once flushed you could put the
cache into writethrough mode, or use a smaller bucket size.

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-16 16:08   ` Tim Small
  0 siblings, 0 replies; 28+ messages in thread
From: Tim Small @ 2016-05-16 16:08 UTC (permalink / raw)
  To: James Johnston, 'Kent Overstreet',
	'Alasdair Kergon', 'Mike Snitzer'
  Cc: linux-bcache, dm-devel, dm-crypt

On 08/05/16 19:39, James Johnston wrote:
> I've run into a problem where the bcache writeback cache can't be flushed to
> disk when the backing device is a LUKS / dm-crypt device and the cache set has
> a non-default bucket size.  Basically, only a few megabytes will be flushed to
> disk, and then it gets stuck.  Stuck means that the bcache writeback task
> thrashes the disk by constantly reading hundreds of MB/second from the cache set
> in an infinite loop, while not actually progressing (dirty_data never decreases
> beyond a certain point).

> [...]

> The situation is basically unrecoverable as far as I can tell: if you attempt
> to detach the cache set then the cache set disk gets thrashed extra-hard
> forever, and it's impossible to actually get the cache set detached.  The only
> solution seems to be to back up the data and destroy the volume...

You can boot an older kernel to flush the device without destroying it
(I'm guessing that's because older kernels split down the big requests
which are failing on the 4.4 kernel).  Once flushed you could put the
cache into writethrough mode, or use a smaller bucket size.

Tim.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-11  1:38   ` [dm-crypt] " Eric Wheeler
@ 2016-05-18 17:01     ` James Johnston
  -1 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-05-18 17:01 UTC (permalink / raw)
  To: 'Eric Wheeler'
  Cc: 'Mike Snitzer',
	dm-crypt, dm-devel, linux-bcache, 'Kent Overstreet',
	'Alasdair Kergon'

> On Sun, 8 May 2016, James Johnston wrote:
> 
> > Hi,
> >
> > [1.] One line summary of the problem:
> >
> > bcache gets stuck flushing writeback cache when used in combination with
> > LUKS/dm-crypt and non-default bucket size
> >
> > [2.] Full description of the problem/report:
> >
> > I've run into a problem where the bcache writeback cache can't be flushed to
> > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > a non-default bucket size.
> 
> You might try LUKS atop of bcache instead of under it.  This might be
> better for privacy too, otherwise your cached data is unencrypted.

Only in this test case; on my real setup, the cache device is also layered on top
Of LUKS.  (On both backing & cache, it's LUKS --> LVM2 --> bcache.  This gives me
flexibility to adjust volumes without messing with the encryption, or having more
encryption devices than really needed.  At any rate, I expect this setup to at
least work...)

> 
> > # Make cache set on second drive
> > # IMPORTANT:  Problem does not occur if I omit --bucket parameter.
> > make-bcache --bucket 2M -C /dev/sdb
> 
> 2MB is quite large, maybe it exceeds the 256-bvec limit.  I'm not sure if
> Ming Lei's patch got in to 4.6 yet, but try this:
>   https://lkml.org/lkml/2016/4/5/1046
> 
> and maybe Shaohua Li's patch too:
>   http://www.spinics.net/lists/raid/msg51830.html

Trying these is still on my TODO list (thus the belated reply here) but based
on the responses from Tim Small I'm doubtful this will fix anything, as it
sounds like he has the same problem (symptoms sound exactly the same) and he
says the patches didn't help.

Like Tim, I also chose a large bucket size because the manual page told me to.
Based on the high-level description of bcache and my knowledge of how flash
works, it certainly sounds necessary.

Perhaps the union of people who read manpages and people who use LUKS like
this is very small. :)

James

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] [dm-devel] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-18 17:01     ` James Johnston
  0 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-05-18 17:01 UTC (permalink / raw)
  To: 'Eric Wheeler'
  Cc: 'Mike Snitzer',
	dm-crypt, dm-devel, linux-bcache, 'Kent Overstreet',
	'Alasdair Kergon'

> On Sun, 8 May 2016, James Johnston wrote:
> 
> > Hi,
> >
> > [1.] One line summary of the problem:
> >
> > bcache gets stuck flushing writeback cache when used in combination with
> > LUKS/dm-crypt and non-default bucket size
> >
> > [2.] Full description of the problem/report:
> >
> > I've run into a problem where the bcache writeback cache can't be flushed to
> > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > a non-default bucket size.
> 
> You might try LUKS atop of bcache instead of under it.  This might be
> better for privacy too, otherwise your cached data is unencrypted.

Only in this test case; on my real setup, the cache device is also layered on top
Of LUKS.  (On both backing & cache, it's LUKS --> LVM2 --> bcache.  This gives me
flexibility to adjust volumes without messing with the encryption, or having more
encryption devices than really needed.  At any rate, I expect this setup to at
least work...)

> 
> > # Make cache set on second drive
> > # IMPORTANT:  Problem does not occur if I omit --bucket parameter.
> > make-bcache --bucket 2M -C /dev/sdb
> 
> 2MB is quite large, maybe it exceeds the 256-bvec limit.  I'm not sure if
> Ming Lei's patch got in to 4.6 yet, but try this:
>   https://lkml.org/lkml/2016/4/5/1046
> 
> and maybe Shaohua Li's patch too:
>   http://www.spinics.net/lists/raid/msg51830.html

Trying these is still on my TODO list (thus the belated reply here) but based
on the responses from Tim Small I'm doubtful this will fix anything, as it
sounds like he has the same problem (symptoms sound exactly the same) and he
says the patches didn't help.

Like Tim, I also chose a large bucket size because the manual page told me to.
Based on the high-level description of bcache and my knowledge of how flash
works, it certainly sounds necessary.

Perhaps the union of people who read manpages and people who use LUKS like
this is very small. :)

James

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-16 13:02       ` [dm-crypt] " Tim Small
@ 2016-05-19 23:15         ` Eric Wheeler
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2016-05-19 23:15 UTC (permalink / raw)
  To: Tim Small; +Cc: James Johnston, linux-bcache, dm-crypt, dm-devel

On Mon, 16 May 2016, Tim Small wrote:
> Hi Eric,
> 
> On 15/05/16 10:08, Tim Small wrote:
> > On 11/05/16 02:38, Eric Wheeler wrote:
> >> Ming Lei's patch got in to 4.6 yet, but try this:
> >> >   https://lkml.org/lkml/2016/4/5/1046
> >> > 
> >> > and maybe Shaohua Li's patch too:
> >> >   http://www.spinics.net/lists/raid/msg51830.html
> 
> > I'll give them both a go...
> 
> I tried both of these on 4.6.0-rc7 without change to the symptoms (cache
> device continuously read).  Then I tried also disabling
> partial_stripes_expensive prior to registering the bcache device as per
> your instructions here:
> 
> https://lkml.org/lkml/2016/2/1/636
> 
> and that seems to have improved things, but not fixed them.

What is your /sys/class/X/queue/limits/io_opt value? (requires the sysfs 
patch)

Caution: make these changes at your own risk, I have no idea what other 
side effects that might when modifying io_opt and dc->disk.stride_width, 
so be sure this is a test machine.

You could update my sysfs limits patch to set QL_SYSFS_RW for io_opt and 
shrink it or set it to zero before registering.  

or,

bcache sets the disk.stripe_size at initialization, so you could just 
force this to 0 in cached_dev_init() and see if it fixes that:

-bcache/super.c:1138    dc->disk.stripe_size = q->limits.io_opt >> 9;
+bcache/super.c:1138    dc->disk.stripe_size = 0;

It then uses stripe_size in the writeback code:

writeback.c:299:        stripe_offset = offset & (d->stripe_size - 1);
writeback.c:303:                              d->stripe_size - stripe_offset);
writeback.c:313:                if (sectors_dirty == d->stripe_size)
writeback.c:357:                                        stripe * dc->disk.stripe_size, 0);
writeback.c:361:                                       next_stripe * dc->disk.stripe_size, 0),
writeback.h:20: do_div(offset, d->stripe_size);
writeback.h:34:         if (nr_sectors <= dc->disk.stripe_size)
writeback.h:37:         nr_sectors -= dc->disk.stripe_size;

Speculation only, but I've always wondered if there are issues when opt_io!=0.

Are you able to test one or the other or both methods?

--
Eric Wheeler


> 
> The cache device is 120G, and dirty_data had got up to 55.3G, but has
> now dropped down to 44.5G, but isn't going any further...
> 
> The cache device is being read at a steady ~270 MB/s, and the backing
> device (dm-crypt) being written at the same rate, but the writes aren't
> flowing down to the underlying devices (md RAID5, and SATA disks).  I'm
> guessing that these writes are being refused/retried, and are maybe
> failing due to their size (avgrq-sz showing > 4000 sectors on the
> backing device)?  Disabling the partial stripes expensive maybe just
> resulted in a few GB of small writes succeeding?
> 
> # iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0
> Linux 4.6.0-rc7+  16/05/16        _x86_64_        (2 CPU)
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdf               0.00     0.00  413.00    0.00 281422.00     0.00
> 1362.82   143.18  338.31  338.31    0.00   2.42 100.00
> sdf1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdf2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdf3              0.00     0.00  413.00    0.00 281422.00     0.00
> 1362.82   143.18  338.31  338.31    0.00   2.42 100.00
> dm-0              0.00     0.00    0.00  138.50     0.00 280912.00
> 4056.49     0.00    0.01    0.00    0.01   0.01   0.20
> md2               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> bcache0           0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdf               0.00     6.00  412.00    1.50 281806.00    32.00
> 1363.18   135.19  314.09  314.78  124.00   2.42 100.00
> sdf1              0.00     6.00    0.00    1.50     0.00    32.00
> 42.67     4.10  124.00    0.00  124.00 388.00  58.20
> sdf2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdf3              0.00     0.00  412.00    0.00 281806.00     0.00
> 1367.99   131.10  314.78  314.78    0.00   2.43 100.00
> dm-0              0.00     0.00    0.00  138.50     0.00 282388.00
> 4077.81     0.00    0.01    0.00    0.01   0.01   0.20
> md2               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> bcache0           0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> 
> Cheers,
> 
> Tim.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-19 23:15         ` Eric Wheeler
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2016-05-19 23:15 UTC (permalink / raw)
  To: Tim Small; +Cc: James Johnston, linux-bcache, dm-crypt, dm-devel

On Mon, 16 May 2016, Tim Small wrote:
> Hi Eric,
> 
> On 15/05/16 10:08, Tim Small wrote:
> > On 11/05/16 02:38, Eric Wheeler wrote:
> >> Ming Lei's patch got in to 4.6 yet, but try this:
> >> >   https://lkml.org/lkml/2016/4/5/1046
> >> > 
> >> > and maybe Shaohua Li's patch too:
> >> >   http://www.spinics.net/lists/raid/msg51830.html
> 
> > I'll give them both a go...
> 
> I tried both of these on 4.6.0-rc7 without change to the symptoms (cache
> device continuously read).  Then I tried also disabling
> partial_stripes_expensive prior to registering the bcache device as per
> your instructions here:
> 
> https://lkml.org/lkml/2016/2/1/636
> 
> and that seems to have improved things, but not fixed them.

What is your /sys/class/X/queue/limits/io_opt value? (requires the sysfs 
patch)

Caution: make these changes at your own risk, I have no idea what other 
side effects that might when modifying io_opt and dc->disk.stride_width, 
so be sure this is a test machine.

You could update my sysfs limits patch to set QL_SYSFS_RW for io_opt and 
shrink it or set it to zero before registering.  

or,

bcache sets the disk.stripe_size at initialization, so you could just 
force this to 0 in cached_dev_init() and see if it fixes that:

-bcache/super.c:1138    dc->disk.stripe_size = q->limits.io_opt >> 9;
+bcache/super.c:1138    dc->disk.stripe_size = 0;

It then uses stripe_size in the writeback code:

writeback.c:299:        stripe_offset = offset & (d->stripe_size - 1);
writeback.c:303:                              d->stripe_size - stripe_offset);
writeback.c:313:                if (sectors_dirty == d->stripe_size)
writeback.c:357:                                        stripe * dc->disk.stripe_size, 0);
writeback.c:361:                                       next_stripe * dc->disk.stripe_size, 0),
writeback.h:20: do_div(offset, d->stripe_size);
writeback.h:34:         if (nr_sectors <= dc->disk.stripe_size)
writeback.h:37:         nr_sectors -= dc->disk.stripe_size;

Speculation only, but I've always wondered if there are issues when opt_io!=0.

Are you able to test one or the other or both methods?

--
Eric Wheeler


> 
> The cache device is 120G, and dirty_data had got up to 55.3G, but has
> now dropped down to 44.5G, but isn't going any further...
> 
> The cache device is being read at a steady ~270 MB/s, and the backing
> device (dm-crypt) being written at the same rate, but the writes aren't
> flowing down to the underlying devices (md RAID5, and SATA disks).  I'm
> guessing that these writes are being refused/retried, and are maybe
> failing due to their size (avgrq-sz showing > 4000 sectors on the
> backing device)?  Disabling the partial stripes expensive maybe just
> resulted in a few GB of small writes succeeding?
> 
> # iostat -y -d 2 -x -p /dev/sdf /dev/dm-0 /dev/md2 /dev/bcache0
> Linux 4.6.0-rc7+  16/05/16        _x86_64_        (2 CPU)
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdf               0.00     0.00  413.00    0.00 281422.00     0.00
> 1362.82   143.18  338.31  338.31    0.00   2.42 100.00
> sdf1              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdf2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdf3              0.00     0.00  413.00    0.00 281422.00     0.00
> 1362.82   143.18  338.31  338.31    0.00   2.42 100.00
> dm-0              0.00     0.00    0.00  138.50     0.00 280912.00
> 4056.49     0.00    0.01    0.00    0.01   0.01   0.20
> md2               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> bcache0           0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> 
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sdf               0.00     6.00  412.00    1.50 281806.00    32.00
> 1363.18   135.19  314.09  314.78  124.00   2.42 100.00
> sdf1              0.00     6.00    0.00    1.50     0.00    32.00
> 42.67     4.10  124.00    0.00  124.00 388.00  58.20
> sdf2              0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> sdf3              0.00     0.00  412.00    0.00 281806.00     0.00
> 1367.99   131.10  314.78  314.78    0.00   2.43 100.00
> dm-0              0.00     0.00    0.00  138.50     0.00 282388.00
> 4077.81     0.00    0.01    0.00    0.01   0.01   0.20
> md2               0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> bcache0           0.00     0.00    0.00    0.00     0.00     0.00
> 0.00     0.00    0.00    0.00    0.00   0.00   0.00
> 
> Cheers,
> 
> Tim.
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-16 16:08   ` [dm-crypt] " Tim Small
@ 2016-05-19 23:22     ` Eric Wheeler
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2016-05-19 23:22 UTC (permalink / raw)
  To: Tim Small
  Cc: James Johnston, 'Kent Overstreet',
	'Alasdair Kergon', 'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt


On Mon, 16 May 2016, Tim Small wrote:

> On 08/05/16 19:39, James Johnston wrote:
> > I've run into a problem where the bcache writeback cache can't be flushed to
> > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > in an infinite loop, while not actually progressing (dirty_data never decreases
> > beyond a certain point).
> 
> > [...]
> 
> > The situation is basically unrecoverable as far as I can tell: if you attempt
> > to detach the cache set then the cache set disk gets thrashed extra-hard
> > forever, and it's impossible to actually get the cache set detached.  The only
> > solution seems to be to back up the data and destroy the volume...
> 
> You can boot an older kernel to flush the device without destroying it
> (I'm guessing that's because older kernels split down the big requests
> which are failing on the 4.4 kernel).  Once flushed you could put the
> cache into writethrough mode, or use a smaller bucket size.

Indeed, can someone test 4.1.y and see if the problem persists with a 2M 
bucket size?  (If someone has already tested 4.1, then appologies as I've 
not yet seen that report.)

If 4.1 works, then I think a bisect is in order.  Such a bisect would at 
least highlight the problem and might indicate a (hopefully trivial) fix.

--
Eric Wheeler



> 
> Tim.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-19 23:22     ` Eric Wheeler
  0 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2016-05-19 23:22 UTC (permalink / raw)
  To: Tim Small
  Cc: James Johnston, 'Kent Overstreet',
	'Alasdair Kergon', 'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt


On Mon, 16 May 2016, Tim Small wrote:

> On 08/05/16 19:39, James Johnston wrote:
> > I've run into a problem where the bcache writeback cache can't be flushed to
> > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > in an infinite loop, while not actually progressing (dirty_data never decreases
> > beyond a certain point).
> 
> > [...]
> 
> > The situation is basically unrecoverable as far as I can tell: if you attempt
> > to detach the cache set then the cache set disk gets thrashed extra-hard
> > forever, and it's impossible to actually get the cache set detached.  The only
> > solution seems to be to back up the data and destroy the volume...
> 
> You can boot an older kernel to flush the device without destroying it
> (I'm guessing that's because older kernels split down the big requests
> which are failing on the 4.4 kernel).  Once flushed you could put the
> cache into writethrough mode, or use a smaller bucket size.

Indeed, can someone test 4.1.y and see if the problem persists with a 2M 
bucket size?  (If someone has already tested 4.1, then appologies as I've 
not yet seen that report.)

If 4.1 works, then I think a bisect is in order.  Such a bisect would at 
least highlight the problem and might indicate a (hopefully trivial) fix.

--
Eric Wheeler



> 
> Tim.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-19 23:22     ` [dm-crypt] " Eric Wheeler
@ 2016-05-20  6:59       ` James Johnston
  -1 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-05-20  6:59 UTC (permalink / raw)
  To: 'Eric Wheeler', 'Tim Small'
  Cc: 'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt

> On Mon, 16 May 2016, Tim Small wrote:
> 
> > On 08/05/16 19:39, James Johnston wrote:
> > > I've run into a problem where the bcache writeback cache can't be flushed to
> > > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > > in an infinite loop, while not actually progressing (dirty_data never decreases
> > > beyond a certain point).
> >
> > > [...]
> >
> > > The situation is basically unrecoverable as far as I can tell: if you attempt
> > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > forever, and it's impossible to actually get the cache set detached.  The only
> > > solution seems to be to back up the data and destroy the volume...
> >
> > You can boot an older kernel to flush the device without destroying it
> > (I'm guessing that's because older kernels split down the big requests
> > which are failing on the 4.4 kernel).  Once flushed you could put the
> > cache into writethrough mode, or use a smaller bucket size.
> 
> Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> bucket size?  (If someone has already tested 4.1, then appologies as I've
> not yet seen that report.)
> 
> If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> least highlight the problem and might indicate a (hopefully trivial) fix.

To help narrow this down, I tested the following generic pre-compiled mainline kernels
on Ubuntu 15.10:

 * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
 * DOES NOT WORK:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/

I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
This one also worked:

 * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/

So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  That
should help save time with bisection...

James

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-20  6:59       ` James Johnston
  0 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-05-20  6:59 UTC (permalink / raw)
  To: 'Eric Wheeler', 'Tim Small'
  Cc: 'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt

> On Mon, 16 May 2016, Tim Small wrote:
> 
> > On 08/05/16 19:39, James Johnston wrote:
> > > I've run into a problem where the bcache writeback cache can't be flushed to
> > > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > > in an infinite loop, while not actually progressing (dirty_data never decreases
> > > beyond a certain point).
> >
> > > [...]
> >
> > > The situation is basically unrecoverable as far as I can tell: if you attempt
> > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > forever, and it's impossible to actually get the cache set detached.  The only
> > > solution seems to be to back up the data and destroy the volume...
> >
> > You can boot an older kernel to flush the device without destroying it
> > (I'm guessing that's because older kernels split down the big requests
> > which are failing on the 4.4 kernel).  Once flushed you could put the
> > cache into writethrough mode, or use a smaller bucket size.
> 
> Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> bucket size?  (If someone has already tested 4.1, then appologies as I've
> not yet seen that report.)
> 
> If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> least highlight the problem and might indicate a (hopefully trivial) fix.

To help narrow this down, I tested the following generic pre-compiled mainline kernels
on Ubuntu 15.10:

 * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
 * DOES NOT WORK:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/

I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
This one also worked:

 * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/

So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  That
should help save time with bisection...

James

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-08 18:39 ` [dm-crypt] " James Johnston
                   ` (2 preceding siblings ...)
  (?)
@ 2016-05-20 20:22 ` Eric Wheeler
  -1 siblings, 0 replies; 28+ messages in thread
From: Eric Wheeler @ 2016-05-20 20:22 UTC (permalink / raw)
  To: James Johnston; +Cc: 'Kent Overstreet', Tim Small, linux-bcache


On Sun, 8 May 2016, James Johnston wrote:
> [1.] One line summary of the problem:
> 
> bcache gets stuck flushing writeback cache when used in combination with
> LUKS/dm-crypt and non-default bucket size
> 
> [2.] Full description of the problem/report:
> 
> I've run into a problem where the bcache writeback cache can't be flushed to
> disk when the backing device is a LUKS / dm-crypt device and the cache set has
> a non-default bucket size.  Basically, only a few megabytes will be flushed to
> disk, and then it gets stuck.  Stuck means that the bcache writeback task
> thrashes the disk by constantly reading hundreds of MB/second from the cache set
> in an infinite loop, while not actually progressing (dirty_data never decreases
> beyond a certain point).

While its thrashing, can you try getting a stack trace from the 
[bcache_writebac] thread with `cat /proc/pid/stack` ?

Run it several times as it is bound to change; maybe we can track down 
where it is spinning disk IO in the writeback process and add some debug 
code.  Perhaps there is some error-and-retry logic that needs some debug 
output.

--
Eric Wheeler



> 
> I am wondering if anybody else can reproduce this apparent bug?  Apologies for
> mailing both device mapper and bcache mailing lists, but I'm not sure where the
> bug lies as I've only reproduced it when both are used in combination.
> 
> The situation is basically unrecoverable as far as I can tell: if you attempt
> to detach the cache set then the cache set disk gets thrashed extra-hard
> forever, and it's impossible to actually get the cache set detached.  The only
> solution seems to be to back up the data and destroy the volume...
> 
> [3.] Keywords (i.e., modules, networking, kernel):
> 
> bcache, dm-crypt, LUKS, device mapper, LVM
> 
> [4.] Kernel information
> [4.1.] Kernel version (from /proc/version):
> Linux version 4.6.0-040600rc6-generic (kernel@gloin) (gcc version 5.2.1 20151010 (Ubuntu 5.2.1-22ubuntu2) ) #201605012031 SMP Mon May 2 00:33:26 UTC 2016
> 
> [7.] A small shell script or example program which triggers the
>      problem (if possible)
> 
> Here are the steps I used to reproduce:
> 
> 1.  Set up an Ubuntu 16.04 virtual machine in VMware with three SATA hard
>     drives.  Ubuntu was installed with default settings, except that: (1) guided
>     partitioning used with NO LVM or dm-crypt, (2) OpenSSH server installed.
>     First SATA drive has operating system installation.  Second SATA drive is
>     used for bcache cache set.  Third SATA drive has dm-crypt/LUKS + bcache
>     backing device.  Note that all drives have 512 byte physical sectors.  Also,
>     all virtual drives are backed by a single physical SSD with 512 byte
>     sectors. (i.e. not advanced format)
> 
> 2.  Ubuntu was updated to latest packages as of 5/8/2016.  The problem
>     reproduces with both distribution kernel 4.4.0-22-generic and also mainline
>     kernel 4.6.0-040600rc6-generic distributed by Ubuntu kernel team.  Installed
>     bcache-tools package was 1.0.8-2.  Installed cryptsetup-bin package was
>     2:1.6.6-5ubuntu2.
> 
> 3.  Set up the cache set, dm-crypt, and backing device:
> 
> sudo -s
> # Make cache set on second drive
> # IMPORTANT:  Problem does not occur if I omit --bucket parameter.
> make-bcache --bucket 2M -C /dev/sdb
> # Set up LUKS/dm-crypt on second drive.
> # IMPORTANT:  Problem does not occur if I omit the dm-crypt layer.
> cryptsetup luksFormat /dev/sdc
> cryptsetup open --type luks /dev/sdc backCrypt
> # Make bcache backing device & enable writeback
> make-bcache -B /dev/mapper/backCrypt
> bcache-super-show /dev/sdb | grep cset.uuid | \
> cut -f 3 > /sys/block/bcache0/bcache/attach
> echo writeback > /sys/block/bcache0/bcache/cache_mode
> 
> 4.  Finally, this is the kill sequence to bring the system to its knees:
> 
> sudo -s
> cd /sys/block/bcache0/bcache
> echo 0 > sequential_cutoff
> # Verify that the cache is attached (i.e. does not say "no cache").  It should
> # say that it's clean since we haven't written anything yet.
> cat state
> # Copy some random data.
> dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
> # Show current state.  On my system approximately 20 to 25 MB remain in
> # writeback cache.
> cat dirty_data
> cat state
> # Detach the cache set.  This will start the cache set disk thrashing.
> echo 1 > detach
> # After a few moments, confirm that the cache set is not going anywhere.  On
> # my system, only a few MB have been flushed as evidenced by a small decrease
> # in dirty_data.  State remains dirty.
> cat dirty_data
> cat state
> # At this point, the hypervisor system reports hundreds of MB/second of reads
> # to the underlying physical SSD coming from the virtual machine; the hard drive
> # light is stuck on...  hypervisor status bar shows the activity is on cache
> # set.  No writes seem to be occurring on any disk.
> 
> [8.] Environment
> [8.1.] Software (add the output of the ver_linux script here)
> Linux bcachetest2 4.6.0-040600rc6-generic #201605012031 SMP Mon May 2 00:33:26 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
> 
> Util-linux              2.27.1
> Mount                   2.27.1
> Module-init-tools       22
> E2fsprogs               1.42.13
> Xfsprogs                4.3.0
> Linux C Library         2.23
> Dynamic linker (ldd)    2.23
> Linux C++ Library       6.0.21
> Procps                  3.3.10
> Net-tools               1.60
> Kbd                     1.15.5
> Console-tools           1.15.5
> Sh-utils                8.25
> Udev                    229
> Modules Loaded          8250_fintek ablk_helper aesni_intel aes_x86_64 ahci async_memcpy async_pq async_raid6_recov async_tx async_xor autofs4 btrfs configfs coretemp crc32_pclmul crct10dif_pclmul cryptd drm drm_kms_helper e1000 fb_sys_fops fjes gf128mul ghash_clmulni_intel glue_helper hid hid_generic i2c_piix4 ib_addr ib_cm ib_core ib_iser ib_mad ib_sa input_leds iscsi_tcp iw_cm joydev libahci libcrc32c libiscsi libiscsi_tcp linear lrw mac_hid mptbase mptscsih mptspi multipath nfit parport parport_pc pata_acpi ppdev psmouse raid0 raid10 raid1 raid456 raid6_pq rdma_cm scsi_transport_iscsi scsi_transport_spi serio_raw shpchp syscopyarea sysfillrect sysimgblt ttm usbhid vmw_balloon vmwgfx vmw_vmci vmw_vsock_vmci_transport vsock xor
> 
> [8.2.] Processor information (from /proc/cpuinfo):
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 42
> model name      : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
> stepping        : 7
> microcode       : 0x29
> cpu MHz         : 2491.980
> cache size      : 3072 KB
> physical id     : 0
> siblings        : 1
> core id         : 0
> cpu cores       : 1
> apicid          : 0
> initial apicid  : 0
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 13
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm epb tsc_adjust dtherm ida arat pln pts
> bugs            :
> bogomips        : 4983.96
> clflush size    : 64
> cache_alignment : 64
> address sizes   : 42 bits physical, 48 bits virtual
> power management:
> 
> [8.3.] Module information (from /proc/modules):
> ppdev 20480 0 - Live 0x0000000000000000
> vmw_balloon 20480 0 - Live 0x0000000000000000
> vmw_vsock_vmci_transport 28672 1 - Live 0x0000000000000000
> vsock 36864 2 vmw_vsock_vmci_transport, Live 0x0000000000000000
> coretemp 16384 0 - Live 0x0000000000000000
> joydev 20480 0 - Live 0x0000000000000000
> input_leds 16384 0 - Live 0x0000000000000000
> serio_raw 16384 0 - Live 0x0000000000000000
> shpchp 36864 0 - Live 0x0000000000000000
> vmw_vmci 65536 2 vmw_balloon,vmw_vsock_vmci_transport, Live 0x0000000000000000
> i2c_piix4 24576 0 - Live 0x0000000000000000
> nfit 40960 0 - Live 0x0000000000000000
> 8250_fintek 16384 0 - Live 0x0000000000000000
> parport_pc 32768 0 - Live 0x0000000000000000
> parport 49152 2 ppdev,parport_pc, Live 0x0000000000000000
> mac_hid 16384 0 - Live 0x0000000000000000
> ib_iser 49152 0 - Live 0x0000000000000000
> rdma_cm 53248 1 ib_iser, Live 0x0000000000000000
> iw_cm 49152 1 rdma_cm, Live 0x0000000000000000
> ib_cm 45056 1 rdma_cm, Live 0x0000000000000000
> ib_sa 36864 2 rdma_cm,ib_cm, Live 0x0000000000000000
> ib_mad 49152 2 ib_cm,ib_sa, Live 0x0000000000000000
> ib_core 122880 6 ib_iser,rdma_cm,iw_cm,ib_cm,ib_sa,ib_mad, Live 0x0000000000000000
> ib_addr 20480 3 rdma_cm,ib_sa,ib_core, Live 0x0000000000000000
> configfs 40960 2 rdma_cm, Live 0x0000000000000000
> iscsi_tcp 20480 0 - Live 0x0000000000000000
> libiscsi_tcp 24576 1 iscsi_tcp, Live 0x0000000000000000
> libiscsi 53248 3 ib_iser,iscsi_tcp,libiscsi_tcp, Live 0x0000000000000000
> scsi_transport_iscsi 98304 4 ib_iser,iscsi_tcp,libiscsi, Live 0x0000000000000000
> autofs4 40960 2 - Live 0x0000000000000000
> btrfs 1024000 0 - Live 0x0000000000000000
> raid10 49152 0 - Live 0x0000000000000000
> raid456 110592 0 - Live 0x0000000000000000
> async_raid6_recov 20480 1 raid456, Live 0x0000000000000000
> async_memcpy 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
> async_pq 16384 2 raid456,async_raid6_recov, Live 0x0000000000000000
> async_xor 16384 3 raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
> async_tx 16384 5 raid456,async_raid6_recov,async_memcpy,async_pq,async_xor, Live 0x0000000000000000
> xor 24576 2 btrfs,async_xor, Live 0x0000000000000000
> raid6_pq 102400 4 btrfs,raid456,async_raid6_recov,async_pq, Live 0x0000000000000000
> libcrc32c 16384 1 raid456, Live 0x0000000000000000
> raid1 36864 0 - Live 0x0000000000000000
> raid0 20480 0 - Live 0x0000000000000000
> multipath 16384 0 - Live 0x0000000000000000
> linear 16384 0 - Live 0x0000000000000000
> hid_generic 16384 0 - Live 0x0000000000000000
> usbhid 49152 0 - Live 0x0000000000000000
> hid 122880 2 hid_generic,usbhid, Live 0x0000000000000000
> crct10dif_pclmul 16384 0 - Live 0x0000000000000000
> crc32_pclmul 16384 0 - Live 0x0000000000000000
> ghash_clmulni_intel 16384 0 - Live 0x0000000000000000
> aesni_intel 167936 0 - Live 0x0000000000000000
> aes_x86_64 20480 1 aesni_intel, Live 0x0000000000000000
> lrw 16384 1 aesni_intel, Live 0x0000000000000000
> gf128mul 16384 1 lrw, Live 0x0000000000000000
> glue_helper 16384 1 aesni_intel, Live 0x0000000000000000
> ablk_helper 16384 1 aesni_intel, Live 0x0000000000000000
> cryptd 20480 3 ghash_clmulni_intel,aesni_intel,ablk_helper, Live 0x0000000000000000
> vmwgfx 237568 1 - Live 0x0000000000000000
> ttm 98304 1 vmwgfx, Live 0x0000000000000000
> drm_kms_helper 147456 1 vmwgfx, Live 0x0000000000000000
> syscopyarea 16384 1 drm_kms_helper, Live 0x0000000000000000
> psmouse 131072 0 - Live 0x0000000000000000
> sysfillrect 16384 1 drm_kms_helper, Live 0x0000000000000000
> sysimgblt 16384 1 drm_kms_helper, Live 0x0000000000000000
> fb_sys_fops 16384 1 drm_kms_helper, Live 0x0000000000000000
> drm 364544 4 vmwgfx,ttm,drm_kms_helper, Live 0x0000000000000000
> ahci 36864 2 - Live 0x0000000000000000
> libahci 32768 1 ahci, Live 0x0000000000000000
> e1000 135168 0 - Live 0x0000000000000000
> mptspi 24576 0 - Live 0x0000000000000000
> mptscsih 40960 1 mptspi, Live 0x0000000000000000
> mptbase 102400 2 mptspi,mptscsih, Live 0x0000000000000000
> scsi_transport_spi 32768 1 mptspi, Live 0x0000000000000000
> pata_acpi 16384 0 - Live 0x0000000000000000
> fjes 28672 0 - Live 0x0000000000000000
> 
> [8.6.] SCSI information (from /proc/scsi/scsi)
> Attached devices:
> Host: scsi3 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> Host: scsi4 Channel: 00 Id: 00 Lun: 00
>   Vendor: NECVMWar Model: VMware SATA CD01 Rev: 1.00
>   Type:   CD-ROM                           ANSI  SCSI revision: 05
> Host: scsi5 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> Host: scsi6 Channel: 00 Id: 00 Lun: 00
>   Vendor: ATA      Model: VMware Virtual S Rev: 0001
>   Type:   Direct-Access                    ANSI  SCSI revision: 05
> 
> Best regards,
> 
> James Johnston
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-bcache" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-20  6:59       ` [dm-crypt] " James Johnston
@ 2016-05-20 21:37         ` 'Eric Wheeler'
  -1 siblings, 0 replies; 28+ messages in thread
From: 'Eric Wheeler' @ 2016-05-20 21:37 UTC (permalink / raw)
  To: James Johnston
  Cc: 'Tim Small', 'Kent Overstreet',
	'Alasdair Kergon', 'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt

On Fri, 20 May 2016, James Johnston wrote:

> > On Mon, 16 May 2016, Tim Small wrote:
> > 
> > > On 08/05/16 19:39, James Johnston wrote:
> > > > I've run into a problem where the bcache writeback cache can't be flushed to
> > > > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > > > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > > > in an infinite loop, while not actually progressing (dirty_data never decreases
> > > > beyond a certain point).
> > >
> > > > [...]
> > >
> > > > The situation is basically unrecoverable as far as I can tell: if you attempt
> > > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > > forever, and it's impossible to actually get the cache set detached.  The only
> > > > solution seems to be to back up the data and destroy the volume...
> > >
> > > You can boot an older kernel to flush the device without destroying it
> > > (I'm guessing that's because older kernels split down the big requests
> > > which are failing on the 4.4 kernel).  Once flushed you could put the
> > > cache into writethrough mode, or use a smaller bucket size.
> > 
> > Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> > bucket size?  (If someone has already tested 4.1, then appologies as I've
> > not yet seen that report.)
> > 
> > If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> > least highlight the problem and might indicate a (hopefully trivial) fix.
> 
> To help narrow this down, I tested the following generic pre-compiled mainline kernels
> on Ubuntu 15.10:
> 
>  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
>  * DOES NOT WORK:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/
> 
> I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
> This one also worked:
> 
>  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/
> 
> So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  That
> should help save time with bisection...

Below is the patchlist for md and block that might help with a place to 
start.  Are there any other places in the Linux tree where we should watch 
for changes?

I'm wondering if it might be in dm-4.4-changes since this is dm-crypt
related, but it could be ac322de which was quite large.

James or Tim,

Can you try building ac322de?  If that produces the problem, then there 
are only 3 more to try (unless this was actually a problem in 4.3 which 
was fixed in 4.3.y, but hopefully that isn't so). 

ccf21b6 is probably the next to test to rule out neil's big md patch, 
which Linus abreviated in the commit log so it must be quite long.  OTOH, 
if dm-4.4-changes works, then I'm not sure what commit might produce the 
problem because the rest are not obviously relevant to the issue that are 
more recent.  

-Eric

]# git log --oneline v4.3~1..v4.4-rc1 drivers/md/ block/ Makefile | egrep -v 'md-cluster|raid5|blk-mq'

 8005c49 Linux 4.4-rc1
 ccc2600 block: fix blk-core.c kernel-doc warning
 c34e6e0 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
 3419b45 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
 3934bbc Merge tag 'md/4.4-rc0-fix' of git://neil.brown.name/md
 ad804a0 Merge branch 'akpm' (patches from Andrew)
 75021d2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
 05229be block: add block polling support
 dece163 block: change ->make_request_fn() and users to return a queue cookie
 8639b46 pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
 71baba4 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM
 d0164ad mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
 8d090f4 bcache: Really show state of work pending bit
 933425fb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
 5ebe0ee Merge tag 'docs-for-linus' of git://git.lwn.net/linux
 69234ac Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
 e0700ce Merge tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
 ac322de Merge tag 'md/4.4' of git://neil.brown.name/md
 ccf21b6 Merge branch 'for-4.4/reservations' of git://git.kernel.dk/linux-block
 527d152 Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block
 d9734e0 Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block
 6a13feb Linux 4.3


--
Eric Wheeler



> 
> James
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-20 21:37         ` 'Eric Wheeler'
  0 siblings, 0 replies; 28+ messages in thread
From: 'Eric Wheeler' @ 2016-05-20 21:37 UTC (permalink / raw)
  To: James Johnston
  Cc: 'Tim Small', 'Kent Overstreet',
	'Alasdair Kergon', 'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt

On Fri, 20 May 2016, James Johnston wrote:

> > On Mon, 16 May 2016, Tim Small wrote:
> > 
> > > On 08/05/16 19:39, James Johnston wrote:
> > > > I've run into a problem where the bcache writeback cache can't be flushed to
> > > > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > > > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > > > in an infinite loop, while not actually progressing (dirty_data never decreases
> > > > beyond a certain point).
> > >
> > > > [...]
> > >
> > > > The situation is basically unrecoverable as far as I can tell: if you attempt
> > > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > > forever, and it's impossible to actually get the cache set detached.  The only
> > > > solution seems to be to back up the data and destroy the volume...
> > >
> > > You can boot an older kernel to flush the device without destroying it
> > > (I'm guessing that's because older kernels split down the big requests
> > > which are failing on the 4.4 kernel).  Once flushed you could put the
> > > cache into writethrough mode, or use a smaller bucket size.
> > 
> > Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> > bucket size?  (If someone has already tested 4.1, then appologies as I've
> > not yet seen that report.)
> > 
> > If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> > least highlight the problem and might indicate a (hopefully trivial) fix.
> 
> To help narrow this down, I tested the following generic pre-compiled mainline kernels
> on Ubuntu 15.10:
> 
>  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
>  * DOES NOT WORK:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/
> 
> I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
> This one also worked:
> 
>  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/
> 
> So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  That
> should help save time with bisection...

Below is the patchlist for md and block that might help with a place to 
start.  Are there any other places in the Linux tree where we should watch 
for changes?

I'm wondering if it might be in dm-4.4-changes since this is dm-crypt
related, but it could be ac322de which was quite large.

James or Tim,

Can you try building ac322de?  If that produces the problem, then there 
are only 3 more to try (unless this was actually a problem in 4.3 which 
was fixed in 4.3.y, but hopefully that isn't so). 

ccf21b6 is probably the next to test to rule out neil's big md patch, 
which Linus abreviated in the commit log so it must be quite long.  OTOH, 
if dm-4.4-changes works, then I'm not sure what commit might produce the 
problem because the rest are not obviously relevant to the issue that are 
more recent.  

-Eric

]# git log --oneline v4.3~1..v4.4-rc1 drivers/md/ block/ Makefile | egrep -v 'md-cluster|raid5|blk-mq'

 8005c49 Linux 4.4-rc1
 ccc2600 block: fix blk-core.c kernel-doc warning
 c34e6e0 Merge branch 'kbuild' of git://git.kernel.org/pub/scm/linux/kernel/git/mmarek/kbuild
 3419b45 Merge branch 'for-4.4/io-poll' of git://git.kernel.dk/linux-block
 3934bbc Merge tag 'md/4.4-rc0-fix' of git://neil.brown.name/md
 ad804a0 Merge branch 'akpm' (patches from Andrew)
 75021d2 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial
 05229be block: add block polling support
 dece163 block: change ->make_request_fn() and users to return a queue cookie
 8639b46 pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
 71baba4 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM
 d0164ad mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd
 8d090f4 bcache: Really show state of work pending bit
 933425fb Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
 5ebe0ee Merge tag 'docs-for-linus' of git://git.lwn.net/linux
 69234ac Merge branch 'for-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
 e0700ce Merge tag 'dm-4.4-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
 ac322de Merge tag 'md/4.4' of git://neil.brown.name/md
 ccf21b6 Merge branch 'for-4.4/reservations' of git://git.kernel.dk/linux-block
 527d152 Merge branch 'for-4.4/integrity' of git://git.kernel.dk/linux-block
 d9734e0 Merge branch 'for-4.4/core' of git://git.kernel.dk/linux-block
 6a13feb Linux 4.3


--
Eric Wheeler



> 
> James
> 
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
  2016-05-20 21:37         ` [dm-crypt] " 'Eric Wheeler'
@ 2016-05-22  4:26           ` James Johnston
  -1 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-05-22  4:26 UTC (permalink / raw)
  To: 'Eric Wheeler'
  Cc: 'Tim Small', 'Kent Overstreet',
	'Alasdair Kergon', 'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt, 'Neil Brown',
	linux-raid, 'Mikulas Patocka'

> On Fri, 20 May 2016, James Johnston wrote:
> 
> > > On Mon, 16 May 2016, Tim Small wrote:
> > >
> > > > On 08/05/16 19:39, James Johnston wrote:
> > > > > I've run into a problem where the bcache writeback cache can't be flushed to
> > > > > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > > > > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > > > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > > > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > > > > in an infinite loop, while not actually progressing (dirty_data never decreases
> > > > > beyond a certain point).
> > > >
> > > > > [...]
> > > >
> > > > > The situation is basically unrecoverable as far as I can tell: if you attempt
> > > > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > > > forever, and it's impossible to actually get the cache set detached.  The only
> > > > > solution seems to be to back up the data and destroy the volume...
> > > >
> > > > You can boot an older kernel to flush the device without destroying it
> > > > (I'm guessing that's because older kernels split down the big requests
> > > > which are failing on the 4.4 kernel).  Once flushed you could put the
> > > > cache into writethrough mode, or use a smaller bucket size.
> > >
> > > Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> > > bucket size?  (If someone has already tested 4.1, then appologies as I've
> > > not yet seen that report.)
> > >
> > > If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> > > least highlight the problem and might indicate a (hopefully trivial) fix.
> >
> > To help narrow this down, I tested the following generic pre-compiled mainline kernels
> > on Ubuntu 15.10:
> >
> >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
> >  * DOES NOT WORK:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/
> >
> > I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
> > This one also worked:
> >
> >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/
> >
> > So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  That
> > should help save time with bisection...
> 
> Below is the patchlist for md and block that might help with a place to
> start.  Are there any other places in the Linux tree where we should watch
> for changes?
> 
> I'm wondering if it might be in dm-4.4-changes since this is dm-crypt
> related, but it could be ac322de which was quite large.
> 
> James or Tim,
> 
> Can you try building ac322de?  If that produces the problem, then there
> are only 3 more to try (unless this was actually a problem in 4.3 which
> was fixed in 4.3.y, but hopefully that isn't so).
> 
> ccf21b6 is probably the next to test to rule out neil's big md patch,
> which Linus abreviated in the commit log so it must be quite long.  OTOH,
> if dm-4.4-changes works, then I'm not sure what commit might produce the
> problem because the rest are not obviously relevant to the issue that are
> more recent. 

So I decided to go ahead and bisect it today.  Looks like the bad commit is
this one.  The commit prior flushed the bcache writeback cache without
incident; this one does not and I guess caused this bcache regression.
(FWIW ac322de came up during bisection, and tested good.)

johnstonj@kernel-build:~/linux$ git bisect bad
dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7 is the first bad commit
commit dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Wed Oct 21 16:34:20 2015 -0400

    dm: eliminate unused "bioset" process for each bio-based DM device

    Commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e ("block: make
    generic_make_request handle arbitrarily sized bios") makes it possible
    for block devices to process large bios.  In doing so that commit
    allocates a new queue->bio_split bioset for each block device, this
    bioset is used for allocating bios when the driver needs to split large
    bios.

    Each bioset allocates a workqueue process, thus the above commit
    increases the number of processes allocated per block device.

    DM doesn't need the queue->bio_split bioset, thus we can deallocate it.
    This reduces the number of allocated processes per bio-based DM device
    from 3 to 2.  Also remove the call to blk_queue_split(), it is not
    needed because DM does its own splitting.

    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

The patch for this commit is very brief; reproduced here:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9555843..64b50b7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1763,8 +1763,6 @@ static void dm_make_request(struct request_queue *q, struct bio *bio)

        map = dm_get_live_table(md, &srcu_idx);

-       blk_queue_split(q, &bio, q->bio_split);
-
        generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);

        /* if we're suspended, we have to queue this io for later */
@@ -2792,6 +2790,12 @@ int dm_setup_md_queue(struct mapped_device *md)
        case DM_TYPE_BIO_BASED:
                dm_init_old_md_queue(md);
                blk_queue_make_request(md->queue, dm_make_request);
+               /*
+                * DM handles splitting bios as needed.  Free the bio_split bioset
+                * since it won't be used (saves 1 process per bio-based DM device).
+                */
+               bioset_free(md->queue->bio_split);
+               md->queue->bio_split = NULL;
                break;
        }

Here is the bisect log:

johnstonj@kernel-build:~/linux$ git bisect log
git bisect start
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# bad: [8005c49d9aea74d382f474ce11afbbc7d7130bec] Linux 4.4-rc1
git bisect bad 8005c49d9aea74d382f474ce11afbbc7d7130bec
# bad: [118c216e16c5ccb028cd03a0dcd56d17a07ff8d7] Merge tag 'staging-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect bad 118c216e16c5ccb028cd03a0dcd56d17a07ff8d7
# good: [e627078a0cbdc0c391efeb5a2c4eb287328fd633] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect good e627078a0cbdc0c391efeb5a2c4eb287328fd633
# good: [c17c6da659571a115c7b4983da6c6ac464317c34] staging: wilc1000: rename pfScanResult of struct scan_attr
git bisect good c17c6da659571a115c7b4983da6c6ac464317c34
# good: [7bdb7d554e0e433b92b63f3472523cc3067f8ab4] Staging: rtl8192u: ieee80211: corrected indent
git bisect good 7bdb7d554e0e433b92b63f3472523cc3067f8ab4
# good: [ac322de6bf5416cb145b58599297b8be73cd86ac] Merge tag 'md/4.4' of git://neil.brown.name/md
git bisect good ac322de6bf5416cb145b58599297b8be73cd86ac
# good: [a4d8e93c3182a54d8d21a4d1cec6538ae1be9e16] Merge tag 'usb-for-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-next
git bisect good a4d8e93c3182a54d8d21a4d1cec6538ae1be9e16
# good: [4f56f3fdca43c9a18339b6e0c3b1aa2f57f6d0b0] serial: 8250: Tolerate clock variance for max baud rate
git bisect good 4f56f3fdca43c9a18339b6e0c3b1aa2f57f6d0b0
# good: [e052c6d15c61cc4caff2f06cbca72b183da9f15e] tty: Use unbound workqueue for all input workers
git bisect good e052c6d15c61cc4caff2f06cbca72b183da9f15e
# good: [b9ca0c948c921e960006aaf319a29c004917cdf6] uwb: neh: Use setup_timer
git bisect good b9ca0c948c921e960006aaf319a29c004917cdf6
# bad: [aad9ae4550755edc020b5c511a8b54f0104b2f47] dm switch: simplify conditional in alloc_region_table()
git bisect bad aad9ae4550755edc020b5c511a8b54f0104b2f47
# good: [a3d939ae7b5f82688a6d3450f95286eaea338328] dm: convert ffs to __ffs
git bisect good a3d939ae7b5f82688a6d3450f95286eaea338328
# bad: [00272c854ee17b804ce81ef706f611dac17f4f89] dm linear: remove redundant target name from error messages
git bisect bad 00272c854ee17b804ce81ef706f611dac17f4f89
# bad: [4c7da06f5a780bbf44ebd7547789e48536d0a823] dm persistent data: eliminate unnecessary return values
git bisect bad 4c7da06f5a780bbf44ebd7547789e48536d0a823
# bad: [dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7] dm: eliminate unused "bioset" process for each bio-based DM device
git bisect bad dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
# first bad commit: [dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7] dm: eliminate unused "bioset" process for each bio-based DM device

Commands used for testing:

# Make cache set
make-bcache --bucket 2M -C /dev/sdb
# Set up backing device crypto
cryptsetup luksFormat /dev/sdc
cryptsetup open --type luks /dev/sdc backCrypt
# Make backing device & enable writeback
make-bcache -B /dev/mapper/backCrypt
bcache-super-show /dev/sdb | grep cset.uuid | cut -f 3 > /sys/block/bcache0/bcache/attach
echo writeback > /sys/block/bcache0/bcache/cache_mode

# KILL SEQUENCE

cd /sys/block/bcache0/bcache
echo 0 > sequential_cutoff
# Verify that the cache is attached (i.e. does not say "no cache")
cat state
dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
cat dirty_data
cat state
# Next line causes severe disk thrashing and failure to flush writeback cache
# on bad commits.
echo 1 > detach
cat dirty_data
cat state

Hope this provides some insight into the problem...

James

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size
@ 2016-05-22  4:26           ` James Johnston
  0 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-05-22  4:26 UTC (permalink / raw)
  To: 'Eric Wheeler'
  Cc: 'Tim Small', 'Kent Overstreet',
	'Alasdair Kergon', 'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt, 'Neil Brown',
	linux-raid, 'Mikulas Patocka'

> On Fri, 20 May 2016, James Johnston wrote:
> 
> > > On Mon, 16 May 2016, Tim Small wrote:
> > >
> > > > On 08/05/16 19:39, James Johnston wrote:
> > > > > I've run into a problem where the bcache writeback cache can't be flushed to
> > > > > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > > > > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > > > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > > > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > > > > in an infinite loop, while not actually progressing (dirty_data never decreases
> > > > > beyond a certain point).
> > > >
> > > > > [...]
> > > >
> > > > > The situation is basically unrecoverable as far as I can tell: if you attempt
> > > > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > > > forever, and it's impossible to actually get the cache set detached.  The only
> > > > > solution seems to be to back up the data and destroy the volume...
> > > >
> > > > You can boot an older kernel to flush the device without destroying it
> > > > (I'm guessing that's because older kernels split down the big requests
> > > > which are failing on the 4.4 kernel).  Once flushed you could put the
> > > > cache into writethrough mode, or use a smaller bucket size.
> > >
> > > Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> > > bucket size?  (If someone has already tested 4.1, then appologies as I've
> > > not yet seen that report.)
> > >
> > > If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> > > least highlight the problem and might indicate a (hopefully trivial) fix.
> >
> > To help narrow this down, I tested the following generic pre-compiled mainline kernels
> > on Ubuntu 15.10:
> >
> >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
> >  * DOES NOT WORK:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/
> >
> > I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
> > This one also worked:
> >
> >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/
> >
> > So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  That
> > should help save time with bisection...
> 
> Below is the patchlist for md and block that might help with a place to
> start.  Are there any other places in the Linux tree where we should watch
> for changes?
> 
> I'm wondering if it might be in dm-4.4-changes since this is dm-crypt
> related, but it could be ac322de which was quite large.
> 
> James or Tim,
> 
> Can you try building ac322de?  If that produces the problem, then there
> are only 3 more to try (unless this was actually a problem in 4.3 which
> was fixed in 4.3.y, but hopefully that isn't so).
> 
> ccf21b6 is probably the next to test to rule out neil's big md patch,
> which Linus abreviated in the commit log so it must be quite long.  OTOH,
> if dm-4.4-changes works, then I'm not sure what commit might produce the
> problem because the rest are not obviously relevant to the issue that are
> more recent. 

So I decided to go ahead and bisect it today.  Looks like the bad commit is
this one.  The commit prior flushed the bcache writeback cache without
incident; this one does not and I guess caused this bcache regression.
(FWIW ac322de came up during bisection, and tested good.)

johnstonj@kernel-build:~/linux$ git bisect bad
dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7 is the first bad commit
commit dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
Author: Mikulas Patocka <mpatocka@redhat.com>
Date:   Wed Oct 21 16:34:20 2015 -0400

    dm: eliminate unused "bioset" process for each bio-based DM device

    Commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e ("block: make
    generic_make_request handle arbitrarily sized bios") makes it possible
    for block devices to process large bios.  In doing so that commit
    allocates a new queue->bio_split bioset for each block device, this
    bioset is used for allocating bios when the driver needs to split large
    bios.

    Each bioset allocates a workqueue process, thus the above commit
    increases the number of processes allocated per block device.

    DM doesn't need the queue->bio_split bioset, thus we can deallocate it.
    This reduces the number of allocated processes per bio-based DM device
    from 3 to 2.  Also remove the call to blk_queue_split(), it is not
    needed because DM does its own splitting.

    Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
    Signed-off-by: Mike Snitzer <snitzer@redhat.com>

The patch for this commit is very brief; reproduced here:

diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 9555843..64b50b7 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -1763,8 +1763,6 @@ static void dm_make_request(struct request_queue *q, struct bio *bio)

        map = dm_get_live_table(md, &srcu_idx);

-       blk_queue_split(q, &bio, q->bio_split);
-
        generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);

        /* if we're suspended, we have to queue this io for later */
@@ -2792,6 +2790,12 @@ int dm_setup_md_queue(struct mapped_device *md)
        case DM_TYPE_BIO_BASED:
                dm_init_old_md_queue(md);
                blk_queue_make_request(md->queue, dm_make_request);
+               /*
+                * DM handles splitting bios as needed.  Free the bio_split bioset
+                * since it won't be used (saves 1 process per bio-based DM device).
+                */
+               bioset_free(md->queue->bio_split);
+               md->queue->bio_split = NULL;
                break;
        }

Here is the bisect log:

johnstonj@kernel-build:~/linux$ git bisect log
git bisect start
# good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
# bad: [8005c49d9aea74d382f474ce11afbbc7d7130bec] Linux 4.4-rc1
git bisect bad 8005c49d9aea74d382f474ce11afbbc7d7130bec
# bad: [118c216e16c5ccb028cd03a0dcd56d17a07ff8d7] Merge tag 'staging-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect bad 118c216e16c5ccb028cd03a0dcd56d17a07ff8d7
# good: [e627078a0cbdc0c391efeb5a2c4eb287328fd633] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
git bisect good e627078a0cbdc0c391efeb5a2c4eb287328fd633
# good: [c17c6da659571a115c7b4983da6c6ac464317c34] staging: wilc1000: rename pfScanResult of struct scan_attr
git bisect good c17c6da659571a115c7b4983da6c6ac464317c34
# good: [7bdb7d554e0e433b92b63f3472523cc3067f8ab4] Staging: rtl8192u: ieee80211: corrected indent
git bisect good 7bdb7d554e0e433b92b63f3472523cc3067f8ab4
# good: [ac322de6bf5416cb145b58599297b8be73cd86ac] Merge tag 'md/4.4' of git://neil.brown.name/md
git bisect good ac322de6bf5416cb145b58599297b8be73cd86ac
# good: [a4d8e93c3182a54d8d21a4d1cec6538ae1be9e16] Merge tag 'usb-for-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-next
git bisect good a4d8e93c3182a54d8d21a4d1cec6538ae1be9e16
# good: [4f56f3fdca43c9a18339b6e0c3b1aa2f57f6d0b0] serial: 8250: Tolerate clock variance for max baud rate
git bisect good 4f56f3fdca43c9a18339b6e0c3b1aa2f57f6d0b0
# good: [e052c6d15c61cc4caff2f06cbca72b183da9f15e] tty: Use unbound workqueue for all input workers
git bisect good e052c6d15c61cc4caff2f06cbca72b183da9f15e
# good: [b9ca0c948c921e960006aaf319a29c004917cdf6] uwb: neh: Use setup_timer
git bisect good b9ca0c948c921e960006aaf319a29c004917cdf6
# bad: [aad9ae4550755edc020b5c511a8b54f0104b2f47] dm switch: simplify conditional in alloc_region_table()
git bisect bad aad9ae4550755edc020b5c511a8b54f0104b2f47
# good: [a3d939ae7b5f82688a6d3450f95286eaea338328] dm: convert ffs to __ffs
git bisect good a3d939ae7b5f82688a6d3450f95286eaea338328
# bad: [00272c854ee17b804ce81ef706f611dac17f4f89] dm linear: remove redundant target name from error messages
git bisect bad 00272c854ee17b804ce81ef706f611dac17f4f89
# bad: [4c7da06f5a780bbf44ebd7547789e48536d0a823] dm persistent data: eliminate unnecessary return values
git bisect bad 4c7da06f5a780bbf44ebd7547789e48536d0a823
# bad: [dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7] dm: eliminate unused "bioset" process for each bio-based DM device
git bisect bad dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
# first bad commit: [dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7] dm: eliminate unused "bioset" process for each bio-based DM device

Commands used for testing:

# Make cache set
make-bcache --bucket 2M -C /dev/sdb
# Set up backing device crypto
cryptsetup luksFormat /dev/sdc
cryptsetup open --type luks /dev/sdc backCrypt
# Make backing device & enable writeback
make-bcache -B /dev/mapper/backCrypt
bcache-super-show /dev/sdb | grep cset.uuid | cut -f 3 > /sys/block/bcache0/bcache/attach
echo writeback > /sys/block/bcache0/bcache/cache_mode

# KILL SEQUENCE

cd /sys/block/bcache0/bcache
echo 0 > sequential_cutoff
# Verify that the cache is attached (i.e. does not say "no cache")
cat state
dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
cat dirty_data
cat state
# Next line causes severe disk thrashing and failure to flush writeback cache
# on bad commits.
echo 1 > detach
cat dirty_data
cat state

Hope this provides some insight into the problem...

James

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH] dm-crypt: Fix error with too large bios (was: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size)
  2016-05-22  4:26           ` [dm-crypt] " James Johnston
@ 2016-05-27 14:47             ` Mikulas Patocka
  -1 siblings, 0 replies; 28+ messages in thread
From: Mikulas Patocka @ 2016-05-27 14:47 UTC (permalink / raw)
  To: James Johnston
  Cc: 'Eric Wheeler', 'Tim Small',
	'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt, 'Neil Brown',
	linux-raid

Hi

Here I'm sending a patch for this bug.

BTW. I found several other bugs in bcache when testing this.

1) make-bcache and the other tools do not perform endian conversion - 
consequently bcache doesn't work on big-endian machines.

2) bcache cannot be compiled on newer gcc because of inline keyword. Note 
that in GNU C, the inline keyword is just a hint that doesn't change 
correntness or behavior of a program. However, according to ANSI C, the 
inline keywork changes meaning of a program - GCC recently switched to 
ANSI C by default and so the code doesn't compile. This is a patch:

	--- bcache-tools.orig/bcache.c
	+++ bcache-tools/bcache.c
	@@ -115,7 +115,7 @@ static const uint64_t crc_table[256] = {
	        0x9AFCE626CE85B507ULL
	 };

	-inline uint64_t crc64(const void *_data, size_t len)
	+uint64_t crc64(const void *_data, size_t len)
	 {
	        uint64_t crc = 0xFFFFFFFFFFFFFFFFULL;
	        const unsigned char *data = _data;

3) dm-crypt returns large bios with -EIO and bcache responds by attempting 
to submit the bios again and again (which results in the reported loop). 
The patch below fixes dm-crypt to not return errors, however you should 
also fix bcache to handle errors gracefully (i.e. stop using the device on 
I/O error, and don't submit the bios over and over again).

Mikulas



On Sun, 22 May 2016, James Johnston wrote:

> > On Fri, 20 May 2016, James Johnston wrote:
> > 
> > > > On Mon, 16 May 2016, Tim Small wrote:
> > > >
> > > > > On 08/05/16 19:39, James Johnston wrote:
> > > > > > I've run into a problem where the bcache writeback cache can't be flushed to
> > > > > > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > > > > > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > > > > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > > > > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > > > > > in an infinite loop, while not actually progressing (dirty_data never decreases
> > > > > > beyond a certain point).
> > > > >
> > > > > > [...]
> > > > >
> > > > > > The situation is basically unrecoverable as far as I can tell: if you attempt
> > > > > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > > > > forever, and it's impossible to actually get the cache set detached.  The only
> > > > > > solution seems to be to back up the data and destroy the volume...
> > > > >
> > > > > You can boot an older kernel to flush the device without destroying it
> > > > > (I'm guessing that's because older kernels split down the big requests
> > > > > which are failing on the 4.4 kernel).  Once flushed you could put the
> > > > > cache into writethrough mode, or use a smaller bucket size.
> > > >
> > > > Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> > > > bucket size?  (If someone has already tested 4.1, then appologies as I've
> > > > not yet seen that report.)
> > > >
> > > > If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> > > > least highlight the problem and might indicate a (hopefully trivial) fix.
> > >
> > > To help narrow this down, I tested the following generic pre-compiled mainline kernels
> > > on Ubuntu 15.10:
> > >
> > >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
> > >  * DOES NOT WORK:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/
> > >
> > > I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
> > > This one also worked:
> > >
> > >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/
> > >
> > > So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  That
> > > should help save time with bisection...
> > 
> > Below is the patchlist for md and block that might help with a place to
> > start.  Are there any other places in the Linux tree where we should watch
> > for changes?
> > 
> > I'm wondering if it might be in dm-4.4-changes since this is dm-crypt
> > related, but it could be ac322de which was quite large.
> > 
> > James or Tim,
> > 
> > Can you try building ac322de?  If that produces the problem, then there
> > are only 3 more to try (unless this was actually a problem in 4.3 which
> > was fixed in 4.3.y, but hopefully that isn't so).
> > 
> > ccf21b6 is probably the next to test to rule out neil's big md patch,
> > which Linus abreviated in the commit log so it must be quite long.  OTOH,
> > if dm-4.4-changes works, then I'm not sure what commit might produce the
> > problem because the rest are not obviously relevant to the issue that are
> > more recent. 
> 
> So I decided to go ahead and bisect it today.  Looks like the bad commit is
> this one.  The commit prior flushed the bcache writeback cache without
> incident; this one does not and I guess caused this bcache regression.
> (FWIW ac322de came up during bisection, and tested good.)
> 
> johnstonj@kernel-build:~/linux$ git bisect bad
> dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7 is the first bad commit
> commit dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
> Author: Mikulas Patocka <mpatocka@redhat.com>
> Date:   Wed Oct 21 16:34:20 2015 -0400
> 
>     dm: eliminate unused "bioset" process for each bio-based DM device
> 
>     Commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e ("block: make
>     generic_make_request handle arbitrarily sized bios") makes it possible
>     for block devices to process large bios.  In doing so that commit
>     allocates a new queue->bio_split bioset for each block device, this
>     bioset is used for allocating bios when the driver needs to split large
>     bios.
> 
>     Each bioset allocates a workqueue process, thus the above commit
>     increases the number of processes allocated per block device.
> 
>     DM doesn't need the queue->bio_split bioset, thus we can deallocate it.
>     This reduces the number of allocated processes per bio-based DM device
>     from 3 to 2.  Also remove the call to blk_queue_split(), it is not
>     needed because DM does its own splitting.
> 
>     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> The patch for this commit is very brief; reproduced here:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 9555843..64b50b7 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1763,8 +1763,6 @@ static void dm_make_request(struct request_queue *q, struct bio *bio)
> 
>         map = dm_get_live_table(md, &srcu_idx);
> 
> -       blk_queue_split(q, &bio, q->bio_split);
> -
>         generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);
> 
>         /* if we're suspended, we have to queue this io for later */
> @@ -2792,6 +2790,12 @@ int dm_setup_md_queue(struct mapped_device *md)
>         case DM_TYPE_BIO_BASED:
>                 dm_init_old_md_queue(md);
>                 blk_queue_make_request(md->queue, dm_make_request);
> +               /*
> +                * DM handles splitting bios as needed.  Free the bio_split bioset
> +                * since it won't be used (saves 1 process per bio-based DM device).
> +                */
> +               bioset_free(md->queue->bio_split);
> +               md->queue->bio_split = NULL;
>                 break;
>         }
> 
> Here is the bisect log:
> 
> johnstonj@kernel-build:~/linux$ git bisect log
> git bisect start
> # good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
> git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
> # bad: [8005c49d9aea74d382f474ce11afbbc7d7130bec] Linux 4.4-rc1
> git bisect bad 8005c49d9aea74d382f474ce11afbbc7d7130bec
> # bad: [118c216e16c5ccb028cd03a0dcd56d17a07ff8d7] Merge tag 'staging-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
> git bisect bad 118c216e16c5ccb028cd03a0dcd56d17a07ff8d7
> # good: [e627078a0cbdc0c391efeb5a2c4eb287328fd633] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
> git bisect good e627078a0cbdc0c391efeb5a2c4eb287328fd633
> # good: [c17c6da659571a115c7b4983da6c6ac464317c34] staging: wilc1000: rename pfScanResult of struct scan_attr
> git bisect good c17c6da659571a115c7b4983da6c6ac464317c34
> # good: [7bdb7d554e0e433b92b63f3472523cc3067f8ab4] Staging: rtl8192u: ieee80211: corrected indent
> git bisect good 7bdb7d554e0e433b92b63f3472523cc3067f8ab4
> # good: [ac322de6bf5416cb145b58599297b8be73cd86ac] Merge tag 'md/4.4' of git://neil.brown.name/md
> git bisect good ac322de6bf5416cb145b58599297b8be73cd86ac
> # good: [a4d8e93c3182a54d8d21a4d1cec6538ae1be9e16] Merge tag 'usb-for-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-next
> git bisect good a4d8e93c3182a54d8d21a4d1cec6538ae1be9e16
> # good: [4f56f3fdca43c9a18339b6e0c3b1aa2f57f6d0b0] serial: 8250: Tolerate clock variance for max baud rate
> git bisect good 4f56f3fdca43c9a18339b6e0c3b1aa2f57f6d0b0
> # good: [e052c6d15c61cc4caff2f06cbca72b183da9f15e] tty: Use unbound workqueue for all input workers
> git bisect good e052c6d15c61cc4caff2f06cbca72b183da9f15e
> # good: [b9ca0c948c921e960006aaf319a29c004917cdf6] uwb: neh: Use setup_timer
> git bisect good b9ca0c948c921e960006aaf319a29c004917cdf6
> # bad: [aad9ae4550755edc020b5c511a8b54f0104b2f47] dm switch: simplify conditional in alloc_region_table()
> git bisect bad aad9ae4550755edc020b5c511a8b54f0104b2f47
> # good: [a3d939ae7b5f82688a6d3450f95286eaea338328] dm: convert ffs to __ffs
> git bisect good a3d939ae7b5f82688a6d3450f95286eaea338328
> # bad: [00272c854ee17b804ce81ef706f611dac17f4f89] dm linear: remove redundant target name from error messages
> git bisect bad 00272c854ee17b804ce81ef706f611dac17f4f89
> # bad: [4c7da06f5a780bbf44ebd7547789e48536d0a823] dm persistent data: eliminate unnecessary return values
> git bisect bad 4c7da06f5a780bbf44ebd7547789e48536d0a823
> # bad: [dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7] dm: eliminate unused "bioset" process for each bio-based DM device
> git bisect bad dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
> # first bad commit: [dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7] dm: eliminate unused "bioset" process for each bio-based DM device
> 
> Commands used for testing:
> 
> # Make cache set
> make-bcache --bucket 2M -C /dev/sdb
> # Set up backing device crypto
> cryptsetup luksFormat /dev/sdc
> cryptsetup open --type luks /dev/sdc backCrypt
> # Make backing device & enable writeback
> make-bcache -B /dev/mapper/backCrypt
> bcache-super-show /dev/sdb | grep cset.uuid | cut -f 3 > /sys/block/bcache0/bcache/attach
> echo writeback > /sys/block/bcache0/bcache/cache_mode
> 
> # KILL SEQUENCE
> 
> cd /sys/block/bcache0/bcache
> echo 0 > sequential_cutoff
> # Verify that the cache is attached (i.e. does not say "no cache")
> cat state
> dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
> cat dirty_data
> cat state
> # Next line causes severe disk thrashing and failure to flush writeback cache
> # on bad commits.
> echo 1 > detach
> cat dirty_data
> cat state
> 
> Hope this provides some insight into the problem...
> 
> James

dm-crypt: Fix error with too large bios

When dm-crypt processes writes, it allocates a new bio in the function
crypt_alloc_buffer. The bio is allocated from a bio set and it can have at
most BIO_MAX_PAGES vector entries, however the incoming bio can be larger
if it was allocated by other means. For example, bcache creates bios
larger than BIO_MAX_PAGES. If the incoming bio is larger, bio_alloc_bioset
fails and error is returned.

To avoid the error, we test for too large bio in the function crypt_map
and dm_accept_partial_bio to split the bio. dm_accept_partial_bio trims
the current bio to the desired size and requests that the device mapper
core sends another bio with the rest of the data.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org	# v3.16+

Index: linux-4.6/drivers/md/dm-crypt.c
===================================================================
--- linux-4.6.orig/drivers/md/dm-crypt.c
+++ linux-4.6/drivers/md/dm-crypt.c
@@ -2137,6 +2137,10 @@ static int crypt_map(struct dm_target *t
 	struct dm_crypt_io *io;
 	struct crypt_config *cc = ti->private;
 
+	if (unlikely(bio->bi_iter.bi_size > BIO_MAX_SIZE) &&
+	    (bio->bi_rw & (REQ_FLUSH | REQ_DISCARD | REQ_WRITE)) == REQ_WRITE)
+		dm_accept_partial_bio(bio, BIO_MAX_SIZE >> SECTOR_SHIFT);
+
 	/*
 	 * If bio is REQ_FLUSH or REQ_DISCARD, just bypass crypt queues.
 	 * - for REQ_FLUSH device-mapper core ensures that no IO is in-flight

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [dm-crypt] [PATCH] dm-crypt: Fix error with too large bios (was: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size)
@ 2016-05-27 14:47             ` Mikulas Patocka
  0 siblings, 0 replies; 28+ messages in thread
From: Mikulas Patocka @ 2016-05-27 14:47 UTC (permalink / raw)
  To: James Johnston
  Cc: 'Eric Wheeler', 'Tim Small',
	'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt, 'Neil Brown',
	linux-raid

Hi

Here I'm sending a patch for this bug.

BTW. I found several other bugs in bcache when testing this.

1) make-bcache and the other tools do not perform endian conversion - 
consequently bcache doesn't work on big-endian machines.

2) bcache cannot be compiled on newer gcc because of inline keyword. Note 
that in GNU C, the inline keyword is just a hint that doesn't change 
correntness or behavior of a program. However, according to ANSI C, the 
inline keywork changes meaning of a program - GCC recently switched to 
ANSI C by default and so the code doesn't compile. This is a patch:

	--- bcache-tools.orig/bcache.c
	+++ bcache-tools/bcache.c
	@@ -115,7 +115,7 @@ static const uint64_t crc_table[256] = {
	        0x9AFCE626CE85B507ULL
	 };

	-inline uint64_t crc64(const void *_data, size_t len)
	+uint64_t crc64(const void *_data, size_t len)
	 {
	        uint64_t crc = 0xFFFFFFFFFFFFFFFFULL;
	        const unsigned char *data = _data;

3) dm-crypt returns large bios with -EIO and bcache responds by attempting 
to submit the bios again and again (which results in the reported loop). 
The patch below fixes dm-crypt to not return errors, however you should 
also fix bcache to handle errors gracefully (i.e. stop using the device on 
I/O error, and don't submit the bios over and over again).

Mikulas



On Sun, 22 May 2016, James Johnston wrote:

> > On Fri, 20 May 2016, James Johnston wrote:
> > 
> > > > On Mon, 16 May 2016, Tim Small wrote:
> > > >
> > > > > On 08/05/16 19:39, James Johnston wrote:
> > > > > > I've run into a problem where the bcache writeback cache can't be flushed to
> > > > > > disk when the backing device is a LUKS / dm-crypt device and the cache set has
> > > > > > a non-default bucket size.  Basically, only a few megabytes will be flushed to
> > > > > > disk, and then it gets stuck.  Stuck means that the bcache writeback task
> > > > > > thrashes the disk by constantly reading hundreds of MB/second from the cache set
> > > > > > in an infinite loop, while not actually progressing (dirty_data never decreases
> > > > > > beyond a certain point).
> > > > >
> > > > > > [...]
> > > > >
> > > > > > The situation is basically unrecoverable as far as I can tell: if you attempt
> > > > > > to detach the cache set then the cache set disk gets thrashed extra-hard
> > > > > > forever, and it's impossible to actually get the cache set detached.  The only
> > > > > > solution seems to be to back up the data and destroy the volume...
> > > > >
> > > > > You can boot an older kernel to flush the device without destroying it
> > > > > (I'm guessing that's because older kernels split down the big requests
> > > > > which are failing on the 4.4 kernel).  Once flushed you could put the
> > > > > cache into writethrough mode, or use a smaller bucket size.
> > > >
> > > > Indeed, can someone test 4.1.y and see if the problem persists with a 2M
> > > > bucket size?  (If someone has already tested 4.1, then appologies as I've
> > > > not yet seen that report.)
> > > >
> > > > If 4.1 works, then I think a bisect is in order.  Such a bisect would at
> > > > least highlight the problem and might indicate a (hopefully trivial) fix.
> > >
> > > To help narrow this down, I tested the following generic pre-compiled mainline kernels
> > > on Ubuntu 15.10:
> > >
> > >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3.6-wily/
> > >  * DOES NOT WORK:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc1+cod1-wily/
> > >
> > > I also tried the default & latest distribution-provided 4.2 kernel.  It worked.
> > > This one also worked:
> > >
> > >  * WORKS:  http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2.8-wily/
> > >
> > > So it seems to me that it is a regression from 4.3.6 kernel to any 4.4 kernel.  That
> > > should help save time with bisection...
> > 
> > Below is the patchlist for md and block that might help with a place to
> > start.  Are there any other places in the Linux tree where we should watch
> > for changes?
> > 
> > I'm wondering if it might be in dm-4.4-changes since this is dm-crypt
> > related, but it could be ac322de which was quite large.
> > 
> > James or Tim,
> > 
> > Can you try building ac322de?  If that produces the problem, then there
> > are only 3 more to try (unless this was actually a problem in 4.3 which
> > was fixed in 4.3.y, but hopefully that isn't so).
> > 
> > ccf21b6 is probably the next to test to rule out neil's big md patch,
> > which Linus abreviated in the commit log so it must be quite long.  OTOH,
> > if dm-4.4-changes works, then I'm not sure what commit might produce the
> > problem because the rest are not obviously relevant to the issue that are
> > more recent. 
> 
> So I decided to go ahead and bisect it today.  Looks like the bad commit is
> this one.  The commit prior flushed the bcache writeback cache without
> incident; this one does not and I guess caused this bcache regression.
> (FWIW ac322de came up during bisection, and tested good.)
> 
> johnstonj@kernel-build:~/linux$ git bisect bad
> dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7 is the first bad commit
> commit dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
> Author: Mikulas Patocka <mpatocka@redhat.com>
> Date:   Wed Oct 21 16:34:20 2015 -0400
> 
>     dm: eliminate unused "bioset" process for each bio-based DM device
> 
>     Commit 54efd50bfd873e2dbf784e0b21a8027ba4299a3e ("block: make
>     generic_make_request handle arbitrarily sized bios") makes it possible
>     for block devices to process large bios.  In doing so that commit
>     allocates a new queue->bio_split bioset for each block device, this
>     bioset is used for allocating bios when the driver needs to split large
>     bios.
> 
>     Each bioset allocates a workqueue process, thus the above commit
>     increases the number of processes allocated per block device.
> 
>     DM doesn't need the queue->bio_split bioset, thus we can deallocate it.
>     This reduces the number of allocated processes per bio-based DM device
>     from 3 to 2.  Also remove the call to blk_queue_split(), it is not
>     needed because DM does its own splitting.
> 
>     Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
>     Signed-off-by: Mike Snitzer <snitzer@redhat.com>
> 
> The patch for this commit is very brief; reproduced here:
> 
> diff --git a/drivers/md/dm.c b/drivers/md/dm.c
> index 9555843..64b50b7 100644
> --- a/drivers/md/dm.c
> +++ b/drivers/md/dm.c
> @@ -1763,8 +1763,6 @@ static void dm_make_request(struct request_queue *q, struct bio *bio)
> 
>         map = dm_get_live_table(md, &srcu_idx);
> 
> -       blk_queue_split(q, &bio, q->bio_split);
> -
>         generic_start_io_acct(rw, bio_sectors(bio), &dm_disk(md)->part0);
> 
>         /* if we're suspended, we have to queue this io for later */
> @@ -2792,6 +2790,12 @@ int dm_setup_md_queue(struct mapped_device *md)
>         case DM_TYPE_BIO_BASED:
>                 dm_init_old_md_queue(md);
>                 blk_queue_make_request(md->queue, dm_make_request);
> +               /*
> +                * DM handles splitting bios as needed.  Free the bio_split bioset
> +                * since it won't be used (saves 1 process per bio-based DM device).
> +                */
> +               bioset_free(md->queue->bio_split);
> +               md->queue->bio_split = NULL;
>                 break;
>         }
> 
> Here is the bisect log:
> 
> johnstonj@kernel-build:~/linux$ git bisect log
> git bisect start
> # good: [6a13feb9c82803e2b815eca72fa7a9f5561d7861] Linux 4.3
> git bisect good 6a13feb9c82803e2b815eca72fa7a9f5561d7861
> # bad: [8005c49d9aea74d382f474ce11afbbc7d7130bec] Linux 4.4-rc1
> git bisect bad 8005c49d9aea74d382f474ce11afbbc7d7130bec
> # bad: [118c216e16c5ccb028cd03a0dcd56d17a07ff8d7] Merge tag 'staging-4.4-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
> git bisect bad 118c216e16c5ccb028cd03a0dcd56d17a07ff8d7
> # good: [e627078a0cbdc0c391efeb5a2c4eb287328fd633] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
> git bisect good e627078a0cbdc0c391efeb5a2c4eb287328fd633
> # good: [c17c6da659571a115c7b4983da6c6ac464317c34] staging: wilc1000: rename pfScanResult of struct scan_attr
> git bisect good c17c6da659571a115c7b4983da6c6ac464317c34
> # good: [7bdb7d554e0e433b92b63f3472523cc3067f8ab4] Staging: rtl8192u: ieee80211: corrected indent
> git bisect good 7bdb7d554e0e433b92b63f3472523cc3067f8ab4
> # good: [ac322de6bf5416cb145b58599297b8be73cd86ac] Merge tag 'md/4.4' of git://neil.brown.name/md
> git bisect good ac322de6bf5416cb145b58599297b8be73cd86ac
> # good: [a4d8e93c3182a54d8d21a4d1cec6538ae1be9e16] Merge tag 'usb-for-v4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/balbi/usb into usb-next
> git bisect good a4d8e93c3182a54d8d21a4d1cec6538ae1be9e16
> # good: [4f56f3fdca43c9a18339b6e0c3b1aa2f57f6d0b0] serial: 8250: Tolerate clock variance for max baud rate
> git bisect good 4f56f3fdca43c9a18339b6e0c3b1aa2f57f6d0b0
> # good: [e052c6d15c61cc4caff2f06cbca72b183da9f15e] tty: Use unbound workqueue for all input workers
> git bisect good e052c6d15c61cc4caff2f06cbca72b183da9f15e
> # good: [b9ca0c948c921e960006aaf319a29c004917cdf6] uwb: neh: Use setup_timer
> git bisect good b9ca0c948c921e960006aaf319a29c004917cdf6
> # bad: [aad9ae4550755edc020b5c511a8b54f0104b2f47] dm switch: simplify conditional in alloc_region_table()
> git bisect bad aad9ae4550755edc020b5c511a8b54f0104b2f47
> # good: [a3d939ae7b5f82688a6d3450f95286eaea338328] dm: convert ffs to __ffs
> git bisect good a3d939ae7b5f82688a6d3450f95286eaea338328
> # bad: [00272c854ee17b804ce81ef706f611dac17f4f89] dm linear: remove redundant target name from error messages
> git bisect bad 00272c854ee17b804ce81ef706f611dac17f4f89
> # bad: [4c7da06f5a780bbf44ebd7547789e48536d0a823] dm persistent data: eliminate unnecessary return values
> git bisect bad 4c7da06f5a780bbf44ebd7547789e48536d0a823
> # bad: [dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7] dm: eliminate unused "bioset" process for each bio-based DM device
> git bisect bad dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7
> # first bad commit: [dbba42d8a9ebddcc1c1412e8457f79f3cb6ef6e7] dm: eliminate unused "bioset" process for each bio-based DM device
> 
> Commands used for testing:
> 
> # Make cache set
> make-bcache --bucket 2M -C /dev/sdb
> # Set up backing device crypto
> cryptsetup luksFormat /dev/sdc
> cryptsetup open --type luks /dev/sdc backCrypt
> # Make backing device & enable writeback
> make-bcache -B /dev/mapper/backCrypt
> bcache-super-show /dev/sdb | grep cset.uuid | cut -f 3 > /sys/block/bcache0/bcache/attach
> echo writeback > /sys/block/bcache0/bcache/cache_mode
> 
> # KILL SEQUENCE
> 
> cd /sys/block/bcache0/bcache
> echo 0 > sequential_cutoff
> # Verify that the cache is attached (i.e. does not say "no cache")
> cat state
> dd if=/dev/urandom of=/dev/bcache0 bs=1M count=250
> cat dirty_data
> cat state
> # Next line causes severe disk thrashing and failure to flush writeback cache
> # on bad commits.
> echo 1 > detach
> cat dirty_data
> cat state
> 
> Hope this provides some insight into the problem...
> 
> James

dm-crypt: Fix error with too large bios

When dm-crypt processes writes, it allocates a new bio in the function
crypt_alloc_buffer. The bio is allocated from a bio set and it can have at
most BIO_MAX_PAGES vector entries, however the incoming bio can be larger
if it was allocated by other means. For example, bcache creates bios
larger than BIO_MAX_PAGES. If the incoming bio is larger, bio_alloc_bioset
fails and error is returned.

To avoid the error, we test for too large bio in the function crypt_map
and dm_accept_partial_bio to split the bio. dm_accept_partial_bio trims
the current bio to the desired size and requests that the device mapper
core sends another bio with the rest of the data.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org	# v3.16+

Index: linux-4.6/drivers/md/dm-crypt.c
===================================================================
--- linux-4.6.orig/drivers/md/dm-crypt.c
+++ linux-4.6/drivers/md/dm-crypt.c
@@ -2137,6 +2137,10 @@ static int crypt_map(struct dm_target *t
 	struct dm_crypt_io *io;
 	struct crypt_config *cc = ti->private;
 
+	if (unlikely(bio->bi_iter.bi_size > BIO_MAX_SIZE) &&
+	    (bio->bi_rw & (REQ_FLUSH | REQ_DISCARD | REQ_WRITE)) == REQ_WRITE)
+		dm_accept_partial_bio(bio, BIO_MAX_SIZE >> SECTOR_SHIFT);
+
 	/*
 	 * If bio is REQ_FLUSH or REQ_DISCARD, just bypass crypt queues.
 	 * - for REQ_FLUSH device-mapper core ensures that no IO is in-flight

^ permalink raw reply	[flat|nested] 28+ messages in thread

* RE: [PATCH] dm-crypt: Fix error with too large bios (was: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size)
  2016-05-27 14:47             ` [dm-crypt] " Mikulas Patocka
@ 2016-06-01  4:19               ` James Johnston
  -1 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-06-01  4:19 UTC (permalink / raw)
  To: 'Mikulas Patocka'
  Cc: 'Eric Wheeler', 'Tim Small',
	'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt, 'Neil Brown',
	linux-raid

On Fri, 27 May 2016, Mikulas Patocka wrote:
> dm-crypt: Fix error with too large bios
> 
> When dm-crypt processes writes, it allocates a new bio in the function
> crypt_alloc_buffer. The bio is allocated from a bio set and it can have at
> most BIO_MAX_PAGES vector entries, however the incoming bio can be larger
> if it was allocated by other means. For example, bcache creates bios
> larger than BIO_MAX_PAGES. If the incoming bio is larger, bio_alloc_bioset
> fails and error is returned.
> 
> To avoid the error, we test for too large bio in the function crypt_map
> and dm_accept_partial_bio to split the bio. dm_accept_partial_bio trims
> the current bio to the desired size and requests that the device mapper
> core sends another bio with the rest of the data.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org	# v3.16+

Tested-by: James Johnston <johnstonj.public@codenest.com>

I tested this patch by:

1.  Building v4.7-rc1 from Torvalds git repo.  Confirmed that original bug
    still occurs on Ubuntu 15.10.

2.  Applying your patch to v4.7-rc1.  My kill sequence no longer works,
    and the writeback cache is now successfully flushed to disk, and the
    cache can be detached from the backing device.

3.  To check data integrity, copied 250 MB of /dev/urandom to some file
    on main volume.  Then, dd copy this file to /dev/bcache0.  Then,
    detached the cache device from the backing device.  Then, rebooted.
    Then, dd copy /dev/bcache0 to another file on main volume.  Then,
    diff the files and confirm no changes.

So it looks like it works, based on this admittedly brief testing.  Thanks!

Best regards,

James Johnston



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [dm-crypt] [PATCH] dm-crypt: Fix error with too large bios (was: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size)
@ 2016-06-01  4:19               ` James Johnston
  0 siblings, 0 replies; 28+ messages in thread
From: James Johnston @ 2016-06-01  4:19 UTC (permalink / raw)
  To: 'Mikulas Patocka'
  Cc: 'Eric Wheeler', 'Tim Small',
	'Kent Overstreet', 'Alasdair Kergon',
	'Mike Snitzer',
	linux-bcache, dm-devel, dm-crypt, 'Neil Brown',
	linux-raid

On Fri, 27 May 2016, Mikulas Patocka wrote:
> dm-crypt: Fix error with too large bios
> 
> When dm-crypt processes writes, it allocates a new bio in the function
> crypt_alloc_buffer. The bio is allocated from a bio set and it can have at
> most BIO_MAX_PAGES vector entries, however the incoming bio can be larger
> if it was allocated by other means. For example, bcache creates bios
> larger than BIO_MAX_PAGES. If the incoming bio is larger, bio_alloc_bioset
> fails and error is returned.
> 
> To avoid the error, we test for too large bio in the function crypt_map
> and dm_accept_partial_bio to split the bio. dm_accept_partial_bio trims
> the current bio to the desired size and requests that the device mapper
> core sends another bio with the rest of the data.
> 
> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
> Cc: stable@vger.kernel.org	# v3.16+

Tested-by: James Johnston <johnstonj.public@codenest.com>

I tested this patch by:

1.  Building v4.7-rc1 from Torvalds git repo.  Confirmed that original bug
    still occurs on Ubuntu 15.10.

2.  Applying your patch to v4.7-rc1.  My kill sequence no longer works,
    and the writeback cache is now successfully flushed to disk, and the
    cache can be detached from the backing device.

3.  To check data integrity, copied 250 MB of /dev/urandom to some file
    on main volume.  Then, dd copy this file to /dev/bcache0.  Then,
    detached the cache device from the backing device.  Then, rebooted.
    Then, dd copy /dev/bcache0 to another file on main volume.  Then,
    diff the files and confirm no changes.

So it looks like it works, based on this admittedly brief testing.  Thanks!

Best regards,

James Johnston

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2016-06-01  4:19 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-08 18:39 bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size James Johnston
2016-05-08 18:39 ` [dm-crypt] " James Johnston
2016-05-11  1:38 ` Eric Wheeler
2016-05-11  1:38   ` [dm-crypt] " Eric Wheeler
2016-05-15  9:08   ` Tim Small
2016-05-16 13:02     ` Tim Small
2016-05-16 13:02       ` [dm-crypt] " Tim Small
2016-05-16 13:53       ` Tim Small
2016-05-16 13:53         ` [dm-crypt] " Tim Small
2016-05-19 23:15       ` Eric Wheeler
2016-05-19 23:15         ` [dm-crypt] " Eric Wheeler
2016-05-18 17:01   ` [dm-devel] " James Johnston
2016-05-18 17:01     ` [dm-crypt] " James Johnston
2016-05-16 16:08 ` Tim Small
2016-05-16 16:08   ` [dm-crypt] " Tim Small
2016-05-19 23:22   ` Eric Wheeler
2016-05-19 23:22     ` [dm-crypt] " Eric Wheeler
2016-05-20  6:59     ` James Johnston
2016-05-20  6:59       ` [dm-crypt] " James Johnston
2016-05-20 21:37       ` 'Eric Wheeler'
2016-05-20 21:37         ` [dm-crypt] " 'Eric Wheeler'
2016-05-22  4:26         ` James Johnston
2016-05-22  4:26           ` [dm-crypt] " James Johnston
2016-05-27 14:47           ` [PATCH] dm-crypt: Fix error with too large bios (was: bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size) Mikulas Patocka
2016-05-27 14:47             ` [dm-crypt] " Mikulas Patocka
2016-06-01  4:19             ` James Johnston
2016-06-01  4:19               ` [dm-crypt] " James Johnston
2016-05-20 20:22 ` bcache gets stuck flushing writeback cache when used in combination with LUKS/dm-crypt and non-default bucket size Eric Wheeler

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.