Zram writeback feature unstable with heavy swap utilization

* Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
@ 2018-07-23 12:29 Tino Lehnig
  2018-07-24  1:03 ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Tino Lehnig @ 2018-07-23 12:29 UTC (permalink / raw)
  To: minchan, ngupta; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7164 bytes --]

Hello,

after enabling the writeback feature in zram, I encountered the kernel 
bug below with heavy swap utilization. There is one specific workload 
that triggers the bug reliably and that is running Windows in KVM while 
overcommitting memory. The Windows VMs would fill all allocated memory 
with zero pages while booting. A few seconds after the host hits zram 
swap, the console on the host is flooded with the bug message. A few 
more seconds later I also encountered filesystem errors on the host 
causing the root filesystem to be mounted read-only. The filesystem 
errors do not occur when leaving RAM available for the host OS by 
limiting physical memory of the QEMU processes via cgroups.

I started three KVM instances with the following commands in my tests. 
Any Windows ISO or disk image can be used. Less instances and smaller 
allocated memory will also trigger the bug as long as swapping occurs. 
The type of writeback device does not seem to matter. I have tried a 
SATA SSD and an NVMe Optane drive so far. My test machine has 256 GB of 
RAM and one CPU. I saw the same behavior on another machine with two 
CPUs and 128 GB of RAM.

The bug does not occur when using zram as swap without "backing_dev" 
being set, but I had even more severe problems when running the same 
test on Ubuntu Kernels 4.15 and 4.17. Regardless of the writeback 
feature being used or not, the host would eventually lock up entirely 
when swap is in use on zram. The lockups may not be related directly to 
zram though and were apparently fixed in 4.18. I had absolutely no 
problems on Ubuntu Kernel 4.13 either, before the writeback feature was 
introduced.

Thank you for your attention.

--

commands used:

modprobe zram
echo 1 > /sys/block/zram0/reset
echo lz4 > /sys/block/zram0/comp_algorithm
echo /dev/nvme0n1 > /sys/block/zram0/backing_dev
echo 256G > /sys/block/zram0/disksize
mkswap /dev/zram0
swapon /dev/zram0

kvm -nographic -smp 20 -m 131072 -cdrom winpe.iso

--

log message:

BUG: Bad page state in process qemu-system-x86  pfn:3dfab21
page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1
flags: 0x17fffc000000008(uptodate)
raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
bad because of flags: 0x8(uptodate)
Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel bin
fmt_misc pcbc aesni_intel aes_x86_64 crypto_simd cryptd iTCO_wdt 
glue_helper iTCO_vendor_support intel_cstate lpc_ich mei_me intel_uncore 
intel_rapl_perf pcspkr joydev sg mfd_core ioatdma mei wmi evdev ipmi_si 
ipmi_devintf ipmi_msghandler
acpi_power_meter acpi_pad button ip_tables x_tables autofs4 ext4 
crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod 
xhci_pci ehci_pci ahci libahci xhci_hcd ehci_hcd libata igb i2c_algo_bit 
crc32c_intel scsi_mod i2c_i8
01 dca usbcore
CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G    B 
4.18.0-rc5+ #1
Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
Call Trace:
  dump_stack+0x5c/0x7b
  bad_page+0xba/0x120
  get_page_from_freelist+0x1016/0x1250
  __alloc_pages_nodemask+0xfa/0x250
  alloc_pages_vma+0x7c/0x1c0
  do_swap_page+0x347/0x920
  ? __update_load_avg_se.isra.38+0x1eb/0x1f0
  ? cpumask_next_wrap+0x3d/0x60
  __handle_mm_fault+0x7b4/0x1110
  ? update_load_avg+0x5ea/0x720
  handle_mm_fault+0xfc/0x1f0
  __get_user_pages+0x12f/0x690
  get_user_pages_unlocked+0x148/0x1f0
  __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
  try_async_pf+0x87/0x230 [kvm]
  tdp_page_fault+0x132/0x290 [kvm]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  kvm_mmu_page_fault+0x74/0x570 [kvm]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmx_vcpu_run+0x375/0x620 [kvm_intel]
  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
  ? __update_load_avg_se.isra.38+0x1eb/0x1f0
  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
  ? __switch_to+0x395/0x450
  ? __switch_to+0x395/0x450
  do_vfs_ioctl+0xa2/0x630
  ? __schedule+0x3fd/0x890
  ksys_ioctl+0x70/0x80
  ? exit_to_usermode_loop+0xca/0xf0
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x55/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb30361add7
Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007fb2e97f98b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fb30361add7
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
RBP: 00005652b984e0f0 R08: 00005652b7d513d0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb308c66000 R14: 0000000000000000 R15: 00005652b984e0f0

--

ver_linux: Debian 9.5 with Kernel 4.18.0-rc5+

GNU C               	6.3.0
GNU Make            	4.1
Binutils            	2.28
Util-linux          	2.29.2
Mount               	2.29.2
Module-init-tools   	23
E2fsprogs           	1.43.4
Linux C Library     	2.24
Dynamic linker (ldd)	2.24
Linux C++ Library   	6.0.22
Procps              	3.3.12
Kbd                 	2.0.3
Console-tools       	2.0.3
Sh-utils            	8.26
Udev                	232

--

cpuinfo:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
stepping	: 1
microcode	: 0xb000021
cpu MHz		: 1200.632
cache size	: 25600 KB
physical id	: 0
siblings	: 20
core id		: 0
cpu cores	: 10
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand 
lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single 
pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust 
bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap 
intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local 
dtherm ida arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 4400.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

--
Kind regards,

Tino Lehnig

[-- Attachment #2: config-4.18.0-rc5+.gz --]
[-- Type: application/gzip, Size: 40628 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread