All of lore.kernel.org
 help / color / mirror / Atom feed
* Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
@ 2018-07-23 12:29 Tino Lehnig
  2018-07-24  1:03 ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Tino Lehnig @ 2018-07-23 12:29 UTC (permalink / raw)
  To: minchan, ngupta; +Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 7164 bytes --]

Hello,

after enabling the writeback feature in zram, I encountered the kernel 
bug below with heavy swap utilization. There is one specific workload 
that triggers the bug reliably and that is running Windows in KVM while 
overcommitting memory. The Windows VMs would fill all allocated memory 
with zero pages while booting. A few seconds after the host hits zram 
swap, the console on the host is flooded with the bug message. A few 
more seconds later I also encountered filesystem errors on the host 
causing the root filesystem to be mounted read-only. The filesystem 
errors do not occur when leaving RAM available for the host OS by 
limiting physical memory of the QEMU processes via cgroups.

I started three KVM instances with the following commands in my tests. 
Any Windows ISO or disk image can be used. Less instances and smaller 
allocated memory will also trigger the bug as long as swapping occurs. 
The type of writeback device does not seem to matter. I have tried a 
SATA SSD and an NVMe Optane drive so far. My test machine has 256 GB of 
RAM and one CPU. I saw the same behavior on another machine with two 
CPUs and 128 GB of RAM.

The bug does not occur when using zram as swap without "backing_dev" 
being set, but I had even more severe problems when running the same 
test on Ubuntu Kernels 4.15 and 4.17. Regardless of the writeback 
feature being used or not, the host would eventually lock up entirely 
when swap is in use on zram. The lockups may not be related directly to 
zram though and were apparently fixed in 4.18. I had absolutely no 
problems on Ubuntu Kernel 4.13 either, before the writeback feature was 
introduced.

Thank you for your attention.

--

commands used:

modprobe zram
echo 1 > /sys/block/zram0/reset
echo lz4 > /sys/block/zram0/comp_algorithm
echo /dev/nvme0n1 > /sys/block/zram0/backing_dev
echo 256G > /sys/block/zram0/disksize
mkswap /dev/zram0
swapon /dev/zram0

kvm -nographic -smp 20 -m 131072 -cdrom winpe.iso

--

log message:

BUG: Bad page state in process qemu-system-x86  pfn:3dfab21
page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1
flags: 0x17fffc000000008(uptodate)
raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000
raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
bad because of flags: 0x8(uptodate)
Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl sb_edac 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel bin
fmt_misc pcbc aesni_intel aes_x86_64 crypto_simd cryptd iTCO_wdt 
glue_helper iTCO_vendor_support intel_cstate lpc_ich mei_me intel_uncore 
intel_rapl_perf pcspkr joydev sg mfd_core ioatdma mei wmi evdev ipmi_si 
ipmi_devintf ipmi_msghandler
acpi_power_meter acpi_pad button ip_tables x_tables autofs4 ext4 
crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod 
xhci_pci ehci_pci ahci libahci xhci_hcd ehci_hcd libata igb i2c_algo_bit 
crc32c_intel scsi_mod i2c_i8
01 dca usbcore
CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G    B 
4.18.0-rc5+ #1
Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
Call Trace:
  dump_stack+0x5c/0x7b
  bad_page+0xba/0x120
  get_page_from_freelist+0x1016/0x1250
  __alloc_pages_nodemask+0xfa/0x250
  alloc_pages_vma+0x7c/0x1c0
  do_swap_page+0x347/0x920
  ? __update_load_avg_se.isra.38+0x1eb/0x1f0
  ? cpumask_next_wrap+0x3d/0x60
  __handle_mm_fault+0x7b4/0x1110
  ? update_load_avg+0x5ea/0x720
  handle_mm_fault+0xfc/0x1f0
  __get_user_pages+0x12f/0x690
  get_user_pages_unlocked+0x148/0x1f0
  __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
  try_async_pf+0x87/0x230 [kvm]
  tdp_page_fault+0x132/0x290 [kvm]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  kvm_mmu_page_fault+0x74/0x570 [kvm]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
  ? vmx_vcpu_run+0x375/0x620 [kvm_intel]
  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
  ? __update_load_avg_se.isra.38+0x1eb/0x1f0
  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
  ? __switch_to+0x395/0x450
  ? __switch_to+0x395/0x450
  do_vfs_ioctl+0xa2/0x630
  ? __schedule+0x3fd/0x890
  ksys_ioctl+0x70/0x80
  ? exit_to_usermode_loop+0xca/0xf0
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x55/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fb30361add7
Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff 
ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 
f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
RSP: 002b:00007fb2e97f98b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fb30361add7
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
RBP: 00005652b984e0f0 R08: 00005652b7d513d0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fb308c66000 R14: 0000000000000000 R15: 00005652b984e0f0

--

ver_linux: Debian 9.5 with Kernel 4.18.0-rc5+

GNU C               	6.3.0
GNU Make            	4.1
Binutils            	2.28
Util-linux          	2.29.2
Mount               	2.29.2
Module-init-tools   	23
E2fsprogs           	1.43.4
Linux C Library     	2.24
Dynamic linker (ldd)	2.24
Linux C++ Library   	6.0.22
Procps              	3.3.12
Kbd                 	2.0.3
Console-tools       	2.0.3
Sh-utils            	8.26
Udev                	232

--

cpuinfo:

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 79
model name	: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
stepping	: 1
microcode	: 0xb000021
cpu MHz		: 1200.632
cache size	: 25600 KB
physical id	: 0
siblings	: 20
core id		: 0
cpu cores	: 10
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 20
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx 
pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl 
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 
sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand 
lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single 
pti intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust 
bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap 
intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local 
dtherm ida arat pln pts
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
bogomips	: 4400.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

--
Kind regards,

Tino Lehnig

[-- Attachment #2: config-4.18.0-rc5+.gz --]
[-- Type: application/gzip, Size: 40628 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-23 12:29 Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process Tino Lehnig
@ 2018-07-24  1:03 ` Minchan Kim
  2018-07-24  2:53   ` Sergey Senozhatsky
  2018-07-24  7:30   ` Tino Lehnig
  0 siblings, 2 replies; 30+ messages in thread
From: Minchan Kim @ 2018-07-24  1:03 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

Hi Tino,

Thanks for the report.

On Mon, Jul 23, 2018 at 02:29:32PM +0200, Tino Lehnig wrote:
> Hello,
> 
> after enabling the writeback feature in zram, I encountered the kernel bug
> below with heavy swap utilization. There is one specific workload that
> triggers the bug reliably and that is running Windows in KVM while
> overcommitting memory. The Windows VMs would fill all allocated memory with
> zero pages while booting. A few seconds after the host hits zram swap, the
> console on the host is flooded with the bug message. A few more seconds
> later I also encountered filesystem errors on the host causing the root
> filesystem to be mounted read-only. The filesystem errors do not occur when
> leaving RAM available for the host OS by limiting physical memory of the
> QEMU processes via cgroups.
> 
> I started three KVM instances with the following commands in my tests. Any
> Windows ISO or disk image can be used. Less instances and smaller allocated
> memory will also trigger the bug as long as swapping occurs. The type of
> writeback device does not seem to matter. I have tried a SATA SSD and an
> NVMe Optane drive so far. My test machine has 256 GB of RAM and one CPU. I
> saw the same behavior on another machine with two CPUs and 128 GB of RAM.
> 
> The bug does not occur when using zram as swap without "backing_dev" being
> set, but I had even more severe problems when running the same test on
> Ubuntu Kernels 4.15 and 4.17. Regardless of the writeback feature being used
> or not, the host would eventually lock up entirely when swap is in use on
> zram. The lockups may not be related directly to zram though and were
> apparently fixed in 4.18. I had absolutely no problems on Ubuntu Kernel 4.13
> either, before the writeback feature was introduced.

We didn't release v4.18 yet. Could you say what kernel tree/what version
you used?

Now I don't have enough time to dig in.

Sergey, I really appreciate if you could have availabe time to look into.
Anyway, I could try to see it asap if Sergey is not available.
No worry.

Thanks.


> 
> Thank you for your attention.
> 
> --
> 
> commands used:
> 
> modprobe zram
> echo 1 > /sys/block/zram0/reset
> echo lz4 > /sys/block/zram0/comp_algorithm
> echo /dev/nvme0n1 > /sys/block/zram0/backing_dev
> echo 256G > /sys/block/zram0/disksize
> mkswap /dev/zram0
> swapon /dev/zram0
> 
> kvm -nographic -smp 20 -m 131072 -cdrom winpe.iso
> 
> --
> 
> log message:
> 
> BUG: Bad page state in process qemu-system-x86  pfn:3dfab21
> page:ffffdfb137eac840 count:0 mapcount:0 mapping:0000000000000000 index:0x1
> flags: 0x17fffc000000008(uptodate)
> raw: 017fffc000000008 dead000000000100 dead000000000200 0000000000000000
> raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> bad because of flags: 0x8(uptodate)
> Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel bin
> fmt_misc pcbc aesni_intel aes_x86_64 crypto_simd cryptd iTCO_wdt glue_helper
> iTCO_vendor_support intel_cstate lpc_ich mei_me intel_uncore intel_rapl_perf
> pcspkr joydev sg mfd_core ioatdma mei wmi evdev ipmi_si ipmi_devintf
> ipmi_msghandler
> acpi_power_meter acpi_pad button ip_tables x_tables autofs4 ext4
> crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod
> xhci_pci ehci_pci ahci libahci xhci_hcd ehci_hcd libata igb i2c_algo_bit
> crc32c_intel scsi_mod i2c_i8
> 01 dca usbcore
> CPU: 4 PID: 1039 Comm: qemu-system-x86 Tainted: G    B 4.18.0-rc5+ #1
> Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b 05/02/2017
> Call Trace:
>  dump_stack+0x5c/0x7b
>  bad_page+0xba/0x120
>  get_page_from_freelist+0x1016/0x1250
>  __alloc_pages_nodemask+0xfa/0x250
>  alloc_pages_vma+0x7c/0x1c0
>  do_swap_page+0x347/0x920
>  ? __update_load_avg_se.isra.38+0x1eb/0x1f0
>  ? cpumask_next_wrap+0x3d/0x60
>  __handle_mm_fault+0x7b4/0x1110
>  ? update_load_avg+0x5ea/0x720
>  handle_mm_fault+0xfc/0x1f0
>  __get_user_pages+0x12f/0x690
>  get_user_pages_unlocked+0x148/0x1f0
>  __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
>  try_async_pf+0x87/0x230 [kvm]
>  tdp_page_fault+0x132/0x290 [kvm]
>  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
>  kvm_mmu_page_fault+0x74/0x570 [kvm]
>  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
>  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
>  ? vmx_vcpu_run+0x375/0x620 [kvm_intel]
>  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
>  ? __update_load_avg_se.isra.38+0x1eb/0x1f0
>  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
>  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
>  ? __switch_to+0x395/0x450
>  ? __switch_to+0x395/0x450
>  do_vfs_ioctl+0xa2/0x630
>  ? __schedule+0x3fd/0x890
>  ksys_ioctl+0x70/0x80
>  ? exit_to_usermode_loop+0xca/0xf0
>  __x64_sys_ioctl+0x16/0x20
>  do_syscall_64+0x55/0x100
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x7fb30361add7
> Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff
> ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff
> 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
> RSP: 002b:00007fb2e97f98b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 00007fb30361add7
> RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000015
> RBP: 00005652b984e0f0 R08: 00005652b7d513d0 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> R13: 00007fb308c66000 R14: 0000000000000000 R15: 00005652b984e0f0
> 
> --
> 
> ver_linux: Debian 9.5 with Kernel 4.18.0-rc5+
> 
> GNU C               	6.3.0
> GNU Make            	4.1
> Binutils            	2.28
> Util-linux          	2.29.2
> Mount               	2.29.2
> Module-init-tools   	23
> E2fsprogs           	1.43.4
> Linux C Library     	2.24
> Dynamic linker (ldd)	2.24
> Linux C++ Library   	6.0.22
> Procps              	3.3.12
> Kbd                 	2.0.3
> Console-tools       	2.0.3
> Sh-utils            	8.26
> Udev                	232
> 
> --
> 
> cpuinfo:
> 
> processor	: 0
> vendor_id	: GenuineIntel
> cpu family	: 6
> model		: 79
> model name	: Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz
> stepping	: 1
> microcode	: 0xb000021
> cpu MHz		: 1200.632
> cache size	: 25600 KB
> physical id	: 0
> siblings	: 20
> core id		: 0
> cpu cores	: 10
> apicid		: 0
> initial apicid	: 0
> fpu		: yes
> fpu_exception	: yes
> cpuid level	: 20
> wp		: yes
> flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat
> pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb
> rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
> nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est
> tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt
> tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch
> cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin tpr_shadow vnmi
> flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms
> invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc
> cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
> bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
> bogomips	: 4400.00
> clflush size	: 64
> cache_alignment	: 64
> address sizes	: 46 bits physical, 48 bits virtual
> power management:
> 
> --
> Kind regards,
> 
> Tino Lehnig



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-24  1:03 ` Minchan Kim
@ 2018-07-24  2:53   ` Sergey Senozhatsky
  2018-07-24  6:47     ` Minchan Kim
  2018-07-24  7:30   ` Tino Lehnig
  1 sibling, 1 reply; 30+ messages in thread
From: Sergey Senozhatsky @ 2018-07-24  2:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tino Lehnig, ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On (07/24/18 10:03), Minchan Kim wrote:
> On Mon, Jul 23, 2018 at 02:29:32PM +0200, Tino Lehnig wrote:
> > Hello,
> > 
> > after enabling the writeback feature in zram, I encountered the kernel bug
> > below with heavy swap utilization. There is one specific workload that
> > triggers the bug reliably and that is running Windows in KVM while
> > overcommitting memory. The Windows VMs would fill all allocated memory with
> > zero pages while booting. A few seconds after the host hits zram swap, the
> > console on the host is flooded with the bug message. A few more seconds
> > later I also encountered filesystem errors on the host causing the root
> > filesystem to be mounted read-only. The filesystem errors do not occur when
> > leaving RAM available for the host OS by limiting physical memory of the
> > QEMU processes via cgroups.
> > 
> > I started three KVM instances with the following commands in my tests. Any
> > Windows ISO or disk image can be used. Less instances and smaller allocated
> > memory will also trigger the bug as long as swapping occurs. The type of
> > writeback device does not seem to matter. I have tried a SATA SSD and an
> > NVMe Optane drive so far. My test machine has 256 GB of RAM and one CPU. I
> > saw the same behavior on another machine with two CPUs and 128 GB of RAM.
> > 
> > The bug does not occur when using zram as swap without "backing_dev" being
> > set, but I had even more severe problems when running the same test on
> > Ubuntu Kernels 4.15 and 4.17. Regardless of the writeback feature being used
> > or not, the host would eventually lock up entirely when swap is in use on
> > zram. The lockups may not be related directly to zram though and were
> > apparently fixed in 4.18. I had absolutely no problems on Ubuntu Kernel 4.13
> > either, before the writeback feature was introduced.
> 
> We didn't release v4.18 yet. Could you say what kernel tree/what version
> you used?
> 
> Now I don't have enough time to dig in.
> 
> Sergey, I really appreciate if you could have availabe time to look into.
> Anyway, I could try to see it asap if Sergey is not available.
> No worry.

Interesting case.

Will take me several days to get to it.
Sorry, quite busy at the moment.

	-ss

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-24  2:53   ` Sergey Senozhatsky
@ 2018-07-24  6:47     ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2018-07-24  6:47 UTC (permalink / raw)
  To: Sergey Senozhatsky, mawilcox
  Cc: Tino Lehnig, ngupta, linux-kernel, Andrew Morton

On Tue, Jul 24, 2018 at 11:53:30AM +0900, Sergey Senozhatsky wrote:
> On (07/24/18 10:03), Minchan Kim wrote:
> > On Mon, Jul 23, 2018 at 02:29:32PM +0200, Tino Lehnig wrote:
> > > Hello,
> > > 
> > > after enabling the writeback feature in zram, I encountered the kernel bug
> > > below with heavy swap utilization. There is one specific workload that
> > > triggers the bug reliably and that is running Windows in KVM while
> > > overcommitting memory. The Windows VMs would fill all allocated memory with
> > > zero pages while booting. A few seconds after the host hits zram swap, the
> > > console on the host is flooded with the bug message. A few more seconds
> > > later I also encountered filesystem errors on the host causing the root
> > > filesystem to be mounted read-only. The filesystem errors do not occur when
> > > leaving RAM available for the host OS by limiting physical memory of the
> > > QEMU processes via cgroups.
> > > 
> > > I started three KVM instances with the following commands in my tests. Any
> > > Windows ISO or disk image can be used. Less instances and smaller allocated
> > > memory will also trigger the bug as long as swapping occurs. The type of
> > > writeback device does not seem to matter. I have tried a SATA SSD and an
> > > NVMe Optane drive so far. My test machine has 256 GB of RAM and one CPU. I
> > > saw the same behavior on another machine with two CPUs and 128 GB of RAM.
> > > 
> > > The bug does not occur when using zram as swap without "backing_dev" being
> > > set, but I had even more severe problems when running the same test on
> > > Ubuntu Kernels 4.15 and 4.17. Regardless of the writeback feature being used
> > > or not, the host would eventually lock up entirely when swap is in use on
> > > zram. The lockups may not be related directly to zram though and were
> > > apparently fixed in 4.18. I had absolutely no problems on Ubuntu Kernel 4.13
> > > either, before the writeback feature was introduced.
> > 
> > We didn't release v4.18 yet. Could you say what kernel tree/what version
> > you used?
> > 
> > Now I don't have enough time to dig in.
> > 
> > Sergey, I really appreciate if you could have availabe time to look into.
> > Anyway, I could try to see it asap if Sergey is not available.
> > No worry.
> 
> Interesting case.
> 
> Will take me several days to get to it.
> Sorry, quite busy at the moment.

I look though v4.18-rcX very quick what are changed for zsmalloc.
struct page fields were totally reordered. It could be a regression related to
that. Ccing Matthew

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-24  1:03 ` Minchan Kim
  2018-07-24  2:53   ` Sergey Senozhatsky
@ 2018-07-24  7:30   ` Tino Lehnig
  2018-07-25  1:32     ` Minchan Kim
  2018-07-25 13:21     ` Minchan Kim
  1 sibling, 2 replies; 30+ messages in thread
From: Tino Lehnig @ 2018-07-24  7:30 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

Hi,

The first build I used was from the master branch of the mainline 
kernel, somewhere between rc5 and rc6. I have just reproduced the bug 
with 4.17.9 and 4.18-rc6. Kernel messages below.

The bug does not appear on 4.14.57. I can test more versions if it helps.

On 07/24/2018 03:03 AM, Minchan Kim wrote:
> We didn't release v4.18 yet. Could you say what kernel tree/what version
> you used?

--

[  804.485321] BUG: Bad page state in process qemu-system-x86  pfn:1c4b08e
[  804.485403] page:ffffe809312c2380 count:0 mapcount:0 
mapping:0000000000000000 index:0x1
[  804.485483] flags: 0x17fffc000000008(uptodate)
[  804.485554] raw: 017fffc000000008 0000000000000000 0000000000000001 
00000000ffffffff
[  804.485632] raw: dead000000000100 dead000000000200 0000000000000000 
0000000000000000
[  804.485709] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[  804.485782] bad because of flags: 0x8(uptodate)
[  804.485852] Modules linked in: lz4 lz4_compress zram zsmalloc 
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcb
c aesni_intel aes_x86_64 crypto_simd cryptd iTCO_wdt glue_helper 
iTCO_vendor_support intel_cstate binfmt_misc intel_uncore 
intel_rapl_perf pcspkr mei_me lpc_ich joydev sg mfd_core mei ioatdma 
shpchp wmi evdev ipmi_si ipmi_devintf ipmi_msgh
andler acpi_power_meter acpi_pad button ip_tables x_tables autofs4 ext4 
crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod 
ahci libahci xhci_pci ehci_pci libata igb xhci_hcd ehci_hcd crc32c_intel 
i2c_algo_bit scsi_mod
  i2c_i801 dca usbcore
[  804.485890] CPU: 17 PID: 1165 Comm: qemu-system-x86 Not tainted 4.17.9 #1
[  804.485891] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 
2.0b 05/02/2017
[  804.485891] Call Trace:
[  804.485899]  dump_stack+0x5c/0x7b
[  804.485902]  bad_page+0xba/0x120
[  804.485905]  get_page_from_freelist+0x1016/0x1250
[  804.485908]  __alloc_pages_nodemask+0xfa/0x250
[  804.485911]  alloc_pages_vma+0x7c/0x1c0
[  804.485915]  __handle_mm_fault+0xcf6/0x1110
[  804.485918]  handle_mm_fault+0xfc/0x1f0
[  804.485921]  __get_user_pages+0x12f/0x670
[  804.485923]  get_user_pages_unlocked+0x148/0x1f0
[  804.485945]  __gfn_to_pfn_memslot+0xff/0x390 [kvm]
[  804.485959]  try_async_pf+0x67/0x200 [kvm]
[  804.485971]  tdp_page_fault+0x132/0x290 [kvm]
[  804.485975]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
[  804.485987]  kvm_mmu_page_fault+0x59/0x140 [kvm]
[  804.485999]  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
[  804.486003]  ? futex_wake+0x94/0x170
[  804.486012]  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
[  804.486021]  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
[  804.486024]  ? __switch_to+0x395/0x450
[  804.486026]  ? __switch_to+0x395/0x450
[  804.486029]  do_vfs_ioctl+0xa2/0x620
[  804.486030]  ? __x64_sys_futex+0x88/0x180
[  804.486032]  ksys_ioctl+0x70/0x80
[  804.486034]  __x64_sys_ioctl+0x16/0x20
[  804.486037]  do_syscall_64+0x55/0x100
[  804.486039]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  804.486041] RIP: 0033:0x7f82db677dd7
[  804.486042] RSP: 002b:00007f82c1ffa8b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  804.486044] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 
00007f82db677dd7
[  804.486044] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 
0000000000000014
[  804.486045] RBP: 000055b592a1ddf0 R08: 000055b5914bb3d0 R09: 
00000000ffffffff
[  804.486046] R10: 00007f82c1ffa670 R11: 0000000000000246 R12: 
0000000000000000
[  804.486047] R13: 00007f82e0cc6000 R14: 0000000000000000 R15: 
000055b592a1ddf0
[  804.486048] Disabling lock debugging due to kernel taint

--

[  170.707761] BUG: Bad page state in process qemu-system-x86  pfn:1901199
[  170.707842] page:ffffe453e4046640 count:0 mapcount:0 
mapping:0000000000000000 index:0x1
[  170.707923] flags: 0x17fffc000000008(uptodate)
[  170.707996] raw: 017fffc000000008 dead000000000100 dead000000000200 
0000000000000000
[  170.708074] raw: 0000000000000001 0000000000000000 00000000ffffffff 
0000000000000000
[  170.708151] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[  170.708225] bad because of flags: 0x8(uptodate)
[  170.708295] Modules linked in: lz4 lz4_compress zram zsmalloc 
intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp 
kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel iTCO_wdt iTCO_vendor_support binfmt_misc pcbc 
aesni_intel aes_x86_64 crypto_simd cryptd glue_helper intel_cstate 
mei_me intel_uncore lpc_ich intel_rapl_perf pcspkr joydev sg mfd_core 
mei ioatdma wmi evdev ipmi_si ipmi_devintf ipmi_msghandler 
acpi_power_meter acpi_pad pcc_cpufreq button ip_tables x_tables autofs4 
ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid 
sd_mod ahci libahci libata xhci_pci ehci_pci crc32c_intel xhci_hcd 
ehci_hcd scsi_mod i2c_i801 igb i2c_algo_bit dca usbcore
[  170.708344] CPU: 8 PID: 1031 Comm: qemu-system-x86 Not tainted 
4.18.0-rc6 #1
[  170.708345] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 
2.0b 05/02/2017
[  170.708346] Call Trace:
[  170.708354]  dump_stack+0x5c/0x7b
[  170.708357]  bad_page+0xba/0x120
[  170.708360]  get_page_from_freelist+0x1016/0x1250
[  170.708364]  __alloc_pages_nodemask+0xfa/0x250
[  170.708368]  alloc_pages_vma+0x7c/0x1c0
[  170.708371]  do_swap_page+0x347/0x920
[  170.708375]  ? do_huge_pmd_anonymous_page+0x461/0x6f0
[  170.708377]  __handle_mm_fault+0x7b4/0x1110
[  170.708380]  ? call_function_interrupt+0xa/0x20
[  170.708383]  handle_mm_fault+0xfc/0x1f0
[  170.708385]  __get_user_pages+0x12f/0x690
[  170.708387]  get_user_pages_unlocked+0x148/0x1f0
[  170.708415]  __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
[  170.708433]  try_async_pf+0x87/0x230 [kvm]
[  170.708450]  tdp_page_fault+0x132/0x290 [kvm]
[  170.708455]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
[  170.708470]  kvm_mmu_page_fault+0x74/0x570 [kvm]
[  170.708474]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
[  170.708477]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
[  170.708480]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
[  170.708484]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
[  170.708487]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
[  170.708490]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
[  170.708493]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
[  170.708497]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
[  170.708500]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
[  170.708503]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
[  170.708506]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
[  170.708510]  ? vmx_vcpu_run+0x375/0x620 [kvm_intel]
[  170.708526]  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
[  170.708529]  ? futex_wake+0x94/0x170
[  170.708542]  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
[  170.708555]  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
[  170.708558]  ? __handle_mm_fault+0x7c4/0x1110
[  170.708561]  do_vfs_ioctl+0xa2/0x630
[  170.708563]  ? __x64_sys_futex+0x88/0x180
[  170.708565]  ksys_ioctl+0x70/0x80
[  170.708568]  ? exit_to_usermode_loop+0xca/0xf0
[  170.708570]  __x64_sys_ioctl+0x16/0x20
[  170.708572]  do_syscall_64+0x55/0x100
[  170.708574]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  170.708577] RIP: 0033:0x7fc9e4889dd7
[  170.708577] Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 
48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
[  170.708610] RSP: 002b:00007fc9c27fb8b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  170.708612] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 
00007fc9e4889dd7
[  170.708613] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 
0000000000000015
[  170.708614] RBP: 000055dbb5f263e0 R08: 000055dbb34f03d0 R09: 
00000000ffffffff
[  170.708616] R10: 00007fc9c27fb670 R11: 0000000000000246 R12: 
0000000000000000
[  170.708617] R13: 00007fc9e9ed5000 R14: 0000000000000000 R15: 
000055dbb5f263e0
[  170.708618] Disabling lock debugging due to kernel taint

--
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-24  7:30   ` Tino Lehnig
@ 2018-07-25  1:32     ` Minchan Kim
  2018-07-25  1:55       ` Matthew Wilcox
  2018-07-25  2:51       ` Matthew Wilcox
  2018-07-25 13:21     ` Minchan Kim
  1 sibling, 2 replies; 30+ messages in thread
From: Minchan Kim @ 2018-07-25  1:32 UTC (permalink / raw)
  To: Tino Lehnig, willy
  Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

Hi Tino,

On Tue, Jul 24, 2018 at 09:30:34AM +0200, Tino Lehnig wrote:
> Hi,
> 
> The first build I used was from the master branch of the mainline kernel,
> somewhere between rc5 and rc6. I have just reproduced the bug with 4.17.9
> and 4.18-rc6. Kernel messages below.
> 
> The bug does not appear on 4.14.57. I can test more versions if it helps.

Could you try 4.15?

I think it's a regression of struct page field reordring and it started from
v4.16. 

page->units for zsmalloc is used as offset of first object on the zspage,
However, below patch unified it with page->_refcount.

I believe it's the culprit of the regression.

commit ca9c88c781b8
Author: Matthew Wilcox <mawilcox@microsoft.com>
Date:   Wed Jan 31 16:18:47 2018 -0800

    mm: de-indent struct page

    I found the struct { union { struct { union { struct { } } } } } layout
    rather confusing.  Fortunately, there is an easier way to write this.

    The innermost union is of four things which are the size of an int, so

> 
> On 07/24/2018 03:03 AM, Minchan Kim wrote:
> > We didn't release v4.18 yet. Could you say what kernel tree/what version
> > you used?
> 
> --
> 
> [  804.485321] BUG: Bad page state in process qemu-system-x86  pfn:1c4b08e
> [  804.485403] page:ffffe809312c2380 count:0 mapcount:0
> mapping:0000000000000000 index:0x1
> [  804.485483] flags: 0x17fffc000000008(uptodate)
> [  804.485554] raw: 017fffc000000008 0000000000000000 0000000000000001
> 00000000ffffffff
> [  804.485632] raw: dead000000000100 dead000000000200 0000000000000000
> 0000000000000000
> [  804.485709] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> [  804.485782] bad because of flags: 0x8(uptodate)
> [  804.485852] Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl
> sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcb
> c aesni_intel aes_x86_64 crypto_simd cryptd iTCO_wdt glue_helper
> iTCO_vendor_support intel_cstate binfmt_misc intel_uncore intel_rapl_perf
> pcspkr mei_me lpc_ich joydev sg mfd_core mei ioatdma shpchp wmi evdev
> ipmi_si ipmi_devintf ipmi_msgh
> andler acpi_power_meter acpi_pad button ip_tables x_tables autofs4 ext4
> crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod
> ahci libahci xhci_pci ehci_pci libata igb xhci_hcd ehci_hcd crc32c_intel
> i2c_algo_bit scsi_mod
>  i2c_i801 dca usbcore
> [  804.485890] CPU: 17 PID: 1165 Comm: qemu-system-x86 Not tainted 4.17.9 #1
> [  804.485891] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b
> 05/02/2017
> [  804.485891] Call Trace:
> [  804.485899]  dump_stack+0x5c/0x7b
> [  804.485902]  bad_page+0xba/0x120
> [  804.485905]  get_page_from_freelist+0x1016/0x1250
> [  804.485908]  __alloc_pages_nodemask+0xfa/0x250
> [  804.485911]  alloc_pages_vma+0x7c/0x1c0
> [  804.485915]  __handle_mm_fault+0xcf6/0x1110
> [  804.485918]  handle_mm_fault+0xfc/0x1f0
> [  804.485921]  __get_user_pages+0x12f/0x670
> [  804.485923]  get_user_pages_unlocked+0x148/0x1f0
> [  804.485945]  __gfn_to_pfn_memslot+0xff/0x390 [kvm]
> [  804.485959]  try_async_pf+0x67/0x200 [kvm]
> [  804.485971]  tdp_page_fault+0x132/0x290 [kvm]
> [  804.485975]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  804.485987]  kvm_mmu_page_fault+0x59/0x140 [kvm]
> [  804.485999]  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
> [  804.486003]  ? futex_wake+0x94/0x170
> [  804.486012]  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> [  804.486021]  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> [  804.486024]  ? __switch_to+0x395/0x450
> [  804.486026]  ? __switch_to+0x395/0x450
> [  804.486029]  do_vfs_ioctl+0xa2/0x620
> [  804.486030]  ? __x64_sys_futex+0x88/0x180
> [  804.486032]  ksys_ioctl+0x70/0x80
> [  804.486034]  __x64_sys_ioctl+0x16/0x20
> [  804.486037]  do_syscall_64+0x55/0x100
> [  804.486039]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  804.486041] RIP: 0033:0x7f82db677dd7
> [  804.486042] RSP: 002b:00007f82c1ffa8b8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  804.486044] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX:
> 00007f82db677dd7
> [  804.486044] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000014
> [  804.486045] RBP: 000055b592a1ddf0 R08: 000055b5914bb3d0 R09:
> 00000000ffffffff
> [  804.486046] R10: 00007f82c1ffa670 R11: 0000000000000246 R12:
> 0000000000000000
> [  804.486047] R13: 00007f82e0cc6000 R14: 0000000000000000 R15:
> 000055b592a1ddf0
> [  804.486048] Disabling lock debugging due to kernel taint
> 
> --
> 
> [  170.707761] BUG: Bad page state in process qemu-system-x86  pfn:1901199
> [  170.707842] page:ffffe453e4046640 count:0 mapcount:0
> mapping:0000000000000000 index:0x1
> [  170.707923] flags: 0x17fffc000000008(uptodate)
> [  170.707996] raw: 017fffc000000008 dead000000000100 dead000000000200
> 0000000000000000
> [  170.708074] raw: 0000000000000001 0000000000000000 00000000ffffffff
> 0000000000000000
> [  170.708151] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> [  170.708225] bad because of flags: 0x8(uptodate)
> [  170.708295] Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl
> sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
> iTCO_vendor_support binfmt_misc pcbc aesni_intel aes_x86_64 crypto_simd
> cryptd glue_helper intel_cstate mei_me intel_uncore lpc_ich intel_rapl_perf
> pcspkr joydev sg mfd_core mei ioatdma wmi evdev ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter acpi_pad pcc_cpufreq button ip_tables
> x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic
> usbhid hid sd_mod ahci libahci libata xhci_pci ehci_pci crc32c_intel
> xhci_hcd ehci_hcd scsi_mod i2c_i801 igb i2c_algo_bit dca usbcore
> [  170.708344] CPU: 8 PID: 1031 Comm: qemu-system-x86 Not tainted 4.18.0-rc6
> #1
> [  170.708345] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b
> 05/02/2017
> [  170.708346] Call Trace:
> [  170.708354]  dump_stack+0x5c/0x7b
> [  170.708357]  bad_page+0xba/0x120
> [  170.708360]  get_page_from_freelist+0x1016/0x1250
> [  170.708364]  __alloc_pages_nodemask+0xfa/0x250
> [  170.708368]  alloc_pages_vma+0x7c/0x1c0
> [  170.708371]  do_swap_page+0x347/0x920
> [  170.708375]  ? do_huge_pmd_anonymous_page+0x461/0x6f0
> [  170.708377]  __handle_mm_fault+0x7b4/0x1110
> [  170.708380]  ? call_function_interrupt+0xa/0x20
> [  170.708383]  handle_mm_fault+0xfc/0x1f0
> [  170.708385]  __get_user_pages+0x12f/0x690
> [  170.708387]  get_user_pages_unlocked+0x148/0x1f0
> [  170.708415]  __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
> [  170.708433]  try_async_pf+0x87/0x230 [kvm]
> [  170.708450]  tdp_page_fault+0x132/0x290 [kvm]
> [  170.708455]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708470]  kvm_mmu_page_fault+0x74/0x570 [kvm]
> [  170.708474]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708477]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708480]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708484]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708487]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708490]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708493]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708497]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708500]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708503]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708506]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708510]  ? vmx_vcpu_run+0x375/0x620 [kvm_intel]
> [  170.708526]  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
> [  170.708529]  ? futex_wake+0x94/0x170
> [  170.708542]  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> [  170.708555]  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> [  170.708558]  ? __handle_mm_fault+0x7c4/0x1110
> [  170.708561]  do_vfs_ioctl+0xa2/0x630
> [  170.708563]  ? __x64_sys_futex+0x88/0x180
> [  170.708565]  ksys_ioctl+0x70/0x80
> [  170.708568]  ? exit_to_usermode_loop+0xca/0xf0
> [  170.708570]  __x64_sys_ioctl+0x16/0x20
> [  170.708572]  do_syscall_64+0x55/0x100
> [  170.708574]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  170.708577] RIP: 0033:0x7fc9e4889dd7
> [  170.708577] Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48
> c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
> [  170.708610] RSP: 002b:00007fc9c27fb8b8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  170.708612] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX:
> 00007fc9e4889dd7
> [  170.708613] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000015
> [  170.708614] RBP: 000055dbb5f263e0 R08: 000055dbb34f03d0 R09:
> 00000000ffffffff
> [  170.708616] R10: 00007fc9c27fb670 R11: 0000000000000246 R12:
> 0000000000000000
> [  170.708617] R13: 00007fc9e9ed5000 R14: 0000000000000000 R15:
> 000055dbb5f263e0
> [  170.708618] Disabling lock debugging due to kernel taint
> 
> --
> Kind regards,
> 
> Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25  1:32     ` Minchan Kim
@ 2018-07-25  1:55       ` Matthew Wilcox
  2018-07-25  2:16         ` Minchan Kim
  2018-07-25  2:51       ` Matthew Wilcox
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2018-07-25  1:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tino Lehnig, ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Wed, Jul 25, 2018 at 10:32:50AM +0900, Minchan Kim wrote:
> Hi Tino,
> 
> On Tue, Jul 24, 2018 at 09:30:34AM +0200, Tino Lehnig wrote:
> > Hi,
> > 
> > The first build I used was from the master branch of the mainline kernel,
> > somewhere between rc5 and rc6. I have just reproduced the bug with 4.17.9
> > and 4.18-rc6. Kernel messages below.
> > 
> > The bug does not appear on 4.14.57. I can test more versions if it helps.
> 
> Could you try 4.15?
> 
> I think it's a regression of struct page field reordring and it started from
> v4.16. 
> 
> page->units for zsmalloc is used as offset of first object on the zspage,
> However, below patch unified it with page->_refcount.

No it didn't.  It's in a union with _mapcount, which is where it was before
my patches.

It's entiely possible that my patches caused this, but the explanation
you're offering is wrong.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25  1:55       ` Matthew Wilcox
@ 2018-07-25  2:16         ` Minchan Kim
  2018-07-25  2:35           ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-25  2:16 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tino Lehnig, ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Tue, Jul 24, 2018 at 06:55:02PM -0700, Matthew Wilcox wrote:
> On Wed, Jul 25, 2018 at 10:32:50AM +0900, Minchan Kim wrote:
> > Hi Tino,
> > 
> > On Tue, Jul 24, 2018 at 09:30:34AM +0200, Tino Lehnig wrote:
> > > Hi,
> > > 
> > > The first build I used was from the master branch of the mainline kernel,
> > > somewhere between rc5 and rc6. I have just reproduced the bug with 4.17.9
> > > and 4.18-rc6. Kernel messages below.
> > > 
> > > The bug does not appear on 4.14.57. I can test more versions if it helps.
> > 
> > Could you try 4.15?
> > 
> > I think it's a regression of struct page field reordring and it started from
> > v4.16. 
> > 
> > page->units for zsmalloc is used as offset of first object on the zspage,
> > However, below patch unified it with page->_refcount.
> 
> No it didn't.  It's in a union with _mapcount, which is where it was before
> my patches.
> 
> It's entiely possible that my patches caused this, but the explanation
> you're offering is wrong.

Before your patch, _mapcount and _refcount is separated space so we can use
_mapcount for page->units for non-mapped pages because units is unified with
_mapcount. However, with your patch, now units is unified with _refcount.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25  2:16         ` Minchan Kim
@ 2018-07-25  2:35           ` Matthew Wilcox
  2018-07-25  2:51             ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2018-07-25  2:35 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tino Lehnig, ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Wed, Jul 25, 2018 at 11:16:57AM +0900, Minchan Kim wrote:
> On Tue, Jul 24, 2018 at 06:55:02PM -0700, Matthew Wilcox wrote:
> > On Wed, Jul 25, 2018 at 10:32:50AM +0900, Minchan Kim wrote:
> > > Hi Tino,
> > > 
> > > On Tue, Jul 24, 2018 at 09:30:34AM +0200, Tino Lehnig wrote:
> > > > Hi,
> > > > 
> > > > The first build I used was from the master branch of the mainline kernel,
> > > > somewhere between rc5 and rc6. I have just reproduced the bug with 4.17.9
> > > > and 4.18-rc6. Kernel messages below.
> > > > 
> > > > The bug does not appear on 4.14.57. I can test more versions if it helps.
> > > 
> > > Could you try 4.15?
> > > 
> > > I think it's a regression of struct page field reordring and it started from
> > > v4.16. 
> > > 
> > > page->units for zsmalloc is used as offset of first object on the zspage,
> > > However, below patch unified it with page->_refcount.
> > 
> > No it didn't.  It's in a union with _mapcount, which is where it was before
> > my patches.
> > 
> > It's entiely possible that my patches caused this, but the explanation
> > you're offering is wrong.
> 
> Before your patch, _mapcount and _refcount is separated space so we can use
> _mapcount for page->units for non-mapped pages because units is unified with
> _mapcount. However, with your patch, now units is unified with _refcount.

No.  That's completely untrue.

        union {         /* This union is 4 bytes in size. */
                atomic_t _mapcount;
                unsigned int page_type;
                unsigned int active;            /* SLAB */
                int units;                      /* SLOB */
        };

        atomic_t _refcount;

There is NOTHING in a union with _refcount.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25  2:35           ` Matthew Wilcox
@ 2018-07-25  2:51             ` Minchan Kim
  2018-07-25  2:55               ` Matthew Wilcox
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-25  2:51 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tino Lehnig, ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Tue, Jul 24, 2018 at 07:35:32PM -0700, Matthew Wilcox wrote:
> On Wed, Jul 25, 2018 at 11:16:57AM +0900, Minchan Kim wrote:
> > On Tue, Jul 24, 2018 at 06:55:02PM -0700, Matthew Wilcox wrote:
> > > On Wed, Jul 25, 2018 at 10:32:50AM +0900, Minchan Kim wrote:
> > > > Hi Tino,
> > > > 
> > > > On Tue, Jul 24, 2018 at 09:30:34AM +0200, Tino Lehnig wrote:
> > > > > Hi,
> > > > > 
> > > > > The first build I used was from the master branch of the mainline kernel,
> > > > > somewhere between rc5 and rc6. I have just reproduced the bug with 4.17.9
> > > > > and 4.18-rc6. Kernel messages below.
> > > > > 
> > > > > The bug does not appear on 4.14.57. I can test more versions if it helps.
> > > > 
> > > > Could you try 4.15?
> > > > 
> > > > I think it's a regression of struct page field reordring and it started from
> > > > v4.16. 
> > > > 
> > > > page->units for zsmalloc is used as offset of first object on the zspage,
> > > > However, below patch unified it with page->_refcount.
> > > 
> > > No it didn't.  It's in a union with _mapcount, which is where it was before
> > > my patches.
> > > 
> > > It's entiely possible that my patches caused this, but the explanation
> > > you're offering is wrong.
> > 
> > Before your patch, _mapcount and _refcount is separated space so we can use
> > _mapcount for page->units for non-mapped pages because units is unified with
> > _mapcount. However, with your patch, now units is unified with _refcount.
> 
> No.  That's completely untrue.
> 
>         union {         /* This union is 4 bytes in size. */
>                 atomic_t _mapcount;
>                 unsigned int page_type;
>                 unsigned int active;            /* SLAB */
>                 int units;                      /* SLOB */
>         };
> 
>         atomic_t _refcount;
> 
> There is NOTHING in a union with _refcount.
> 


Confusing. Matthew, what am I missing?

Before:

counters consumes 8 bytes
units and _refcount consumes each 4 bytes so memory space is not overlapped.

        union {
                unsigned long counters;
                struct {

                        union {
                                atomic_t _mapcount;
                                unsigned int active;            /* SLAB */
                                struct {                        /* SLUB */
                                        unsigned inuse:16;
                                        unsigned objects:15;
                                        unsigned frozen:1;
                                };
                                int units;                      /* SLOB */
                        };
                        /*
                         * Usage count, *USE WRAPPER FUNCTION* when manual
                         * accounting. See page_ref.h
                         */
                        atomic_t _refcount;
                };
        };


After:

Now, units is overlapped with _refcount and _mapcount.

        union {
                unsigned long counters;
                unsigned int active;            /* SLAB */
                struct {                        /* SLUB */
                        unsigned inuse:16;
                        unsigned objects:15;
                        unsigned frozen:1;
                };
                int units;                      /* SLOB */
                struct {                        /* Page cache */
                        atomic_t _mapcount;
                        atomic_t _refcount;
                };
        };



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25  1:32     ` Minchan Kim
  2018-07-25  1:55       ` Matthew Wilcox
@ 2018-07-25  2:51       ` Matthew Wilcox
  2018-07-25  4:07         ` Sergey Senozhatsky
  1 sibling, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2018-07-25  2:51 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tino Lehnig, ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Wed, Jul 25, 2018 at 10:32:50AM +0900, Minchan Kim wrote:
> > [  804.485321] BUG: Bad page state in process qemu-system-x86  pfn:1c4b08e
> > [  804.485403] page:ffffe809312c2380 count:0 mapcount:0
> > mapping:0000000000000000 index:0x1
> > [  804.485483] flags: 0x17fffc000000008(uptodate)
> > [  804.485554] raw: 017fffc000000008 0000000000000000 0000000000000001
> > 00000000ffffffff
> > [  804.485632] raw: dead000000000100 dead000000000200 0000000000000000
> > 0000000000000000
> > [  804.485709] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> > [  804.485782] bad because of flags: 0x8(uptodate)

The message here even tells you what the problem is.  The page
was marked Uptodate at the time that it was allocated.  There aren't
any other flags that alias Uptodate, so somebody is failing to clear
the Uptodate bit when they ought to.

> > [  804.485891] Call Trace:
> > [  804.485899]  dump_stack+0x5c/0x7b
> > [  804.485902]  bad_page+0xba/0x120
> > [  804.485905]  get_page_from_freelist+0x1016/0x1250
> > [  804.485908]  __alloc_pages_nodemask+0xfa/0x250

Also note that this is in the allocation path; this flag isn't checked
at free.  But it is cleared on free, so someone's stomping on page->flags
after they're freed.  I suggest enabling more debugging code.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25  2:51             ` Minchan Kim
@ 2018-07-25  2:55               ` Matthew Wilcox
  2018-07-25  3:02                 ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Matthew Wilcox @ 2018-07-25  2:55 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Tino Lehnig, ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Wed, Jul 25, 2018 at 11:51:06AM +0900, Minchan Kim wrote:
> On Tue, Jul 24, 2018 at 07:35:32PM -0700, Matthew Wilcox wrote:
> > There is NOTHING in a union with _refcount.
> 
> Confusing. Matthew, what am I missing?
> 
> Before:
> 
> counters consumes 8 bytes
> units and _refcount consumes each 4 bytes so memory space is not overlapped.

Correct.  _mapcount is at offset 24, as is units.  _refcount is at offset 28.

>         union {
>                 unsigned long counters;
>                 struct {
>                         union {
>                                 atomic_t _mapcount;
>                                 int units;                      /* SLOB */
>                         };
>                         atomic_t _refcount;
>                 };
>         };
> 
> 
> After:
> 
> Now, units is overlapped with _refcount and _mapcount.

No.  units is at offset 24, as is _mapcount.  But _refcount is still at
offset 28.

>         union {
>                 unsigned long counters;
>                 int units;                      /* SLOB */
>                 struct {                        /* Page cache */
>                         atomic_t _mapcount;
>                         atomic_t _refcount;
>                 };
>         };

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25  2:55               ` Matthew Wilcox
@ 2018-07-25  3:02                 ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2018-07-25  3:02 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tino Lehnig, ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Tue, Jul 24, 2018 at 07:55:25PM -0700, Matthew Wilcox wrote:
> On Wed, Jul 25, 2018 at 11:51:06AM +0900, Minchan Kim wrote:
> > On Tue, Jul 24, 2018 at 07:35:32PM -0700, Matthew Wilcox wrote:
> > > There is NOTHING in a union with _refcount.
> > 
> > Confusing. Matthew, what am I missing?
> > 
> > Before:
> > 
> > counters consumes 8 bytes
> > units and _refcount consumes each 4 bytes so memory space is not overlapped.
> 
> Correct.  _mapcount is at offset 24, as is units.  _refcount is at offset 28.
> 
> >         union {
> >                 unsigned long counters;
> >                 struct {
> >                         union {
> >                                 atomic_t _mapcount;
> >                                 int units;                      /* SLOB */
> >                         };
> >                         atomic_t _refcount;
> >                 };
> >         };
> > 
> > 
> > After:
> > 
> > Now, units is overlapped with _refcount and _mapcount.
> 
> No.  units is at offset 24, as is _mapcount.  But _refcount is still at
> offset 28.

Ah, got it what you mean. Sorry for the noise, Matthew.

> 
> >         union {
> >                 unsigned long counters;
> >                 int units;                      /* SLOB */
> >                 struct {                        /* Page cache */
> >                         atomic_t _mapcount;
> >                         atomic_t _refcount;
> >                 };
> >         };

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25  2:51       ` Matthew Wilcox
@ 2018-07-25  4:07         ` Sergey Senozhatsky
  0 siblings, 0 replies; 30+ messages in thread
From: Sergey Senozhatsky @ 2018-07-25  4:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Minchan Kim, Tino Lehnig, ngupta, linux-kernel,
	Sergey Senozhatsky, Andrew Morton

On (07/24/18 19:51), Matthew Wilcox wrote:
> 
> Also note that this is in the allocation path; this flag isn't checked
> at free.  But it is cleared on free, so someone's stomping on page->flags
> after they're freed.  I suggest enabling more debugging code.

Would be lovely if Tino could bisect it, perhaps.

	-ss

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-24  7:30   ` Tino Lehnig
  2018-07-25  1:32     ` Minchan Kim
@ 2018-07-25 13:21     ` Minchan Kim
  2018-07-25 15:12       ` Tino Lehnig
  1 sibling, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-25 13:21 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Tue, Jul 24, 2018 at 09:30:34AM +0200, Tino Lehnig wrote:
> Hi,
> 
> The first build I used was from the master branch of the mainline kernel,
> somewhere between rc5 and rc6. I have just reproduced the bug with 4.17.9
> and 4.18-rc6. Kernel messages below.
> 
> The bug does not appear on 4.14.57. I can test more versions if it helps.

It would be much helpful if you could check more versions with git-bisect.

I also want to reproduce it.

Today, I downloaded one window iso and execute it as cdrom with my owned
compiled kernel on KVM but I couldn't reproduce.
I also tested some heavy swap workload(kernel build with multiple CPU
on small memory) but I failed to reproduce, too.

Please could you told me your method more detail?

Thanks.


> 
> On 07/24/2018 03:03 AM, Minchan Kim wrote:
> > We didn't release v4.18 yet. Could you say what kernel tree/what version
> > you used?
> 
> --
> 
> [  804.485321] BUG: Bad page state in process qemu-system-x86  pfn:1c4b08e
> [  804.485403] page:ffffe809312c2380 count:0 mapcount:0
> mapping:0000000000000000 index:0x1
> [  804.485483] flags: 0x17fffc000000008(uptodate)
> [  804.485554] raw: 017fffc000000008 0000000000000000 0000000000000001
> 00000000ffffffff
> [  804.485632] raw: dead000000000100 dead000000000200 0000000000000000
> 0000000000000000
> [  804.485709] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> [  804.485782] bad because of flags: 0x8(uptodate)
> [  804.485852] Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl
> sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcb
> c aesni_intel aes_x86_64 crypto_simd cryptd iTCO_wdt glue_helper
> iTCO_vendor_support intel_cstate binfmt_misc intel_uncore intel_rapl_perf
> pcspkr mei_me lpc_ich joydev sg mfd_core mei ioatdma shpchp wmi evdev
> ipmi_si ipmi_devintf ipmi_msgh
> andler acpi_power_meter acpi_pad button ip_tables x_tables autofs4 ext4
> crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic usbhid hid sd_mod
> ahci libahci xhci_pci ehci_pci libata igb xhci_hcd ehci_hcd crc32c_intel
> i2c_algo_bit scsi_mod
>  i2c_i801 dca usbcore
> [  804.485890] CPU: 17 PID: 1165 Comm: qemu-system-x86 Not tainted 4.17.9 #1
> [  804.485891] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b
> 05/02/2017
> [  804.485891] Call Trace:
> [  804.485899]  dump_stack+0x5c/0x7b
> [  804.485902]  bad_page+0xba/0x120
> [  804.485905]  get_page_from_freelist+0x1016/0x1250
> [  804.485908]  __alloc_pages_nodemask+0xfa/0x250
> [  804.485911]  alloc_pages_vma+0x7c/0x1c0
> [  804.485915]  __handle_mm_fault+0xcf6/0x1110
> [  804.485918]  handle_mm_fault+0xfc/0x1f0
> [  804.485921]  __get_user_pages+0x12f/0x670
> [  804.485923]  get_user_pages_unlocked+0x148/0x1f0
> [  804.485945]  __gfn_to_pfn_memslot+0xff/0x390 [kvm]
> [  804.485959]  try_async_pf+0x67/0x200 [kvm]
> [  804.485971]  tdp_page_fault+0x132/0x290 [kvm]
> [  804.485975]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  804.485987]  kvm_mmu_page_fault+0x59/0x140 [kvm]
> [  804.485999]  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
> [  804.486003]  ? futex_wake+0x94/0x170
> [  804.486012]  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> [  804.486021]  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> [  804.486024]  ? __switch_to+0x395/0x450
> [  804.486026]  ? __switch_to+0x395/0x450
> [  804.486029]  do_vfs_ioctl+0xa2/0x620
> [  804.486030]  ? __x64_sys_futex+0x88/0x180
> [  804.486032]  ksys_ioctl+0x70/0x80
> [  804.486034]  __x64_sys_ioctl+0x16/0x20
> [  804.486037]  do_syscall_64+0x55/0x100
> [  804.486039]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  804.486041] RIP: 0033:0x7f82db677dd7
> [  804.486042] RSP: 002b:00007f82c1ffa8b8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  804.486044] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX:
> 00007f82db677dd7
> [  804.486044] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000014
> [  804.486045] RBP: 000055b592a1ddf0 R08: 000055b5914bb3d0 R09:
> 00000000ffffffff
> [  804.486046] R10: 00007f82c1ffa670 R11: 0000000000000246 R12:
> 0000000000000000
> [  804.486047] R13: 00007f82e0cc6000 R14: 0000000000000000 R15:
> 000055b592a1ddf0
> [  804.486048] Disabling lock debugging due to kernel taint
> 
> --
> 
> [  170.707761] BUG: Bad page state in process qemu-system-x86  pfn:1901199
> [  170.707842] page:ffffe453e4046640 count:0 mapcount:0
> mapping:0000000000000000 index:0x1
> [  170.707923] flags: 0x17fffc000000008(uptodate)
> [  170.707996] raw: 017fffc000000008 dead000000000100 dead000000000200
> 0000000000000000
> [  170.708074] raw: 0000000000000001 0000000000000000 00000000ffffffff
> 0000000000000000
> [  170.708151] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> [  170.708225] bad because of flags: 0x8(uptodate)
> [  170.708295] Modules linked in: lz4 lz4_compress zram zsmalloc intel_rapl
> sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
> irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iTCO_wdt
> iTCO_vendor_support binfmt_misc pcbc aesni_intel aes_x86_64 crypto_simd
> cryptd glue_helper intel_cstate mei_me intel_uncore lpc_ich intel_rapl_perf
> pcspkr joydev sg mfd_core mei ioatdma wmi evdev ipmi_si ipmi_devintf
> ipmi_msghandler acpi_power_meter acpi_pad pcc_cpufreq button ip_tables
> x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 fscrypto hid_generic
> usbhid hid sd_mod ahci libahci libata xhci_pci ehci_pci crc32c_intel
> xhci_hcd ehci_hcd scsi_mod i2c_i801 igb i2c_algo_bit dca usbcore
> [  170.708344] CPU: 8 PID: 1031 Comm: qemu-system-x86 Not tainted 4.18.0-rc6
> #1
> [  170.708345] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b
> 05/02/2017
> [  170.708346] Call Trace:
> [  170.708354]  dump_stack+0x5c/0x7b
> [  170.708357]  bad_page+0xba/0x120
> [  170.708360]  get_page_from_freelist+0x1016/0x1250
> [  170.708364]  __alloc_pages_nodemask+0xfa/0x250
> [  170.708368]  alloc_pages_vma+0x7c/0x1c0
> [  170.708371]  do_swap_page+0x347/0x920
> [  170.708375]  ? do_huge_pmd_anonymous_page+0x461/0x6f0
> [  170.708377]  __handle_mm_fault+0x7b4/0x1110
> [  170.708380]  ? call_function_interrupt+0xa/0x20
> [  170.708383]  handle_mm_fault+0xfc/0x1f0
> [  170.708385]  __get_user_pages+0x12f/0x690
> [  170.708387]  get_user_pages_unlocked+0x148/0x1f0
> [  170.708415]  __gfn_to_pfn_memslot+0xff/0x3c0 [kvm]
> [  170.708433]  try_async_pf+0x87/0x230 [kvm]
> [  170.708450]  tdp_page_fault+0x132/0x290 [kvm]
> [  170.708455]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708470]  kvm_mmu_page_fault+0x74/0x570 [kvm]
> [  170.708474]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708477]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708480]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708484]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708487]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708490]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708493]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708497]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708500]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708503]  ? vmexit_fill_RSB+0x18/0x30 [kvm_intel]
> [  170.708506]  ? vmexit_fill_RSB+0xc/0x30 [kvm_intel]
> [  170.708510]  ? vmx_vcpu_run+0x375/0x620 [kvm_intel]
> [  170.708526]  kvm_arch_vcpu_ioctl_run+0x9b3/0x1990 [kvm]
> [  170.708529]  ? futex_wake+0x94/0x170
> [  170.708542]  ? kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> [  170.708555]  kvm_vcpu_ioctl+0x388/0x5d0 [kvm]
> [  170.708558]  ? __handle_mm_fault+0x7c4/0x1110
> [  170.708561]  do_vfs_ioctl+0xa2/0x630
> [  170.708563]  ? __x64_sys_futex+0x88/0x180
> [  170.708565]  ksys_ioctl+0x70/0x80
> [  170.708568]  ? exit_to_usermode_loop+0xca/0xf0
> [  170.708570]  __x64_sys_ioctl+0x16/0x20
> [  170.708572]  do_syscall_64+0x55/0x100
> [  170.708574]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  170.708577] RIP: 0033:0x7fc9e4889dd7
> [  170.708577] Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48
> c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
> [  170.708610] RSP: 002b:00007fc9c27fb8b8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  170.708612] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX:
> 00007fc9e4889dd7
> [  170.708613] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000015
> [  170.708614] RBP: 000055dbb5f263e0 R08: 000055dbb34f03d0 R09:
> 00000000ffffffff
> [  170.708616] R10: 00007fc9c27fb670 R11: 0000000000000246 R12:
> 0000000000000000
> [  170.708617] R13: 00007fc9e9ed5000 R14: 0000000000000000 R15:
> 000055dbb5f263e0
> [  170.708618] Disabling lock debugging due to kernel taint
> 
> --
> Kind regards,
> 
> Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25 13:21     ` Minchan Kim
@ 2018-07-25 15:12       ` Tino Lehnig
  2018-07-26  2:03         ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Tino Lehnig @ 2018-07-25 15:12 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

Hi,

On 07/25/2018 03:21 PM, Minchan Kim wrote:
> It would be much helpful if you could check more versions with git-bisect.

I started bisecting today, but my results are not conclusive yet. It is 
certain that the problem started with 4.15 though. I have not 
encountered the bug message in 4.15-rc1 so far, but the kvm processes 
always became unresponsive after hitting swap and could not be killed 
there. I saw the same behavior in rc2, rc3, and other builds in between, 
but the bad state bug would only trigger occasionally there. The 
behavior in 4.15.18 is the same as in newer kernels.

> I also want to reproduce it.
> 
> Today, I downloaded one window iso and execute it as cdrom with my owned
> compiled kernel on KVM but I couldn't reproduce.
> I also tested some heavy swap workload(kernel build with multiple CPU
> on small memory) but I failed to reproduce, too.
> 
> Please could you told me your method more detail?

I found that running Windows in KVM really is the only reliable method, 
maybe because the zero pages are easily compressible. There is actually 
not a lot of disk utilization on the backing device when running this test.

My operating system is a minimal install of Debian 9. I took the kernel 
configuration from the default Debian kernel and built my own kernel 
with "make oldconfig" leaving all settings at their defaults. The only 
thing I changed in the configuration was enabling the zram writeback 
feature.

All my tests were done on bare-metal hardware with Xeon processors and 
lots of RAM. I encounter the bug quite quickly, but it still takes 
several GBs of swap usage. Below is my /proc/meminfo with enough KVM 
instances running (3 in my case) to trigger the bug on my test machine.

I will also try to reproduce the problem on some different hardware next.

--

MemTotal:       264033384 kB
MemFree:         1232968 kB
MemAvailable:          0 kB
Buffers:            1152 kB
Cached:             5036 kB
SwapCached:        49200 kB
Active:         249955744 kB
Inactive:        5096148 kB
Active(anon):   249953396 kB
Inactive(anon):  5093084 kB
Active(file):       2348 kB
Inactive(file):     3064 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:      1073741820 kB
SwapFree:       938603260 kB
Dirty:                68 kB
Writeback:             0 kB
AnonPages:      255007752 kB
Mapped:             4708 kB
Shmem:              1212 kB
Slab:              88500 kB
SReclaimable:      16096 kB
SUnreclaim:        72404 kB
KernelStack:        5040 kB
PageTables:       765560 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    1205758512 kB
Committed_AS:   403586176 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:  254799872 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       75136 kB
DirectMap2M:    10295296 kB
DirectMap1G:    260046848 kB

-- 
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-25 15:12       ` Tino Lehnig
@ 2018-07-26  2:03         ` Minchan Kim
  2018-07-26  6:10           ` Tino Lehnig
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-26  2:03 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

Hi Tino,

On Wed, Jul 25, 2018 at 05:12:13PM +0200, Tino Lehnig wrote:
> Hi,
> 
> On 07/25/2018 03:21 PM, Minchan Kim wrote:
> > It would be much helpful if you could check more versions with git-bisect.
> 
> I started bisecting today, but my results are not conclusive yet. It is
> certain that the problem started with 4.15 though. I have not encountered

A thing I could imagine is 
[0bcac06f27d75, skip swapcache for swapin of synchronous device]
It was merged into v4.15. Could you check it by bisecting?

With enabling writeback feature of zram, it's no more synchonous device
so it could affect the unresponsive symptom you are seeing now.
I should disable synchronous flag when the zram works with writeback feature.

However, I should still see why the PG_uptodate WARN turning on.
I guess there are some places where assume all of anonymous pages have
PG_swapbacked and !PG_dirty are in swapcache so reference accounting could
be corrupted.

> the bug message in 4.15-rc1 so far, but the kvm processes always became
> unresponsive after hitting swap and could not be killed there. I saw the
> same behavior in rc2, rc3, and other builds in between, but the bad state
> bug would only trigger occasionally there. The behavior in 4.15.18 is the
> same as in newer kernels.
> 
> > I also want to reproduce it.
> > 
> > Today, I downloaded one window iso and execute it as cdrom with my owned
> > compiled kernel on KVM but I couldn't reproduce.
> > I also tested some heavy swap workload(kernel build with multiple CPU
> > on small memory) but I failed to reproduce, too.
> > 
> > Please could you told me your method more detail?
> 
> I found that running Windows in KVM really is the only reliable method,
> maybe because the zero pages are easily compressible. There is actually not
> a lot of disk utilization on the backing device when running this test.

Hmm, my testing was creating a tons of disk IO on backing device.
Maybe, that's why I couldn't reproduce a lot.
zero pages could never go to the writeback into the device so I don't
understand how such kinds of testing reprodcue the problem.
Anyway, I will try it again with zero pages with a few of random pages.

> 
> My operating system is a minimal install of Debian 9. I took the kernel
> configuration from the default Debian kernel and built my own kernel with
> "make oldconfig" leaving all settings at their defaults. The only thing I
> changed in the configuration was enabling the zram writeback feature.

You mean you changed host kernel configuration?

> 
> All my tests were done on bare-metal hardware with Xeon processors and lots
> of RAM. I encounter the bug quite quickly, but it still takes several GBs of
> swap usage. Below is my /proc/meminfo with enough KVM instances running (3
> in my case) to trigger the bug on my test machine.

Aha.. you did writeback feature into your bare-metal host machine and execute
kvm with window images as a guest. So, PG_uptodate warning happens on host side,
not guest? Right?

Thanks!

> 
> I will also try to reproduce the problem on some different hardware next.
> 
> --
> 
> MemTotal:       264033384 kB
> MemFree:         1232968 kB
> MemAvailable:          0 kB
> Buffers:            1152 kB
> Cached:             5036 kB
> SwapCached:        49200 kB
> Active:         249955744 kB
> Inactive:        5096148 kB
> Active(anon):   249953396 kB
> Inactive(anon):  5093084 kB
> Active(file):       2348 kB
> Inactive(file):     3064 kB
> Unevictable:           0 kB
> Mlocked:               0 kB
> SwapTotal:      1073741820 kB
> SwapFree:       938603260 kB
> Dirty:                68 kB
> Writeback:             0 kB
> AnonPages:      255007752 kB
> Mapped:             4708 kB
> Shmem:              1212 kB
> Slab:              88500 kB
> SReclaimable:      16096 kB
> SUnreclaim:        72404 kB
> KernelStack:        5040 kB
> PageTables:       765560 kB
> NFS_Unstable:          0 kB
> Bounce:                0 kB
> WritebackTmp:          0 kB
> CommitLimit:    1205758512 kB
> Committed_AS:   403586176 kB
> VmallocTotal:   34359738367 kB
> VmallocUsed:           0 kB
> VmallocChunk:          0 kB
> HardwareCorrupted:     0 kB
> AnonHugePages:  254799872 kB
> ShmemHugePages:        0 kB
> ShmemPmdMapped:        0 kB
> HugePages_Total:       0
> HugePages_Free:        0
> HugePages_Rsvd:        0
> HugePages_Surp:        0
> Hugepagesize:       2048 kB
> DirectMap4k:       75136 kB
> DirectMap2M:    10295296 kB
> DirectMap1G:    260046848 kB
> 
> -- 
> Kind regards,
> 
> Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-26  2:03         ` Minchan Kim
@ 2018-07-26  6:10           ` Tino Lehnig
  2018-07-26  6:21             ` Minchan Kim
  2018-07-26 10:00             ` Tino Lehnig
  0 siblings, 2 replies; 30+ messages in thread
From: Tino Lehnig @ 2018-07-26  6:10 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

Hi,

On 07/26/2018 04:03 AM, Minchan Kim wrote:
> A thing I could imagine is
> [0bcac06f27d75, skip swapcache for swapin of synchronous device]
> It was merged into v4.15. Could you check it by bisecting?

Thanks, I will check that.

>> My operating system is a minimal install of Debian 9. I took the kernel
>> configuration from the default Debian kernel and built my own kernel with
>> "make oldconfig" leaving all settings at their defaults. The only thing I
>> changed in the configuration was enabling the zram writeback feature.
> 
> You mean you changed host kernel configuration?
> 
>>
>> All my tests were done on bare-metal hardware with Xeon processors and lots
>> of RAM. I encounter the bug quite quickly, but it still takes several GBs of
>> swap usage. Below is my /proc/meminfo with enough KVM instances running (3
>> in my case) to trigger the bug on my test machine.
> 
> Aha.. you did writeback feature into your bare-metal host machine and execute
> kvm with window images as a guest. So, PG_uptodate warning happens on host side,
> not guest? Right?

Yes, I am only talking about the host kernel. Zram swap is set up on the 
host. I just used Windows guests to fill up the host RAM and force it 
into swap.

>> I will also try to reproduce the problem on some different hardware next.

Just to confirm, I was able to reproduce the problem on another machine 
running Ubuntu 18.04 with the Ubuntu stock kernel (4.15) and no 
modifications to the kernel configuration whatsoever. The host had 8 GB 
of RAM, 32 GB of swap with zram and a 32 GB SSD as backing device. I had 
to start only one Windows VM with "-m 32768" to trigger the bug.

-- 
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-26  6:10           ` Tino Lehnig
@ 2018-07-26  6:21             ` Minchan Kim
  2018-07-26  6:34               ` Tino Lehnig
  2018-07-26 10:00             ` Tino Lehnig
  1 sibling, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-26  6:21 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Thu, Jul 26, 2018 at 08:10:41AM +0200, Tino Lehnig wrote:
> Hi,
> 
> On 07/26/2018 04:03 AM, Minchan Kim wrote:
> > A thing I could imagine is
> > [0bcac06f27d75, skip swapcache for swapin of synchronous device]
> > It was merged into v4.15. Could you check it by bisecting?
> 
> Thanks, I will check that.
> 
> > > My operating system is a minimal install of Debian 9. I took the kernel
> > > configuration from the default Debian kernel and built my own kernel with
> > > "make oldconfig" leaving all settings at their defaults. The only thing I
> > > changed in the configuration was enabling the zram writeback feature.
> > 
> > You mean you changed host kernel configuration?
> > 
> > > 
> > > All my tests were done on bare-metal hardware with Xeon processors and lots
> > > of RAM. I encounter the bug quite quickly, but it still takes several GBs of
> > > swap usage. Below is my /proc/meminfo with enough KVM instances running (3
> > > in my case) to trigger the bug on my test machine.
> > 
> > Aha.. you did writeback feature into your bare-metal host machine and execute
> > kvm with window images as a guest. So, PG_uptodate warning happens on host side,
> > not guest? Right?
> 
> Yes, I am only talking about the host kernel. Zram swap is set up on the
> host. I just used Windows guests to fill up the host RAM and force it into
> swap.
> 
> > > I will also try to reproduce the problem on some different hardware next.
> 
> Just to confirm, I was able to reproduce the problem on another machine
> running Ubuntu 18.04 with the Ubuntu stock kernel (4.15) and no
> modifications to the kernel configuration whatsoever. The host had 8 GB of

That means you could reproduce it without writeback feature?
If so, it would be more reasoanble to check [0bcac06f27d75, skip swapcache for swapin of synchronous device]

> RAM, 32 GB of swap with zram and a 32 GB SSD as backing device. I had to
> start only one Windows VM with "-m 32768" to trigger the bug.

Thanks. I will try it later today.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-26  6:21             ` Minchan Kim
@ 2018-07-26  6:34               ` Tino Lehnig
  0 siblings, 0 replies; 30+ messages in thread
From: Tino Lehnig @ 2018-07-26  6:34 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On 07/26/2018 08:21 AM, Minchan Kim wrote:
> That means you could reproduce it without writeback feature?
> If so, it would be more reasoanble to check [0bcac06f27d75, skip swapcache for swapin of synchronous device]

No, the bug only occurs with a backing device. The writeback feature is 
enabled in the default Ubuntu kernel configuration.

-- 
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-26  6:10           ` Tino Lehnig
  2018-07-26  6:21             ` Minchan Kim
@ 2018-07-26 10:00             ` Tino Lehnig
  2018-07-26 10:30               ` Minchan Kim
  1 sibling, 1 reply; 30+ messages in thread
From: Tino Lehnig @ 2018-07-26 10:00 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On 07/26/2018 08:10 AM, Tino Lehnig wrote:
>> A thing I could imagine is
>> [0bcac06f27d75, skip swapcache for swapin of synchronous device]
>> It was merged into v4.15. Could you check it by bisecting?
> 
> Thanks, I will check that.

So I get the same behavior as in v4.15-rc1 after this commit. All prior 
builds are fine.

I have also tested all other 4.15 rc builds now and the symptoms are the 
same through rc8. KVM processes become unresponsive and I see kernel 
messages like the one below. This happens with and without the writeback 
feature being used. The bad page state bug appears very rarely in these 
versions and only when writeback is active.

Starting with rc9, I only get the same bad page state bug as in all 
newer kernels.

--

[  363.494793] INFO: task kworker/4:2:498 blocked for more than 120 seconds.
[  363.494872]       Not tainted 4.14.0-zram-pre-rc1 #17
[  363.494943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.495021] kworker/4:2     D    0   498      2 0x80000000
[  363.495029] Workqueue: events async_pf_execute
[  363.495030] Call Trace:
[  363.495037]  ? __schedule+0x3bc/0x830
[  363.495039]  schedule+0x32/0x80
[  363.495042]  io_schedule+0x12/0x40
[  363.495045]  __lock_page_or_retry+0x302/0x320
[  363.495047]  ? page_cache_tree_insert+0xa0/0xa0
[  363.495051]  do_swap_page+0x4ab/0x860
[  363.495054]  __handle_mm_fault+0x77b/0x10c0
[  363.495056]  handle_mm_fault+0xc6/0x1b0
[  363.495059]  __get_user_pages+0xf9/0x620
[  363.495061]  ? update_load_avg+0x5d6/0x6d0
[  363.495064]  get_user_pages_remote+0x137/0x1f0
[  363.495067]  async_pf_execute+0x62/0x180
[  363.495071]  process_one_work+0x184/0x380
[  363.495073]  worker_thread+0x4d/0x3c0
[  363.495076]  kthread+0xf5/0x130
[  363.495078]  ? process_one_work+0x380/0x380
[  363.495080]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.495083]  ? do_group_exit+0x3a/0xa0
[  363.495086]  ret_from_fork+0x1f/0x30

-- 
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-26 10:00             ` Tino Lehnig
@ 2018-07-26 10:30               ` Minchan Kim
  2018-07-26 12:35                 ` Tino Lehnig
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-26 10:30 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Thu, Jul 26, 2018 at 12:00:44PM +0200, Tino Lehnig wrote:
> On 07/26/2018 08:10 AM, Tino Lehnig wrote:
> > > A thing I could imagine is
> > > [0bcac06f27d75, skip swapcache for swapin of synchronous device]
> > > It was merged into v4.15. Could you check it by bisecting?
> > 
> > Thanks, I will check that.
> 
> So I get the same behavior as in v4.15-rc1 after this commit. All prior
> builds are fine.
> 
> I have also tested all other 4.15 rc builds now and the symptoms are the
> same through rc8. KVM processes become unresponsive and I see kernel

Yub, I think it's the pooling routine of swap_readpage. With the patch
I mentioned, swap layer will wait the IO synchronoulsy and I belive the
page was in the backing device, not zram memory.

> messages like the one below. This happens with and without the writeback

Huh, you see it without writeback? It's weird. Without writeback feature,
zram operaion is always synchronous on memory compression/decompression
so you shouldn't see below io_schedule logic which happens only for
asynchronous IO operation.
Could you check one more time that it happens without writeback?

> feature being used. The bad page state bug appears very rarely in these
> versions and only when writeback is active.

Yub, I will review code more. I guess there are some places where assumes
anonymous pages are with swapcache so refcount counting would be broken.

> 
> Starting with rc9, I only get the same bad page state bug as in all newer
> kernels.

So, you mean you couldn't se bad page state bug until 4.15-rc8?
You just see below hung message until 4.15-rc8, not bad page bug?

> 
> --
> 
> [  363.494793] INFO: task kworker/4:2:498 blocked for more than 120 seconds.
> [  363.494872]       Not tainted 4.14.0-zram-pre-rc1 #17
> [  363.494943] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.495021] kworker/4:2     D    0   498      2 0x80000000
> [  363.495029] Workqueue: events async_pf_execute
> [  363.495030] Call Trace:
> [  363.495037]  ? __schedule+0x3bc/0x830
> [  363.495039]  schedule+0x32/0x80
> [  363.495042]  io_schedule+0x12/0x40
> [  363.495045]  __lock_page_or_retry+0x302/0x320
> [  363.495047]  ? page_cache_tree_insert+0xa0/0xa0
> [  363.495051]  do_swap_page+0x4ab/0x860
> [  363.495054]  __handle_mm_fault+0x77b/0x10c0
> [  363.495056]  handle_mm_fault+0xc6/0x1b0
> [  363.495059]  __get_user_pages+0xf9/0x620
> [  363.495061]  ? update_load_avg+0x5d6/0x6d0
> [  363.495064]  get_user_pages_remote+0x137/0x1f0
> [  363.495067]  async_pf_execute+0x62/0x180
> [  363.495071]  process_one_work+0x184/0x380
> [  363.495073]  worker_thread+0x4d/0x3c0
> [  363.495076]  kthread+0xf5/0x130
> [  363.495078]  ? process_one_work+0x380/0x380
> [  363.495080]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.495083]  ? do_group_exit+0x3a/0xa0
> [  363.495086]  ret_from_fork+0x1f/0x30
> 
> -- 
> Kind regards,
> 
> Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-26 10:30               ` Minchan Kim
@ 2018-07-26 12:35                 ` Tino Lehnig
  2018-07-27  9:14                   ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Tino Lehnig @ 2018-07-26 12:35 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On 07/26/2018 12:30 PM, Minchan Kim wrote:
> Huh, you see it without writeback? It's weird. Without writeback feature,
> zram operaion is always synchronous on memory compression/decompression
> so you shouldn't see below io_schedule logic which happens only for
> asynchronous IO operation.
> Could you check one more time that it happens without writeback?

Confirmed. More kernel logs below. backing_dev was not set in this run. 
This does not happen with 4.15-rc9 and newer.

> So, you mean you couldn't se bad page state bug until 4.15-rc8?
> You just see below hung message until 4.15-rc8, not bad page bug?
> 

I did see the bad page state bug first in 4.15-rc2, but only very 
rarely. I have attached the log below the other one. Most test runs from 
commit 0bcac06f27d75 through 4.15-rc8 just resulted in hung task errors.

--

[  363.116153] INFO: task kworker/0:1:130 blocked for more than 120 seconds.
[  363.116237]       Not tainted 4.15.0-rc8-zram #22
[  363.116311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.116392] kworker/0:1     D    0   130      2 0x80000000
[  363.116405] Workqueue: events async_pf_execute
[  363.116407] Call Trace:
[  363.116416]  ? __schedule+0x3d0/0x850
[  363.116419]  schedule+0x32/0x80
[  363.116426]  io_schedule+0x12/0x40
[  363.116433]  __lock_page_or_retry+0x302/0x320
[  363.116437]  ? page_cache_tree_insert+0xb0/0xb0
[  363.116442]  do_swap_page+0x4dd/0x870
[  363.116446]  __handle_mm_fault+0x790/0x10c0
[  363.116450]  handle_mm_fault+0xc6/0x1b0
[  363.116453]  __get_user_pages+0xf9/0x620
[  363.116457]  get_user_pages_remote+0x137/0x1f0
[  363.116462]  async_pf_execute+0x62/0x180
[  363.116469]  process_one_work+0x189/0x380
[  363.116474]  worker_thread+0x4d/0x3c0
[  363.116478]  kthread+0xf8/0x130
[  363.116482]  ? process_one_work+0x380/0x380
[  363.116486]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.116492]  ret_from_fork+0x32/0x40
[  363.116498] INFO: task kswapd0:159 blocked for more than 120 seconds.
[  363.116576]       Not tainted 4.15.0-rc8-zram #22
[  363.116650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.116731] kswapd0         D    0   159      2 0x80000000
[  363.116734] Call Trace:
[  363.116738]  ? __schedule+0x3d0/0x850
[  363.116741]  schedule+0x32/0x80
[  363.116745]  io_schedule+0x12/0x40
[  363.116749]  __lock_page+0x109/0x130
[  363.116753]  ? page_cache_tree_insert+0xb0/0xb0
[  363.116759]  deferred_split_scan+0x1e9/0x2a0
[  363.116762]  shrink_slab.part.49+0x1e6/0x3d0
[  363.116768]  shrink_node+0x2e7/0x2f0
[  363.116771]  kswapd+0x35b/0x6f0
[  363.116776]  kthread+0xf8/0x130
[  363.116779]  ? mem_cgroup_shrink_node+0x150/0x150
[  363.116782]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.116786]  ret_from_fork+0x32/0x40
[  363.116790] INFO: task kworker/9:1:196 blocked for more than 120 seconds.
[  363.116869]       Not tainted 4.15.0-rc8-zram #22
[  363.116942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.117023] kworker/9:1     D    0   196      2 0x80000000
[  363.117029] Workqueue: events async_pf_execute
[  363.117031] Call Trace:
[  363.117034]  ? __schedule+0x3d0/0x850
[  363.117037]  schedule+0x32/0x80
[  363.117041]  io_schedule+0x12/0x40
[  363.117046]  __lock_page_or_retry+0x302/0x320
[  363.117050]  ? page_cache_tree_insert+0xb0/0xb0
[  363.117053]  do_swap_page+0x4dd/0x870
[  363.117057]  __handle_mm_fault+0x790/0x10c0
[  363.117061]  handle_mm_fault+0xc6/0x1b0
[  363.117063]  __get_user_pages+0xf9/0x620
[  363.117068]  ? update_load_avg+0x5c0/0x6f0
[  363.117071]  get_user_pages_remote+0x137/0x1f0
[  363.117076]  async_pf_execute+0x62/0x180
[  363.117081]  process_one_work+0x189/0x380
[  363.117085]  worker_thread+0x4d/0x3c0
[  363.117089]  kthread+0xf8/0x130
[  363.117093]  ? process_one_work+0x380/0x380
[  363.117096]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.117100]  ret_from_fork+0x32/0x40
[  363.117104] INFO: task kworker/19:1:201 blocked for more than 120 
seconds.
[  363.117183]       Not tainted 4.15.0-rc8-zram #22
[  363.117256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.117337] kworker/19:1    D    0   201      2 0x80000000
[  363.117343] Workqueue: events async_pf_execute
[  363.117345] Call Trace:
[  363.117349]  ? __schedule+0x3d0/0x850
[  363.117352]  schedule+0x32/0x80
[  363.117356]  io_schedule+0x12/0x40
[  363.117360]  __lock_page_or_retry+0x302/0x320
[  363.117364]  ? page_cache_tree_insert+0xb0/0xb0
[  363.117367]  do_swap_page+0x4dd/0x870
[  363.117370]  __handle_mm_fault+0x790/0x10c0
[  363.117374]  handle_mm_fault+0xc6/0x1b0
[  363.117377]  __get_user_pages+0xf9/0x620
[  363.117380]  ? update_load_avg+0x5c0/0x6f0
[  363.117383]  get_user_pages_remote+0x137/0x1f0
[  363.117388]  async_pf_execute+0x62/0x180
[  363.117393]  process_one_work+0x189/0x380
[  363.117397]  worker_thread+0x4d/0x3c0
[  363.117401]  kthread+0xf8/0x130
[  363.117404]  ? process_one_work+0x380/0x380
[  363.117407]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.117411]  ret_from_fork+0x32/0x40
[  363.117415] INFO: task kworker/5:1:207 blocked for more than 120 seconds.
[  363.117494]       Not tainted 4.15.0-rc8-zram #22
[  363.117567] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.117648] kworker/5:1     D    0   207      2 0x80000000
[  363.117654] Workqueue: events async_pf_execute
[  363.117655] Call Trace:
[  363.117659]  ? __schedule+0x3d0/0x850
[  363.117662]  schedule+0x32/0x80
[  363.117665]  io_schedule+0x12/0x40
[  363.117669]  __lock_page_or_retry+0x302/0x320
[  363.117673]  ? page_cache_tree_insert+0xb0/0xb0
[  363.117676]  do_swap_page+0x4dd/0x870
[  363.117680]  ? check_preempt_curr+0x83/0x90
[  363.117683]  __handle_mm_fault+0x790/0x10c0
[  363.117687]  handle_mm_fault+0xc6/0x1b0
[  363.117689]  __get_user_pages+0xf9/0x620
[  363.117693]  ? update_load_avg+0x5c0/0x6f0
[  363.117695]  get_user_pages_remote+0x137/0x1f0
[  363.117700]  async_pf_execute+0x62/0x180
[  363.117705]  process_one_work+0x189/0x380
[  363.117709]  worker_thread+0x4d/0x3c0
[  363.117713]  kthread+0xf8/0x130
[  363.117717]  ? process_one_work+0x380/0x380
[  363.117720]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.117723]  ret_from_fork+0x32/0x40
[  363.117731] INFO: task kworker/5:2:437 blocked for more than 120 seconds.
[  363.117828]       Not tainted 4.15.0-rc8-zram #22
[  363.117920] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.118039] kworker/5:2     D    0   437      2 0x80000000
[  363.118044] Workqueue: events async_pf_execute
[  363.118046] Call Trace:
[  363.118049]  ? __schedule+0x3d0/0x850
[  363.118052]  schedule+0x32/0x80
[  363.118056]  io_schedule+0x12/0x40
[  363.118060]  __lock_page_or_retry+0x302/0x320
[  363.118064]  ? page_cache_tree_insert+0xb0/0xb0
[  363.118067]  do_swap_page+0x4dd/0x870
[  363.118070]  __handle_mm_fault+0x790/0x10c0
[  363.118074]  handle_mm_fault+0xc6/0x1b0
[  363.118077]  __get_user_pages+0xf9/0x620
[  363.118080]  ? update_load_avg+0x5c0/0x6f0
[  363.118083]  get_user_pages_remote+0x137/0x1f0
[  363.118087]  async_pf_execute+0x62/0x180
[  363.118092]  process_one_work+0x189/0x380
[  363.118096]  worker_thread+0x4d/0x3c0
[  363.118100]  kthread+0xf8/0x130
[  363.118104]  ? process_one_work+0x380/0x380
[  363.118107]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.118113]  ? do_group_exit+0x3a/0xa0
[  363.118116]  ret_from_fork+0x32/0x40
[  363.118120] INFO: task kworker/8:2:501 blocked for more than 120 seconds.
[  363.118217]       Not tainted 4.15.0-rc8-zram #22
[  363.118309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.118427] kworker/8:2     D    0   501      2 0x80000000
[  363.118433] Workqueue: events async_pf_execute
[  363.118435] Call Trace:
[  363.118438]  ? __schedule+0x3d0/0x850
[  363.118441]  schedule+0x32/0x80
[  363.118445]  io_schedule+0x12/0x40
[  363.118448]  __lock_page_or_retry+0x302/0x320
[  363.118452]  ? page_cache_tree_insert+0xb0/0xb0
[  363.118455]  do_swap_page+0x4dd/0x870
[  363.118459]  __handle_mm_fault+0x790/0x10c0
[  363.118463]  handle_mm_fault+0xc6/0x1b0
[  363.118465]  __get_user_pages+0xf9/0x620
[  363.118469]  ? update_load_avg+0x5c0/0x6f0
[  363.118471]  get_user_pages_remote+0x137/0x1f0
[  363.118476]  async_pf_execute+0x62/0x180
[  363.118481]  process_one_work+0x189/0x380
[  363.118485]  worker_thread+0x4d/0x3c0
[  363.118489]  kthread+0xf8/0x130
[  363.118492]  ? process_one_work+0x380/0x380
[  363.118495]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.118498]  ? kthread_create_worker_on_cpu+0x50/0x50
[  363.118502]  ret_from_fork+0x32/0x40
[  363.118511] INFO: task qemu-system-x86:943 blocked for more than 120 
seconds.
[  363.118609]       Not tainted 4.15.0-rc8-zram #22
[  363.118701] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[  363.118820] qemu-system-x86 D    0   943    931 0x00000000
[  363.118822] Call Trace:
[  363.118826]  ? __schedule+0x3d0/0x850
[  363.118829]  schedule+0x32/0x80
[  363.118833]  io_schedule+0x12/0x40
[  363.118837]  __lock_page+0x109/0x130
[  363.118840]  ? page_cache_tree_insert+0xb0/0xb0
[  363.118844]  deferred_split_scan+0x1e9/0x2a0
[  363.118848]  shrink_slab.part.49+0x1e6/0x3d0
[  363.118852]  shrink_node+0x2e7/0x2f0
[  363.118856]  do_try_to_free_pages+0xd5/0x320
[  363.118859]  try_to_free_pages+0xd6/0x190
[  363.118864]  __alloc_pages_slowpath+0x34d/0xdc0
[  363.118869]  __alloc_pages_nodemask+0x214/0x230
[  363.118873]  do_huge_pmd_anonymous_page+0x13a/0x610
[  363.118877]  __handle_mm_fault+0xe04/0x10c0
[  363.118879]  ? kvm_vcpu_mmap+0x20/0x20
[  363.118884]  ? bsearch+0x52/0x90
[  363.118887]  handle_mm_fault+0xc6/0x1b0
[  363.118890]  __get_user_pages+0xf9/0x620
[  363.118893]  get_user_pages+0x3e/0x50
[  363.118898]  __gfn_to_pfn_memslot+0xff/0x3d0
[  363.118903]  try_async_pf+0x53/0x1d0
[  363.118907]  tdp_page_fault+0x10e/0x260
[  363.118912]  ? vmexit_fill_RSB+0x11/0x30
[  363.118915]  kvm_mmu_page_fault+0x59/0x130
[  363.118920]  vmx_handle_exit+0x9f/0x1530
[  363.118924]  ? vmexit_fill_RSB+0x11/0x30
[  363.118927]  ? vmx_vcpu_run+0x32f/0x4d0
[  363.118932]  kvm_arch_vcpu_ioctl_run+0x90c/0x1670
[  363.118936]  ? handle_machine_check+0x10/0x10
[  363.118938]  ? kvm_arch_vcpu_load+0x68/0x250
[  363.118943]  ? kvm_vcpu_ioctl+0x2e8/0x580
[  363.118946]  kvm_vcpu_ioctl+0x2e8/0x580
[  363.118952]  do_vfs_ioctl+0x92/0x5f0
[  363.118955]  ? handle_mm_fault+0xc6/0x1b0
[  363.118960]  ? kvm_on_user_return+0x68/0xa0
[  363.118964]  SyS_ioctl+0x74/0x80
[  363.118969]  entry_SYSCALL_64_fastpath+0x24/0x87
[  363.118972] RIP: 0033:0x7fcc51943dd7

--

[  967.078557] BUG: Bad page state in process qemu-system-x86  pfn:3f6d853
[  967.078637] page:000000004d26db38 count:0 mapcount:0 mapping: 
  (null) index:0x1
[  967.078715] flags: 0x17fffc000000008(uptodate)
[  967.078786] raw: 017fffc000000008 0000000000000000 0000000000000001 
00000000ffffffff
[  967.078863] raw: dead000000000100 dead000000000200 0000000000000000 
0000000000000000
[  967.078939] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[  967.079012] bad because of flags: 0x8(uptodate)
[  967.079081] Modules linked in: lz4 lz4_compress zram
[  967.079085] CPU: 2 PID: 946 Comm: qemu-system-x86 Not tainted 
4.15.0-rc2-zram #4
[  967.079086] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 
2.0b 05/02/2017
[  967.079087] Call Trace:
[  967.079093]  dump_stack+0x5c/0x84
[  967.079097]  bad_page+0xba/0x120
[  967.079098]  get_page_from_freelist+0xfe7/0x1230
[  967.079101]  __alloc_pages_nodemask+0xea/0x230
[  967.079104]  alloc_pages_vma+0x7c/0x1c0
[  967.079106]  do_swap_page+0x474/0x870
[  967.079109]  ? do_huge_pmd_anonymous_page+0x417/0x610
[  967.079110]  __handle_mm_fault+0xa53/0x1160
[  967.079112]  handle_mm_fault+0xc6/0x1b0
[  967.079114]  __get_user_pages+0xf9/0x620
[  967.079117]  get_user_pages+0x3e/0x50
[  967.079119]  __gfn_to_pfn_memslot+0xff/0x3d0
[  967.079122]  try_async_pf+0x53/0x1c0
[  967.079124]  tdp_page_fault+0x10e/0x260
[  967.079125]  kvm_mmu_page_fault+0x53/0x130
[  967.079128]  vmx_handle_exit+0x9c/0x1500
[  967.079129]  ? atomic_switch_perf_msrs+0x5f/0x80
[  967.079130]  ? vmx_vcpu_run+0x30a/0x4b0
[  967.079133]  kvm_arch_vcpu_ioctl_run+0x8dc/0x15e0
[  967.079135]  ? kvm_arch_vcpu_load+0x62/0x230
[  967.079136]  ? kvm_vcpu_ioctl+0x2e8/0x580
[  967.079137]  kvm_vcpu_ioctl+0x2e8/0x580
[  967.079141]  ? wake_up_q+0x70/0x70
[  967.079144]  do_vfs_ioctl+0x8f/0x5e0
[  967.079147]  ? kvm_on_user_return+0x68/0xa0
[  967.079148]  SyS_ioctl+0x74/0x80
[  967.079152]  entry_SYSCALL_64_fastpath+0x1e/0x81
[  967.079154] RIP: 0033:0x7f4410161dd7
[  967.079154] RSP: 002b:00007f43eeffc8b8 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  967.079156] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 
00007f4410161dd7
[  967.079156] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 
0000000000000013
[  967.079157] RBP: 000055c5a36f8e50 R08: 000055c5a0fa73d0 R09: 
00000000ffffffff
[  967.079158] R10: 003b83305f306bdf R11: 0000000000000246 R12: 
0000000000000000
[  967.079158] R13: 00007f44157ad000 R14: 0000000000000000 R15: 
000055c5a36f8e50
[  967.079160] Disabling lock debugging due to kernel taint

-- 
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-26 12:35                 ` Tino Lehnig
@ 2018-07-27  9:14                   ` Minchan Kim
  2018-07-27 11:00                     ` Tino Lehnig
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-27  9:14 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 14468 bytes --]

I tried to reproduce with KVM but was not successful and I don't have
real mahcine to reproduce it. I am asking one device for it.

Anyway, I want to try this patch.
Could you apply attached two patches?

On Thu, Jul 26, 2018 at 02:35:15PM +0200, Tino Lehnig wrote:
> On 07/26/2018 12:30 PM, Minchan Kim wrote:
> > Huh, you see it without writeback? It's weird. Without writeback feature,
> > zram operaion is always synchronous on memory compression/decompression
> > so you shouldn't see below io_schedule logic which happens only for
> > asynchronous IO operation.
> > Could you check one more time that it happens without writeback?
> 
> Confirmed. More kernel logs below. backing_dev was not set in this run. This
> does not happen with 4.15-rc9 and newer.

I am confusing. You mean after 4.15-rc9, you are not seeing *hung* problem?

> 
> > So, you mean you couldn't se bad page state bug until 4.15-rc8?
> > You just see below hung message until 4.15-rc8, not bad page bug?
> > 
> 
> I did see the bad page state bug first in 4.15-rc2, but only very rarely. I
> have attached the log below the other one. Most test runs from commit
> 0bcac06f27d75 through 4.15-rc8 just resulted in hung task errors.

So you mean you see page state bug with recent kernel right?
It seems there are two problems now.

1. Hung and 2. bad page

What bugs between them happens against what kernel version?
Could you clarify it?

> 
> --
> 
> [  363.116153] INFO: task kworker/0:1:130 blocked for more than 120 seconds.
> [  363.116237]       Not tainted 4.15.0-rc8-zram #22
> [  363.116311] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.116392] kworker/0:1     D    0   130      2 0x80000000
> [  363.116405] Workqueue: events async_pf_execute
> [  363.116407] Call Trace:
> [  363.116416]  ? __schedule+0x3d0/0x850
> [  363.116419]  schedule+0x32/0x80
> [  363.116426]  io_schedule+0x12/0x40
> [  363.116433]  __lock_page_or_retry+0x302/0x320
> [  363.116437]  ? page_cache_tree_insert+0xb0/0xb0
> [  363.116442]  do_swap_page+0x4dd/0x870
> [  363.116446]  __handle_mm_fault+0x790/0x10c0
> [  363.116450]  handle_mm_fault+0xc6/0x1b0
> [  363.116453]  __get_user_pages+0xf9/0x620
> [  363.116457]  get_user_pages_remote+0x137/0x1f0
> [  363.116462]  async_pf_execute+0x62/0x180
> [  363.116469]  process_one_work+0x189/0x380
> [  363.116474]  worker_thread+0x4d/0x3c0
> [  363.116478]  kthread+0xf8/0x130
> [  363.116482]  ? process_one_work+0x380/0x380
> [  363.116486]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.116492]  ret_from_fork+0x32/0x40
> [  363.116498] INFO: task kswapd0:159 blocked for more than 120 seconds.
> [  363.116576]       Not tainted 4.15.0-rc8-zram #22
> [  363.116650] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.116731] kswapd0         D    0   159      2 0x80000000
> [  363.116734] Call Trace:
> [  363.116738]  ? __schedule+0x3d0/0x850
> [  363.116741]  schedule+0x32/0x80
> [  363.116745]  io_schedule+0x12/0x40
> [  363.116749]  __lock_page+0x109/0x130
> [  363.116753]  ? page_cache_tree_insert+0xb0/0xb0
> [  363.116759]  deferred_split_scan+0x1e9/0x2a0
> [  363.116762]  shrink_slab.part.49+0x1e6/0x3d0
> [  363.116768]  shrink_node+0x2e7/0x2f0
> [  363.116771]  kswapd+0x35b/0x6f0
> [  363.116776]  kthread+0xf8/0x130
> [  363.116779]  ? mem_cgroup_shrink_node+0x150/0x150
> [  363.116782]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.116786]  ret_from_fork+0x32/0x40
> [  363.116790] INFO: task kworker/9:1:196 blocked for more than 120 seconds.
> [  363.116869]       Not tainted 4.15.0-rc8-zram #22
> [  363.116942] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.117023] kworker/9:1     D    0   196      2 0x80000000
> [  363.117029] Workqueue: events async_pf_execute
> [  363.117031] Call Trace:
> [  363.117034]  ? __schedule+0x3d0/0x850
> [  363.117037]  schedule+0x32/0x80
> [  363.117041]  io_schedule+0x12/0x40
> [  363.117046]  __lock_page_or_retry+0x302/0x320
> [  363.117050]  ? page_cache_tree_insert+0xb0/0xb0
> [  363.117053]  do_swap_page+0x4dd/0x870
> [  363.117057]  __handle_mm_fault+0x790/0x10c0
> [  363.117061]  handle_mm_fault+0xc6/0x1b0
> [  363.117063]  __get_user_pages+0xf9/0x620
> [  363.117068]  ? update_load_avg+0x5c0/0x6f0
> [  363.117071]  get_user_pages_remote+0x137/0x1f0
> [  363.117076]  async_pf_execute+0x62/0x180
> [  363.117081]  process_one_work+0x189/0x380
> [  363.117085]  worker_thread+0x4d/0x3c0
> [  363.117089]  kthread+0xf8/0x130
> [  363.117093]  ? process_one_work+0x380/0x380
> [  363.117096]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.117100]  ret_from_fork+0x32/0x40
> [  363.117104] INFO: task kworker/19:1:201 blocked for more than 120
> seconds.
> [  363.117183]       Not tainted 4.15.0-rc8-zram #22
> [  363.117256] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.117337] kworker/19:1    D    0   201      2 0x80000000
> [  363.117343] Workqueue: events async_pf_execute
> [  363.117345] Call Trace:
> [  363.117349]  ? __schedule+0x3d0/0x850
> [  363.117352]  schedule+0x32/0x80
> [  363.117356]  io_schedule+0x12/0x40
> [  363.117360]  __lock_page_or_retry+0x302/0x320
> [  363.117364]  ? page_cache_tree_insert+0xb0/0xb0
> [  363.117367]  do_swap_page+0x4dd/0x870
> [  363.117370]  __handle_mm_fault+0x790/0x10c0
> [  363.117374]  handle_mm_fault+0xc6/0x1b0
> [  363.117377]  __get_user_pages+0xf9/0x620
> [  363.117380]  ? update_load_avg+0x5c0/0x6f0
> [  363.117383]  get_user_pages_remote+0x137/0x1f0
> [  363.117388]  async_pf_execute+0x62/0x180
> [  363.117393]  process_one_work+0x189/0x380
> [  363.117397]  worker_thread+0x4d/0x3c0
> [  363.117401]  kthread+0xf8/0x130
> [  363.117404]  ? process_one_work+0x380/0x380
> [  363.117407]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.117411]  ret_from_fork+0x32/0x40
> [  363.117415] INFO: task kworker/5:1:207 blocked for more than 120 seconds.
> [  363.117494]       Not tainted 4.15.0-rc8-zram #22
> [  363.117567] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.117648] kworker/5:1     D    0   207      2 0x80000000
> [  363.117654] Workqueue: events async_pf_execute
> [  363.117655] Call Trace:
> [  363.117659]  ? __schedule+0x3d0/0x850
> [  363.117662]  schedule+0x32/0x80
> [  363.117665]  io_schedule+0x12/0x40
> [  363.117669]  __lock_page_or_retry+0x302/0x320
> [  363.117673]  ? page_cache_tree_insert+0xb0/0xb0
> [  363.117676]  do_swap_page+0x4dd/0x870
> [  363.117680]  ? check_preempt_curr+0x83/0x90
> [  363.117683]  __handle_mm_fault+0x790/0x10c0
> [  363.117687]  handle_mm_fault+0xc6/0x1b0
> [  363.117689]  __get_user_pages+0xf9/0x620
> [  363.117693]  ? update_load_avg+0x5c0/0x6f0
> [  363.117695]  get_user_pages_remote+0x137/0x1f0
> [  363.117700]  async_pf_execute+0x62/0x180
> [  363.117705]  process_one_work+0x189/0x380
> [  363.117709]  worker_thread+0x4d/0x3c0
> [  363.117713]  kthread+0xf8/0x130
> [  363.117717]  ? process_one_work+0x380/0x380
> [  363.117720]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.117723]  ret_from_fork+0x32/0x40
> [  363.117731] INFO: task kworker/5:2:437 blocked for more than 120 seconds.
> [  363.117828]       Not tainted 4.15.0-rc8-zram #22
> [  363.117920] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.118039] kworker/5:2     D    0   437      2 0x80000000
> [  363.118044] Workqueue: events async_pf_execute
> [  363.118046] Call Trace:
> [  363.118049]  ? __schedule+0x3d0/0x850
> [  363.118052]  schedule+0x32/0x80
> [  363.118056]  io_schedule+0x12/0x40
> [  363.118060]  __lock_page_or_retry+0x302/0x320
> [  363.118064]  ? page_cache_tree_insert+0xb0/0xb0
> [  363.118067]  do_swap_page+0x4dd/0x870
> [  363.118070]  __handle_mm_fault+0x790/0x10c0
> [  363.118074]  handle_mm_fault+0xc6/0x1b0
> [  363.118077]  __get_user_pages+0xf9/0x620
> [  363.118080]  ? update_load_avg+0x5c0/0x6f0
> [  363.118083]  get_user_pages_remote+0x137/0x1f0
> [  363.118087]  async_pf_execute+0x62/0x180
> [  363.118092]  process_one_work+0x189/0x380
> [  363.118096]  worker_thread+0x4d/0x3c0
> [  363.118100]  kthread+0xf8/0x130
> [  363.118104]  ? process_one_work+0x380/0x380
> [  363.118107]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.118113]  ? do_group_exit+0x3a/0xa0
> [  363.118116]  ret_from_fork+0x32/0x40
> [  363.118120] INFO: task kworker/8:2:501 blocked for more than 120 seconds.
> [  363.118217]       Not tainted 4.15.0-rc8-zram #22
> [  363.118309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.118427] kworker/8:2     D    0   501      2 0x80000000
> [  363.118433] Workqueue: events async_pf_execute
> [  363.118435] Call Trace:
> [  363.118438]  ? __schedule+0x3d0/0x850
> [  363.118441]  schedule+0x32/0x80
> [  363.118445]  io_schedule+0x12/0x40
> [  363.118448]  __lock_page_or_retry+0x302/0x320
> [  363.118452]  ? page_cache_tree_insert+0xb0/0xb0
> [  363.118455]  do_swap_page+0x4dd/0x870
> [  363.118459]  __handle_mm_fault+0x790/0x10c0
> [  363.118463]  handle_mm_fault+0xc6/0x1b0
> [  363.118465]  __get_user_pages+0xf9/0x620
> [  363.118469]  ? update_load_avg+0x5c0/0x6f0
> [  363.118471]  get_user_pages_remote+0x137/0x1f0
> [  363.118476]  async_pf_execute+0x62/0x180
> [  363.118481]  process_one_work+0x189/0x380
> [  363.118485]  worker_thread+0x4d/0x3c0
> [  363.118489]  kthread+0xf8/0x130
> [  363.118492]  ? process_one_work+0x380/0x380
> [  363.118495]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.118498]  ? kthread_create_worker_on_cpu+0x50/0x50
> [  363.118502]  ret_from_fork+0x32/0x40
> [  363.118511] INFO: task qemu-system-x86:943 blocked for more than 120
> seconds.
> [  363.118609]       Not tainted 4.15.0-rc8-zram #22
> [  363.118701] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [  363.118820] qemu-system-x86 D    0   943    931 0x00000000
> [  363.118822] Call Trace:
> [  363.118826]  ? __schedule+0x3d0/0x850
> [  363.118829]  schedule+0x32/0x80
> [  363.118833]  io_schedule+0x12/0x40
> [  363.118837]  __lock_page+0x109/0x130
> [  363.118840]  ? page_cache_tree_insert+0xb0/0xb0
> [  363.118844]  deferred_split_scan+0x1e9/0x2a0
> [  363.118848]  shrink_slab.part.49+0x1e6/0x3d0
> [  363.118852]  shrink_node+0x2e7/0x2f0
> [  363.118856]  do_try_to_free_pages+0xd5/0x320
> [  363.118859]  try_to_free_pages+0xd6/0x190
> [  363.118864]  __alloc_pages_slowpath+0x34d/0xdc0
> [  363.118869]  __alloc_pages_nodemask+0x214/0x230
> [  363.118873]  do_huge_pmd_anonymous_page+0x13a/0x610
> [  363.118877]  __handle_mm_fault+0xe04/0x10c0
> [  363.118879]  ? kvm_vcpu_mmap+0x20/0x20
> [  363.118884]  ? bsearch+0x52/0x90
> [  363.118887]  handle_mm_fault+0xc6/0x1b0
> [  363.118890]  __get_user_pages+0xf9/0x620
> [  363.118893]  get_user_pages+0x3e/0x50
> [  363.118898]  __gfn_to_pfn_memslot+0xff/0x3d0
> [  363.118903]  try_async_pf+0x53/0x1d0
> [  363.118907]  tdp_page_fault+0x10e/0x260
> [  363.118912]  ? vmexit_fill_RSB+0x11/0x30
> [  363.118915]  kvm_mmu_page_fault+0x59/0x130
> [  363.118920]  vmx_handle_exit+0x9f/0x1530
> [  363.118924]  ? vmexit_fill_RSB+0x11/0x30
> [  363.118927]  ? vmx_vcpu_run+0x32f/0x4d0
> [  363.118932]  kvm_arch_vcpu_ioctl_run+0x90c/0x1670
> [  363.118936]  ? handle_machine_check+0x10/0x10
> [  363.118938]  ? kvm_arch_vcpu_load+0x68/0x250
> [  363.118943]  ? kvm_vcpu_ioctl+0x2e8/0x580
> [  363.118946]  kvm_vcpu_ioctl+0x2e8/0x580
> [  363.118952]  do_vfs_ioctl+0x92/0x5f0
> [  363.118955]  ? handle_mm_fault+0xc6/0x1b0
> [  363.118960]  ? kvm_on_user_return+0x68/0xa0
> [  363.118964]  SyS_ioctl+0x74/0x80
> [  363.118969]  entry_SYSCALL_64_fastpath+0x24/0x87
> [  363.118972] RIP: 0033:0x7fcc51943dd7
> 
> --
> 
> [  967.078557] BUG: Bad page state in process qemu-system-x86  pfn:3f6d853
> [  967.078637] page:000000004d26db38 count:0 mapcount:0 mapping:  (null)
> index:0x1
> [  967.078715] flags: 0x17fffc000000008(uptodate)
> [  967.078786] raw: 017fffc000000008 0000000000000000 0000000000000001
> 00000000ffffffff
> [  967.078863] raw: dead000000000100 dead000000000200 0000000000000000
> 0000000000000000
> [  967.078939] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> [  967.079012] bad because of flags: 0x8(uptodate)
> [  967.079081] Modules linked in: lz4 lz4_compress zram
> [  967.079085] CPU: 2 PID: 946 Comm: qemu-system-x86 Not tainted
> 4.15.0-rc2-zram #4
> [  967.079086] Hardware name: Supermicro Super Server/X10SRL-F, BIOS 2.0b
> 05/02/2017
> [  967.079087] Call Trace:
> [  967.079093]  dump_stack+0x5c/0x84
> [  967.079097]  bad_page+0xba/0x120
> [  967.079098]  get_page_from_freelist+0xfe7/0x1230
> [  967.079101]  __alloc_pages_nodemask+0xea/0x230
> [  967.079104]  alloc_pages_vma+0x7c/0x1c0
> [  967.079106]  do_swap_page+0x474/0x870
> [  967.079109]  ? do_huge_pmd_anonymous_page+0x417/0x610
> [  967.079110]  __handle_mm_fault+0xa53/0x1160
> [  967.079112]  handle_mm_fault+0xc6/0x1b0
> [  967.079114]  __get_user_pages+0xf9/0x620
> [  967.079117]  get_user_pages+0x3e/0x50
> [  967.079119]  __gfn_to_pfn_memslot+0xff/0x3d0
> [  967.079122]  try_async_pf+0x53/0x1c0
> [  967.079124]  tdp_page_fault+0x10e/0x260
> [  967.079125]  kvm_mmu_page_fault+0x53/0x130
> [  967.079128]  vmx_handle_exit+0x9c/0x1500
> [  967.079129]  ? atomic_switch_perf_msrs+0x5f/0x80
> [  967.079130]  ? vmx_vcpu_run+0x30a/0x4b0
> [  967.079133]  kvm_arch_vcpu_ioctl_run+0x8dc/0x15e0
> [  967.079135]  ? kvm_arch_vcpu_load+0x62/0x230
> [  967.079136]  ? kvm_vcpu_ioctl+0x2e8/0x580
> [  967.079137]  kvm_vcpu_ioctl+0x2e8/0x580
> [  967.079141]  ? wake_up_q+0x70/0x70
> [  967.079144]  do_vfs_ioctl+0x8f/0x5e0
> [  967.079147]  ? kvm_on_user_return+0x68/0xa0
> [  967.079148]  SyS_ioctl+0x74/0x80
> [  967.079152]  entry_SYSCALL_64_fastpath+0x1e/0x81
> [  967.079154] RIP: 0033:0x7f4410161dd7
> [  967.079154] RSP: 002b:00007f43eeffc8b8 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  967.079156] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX:
> 00007f4410161dd7
> [  967.079156] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000013
> [  967.079157] RBP: 000055c5a36f8e50 R08: 000055c5a0fa73d0 R09:
> 00000000ffffffff
> [  967.079158] R10: 003b83305f306bdf R11: 0000000000000246 R12:
> 0000000000000000
> [  967.079158] R13: 00007f44157ad000 R14: 0000000000000000 R15:
> 000055c5a36f8e50
> [  967.079160] Disabling lock debugging due to kernel taint
> 
> -- 
> Kind regards,
> 
> Tino Lehnig

[-- Attachment #2: 0001-zram-remove-BD_CAP_SYNCHRONOUS_IO-with-writeback-fea.patch --]
[-- Type: text/x-diff, Size: 1193 bytes --]

From fa946405d137fbc9a433f9a150929e93f7f0b308 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 27 Jul 2018 15:15:33 +0900
Subject: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature

If zram supports writeback feature, it's no more syncrhonous
device beause we need asynchronous IO opeation.

Do not pretend to be syncrhonous IO device.
It makes system very sluggish as waiting IO completion from
upper layer.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7436b2d27fa3..8610987b7b5a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1699,6 +1699,11 @@ static int zram_add(void)
 
 	zram->disk->queue->backing_dev_info->capabilities |=
 			(BDI_CAP_STABLE_WRITES | BDI_CAP_SYNCHRONOUS_IO);
+#if CONFIG_ZRAM_WRITEBACK
+	if (zram->backing_dev)
+		zram->disk->queue->backing_dev_info->capabilities &=
+			~BDI_CAP_SYNCHRONOUS_IO;
+#endif
 	add_disk(zram->disk);
 
 	ret = sysfs_create_group(&disk_to_dev(zram->disk)->kobj,
-- 
2.18.0.345.g5c9ce644c3-goog


[-- Attachment #3: 0002-swap-free-allocated-page-if-swap_read-fails.patch --]
[-- Type: text/x-diff, Size: 936 bytes --]

From be50669d38fd61259e787e337a7c82d73b42bb91 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 27 Jul 2018 17:17:01 +0900
Subject: [PATCH 2/2] swap: free allocated page if swap_read fails

swap_readpage could fails by -ENOMEM. In this case, we should free
allocated page instantly. There is no reason to retry.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..8a5e304ffd91 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2943,7 +2943,8 @@ int do_swap_page(struct vm_fault *vmf)
 				__SetPageSwapBacked(page);
 				set_page_private(page, entry.val);
 				lru_cache_add_anon(page);
-				swap_readpage(page, true);
+				if (swap_readpage(page, true))
+					goto out_page;
 			}
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-- 
2.18.0.345.g5c9ce644c3-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-27  9:14                   ` Minchan Kim
@ 2018-07-27 11:00                     ` Tino Lehnig
  2018-07-27 12:05                       ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Tino Lehnig @ 2018-07-27 11:00 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On 07/27/2018 11:14 AM, Minchan Kim wrote:
> I tried to reproduce with KVM but was not successful and I don't have
> real mahcine to reproduce it. I am asking one device for it.
> 
> Anyway, I want to try this patch.
> Could you apply attached two patches?

Thanks, I applied the patches on 4.18-rc6, but unfortunately, they do 
not solve the problem for me. Kernel message below.

> I am confusing. You mean after 4.15-rc9, you are not seeing*hung*  problem?

Correct.

> So you mean you see page state bug with recent kernel right?
> It seems there are two problems now.
> 
> 1. Hung and 2. bad page
> 
> What bugs between them happens against what kernel version?
> Could you clarify it?

* pre 0bcac06f27d75 (4.15-rc1): all good
* 4.15-rc1: hung task (I have not encountered bad page here yet...)
* 4.15-rc2 through 4.15-rc8: hung task + bad page (very rare)
* 4.15-rc9 and newer: bad page

--

[  809.149272] BUG: Bad page state in process kvm  pfn:1cb08a8
[  809.149332] flags: 0x57ffffc0000008(uptodate)
[  809.149350] raw: 0057ffffc0000008 dead000000000100 dead000000000200 
0000000000000000
[  809.149378] raw: 0000000000000001 0000000000000000 00000000ffffffff 
0000000000000000
[  809.149405] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
[  809.149427] bad because of flags: 0x8(uptodate)
[  809.149444] Modules linked in: lz4 lz4_compress zram
[  809.149450] CPU: 14 PID: 3734 Comm: kvm Not tainted 4.18.0-rc6+ #1
[  809.149450] Hardware name: Supermicro Super Server/X10DRL-i, BIOS 
3.0a 02/09/2018
[  809.149451] Call Trace:
[  809.149458]  dump_stack+0x63/0x85
[  809.149463]  bad_page+0xc1/0x120
[  809.149465]  check_new_page_bad+0x67/0x80
[  809.149467]  get_page_from_freelist+0xe25/0x12f0
[  809.149469]  __alloc_pages_nodemask+0xfd/0x280
[  809.149472]  alloc_pages_vma+0x88/0x1c0
[  809.149475]  do_swap_page+0x346/0x910
[  809.149477]  __handle_mm_fault+0x815/0x1170
[  809.149479]  handle_mm_fault+0x102/0x200
[  809.149481]  __get_user_pages+0x131/0x680
[  809.149483]  get_user_pages_unlocked+0x145/0x1e0
[  809.149488]  __gfn_to_pfn_memslot+0x10b/0x3c0
[  809.149491]  try_async_pf+0x86/0x230
[  809.149494]  tdp_page_fault+0x12d/0x290
[  809.149496]  kvm_mmu_page_fault+0x74/0x5d0
[  809.149499]  ? call_function_interrupt+0xa/0x20
[  809.149502]  ? vmexit_fill_RSB+0x10/0x40
[  809.149503]  ? vmexit_fill_RSB+0x1c/0x40
[  809.149504]  ? vmexit_fill_RSB+0x10/0x40
[  809.149505]  ? vmexit_fill_RSB+0x1c/0x40
[  809.149506]  ? vmexit_fill_RSB+0x10/0x40
[  809.149507]  ? vmexit_fill_RSB+0x1c/0x40
[  809.149508]  ? vmexit_fill_RSB+0x10/0x40
[  809.149509]  ? vmexit_fill_RSB+0x1c/0x40
[  809.149510]  ? vmexit_fill_RSB+0x10/0x40
[  809.149513]  handle_ept_violation+0xdf/0x1a0
[  809.149514]  vmx_handle_exit+0xa5/0x11c0
[  809.149516]  ? vmx_vcpu_run+0x3bb/0x620
[  809.149519]  kvm_arch_vcpu_ioctl_run+0x9b3/0x1980
[  809.149522]  kvm_vcpu_ioctl+0x3a0/0x5e0
[  809.149523]  ? kvm_vcpu_ioctl+0x3a0/0x5e0
[  809.149526]  do_vfs_ioctl+0xa6/0x620
[  809.149527]  ksys_ioctl+0x75/0x80
[  809.149529]  __x64_sys_ioctl+0x1a/0x20
[  809.149532]  do_syscall_64+0x5a/0x110
[  809.149534]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  809.149536] RIP: 0033:0x7fd3c5572dd7
[  809.149536] Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 
48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
[  809.149563] RSP: 002b:00007fd3b07fc538 EFLAGS: 00000246 ORIG_RAX: 
0000000000000010
[  809.149565] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX: 
00007fd3c5572dd7
[  809.149566] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 
0000000000000014
[  809.149566] RBP: 00007fd3b9b13000 R08: 0000558cb94bb350 R09: 
00000000ffffffff
[  809.149567] R10: 0005577fd3b06fe6 R11: 0000000000000246 R12: 
0000000000000000
[  809.149568] R13: 00007fd3ba146000 R14: 0000000000000000 R15: 
00007fd3b9b13000
[  809.149570] Disabling lock debugging due to kernel taint

-- 
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-27 11:00                     ` Tino Lehnig
@ 2018-07-27 12:05                       ` Minchan Kim
  2018-07-27 12:13                         ` Tino Lehnig
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-27 12:05 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Fri, Jul 27, 2018 at 01:00:01PM +0200, Tino Lehnig wrote:
> On 07/27/2018 11:14 AM, Minchan Kim wrote:
> > I tried to reproduce with KVM but was not successful and I don't have
> > real mahcine to reproduce it. I am asking one device for it.
> > 
> > Anyway, I want to try this patch.
> > Could you apply attached two patches?
> 
> Thanks, I applied the patches on 4.18-rc6, but unfortunately, they do not
> solve the problem for me. Kernel message below.

Thanks for the testing.

> 
> > I am confusing. You mean after 4.15-rc9, you are not seeing*hung*  problem?
> 
> Correct.
> 
> > So you mean you see page state bug with recent kernel right?
> > It seems there are two problems now.
> > 
> > 1. Hung and 2. bad page
> > 
> > What bugs between them happens against what kernel version?
> > Could you clarify it?
> 
> * pre 0bcac06f27d75 (4.15-rc1): all good
> * 4.15-rc1: hung task (I have not encountered bad page here yet...)
> * 4.15-rc2 through 4.15-rc8: hung task + bad page (very rare)
> * 4.15-rc9 and newer: bad page

And bad page is always with writeback enable?

writeback enable means "echo "some dev" > /sys/block/zram0/backing_dev,
not just enable CONFIG_ZRAM_WRITEBACK.

> --
> 
> [  809.149272] BUG: Bad page state in process kvm  pfn:1cb08a8
> [  809.149332] flags: 0x57ffffc0000008(uptodate)
> [  809.149350] raw: 0057ffffc0000008 dead000000000100 dead000000000200
> 0000000000000000
> [  809.149378] raw: 0000000000000001 0000000000000000 00000000ffffffff
> 0000000000000000
> [  809.149405] page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
> [  809.149427] bad because of flags: 0x8(uptodate)
> [  809.149444] Modules linked in: lz4 lz4_compress zram
> [  809.149450] CPU: 14 PID: 3734 Comm: kvm Not tainted 4.18.0-rc6+ #1
> [  809.149450] Hardware name: Supermicro Super Server/X10DRL-i, BIOS 3.0a
> 02/09/2018
> [  809.149451] Call Trace:
> [  809.149458]  dump_stack+0x63/0x85
> [  809.149463]  bad_page+0xc1/0x120
> [  809.149465]  check_new_page_bad+0x67/0x80
> [  809.149467]  get_page_from_freelist+0xe25/0x12f0
> [  809.149469]  __alloc_pages_nodemask+0xfd/0x280
> [  809.149472]  alloc_pages_vma+0x88/0x1c0
> [  809.149475]  do_swap_page+0x346/0x910
> [  809.149477]  __handle_mm_fault+0x815/0x1170
> [  809.149479]  handle_mm_fault+0x102/0x200
> [  809.149481]  __get_user_pages+0x131/0x680
> [  809.149483]  get_user_pages_unlocked+0x145/0x1e0
> [  809.149488]  __gfn_to_pfn_memslot+0x10b/0x3c0
> [  809.149491]  try_async_pf+0x86/0x230
> [  809.149494]  tdp_page_fault+0x12d/0x290
> [  809.149496]  kvm_mmu_page_fault+0x74/0x5d0
> [  809.149499]  ? call_function_interrupt+0xa/0x20
> [  809.149502]  ? vmexit_fill_RSB+0x10/0x40
> [  809.149503]  ? vmexit_fill_RSB+0x1c/0x40
> [  809.149504]  ? vmexit_fill_RSB+0x10/0x40
> [  809.149505]  ? vmexit_fill_RSB+0x1c/0x40
> [  809.149506]  ? vmexit_fill_RSB+0x10/0x40
> [  809.149507]  ? vmexit_fill_RSB+0x1c/0x40
> [  809.149508]  ? vmexit_fill_RSB+0x10/0x40
> [  809.149509]  ? vmexit_fill_RSB+0x1c/0x40
> [  809.149510]  ? vmexit_fill_RSB+0x10/0x40
> [  809.149513]  handle_ept_violation+0xdf/0x1a0
> [  809.149514]  vmx_handle_exit+0xa5/0x11c0
> [  809.149516]  ? vmx_vcpu_run+0x3bb/0x620
> [  809.149519]  kvm_arch_vcpu_ioctl_run+0x9b3/0x1980
> [  809.149522]  kvm_vcpu_ioctl+0x3a0/0x5e0
> [  809.149523]  ? kvm_vcpu_ioctl+0x3a0/0x5e0
> [  809.149526]  do_vfs_ioctl+0xa6/0x620
> [  809.149527]  ksys_ioctl+0x75/0x80
> [  809.149529]  __x64_sys_ioctl+0x1a/0x20
> [  809.149532]  do_syscall_64+0x5a/0x110
> [  809.149534]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  809.149536] RIP: 0033:0x7fd3c5572dd7
> [  809.149536] Code: 00 00 00 48 8b 05 c1 80 2b 00 64 c7 00 26 00 00 00 48
> c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48>
> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 91 80 2b 00 f7 d8 64 89 01 48
> [  809.149563] RSP: 002b:00007fd3b07fc538 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  809.149565] RAX: ffffffffffffffda RBX: 000000000000ae80 RCX:
> 00007fd3c5572dd7
> [  809.149566] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
> 0000000000000014
> [  809.149566] RBP: 00007fd3b9b13000 R08: 0000558cb94bb350 R09:
> 00000000ffffffff
> [  809.149567] R10: 0005577fd3b06fe6 R11: 0000000000000246 R12:
> 0000000000000000
> [  809.149568] R13: 00007fd3ba146000 R14: 0000000000000000 R15:
> 00007fd3b9b13000
> [  809.149570] Disabling lock debugging due to kernel taint
> 
> -- 
> Kind regards,
> 
> Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-27 12:05                       ` Minchan Kim
@ 2018-07-27 12:13                         ` Tino Lehnig
  2018-07-27 22:58                           ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Tino Lehnig @ 2018-07-27 12:13 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On 07/27/2018 02:05 PM, Minchan Kim wrote:
> And bad page is always with writeback enable?
> 
> writeback enable means "echo "some dev" > /sys/block/zram0/backing_dev,
> not just enable CONFIG_ZRAM_WRITEBACK.

Yes, the bug only appears when backing_dev is set.

-- 
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-27 12:13                         ` Tino Lehnig
@ 2018-07-27 22:58                           ` Minchan Kim
  2018-07-30  6:09                             ` Tino Lehnig
  0 siblings, 1 reply; 30+ messages in thread
From: Minchan Kim @ 2018-07-27 22:58 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

[-- Attachment #1: Type: text/plain, Size: 475 bytes --]

On Fri, Jul 27, 2018 at 02:13:57PM +0200, Tino Lehnig wrote:
> On 07/27/2018 02:05 PM, Minchan Kim wrote:
> > And bad page is always with writeback enable?
> > 
> > writeback enable means "echo "some dev" > /sys/block/zram0/backing_dev,
> > not just enable CONFIG_ZRAM_WRITEBACK.
> 
> Yes, the bug only appears when backing_dev is set.

Thanks for the clarifiation.

I made a mistake on previous patch.
Could you test this patches?

> 
> -- 
> Kind regards,
> 
> Tino Lehnig

[-- Attachment #2: 0001-zram-remove-BD_CAP_SYNCHRONOUS_IO-with-writeback-fea.patch --]
[-- Type: text/x-diff, Size: 1706 bytes --]

From 77a5fc378dfae733af5a0f0c7ef901668d8c9778 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 27 Jul 2018 15:15:33 +0900
Subject: [PATCH 1/2] zram: remove BD_CAP_SYNCHRONOUS_IO with writeback feature

If zram supports writeback feature, it's no more syncrhonous
device beause we need asynchronous IO opeation.

Do not pretend to be syncrhonous IO device. It makes system
very sluggish as waiting IO completion from upper layer.
Furthermore, it makes user-after-free problem because swap
think the opearion is done when the IO functions returns so
it could free page by will but in fact, IO is asynchrnous
so driver could access the freed page afterward.
(I will make description more clear at the formal patch).

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7436b2d27fa3..0b6eda1bd77a 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -298,7 +298,8 @@ static void reset_bdev(struct zram *zram)
 	zram->backing_dev = NULL;
 	zram->old_block_size = 0;
 	zram->bdev = NULL;
-
+	zram->disk->queue->backing_dev_info->capabilities |=
+				BDI_CAP_SYNCHRONOUS_IO;
 	kvfree(zram->bitmap);
 	zram->bitmap = NULL;
 }
@@ -400,6 +401,8 @@ static ssize_t backing_dev_store(struct device *dev,
 	zram->backing_dev = backing_dev;
 	zram->bitmap = bitmap;
 	zram->nr_pages = nr_pages;
+	zram->disk->queue->backing_dev_info->capabilities &=
+			~BDI_CAP_SYNCHRONOUS_IO;
 	up_write(&zram->init_lock);
 
 	pr_info("setup backing device %s\n", file_name);
-- 
2.18.0.345.g5c9ce644c3-goog


[-- Attachment #3: 0002-swap-free-allocated-page-if-swap_read-fails.patch --]
[-- Type: text/x-diff, Size: 949 bytes --]

From 7c263ac8b9557c60c631c239cc7b863ed762098f Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Fri, 27 Jul 2018 17:17:01 +0900
Subject: [PATCH 2/2] swap: free allocated page if swap_read fails

swap_readpage could fails by -ENOMEM. In this case, we should free
allocated page instantly and bail out. There is no reason to retry.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/memory.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7206a634270b..8a5e304ffd91 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2943,7 +2943,8 @@ int do_swap_page(struct vm_fault *vmf)
 				__SetPageSwapBacked(page);
 				set_page_private(page, entry.val);
 				lru_cache_add_anon(page);
-				swap_readpage(page, true);
+				if (swap_readpage(page, true))
+					goto out_page;
 			}
 		} else {
 			page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE,
-- 
2.18.0.345.g5c9ce644c3-goog


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-27 22:58                           ` Minchan Kim
@ 2018-07-30  6:09                             ` Tino Lehnig
  2018-08-02  5:15                               ` Minchan Kim
  0 siblings, 1 reply; 30+ messages in thread
From: Tino Lehnig @ 2018-07-30  6:09 UTC (permalink / raw)
  To: Minchan Kim; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On 07/28/2018 12:58 AM, Minchan Kim wrote:
> I made a mistake on previous patch.
> Could you test this patches?

Thanks! Looking good so far! No errors whatsoever with the new patch. I 
will let my test workload running for while to be sure, but I think we 
are good.

-- 
Kind regards,

Tino Lehnig

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process...
  2018-07-30  6:09                             ` Tino Lehnig
@ 2018-08-02  5:15                               ` Minchan Kim
  0 siblings, 0 replies; 30+ messages in thread
From: Minchan Kim @ 2018-08-02  5:15 UTC (permalink / raw)
  To: Tino Lehnig; +Cc: ngupta, linux-kernel, Sergey Senozhatsky, Andrew Morton

On Mon, Jul 30, 2018 at 08:09:33AM +0200, Tino Lehnig wrote:
> On 07/28/2018 12:58 AM, Minchan Kim wrote:
> > I made a mistake on previous patch.
> > Could you test this patches?
> 
> Thanks! Looking good so far! No errors whatsoever with the new patch. I will
> let my test workload running for while to be sure, but I think we are good.

Thanks for the confirming, Tino.
I belive you didn't see any problem until now so I sent formal patch with
your 'Tested-by" SoB.
https://lore.kernel.org/lkml/20180802051112.86174-1-minchan@kernel.org/
If you see any problem, feel free to reply that thread.

Thanks for the report and help!

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2018-08-02  5:15 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-23 12:29 Zram writeback feature unstable with heavy swap utilization - BUG: Bad page state in process Tino Lehnig
2018-07-24  1:03 ` Minchan Kim
2018-07-24  2:53   ` Sergey Senozhatsky
2018-07-24  6:47     ` Minchan Kim
2018-07-24  7:30   ` Tino Lehnig
2018-07-25  1:32     ` Minchan Kim
2018-07-25  1:55       ` Matthew Wilcox
2018-07-25  2:16         ` Minchan Kim
2018-07-25  2:35           ` Matthew Wilcox
2018-07-25  2:51             ` Minchan Kim
2018-07-25  2:55               ` Matthew Wilcox
2018-07-25  3:02                 ` Minchan Kim
2018-07-25  2:51       ` Matthew Wilcox
2018-07-25  4:07         ` Sergey Senozhatsky
2018-07-25 13:21     ` Minchan Kim
2018-07-25 15:12       ` Tino Lehnig
2018-07-26  2:03         ` Minchan Kim
2018-07-26  6:10           ` Tino Lehnig
2018-07-26  6:21             ` Minchan Kim
2018-07-26  6:34               ` Tino Lehnig
2018-07-26 10:00             ` Tino Lehnig
2018-07-26 10:30               ` Minchan Kim
2018-07-26 12:35                 ` Tino Lehnig
2018-07-27  9:14                   ` Minchan Kim
2018-07-27 11:00                     ` Tino Lehnig
2018-07-27 12:05                       ` Minchan Kim
2018-07-27 12:13                         ` Tino Lehnig
2018-07-27 22:58                           ` Minchan Kim
2018-07-30  6:09                             ` Tino Lehnig
2018-08-02  5:15                               ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.