Re: regression from 5.10.0-rc3: BUG: Bad page state in process kworker/41:0 pfn:891066 during fio on devdax

From: Yi Zhang <yi.zhang@redhat.com>
To: Jason Gunthorpe <jgg@nvidia.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Ralph Campbell <rcampbell@nvidia.com>
Cc: linux-nvdimm <linux-nvdimm@lists.01.org>
Subject: Re: regression from 5.10.0-rc3: BUG: Bad page state in process kworker/41:0 pfn:891066 during fio on devdax
Date: Wed, 18 Nov 2020 22:02:00 +0800	[thread overview]
Message-ID: <51e938d1-aff7-0fa4-1a79-f77ac8bb2f8b@redhat.com> (raw)
In-Reply-To: <ef5aca5c-6d32-8d01-81d6-ac65558115fa@redhat.com>

ping
This issue still can be reproduced on 5.10.0-rc4

[ 1914.356562] BUG: Bad page state in process kworker/58:0  pfn:1fadf5
[ 1914.390159] page:00000000fee4d2a1 refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0x1fadf5
[ 1914.436292] flags: 0x17ffffc0000000()
[ 1914.452792] raw: 0017ffffc0000000 dead000000000100 dead000000000122 0000000000000000
[ 1914.488322] raw: 0000000000000000 0000000000000000 00000000fffffbff 0000000000000000
[ 1914.523625] page dumped because: nonzero mapcount
[ 1914.544972] Modules linked in: dm_log_writes loop ext4 mbcache jbd2 rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass mgag200 crct10dif_pclmul i2c_algo_bit drm_kms_helper syscopyarea crc32_pclmul ghash_clmulni_intel iTCO_wdt sysfillrect sysimgblt rapl fb_sys_fops intel_cstate iTCO_vendor_support drm dax_pmem_compat ipmi_ssif device_dax intel_uncore pcspkr dax_pmem_core i2c_i801 lpc_ich acpi_ipmi ipmi_si joydev ipmi_devintf acpi_tad ipmi_msghandler hpilo hpwdt i2c_smbus ioatdma acpi_power_meter dca ip_tables xfs sr_mod cdrom sd_mod t10_pi sg nd_pmem nd_btt ahci bnx2x nfit libahci libata tg3 libnvdimm hpsa mdio libcrc32c scsi_transport_sas crc32c_intel wmi dm_mirror dm_region_hash dm_log dm_mod
[ 1914.862181] CPU: 58 PID: 14617 Comm: kworker/58:0 Tainted: G S  B             5.10.0-rc4 #1
[ 1914.903469] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
[ 1914.945189] Workqueue: mm_percpu_wq vmstat_update
[ 1914.966350] Call Trace:
[ 1914.977331]  dump_stack+0x57/0x6a
[ 1914.992193]  bad_page.cold.114+0x9b/0xa0
[ 1915.009908]  free_pcppages_bulk+0x538/0x760
[ 1915.029226]  drain_zone_pages+0x1f/0x30
[ 1915.046526]  refresh_cpu_vm_stats+0x1ea/0x2b0
[ 1915.066113]  vmstat_update+0xf/0x50
[ 1915.081784]  process_one_work+0x1a4/0x340
[ 1915.099858]  ? process_one_work+0x340/0x340
[ 1915.118741]  worker_thread+0x30/0x370
[ 1915.135268]  ? process_one_work+0x340/0x340
[ 1915.154211]  kthread+0x116/0x130
[ 1915.168771]  ? kthread_park+0x80/0x80
[ 1915.185635]  ret_from_fork+0x22/0x30
[ 1972.063440] restraintd[2377]: *** Current Time: Mon Nov 16 00:56:57 2020  Localwatchdog at: Mon Nov 16 02:55:57 2020
[ 1976.501706] BUG: Bad page state in process kworker/4:0  pfn:a24692
[ 1976.532586] page:00000000f000e4ba refcount:0 mapcount:-1024 mapping:0000000000000000 index:0x0 pfn:0xa24692
[ 1976.581869] flags: 0x57ffffc0000000()
[ 1976.599064] raw: 0057ffffc0000000 dead000000000100 dead000000000122 0000000000000000
[ 1976.635786] raw: 0000000000000000 0000000000000000 00000000fffffbff 0000000000000000
[ 1976.671862] page dumped because: nonzero mapcount
[ 1976.694287] Modules linked in: dm_log_writes loop ext4 mbcache jbd2 rfkill sunrpc vfat fat dm_multipath intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass mgag200 crct10dif_pclmul i2c_algo_bit drm_kms_helper syscopyarea crc32_pclmul ghash_clmulni_intel iTCO_wdt sysfillrect sysimgblt rapl fb_sys_fops intel_cstate iTCO_vendor_support drm dax_pmem_compat ipmi_ssif device_dax intel_uncore pcspkr dax_pmem_core i2c_i801 lpc_ich acpi_ipmi ipmi_si joydev ipmi_devintf acpi_tad ipmi_msghandler hpilo hpwdt i2c_smbus ioatdma acpi_power_meter dca ip_tables xfs sr_mod cdrom sd_mod t10_pi sg nd_pmem nd_btt ahci bnx2x nfit libahci libata tg3 libnvdimm hpsa mdio libcrc32c scsi_transport_sas crc32c_intel wmi dm_mirror dm_region_hash dm_log dm_mod
[ 1977.024006] CPU: 4 PID: 23471 Comm: kworker/4:0 Tainted: G S  B             5.10.0-rc4 #1
[ 1977.067069] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 10/05/2016
[ 1977.106156] Workqueue: mm_percpu_wq vmstat_update
[ 1977.128645] Call Trace:
[ 1977.140263]  dump_stack+0x57/0x6a
[ 1977.155844]  bad_page.cold.114+0x9b/0xa0
[ 1977.174451]  free_pcppages_bulk+0x538/0x760
[ 1977.194417]  drain_zone_pages+0x1f/0x30
[ 1977.212748]  refresh_cpu_vm_stats+0x1ea/0x2b0
[ 1977.233450]  vmstat_update+0xf/0x50
[ 1977.249779]  process_one_work+0x1a4/0x340
[ 1977.268797]  ? process_one_work+0x340/0x340
[ 1977.288564]  worker_thread+0x30/0x370
[ 1977.306138]  ? process_one_work+0x340/0x340
[ 1977.326017]  kthread+0x116/0x130
[ 1977.341274]  ? kthread_park+0x80/0x80
[ 1977.358649]  ret_from_fork+0x22/0x30


On 11/11/20 11:44 AM, Yi Zhang wrote:
> Add Ralph
>
>>>
>> Hi Dan/Jason
>>
>> It turns out that it was introduced by bellow patch[1] which fixed 
>> the "static key devmap_managed_key" issue, but introduced [2]
>> Finally I found it was not 100% reproduced, and sorry for my mistake.
>>
>> [1]
>> commit 46b1ee38b2ba1a9524c8e886ad078bd3ca40de2a (HEAD)
>> Author: Ralph Campbell <rcampbell@nvidia.com>
>> Date:   Sun Nov 1 17:07:23 2020 -0800
>>
>>     mm/mremap_pages: fix static key devmap_managed_key updates
>>
>> [2]
>> [ 1129.792673] memmap_init_zone_device initialised 2063872 pages in 34ms
>> [ 1129.865469] memmap_init_zone_device initialised 2063872 pages in 34ms
>> [ 1129.924080] memmap_init_zone_device initialised 2063872 pages in 24ms
>> [ 1129.987160] memmap_init_zone_device initialised 2063872 pages in 25ms
>> [ 1170.785114] BUG: Bad page state in process kworker/67:2 pfn:189e3e
>> [ 1170.815859] page:000000002f5fe047 refcount:0 mapcount:-1024 
>> mapping:0000000000000000 index:0x0 pfn:0x189e3e
>> [ 1170.864772] flags: 0x17ffffc0000000()
>> [ 1170.883291] raw: 0017ffffc0000000 dead000000000100 
>> dead000000000122 0000000000000000
>> [ 1170.920537] raw: 0000000000000000 0000000000000000 
>> 00000000fffffbff 0000000000000000
>> [ 1170.957627] page dumped because: nonzero mapcount
>> [ 1170.980101] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 
>> dns_resolver nfs lockd grace nfs_ssc fscache rfkill sunrpc vfat fat 
>> dm_multipath intel_rapl_msr intel_rapl_common sb_edac 
>> x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel ipmi_ssif 
>> kvm irqbypass mgag200 crct10dif_pclmul iTCO_wdt i2c_algo_bit 
>> crc32_pclmul iTCO_vendor_support drm_kms_helper syscopyarea acpi_ipmi 
>> ghash_clmulni_intel sysfillrect ipmi_si rapl sysimgblt fb_sys_fops 
>> i2c_i801 ipmi_devintf drm ipmi_msghandler intel_cstate intel_uncore 
>> dax_pmem_compat device_dax ioatdma i2c_smbus acpi_tad joydev 
>> dax_pmem_core pcspkr hpwdt lpc_ich acpi_power_meter hpilo dca 
>> ip_tables xfs sr_mod cdrom sd_mod t10_pi sg nd_pmem nd_btt ahci bnx2x 
>> libahci nfit libata tg3 libnvdimm hpsa mdio scsi_transport_sas 
>> libcrc32c wmi crc32c_intel dm_mirror dm_region_hash dm_log dm_mod
>> [ 1171.332281] CPU: 67 PID: 2700 Comm: kworker/67:2 Tainted: G 
>> S                5.10.0-rc2.46b1ee38b2ba+ #4
>> [ 1171.378334] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 
>> Gen9, BIOS P89 10/05/2016
>> [ 1171.419774] Workqueue: mm_percpu_wq vmstat_update
>> [ 1171.442726] Call Trace:
>> [ 1171.454481]  dump_stack+0x57/0x6a
>> [ 1171.470597]  bad_page.cold.114+0x9b/0xa0
>> [ 1171.489841]  free_pcppages_bulk+0x538/0x760
>> [ 1171.509124]  drain_zone_pages+0x1f/0x30
>> [ 1171.527649]  refresh_cpu_vm_stats+0x1ea/0x2b0
>> [ 1171.548935]  vmstat_update+0xf/0x50
>> [ 1171.565961]  process_one_work+0x1a4/0x340
>> [ 1171.585142]  ? process_one_work+0x340/0x340
>> [ 1171.605147]  worker_thread+0x30/0x370
>> [ 1171.622603]  ? process_one_work+0x340/0x340
>> [ 1171.642355]  kthread+0x116/0x130
>> [ 1171.657519]  ? kthread_park+0x80/0x80
>> [ 1171.674713]  ret_from_fork+0x22/0x30
>> [ 1171.691291] Disabling lock debugging due to kernel taint
>>
>>>> How confident are you in the bisection?
>>>>
>>>> Jason
>>>>
>>> _______________________________________________
>>> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
>>> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
>>>
>> _______________________________________________
>> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
>> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org
> _______________________________________________
> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org