Re: GPF on 0xdead000000000100 in nvme_map_data - Linux 5.9.9 - Marek Marczykowski-Górecki

From: "Marek Marczykowski-Górecki" <marmarek@invisiblethingslab.com>
To: "Roger Pau Monné" <roger.pau@citrix.com>,
	"Juergen Gross" <jgross@suse.com>
Cc: Sagi Grimberg <sagi@grimberg.me>,
	linux-nvme@lists.infradead.org, Jens Axboe <axboe@fb.com>,
	Keith Busch <kbusch@kernel.org>,
	xen-devel <xen-devel@lists.xenproject.org>,
	Christoph Hellwig <hch@lst.de>
Subject: Re: GPF on 0xdead000000000100 in nvme_map_data - Linux 5.9.9
Date: Fri, 4 Dec 2020 12:08:47 +0100	[thread overview]
Message-ID: <20201204110847.GU201140@mail-itl> (raw)
In-Reply-To: <20201202000642.GJ201140@mail-itl>

[-- Attachment #1.1: Type: text/plain, Size: 6993 bytes --]

On Wed, Dec 02, 2020 at 01:06:46AM +0100, Marek Marczykowski-Górecki wrote:
> On Tue, Dec 01, 2020 at 01:40:10AM +0900, Keith Busch wrote:
> > On Sun, Nov 29, 2020 at 04:56:39AM +0100, Marek Marczykowski-Górecki wrote:
> > > I can reliably hit kernel panic in nvme_map_data() which looks like the
> > > one below. It happens on Linux 5.9.9, while 5.4.75 works fine. I haven't
> > > tried other version on this hardware. Linux is running as Xen
> > > PV dom0, on top of nvme there is LUKS and then LVM with thin
> > > provisioning. The crash happens reliably when starting a Xen domU (which
> > > uses one of thin provisioned LVM volumes as its disk). But booting dom0
> > > works fine (even though it is using the same disk setup for its root
> > > filesystem).
> > > 
> > > I did a bit of debugging and found it's about this part:
> > > 
> > > drivers/nvme/host/pci.c:
> > >  800 static blk_status_t nvme_map_data(struct nvme_dev *dev, struct request *req,
> > >  801         struct nvme_command *cmnd)
> > >  802 {
> > >  803     struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> > >  804     blk_status_t ret = BLK_STS_RESOURCE;
> > >  805     int nr_mapped;
> > >  806 
> > >  807     if (blk_rq_nr_phys_segments(req) == 1) {
> > >  808         struct bio_vec bv = req_bvec(req);
> > >  809 
> > >  810         if (!is_pci_p2pdma_page(bv.bv_page)) {
> > > 
> > > Here, bv.bv_page->pgmap is LIST_POISON1, while page_zonenum(bv.bv_page)
> > > says ZONE_DEVICE. So, is_pci_p2pdma_page() crashes on accessing
> > > bv.bv_page->pgmap->type.
> > 
> > Something sounds off. I thought all ZONE_DEVICE pages require a pgmap
> > because that's what holds a references to the device's live-ness. What
> > are you allocating this memory from that makes ZONE_DEVICE true without
> > a pgmap?
> 
> Well, I allocate anything myself. I just try to start the system with
> unmodified Linux 5.9.9 and NVME drive...
> I didn't managed to find where this page is allocated, nor where it gets
> broken. I _suspect_ it gets allocated as ZONE_DEVICE page and then gets
> released as ZONE_NORMAL which sets another part of the union to
> LIST_POISON1. But I have absolutely no data to confirm/deny this theory.

I've bisected this (thanks to a bit of scripting, PXE and git bisect
run, it was long, but fairly painless) and identified this commit as the
culprit: 

commit 9e2369c06c8a181478039258a4598c1ddd2cadfa
Author: Roger Pau Monne <roger.pau@citrix.com>
Date:   Tue Sep 1 10:33:26 2020 +0200

    xen: add helpers to allocate unpopulated memory

I'm adding relevant people and xen-devel to the thread.
For completeness, here is the original crash message:

general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] SMP NOPTI
CPU: 1 PID: 134 Comm: kworker/u12:2 Not tainted 5.9.9-1.qubes.x86_64 #1
Hardware name: LENOVO 20M9CTO1WW/20M9CTO1WW, BIOS N2CET50W (1.33 ) 01/15/2020
Workqueue: dm-thin do_worker [dm_thin_pool]
RIP: e030:nvme_map_data+0x300/0x3a0 [nvme]
Code: b8 fe ff ff e9 a8 fe ff ff 4c 8b 56 68 8b 5e 70 8b 76 74 49 8b 02 48 c1 e8 33 83 e0 07 83 f8 04 0f 85 f2 fe ff ff 49 8b 42 08 <83> b8 d0 00 00 00 04 0f 85 e1 fe ff ff e9 38 fd ff ff 8b 55 70 be
RSP: e02b:ffffc900010e7ad8 EFLAGS: 00010246
RAX: dead000000000100 RBX: 0000000000001000 RCX: ffff8881a58f5000
RDX: 0000000000001000 RSI: 0000000000000000 RDI: ffff8881a679e000
RBP: ffff8881a5ef4c80 R08: ffff8881a5ef4c80 R09: 0000000000000002
R10: ffffea0003dfff40 R11: 0000000000000008 R12: ffff8881a679e000
R13: ffffc900010e7b20 R14: ffff8881a70b5980 R15: ffff8881a679e000
FS:  0000000000000000(0000) GS:ffff8881b5440000(0000) knlGS:0000000000000000
CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000001d64408 CR3: 00000001aa2c0000 CR4: 0000000000050660
Call Trace:
 nvme_queue_rq+0xa7/0x1a0 [nvme]
 __blk_mq_try_issue_directly+0x11d/0x1e0
 ? add_wait_queue_exclusive+0x70/0x70
 blk_mq_try_issue_directly+0x35/0xc0l[
 blk_mq_submit_bio+0x58f/0x660
 __submit_bio_noacct+0x300/0x330
 process_shared_bio+0x126/0x1b0 [dm_thin_pool]
 process_cell+0x226/0x280 [dm_thin_pool]
 process_thin_deferred_cells+0x185/0x320 [dm_thin_pool]
 process_deferred_bios+0xa4/0x2a0 [dm_thin_pool]UX
 do_worker+0xcc/0x130 [dm_thin_pool]
 process_one_work+0x1b4/0x370
 worker_thread+0x4c/0x310
 ? process_one_work+0x370/0x370
 kthread+0x11b/0x140
 ? __kthread_bind_mask+0x60/0x60<
 ret_from_fork+0x22/0x30
Modules linked in: loop snd_seq_dummy snd_hrtimer nf_tables nfnetlink vfat fat snd_sof_pci snd_sof_intel_byt snd_sof_intel_ipc snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_xtensa_dsp snd_sof_intel_hda snd_sof snd_soc_skl
snd_soc_sst_
ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_acpi_intel_match snd_soc_acpi snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine elan_i2c snd_hda_codec_hdmi mei_hdcp iTCO_wdt intel_powerclamp intel_pmc_bxt ee1004 intel_rapl_msr
iTCO_vendor
_support joydev pcspkr intel_wmi_thunderbolt wmi_bmof thunderbolt ucsi_acpi idma64 typec_ucsi snd_hda_codec_realtek typec snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec thinkpad_acpi snd_hda_core ledtrig_audio
int3403_
thermal snd_hwdep snd_seq snd_seq_device snd_pcm iwlwifi snd_timer processor_thermal_device mei_me cfg80211 intel_rapl_common snd e1000e mei int3400_thermal int340x_thermal_zone i2c_i801 acpi_thermal_rel soundcore intel_soc_dts_iosf
i2c_s
mbus rfkill intel_pch_thermal xenfs
 ip_tables dm_thin_pool dm_persistent_data dm_bio_prison dm_crypt nouveau rtsx_pci_sdmmc mmc_core mxm_wmi crct10dif_pclmul ttm crc32_pclmul crc32c_intel i915 ghash_clmulni_intel i2c_algo_bit serio_raw nvme drm_kms_helper cec xhci_pci
nvme
_core rtsx_pci xhci_pci_renesas drm xhci_hcd wmi video pinctrl_cannonlake pinctrl_intel xen_privcmd xen_pciback xen_blkback xen_gntalloc xen_gntdev xen_evtchn uinput
---[ end trace f8d47e4aa6724df4 ]---
RIP: e030:nvme_map_data+0x300/0x3a0 [nvme]
Code: b8 fe ff ff e9 a8 fe ff ff 4c 8b 56 68 8b 5e 70 8b 76 74 49 8b 02 48 c1 e8 33 83 e0 07 83 f8 04 0f 85 f2 fe ff ff 49 8b 42 08 <83> b8 d0 00 00 00 04 0f 85 e1 fe ff ff e9 38 fd ff ff 8b 55 70 be
RSP: e02b:ffffc900010e7ad8 EFLAGS: 00010246
RAX: dead000000000100 RBX: 0000000000001000 RCX: ffff8881a58f5000
RDX: 0000000000001000 RSI: 0000000000000000 RDI: ffff8881a679e000
RBP: ffff8881a5ef4c80 R08: ffff8881a5ef4c80 R09: 0000000000000002
R10: ffffea0003dfff40 R11: 0000000000000008 R12: ffff8881a679e000
R13: ffffc900010e7b20 R14: ffff8881a70b5980 R15: ffff8881a679e000
FS:  0000000000000000(0000) GS:ffff8881b5440000(0000) knlGS:0000000000000000
CS:  e030 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000001d64408 CR3: 00000001aa2c0000 CR4: 0000000000050660
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 158 bytes --]

_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme