On Mon, Dec 07, 2020 at 11:55:01AM +0100, Jürgen Groß wrote: > Marek, > > On 06.12.20 17:47, Jason Andryuk wrote: > > On Sat, Dec 5, 2020 at 3:29 AM Roger Pau Monné wrote: > > > > > > On Fri, Dec 04, 2020 at 01:20:54PM +0100, Marek Marczykowski-Górecki wrote: > > > > On Fri, Dec 04, 2020 at 01:08:03PM +0100, Christoph Hellwig wrote: > > > > > On Fri, Dec 04, 2020 at 12:08:47PM +0100, Marek Marczykowski-Górecki wrote: > > > > > > culprit: > > > > > > > > > > > > commit 9e2369c06c8a181478039258a4598c1ddd2cadfa > > > > > > Author: Roger Pau Monne > > > > > > Date: Tue Sep 1 10:33:26 2020 +0200 > > > > > > > > > > > > xen: add helpers to allocate unpopulated memory > > > > > > > > > > > > I'm adding relevant people and xen-devel to the thread. > > > > > > For completeness, here is the original crash message: > > > > > > > > > > That commit definitively adds a new ZONE_DEVICE user, so it does look > > > > > related. But you are not running on Xen, are you? > > > > > > > > I am. It is Xen dom0. > > > > > > I'm afraid I'm on leave and won't be able to look into this until the > > > beginning of January. I would guess it's some kind of bad > > > interaction between blkback and NVMe drivers both using ZONE_DEVICE? > > > > > > Maybe the best is to revert this change and I will look into it when > > > I get back, unless someone is willing to debug this further. > > > > Looking at commit 9e2369c06c8a and xen-blkback put_free_pages() , they > > both use page->lru which is part of the anonymous union shared with > > *pgmap. That matches Marek's suspicion that the ZONE_DEVICE memory is > > being used as ZONE_NORMAL. > > > > memmap_init_zone_device() says: > > * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer > > * and zone_device_data. It is a bug if a ZONE_DEVICE page is > > * ever freed or placed on a driver-private list. > > Second try, now even tested to work on a test system (without NVMe). It doesn't work for me: [ 526.023340] xen-blkback: backend/vbd/1/51712: using 2 queues, protocol 1 (x86_64-abi) persistent grants [ 526.030550] xen-blkback: backend/vbd/1/51728: using 2 queues, protocol 1 (x86_64-abi) persistent grants [ 526.034810] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 526.034841] #PF: supervisor read access in kernel mode [ 526.034857] #PF: error_code(0x0000) - not-present page [ 526.034875] PGD 105428067 P4D 105428067 PUD 105b92067 PMD 0 [ 526.034896] Oops: 0000 [#1] SMP NOPTI [ 526.034909] CPU: 3 PID: 4007 Comm: 1.xvda-0 Tainted: G W 5.10.0-rc6-1.qubes.x86_64+ #108 [ 526.034933] Hardware name: LENOVO 20M9CTO1WW/20M9CTO1WW, BIOS N2CET50W (1.33 ) 01/15/2020 [ 526.034974] RIP: e030:gnttab_page_cache_get+0x32/0x60 [ 526.034990] Code: 89 f4 55 48 89 fd e8 4d e3 80 00 48 83 7d 08 00 48 89 c6 74 15 48 89 ef e8 5b e0 80 00 4c 89 e6 5d bf 01 00 00 00 41 5c eb 8e <48> 8b 04 25 10 00 00 00 48 89 ef 48 89 45 08 49 c7 04 24 00 00 00 [ 526.035035] RSP: e02b:ffffc90003e27a40 EFLAGS: 00010046 [ 526.035052] RAX: 0000000000000200 RBX: 0000000000000001 RCX: 0000000000000000 [ 526.035072] RDX: 0000000000000001 RSI: 0000000000000200 RDI: ffff888104275518 [ 526.035092] RBP: ffff888104275518 R08: 0000000000000000 R09: 0000000000000000 [ 526.035113] R10: ffff888104275400 R11: 0000000000000000 R12: ffff888109b5d3a0 [ 526.035133] R13: 0000000000000000 R14: 0000000000000000 R15: ffff888104275400 [ 526.035159] FS: 0000000000000000(0000) GS:ffff8881b54c0000(0000) knlGS:0000000000000000 [ 526.035194] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 526.035214] CR2: 0000000000000010 CR3: 0000000103b5a000 CR4: 0000000000050660 [ 526.035239] Call Trace: [ 526.035253] xen_blkbk_map+0x131/0x5a0 [ 526.035268] dispatch_rw_block_io+0x42a/0x9c0 [ 526.035284] ? xen_mc_flush+0xcb/0x190 [ 526.035298] __do_block_io_op+0x314/0x630 [ 526.035312] xen_blkif_schedule+0x182/0x790 [ 526.035327] ? finish_wait+0x80/0x80 [ 526.035340] ? xen_blkif_be_int+0x30/0x30 [ 526.035355] kthread+0xfe/0x140 [ 526.035371] ? kthread_park+0x90/0x90 [ 526.035385] ret_from_fork+0x22/0x30 [ 526.035398] Modules linked in: [ 526.035410] CR2: 0000000000000010 [ 526.035440] ---[ end trace 431ea72658d96c9d ]--- [ 526.176390] RIP: e030:gnttab_page_cache_get+0x32/0x60 [ 526.176460] Code: 89 f4 55 48 89 fd e8 4d e3 80 00 48 83 7d 08 00 48 89 c6 74 15 48 89 ef e8 5b e0 80 00 4c 89 e6 5d bf 01 00 00 00 41 5c eb 8e <48> 8b 04 25 10 00 00 00 48 89 ef 48 89 45 08 49 c7 04 24 00 00 00 [ 526.250734] RSP: e02b:ffffc90003e27a40 EFLAGS: 00010046 [ 526.250751] RAX: 0000000000000200 RBX: 0000000000000001 RCX: 0000000000000000 [ 526.250771] RDX: 0000000000000001 RSI: 0000000000000200 RDI: ffff888104275518 [ 526.250790] RBP: ffff888104275518 R08: 0000000000000000 R09: 0000000000000000 [ 526.250808] R10: ffff888104275400 R11: 0000000000000000 R12: ffff888109b5d3a0 [ 526.250827] R13: 0000000000000000 R14: 0000000000000000 R15: ffff888104275400 [ 526.250863] FS: 0000000000000000(0000) GS:ffff8881b54c0000(0000) knlGS:0000000000000000 [ 526.250884] CS: 10000e030 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 526.250901] CR2: 0000000000000010 CR3: 0000000103b5a000 CR4: 0000000000050660 [ 526.250924] Kernel panic - not syncing: Fatal exception [ 526.250972] Kernel Offset: disabled This is 7059c2c00a2196865c2139083cbef47cd18109b6 with your patches on top. -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?