On Mon, Dec 07, 2020 at 01:00:14PM +0100, Jürgen Groß wrote: > On 07.12.20 12:48, Marek Marczykowski-Górecki wrote: > > On Mon, Dec 07, 2020 at 11:55:01AM +0100, Jürgen Groß wrote: > > > Marek, > > > > > > On 06.12.20 17:47, Jason Andryuk wrote: > > > > On Sat, Dec 5, 2020 at 3:29 AM Roger Pau Monné wrote: > > > > > > > > > > On Fri, Dec 04, 2020 at 01:20:54PM +0100, Marek Marczykowski-Górecki wrote: > > > > > > On Fri, Dec 04, 2020 at 01:08:03PM +0100, Christoph Hellwig wrote: > > > > > > > On Fri, Dec 04, 2020 at 12:08:47PM +0100, Marek Marczykowski-Górecki wrote: > > > > > > > > culprit: > > > > > > > > > > > > > > > > commit 9e2369c06c8a181478039258a4598c1ddd2cadfa > > > > > > > > Author: Roger Pau Monne > > > > > > > > Date: Tue Sep 1 10:33:26 2020 +0200 > > > > > > > > > > > > > > > > xen: add helpers to allocate unpopulated memory > > > > > > > > > > > > > > > > I'm adding relevant people and xen-devel to the thread. > > > > > > > > For completeness, here is the original crash message: > > > > > > > > > > > > > > That commit definitively adds a new ZONE_DEVICE user, so it does look > > > > > > > related. But you are not running on Xen, are you? > > > > > > > > > > > > I am. It is Xen dom0. > > > > > > > > > > I'm afraid I'm on leave and won't be able to look into this until the > > > > > beginning of January. I would guess it's some kind of bad > > > > > interaction between blkback and NVMe drivers both using ZONE_DEVICE? > > > > > > > > > > Maybe the best is to revert this change and I will look into it when > > > > > I get back, unless someone is willing to debug this further. > > > > > > > > Looking at commit 9e2369c06c8a and xen-blkback put_free_pages() , they > > > > both use page->lru which is part of the anonymous union shared with > > > > *pgmap. That matches Marek's suspicion that the ZONE_DEVICE memory is > > > > being used as ZONE_NORMAL. > > > > > > > > memmap_init_zone_device() says: > > > > * ZONE_DEVICE pages union ->lru with a ->pgmap back pointer > > > > * and zone_device_data. It is a bug if a ZONE_DEVICE page is > > > > * ever freed or placed on a driver-private list. > > > > > > Second try, now even tested to work on a test system (without NVMe). > > > > It doesn't work for me: > > > > [ 526.023340] xen-blkback: backend/vbd/1/51712: using 2 queues, protocol 1 (x86_64-abi) persistent grants > > [ 526.030550] xen-blkback: backend/vbd/1/51728: using 2 queues, protocol 1 (x86_64-abi) persistent grants > > [ 526.034810] BUG: kernel NULL pointer dereference, address: 0000000000000010 > > Oh, indeed. Silly bug. My test was with qdisk as backend :-( > > 3rd try... Now it works :) -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?