Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device

From: Klaus Jensen <its@irrelevant.dk>
To: Andrzej Jakowski <andrzej.jakowski@linux.intel.com>
Cc: kbusch@kernel.org, kwolf@redhat.com, qemu-devel@nongnu.org,
	qemu-block@nongnu.org, mreitz@redhat.com
Subject: Re: [PATCH v4 2/2] nvme: allow cmb and pmr to be enabled on same device
Date: Wed, 22 Jul 2020 19:21:27 +0200	[thread overview]
Message-ID: <20200722172127.GA712967@apples.localdomain> (raw)
In-Reply-To: <63d2e5ed-3c73-4a80-ae45-ce3665b406c8@linux.intel.com>

On Jul 22 10:00, Andrzej Jakowski wrote:
> On 7/22/20 12:43 AM, Klaus Jensen wrote:
> > @keith, please see below - can you comment on the Linux kernel 2 MB
> > boundary requirement for the CMB? Or should we hail Stephen (or Logan
> > maybe) since this seems to be related to p2pdma?
> > 
> > On Jul 21 14:54, Andrzej Jakowski wrote:
> >> On 7/15/20 1:06 AM, Klaus Jensen wrote:
> >>> Hi Andrzej,
> >>>
> >>> I've not been ignoring this, but sorry for not following up earlier.
> >>>
> >>> I'm hesitent to merge anything that very obviously breaks an OS that we
> >>> know is used a lot to this using this device. Also because the issue has
> >>> not been analyzed well enough to actually know if this is a QEMU or
> >>> kernel issue.
> >>
> >> Hi Klaus,
> >>
> >> Thx for your response! I understand your hesitance on merging stuff that
> >> obviously breaks guest OS. 
> >>
> >>>
> >>> Now, as far as I can test, having the MSI-X vector table and PBA in BAR
> >>> 0, PMR in BAR 2 and CMB in BAR 4 seems to make everyone happy
> >>> (irregardless of IOMMU on/off).
> >>>
> >>> Later, when the issue is better understood, we can add options to set
> >>> offsets, BIRs etc.
> >>>
> >>> The patch below replaces your "[PATCH v4 2/2] nvme: allow cmb and pmr to
> >>> be enabled" (but still requires "[PATCH v4 1/2] ...") and applies to
> >>> git://git.infradead.org/qemu-nvme.git nvme-next branch.
> >>>
> >>> Can you reproduce the issues with that patch? I can't on a stock Arch
> >>> Linux 5.7.5-arch1-1 kernel.
> >>
> >> While I'm happy that approach with MSIX and PBA in BAR0 works fine, I
> >> feel that investigation part why it works while mine doesn't is
> >> missing. It looks to me that both patches are basically following same 
> >> approach: create memory subregion and overlay on top of other memory
> >> region. Why one works and the other doesn't then?
> >>
> >> Having in mind that, I have recently focused on understanding problem.
> >> I observed that when guest assigns address to BAR4, addr field in
> >> nvme-bar4 memory region gets populated, but it doesn't get automatically
> >> populated in ctrl_mem (cmb) memory subregion, so later when nvme_addr_is_cmb() 
> >> is called address check works incorrectly and as a consequence vmm does dma 
> >> read instead of memcpy.
> >> I created a patch that sets correct address on ctrl_mem subregion and guest 
> >> OS boots up correctly.
> >>
> >> When I looked into pci and memory region code I noticed that indeed address
> >> is only assigned to top level memory region but not to contained subregions.
> >> I think that because in your approach cmb grabs whole bar exclusively it works
> >> fine.
> >>
> >> Here is my question (perhaps pci folks can help answer :)): if we consider 
> >> memory region overlapping for pci devices as valid use case should pci 
> >> code on configuration cycles walk through all contained subregion and
> >> update addr field accordingly?
> >>
> >> Thx!
> >>
> > 
> > Hi Andrzej,
> > 
> > Thanks for looking into this. I think your analysis helped me nail this.
> > The problem is that we added the use of a subregion and have some
> > assumptions that no longer hold.
> > 
> > nvme_addr_is_cmb() assumes that n->ctrl_mem.addr is an absolute address.
> > But when the memory region is a subregion, addr holds an offset into the
> > parent container instead. Thus, changing all occurances of
> > n->ctrl_mem.addr to (n->bar0.addr + n->ctrl_mem.addr) fixes the issue
> > (this is required in nvme_addr_is_cmb and nvme_map_prp). I patched that
> > in your original patch[1]. The reason my version worked is because there
> > was no subregion involved for the CMB, so the existing address
> > validation calculations were still correct.
> 
> I'm a little bit concerned with this approach:
> (n->bar0.addr + n->ctrl_mem.addr) and hoping to have some debate. Let me 
> describe my understanding of the problem.

Oh. In the context of your patch I meant bar4 of course, but anyway.

> It looks to me that addr field sometimes contains *absolute* address (when no 
> hierarchy is used) and other times it contains *relative* address (when
> hierarchy is created). From my perspective use of this field is inconsistent
> and thus error-prone.  
> Because of that I think that doing n->bar0.addr + n->ctrl_mem.addr doesn't
> solve root problem and is still prone to the same problem if in the future
> we potentially build even more complicated hierarchy.
> I think that we could solve it by introducing helper function like
> 
> hwaddr memory_region_get_abs_addr(MemoryRegion *mr) 
> 
> to retrieve absolute address and in the documentation indicate that addr field
> can be relative or absolute and it is recommended to use above function to 
> retrieve absolute address.
> What do you think?
> 

I'm all for a helper - I was not gonna cheer for the quick'n'dirty fix I
did just to convince myself that this was the issue ;)

I think the helper might already be there in memory.c. It's just not
exported.

   static hwaddr memory_region_to_absolute_addr(MemoryRegion *mr, hwaddr offset)

> > 
> > This leaves us with the Linux kernel complaining about not being able to
> > register the CMB if it is not on a 2MB boundary - this is probably just
> > a constraint in the kernel that we can't do anything about (but I'm no
> > kernel hacker...), which can be fixed by either being "nice" towards the
> > Linux kernel by forcing a 2 MB alignment in the device or exposing the
> > SZU field such that the user can choose 16MiB size units (or higher)
> > instead of 1MiB. I'm leaning towards ensuring the 2 MB alignment in the
> > device such that we do not have to introduce new cmb_size parameters,
> > while also making it easier for the user to configure. But I'm not
> > really sure.
> This is kernel limitation that we have to live with. The minimum granularity
> of devm_memremap_pages() is 2MB and it must be 2MB aligned. It used to worse
> at 128MB size+align (section-size), but sub-section memory-hotplug patches 
> adjusted that to a 2MB section. 

Thanks for that explanation!

> Ensuring 2MiB size and alignment in the device emulation makes sense to me.
> Perhaps we could document that limitations - making user more aware of it.
> 

Sounds good to me.