Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR

From: Leonardo Bras <leobras.c@gmail.com>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Paul Mackerras <paulus@samba.org>,
	Brian King <brking@linux.vnet.ibm.com>,
	linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Subject: Re: [PATCH v2 1/1] powerpc/iommu: Enable remaining IOMMU Pagesizes present in LoPAR
Date: Fri, 9 Apr 2021 01:44:54 -0300	[thread overview]
Message-ID: <CADvQ+rGq8-gp6vf1QLgKCWTVHjX7O1EuKo8gciXeKv5z9UkiOA@mail.gmail.com> (raw)
In-Reply-To: <21407a96-5b20-3fae-f1c8-895973b655ef@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 4446 bytes --]

Em sex., 9 de abr. de 2021 01:36, Alexey Kardashevskiy <aik@ozlabs.ru>
escreveu:

>
>
> On 08/04/2021 19:04, Michael Ellerman wrote:
> > Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> >> On 08/04/2021 15:37, Michael Ellerman wrote:
> >>> Leonardo Bras <leobras.c@gmail.com> writes:
> >>>> According to LoPAR, ibm,query-pe-dma-window output named "IO Page
> Sizes"
> >>>> will let the OS know all possible pagesizes that can be used for
> creating a
> >>>> new DDW.
> >>>>
> >>>> Currently Linux will only try using 3 of the 8 available options:
> >>>> 4K, 64K and 16M. According to LoPAR, Hypervisor may also offer 32M,
> 64M,
> >>>> 128M, 256M and 16G.
> >>>
> >>> Do we know of any hardware & hypervisor combination that will actually
> >>> give us bigger pages?
> >>
> >>
> >> On P8 16MB host pages and 16MB hardware iommu pages worked.
> >>
> >> On P9, VM's 16MB IOMMU pages worked on top of 2MB host pages + 2MB
> >> hardware IOMMU pages.
> >
> > The current code already tries 16MB though.
> >
> > I'm wondering if we're going to ask for larger sizes that have never
> > been tested and possibly expose bugs. But it sounds like this is mainly
> > targeted at future platforms.
>
>
> I tried for fun to pass through a PCI device to a guest with this patch as:
>
> pbuild/qemu-killslof-aiku1904le-ppc64/qemu-system-ppc64 \
> -nodefaults \
> -chardev stdio,id=STDIO0,signal=off,mux=on \
> -device spapr-vty,id=svty0,reg=0x71000110,chardev=STDIO0 \
> -mon id=MON0,chardev=STDIO0,mode=readline \
> -nographic \
> -vga none \
> -enable-kvm \
> -m 16G \
> -kernel ./vmldbg \
> -initrd /home/aik/t/le.cpio \
> -device vfio-pci,id=vfio0001_01_00_0,host=0001:01:00.0 \
> -mem-prealloc \
> -mem-path qemu_hp_1G_node0 \
> -global spapr-pci-host-bridge.pgsz=0xffffff000 \
> -machine cap-cfpc=broken,cap-ccf-assist=off \
> -smp 1,threads=1 \
> -L /home/aik/t/qemu-ppc64-bios/ \
> -trace events=qemu_trace_events \
> -d guest_errors,mmu \
> -chardev socket,id=SOCKET0,server=on,wait=off,path=qemu.mon.1_1_0_0 \
> -mon chardev=SOCKET0,mode=control
>
>
> The guest created a huge window:
>
> xhci_hcd 0000:00:00.0: ibm,create-pe-dma-window(2027) 0 8000000 20000000
> 22 22 returned 0 (liobn = 0x80000001 starting addr = 8000000 0)
>
> The first "22" is page_shift in hex (16GB), the second "22" is
> window_shift (so we have 1 TCE).
>
> On the host side the window#1 was created with 1GB pages:
> pci 0001:01     : [PE# fd] Setting up window#1
> 800000000000000..80007ffffffffff pg=40000000
>
>
> The XHCI seems working. Without the patch 16MB was the maximum.
>
>
> >
> >>>> diff --git a/arch/powerpc/platforms/pseries/iommu.c
> b/arch/powerpc/platforms/pseries/iommu.c
> >>>> index 9fc5217f0c8e..6cda1c92597d 100644
> >>>> --- a/arch/powerpc/platforms/pseries/iommu.c
> >>>> +++ b/arch/powerpc/platforms/pseries/iommu.c
> >>>> @@ -53,6 +53,20 @@ enum {
> >>>>            DDW_EXT_QUERY_OUT_SIZE = 2
> >>>>    };
> >>>
> >>> A comment saying where the values come from would be good.
> >>>
> >>>> +#define QUERY_DDW_PGSIZE_4K       0x01
> >>>> +#define QUERY_DDW_PGSIZE_64K      0x02
> >>>> +#define QUERY_DDW_PGSIZE_16M      0x04
> >>>> +#define QUERY_DDW_PGSIZE_32M      0x08
> >>>> +#define QUERY_DDW_PGSIZE_64M      0x10
> >>>> +#define QUERY_DDW_PGSIZE_128M     0x20
> >>>> +#define QUERY_DDW_PGSIZE_256M     0x40
> >>>> +#define QUERY_DDW_PGSIZE_16G      0x80
> >>>
> >>> I'm not sure the #defines really gain us much vs just putting the
> >>> literal values in the array below?
> >>
> >> Then someone says "uuuuu magic values" :) I do not mind either way.
> Thanks,
> >
> > Yeah that's true. But #defining them doesn't make them less magic, if
> > you only use them in one place :)
>
> Defining them with "QUERY_DDW" in the names kinda tells where they are
> from. Can also grep QEMU using these to see how the other side handles
> it. Dunno.
>
> btw the bot complained about __builtin_ctz(SZ_16G) which should be
> __builtin_ctzl(SZ_16G) so we have to ask Leonardo to repost anyway :)
>

Thanks for testing!

http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210408201915.174217-1-leobras.c@gmail.com/

I sent a v3 a few hours ago, fixing this by using __builtin_ctzll() instead
of __builtin_ctz() in all sizes, and it worked like a charm.

I also reverted to the previous approach of not having QUERY_DDW defines
for masks, as Michael suggested.

I can revert back to v2 approach if you guys decide it's better.

Best regards,
Leonardo Bras

[-- Attachment #2: Type: text/html, Size: 6576 bytes --]