From: Christoph Hellwig <hch@infradead.org>
To: Dongli Zhang <dongli.zhang@oracle.com>
Cc: iommu@lists.linux.dev, x86@kernel.org,
xen-devel@lists.xenproject.org,
linux-arm-kernel@lists.infradead.org, linux-mips@vger.kernel.org,
linux-kernel@vger.kernel.org, hch@infradead.org,
sstabellini@kernel.org, linux@armlinux.org.uk,
tsbogend@alpha.franken.de, jgross@suse.com,
boris.ostrovsky@oracle.com, tglx@linutronix.de, mingo@redhat.com,
bp@alien8.de, dave.hansen@linux.intel.com,
m.szyprowski@samsung.com, konrad.wilk@oracle.com,
robin.murphy@arm.com, hpa@zytor.com,
oleksandr_tyshchenko@epam.com, joe.jin@oracle.com
Subject: Re: [PATCH RFC v2 1/1] wiotlb: split buffer into 32-bit default and 64-bit extra zones
Date: Tue, 20 Sep 2022 00:55:03 -0700 [thread overview]
Message-ID: <Yylx1xPdEbq3k4v1@infradead.org> (raw)
In-Reply-To: <20220820074250.5460-1-dongli.zhang@oracle.com>
On Sat, Aug 20, 2022 at 12:42:50AM -0700, Dongli Zhang wrote:
> Hello,
>
> I used to send out RFC v1 to introduce an extra io_tlb_mem (created with
> SWIOTLB_ANY) in addition to the default io_tlb_mem (32-bit). The
> dev->dma_io_tlb_mem is set to either default or the extra io_tlb_mem,
> depending on dma mask. However, that is not good for setting
> dev->dma_io_tlb_mem at swiotlb layer transparently as suggested by
> Christoph Hellwig.
>
> https://lore.kernel.org/all/20220609005553.30954-1-dongli.zhang@oracle.com/
>
> Therefore, this is another RFC v2 implementation following a different
> direction. The core ideas are:
>
> 1. The swiotlb is splited into two zones, io_tlb_mem->zone[0] (32-bit) and
> io_tlb_mem->zone[1] (64-bit).
>
> struct io_tlb_mem {
> struct io_tlb_zone zone[SWIOTLB_NR];
> struct dentry *debugfs;
> bool late_alloc;
> bool force_bounce;
> bool for_alloc;
> bool has_extra;
> };
>
> struct io_tlb_zone {
> phys_addr_t start;
> phys_addr_t end;
> void *vaddr;
> unsigned long nslabs;
> unsigned long used;
> unsigned int nareas;
> unsigned int area_nslabs;
> struct io_tlb_area *areas;
> struct io_tlb_slot *slots;
> };
>
> 2. By default, only io_tlb_mem->zone[0] is available. The
> io_tlb_mem->zone[1] is allocated conditionally if:
>
> - the "swiotlb=" is configured to allocate extra buffer, and
> - the SWIOTLB_EXTRA is set in the flag (this is to make sure arch(s) other
> than x86/sev/xen will not enable it until it is fully tested by each
> arch, e.g., mips/powerpc). Currently it is enabled for x86 and xen.
>
> 3. During swiotlb map, whether zone[0] (32-bit) or zone[1] (64-bit
> SWIOTLB_ANY)
> is used depends on min_not_zero(*dev->dma_mask, dev->bus_dma_limit).
>
> To test the RFC v2, here is the QEMU command line.
>
> qemu-system-x86_64 -smp 8 -m 32G -enable-kvm -vnc :5 -hda disk.img \
> -kernel path-to-linux/arch/x86_64/boot/bzImage \
> -append "root=/dev/sda1 init=/sbin/init text console=ttyS0 loglevel=7 swiotlb=32768,4194304,force" \
> -net nic -net user,hostfwd=tcp::5025-:22 \
> -device nvme,drive=nvme01,serial=helloworld -drive file=test.qcow2,if=none,id=nvme01 \
> -serial stdio
>
> There is below in syslog. The extra 8GB buffer is allocated.
>
> [ 0.152251] software IO TLB: area num 8.
> ... ...
> [ 3.706088] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
> [ 3.707334] software IO TLB: mapped default [mem 0x00000000bbfd7000-0x00000000bffd7000] (64MB)
> [ 3.708585] software IO TLB: mapped extra [mem 0x000000061cc00000-0x000000081cc00000] (8192MB)
>
> After the FIO is triggered over NVMe, the 64-bit buffer is used.
>
> $ cat /sys/kernel/debug/swiotlb/io_tlb_nslabs_extra
> 4194304
> $ cat /sys/kernel/debug/swiotlb/io_tlb_used_extra
> 327552
>
> Would you mind helping if this is the right direction to go?
>
> Thank you very much!
>
> Cc: Konrad Wilk <konrad.wilk@oracle.com>
> Cc: Joe Jin <joe.jin@oracle.com>
> Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com>
> ---
> arch/arm/xen/mm.c | 2 +-
> arch/mips/pci/pci-octeon.c | 5 +-
> arch/x86/include/asm/xen/swiotlb-xen.h | 2 +-
> arch/x86/kernel/pci-dma.c | 6 +-
> drivers/xen/swiotlb-xen.c | 18 +-
> include/linux/swiotlb.h | 73 +++--
> diff --git a/arch/arm/xen/mm.c b/arch/arm/xen/mm.c
> index 3d826c0..4edfa42 100644
> --- a/arch/arm/xen/mm.c
> +++ b/arch/arm/xen/mm.c
> @@ -125,7 +125,7 @@ static int __init xen_mm_init(void)
> return 0;
>
> /* we can work with the default swiotlb */
> - if (!io_tlb_default_mem.nslabs) {
> + if (!io_tlb_default_mem.zone[SWIOTLB_DF].nslabs) {
> rc = swiotlb_init_late(swiotlb_size_or_default(),
> xen_swiotlb_gfp(), NULL);
> if (rc < 0)
First thing we need before doing anything about multiple default
pools is to get all the knowledge of details hidden inside swiotlb.c.
For swiotlb_init_late that seems easy as we can just move the check
into it.
> diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
> index e457a18..0bf0859 100644
> --- a/arch/mips/pci/pci-octeon.c
> +++ b/arch/mips/pci/pci-octeon.c
> @@ -654,6 +654,9 @@ static int __init octeon_pci_setup(void)
> octeon_pci_mem_resource.end =
> octeon_pci_mem_resource.start + (1ul << 30);
> } else {
> + struct io_tlb_mem *mem = &io_tlb_default_mem;
> + struct io_tlb_zone *zone = &mem->zone[SWIOTLB_DF];
> +
> /* Remap the Octeon BAR 0 to map 128MB-(128MB+4KB) */
> octeon_npi_write32(CVMX_NPI_PCI_CFG04, 128ul << 20);
> octeon_npi_write32(CVMX_NPI_PCI_CFG05, 0);
> @@ -664,7 +667,7 @@ static int __init octeon_pci_setup(void)
>
> /* BAR1 movable regions contiguous to cover the swiotlb */
> octeon_bar1_pci_phys =
> - io_tlb_default_mem.start & ~((1ull << 22) - 1);
> + zone->start & ~((1ull << 22) - 1);
But we'll need to do something about this mess. I'll need help from
the octeon maintainer on it.
> - x86_swiotlb_flags |= SWIOTLB_ANY;
> + x86_swiotlb_flags |= SWIOTLB_ANY | SWIOTLB_EXTRA;
I don't think this is how it is suppoed to be. SWIOTLB_ANY already
says give me a pool with no addressing constrains. We don't need
two pools without that. EXTRA is also not exactly a very helpful
name here.
> #ifdef CONFIG_X86
> -int xen_swiotlb_fixup(void *buf, unsigned long nslabs)
> +int xen_swiotlb_fixup(void *buf, unsigned long nslabs, unsigned int flags)
> {
> int rc;
> unsigned int order = get_order(IO_TLB_SEGSIZE << IO_TLB_SHIFT);
> unsigned int i, dma_bits = order + PAGE_SHIFT;
> dma_addr_t dma_handle;
> phys_addr_t p = virt_to_phys(buf);
> + unsigned int max_dma_bits = 32;
I think live will be a lot simple if the addressing bits are passed to
this function, and not some kind of flags.
> +#define SWIOTLB_DF 0
> +#define SWIOTLB_EX 1
> +#define SWIOTLB_NR 2
These names are not very descriptive.
prev parent reply other threads:[~2022-09-20 7:55 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-20 7:42 [PATCH RFC v2 1/1] wiotlb: split buffer into 32-bit default and 64-bit extra zones Dongli Zhang
2022-09-20 7:55 ` Christoph Hellwig [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yylx1xPdEbq3k4v1@infradead.org \
--to=hch@infradead.org \
--cc=boris.ostrovsky@oracle.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=dongli.zhang@oracle.com \
--cc=hpa@zytor.com \
--cc=iommu@lists.linux.dev \
--cc=jgross@suse.com \
--cc=joe.jin@oracle.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mips@vger.kernel.org \
--cc=linux@armlinux.org.uk \
--cc=m.szyprowski@samsung.com \
--cc=mingo@redhat.com \
--cc=oleksandr_tyshchenko@epam.com \
--cc=robin.murphy@arm.com \
--cc=sstabellini@kernel.org \
--cc=tglx@linutronix.de \
--cc=tsbogend@alpha.franken.de \
--cc=x86@kernel.org \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).