From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF988C433DB for ; Wed, 27 Jan 2021 02:02:27 +0000 (UTC) Received: from hemlock.osuosl.org (smtp2.osuosl.org [140.211.166.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 548E9206A5 for ; Wed, 27 Jan 2021 02:02:27 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 548E9206A5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=iommu-bounces@lists.linux-foundation.org Received: from localhost (localhost [127.0.0.1]) by hemlock.osuosl.org (Postfix) with ESMTP id CE1B68700A; Wed, 27 Jan 2021 02:02:26 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from hemlock.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id wI1qxOgPcxim; Wed, 27 Jan 2021 02:02:25 +0000 (UTC) Received: from lists.linuxfoundation.org (lf-lists.osuosl.org [140.211.9.56]) by hemlock.osuosl.org (Postfix) with ESMTP id 8E9DD86F8F; Wed, 27 Jan 2021 02:02:25 +0000 (UTC) Received: from lf-lists.osuosl.org (localhost [127.0.0.1]) by lists.linuxfoundation.org (Postfix) with ESMTP id 88146C08A1; Wed, 27 Jan 2021 02:02:25 +0000 (UTC) Received: from fraxinus.osuosl.org (smtp4.osuosl.org [140.211.166.137]) by lists.linuxfoundation.org (Postfix) with ESMTP id EA448C013A for ; Wed, 27 Jan 2021 02:02:23 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by fraxinus.osuosl.org (Postfix) with ESMTP id D8F5685722 for ; Wed, 27 Jan 2021 02:02:23 +0000 (UTC) X-Virus-Scanned: amavisd-new at osuosl.org Received: from fraxinus.osuosl.org ([127.0.0.1]) by localhost (.osuosl.org [127.0.0.1]) (amavisd-new, port 10024) with ESMTP id fgbWZughU6vC for ; Wed, 27 Jan 2021 02:02:20 +0000 (UTC) X-Greylist: domain auto-whitelisted by SQLgrey-1.7.6 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by fraxinus.osuosl.org (Postfix) with ESMTPS id BC9608574F for ; Wed, 27 Jan 2021 02:02:20 +0000 (UTC) IronPort-SDR: dMFAx5Ll5siiY22guyn7CTObr5BwjoCoq1aq1VNFuMDpCQqOjZXZ+wO95a0sq5MDwsr/rq0Pbu W2LYKfr5hzMQ== X-IronPort-AV: E=McAfee;i="6000,8403,9876"; a="180083778" X-IronPort-AV: E=Sophos;i="5.79,378,1602572400"; d="scan'208";a="180083778" Received: from fmsmga004.fm.intel.com ([10.253.24.48]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Jan 2021 18:02:18 -0800 IronPort-SDR: vDArbV7yQDBrNN+eF9wak9exUNPPGLieCNxWjH4X1B2ymelOPZnj771FWQT1XeH4bUiImItjp0 zPTw2NJD6pIg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.79,378,1602572400"; d="scan'208";a="402939270" Received: from allen-box.sh.intel.com (HELO [10.239.159.128]) ([10.239.159.128]) by fmsmga004.fm.intel.com with ESMTP; 26 Jan 2021 18:02:16 -0800 Subject: Re: [RFT PATCH 0/3] Performance regression noted in v5.11-rc after c062db039f40 To: Chuck Lever References: <20210125023858.570175-1-baolu.lu@linux.intel.com> <83EE54C6-F654-4D1D-92F7-F442ACEC8D70@oracle.com> <7937bfa5-5cb9-0a23-176c-e91e5e9ac962@linux.intel.com> From: Lu Baolu Message-ID: Date: Wed, 27 Jan 2021 09:53:59 +0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US Cc: isaacm@codeaurora.org, Robin Murphy , "iommu@lists.linux-foundation.org" , Will Deacon X-BeenThere: iommu@lists.linux-foundation.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: Development issues for Linux IOMMU support List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: iommu-bounces@lists.linux-foundation.org Sender: "iommu" Hi Chuck, On 1/26/21 11:52 PM, Chuck Lever wrote: > > >> On Jan 26, 2021, at 1:18 AM, Lu Baolu wrote: >> >> On 2021/1/26 3:31, Chuck Lever wrote: >>>> On Jan 25, 2021, at 12:39 PM, Chuck Lever wrote: >>>> >>>> Hello Lu - >>>> >>>> Many thanks for your prototype. >>>> >>>> >>>>> On Jan 24, 2021, at 9:38 PM, Lu Baolu wrote: >>>>> >>>>> This patch series is only for Request-For-Testing purpose. It aims to >>>>> fix the performance regression reported here. >>>>> >>>>> https://lore.kernel.org/linux-iommu/D81314ED-5673-44A6-B597-090E3CB83EB0@oracle.com/ >>>>> >>>>> The first two patches are borrowed from here. >>>>> >>>>> https://lore.kernel.org/linux-iommu/20210107122909.16317-1-yong.wu@mediatek.com/ >>>>> >>>>> Please kindly help to verification. >>>>> >>>>> Best regards, >>>>> baolu >>>>> >>>>> Lu Baolu (1): >>>>> iommu/vt-d: Add iotlb_sync_map callback >>>>> >>>>> Yong Wu (2): >>>>> iommu: Move iotlb_sync_map out from __iommu_map >>>>> iommu: Add iova and size as parameters in iotlb_sync_map >>>>> >>>>> drivers/iommu/intel/iommu.c | 86 +++++++++++++++++++++++++------------ >>>>> drivers/iommu/iommu.c | 23 +++++++--- >>>>> drivers/iommu/tegra-gart.c | 7 ++- >>>>> include/linux/iommu.h | 3 +- >>>>> 4 files changed, 83 insertions(+), 36 deletions(-) >>>> >>>> Here are results with the NFS client at stock v5.11-rc5 and the >>>> NFS server at v5.10, showing the regression I reported earlier. >>>> >>>> Children see throughput for 12 initial writers = 4534582.00 kB/sec >>>> Parent sees throughput for 12 initial writers = 4458145.56 kB/sec >>>> Min throughput per process = 373101.59 kB/sec >>>> Max throughput per process = 382669.50 kB/sec >>>> Avg throughput per process = 377881.83 kB/sec >>>> Min xfer = 1022720.00 kB >>>> CPU Utilization: Wall time 2.787 CPU time 1.922 CPU utilization 68.95 % >>>> >>>> >>>> Children see throughput for 12 rewriters = 4542003.12 kB/sec >>>> Parent sees throughput for 12 rewriters = 4538024.19 kB/sec >>>> Min throughput per process = 374672.00 kB/sec >>>> Max throughput per process = 383983.78 kB/sec >>>> Avg throughput per process = 378500.26 kB/sec >>>> Min xfer = 1022976.00 kB >>>> CPU utilization: Wall time 2.733 CPU time 1.947 CPU utilization 71.25 % >>>> >>>> >>>> Children see throughput for 12 readers = 4568632.03 kB/sec >>>> Parent sees throughput for 12 readers = 4563672.02 kB/sec >>>> Min throughput per process = 376727.56 kB/sec >>>> Max throughput per process = 383783.91 kB/sec >>>> Avg throughput per process = 380719.34 kB/sec >>>> Min xfer = 1029376.00 kB >>>> CPU utilization: Wall time 2.733 CPU time 1.898 CPU utilization 69.46 % >>>> >>>> >>>> Children see throughput for 12 re-readers = 4610702.78 kB/sec >>>> Parent sees throughput for 12 re-readers = 4606135.66 kB/sec >>>> Min throughput per process = 381532.78 kB/sec >>>> Max throughput per process = 387072.53 kB/sec >>>> Avg throughput per process = 384225.23 kB/sec >>>> Min xfer = 1034496.00 kB >>>> CPU utilization: Wall time 2.711 CPU time 1.910 CPU utilization 70.45 % >>>> >>>> Here's the NFS client at v5.11-rc5 with your series applied. >>>> The NFS server remains at v5.10: >>>> >>>> Children see throughput for 12 initial writers = 4434778.81 kB/sec >>>> Parent sees throughput for 12 initial writers = 4408190.69 kB/sec >>>> Min throughput per process = 367865.28 kB/sec >>>> Max throughput per process = 371134.38 kB/sec >>>> Avg throughput per process = 369564.90 kB/sec >>>> Min xfer = 1039360.00 kB >>>> CPU Utilization: Wall time 2.842 CPU time 1.904 CPU utilization 66.99 % >>>> >>>> >>>> Children see throughput for 12 rewriters = 4476870.69 kB/sec >>>> Parent sees throughput for 12 rewriters = 4471701.48 kB/sec >>>> Min throughput per process = 370985.34 kB/sec >>>> Max throughput per process = 374752.28 kB/sec >>>> Avg throughput per process = 373072.56 kB/sec >>>> Min xfer = 1038592.00 kB >>>> CPU utilization: Wall time 2.801 CPU time 1.902 CPU utilization 67.91 % >>>> >>>> >>>> Children see throughput for 12 readers = 5865268.88 kB/sec >>>> Parent sees throughput for 12 readers = 5854519.73 kB/sec >>>> Min throughput per process = 487766.81 kB/sec >>>> Max throughput per process = 489623.88 kB/sec >>>> Avg throughput per process = 488772.41 kB/sec >>>> Min xfer = 1044736.00 kB >>>> CPU utilization: Wall time 2.144 CPU time 1.895 CPU utilization 88.41 % >>>> >>>> >>>> Children see throughput for 12 re-readers = 5847438.62 kB/sec >>>> Parent sees throughput for 12 re-readers = 5839292.18 kB/sec >>>> Min throughput per process = 485835.03 kB/sec >>>> Max throughput per process = 488702.12 kB/sec >>>> Avg throughput per process = 487286.55 kB/sec >>>> Min xfer = 1042688.00 kB >>>> CPU utilization: Wall time 2.148 CPU time 1.909 CPU utilization 88.84 % >>>> >>>> NFS READ throughput is almost fully restored. A normal-looking throughput >>>> result, copied from the previous thread, is: >>>> >>>> Children see throughput for 12 readers = 5921370.94 kB/sec >>>> Parent sees throughput for 12 readers = 5914106.69 kB/sec >>>> >>>> The NFS WRITE throughput result appears to be unchanged, or slightly >>>> worse than before. I don't have an explanation for this result. I applied >>>> your patches on the NFS server also without noting improvement. >>> Function-boundary tracing shows some interesting results. >>> # trace-cmd record -e rpcrdma -e iommu -p function_graph --max-graph-depth=5 -g dma_map_sg_attrs >>> Some 120KB SGLs are DMA-mapped in a single call to __iommu_map(). Other SGLs of >>> the same size need as many as one __iommu_map() call per SGL element (which >>> would be 30 for a 120KB SGL). >>> In v5.10, intel_map_sg() was structured such that an SGL is always handled with >>> a single call to domain_mapping() and thus always just a single TLB flush. >>> My amateur theorizing suggests that the SGL element coalescing done in >>> __iommu_map_sg() is not working as well as intel_map_sg() used to, which results >>> in more calls to domain_mapping(). Not only does that take longer, but it creates >>> many more DMA maps. Could that also have some impact on device TLB resources? >> >> It seems that more domain_mapping() calls are not caused by >> __iommu_map_sg() but __iommu_map(). >> >> Can you please test below changes? It call intel_iommu_map() directly >> instead of __iommu_map(). >> >> diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c >> index f5a236e63ded..660d5744a117 100644 >> --- a/drivers/iommu/intel/iommu.c >> +++ b/drivers/iommu/intel/iommu.c >> @@ -4916,7 +4916,7 @@ intel_iommu_sva_invalidate(struct iommu_domain *domain, struct device *dev, >> } >> #endif >> >> -static int intel_iommu_map(struct iommu_domain *domain, >> +int intel_iommu_map(struct iommu_domain *domain, >> unsigned long iova, phys_addr_t hpa, >> size_t size, int iommu_prot, gfp_t gfp) >> { >> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c >> index 3d099a31ddca..a1b41fd3fb4e 100644 >> --- a/drivers/iommu/iommu.c >> +++ b/drivers/iommu/iommu.c >> @@ -23,8 +23,13 @@ >> #include >> #include >> #include >> +#include >> #include >> >> +extern int intel_iommu_map(struct iommu_domain *domain, >> + unsigned long iova, phys_addr_t hpa, >> + size_t size, int iommu_prot, gfp_t gfp); >> + >> static struct kset *iommu_group_kset; >> static DEFINE_IDA(iommu_group_ida); >> >> @@ -2553,8 +2558,7 @@ static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova, >> phys_addr_t s_phys = sg_phys(sg); >> >> if (len && s_phys != start + len) { >> - ret = __iommu_map(domain, iova + mapped, start, >> - len, prot, gfp); >> + ret = intel_iommu_map(domain, iova + mapped, start, len, prot, gfp); >> >> if (ret) >> goto out_err; >> >> Does it change anything? > > I removed yesterday's 3-patch series, and applied the above. > Not a full restoration, but interesting nonetheless. Do you mind having a try with all 4 patches? > > Children see throughput for 12 initial writers = 4852657.22 kB/sec > Parent sees throughput for 12 initial writers = 4826730.38 kB/sec > Min throughput per process = 403196.09 kB/sec > Max throughput per process = 406071.47 kB/sec > Avg throughput per process = 404388.10 kB/sec > Min xfer = 1041408.00 kB > CPU Utilization: Wall time 2.596 CPU time 2.047 CPU utilization 78.84 % > > > Children see throughput for 12 rewriters = 4853900.22 kB/sec > Parent sees throughput for 12 rewriters = 4848623.31 kB/sec > Min throughput per process = 403380.81 kB/sec > Max throughput per process = 405478.53 kB/sec > Avg throughput per process = 404491.68 kB/sec > Min xfer = 1042944.00 kB > CPU utilization: Wall time 2.589 CPU time 2.048 CPU utilization 79.12 % > > > Children see throughput for 12 readers = 4938124.12 kB/sec > Parent sees throughput for 12 readers = 4932862.08 kB/sec > Min throughput per process = 408768.81 kB/sec > Max throughput per process = 413879.25 kB/sec > Avg throughput per process = 411510.34 kB/sec > Min xfer = 1036800.00 kB > CPU utilization: Wall time 2.536 CPU time 1.906 CPU utilization 75.16 % > > > Children see throughput for 12 re-readers = 4992115.16 kB/sec > Parent sees throughput for 12 re-readers = 4986102.07 kB/sec > Min throughput per process = 411103.00 kB/sec > Max throughput per process = 420302.97 kB/sec > Avg throughput per process = 416009.60 kB/sec > Min xfer = 1025792.00 kB > CPU utilization: Wall time 2.497 CPU time 1.887 CPU utilization 75.57 % > > NFS WRITE throughput improves significantly. NFS READ throughput > improves somewhat, but not to the degree it did with yesterday's > patch series. > > function_graph shows a single intel_iommu_map() is used more > frequently, but the following happens on occasion: This is because the pages are not physically contiguous, __iommu_map_sg can't coalesce them. If there's no full restoration with all four patches, it could be due to ping-pongs between iommu core and the vendor iommu driver with indirect calls. Hence a possible solution is to add an iommu_ops->map_sg() so that all scatter gathered items could be handled in a single function as before. This has already proposed by Isaac J. Manjarres. https://lore.kernel.org/linux-iommu/1610376862-927-5-git-send-email-isaacm@codeaurora.org/ Best regards, baolu > > 395.678889: funcgraph_entry: | dma_map_sg_attrs() { > 395.678889: funcgraph_entry: | iommu_dma_map_sg() { > 395.678890: funcgraph_entry: 0.257 us | iommu_get_dma_domain(); > 395.678890: funcgraph_entry: 0.255 us | iommu_dma_deferred_attach(); > 395.678891: funcgraph_entry: | iommu_dma_sync_sg_for_device() { > 395.678891: funcgraph_entry: 0.253 us | dev_is_untrusted(); > 395.678891: funcgraph_exit: 0.773 us | } > 395.678892: funcgraph_entry: 0.250 us | dev_is_untrusted(); > 395.678893: funcgraph_entry: | iommu_dma_alloc_iova() { > 395.678893: funcgraph_entry: | alloc_iova_fast() { > 395.678893: funcgraph_entry: 0.255 us | _raw_spin_lock_irqsave(); > 395.678894: funcgraph_entry: 0.375 us | __lock_text_start(); > 395.678894: funcgraph_exit: 1.435 us | } > 395.678895: funcgraph_exit: 2.002 us | } > 395.678895: funcgraph_entry: 0.252 us | dma_info_to_prot(); > 395.678895: funcgraph_entry: | iommu_map_sg_atomic() { > 395.678896: funcgraph_entry: | __iommu_map_sg() { > 395.678896: funcgraph_entry: 1.675 us | intel_iommu_map(); > 395.678898: funcgraph_entry: 1.365 us | intel_iommu_map(); > 395.678900: funcgraph_entry: 1.373 us | intel_iommu_map(); > 395.678901: funcgraph_entry: 1.378 us | intel_iommu_map(); > 395.678903: funcgraph_entry: 1.380 us | intel_iommu_map(); > 395.678905: funcgraph_entry: 1.380 us | intel_iommu_map(); > 395.678906: funcgraph_entry: 1.378 us | intel_iommu_map(); > 395.678908: funcgraph_entry: 1.380 us | intel_iommu_map(); > 395.678910: funcgraph_entry: 1.345 us | intel_iommu_map(); > 395.678911: funcgraph_entry: 1.342 us | intel_iommu_map(); > 395.678913: funcgraph_entry: 1.342 us | intel_iommu_map(); > 395.678915: funcgraph_entry: 1.395 us | intel_iommu_map(); > 395.678916: funcgraph_entry: 1.343 us | intel_iommu_map(); > 395.678918: funcgraph_entry: 1.350 us | intel_iommu_map(); > 395.678920: funcgraph_entry: 1.395 us | intel_iommu_map(); > 395.678921: funcgraph_entry: 1.342 us | intel_iommu_map(); > 395.678923: funcgraph_entry: 1.350 us | intel_iommu_map(); > 395.678924: funcgraph_entry: 1.345 us | intel_iommu_map(); > 395.678926: funcgraph_entry: 1.345 us | intel_iommu_map(); > 395.678928: funcgraph_entry: 1.340 us | intel_iommu_map(); > 395.678929: funcgraph_entry: 1.342 us | intel_iommu_map(); > 395.678931: funcgraph_entry: 1.335 us | intel_iommu_map(); > 395.678933: funcgraph_entry: 1.345 us | intel_iommu_map(); > 395.678934: funcgraph_entry: 1.337 us | intel_iommu_map(); > 395.678936: funcgraph_entry: 1.305 us | intel_iommu_map(); > 395.678938: funcgraph_entry: 1.380 us | intel_iommu_map(); > 395.678939: funcgraph_entry: 1.365 us | intel_iommu_map(); > 395.678941: funcgraph_entry: 1.370 us | intel_iommu_map(); > 395.678943: funcgraph_entry: 1.365 us | intel_iommu_map(); > 395.678945: funcgraph_entry: 1.482 us | intel_iommu_map(); > 395.678946: funcgraph_exit: + 50.753 us | } > 395.678947: funcgraph_exit: + 51.348 us | } > 395.678947: funcgraph_exit: + 57.975 us | } > 395.678948: funcgraph_exit: + 58.700 us | } > 395.678953: xprtrdma_mr_map: task:64127@5 mr.id=104 nents=30 122880@0xc5e467fde9380000:0xc0011103 (TO_DEVICE) > 395.678953: xprtrdma_chunk_read: task:64127@5 pos=148 122880@0xc5e467fde9380000:0xc0011103 (more) > > > -- > Chuck Lever > > > _______________________________________________ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu