From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.7 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 144B4C432BE for ; Tue, 27 Jul 2021 07:54:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E8F5D6120E for ; Tue, 27 Jul 2021 07:54:15 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235882AbhG0HyL (ORCPT ); Tue, 27 Jul 2021 03:54:11 -0400 Received: from szxga01-in.huawei.com ([45.249.212.187]:16007 "EHLO szxga01-in.huawei.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235621AbhG0HyK (ORCPT ); Tue, 27 Jul 2021 03:54:10 -0400 Received: from dggemv704-chm.china.huawei.com (unknown [172.30.72.56]) by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4GYpr03QGFzZst8; Tue, 27 Jul 2021 15:50:40 +0800 (CST) Received: from dggpemm500005.china.huawei.com (7.185.36.74) by dggemv704-chm.china.huawei.com (10.3.19.47) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2176.2; Tue, 27 Jul 2021 15:54:07 +0800 Received: from [10.69.30.204] (10.69.30.204) by dggpemm500005.china.huawei.com (7.185.36.74) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256) id 15.1.2176.2; Tue, 27 Jul 2021 15:54:07 +0800 Subject: Re: [PATCH rfc v6 2/4] page_pool: add interface to manipulate frag count in page pool To: Alexander Duyck , Yunsheng Lin CC: David Miller , Jakub Kicinski , Russell King - ARM Linux , Marcin Wojtas , , , "Salil Mehta" , , , Ilias Apalodimas , "Alexei Starovoitov" , Daniel Borkmann , "John Fastabend" , Andrew Morton , Peter Zijlstra , "Will Deacon" , Matthew Wilcox , "Vlastimil Babka" , , , Peter Xu , Feng Tang , Jason Gunthorpe , Matteo Croce , Hugh Dickins , Jonathan Lemon , "Alexander Lobakin" , Willem de Bruijn , , Cong Wang , Kevin Hao , , Marco Elver , Yonghong Song , , , "Martin KaFai Lau" , , Netdev , LKML , bpf References: <1626752145-27266-1-git-send-email-linyunsheng@huawei.com> <1626752145-27266-3-git-send-email-linyunsheng@huawei.com> <92e68f4e-49a4-568c-a281-2865b54a146e@huawei.com> <41283c5f-2f58-7fa7-e8fe-a91207a57353@huawei.com> <20210724130709.GA1461@ip-172-31-30-86.us-east-2.compute.internal> From: Yunsheng Lin Message-ID: <75213c28-d586-3dfe-c2a7-738af9dd9864@huawei.com> Date: Tue, 27 Jul 2021 15:54:07 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:52.0) Gecko/20100101 Thunderbird/52.2.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 8bit X-Originating-IP: [10.69.30.204] X-ClientProxiedBy: dggeme714-chm.china.huawei.com (10.1.199.110) To dggpemm500005.china.huawei.com (7.185.36.74) X-CFilter-Loop: Reflected Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2021/7/26 0:49, Alexander Duyck wrote: > On Sat, Jul 24, 2021 at 6:07 AM Yunsheng Lin wrote: >> >> On Fri, Jul 23, 2021 at 09:08:00AM -0700, Alexander Duyck wrote: >>> On Fri, Jul 23, 2021 at 4:12 AM Yunsheng Lin wrote: >>>> >>>> On 2021/7/22 23:18, Alexander Duyck wrote: >>>>>>> >>>>>>>> You are right that that may cover up the reference count errors. How about >>>>>>>> something like below: >>>>>>>> >>>>>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page, >>>>>>>> long nr) >>>>>>>> { >>>>>>>> #ifdef CONFIG_DEBUG_PAGE_REF >>>>>>>> long ret = atomic_long_sub_return(nr, &page->pp_frag_count); >>>>>>>> >>>>>>>> WARN_ON(ret < 0); >>>>>>>> >>>>>>>> return ret; >>>>>>>> #else >>>>>>>> if (atomic_long_read(&page->pp_frag_count) == nr) >>>>>>>> return 0; >>>>>>>> >>>>>>>> return atomic_long_sub_return(nr, &page->pp_frag_count); >>>>>>>> #end >>>>>>>> } >>>>>>>> >>>>>>>> Or any better suggestion? >>>>>>> >>>>>>> So the one thing I might change would be to make it so that you only >>>>>>> do the atomic_long_read if nr is a constant via __builtin_constant_p. >>>>>>> That way you would be performing the comparison in >>>>>>> __page_pool_put_page and in the cases of freeing or draining the >>>>>>> page_frags you would be using the atomic_long_sub_return which should >>>>>>> be paths where you would not expect it to match or that are slowpath >>>>>>> anyway. >>>>>>> >>>>>>> Also I would keep the WARN_ON in both paths just to be on the safe side. >>>>>> >>>>>> If I understand it correctly, we should change it as below, right? >>>>>> >>>>>> static inline long page_pool_atomic_sub_frag_count_return(struct page *page, >>>>>> long nr) >>>>>> { >>>>>> long ret; >>>>>> >>>>>> /* As suggested by Alexander, atomic_long_read() may cover up the >>>>>> * reference count errors, so avoid calling atomic_long_read() in >>>>>> * the cases of freeing or draining the page_frags, where we would >>>>>> * not expect it to match or that are slowpath anyway. >>>>>> */ >>>>>> if (__builtin_constant_p(nr) && >>>>>> atomic_long_read(&page->pp_frag_count) == nr) >>>>>> return 0; >>>>>> >>>>>> ret = atomic_long_sub_return(nr, &page->pp_frag_count); >>>>>> WARN_ON(ret < 0); >>>>>> return ret; >>>>>> } >>>>> >>>>> Yes, that is what I had in mind. >>>>> >>>>> One thought I had for a future optimization is that we could look at >>>>> reducing the count by 1 so that we could essentially combine the >>>>> non-frag and frag cases.Then instead of testing for 1 we would test >>>>> for 0 at thee start of the function and test for < 0 to decide if we >>>>> want to free it or not instead of testing for 0. With that we can >>>>> essentially reduce the calls to the WARN_ON since we should only have >>>>> one case where we actually return a value < 0, and we can then check >>>>> to see if we overshot -1 which would be the WARN_ON case. >>>>> >>>>> With that a value of 0 instead of 1 would indicate page frag is not in >>>>> use for the page *AND/OR* that the page has reached the state where >>>>> there are no other frags present so the page can be recycled. In >>>>> effect it would allow us to mix page frags and no frags within the >>>>> same pool. The added bonus would be we could get rid of the check for >>>>> PP_FLAG_PAGE_FRAG flag check in the __page_pool_put_page function and >>>>> replace it with a check for PAGE_POOL_DMA_USE_PP_FRAG_COUNT since we >>>>> cannot read frag_count in that case. >>>> >>>> Let's leave it for a future optimization. >>>> I am not sure if there is use case to support both frag page and non-frag >>>> page for the same page pool. If there is, maybe we can use "page->pp_frag_count >>>>> 0" to indicate that the page is frag page, and "page->pp_frag_count == 0" >>>> to indicate that the page is non-frag page, so that we can support frag page and >>>> non-frag page for the same page pool instead of disabling non-frag page support >>>> when PP_FLAG_PAGE_FRAG flag is set, which might be conflit with the above >>>> optimization? >>> >>> As far as use case I can see a number of potential uses. For example >>> in the case of drivers that do something like a header/data split I >>> could see potentially having the header pages be frags while the data >>> pages being 4K blocks. Basically the big optimization of the count == >>> 1/0/nr case is that you aren't increasing/decreasing the count and it >>> is immediately being recycled/reused. So in such a case being able to >>> add frag count some pages, and not to others would likely be quite >>> useful. >> >> I am not sure how the header/data split is implemented in hw, but it >> seems the driver is not able to tell which desc will be filled with >> header or data in advance, so it might need to allocate 4K block for >> all desc? > > It all depends on the hardware config. In theory you could have > anything from a single use for a page to multiple uses for a page in > the case of headers and/or packets being small. The overhead for > adding/removing the frag count could end up being more than what is > needed if the page is only used once. That is why I was thinking it > might make sense to allow both to coexist in the same pool. I am agreed that there may be usecase of using both frag page and non-frag page of the same page pool. Let's leave it for now. > >>> >>> Basically by shifting the pool values by 1 you can have both in the >>> same pool with little issue. However the big change is that instead of >>> testing for count = nr it would end up being pp_frag_count = nr - 1. >>> So in the case of the standard page pool pages being freed or the last >>> frag you would be looking at pp_frag_count = 0. In addition we can >>> mask the WARN_ON overhead as you would be using -1 as the point to >>> free so you would only have to perform the WARN_ON check for the last >>> frag instead of every frag. >> >> Yes, it seems doable. >> >>> >>>> Also, I am prototyping the tx recycling based on page pool in order to see >>>> if there is any value supporting the tx recycling. >>> >>> Just to clarify here when you say Tx recycling you are talking about >>> socket to netdev correct? Just want to be certain since the netdev to >>> netdev case should already have recycling for page pool pages as long >>> as it follows a 1<->1 path. >> >> Yes, the above Tx recycling meant socket to netdev. >> Also, the above "netdev to netdev" only meant XDP now, but not the IP >> forwarding path in the network stack, right? >> >>> >>>> As the busypoll has enable the one-to-one relation between NAPI and sock, >>>> and there is one-to-one relation between NAPI and page pool, perhaps it make >>>> senses that we use page pool to recycle the tx page too? >>>> >>>> There are possibly below problems when doing that as I am aware of now: >>>> 1. busypoll is for rx, and tx may not be using the same queue as rx even if >>>> there are *technically* the same flow, so I am not sure it is ok to use >>>> busypoll infrastructure to get the page pool ptr for a specific sock. >>>> >>>> 2. There may be multi socks using the same page pool ptr to allocate page for >>>> multi flow, so we can not assume the same NAPI polling protection as rx, >>>> which might mean we can only use the recyclable page from pool->ring under the >>>> r->consumer_lock protection. >>>> >>>> 3. Right now tcp_sendmsg_locked() use sk_page_frag_refill() to refill the page >>>> frag for tcp xmit, when implementing a similar sk_page_pool_frag_refill() >>>> based on page pool, I found that tcp coalesce in tcp_mtu_probe() and >>>> tcp fragment in tso_fragment() might mess with the page_ref_count directly. >>>> >>>> As the above the problem I am aware of(I believe there are other problems I am not >>>> aware of yet), I am not sure if the tcp tx page recycling based on page pool is >>>> doable or not, I would like to hear about your opinion about tcp tx recycling support >>>> based on page pool first, in case it is a dead end to support that. >>> >>> I'm honestly not sure there is much there to gain. Last I knew TCP was >>> using order 3 pages for transmitting and as a result the overhead for >>> the pages should already be greatly reduced. In addition one of the >>> main reasons for page_pool is the fact that the device has to DMA map >>> the page and that can have very high overhead on systems with an >>> IOMMU. >> >> Yes, avoiding the IOMMU overhead is the main gain. and "order 3 pages" >> seems to be disabled on defaut? >> >>> >>> Rather than trying to reuse the devices page pool it might make more >>> sense to see if you couldn't have TCP just use some sort of circular >>> buffer of memory that is directly mapped for the device that it is >>> going to be transmitting to. Essentially what you would be doing is >>> creating a pre-mapped page and would need to communicate that the >>> memory is already mapped for the device you want to send it to so that >>> it could skip that step. >> >> IIUC sk_page_frag_refill() is already doing a similar reusing as the >> rx reusing implemented in most driver except for the not pre-mapping >> part. >> >> And it seems that even if we pre-map the page and communicate that the >> memory is already mapped to the driver, it is likely that we will not >> be able to reuse the page when the circular buffer is not big enough >> or tx completion/tcp ack is not happening quickly enough, which might >> means unmapping/deallocating old circular buffer and allocating/mapping >> new circular buffer. >> >> Using page pool we might be able to alleviate the above problem as it >> does for rx? > > I would say that instead of looking at going straight for the page > pool it might make more sense to look at seeing if we can coalesce the > DMA mapping of the pages first at the socket layer rather than trying > to introduce the overhead for the page pool. In the case of sockets we > already have the destructors that are called when the memory is freed, > so instead of making sockets use page pool it might make more sense to > extend the socket buffer allocation/freeing to incorporate bulk > mapping and unmapping of pages to optimize the socket Tx path in the > 32K page case. I was able to enable tx recycling prototyping based on page pool to run some performance test, the performance improvement is about +20% (30Gbit -> 38Gbit) for single thread iperf tcp flow when IOMMU is in strict mode. And CPU usage descreases about 10% for four threads iperf tcp flow for line speed of 100Gbit when IOMMU is in strict mode. Looking at the prototyping code, I am agreed that it is a bit controversial to use the page pool for tx as the page pool is assuming NAPI polling protection for allocation side. So I will take a deeper look about your suggestion above to see how to implement it. Also, I am assuming the "destructors" means tcp_wfree() for TCP, right? It seems tcp_wfree() is mainly used to do memory accounting and free "struct sock" if necessary. I am not so familiar with socket layer to understand how the "destructors" will be helpful here, any detailed idea how to use "destructors" here? > . >