From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1D2A4C6FA86 for ; Thu, 8 Sep 2022 14:53:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231806AbiIHOxl (ORCPT ); Thu, 8 Sep 2022 10:53:41 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232791AbiIHOxh (ORCPT ); Thu, 8 Sep 2022 10:53:37 -0400 Received: from mail-pj1-x1030.google.com (mail-pj1-x1030.google.com [IPv6:2607:f8b0:4864:20::1030]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 98D69B5302 for ; Thu, 8 Sep 2022 07:53:35 -0700 (PDT) Received: by mail-pj1-x1030.google.com with SMTP id s14-20020a17090a6e4e00b0020057c70943so2603466pjm.1 for ; Thu, 08 Sep 2022 07:53:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:from:to:cc:subject :date; bh=1eeO1Pj9F81ivsmmw7WnfwI33g1hkZ6IFjDp/jgOBvM=; b=W3DDTV5BMAj67ipuh/Gg2a3vh5wlNanQJOzDEdFdGFfin+GyFsyNnneQKY2QkVBu9d QsBxLFXPzNm7ShT932KLSM3F3Jrn8MWSWaLUdYP2wOkJpiP3RlL4OqOswLLmgmV8YYHN D+Smt4B1lcmgxCt5f/bEhRkYQq6bykTXFIdaNxM7uun/dCQp0pBqv2KXnV2RrYcqXrdb ljpmQonrdELavmKOHQuB6Bj6pe4yrzrIvJQ0IxHxsT/0AkE9XEIgEpVjBM/T/9uTOo5+ 6h8oBVcmlkwsmVvpSZsyCHoAxR7Er86EZDrkECucFh2gYV7gbJ28DxW5FxWaM8NfZ+5F XzYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=mime-version:user-agent:content-transfer-encoding:references :in-reply-to:date:cc:to:from:subject:message-id:x-gm-message-state :from:to:cc:subject:date; bh=1eeO1Pj9F81ivsmmw7WnfwI33g1hkZ6IFjDp/jgOBvM=; b=K8ttEJzVt91xgFHpJWLcQhGNHYPYqYc17mtf4Pa98BwFxJKnLCz0+b3WJeodxj7FNz Hixt6YfN405Sojk4U/LWGRVikDQCUm9jhYe4FRhYT3TfIfsKK8MeLUAfJoUCEi0F724K 6V1hBFfm6kvdhuFxn7Ha/3MGIl5acNWUz6XJ75Stu7hu7oXm2rK/MYTlLup7SO+g3bkU 1xqxAWMIJEVKEvQ655hOqcTs+UrOs8MC0+29O6tHqj+nZU2e7Tr/yEU9cmQcIyXppd6F zdgnpvouGjTpfqD4NR8JjjVm6VPKL1rpkO/kfAy5ldcQ0fdaGwCmE7464noH6/XiyuFQ F1sg== X-Gm-Message-State: ACgBeo28LmgNyyY5iejnKkCm6Go5udCGYedcAAOqnrA7FZ/lucMcTeY2 4hnrtkqt1ovWAd3oS7f/+Fo= X-Google-Smtp-Source: AA6agR7sZ6WXV1SjpWecWeKwUl4oU4Dj2NNiCijsPnR8mR7u5U0HXBvzGj5SZqm1/Y274SqABJoYiQ== X-Received: by 2002:a17:90b:4d8a:b0:1fb:5e0c:67fd with SMTP id oj10-20020a17090b4d8a00b001fb5e0c67fdmr4702800pjb.75.1662648814871; Thu, 08 Sep 2022 07:53:34 -0700 (PDT) Received: from [192.168.0.128] ([98.97.38.208]) by smtp.googlemail.com with ESMTPSA id rj9-20020a17090b3e8900b001df264610c4sm8792468pjb.0.2022.09.08.07.53.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 08 Sep 2022 07:53:34 -0700 (PDT) Message-ID: Subject: Re: [PATCH net] net: avoid 32 x truesize under-estimation for tiny skbs From: Alexander H Duyck To: Paolo Abeni , Eric Dumazet , "David S . Miller" , Jakub Kicinski Cc: netdev , Eric Dumazet , Alexander Duyck , "Michael S . Tsirkin" , Greg Thelen Date: Thu, 08 Sep 2022 07:53:32 -0700 In-Reply-To: <498a25e4f7ba4e21d688ca74f335b28cadcb3381.camel@redhat.com> References: <20210113161819.1155526-1-eric.dumazet@gmail.com> <498a25e4f7ba4e21d688ca74f335b28cadcb3381.camel@redhat.com> Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.44.4 (3.44.4-1.fc36) MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org On Thu, 2022-09-08 at 13:00 +0200, Paolo Abeni wrote: > On Wed, 2022-09-07 at 14:36 -0700, Alexander H Duyck wrote: > > On Wed, 2022-09-07 at 22:19 +0200, Paolo Abeni wrote: > > > What outlined above will allow for 10 min size frags in page_order0, = as > > > (SKB_DATA_ALIGN(0) + SKB_DATA_ALIGN(struct skb_shared_info) =3D=3D 38= 4. I'm > > > not sure that anything will allocate such small frags. > > > With a more reasonable GRO_MAX_HEAD, there will be 6 frags per page.= =C2=A0 > >=20 > > That doesn't account for any headroom though.=C2=A0 >=20 > Yes, the 0-size data packet was just a theoretical example to make the > really worst case scenario. >=20 > > Most of the time you have > > to reserve some space for headroom so that if this buffer ends up > > getting routed off somewhere to be tunneled there is room for adding to > > the header. I think the default ends up being NET_SKB_PAD, though many > > NICs use larger values. So adding any data onto that will push you up > > to a minimum of 512 per skb for the first 64B for header data. > >=20 > > With that said it would probably put you in the range of 8 or fewer > > skbs per page assuming at least 1 byte for data: > > 512 =3D SKB_DATA_ALIGN(NET_SKB_PAD + 1) + > > SKB_DATA_ALIGN(struct skb_shared_info) >=20 > In most build GRO_MAX_HEAD packets are even larger (should be 640) Right, which is why I am thinking we may want to default to a 1K slice. > > > The maximum truesize underestimation in both cases will be lower than > > > what we can get with the current code in the worst case (almost 32x > > > AFAICS).=C2=A0 > > >=20 > > > Is the above schema safe enough or should the requested size > > > artificially inflatted to fit at most 4 allocations per page_order0? > > > Am I miss something else? Apart from omitting a good deal of testing = in > > > the above list ;)=20 > >=20 > > If we are working with an order 0 page we may just want to split it up > > into a fixed 1K fragments and not bother with a variable pagecnt bias. > > Doing that we would likely simplify this quite a bit and avoid having > > to do as much page count manipulation which could get expensive if we > > are not getting many uses out of the page. An added advantage is that > > we can get rid of the pagecnt_bias and just work based on the page > > offset. > >=20 > > As such I am not sure the page frag cache would really be that good of > > a fit since we have quite a bit of overhead in terms of maintaining the > > pagecnt_bias which assumes the page is a bit longer lived so the ratio > > of refcnt updates vs pagecnt_bias updates is better. >=20 > I see. With the above schema there will be 4-6 frags per packet. I'm > wild guessing that the pagecnt_bias optimization still give some gain > in that case, but I really shold collect some data points. As I recall one of the big advantages of the 32k page was that we were reducing the atomic ops by nearly half. Essentially we did a page_ref_add at the start and a page_ref_sub_and_test when we were out of space. Whereas a single 4K allocation would be 2 atomic ops per allocation, we were only averaging 1.13 per 2k slice. With the Intel NICs I was able to get even closer to 1 since I was able to do the 2k flip/flop setup and could get up to 64K uses off of a single page. Then again though I am not sure now much the atomic ops penalty will be for your use case. Where it came into play is that MMIO writes to the device will block atomic ops until they can be completed so in a device driver atomic ops become very expensive and so we want to batch them as much as possible. > If the pagecnt optimization should be dropped, it would be probably > more straight-forward to use/adapt 'page_frag' for the page_order0 > allocator. That would make sense. Basically we could get rid of the pagecnt bias and add the fixed number of slices to the count at allocation so we would just need to track the offset to decide when we need to allocate a new page. In addtion if we are flushing the page when it is depleted we don't have to mess with the pfmemalloc logic. > BTW it's quite strange/confusing having to very similar APIs (page_frag > and page_frag_cache) with very similar names and no references between > them. I'm not sure what you are getting at here. There are plenty of references between them, they just aren't direct. I viewed it as being more like the get_free_pages() type logic where you end up allocating a page but what you get is an address to the page fragment instead of the page_frag struct.