linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Alexander Lobakin <alobakin@pm.me>
To: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Alexander Lobakin <alobakin@pm.me>,
	Mel Gorman <mgorman@techsingularity.net>,
	Andrew Morton <akpm@linux-foundation.org>,
	Chuck Lever <chuck.lever@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux-Net <netdev@vger.kernel.org>, Linux-MM <linux-mm@kvack.org>,
	Linux-NFS <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users
Date: Wed, 17 Mar 2021 22:25:24 +0000	[thread overview]
Message-ID: <20210317222506.1266004-1-alobakin@pm.me> (raw)
In-Reply-To: <20210317181943.1a339b1e@carbon>

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Wed, 17 Mar 2021 18:19:43 +0100

> On Wed, 17 Mar 2021 16:52:32 +0000
> Alexander Lobakin <alobakin@pm.me> wrote:
>
> > From: Jesper Dangaard Brouer <brouer@redhat.com>
> > Date: Wed, 17 Mar 2021 17:38:44 +0100
> >
> > > On Wed, 17 Mar 2021 16:31:07 +0000
> > > Alexander Lobakin <alobakin@pm.me> wrote:
> > >
> > > > From: Mel Gorman <mgorman@techsingularity.net>
> > > > Date: Fri, 12 Mar 2021 15:43:24 +0000
> > > >
> > > > Hi there,
> > > >
> > > > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > > > following git tree
> > > > >
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2
> > > >
> > > > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> > > >
> > > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > > > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > > > GRO + TSO/USO)
> > >
> > > What NIC driver is this?
> >
> > Ah, forgot to mention. It's a WIP driver, not yet mainlined.
> > The NIC itself is basically on-SoC 1G chip.
>
> Hmm, then it is really hard to check if your driver is doing something
> else that could cause this.
>
> Well, can you try to lower the page_pool bulking size, to test the
> theory from Wilcox that we should do smaller bulking to avoid pushing
> cachelines into L2 when walking the LRU list.  You might have to go as
> low as bulk=8 (for N-way associative level of L1 cache).

Turned out it suffered from GCC's decisions.
All of the following was taken on GCC 10.2.0 with -O2 in dotconfig.

vmlinux differences between baseline and this series:

(I used your followup instead of the last patch from the tree)

Function                                     old     new   delta
__rmqueue_pcplist                              -    2024   +2024
__alloc_pages_bulk                             -    1456   +1456
__page_pool_alloc_pages_slow                 284     600    +316
page_pool_dma_map                              -     164    +164
get_page_from_freelist                      5676    3760   -1916

The uninlining of __rmqueue_pcplist() hurts a lot. It slightly slows
down the "regular" page allocator, but makes __alloc_pages_bulk()
much slower than the per-page (in my case at least) due to calling
this function out from the loop.

One possible solution is to mark __rmqueue_pcplist() and
rmqueue_bulk() as __always_inline. Only both and only with
__always_inline, or GCC will emit rmqueue_bulk.constprop and
make the numbers even poorer.
This nearly doubles the size of bulk allocator, but eliminates
all performance hits.

Function                                     old     new   delta
__alloc_pages_bulk                          1456    3512   +2056
get_page_from_freelist                      3760    5744   +1984
find_suitable_fallback.part                    -     160    +160
min_free_kbytes_sysctl_handler                96     128     +32
find_suitable_fallback                       164      28    -136
__rmqueue_pcplist                           2024       -   -2024

Between baseline and this series with __always_inline hints:

Function                                     old     new   delta
__alloc_pages_bulk                             -    3512   +3512
find_suitable_fallback.part                    -     160    +160
get_page_from_freelist                      5676    5744     +68
min_free_kbytes_sysctl_handler                96     128     +32
find_suitable_fallback                       164      28    -136

Another suboptimal place I've found is two functions in Page Pool
code which are marked as 'noinline'.
Maybe there's a reason behind this, but removing the annotations
and additionally marking page_pool_dma_map() as inline simplifies
the object code and in fact improves the performance (+15 Mbps on
my setup):

add/remove: 0/3 grow/shrink: 1/0 up/down: 1024/-1096 (-72)
Function                                     old     new   delta
page_pool_alloc_pages                        100    1124   +1024
page_pool_dma_map                            164       -    -164
page_pool_refill_alloc_cache                 332       -    -332
__page_pool_alloc_pages_slow                 600       -    -600

1124 is a normal size for a hotpath function.
These fragmentation and jumps between page_pool_alloc_pages(),
__page_pool_alloc_pages_slow() and page_pool_refill_alloc_cache()
are really excessive and unhealthy for performance, as well as
page_pool_dma_map() uninlined by GCC.

So the best results I got so far were with these additional changes:
 - mark __rmqueue_pcplist() as __always_inline;
 - mark rmqueue_bulk() as __always_inline;
 - drop 'noinline' from page_pool_refill_alloc_cache();
 - drop 'noinline' from __page_pool_alloc_pages_slow();
 - mark page_pool_dma_map() as inline.

(inlines in C files aren't generally recommended, but well, GCC
 is far from perfect)

> In function: __page_pool_alloc_pages_slow() adjust variable:
>   const int bulk = PP_ALLOC_CACHE_REFILL;

Regarding bulk size, it makes no sense on my machine. I tried
{ 8, 16, 32, 64 } and they differed by 1-2 Mbps max / standard
deviation.
Most of the bulk operations I've seen usually take the value of
16 as a "golden ratio" though.

>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

Thanks,
Al


      reply	other threads:[~2021-03-17 22:26 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-12 15:43 [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Mel Gorman
2021-03-12 15:43 ` [PATCH 1/7] mm/page_alloc: Move gfp_allowed_mask enforcement to prepare_alloc_pages Mel Gorman
2021-03-19 16:11   ` Vlastimil Babka
2021-03-19 17:49     ` Mel Gorman
2021-03-12 15:43 ` [PATCH 2/7] mm/page_alloc: Rename alloced to allocated Mel Gorman
2021-03-19 16:22   ` Vlastimil Babka
2021-03-12 15:43 ` [PATCH 3/7] mm/page_alloc: Add a bulk page allocator Mel Gorman
2021-03-19 18:18   ` Vlastimil Babka
2021-03-22  8:30     ` Mel Gorman
2021-03-12 15:43 ` [PATCH 4/7] SUNRPC: Set rq_page_end differently Mel Gorman
2021-03-12 15:43 ` [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator Mel Gorman
2021-03-12 18:44   ` Alexander Duyck
2021-03-12 19:22     ` Chuck Lever III
2021-03-13 12:59       ` Mel Gorman
2021-03-12 15:43 ` [PATCH 6/7] net: page_pool: refactor dma_map into own function page_pool_dma_map Mel Gorman
2021-03-12 15:43 ` [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path Mel Gorman
2021-03-12 19:44   ` Alexander Duyck
2021-03-12 20:05     ` Ilias Apalodimas
2021-03-15 13:39       ` Jesper Dangaard Brouer
2021-03-13 13:30     ` Mel Gorman
2021-03-15  8:40       ` Jesper Dangaard Brouer
2021-03-15 19:33         ` [PATCH mel-git] Followup: Update [PATCH 7/7] in Mel's series Jesper Dangaard Brouer
2021-03-15 19:33           ` [PATCH mel-git] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
2021-03-17 16:31 ` [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users Alexander Lobakin
2021-03-17 16:38   ` Jesper Dangaard Brouer
2021-03-17 16:52     ` Alexander Lobakin
2021-03-17 17:19       ` Jesper Dangaard Brouer
2021-03-17 22:25         ` Alexander Lobakin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210317222506.1266004-1-alobakin@pm.me \
    --to=alobakin@pm.me \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=brouer@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=hch@infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=netdev@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).