All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-mm@kvack.org, chuck.lever@oracle.com,
	netdev@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, brouer@redhat.com
Subject: Re: [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster
Date: Thu, 25 Feb 2021 16:16:33 +0100	[thread overview]
Message-ID: <20210225161633.53e5f910@carbon> (raw)
In-Reply-To: <20210225112849.GM3697@techsingularity.net>

On Thu, 25 Feb 2021 11:28:49 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> As a side-node, I didn't pick up the other patches as there is review
> feedback and I didn't have strong opinions either way. Patch 3 is curious
> though, it probably should be split out and sent separetly but still;
> 
> On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote:
> > Avoid multiplication (imul) operations when accessing:
> >  zone->free_area[order].nr_free
> > 
> > This was really tricky to find. I was puzzled why perf reported that
> > rmqueue_bulk was using 44% of the time in an imul operation:
> > 
> >        ???     del_page_from_free_list():
> >  44,54 ??? e2:   imul   $0x58,%rax,%rax
> > 
> > This operation was generated (by compiler) because the struct free_area have
> > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use
> > and instead choose to use a more expensive imul, to find the offset into the
> > array free_area[].
> > 
> > The patch align struct free_area to a cache-line, which cause the
> > compiler avoid the imul operation. The imul operation is very fast on
> > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the
> > member 'nr_free' to be first element, which saves one 'add' operation.
> > 
> > Looking up instruction latency this exchange a 3-cycle imul with a
> > 1-cycle shl, saving 2-cycles. It does trade some space to do this.
> > 
> > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
> >   
> 
> I'm having some trouble parsing this and matching it to the patch itself.
> 
> First off, on my system (x86-64), the size of struct free area is 72,
> not 88 bytes. For either size, cache-aligning the structure is a big
> increase in the struct size.

Yes, the increase in size is big. For the struct free_area 40 bytes for
my case and 56 bytes for your case.  The real problem is that this is
multiplied by 11 (MAX_ORDER) and multiplied by number of zone structs
(is it 5?).  Thus, 56*11*5 = 3080 bytes.

Thus, I'm not sure it is worth it!  As I'm only saving 2-cycles, for
something that depends on the compiler generating specific code.  And
the compiler can easily change, and "fix" this on-its-own in a later
release, and then we are just wasting memory.

I did notice this imul happens 45 times in mm/page_alloc.o, with this
offset 0x58, but still this is likely not on hot-path.

> struct free_area {
>         struct list_head           free_list[4];         /*     0    64 */
>         /* --- cacheline 1 boundary (64 bytes) --- */
>         long unsigned int          nr_free;              /*    64     8 */
> 
>         /* size: 72, cachelines: 2, members: 2 */
>         /* last cacheline: 8 bytes */
> };
> 
> Are there other patches in the tree? What does pahole say?

The size of size of struct free_area varies based on some CONFIG
setting, as free_list[] array size is determined by MIGRATE_TYPES,
which on my system is 5, and not 4 as on your system.

  struct list_head	free_list[MIGRATE_TYPES];

CONFIG_CMA and CONFIG_MEMORY_ISOLATION both increase MIGRATE_TYPES with one.
Thus, the array size can vary from 4 to 6.


> With gcc-9, I'm also not seeing the imul instruction outputted like you
> described in rmqueue_pcplist which inlines rmqueue_bulk. At the point
> where it calls get_page_from_free_area, it's using shl for the page list
> operation. This might be a compiler glitch but given that free_area is a
> different size, I'm less certain and wonder if something else is going on.

I think it is the size variation.

> Finally, moving nr_free to the end and cache aligning it will make the
> started of each free_list cache-aligned because of its location in the
> struct zone so what purpose does __pad_to_align_free_list serve?

The purpose of purpose of __pad_to_align_free_list is because struct
list_head is 16 bytes, thus I wanted to align free_list to 16, given we
already have wasted the space.

Notice I added some more detailed notes in[1]:

 [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#micro-optimisations

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


  reply	other threads:[~2021-02-25 15:18 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
2021-02-24 10:26 ` [PATCH 1/3] SUNRPC: Set rq_page_end differently Mel Gorman
2021-02-24 10:26 ` [PATCH 2/3] mm, page_alloc: Add a bulk page allocator Mel Gorman
2021-02-24 10:26 ` [PATCH 3/3] SUNRPC: Refresh rq_pages using " Mel Gorman
2021-02-24 11:27 ` [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Jesper Dangaard Brouer
2021-02-24 11:55   ` Mel Gorman
2021-02-24 13:20 ` Chuck Lever
2021-02-24 18:56 ` [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
2021-02-24 18:56   ` [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
2021-02-24 20:11     ` Ilias Apalodimas
2021-02-24 18:56   ` [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
2021-02-24 20:15     ` Ilias Apalodimas
2021-02-26 14:31       ` Jesper Dangaard Brouer
2021-02-25  0:06     ` kernel test robot
2021-02-24 18:56   ` [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster Jesper Dangaard Brouer
2021-02-25 11:28     ` Mel Gorman
2021-02-25 15:16       ` Jesper Dangaard Brouer [this message]
2021-02-25 15:38         ` Mel Gorman
2021-02-26 14:34           ` Jesper Dangaard Brouer
2021-03-01 13:29 ` [PATCH RFC V2 net-next 0/2] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
2021-03-01 13:29   ` [PATCH RFC V2 net-next 1/2] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
2021-03-01 13:29   ` [PATCH RFC V2 net-next 2/2] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
2021-03-02 17:40     ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210225161633.53e5f910@carbon \
    --to=brouer@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.