All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jesper Dangaard Brouer <brouer@redhat.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: linux-mm@kvack.org, chuck.lever@oracle.com,
	netdev@vger.kernel.org, linux-nfs@vger.kernel.org,
	linux-kernel@vger.kernel.org, brouer@redhat.com
Subject: Re: [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster
Date: Fri, 26 Feb 2021 15:34:35 +0100	[thread overview]
Message-ID: <20210226153435.6708d171@carbon> (raw)
In-Reply-To: <20210225153815.GN3697@techsingularity.net>

On Thu, 25 Feb 2021 15:38:15 +0000
Mel Gorman <mgorman@techsingularity.net> wrote:

> On Thu, Feb 25, 2021 at 04:16:33PM +0100, Jesper Dangaard Brouer wrote:
> > > On Wed, Feb 24, 2021 at 07:56:51PM +0100, Jesper Dangaard Brouer wrote:  
> > > > Avoid multiplication (imul) operations when accessing:
> > > >  zone->free_area[order].nr_free
> > > > 
> > > > This was really tricky to find. I was puzzled why perf reported that
> > > > rmqueue_bulk was using 44% of the time in an imul operation:
> > > > 
> > > >        ???     del_page_from_free_list():
> > > >  44,54 ??? e2:   imul   $0x58,%rax,%rax
> > > > 
> > > > This operation was generated (by compiler) because the struct free_area have
> > > > size 88 bytes or 0x58 hex. The compiler cannot find a shift operation to use
> > > > and instead choose to use a more expensive imul, to find the offset into the
> > > > array free_area[].
> > > > 
> > > > The patch align struct free_area to a cache-line, which cause the
> > > > compiler avoid the imul operation. The imul operation is very fast on
> > > > modern Intel CPUs. To help fast-path that decrement 'nr_free' move the
> > > > member 'nr_free' to be first element, which saves one 'add' operation.
> > > > 
> > > > Looking up instruction latency this exchange a 3-cycle imul with a
> > > > 1-cycle shl, saving 2-cycles. It does trade some space to do this.
> > > > 
> > > > Used: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
> > > >     
> > > 
> > > I'm having some trouble parsing this and matching it to the patch itself.
> > > 
> > > First off, on my system (x86-64), the size of struct free area is 72,
> > > not 88 bytes. For either size, cache-aligning the structure is a big
> > > increase in the struct size.  
> > 
> > Yes, the increase in size is big. For the struct free_area 40 bytes for
> > my case and 56 bytes for your case.  The real problem is that this is
> > multiplied by 11 (MAX_ORDER) and multiplied by number of zone structs
> > (is it 5?).  Thus, 56*11*5 = 3080 bytes.
> > 
> > Thus, I'm not sure it is worth it!  As I'm only saving 2-cycles, for
> > something that depends on the compiler generating specific code.  And
> > the compiler can easily change, and "fix" this on-its-own in a later
> > release, and then we are just wasting memory.
> > 
> > I did notice this imul happens 45 times in mm/page_alloc.o, with this
> > offset 0x58, but still this is likely not on hot-path.
> >   
> 
> Yeah, I'm not convinced it's worth it. The benefit of 2 cycles is small and
> it's config-dependant. While some configurations will benefit, others do
> not but the increased consumption is universal. I think there are better
> ways to save 2 cycles in the page allocator and this seems like a costly
> micro-optimisation.
> 
> > > <SNIP>
> > >
> > > With gcc-9, I'm also not seeing the imul instruction outputted like you
> > > described in rmqueue_pcplist which inlines rmqueue_bulk. At the point
> > > where it calls get_page_from_free_area, it's using shl for the page list
> > > operation. This might be a compiler glitch but given that free_area is a
> > > different size, I'm less certain and wonder if something else is going on.  
> > 
> > I think it is the size variation.
> >   
> 
> Yes.
> 
> > > Finally, moving nr_free to the end and cache aligning it will make the
> > > started of each free_list cache-aligned because of its location in the
> > > struct zone so what purpose does __pad_to_align_free_list serve?  
> > 
> > The purpose of purpose of __pad_to_align_free_list is because struct
> > list_head is 16 bytes, thus I wanted to align free_list to 16, given we
> > already have wasted the space.
> >   
> 
> Ok, that's fair enough but it's also somewhat of a micro-optimisation as
> whether it helps or not depends on the architecture.
> 
> I don't think I'll pick this up, certainly in the context of the bulk
> allocator but it's worth keeping in mind. It's an interesting corner case
> at least.

I fully agree. Lets drop this patch.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


  reply	other threads:[~2021-02-26 14:36 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-02-24 10:26 [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Mel Gorman
2021-02-24 10:26 ` [PATCH 1/3] SUNRPC: Set rq_page_end differently Mel Gorman
2021-02-24 10:26 ` [PATCH 2/3] mm, page_alloc: Add a bulk page allocator Mel Gorman
2021-02-24 10:26 ` [PATCH 3/3] SUNRPC: Refresh rq_pages using " Mel Gorman
2021-02-24 11:27 ` [RFC PATCH 0/3] Introduce a bulk order-0 page allocator for sunrpc Jesper Dangaard Brouer
2021-02-24 11:55   ` Mel Gorman
2021-02-24 13:20 ` Chuck Lever
2021-02-24 18:56 ` [PATCH RFC net-next 0/3] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
2021-02-24 18:56   ` [PATCH RFC net-next 1/3] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
2021-02-24 20:11     ` Ilias Apalodimas
2021-02-24 18:56   ` [PATCH RFC net-next 2/3] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
2021-02-24 20:15     ` Ilias Apalodimas
2021-02-26 14:31       ` Jesper Dangaard Brouer
2021-02-25  0:06     ` kernel test robot
2021-02-24 18:56   ` [PATCH RFC net-next 3/3] mm: make zone->free_area[order] access faster Jesper Dangaard Brouer
2021-02-25 11:28     ` Mel Gorman
2021-02-25 15:16       ` Jesper Dangaard Brouer
2021-02-25 15:38         ` Mel Gorman
2021-02-26 14:34           ` Jesper Dangaard Brouer [this message]
2021-03-01 13:29 ` [PATCH RFC V2 net-next 0/2] Use bulk order-0 page allocator API for page_pool Jesper Dangaard Brouer
2021-03-01 13:29   ` [PATCH RFC V2 net-next 1/2] net: page_pool: refactor dma_map into own function page_pool_dma_map Jesper Dangaard Brouer
2021-03-01 13:29   ` [PATCH RFC V2 net-next 2/2] net: page_pool: use alloc_pages_bulk in refill code path Jesper Dangaard Brouer
2021-03-02 17:40     ` kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210226153435.6708d171@carbon \
    --to=brouer@redhat.com \
    --cc=chuck.lever@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.