All of lore.kernel.org
 help / color / mirror / Atom feed
From: Daniel Jordan <daniel.m.jordan@oracle.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Steffen Klassert <steffen.klassert@secunet.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, Jonathan Corbet <corbet@lwn.net>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Michal Hocko <mhocko@kernel.org>, Pavel Machek <pavel@ucw.cz>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	Shile Zhang <shile.zhang@linux.alibaba.com>,
	Tejun Heo <tj@kernel.org>, Zi Yan <ziy@nvidia.com>,
	linux-crypto@vger.kernel.org, linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()
Date: Mon, 4 May 2020 21:48:44 -0400	[thread overview]
Message-ID: <20200505014844.ulp4rtih7adtcicm@ca-dmjordan1.us.oracle.com> (raw)
In-Reply-To: <CAKgT0UdBv-Wj98P2wMFGDSihPLKWFsqpu77ZmO+eA51uteZ-Ag@mail.gmail.com>

On Mon, May 04, 2020 at 05:40:19PM -0700, Alexander Duyck wrote:
> On Mon, May 4, 2020 at 4:44 PM Josh Triplett <josh@joshtriplett.org> wrote:
> >
> > On May 4, 2020 3:33:58 PM PDT, Alexander Duyck <alexander.duyck@gmail.com> wrote:
> > >On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan
> > ><daniel.m.jordan@oracle.com> wrote:
> > >>         /*
> > >> -        * Initialize and free pages in MAX_ORDER sized increments so
> > >> -        * that we can avoid introducing any issues with the buddy
> > >> -        * allocator.
> > >> +        * More CPUs always led to greater speedups on tested
> > >systems, up to
> > >> +        * all the nodes' CPUs.  Use all since the system is
> > >otherwise idle now.
> > >>          */
> > >
> > >I would be curious about your data. That isn't what I have seen in the
> > >past. Typically only up to about 8 or 10 CPUs gives you any benefit,
> > >beyond that I was usually cache/memory bandwidth bound.

On Skylake it took more than 8 or 10 CPUs, though on other machines the benefit
of using all versus half or 3/4 of the CPUs is less significant.

Given that the rest of the system is idle at this point, my main concern is
whether other archs regress past a certain thread count.


    Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
      2 nodes * 26 cores * 2 threads = 104 CPUs
      384G/node = 768G memory
    
                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   4056.7 (  5.5)         --   1763.3 (  4.2)
          (  1)      -2.3%   4153.3 (  2.5)      -5.3%   1861.7 (  5.5)
      12% (  6)      53.8%   2637.7 ( 38.7)     408.7%    346.7 ( 37.5)
      25% ( 13)      62.4%   2497.3 ( 38.5)     739.7%    210.0 ( 41.8)
      37% ( 19)      63.8%   2477.0 ( 19.0)     851.4%    185.3 ( 21.5)
      50% ( 26)      64.1%   2471.7 ( 21.4)     881.4%    179.7 ( 25.8)
      75% ( 39)      65.2%   2455.7 ( 33.2)     990.7%    161.7 ( 29.3)
     100% ( 52)      66.5%   2436.7 (  2.1)    1121.7%    144.3 (  5.9)


    Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
      1 node * 16 cores * 2 threads = 32 CPUs
      192G/node = 192G memory
    
                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1957.3 ( 14.0)         --   1093.7 ( 12.9)
          (  1)       1.4%   1930.7 ( 10.0)       3.8%   1053.3 (  7.6)
      12% (  4)      70.0%   1151.7 (  9.0)     292.5%    278.7 (  0.6)
      25% (  8)      86.2%   1051.0 (  7.8)     514.4%    178.0 (  2.6)
      37% ( 12)      95.1%   1003.3 (  7.6)     672.0%    141.7 (  3.8)
      50% ( 16)      93.0%   1014.3 ( 20.0)     720.2%    133.3 (  3.2)
      75% ( 24)      97.8%    989.3 (  6.7)     765.7%    126.3 (  1.5)
     100% ( 32)      96.5%    996.0 (  7.2)     758.9%    127.3 (  5.1)
    

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
      2 nodes * 18 cores * 2 threads = 72 CPUs
      128G/node = 256G memory
    
                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1666.0 (  3.5)         --    618.0 (  3.5)
          (  1)       1.0%   1649.7 (  1.5)       3.0%    600.0 (  1.0)
      12% (  4)      34.9%   1234.7 ( 21.4)     237.7%    183.0 ( 22.5)
      25% (  9)      42.0%   1173.0 ( 10.0)     417.9%    119.3 (  9.6)
      37% ( 13)      44.4%   1153.7 ( 17.0)     524.2%     99.0 ( 15.6)
      50% ( 18)      44.8%   1150.3 ( 15.5)     534.9%     97.3 ( 16.2)
      75% ( 27)      44.8%   1150.3 (  2.5)     550.5%     95.0 (  5.6)
     100% ( 36)      45.5%   1145.3 (  1.5)     594.4%     89.0 (  1.7)
    

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 8 cores * 2 threads = 16 CPUs
      64G/node = 64G memory
    
                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1029.7 ( 42.3)         --    253.7 (  3.1)
          (  1)       3.4%    995.3 ( 21.4)       4.5%    242.7 (  5.5)
      12% (  2)      16.3%    885.7 ( 24.4)      86.5%    136.0 (  5.2)
      25% (  4)      23.3%    835.0 ( 21.5)     195.0%     86.0 (  1.7)
      37% (  6)      28.0%    804.7 ( 15.7)     249.1%     72.7 (  2.1)
      50% (  8)      26.3%    815.3 ( 11.7)     290.3%     65.0 (  3.5)
      75% ( 12)      30.7%    787.7 (  2.1)     284.3%     66.0 (  3.6)
     100% ( 16)      30.4%    789.3 ( 15.0)     322.8%     60.0 (  5.6)
    
    
    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      16G/node = 16G memory
    
                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    757.7 ( 17.1)         --     57.0 (  0.0)
      25% (  1)      -1.0%    765.3 (  5.5)       3.6%     55.0 (  0.0)
      50% (  2)       4.9%    722.3 ( 21.5)      74.5%     32.7 (  4.6)
      75% (  3)       3.8%    729.7 (  4.9)     119.2%     26.0 (  0.0)
     100% (  4)       6.7%    710.3 ( 15.0)     171.4%     21.0 (  0.0)
    

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      14G/node = 14G memory
    
                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    656.3 (  7.1)         --     57.3 (  1.5)
      25% (  1)       1.8%    644.7 (  3.1)       0.6%     57.0 (  0.0)
      50% (  2)       7.0%    613.7 (  5.1)      68.6%     34.0 (  5.3)
      75% (  3)       7.4%    611.3 (  6.7)     135.6%     24.3 (  0.6)
     100% (  4)       9.4%    599.7 (  5.9)     168.8%     21.3 (  1.2)


> > I've found pretty much linear performance up to memory bandwidth, and on the systems I was testing, I didn't saturate memory bandwidth until about the full number of physical cores. From number of cores up to number of threads, the performance stayed about flat; it didn't get any better or worse.
> 
> That doesn't sound right though based on the numbers you provided. The
> system you had was 192GB spread over 2 nodes with 48thread/24core per
> node, correct? Your numbers went from ~290ms to ~28ms so a 10x
> decrease, that doesn't sound linear when you spread the work over 24
> cores to get there. I agree that the numbers largely stay flat once
> you hit the peak, I have seen similar behavior when I was working on
> the deferred init code previously. One concern I have though is that
> we may end up seeing better performance with a subset of cores instead
> of running all of the cores/threads, especially if features such as
> turbo come into play. In addition we are talking x86 only so far. I
> would be interested in seeing if this has benefits or not for other
> architectures.
> 
> Also what is the penalty that is being paid in order to break up the
> work before-hand and set it up for the parallel work? I would be
> interested in seeing what the cost is on a system with fewer cores per
> node, maybe even down to 1. That would tell us how much additional
> overhead is being added to set things up to run in parallel.

The numbers above have the 1-thread case.  It seems close to the noise.

> If I get
> a chance tomorrow I might try applying the patches and doing some
> testing myself.

If you end up doing that, you might find this helpful:
    https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=patch;h=afc72bf8478b95a1d6d174c269ff3693c60630e0
    
and maybe this:
    https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=patch;h=dff6537eab281e5a9917682c4adf9059c0574223

Thanks for looking this over.

[ By the way, I'm going to be out Tuesday but back the rest of the week. ]

  reply	other threads:[~2020-05-05  1:49 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-30 20:11 [PATCH 0/7] padata: parallelize deferred page init Daniel Jordan
2020-04-30 20:11 ` [PATCH 1/7] padata: remove exit routine Daniel Jordan
2020-04-30 20:11 ` [PATCH 2/7] padata: initialize earlier Daniel Jordan
2020-04-30 20:11 ` [PATCH 3/7] padata: allocate work structures for parallel jobs from a pool Daniel Jordan
2020-04-30 20:11 ` [PATCH 4/7] padata: add basic support for multithreaded jobs Daniel Jordan
2020-04-30 20:11 ` [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder() Daniel Jordan
2020-04-30 21:43   ` Alexander Duyck
2020-05-01  2:45     ` Daniel Jordan
2020-05-04 22:10       ` Alexander Duyck
2020-05-04 22:10         ` Alexander Duyck
2020-05-05  0:54         ` Daniel Jordan
2020-05-05 15:27           ` Alexander Duyck
2020-05-05 15:27             ` Alexander Duyck
2020-05-06 22:39             ` Daniel Jordan
2020-05-07 15:26               ` Alexander Duyck
2020-05-07 15:26                 ` Alexander Duyck
2020-05-07 20:20                 ` Daniel Jordan
2020-05-07 21:18                   ` Alexander Duyck
2020-05-07 21:18                     ` Alexander Duyck
2020-05-07 22:15                     ` Daniel Jordan
2020-04-30 20:11 ` [PATCH 6/7] mm: parallelize deferred_init_memmap() Daniel Jordan
2020-05-04 22:33   ` Alexander Duyck
2020-05-04 22:33     ` Alexander Duyck
2020-05-04 23:38     ` Josh Triplett
2020-05-04 23:38       ` Josh Triplett
2020-05-05  0:40       ` Alexander Duyck
2020-05-05  0:40         ` Alexander Duyck
2020-05-05  1:48         ` Daniel Jordan [this message]
2020-05-05  2:09           ` Daniel Jordan
2020-05-05 14:55             ` Alexander Duyck
2020-05-05 14:55               ` Alexander Duyck
2020-05-06 22:21               ` Daniel Jordan
2020-05-06 22:36                 ` Alexander Duyck
2020-05-06 22:36                   ` Alexander Duyck
2020-05-06 22:43                   ` Daniel Jordan
2020-05-06 23:01                     ` Daniel Jordan
2020-05-05  1:26     ` Daniel Jordan
2020-04-30 20:11 ` [PATCH 7/7] padata: document multithreaded jobs Daniel Jordan
2020-04-30 21:31 ` [PATCH 0/7] padata: parallelize deferred page init Andrew Morton
2020-04-30 21:40   ` Pavel Tatashin
2020-04-30 21:40     ` Pavel Tatashin
2020-05-01  2:40     ` Daniel Jordan
2020-05-01  0:50   ` Josh Triplett
2020-05-01  1:09 ` Josh Triplett
2020-05-01  2:48   ` Daniel Jordan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200505014844.ulp4rtih7adtcicm@ca-dmjordan1.us.oracle.com \
    --to=daniel.m.jordan@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=alex.williamson@redhat.com \
    --cc=alexander.duyck@gmail.com \
    --cc=alexander.h.duyck@linux.intel.com \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=david@redhat.com \
    --cc=herbert@gondor.apana.org.au \
    --cc=jgg@ziepe.ca \
    --cc=josh@joshtriplett.org \
    --cc=ktkhai@virtuozzo.com \
    --cc=linux-crypto@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@kernel.org \
    --cc=pasha.tatashin@soleen.com \
    --cc=pavel@ucw.cz \
    --cc=peterz@infradead.org \
    --cc=rdunlap@infradead.org \
    --cc=shile.zhang@linux.alibaba.com \
    --cc=steffen.klassert@secunet.com \
    --cc=tj@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.