Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()

From: Daniel Jordan <daniel.m.jordan@oracle.com>
To: Alexander Duyck <alexander.duyck@gmail.com>
Cc: Josh Triplett <josh@joshtriplett.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Herbert Xu <herbert@gondor.apana.org.au>,
	Steffen Klassert <steffen.klassert@secunet.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	Jason Gunthorpe <jgg@ziepe.ca>, Jonathan Corbet <corbet@lwn.net>,
	Kirill Tkhai <ktkhai@virtuozzo.com>,
	Michal Hocko <mhocko@kernel.org>, Pavel Machek <pavel@ucw.cz>,
	Pavel Tatashin <pasha.tatashin@soleen.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Randy Dunlap <rdunlap@infradead.org>,
	Shile Zhang <shile.zhang@linux.alibaba.com>,
	Tejun Heo <tj@kernel.org>, Zi Yan <ziy@nvidia.com>,
	linux-crypto@vger.kernel.org, linux-mm <linux-mm@kvack.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()
Date: Mon, 4 May 2020 21:48:44 -0400	[thread overview]
Message-ID: <20200505014844.ulp4rtih7adtcicm@ca-dmjordan1.us.oracle.com> (raw)
In-Reply-To: <CAKgT0UdBv-Wj98P2wMFGDSihPLKWFsqpu77ZmO+eA51uteZ-Ag@mail.gmail.com>

On Mon, May 04, 2020 at 05:40:19PM -0700, Alexander Duyck wrote:
> On Mon, May 4, 2020 at 4:44 PM Josh Triplett <josh@joshtriplett.org> wrote:
> >
> > On May 4, 2020 3:33:58 PM PDT, Alexander Duyck <alexander.duyck@gmail.com> wrote:
> > >On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan
> > ><daniel.m.jordan@oracle.com> wrote:
> > >>         /*
> > >> -        * Initialize and free pages in MAX_ORDER sized increments so
> > >> -        * that we can avoid introducing any issues with the buddy
> > >> -        * allocator.
> > >> +        * More CPUs always led to greater speedups on tested
> > >systems, up to
> > >> +        * all the nodes' CPUs.  Use all since the system is
> > >otherwise idle now.
> > >>          */
> > >
> > >I would be curious about your data. That isn't what I have seen in the
> > >past. Typically only up to about 8 or 10 CPUs gives you any benefit,
> > >beyond that I was usually cache/memory bandwidth bound.

On Skylake it took more than 8 or 10 CPUs, though on other machines the benefit
of using all versus half or 3/4 of the CPUs is less significant.

Given that the rest of the system is idle at this point, my main concern is
whether other archs regress past a certain thread count.

    Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
      2 nodes * 26 cores * 2 threads = 104 CPUs
      384G/node = 768G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   4056.7 (  5.5)         --   1763.3 (  4.2)
          (  1)      -2.3%   4153.3 (  2.5)      -5.3%   1861.7 (  5.5)
      12% (  6)      53.8%   2637.7 ( 38.7)     408.7%    346.7 ( 37.5)
      25% ( 13)      62.4%   2497.3 ( 38.5)     739.7%    210.0 ( 41.8)
      37% ( 19)      63.8%   2477.0 ( 19.0)     851.4%    185.3 ( 21.5)
      50% ( 26)      64.1%   2471.7 ( 21.4)     881.4%    179.7 ( 25.8)
      75% ( 39)      65.2%   2455.7 ( 33.2)     990.7%    161.7 ( 29.3)
     100% ( 52)      66.5%   2436.7 (  2.1)    1121.7%    144.3 (  5.9)

    Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
      1 node * 16 cores * 2 threads = 32 CPUs
      192G/node = 192G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1957.3 ( 14.0)         --   1093.7 ( 12.9)
          (  1)       1.4%   1930.7 ( 10.0)       3.8%   1053.3 (  7.6)
      12% (  4)      70.0%   1151.7 (  9.0)     292.5%    278.7 (  0.6)
      25% (  8)      86.2%   1051.0 (  7.8)     514.4%    178.0 (  2.6)
      37% ( 12)      95.1%   1003.3 (  7.6)     672.0%    141.7 (  3.8)
      50% ( 16)      93.0%   1014.3 ( 20.0)     720.2%    133.3 (  3.2)
      75% ( 24)      97.8%    989.3 (  6.7)     765.7%    126.3 (  1.5)
     100% ( 32)      96.5%    996.0 (  7.2)     758.9%    127.3 (  5.1)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
      2 nodes * 18 cores * 2 threads = 72 CPUs
      128G/node = 256G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1666.0 (  3.5)         --    618.0 (  3.5)
          (  1)       1.0%   1649.7 (  1.5)       3.0%    600.0 (  1.0)
      12% (  4)      34.9%   1234.7 ( 21.4)     237.7%    183.0 ( 22.5)
      25% (  9)      42.0%   1173.0 ( 10.0)     417.9%    119.3 (  9.6)
      37% ( 13)      44.4%   1153.7 ( 17.0)     524.2%     99.0 ( 15.6)
      50% ( 18)      44.8%   1150.3 ( 15.5)     534.9%     97.3 ( 16.2)
      75% ( 27)      44.8%   1150.3 (  2.5)     550.5%     95.0 (  5.6)
     100% ( 36)      45.5%   1145.3 (  1.5)     594.4%     89.0 (  1.7)

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 8 cores * 2 threads = 16 CPUs
      64G/node = 64G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1029.7 ( 42.3)         --    253.7 (  3.1)
          (  1)       3.4%    995.3 ( 21.4)       4.5%    242.7 (  5.5)
      12% (  2)      16.3%    885.7 ( 24.4)      86.5%    136.0 (  5.2)
      25% (  4)      23.3%    835.0 ( 21.5)     195.0%     86.0 (  1.7)
      37% (  6)      28.0%    804.7 ( 15.7)     249.1%     72.7 (  2.1)
      50% (  8)      26.3%    815.3 ( 11.7)     290.3%     65.0 (  3.5)
      75% ( 12)      30.7%    787.7 (  2.1)     284.3%     66.0 (  3.6)
     100% ( 16)      30.4%    789.3 ( 15.0)     322.8%     60.0 (  5.6)

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      16G/node = 16G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    757.7 ( 17.1)         --     57.0 (  0.0)
      25% (  1)      -1.0%    765.3 (  5.5)       3.6%     55.0 (  0.0)
      50% (  2)       4.9%    722.3 ( 21.5)      74.5%     32.7 (  4.6)
      75% (  3)       3.8%    729.7 (  4.9)     119.2%     26.0 (  0.0)
     100% (  4)       6.7%    710.3 ( 15.0)     171.4%     21.0 (  0.0)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      14G/node = 14G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    656.3 (  7.1)         --     57.3 (  1.5)
      25% (  1)       1.8%    644.7 (  3.1)       0.6%     57.0 (  0.0)
      50% (  2)       7.0%    613.7 (  5.1)      68.6%     34.0 (  5.3)
      75% (  3)       7.4%    611.3 (  6.7)     135.6%     24.3 (  0.6)
     100% (  4)       9.4%    599.7 (  5.9)     168.8%     21.3 (  1.2)

> > I've found pretty much linear performance up to memory bandwidth, and on the systems I was testing, I didn't saturate memory bandwidth until about the full number of physical cores. From number of cores up to number of threads, the performance stayed about flat; it didn't get any better or worse.
> 
> That doesn't sound right though based on the numbers you provided. The
> system you had was 192GB spread over 2 nodes with 48thread/24core per
> node, correct? Your numbers went from ~290ms to ~28ms so a 10x
> decrease, that doesn't sound linear when you spread the work over 24
> cores to get there. I agree that the numbers largely stay flat once
> you hit the peak, I have seen similar behavior when I was working on
> the deferred init code previously. One concern I have though is that
> we may end up seeing better performance with a subset of cores instead
> of running all of the cores/threads, especially if features such as
> turbo come into play. In addition we are talking x86 only so far. I
> would be interested in seeing if this has benefits or not for other
> architectures.
> 
> Also what is the penalty that is being paid in order to break up the
> work before-hand and set it up for the parallel work? I would be
> interested in seeing what the cost is on a system with fewer cores per
> node, maybe even down to 1. That would tell us how much additional
> overhead is being added to set things up to run in parallel.

The numbers above have the 1-thread case.  It seems close to the noise.

> If I get
> a chance tomorrow I might try applying the patches and doing some
> testing myself.

If you end up doing that, you might find this helpful:
    https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=patch;h=afc72bf8478b95a1d6d174c269ff3693c60630e0

and maybe this:
    https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=patch;h=dff6537eab281e5a9917682c4adf9059c0574223

Thanks for looking this over.

[ By the way, I'm going to be out Tuesday but back the rest of the week. ]