Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL

From: Robert Foley <robert.foley@linaro.org>
To: "Alex Bennée" <alex.bennee@linaro.org>
Cc: Peter Puhov <peter.puhov@linaro.org>,
	"Emilio G. Cota" <cota@braap.org>,
	Richard Henderson <richard.henderson@linaro.org>,
	QEMU Developers <qemu-devel@nongnu.org>
Subject: Re: [PATCH v8 74/74] cputlb: queue async flush jobs without the BQL
Date: Mon, 18 May 2020 09:46:36 -0400	[thread overview]
Message-ID: <CAEyhzFuiDWYvu3FZNYy5M0FQ91Cs=-4=kV80xQZHEWX+ejhyTw@mail.gmail.com> (raw)
In-Reply-To: <CAEyhzFt1=xDMN5KdQvVx8QyS5n35THa2vY9D3rV8S9emyTYpSw@mail.gmail.com>

We re-ran the numbers with the latest re-based series.

We used an aarch64 ubuntu VM image with a host CPU:
Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz, 2 CPUs, 10 cores/CPU,
20 Threads/CPU.  40 cores total.

For the bare hardware and kvm tests (first chart) the host CPU was:
HiSilicon 1620 CPU 2600 Mhz,  2 CPUs, 64 Cores per CPU, 128 CPUs total.

First, we ran a test of building the kernel in the VM.
We did not see any major improvements nor major regressions.
We show the results of the Speedup of building the kernel
on bare hardware compared with kvm and QEMU (both the baseline and cpu locks).

                   Speedup vs a single thread for kernel build

  40 +----------------------------------------------------------------------+
     |         +         +         +          +         +         +  **     |
     |                                                bare hardwar********* |
     |                                                          kvm ####### |
  35 |-+                                                   baseline $$$$$$$-|
     |                                                    *cpu lock %%%%%%% |
     |                                                 ***                  |
     |                                               **                     |
  30 |-+                                          ***                     +-|
     |                                         ***                          |
     |                                      ***                             |
     |                                    **                                |
  25 |-+                               ***                                +-|
     |                              ***                                     |
     |                            **                                        |
     |                          **                                          |
  20 |-+                      **                                          +-|
     |                      **                                #########     |
     |                    **                  ################              |
     |                  **          ##########                              |
     |                **         ###                                        |
  15 |-+             *       ####                                         +-|
     |             **     ###                                               |
     |            *    ###                                                  |
     |           *  ###                                                     |
  10 |-+       **###                                                      +-|
     |        *##                                                           |
     |       ##  $$$$$$$$$$$$$$$$                                           |
     |     #$$$$$%%%%%%%%%%%%%%%%%%%%                                       |
   5 |-+  $%%%%%%                    %%%$%$%$%$%$%$%$%$%$%$%$%$%$%$%$%    +-|
     |   %%                                                           %     |
     | %%                                                                   |
     |%        +         +         +          +         +         +         |
   0 +----------------------------------------------------------------------+
     0         10        20        30         40        50        60        70
                                   Guest vCPUs

After seeing these results and the scaling limits inherent in the build itself,
we decided to run a test which might show the scaling improvements clearer.
So we chose unix bench.

               Unix bench result (higher is better) vs number vCPUs.

  3000 +--------------------------------------------------------------------+
       |      +      +      +      +      +     +      +      +      +      |
       |                                                   baseline ******* |
       |             #                                     cpu lock ####### |
       |           ##*#                                                     |
  2500 |-+        #** *#                                                  +-|
       |          #    *#                                                   |
       |         #*    *#                                                   |
       |         #      *#                                                  |
       |        #*       #                                                  |
       |        #        *#                                                 |
  2000 |-+     #*         #                                               +-|
       |       #          *#                                                |
       |      #*           *#                                               |
       |      #             *####                                           |
       |     #*             *    ###                                        |
  1500 |-+   #               ***    ##                                    +-|
       |     #                  *     ##                                    |
       |    #                    *      ###                                 |
       |    #                     **       ##                               |
       |    #                       *        ###                            |
       |   #                         *          ##                          |
  1000 |-+ #                          **          #                       +-|
       |  #                             *          ###                      |
       |  #                              **           #                     |
       |  #                                *           #                    |
       | #*                                 *           ##                  |
   500 |-#                                   **           #         #     +-|
       | #                                     *           #      ## #      |
       |#*                                      *           ##   #    #     |
       |#*                                       **            ##      #    |
       |*                                                     #         #   |
       |*     +      +      +      +      +     +  **********************#  |
     0 +--------------------------------------------------------------------+
       0      10     20     30     40     50    60     70     80     90    100
                                    Guest vCPUs

We also ran tests to compare the boot times.  This test showed the most
improvements compared to the baseline.

              Boot time in seconds (lower is better) vs number vCPUs.

  550 +---------------------------------------------------------------------+
      |      +      +      +      +      +      +      +      +      +   *  |
      |                                                    baseline ******* |
  500 |-+                                                  cpu lock #######-|
      |                                                              *      |
      |                                                             *       |
      |                                                            *        |
  450 |-+                                                        **      #+-|
      |                                                         *       #   |
      |                                            **          *      ##    |
  400 |-+                                         *  **      **      #    +-|
      |                                           *    *   **       #       |
      |                                          *       **       ##        |
  350 |-+                                       *       *        #        +-|
      |                                         *              ##           |
      |                                        *              #             |
  300 |-+                                     *             ##            +-|
      |                                       *            #                |
      |                                      *           ##                 |
      |                                     *           #                   |
  250 |-+                                 **           #                  +-|
      |                                  *           ##                     |
      |                                **           #                       |
  200 |-+                           ***           ##                      +-|
      |                           **           ###                          |
      |                          *         ####                             |
  150 |-+                       *    ######                               +-|
      |                     ****  ###                                       |
      |*                   *    ##                                          |
      |#*                #######                                            |
  100 |-#          ***###                                                 +-|
      | #*     #######                                                      |
      |  ######     +      +      +      +      +      +      +      +      |
   50 +---------------------------------------------------------------------+
      0      10     20     30     40     50     60     70     80     90    100
                                    Guest vCPUs

Pictures are also here:
https://drive.google.com/file/d/1ASg5XyP9hNfN9VysXC3qe5s9QSJlwFAt/view?usp=sharing

We will plan to update this commit in the series with the final two results
(unix bench and boot times).

Regards,
-Rob

On Tue, 12 May 2020 at 15:26, Robert Foley <robert.foley@linaro.org> wrote:
>
> On Tue, 12 May 2020 at 12:27, Alex Bennée <alex.bennee@linaro.org> wrote:
> > Robert Foley <robert.foley@linaro.org> writes:
> >
> > > From: "Emilio G. Cota" <cota@braap.org>
> > >
> > > This yields sizable scalability improvements, as the below results show.
> > >
> > > Host: Two Intel E5-2683 v3 14-core CPUs at 2.00 GHz (Haswell)
> > >
> > > Workload: Ubuntu 18.04 ppc64 compiling the linux kernel with
> > > "make -j N", where N is the number of cores in the guest.
> > >
> > >                       Speedup vs a single thread (higher is better):
> snip
> > >   png: https://imgur.com/zZRvS7q
> >
> > Can we re-run these numbers on the re-based series?
>
> Sure, we will re-run the numbers.
>
> Regards,
> -Rob