Re: [Qemu-devel] [PATCH v2 4/4] cputlb: read CPUTLBEntry.addr_write atomically

From: "Emilio G. Cota" <cota@braap.org>
To: qemu-devel@nongnu.org
Cc: "Paolo Bonzini" <pbonzini@redhat.com>,
	"Richard Henderson" <rth@twiddle.net>,
	"Alex Bennée" <alex.bennee@linaro.org>
Subject: Re: [Qemu-devel] [PATCH v2 4/4] cputlb: read CPUTLBEntry.addr_write atomically
Date: Thu, 4 Oct 2018 00:01:47 -0400	[thread overview]
Message-ID: <20181004040147.GA22844@flamenco> (raw)
In-Reply-To: <20181003200454.18384-5-cota@braap.org>

On Wed, Oct 03, 2018 at 16:04:54 -0400, Emilio G. Cota wrote:
> Updates can come from other threads, so readers that do not
> take tlb_lock must use atomic_read to avoid undefined
> behaviour (UB).
> 
> This and the previous commit result in a small performance decrease,
> but this is a fair price for removing UB.
(snip)
> That is, a ~2% slowdown for the aarch64 bootup+shutdown test.

I've run more tests. This slowdown is much more pronounced on
memory-heavy workloads. These are the numbers for SPEC06int:

                                Speedup over master

  1.05 +-+--+----+----+----+----+----+----+---+----+----+----+----+----+--+-+
       |                                 +++  ||      +++                   |
       |tlb-lock-noatomic      +++        |  **|       |+++                 |
       |          +atomic       |  ++++   |  **##      | |                  |
     1 +-+..+++...............++##.***#...|..**|#......**|................+-+
       |    ###     ***++     ***# *+*# +++  **+#  +++ **##                 |
       |    # #     *+*#      *|*# *+*#  ||  ** # **## **|#                 |
       |    # #     * *#+     *+*# * *#  ||  ** # **+#+**|#     +**  ++###  |
  0.95 +-+..#.#.....*.*#......*.*#.*.*#.***#.**.#.**.#.**|#......**##***+#+-+
       |    # #     * *#      * *# * *# *|*# ** # ** # **+#      **+#* * #  |
       |    # #     * *#      * *# * *# *|*# ** # ** # ** #+++++ ** #* * #  |
   0.9 +-+***.#..+++*.*#......*.*#.*.*#.*+*#.**.#.**.#.**.#+**|..**.#*.*.#+-+
       |  * * #***##* *#      * *# * *# * *# ** # ** # ** # **## ** #* * #  |
       |  * * #* *+#* *#   +++* *# * *# * *# ** # ** # ** # **|# ** #* * #  |
       |  * * #* * #* *# ***# * *# * *# *+*# ** # ** # ** # **+# ** #* * #  |
  0.85 +-+*.*.#*.*.#*.*#.*.*#+*.*#.*.*#.*.*#.**.#.**.#.**.#.**.#.**.#*.*.#+-+
       |  * * #* * #* *# * *# * *# * *# * *# ** # ** # ** # ** # ** #* * #  |
       |  * * #* * #* *# * *# * *# * *# * *# ** # ** # ** # ** # ** #* * #  |
       |  * * #* * #* *# * *# * *# * *# * *# ** # ** # ** # ** # ** #* * #  |
   0.8 +-+***##***##***#-***#-***#-***#-***#-**##-**##-**##-**##-**##***##+-+
        401.bzi403.g429445.g456.462.libq464.h471.omn4483.xalancbgeomean

That is, a 5% average slowdown, with a max slowdown of ~14% for
mcf :-(

I'll profile tomorrow and see where the slowdown comes from.
If the lock is the issue, we might be better off shifting
all the work to the cross-vCPU call (e.g. doing a round of
synchronous cross-vCPU calls via run_on_cpu), if the assumption
that those calls are very rare is correct.

		Emilio