Re: [PATCH v3 1/2] x86/vdso: Move mult and shift into struct vgtod_ts

From: Thomas Gleixner <tglx@linutronix.de>
To: "Sverdlin, Alexander (Nokia - DE/Ulm)" <alexander.sverdlin@nokia.com>
Cc: "x86@kernel.org" <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	"H. Peter Anvin" <hpa@zytor.com>,
	Jason Vas Dias <jason.vas.dias@gmail.com>,
	Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH v3 1/2] x86/vdso: Move mult and shift into struct vgtod_ts
Date: Sun, 23 Jun 2019 12:18:26 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.21.1906231008170.32342@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20190605144116.28553-2-alexander.sverdlin@nokia.com>

Alexander,

On Wed, 5 Jun 2019, Sverdlin, Alexander (Nokia - DE/Ulm) wrote:

> From: Alexander Sverdlin <alexander.sverdlin@nokia.com>
> 
> This is a preparation for CLOCK_MONOTONIC_RAW vDSO implementation.
> Coincidentally it had a slight performance improvement as well:
> 
> ---- Test code ----
>  #include <errno.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <time.h>
>  #include <unistd.h>
> 
>  #define CLOCK_TYPE CLOCK_MONOTONIC_RAW
>  #define DURATION_SEC 10
> 
> int main(int argc, char **argv)
> {
> 	struct timespec t, end;
> 	unsigned long long cnt = 0;
> 
> 	clock_gettime(CLOCK_TYPE, &end);
> 	end.tv_sec += DURATION_SEC;
> 
> 	do {
> 		clock_gettime(CLOCK_TYPE, &t);
> 		++cnt;
> 	} while (t.tv_sec < end.tv_sec || t.tv_nsec < end.tv_nsec);
> 
> 	dprintf(STDOUT_FILENO, "%llu", cnt);
> 
> 	return EXIT_SUCCESS;
> }
> -------------------
> 
> The results from the above test program:
> 
> Clock                   Before  After   Diff
> -----                   ------  -----   ----
> CLOCK_MONOTONIC         355.5M  359.6M  +1%
> CLOCK_REALTIME          355.5M  359.6M  +1%

Been there done that. But the results of the micro benchmark above have to
be taken with a grain of salt.

Let's look at the cache side effects of this change:

Before:

struct vsyscall_gtod_data {
	unsigned int               seq;                  /*     0     4 */
	int                        vclock_mode;          /*     4     4 */
	u64                        cycle_last;           /*     8     8 */
	u64                        mask;                 /*    16     8 */
	u32                        mult;                 /*    24     4 */
	u32                        shift;                /*    28     4 */
	struct vgtod_ts            basetime[12];         /*    32   192 */
	/* --- cacheline 3 boundary (192 bytes) was 32 bytes ago --- */
	int                        tz_minuteswest;       /*   224     4 */
	int                        tz_dsttime;           /*   228     4 */

	/* size: 232, cachelines: 4, members: 9 */
	/* last cacheline: 40 bytes */
};

After:

struct vsyscall_gtod_data {
	unsigned int               seq;                  /*     0     4 */
	int                        vclock_mode;          /*     4     4 */
	u64                        cycle_last;           /*     8     8 */
	u64                        mask;                 /*    16     8 */
	struct vgtod_ts            basetime[12];         /*    24   288 */
	/* --- cacheline 4 boundary (256 bytes) was 56 bytes ago --- */
	int                        tz_minuteswest;       /*   312     4 */
	int                        tz_dsttime;           /*   316     4 */

	/* size: 320, cachelines: 5, members: 7 */
};

So the interesting change is here:

	struct vgtod_ts            basetime[12];         /*    32   192 */

vs.

	struct vgtod_ts            basetime[12];         /*    24   288 */

In the original version struct vgtod_ts occupies 16 bytes in the modified
version it has 24 bytes. As a consequence:

	CLOCK				Before		After
	data->basetime[CLOCK_REALTIME]	32 .. 47	24 .. 47
	data->basetime[CLOCK_MONOTONIC]	48 .. 63	48 .. 71

That means that the CLOCK_MONOTONIC storage now overlaps into a second
cache line. CLOCK_MONOTONIC is widely used.

Of course on a micro benchmark this does not matter at all, but in real
world use cases which care about cache pressure it does.

Another thing is that just adding some small computation to the benchmark
loop makes the advantage of the modified version go away and it becomes
slightly slower.

On a real world application which is a cache heavy and uses CLOCK MONOTONIC
extensivly I can observe a significant L1 miss increase (%2) with the
modified version.

Now there is something else which is interesting:

On a guest (pvclock) I can observe with your benchmark:

 Before:

	MONO  cnt: 576.7
	REAL  cnt: 576.6

 After: 

 	MONO  cnt: 577.1
	REAL  cnt: 577.0

But on a (different) host:

 Before:

       MONO   cnt: 353.9
       REAL   cnt: 354.4

 After:

       MONO   cnt: 347.9
       REAL   cnt: 347.1

Yuck!

So this is really sensitive and I'm inclined not to do that due to the
cache line side effect.

The alternative solution for this is what Vincenzo has in his unified VDSO
patch series:

  https://lkml.kernel.org/r/20190621095252.32307-1-vincenzo.frascino@arm.com

It leaves the data struct unmodified and has a separate array for the raw
clock. That does not have the side effects at all.

I'm in the process of merging that series and I actually adapted your
scheme to the new unified infrastructure where it has exactly the same
effects as with your original patches against the x86 version.

Thanks,

	tglx