Re: propagating vmgenid outward and upward

From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: "Laszlo Ersek" <lersek@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>,
	"KVM list" <kvm@vger.kernel.org>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	linux-hyperv@vger.kernel.org,
	"Linux Crypto Mailing List" <linux-crypto@vger.kernel.org>,
	"Alexander Graf" <graf@amazon.com>,
	"Michael Kelley (LINUX)" <mikelley@microsoft.com>,
	"Greg Kroah-Hartman" <gregkh@linuxfoundation.org>,
	adrian@parity.io, "Daniel P. Berrangé" <berrange@redhat.com>,
	"Dominik Brodowski" <linux@dominikbrodowski.net>,
	"Jann Horn" <jannh@google.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	"Brown, Len" <len.brown@intel.com>, "Pavel Machek" <pavel@ucw.cz>,
	"Linux PM" <linux-pm@vger.kernel.org>,
	"Colm MacCarthaigh" <colmmacc@amazon.com>,
	"Theodore Ts'o" <tytso@mit.edu>, "Arnd Bergmann" <arnd@arndb.de>
Subject: Re: propagating vmgenid outward and upward
Date: Thu, 3 Mar 2022 08:07:16 -0500	[thread overview]
Message-ID: <20220303075426-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <Yh+cB5bWarl8CFN1@zx2c4.com>

On Wed, Mar 02, 2022 at 05:32:07PM +0100, Jason A. Donenfeld wrote:
> Hi Michael,
> 
> On Wed, Mar 02, 2022 at 11:22:46AM -0500, Michael S. Tsirkin wrote:
> > > Because that 16 byte read of vmgenid is not atomic. Let's say you read
> > > the first 8 bytes, and then the VM is forked.
> > 
> > But at this point when VM was forked plaintext key and nonce are all in
> > buffer, and you previously indicated a fork at this point is harmless.
> > You wrote "If it changes _after_ that point of check ... it doesn't
> > matter:"
> 
> Ahhh, fair point. I think you're right.
> 
> Alright, so all we're talking about here is an ordinary 16-byte read,
> and 16 bytes of storage per keypair, and a 16-byte comparison.
> 
> Still seems much worse than just having a single word...
> 
> Jason

Oh I forgot about __int128.

#include <stdio.h>
#include <assert.h>
#include <limits.h>
#include <string.h>

struct lng {
	__int128 l;
};

struct shrt {
	unsigned long s;
};

struct lng l = { 1 };
struct shrt s = { 3 };

static void test1(volatile struct shrt *sp)
{
	if (sp->s != s.s) {
		printf("short mismatch!\n");
		s.s = sp->s;
	}
}
static void test2(volatile struct lng *lp)
{
	if (lp->l != l.l) {
		printf("long mismatch!\n");
		l.l = lp->l;
	}
}

int main(int argc, char **argv)
{
	volatile struct shrt sv = { 4 };
	volatile struct lng lv = { 5 };

	if (argc > 1) {
		printf("test 1\n");
		for (int i = 0; i < 100000000; ++i) 
			test1(&sv);
	} else {
		printf("test 2\n");
		for (int i = 0; i < 100000000; ++i)
			test2(&lv);
	}
	return 0;
}

with that the compiler has an easier time to produce optimal
code, so the difference is smaller.
Note: compiled with
gcc -O2 -mno-sse -mno-sse2 -ggdb bench3.c 

since with sse there's no difference at all.

[mst@tuck ~]$ perf stat -r 100 ./a.out 1 > /dev/null 

 Performance counter stats for './a.out 1' (100 runs):

             94.55 msec task-clock:u              #    0.996 CPUs utilized            ( +-  0.09% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #  548.914 /sec                     ( +-  0.21% )
       400,459,851      cycles:u                  #    4.227 GHz                      ( +-  0.03% )
       500,147,935      instructions:u            #    1.25  insn per cycle           ( +-  0.00% )
       200,032,462      branches:u                #    2.112 G/sec                    ( +-  0.00% )
             1,810      branch-misses:u           #    0.00% of all branches          ( +-  0.73% )

         0.0949732 +- 0.0000875 seconds time elapsed  ( +-  0.09% )

[mst@tuck ~]$ 
[mst@tuck ~]$ perf stat -r 100 ./a.out > /dev/null 

 Performance counter stats for './a.out' (100 runs):

            110.19 msec task-clock:u              #    1.136 CPUs utilized            ( +-  0.18% )
                 0      context-switches:u        #    0.000 /sec                   
                 0      cpu-migrations:u          #    0.000 /sec                   
                52      page-faults:u             #  537.743 /sec                     ( +-  0.22% )
       428,518,442      cycles:u                  #    4.431 GHz                      ( +-  0.07% )
       900,147,986      instructions:u            #    2.24  insn per cycle           ( +-  0.00% )
       200,032,505      branches:u                #    2.069 G/sec                    ( +-  0.00% )
             2,139      branch-misses:u           #    0.00% of all branches          ( +-  0.77% )

          0.096956 +- 0.000203 seconds time elapsed  ( +-  0.21% )

-- 
MST