All of lore.kernel.org
 help / color / mirror / Atom feed
From: Marcelo Tosatti <mtosatti@redhat.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	torvalds@linux-foundation.org, linux-kernel@vger.kernel.org,
	gleb@kernel.org, kvm@vger.kernel.org,
	Ralf Baechle <ralf@linux-mips.org>,
	luto@kernel.org
Subject: Re: [GIT PULL] First batch of KVM changes for 4.1
Date: Fri, 17 Apr 2015 17:18:41 -0300	[thread overview]
Message-ID: <20150417201841.GA31302@amt.cnet> (raw)
In-Reply-To: <55316598.908@redhat.com>

On Fri, Apr 17, 2015 at 09:57:12PM +0200, Paolo Bonzini wrote:
> 
> 
> >> From 4eb9d7132e1990c0586f28af3103675416d38974 Mon Sep 17 00:00:00 2001
> >> From: Paolo Bonzini <pbonzini@redhat.com>
> >> Date: Fri, 17 Apr 2015 14:57:34 +0200
> >> Subject: [PATCH] sched: add CONFIG_TASK_MIGRATION_NOTIFIER
> >>
> >> The task migration notifier is only used in x86 paravirt.  Make it
> >> possible to compile it out.
> >>
> >> While at it, move some code around to ensure tmn is filled from CPU
> >> registers.
> >>
> >> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> >> ---
> >>  arch/x86/Kconfig    | 1 +
> >>  init/Kconfig        | 3 +++
> >>  kernel/sched/core.c | 9 ++++++++-
> >>  3 files changed, 12 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> >> index d43e7e1c784b..9af252c8698d 100644
> >> --- a/arch/x86/Kconfig
> >> +++ b/arch/x86/Kconfig
> >> @@ -649,6 +649,7 @@ if HYPERVISOR_GUEST
> >>  
> >>  config PARAVIRT
> >>  	bool "Enable paravirtualization code"
> >> +	select TASK_MIGRATION_NOTIFIER
> >>  	---help---
> >>  	  This changes the kernel so it can modify itself when it is run
> >>  	  under a hypervisor, potentially improving performance significantly
> >> diff --git a/init/Kconfig b/init/Kconfig
> >> index 3b9df1aa35db..891917123338 100644
> >> --- a/init/Kconfig
> >> +++ b/init/Kconfig
> >> @@ -2016,6 +2016,9 @@ source "block/Kconfig"
> >>  config PREEMPT_NOTIFIERS
> >>  	bool
> >>  
> >> +config TASK_MIGRATION_NOTIFIER
> >> +	bool
> >> +
> >>  config PADATA
> >>  	depends on SMP
> >>  	bool
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index f9123a82cbb6..c07a53aa543c 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1016,12 +1016,14 @@ void check_preempt_curr(struct rq *rq, struct task_struct *p, int flags)
> >>  		rq_clock_skip_update(rq, true);
> >>  }
> >>  
> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
> >>  static ATOMIC_NOTIFIER_HEAD(task_migration_notifier);
> >>  
> >>  void register_task_migration_notifier(struct notifier_block *n)
> >>  {
> >>  	atomic_notifier_chain_register(&task_migration_notifier, n);
> >>  }
> >> +#endif
> >>  
> >>  #ifdef CONFIG_SMP
> >>  void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> >> @@ -1053,18 +1055,23 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu)
> >>  	trace_sched_migrate_task(p, new_cpu);
> >>  
> >>  	if (task_cpu(p) != new_cpu) {
> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
> >>  		struct task_migration_notifier tmn;
> >> +		int from_cpu = task_cpu(p);
> >> +#endif
> >>  
> >>  		if (p->sched_class->migrate_task_rq)
> >>  			p->sched_class->migrate_task_rq(p, new_cpu);
> >>  		p->se.nr_migrations++;
> >>  		perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0);
> >>  
> >> +#ifdef CONFIG_TASK_MIGRATION_NOTIFIER
> >>  		tmn.task = p;
> >> -		tmn.from_cpu = task_cpu(p);
> >> +		tmn.from_cpu = from_cpu;
> >>  		tmn.to_cpu = new_cpu;
> >>  
> >>  		atomic_notifier_call_chain(&task_migration_notifier, 0, &tmn);
> >> +#endif
> >>  	}
> >>  
> >>  	__set_task_cpu(p, new_cpu);
> >> -- 
> >> 2.3.5
> > 
> > Paolo, 
> > 
> > Please revert the patch -- can fix properly in the host
> > which also conforms the KVM guest/host documented protocol.
> > 
> > Radim submitted a patch to kvm@ to split 
> > the kvm_write_guest in two with a barrier in between, i think.
> > 
> > I'll review that patch.
> 
> You're thinking of
> http://article.gmane.org/gmane.linux.kernel.stable/129187, but see
> Andy's reply:
> 
> > 
> > I think there are at least two ways that would work:
> > 
> > a) If KVM incremented version as advertised:
> > 
> > cpu = getcpu();
> > pvti = pvti for cpu;
> > 
> > ver1 = pvti->version;
> > check stable bit;
> > rdtsc_barrier, rdtsc, read scale, shift, etc.
> > if (getcpu() != cpu) retry;
> > if (pvti->version != ver1) retry;
> > 
> > I think this is safe because, we're guaranteed that there was an
> > interval (between the two version reads) in which the vcpu we think
> > we're on was running and the kvmclock data was valid and marked
> > stable, and we know that the tsc we read came from that interval.
> > 
> > Note: rdtscp isn't needed. If we're stable, is makes no difference
> > which cpu's tsc we actually read.
> > 
> > b) If version remains buggy but we use this migrations_from hack:
> > 
> > cpu = getcpu();
> > pvti = pvti for cpu;
> > m1 = pvti->migrations_from;
> > barrier();
> > 
> > ver1 = pvti->version;
> > check stable bit;
> > rdtsc_barrier, rdtsc, read scale, shift, etc.
> > if (getcpu() != cpu) retry;
> > if (pvti->version != ver1) retry;  /* probably not really needed */
> > 
> > barrier();
> > if (pvti->migrations_from != m1) retry;
> > 
> > This is just like (a), except that we're using a guest kernel hack to
> > ensure that no one migrated off the vcpu during the version-protected
> > critical section and that we were, in fact, on that vcpu at some point
> > during that critical section.  Once we've ensured that we were on
> > pvti's associated vcpu for the entire time we were reading it, then we
> > are protected by the existing versioning in the host.
> 
> (a) is not going to happen until 4.2, and there are too many buggy hosts
> around so we'd have to define new ABI that lets the guest distinguish a
> buggy host from a fixed one.
> 
> (b) works now, is not invasive, and I still maintain that the cost is
> negligible.  I'm going to run for a while with CONFIG_SCHEDSTATS to see
> how often you have a migration.
> 
> Anyhow if the task migration notifier is reverted we have to disable the
> whole vsyscall support altogether.

The bug which this is fixing is very rare, have no memory of a report.

In fact, its even difficult to create a synthetic reproducer. You need:

1) update of kvmclock data structure (happens once every 5 minutes).
2) migration of task from vcpu1 to vcpu2 back to vcpu1.
3) a data race between kvm_write_guest (string copy) and 
2 above.

At the same time.



  reply	other threads:[~2015-04-17 20:19 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-10 15:01 [GIT PULL] First batch of KVM changes for 4.1 Paolo Bonzini
2015-04-17  8:52 ` Peter Zijlstra
2015-04-17  9:17   ` Peter Zijlstra
2015-04-17 10:09     ` Paolo Bonzini
2015-04-17 10:36       ` Peter Zijlstra
2015-04-17 10:38         ` Paolo Bonzini
2015-04-17 10:55           ` Peter Zijlstra
2015-04-17 12:46             ` Paolo Bonzini
2015-04-17 13:10               ` Peter Zijlstra
2015-04-17 13:38                 ` Paolo Bonzini
2015-04-17 13:43                   ` Peter Zijlstra
2015-04-17 14:57                     ` Paolo Bonzini
2015-04-17 19:01                   ` Marcelo Tosatti
2015-04-17 19:16                     ` Andy Lutomirski
2015-04-17 19:57                     ` Paolo Bonzini
2015-04-17 20:18                       ` Marcelo Tosatti [this message]
2015-04-17 20:39                         ` Andy Lutomirski
2015-04-17 21:28                           ` Linus Torvalds
2015-04-17 21:42                             ` Andy Lutomirski
2015-04-17 22:04                               ` Linus Torvalds
2015-04-17 22:25                                 ` Andy Lutomirski
2015-04-17 23:39                                   ` Marcelo Tosatti
2015-04-18 16:20                                   ` Paolo Bonzini
2015-04-20 16:59                         ` Paolo Bonzini
2015-04-20 20:27                           ` Andy Lutomirski
2015-04-22 21:21                             ` Marcelo Tosatti
2015-04-23  9:13                               ` Paolo Bonzini
2015-04-23 11:51                                 ` Marcelo Tosatti
2015-04-23 12:02                                   ` Paolo Bonzini
2015-04-23 17:06                                     ` Marcelo Tosatti
2015-04-22 20:56                           ` Marcelo Tosatti
2015-04-22 21:01                             ` Paolo Bonzini
2015-04-22 22:55                               ` Marcelo Tosatti
2015-04-23 11:29                                 ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150417201841.GA31302@amt.cnet \
    --to=mtosatti@redhat.com \
    --cc=gleb@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=ralf@linux-mips.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.