linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Peter Zijlstra <peterz@infradead.org>
To: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>,
	tglx@linutronix.de, mingo@kernel.org, bigeasy@linutronix.de,
	umgwanakikbuti@gmail.com, paulmck@linux.vnet.ibm.com,
	linux-kernel@vger.kernel.org, kmo@daterainc.com
Subject: Re: [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock
Date: Tue, 22 Mar 2016 14:55:30 +0100	[thread overview]
Message-ID: <20160322135530.GR6344@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <20160322132600.GC3921@osiris>

On Tue, Mar 22, 2016 at 02:26:00PM +0100, Heiko Carstens wrote:
> > Clearly something magical is going on and its not clear.
> 
> The mechanism of our pfault code: if Linux is running as guest, runs a user
> space process and the user space process accesses a page that the host has
> paged out we get a pfault interrupt.
> 
> This allows us, within the guest, to schedule a different process. Without
> this mechanism the host would have to suspend the whole virtual CPU until
> the page has been paged in.
> 
> So when we get such an interrupt then we set the state of the current task
> to uninterruptible and also set the need_resched flag. Both happens within
> interrupt context(!). If we later on want to return to user space we
> recognize the need_resched flag and then call schedule().
> It's not very obvious how this works...

A few lines like the above near that function would go a long while I
think.

And, ah!, you rely on the return to user resched to not be a
preempt_schedule, how very icky :-)

Now, what happens if that task gets a spurious wakeup? Will it take the
fault again, raise the PF int again etc.. ?

> Of course we have a lot of additional fun with the completion interrupt (->
> host signals that a page of a process has been paged in and the process can
> continue to run). This interrupt can arrive on any cpu and, since we have
> virtual cpus, actually appear before the interrupt that signals that a page
> is missing.

Of course :-)

Something like the below perhaps?

---
 arch/s390/mm/fault.c | 44 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 36 insertions(+), 8 deletions(-)

diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index 791a4146052c..52cc8c99e62c 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -629,6 +629,29 @@ void pfault_fini(void)
 static DEFINE_SPINLOCK(pfault_lock);
 static LIST_HEAD(pfault_list);
 
+#define PF_COMPLETE	0x0080
+
+/*
+ * The mechanism of our pfault code: if Linux is running as guest, runs a user
+ * space process and the user space process accesses a page that the host has
+ * paged out we get a pfault interrupt.
+ *
+ * This allows us, within the guest, to schedule a different process. Without
+ * this mechanism the host would have to suspend the whole virtual CPU until
+ * the page has been paged in.
+ *
+ * So when we get such an interrupt then we set the state of the current task
+ * to uninterruptible and also set the need_resched flag. Both happens within
+ * interrupt context(!). If we later on want to return to user space we
+ * recognize the need_resched flag and then call schedule().  It's not very
+ * obvious how this works...
+ *
+ * Of course we have a lot of additional fun with the completion interrupt (->
+ * host signals that a page of a process has been paged in and the process can
+ * continue to run). This interrupt can arrive on any cpu and, since we have
+ * virtual cpus, actually appear before the interrupt that signals that a page
+ * is missing.
+ */
 static void pfault_interrupt(struct ext_code ext_code,
 			     unsigned int param32, unsigned long param64)
 {
@@ -637,14 +660,14 @@ static void pfault_interrupt(struct ext_code ext_code,
 	pid_t pid;
 
 	/*
-	 * Get the external interruption subcode & pfault
-	 * initial/completion signal bit. VM stores this 
-	 * in the 'cpu address' field associated with the
-         * external interrupt. 
+	 * Get the external interruption subcode & pfault initial/completion
+	 * signal bit. VM stores this in the 'cpu address' field associated
+	 * with the external interrupt.
 	 */
 	subcode = ext_code.subcode;
 	if ((subcode & 0xff00) != __SUBCODE_MASK)
 		return;
+
 	inc_irq_stat(IRQEXT_PFL);
 	/* Get the token (= pid of the affected task). */
 	pid = param64 & LPP_PFAULT_PID_MASK;
@@ -655,8 +678,9 @@ static void pfault_interrupt(struct ext_code ext_code,
 	rcu_read_unlock();
 	if (!tsk)
 		return;
+
 	spin_lock(&pfault_lock);
-	if (subcode & 0x0080) {
+	if (subcode & PF_COMPLETE) {
 		/* signal bit is set -> a page has been swapped in by VM */
 		if (tsk->thread.pfault_wait == 1) {
 			/* Initial interrupt was faster than the completion
@@ -683,10 +707,10 @@ static void pfault_interrupt(struct ext_code ext_code,
 		/* signal bit not set -> a real page is missing. */
 		if (WARN_ON_ONCE(tsk != current))
 			goto out;
+
 		if (tsk->thread.pfault_wait == 1) {
 			/* Already on the list with a reference: put to sleep */
-			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
-			set_tsk_need_resched(tsk);
+			goto block;
 		} else if (tsk->thread.pfault_wait == -1) {
 			/* Completion interrupt was faster than the initial
 			 * interrupt (pfault_wait == -1). Set pfault_wait
@@ -701,7 +725,11 @@ static void pfault_interrupt(struct ext_code ext_code,
 			get_task_struct(tsk);
 			tsk->thread.pfault_wait = 1;
 			list_add(&tsk->thread.list, &pfault_list);
-			__set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+block:
+			/* Since this must be a userspace fault, there
+			 * is no kernel task state to trample. Rely on the
+			 * return to userspace schedule() to block */
+			__set_current_state(TASK_UNINTERRUPTIBLE);
 			set_tsk_need_resched(tsk);
 		}
 	}

  reply	other threads:[~2016-03-22 13:55 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-08 18:20 [PATCH -tip 0/3] locking/rtmutex: Another crack at spin on owner Davidlohr Bueso
2016-03-08 18:20 ` [PATCH 1/3] rtmutex: Delete save_state member of struct rt_mutex Davidlohr Bueso
2016-03-08 18:20 ` [PATCH 2/3] rtmutex: Add rt_mutex_init_waiter helper Davidlohr Bueso
2016-03-14 13:16   ` Peter Zijlstra
2016-03-08 18:20 ` [PATCH 3/3] rtmutex: Reduce top-waiter blocking on a lock Davidlohr Bueso
2016-03-14 13:23   ` Peter Zijlstra
2016-03-08 22:05 ` [PATCH 4/3] rtmutex: Avoid barrier in rt_mutex_handle_deadlock Davidlohr Bueso
2016-03-14 13:40   ` Peter Zijlstra
2016-03-21 18:16     ` Davidlohr Bueso
2016-03-22 10:21       ` Peter Zijlstra
2016-03-22 11:32         ` Heiko Carstens
2016-03-22 12:20           ` Peter Zijlstra
2016-03-22 13:26             ` Heiko Carstens
2016-03-22 13:55               ` Peter Zijlstra [this message]
2016-03-22 14:45                 ` Heiko Carstens
2016-03-22 16:41                   ` Peter Zijlstra
2016-03-22 21:46                     ` Heiko Carstens
2016-03-25  2:30         ` Davidlohr Bueso

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160322135530.GR6344@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bigeasy@linutronix.de \
    --cc=dave@stgolabs.net \
    --cc=heiko.carstens@de.ibm.com \
    --cc=kmo@daterainc.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=tglx@linutronix.de \
    --cc=umgwanakikbuti@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).