From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1752412AbbDAV0k (ORCPT <rfc822;w@1wt.eu>);
	Wed, 1 Apr 2015 17:26:40 -0400
Received: from mail.kernel.org ([198.145.29.136]:53459 "EHLO mail.kernel.org"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1751459AbbDAV0h (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
	Wed, 1 Apr 2015 17:26:37 -0400
From: Andy Lutomirski <luto@kernel.org>
To: Ingo Molnar <mingo@kernel.org>, x86@kernel.org,
        linux-kernel@vger.kernel.org
Cc: Borislav Petkov <bp@suse.de>, Denys Vlasenko <dvlasenk@redhat.com>,
        Andy Lutomirski <luto@kernel.org>
Subject: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
Date: Wed,  1 Apr 2015 14:26:34 -0700
Message-Id: <9472f1ca4c19a38ecda45bba9c91b7168135fcfa.1427923514.git.luto@kernel.org>
X-Mailer: git-send-email 2.3.0
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

When I wrote the opportunistic SYSRET code, I missed an important
difference between SYSRET and IRET.  Both instructions are capable
of setting EFLAGS.TF, but they behave differently when doing so.
IRET will not issue a #DB trap after execution when it sets TF This
is critical -- otherwise you'd never be able to make forward
progress when returning to userspace.  SYSRET, on the other hand,
will trap with #DB immediately after returning to CPL3, and the next
instruction will never execute.

This breaks anything that opportunistically SYSRETs to a user
context with TF set.  For example, running this code with TF set and
a SIGTRAP handler loaded never gets past post_nop.

	extern unsigned char post_nop[];
	asm volatile ("pushfq\n\t"
		      "popq %%r11\n\t"
		      "nop\n\t"
		      "post_nop:"
		      : : "c" (post_nop) : "r11");

In my defense, I can't find this documented in the AMD or Intel
manual.

Fix it by using IRET to restore TF.

Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

This affects 4.0-rc as well as -tip.  A full test case lives here:

https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/

It's called single_step_syscall_64.

On Intel systems, the 32-bit version of that test fails for unrelated
reasons, but that's not a regression, and fixing it will be much more
intrusive.

Changes from v1:
 - Remove mention of testl from changelog.
 - Improve comment per Denys' suggestion.

arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 750c6efcb718..537716380959 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -715,7 +715,21 @@ retint_swapgs:		/* return to user-space */
 	cmpq %r11,EFLAGS(%rsp)		/* R11 == RFLAGS */
 	jne opportunistic_sysret_failed
 
-	testq $X86_EFLAGS_RF,%r11	/* sysret can't restore RF */
+	/*
+	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
+	 * restoring TF results in a trap from userspace immediately after
+	 * SYSRET.  This would cause an infinite loop whenever #DB happens
+	 * with register state that satisfies the opportunistic SYSRET
+	 * conditions.  For example, single-stepping this user code:
+	 *
+	 *           movq $stuck_here,%rcx
+	 *           pushfq
+	 *           popq %r11
+	 *   stuck_here:
+	 *
+	 * would never get past stuck_here.
+	 */
+	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
 	jnz opportunistic_sysret_failed
 
 	/* nothing to check for RSP */
-- 
2.3.0