All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: linux-kernel@vger.kernel.org, xen-devel@lists.xensource.com,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	x86@kernel.org, Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>,
	stable@kernel.org
Subject: Is: [PATCH] x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode. Was: Re: [PATCH] x86/paravirt: Partially revert "remove lazy mode in interrupts"
Date: Mon, 26 Sep 2011 15:34:53 -0400	[thread overview]
Message-ID: <20110926193453.GA9717@phenom.oracle.com> (raw)
In-Reply-To: <4E80A6BD.3070703@goop.org>

On Mon, Sep 26, 2011 at 09:22:21AM -0700, Jeremy Fitzhardinge wrote:
> On 09/26/2011 06:13 AM, Konrad Rzeszutek Wilk wrote:
> > which has git commit b8bcfe997e46150fedcc3f5b26b846400122fdd9.
> >
> > The unintended consequence of removing the flushing of MMU
> > updates when doing kmap_atomic (or kunmap_atomic) is that we can
> > hit a dereference bug when processing a "fork()" under a heavy loaded
> > machine. Specifically we can hit:
> 
> The patch is all OK, but I wouldn't have headlined it as a "partial
> revert" - the important point is that the pte updates in k(un)map_atomic
> need to be synchronous, regardless of whether we're in lazy_mmu mode.
> 
> The fact that b8bcfe997e4 introduced the problem is interesting to note,
> but only somewhat relevant to the analysis of what's being fixed here.

Good point. How about

>From 09966678dd645b68a422c9bf0223b13e73387302 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 23 Sep 2011 17:02:29 -0400
Subject: [PATCH] x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode.

This patch fixes an outstanding issue that has been reported since 2.6.37.
Under a heavy loaded machine processing "fork()" calls could keepover with:

BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
 [<c01ac222>] ? __swap_duplicate+0xc2/0x160
 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
 [<c01ac2e4>] ? swap_duplicate+0x14/0x40
 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500
 [<c01a0ca5>] ? copy_page_range+0x195/0x200
 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0
 [<c0132cf8>] ? dup_mm+0xa8/0x130
 [<c013376a>] ? copy_process+0x98a/0xb30
 [<c013395f>] ? do_fork+0x4f/0x280
 [<c01573b3>] ? getnstimeofday+0x43/0x100
 [<c010f770>] ? sys_clone+0x30/0x40
 [<c06c048d>] ? ptregs_clone+0x15/0x48
 [<c06bfb71>] ? syscall_call+0x7/0xb

The problem is that in copy_page_range we turn lazy mode on, and then
in swap_entry_free we call swap_count_continued which ends up in:

         map = kmap_atomic(page, KM_USER0) + offset;

and then later we touch *map.

Since we are running in batched mode (lazy) we don't actually set up the
PTE mappings and the kmap_atomic is not done synchronously and ends up
trying to dereference a page that has not been set.

Looking at kmap_atomic_prot_pfn, it uses 'arch_flush_lazy_mmu_mode' and
doing the same in kmap_atomic_prot and __kunmap_atomic makes the problem
go away.

Interestingly, git commit b8bcfe997e46150fedcc3f5b26b846400122fdd9
removed part of this to fix an interrupt issue - but it went to far
and did not consider this scenario.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: stable@kernel.org
[v1: Redid the commit description per Jeremy's apt suggestion]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/mm/highmem_32.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index b499626..f4f29b1 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -45,6 +45,7 @@ void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
 	BUG_ON(!pte_none(*(kmap_pte-idx)));
 	set_pte(kmap_pte-idx, mk_pte(page, prot));
+	arch_flush_lazy_mmu_mode();
 
 	return (void *)vaddr;
 }
@@ -88,6 +89,7 @@ void __kunmap_atomic(void *kvaddr)
 		 */
 		kpte_clear_flush(kmap_pte-idx, vaddr);
 		kmap_atomic_idx_pop();
+		arch_flush_lazy_mmu_mode();
 	}
 #ifdef CONFIG_DEBUG_HIGHMEM
 	else {
-- 
1.7.4.1


WARNING: multiple messages have this Message-ID (diff)
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: Jeremy Fitzhardinge <jeremy@goop.org>
Cc: xen-devel@lists.xensource.com,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org,
	Ingo Molnar <mingo@redhat.com>, "H. Peter Anvin" <hpa@zytor.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	stable@kernel.org
Subject: Is: [PATCH] x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode. Was: Re: [PATCH] x86/paravirt: Partially revert "remove lazy mode in interrupts"
Date: Mon, 26 Sep 2011 15:34:53 -0400	[thread overview]
Message-ID: <20110926193453.GA9717@phenom.oracle.com> (raw)
In-Reply-To: <4E80A6BD.3070703@goop.org>

On Mon, Sep 26, 2011 at 09:22:21AM -0700, Jeremy Fitzhardinge wrote:
> On 09/26/2011 06:13 AM, Konrad Rzeszutek Wilk wrote:
> > which has git commit b8bcfe997e46150fedcc3f5b26b846400122fdd9.
> >
> > The unintended consequence of removing the flushing of MMU
> > updates when doing kmap_atomic (or kunmap_atomic) is that we can
> > hit a dereference bug when processing a "fork()" under a heavy loaded
> > machine. Specifically we can hit:
> 
> The patch is all OK, but I wouldn't have headlined it as a "partial
> revert" - the important point is that the pte updates in k(un)map_atomic
> need to be synchronous, regardless of whether we're in lazy_mmu mode.
> 
> The fact that b8bcfe997e4 introduced the problem is interesting to note,
> but only somewhat relevant to the analysis of what's being fixed here.

Good point. How about

>From 09966678dd645b68a422c9bf0223b13e73387302 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 23 Sep 2011 17:02:29 -0400
Subject: [PATCH] x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode.

This patch fixes an outstanding issue that has been reported since 2.6.37.
Under a heavy loaded machine processing "fork()" calls could keepover with:

BUG: unable to handle kernel paging request at f573fc8c
IP: [<c01abc54>] swap_count_continued+0x104/0x180
*pdpt = 000000002a3b9027 *pde = 0000000001bed067 *pte = 0000000000000000
Oops: 0000 [#1] SMP
Modules linked in:
Pid: 1638, comm: apache2 Not tainted 3.0.4-linode37 #1
EIP: 0061:[<c01abc54>] EFLAGS: 00210246 CPU: 3
EIP is at swap_count_continued+0x104/0x180
.. snip..
Call Trace:
 [<c01ac222>] ? __swap_duplicate+0xc2/0x160
 [<c01040f7>] ? pte_mfn_to_pfn+0x87/0xe0
 [<c01ac2e4>] ? swap_duplicate+0x14/0x40
 [<c01a0a6b>] ? copy_pte_range+0x45b/0x500
 [<c01a0ca5>] ? copy_page_range+0x195/0x200
 [<c01328c6>] ? dup_mmap+0x1c6/0x2c0
 [<c0132cf8>] ? dup_mm+0xa8/0x130
 [<c013376a>] ? copy_process+0x98a/0xb30
 [<c013395f>] ? do_fork+0x4f/0x280
 [<c01573b3>] ? getnstimeofday+0x43/0x100
 [<c010f770>] ? sys_clone+0x30/0x40
 [<c06c048d>] ? ptregs_clone+0x15/0x48
 [<c06bfb71>] ? syscall_call+0x7/0xb

The problem is that in copy_page_range we turn lazy mode on, and then
in swap_entry_free we call swap_count_continued which ends up in:

         map = kmap_atomic(page, KM_USER0) + offset;

and then later we touch *map.

Since we are running in batched mode (lazy) we don't actually set up the
PTE mappings and the kmap_atomic is not done synchronously and ends up
trying to dereference a page that has not been set.

Looking at kmap_atomic_prot_pfn, it uses 'arch_flush_lazy_mmu_mode' and
doing the same in kmap_atomic_prot and __kunmap_atomic makes the problem
go away.

Interestingly, git commit b8bcfe997e46150fedcc3f5b26b846400122fdd9
removed part of this to fix an interrupt issue - but it went to far
and did not consider this scenario.

CC: Thomas Gleixner <tglx@linutronix.de>
CC: Ingo Molnar <mingo@redhat.com>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: x86@kernel.org
CC: Peter Zijlstra <a.p.zijlstra@chello.nl>
CC: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
CC: stable@kernel.org
[v1: Redid the commit description per Jeremy's apt suggestion]
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 arch/x86/mm/highmem_32.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/x86/mm/highmem_32.c b/arch/x86/mm/highmem_32.c
index b499626..f4f29b1 100644
--- a/arch/x86/mm/highmem_32.c
+++ b/arch/x86/mm/highmem_32.c
@@ -45,6 +45,7 @@ void *kmap_atomic_prot(struct page *page, pgprot_t prot)
 	vaddr = __fix_to_virt(FIX_KMAP_BEGIN + idx);
 	BUG_ON(!pte_none(*(kmap_pte-idx)));
 	set_pte(kmap_pte-idx, mk_pte(page, prot));
+	arch_flush_lazy_mmu_mode();
 
 	return (void *)vaddr;
 }
@@ -88,6 +89,7 @@ void __kunmap_atomic(void *kvaddr)
 		 */
 		kpte_clear_flush(kmap_pte-idx, vaddr);
 		kmap_atomic_idx_pop();
+		arch_flush_lazy_mmu_mode();
 	}
 #ifdef CONFIG_DEBUG_HIGHMEM
 	else {
-- 
1.7.4.1

  reply	other threads:[~2011-09-26 19:36 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-09-26 13:13 [PATCH] x86/paravirt: Partially revert "remove lazy mode in interrupts" Konrad Rzeszutek Wilk
2011-09-26 13:13 ` Konrad Rzeszutek Wilk
2011-09-26 16:22 ` Jeremy Fitzhardinge
2011-09-26 19:34   ` Konrad Rzeszutek Wilk [this message]
2011-09-26 19:34     ` Is: [PATCH] x86/paravirt: PTE updates in k(un)map_atomic need to be synchronous, regardless of lazy_mmu mode. Was: " Konrad Rzeszutek Wilk
2011-09-30  9:59     ` Stefan Bader
2011-10-03 16:50       ` Konrad Rzeszutek Wilk
2011-10-03 17:04       ` Konrad Rzeszutek Wilk
2011-10-25 17:55         ` Christopher S. Aker
2011-10-25 18:19           ` Konrad Rzeszutek Wilk
2011-10-25 18:26           ` Konrad Rzeszutek Wilk
2011-09-30 14:22     ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110926193453.GA9717@phenom.oracle.com \
    --to=konrad.wilk@oracle.com \
    --cc=a.p.zijlstra@chello.nl \
    --cc=hpa@zytor.com \
    --cc=jeremy.fitzhardinge@citrix.com \
    --cc=jeremy@goop.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=stable@kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.