linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
@ 2019-03-02 15:11 Jan Stancek
  2019-03-02 17:10 ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: Jan Stancek @ 2019-03-02 15:11 UTC (permalink / raw)
  To: linux-mm, akpm, willy, peterz, riel, mhocko, ying.huang,
	jrdr.linux, jglisse, aneesh.kumar, david, aarcange, raquini,
	rientjes, kirill, mgorman, jstancek
  Cc: linux-kernel

LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

  CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
  Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
  Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
  Call Trace:
  ([<0000000000000000>]           (null))
   [<00000000001adae4>] lock_acquire+0xec/0x258
   [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
   [<000000000012a780>] page_table_free+0x48/0x1a8
   [<00000000002f6e54>] do_fault+0xdc/0x670
   [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
   [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
   [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
   [<000000000080e5ee>] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because
"0" is a valid address on s390 (see S390_lowcore), it keeps
going until it eventually crashes in lockdep's lock_acquire.
This crash is reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale.
Because mmap_sem may be released, other threads can come in,
call munmap() and cause "vma" be returned to kmem cache, and
get zeroed/re-initialized and re-used:

handle_mm_fault                           |
  __handle_mm_fault                       |
    do_fault                              |
      vma = vmf->vma                      |
      do_read_fault                       |
        __do_fault                        |
          vma->vm_ops->fault(vmf);        |
            mmap_sem is released          |
                                          |
                                          | do_munmap()
                                          |   remove_vma_list()
                                          |     remove_vma()
                                          |       vm_area_free()
                                          |         # vma is released
                                          | ...
                                          | # same vma is allocated
                                          | # from kmem cache
                                          | do_mmap()
                                          |   vm_area_alloc()
                                          |     memset(vma, 0, ...)
                                          |
      pte_free(vma->vm_mm, ...);          |
        page_table_free                   |
          spin_lock_bh(&mm->context.lock);|
            <crash>                       |

This patch pins mm_struct and stores its value, to avoid using
potentially stale "vma" when calling pte_free().

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Signed-off-by: Jan Stancek <jstancek@redhat.com>
---
 mm/memory.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..1287ee9acbdc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3517,12 +3517,17 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
  * but allow concurrent faults).
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
+ * If mmap_sem is released, vma may become invalid (for example
+ * by other thread calling munmap()).
  */
 static vm_fault_t do_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);
 	vm_fault_t ret;
 
+	mmgrab(vm_mm);
+
 	/*
 	 * The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
 	 */
@@ -3561,9 +3566,12 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 
 	/* preallocated pagetable is unused: free it */
 	if (vmf->prealloc_pte) {
-		pte_free(vma->vm_mm, vmf->prealloc_pte);
+		pte_free(vm_mm, vmf->prealloc_pte);
 		vmf->prealloc_pte = NULL;
 	}
+
+	mmdrop(vm_mm);
+
 	return ret;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-02 15:11 [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area_struct Jan Stancek
@ 2019-03-02 17:10 ` Matthew Wilcox
  2019-03-02 18:00   ` Jan Stancek
  2019-03-02 18:19   ` [PATCH v2] " Jan Stancek
  0 siblings, 2 replies; 12+ messages in thread
From: Matthew Wilcox @ 2019-03-02 17:10 UTC (permalink / raw)
  To: Jan Stancek
  Cc: linux-mm, akpm, peterz, riel, mhocko, ying.huang, jrdr.linux,
	jglisse, aneesh.kumar, david, aarcange, raquini, rientjes,
	kirill, mgorman, linux-kernel

On Sat, Mar 02, 2019 at 04:11:26PM +0100, Jan Stancek wrote:
> Problem is that "vmf->vma" used in do_fault() can become stale.
> Because mmap_sem may be released, other threads can come in,
> call munmap() and cause "vma" be returned to kmem cache, and
> get zeroed/re-initialized and re-used:

> This patch pins mm_struct and stores its value, to avoid using
> potentially stale "vma" when calling pte_free().

OK, we need to cache the mm_struct, but why do we need the extra atomic op?
There's surely no way the mm can be freed while the thread is in the middle
of handling a fault.

ie I would drop these lines:

> +	mmgrab(vm_mm);
> +
...
> +
> +	mmdrop(vm_mm);
> +

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-02 17:10 ` Matthew Wilcox
@ 2019-03-02 18:00   ` Jan Stancek
  2019-03-02 18:19   ` [PATCH v2] " Jan Stancek
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Stancek @ 2019-03-02 18:00 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: linux-mm, akpm, peterz, riel, mhocko, ying huang, jrdr linux,
	jglisse, aneesh kumar, david, aarcange, raquini, rientjes,
	kirill, mgorman, linux-kernel



----- Original Message -----
> On Sat, Mar 02, 2019 at 04:11:26PM +0100, Jan Stancek wrote:
> > Problem is that "vmf->vma" used in do_fault() can become stale.
> > Because mmap_sem may be released, other threads can come in,
> > call munmap() and cause "vma" be returned to kmem cache, and
> > get zeroed/re-initialized and re-used:
> 
> > This patch pins mm_struct and stores its value, to avoid using
> > potentially stale "vma" when calling pte_free().
> 
> OK, we need to cache the mm_struct, but why do we need the extra atomic op?
> There's surely no way the mm can be freed while the thread is in the middle
> of handling a fault.

You're right, I was needlessly paranoid.

> 
> ie I would drop these lines:

I'll send v2.

Thanks,
Jan

> 
> > +	mmgrab(vm_mm);
> > +
> ...
> > +
> > +	mmdrop(vm_mm);
> > +
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-02 17:10 ` Matthew Wilcox
  2019-03-02 18:00   ` Jan Stancek
@ 2019-03-02 18:19   ` Jan Stancek
  2019-03-02 18:45     ` Peter Zijlstra
  2019-03-02 18:51     ` Andrea Arcangeli
  1 sibling, 2 replies; 12+ messages in thread
From: Jan Stancek @ 2019-03-02 18:19 UTC (permalink / raw)
  To: linux-mm, akpm, willy, peterz, riel, mhocko, ying.huang,
	jrdr.linux, jglisse, aneesh.kumar, david, aarcange, raquini,
	rientjes, kirill, mgorman, jstancek
  Cc: linux-kernel

LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

  CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
  Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
  Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
  Call Trace:
  ([<0000000000000000>]           (null))
   [<00000000001adae4>] lock_acquire+0xec/0x258
   [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
   [<000000000012a780>] page_table_free+0x48/0x1a8
   [<00000000002f6e54>] do_fault+0xdc/0x670
   [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
   [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
   [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
   [<000000000080e5ee>] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because
"0" is a valid address on s390 (see S390_lowcore), it keeps
going until it eventually crashes in lockdep's lock_acquire.
This crash is reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale.
Because mmap_sem may be released, other threads can come in,
call munmap() and cause "vma" be returned to kmem cache, and
get zeroed/re-initialized and re-used:

handle_mm_fault                           |
  __handle_mm_fault                       |
    do_fault                              |
      vma = vmf->vma                      |
      do_read_fault                       |
        __do_fault                        |
          vma->vm_ops->fault(vmf);        |
            mmap_sem is released          |
                                          |
                                          | do_munmap()
                                          |   remove_vma_list()
                                          |     remove_vma()
                                          |       vm_area_free()
                                          |         # vma is released
                                          | ...
                                          | # same vma is allocated
                                          | # from kmem cache
                                          | do_mmap()
                                          |   vm_area_alloc()
                                          |     memset(vma, 0, ...)
                                          |
      pte_free(vma->vm_mm, ...);          |
        page_table_free                   |
          spin_lock_bh(&mm->context.lock);|
            <crash>                       |

Cache mm_struct to avoid using potentially stale "vma".

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Signed-off-by: Jan Stancek <jstancek@redhat.com>
---
 mm/memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..6c1afc1ece50 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3517,10 +3517,13 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
  * but allow concurrent faults).
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
+ * If mmap_sem is released, vma may become invalid (for example
+ * by other thread calling munmap()).
  */
 static vm_fault_t do_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);
 	vm_fault_t ret;
 
 	/*
@@ -3561,7 +3564,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 
 	/* preallocated pagetable is unused: free it */
 	if (vmf->prealloc_pte) {
-		pte_free(vma->vm_mm, vmf->prealloc_pte);
+		pte_free(vm_mm, vmf->prealloc_pte);
 		vmf->prealloc_pte = NULL;
 	}
 	return ret;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-02 18:19   ` [PATCH v2] " Jan Stancek
@ 2019-03-02 18:45     ` Peter Zijlstra
  2019-03-02 18:51     ` Andrea Arcangeli
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Zijlstra @ 2019-03-02 18:45 UTC (permalink / raw)
  To: Jan Stancek
  Cc: linux-mm, akpm, willy, riel, mhocko, ying.huang, jrdr.linux,
	jglisse, aneesh.kumar, david, aarcange, raquini, rientjes,
	kirill, mgorman, linux-kernel

On Sat, Mar 02, 2019 at 07:19:39PM +0100, Jan Stancek wrote:
>  static vm_fault_t do_fault(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> +	struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);

Would this not need a corresponding WRITE_ONCE() in vma_init() ?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-02 18:19   ` [PATCH v2] " Jan Stancek
  2019-03-02 18:45     ` Peter Zijlstra
@ 2019-03-02 18:51     ` Andrea Arcangeli
  2019-03-03  7:27       ` Jan Stancek
  2019-03-03  7:28       ` [PATCH v3] " Jan Stancek
  1 sibling, 2 replies; 12+ messages in thread
From: Andrea Arcangeli @ 2019-03-02 18:51 UTC (permalink / raw)
  To: Jan Stancek
  Cc: linux-mm, akpm, willy, peterz, riel, mhocko, ying.huang,
	jrdr.linux, jglisse, aneesh.kumar, david, raquini, rientjes,
	kirill, mgorman, linux-kernel

Hello Jan,

On Sat, Mar 02, 2019 at 07:19:39PM +0100, Jan Stancek wrote:
> +	struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);

The vma->vm_mm cannot change under gcc there, so no need of
READ_ONCE. The release of mmap_sem has release semantics so the
vma->vm_mm access cannot be reordered after up_read(mmap_sem) either.

Other than the above detail:

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Thanks,
Andrea

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-02 18:51     ` Andrea Arcangeli
@ 2019-03-03  7:27       ` Jan Stancek
  2019-03-03  7:28       ` [PATCH v3] " Jan Stancek
  1 sibling, 0 replies; 12+ messages in thread
From: Jan Stancek @ 2019-03-03  7:27 UTC (permalink / raw)
  To: Andrea Arcangeli, peterz
  Cc: linux-mm, akpm, willy, riel, mhocko, ying huang, jrdr linux,
	jglisse, aneesh kumar, david, raquini, rientjes, kirill, mgorman,
	linux-kernel



----- Original Message -----
> Hello Jan,
> 
> On Sat, Mar 02, 2019 at 07:19:39PM +0100, Jan Stancek wrote:
> > +	struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);
> 
> The vma->vm_mm cannot change under gcc there, so no need of
> READ_ONCE. The release of mmap_sem has release semantics so the
> vma->vm_mm access cannot be reordered after up_read(mmap_sem) either.
> 
> Other than the above detail:
> 
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Thank you for review, I dropped READ_ONCE and sent v3 with your
Reviewed-by included. I also successfully re-ran tests over-night.

> Would this not need a corresponding WRITE_ONCE() in vma_init() ?

There's at least 2 context switches between, so I think it wouldn't matter.
My concern was gcc optimizing out vm_mm, and vma->vm_mm access happening only
after do_read_fault().


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-02 18:51     ` Andrea Arcangeli
  2019-03-03  7:27       ` Jan Stancek
@ 2019-03-03  7:28       ` Jan Stancek
  2019-03-03 10:36         ` Matthew Wilcox
                           ` (3 more replies)
  1 sibling, 4 replies; 12+ messages in thread
From: Jan Stancek @ 2019-03-03  7:28 UTC (permalink / raw)
  To: linux-mm, akpm, willy, peterz, riel, mhocko, ying.huang,
	jrdr.linux, jglisse, aneesh.kumar, david, aarcange, raquini,
	rientjes, kirill, mgorman, jstancek
  Cc: linux-kernel

LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

  CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
  Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
  Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
  Call Trace:
  ([<0000000000000000>]           (null))
   [<00000000001adae4>] lock_acquire+0xec/0x258
   [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
   [<000000000012a780>] page_table_free+0x48/0x1a8
   [<00000000002f6e54>] do_fault+0xdc/0x670
   [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
   [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
   [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
   [<000000000080e5ee>] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because
"0" is a valid address on s390 (see S390_lowcore), it keeps
going until it eventually crashes in lockdep's lock_acquire.
This crash is reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale.
Because mmap_sem may be released, other threads can come in,
call munmap() and cause "vma" be returned to kmem cache, and
get zeroed/re-initialized and re-used:

handle_mm_fault                           |
  __handle_mm_fault                       |
    do_fault                              |
      vma = vmf->vma                      |
      do_read_fault                       |
        __do_fault                        |
          vma->vm_ops->fault(vmf);        |
            mmap_sem is released          |
                                          |
                                          | do_munmap()
                                          |   remove_vma_list()
                                          |     remove_vma()
                                          |       vm_area_free()
                                          |         # vma is released
                                          | ...
                                          | # same vma is allocated
                                          | # from kmem cache
                                          | do_mmap()
                                          |   vm_area_alloc()
                                          |     memset(vma, 0, ...)
                                          |
      pte_free(vma->vm_mm, ...);          |
        page_table_free                   |
          spin_lock_bh(&mm->context.lock);|
            <crash>                       |

Cache mm_struct to avoid using potentially stale "vma".

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Signed-off-by: Jan Stancek <jstancek@redhat.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/memory.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..e8d69ade5acc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3517,10 +3517,13 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
  * but allow concurrent faults).
  * The mmap_sem may have been released depending on flags and our
  * return value.  See filemap_fault() and __lock_page_or_retry().
+ * If mmap_sem is released, vma may become invalid (for example
+ * by other thread calling munmap()).
  */
 static vm_fault_t do_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
+	struct mm_struct *vm_mm = vma->vm_mm;
 	vm_fault_t ret;
 
 	/*
@@ -3561,7 +3564,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
 
 	/* preallocated pagetable is unused: free it */
 	if (vmf->prealloc_pte) {
-		pte_free(vma->vm_mm, vmf->prealloc_pte);
+		pte_free(vm_mm, vmf->prealloc_pte);
 		vmf->prealloc_pte = NULL;
 	}
 	return ret;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-03  7:28       ` [PATCH v3] " Jan Stancek
@ 2019-03-03 10:36         ` Matthew Wilcox
  2019-03-04  0:13         ` Rafael Aquini
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2019-03-03 10:36 UTC (permalink / raw)
  To: Jan Stancek
  Cc: linux-mm, akpm, peterz, riel, mhocko, ying.huang, jrdr.linux,
	jglisse, aneesh.kumar, david, aarcange, raquini, rientjes,
	kirill, mgorman, linux-kernel

On Sun, Mar 03, 2019 at 08:28:04AM +0100, Jan Stancek wrote:
> Cache mm_struct to avoid using potentially stale "vma".
> 
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> 
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Reviewed-by: Matthew Wilcox <willy@infradead.org>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-03  7:28       ` [PATCH v3] " Jan Stancek
  2019-03-03 10:36         ` Matthew Wilcox
@ 2019-03-04  0:13         ` Rafael Aquini
  2019-03-04  8:10         ` Minchan Kim
  2019-03-04  8:19         ` Kirill A. Shutemov
  3 siblings, 0 replies; 12+ messages in thread
From: Rafael Aquini @ 2019-03-04  0:13 UTC (permalink / raw)
  To: Jan Stancek
  Cc: linux-mm, akpm, willy, peterz, riel, mhocko, ying.huang,
	jrdr.linux, jglisse, aneesh.kumar, david, aarcange, raquini,
	rientjes, kirill, mgorman, linux-kernel

On Sun, Mar 03, 2019 at 08:28:04AM +0100, Jan Stancek wrote:
> LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
> This is a stress test, where one thread mmaps/writes/munmaps memory area
> and other thread is trying to read from it:
> 
>   CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
>   Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
>   Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
>   Call Trace:
>   ([<0000000000000000>]           (null))
>    [<00000000001adae4>] lock_acquire+0xec/0x258
>    [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
>    [<000000000012a780>] page_table_free+0x48/0x1a8
>    [<00000000002f6e54>] do_fault+0xdc/0x670
>    [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
>    [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
>    [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
>    [<000000000080e5ee>] pgm_check_handler+0x19e/0x200
> 
> page_table_free() is called with NULL mm parameter, but because
> "0" is a valid address on s390 (see S390_lowcore), it keeps
> going until it eventually crashes in lockdep's lock_acquire.
> This crash is reproducible at least since 4.14.
> 
> Problem is that "vmf->vma" used in do_fault() can become stale.
> Because mmap_sem may be released, other threads can come in,
> call munmap() and cause "vma" be returned to kmem cache, and
> get zeroed/re-initialized and re-used:
> 
> handle_mm_fault                           |
>   __handle_mm_fault                       |
>     do_fault                              |
>       vma = vmf->vma                      |
>       do_read_fault                       |
>         __do_fault                        |
>           vma->vm_ops->fault(vmf);        |
>             mmap_sem is released          |
>                                           |
>                                           | do_munmap()
>                                           |   remove_vma_list()
>                                           |     remove_vma()
>                                           |       vm_area_free()
>                                           |         # vma is released
>                                           | ...
>                                           | # same vma is allocated
>                                           | # from kmem cache
>                                           | do_mmap()
>                                           |   vm_area_alloc()
>                                           |     memset(vma, 0, ...)
>                                           |
>       pte_free(vma->vm_mm, ...);          |
>         page_table_free                   |
>           spin_lock_bh(&mm->context.lock);|
>             <crash>                       |
> 
> Cache mm_struct to avoid using potentially stale "vma".
> 
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> 
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/memory.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e11ca9dd823f..e8d69ade5acc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3517,10 +3517,13 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
>   * but allow concurrent faults).
>   * The mmap_sem may have been released depending on flags and our
>   * return value.  See filemap_fault() and __lock_page_or_retry().
> + * If mmap_sem is released, vma may become invalid (for example
> + * by other thread calling munmap()).
>   */
>  static vm_fault_t do_fault(struct vm_fault *vmf)
>  {
>  	struct vm_area_struct *vma = vmf->vma;
> +	struct mm_struct *vm_mm = vma->vm_mm;
>  	vm_fault_t ret;
>  
>  	/*
> @@ -3561,7 +3564,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
>  
>  	/* preallocated pagetable is unused: free it */
>  	if (vmf->prealloc_pte) {
> -		pte_free(vma->vm_mm, vmf->prealloc_pte);
> +		pte_free(vm_mm, vmf->prealloc_pte);
>  		vmf->prealloc_pte = NULL;
>  	}
>  	return ret;
> -- 
> 1.8.3.1
> 
Acked-by: Rafael Aquini <aquini@redhat.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-03  7:28       ` [PATCH v3] " Jan Stancek
  2019-03-03 10:36         ` Matthew Wilcox
  2019-03-04  0:13         ` Rafael Aquini
@ 2019-03-04  8:10         ` Minchan Kim
  2019-03-04  8:19         ` Kirill A. Shutemov
  3 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2019-03-04  8:10 UTC (permalink / raw)
  To: Jan Stancek
  Cc: linux-mm, akpm, willy, peterz, riel, mhocko, ying.huang,
	jrdr.linux, jglisse, aneesh.kumar, david, aarcange, raquini,
	rientjes, kirill, mgorman, linux-kernel

On Sun, Mar 03, 2019 at 08:28:04AM +0100, Jan Stancek wrote:
> LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
> This is a stress test, where one thread mmaps/writes/munmaps memory area
> and other thread is trying to read from it:
> 
>   CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
>   Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
>   Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
>   Call Trace:
>   ([<0000000000000000>]           (null))
>    [<00000000001adae4>] lock_acquire+0xec/0x258
>    [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
>    [<000000000012a780>] page_table_free+0x48/0x1a8
>    [<00000000002f6e54>] do_fault+0xdc/0x670
>    [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
>    [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
>    [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
>    [<000000000080e5ee>] pgm_check_handler+0x19e/0x200
> 
> page_table_free() is called with NULL mm parameter, but because
> "0" is a valid address on s390 (see S390_lowcore), it keeps
> going until it eventually crashes in lockdep's lock_acquire.
> This crash is reproducible at least since 4.14.
> 
> Problem is that "vmf->vma" used in do_fault() can become stale.
> Because mmap_sem may be released, other threads can come in,
> call munmap() and cause "vma" be returned to kmem cache, and
> get zeroed/re-initialized and re-used:
> 
> handle_mm_fault                           |
>   __handle_mm_fault                       |
>     do_fault                              |
>       vma = vmf->vma                      |
>       do_read_fault                       |
>         __do_fault                        |
>           vma->vm_ops->fault(vmf);        |
>             mmap_sem is released          |
>                                           |
>                                           | do_munmap()
>                                           |   remove_vma_list()
>                                           |     remove_vma()
>                                           |       vm_area_free()
>                                           |         # vma is released
>                                           | ...
>                                           | # same vma is allocated
>                                           | # from kmem cache
>                                           | do_mmap()
>                                           |   vm_area_alloc()
>                                           |     memset(vma, 0, ...)
>                                           |
>       pte_free(vma->vm_mm, ...);          |
>         page_table_free                   |
>           spin_lock_bh(&mm->context.lock);|
>             <crash>                       |
> 
> Cache mm_struct to avoid using potentially stale "vma".
> 
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> 
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Minchan Kim <minchan@kernel.org>

Isn't it -stable material?


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct
  2019-03-03  7:28       ` [PATCH v3] " Jan Stancek
                           ` (2 preceding siblings ...)
  2019-03-04  8:10         ` Minchan Kim
@ 2019-03-04  8:19         ` Kirill A. Shutemov
  3 siblings, 0 replies; 12+ messages in thread
From: Kirill A. Shutemov @ 2019-03-04  8:19 UTC (permalink / raw)
  To: Jan Stancek
  Cc: linux-mm, akpm, willy, peterz, riel, mhocko, ying.huang,
	jrdr.linux, jglisse, aneesh.kumar, david, aarcange, raquini,
	rientjes, mgorman, linux-kernel

On Sun, Mar 03, 2019 at 08:28:04AM +0100, Jan Stancek wrote:
> LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
> This is a stress test, where one thread mmaps/writes/munmaps memory area
> and other thread is trying to read from it:
> 
>   CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
>   Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
>   Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
>   Call Trace:
>   ([<0000000000000000>]           (null))
>    [<00000000001adae4>] lock_acquire+0xec/0x258
>    [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
>    [<000000000012a780>] page_table_free+0x48/0x1a8
>    [<00000000002f6e54>] do_fault+0xdc/0x670
>    [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
>    [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
>    [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
>    [<000000000080e5ee>] pgm_check_handler+0x19e/0x200
> 
> page_table_free() is called with NULL mm parameter, but because
> "0" is a valid address on s390 (see S390_lowcore), it keeps
> going until it eventually crashes in lockdep's lock_acquire.
> This crash is reproducible at least since 4.14.
> 
> Problem is that "vmf->vma" used in do_fault() can become stale.
> Because mmap_sem may be released, other threads can come in,
> call munmap() and cause "vma" be returned to kmem cache, and
> get zeroed/re-initialized and re-used:
> 
> handle_mm_fault                           |
>   __handle_mm_fault                       |
>     do_fault                              |
>       vma = vmf->vma                      |
>       do_read_fault                       |
>         __do_fault                        |
>           vma->vm_ops->fault(vmf);        |
>             mmap_sem is released          |
>                                           |
>                                           | do_munmap()
>                                           |   remove_vma_list()
>                                           |     remove_vma()
>                                           |       vm_area_free()
>                                           |         # vma is released
>                                           | ...
>                                           | # same vma is allocated
>                                           | # from kmem cache
>                                           | do_mmap()
>                                           |   vm_area_alloc()
>                                           |     memset(vma, 0, ...)
>                                           |
>       pte_free(vma->vm_mm, ...);          |
>         page_table_free                   |
>           spin_lock_bh(&mm->context.lock);|
>             <crash>                       |
> 
> Cache mm_struct to avoid using potentially stale "vma".
> 
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
> 
> Signed-off-by: Jan Stancek <jstancek@redhat.com>
> Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-03-04  8:19 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-02 15:11 [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area_struct Jan Stancek
2019-03-02 17:10 ` Matthew Wilcox
2019-03-02 18:00   ` Jan Stancek
2019-03-02 18:19   ` [PATCH v2] " Jan Stancek
2019-03-02 18:45     ` Peter Zijlstra
2019-03-02 18:51     ` Andrea Arcangeli
2019-03-03  7:27       ` Jan Stancek
2019-03-03  7:28       ` [PATCH v3] " Jan Stancek
2019-03-03 10:36         ` Matthew Wilcox
2019-03-04  0:13         ` Rafael Aquini
2019-03-04  8:10         ` Minchan Kim
2019-03-04  8:19         ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).