linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: Introduce i_mmap_lock_write_killable().
@ 2018-03-27 11:19 Tetsuo Handa
  2018-03-27 14:52 ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2018-03-27 11:19 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel
  Cc: Tetsuo Handa, Alexander Viro, Andrew Morton, Kirill A. Shutemov,
	Michal Hocko, Rik van Riel

I found that it is not difficult to hit "oom_reaper: unable to reap pid:"
messages if the victim thread is doing copy_process(). If we check where
the OOM victims are stuck, we can find that they are waiting at
i_mmap_lock_write() in dup_mmap().

----------------------------------------
[  239.804758] Out of memory: Kill process 28909 (a.out) score 0 or sacrifice child
[  239.808597] Killed process 31088 (a.out) total-vm:4176kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[  239.929470] oom_reaper: reaped process 31088 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  240.520586] Out of memory: Kill process 28909 (a.out) score 0 or sacrifice child
[  240.524264] Killed process 29307 (a.out) total-vm:4176kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[  241.568323] oom_reaper: reaped process 29307 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  242.228607] Out of memory: Kill process 28909 (a.out) score 0 or sacrifice child
[  242.232281] Killed process 29323 (a.out) total-vm:4176kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  242.902598] Out of memory: Kill process 28909 (a.out) score 0 or sacrifice child
[  242.906366] Killed process 31097 (a.out) total-vm:4176kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[  243.240908] oom_reaper: reaped process 31097 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  243.854813] Out of memory: Kill process 28909 (a.out) score 0 or sacrifice child
[  243.858490] Killed process 31100 (a.out) total-vm:4176kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[  244.120162] oom_reaper: reaped process 31100 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[  244.750778] Out of memory: Kill process 28909 (a.out) score 0 or sacrifice child
[  244.754505] Killed process 31106 (a.out) total-vm:4176kB, anon-rss:88kB, file-rss:0kB, shmem-rss:0kB
[  245.815781] oom_reaper: unable to reap pid:31106 (a.out)
[  245.818786] 
[  245.818786] Showing all locks held in the system:
(...snipped...)
[  245.869500] 1 lock held by kswapd0/79:
[  245.872655]  #0: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: rmap_walk_file+0x205/0x310
(...snipped...)
[  246.565465] 1 lock held by a.out/29307: /* Already reaped OOM victim */
[  246.568926]  #0: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: unlink_file_vma+0x28/0x50
(...snipped...)
[  250.812088] 3 locks held by a.out/30940:
[  250.815543]  #0: 00000000cd61a8e0 (&mm->mmap_sem){++++}, at: copy_process.part.41+0x10d5/0x1fe0
[  250.820980]  #1: 00000000cf6d4f24 (&mm->mmap_sem/1){+.+.}, at: copy_process.part.41+0x10fe/0x1fe0
[  250.826401]  #2: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: rmap_walk_file+0x205/0x310
(...snipped...)
[  251.233604] 1 lock held by a.out/31088: /* Already reaped OOM victim */
[  251.236953]  #0: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: unlink_file_vma+0x28/0x50
(...snipped...)
[  251.258144] 1 lock held by a.out/31097: /* Already reaped OOM victim */
[  251.261531]  #0: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: unlink_file_vma+0x28/0x50
[  251.266789] 1 lock held by a.out/31100: /* Already reaped OOM victim */
[  251.270208]  #0: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: unlink_file_vma+0x28/0x50
(...snipped...)
[  251.305475] 3 locks held by a.out/31106: /* Unable to reap OOM victim */
[  251.308949]  #0: 00000000b0f753ba (&mm->mmap_sem){++++}, at: copy_process.part.41+0x10d5/0x1fe0
[  251.314283]  #1: 00000000ef64d539 (&mm->mmap_sem/1){+.+.}, at: copy_process.part.41+0x10fe/0x1fe0
[  251.319618]  #2: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: copy_process.part.41+0x12f2/0x1fe0
(...snipped...)
[  259.196415] 1 lock held by a.out/33338:
[  259.199837]  #0: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: unlink_file_vma+0x28/0x50
(...snipped...)
[  264.040902] 3 locks held by a.out/34558:
[  264.044475]  #0: 00000000348405b9 (&mm->mmap_sem){++++}, at: __do_page_fault+0x16f/0x4d0
[  264.049516]  #1: 00000000962671a1 (&(&ip->i_mmaplock)->mr_lock){++++}, at: xfs_ilock+0x10a/0x190 [xfs]
[  264.055108]  #2: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: rmap_walk_file+0x205/0x310
(...snipped...)
[  267.747036] 3 locks held by a.out/35518:
[  267.750545]  #0: 0000000098a5825d (&mm->mmap_sem){++++}, at: copy_process.part.41+0x10d5/0x1fe0
[  267.755955]  #1: 000000002b63c006 (&mm->mmap_sem/1){+.+.}, at: copy_process.part.41+0x10fe/0x1fe0
[  267.761466]  #2: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: rmap_walk_file+0x205/0x310
(...snipped...)
[  295.198803] 1 lock held by a.out/42524:
[  295.202483]  #0: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: rmap_walk_file+0x205/0x310
(...snipped...)
[  295.599000] 2 locks held by a.out/42684:
[  295.602901]  #0: 000000003cd42787 (&mm->mmap_sem){++++}, at: __do_page_fault+0x457/0x4d0
[  295.608495]  #1: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: rmap_walk_file+0x205/0x310
(...snipped...)
[  296.202611] 2 locks held by a.out/42885:
[  296.206546]  #0: 0000000065124e3d (&mm->mmap_sem){++++}, at: __do_page_fault+0x16f/0x4d0
[  296.212185]  #1: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: rmap_walk_file+0x205/0x310
(...snipped...)
[  300.594196] 2 locks held by a.out/44035:
[  300.599942]  #0: 00000000a4e2de40 (&mm->mmap_sem){++++}, at: __do_page_fault+0x457/0x4d0
[  300.605533]  #1: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: rmap_walk_file+0x205/0x310
(...snipped...)
[  302.278287] 3 locks held by a.out/44420:
[  302.282104]  #0: 00000000f043328f (&mm->mmap_sem){++++}, at: copy_process.part.41+0x10d5/0x1fe0
[  302.287959]  #1: 000000007f312097 (&mm->mmap_sem/1){+.+.}, at: copy_process.part.41+0x10fe/0x1fe0
[  302.293872]  #2: 00000000b41a282e (&mapping->i_mmap_rwsem){++++}, at: copy_process.part.41+0x12f2/0x1fe0
----------------------------------------

In dup_mmap(), we are using down_write_killable() for mm->mmap_sem but
we are not using down_write_killable() for mapping->i_mmap_rwsem. And
what is unfortunate is that processes accessing mapping->i_mmap_rwsem
is more wider than processes accessing the OOM victim's mm->mmap_sem.

If the OOM victim is holding mm->mmap_sem held for write, and if the OOM
victim can interrupt operations which need mm->mmap_sem held for write,
we can downgrade mm->mmap_sem upon SIGKILL and the OOM reaper will be
able to reap the OOM victim's memory.

Therefore, this patch introduces i_mmap_lock_write_killable() and downgrade
upon SIGKILL. Since mm->mmap_sem is still held for read, nobody can acquire
mm->mmap_sem for write. Thus, the only thing we need to be careful is that
whether we can safely interrupt. (But I'm not familiar with mmap. Thus,
please review carefully...)

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/fs.h |  5 +++++
 kernel/fork.c      | 16 +++++++++++++---
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index bb45c48..2f11c55 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -468,6 +468,11 @@ static inline void i_mmap_lock_write(struct address_space *mapping)
 	down_write(&mapping->i_mmap_rwsem);
 }
 
+static inline int i_mmap_lock_write_killable(struct address_space *mapping)
+{
+	return down_write_killable(&mapping->i_mmap_rwsem);
+}
+
 static inline void i_mmap_unlock_write(struct address_space *mapping)
 {
 	up_write(&mapping->i_mmap_rwsem);
diff --git a/kernel/fork.c b/kernel/fork.c
index 1e8c9a7..b4384e2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -400,6 +400,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	int retval;
 	unsigned long charge;
 	LIST_HEAD(uf);
+	bool downgraded = false;
 
 	uprobe_start_dup_mmap();
 	if (down_write_killable(&oldmm->mmap_sem)) {
@@ -476,7 +477,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			i_mmap_lock_write(mapping);
+			if (i_mmap_lock_write_killable(mapping)) {
+				downgrade_write(&oldmm->mmap_sem);
+				downgraded = true;
+				i_mmap_lock_write(mapping);
+			}
 			if (tmp->vm_flags & VM_SHARED)
 				atomic_inc(&mapping->i_mmap_writable);
 			flush_dcache_mmap_lock(mapping);
@@ -508,7 +513,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		if (downgraded)
+			retval = -EINTR;
+		else if (!(tmp->vm_flags & VM_WIPEONFORK))
 			retval = copy_page_range(mm, oldmm, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
@@ -523,7 +530,10 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);
-	up_write(&oldmm->mmap_sem);
+	if (!downgraded)
+		up_write(&oldmm->mmap_sem);
+	else
+		up_read(&oldmm->mmap_sem);
 	dup_userfaultfd_complete(&uf);
 fail_uprobe_end:
 	uprobe_end_dup_mmap();
-- 
1.8.3.1

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Introduce i_mmap_lock_write_killable().
  2018-03-27 11:19 [PATCH] mm: Introduce i_mmap_lock_write_killable() Tetsuo Handa
@ 2018-03-27 14:52 ` Michal Hocko
  2018-03-28 10:23   ` Tetsuo Handa
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2018-03-27 14:52 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: linux-mm, linux-fsdevel, Alexander Viro, Andrew Morton,
	Kirill A. Shutemov, Rik van Riel

On Tue 27-03-18 20:19:30, Tetsuo Handa wrote:
> If the OOM victim is holding mm->mmap_sem held for write, and if the OOM
> victim can interrupt operations which need mm->mmap_sem held for write,
> we can downgrade mm->mmap_sem upon SIGKILL and the OOM reaper will be
> able to reap the OOM victim's memory.

This really begs for much better explanation. Why is it safe? Are you
assuming that the killed task will not perform any changes on the
address space? What about ongoing page faults or other operations deeper
in the call chain. Why they are safe to change things for the child
during the copy?

I am not saying this is wrong, I would have to think about that much
more because mmap_sem tends to be used on many surprising places and the
write lock just hide them all.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Introduce i_mmap_lock_write_killable().
  2018-03-27 14:52 ` Michal Hocko
@ 2018-03-28 10:23   ` Tetsuo Handa
  2018-03-28 11:05     ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2018-03-28 10:23 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, linux-fsdevel, viro, akpm, kirill.shutemov, riel

Michal Hocko wrote:
> On Tue 27-03-18 20:19:30, Tetsuo Handa wrote:
> > If the OOM victim is holding mm->mmap_sem held for write, and if the OOM
> > victim can interrupt operations which need mm->mmap_sem held for write,
> > we can downgrade mm->mmap_sem upon SIGKILL and the OOM reaper will be
> > able to reap the OOM victim's memory.
> 
> This really begs for much better explanation. Why is it safe?

Basic idea is

  bool downgraded = false;
  down_write(mmap_sem);
  for (something_1_that_might_depend_mmap_sem_held_for_write;
       something_2_that_might_depend_mmap_sem_held_for_write;
       something_3_that_might_depend_mmap_sem_held_for_write) {
     something_4_that_might_depend_mmap_sem_held_for_write();
     if (fatal_signal_pending(current)) {
        downgrade_write(mmap_sem);
        downgraded = true;
        break;
     }
     something_5_that_might_depend_mmap_sem_held_for_write();
  }
  if (!downgraded)
    up_write(mmap_sem);
  else
    up_read(mmap_sem);

. That is, try to interrupt critical sections at locations where it is
known to be safe and consistent.

>                                                               Are you
> assuming that the killed task will not perform any changes on the
> address space?

If somebody drops mmap_sem held for write is not safe, how can the OOM
reaper work safely?

The OOM reaper is assuming that the thread who got mmap_sem held for write
is responsible to complete critical sections before dropping mmap_sem held
for write, isn't it?

Then, how an attempt to perform changes on the address space can become a
problem given that the thread who got mmap_sem held for write is responsible
to complete critical sections before dropping mmap_sem held for write?

>                What about ongoing page faults or other operations deeper
> in the call chain.

Even if there are ongoing page faults or other operations deeper in the call
chain, there should be no problem as long as the thread who got mmap_sem
held for write is responsible to complete critical sections before dropping
mmap_sem held for write.

>                    Why they are safe to change things for the child
> during the copy?

In this patch, the current thread who got mmap_sem held for write (which is
likely an OOM victim thread) downgrades mmap_sem, with an assumption that
current thread no longer accesses memory which might depend on mmap_sem held
for write.

dup_mmap() duplicates current->mm and to-be-duplicated mm is not visible yet.
If dup_mmap() failed, to-be-duplicated incomplete mm is discarded via mmput()
in dup_mm() rather than assigned to the child. Thus, this patch should not
change things which are visible to the child during the copy.

What we need to be careful is making changes to current->mm.
I'm assuming that current->mm->mmap_sem held for read is enough for
i_mmap_lock_write()/flush_dcache_mmap_lock()/vma_interval_tree_insert_after()/
flush_dcache_mmap_unlock()/i_mmap_unlock_write()/is_vm_hugetlb_page()/
reset_vma_resv_huge_pages()/__vma_link_rb(). But I'm not sure.

> 
> I am not saying this is wrong, I would have to think about that much
> more because mmap_sem tends to be used on many surprising places and the
> write lock just hide them all.

Then, an alternative approach which interrupts without downgrading is shown
below. But I'm not sure.

diff --git a/include/linux/fs.h b/include/linux/fs.h
index bb45c48..2f11c55 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -468,6 +468,11 @@ static inline void i_mmap_lock_write(struct address_space *mapping)
 	down_write(&mapping->i_mmap_rwsem);
 }
 
+static inline int i_mmap_lock_write_killable(struct address_space *mapping)
+{
+	return down_write_killable(&mapping->i_mmap_rwsem);
+}
+
 static inline void i_mmap_unlock_write(struct address_space *mapping)
 {
 	up_write(&mapping->i_mmap_rwsem);
diff --git a/kernel/fork.c b/kernel/fork.c
index 1e8c9a7..c9c141d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -473,10 +473,21 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			struct inode *inode = file_inode(file);
 			struct address_space *mapping = file->f_mapping;
 
+			if (i_mmap_lock_write_killable(mapping)) {
+				/*
+				 * Pretend that this is not a file mapping, for
+				 * dup_mm() will after all discard this mm due
+				 * to fatal_signal_pending() check below. But
+				 * make sure not to call open()/close() hook
+				 * which might expect tmp->vm_file != NULL.
+				 */
+				tmp->vm_file = NULL;
+				tmp->vm_ops = NULL;
+				goto skip_file;
+			}
 			get_file(file);
 			if (tmp->vm_flags & VM_DENYWRITE)
 				atomic_dec(&inode->i_writecount);
-			i_mmap_lock_write(mapping);
 			if (tmp->vm_flags & VM_SHARED)
 				atomic_inc(&mapping->i_mmap_writable);
 			flush_dcache_mmap_lock(mapping);
@@ -486,6 +497,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			flush_dcache_mmap_unlock(mapping);
 			i_mmap_unlock_write(mapping);
 		}
+skip_file:
 
 		/*
 		 * Clear hugetlb-related page reserves for children. This only
@@ -508,7 +520,13 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		/*
+		 * Bail out as soon as possible, for dup_mm() will after all
+		 * discard this mm by returning an error.
+		 */
+		if (fatal_signal_pending(current))
+			retval = -EINTR;
+		else if (!(tmp->vm_flags & VM_WIPEONFORK))
 			retval = copy_page_range(mm, oldmm, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Introduce i_mmap_lock_write_killable().
  2018-03-28 10:23   ` Tetsuo Handa
@ 2018-03-28 11:05     ` Michal Hocko
  2018-03-28 12:26       ` Tetsuo Handa
  0 siblings, 1 reply; 6+ messages in thread
From: Michal Hocko @ 2018-03-28 11:05 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, linux-fsdevel, viro, akpm, kirill.shutemov, riel

On Wed 28-03-18 19:23:20, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 27-03-18 20:19:30, Tetsuo Handa wrote:
> > > If the OOM victim is holding mm->mmap_sem held for write, and if the OOM
> > > victim can interrupt operations which need mm->mmap_sem held for write,
> > > we can downgrade mm->mmap_sem upon SIGKILL and the OOM reaper will be
> > > able to reap the OOM victim's memory.
> > 
> > This really begs for much better explanation. Why is it safe?
> 
> Basic idea is
> 
>   bool downgraded = false;
>   down_write(mmap_sem);
>   for (something_1_that_might_depend_mmap_sem_held_for_write;
>        something_2_that_might_depend_mmap_sem_held_for_write;
>        something_3_that_might_depend_mmap_sem_held_for_write) {
>      something_4_that_might_depend_mmap_sem_held_for_write();
>      if (fatal_signal_pending(current)) {
>         downgrade_write(mmap_sem);
>         downgraded = true;
>         break;
>      }
>      something_5_that_might_depend_mmap_sem_held_for_write();
>   }
>   if (!downgraded)
>     up_write(mmap_sem);
>   else
>     up_read(mmap_sem);
> 
> . That is, try to interrupt critical sections at locations where it is
> known to be safe and consistent.

Please explain why those places are safe to interrupt.

> >                                                               Are you
> > assuming that the killed task will not perform any changes on the
> > address space?
> 
> If somebody drops mmap_sem held for write is not safe, how can the OOM
> reaper work safely?
> 
> The OOM reaper is assuming that the thread who got mmap_sem held for write
> is responsible to complete critical sections before dropping mmap_sem held
> for write, isn't it?
> 
> Then, how an attempt to perform changes on the address space can become a
> problem given that the thread who got mmap_sem held for write is responsible
> to complete critical sections before dropping mmap_sem held for write?

ENOPARSE. How does this have anything to do with oom_reaper. Sure you
want to _help_ the oom_reaper to do its job but you are dropping the
lock in the downgrading the lock in the middle of dup_mmap and that is
what we are dicussing here. So please explain why it is safe. It is
really not straightforward.

> >                What about ongoing page faults or other operations deeper
> > in the call chain.
> 
> Even if there are ongoing page faults or other operations deeper in the call
> chain, there should be no problem as long as the thread who got mmap_sem
> held for write is responsible to complete critical sections before dropping
> mmap_sem held for write.
> 
> >                    Why they are safe to change things for the child
> > during the copy?
> 
> In this patch, the current thread who got mmap_sem held for write (which is
> likely an OOM victim thread) downgrades mmap_sem, with an assumption that
> current thread no longer accesses memory which might depend on mmap_sem held
> for write.
> 
> dup_mmap() duplicates current->mm and to-be-duplicated mm is not visible yet.
> If dup_mmap() failed, to-be-duplicated incomplete mm is discarded via mmput()
> in dup_mm() rather than assigned to the child. Thus, this patch should not
> change things which are visible to the child during the copy.
> 
> What we need to be careful is making changes to current->mm.
> I'm assuming that current->mm->mmap_sem held for read is enough for
> i_mmap_lock_write()/flush_dcache_mmap_lock()/vma_interval_tree_insert_after()/
> flush_dcache_mmap_unlock()/i_mmap_unlock_write()/is_vm_hugetlb_page()/
> reset_vma_resv_huge_pages()/__vma_link_rb(). But I'm not sure.

But as soon as you downgrade the lock then all other threads can
interfere and perform page faults or update respecive mappings. Does
this matter? If not then why?

> > I am not saying this is wrong, I would have to think about that much
> > more because mmap_sem tends to be used on many surprising places and the
> > write lock just hide them all.
> 
> Then, an alternative approach which interrupts without downgrading is shown
> below. But I'm not sure.

Failing the whole dup_mmap might be quite reasonable, yes. I haven't
checked your particular patch because this code path needs much more
time than I can give this, though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Introduce i_mmap_lock_write_killable().
  2018-03-28 11:05     ` Michal Hocko
@ 2018-03-28 12:26       ` Tetsuo Handa
  2018-03-28 12:39         ` Michal Hocko
  0 siblings, 1 reply; 6+ messages in thread
From: Tetsuo Handa @ 2018-03-28 12:26 UTC (permalink / raw)
  To: mhocko; +Cc: linux-mm, linux-fsdevel, viro, akpm, kirill.shutemov, riel

Michal Hocko wrote:
> > > I am not saying this is wrong, I would have to think about that much
> > > more because mmap_sem tends to be used on many surprising places and the
> > > write lock just hide them all.
> > 
> > Then, an alternative approach which interrupts without downgrading is shown
> > below. But I'm not sure.
> 
> Failing the whole dup_mmap might be quite reasonable, yes. I haven't
> checked your particular patch because this code path needs much more
> time than I can give this, though.

I think that interrupting at

diff --git a/kernel/fork.c b/kernel/fork.c
index 1e8c9a7..851c675 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -514,6 +514,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		if (tmp->vm_ops && tmp->vm_ops->open)
 			tmp->vm_ops->open(tmp);
 
+		if (!retval && fatal_signal_pending(current))
+			retval = -EINTR;
+
 		if (retval)
 			goto out;
 	}
-- 

is safe because there is no difference (except the error code) between above
change and hitting "goto fail_nomem;" path after "mpnt = mpnt->vm_next;".

Therefore, I think that interrupting at

diff --git a/kernel/fork.c b/kernel/fork.c
index 851c675..2706acc 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -508,7 +508,9 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_parent = &tmp->vm_rb;
 
 		mm->map_count++;
-		if (!(tmp->vm_flags & VM_WIPEONFORK))
+		if (fatal_signal_pending(current))
+			retval = -EINTR;
+		else if (!(tmp->vm_flags & VM_WIPEONFORK))
 			retval = copy_page_range(mm, oldmm, mpnt);
 
 		if (tmp->vm_ops && tmp->vm_ops->open)
-- 

is also safe because handling of copy_page_range() failure is already
scheduled by mmput().

Thus, I think that there are locations where it is known to be safely and consistently
interruptible inside "for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next)" loop.

> On Wed 28-03-18 19:23:20, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 27-03-18 20:19:30, Tetsuo Handa wrote:
> > > > If the OOM victim is holding mm->mmap_sem held for write, and if the OOM
> > > > victim can interrupt operations which need mm->mmap_sem held for write,
> > > > we can downgrade mm->mmap_sem upon SIGKILL and the OOM reaper will be
> > > > able to reap the OOM victim's memory.
> > > 
> > > This really begs for much better explanation. Why is it safe?
> > 
> > Basic idea is
> > 
> >   bool downgraded = false;
> >   down_write(mmap_sem);
> >   for (something_1_that_might_depend_mmap_sem_held_for_write;
> >        something_2_that_might_depend_mmap_sem_held_for_write;
> >        something_3_that_might_depend_mmap_sem_held_for_write) {
> >      something_4_that_might_depend_mmap_sem_held_for_write();
> >      if (fatal_signal_pending(current)) {
> >         downgrade_write(mmap_sem);
> >         downgraded = true;
> >         break;
> >      }
> >      something_5_that_might_depend_mmap_sem_held_for_write();
> >   }
> >   if (!downgraded)
> >     up_write(mmap_sem);
> >   else
> >     up_read(mmap_sem);
> > 
> > . That is, try to interrupt critical sections at locations where it is
> > known to be safe and consistent.
> 
> Please explain why those places are safe to interrupt.

Because (regarding the downgrade_write() approach), as far as I know,
the current thread does not access memory which needs to be protected with
mmap_sem held for write.

> 
> > >                                                               Are you
> > > assuming that the killed task will not perform any changes on the
> > > address space?
> > 
> > If somebody drops mmap_sem held for write is not safe, how can the OOM
> > reaper work safely?
> > 
> > The OOM reaper is assuming that the thread who got mmap_sem held for write
> > is responsible to complete critical sections before dropping mmap_sem held
> > for write, isn't it?
> > 
> > Then, how an attempt to perform changes on the address space can become a
> > problem given that the thread who got mmap_sem held for write is responsible
> > to complete critical sections before dropping mmap_sem held for write?
> 
> ENOPARSE. How does this have anything to do with oom_reaper.

The oom_reaper can work safely as long as mmap_sem held for write is released
at safely and consistently interruptible locations.

>                                                              Sure you
> want to _help_ the oom_reaper to do its job but you are dropping the
> lock in the downgrading the lock in the middle of dup_mmap and that is
> what we are dicussing here.

Yes.

>                             So please explain why it is safe. It is
> really not straightforward.

So, please explain why it is not safe. How can releasing mmap_sem held for
write at safely and consistently interruptible locations be not safe?

> 
> > >                What about ongoing page faults or other operations deeper
> > > in the call chain.
> > 
> > Even if there are ongoing page faults or other operations deeper in the call
> > chain, there should be no problem as long as the thread who got mmap_sem
> > held for write is responsible to complete critical sections before dropping
> > mmap_sem held for write.
> > 
> > >                    Why they are safe to change things for the child
> > > during the copy?
> > 
> > In this patch, the current thread who got mmap_sem held for write (which is
> > likely an OOM victim thread) downgrades mmap_sem, with an assumption that
> > current thread no longer accesses memory which might depend on mmap_sem held
> > for write.
> > 
> > dup_mmap() duplicates current->mm and to-be-duplicated mm is not visible yet.
> > If dup_mmap() failed, to-be-duplicated incomplete mm is discarded via mmput()
> > in dup_mm() rather than assigned to the child. Thus, this patch should not
> > change things which are visible to the child during the copy.
> > 
> > What we need to be careful is making changes to current->mm.
> > I'm assuming that current->mm->mmap_sem held for read is enough for
> > i_mmap_lock_write()/flush_dcache_mmap_lock()/vma_interval_tree_insert_after()/
> > flush_dcache_mmap_unlock()/i_mmap_unlock_write()/is_vm_hugetlb_page()/
> > reset_vma_resv_huge_pages()/__vma_link_rb(). But I'm not sure.
> 
> But as soon as you downgrade the lock then all other threads can
> interfere and perform page faults or update respecive mappings. Does
> this matter? If not then why?
> 

Why does this matter?

I don't know what "update respecive mappings" means.
Is that about mmap()/munmap() which need mmap_sem held for write?
Since mmap_sem is still held for read, operations which needs
mmap_sem held for write cannot happen.

Anyway, as long as I downgrade the mmap_sem at safely and consistently
interruptible locations, there cannot be a problem.

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm: Introduce i_mmap_lock_write_killable().
  2018-03-28 12:26       ` Tetsuo Handa
@ 2018-03-28 12:39         ` Michal Hocko
  0 siblings, 0 replies; 6+ messages in thread
From: Michal Hocko @ 2018-03-28 12:39 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-mm, linux-fsdevel, viro, akpm, kirill.shutemov, riel

On Wed 28-03-18 21:26:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > > Basic idea is
> > > 
> > >   bool downgraded = false;
> > >   down_write(mmap_sem);
> > >   for (something_1_that_might_depend_mmap_sem_held_for_write;
> > >        something_2_that_might_depend_mmap_sem_held_for_write;
> > >        something_3_that_might_depend_mmap_sem_held_for_write) {
> > >      something_4_that_might_depend_mmap_sem_held_for_write();
> > >      if (fatal_signal_pending(current)) {
> > >         downgrade_write(mmap_sem);
> > >         downgraded = true;
> > >         break;
> > >      }
> > >      something_5_that_might_depend_mmap_sem_held_for_write();
> > >   }
> > >   if (!downgraded)
> > >     up_write(mmap_sem);
> > >   else
> > >     up_read(mmap_sem);
> > > 
> > > . That is, try to interrupt critical sections at locations where it is
> > > known to be safe and consistent.
> > 
> > Please explain why those places are safe to interrupt.
> 
> Because (regarding the downgrade_write() approach), as far as I know,
> the current thread does not access memory which needs to be protected with
> mmap_sem held for write.

But there are other threads which can be in an arbitrary code path
before they get to the signal handling code.

> >                             So please explain why it is safe. It is
> > really not straightforward.
> 
> So, please explain why it is not safe. How can releasing mmap_sem held for
> write at safely and consistently interruptible locations be not safe?

Look, you are proposing a patch and the onus to justify its correctness
is on you. You are playing with the code which you have only vague
understanding of and so you should study and understand it more. You
seem to be infering invariants from the current function scope and that
is quite dangerous.

[...]
> > > What we need to be careful is making changes to current->mm.
> > > I'm assuming that current->mm->mmap_sem held for read is enough for
> > > i_mmap_lock_write()/flush_dcache_mmap_lock()/vma_interval_tree_insert_after()/
> > > flush_dcache_mmap_unlock()/i_mmap_unlock_write()/is_vm_hugetlb_page()/
> > > reset_vma_resv_huge_pages()/__vma_link_rb(). But I'm not sure.
> > 
> > But as soon as you downgrade the lock then all other threads can
> > interfere and perform page faults or update respecive mappings. Does
> > this matter? If not then why?
> > 
> 
> Why does this matter?
> 
> I don't know what "update respecive mappings" means.

E.g. any get_user_page or an ongoing #PF can update a mapping and so the
child process might see some inconsistencies (aka partial of the address
space with the previous content).

> Is that about mmap()/munmap() which need mmap_sem held for write?

No, it is about those who take it for read.

> Since mmap_sem is still held for read, operations which needs
> mmap_sem held for write cannot happen.
> 
> Anyway, as long as I downgrade the mmap_sem at safely and consistently
> interruptible locations, there cannot be a problem.

I have a strong feeling that the whole address space has to be copied
atomicaly form the address space POV. I might be wrong but it is not
really obvious to me why and if you want to downgrade the lock there
then please make sure to document what are you assumptions and why they
are valid.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-03-28 12:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-27 11:19 [PATCH] mm: Introduce i_mmap_lock_write_killable() Tetsuo Handa
2018-03-27 14:52 ` Michal Hocko
2018-03-28 10:23   ` Tetsuo Handa
2018-03-28 11:05     ` Michal Hocko
2018-03-28 12:26       ` Tetsuo Handa
2018-03-28 12:39         ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).