linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] exit: clear TIF_MEMDIE after exit_task_work
@ 2016-02-29 17:02 Vladimir Davydov
  2016-02-29 18:21 ` Michal Hocko
  2016-03-01 15:52 ` Michal Hocko
  0 siblings, 2 replies; 18+ messages in thread
From: Vladimir Davydov @ 2016-02-29 17:02 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Michal Hocko, Tetsuo Handa, David Rientjes, linux-mm, linux-kernel

An mm_struct may be pinned by a file. An example is vhost-net device
created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
vhost_dev_set_owner). If such process gets OOM-killed, the reference to
its mm_struct will only be released from exit_task_work -> ____fput ->
__fput -> vhost_net_release -> vhost_dev_cleanup, which is called after
exit_mmap, where TIF_MEMDIE is cleared. As a result, we can start
selecting the next victim before giving the last one a chance to free
its memory. In practice, this leads to killing several VMs along with
the fattest one.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 kernel/exit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index fd90195667e1..cc50e12165f7 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,8 +434,6 @@ static void exit_mm(struct task_struct *tsk)
 	task_unlock(tsk);
 	mm_update_next_owner(mm);
 	mmput(mm);
-	if (test_thread_flag(TIF_MEMDIE))
-		exit_oom_victim(tsk);
 }
 
 static struct task_struct *find_alive_thread(struct task_struct *p)
@@ -746,6 +744,8 @@ void do_exit(long code)
 		disassociate_ctty(1);
 	exit_task_namespaces(tsk);
 	exit_task_work(tsk);
+	if (test_thread_flag(TIF_MEMDIE))
+		exit_oom_victim(tsk);
 	exit_thread();
 
 	/*
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-02-29 17:02 [PATCH] exit: clear TIF_MEMDIE after exit_task_work Vladimir Davydov
@ 2016-02-29 18:21 ` Michal Hocko
  2016-02-29 18:44   ` Michal Hocko
  2016-03-01 15:52 ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2016-02-29 18:21 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm, linux-kernel

On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> An mm_struct may be pinned by a file. An example is vhost-net device
> created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> vhost_dev_set_owner). If such process gets OOM-killed, the reference to
> its mm_struct will only be released from exit_task_work -> ____fput ->
> __fput -> vhost_net_release -> vhost_dev_cleanup, which is called after
> exit_mmap, where TIF_MEMDIE is cleared. As a result, we can start
> selecting the next victim before giving the last one a chance to free
> its memory. In practice, this leads to killing several VMs along with
> the fattest one.

I am wondering why our PF_EXITING protection hasn't fired up. This is
not done in the mmotm tree but I guess you have seen the issue with the
linus tree, right? Do you have a log with oom reports available?

To be honest I do not feel very comfortable about moving the
exit_oom_victim even further down in do_exit path behind even less clear
locking or other dependencies.

Let's see if we can do any better for this particular case. 

> Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
> ---
>  kernel/exit.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index fd90195667e1..cc50e12165f7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -434,8 +434,6 @@ static void exit_mm(struct task_struct *tsk)
>  	task_unlock(tsk);
>  	mm_update_next_owner(mm);
>  	mmput(mm);
> -	if (test_thread_flag(TIF_MEMDIE))
> -		exit_oom_victim(tsk);
>  }
>  
>  static struct task_struct *find_alive_thread(struct task_struct *p)
> @@ -746,6 +744,8 @@ void do_exit(long code)
>  		disassociate_ctty(1);
>  	exit_task_namespaces(tsk);
>  	exit_task_work(tsk);
> +	if (test_thread_flag(TIF_MEMDIE))
> +		exit_oom_victim(tsk);
>  	exit_thread();
>  
>  	/*
> -- 
> 2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-02-29 18:21 ` Michal Hocko
@ 2016-02-29 18:44   ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2016-02-29 18:44 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm, linux-kernel

On Mon 29-02-16 19:21:31, Michal Hocko wrote:
> On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > An mm_struct may be pinned by a file. An example is vhost-net device
> > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > vhost_dev_set_owner). If such process gets OOM-killed, the reference to
> > its mm_struct will only be released from exit_task_work -> ____fput ->
> > __fput -> vhost_net_release -> vhost_dev_cleanup, which is called after
> > exit_mmap, where TIF_MEMDIE is cleared. As a result, we can start
> > selecting the next victim before giving the last one a chance to free
> > its memory. In practice, this leads to killing several VMs along with
> > the fattest one.
> 
> I am wondering why our PF_EXITING protection hasn't fired up.

OK, I guess I can see it. exit_mm has done tsk->mm = NULL and so we are
skipping over that task because oom_scan_process_thread hasn't checked
PF_EXITING. I will try to think about this some more tomorrow with a
fresh brain.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-02-29 17:02 [PATCH] exit: clear TIF_MEMDIE after exit_task_work Vladimir Davydov
  2016-02-29 18:21 ` Michal Hocko
@ 2016-03-01 15:52 ` Michal Hocko
  2016-03-01 15:57   ` Michael S. Tsirkin
  1 sibling, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2016-03-01 15:52 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Andrew Morton, Tetsuo Handa, David Rientjes, linux-mm,
	linux-kernel, Michael S. Tsirkin

[CCing vhost-net maintainer]

On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> An mm_struct may be pinned by a file. An example is vhost-net device
> created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> vhost_dev_set_owner).

The more I think about that the more I am wondering whether this is
actually OK and correct. Why does the driver have to pin the address
space? Nothing really prevents from parallel tearing down of the address
space anyway so the code cannot expect all the vmas to stay. Would it be
enough to pin the mm_struct only?

I am not sure I understand the code properly but what prevents from
the situation when a VHOST_SET_OWNER caller dies without calling
VHOST_RESET_OWNER and so the mm would be pinned indefinitely?

[Keeping the reset of the email for reference]

> If such process gets OOM-killed, the reference to
> its mm_struct will only be released from exit_task_work -> ____fput ->
> __fput -> vhost_net_release -> vhost_dev_cleanup, which is called after
> exit_mmap, where TIF_MEMDIE is cleared. As a result, we can start
> selecting the next victim before giving the last one a chance to free
> its memory. In practice, this leads to killing several VMs along with
> the fattest one.
> 
> Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
> ---
>  kernel/exit.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index fd90195667e1..cc50e12165f7 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -434,8 +434,6 @@ static void exit_mm(struct task_struct *tsk)
>  	task_unlock(tsk);
>  	mm_update_next_owner(mm);
>  	mmput(mm);
> -	if (test_thread_flag(TIF_MEMDIE))
> -		exit_oom_victim(tsk);
>  }
>  
>  static struct task_struct *find_alive_thread(struct task_struct *p)
> @@ -746,6 +744,8 @@ void do_exit(long code)
>  		disassociate_ctty(1);
>  	exit_task_namespaces(tsk);
>  	exit_task_work(tsk);
> +	if (test_thread_flag(TIF_MEMDIE))
> +		exit_oom_victim(tsk);
>  	exit_thread();
>  
>  	/*
> -- 
> 2.1.4

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 15:52 ` Michal Hocko
@ 2016-03-01 15:57   ` Michael S. Tsirkin
  2016-03-01 16:08     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-03-01 15:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue, Mar 01, 2016 at 04:52:12PM +0100, Michal Hocko wrote:
> [CCing vhost-net maintainer]
> 
> On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > An mm_struct may be pinned by a file. An example is vhost-net device
> > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > vhost_dev_set_owner).
> 
> The more I think about that the more I am wondering whether this is
> actually OK and correct. Why does the driver have to pin the address
> space? Nothing really prevents from parallel tearing down of the address
> space anyway so the code cannot expect all the vmas to stay. Would it be
> enough to pin the mm_struct only?

I'll need to research this. It's a fact that as long as the
device is not stopped, vhost can attempt to access
the address space.

> I am not sure I understand the code properly but what prevents from
> the situation when a VHOST_SET_OWNER caller dies without calling
> VHOST_RESET_OWNER and so the mm would be pinned indefinitely?
> 
> [Keeping the reset of the email for reference]

We have:

static const struct file_operations vhost_net_fops = {
        .owner          = THIS_MODULE,
        .release        = vhost_net_release,
...
};

When caller dies and after fds are closed,
vhost_net_release calls vhost_dev_cleanup and that
drops the mm reference.

> > If such process gets OOM-killed, the reference to
> > its mm_struct will only be released from exit_task_work -> ____fput ->
> > __fput -> vhost_net_release -> vhost_dev_cleanup, which is called after
> > exit_mmap, where TIF_MEMDIE is cleared. As a result, we can start
> > selecting the next victim before giving the last one a chance to free
> > its memory. In practice, this leads to killing several VMs along with
> > the fattest one.
> > 
> > Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
> > ---
> >  kernel/exit.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/exit.c b/kernel/exit.c
> > index fd90195667e1..cc50e12165f7 100644
> > --- a/kernel/exit.c
> > +++ b/kernel/exit.c
> > @@ -434,8 +434,6 @@ static void exit_mm(struct task_struct *tsk)
> >  	task_unlock(tsk);
> >  	mm_update_next_owner(mm);
> >  	mmput(mm);
> > -	if (test_thread_flag(TIF_MEMDIE))
> > -		exit_oom_victim(tsk);
> >  }
> >  
> >  static struct task_struct *find_alive_thread(struct task_struct *p)
> > @@ -746,6 +744,8 @@ void do_exit(long code)
> >  		disassociate_ctty(1);
> >  	exit_task_namespaces(tsk);
> >  	exit_task_work(tsk);
> > +	if (test_thread_flag(TIF_MEMDIE))
> > +		exit_oom_victim(tsk);
> >  	exit_thread();
> >  
> >  	/*
> > -- 
> > 2.1.4
> 
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 15:57   ` Michael S. Tsirkin
@ 2016-03-01 16:08     ` Michal Hocko
  2016-03-01 16:14       ` Michael S. Tsirkin
  2016-03-01 16:22       ` Michael S. Tsirkin
  0 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2016-03-01 16:08 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue 01-03-16 17:57:04, Michael S. Tsirkin wrote:
> On Tue, Mar 01, 2016 at 04:52:12PM +0100, Michal Hocko wrote:
> > [CCing vhost-net maintainer]
> > 
> > On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > > An mm_struct may be pinned by a file. An example is vhost-net device
> > > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > > vhost_dev_set_owner).
> > 
> > The more I think about that the more I am wondering whether this is
> > actually OK and correct. Why does the driver have to pin the address
> > space? Nothing really prevents from parallel tearing down of the address
> > space anyway so the code cannot expect all the vmas to stay. Would it be
> > enough to pin the mm_struct only?
> 
> I'll need to research this. It's a fact that as long as the
> device is not stopped, vhost can attempt to access
> the address space.

But does it expect any specific parts of the address space to be mapped?
E.g. proc needs to keep the mm allocated as well for some files but it
doesn't pin the address space (mm_users) but rather mm_count (see
proc_mem_open).

> > I am not sure I understand the code properly but what prevents from
> > the situation when a VHOST_SET_OWNER caller dies without calling
> > VHOST_RESET_OWNER and so the mm would be pinned indefinitely?
> > 
> > [Keeping the reset of the email for reference]
> 
> We have:
> 
> static const struct file_operations vhost_net_fops = {
>         .owner          = THIS_MODULE,
>         .release        = vhost_net_release,
> ...
> };
> 
> When caller dies and after fds are closed,
> vhost_net_release calls vhost_dev_cleanup and that
> drops the mm reference.

Can another process have the device open as well and prevent from
destruction?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 16:08     ` Michal Hocko
@ 2016-03-01 16:14       ` Michael S. Tsirkin
  2016-03-01 16:22       ` Michael S. Tsirkin
  1 sibling, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-03-01 16:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue, Mar 01, 2016 at 05:08:13PM +0100, Michal Hocko wrote:
> On Tue 01-03-16 17:57:04, Michael S. Tsirkin wrote:
> > On Tue, Mar 01, 2016 at 04:52:12PM +0100, Michal Hocko wrote:
> > > [CCing vhost-net maintainer]
> > > 
> > > On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > > > An mm_struct may be pinned by a file. An example is vhost-net device
> > > > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > > > vhost_dev_set_owner).
> > > 
> > > The more I think about that the more I am wondering whether this is
> > > actually OK and correct. Why does the driver have to pin the address
> > > space? Nothing really prevents from parallel tearing down of the address
> > > space anyway so the code cannot expect all the vmas to stay. Would it be
> > > enough to pin the mm_struct only?
> > 
> > I'll need to research this. It's a fact that as long as the
> > device is not stopped, vhost can attempt to access
> > the address space.
> 
> But does it expect any specific parts of the address space to be mapped?
> E.g. proc needs to keep the mm allocated as well for some files but it
> doesn't pin the address space (mm_users) but rather mm_count (see
> proc_mem_open).

As I said, I need to research this.

> > > I am not sure I understand the code properly but what prevents from
> > > the situation when a VHOST_SET_OWNER caller dies without calling
> > > VHOST_RESET_OWNER and so the mm would be pinned indefinitely?
> > > 
> > > [Keeping the reset of the email for reference]
> > 
> > We have:
> > 
> > static const struct file_operations vhost_net_fops = {
> >         .owner          = THIS_MODULE,
> >         .release        = vhost_net_release,
> > ...
> > };
> > 
> > When caller dies and after fds are closed,
> > vhost_net_release calls vhost_dev_cleanup and that
> > drops the mm reference.
> 
> Can another process have the device open as well and prevent from
> destruction?

Yes.

> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 16:08     ` Michal Hocko
  2016-03-01 16:14       ` Michael S. Tsirkin
@ 2016-03-01 16:22       ` Michael S. Tsirkin
  2016-03-01 16:35         ` Michal Hocko
  1 sibling, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-03-01 16:22 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue, Mar 01, 2016 at 05:08:13PM +0100, Michal Hocko wrote:
> On Tue 01-03-16 17:57:04, Michael S. Tsirkin wrote:
> > On Tue, Mar 01, 2016 at 04:52:12PM +0100, Michal Hocko wrote:
> > > [CCing vhost-net maintainer]
> > > 
> > > On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > > > An mm_struct may be pinned by a file. An example is vhost-net device
> > > > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > > > vhost_dev_set_owner).
> > > 
> > > The more I think about that the more I am wondering whether this is
> > > actually OK and correct. Why does the driver have to pin the address
> > > space? Nothing really prevents from parallel tearing down of the address
> > > space anyway so the code cannot expect all the vmas to stay. Would it be
> > > enough to pin the mm_struct only?
> > 
> > I'll need to research this. It's a fact that as long as the
> > device is not stopped, vhost can attempt to access
> > the address space.
> 
> But does it expect any specific parts of the address space to be mapped?
> E.g. proc needs to keep the mm allocated as well for some files but it
> doesn't pin the address space (mm_users) but rather mm_count (see
> proc_mem_open).

At a quick glance, it seems that it's needed: it calls
get_user_pages(mm) and that looks like it will not DTRT (or even fail
gracefully) if mm->mm_users == 0 and exit_mmap/etc was already called
(or is in progress).

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 16:22       ` Michael S. Tsirkin
@ 2016-03-01 16:35         ` Michal Hocko
  2016-03-01 16:46           ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2016-03-01 16:35 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue 01-03-16 18:22:32, Michael S. Tsirkin wrote:
> On Tue, Mar 01, 2016 at 05:08:13PM +0100, Michal Hocko wrote:
> > On Tue 01-03-16 17:57:04, Michael S. Tsirkin wrote:
> > > On Tue, Mar 01, 2016 at 04:52:12PM +0100, Michal Hocko wrote:
> > > > [CCing vhost-net maintainer]
> > > > 
> > > > On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > > > > An mm_struct may be pinned by a file. An example is vhost-net device
> > > > > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > > > > vhost_dev_set_owner).
> > > > 
> > > > The more I think about that the more I am wondering whether this is
> > > > actually OK and correct. Why does the driver have to pin the address
> > > > space? Nothing really prevents from parallel tearing down of the address
> > > > space anyway so the code cannot expect all the vmas to stay. Would it be
> > > > enough to pin the mm_struct only?
> > > 
> > > I'll need to research this. It's a fact that as long as the
> > > device is not stopped, vhost can attempt to access
> > > the address space.
> > 
> > But does it expect any specific parts of the address space to be mapped?
> > E.g. proc needs to keep the mm allocated as well for some files but it
> > doesn't pin the address space (mm_users) but rather mm_count (see
> > proc_mem_open).
> 
> At a quick glance, it seems that it's needed: it calls
> get_user_pages(mm) and that looks like it will not DTRT (or even fail
> gracefully) if mm->mm_users == 0 and exit_mmap/etc was already called
> (or is in progress).

yes it will fail gracefully but what does prevent from munmap now? The
VMA can go away and get_user_pages would fail as well.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 16:35         ` Michal Hocko
@ 2016-03-01 16:46           ` Michael S. Tsirkin
  2016-03-01 17:17             ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-03-01 16:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue, Mar 01, 2016 at 05:35:37PM +0100, Michal Hocko wrote:
> On Tue 01-03-16 18:22:32, Michael S. Tsirkin wrote:
> > On Tue, Mar 01, 2016 at 05:08:13PM +0100, Michal Hocko wrote:
> > > On Tue 01-03-16 17:57:04, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 01, 2016 at 04:52:12PM +0100, Michal Hocko wrote:
> > > > > [CCing vhost-net maintainer]
> > > > > 
> > > > > On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > > > > > An mm_struct may be pinned by a file. An example is vhost-net device
> > > > > > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > > > > > vhost_dev_set_owner).
> > > > > 
> > > > > The more I think about that the more I am wondering whether this is
> > > > > actually OK and correct. Why does the driver have to pin the address
> > > > > space? Nothing really prevents from parallel tearing down of the address
> > > > > space anyway so the code cannot expect all the vmas to stay. Would it be
> > > > > enough to pin the mm_struct only?
> > > > 
> > > > I'll need to research this. It's a fact that as long as the
> > > > device is not stopped, vhost can attempt to access
> > > > the address space.
> > > 
> > > But does it expect any specific parts of the address space to be mapped?
> > > E.g. proc needs to keep the mm allocated as well for some files but it
> > > doesn't pin the address space (mm_users) but rather mm_count (see
> > > proc_mem_open).
> > 
> > At a quick glance, it seems that it's needed: it calls
> > get_user_pages(mm) and that looks like it will not DTRT (or even fail
> > gracefully) if mm->mm_users == 0 and exit_mmap/etc was already called
> > (or is in progress).
> 
> yes it will fail gracefully


What makes get_user_pages fail gracefully in this case,
if it races with task exiting?

> but what does prevent from munmap now? The
> VMA can go away and get_user_pages would fail as well.
> -- 
> Michal Hocko
> SUSE Labs

That's user error -> user gets -EFAULT.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 16:46           ` Michael S. Tsirkin
@ 2016-03-01 17:17             ` Michal Hocko
  2016-03-01 17:20               ` Michael S. Tsirkin
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2016-03-01 17:17 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue 01-03-16 18:46:38, Michael S. Tsirkin wrote:
> On Tue, Mar 01, 2016 at 05:35:37PM +0100, Michal Hocko wrote:
> > On Tue 01-03-16 18:22:32, Michael S. Tsirkin wrote:
> > > On Tue, Mar 01, 2016 at 05:08:13PM +0100, Michal Hocko wrote:
> > > > On Tue 01-03-16 17:57:04, Michael S. Tsirkin wrote:
> > > > > On Tue, Mar 01, 2016 at 04:52:12PM +0100, Michal Hocko wrote:
> > > > > > [CCing vhost-net maintainer]
> > > > > > 
> > > > > > On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > > > > > > An mm_struct may be pinned by a file. An example is vhost-net device
> > > > > > > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > > > > > > vhost_dev_set_owner).
> > > > > > 
> > > > > > The more I think about that the more I am wondering whether this is
> > > > > > actually OK and correct. Why does the driver have to pin the address
> > > > > > space? Nothing really prevents from parallel tearing down of the address
> > > > > > space anyway so the code cannot expect all the vmas to stay. Would it be
> > > > > > enough to pin the mm_struct only?
> > > > > 
> > > > > I'll need to research this. It's a fact that as long as the
> > > > > device is not stopped, vhost can attempt to access
> > > > > the address space.
> > > > 
> > > > But does it expect any specific parts of the address space to be mapped?
> > > > E.g. proc needs to keep the mm allocated as well for some files but it
> > > > doesn't pin the address space (mm_users) but rather mm_count (see
> > > > proc_mem_open).
> > > 
> > > At a quick glance, it seems that it's needed: it calls
> > > get_user_pages(mm) and that looks like it will not DTRT (or even fail
> > > gracefully) if mm->mm_users == 0 and exit_mmap/etc was already called
> > > (or is in progress).
> > 
> > yes it will fail gracefully
> 
> 
> What makes get_user_pages fail gracefully in this case,
> if it races with task exiting?

Sorry, I could have been more verbose... The code would have to make sure
that the mm is still alive before calling g-u-p by
atomic_inc_not_zero(&mm->mm_users) and fail if the user count dropped to
0 in the mean time. See how fs/proc/task_mmu.c does that (proc_mem_open
+ m_start + m_stop.

The biggest advanatage would be that the mm address space pin would be
only for the particular operation. Not sure whether that is possible in
the driver though. Anyway pinning the mm for a potentially unbounded
amount of time doesn't sound too nice.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 17:17             ` Michal Hocko
@ 2016-03-01 17:20               ` Michael S. Tsirkin
  2016-03-14 16:39                 ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-03-01 17:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue, Mar 01, 2016 at 06:17:58PM +0100, Michal Hocko wrote:
> On Tue 01-03-16 18:46:38, Michael S. Tsirkin wrote:
> > On Tue, Mar 01, 2016 at 05:35:37PM +0100, Michal Hocko wrote:
> > > On Tue 01-03-16 18:22:32, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 01, 2016 at 05:08:13PM +0100, Michal Hocko wrote:
> > > > > On Tue 01-03-16 17:57:04, Michael S. Tsirkin wrote:
> > > > > > On Tue, Mar 01, 2016 at 04:52:12PM +0100, Michal Hocko wrote:
> > > > > > > [CCing vhost-net maintainer]
> > > > > > > 
> > > > > > > On Mon 29-02-16 20:02:09, Vladimir Davydov wrote:
> > > > > > > > An mm_struct may be pinned by a file. An example is vhost-net device
> > > > > > > > created by a qemu/kvm (see vhost_net_ioctl -> vhost_net_set_owner ->
> > > > > > > > vhost_dev_set_owner).
> > > > > > > 
> > > > > > > The more I think about that the more I am wondering whether this is
> > > > > > > actually OK and correct. Why does the driver have to pin the address
> > > > > > > space? Nothing really prevents from parallel tearing down of the address
> > > > > > > space anyway so the code cannot expect all the vmas to stay. Would it be
> > > > > > > enough to pin the mm_struct only?
> > > > > > 
> > > > > > I'll need to research this. It's a fact that as long as the
> > > > > > device is not stopped, vhost can attempt to access
> > > > > > the address space.
> > > > > 
> > > > > But does it expect any specific parts of the address space to be mapped?
> > > > > E.g. proc needs to keep the mm allocated as well for some files but it
> > > > > doesn't pin the address space (mm_users) but rather mm_count (see
> > > > > proc_mem_open).
> > > > 
> > > > At a quick glance, it seems that it's needed: it calls
> > > > get_user_pages(mm) and that looks like it will not DTRT (or even fail
> > > > gracefully) if mm->mm_users == 0 and exit_mmap/etc was already called
> > > > (or is in progress).
> > > 
> > > yes it will fail gracefully
> > 
> > 
> > What makes get_user_pages fail gracefully in this case,
> > if it races with task exiting?
> 
> Sorry, I could have been more verbose... The code would have to make sure
> that the mm is still alive before calling g-u-p by
> atomic_inc_not_zero(&mm->mm_users) and fail if the user count dropped to
> 0 in the mean time. See how fs/proc/task_mmu.c does that (proc_mem_open
> + m_start + m_stop.
> 
> The biggest advanatage would be that the mm address space pin would be
> only for the particular operation. Not sure whether that is possible in
> the driver though. Anyway pinning the mm for a potentially unbounded
> amount of time doesn't sound too nice.
> -- 
> Michal Hocko
> SUSE Labs

Hmm that would be another atomic on data path ...
I'd have to explore that.

-- 
MST

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-01 17:20               ` Michael S. Tsirkin
@ 2016-03-14 16:39                 ` Michal Hocko
  2016-06-07 12:50                   ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2016-03-14 16:39 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue 01-03-16 19:20:24, Michael S. Tsirkin wrote:
> On Tue, Mar 01, 2016 at 06:17:58PM +0100, Michal Hocko wrote:
[...]
> > Sorry, I could have been more verbose... The code would have to make sure
> > that the mm is still alive before calling g-u-p by
> > atomic_inc_not_zero(&mm->mm_users) and fail if the user count dropped to
> > 0 in the mean time. See how fs/proc/task_mmu.c does that (proc_mem_open
> > + m_start + m_stop.
> > 
> > The biggest advanatage would be that the mm address space pin would be
> > only for the particular operation. Not sure whether that is possible in
> > the driver though. Anyway pinning the mm for a potentially unbounded
> > amount of time doesn't sound too nice.
> 
> Hmm that would be another atomic on data path ...
> I'd have to explore that.

Did you have any chance to look into this?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-03-14 16:39                 ` Michal Hocko
@ 2016-06-07 12:50                   ` Michal Hocko
  2016-06-13 11:50                     ` Michal Hocko
  0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2016-06-07 12:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Mon 14-03-16 17:39:43, Michal Hocko wrote:
> On Tue 01-03-16 19:20:24, Michael S. Tsirkin wrote:
> > On Tue, Mar 01, 2016 at 06:17:58PM +0100, Michal Hocko wrote:
> [...]
> > > Sorry, I could have been more verbose... The code would have to make sure
> > > that the mm is still alive before calling g-u-p by
> > > atomic_inc_not_zero(&mm->mm_users) and fail if the user count dropped to
> > > 0 in the mean time. See how fs/proc/task_mmu.c does that (proc_mem_open
> > > + m_start + m_stop.
> > > 
> > > The biggest advanatage would be that the mm address space pin would be
> > > only for the particular operation. Not sure whether that is possible in
> > > the driver though. Anyway pinning the mm for a potentially unbounded
> > > amount of time doesn't sound too nice.
> > 
> > Hmm that would be another atomic on data path ...
> > I'd have to explore that.
> 
> Did you have any chance to look into this?

So this is my take to get rid of mm_users pinning for an unbounded
amount of time. This is even not compile tested. I am not sure how to
handle when the mm goes away while there are still work items pending.
It seems this is not handled current anyway and only shouts with a
warning so this shouldn't cause a new regression AFAICS. I am not
familiar with the vnet code at all so I might be missing many things,
though. Does the below sound even remotely reasonable to you Michael?
---
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 669fef1e2bb6..47a3e2c832ea 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -343,7 +343,12 @@ static int vhost_worker(void *data)
 
 		if (work) {
 			__set_current_state(TASK_RUNNING);
+			if (!mmget_not_zero(dev->mm)) {
+				pr_warn("vhost: device owner mm got released unexpectedly\n");
+				break;
+			}
 			work->fn(work);
+			mmput(dev->mm);
 			if (need_resched())
 				schedule();
 		} else
@@ -481,7 +486,16 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	}
 
 	/* No owner, become one */
-	dev->mm = get_task_mm(current);
+	task_lock(current);
+	if (current->mm) {
+		dev->mm = current->mm;
+		atomic_inc(&curent->mm->mm_count);
+	}
+	task_unlock(current);
+	if (!dev->mm) {
+		err = -EINVAL;
+		goto err_mm;
+	}
 	worker = kthread_create(vhost_worker, dev, "vhost-%d", current->pid);
 	if (IS_ERR(worker)) {
 		err = PTR_ERR(worker);
@@ -505,7 +519,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
 	dev->worker = NULL;
 err_worker:
 	if (dev->mm)
-		mmput(dev->mm);
+		mmdrop(dev->mm);
 	dev->mm = NULL;
 err_mm:
 	return err;
@@ -583,7 +597,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev, bool locked)
 		dev->worker = NULL;
 	}
 	if (dev->mm)
-		mmput(dev->mm);
+		mmdrop(dev->mm);
 	dev->mm = NULL;
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-06-07 12:50                   ` Michal Hocko
@ 2016-06-13 11:50                     ` Michal Hocko
  2016-06-13 13:52                       ` Tetsuo Handa
  2016-06-13 18:11                       ` Michael S. Tsirkin
  0 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2016-06-13 11:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Tue 07-06-16 14:50:14, Michal Hocko wrote:
> On Mon 14-03-16 17:39:43, Michal Hocko wrote:
> > On Tue 01-03-16 19:20:24, Michael S. Tsirkin wrote:
> > > On Tue, Mar 01, 2016 at 06:17:58PM +0100, Michal Hocko wrote:
> > [...]
> > > > Sorry, I could have been more verbose... The code would have to make sure
> > > > that the mm is still alive before calling g-u-p by
> > > > atomic_inc_not_zero(&mm->mm_users) and fail if the user count dropped to
> > > > 0 in the mean time. See how fs/proc/task_mmu.c does that (proc_mem_open
> > > > + m_start + m_stop.
> > > > 
> > > > The biggest advanatage would be that the mm address space pin would be
> > > > only for the particular operation. Not sure whether that is possible in
> > > > the driver though. Anyway pinning the mm for a potentially unbounded
> > > > amount of time doesn't sound too nice.
> > > 
> > > Hmm that would be another atomic on data path ...
> > > I'd have to explore that.
> > 
> > Did you have any chance to look into this?
> 
> So this is my take to get rid of mm_users pinning for an unbounded
> amount of time. This is even not compile tested. I am not sure how to
> handle when the mm goes away while there are still work items pending.
> It seems this is not handled current anyway and only shouts with a
> warning so this shouldn't cause a new regression AFAICS. I am not
> familiar with the vnet code at all so I might be missing many things,
> though. Does the below sound even remotely reasonable to you Michael?

I have checked the vnet code and it doesn't seem to rely on
copy_from_user/get_user AFAICS. Other users of use_mm() need to copy to
the userspace only as well. So we should be perfectly safe to OOM reap
address space even when it is shared by the kthread [1] so this is
not really needed for the OOM correctness purpose. It would be much
nicer if the kthread didn't pin the mm for two long outside of the OOM
handling as well of course but that lowers the priority of the change.

[1] http://lkml.kernel.org/r/20160613112348.GC6518@dhcp22.suse.cz
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-06-13 11:50                     ` Michal Hocko
@ 2016-06-13 13:52                       ` Tetsuo Handa
  2016-06-13 14:00                         ` Michal Hocko
  2016-06-13 18:11                       ` Michael S. Tsirkin
  1 sibling, 1 reply; 18+ messages in thread
From: Tetsuo Handa @ 2016-06-13 13:52 UTC (permalink / raw)
  To: mhocko, mst; +Cc: vdavydov, akpm, rientjes, linux-mm, linux-kernel

Michal Hocko wrote:
> I have checked the vnet code and it doesn't seem to rely on
> copy_from_user/get_user AFAICS. Other users of use_mm() need to copy to
> the userspace only as well. So we should be perfectly safe to OOM reap
> address space even when it is shared by the kthread [1] so this is
> not really needed for the OOM correctness purpose. It would be much
> nicer if the kthread didn't pin the mm for two long outside of the OOM
> handling as well of course but that lowers the priority of the change.
> 
> [1] http://lkml.kernel.org/r/20160613112348.GC6518@dhcp22.suse.cz

It seems to me that vhost code relies on copy from the userspace.

use_mm(dev->mm) and unuse_mm(dev->mm) are used inside vhost_worker().
work->fn(work) is initialized by vhost_work_init().
vhost_scsi_open() passes vhost_scsi_complete_cmd_work() and
vhost_scsi_evt_work() as ->fn, and both functions call __get_user().

vhost_scsi_complete_cmd_work() {
  vhost_signal() {
    vhost_notify() {
      __get_user()
    }
  }
}

vhost_scsi_evt_work() {
  vhost_scsi_do_evt_work() {
    vhost_get_vq_desc() {
      __get_user() / __copy_from_user()
      get_indirect() {
        copy_from_iter()
      }
    }
  }
}

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-06-13 13:52                       ` Tetsuo Handa
@ 2016-06-13 14:00                         ` Michal Hocko
  0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2016-06-13 14:00 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: mst, vdavydov, akpm, rientjes, linux-mm, linux-kernel

On Mon 13-06-16 22:52:43, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I have checked the vnet code and it doesn't seem to rely on
> > copy_from_user/get_user AFAICS. Other users of use_mm() need to copy to
> > the userspace only as well. So we should be perfectly safe to OOM reap
> > address space even when it is shared by the kthread [1] so this is
> > not really needed for the OOM correctness purpose. It would be much
> > nicer if the kthread didn't pin the mm for two long outside of the OOM
> > handling as well of course but that lowers the priority of the change.
> > 
> > [1] http://lkml.kernel.org/r/20160613112348.GC6518@dhcp22.suse.cz
> 
> It seems to me that vhost code relies on copy from the userspace.
> 
> use_mm(dev->mm) and unuse_mm(dev->mm) are used inside vhost_worker().
> work->fn(work) is initialized by vhost_work_init().
> vhost_scsi_open() passes vhost_scsi_complete_cmd_work() and
> vhost_scsi_evt_work() as ->fn, and both functions call __get_user().
> 
> vhost_scsi_complete_cmd_work() {
>   vhost_signal() {
>     vhost_notify() {
>       __get_user()
>     }
>   }
> }
> 
> vhost_scsi_evt_work() {
>   vhost_scsi_do_evt_work() {
>     vhost_get_vq_desc() {
>       __get_user() / __copy_from_user()
>       get_indirect() {
>         copy_from_iter()
>       }
>     }
>   }
> }

Ahh, I've missed those. Thanks for pointing this out! Let me try to find
out whether the code is robust to see unexpected 0 when reading from the
userspace.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH] exit: clear TIF_MEMDIE after exit_task_work
  2016-06-13 11:50                     ` Michal Hocko
  2016-06-13 13:52                       ` Tetsuo Handa
@ 2016-06-13 18:11                       ` Michael S. Tsirkin
  1 sibling, 0 replies; 18+ messages in thread
From: Michael S. Tsirkin @ 2016-06-13 18:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vladimir Davydov, Andrew Morton, Tetsuo Handa, David Rientjes,
	linux-mm, linux-kernel

On Mon, Jun 13, 2016 at 01:50:41PM +0200, Michal Hocko wrote:
> On Tue 07-06-16 14:50:14, Michal Hocko wrote:
> > On Mon 14-03-16 17:39:43, Michal Hocko wrote:
> > > On Tue 01-03-16 19:20:24, Michael S. Tsirkin wrote:
> > > > On Tue, Mar 01, 2016 at 06:17:58PM +0100, Michal Hocko wrote:
> > > [...]
> > > > > Sorry, I could have been more verbose... The code would have to make sure
> > > > > that the mm is still alive before calling g-u-p by
> > > > > atomic_inc_not_zero(&mm->mm_users) and fail if the user count dropped to
> > > > > 0 in the mean time. See how fs/proc/task_mmu.c does that (proc_mem_open
> > > > > + m_start + m_stop.
> > > > > 
> > > > > The biggest advanatage would be that the mm address space pin would be
> > > > > only for the particular operation. Not sure whether that is possible in
> > > > > the driver though. Anyway pinning the mm for a potentially unbounded
> > > > > amount of time doesn't sound too nice.
> > > > 
> > > > Hmm that would be another atomic on data path ...
> > > > I'd have to explore that.
> > > 
> > > Did you have any chance to look into this?
> > 
> > So this is my take to get rid of mm_users pinning for an unbounded
> > amount of time. This is even not compile tested. I am not sure how to
> > handle when the mm goes away while there are still work items pending.
> > It seems this is not handled current anyway and only shouts with a
> > warning so this shouldn't cause a new regression AFAICS. I am not
> > familiar with the vnet code at all so I might be missing many things,
> > though. Does the below sound even remotely reasonable to you Michael?
> 
> I have checked the vnet code and it doesn't seem to rely on
> copy_from_user/get_user AFAICS. Other users of use_mm() need to copy to
> the userspace only as well.

E.g. vhost_get_vq_desc calls __get_user quite a lot.
I think it should be safe for it to fail, but
not to trigger warnings/errors.

> So we should be perfectly safe to OOM reap
> address space even when it is shared by the kthread [1] so this is
> not really needed for the OOM correctness purpose. It would be much
> nicer if the kthread didn't pin the mm for two long outside of the OOM
> handling as well of course but that lowers the priority of the change.
> 
> [1] http://lkml.kernel.org/r/20160613112348.GC6518@dhcp22.suse.cz
> -- 
> Michal Hocko
> SUSE Labs

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2016-06-13 18:11 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-29 17:02 [PATCH] exit: clear TIF_MEMDIE after exit_task_work Vladimir Davydov
2016-02-29 18:21 ` Michal Hocko
2016-02-29 18:44   ` Michal Hocko
2016-03-01 15:52 ` Michal Hocko
2016-03-01 15:57   ` Michael S. Tsirkin
2016-03-01 16:08     ` Michal Hocko
2016-03-01 16:14       ` Michael S. Tsirkin
2016-03-01 16:22       ` Michael S. Tsirkin
2016-03-01 16:35         ` Michal Hocko
2016-03-01 16:46           ` Michael S. Tsirkin
2016-03-01 17:17             ` Michal Hocko
2016-03-01 17:20               ` Michael S. Tsirkin
2016-03-14 16:39                 ` Michal Hocko
2016-06-07 12:50                   ` Michal Hocko
2016-06-13 11:50                     ` Michal Hocko
2016-06-13 13:52                       ` Tetsuo Handa
2016-06-13 14:00                         ` Michal Hocko
2016-06-13 18:11                       ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).