[v7,7/7] mm/madvise: allow KSM hints for remote API
diff mbox series

Message ID 20200302193630.68771-8-minchan@kernel.org
State In Next
Commit 56377dc6bed84e61da5772c739087b8b7d78d350
Headers show
Series
  • introduce memory hinting API for external process
Related show

Commit Message

Minchan Kim March 2, 2020, 7:36 p.m. UTC
From: Oleksandr Natalenko <oleksandr@redhat.com>

It all began with the fact that KSM works only on memory that is marked
by madvise(). And the only way to get around that is to either:

  * use LD_PRELOAD; or
  * patch the kernel with something like UKSM or PKSM.

(i skip ptrace can of worms here intentionally)

To overcome this restriction, lets employ a new remote madvise API. This
can be used by some small userspace helper daemon that will do auto-KSM
job for us.

I think of two major consumers of remote KSM hints:

  * hosts, that run containers, especially similar ones and especially in
    a trusted environment, sharing the same runtime like Node.js;

  * heavy applications, that can be run in multiple instances, not
    limited to opensource ones like Firefox, but also those that cannot be
    modified since they are binary-only and, maybe, statically linked.

Speaking of statistics, more numbers can be found in the very first
submission, that is related to this one [1]. For my current setup with
two Firefox instances I get 100 to 200 MiB saved for the second instance
depending on the amount of tabs.

1 FF instance with 15 tabs:

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   410

2 FF instances, second one has 12 tabs (all the tabs are different):

   $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
   592

At the very moment I do not have specific numbers for containerised
workload, but those should be comparable in case the containers share
similar/same runtime.

[1] https://lore.kernel.org/patchwork/patch/1012142/

Reviewed-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/madvise.c | 4 ++++
 1 file changed, 4 insertions(+)

Comments

Vlastimil Babka March 6, 2020, 1:13 p.m. UTC | #1
On 3/2/20 8:36 PM, Minchan Kim wrote:
> From: Oleksandr Natalenko <oleksandr@redhat.com>
> 
> It all began with the fact that KSM works only on memory that is marked
> by madvise(). And the only way to get around that is to either:
> 
>   * use LD_PRELOAD; or
>   * patch the kernel with something like UKSM or PKSM.
> 
> (i skip ptrace can of worms here intentionally)
> 
> To overcome this restriction, lets employ a new remote madvise API. This
> can be used by some small userspace helper daemon that will do auto-KSM
> job for us.
> 
> I think of two major consumers of remote KSM hints:
> 
>   * hosts, that run containers, especially similar ones and especially in
>     a trusted environment, sharing the same runtime like Node.js;
> 
>   * heavy applications, that can be run in multiple instances, not
>     limited to opensource ones like Firefox, but also those that cannot be
>     modified since they are binary-only and, maybe, statically linked.
> 
> Speaking of statistics, more numbers can be found in the very first
> submission, that is related to this one [1]. For my current setup with
> two Firefox instances I get 100 to 200 MiB saved for the second instance
> depending on the amount of tabs.
> 
> 1 FF instance with 15 tabs:
> 
>    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>    410
> 
> 2 FF instances, second one has 12 tabs (all the tabs are different):
> 
>    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>    592
> 
> At the very moment I do not have specific numbers for containerised
> workload, but those should be comparable in case the containers share
> similar/same runtime.
> 
> [1] https://lore.kernel.org/patchwork/patch/1012142/
> 
> Reviewed-by: SeongJae Park <sjpark@amazon.de>
> Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

This will lead to one process calling unmerge_ksm_pages() of another. There's a
(signal_pending(current)) test there, should it check also the other task,
analogically to task 3?
Then break_ksm() is fine as it is, as ksmd also calls it, right?

> ---
>  mm/madvise.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index e77c6c1fad34..f4fa962ee74d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
>  	switch (behavior) {
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
> +#ifdef CONFIG_KSM
> +	case MADV_MERGEABLE:
> +	case MADV_UNMERGEABLE:
> +#endif
>  		return true;
>  	default:
>  		return false;
>
Oleksandr Natalenko March 6, 2020, 1:41 p.m. UTC | #2
On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote:
> On 3/2/20 8:36 PM, Minchan Kim wrote:
> > From: Oleksandr Natalenko <oleksandr@redhat.com>
> > 
> > It all began with the fact that KSM works only on memory that is marked
> > by madvise(). And the only way to get around that is to either:
> > 
> >   * use LD_PRELOAD; or
> >   * patch the kernel with something like UKSM or PKSM.
> > 
> > (i skip ptrace can of worms here intentionally)
> > 
> > To overcome this restriction, lets employ a new remote madvise API. This
> > can be used by some small userspace helper daemon that will do auto-KSM
> > job for us.
> > 
> > I think of two major consumers of remote KSM hints:
> > 
> >   * hosts, that run containers, especially similar ones and especially in
> >     a trusted environment, sharing the same runtime like Node.js;
> > 
> >   * heavy applications, that can be run in multiple instances, not
> >     limited to opensource ones like Firefox, but also those that cannot be
> >     modified since they are binary-only and, maybe, statically linked.
> > 
> > Speaking of statistics, more numbers can be found in the very first
> > submission, that is related to this one [1]. For my current setup with
> > two Firefox instances I get 100 to 200 MiB saved for the second instance
> > depending on the amount of tabs.
> > 
> > 1 FF instance with 15 tabs:
> > 
> >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> >    410
> > 
> > 2 FF instances, second one has 12 tabs (all the tabs are different):
> > 
> >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> >    592
> > 
> > At the very moment I do not have specific numbers for containerised
> > workload, but those should be comparable in case the containers share
> > similar/same runtime.
> > 
> > [1] https://lore.kernel.org/patchwork/patch/1012142/
> > 
> > Reviewed-by: SeongJae Park <sjpark@amazon.de>
> > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> This will lead to one process calling unmerge_ksm_pages() of another. There's a
> (signal_pending(current)) test there, should it check also the other task,
> analogically to task 3?

Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else?

> Then break_ksm() is fine as it is, as ksmd also calls it, right?

I think break_ksm() cares only about mmap_sem protection, so we should
be fine here.

> 
> > ---
> >  mm/madvise.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index e77c6c1fad34..f4fa962ee74d 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
> >  	switch (behavior) {
> >  	case MADV_COLD:
> >  	case MADV_PAGEOUT:
> > +#ifdef CONFIG_KSM
> > +	case MADV_MERGEABLE:
> > +	case MADV_UNMERGEABLE:
> > +#endif
> >  		return true;
> >  	default:
> >  		return false;
> > 
>
Vlastimil Babka March 6, 2020, 4:08 p.m. UTC | #3
On 3/6/20 2:41 PM, Oleksandr Natalenko wrote:
> On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote:
>> On 3/2/20 8:36 PM, Minchan Kim wrote:
>> > From: Oleksandr Natalenko <oleksandr@redhat.com>
>> > 
>> > It all began with the fact that KSM works only on memory that is marked
>> > by madvise(). And the only way to get around that is to either:
>> > 
>> >   * use LD_PRELOAD; or
>> >   * patch the kernel with something like UKSM or PKSM.
>> > 
>> > (i skip ptrace can of worms here intentionally)
>> > 
>> > To overcome this restriction, lets employ a new remote madvise API. This
>> > can be used by some small userspace helper daemon that will do auto-KSM
>> > job for us.
>> > 
>> > I think of two major consumers of remote KSM hints:
>> > 
>> >   * hosts, that run containers, especially similar ones and especially in
>> >     a trusted environment, sharing the same runtime like Node.js;

Ah, I forgot to ask, given the discussion of races in patch 2 (Question 2),
where android can stop the tasks to apply the madvise hints in a race-free
manner, how does that work for remote KSM hints in your scenarios, especially
the one above?

>> > 
>> >   * heavy applications, that can be run in multiple instances, not
>> >     limited to opensource ones like Firefox, but also those that cannot be
>> >     modified since they are binary-only and, maybe, statically linked.
>> > 
>> > Speaking of statistics, more numbers can be found in the very first
>> > submission, that is related to this one [1]. For my current setup with
>> > two Firefox instances I get 100 to 200 MiB saved for the second instance
>> > depending on the amount of tabs.
>> > 
>> > 1 FF instance with 15 tabs:
>> > 
>> >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>> >    410
>> > 
>> > 2 FF instances, second one has 12 tabs (all the tabs are different):
>> > 
>> >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
>> >    592
>> > 
>> > At the very moment I do not have specific numbers for containerised
>> > workload, but those should be comparable in case the containers share
>> > similar/same runtime.
>> > 
>> > [1] https://lore.kernel.org/patchwork/patch/1012142/
>> > 
>> > Reviewed-by: SeongJae Park <sjpark@amazon.de>
>> > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
>> > Signed-off-by: Minchan Kim <minchan@kernel.org>
>> 
>> This will lead to one process calling unmerge_ksm_pages() of another. There's a
>> (signal_pending(current)) test there, should it check also the other task,
>> analogically to task 3?
> 
> Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else?

Dunno, it's nice to react to signals quickly, for any proces that gets them, no?

>> Then break_ksm() is fine as it is, as ksmd also calls it, right?
> 
> I think break_ksm() cares only about mmap_sem protection, so we should
> be fine here.
> 
>> 
>> > ---
>> >  mm/madvise.c | 4 ++++
>> >  1 file changed, 4 insertions(+)
>> > 
>> > diff --git a/mm/madvise.c b/mm/madvise.c
>> > index e77c6c1fad34..f4fa962ee74d 100644
>> > --- a/mm/madvise.c
>> > +++ b/mm/madvise.c
>> > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
>> >  	switch (behavior) {
>> >  	case MADV_COLD:
>> >  	case MADV_PAGEOUT:
>> > +#ifdef CONFIG_KSM
>> > +	case MADV_MERGEABLE:
>> > +	case MADV_UNMERGEABLE:
>> > +#endif
>> >  		return true;
>> >  	default:
>> >  		return false;
>> > 
>> 
>
Oleksandr Natalenko March 9, 2020, 1:11 p.m. UTC | #4
On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
> On 3/6/20 2:41 PM, Oleksandr Natalenko wrote:
> > On Fri, Mar 06, 2020 at 02:13:49PM +0100, Vlastimil Babka wrote:
> >> On 3/2/20 8:36 PM, Minchan Kim wrote:
> >> > From: Oleksandr Natalenko <oleksandr@redhat.com>
> >> > 
> >> > It all began with the fact that KSM works only on memory that is marked
> >> > by madvise(). And the only way to get around that is to either:
> >> > 
> >> >   * use LD_PRELOAD; or
> >> >   * patch the kernel with something like UKSM or PKSM.
> >> > 
> >> > (i skip ptrace can of worms here intentionally)
> >> > 
> >> > To overcome this restriction, lets employ a new remote madvise API. This
> >> > can be used by some small userspace helper daemon that will do auto-KSM
> >> > job for us.
> >> > 
> >> > I think of two major consumers of remote KSM hints:
> >> > 
> >> >   * hosts, that run containers, especially similar ones and especially in
> >> >     a trusted environment, sharing the same runtime like Node.js;
> 
> Ah, I forgot to ask, given the discussion of races in patch 2 (Question 2),
> where android can stop the tasks to apply the madvise hints in a race-free
> manner, how does that work for remote KSM hints in your scenarios, especially
> the one above?

We have cgroup.freeze for that.

> 
> >> > 
> >> >   * heavy applications, that can be run in multiple instances, not
> >> >     limited to opensource ones like Firefox, but also those that cannot be
> >> >     modified since they are binary-only and, maybe, statically linked.
> >> > 
> >> > Speaking of statistics, more numbers can be found in the very first
> >> > submission, that is related to this one [1]. For my current setup with
> >> > two Firefox instances I get 100 to 200 MiB saved for the second instance
> >> > depending on the amount of tabs.
> >> > 
> >> > 1 FF instance with 15 tabs:
> >> > 
> >> >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> >> >    410
> >> > 
> >> > 2 FF instances, second one has 12 tabs (all the tabs are different):
> >> > 
> >> >    $ echo "$(cat /sys/kernel/mm/ksm/pages_sharing) * 4 / 1024" | bc
> >> >    592
> >> > 
> >> > At the very moment I do not have specific numbers for containerised
> >> > workload, but those should be comparable in case the containers share
> >> > similar/same runtime.
> >> > 
> >> > [1] https://lore.kernel.org/patchwork/patch/1012142/
> >> > 
> >> > Reviewed-by: SeongJae Park <sjpark@amazon.de>
> >> > Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
> >> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> >> 
> >> This will lead to one process calling unmerge_ksm_pages() of another. There's a
> >> (signal_pending(current)) test there, should it check also the other task,
> >> analogically to task 3?
> > 
> > Do we care about current there then? Shall we just pass mm into unmerge_ksm_pages and check the signals of the target task only, be it current or something else?
> 
> Dunno, it's nice to react to signals quickly, for any proces that gets them, no?

So, do you mean something like this?

===
diff --git a/mm/ksm.c b/mm/ksm.c
index 363ec8189561..b39c237cfcf4 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
 	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
 		if (ksm_test_exit(vma->vm_mm))
 			break;
-		if (signal_pending(current))
+		if (signal_pending(current) ||
+		    signal_pending(rcu_dereference(vma->vm_mm->owner)))
 			err = -ERESTARTSYS;
 		else
 			err = break_ksm(vma, addr);
===

BTW, this won't work with !CONFIG_MEMCG, so probably task_struct should be
passed through instead. IIUC, this would also require amending struct
mm_slot in order to share the same code path with ksmd.

I'm not sure I've seen such a culprit anywhere else, so I'm in doubt
this would be a correct thing to do.

Ideas?

> 
> >> Then break_ksm() is fine as it is, as ksmd also calls it, right?
> > 
> > I think break_ksm() cares only about mmap_sem protection, so we should
> > be fine here.
> > 
> >> 
> >> > ---
> >> >  mm/madvise.c | 4 ++++
> >> >  1 file changed, 4 insertions(+)
> >> > 
> >> > diff --git a/mm/madvise.c b/mm/madvise.c
> >> > index e77c6c1fad34..f4fa962ee74d 100644
> >> > --- a/mm/madvise.c
> >> > +++ b/mm/madvise.c
> >> > @@ -1005,6 +1005,10 @@ process_madvise_behavior_valid(int behavior)
> >> >  	switch (behavior) {
> >> >  	case MADV_COLD:
> >> >  	case MADV_PAGEOUT:
> >> > +#ifdef CONFIG_KSM
> >> > +	case MADV_MERGEABLE:
> >> > +	case MADV_UNMERGEABLE:
> >> > +#endif
> >> >  		return true;
> >> >  	default:
> >> >  		return false;
> >> > 
> >> 
> > 
>
Michal Hocko March 9, 2020, 3:08 p.m. UTC | #5
On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote:
> On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
[...]
> > Dunno, it's nice to react to signals quickly, for any proces that gets them, no?
> 
> So, do you mean something like this?
> 
> ===
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 363ec8189561..b39c237cfcf4 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
>  	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
>  		if (ksm_test_exit(vma->vm_mm))
>  			break;
> -		if (signal_pending(current))
> +		if (signal_pending(current) ||
> +		    signal_pending(rcu_dereference(vma->vm_mm->owner)))
>  			err = -ERESTARTSYS;
>  		else
>  			err = break_ksm(vma, addr);
> ===

This is broken because mm might be attached to different tasks.
AFAIU this check is meant to allow quick backoff of the _calling_
process so that it doesn't waste time when the context is killed
already. I do not understand why should we care about any other context
here? What is the actual problem this would solve?
Oleksandr Natalenko March 9, 2020, 3:19 p.m. UTC | #6
On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote:
> On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote:
> > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
> [...]
> > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no?
> > 
> > So, do you mean something like this?
> > 
> > ===
> > diff --git a/mm/ksm.c b/mm/ksm.c
> > index 363ec8189561..b39c237cfcf4 100644
> > --- a/mm/ksm.c
> > +++ b/mm/ksm.c
> > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
> >  	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
> >  		if (ksm_test_exit(vma->vm_mm))
> >  			break;
> > -		if (signal_pending(current))
> > +		if (signal_pending(current) ||
> > +		    signal_pending(rcu_dereference(vma->vm_mm->owner)))
> >  			err = -ERESTARTSYS;
> >  		else
> >  			err = break_ksm(vma, addr);
> > ===
> 
> This is broken because mm might be attached to different tasks.
> AFAIU this check is meant to allow quick backoff of the _calling_
> process so that it doesn't waste time when the context is killed
> already. I do not understand why should we care about any other context
> here? What is the actual problem this would solve?

I agree with you, but still trying to understand what does Vlastimil mean
:).

> 
> -- 
> Michal Hocko
> SUSE Labs
>
Vlastimil Babka March 9, 2020, 3:42 p.m. UTC | #7
On 3/9/20 4:19 PM, Oleksandr Natalenko wrote:
> On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote:
>> On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote:
>> > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
>> [...]
>> > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no?
>> > 
>> > So, do you mean something like this?
>> > 
>> > ===
>> > diff --git a/mm/ksm.c b/mm/ksm.c
>> > index 363ec8189561..b39c237cfcf4 100644
>> > --- a/mm/ksm.c
>> > +++ b/mm/ksm.c
>> > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
>> >  	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
>> >  		if (ksm_test_exit(vma->vm_mm))
>> >  			break;
>> > -		if (signal_pending(current))
>> > +		if (signal_pending(current) ||
>> > +		    signal_pending(rcu_dereference(vma->vm_mm->owner)))
>> >  			err = -ERESTARTSYS;
>> >  		else
>> >  			err = break_ksm(vma, addr);
>> > ===
>> 
>> This is broken because mm might be attached to different tasks.
>> AFAIU this check is meant to allow quick backoff of the _calling_
>> process so that it doesn't waste time when the context is killed
>> already. I do not understand why should we care about any other context
>> here? What is the actual problem this would solve?
> 
> I agree with you, but still trying to understand what does Vlastimil mean
> :).

Well you wondered if we should stop caring about current, and I said that
probably wouldn't be nice.
As for caring about the other task, patch 3/7 does that for
(MADV_COLD|MADV_PAGEOUT) so I just pointed out that the KSM case doesn't. AFAIU
if we don't check the signals, we might be blocking the killed task from exiting?

>> 
>> -- 
>> Michal Hocko
>> SUSE Labs
>> 
>
Michal Hocko March 9, 2020, 4:03 p.m. UTC | #8
On Mon 09-03-20 16:42:43, Vlastimil Babka wrote:
> On 3/9/20 4:19 PM, Oleksandr Natalenko wrote:
> > On Mon, Mar 09, 2020 at 04:08:15PM +0100, Michal Hocko wrote:
> >> On Mon 09-03-20 14:11:17, Oleksandr Natalenko wrote:
> >> > On Fri, Mar 06, 2020 at 05:08:18PM +0100, Vlastimil Babka wrote:
> >> [...]
> >> > > Dunno, it's nice to react to signals quickly, for any proces that gets them, no?
> >> > 
> >> > So, do you mean something like this?
> >> > 
> >> > ===
> >> > diff --git a/mm/ksm.c b/mm/ksm.c
> >> > index 363ec8189561..b39c237cfcf4 100644
> >> > --- a/mm/ksm.c
> >> > +++ b/mm/ksm.c
> >> > @@ -849,7 +849,8 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
> >> >  	for (addr = start; addr < end && !err; addr += PAGE_SIZE) {
> >> >  		if (ksm_test_exit(vma->vm_mm))
> >> >  			break;
> >> > -		if (signal_pending(current))
> >> > +		if (signal_pending(current) ||
> >> > +		    signal_pending(rcu_dereference(vma->vm_mm->owner)))
> >> >  			err = -ERESTARTSYS;
> >> >  		else
> >> >  			err = break_ksm(vma, addr);
> >> > ===
> >> 
> >> This is broken because mm might be attached to different tasks.
> >> AFAIU this check is meant to allow quick backoff of the _calling_
> >> process so that it doesn't waste time when the context is killed
> >> already. I do not understand why should we care about any other context
> >> here? What is the actual problem this would solve?
> > 
> > I agree with you, but still trying to understand what does Vlastimil mean
> > :).
> 
> Well you wondered if we should stop caring about current, and I said that
> probably wouldn't be nice.
> As for caring about the other task, patch 3/7 does that for
> (MADV_COLD|MADV_PAGEOUT) so I just pointed out that the KSM case doesn't. AFAIU
> if we don't check the signals, we might be blocking the killed task from exiting?

I would have to double check but I do not think this would be a problem
because the remote task should take mmget to prevent address space to
vanish under its feet. That should also rule out the exclusive mmap_sem
usage from the exit path.
Jann Horn June 11, 2020, 2:21 a.m. UTC | #9
On Mon, Mar 2, 2020 at 8:36 PM Minchan Kim <minchan@kernel.org> wrote:
> From: Oleksandr Natalenko <oleksandr@redhat.com>
>
> It all began with the fact that KSM works only on memory that is marked
> by madvise(). And the only way to get around that is to either:
[...]
> To overcome this restriction, lets employ a new remote madvise API. This
> can be used by some small userspace helper daemon that will do auto-KSM
> job for us.
>
> I think of two major consumers of remote KSM hints:
[...]
>   * heavy applications, that can be run in multiple instances, not
>     limited to opensource ones like Firefox, but also those that cannot be
>     modified since they are binary-only and, maybe, statically linked.

Just as a note, since you're mentioning Firefox as a usecase: Memory
deduplication between browser renderers creates new side channels and
is a questionable idea from a security standpoint. Memory
deduplication is (mostly) fine if either all involved processes are
trusted or no involved processes contain secrets, but browsers usually
run tons of untrusted code while at the same time containing lots of
valuable secrets.

Patch
diff mbox series

diff --git a/mm/madvise.c b/mm/madvise.c
index e77c6c1fad34..f4fa962ee74d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1005,6 +1005,10 @@  process_madvise_behavior_valid(int behavior)
 	switch (behavior) {
 	case MADV_COLD:
 	case MADV_PAGEOUT:
+#ifdef CONFIG_KSM
+	case MADV_MERGEABLE:
+	case MADV_UNMERGEABLE:
+#endif
 		return true;
 	default:
 		return false;