linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
@ 2022-07-01  8:43 cgel.zte
  2022-07-01  9:11 ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: cgel.zte @ 2022-07-01  8:43 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, mhocko, vbabka, minchan, oleksandr, xu xin

From: xu xin <xu.xin16@zte.com.cn>

The benefits of doing this are obvious because using madvise in user code
is the only current way to enable KSM, which is inconvenient for those
compiled app without marking MERGEABLE wanting to enable KSM.

Since we already have the syscall of process_madvise(), then reusing the
interface to allow external KSM hints is more acceptable [1].

Although this patch was released by Oleksandr Natalenko, but it was
unfortunately terminated without any conclusions because there was debate
on whether it should use signal_pending() to check the target task besides
the task of current() when calling unmerge_ksm_pages of other task [2].

I think it's unneeded to check the target task. For example, when we set
the klob /sys/kernel/mm/ksm/run from 1 to 2,
unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
all other target tasks either.

I hope this patch can get attention again.

[1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
[2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/

Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
Signed-off-by: xu xin <xu.xin16@zte.com.cn>
---
 mm/madvise.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index 851fa4e134bc..df915531ad9f 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1173,6 +1173,10 @@ process_madvise_behavior_valid(int behavior)
 	case MADV_COLD:
 	case MADV_PAGEOUT:
 	case MADV_WILLNEED:
+#ifdef CONFIG_KSM
+	case MADV_MERGEABLE:
+	case MADV_UNMERGEABLE:
+#endif
 		return true;
 	default:
 		return false;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01  8:43 [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise cgel.zte
@ 2022-07-01  9:11 ` Michal Hocko
  2022-07-01 10:32   ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-07-01  9:11 UTC (permalink / raw)
  To: cgel.zte
  Cc: akpm, linux-mm, linux-kernel, vbabka, minchan, oleksandr, xu xin,
	Jann Horn

[Cc Jann]

On Fri 01-07-22 08:43:23, cgel.zte@gmail.com wrote:
> From: xu xin <xu.xin16@zte.com.cn>
> 
> The benefits of doing this are obvious because using madvise in user code
> is the only current way to enable KSM, which is inconvenient for those
> compiled app without marking MERGEABLE wanting to enable KSM.

I would rephrase:
"
KSM functionality is currently available only to processes which are
using MADV_MERGEABLE directly. This is limiting because there are
usecases which will benefit from enabling KSM on a remote process. One
example would be an application which cannot be modified (e.g. because
it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
BENEFICIAL.
"

> Since we already have the syscall of process_madvise(), then reusing the
> interface to allow external KSM hints is more acceptable [1].
> 
> Although this patch was released by Oleksandr Natalenko, but it was
> unfortunately terminated without any conclusions because there was debate
> on whether it should use signal_pending() to check the target task besides
> the task of current() when calling unmerge_ksm_pages of other task [2].

I am not sure this is particularly interesting. I do not remember
details of that discussion but checking signal_pending on a different
task is rarely the right thing to do. In this case the check is meant to
allow bailing out from the operation so that the caller could be
terminated for example.

> I think it's unneeded to check the target task. For example, when we set
> the klob /sys/kernel/mm/ksm/run from 1 to 2,
> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
> all other target tasks either.
> 
> I hope this patch can get attention again.

One thing that the changelog is missing and it is quite important IMHO
is the permission model. As we have discussed in previous incarnations
of the remote KSM functionality that KSM has some security implications.
It would be really great to refer to that in the changelog for the
future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@mail.gmail.com)

So this implementation requires PTRACE_MODE_READ_FSCREDS and
CAP_SYS_NICE so the remote process would need to be allowed to
introspect the address space. This is the same constrain applied to the
remote momory reclaim. Is this sufficient?

I would say yes because to some degree KSM mergning can have very
similar effect to memory reclaim from the side channel POV. But it
should be really documented in the changelog so that it is clear that
this has been a deliberate decision and thought through.

Other than that this looks like the most reasonable approach to me.

> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/
> 
> Signed-off-by: Oleksandr Natalenko <oleksandr@redhat.com>
> Signed-off-by: xu xin <xu.xin16@zte.com.cn>
> ---
>  mm/madvise.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 851fa4e134bc..df915531ad9f 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1173,6 +1173,10 @@ process_madvise_behavior_valid(int behavior)
>  	case MADV_COLD:
>  	case MADV_PAGEOUT:
>  	case MADV_WILLNEED:
> +#ifdef CONFIG_KSM
> +	case MADV_MERGEABLE:
> +	case MADV_UNMERGEABLE:
> +#endif
>  		return true;
>  	default:
>  		return false;
> -- 
> 2.25.1

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01  9:11 ` Michal Hocko
@ 2022-07-01 10:32   ` David Hildenbrand
  2022-07-01 10:50     ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2022-07-01 10:32 UTC (permalink / raw)
  To: Michal Hocko, cgel.zte
  Cc: akpm, linux-mm, linux-kernel, vbabka, minchan, oleksandr, xu xin,
	Jann Horn

On 01.07.22 11:11, Michal Hocko wrote:
> [Cc Jann]
> 
> On Fri 01-07-22 08:43:23, cgel.zte@gmail.com wrote:
>> From: xu xin <xu.xin16@zte.com.cn>
>>
>> The benefits of doing this are obvious because using madvise in user code
>> is the only current way to enable KSM, which is inconvenient for those
>> compiled app without marking MERGEABLE wanting to enable KSM.
> 
> I would rephrase:
> "
> KSM functionality is currently available only to processes which are
> using MADV_MERGEABLE directly. This is limiting because there are
> usecases which will benefit from enabling KSM on a remote process. One
> example would be an application which cannot be modified (e.g. because
> it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
> BENEFICIAL.
> "
> 
>> Since we already have the syscall of process_madvise(), then reusing the
>> interface to allow external KSM hints is more acceptable [1].
>>
>> Although this patch was released by Oleksandr Natalenko, but it was
>> unfortunately terminated without any conclusions because there was debate
>> on whether it should use signal_pending() to check the target task besides
>> the task of current() when calling unmerge_ksm_pages of other task [2].
> 
> I am not sure this is particularly interesting. I do not remember
> details of that discussion but checking signal_pending on a different
> task is rarely the right thing to do. In this case the check is meant to
> allow bailing out from the operation so that the caller could be
> terminated for example.
> 
>> I think it's unneeded to check the target task. For example, when we set
>> the klob /sys/kernel/mm/ksm/run from 1 to 2,
>> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
>> all other target tasks either.
>>
>> I hope this patch can get attention again.
> 
> One thing that the changelog is missing and it is quite important IMHO
> is the permission model. As we have discussed in previous incarnations
> of the remote KSM functionality that KSM has some security implications.
> It would be really great to refer to that in the changelog for the
> future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@mail.gmail.com)
> 
> So this implementation requires PTRACE_MODE_READ_FSCREDS and
> CAP_SYS_NICE so the remote process would need to be allowed to
> introspect the address space. This is the same constrain applied to the
> remote momory reclaim. Is this sufficient?
> 
> I would say yes because to some degree KSM mergning can have very
> similar effect to memory reclaim from the side channel POV. But it
> should be really documented in the changelog so that it is clear that
> this has been a deliberate decision and thought through.
> 
> Other than that this looks like the most reasonable approach to me.
> 
>> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
>> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/
>>

I have various concerns, but the biggest concern is that this modifies
VMA flags and can possibly break applications.

process_madvise must not modify remote process state.

That's why we only allow a very limited selection that are merely hints.

So nack from my side.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 10:32   ` David Hildenbrand
@ 2022-07-01 10:50     ` David Hildenbrand
  2022-07-01 12:02       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2022-07-01 10:50 UTC (permalink / raw)
  To: Michal Hocko, cgel.zte
  Cc: akpm, linux-mm, linux-kernel, vbabka, minchan, oleksandr, xu xin,
	Jann Horn

On 01.07.22 12:32, David Hildenbrand wrote:
> On 01.07.22 11:11, Michal Hocko wrote:
>> [Cc Jann]
>>
>> On Fri 01-07-22 08:43:23, cgel.zte@gmail.com wrote:
>>> From: xu xin <xu.xin16@zte.com.cn>
>>>
>>> The benefits of doing this are obvious because using madvise in user code
>>> is the only current way to enable KSM, which is inconvenient for those
>>> compiled app without marking MERGEABLE wanting to enable KSM.
>>
>> I would rephrase:
>> "
>> KSM functionality is currently available only to processes which are
>> using MADV_MERGEABLE directly. This is limiting because there are
>> usecases which will benefit from enabling KSM on a remote process. One
>> example would be an application which cannot be modified (e.g. because
>> it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
>> BENEFICIAL.
>> "
>>
>>> Since we already have the syscall of process_madvise(), then reusing the
>>> interface to allow external KSM hints is more acceptable [1].
>>>
>>> Although this patch was released by Oleksandr Natalenko, but it was
>>> unfortunately terminated without any conclusions because there was debate
>>> on whether it should use signal_pending() to check the target task besides
>>> the task of current() when calling unmerge_ksm_pages of other task [2].
>>
>> I am not sure this is particularly interesting. I do not remember
>> details of that discussion but checking signal_pending on a different
>> task is rarely the right thing to do. In this case the check is meant to
>> allow bailing out from the operation so that the caller could be
>> terminated for example.
>>
>>> I think it's unneeded to check the target task. For example, when we set
>>> the klob /sys/kernel/mm/ksm/run from 1 to 2,
>>> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
>>> all other target tasks either.
>>>
>>> I hope this patch can get attention again.
>>
>> One thing that the changelog is missing and it is quite important IMHO
>> is the permission model. As we have discussed in previous incarnations
>> of the remote KSM functionality that KSM has some security implications.
>> It would be really great to refer to that in the changelog for the
>> future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@mail.gmail.com)
>>
>> So this implementation requires PTRACE_MODE_READ_FSCREDS and
>> CAP_SYS_NICE so the remote process would need to be allowed to
>> introspect the address space. This is the same constrain applied to the
>> remote momory reclaim. Is this sufficient?
>>
>> I would say yes because to some degree KSM mergning can have very
>> similar effect to memory reclaim from the side channel POV. But it
>> should be really documented in the changelog so that it is clear that
>> this has been a deliberate decision and thought through.
>>
>> Other than that this looks like the most reasonable approach to me.
>>
>>> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
>>> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/
>>>
> 
> I have various concerns, but the biggest concern is that this modifies
> VMA flags and can possibly break applications.
> 
> process_madvise must not modify remote process state.
> 
> That's why we only allow a very limited selection that are merely hints.
> 
> So nack from my side.
> 

[I'm quit ebusy, but I think some more explanation might be of value]

One COW example where I think force-enabling KSM for processes is
*currently* not a good idea (besides the side channel discussions, which
is also why Windows stopped to enable KSM system wide a while ago):

App:

a) memset(page, 0);
b) trigger R/O long-term pin on page (e.g., vfio)

If between a) and b) KSM replaces the page by the shared zeropage you'll
get an unreliable pin because we don't break yet COW when taking a R/O
pin on the shared zeropage. And in the traditional sense, the app did
everything right to guarantee that the pin will stay reliable.


Further, if an app explicitly decides to disable KSM one some region, we
should not overwrite that.


I can see that we might want such a (VMA-wide? process-wide?
system-wide?) overwrite, but IMHO we would have to make sure that

a) Any eventual side effects (see above) are handled correctly.
b) The app has an explicit mechanism to certainly disable KSM for a
   region (e.g., storing secrets) -- similarly to MADV_NOHUGEPAGE that
   ensures that there will *not* be huge pages.

IOW, fixes for a) [I'm planning on doing that] and two sets of flags for
b) to distinguish what the app wants and what somebody else might want.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 10:50     ` David Hildenbrand
@ 2022-07-01 12:02       ` Michal Hocko
  2022-07-01 12:09         ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-07-01 12:02 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: cgel.zte, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

On Fri 01-07-22 12:50:59, David Hildenbrand wrote:
> On 01.07.22 12:32, David Hildenbrand wrote:
> > On 01.07.22 11:11, Michal Hocko wrote:
> >> [Cc Jann]
> >>
> >> On Fri 01-07-22 08:43:23, cgel.zte@gmail.com wrote:
> >>> From: xu xin <xu.xin16@zte.com.cn>
> >>>
> >>> The benefits of doing this are obvious because using madvise in user code
> >>> is the only current way to enable KSM, which is inconvenient for those
> >>> compiled app without marking MERGEABLE wanting to enable KSM.
> >>
> >> I would rephrase:
> >> "
> >> KSM functionality is currently available only to processes which are
> >> using MADV_MERGEABLE directly. This is limiting because there are
> >> usecases which will benefit from enabling KSM on a remote process. One
> >> example would be an application which cannot be modified (e.g. because
> >> it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
> >> BENEFICIAL.
> >> "
> >>
> >>> Since we already have the syscall of process_madvise(), then reusing the
> >>> interface to allow external KSM hints is more acceptable [1].
> >>>
> >>> Although this patch was released by Oleksandr Natalenko, but it was
> >>> unfortunately terminated without any conclusions because there was debate
> >>> on whether it should use signal_pending() to check the target task besides
> >>> the task of current() when calling unmerge_ksm_pages of other task [2].
> >>
> >> I am not sure this is particularly interesting. I do not remember
> >> details of that discussion but checking signal_pending on a different
> >> task is rarely the right thing to do. In this case the check is meant to
> >> allow bailing out from the operation so that the caller could be
> >> terminated for example.
> >>
> >>> I think it's unneeded to check the target task. For example, when we set
> >>> the klob /sys/kernel/mm/ksm/run from 1 to 2,
> >>> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
> >>> all other target tasks either.
> >>>
> >>> I hope this patch can get attention again.
> >>
> >> One thing that the changelog is missing and it is quite important IMHO
> >> is the permission model. As we have discussed in previous incarnations
> >> of the remote KSM functionality that KSM has some security implications.
> >> It would be really great to refer to that in the changelog for the
> >> future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@mail.gmail.com)
> >>
> >> So this implementation requires PTRACE_MODE_READ_FSCREDS and
> >> CAP_SYS_NICE so the remote process would need to be allowed to
> >> introspect the address space. This is the same constrain applied to the
> >> remote momory reclaim. Is this sufficient?
> >>
> >> I would say yes because to some degree KSM mergning can have very
> >> similar effect to memory reclaim from the side channel POV. But it
> >> should be really documented in the changelog so that it is clear that
> >> this has been a deliberate decision and thought through.
> >>
> >> Other than that this looks like the most reasonable approach to me.
> >>
> >>> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
> >>> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/
> >>>
> > 
> > I have various concerns, but the biggest concern is that this modifies
> > VMA flags and can possibly break applications.
> > 
> > process_madvise must not modify remote process state.
> > 
> > That's why we only allow a very limited selection that are merely hints.
> > 
> > So nack from my side.
> > 
> 
> [I'm quit ebusy, but I think some more explanation might be of value]
> 
> One COW example where I think force-enabling KSM for processes is
> *currently* not a good idea (besides the side channel discussions, which
> is also why Windows stopped to enable KSM system wide a while ago):
> 
> App:
> 
> a) memset(page, 0);
> b) trigger R/O long-term pin on page (e.g., vfio)
> 
> If between a) and b) KSM replaces the page by the shared zeropage you'll
> get an unreliable pin because we don't break yet COW when taking a R/O
> pin on the shared zeropage. And in the traditional sense, the app did
> everything right to guarantee that the pin will stay reliable.

Isn't this a bug in the existing implementation of the CoW?

> Further, if an app explicitly decides to disable KSM one some region, we
> should not overwrite that.

Well, the interface is rather spartan. You cannot really tell "disable
KSM on some reqion". You can only tell "KSM can be applied to this
region" and later change your mind. Maybe this is what you had in
mind though.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 12:02       ` Michal Hocko
@ 2022-07-01 12:09         ` David Hildenbrand
  2022-07-01 12:36           ` Michal Hocko
  2022-07-04  8:13           ` CGEL
  0 siblings, 2 replies; 16+ messages in thread
From: David Hildenbrand @ 2022-07-01 12:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgel.zte, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

On 01.07.22 14:02, Michal Hocko wrote:
> On Fri 01-07-22 12:50:59, David Hildenbrand wrote:
>> On 01.07.22 12:32, David Hildenbrand wrote:
>>> On 01.07.22 11:11, Michal Hocko wrote:
>>>> [Cc Jann]
>>>>
>>>> On Fri 01-07-22 08:43:23, cgel.zte@gmail.com wrote:
>>>>> From: xu xin <xu.xin16@zte.com.cn>
>>>>>
>>>>> The benefits of doing this are obvious because using madvise in user code
>>>>> is the only current way to enable KSM, which is inconvenient for those
>>>>> compiled app without marking MERGEABLE wanting to enable KSM.
>>>>
>>>> I would rephrase:
>>>> "
>>>> KSM functionality is currently available only to processes which are
>>>> using MADV_MERGEABLE directly. This is limiting because there are
>>>> usecases which will benefit from enabling KSM on a remote process. One
>>>> example would be an application which cannot be modified (e.g. because
>>>> it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
>>>> BENEFICIAL.
>>>> "
>>>>
>>>>> Since we already have the syscall of process_madvise(), then reusing the
>>>>> interface to allow external KSM hints is more acceptable [1].
>>>>>
>>>>> Although this patch was released by Oleksandr Natalenko, but it was
>>>>> unfortunately terminated without any conclusions because there was debate
>>>>> on whether it should use signal_pending() to check the target task besides
>>>>> the task of current() when calling unmerge_ksm_pages of other task [2].
>>>>
>>>> I am not sure this is particularly interesting. I do not remember
>>>> details of that discussion but checking signal_pending on a different
>>>> task is rarely the right thing to do. In this case the check is meant to
>>>> allow bailing out from the operation so that the caller could be
>>>> terminated for example.
>>>>
>>>>> I think it's unneeded to check the target task. For example, when we set
>>>>> the klob /sys/kernel/mm/ksm/run from 1 to 2,
>>>>> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
>>>>> all other target tasks either.
>>>>>
>>>>> I hope this patch can get attention again.
>>>>
>>>> One thing that the changelog is missing and it is quite important IMHO
>>>> is the permission model. As we have discussed in previous incarnations
>>>> of the remote KSM functionality that KSM has some security implications.
>>>> It would be really great to refer to that in the changelog for the
>>>> future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@mail.gmail.com)
>>>>
>>>> So this implementation requires PTRACE_MODE_READ_FSCREDS and
>>>> CAP_SYS_NICE so the remote process would need to be allowed to
>>>> introspect the address space. This is the same constrain applied to the
>>>> remote momory reclaim. Is this sufficient?
>>>>
>>>> I would say yes because to some degree KSM mergning can have very
>>>> similar effect to memory reclaim from the side channel POV. But it
>>>> should be really documented in the changelog so that it is clear that
>>>> this has been a deliberate decision and thought through.
>>>>
>>>> Other than that this looks like the most reasonable approach to me.
>>>>
>>>>> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
>>>>> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/
>>>>>
>>>
>>> I have various concerns, but the biggest concern is that this modifies
>>> VMA flags and can possibly break applications.
>>>
>>> process_madvise must not modify remote process state.
>>>
>>> That's why we only allow a very limited selection that are merely hints.
>>>
>>> So nack from my side.
>>>
>>
>> [I'm quit ebusy, but I think some more explanation might be of value]
>>
>> One COW example where I think force-enabling KSM for processes is
>> *currently* not a good idea (besides the side channel discussions, which
>> is also why Windows stopped to enable KSM system wide a while ago):
>>
>> App:
>>
>> a) memset(page, 0);
>> b) trigger R/O long-term pin on page (e.g., vfio)
>>
>> If between a) and b) KSM replaces the page by the shared zeropage you'll
>> get an unreliable pin because we don't break yet COW when taking a R/O
>> pin on the shared zeropage. And in the traditional sense, the app did
>> everything right to guarantee that the pin will stay reliable.
> 
> Isn't this a bug in the existing implementation of the CoW?

One the one hand yes (pinning the shared zeropage is questionable), on
the other hand no (user space did modify that memory ahead of time and
filled it with something reasonable, that's how why always worked
correctly in the absence of KSM).

> 
>> Further, if an app explicitly decides to disable KSM one some region, we
>> should not overwrite that.
> 
> Well, the interface is rather spartan. You cannot really tell "disable
> KSM on some reqion". You can only tell "KSM can be applied to this
> region" and later change your mind. Maybe this is what you had in
> mind though.

That's what I meant. The hugepage interface has different semantics and
you get three possible states:

1: yes please: MADV_HUGEPAGE
2: don't care -- don't set anything
3. please no: MADV_NOHUGEPAGE

Currently for KSM we only have 1 and 2 internally I think (single
flag), because it didn't matter in the past ebcause there was no
force-enablement. One could convert it into all 3 states, changing the
semantics of MADV_UNMERGEABLE slightly from


1: yes please: MADV_MERGEABLE
2: don't care: MADV_UNMERGEABLE

to

1: yes please: MADV_MERGEABLE
2: don't care -- don't set anything
3. please no: MADV_UNMERGEABLE


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 12:09         ` David Hildenbrand
@ 2022-07-01 12:36           ` Michal Hocko
  2022-07-01 12:39             ` David Hildenbrand
  2022-07-04  8:13           ` CGEL
  1 sibling, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-07-01 12:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: cgel.zte, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

On Fri 01-07-22 14:09:24, David Hildenbrand wrote:
> On 01.07.22 14:02, Michal Hocko wrote:
> > On Fri 01-07-22 12:50:59, David Hildenbrand wrote:
> >> On 01.07.22 12:32, David Hildenbrand wrote:
> >>> On 01.07.22 11:11, Michal Hocko wrote:
> >>>> [Cc Jann]
> >>>>
> >>>> On Fri 01-07-22 08:43:23, cgel.zte@gmail.com wrote:
> >>>>> From: xu xin <xu.xin16@zte.com.cn>
> >>>>>
> >>>>> The benefits of doing this are obvious because using madvise in user code
> >>>>> is the only current way to enable KSM, which is inconvenient for those
> >>>>> compiled app without marking MERGEABLE wanting to enable KSM.
> >>>>
> >>>> I would rephrase:
> >>>> "
> >>>> KSM functionality is currently available only to processes which are
> >>>> using MADV_MERGEABLE directly. This is limiting because there are
> >>>> usecases which will benefit from enabling KSM on a remote process. One
> >>>> example would be an application which cannot be modified (e.g. because
> >>>> it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
> >>>> BENEFICIAL.
> >>>> "
> >>>>
> >>>>> Since we already have the syscall of process_madvise(), then reusing the
> >>>>> interface to allow external KSM hints is more acceptable [1].
> >>>>>
> >>>>> Although this patch was released by Oleksandr Natalenko, but it was
> >>>>> unfortunately terminated without any conclusions because there was debate
> >>>>> on whether it should use signal_pending() to check the target task besides
> >>>>> the task of current() when calling unmerge_ksm_pages of other task [2].
> >>>>
> >>>> I am not sure this is particularly interesting. I do not remember
> >>>> details of that discussion but checking signal_pending on a different
> >>>> task is rarely the right thing to do. In this case the check is meant to
> >>>> allow bailing out from the operation so that the caller could be
> >>>> terminated for example.
> >>>>
> >>>>> I think it's unneeded to check the target task. For example, when we set
> >>>>> the klob /sys/kernel/mm/ksm/run from 1 to 2,
> >>>>> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
> >>>>> all other target tasks either.
> >>>>>
> >>>>> I hope this patch can get attention again.
> >>>>
> >>>> One thing that the changelog is missing and it is quite important IMHO
> >>>> is the permission model. As we have discussed in previous incarnations
> >>>> of the remote KSM functionality that KSM has some security implications.
> >>>> It would be really great to refer to that in the changelog for the
> >>>> future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@mail.gmail.com)
> >>>>
> >>>> So this implementation requires PTRACE_MODE_READ_FSCREDS and
> >>>> CAP_SYS_NICE so the remote process would need to be allowed to
> >>>> introspect the address space. This is the same constrain applied to the
> >>>> remote momory reclaim. Is this sufficient?
> >>>>
> >>>> I would say yes because to some degree KSM mergning can have very
> >>>> similar effect to memory reclaim from the side channel POV. But it
> >>>> should be really documented in the changelog so that it is clear that
> >>>> this has been a deliberate decision and thought through.
> >>>>
> >>>> Other than that this looks like the most reasonable approach to me.
> >>>>
> >>>>> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
> >>>>> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/
> >>>>>
> >>>
> >>> I have various concerns, but the biggest concern is that this modifies
> >>> VMA flags and can possibly break applications.
> >>>
> >>> process_madvise must not modify remote process state.
> >>>
> >>> That's why we only allow a very limited selection that are merely hints.
> >>>
> >>> So nack from my side.
> >>>
> >>
> >> [I'm quit ebusy, but I think some more explanation might be of value]
> >>
> >> One COW example where I think force-enabling KSM for processes is
> >> *currently* not a good idea (besides the side channel discussions, which
> >> is also why Windows stopped to enable KSM system wide a while ago):
> >>
> >> App:
> >>
> >> a) memset(page, 0);
> >> b) trigger R/O long-term pin on page (e.g., vfio)
> >>
> >> If between a) and b) KSM replaces the page by the shared zeropage you'll
> >> get an unreliable pin because we don't break yet COW when taking a R/O
> >> pin on the shared zeropage. And in the traditional sense, the app did
> >> everything right to guarantee that the pin will stay reliable.
> > 
> > Isn't this a bug in the existing implementation of the CoW?
> 
> One the one hand yes (pinning the shared zeropage is questionable), on
> the other hand no (user space did modify that memory ahead of time and
> filled it with something reasonable, that's how why always worked
> correctly in the absence of KSM).

I am not sure about exact details of the KSM implementation but if that
is not a desirable behavior then it should be handled on the KSM level.
The very sam thing can easily happen in a multithreaded (or in general
multi-process with shared mm) environment as well.
 
> >> Further, if an app explicitly decides to disable KSM one some region, we
> >> should not overwrite that.
> > 
> > Well, the interface is rather spartan. You cannot really tell "disable
> > KSM on some reqion". You can only tell "KSM can be applied to this
> > region" and later change your mind. Maybe this is what you had in
> > mind though.
> 
> That's what I meant. The hugepage interface has different semantics and
> you get three possible states:
> 
> 1: yes please: MADV_HUGEPAGE
> 2: don't care -- don't set anything
> 3. please no: MADV_NOHUGEPAGE
> 
> Currently for KSM we only have 1 and 2 internally I think (single
> flag), because it didn't matter in the past ebcause there was no
> force-enablement. One could convert it into all 3 states, changing the
> semantics of MADV_UNMERGEABLE slightly from
> 
> 
> 1: yes please: MADV_MERGEABLE
> 2: don't care: MADV_UNMERGEABLE
> 
> to
> 
> 1: yes please: MADV_MERGEABLE
> 2: don't care -- don't set anything
> 3. please no: MADV_UNMERGEABLE

Are you saying that any remote handling of the KSM has to deal with a
pre-existing semantic as well? Are we aware of any existing application
that really uses MADV_UNMERGEABLE in a hope to disable KSM for any of
its sensitive memory ranges? My understanding is that this is simply a
on/off knob and a remote way to do the same is in line with the existing
API.

To be completely honest I do not really buy an argument that this might
break something much more than the original application can do already.
Unless I am missing the ptrace check puts the bar rather high. Adversary
with this level of access to the target application has already broken
it. Or am I missing something?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 12:36           ` Michal Hocko
@ 2022-07-01 12:39             ` David Hildenbrand
  2022-07-01 13:19               ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2022-07-01 12:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgel.zte, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

> I am not sure about exact details of the KSM implementation but if that
> is not a desirable behavior then it should be handled on the KSM level.
> The very sam thing can easily happen in a multithreaded (or in general
> multi-process with shared mm) environment as well.

I don't quite get what you mean.

>  
>>>> Further, if an app explicitly decides to disable KSM one some region, we
>>>> should not overwrite that.
>>>
>>> Well, the interface is rather spartan. You cannot really tell "disable
>>> KSM on some reqion". You can only tell "KSM can be applied to this
>>> region" and later change your mind. Maybe this is what you had in
>>> mind though.
>>
>> That's what I meant. The hugepage interface has different semantics and
>> you get three possible states:
>>
>> 1: yes please: MADV_HUGEPAGE
>> 2: don't care -- don't set anything
>> 3. please no: MADV_NOHUGEPAGE
>>
>> Currently for KSM we only have 1 and 2 internally I think (single
>> flag), because it didn't matter in the past ebcause there was no
>> force-enablement. One could convert it into all 3 states, changing the
>> semantics of MADV_UNMERGEABLE slightly from
>>
>>
>> 1: yes please: MADV_MERGEABLE
>> 2: don't care: MADV_UNMERGEABLE
>>
>> to
>>
>> 1: yes please: MADV_MERGEABLE
>> 2: don't care -- don't set anything
>> 3. please no: MADV_UNMERGEABLE
> 
> Are you saying that any remote handling of the KSM has to deal with a
> pre-existing semantic as well? Are we aware of any existing application
> that really uses MADV_UNMERGEABLE in a hope to disable KSM for any of
> its sensitive memory ranges? My understanding is that this is simply a
> on/off knob and a remote way to do the same is in line with the existing
> API.

"its sensitive memory ranges" that's exactly what I am concerned of.
There should be a toggle, and existing applciations will not be using it.

> 
> To be completely honest I do not really buy an argument that this might
> break something much more than the original application can do already.

How can you get a shared zeropage in a private mapping after a previous
write if not via KSM?

> Unless I am missing the ptrace check puts the bar rather high. Adversary
> with this level of access to the target application has already broken
> it. Or am I missing something?

I don't see what you mean.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 12:39             ` David Hildenbrand
@ 2022-07-01 13:19               ` Michal Hocko
  2022-07-01 19:12                 ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-07-01 13:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: cgel.zte, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

On Fri 01-07-22 14:39:24, David Hildenbrand wrote:
> > I am not sure about exact details of the KSM implementation but if that
> > is not a desirable behavior then it should be handled on the KSM level.
> > The very sam thing can easily happen in a multithreaded (or in general
> > multi-process with shared mm) environment as well.
> 
> I don't quite get what you mean.

I meant to say that if KSM needs to be aware of a special CoW semantic
then it should be handled on the KSM layer regardless whether the KSM
has been set by the process itself or any other process that has acccess
to the MM. process_madvise is just another way to access a remote MM
other than sharing the full MM.
 
[...]
> > Are you saying that any remote handling of the KSM has to deal with a
> > pre-existing semantic as well? Are we aware of any existing application
> > that really uses MADV_UNMERGEABLE in a hope to disable KSM for any of
> > its sensitive memory ranges? My understanding is that this is simply a
> > on/off knob and a remote way to do the same is in line with the existing
> > API.
> 
> "its sensitive memory ranges" that's exactly what I am concerned of.
> There should be a toggle, and existing applciations will not be using it.

The thing is that most applications (are there any?) do not actively
say that something is not KSM safe, right? They expect they opt in where
it makes sense. So my question is, whether any remote way to opt in for
KSM has to redefine the existing semantic or the same should be achieved
by a sufficient privileges?

The former would have really hard times to be applicable to the very
likely first hand usecase - unmodifiable binaries...

> > To be completely honest I do not really buy an argument that this might
> > break something much more than the original application can do already.
> 
> How can you get a shared zeropage in a private mapping after a previous
> write if not via KSM?

I was not referring to KSM specifically here. My recollection is that
PTRACE_MODE_READ_FSCREDS is quite powerful already.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 13:19               ` Michal Hocko
@ 2022-07-01 19:12                 ` David Hildenbrand
  2022-07-04  6:48                   ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: David Hildenbrand @ 2022-07-01 19:12 UTC (permalink / raw)
  To: Michal Hocko
  Cc: cgel.zte, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

On 01.07.22 15:19, Michal Hocko wrote:
> On Fri 01-07-22 14:39:24, David Hildenbrand wrote:
>>> I am not sure about exact details of the KSM implementation but if that
>>> is not a desirable behavior then it should be handled on the KSM level.
>>> The very sam thing can easily happen in a multithreaded (or in general
>>> multi-process with shared mm) environment as well.
>>
>> I don't quite get what you mean.
> 
> I meant to say that if KSM needs to be aware of a special CoW semantic
> then it should be handled on the KSM layer regardless whether the KSM
> has been set by the process itself or any other process that has acccess
> to the MM. process_madvise is just another way to access a remote MM
> other than sharing the full MM.

Okay.

KSM has been a corner case feature that was restricted to well-defined
and well-tested environments. Until recently, R/O pins of any KSM pages
was essentially completely unreliably. And applications don't expect
such surprises. The shared zeropage is most probably the last
problematic piece.

Yes, we're getting there that it's a real feature that can see more
(forced) wide-spread use. However, until the known issues in KSM have
been fixed (e.g., below -- there is a whole list of papers regarding
attacks on memory deduplication), it should be limited to well defined
environments and applications only -- IMHO.

So what I want to express here is that if we're adding an interface that
can be used to just enable KSM on the whole system easily, it might be a
bit to soon for that. No matter what you document, people will ignore it.

OTOH, if this is a real debug feature that will only be available in
specific debug/test scenarios (kernel config? toggle? whatsoever?), then
it's "better". If that is already the case, good.

>  
> [...]
>>> Are you saying that any remote handling of the KSM has to deal with a
>>> pre-existing semantic as well? Are we aware of any existing application
>>> that really uses MADV_UNMERGEABLE in a hope to disable KSM for any of
>>> its sensitive memory ranges? My understanding is that this is simply a
>>> on/off knob and a remote way to do the same is in line with the existing
>>> API.
>>
>> "its sensitive memory ranges" that's exactly what I am concerned of.
>> There should be a toggle, and existing applciations will not be using it.
> 
> The thing is that most applications (are there any?) do not actively
> say that something is not KSM safe, right? They expect they opt in where

They can't. But knowing about stuff like
https://access.redhat.com/security/cve/cve-2021-3714 makes me be sure
that there are applications that don't want this force-enabled ever.

> it makes sense. So my question is, whether any remote way to opt in for
> KSM has to redefine the existing semantic or the same should be achieved
> by a sufficient privileges?
> 
> The former would have really hard times to be applicable to the very
> likely first hand usecase - unmodifiable binaries...

Yes, I know. I also don't have a good answer to all of that.

> 
>>> To be completely honest I do not really buy an argument that this might
>>> break something much more than the original application can do already.
>>
>> How can you get a shared zeropage in a private mapping after a previous
>> write if not via KSM?
> 
> I was not referring to KSM specifically here. My recollection is that
> PTRACE_MODE_READ_FSCREDS is quite powerful already.

Ah, you mean process_madvise() permissions.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 19:12                 ` David Hildenbrand
@ 2022-07-04  6:48                   ` Michal Hocko
  2022-07-04  7:29                     ` CGEL
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-07-04  6:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: cgel.zte, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

On Fri 01-07-22 21:12:56, David Hildenbrand wrote:
> On 01.07.22 15:19, Michal Hocko wrote:
> > On Fri 01-07-22 14:39:24, David Hildenbrand wrote:
> >>> I am not sure about exact details of the KSM implementation but if that
> >>> is not a desirable behavior then it should be handled on the KSM level.
> >>> The very sam thing can easily happen in a multithreaded (or in general
> >>> multi-process with shared mm) environment as well.
> >>
> >> I don't quite get what you mean.
> > 
> > I meant to say that if KSM needs to be aware of a special CoW semantic
> > then it should be handled on the KSM layer regardless whether the KSM
> > has been set by the process itself or any other process that has acccess
> > to the MM. process_madvise is just another way to access a remote MM
> > other than sharing the full MM.
> 
> Okay.
> 
> KSM has been a corner case feature that was restricted to well-defined
> and well-tested environments. Until recently, R/O pins of any KSM pages
> was essentially completely unreliably. And applications don't expect
> such surprises. The shared zeropage is most probably the last
> problematic piece.
> 
> Yes, we're getting there that it's a real feature that can see more
> (forced) wide-spread use. However, until the known issues in KSM have
> been fixed (e.g., below -- there is a whole list of papers regarding
> attacks on memory deduplication), it should be limited to well defined
> environments and applications only -- IMHO.

Very much agreed on all this! To be completely honest I am not really
sure that all those consequences are widely understood and optmizing
solely on memory savings is a very short sighted strategy IMO. But, it
seems that there is a demand for this feature and previous attempts for
APIs were much worse both from the semantic and maintainability POV. I
am not sure we can get anything more sane than madvise.

I also very much agree that current shortcomings have to be adressed
first before we open this can of worms to 3rd party actors. I was not
aware of those so thank for bringing them up. Maybe I was overly
optimistic here.

So I guess we have following questions to answer:
1) Do we really want to support KSM triggered by 3rd party? Does it
impose new challenges other than existing ones in multi "threaded"
environemnts?
2) If yes, is the process_madvise the most appropriate existing API? Or
do we need a new one?
3) Should this be a highly privileged operation or we want to allow
userspace to shoot its feet because consequences are subtle and not very
well understood?

> So what I want to express here is that if we're adding an interface that
> can be used to just enable KSM on the whole system easily, it might be a
> bit to soon for that. No matter what you document, people will ignore it.

Agreed.

> OTOH, if this is a real debug feature that will only be available in
> specific debug/test scenarios (kernel config? toggle? whatsoever?), then
> it's "better". If that is already the case, good.

No, I think this is aimed to real production deployments.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-04  6:48                   ` Michal Hocko
@ 2022-07-04  7:29                     ` CGEL
  2022-07-04  8:40                       ` Michal Hocko
  0 siblings, 1 reply; 16+ messages in thread
From: CGEL @ 2022-07-04  7:29 UTC (permalink / raw)
  To: Michal Hocko
  Cc: David Hildenbrand, linux-mm, linux-kernel, vbabka, minchan,
	oleksandr, xu xin, Jann Horn, Andrew Morton

On Mon, Jul 04, 2022 at 08:48:06AM +0200, Michal Hocko wrote:
> On Fri 01-07-22 21:12:56, David Hildenbrand wrote:
> > On 01.07.22 15:19, Michal Hocko wrote:
> > > On Fri 01-07-22 14:39:24, David Hildenbrand wrote:
> > >>> I am not sure about exact details of the KSM implementation but if that
> > >>> is not a desirable behavior then it should be handled on the KSM level.
> > >>> The very sam thing can easily happen in a multithreaded (or in general
> > >>> multi-process with shared mm) environment as well.
> > >>
> > >> I don't quite get what you mean.
> > > 
> > > I meant to say that if KSM needs to be aware of a special CoW semantic
> > > then it should be handled on the KSM layer regardless whether the KSM
> > > has been set by the process itself or any other process that has acccess
> > > to the MM. process_madvise is just another way to access a remote MM
> > > other than sharing the full MM.
> > 
> > Okay.
> > 
> > KSM has been a corner case feature that was restricted to well-defined
> > and well-tested environments. Until recently, R/O pins of any KSM pages
> > was essentially completely unreliably. And applications don't expect
> > such surprises. The shared zeropage is most probably the last
> > problematic piece.
> > 
> > Yes, we're getting there that it's a real feature that can see more
> > (forced) wide-spread use. However, until the known issues in KSM have
> > been fixed (e.g., below -- there is a whole list of papers regarding
> > attacks on memory deduplication), it should be limited to well defined
> > environments and applications only -- IMHO.
> 
> Very much agreed on all this! To be completely honest I am not really
> sure that all those consequences are widely understood and optmizing
> solely on memory savings is a very short sighted strategy IMO. But, it
> seems that there is a demand for this feature and previous attempts for
> APIs were much worse both from the semantic and maintainability POV. I
> am not sure we can get anything more sane than madvise.
> 
> I also very much agree that current shortcomings have to be adressed
> first before we open this can of worms to 3rd party actors. I was not
> aware of those so thank for bringing them up. Maybe I was overly
> optimistic here.
> 
> So I guess we have following questions to answer:
> 1) Do we really want to support KSM triggered by 3rd party? Does it
> impose new challenges other than existing ones in multi "threaded"
> environemnts?
> 2) If yes, is the process_madvise the most appropriate existing API? Or
> do we need a new one?

Maybe new semantics is needed similarly to MADV_NOHUGEPAGE that ensures that
there will *not* be huge pages.

> 3) Should this be a highly privileged operation or we want to allow
> userspace to shoot its feet because consequences are subtle and not very
> well understood?
> 
> > So what I want to express here is that if we're adding an interface that
> > can be used to just enable KSM on the whole system easily, it might be a
> > bit to soon for that. No matter what you document, people will ignore it.
> 
> Agreed.
> 

Agree too.
Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-01 12:09         ` David Hildenbrand
  2022-07-01 12:36           ` Michal Hocko
@ 2022-07-04  8:13           ` CGEL
  2022-07-04  9:30             ` David Hildenbrand
  1 sibling, 1 reply; 16+ messages in thread
From: CGEL @ 2022-07-04  8:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Michal Hocko, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

On Fri, Jul 01, 2022 at 02:09:24PM +0200, David Hildenbrand wrote:
> On 01.07.22 14:02, Michal Hocko wrote:
> > On Fri 01-07-22 12:50:59, David Hildenbrand wrote:
> >> On 01.07.22 12:32, David Hildenbrand wrote:
> >>> On 01.07.22 11:11, Michal Hocko wrote:
> >>>> [Cc Jann]
> >>>>
> >>>> On Fri 01-07-22 08:43:23, cgel.zte@gmail.com wrote:
> >>>>> From: xu xin <xu.xin16@zte.com.cn>
> >>>>>
> >>>>> The benefits of doing this are obvious because using madvise in user code
> >>>>> is the only current way to enable KSM, which is inconvenient for those
> >>>>> compiled app without marking MERGEABLE wanting to enable KSM.
> >>>>
> >>>> I would rephrase:
> >>>> "
> >>>> KSM functionality is currently available only to processes which are
> >>>> using MADV_MERGEABLE directly. This is limiting because there are
> >>>> usecases which will benefit from enabling KSM on a remote process. One
> >>>> example would be an application which cannot be modified (e.g. because
> >>>> it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
> >>>> BENEFICIAL.
> >>>> "
> >>>>
> >>>>> Since we already have the syscall of process_madvise(), then reusing the
> >>>>> interface to allow external KSM hints is more acceptable [1].
> >>>>>
> >>>>> Although this patch was released by Oleksandr Natalenko, but it was
> >>>>> unfortunately terminated without any conclusions because there was debate
> >>>>> on whether it should use signal_pending() to check the target task besides
> >>>>> the task of current() when calling unmerge_ksm_pages of other task [2].
> >>>>
> >>>> I am not sure this is particularly interesting. I do not remember
> >>>> details of that discussion but checking signal_pending on a different
> >>>> task is rarely the right thing to do. In this case the check is meant to
> >>>> allow bailing out from the operation so that the caller could be
> >>>> terminated for example.
> >>>>
> >>>>> I think it's unneeded to check the target task. For example, when we set
> >>>>> the klob /sys/kernel/mm/ksm/run from 1 to 2,
> >>>>> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
> >>>>> all other target tasks either.
> >>>>>
> >>>>> I hope this patch can get attention again.
> >>>>
> >>>> One thing that the changelog is missing and it is quite important IMHO
> >>>> is the permission model. As we have discussed in previous incarnations
> >>>> of the remote KSM functionality that KSM has some security implications.
> >>>> It would be really great to refer to that in the changelog for the
> >>>> future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@mail.gmail.com)
> >>>>
> >>>> So this implementation requires PTRACE_MODE_READ_FSCREDS and
> >>>> CAP_SYS_NICE so the remote process would need to be allowed to
> >>>> introspect the address space. This is the same constrain applied to the
> >>>> remote momory reclaim. Is this sufficient?
> >>>>
> >>>> I would say yes because to some degree KSM mergning can have very
> >>>> similar effect to memory reclaim from the side channel POV. But it
> >>>> should be really documented in the changelog so that it is clear that
> >>>> this has been a deliberate decision and thought through.
> >>>>
> >>>> Other than that this looks like the most reasonable approach to me.
> >>>>
> >>>>> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
> >>>>> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/
> >>>>>
> >>>
> >>> I have various concerns, but the biggest concern is that this modifies
> >>> VMA flags and can possibly break applications.
> >>>
> >>> process_madvise must not modify remote process state.
> >>>
> >>> That's why we only allow a very limited selection that are merely hints.
> >>>
> >>> So nack from my side.
> >>>
> >>
> >> [I'm quit ebusy, but I think some more explanation might be of value]
> >>
> >> One COW example where I think force-enabling KSM for processes is
> >> *currently* not a good idea (besides the side channel discussions, which
> >> is also why Windows stopped to enable KSM system wide a while ago):
> >>
> >> App:
> >>
> >> a) memset(page, 0);
> >> b) trigger R/O long-term pin on page (e.g., vfio)
> >>
> >> If between a) and b) KSM replaces the page by the shared zeropage you'll
> >> get an unreliable pin because we don't break yet COW when taking a R/O
> >> pin on the shared zeropage. And in the traditional sense, the app did
> >> everything right to guarantee that the pin will stay reliable.
> > 
> > Isn't this a bug in the existing implementation of the CoW?
> 
> One the one hand yes (pinning the shared zeropage is questionable), on
> the other hand no (user space did modify that memory ahead of time and
> filled it with something reasonable, that's how why always worked
> correctly in the absence of KSM).
>

Thanks for your information.

So does it needs to be fixed? and if yes, are you planning to fix it.

> > 
> >> Further, if an app explicitly decides to disable KSM one some region, we
> >> should not overwrite that.
> > 
> > Well, the interface is rather spartan. You cannot really tell "disable
> > KSM on some reqion". You can only tell "KSM can be applied to this
> > region" and later change your mind. Maybe this is what you had in
> > mind though.
> 
> That's what I meant. The hugepage interface has different semantics and
> you get three possible states:
> 
> 1: yes please: MADV_HUGEPAGE
> 2: don't care -- don't set anything
> 3. please no: MADV_NOHUGEPAGE
> 
> Currently for KSM we only have 1 and 2 internally I think (single
> flag), because it didn't matter in the past ebcause there was no
> force-enablement. One could convert it into all 3 states, changing the
> semantics of MADV_UNMERGEABLE slightly from
> 
> 
> 1: yes please: MADV_MERGEABLE
> 2: don't care: MADV_UNMERGEABLE
> 
> to
> 
> 1: yes please: MADV_MERGEABLE
> 2: don't care -- don't set anything
> 3. please no: MADV_UNMERGEABLE
> 
> 
> -- 
> Thanks,
> 
> David / dhildenb

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-04  7:29                     ` CGEL
@ 2022-07-04  8:40                       ` Michal Hocko
  2022-07-04  9:35                         ` David Hildenbrand
  0 siblings, 1 reply; 16+ messages in thread
From: Michal Hocko @ 2022-07-04  8:40 UTC (permalink / raw)
  To: CGEL
  Cc: David Hildenbrand, linux-mm, linux-kernel, vbabka, minchan,
	oleksandr, xu xin, Jann Horn, Andrew Morton

On Mon 04-07-22 07:29:41, CGEL wrote:
[...]
> Maybe new semantics is needed similarly to MADV_NOHUGEPAGE that ensures that
> there will *not* be huge pages.

How do you achieve that with a backward compatibility?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-04  8:13           ` CGEL
@ 2022-07-04  9:30             ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2022-07-04  9:30 UTC (permalink / raw)
  To: CGEL
  Cc: Michal Hocko, linux-mm, linux-kernel, vbabka, minchan, oleksandr,
	xu xin, Jann Horn, Andrew Morton

On 04.07.22 10:13, CGEL wrote:
> On Fri, Jul 01, 2022 at 02:09:24PM +0200, David Hildenbrand wrote:
>> On 01.07.22 14:02, Michal Hocko wrote:
>>> On Fri 01-07-22 12:50:59, David Hildenbrand wrote:
>>>> On 01.07.22 12:32, David Hildenbrand wrote:
>>>>> On 01.07.22 11:11, Michal Hocko wrote:
>>>>>> [Cc Jann]
>>>>>>
>>>>>> On Fri 01-07-22 08:43:23, cgel.zte@gmail.com wrote:
>>>>>>> From: xu xin <xu.xin16@zte.com.cn>
>>>>>>>
>>>>>>> The benefits of doing this are obvious because using madvise in user code
>>>>>>> is the only current way to enable KSM, which is inconvenient for those
>>>>>>> compiled app without marking MERGEABLE wanting to enable KSM.
>>>>>>
>>>>>> I would rephrase:
>>>>>> "
>>>>>> KSM functionality is currently available only to processes which are
>>>>>> using MADV_MERGEABLE directly. This is limiting because there are
>>>>>> usecases which will benefit from enabling KSM on a remote process. One
>>>>>> example would be an application which cannot be modified (e.g. because
>>>>>> it is only distributed as a binary). MORE EXAMPLES WOULD BE REALLY
>>>>>> BENEFICIAL.
>>>>>> "
>>>>>>
>>>>>>> Since we already have the syscall of process_madvise(), then reusing the
>>>>>>> interface to allow external KSM hints is more acceptable [1].
>>>>>>>
>>>>>>> Although this patch was released by Oleksandr Natalenko, but it was
>>>>>>> unfortunately terminated without any conclusions because there was debate
>>>>>>> on whether it should use signal_pending() to check the target task besides
>>>>>>> the task of current() when calling unmerge_ksm_pages of other task [2].
>>>>>>
>>>>>> I am not sure this is particularly interesting. I do not remember
>>>>>> details of that discussion but checking signal_pending on a different
>>>>>> task is rarely the right thing to do. In this case the check is meant to
>>>>>> allow bailing out from the operation so that the caller could be
>>>>>> terminated for example.
>>>>>>
>>>>>>> I think it's unneeded to check the target task. For example, when we set
>>>>>>> the klob /sys/kernel/mm/ksm/run from 1 to 2,
>>>>>>> unmerge_and_remove_all_rmap_items() doesn't use signal_pending() to check
>>>>>>> all other target tasks either.
>>>>>>>
>>>>>>> I hope this patch can get attention again.
>>>>>>
>>>>>> One thing that the changelog is missing and it is quite important IMHO
>>>>>> is the permission model. As we have discussed in previous incarnations
>>>>>> of the remote KSM functionality that KSM has some security implications.
>>>>>> It would be really great to refer to that in the changelog for the
>>>>>> future reference (http://lkml.kernel.org/r/CAG48ez0riS60zcA9CC9rUDV=kLS0326Rr23OKv1_RHaTkOOj7A@mail.gmail.com)
>>>>>>
>>>>>> So this implementation requires PTRACE_MODE_READ_FSCREDS and
>>>>>> CAP_SYS_NICE so the remote process would need to be allowed to
>>>>>> introspect the address space. This is the same constrain applied to the
>>>>>> remote momory reclaim. Is this sufficient?
>>>>>>
>>>>>> I would say yes because to some degree KSM mergning can have very
>>>>>> similar effect to memory reclaim from the side channel POV. But it
>>>>>> should be really documented in the changelog so that it is clear that
>>>>>> this has been a deliberate decision and thought through.
>>>>>>
>>>>>> Other than that this looks like the most reasonable approach to me.
>>>>>>
>>>>>>> [1] https://lore.kernel.org/lkml/YoOrdh85+AqJH8w1@dhcp22.suse.cz/
>>>>>>> [2] https://lore.kernel.org/lkml/2a66abd8-4103-f11b-06d1-07762667eee6@suse.cz/
>>>>>>>
>>>>>
>>>>> I have various concerns, but the biggest concern is that this modifies
>>>>> VMA flags and can possibly break applications.
>>>>>
>>>>> process_madvise must not modify remote process state.
>>>>>
>>>>> That's why we only allow a very limited selection that are merely hints.
>>>>>
>>>>> So nack from my side.
>>>>>
>>>>
>>>> [I'm quit ebusy, but I think some more explanation might be of value]
>>>>
>>>> One COW example where I think force-enabling KSM for processes is
>>>> *currently* not a good idea (besides the side channel discussions, which
>>>> is also why Windows stopped to enable KSM system wide a while ago):
>>>>
>>>> App:
>>>>
>>>> a) memset(page, 0);
>>>> b) trigger R/O long-term pin on page (e.g., vfio)
>>>>
>>>> If between a) and b) KSM replaces the page by the shared zeropage you'll
>>>> get an unreliable pin because we don't break yet COW when taking a R/O
>>>> pin on the shared zeropage. And in the traditional sense, the app did
>>>> everything right to guarantee that the pin will stay reliable.
>>>
>>> Isn't this a bug in the existing implementation of the CoW?
>>
>> One the one hand yes (pinning the shared zeropage is questionable), on
>> the other hand no (user space did modify that memory ahead of time and
>> filled it with something reasonable, that's how why always worked
>> correctly in the absence of KSM).
>>
> 
> Thanks for your information.
> 
> So does it needs to be fixed? and if yes, are you planning to fix it.

Very high on my todo list. So yes, I think it really needs fixing,
especially with KSM in mind.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise
  2022-07-04  8:40                       ` Michal Hocko
@ 2022-07-04  9:35                         ` David Hildenbrand
  0 siblings, 0 replies; 16+ messages in thread
From: David Hildenbrand @ 2022-07-04  9:35 UTC (permalink / raw)
  To: Michal Hocko, CGEL
  Cc: linux-mm, linux-kernel, vbabka, minchan, oleksandr, xu xin,
	Jann Horn, Andrew Morton

On 04.07.22 10:40, Michal Hocko wrote:
> On Mon 04-07-22 07:29:41, CGEL wrote:
> [...]
>> Maybe new semantics is needed similarly to MADV_NOHUGEPAGE that ensures that
>> there will *not* be huge pages.
> 
> How do you achieve that with a backward compatibility?

Some apps mark secrets via mlock(). I did not check, but I assume
mlock()'ed pages (in VMAs) would not be applicable to KSM. If they would
be, one option would be to not deduplicate them.

But then, I have no clue what's exploitable via a side channel and
what's not. Eventually, having a proper fix for most side channels would
make KSM safer to use in the the general case. But then, who knows what
security researchers will be able to come up with ...

As a very minimum, there would have to be some kind of toggle to allow
forcing KSM on other applications at all. Either/o a compile-time and a
run-time option. Once most of the known side channels are fixed we could
adjust the default of the run-time option.

(I think to this day, Windows still disables system-wide deduplication
as a default and requires an admin to explicitly enable it)

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-07-04  9:35 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-01  8:43 [PATCH linux-next] mm/madvise: allow KSM hints for process_madvise cgel.zte
2022-07-01  9:11 ` Michal Hocko
2022-07-01 10:32   ` David Hildenbrand
2022-07-01 10:50     ` David Hildenbrand
2022-07-01 12:02       ` Michal Hocko
2022-07-01 12:09         ` David Hildenbrand
2022-07-01 12:36           ` Michal Hocko
2022-07-01 12:39             ` David Hildenbrand
2022-07-01 13:19               ` Michal Hocko
2022-07-01 19:12                 ` David Hildenbrand
2022-07-04  6:48                   ` Michal Hocko
2022-07-04  7:29                     ` CGEL
2022-07-04  8:40                       ` Michal Hocko
2022-07-04  9:35                         ` David Hildenbrand
2022-07-04  8:13           ` CGEL
2022-07-04  9:30             ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).