Linux-mm Archive on lore.kernel.org
 help / color / Atom feed
* interaction of MADV_PAGEOUT with CoW anonymous mappings?
@ 2020-03-10 18:08 Jann Horn
  2020-03-10 18:48 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Jann Horn @ 2020-03-10 18:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Linux-MM, kernel list, Daniel Colascione,
	Dave Hansen, Joel Fernandes (Google)

Hi!

From looking at the source code, it looks to me as if using
MADV_PAGEOUT on a CoW anonymous mapping will page out the page if
possible, even if other processes still have the same page mapped. Is
that correct?

If so, that's probably bad in environments where many processes (with
different privileges) are forked from a single zygote process (like
Android and Chrome), I think? If you accidentally call it on a CoW
anonymous mapping with shared pages, you'll degrade the performance of
other processes. And if an attacker does it intentionally, they could
use that to aid with exploiting race conditions or weird
microarchitectural stuff (e.g. the new https://lviattack.eu/lvi.pdf
talks about "the assumption that attackers can provoke page faults or
microcode assists for (arbitrary) load operations in the victim
domain").

Should madvise_cold_or_pageout_pte_range() maybe refuse to operate on
pages with mapcount>1, or something like that? Or does it already do
that, and I just missed the check?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 18:08 interaction of MADV_PAGEOUT with CoW anonymous mappings? Jann Horn
@ 2020-03-10 18:48 ` Michal Hocko
  2020-03-10 19:11   ` Jann Horn
  2020-03-10 20:19   ` Daniel Colascione
  2020-03-10 22:14 ` Minchan Kim
  2020-03-12  8:22 ` Michal Hocko
  2 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2020-03-10 18:48 UTC (permalink / raw)
  To: Jann Horn
  Cc: Minchan Kim, Linux-MM, kernel list, Daniel Colascione,
	Dave Hansen, Joel Fernandes (Google)

On Tue 10-03-20 19:08:28, Jann Horn wrote:
> Hi!
> 
> >From looking at the source code, it looks to me as if using
> MADV_PAGEOUT on a CoW anonymous mapping will page out the page if
> possible, even if other processes still have the same page mapped. Is
> that correct?
> 
> If so, that's probably bad in environments where many processes (with
> different privileges) are forked from a single zygote process (like
> Android and Chrome), I think? If you accidentally call it on a CoW
> anonymous mapping with shared pages, you'll degrade the performance of
> other processes. And if an attacker does it intentionally, they could
> use that to aid with exploiting race conditions or weird
> microarchitectural stuff (e.g. the new https://lviattack.eu/lvi.pdf
> talks about "the assumption that attackers can provoke page faults or
> microcode assists for (arbitrary) load operations in the victim
> domain").
> 
> Should madvise_cold_or_pageout_pte_range() maybe refuse to operate on
> pages with mapcount>1, or something like that? Or does it already do
> that, and I just missed the check?

I have brought up side channel attacks earlier [1] but only in the
context of shared page cache pages. I didn't really consider shared
anonymous pages to be a real problem. I was under impression that CoW
pages shouldn't be a real problem because any security sensible
applications shouldn't allow untrusted code to be forked and CoW
anything really important. I believe we have made this assumption
in other places - IIRC on gup with FOLL_FORCE but I admit I have
very happily forgot most details.

[1] http://lkml.kernel.org/r/20190619132450.GQ2968@dhcp22.suse.cz

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 18:48 ` Michal Hocko
@ 2020-03-10 19:11   ` Jann Horn
  2020-03-10 21:09     ` Michal Hocko
  2020-03-10 20:19   ` Daniel Colascione
  1 sibling, 1 reply; 34+ messages in thread
From: Jann Horn @ 2020-03-10 19:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Minchan Kim, Linux-MM, kernel list, Daniel Colascione,
	Dave Hansen, Joel Fernandes (Google)

On Tue, Mar 10, 2020 at 7:48 PM Michal Hocko <mhocko@kernel.org> wrote:
> On Tue 10-03-20 19:08:28, Jann Horn wrote:
> > Hi!
> >
> > >From looking at the source code, it looks to me as if using
> > MADV_PAGEOUT on a CoW anonymous mapping will page out the page if
> > possible, even if other processes still have the same page mapped. Is
> > that correct?
> >
> > If so, that's probably bad in environments where many processes (with
> > different privileges) are forked from a single zygote process (like
> > Android and Chrome), I think? If you accidentally call it on a CoW
> > anonymous mapping with shared pages, you'll degrade the performance of
> > other processes. And if an attacker does it intentionally, they could
> > use that to aid with exploiting race conditions or weird
> > microarchitectural stuff (e.g. the new https://lviattack.eu/lvi.pdf
> > talks about "the assumption that attackers can provoke page faults or
> > microcode assists for (arbitrary) load operations in the victim
> > domain").
> >
> > Should madvise_cold_or_pageout_pte_range() maybe refuse to operate on
> > pages with mapcount>1, or something like that? Or does it already do
> > that, and I just missed the check?
>
> I have brought up side channel attacks earlier [1] but only in the
> context of shared page cache pages. I didn't really consider shared
> anonymous pages to be a real problem. I was under impression that CoW
> pages shouldn't be a real problem because any security sensible
> applications shouldn't allow untrusted code to be forked and CoW
> anything really important. I believe we have made this assumption
> in other places - IIRC on gup with FOLL_FORCE but I admit I have
> very happily forgot most details.

Android has a "zygote" process that starts up the whole Java
environment with a bunch of libraries before entering into a loop that
fork()s off a child every time the user wants to launch an app. So all
the apps, and even browser renderer processes, on the device share
many CoW VMAs. See
<https://developer.android.com/topic/performance/memory-overview#SharingRAM>.

I think Chrome on Linux desktop systems also forks off renderers from
a common zygote process after initializing libraries and so on. See
<https://chromium.googlesource.com/chromium/src.git/+/master/docs/linux/zygote.md>.
(But they use a relatively strict seccomp sandbox that e.g. doesn't
permit MADV_PAGEOUT.)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 18:48 ` Michal Hocko
  2020-03-10 19:11   ` Jann Horn
@ 2020-03-10 20:19   ` Daniel Colascione
  2020-03-10 21:40     ` Jann Horn
  1 sibling, 1 reply; 34+ messages in thread
From: Daniel Colascione @ 2020-03-10 20:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Minchan Kim, Linux-MM, kernel list, Dave Hansen,
	Joel Fernandes (Google)

On Tue, Mar 10, 2020 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 10-03-20 19:08:28, Jann Horn wrote:
> > Hi!
> >
> > >From looking at the source code, it looks to me as if using
> > MADV_PAGEOUT on a CoW anonymous mapping will page out the page if
> > possible, even if other processes still have the same page mapped. Is
> > that correct?
> >
> > If so, that's probably bad in environments where many processes (with
> > different privileges) are forked from a single zygote process (like
> > Android and Chrome), I think? If you accidentally call it on a CoW
> > anonymous mapping with shared pages, you'll degrade the performance of
> > other processes. And if an attacker does it intentionally, they could
> > use that to aid with exploiting race conditions or weird
> > microarchitectural stuff (e.g. the new https://lviattack.eu/lvi.pdf
> > talks about "the assumption that attackers can provoke page faults or
> > microcode assists for (arbitrary) load operations in the victim
> > domain").
> >
> > Should madvise_cold_or_pageout_pte_range() maybe refuse to operate on
> > pages with mapcount>1, or something like that? Or does it already do
> > that, and I just missed the check?
>
> I have brought up side channel attacks earlier [1] but only in the
> context of shared page cache pages. I didn't really consider shared
> anonymous pages to be a real problem. I was under impression that CoW
> pages shouldn't be a real problem because any security sensible
> applications shouldn't allow untrusted code to be forked and CoW
> anything really important. I believe we have made this assumption
> in other places - IIRC on gup with FOLL_FORCE but I admit I have
> very happily forgot most details.

I'm more worried about the performance implications. Consider
libc.so's data section: that's a COW mapping, and we COW it during
zygote initialization as we load and relocate libc.so. Child processes
shouldn't be dirtying and re-COWing those relocated pages. If I
understand Jann's message correctly, MADV_PAGEOUT would force the
pages corresponding to the libc.so data segment out to zram just
because we MADV_PAGEOUT-ed a single process that happened to use libc.
We should leave those pages in memory, IMHO.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 19:11   ` Jann Horn
@ 2020-03-10 21:09     ` Michal Hocko
  2020-03-10 22:48       ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-03-10 21:09 UTC (permalink / raw)
  To: Jann Horn
  Cc: Minchan Kim, Linux-MM, kernel list, Daniel Colascione,
	Dave Hansen, Joel Fernandes (Google)

On Tue 10-03-20 20:11:45, Jann Horn wrote:
> On Tue, Mar 10, 2020 at 7:48 PM Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 10-03-20 19:08:28, Jann Horn wrote:
> > > Hi!
> > >
> > > >From looking at the source code, it looks to me as if using
> > > MADV_PAGEOUT on a CoW anonymous mapping will page out the page if
> > > possible, even if other processes still have the same page mapped. Is
> > > that correct?
> > >
> > > If so, that's probably bad in environments where many processes (with
> > > different privileges) are forked from a single zygote process (like
> > > Android and Chrome), I think? If you accidentally call it on a CoW
> > > anonymous mapping with shared pages, you'll degrade the performance of
> > > other processes. And if an attacker does it intentionally, they could
> > > use that to aid with exploiting race conditions or weird
> > > microarchitectural stuff (e.g. the new https://lviattack.eu/lvi.pdf
> > > talks about "the assumption that attackers can provoke page faults or
> > > microcode assists for (arbitrary) load operations in the victim
> > > domain").
> > >
> > > Should madvise_cold_or_pageout_pte_range() maybe refuse to operate on
> > > pages with mapcount>1, or something like that? Or does it already do
> > > that, and I just missed the check?
> >
> > I have brought up side channel attacks earlier [1] but only in the
> > context of shared page cache pages. I didn't really consider shared
> > anonymous pages to be a real problem. I was under impression that CoW
> > pages shouldn't be a real problem because any security sensible
> > applications shouldn't allow untrusted code to be forked and CoW
> > anything really important. I believe we have made this assumption
> > in other places - IIRC on gup with FOLL_FORCE but I admit I have
> > very happily forgot most details.

I have quickly checked FOLL_FORCE and it is careful to break CoW on the
write access.

> Android has a "zygote" process that starts up the whole Java
> environment with a bunch of libraries before entering into a loop that
> fork()s off a child every time the user wants to launch an app. So all
> the apps, and even browser renderer processes, on the device share
> many CoW VMAs. See
> <https://developer.android.com/topic/performance/memory-overview#SharingRAM>.

I still have to think about how this could be used for any reasonable
attack. But certainly the simplest workaround is to simply back off on
pages mapped multiple times as we do for THP already. Something like the
following should work but I haven't tested it at all
diff --git a/mm/madvise.c b/mm/madvise.c
index 43b47d3fae02..02daa447bf47 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -351,6 +351,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				goto regular_page;
 			return 0;
 		}
+		
+		/* Do not interfere with other mappings of this page */
+		if (page_mapcount(page) != 1)
+			goto huge_unlock;
 
 		if (pmd_young(orig_pmd)) {
 			pmdp_invalidate(vma, addr, pmd);
@@ -426,6 +430,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			continue;
 		}
 
+		/* Do not interfere with other mappings of this page */
+		if (page_mapcount(page) != 1)
+			continue;
+
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
 		if (pte_young(ptent)) {
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 20:19   ` Daniel Colascione
@ 2020-03-10 21:40     ` Jann Horn
  2020-03-10 21:52       ` Daniel Colascione
  0 siblings, 1 reply; 34+ messages in thread
From: Jann Horn @ 2020-03-10 21:40 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Michal Hocko, Minchan Kim, Linux-MM, kernel list, Dave Hansen,
	Joel Fernandes (Google)

On Tue, Mar 10, 2020 at 9:19 PM Daniel Colascione <dancol@google.com> wrote:
> On Tue, Mar 10, 2020 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> > On Tue 10-03-20 19:08:28, Jann Horn wrote:
> > > >From looking at the source code, it looks to me as if using
> > > MADV_PAGEOUT on a CoW anonymous mapping will page out the page if
> > > possible, even if other processes still have the same page mapped. Is
> > > that correct?
> > >
> > > If so, that's probably bad in environments where many processes (with
> > > different privileges) are forked from a single zygote process (like
> > > Android and Chrome), I think? If you accidentally call it on a CoW
> > > anonymous mapping with shared pages, you'll degrade the performance of
> > > other processes. And if an attacker does it intentionally, they could
> > > use that to aid with exploiting race conditions or weird
> > > microarchitectural stuff (e.g. the new https://lviattack.eu/lvi.pdf
> > > talks about "the assumption that attackers can provoke page faults or
> > > microcode assists for (arbitrary) load operations in the victim
> > > domain").
> > >
> > > Should madvise_cold_or_pageout_pte_range() maybe refuse to operate on
> > > pages with mapcount>1, or something like that? Or does it already do
> > > that, and I just missed the check?
> >
> > I have brought up side channel attacks earlier [1] but only in the
> > context of shared page cache pages. I didn't really consider shared
> > anonymous pages to be a real problem. I was under impression that CoW
> > pages shouldn't be a real problem because any security sensible
> > applications shouldn't allow untrusted code to be forked and CoW
> > anything really important. I believe we have made this assumption
> > in other places - IIRC on gup with FOLL_FORCE but I admit I have
> > very happily forgot most details.
>
> I'm more worried about the performance implications. Consider
> libc.so's data section: that's a COW mapping, and we COW it during
> zygote initialization as we load and relocate libc.so. Child processes
> shouldn't be dirtying and re-COWing those relocated pages. If I
> understand Jann's message correctly, MADV_PAGEOUT would force the
> pages corresponding to the libc.so data segment out to zram just
> because we MADV_PAGEOUT-ed a single process that happened to use libc.
> We should leave those pages in memory, IMHO.

Actually, the libc.so data section is a file mapping, so I think
can_do_pageout() would decide whether the caller is allowed to force
pageout based on whether the caller is the owner of (or capable over)
libc (in other words, root, basically). But I think the bss section,
as well as heap memory, could have pageout forced by anyone.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 21:40     ` Jann Horn
@ 2020-03-10 21:52       ` Daniel Colascione
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Colascione @ 2020-03-10 21:52 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michal Hocko, Minchan Kim, Linux-MM, kernel list, Dave Hansen,
	Joel Fernandes (Google)

On Tue, Mar 10, 2020 at 2:40 PM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Mar 10, 2020 at 9:19 PM Daniel Colascione <dancol@google.com> wrote:
> > On Tue, Mar 10, 2020 at 11:48 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > On Tue 10-03-20 19:08:28, Jann Horn wrote:
> > > > >From looking at the source code, it looks to me as if using
> > > > MADV_PAGEOUT on a CoW anonymous mapping will page out the page if
> > > > possible, even if other processes still have the same page mapped. Is
> > > > that correct?
> > > >
> > > > If so, that's probably bad in environments where many processes (with
> > > > different privileges) are forked from a single zygote process (like
> > > > Android and Chrome), I think? If you accidentally call it on a CoW
> > > > anonymous mapping with shared pages, you'll degrade the performance of
> > > > other processes. And if an attacker does it intentionally, they could
> > > > use that to aid with exploiting race conditions or weird
> > > > microarchitectural stuff (e.g. the new https://lviattack.eu/lvi.pdf
> > > > talks about "the assumption that attackers can provoke page faults or
> > > > microcode assists for (arbitrary) load operations in the victim
> > > > domain").
> > > >
> > > > Should madvise_cold_or_pageout_pte_range() maybe refuse to operate on
> > > > pages with mapcount>1, or something like that? Or does it already do
> > > > that, and I just missed the check?
> > >
> > > I have brought up side channel attacks earlier [1] but only in the
> > > context of shared page cache pages. I didn't really consider shared
> > > anonymous pages to be a real problem. I was under impression that CoW
> > > pages shouldn't be a real problem because any security sensible
> > > applications shouldn't allow untrusted code to be forked and CoW
> > > anything really important. I believe we have made this assumption
> > > in other places - IIRC on gup with FOLL_FORCE but I admit I have
> > > very happily forgot most details.
> >
> > I'm more worried about the performance implications. Consider
> > libc.so's data section: that's a COW mapping, and we COW it during
> > zygote initialization as we load and relocate libc.so. Child processes
> > shouldn't be dirtying and re-COWing those relocated pages. If I
> > understand Jann's message correctly, MADV_PAGEOUT would force the
> > pages corresponding to the libc.so data segment out to zram just
> > because we MADV_PAGEOUT-ed a single process that happened to use libc.
> > We should leave those pages in memory, IMHO.
>
> Actually, the libc.so data section is a file mapping, so I think
> can_do_pageout() would decide whether the caller is allowed to force
> pageout based on whether the caller is the owner of (or capable over)
> libc (in other words, root, basically). But I think the bss section,
> as well as heap memory, could have pageout forced by anyone.

lmkd would have that capability though, right? But the point stands
regardless. It sounds like both security and performance suggest a
behavior change here. Thanks for bringing it up!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 18:08 interaction of MADV_PAGEOUT with CoW anonymous mappings? Jann Horn
  2020-03-10 18:48 ` Michal Hocko
@ 2020-03-10 22:14 ` Minchan Kim
  2020-03-12  8:22 ` Michal Hocko
  2 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2020-03-10 22:14 UTC (permalink / raw)
  To: Jann Horn
  Cc: Michal Hocko, Linux-MM, kernel list, Daniel Colascione,
	Dave Hansen, Joel Fernandes (Google)

Hi Jann,

On Tue, Mar 10, 2020 at 07:08:28PM +0100, Jann Horn wrote:
> Hi!
> 
> From looking at the source code, it looks to me as if using
> MADV_PAGEOUT on a CoW anonymous mapping will page out the page if
> possible, even if other processes still have the same page mapped. Is
> that correct?
> 
> If so, that's probably bad in environments where many processes (with
> different privileges) are forked from a single zygote process (like
> Android and Chrome), I think? If you accidentally call it on a CoW
> anonymous mapping with shared pages, you'll degrade the performance of
> other processes. And if an attacker does it intentionally, they could
> use that to aid with exploiting race conditions or weird
> microarchitectural stuff (e.g. the new https://lviattack.eu/lvi.pdf
> talks about "the assumption that attackers can provoke page faults or
> microcode assists for (arbitrary) load operations in the victim
> domain").
> 
> Should madvise_cold_or_pageout_pte_range() maybe refuse to operate on
> pages with mapcount>1, or something like that? Or does it already do
> that, and I just missed the check?

Originally, patchset had the mapcount check to filer out shared page
due to performance concern, not security stuff. However, the code
was removed because reviewer asked me "let the shared pages rely on
general LRU" because shared pages would have higher chance to be
touched compared to private pages so naturally, they will keep in
the memory. It did make sense for me.

I am not familiar with the security stuff but if it's really vulnerable
and everyone agree on that, it's easy to add mapcount check there.

Thanks.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 21:09     ` Michal Hocko
@ 2020-03-10 22:48       ` Dave Hansen
  2020-03-11  8:45         ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2020-03-10 22:48 UTC (permalink / raw)
  To: Michal Hocko, Jann Horn
  Cc: Minchan Kim, Linux-MM, kernel list, Daniel Colascione,
	Joel Fernandes (Google)

Maybe instead of just punting on MADV_PAGEOUT for map_count>1 we should
only let it affect the *local* process.  We could still put the page in
the swap cache, we just wouldn't go do the rmap walk.

On 3/10/20 2:09 PM, Michal Hocko wrote:
> I still have to think about how this could be used for any reasonable
> attack.

While this sounds theoretically vulnerable, I'm having a hard time
convincing myself that it's in any way practical.

Code that's a victim to LVI has to hit a fault or an "assist".  The most
likely assist in this case is if the victim needs to set the PTE
Accessed or Dirty bits.

There would first need to be some secret that an attacker wants to
steal, then some *other* data that directly or indirectly protects the
secret.  The load of the "other" data would be what needs to encounter
the fault or the assist, and would also need to be able to speculatively
reach a "disclosure gadget".

The attacker would sit there and do MADV_PAGEOUT constantly, inducing
lots of faults and assists and then _separately_ poison the
microarchitectural buffers that lead to the "other" data getting an
injected speculative value.

If an app is vulnerable to this, it's probably going to need to be using
the "other" data a *lot*.  It's going to have a really hard time doing
that because it's going to be handling page faults all the time.  It
also can't ever write to the data being shared, or it will break the COW.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 22:48       ` Dave Hansen
@ 2020-03-11  8:45         ` Michal Hocko
  2020-03-11 22:02           ` Minchan Kim
  2020-03-11 23:53           ` Shakeel Butt
  0 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2020-03-11  8:45 UTC (permalink / raw)
  To: Dave Hansen, Minchan Kim
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione,
	Joel Fernandes (Google)

On Tue 10-03-20 15:48:31, Dave Hansen wrote:
> Maybe instead of just punting on MADV_PAGEOUT for map_count>1 we should
> only let it affect the *local* process.  We could still put the page in
> the swap cache, we just wouldn't go do the rmap walk.

Is it really worth medling with the reclaim code and special case
MADV_PAGEOUT there? I mean it is quite reasonable to have an initial
implementation that doesn't really touch shared pages because that can
lead to all sorts of hard to debug and unexpected problems. So I would
much rather go with a simple patch to check map count first and see
whether somebody actually cares about those shared pages and go from
there.

Minchan, do you want to take my diff and turn it into the proper patch
or should I do it.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-11  8:45         ` Michal Hocko
@ 2020-03-11 22:02           ` Minchan Kim
  2020-03-11 23:53           ` Shakeel Butt
  1 sibling, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2020-03-11 22:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Jann Horn, Linux-MM, kernel list, Daniel Colascione,
	Joel Fernandes (Google)

On Wed, Mar 11, 2020 at 09:45:13AM +0100, Michal Hocko wrote:
> On Tue 10-03-20 15:48:31, Dave Hansen wrote:
> > Maybe instead of just punting on MADV_PAGEOUT for map_count>1 we should
> > only let it affect the *local* process.  We could still put the page in
> > the swap cache, we just wouldn't go do the rmap walk.
> 
> Is it really worth medling with the reclaim code and special case
> MADV_PAGEOUT there? I mean it is quite reasonable to have an initial
> implementation that doesn't really touch shared pages because that can
> lead to all sorts of hard to debug and unexpected problems. So I would
> much rather go with a simple patch to check map count first and see
> whether somebody actually cares about those shared pages and go from
> there.
> 
> Minchan, do you want to take my diff and turn it into the proper patch
> or should I do it.

Hey Michal,

It would be great if you could send it.
Thanks.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-11  8:45         ` Michal Hocko
  2020-03-11 22:02           ` Minchan Kim
@ 2020-03-11 23:53           ` Shakeel Butt
  2020-03-12  0:18             ` Minchan Kim
  1 sibling, 1 reply; 34+ messages in thread
From: Shakeel Butt @ 2020-03-11 23:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Dave Hansen, Minchan Kim, Jann Horn, Linux-MM, kernel list,
	Daniel Colascione, Joel Fernandes (Google)

On Wed, Mar 11, 2020 at 1:45 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Tue 10-03-20 15:48:31, Dave Hansen wrote:
> > Maybe instead of just punting on MADV_PAGEOUT for map_count>1 we should
> > only let it affect the *local* process.  We could still put the page in
> > the swap cache, we just wouldn't go do the rmap walk.
>
> Is it really worth medling with the reclaim code and special case
> MADV_PAGEOUT there? I mean it is quite reasonable to have an initial
> implementation that doesn't really touch shared pages because that can
> lead to all sorts of hard to debug and unexpected problems. So I would
> much rather go with a simple patch to check map count first and see
> whether somebody actually cares about those shared pages and go from
> there.
>
> Minchan, do you want to take my diff and turn it into the proper patch
> or should I do it.
>

What about the remote_madvise(MADV_PAGEOUT)? Will your patch disable
the pageout from that code path as well for pages with mapcount > 1?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-11 23:53           ` Shakeel Butt
@ 2020-03-12  0:18             ` Minchan Kim
  2020-03-12  2:03               ` Daniel Colascione
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2020-03-12  0:18 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Michal Hocko, Dave Hansen, Jann Horn, Linux-MM, kernel list,
	Daniel Colascione, Joel Fernandes (Google)

On Wed, Mar 11, 2020 at 04:53:17PM -0700, Shakeel Butt wrote:
> On Wed, Mar 11, 2020 at 1:45 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Tue 10-03-20 15:48:31, Dave Hansen wrote:
> > > Maybe instead of just punting on MADV_PAGEOUT for map_count>1 we should
> > > only let it affect the *local* process.  We could still put the page in
> > > the swap cache, we just wouldn't go do the rmap walk.
> >
> > Is it really worth medling with the reclaim code and special case
> > MADV_PAGEOUT there? I mean it is quite reasonable to have an initial
> > implementation that doesn't really touch shared pages because that can
> > lead to all sorts of hard to debug and unexpected problems. So I would
> > much rather go with a simple patch to check map count first and see
> > whether somebody actually cares about those shared pages and go from
> > there.
> >
> > Minchan, do you want to take my diff and turn it into the proper patch
> > or should I do it.
> >
> 
> What about the remote_madvise(MADV_PAGEOUT)? Will your patch disable
> the pageout from that code path as well for pages with mapcount > 1?

Maybe, not because process_madvise syscall needs more previliedge(ie,
PTRACE_MODE_ATTACH_FSCREDS) so I guess it would be more secure.
So in that case, I want to rely on the LRU chance for shared pages.

With that, the manager process could give a hint to several processes
and finally makes them paging out.

What do you think?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12  0:18             ` Minchan Kim
@ 2020-03-12  2:03               ` Daniel Colascione
  2020-03-12 15:15                 ` Shakeel Butt
  0 siblings, 1 reply; 34+ messages in thread
From: Daniel Colascione @ 2020-03-12  2:03 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Shakeel Butt, Michal Hocko, Dave Hansen, Jann Horn, Linux-MM,
	kernel list, Joel Fernandes (Google)

On Wed, Mar 11, 2020 at 5:18 PM Minchan Kim <minchan@kernel.org> wrote:
>
> On Wed, Mar 11, 2020 at 04:53:17PM -0700, Shakeel Butt wrote:
> > On Wed, Mar 11, 2020 at 1:45 AM Michal Hocko <mhocko@kernel.org> wrote:
> > >
> > > On Tue 10-03-20 15:48:31, Dave Hansen wrote:
> > > > Maybe instead of just punting on MADV_PAGEOUT for map_count>1 we should
> > > > only let it affect the *local* process.  We could still put the page in
> > > > the swap cache, we just wouldn't go do the rmap walk.
> > >
> > > Is it really worth medling with the reclaim code and special case
> > > MADV_PAGEOUT there? I mean it is quite reasonable to have an initial
> > > implementation that doesn't really touch shared pages because that can
> > > lead to all sorts of hard to debug and unexpected problems. So I would
> > > much rather go with a simple patch to check map count first and see
> > > whether somebody actually cares about those shared pages and go from
> > > there.
> > >
> > > Minchan, do you want to take my diff and turn it into the proper patch
> > > or should I do it.
> > >
> >
> > What about the remote_madvise(MADV_PAGEOUT)? Will your patch disable
> > the pageout from that code path as well for pages with mapcount > 1?
>
> Maybe, not because process_madvise syscall needs more previliedge(ie,
> PTRACE_MODE_ATTACH_FSCREDS) so I guess it would be more secure.
> So in that case, I want to rely on the LRU chance for shared pages.

I don't want the behavior of an madvise command to change depending on
*how* the command is invoked. MADV_PAGEOUT should do the same thing
regardless. If you want to allow purging of shared pages as well,
please add a new MADV_PAGEOUT_ALL or something and require a privilege
to use it.

> With that, the manager process could give a hint to several processes
> and finally makes them paging out.

On many different occasions over the past few years, I've found myself
wanting to ask the kernel to do bulk memory management operations. I'd
much rather add *one* facility to allow for multiple-target mm
operations than add one-off special cases for specific callers cases
as they come up. If we had such a bulk operation, the kernel could
inspect the bulk operation payload, see whether the number of
MADV_PAGEOUT requests for a given page were equal to the share count
for that page, and, if so, evict that page despite it being shared.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-10 18:08 interaction of MADV_PAGEOUT with CoW anonymous mappings? Jann Horn
  2020-03-10 18:48 ` Michal Hocko
  2020-03-10 22:14 ` Minchan Kim
@ 2020-03-12  8:22 ` Michal Hocko
  2020-03-12 15:40   ` Vlastimil Babka
  2020-03-12 20:16   ` Minchan Kim
  2 siblings, 2 replies; 34+ messages in thread
From: Michal Hocko @ 2020-03-12  8:22 UTC (permalink / raw)
  To: Jann Horn
  Cc: Minchan Kim, Linux-MM, kernel list, Daniel Colascione,
	Dave Hansen, Joel Fernandes (Google),
	Andrew Morton

[Cc akpm]

So what about this?

From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Thu, 12 Mar 2020 09:04:35 +0100
Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages

Jann has brought up a very interesting point [1]. While shared pages are
excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
that way. This can lead to all sorts of hard to debug problems. E.g.
performance problems outlined by Daniel [2]. There are runtime
environments where there is a substantial memory shared among security
domains via CoW memory and a easy to reclaim way of that memory, which
MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
in for the parent process which might be more privileged or even open
side channel attacks. The feasibility of the later is not really clear
to me TBH but there is no real reason for exposure at this stage. It
seems there is no real use case to depend on reclaiming CoW memory via
madvise at this stage so it is much easier to simply disallow it and
this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can
operate only on the exclusively owned memory which is a straightforward
semantic.

[1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
[2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 mm/madvise.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 43b47d3fae02..4bb30ed6c8d2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -335,12 +335,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		}
 
 		page = pmd_page(orig_pmd);
+
+		/* Do not interfere with other mappings of this page */
+		if (page_mapcount(page) != 1)
+			goto huge_unlock;
+
 		if (next - addr != HPAGE_PMD_SIZE) {
 			int err;
 
-			if (page_mapcount(page) != 1)
-				goto huge_unlock;
-
 			get_page(page);
 			spin_unlock(ptl);
 			lock_page(page);
@@ -426,6 +428,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			continue;
 		}
 
+		/* Do not interfere with other mappings of this page */
+		if (page_mapcount(page) != 1)
+			continue;
+
 		VM_BUG_ON_PAGE(PageTransCompound(page), page);
 
 		if (pte_young(ptent)) {
-- 
2.24.1

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12  2:03               ` Daniel Colascione
@ 2020-03-12 15:15                 ` Shakeel Butt
  0 siblings, 0 replies; 34+ messages in thread
From: Shakeel Butt @ 2020-03-12 15:15 UTC (permalink / raw)
  To: Daniel Colascione
  Cc: Minchan Kim, Michal Hocko, Dave Hansen, Jann Horn, Linux-MM,
	kernel list, Joel Fernandes (Google)

On Wed, Mar 11, 2020 at 7:04 PM Daniel Colascione <dancol@google.com> wrote:
>
> On Wed, Mar 11, 2020 at 5:18 PM Minchan Kim <minchan@kernel.org> wrote:
> >
> > On Wed, Mar 11, 2020 at 04:53:17PM -0700, Shakeel Butt wrote:
> > > On Wed, Mar 11, 2020 at 1:45 AM Michal Hocko <mhocko@kernel.org> wrote:
> > > >
> > > > On Tue 10-03-20 15:48:31, Dave Hansen wrote:
> > > > > Maybe instead of just punting on MADV_PAGEOUT for map_count>1 we should
> > > > > only let it affect the *local* process.  We could still put the page in
> > > > > the swap cache, we just wouldn't go do the rmap walk.
> > > >
> > > > Is it really worth medling with the reclaim code and special case
> > > > MADV_PAGEOUT there? I mean it is quite reasonable to have an initial
> > > > implementation that doesn't really touch shared pages because that can
> > > > lead to all sorts of hard to debug and unexpected problems. So I would
> > > > much rather go with a simple patch to check map count first and see
> > > > whether somebody actually cares about those shared pages and go from
> > > > there.
> > > >
> > > > Minchan, do you want to take my diff and turn it into the proper patch
> > > > or should I do it.
> > > >
> > >
> > > What about the remote_madvise(MADV_PAGEOUT)? Will your patch disable
> > > the pageout from that code path as well for pages with mapcount > 1?
> >
> > Maybe, not because process_madvise syscall needs more previliedge(ie,
> > PTRACE_MODE_ATTACH_FSCREDS) so I guess it would be more secure.
> > So in that case, I want to rely on the LRU chance for shared pages.
>
> I don't want the behavior of an madvise command to change depending on
> *how* the command is invoked. MADV_PAGEOUT should do the same thing
> regardless. If you want to allow purging of shared pages as well,
> please add a new MADV_PAGEOUT_ALL or something and require a privilege
> to use it.
>

I would like to have a way to pageout the shared pages and
MADV_PAGEOUT_ALL approach looks fine to me.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12  8:22 ` Michal Hocko
@ 2020-03-12 15:40   ` Vlastimil Babka
  2020-03-12 20:16   ` Minchan Kim
  1 sibling, 0 replies; 34+ messages in thread
From: Vlastimil Babka @ 2020-03-12 15:40 UTC (permalink / raw)
  To: Michal Hocko, Jann Horn
  Cc: Minchan Kim, Linux-MM, kernel list, Daniel Colascione,
	Dave Hansen, Joel Fernandes (Google),
	Andrew Morton

On 3/12/20 9:22 AM, Michal Hocko wrote:
> [Cc akpm]
> 
> So what about this?
> 
> From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 12 Mar 2020 09:04:35 +0100
> Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> 
> Jann has brought up a very interesting point [1]. While shared pages are
> excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> that way. This can lead to all sorts of hard to debug problems. E.g.
> performance problems outlined by Daniel [2]. There are runtime
> environments where there is a substantial memory shared among security
> domains via CoW memory and a easy to reclaim way of that memory, which
> MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> in for the parent process which might be more privileged or even open
> side channel attacks. The feasibility of the later is not really clear
> to me TBH but there is no real reason for exposure at this stage. It
> seems there is no real use case to depend on reclaiming CoW memory via
> madvise at this stage so it is much easier to simply disallow it and
> this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can
> operate only on the exclusively owned memory which is a straightforward
> semantic.
> 
> [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
> [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com
> 

Reported-by: Jann Horn <jannh@google.com>
Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD")

Maybe CC stable?

> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/madvise.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 43b47d3fae02..4bb30ed6c8d2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -335,12 +335,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		}
>  
>  		page = pmd_page(orig_pmd);
> +
> +		/* Do not interfere with other mappings of this page */
> +		if (page_mapcount(page) != 1)
> +			goto huge_unlock;
> +
>  		if (next - addr != HPAGE_PMD_SIZE) {
>  			int err;
>  
> -			if (page_mapcount(page) != 1)
> -				goto huge_unlock;
> -
>  			get_page(page);
>  			spin_unlock(ptl);
>  			lock_page(page);
> @@ -426,6 +428,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  			continue;
>  		}
>  
> +		/* Do not interfere with other mappings of this page */
> +		if (page_mapcount(page) != 1)
> +			continue;
> +
>  		VM_BUG_ON_PAGE(PageTransCompound(page), page);
>  
>  		if (pte_young(ptent)) {
> 



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12  8:22 ` Michal Hocko
  2020-03-12 15:40   ` Vlastimil Babka
@ 2020-03-12 20:16   ` Minchan Kim
  2020-03-12 20:26     ` Dave Hansen
                       ` (3 more replies)
  1 sibling, 4 replies; 34+ messages in thread
From: Minchan Kim @ 2020-03-12 20:16 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> [Cc akpm]
> 
> So what about this?

Thanks, Michal.

I't likde to wait Jann's reply since Dave gave his opinion about the vulnerability.
https://lore.kernel.org/linux-mm/cf95db88-968d-fee5-1c15-10d024c09d8a@intel.com/
Jann, could you give your insigh about that practically it's possible?

A real dumb question to understand vulnerability:

The attacker would be able to trigger heavy memory consumption so that he
could make paging them out without MADV_PAGEOUT. I know MADV_PAGEOUT makes
it easier but he still could do without MADV_PAGEOUT.
What makes difference here?

To clarify how MADV_PAGEWORK works:
If other process has accessed the page so that his page table has access
bit marked, MADV_PAGEOUT couldn't page it out.

> 
> From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <mhocko@suse.com>
> Date: Thu, 12 Mar 2020 09:04:35 +0100
> Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> 
> Jann has brought up a very interesting point [1]. While shared pages are
> excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> that way. This can lead to all sorts of hard to debug problems. E.g.
> performance problems outlined by Daniel [2]. There are runtime
> environments where there is a substantial memory shared among security
> domains via CoW memory and a easy to reclaim way of that memory, which
> MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> in for the parent process which might be more privileged or even open
> side channel attacks. The feasibility of the later is not really clear

I am not sure it's a good idea to mention performance stuff because
it's rather arguble. You and Johannes already pointed it out when I sbumit
early draft which had shared page filtering out logic due to performance
reason. You guys suggested the shared pages has higher chance to be touched
so that if it's really hot pages, that whould keep in the memory. I agree.

I think the only reason at this moment is just vulnerability.

> to me TBH but there is no real reason for exposure at this stage. It
> seems there is no real use case to depend on reclaiming CoW memory via
> madvise at this stage so it is much easier to simply disallow it and
> this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can
> operate only on the exclusively owned memory which is a straightforward
> semantic.
> 
> [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
> [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/madvise.c | 12 +++++++++---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 43b47d3fae02..4bb30ed6c8d2 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -335,12 +335,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		}
>  
>  		page = pmd_page(orig_pmd);
> +
> +		/* Do not interfere with other mappings of this page */


How about this?
/*
 * paging out only single mapped private pages for anonymous mapping,
 * otherwise, it opens a side channel.
 */

Otherwise, looks good to me.

> +		if (page_mapcount(page) != 1)
> +			goto huge_unlock;
> +
>  		if (next - addr != HPAGE_PMD_SIZE) {
>  			int err;
>  
> -			if (page_mapcount(page) != 1)
> -				goto huge_unlock;
> -
>  			get_page(page);
>  			spin_unlock(ptl);
>  			lock_page(page);
> @@ -426,6 +428,10 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  			continue;
>  		}
>  
> +		/* Do not interfere with other mappings of this page */
> +		if (page_mapcount(page) != 1)
> +			continue;
> +
>  		VM_BUG_ON_PAGE(PageTransCompound(page), page);
>  
>  		if (pte_young(ptent)) {
> -- 
> 2.24.1
> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12 20:16   ` Minchan Kim
@ 2020-03-12 20:26     ` Dave Hansen
  2020-03-12 20:41     ` Michal Hocko
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 34+ messages in thread
From: Dave Hansen @ 2020-03-12 20:26 UTC (permalink / raw)
  To: Minchan Kim, Michal Hocko
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione,
	Joel Fernandes (Google),
	Andrew Morton

On 3/12/20 1:16 PM, Minchan Kim wrote:
> On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> I't likde to wait Jann's reply since Dave gave his opinion about the vulnerability.
> https://lore.kernel.org/linux-mm/cf95db88-968d-fee5-1c15-10d024c09d8a@intel.com/
> Jann, could you give your insigh about that practically it's possible?

FWIW, just checking for mapcount>=1 seems like a pretty sane fix to me.
 I went looking at doing it another way, but Michal was quite correct.
We'd probably end up having to special-case something underneath
shrink_page_list().

> A real dumb question to understand vulnerability:
> 
> The attacker would be able to trigger heavy memory consumption so that he
> could make paging them out without MADV_PAGEOUT. I know MADV_PAGEOUT makes
> it easier but he still could do without MADV_PAGEOUT.
> What makes difference here?

Causing memory pressure is quite a bit more disruptive than
MADV_PAGEOUT.  It's a much more blunt instrument and is likely to result
in a lot of collateral damage and a lot of I/O.

MADV_PAGEOUT is *surgical*.  You can target one very specific page if,
for instance, you think that your victim is reading it in a way that is
vulnerable.  You can also do it with zero I/O (after the initial pageout).

> To clarify how MADV_PAGEWORK works:
> If other process has accessed the page so that his page table has access
> bit marked, MADV_PAGEOUT couldn't page it out.

The attacker doesn't need to get the victim to get a major fault, it
just needs to induce *a* fault.  I actually did an experiment to see how
this would work in practice.

1. Allocate some memory(), touch it
2. fork()
3. In the parent: Loop reading the memory
4. In the child: loop running MADV_PAGEOUT

The pages stayed in the swap cache and the parent reading the memory saw
a constant stream of faults.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12 20:16   ` Minchan Kim
  2020-03-12 20:26     ` Dave Hansen
@ 2020-03-12 20:41     ` Michal Hocko
  2020-03-13  2:08       ` Minchan Kim
  2020-03-12 21:41     ` Dave Hansen
  2020-03-12 23:29     ` Jann Horn
  3 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-03-12 20:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Thu 12-03-20 13:16:02, Minchan Kim wrote:
> On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
[...]
> > From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <mhocko@suse.com>
> > Date: Thu, 12 Mar 2020 09:04:35 +0100
> > Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> > 
> > Jann has brought up a very interesting point [1]. While shared pages are
> > excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> > that way. This can lead to all sorts of hard to debug problems. E.g.
> > performance problems outlined by Daniel [2]. There are runtime
> > environments where there is a substantial memory shared among security
> > domains via CoW memory and a easy to reclaim way of that memory, which
> > MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> > in for the parent process which might be more privileged or even open
> > side channel attacks. The feasibility of the later is not really clear
> 
> I am not sure it's a good idea to mention performance stuff because
> it's rather arguble. You and Johannes already pointed it out when I sbumit
> early draft which had shared page filtering out logic due to performance
> reason. You guys suggested the shared pages has higher chance to be touched
> so that if it's really hot pages, that whould keep in the memory. I agree.

Yes, the hot memory is likely to be referenced but the point was an
unexpected latency because of the major fault. I have to say that I have
underestimated the issue because I was not aware of runtimes mentioned
in the referenced links. Essentially a lot of CoW memory shared over
security domains.

> I think the only reason at this moment is just vulnerability.
> 
> > to me TBH but there is no real reason for exposure at this stage. It
> > seems there is no real use case to depend on reclaiming CoW memory via
> > madvise at this stage so it is much easier to simply disallow it and
> > this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can
> > operate only on the exclusively owned memory which is a straightforward
> > semantic.
> > 
> > [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
> > [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > ---
> >  mm/madvise.c | 12 +++++++++---
> >  1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 43b47d3fae02..4bb30ed6c8d2 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -335,12 +335,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >  		}
> >  
> >  		page = pmd_page(orig_pmd);
> > +
> > +		/* Do not interfere with other mappings of this page */
> 
> 
> How about this?
> /*
>  * paging out only single mapped private pages for anonymous mapping,
>  * otherwise, it opens a side channel.
>  */

I am not sure this is much more helpful without a larger context. I
would stick with the wording unless you insist.
 
> Otherwise, looks good to me.

Thanks for the review.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12 20:16   ` Minchan Kim
  2020-03-12 20:26     ` Dave Hansen
  2020-03-12 20:41     ` Michal Hocko
@ 2020-03-12 21:41     ` Dave Hansen
  2020-03-13  2:00       ` Minchan Kim
  2020-03-12 23:29     ` Jann Horn
  3 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2020-03-12 21:41 UTC (permalink / raw)
  To: Minchan Kim, Michal Hocko
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione,
	Joel Fernandes (Google),
	Andrew Morton


[-- Attachment #1: Type: text/plain, Size: 694 bytes --]

One other fun thing.  I have a "victim" thread sitting in a loop doing:

	sleep(1)
	memcpy(&garbage, buffer, sz);

The "attacker" is doing

	madvise(buffer, sz, MADV_PAGEOUT);

in a loop.  That, oddly enough doesn't cause the victim to page fault.
But, if I do:

	memcpy(&garbage, buffer, sz);
	madvise(buffer, sz, MADV_PAGEOUT);

It *does* cause the memory to get paged out.  The MADV_PAGEOUT code
actually has a !pte_present() check.  It will punt on a PTE if it sees
it.  In other words, if a page is in the swap cache but not mapped by a
pte_present() PTE, MADV_PAGEOUT won't touch it.

Shouldn't MADV_PAGEOUT be able to find and reclaim those pages?  Patch
attached.

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: madv-pageout-find-swap-cache.patch --]
[-- Type: text/x-patch; name="madv-pageout-find-swap-cache.patch", Size: 1692 bytes --]



---

 b/mm/madvise.c |   38 +++++++++++++++++++++++++++++++-------
 1 file changed, 31 insertions(+), 7 deletions(-)

diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
--- a/mm/madvise.c~madv-pageout-find-swap-cache	2020-03-12 14:24:45.178775035 -0700
+++ b/mm/madvise.c	2020-03-12 14:35:49.706773378 -0700
@@ -248,6 +248,36 @@ static void force_shm_swapin_readahead(s
 #endif		/* CONFIG_SWAP */
 
 /*
+ * Given a PTE, find the corresponding 'struct page'.  Also handles
+ * non-present swap PTEs.
+ */
+struct page *pte_to_reclaim_page(struct vm_area_struct *vma,
+				 unsigned long addr, pte_t ptent)
+{
+	swp_entry_t entry;
+
+	/* Totally empty PTE: */
+	if (pte_none(ptent))
+		return NULL;
+
+	/* A normal, present page is mapped: */
+	if (pte_present(ptent))
+		return vm_normal_page(vma, addr, ptent);
+
+	entry = pte_to_swp_entry(vmf->orig_pte);
+	/* Is it one of the "swap PTEs" that's not really swap? */
+	if (non_swap_entry(entry))
+		return false;
+
+	/*
+	 * The PTE was a true swap entry.  The page may be in the
+	 * swap cache.  If so, find it and return it so it may be
+	 * reclaimed.
+	 */
+	return lookup_swap_cache(entry, vma, addr);
+}
+
+/*
  * Schedule all required I/O operations.  Do not wait for completion.
  */
 static long madvise_willneed(struct vm_area_struct *vma,
@@ -389,13 +419,7 @@ regular_page:
 	for (; addr < end; pte++, addr += PAGE_SIZE) {
 		ptent = *pte;
 
-		if (pte_none(ptent))
-			continue;
-
-		if (!pte_present(ptent))
-			continue;
-
-		page = vm_normal_page(vma, addr, ptent);
+		page = pte_to_reclaim_page(vma, addr, ptent);
 		if (!page)
 			continue;
 
_

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12 20:16   ` Minchan Kim
                       ` (2 preceding siblings ...)
  2020-03-12 21:41     ` Dave Hansen
@ 2020-03-12 23:29     ` Jann Horn
  3 siblings, 0 replies; 34+ messages in thread
From: Jann Horn @ 2020-03-12 23:29 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Linux-MM, kernel list, Daniel Colascione,
	Dave Hansen, Joel Fernandes (Google),
	Andrew Morton

On Thu, Mar 12, 2020 at 9:16 PM Minchan Kim <minchan@kernel.org> wrote:
> On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> > [Cc akpm]
> >
> > So what about this?
>
> Thanks, Michal.
>
> I't likde to wait Jann's reply since Dave gave his opinion about the vulnerability.
> https://lore.kernel.org/linux-mm/cf95db88-968d-fee5-1c15-10d024c09d8a@intel.com/
> Jann, could you give your insigh about that practically it's possible?

Since I don't really know much about the whole LVI attack, and in
particular don't have any practical experience with it, I don't think
I can help with insight on that. So since Dave said that he doesn't
think it's practical, I'll defer to him on that.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12 21:41     ` Dave Hansen
@ 2020-03-13  2:00       ` Minchan Kim
  2020-03-13 16:59         ` Dave Hansen
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2020-03-13  2:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Michal Hocko, Jann Horn, Linux-MM, kernel list,
	Daniel Colascione, Joel Fernandes (Google),
	Andrew Morton

On Thu, Mar 12, 2020 at 02:41:07PM -0700, Dave Hansen wrote:
> One other fun thing.  I have a "victim" thread sitting in a loop doing:
> 
> 	sleep(1)
> 	memcpy(&garbage, buffer, sz);
> 
> The "attacker" is doing
> 
> 	madvise(buffer, sz, MADV_PAGEOUT);
> 
> in a loop.  That, oddly enough doesn't cause the victim to page fault.
> But, if I do:
> 
> 	memcpy(&garbage, buffer, sz);
> 	madvise(buffer, sz, MADV_PAGEOUT);
> 
> It *does* cause the memory to get paged out.  The MADV_PAGEOUT code
> actually has a !pte_present() check.  It will punt on a PTE if it sees
> it.  In other words, if a page is in the swap cache but not mapped by a
> pte_present() PTE, MADV_PAGEOUT won't touch it.
> 
> Shouldn't MADV_PAGEOUT be able to find and reclaim those pages?  Patch
> attached.

> 
> 
> ---
> 
>  b/mm/madvise.c |   38 +++++++++++++++++++++++++++++++-------
>  1 file changed, 31 insertions(+), 7 deletions(-)
> 
> diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
> --- a/mm/madvise.c~madv-pageout-find-swap-cache	2020-03-12 14:24:45.178775035 -0700
> +++ b/mm/madvise.c	2020-03-12 14:35:49.706773378 -0700
> @@ -248,6 +248,36 @@ static void force_shm_swapin_readahead(s
>  #endif		/* CONFIG_SWAP */
>  
>  /*
> + * Given a PTE, find the corresponding 'struct page'.  Also handles
> + * non-present swap PTEs.
> + */
> +struct page *pte_to_reclaim_page(struct vm_area_struct *vma,
> +				 unsigned long addr, pte_t ptent)
> +{
> +	swp_entry_t entry;
> +
> +	/* Totally empty PTE: */
> +	if (pte_none(ptent))
> +		return NULL;
> +
> +	/* A normal, present page is mapped: */
> +	if (pte_present(ptent))
> +		return vm_normal_page(vma, addr, ptent);
> +

Please check is_swap_pte first.

> +	entry = pte_to_swp_entry(vmf->orig_pte);
> +	/* Is it one of the "swap PTEs" that's not really swap? */
> +	if (non_swap_entry(entry))
> +		return false;
> +
> +	/*
> +	 * The PTE was a true swap entry.  The page may be in the
> +	 * swap cache.  If so, find it and return it so it may be
> +	 * reclaimed.
> +	 */
> +	return lookup_swap_cache(entry, vma, addr);

If we go with handling only exclusived owned page for anon,
I think we should apply the rule to swap cache, too.

Do you mind posting it as formal patch?

Thanks for the explain about vulnerability and the patch, Dave!

> +}
> +
> +/*
>   * Schedule all required I/O operations.  Do not wait for completion.
>   */
>  static long madvise_willneed(struct vm_area_struct *vma,
> @@ -389,13 +419,7 @@ regular_page:
>  	for (; addr < end; pte++, addr += PAGE_SIZE) {
>  		ptent = *pte;
>  
> -		if (pte_none(ptent))
> -			continue;
> -
> -		if (!pte_present(ptent))
> -			continue;
> -
> -		page = vm_normal_page(vma, addr, ptent);
> +		page = pte_to_reclaim_page(vma, addr, ptent);
>  		if (!page)
>  			continue;
>  
> _



^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-12 20:41     ` Michal Hocko
@ 2020-03-13  2:08       ` Minchan Kim
  2020-03-13  8:05         ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2020-03-13  2:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Thu, Mar 12, 2020 at 09:41:55PM +0100, Michal Hocko wrote:
> On Thu 12-03-20 13:16:02, Minchan Kim wrote:
> > On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> [...]
> > > From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> > > From: Michal Hocko <mhocko@suse.com>
> > > Date: Thu, 12 Mar 2020 09:04:35 +0100
> > > Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> > > 
> > > Jann has brought up a very interesting point [1]. While shared pages are
> > > excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> > > that way. This can lead to all sorts of hard to debug problems. E.g.
> > > performance problems outlined by Daniel [2]. There are runtime
> > > environments where there is a substantial memory shared among security
> > > domains via CoW memory and a easy to reclaim way of that memory, which
> > > MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> > > in for the parent process which might be more privileged or even open
> > > side channel attacks. The feasibility of the later is not really clear
> > 
> > I am not sure it's a good idea to mention performance stuff because
> > it's rather arguble. You and Johannes already pointed it out when I sbumit
> > early draft which had shared page filtering out logic due to performance
> > reason. You guys suggested the shared pages has higher chance to be touched
> > so that if it's really hot pages, that whould keep in the memory. I agree.
> 
> Yes, the hot memory is likely to be referenced but the point was an
> unexpected latency because of the major fault. I have to say that I have

I don't understand your point here. If it's likely to be referenced
among several processes, it doesn't have the major fault latency.
What's your point here?

> underestimated the issue because I was not aware of runtimes mentioned
> in the referenced links. Essentially a lot of CoW memory shared over
> security domains.

I tend to agree about security part in the description, but still don't
agree with performance concern in the description so I'd like to remove
it in the description. Current situation is caused by security concern
unfortunately, not performance reason.

> 
> > I think the only reason at this moment is just vulnerability.
> > 
> > > to me TBH but there is no real reason for exposure at this stage. It
> > > seems there is no real use case to depend on reclaiming CoW memory via
> > > madvise at this stage so it is much easier to simply disallow it and
> > > this is what this patch does. Put it simply MADV_{PAGEOUT,COLD} can
> > > operate only on the exclusively owned memory which is a straightforward
> > > semantic.
> > > 
> > > [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
> > > [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com
> > > 
> > > Signed-off-by: Michal Hocko <mhocko@suse.com>
> > > ---
> > >  mm/madvise.c | 12 +++++++++---
> > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 43b47d3fae02..4bb30ed6c8d2 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -335,12 +335,14 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> > >  		}
> > >  
> > >  		page = pmd_page(orig_pmd);
> > > +
> > > +		/* Do not interfere with other mappings of this page */
> > 
> > 
> > How about this?
> > /*
> >  * paging out only single mapped private pages for anonymous mapping,
> >  * otherwise, it opens a side channel.
> >  */
> 
> I am not sure this is much more helpful without a larger context. I
> would stick with the wording unless you insist.

The comment you provides explain what code does, not *why*.
Comment is always lack of explaining the whole story but we try to
demonstrate a certain context to make reader careful.
Havind said, I will not insist on it.

Thanks.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-13  2:08       ` Minchan Kim
@ 2020-03-13  8:05         ` Michal Hocko
  2020-03-13 20:59           ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-03-13  8:05 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Thu 12-03-20 19:08:51, Minchan Kim wrote:
> On Thu, Mar 12, 2020 at 09:41:55PM +0100, Michal Hocko wrote:
> > On Thu 12-03-20 13:16:02, Minchan Kim wrote:
> > > On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> > [...]
> > > > From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> > > > From: Michal Hocko <mhocko@suse.com>
> > > > Date: Thu, 12 Mar 2020 09:04:35 +0100
> > > > Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> > > > 
> > > > Jann has brought up a very interesting point [1]. While shared pages are
> > > > excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> > > > that way. This can lead to all sorts of hard to debug problems. E.g.
> > > > performance problems outlined by Daniel [2]. There are runtime
> > > > environments where there is a substantial memory shared among security
> > > > domains via CoW memory and a easy to reclaim way of that memory, which
> > > > MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> > > > in for the parent process which might be more privileged or even open
> > > > side channel attacks. The feasibility of the later is not really clear
> > > 
> > > I am not sure it's a good idea to mention performance stuff because
> > > it's rather arguble. You and Johannes already pointed it out when I sbumit
> > > early draft which had shared page filtering out logic due to performance
> > > reason. You guys suggested the shared pages has higher chance to be touched
> > > so that if it's really hot pages, that whould keep in the memory. I agree.
> > 
> > Yes, the hot memory is likely to be referenced but the point was an
> > unexpected latency because of the major fault. I have to say that I have
> 
> I don't understand your point here. If it's likely to be referenced
> among several processes, it doesn't have the major fault latency.
> What's your point here?

a) the particular CoW page might be cold enough to be reclaimed and b)
nothing really prevents the MADV_PAGEOUT to be called faster than the
reference bit being readded.

> > underestimated the issue because I was not aware of runtimes mentioned
> > in the referenced links. Essentially a lot of CoW memory shared over
> > security domains.
> 
> I tend to agree about security part in the description, but still don't
> agree with performance concern in the description so I'd like to remove
> it in the description. Current situation is caused by security concern
> unfortunately, not performance reason.

Well, I have to admit that I haven't seen any actual numbers here but
considering zygote like workload I would rather not even risk it. Even
if the risk is theoretical I would rather put the restriction and
mention it in the changelog. If somebody would like to drop this
restriction it is at least more clear what to test for.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-13  2:00       ` Minchan Kim
@ 2020-03-13 16:59         ` Dave Hansen
  2020-03-13 21:13           ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Hansen @ 2020-03-13 16:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Jann Horn, Linux-MM, kernel list,
	Daniel Colascione, Joel Fernandes (Google),
	Andrew Morton

On 3/12/20 7:00 PM, Minchan Kim wrote:
> On Thu, Mar 12, 2020 at 02:41:07PM -0700, Dave Hansen wrote:
>> One other fun thing.  I have a "victim" thread sitting in a loop doing:
>>
>> 	sleep(1)
>> 	memcpy(&garbage, buffer, sz);
>>
>> The "attacker" is doing
>>
>> 	madvise(buffer, sz, MADV_PAGEOUT);
>>
>> in a loop.  That, oddly enough doesn't cause the victim to page fault.
>> But, if I do:
>>
>> 	memcpy(&garbage, buffer, sz);
>> 	madvise(buffer, sz, MADV_PAGEOUT);
>>
>> It *does* cause the memory to get paged out.  The MADV_PAGEOUT code
>> actually has a !pte_present() check.  It will punt on a PTE if it sees
>> it.  In other words, if a page is in the swap cache but not mapped by a
>> pte_present() PTE, MADV_PAGEOUT won't touch it.
>>
>> Shouldn't MADV_PAGEOUT be able to find and reclaim those pages?  Patch
>> attached.
> 
>>
>>
>> ---
>>
>>  b/mm/madvise.c |   38 +++++++++++++++++++++++++++++++-------
>>  1 file changed, 31 insertions(+), 7 deletions(-)
>>
>> diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
>> --- a/mm/madvise.c~madv-pageout-find-swap-cache	2020-03-12 14:24:45.178775035 -0700
>> +++ b/mm/madvise.c	2020-03-12 14:35:49.706773378 -0700
>> @@ -248,6 +248,36 @@ static void force_shm_swapin_readahead(s
>>  #endif		/* CONFIG_SWAP */
>>  
>>  /*
>> + * Given a PTE, find the corresponding 'struct page'.  Also handles
>> + * non-present swap PTEs.
>> + */
>> +struct page *pte_to_reclaim_page(struct vm_area_struct *vma,
>> +				 unsigned long addr, pte_t ptent)
>> +{
>> +	swp_entry_t entry;
>> +
>> +	/* Totally empty PTE: */
>> +	if (pte_none(ptent))
>> +		return NULL;
>> +
>> +	/* A normal, present page is mapped: */
>> +	if (pte_present(ptent))
>> +		return vm_normal_page(vma, addr, ptent);
>> +
> 
> Please check is_swap_pte first.

Why?

is_swap_pte() duplicates the first two checks.  But, I need an explicit
pte_present() check somewhere because I need to call vm_normal_page()
only on present PTEs.

I guess the pte_present() check could be:

	if (!is_swap_pte(ptent))
		return vm_normal_page(...);

*after* the pte_none() check.

>> +	entry = pte_to_swp_entry(vmf->orig_pte);
>> +	/* Is it one of the "swap PTEs" that's not really swap? */
>> +	if (non_swap_entry(entry))
>> +		return false;
>> +
>> +	/*
>> +	 * The PTE was a true swap entry.  The page may be in the
>> +	 * swap cache.  If so, find it and return it so it may be
>> +	 * reclaimed.
>> +	 */
>> +	return lookup_swap_cache(entry, vma, addr);
> 
> If we go with handling only exclusived owned page for anon,
> I think we should apply the rule to swap cache, too.

I'm going back and forth on it.  If we're just trying to avoid causing
faults in other processes, we could add a mapcount>0 check here in
addition to the mapcount>1 checks that were added in the other patch.

But, if we want a check for true exclusivity: no other swap entries or
mappings, we need to check swap_count() too.  It's getting quite a bit
uglier as I add that it, but I guess we'll see how it looks in the end.

> Do you mind posting it as formal patch?

Yeah, I'll send something out.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-13  8:05         ` Michal Hocko
@ 2020-03-13 20:59           ` Minchan Kim
  2020-03-16  9:20             ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2020-03-13 20:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Fri, Mar 13, 2020 at 09:05:46AM +0100, Michal Hocko wrote:
> On Thu 12-03-20 19:08:51, Minchan Kim wrote:
> > On Thu, Mar 12, 2020 at 09:41:55PM +0100, Michal Hocko wrote:
> > > On Thu 12-03-20 13:16:02, Minchan Kim wrote:
> > > > On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> > > [...]
> > > > > From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > Date: Thu, 12 Mar 2020 09:04:35 +0100
> > > > > Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> > > > > 
> > > > > Jann has brought up a very interesting point [1]. While shared pages are
> > > > > excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> > > > > that way. This can lead to all sorts of hard to debug problems. E.g.
> > > > > performance problems outlined by Daniel [2]. There are runtime
> > > > > environments where there is a substantial memory shared among security
> > > > > domains via CoW memory and a easy to reclaim way of that memory, which
> > > > > MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> > > > > in for the parent process which might be more privileged or even open
> > > > > side channel attacks. The feasibility of the later is not really clear
> > > > 
> > > > I am not sure it's a good idea to mention performance stuff because
> > > > it's rather arguble. You and Johannes already pointed it out when I sbumit
> > > > early draft which had shared page filtering out logic due to performance
> > > > reason. You guys suggested the shared pages has higher chance to be touched
> > > > so that if it's really hot pages, that whould keep in the memory. I agree.
> > > 
> > > Yes, the hot memory is likely to be referenced but the point was an
> > > unexpected latency because of the major fault. I have to say that I have
> > 
> > I don't understand your point here. If it's likely to be referenced
> > among several processes, it doesn't have the major fault latency.
> > What's your point here?
> 
> a) the particular CoW page might be cold enough to be reclaimed and b)

If it is, that means it's *cold* so it's really worth to be reclaimed.

> nothing really prevents the MADV_PAGEOUT to be called faster than the
> reference bit being readded.

Yeb, that's undesirable. I should admit it was not intended when I implemented
PAGEOUT. The thing is page_check_references clears access bit of pte for every
process are sharing the page so that two times MADV_PAGEOUT from a process could
evict the page. That's the really bug. It shouldn't have cleared the
access bit for other processes. What I wanted to do with MADV_PAGEOUT is just
check the reference from the other processes without clearing access bit of pte
and if it found other process pte has access bit, just bail out.

Okay so you're right. Current implementation could cause performance impact
since MADV_PAGEOUT unconditionally clear the access bit of other processes so
your patch will fix it.

Thanks.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-13 16:59         ` Dave Hansen
@ 2020-03-13 21:13           ` Minchan Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2020-03-13 21:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Michal Hocko, Jann Horn, Linux-MM, kernel list,
	Daniel Colascione, Joel Fernandes (Google),
	Andrew Morton

On Fri, Mar 13, 2020 at 09:59:50AM -0700, Dave Hansen wrote:
> On 3/12/20 7:00 PM, Minchan Kim wrote:
> > On Thu, Mar 12, 2020 at 02:41:07PM -0700, Dave Hansen wrote:
> >> One other fun thing.  I have a "victim" thread sitting in a loop doing:
> >>
> >> 	sleep(1)
> >> 	memcpy(&garbage, buffer, sz);
> >>
> >> The "attacker" is doing
> >>
> >> 	madvise(buffer, sz, MADV_PAGEOUT);
> >>
> >> in a loop.  That, oddly enough doesn't cause the victim to page fault.
> >> But, if I do:
> >>
> >> 	memcpy(&garbage, buffer, sz);
> >> 	madvise(buffer, sz, MADV_PAGEOUT);
> >>
> >> It *does* cause the memory to get paged out.  The MADV_PAGEOUT code
> >> actually has a !pte_present() check.  It will punt on a PTE if it sees
> >> it.  In other words, if a page is in the swap cache but not mapped by a
> >> pte_present() PTE, MADV_PAGEOUT won't touch it.
> >>
> >> Shouldn't MADV_PAGEOUT be able to find and reclaim those pages?  Patch
> >> attached.
> > 
> >>
> >>
> >> ---
> >>
> >>  b/mm/madvise.c |   38 +++++++++++++++++++++++++++++++-------
> >>  1 file changed, 31 insertions(+), 7 deletions(-)
> >>
> >> diff -puN mm/madvise.c~madv-pageout-find-swap-cache mm/madvise.c
> >> --- a/mm/madvise.c~madv-pageout-find-swap-cache	2020-03-12 14:24:45.178775035 -0700
> >> +++ b/mm/madvise.c	2020-03-12 14:35:49.706773378 -0700
> >> @@ -248,6 +248,36 @@ static void force_shm_swapin_readahead(s
> >>  #endif		/* CONFIG_SWAP */
> >>  
> >>  /*
> >> + * Given a PTE, find the corresponding 'struct page'.  Also handles
> >> + * non-present swap PTEs.
> >> + */
> >> +struct page *pte_to_reclaim_page(struct vm_area_struct *vma,
> >> +				 unsigned long addr, pte_t ptent)
> >> +{
> >> +	swp_entry_t entry;
> >> +
> >> +	/* Totally empty PTE: */
> >> +	if (pte_none(ptent))
> >> +		return NULL;
> >> +
> >> +	/* A normal, present page is mapped: */
> >> +	if (pte_present(ptent))
> >> +		return vm_normal_page(vma, addr, ptent);
> >> +
> > 
> > Please check is_swap_pte first.
> 
> Why?
> 
> is_swap_pte() duplicates the first two checks.  But, I need an explicit
> pte_present() check somewhere because I need to call vm_normal_page()
> only on present PTEs.
> 
> I guess the pte_present() check could be:
> 
> 	if (!is_swap_pte(ptent))
> 		return vm_normal_page(...);
> 
> *after* the pte_none() check.

Yub, I thought is_swap_pte looks more readable and maintainable for
the change of pte encoding in future. Anyway, I am not insisting.

> 
> >> +	entry = pte_to_swp_entry(vmf->orig_pte);
> >> +	/* Is it one of the "swap PTEs" that's not really swap? */
> >> +	if (non_swap_entry(entry))
> >> +		return false;
> >> +
> >> +	/*
> >> +	 * The PTE was a true swap entry.  The page may be in the
> >> +	 * swap cache.  If so, find it and return it so it may be
> >> +	 * reclaimed.
> >> +	 */
> >> +	return lookup_swap_cache(entry, vma, addr);
> > 
> > If we go with handling only exclusived owned page for anon,
> > I think we should apply the rule to swap cache, too.
> 
> I'm going back and forth on it.  If we're just trying to avoid causing
> faults in other processes, we could add a mapcount>0 check here in
> addition to the mapcount>1 checks that were added in the other patch.
> 
> But, if we want a check for true exclusivity: no other swap entries or
> mappings, we need to check swap_count() too.  It's getting quite a bit
> uglier as I add that it, but I guess we'll see how it looks in the end.

If we go to the map_count > 1 check here and follows the Daniel's suggestion
of MADV_PAGEOUT_ALL to make shared page paging out, that means it clearly
makes semantic change for MADV_PAGEOUT: "paging out only exclusive owned page"
so it would be rather weired if we reclaim swap_count() > 1 of swap cache.
> 
> > Do you mind posting it as formal patch?
> 
> Yeah, I'll send something out.

Thanks for bring up the issue, Dave!


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-13 20:59           ` Minchan Kim
@ 2020-03-16  9:20             ` Michal Hocko
  2020-03-17  1:43               ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-03-16  9:20 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Fri 13-03-20 13:59:41, Minchan Kim wrote:
> On Fri, Mar 13, 2020 at 09:05:46AM +0100, Michal Hocko wrote:
> > On Thu 12-03-20 19:08:51, Minchan Kim wrote:
> > > On Thu, Mar 12, 2020 at 09:41:55PM +0100, Michal Hocko wrote:
> > > > On Thu 12-03-20 13:16:02, Minchan Kim wrote:
> > > > > On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> > > > [...]
> > > > > > From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > Date: Thu, 12 Mar 2020 09:04:35 +0100
> > > > > > Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> > > > > > 
> > > > > > Jann has brought up a very interesting point [1]. While shared pages are
> > > > > > excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> > > > > > that way. This can lead to all sorts of hard to debug problems. E.g.
> > > > > > performance problems outlined by Daniel [2]. There are runtime
> > > > > > environments where there is a substantial memory shared among security
> > > > > > domains via CoW memory and a easy to reclaim way of that memory, which
> > > > > > MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> > > > > > in for the parent process which might be more privileged or even open
> > > > > > side channel attacks. The feasibility of the later is not really clear
> > > > > 
> > > > > I am not sure it's a good idea to mention performance stuff because
> > > > > it's rather arguble. You and Johannes already pointed it out when I sbumit
> > > > > early draft which had shared page filtering out logic due to performance
> > > > > reason. You guys suggested the shared pages has higher chance to be touched
> > > > > so that if it's really hot pages, that whould keep in the memory. I agree.
> > > > 
> > > > Yes, the hot memory is likely to be referenced but the point was an
> > > > unexpected latency because of the major fault. I have to say that I have
> > > 
> > > I don't understand your point here. If it's likely to be referenced
> > > among several processes, it doesn't have the major fault latency.
> > > What's your point here?
> > 
> > a) the particular CoW page might be cold enough to be reclaimed and b)
> 
> If it is, that means it's *cold* so it's really worth to be reclaimed.
> 
> > nothing really prevents the MADV_PAGEOUT to be called faster than the
> > reference bit being readded.
> 
> Yeb, that's undesirable. I should admit it was not intended when I implemented
> PAGEOUT. The thing is page_check_references clears access bit of pte for every
> process are sharing the page so that two times MADV_PAGEOUT from a process could
> evict the page. That's the really bug.

I do not really think this is a bug. This is a side effect of the
reclaim process and we do not really want MADV_{PAGEOUT,COLD} behave
differently here because then the behavior would be even harder to
understand.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-16  9:20             ` Michal Hocko
@ 2020-03-17  1:43               ` Minchan Kim
  2020-03-17  7:12                 ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2020-03-17  1:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Mon, Mar 16, 2020 at 10:20:52AM +0100, Michal Hocko wrote:
> On Fri 13-03-20 13:59:41, Minchan Kim wrote:
> > On Fri, Mar 13, 2020 at 09:05:46AM +0100, Michal Hocko wrote:
> > > On Thu 12-03-20 19:08:51, Minchan Kim wrote:
> > > > On Thu, Mar 12, 2020 at 09:41:55PM +0100, Michal Hocko wrote:
> > > > > On Thu 12-03-20 13:16:02, Minchan Kim wrote:
> > > > > > On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> > > > > [...]
> > > > > > > From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> > > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > > Date: Thu, 12 Mar 2020 09:04:35 +0100
> > > > > > > Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> > > > > > > 
> > > > > > > Jann has brought up a very interesting point [1]. While shared pages are
> > > > > > > excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> > > > > > > that way. This can lead to all sorts of hard to debug problems. E.g.
> > > > > > > performance problems outlined by Daniel [2]. There are runtime
> > > > > > > environments where there is a substantial memory shared among security
> > > > > > > domains via CoW memory and a easy to reclaim way of that memory, which
> > > > > > > MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> > > > > > > in for the parent process which might be more privileged or even open
> > > > > > > side channel attacks. The feasibility of the later is not really clear
> > > > > > 
> > > > > > I am not sure it's a good idea to mention performance stuff because
> > > > > > it's rather arguble. You and Johannes already pointed it out when I sbumit
> > > > > > early draft which had shared page filtering out logic due to performance
> > > > > > reason. You guys suggested the shared pages has higher chance to be touched
> > > > > > so that if it's really hot pages, that whould keep in the memory. I agree.
> > > > > 
> > > > > Yes, the hot memory is likely to be referenced but the point was an
> > > > > unexpected latency because of the major fault. I have to say that I have
> > > > 
> > > > I don't understand your point here. If it's likely to be referenced
> > > > among several processes, it doesn't have the major fault latency.
> > > > What's your point here?
> > > 
> > > a) the particular CoW page might be cold enough to be reclaimed and b)
> > 
> > If it is, that means it's *cold* so it's really worth to be reclaimed.
> > 
> > > nothing really prevents the MADV_PAGEOUT to be called faster than the
> > > reference bit being readded.
> > 
> > Yeb, that's undesirable. I should admit it was not intended when I implemented
> > PAGEOUT. The thing is page_check_references clears access bit of pte for every
> > process are sharing the page so that two times MADV_PAGEOUT from a process could
> > evict the page. That's the really bug.
> 
> I do not really think this is a bug. This is a side effect of the
> reclaim process and we do not really want MADV_{PAGEOUT,COLD} behave

No, that's the bug since we didn't consider the side effect.

> differently here because then the behavior would be even harder to

No, I do want to have difference because it's per-process hint. IOW,
what he know is for only his context, not others so it shouldn't clean
others' pte. That makes difference between LRU aging and the hint.

> understand.

It's not hard to understand.. MADV_PAGEOUT should consider only his
context since it's per-process hint(Even, he couldn't know others'
context) so it shouldn't bother others.

Actually, Dave's suggestion is correct to fix the issue if there
was no isse with side channel attack. However, due to the attack
issue, page_mapcount could prevent the problem effectively.
That's why I am not against of the patch now since it fixes
the bug as well as vulnerability.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-17  1:43               ` Minchan Kim
@ 2020-03-17  7:12                 ` Michal Hocko
  2020-03-17 15:00                   ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-03-17  7:12 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Mon 16-03-20 18:43:40, Minchan Kim wrote:
> On Mon, Mar 16, 2020 at 10:20:52AM +0100, Michal Hocko wrote:
> > On Fri 13-03-20 13:59:41, Minchan Kim wrote:
> > > On Fri, Mar 13, 2020 at 09:05:46AM +0100, Michal Hocko wrote:
> > > > On Thu 12-03-20 19:08:51, Minchan Kim wrote:
> > > > > On Thu, Mar 12, 2020 at 09:41:55PM +0100, Michal Hocko wrote:
> > > > > > On Thu 12-03-20 13:16:02, Minchan Kim wrote:
> > > > > > > On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > > From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> > > > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > > > Date: Thu, 12 Mar 2020 09:04:35 +0100
> > > > > > > > Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> > > > > > > > 
> > > > > > > > Jann has brought up a very interesting point [1]. While shared pages are
> > > > > > > > excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> > > > > > > > that way. This can lead to all sorts of hard to debug problems. E.g.
> > > > > > > > performance problems outlined by Daniel [2]. There are runtime
> > > > > > > > environments where there is a substantial memory shared among security
> > > > > > > > domains via CoW memory and a easy to reclaim way of that memory, which
> > > > > > > > MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> > > > > > > > in for the parent process which might be more privileged or even open
> > > > > > > > side channel attacks. The feasibility of the later is not really clear
> > > > > > > 
> > > > > > > I am not sure it's a good idea to mention performance stuff because
> > > > > > > it's rather arguble. You and Johannes already pointed it out when I sbumit
> > > > > > > early draft which had shared page filtering out logic due to performance
> > > > > > > reason. You guys suggested the shared pages has higher chance to be touched
> > > > > > > so that if it's really hot pages, that whould keep in the memory. I agree.
> > > > > > 
> > > > > > Yes, the hot memory is likely to be referenced but the point was an
> > > > > > unexpected latency because of the major fault. I have to say that I have
> > > > > 
> > > > > I don't understand your point here. If it's likely to be referenced
> > > > > among several processes, it doesn't have the major fault latency.
> > > > > What's your point here?
> > > > 
> > > > a) the particular CoW page might be cold enough to be reclaimed and b)
> > > 
> > > If it is, that means it's *cold* so it's really worth to be reclaimed.
> > > 
> > > > nothing really prevents the MADV_PAGEOUT to be called faster than the
> > > > reference bit being readded.
> > > 
> > > Yeb, that's undesirable. I should admit it was not intended when I implemented
> > > PAGEOUT. The thing is page_check_references clears access bit of pte for every
> > > process are sharing the page so that two times MADV_PAGEOUT from a process could
> > > evict the page. That's the really bug.
> > 
> > I do not really think this is a bug. This is a side effect of the
> > reclaim process and we do not really want MADV_{PAGEOUT,COLD} behave
> 
> No, that's the bug since we didn't consider the side effect.
> 
> > differently here because then the behavior would be even harder to
> 
> No, I do want to have difference because it's per-process hint. IOW,
> what he know is for only his context, not others so it shouldn't clean
> others' pte. That makes difference between LRU aging and the hint.

Just to make it clear, are you really suggesting to special case
page_check_references for madvise path?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-17  7:12                 ` Michal Hocko
@ 2020-03-17 15:00                   ` Minchan Kim
  2020-03-17 15:58                     ` Michal Hocko
  0 siblings, 1 reply; 34+ messages in thread
From: Minchan Kim @ 2020-03-17 15:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Tue, Mar 17, 2020 at 08:12:39AM +0100, Michal Hocko wrote:
> On Mon 16-03-20 18:43:40, Minchan Kim wrote:
> > On Mon, Mar 16, 2020 at 10:20:52AM +0100, Michal Hocko wrote:
> > > On Fri 13-03-20 13:59:41, Minchan Kim wrote:
> > > > On Fri, Mar 13, 2020 at 09:05:46AM +0100, Michal Hocko wrote:
> > > > > On Thu 12-03-20 19:08:51, Minchan Kim wrote:
> > > > > > On Thu, Mar 12, 2020 at 09:41:55PM +0100, Michal Hocko wrote:
> > > > > > > On Thu 12-03-20 13:16:02, Minchan Kim wrote:
> > > > > > > > On Thu, Mar 12, 2020 at 09:22:48AM +0100, Michal Hocko wrote:
> > > > > > > [...]
> > > > > > > > > From eca97990372679c097a88164ff4b3d7879b0e127 Mon Sep 17 00:00:00 2001
> > > > > > > > > From: Michal Hocko <mhocko@suse.com>
> > > > > > > > > Date: Thu, 12 Mar 2020 09:04:35 +0100
> > > > > > > > > Subject: [PATCH] mm: do not allow MADV_PAGEOUT for CoW pages
> > > > > > > > > 
> > > > > > > > > Jann has brought up a very interesting point [1]. While shared pages are
> > > > > > > > > excluded from MADV_PAGEOUT normally, CoW pages can be easily reclaimed
> > > > > > > > > that way. This can lead to all sorts of hard to debug problems. E.g.
> > > > > > > > > performance problems outlined by Daniel [2]. There are runtime
> > > > > > > > > environments where there is a substantial memory shared among security
> > > > > > > > > domains via CoW memory and a easy to reclaim way of that memory, which
> > > > > > > > > MADV_{COLD,PAGEOUT} offers, can lead to either performance degradation
> > > > > > > > > in for the parent process which might be more privileged or even open
> > > > > > > > > side channel attacks. The feasibility of the later is not really clear
> > > > > > > > 
> > > > > > > > I am not sure it's a good idea to mention performance stuff because
> > > > > > > > it's rather arguble. You and Johannes already pointed it out when I sbumit
> > > > > > > > early draft which had shared page filtering out logic due to performance
> > > > > > > > reason. You guys suggested the shared pages has higher chance to be touched
> > > > > > > > so that if it's really hot pages, that whould keep in the memory. I agree.
> > > > > > > 
> > > > > > > Yes, the hot memory is likely to be referenced but the point was an
> > > > > > > unexpected latency because of the major fault. I have to say that I have
> > > > > > 
> > > > > > I don't understand your point here. If it's likely to be referenced
> > > > > > among several processes, it doesn't have the major fault latency.
> > > > > > What's your point here?
> > > > > 
> > > > > a) the particular CoW page might be cold enough to be reclaimed and b)
> > > > 
> > > > If it is, that means it's *cold* so it's really worth to be reclaimed.
> > > > 
> > > > > nothing really prevents the MADV_PAGEOUT to be called faster than the
> > > > > reference bit being readded.
> > > > 
> > > > Yeb, that's undesirable. I should admit it was not intended when I implemented
> > > > PAGEOUT. The thing is page_check_references clears access bit of pte for every
> > > > process are sharing the page so that two times MADV_PAGEOUT from a process could
> > > > evict the page. That's the really bug.
> > > 
> > > I do not really think this is a bug. This is a side effect of the
> > > reclaim process and we do not really want MADV_{PAGEOUT,COLD} behave
> > 
> > No, that's the bug since we didn't consider the side effect.
> > 
> > > differently here because then the behavior would be even harder to
> > 
> > No, I do want to have difference because it's per-process hint. IOW,
> > what he know is for only his context, not others so it shouldn't clean
> > others' pte. That makes difference between LRU aging and the hint.
> 
> Just to make it clear, are you really suggesting to special case
> page_check_references for madvise path?
> 

No, (page_mapcount() >  1) checks *effectively* fixes the performance
bug as well as vulnerability issue.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-17 15:00                   ` Minchan Kim
@ 2020-03-17 15:58                     ` Michal Hocko
  2020-03-17 17:20                       ` Minchan Kim
  0 siblings, 1 reply; 34+ messages in thread
From: Michal Hocko @ 2020-03-17 15:58 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Tue 17-03-20 08:00:55, Minchan Kim wrote:
> On Tue, Mar 17, 2020 at 08:12:39AM +0100, Michal Hocko wrote:
[...]
> > Just to make it clear, are you really suggesting to special case
> > page_check_references for madvise path?
> > 
> 
> No, (page_mapcount() >  1) checks *effectively* fixes the performance
> bug as well as vulnerability issue.

Ahh, ok then we are on the same page. You were replying to the part
where I have pointed out that you can control aging by these calls
and your response suggested that this is somehow undesirable behavior or
even a bug.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: interaction of MADV_PAGEOUT with CoW anonymous mappings?
  2020-03-17 15:58                     ` Michal Hocko
@ 2020-03-17 17:20                       ` Minchan Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Minchan Kim @ 2020-03-17 17:20 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jann Horn, Linux-MM, kernel list, Daniel Colascione, Dave Hansen,
	Joel Fernandes (Google),
	Andrew Morton

On Tue, Mar 17, 2020 at 04:58:55PM +0100, Michal Hocko wrote:
> On Tue 17-03-20 08:00:55, Minchan Kim wrote:
> > On Tue, Mar 17, 2020 at 08:12:39AM +0100, Michal Hocko wrote:
> [...]
> > > Just to make it clear, are you really suggesting to special case
> > > page_check_references for madvise path?
> > > 
> > 
> > No, (page_mapcount() >  1) checks *effectively* fixes the performance
> > bug as well as vulnerability issue.
> 
> Ahh, ok then we are on the same page. You were replying to the part
> where I have pointed out that you can control aging by these calls
> and your response suggested that this is somehow undesirable behavior or
> even a bug.

Sorry about the confusing.

I want to clarify my speaking.

If we don't have vulnerability issue Jann raised, the performance issue
Daniel pointed should be fixed by introducing a special flag in
page_check_references from madvise path to avoid cleaning of access bit
from other processes's pte. With it, we don't need to limit semantic of
MADV_PAGEOUT as "exclusive page only" so that MADV_PAGEOUT will work
*cold* shared pages as well as exclusive one.

However, since we have the vulnerability issue, *unfortunately*, we need
to make MADV_PAGEOUT's semantic working with only exclusive page. 
Thus, page_mapcount check in madvise patch will fix both issues
*effectively*.

Thanks.


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, back to index

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-10 18:08 interaction of MADV_PAGEOUT with CoW anonymous mappings? Jann Horn
2020-03-10 18:48 ` Michal Hocko
2020-03-10 19:11   ` Jann Horn
2020-03-10 21:09     ` Michal Hocko
2020-03-10 22:48       ` Dave Hansen
2020-03-11  8:45         ` Michal Hocko
2020-03-11 22:02           ` Minchan Kim
2020-03-11 23:53           ` Shakeel Butt
2020-03-12  0:18             ` Minchan Kim
2020-03-12  2:03               ` Daniel Colascione
2020-03-12 15:15                 ` Shakeel Butt
2020-03-10 20:19   ` Daniel Colascione
2020-03-10 21:40     ` Jann Horn
2020-03-10 21:52       ` Daniel Colascione
2020-03-10 22:14 ` Minchan Kim
2020-03-12  8:22 ` Michal Hocko
2020-03-12 15:40   ` Vlastimil Babka
2020-03-12 20:16   ` Minchan Kim
2020-03-12 20:26     ` Dave Hansen
2020-03-12 20:41     ` Michal Hocko
2020-03-13  2:08       ` Minchan Kim
2020-03-13  8:05         ` Michal Hocko
2020-03-13 20:59           ` Minchan Kim
2020-03-16  9:20             ` Michal Hocko
2020-03-17  1:43               ` Minchan Kim
2020-03-17  7:12                 ` Michal Hocko
2020-03-17 15:00                   ` Minchan Kim
2020-03-17 15:58                     ` Michal Hocko
2020-03-17 17:20                       ` Minchan Kim
2020-03-12 21:41     ` Dave Hansen
2020-03-13  2:00       ` Minchan Kim
2020-03-13 16:59         ` Dave Hansen
2020-03-13 21:13           ` Minchan Kim
2020-03-12 23:29     ` Jann Horn

Linux-mm Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-mm/0 linux-mm/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-mm linux-mm/ https://lore.kernel.org/linux-mm \
		linux-mm@kvack.org
	public-inbox-index linux-mm

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kvack.linux-mm


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git