linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
@ 2015-04-24 15:01 Mark Williamson
  2015-04-24 15:26 ` Mark Seaborn
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Williamson @ 2015-04-24 15:01 UTC (permalink / raw)
  To: linux-kernel
  Cc: Kirill A. Shutemov, Pavel Emelyanov, Konstantin Khlebnikov,
	Andrew Morton, Linus Torvalds, Mark Seaborn, Andy Lutomirski,
	linux-api, Finn Grimwood, Daniel James

Hi all,

<resending without unwanted HTML-ifying - apologies for the noise if
this appears twice for you!>

Recent changes have restricted a userspace interface used by our
product; specifically, a security patch to require CAP_SYS_ADMIN when
opening /proc/PID/pagemap
(https://github.com/torvalds/linux/commit/ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce,
original LKML discussion here: https://lkml.org/lkml/2015/3/9/864).

Although I've marked this as a "Regression", we do realise there are
legitimate security concerns over the original implementation of this
interface.  Still, given the kernel's strong stance on preserving
userspace interfaces, we thought we ought to flag this quickly as
something that has changed application-relevant behaviour.

We believe this change came into released kernels with Linux 4.0.  We
first observed problems when testing on Ubuntu 15.04 this week; I see
the patch is now backported to the various -stable kernel lines, so
I'd expect it to show up in other distros in due course.  The obvious
solution (to simply run with CAP_SYS_ADMIN) is quite undesirable for
our product, which is a debugger; we're expecting our users to run
without special privileges.

In our use of /proc/PID/pagemap, we currently make use of the physical
pageframe addresses.  We should be able to work with a scrambled
representation of these (Andy Lutomirski suggested this in the
original discussion - https://lkml.org/lkml/2015/3/16/1273) so long as
the scrambling remained consistent during the lifetime of the open
pagemap file.  Alternatively, if physical addresses were simply zeroed
(also suggested by Pavel Emelyanov -
https://lkml.org/lkml/2015/3/9/871) we would be able to change our
code to rely only on the soft-dirty flag and thus still work
correctly.

I propose to follow up with a patch that provides unprivileged access
to /proc/PID/pagemap with the physical pageframe addresses zeroed.
Would this be an acceptable approach?

Thank you,
Mark Williamson

---
Undo Software - http://undo-software.com/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-24 15:01 Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage Mark Williamson
@ 2015-04-24 15:26 ` Mark Seaborn
  2015-04-24 16:43   ` Mark Williamson
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Seaborn @ 2015-04-24 15:26 UTC (permalink / raw)
  To: Mark Williamson
  Cc: kernel list, Kirill A. Shutemov, Pavel Emelyanov,
	Konstantin Khlebnikov, Andrew Morton, Linus Torvalds,
	Andy Lutomirski, linux-api, Finn Grimwood, Daniel James

On 24 April 2015 at 08:01, Mark Williamson
<mwilliamson@undo-software.com> wrote:
> In our use of /proc/PID/pagemap, we currently make use of the physical
> pageframe addresses.  We should be able to work with a scrambled
> representation of these (Andy Lutomirski suggested this in the
> original discussion - https://lkml.org/lkml/2015/3/16/1273) so long as
> the scrambling remained consistent during the lifetime of the open
> pagemap file.  Alternatively, if physical addresses were simply zeroed
> (also suggested by Pavel Emelyanov -
> https://lkml.org/lkml/2015/3/9/871) we would be able to change our
> code to rely only on the soft-dirty flag and thus still work
> correctly.

I'm curious, what do you use the physical page addresses for?

Since you pointed to http://undo-software.com, which talks about
reversible debugging tools, I can guess you would use the soft-dirty
flag to implement copy-on-write snapshotting.  I'm guessing you might
use physical page addresses for determining when the same page is
mapped twice (in the same process or different processes)?

Cheers,
Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-24 15:26 ` Mark Seaborn
@ 2015-04-24 16:43   ` Mark Williamson
  2015-04-29 18:44     ` Mark Williamson
  0 siblings, 1 reply; 25+ messages in thread
From: Mark Williamson @ 2015-04-24 16:43 UTC (permalink / raw)
  To: Mark Seaborn
  Cc: kernel list, Kirill A. Shutemov, Pavel Emelyanov,
	Konstantin Khlebnikov, Andrew Morton, Linus Torvalds,
	Andy Lutomirski, linux-api, Finn Grimwood, Daniel James

Hi Mark,

On Fri, Apr 24, 2015 at 4:26 PM, Mark Seaborn <mseaborn@chromium.org> wrote:
> I'm curious, what do you use the physical page addresses for?
>
> Since you pointed to http://undo-software.com, which talks about
> reversible debugging tools, I can guess you would use the soft-dirty
> flag to implement copy-on-write snapshotting.  I'm guessing you might
> use physical page addresses for determining when the same page is
> mapped twice (in the same process or different processes)?

That's pretty much it.  Actually, we're effectively using the physical
addresses to emulate soft-dirty.  For certain operations (e.g. some
system calls) we need to track what memory has changed since we last
looked at the process state.  We have a mechanism that forks a child
process, runs the system call, then refers to pagemap to figure out
what's been modified.

Currently, our mechanism compares the physical addresses of pages
before and after the syscall so that we can see which pages got CoWed.
This is perhaps a slightly "unconventional" use of the interface but
we support kernels that predate the soft-dirty mechanism and (as far
as we know) this is probably the best way we can answer "What got
changed?" on those releases.

Using the soft-dirty mechanism where available should make our code
both cleaner and faster, so if we can fix the pagemap file to allow
that then we'll be quite happy!

Cheers,
Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-24 16:43   ` Mark Williamson
@ 2015-04-29 18:44     ` Mark Williamson
  2015-04-29 19:23       ` Mark Williamson
  2015-04-29 19:36       ` Kirill A. Shutemov
  0 siblings, 2 replies; 25+ messages in thread
From: Mark Williamson @ 2015-04-29 18:44 UTC (permalink / raw)
  To: Mark Seaborn
  Cc: kernel list, Kirill A. Shutemov, Pavel Emelyanov,
	Konstantin Khlebnikov, Andrew Morton, Linus Torvalds,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

Hi all,

We've been investigating further and found a snag with the PFN-hiding
approach discussed last week - looks like it won't be enough on all
the architectures we support.  Our product runs on x86_32, x86_64 and
ARM.  For now, it looks like soft-dirty is only available on x86_64.
A patch that simply zeros out the physical addresses in
/proc/PID/pagemap will therefore help us on x86_64 but we'll still
have problems on other platforms[1].

For context, we were previously using pagemap as a cross-platform way
to get soft-dirty-like functionality.  Specifically, to ask "did a
process write to any pages since fork()" by comparing addresses and
deducing where CoW must have occurred.  In the absence of soft-dirty
and the physical addresses, it looks like we can't figure that out
with the remaining information in pagemap.

If the pagemap file included the "writeable" bit from the PTE, we
think we'd have all the information required to deduce what we need
(although I realise that's a bit of a nasty workaround).  If I
proposed including the PTE protection bits in pagemap, would that be
controversial?  I'm guessing yes but thought it was worth a shot ;-)
Would anybody be able to suggest a more tasteful approach?

Thanks,
Mark

[1] I'd note that using soft-dirty is clearly the right approach for
us on x64, where available and that ideally we'd use it on other
architectures - cross-arch support for soft-dirty is a slightly
different discussion, which I hope to post another thread for.

On Fri, Apr 24, 2015 at 5:43 PM, Mark Williamson
<mwilliamson@undo-software.com> wrote:
> Hi Mark,
>
> On Fri, Apr 24, 2015 at 4:26 PM, Mark Seaborn <mseaborn@chromium.org> wrote:
>> I'm curious, what do you use the physical page addresses for?
>>
>> Since you pointed to http://undo-software.com, which talks about
>> reversible debugging tools, I can guess you would use the soft-dirty
>> flag to implement copy-on-write snapshotting.  I'm guessing you might
>> use physical page addresses for determining when the same page is
>> mapped twice (in the same process or different processes)?
>
> That's pretty much it.  Actually, we're effectively using the physical
> addresses to emulate soft-dirty.  For certain operations (e.g. some
> system calls) we need to track what memory has changed since we last
> looked at the process state.  We have a mechanism that forks a child
> process, runs the system call, then refers to pagemap to figure out
> what's been modified.
>
> Currently, our mechanism compares the physical addresses of pages
> before and after the syscall so that we can see which pages got CoWed.
> This is perhaps a slightly "unconventional" use of the interface but
> we support kernels that predate the soft-dirty mechanism and (as far
> as we know) this is probably the best way we can answer "What got
> changed?" on those releases.
>
> Using the soft-dirty mechanism where available should make our code
> both cleaner and faster, so if we can fix the pagemap file to allow
> that then we'll be quite happy!
>
> Cheers,
> Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 18:44     ` Mark Williamson
@ 2015-04-29 19:23       ` Mark Williamson
  2015-04-29 19:36       ` Kirill A. Shutemov
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Williamson @ 2015-04-29 19:23 UTC (permalink / raw)
  To: Mark Seaborn
  Cc: kernel list, Kirill A. Shutemov, Pavel Emelyanov,
	Konstantin Khlebnikov, Andrew Morton, Linus Torvalds,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

Hi again,

On Wed, Apr 29, 2015 at 7:44 PM, Mark Williamson
<mwilliamson@undo-software.com> wrote:
> We've been investigating further and found a snag with the PFN-hiding
> approach discussed last week - looks like it won't be enough on all
> the architectures we support.  Our product runs on x86_32, x86_64 and
> ARM.  For now, it looks like soft-dirty is only available on x86_64.
> A patch that simply zeros out the physical addresses in
> /proc/PID/pagemap will therefore help us on x86_64 but we'll still
> have problems on other platforms[1].

Another thought occurs - although we *strictly* want to know "what got
written to", we might be able to get by with a superset of that, such
as "what got accessed, read or write"...

Thus, we could investigate clearing the Referenced bit (which I
understand we can do through /proc/PID/clear_refs) and then just treat
any subsequently-referenced pages as being potentially modified.  It's
not ideal but it might be enough to get by...

I still feel a little nervous with this, since we support distros
(e.g. RHEL5) that are too old to have clear_refs.  Still, it would
result in less disruption to the format of pagemap.

Thanks,
Mark

> For context, we were previously using pagemap as a cross-platform way
> to get soft-dirty-like functionality.  Specifically, to ask "did a
> process write to any pages since fork()" by comparing addresses and
> deducing where CoW must have occurred.  In the absence of soft-dirty
> and the physical addresses, it looks like we can't figure that out
> with the remaining information in pagemap.
>
> If the pagemap file included the "writeable" bit from the PTE, we
> think we'd have all the information required to deduce what we need
> (although I realise that's a bit of a nasty workaround).  If I
> proposed including the PTE protection bits in pagemap, would that be
> controversial?  I'm guessing yes but thought it was worth a shot ;-)
> Would anybody be able to suggest a more tasteful approach?
>
> Thanks,
> Mark
>
> [1] I'd note that using soft-dirty is clearly the right approach for
> us on x64, where available and that ideally we'd use it on other
> architectures - cross-arch support for soft-dirty is a slightly
> different discussion, which I hope to post another thread for.
>
> On Fri, Apr 24, 2015 at 5:43 PM, Mark Williamson
> <mwilliamson@undo-software.com> wrote:
>> Hi Mark,
>>
>> On Fri, Apr 24, 2015 at 4:26 PM, Mark Seaborn <mseaborn@chromium.org> wrote:
>>> I'm curious, what do you use the physical page addresses for?
>>>
>>> Since you pointed to http://undo-software.com, which talks about
>>> reversible debugging tools, I can guess you would use the soft-dirty
>>> flag to implement copy-on-write snapshotting.  I'm guessing you might
>>> use physical page addresses for determining when the same page is
>>> mapped twice (in the same process or different processes)?
>>
>> That's pretty much it.  Actually, we're effectively using the physical
>> addresses to emulate soft-dirty.  For certain operations (e.g. some
>> system calls) we need to track what memory has changed since we last
>> looked at the process state.  We have a mechanism that forks a child
>> process, runs the system call, then refers to pagemap to figure out
>> what's been modified.
>>
>> Currently, our mechanism compares the physical addresses of pages
>> before and after the syscall so that we can see which pages got CoWed.
>> This is perhaps a slightly "unconventional" use of the interface but
>> we support kernels that predate the soft-dirty mechanism and (as far
>> as we know) this is probably the best way we can answer "What got
>> changed?" on those releases.
>>
>> Using the soft-dirty mechanism where available should make our code
>> both cleaner and faster, so if we can fix the pagemap file to allow
>> that then we'll be quite happy!
>>
>> Cheers,
>> Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 18:44     ` Mark Williamson
  2015-04-29 19:23       ` Mark Williamson
@ 2015-04-29 19:36       ` Kirill A. Shutemov
  2015-04-29 20:24         ` Mark Williamson
  2015-04-29 20:33         ` Linus Torvalds
  1 sibling, 2 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2015-04-29 19:36 UTC (permalink / raw)
  To: Mark Williamson
  Cc: Mark Seaborn, kernel list, Kirill A. Shutemov, Pavel Emelyanov,
	Konstantin Khlebnikov, Andrew Morton, Linus Torvalds,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Wed, Apr 29, 2015 at 07:44:57PM +0100, Mark Williamson wrote:
> Hi all,
> 
> We've been investigating further and found a snag with the PFN-hiding
> approach discussed last week - looks like it won't be enough on all
> the architectures we support.  Our product runs on x86_32, x86_64 and
> ARM.  For now, it looks like soft-dirty is only available on x86_64.
> A patch that simply zeros out the physical addresses in
> /proc/PID/pagemap will therefore help us on x86_64 but we'll still
> have problems on other platforms[1].
> 
> For context, we were previously using pagemap as a cross-platform way
> to get soft-dirty-like functionality.  Specifically, to ask "did a
> process write to any pages since fork()" by comparing addresses and
> deducing where CoW must have occurred.  In the absence of soft-dirty
> and the physical addresses, it looks like we can't figure that out
> with the remaining information in pagemap.
> 
> If the pagemap file included the "writeable" bit from the PTE, we
> think we'd have all the information required to deduce what we need
> (although I realise that's a bit of a nasty workaround).  If I
> proposed including the PTE protection bits in pagemap, would that be
> controversial?  I'm guessing yes but thought it was worth a shot ;-)
> Would anybody be able to suggest a more tasteful approach?

Emm.. I have hard time to understand how writable bit is enough to get
soft-dirty-alike functionality.

Let's say we have anon-mapping with COW setup after the fork(). It's not
writable PTEs to trigger COW on wp faults. But you can easily get to the
same non-writable PTE after breaking COW: fork() again or
mprotect(PROT_READ) and mprotect(PROT_READ|PROT_WRITE) back.

?

> 
> Thanks,
> Mark
> 
> [1] I'd note that using soft-dirty is clearly the right approach for
> us on x64, where available and that ideally we'd use it on other
> architectures - cross-arch support for soft-dirty is a slightly
> different discussion, which I hope to post another thread for.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 19:36       ` Kirill A. Shutemov
@ 2015-04-29 20:24         ` Mark Williamson
  2015-04-29 20:33         ` Linus Torvalds
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Williamson @ 2015-04-29 20:24 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mark Seaborn, kernel list, Kirill A. Shutemov, Pavel Emelyanov,
	Konstantin Khlebnikov, Andrew Morton, Linus Torvalds,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

Hi,

On Wed, Apr 29, 2015 at 8:36 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Wed, Apr 29, 2015 at 07:44:57PM +0100, Mark Williamson wrote:
>> Hi all,
 ... snip ...
>> For context, we were previously using pagemap as a cross-platform way
>> to get soft-dirty-like functionality.  Specifically, to ask "did a
>> process write to any pages since fork()" by comparing addresses and
>> deducing where CoW must have occurred.  In the absence of soft-dirty
>> and the physical addresses, it looks like we can't figure that out
>> with the remaining information in pagemap.
>>
>> If the pagemap file included the "writeable" bit from the PTE, we
>> think we'd have all the information required to deduce what we need
>> (although I realise that's a bit of a nasty workaround).  If I
>> proposed including the PTE protection bits in pagemap, would that be
>> controversial?  I'm guessing yes but thought it was worth a shot ;-)
>> Would anybody be able to suggest a more tasteful approach?
>
> Emm.. I have hard time to understand how writable bit is enough to get
> soft-dirty-alike functionality.

In the general case, you are of course correct - in our specific case
I *think* we'd be able to manage OK ...  (see below).

> Let's say we have anon-mapping with COW setup after the fork(). It's not
> writable PTEs to trigger COW on wp faults. But you can easily get to the
> same non-writable PTE after breaking COW: fork() again or
> mprotect(PROT_READ) and mprotect(PROT_READ|PROT_WRITE) back.

I believe we'll be able to get away with this in our particular
usecase.  The process is running in our debugger at the time and so we
can interpose on the system calls that are happening.  That should
give us the opportunity to check for CoW-breaking before the debuggee
is allowed to alter page protections itself.

It ends up not being full soft-dirty behaviour but it's similar enough
to tell us what we need to know.

Cheers,
Mark

> ?
>
>>
>> Thanks,
>> Mark
>>
>> [1] I'd note that using soft-dirty is clearly the right approach for
>> us on x64, where available and that ideally we'd use it on other
>> architectures - cross-arch support for soft-dirty is a slightly
>> different discussion, which I hope to post another thread for.
>
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 19:36       ` Kirill A. Shutemov
  2015-04-29 20:24         ` Mark Williamson
@ 2015-04-29 20:33         ` Linus Torvalds
  2015-04-29 20:44           ` Konstantin Khlebnikov
  1 sibling, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2015-04-29 20:33 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Mark Williamson, Mark Seaborn, kernel list, Kirill A. Shutemov,
	Pavel Emelyanov, Konstantin Khlebnikov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Wed, Apr 29, 2015 at 12:36 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> Emm.. I have hard time to understand how writable bit is enough to get
> soft-dirty-alike functionality.

I don't think it is.

For anonymous pages, maybe you can play tricks with comparing the page
'anon_vma' with the vma->anon_vma.

I haven't really thought that through, but does something like

   static inline bool page_is_dirty_in_vma(struct page *page, struct
vm_area_struct *vma)
   {
        struct anon_vma *anon_vma = vma->anon_vma;

        return page->mapping == (void *)anon_vma + PAGE_MAPPING_ANON;
   }

end up working as a "page has been dirtied in this mapping"?

If the page came from another process and hasn't been written to, it
will have the anon_vma pointing to the originalting vma.

I may be high on some bad drugs, though. As mentioned, I didn't really
think this through.

                        Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 20:33         ` Linus Torvalds
@ 2015-04-29 20:44           ` Konstantin Khlebnikov
  2015-04-29 21:02             ` Linus Torvalds
  0 siblings, 1 reply; 25+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-29 20:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Mark Williamson, Mark Seaborn, kernel list,
	Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Wed, Apr 29, 2015 at 11:33 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 29, 2015 at 12:36 PM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
>>
>> Emm.. I have hard time to understand how writable bit is enough to get
>> soft-dirty-alike functionality.
>
> I don't think it is.
>
> For anonymous pages, maybe you can play tricks with comparing the page
> 'anon_vma' with the vma->anon_vma.
>
> I haven't really thought that through, but does something like
>
>    static inline bool page_is_dirty_in_vma(struct page *page, struct
> vm_area_struct *vma)
>    {
>         struct anon_vma *anon_vma = vma->anon_vma;
>
>         return page->mapping == (void *)anon_vma + PAGE_MAPPING_ANON;
>    }
>
> end up working as a "page has been dirtied in this mapping"?

This's no longer true. After recent fixes for "anon_vma endless growing" new vma
might reuse old anon_vma from grandparent vma.

>
> If the page came from another process and hasn't been written to, it
> will have the anon_vma pointing to the originalting vma.
>
> I may be high on some bad drugs, though. As mentioned, I didn't really
> think this through.
>
>                         Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 20:44           ` Konstantin Khlebnikov
@ 2015-04-29 21:02             ` Linus Torvalds
  2015-04-29 21:05               ` Kirill A. Shutemov
                                 ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Linus Torvalds @ 2015-04-29 21:02 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Kirill A. Shutemov, Mark Williamson, Mark Seaborn, kernel list,
	Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>
> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
> might reuse old anon_vma from grandparent vma.

Oh well. I guess that was too simple.

If Mark is ok with the rule that "it's not reliably if you have two
nested forks" (ie it only works if you exec for every fork you do), it
should still work, right? It sounds like Mark doesn't necessarily need
to handle the *generic* case.

                       Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 21:02             ` Linus Torvalds
@ 2015-04-29 21:05               ` Kirill A. Shutemov
  2015-04-29 21:18                 ` Linus Torvalds
  2015-04-30 11:43               ` Konstantin Khlebnikov
  2015-04-30 11:50               ` Mark Williamson
  2 siblings, 1 reply; 25+ messages in thread
From: Kirill A. Shutemov @ 2015-04-29 21:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Konstantin Khlebnikov, Mark Williamson, Mark Seaborn,
	kernel list, Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Wed, Apr 29, 2015 at 02:02:01PM -0700, Linus Torvalds wrote:
> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> >
> > This's no longer true. After recent fixes for "anon_vma endless growing" new vma
> > might reuse old anon_vma from grandparent vma.
> 
> Oh well. I guess that was too simple.
> 
> If Mark is ok with the rule that "it's not reliably if you have two
> nested forks" (ie it only works if you exec for every fork you do), it
> should still work, right? It sounds like Mark doesn't necessarily need
> to handle the *generic* case.

This sounds too ugly to be exposed it as ABI.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 21:05               ` Kirill A. Shutemov
@ 2015-04-29 21:18                 ` Linus Torvalds
  2015-04-29 21:37                   ` Kirill A. Shutemov
  0 siblings, 1 reply; 25+ messages in thread
From: Linus Torvalds @ 2015-04-29 21:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Konstantin Khlebnikov, Mark Williamson, Mark Seaborn,
	kernel list, Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Wed, Apr 29, 2015 at 2:05 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> This sounds too ugly to be exposed it as ABI.

Oh, pretty it ain't. However, regressions in many ways are worse. If
it makes it possible to not regress...

                     Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 21:18                 ` Linus Torvalds
@ 2015-04-29 21:37                   ` Kirill A. Shutemov
  0 siblings, 0 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2015-04-29 21:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Konstantin Khlebnikov, Mark Williamson, Mark Seaborn,
	kernel list, Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James,
	Cyrill Gorcunov

On Wed, Apr 29, 2015 at 02:18:49PM -0700, Linus Torvalds wrote:
> On Wed, Apr 29, 2015 at 2:05 PM, Kirill A. Shutemov
> <kirill@shutemov.name> wrote:
> >
> > This sounds too ugly to be exposed it as ABI.
> 
> Oh, pretty it ain't. However, regressions in many ways are worse. If
> it makes it possible to not regress...

One idea is to extend kcmp(2) with KCMP_PAGE. idx1 and idx2 are virtual
addresses in two processes. It returns 0 if addresses points to the same
page and 3 otherwise.

Would it be enough for the use case?

I guess it could be too slow to check one page a time...

Invent new kcmpv(2)? ;)

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 21:02             ` Linus Torvalds
  2015-04-29 21:05               ` Kirill A. Shutemov
@ 2015-04-30 11:43               ` Konstantin Khlebnikov
  2015-04-30 13:11                 ` Konstantin Khlebnikov
  2015-04-30 11:50               ` Mark Williamson
  2 siblings, 1 reply; 25+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-30 11:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Mark Williamson, Mark Seaborn, kernel list,
	Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Thu, Apr 30, 2015 at 12:02 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>
>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
>> might reuse old anon_vma from grandparent vma.
>
> Oh well. I guess that was too simple.
>
> If Mark is ok with the rule that "it's not reliably if you have two
> nested forks" (ie it only works if you exec for every fork you do), it
> should still work, right? It sounds like Mark doesn't necessarily need
> to handle the *generic* case.

What about exposing shared/exclusive bit in pagemap == 1 if
page_mapcount() > 1, otherwise 0 (or vise versa).

Seems like this should work for detecting CoWed pages in child mm.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-29 21:02             ` Linus Torvalds
  2015-04-29 21:05               ` Kirill A. Shutemov
  2015-04-30 11:43               ` Konstantin Khlebnikov
@ 2015-04-30 11:50               ` Mark Williamson
  2 siblings, 0 replies; 25+ messages in thread
From: Mark Williamson @ 2015-04-30 11:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Konstantin Khlebnikov, Kirill A. Shutemov, Mark Seaborn,
	kernel list, Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Wed, Apr 29, 2015 at 10:02 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>
>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
>> might reuse old anon_vma from grandparent vma.
>
> Oh well. I guess that was too simple.
>
> If Mark is ok with the rule that "it's not reliably if you have two
> nested forks" (ie it only works if you exec for every fork you do), it
> should still work, right? It sounds like Mark doesn't necessarily need
> to handle the *generic* case.

Yes, it sounds like that should be OK for us.  Our usecase is pretty
restricted, so we're a long way off requiring a generic solution.

Our code will always fork() a fresh child in which to monitor memory
changes.  We run the operations we're interested in, use pagemap to
figure out "what changed" (by comparing whether the pagemap_entry_t
values are different from their parent) and then throw away the child
process.

Currently our code does an entry-by-entry compare of pagemap, so
anything that exposes writes as a change to values in there would
allow us to run unmodified.  That would be really nice.  That said, I
think we'd still be OK to modify our own code too if we can find a
solution that would continue to function on older kernel releases,
-stable trees, etc.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-30 11:43               ` Konstantin Khlebnikov
@ 2015-04-30 13:11                 ` Konstantin Khlebnikov
  2015-04-30 13:22                   ` Kirill A. Shutemov
  2015-04-30 18:32                   ` Mark Williamson
  0 siblings, 2 replies; 25+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-30 13:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Kirill A. Shutemov, Mark Williamson, Mark Seaborn, kernel list,
	Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

On Thu, Apr 30, 2015 at 2:43 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Apr 30, 2015 at 12:02 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>>>
>>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
>>> might reuse old anon_vma from grandparent vma.
>>
>> Oh well. I guess that was too simple.
>>
>> If Mark is ok with the rule that "it's not reliably if you have two
>> nested forks" (ie it only works if you exec for every fork you do), it
>> should still work, right? It sounds like Mark doesn't necessarily need
>> to handle the *generic* case.
>
> What about exposing shared/exclusive bit in pagemap == 1 if
> page_mapcount() > 1, otherwise 0 (or vise versa).
>
> Seems like this should work for detecting CoWed pages in child mm.

Something like this (see patch in attachment)

[-- Attachment #2: pagemap-add-mmap-shared-bit-for-marking-pages-mapped-multiple-times --]
[-- Type: application/octet-stream, Size: 3456 bytes --]

pagemap: add mmap-shared bit for marking pages mapped multiple times

From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>

This patch sets bit 56 in pagemap if this page is mapped more than once.
It allows to detect shared file and CoWed anon pages without exposing PFN.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
---
 Documentation/vm/pagemap.txt |    3 ++-
 fs/proc/task_mmu.c           |    3 +++
 tools/vm/Makefile            |    2 +-
 tools/vm/page-types.c        |    6 ++++++
 4 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/Documentation/vm/pagemap.txt b/Documentation/vm/pagemap.txt
index 6bfbc172cdb9..435b14c13fb9 100644
--- a/Documentation/vm/pagemap.txt
+++ b/Documentation/vm/pagemap.txt
@@ -16,7 +16,8 @@ There are three components to pagemap:
     * Bits 0-4   swap type if swapped
     * Bits 5-54  swap offset if swapped
     * Bit  55    pte is soft-dirty (see Documentation/vm/soft-dirty.txt)
-    * Bits 56-60 zero
+    * Bit  56    page mapped multiple times (mapcount > 1)
+    * Bits 57-60 zero
     * Bit  61    page is file-page or shared-anon
     * Bit  62    page swapped
     * Bit  63    page present
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 6dee68d013ff..d4b1df22988b 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -982,6 +982,7 @@ struct pagemapread {
 #define PM_STATUS2(v2, x)   (__PM_PSHIFT(v2 ? x : PAGE_SHIFT))
 
 #define __PM_SOFT_DIRTY      (1LL)
+#define __PM_MMAP_SHARED     (2LL)
 #define PM_PRESENT          PM_STATUS(4LL)
 #define PM_SWAP             PM_STATUS(2LL)
 #define PM_FILE             PM_STATUS(1LL)
@@ -1074,6 +1075,8 @@ static void pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *pm,
 
 	if (page && !PageAnon(page))
 		flags |= PM_FILE;
+	if (page && page_mapcount(page) > 1)
+		flags2 |= __PM_MMAP_SHARED;
 	if ((vma->vm_flags & VM_SOFTDIRTY))
 		flags2 |= __PM_SOFT_DIRTY;
 
diff --git a/tools/vm/Makefile b/tools/vm/Makefile
index ac884b65a072..93aadaf7ff63 100644
--- a/tools/vm/Makefile
+++ b/tools/vm/Makefile
@@ -3,7 +3,7 @@
 TARGETS=page-types slabinfo page_owner_sort
 
 LIB_DIR = ../lib/api
-LIBS = $(LIB_DIR)/libapikfs.a
+LIBS = $(LIB_DIR)/libapi.a
 
 CC = $(CROSS_COMPILE)gcc
 CFLAGS = -Wall -Wextra -I../lib/
diff --git a/tools/vm/page-types.c b/tools/vm/page-types.c
index 8bdf16b8ba60..f903196fb883 100644
--- a/tools/vm/page-types.c
+++ b/tools/vm/page-types.c
@@ -70,9 +70,11 @@
 #define PM_PFRAME(x)        ((x) & PM_PFRAME_MASK)
 
 #define __PM_SOFT_DIRTY      (1LL)
+#define __PM_MMAP_SHARED     (2LL)
 #define PM_PRESENT          PM_STATUS(4LL)
 #define PM_SWAP             PM_STATUS(2LL)
 #define PM_SOFT_DIRTY       __PM_PSHIFT(__PM_SOFT_DIRTY)
+#define PM_MMAP_SHARED      __PM_PSHIFT(__PM_MMAP_SHARED)
 
 
 /*
@@ -100,6 +102,7 @@
 #define KPF_SLOB_FREE		49
 #define KPF_SLUB_FROZEN		50
 #define KPF_SLUB_DEBUG		51
+#define KPF_MMAP_SHARED		52
 
 #define KPF_ALL_BITS		((uint64_t)~0ULL)
 #define KPF_HACKERS_BITS	(0xffffULL << 32)
@@ -149,6 +152,7 @@ static const char * const page_flag_names[] = {
 	[KPF_SLOB_FREE]		= "P:slob_free",
 	[KPF_SLUB_FROZEN]	= "A:slub_frozen",
 	[KPF_SLUB_DEBUG]	= "E:slub_debug",
+	[KPF_MMAP_SHARED]	= "5:mmap_shared",
 };
 
 
@@ -452,6 +456,8 @@ static uint64_t expand_overloaded_flags(uint64_t flags, uint64_t pme)
 
 	if (pme & PM_SOFT_DIRTY)
 		flags |= BIT(SOFTDIRTY);
+	if (pme & PM_MMAP_SHARED)
+		flags |= BIT(MMAP_SHARED);
 
 	return flags;
 }

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-30 13:11                 ` Konstantin Khlebnikov
@ 2015-04-30 13:22                   ` Kirill A. Shutemov
  2015-04-30 13:32                     ` Konstantin Khlebnikov
  2015-04-30 18:45                     ` Mark Williamson
  2015-04-30 18:32                   ` Mark Williamson
  1 sibling, 2 replies; 25+ messages in thread
From: Kirill A. Shutemov @ 2015-04-30 13:22 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Linus Torvalds, Mark Williamson, Mark Seaborn, kernel list,
	Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Thu, Apr 30, 2015 at 04:11:30PM +0300, Konstantin Khlebnikov wrote:
> On Thu, Apr 30, 2015 at 2:43 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> > On Thu, Apr 30, 2015 at 12:02 AM, Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> >>>
> >>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
> >>> might reuse old anon_vma from grandparent vma.
> >>
> >> Oh well. I guess that was too simple.
> >>
> >> If Mark is ok with the rule that "it's not reliably if you have two
> >> nested forks" (ie it only works if you exec for every fork you do), it
> >> should still work, right? It sounds like Mark doesn't necessarily need
> >> to handle the *generic* case.
> >
> > What about exposing shared/exclusive bit in pagemap == 1 if
> > page_mapcount() > 1, otherwise 0 (or vise versa).
> >
> > Seems like this should work for detecting CoWed pages in child mm.
> 
> Something like this (see patch in attachment)

THP is not covered.

Any comments on kcmp() idea?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-30 13:22                   ` Kirill A. Shutemov
@ 2015-04-30 13:32                     ` Konstantin Khlebnikov
  2015-04-30 18:45                     ` Mark Williamson
  1 sibling, 0 replies; 25+ messages in thread
From: Konstantin Khlebnikov @ 2015-04-30 13:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Linus Torvalds, Mark Williamson, Mark Seaborn, kernel list,
	Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Thu, Apr 30, 2015 at 4:22 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Thu, Apr 30, 2015 at 04:11:30PM +0300, Konstantin Khlebnikov wrote:
>> On Thu, Apr 30, 2015 at 2:43 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> > On Thu, Apr 30, 2015 at 12:02 AM, Linus Torvalds
>> > <torvalds@linux-foundation.org> wrote:
>> >> On Wed, Apr 29, 2015 at 1:44 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> >>>
>> >>> This's no longer true. After recent fixes for "anon_vma endless growing" new vma
>> >>> might reuse old anon_vma from grandparent vma.
>> >>
>> >> Oh well. I guess that was too simple.
>> >>
>> >> If Mark is ok with the rule that "it's not reliably if you have two
>> >> nested forks" (ie it only works if you exec for every fork you do), it
>> >> should still work, right? It sounds like Mark doesn't necessarily need
>> >> to handle the *generic* case.
>> >
>> > What about exposing shared/exclusive bit in pagemap == 1 if
>> > page_mapcount() > 1, otherwise 0 (or vise versa).
>> >
>> > Seems like this should work for detecting CoWed pages in child mm.
>>
>> Something like this (see patch in attachment)
>
> THP is not covered.

Ok. Thanks.

>
> Any comments on kcmp() idea?

Should work too. Supporing full equal-less-greater semantics seems
safe -- it's obfuscation is strong enough for that.

>
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-30 13:11                 ` Konstantin Khlebnikov
  2015-04-30 13:22                   ` Kirill A. Shutemov
@ 2015-04-30 18:32                   ` Mark Williamson
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Williamson @ 2015-04-30 18:32 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Linus Torvalds, Kirill A. Shutemov, Mark Seaborn, kernel list,
	Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

Hi all,

On Thu, Apr 30, 2015 at 2:11 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Thu, Apr 30, 2015 at 2:43 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> What about exposing shared/exclusive bit in pagemap == 1 if
>> page_mapcount() > 1, otherwise 0 (or vise versa).
>>
>> Seems like this should work for detecting CoWed pages in child mm.
>
> Something like this (see patch in attachment)

Either something like this patch (updated to cover THPs), or Linus's
suggestion seems worth a try.  Could I perhaps get a steer on which
would be more likely to be accepted / preferred?

Either way, we'd want to expose the resulting flag somewhere within
pagemap.  We could do this either within the normal flags region, or
potentially even repurpose one of the (now censored) physical bits.

If there's a general feeling then I'll update my work-in-progress and
post it here.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-30 13:22                   ` Kirill A. Shutemov
  2015-04-30 13:32                     ` Konstantin Khlebnikov
@ 2015-04-30 18:45                     ` Mark Williamson
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Williamson @ 2015-04-30 18:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Konstantin Khlebnikov, Linus Torvalds, Mark Seaborn, kernel list,
	Kirill A. Shutemov, Pavel Emelyanov, Andrew Morton,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

>> Something like this (see patch in attachment)
>
> THP is not covered.
>
> Any comments on kcmp() idea?

It seems like a modified kcmp() would also be a valid approach but, as
you noted, probably speed-limited for our purposes.  As you say, there
is the option of a vector-oriented equivalent.  It seems like a
generally nice facility to have available in the kernel but my
suspicion is that it wouldn't be optimal for us...

My thinking is that using soft-dirty might give us the best
performance on platforms where it's available.  We're only using
fork() as a cunning/hacky way of tracking what pages change;
soft-dirty would allow us to avoid that too, whereas using kcmp()
would still require the forking overhead.

Does that make sense, or have I missed something?

Thanks,
Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-24 16:10   ` Andy Lutomirski
  2015-04-24 16:27     ` Linus Torvalds
@ 2015-04-29 14:38     ` Mark Williamson
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Williamson @ 2015-04-29 14:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Linux Kernel Mailing List, Kirill A. Shutemov,
	Pavel Emelyanov, Konstantin Khlebnikov, Andrew Morton,
	Mark Seaborn, Linux API, Finn Grimwood, Daniel James

Hi Andy,

On Fri, Apr 24, 2015 at 5:10 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Even though I've been accused (correctly?) of suggesting that, I'm not
> sure I like it anymore.  Suppose I map some anonymous memory, learn
> its (scrambled) pfn, then unmap it and remap a setuid file.  Now I can
> tell whether I've mapped the setuid file at the same pfn that was
> mapped as my anonymous memory.  IIRC that's sufficient for one of the
> variants of Mark's attack.

In fairness, you may have mentioned it but it's entirely possible you
didn't originate the suggestion and I quoted out of context.  Sorry
for implicating you ;-)

That's an attack that I hadn't considered when thinking about this
stuff.  Zeroing the page frame numbers is an easier patch, so
arguments in favour of that are a happy answer as far as I'm
concerned!

Thanks,
Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-24 16:08 ` Linus Torvalds
  2015-04-24 16:10   ` Andy Lutomirski
@ 2015-04-24 16:46   ` Mark Williamson
  1 sibling, 0 replies; 25+ messages in thread
From: Mark Williamson @ 2015-04-24 16:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Kirill A. Shutemov, Pavel Emelyanov,
	Konstantin Khlebnikov, Andrew Morton, Mark Seaborn,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

Hi Linus,

Thanks for responding so quickly!

On Fri, Apr 24, 2015 at 5:08 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> So the one exception to the regression rule is "security fixes", but
> even for security fixes we do try to be as reasonable as humanly
> possible to make them not break things.

Understood - there are clear reasons something had to be done here.

> Now, as you mentioned, one option is to not outright disallow accesses
> to the /proc/PID/pagemap, but to at least hide the page frame numbers.
> However, I don't believe that we have a good enough scrambling model
> to make that reasonable. Remember: any attacker will be able to see
> our scrambling code, so it would need to be both cryptographically
> secure *and* use a truly random per-VM secret key. Quite frankly,
> that's a _lot_ of effort for dubious gain...

*nod*

> So the "just show physical addresses as zero for non-root users"
> (instead of the outright ban on opening the file) is likely the only
> really viable alternative.
>
> It sounds like that could work for you. So if you can modify the app
> to do that, and send me a tested kernel patch that moves the
> permission check into the read phase (remember to use the open-time
> credentials in "file->f_cred" rather than the read-time credentials in
> "current" - otherwise you can trick some suid program to read the fily
> that an unauthorized user opened), then we can have this fixed. Does
> that sound reasonable?

That sounds very reasonable, thank you!  We'll cook up a patch and get
back to you.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-24 16:10   ` Andy Lutomirski
@ 2015-04-24 16:27     ` Linus Torvalds
  2015-04-29 14:38     ` Mark Williamson
  1 sibling, 0 replies; 25+ messages in thread
From: Linus Torvalds @ 2015-04-24 16:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Mark Williamson, Linux Kernel Mailing List, Kirill A. Shutemov,
	Pavel Emelyanov, Konstantin Khlebnikov, Andrew Morton,
	Mark Seaborn, Linux API, Finn Grimwood, Daniel James

On Fri, Apr 24, 2015 at 9:10 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> Even though I've been accused (correctly?) of suggesting that, I'm not
> sure I like it anymore.  Suppose I map some anonymous memory, learn
> its (scrambled) pfn, then unmap it and remap a setuid file.  Now I can
> tell whether I've mapped the setuid file at the same pfn that was
> mapped as my anonymous memory.  IIRC that's sufficient for one of the
> variants of Mark's attack.

Ack. So we really do have to zero out the pfn entirely for security
reasons, and not just because it's less effort.

                         Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
  2015-04-24 16:08 ` Linus Torvalds
@ 2015-04-24 16:10   ` Andy Lutomirski
  2015-04-24 16:27     ` Linus Torvalds
  2015-04-29 14:38     ` Mark Williamson
  2015-04-24 16:46   ` Mark Williamson
  1 sibling, 2 replies; 25+ messages in thread
From: Andy Lutomirski @ 2015-04-24 16:10 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mark Williamson, Linux Kernel Mailing List, Kirill A. Shutemov,
	Pavel Emelyanov, Konstantin Khlebnikov, Andrew Morton,
	Mark Seaborn, Linux API, Finn Grimwood, Daniel James

On Fri, Apr 24, 2015 at 9:08 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Fri, Apr 24, 2015 at 7:55 AM, Mark Williamson
> <mwilliamson@undo-software.com> wrote:
>>
>> Although I've marked this as a "Regression", we do realise there are
>> legitimate security concerns over the original implementation of this
>> interface.  Still, given the kernel's strong stance on preserving userspace
>> interfaces, we thought we ought to flag this quickly as something that has
>> changed application-relevant behaviour.
>
> So the one exception to the regression rule is "security fixes", but
> even for security fixes we do try to be as reasonable as humanly
> possible to make them not break things.
>
> Now, as you mentioned, one option is to not outright disallow accesses
> to the /proc/PID/pagemap, but to at least hide the page frame numbers.
> However, I don't believe that we have a good enough scrambling model
> to make that reasonable. Remember: any attacker will be able to see
> our scrambling code, so it would need to be both cryptographically
> secure *and* use a truly random per-VM secret key. Quite frankly,
> that's a _lot_ of effort for dubious gain...

Even though I've been accused (correctly?) of suggesting that, I'm not
sure I like it anymore.  Suppose I map some anonymous memory, learn
its (scrambled) pfn, then unmap it and remap a setuid file.  Now I can
tell whether I've mapped the setuid file at the same pfn that was
mapped as my anonymous memory.  IIRC that's sufficient for one of the
variants of Mark's attack.

--Andy

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage
       [not found] <CAEVpBa+7Yp+zCTczZqBd6Qp_uM7yy0i8YZfZkUbDeUsPpKtqRQ@mail.gmail.com>
@ 2015-04-24 16:08 ` Linus Torvalds
  2015-04-24 16:10   ` Andy Lutomirski
  2015-04-24 16:46   ` Mark Williamson
  0 siblings, 2 replies; 25+ messages in thread
From: Linus Torvalds @ 2015-04-24 16:08 UTC (permalink / raw)
  To: Mark Williamson
  Cc: Linux Kernel Mailing List, Kirill A. Shutemov, Pavel Emelyanov,
	Konstantin Khlebnikov, Andrew Morton, Mark Seaborn,
	Andy Lutomirski, Linux API, Finn Grimwood, Daniel James

On Fri, Apr 24, 2015 at 7:55 AM, Mark Williamson
<mwilliamson@undo-software.com> wrote:
>
> Although I've marked this as a "Regression", we do realise there are
> legitimate security concerns over the original implementation of this
> interface.  Still, given the kernel's strong stance on preserving userspace
> interfaces, we thought we ought to flag this quickly as something that has
> changed application-relevant behaviour.

So the one exception to the regression rule is "security fixes", but
even for security fixes we do try to be as reasonable as humanly
possible to make them not break things.

Now, as you mentioned, one option is to not outright disallow accesses
to the /proc/PID/pagemap, but to at least hide the page frame numbers.
However, I don't believe that we have a good enough scrambling model
to make that reasonable. Remember: any attacker will be able to see
our scrambling code, so it would need to be both cryptographically
secure *and* use a truly random per-VM secret key. Quite frankly,
that's a _lot_ of effort for dubious gain...

So the "just show physical addresses as zero for non-root users"
(instead of the outright ban on opening the file) is likely the only
really viable alternative.

It sounds like that could work for you. So if you can modify the app
to do that, and send me a tested kernel patch that moves the
permission check into the read phase (remember to use the open-time
credentials in "file->f_cred" rather than the read-time credentials in
"current" - otherwise you can trick some suid program to read the fily
that an unauthorized user opened), then we can have this fixed. Does
that sound reasonable?

                      Linus

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2015-04-30 18:45 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-24 15:01 Regression: Requiring CAP_SYS_ADMIN for /proc/<pid>/pagemap causes application-level breakage Mark Williamson
2015-04-24 15:26 ` Mark Seaborn
2015-04-24 16:43   ` Mark Williamson
2015-04-29 18:44     ` Mark Williamson
2015-04-29 19:23       ` Mark Williamson
2015-04-29 19:36       ` Kirill A. Shutemov
2015-04-29 20:24         ` Mark Williamson
2015-04-29 20:33         ` Linus Torvalds
2015-04-29 20:44           ` Konstantin Khlebnikov
2015-04-29 21:02             ` Linus Torvalds
2015-04-29 21:05               ` Kirill A. Shutemov
2015-04-29 21:18                 ` Linus Torvalds
2015-04-29 21:37                   ` Kirill A. Shutemov
2015-04-30 11:43               ` Konstantin Khlebnikov
2015-04-30 13:11                 ` Konstantin Khlebnikov
2015-04-30 13:22                   ` Kirill A. Shutemov
2015-04-30 13:32                     ` Konstantin Khlebnikov
2015-04-30 18:45                     ` Mark Williamson
2015-04-30 18:32                   ` Mark Williamson
2015-04-30 11:50               ` Mark Williamson
     [not found] <CAEVpBa+7Yp+zCTczZqBd6Qp_uM7yy0i8YZfZkUbDeUsPpKtqRQ@mail.gmail.com>
2015-04-24 16:08 ` Linus Torvalds
2015-04-24 16:10   ` Andy Lutomirski
2015-04-24 16:27     ` Linus Torvalds
2015-04-29 14:38     ` Mark Williamson
2015-04-24 16:46   ` Mark Williamson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).