* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 20:02 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-04 20:02 UTC (permalink / raw)
To: will.deacon, nadav.amit
Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
linux-mm, jannh, Dock, Deneen T, peterz, kristen, akpm, mingo,
luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > wrote:
> > >
> > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > pages,
> > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > This is
> > > undesirable for cases where the memory being freed has special permissions
> > > such
> > > as executable.
> >
> > So I am trying to finish my patch-set for preventing transient W+X mappings
> > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > for
> > pointing it out).
> >
> > But all of the sudden, I don’t understand why we have the problem that this
> > (your) patch-set deals with at all. We already change the mappings to make
> > the memory writable before freeing the memory, so why can’t we make it
> > non-executable at the same time? Actually, why do we make the module memory,
> > including its data executable before freeing it???
>
> Yeah, this is really confusing, but I have a suspicion it's a combination
> of the various different configurations and hysterical raisins. We can't
> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> can we rely on disable_ro_nx() being available at build time.
>
> If we *could* rely on module allocations always using vmalloc(), then
> we could pass in Rick's new flag and drop disable_ro_nx() altogether
> afaict -- who cares about the memory attributes of a mapping that's about
> to disappear anyway?
>
> Is it just nios2 that does something different?
>
> Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.
A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either modules or
vmalloc. As Jessica points out in the other thread though, modules does a lot
more stuff there than the other module_alloc callers. I think it may take some
thought to centralize AND make it optimal for every module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 20:02 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-04 20:02 UTC (permalink / raw)
To: will.deacon, nadav.amit
Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
linux-mm, jannh, Dock, Deneen T, peterz, kristen, akpm, mingo,
luto, Keshavamurthy,
Anil S <anil.s.keshavamurthy@intel.com>,
kernel-hardening@lists.openwall.com, mhiramat, naveen.n.rao,
davem, netdev, Hansen, Dave
On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > wrote:
> > >
> > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > pages,
> > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > This is
> > > undesirable for cases where the memory being freed has special permissions
> > > such
> > > as executable.
> >
> > So I am trying to finish my patch-set for preventing transient W+X mappings
> > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > for
> > pointing it out).
> >
> > But all of the sudden, I don’t understand why we have the problem that this
> > (your) patch-set deals with at all. We already change the mappings to make
> > the memory writable before freeing the memory, so why can’t we make it
> > non-executable at the same time? Actually, why do we make the module memory,
> > including its data executable before freeing it???
>
> Yeah, this is really confusing, but I have a suspicion it's a combination
> of the various different configurations and hysterical raisins. We can't
> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> can we rely on disable_ro_nx() being available at build time.
>
> If we *could* rely on module allocations always using vmalloc(), then
> we could pass in Rick's new flag and drop disable_ro_nx() altogether
> afaict -- who cares about the memory attributes of a mapping that's about
> to disappear anyway?
>
> Is it just nios2 that does something different?
>
> Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.
A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either modules or
vmalloc. As Jessica points out in the other thread though, modules does a lot
more stuff there than the other module_alloc callers. I think it may take some
thought to centralize AND make it optimal for every module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 20:02 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-04 20:02 UTC (permalink / raw)
To: will.deacon, nadav.amit
Cc: linux-kernel, daniel, jeyu, rostedt, ast, ard.biesheuvel,
linux-mm, jannh, Dock, Deneen T, peterz, kristen, akpm, mingo,
luto, Keshavamurthy, Anil S, kernel-hardening@lists.openwall.com
On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > wrote:
> > >
> > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > pages,
> > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > This is
> > > undesirable for cases where the memory being freed has special permissions
> > > such
> > > as executable.
> >
> > So I am trying to finish my patch-set for preventing transient W+X mappings
> > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > for
> > pointing it out).
> >
> > But all of the sudden, I don’t understand why we have the problem that this
> > (your) patch-set deals with at all. We already change the mappings to make
> > the memory writable before freeing the memory, so why can’t we make it
> > non-executable at the same time? Actually, why do we make the module memory,
> > including its data executable before freeing it???
>
> Yeah, this is really confusing, but I have a suspicion it's a combination
> of the various different configurations and hysterical raisins. We can't
> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> can we rely on disable_ro_nx() being available at build time.
>
> If we *could* rely on module allocations always using vmalloc(), then
> we could pass in Rick's new flag and drop disable_ro_nx() altogether
> afaict -- who cares about the memory attributes of a mapping that's about
> to disappear anyway?
>
> Is it just nios2 that does something different?
>
> Will
Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
solve it as well, in fact that was what I first thought the solution should be
until this was suggested. It's interesting that from the other thread Masami
Hiramatsu referenced, set_memory_nx was suggested last year and would have
inadvertently blocked this on x86. But, on the other architectures I have since
learned it is a bit different.
It looks like actually most arch's don't re-define set_memory_*, and so all of
the frob_* functions are actually just noops. In which case allocating RWX is
needed to make it work at all, because that is what the allocation is going to
stay at. So in these archs, set_memory_nx won't solve it because it will do
nothing.
On x86 I think you cannot get rid of disable_ro_nx fully because there is the
changing of the permissions on the directmap as well. You don't want some other
caller getting a page that was left RO when freed and then trying to write to
it, if I understand this.
The other reasoning was that calling set_memory_nx isn't doing what we are
actually trying to do which is prevent the pages from getting released too
early.
A more clear solution for all of this might involve refactoring some of the
set_memory_ de-allocation logic out into __weak functions in either modules or
vmalloc. As Jessica points out in the other thread though, modules does a lot
more stuff there than the other module_alloc callers. I think it may take some
thought to centralize AND make it optimal for every module_alloc/vmalloc_exec
user and arch.
But for now with the change in vmalloc, we can block the executable mapping
freed page re-use issue in a cross platform way.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-04 20:02 ` Edgecombe, Rick P
@ 2018-12-04 20:09 ` Andy Lutomirski
-1 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-04 20:09 UTC (permalink / raw)
To: Rick Edgecombe
Cc: Will Deacon, Nadav Amit, LKML, Daniel Borkmann, jeyu,
Steven Rostedt, Alexei Starovoitov, Ard Biesheuvel, Linux-MM,
Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Andrew Lutomirski, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu, Naveen N . Rao, David S. Miller,
Network Development, Dave Hansen
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > wrote:
> > > >
> > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > pages,
> > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > This is
> > > > undesirable for cases where the memory being freed has special permissions
> > > > such
> > > > as executable.
> > >
> > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > for
> > > pointing it out).
> > >
> > > But all of the sudden, I don’t understand why we have the problem that this
> > > (your) patch-set deals with at all. We already change the mappings to make
> > > the memory writable before freeing the memory, so why can’t we make it
> > > non-executable at the same time? Actually, why do we make the module memory,
> > > including its data executable before freeing it???
> >
> > Yeah, this is really confusing, but I have a suspicion it's a combination
> > of the various different configurations and hysterical raisins. We can't
> > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > can we rely on disable_ro_nx() being available at build time.
> >
> > If we *could* rely on module allocations always using vmalloc(), then
> > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > afaict -- who cares about the memory attributes of a mapping that's about
> > to disappear anyway?
> >
> > Is it just nios2 that does something different?
> >
> > Will
>
> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> solve it as well, in fact that was what I first thought the solution should be
> until this was suggested. It's interesting that from the other thread Masami
> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> inadvertently blocked this on x86. But, on the other architectures I have since
> learned it is a bit different.
>
> It looks like actually most arch's don't re-define set_memory_*, and so all of
> the frob_* functions are actually just noops. In which case allocating RWX is
> needed to make it work at all, because that is what the allocation is going to
> stay at. So in these archs, set_memory_nx won't solve it because it will do
> nothing.
>
> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> changing of the permissions on the directmap as well. You don't want some other
> caller getting a page that was left RO when freed and then trying to write to
> it, if I understand this.
>
Exactly.
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 20:09 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-04 20:09 UTC (permalink / raw)
To: Rick Edgecombe
Cc: Will Deacon, Nadav Amit, LKML, Daniel Borkmann, jeyu,
Steven Rostedt, Alexei Starovoitov, Ard Biesheuvel, Linux-MM,
Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Andrew Lutomirski, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu, Naveen N . Rao
On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > wrote:
> > > >
> > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > pages,
> > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > This is
> > > > undesirable for cases where the memory being freed has special permissions
> > > > such
> > > > as executable.
> > >
> > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > for
> > > pointing it out).
> > >
> > > But all of the sudden, I don’t understand why we have the problem that this
> > > (your) patch-set deals with at all. We already change the mappings to make
> > > the memory writable before freeing the memory, so why can’t we make it
> > > non-executable at the same time? Actually, why do we make the module memory,
> > > including its data executable before freeing it???
> >
> > Yeah, this is really confusing, but I have a suspicion it's a combination
> > of the various different configurations and hysterical raisins. We can't
> > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > can we rely on disable_ro_nx() being available at build time.
> >
> > If we *could* rely on module allocations always using vmalloc(), then
> > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > afaict -- who cares about the memory attributes of a mapping that's about
> > to disappear anyway?
> >
> > Is it just nios2 that does something different?
> >
> > Will
>
> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> solve it as well, in fact that was what I first thought the solution should be
> until this was suggested. It's interesting that from the other thread Masami
> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> inadvertently blocked this on x86. But, on the other architectures I have since
> learned it is a bit different.
>
> It looks like actually most arch's don't re-define set_memory_*, and so all of
> the frob_* functions are actually just noops. In which case allocating RWX is
> needed to make it work at all, because that is what the allocation is going to
> stay at. So in these archs, set_memory_nx won't solve it because it will do
> nothing.
>
> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> changing of the permissions on the directmap as well. You don't want some other
> caller getting a page that was left RO when freed and then trying to write to
> it, if I understand this.
>
Exactly.
After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
but it would also call some arch hooks to put back the direct map
permissions before the flush. Does that seem reasonable? It would
need to be hooked up that implement set_memory_ro(), but that should
be quite easy. If nothing else, it could fall back to set_memory_ro()
in the absence of a better implementation.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-04 20:09 ` Andy Lutomirski
(?)
@ 2018-12-04 23:52 ` Edgecombe, Rick P
-1 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-04 23:52 UTC (permalink / raw)
To: luto
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, nadav.amit, Dock, Deneen T, jannh, kristen, akpm,
peterz, will.deacon, mingo, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
On Tue, 2018-12-04 at 12:09 -0800, Andy Lutomirski wrote:
> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > rick.p.edgecombe@intel.com>
> > > > > wrote:
> > > > >
> > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > underlying
> > > > > pages,
> > > > > it often leaves stale TLB entries to freed pages that could get re-
> > > > > used.
> > > > > This is
> > > > > undesirable for cases where the memory being freed has special
> > > > > permissions
> > > > > such
> > > > > as executable.
> > > >
> > > > So I am trying to finish my patch-set for preventing transient W+X
> > > > mappings
> > > > from taking space, by handling kprobes & ftrace that I missed (thanks
> > > > again
> > > > for
> > > > pointing it out).
> > > >
> > > > But all of the sudden, I don’t understand why we have the problem that
> > > > this
> > > > (your) patch-set deals with at all. We already change the mappings to
> > > > make
> > > > the memory writable before freeing the memory, so why can’t we make it
> > > > non-executable at the same time? Actually, why do we make the module
> > > > memory,
> > > > including its data executable before freeing it???
> > >
> > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > of the various different configurations and hysterical raisins. We can't
> > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > can we rely on disable_ro_nx() being available at build time.
> > >
> > > If we *could* rely on module allocations always using vmalloc(), then
> > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > afaict -- who cares about the memory attributes of a mapping that's about
> > > to disappear anyway?
> > >
> > > Is it just nios2 that does something different?
> > >
> > > Will
> >
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere
> > would
> > solve it as well, in fact that was what I first thought the solution should
> > be
> > until this was suggested. It's interesting that from the other thread Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > inadvertently blocked this on x86. But, on the other architectures I have
> > since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so all
> > of
> > the frob_* functions are actually just noops. In which case allocating RWX
> > is
> > needed to make it work at all, because that is what the allocation is going
> > to
> > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there is
> > the
> > changing of the permissions on the directmap as well. You don't want some
> > other
> > caller getting a page that was left RO when freed and then trying to write
> > to
> > it, if I understand this.
> >
>
> Exactly.
>
> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> but it would also call some arch hooks to put back the direct map
> permissions before the flush. Does that seem reasonable? It would
> need to be hooked up that implement set_memory_ro(), but that should
> be quite easy. If nothing else, it could fall back to set_memory_ro()
> in the absence of a better implementation.
With arch hooks, I guess we could remove disable_ro_nx then. I think you would
still have to flush twice on x86 to really have no W^X violating window from the
direct map (I think x86 is the only one that sets permissions there?). But this
could be down from sometimes 3. You could also directly vfree non exec RO memory
without set_memory_, like in BPF.
The vfree deferred list would need to be moved since it then couldn't reuse the
allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
the vm_struct. So would probably be a little slower in the interrupt case. Is
this ok?
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 23:52 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-04 23:52 UTC (permalink / raw)
To: luto
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, nadav.amit, Dock, Deneen T, jannh, kristen, akpm,
peterz, will.deacon, mingo, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
On Tue, 2018-12-04 at 12:09 -0800, Andy Lutomirski wrote:
> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > rick.p.edgecombe@intel.com>
> > > > > wrote:
> > > > >
> > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > underlying
> > > > > pages,
> > > > > it often leaves stale TLB entries to freed pages that could get re-
> > > > > used.
> > > > > This is
> > > > > undesirable for cases where the memory being freed has special
> > > > > permissions
> > > > > such
> > > > > as executable.
> > > >
> > > > So I am trying to finish my patch-set for preventing transient W+X
> > > > mappings
> > > > from taking space, by handling kprobes & ftrace that I missed (thanks
> > > > again
> > > > for
> > > > pointing it out).
> > > >
> > > > But all of the sudden, I don’t understand why we have the problem that
> > > > this
> > > > (your) patch-set deals with at all. We already change the mappings to
> > > > make
> > > > the memory writable before freeing the memory, so why can’t we make it
> > > > non-executable at the same time? Actually, why do we make the module
> > > > memory,
> > > > including its data executable before freeing it???
> > >
> > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > of the various different configurations and hysterical raisins. We can't
> > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > can we rely on disable_ro_nx() being available at build time.
> > >
> > > If we *could* rely on module allocations always using vmalloc(), then
> > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > afaict -- who cares about the memory attributes of a mapping that's about
> > > to disappear anyway?
> > >
> > > Is it just nios2 that does something different?
> > >
> > > Will
> >
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere
> > would
> > solve it as well, in fact that was what I first thought the solution should
> > be
> > until this was suggested. It's interesting that from the other thread Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > inadvertently blocked this on x86. But, on the other architectures I have
> > since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so all
> > of
> > the frob_* functions are actually just noops. In which case allocating RWX
> > is
> > needed to make it work at all, because that is what the allocation is going
> > to
> > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there is
> > the
> > changing of the permissions on the directmap as well. You don't want some
> > other
> > caller getting a page that was left RO when freed and then trying to write
> > to
> > it, if I understand this.
> >
>
> Exactly.
>
> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> but it would also call some arch hooks to put back the direct map
> permissions before the flush. Does that seem reasonable? It would
> need to be hooked up that implement set_memory_ro(), but that should
> be quite easy. If nothing else, it could fall back to set_memory_ro()
> in the absence of a better implementation.
With arch hooks, I guess we could remove disable_ro_nx then. I think you would
still have to flush twice on x86 to really have no W^X violating window from the
direct map (I think x86 is the only one that sets permissions there?). But this
could be down from sometimes 3. You could also directly vfree non exec RO memory
without set_memory_, like in BPF.
The vfree deferred list would need to be moved since it then couldn't reuse the
allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
the vm_struct. So would probably be a little slower in the interrupt case. Is
this ok?
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 23:52 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-04 23:52 UTC (permalink / raw)
To: luto
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, nadav.amit, Dock, Deneen T, jannh, kristen, akpm,
peterz, will.deacon, mingo, Keshavamurthy, Anil S,
On Tue, 2018-12-04 at 12:09 -0800, Andy Lutomirski wrote:
> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > rick.p.edgecombe@intel.com>
> > > > > wrote:
> > > > >
> > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > underlying
> > > > > pages,
> > > > > it often leaves stale TLB entries to freed pages that could get re-
> > > > > used.
> > > > > This is
> > > > > undesirable for cases where the memory being freed has special
> > > > > permissions
> > > > > such
> > > > > as executable.
> > > >
> > > > So I am trying to finish my patch-set for preventing transient W+X
> > > > mappings
> > > > from taking space, by handling kprobes & ftrace that I missed (thanks
> > > > again
> > > > for
> > > > pointing it out).
> > > >
> > > > But all of the sudden, I don’t understand why we have the problem that
> > > > this
> > > > (your) patch-set deals with at all. We already change the mappings to
> > > > make
> > > > the memory writable before freeing the memory, so why can’t we make it
> > > > non-executable at the same time? Actually, why do we make the module
> > > > memory,
> > > > including its data executable before freeing it???
> > >
> > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > of the various different configurations and hysterical raisins. We can't
> > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > can we rely on disable_ro_nx() being available at build time.
> > >
> > > If we *could* rely on module allocations always using vmalloc(), then
> > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > afaict -- who cares about the memory attributes of a mapping that's about
> > > to disappear anyway?
> > >
> > > Is it just nios2 that does something different?
> > >
> > > Will
> >
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere
> > would
> > solve it as well, in fact that was what I first thought the solution should
> > be
> > until this was suggested. It's interesting that from the other thread Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > inadvertently blocked this on x86. But, on the other architectures I have
> > since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so all
> > of
> > the frob_* functions are actually just noops. In which case allocating RWX
> > is
> > needed to make it work at all, because that is what the allocation is going
> > to
> > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there is
> > the
> > changing of the permissions on the directmap as well. You don't want some
> > other
> > caller getting a page that was left RO when freed and then trying to write
> > to
> > it, if I understand this.
> >
>
> Exactly.
>
> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> but it would also call some arch hooks to put back the direct map
> permissions before the flush. Does that seem reasonable? It would
> need to be hooked up that implement set_memory_ro(), but that should
> be quite easy. If nothing else, it could fall back to set_memory_ro()
> in the absence of a better implementation.
With arch hooks, I guess we could remove disable_ro_nx then. I think you would
still have to flush twice on x86 to really have no W^X violating window from the
direct map (I think x86 is the only one that sets permissions there?). But this
could be down from sometimes 3. You could also directly vfree non exec RO memory
without set_memory_, like in BPF.
The vfree deferred list would need to be moved since it then couldn't reuse the
allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
the vm_struct. So would probably be a little slower in the interrupt case. Is
this ok?
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-04 23:52 ` Edgecombe, Rick P
(?)
@ 2018-12-05 1:57 ` Andy Lutomirski
-1 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-05 1:57 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: luto, linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, nadav.amit, Dock, Deneen T, jannh, kristen, akpm,
peterz, will.deacon, mingo, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
> On Dec 4, 2018, at 3:52 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
>> On Tue, 2018-12-04 at 12:09 -0800, Andy Lutomirski wrote:
>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
>>>
>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>> rick.p.edgecombe@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>> underlying
>>>>>> pages,
>>>>>> it often leaves stale TLB entries to freed pages that could get re-
>>>>>> used.
>>>>>> This is
>>>>>> undesirable for cases where the memory being freed has special
>>>>>> permissions
>>>>>> such
>>>>>> as executable.
>>>>>
>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>> mappings
>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks
>>>>> again
>>>>> for
>>>>> pointing it out).
>>>>>
>>>>> But all of the sudden, I don’t understand why we have the problem that
>>>>> this
>>>>> (your) patch-set deals with at all. We already change the mappings to
>>>>> make
>>>>> the memory writable before freeing the memory, so why can’t we make it
>>>>> non-executable at the same time? Actually, why do we make the module
>>>>> memory,
>>>>> including its data executable before freeing it???
>>>>
>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>> of the various different configurations and hysterical raisins. We can't
>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>> can we rely on disable_ro_nx() being available at build time.
>>>>
>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>> to disappear anyway?
>>>>
>>>> Is it just nios2 that does something different?
>>>>
>>>> Will
>>>
>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>> would
>>> solve it as well, in fact that was what I first thought the solution should
>>> be
>>> until this was suggested. It's interesting that from the other thread Masami
>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>> inadvertently blocked this on x86. But, on the other architectures I have
>>> since
>>> learned it is a bit different.
>>>
>>> It looks like actually most arch's don't re-define set_memory_*, and so all
>>> of
>>> the frob_* functions are actually just noops. In which case allocating RWX
>>> is
>>> needed to make it work at all, because that is what the allocation is going
>>> to
>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>> nothing.
>>>
>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is
>>> the
>>> changing of the permissions on the directmap as well. You don't want some
>>> other
>>> caller getting a page that was left RO when freed and then trying to write
>>> to
>>> it, if I understand this.
>>>
>>
>> Exactly.
>>
>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
>> but it would also call some arch hooks to put back the direct map
>> permissions before the flush. Does that seem reasonable? It would
>> need to be hooked up that implement set_memory_ro(), but that should
>> be quite easy. If nothing else, it could fall back to set_memory_ro()
>> in the absence of a better implementation.
>
> With arch hooks, I guess we could remove disable_ro_nx then. I think you would
> still have to flush twice on x86 to really have no W^X violating window from the
> direct map (I think x86 is the only one that sets permissions there?). But this
> could be down from sometimes 3. You could also directly vfree non exec RO memory
> without set_memory_, like in BPF.
Just one flush if you’re careful. Set the memory not-present in the direct map and zap it from the vmap area, then flush, then set it RW in the
>
> The vfree deferred list would need to be moved since it then couldn't reuse the
> allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
> the vm_struct. So would probably be a little slower in the interrupt case. Is
> this ok?
I’m fine with that. For eBPF, we should really have a lookaside list for small allocations.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 1:57 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-05 1:57 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: luto, linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, nadav.amit, Dock, Deneen T, jannh, kristen, akpm,
peterz, will.deacon, mingo, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
> On Dec 4, 2018, at 3:52 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
>> On Tue, 2018-12-04 at 12:09 -0800, Andy Lutomirski wrote:
>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
>>>
>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>> rick.p.edgecombe@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>> underlying
>>>>>> pages,
>>>>>> it often leaves stale TLB entries to freed pages that could get re-
>>>>>> used.
>>>>>> This is
>>>>>> undesirable for cases where the memory being freed has special
>>>>>> permissions
>>>>>> such
>>>>>> as executable.
>>>>>
>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>> mappings
>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks
>>>>> again
>>>>> for
>>>>> pointing it out).
>>>>>
>>>>> But all of the sudden, I don’t understand why we have the problem that
>>>>> this
>>>>> (your) patch-set deals with at all. We already change the mappings to
>>>>> make
>>>>> the memory writable before freeing the memory, so why can’t we make it
>>>>> non-executable at the same time? Actually, why do we make the module
>>>>> memory,
>>>>> including its data executable before freeing it???
>>>>
>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>> of the various different configurations and hysterical raisins. We can't
>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>> can we rely on disable_ro_nx() being available at build time.
>>>>
>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>> to disappear anyway?
>>>>
>>>> Is it just nios2 that does something different?
>>>>
>>>> Will
>>>
>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>> would
>>> solve it as well, in fact that was what I first thought the solution should
>>> be
>>> until this was suggested. It's interesting that from the other thread Masami
>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>> inadvertently blocked this on x86. But, on the other architectures I have
>>> since
>>> learned it is a bit different.
>>>
>>> It looks like actually most arch's don't re-define set_memory_*, and so all
>>> of
>>> the frob_* functions are actually just noops. In which case allocating RWX
>>> is
>>> needed to make it work at all, because that is what the allocation is going
>>> to
>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>> nothing.
>>>
>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is
>>> the
>>> changing of the permissions on the directmap as well. You don't want some
>>> other
>>> caller getting a page that was left RO when freed and then trying to write
>>> to
>>> it, if I understand this.
>>>
>>
>> Exactly.
>>
>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
>> but it would also call some arch hooks to put back the direct map
>> permissions before the flush. Does that seem reasonable? It would
>> need to be hooked up that implement set_memory_ro(), but that should
>> be quite easy. If nothing else, it could fall back to set_memory_ro()
>> in the absence of a better implementation.
>
> With arch hooks, I guess we could remove disable_ro_nx then. I think you would
> still have to flush twice on x86 to really have no W^X violating window from the
> direct map (I think x86 is the only one that sets permissions there?). But this
> could be down from sometimes 3. You could also directly vfree non exec RO memory
> without set_memory_, like in BPF.
Just one flush if you’re careful. Set the memory not-present in the direct map and zap it from the vmap area, then flush, then set it RW in the
>
> The vfree deferred list would need to be moved since it then couldn't reuse the
> allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
> the vm_struct. So would probably be a little slower in the interrupt case. Is
> this ok?
I’m fine with that. For eBPF, we should really have a lookaside list for small allocations.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 1:57 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-05 1:57 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: luto, linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, nadav.amit, Dock, Deneen T, jannh, kristen, akpm,
peterz, will.deacon, mingo, Keshavamurthy, Anil S
> On Dec 4, 2018, at 3:52 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
>> On Tue, 2018-12-04 at 12:09 -0800, Andy Lutomirski wrote:
>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
>>>
>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>> rick.p.edgecombe@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>> underlying
>>>>>> pages,
>>>>>> it often leaves stale TLB entries to freed pages that could get re-
>>>>>> used.
>>>>>> This is
>>>>>> undesirable for cases where the memory being freed has special
>>>>>> permissions
>>>>>> such
>>>>>> as executable.
>>>>>
>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>> mappings
>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks
>>>>> again
>>>>> for
>>>>> pointing it out).
>>>>>
>>>>> But all of the sudden, I don’t understand why we have the problem that
>>>>> this
>>>>> (your) patch-set deals with at all. We already change the mappings to
>>>>> make
>>>>> the memory writable before freeing the memory, so why can’t we make it
>>>>> non-executable at the same time? Actually, why do we make the module
>>>>> memory,
>>>>> including its data executable before freeing it???
>>>>
>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>> of the various different configurations and hysterical raisins. We can't
>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>> can we rely on disable_ro_nx() being available at build time.
>>>>
>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>> to disappear anyway?
>>>>
>>>> Is it just nios2 that does something different?
>>>>
>>>> Will
>>>
>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>> would
>>> solve it as well, in fact that was what I first thought the solution should
>>> be
>>> until this was suggested. It's interesting that from the other thread Masami
>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>> inadvertently blocked this on x86. But, on the other architectures I have
>>> since
>>> learned it is a bit different.
>>>
>>> It looks like actually most arch's don't re-define set_memory_*, and so all
>>> of
>>> the frob_* functions are actually just noops. In which case allocating RWX
>>> is
>>> needed to make it work at all, because that is what the allocation is going
>>> to
>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>> nothing.
>>>
>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is
>>> the
>>> changing of the permissions on the directmap as well. You don't want some
>>> other
>>> caller getting a page that was left RO when freed and then trying to write
>>> to
>>> it, if I understand this.
>>>
>>
>> Exactly.
>>
>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
>> but it would also call some arch hooks to put back the direct map
>> permissions before the flush. Does that seem reasonable? It would
>> need to be hooked up that implement set_memory_ro(), but that should
>> be quite easy. If nothing else, it could fall back to set_memory_ro()
>> in the absence of a better implementation.
>
> With arch hooks, I guess we could remove disable_ro_nx then. I think you would
> still have to flush twice on x86 to really have no W^X violating window from the
> direct map (I think x86 is the only one that sets permissions there?). But this
> could be down from sometimes 3. You could also directly vfree non exec RO memory
> without set_memory_, like in BPF.
Just one flush if you’re careful. Set the memory not-present in the direct map and zap it from the vmap area, then flush, then set it RW in the
>
> The vfree deferred list would need to be moved since it then couldn't reuse the
> allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
> the vm_struct. So would probably be a little slower in the interrupt case. Is
> this ok?
I’m fine with that. For eBPF, we should really have a lookaside list for small allocations.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-04 20:09 ` Andy Lutomirski
@ 2018-12-05 11:41 ` Will Deacon
-1 siblings, 0 replies; 117+ messages in thread
From: Will Deacon @ 2018-12-05 11:41 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rick Edgecombe, Nadav Amit, LKML, Daniel Borkmann, jeyu,
Steven Rostedt, Alexei Starovoitov, Ard Biesheuvel, Linux-MM,
Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu,
Naveen N . Rao, David S. Miller, Network Development,
Dave Hansen
On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > wrote:
> > > > >
> > > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > > pages,
> > > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > > This is
> > > > > undesirable for cases where the memory being freed has special permissions
> > > > > such
> > > > > as executable.
> > > >
> > > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > > for
> > > > pointing it out).
> > > >
> > > > But all of the sudden, I don’t understand why we have the problem that this
> > > > (your) patch-set deals with at all. We already change the mappings to make
> > > > the memory writable before freeing the memory, so why can’t we make it
> > > > non-executable at the same time? Actually, why do we make the module memory,
> > > > including its data executable before freeing it???
> > >
> > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > of the various different configurations and hysterical raisins. We can't
> > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > can we rely on disable_ro_nx() being available at build time.
> > >
> > > If we *could* rely on module allocations always using vmalloc(), then
> > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > afaict -- who cares about the memory attributes of a mapping that's about
> > > to disappear anyway?
> > >
> > > Is it just nios2 that does something different?
> > >
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> > solve it as well, in fact that was what I first thought the solution should be
> > until this was suggested. It's interesting that from the other thread Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > inadvertently blocked this on x86. But, on the other architectures I have since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so all of
> > the frob_* functions are actually just noops. In which case allocating RWX is
> > needed to make it work at all, because that is what the allocation is going to
> > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> > changing of the permissions on the directmap as well. You don't want some other
> > caller getting a page that was left RO when freed and then trying to write to
> > it, if I understand this.
> >
>
> Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> but it would also call some arch hooks to put back the direct map
> permissions before the flush. Does that seem reasonable? It would
> need to be hooked up that implement set_memory_ro(), but that should
> be quite easy. If nothing else, it could fall back to set_memory_ro()
> in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.
Will
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=for-next/core&id=c55191e96caa9d787e8f682c5e525b7f8172a3b4
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 11:41 ` Will Deacon
0 siblings, 0 replies; 117+ messages in thread
From: Will Deacon @ 2018-12-05 11:41 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rick Edgecombe, Nadav Amit, LKML, Daniel Borkmann, jeyu,
Steven Rostedt, Alexei Starovoitov, Ard Biesheuvel, Linux-MM,
Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu,
Naveen N . Rao
On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> <rick.p.edgecombe@intel.com> wrote:
> >
> > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > wrote:
> > > > >
> > > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > > pages,
> > > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > > This is
> > > > > undesirable for cases where the memory being freed has special permissions
> > > > > such
> > > > > as executable.
> > > >
> > > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > > for
> > > > pointing it out).
> > > >
> > > > But all of the sudden, I don’t understand why we have the problem that this
> > > > (your) patch-set deals with at all. We already change the mappings to make
> > > > the memory writable before freeing the memory, so why can’t we make it
> > > > non-executable at the same time? Actually, why do we make the module memory,
> > > > including its data executable before freeing it???
> > >
> > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > of the various different configurations and hysterical raisins. We can't
> > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > can we rely on disable_ro_nx() being available at build time.
> > >
> > > If we *could* rely on module allocations always using vmalloc(), then
> > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > afaict -- who cares about the memory attributes of a mapping that's about
> > > to disappear anyway?
> > >
> > > Is it just nios2 that does something different?
> > >
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> > solve it as well, in fact that was what I first thought the solution should be
> > until this was suggested. It's interesting that from the other thread Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > inadvertently blocked this on x86. But, on the other architectures I have since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so all of
> > the frob_* functions are actually just noops. In which case allocating RWX is
> > needed to make it work at all, because that is what the allocation is going to
> > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> > changing of the permissions on the directmap as well. You don't want some other
> > caller getting a page that was left RO when freed and then trying to write to
> > it, if I understand this.
> >
>
> Exactly.
Of course, I forgot about the linear mapping. On arm64, we've just queued
support for reflecting changes to read-only permissions in the linear map
[1]. So, whilst the linear map is always non-executable, we will need to
make parts of it writable again when freeing the module.
> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> but it would also call some arch hooks to put back the direct map
> permissions before the flush. Does that seem reasonable? It would
> need to be hooked up that implement set_memory_ro(), but that should
> be quite easy. If nothing else, it could fall back to set_memory_ro()
> in the absence of a better implementation.
You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
would open up a window where the vmap mapping is executable and the linear
mapping is writable, which is a bit rubbish.
Will
[1]
https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/commit/?h=for-next/core&id=c55191e96caa9d787e8f682c5e525b7f8172a3b4
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-05 11:41 ` Will Deacon
@ 2018-12-05 23:16 ` Andy Lutomirski
-1 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-05 23:16 UTC (permalink / raw)
To: Will Deacon
Cc: Andrew Lutomirski, Rick Edgecombe, Nadav Amit, LKML,
Daniel Borkmann, jeyu, Steven Rostedt, Alexei Starovoitov,
Ard Biesheuvel, Linux-MM, Jann Horn, Dock, Deneen T,
Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu, Naveen N . Rao, David S. Miller,
Network Development, Dave Hansen
On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
>
> On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> > On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > wrote:
> > > > > >
> > > > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > > > pages,
> > > > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > > > This is
> > > > > > undesirable for cases where the memory being freed has special permissions
> > > > > > such
> > > > > > as executable.
> > > > >
> > > > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > > > for
> > > > > pointing it out).
> > > > >
> > > > > But all of the sudden, I don’t understand why we have the problem that this
> > > > > (your) patch-set deals with at all. We already change the mappings to make
> > > > > the memory writable before freeing the memory, so why can’t we make it
> > > > > non-executable at the same time? Actually, why do we make the module memory,
> > > > > including its data executable before freeing it???
> > > >
> > > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > > of the various different configurations and hysterical raisins. We can't
> > > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > > can we rely on disable_ro_nx() being available at build time.
> > > >
> > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > afaict -- who cares about the memory attributes of a mapping that's about
> > > > to disappear anyway?
> > > >
> > > > Is it just nios2 that does something different?
> > > >
> > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> > > solve it as well, in fact that was what I first thought the solution should be
> > > until this was suggested. It's interesting that from the other thread Masami
> > > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > > inadvertently blocked this on x86. But, on the other architectures I have since
> > > learned it is a bit different.
> > >
> > > It looks like actually most arch's don't re-define set_memory_*, and so all of
> > > the frob_* functions are actually just noops. In which case allocating RWX is
> > > needed to make it work at all, because that is what the allocation is going to
> > > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > > nothing.
> > >
> > > On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> > > changing of the permissions on the directmap as well. You don't want some other
> > > caller getting a page that was left RO when freed and then trying to write to
> > > it, if I understand this.
> > >
> >
> > Exactly.
>
> Of course, I forgot about the linear mapping. On arm64, we've just queued
> support for reflecting changes to read-only permissions in the linear map
> [1]. So, whilst the linear map is always non-executable, we will need to
> make parts of it writable again when freeing the module.
>
> > After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> > VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> > but it would also call some arch hooks to put back the direct map
> > permissions before the flush. Does that seem reasonable? It would
> > need to be hooked up that implement set_memory_ro(), but that should
> > be quite easy. If nothing else, it could fall back to set_memory_ro()
> > in the absence of a better implementation.
>
> You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> would open up a window where the vmap mapping is executable and the linear
> mapping is writable, which is a bit rubbish.
>
Right, and Rick pointed out the same issue. Instead, we should set
the direct map not-present or its ARM equivalent, then do the flush,
then make it RW. I assume this also works on arm and arm64, although
I don't know for sure. On x86, the CPU won't cache not-present PTEs.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 23:16 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-05 23:16 UTC (permalink / raw)
To: Will Deacon
Cc: Andrew Lutomirski, Rick Edgecombe, Nadav Amit, LKML,
Daniel Borkmann, jeyu, Steven Rostedt, Alexei Starovoitov,
Ard Biesheuvel, Linux-MM, Jann Horn, Dock, Deneen T,
Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu,
On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
>
> On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> > On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> > <rick.p.edgecombe@intel.com> wrote:
> > >
> > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > wrote:
> > > > > >
> > > > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > > > pages,
> > > > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > > > This is
> > > > > > undesirable for cases where the memory being freed has special permissions
> > > > > > such
> > > > > > as executable.
> > > > >
> > > > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > > > for
> > > > > pointing it out).
> > > > >
> > > > > But all of the sudden, I don’t understand why we have the problem that this
> > > > > (your) patch-set deals with at all. We already change the mappings to make
> > > > > the memory writable before freeing the memory, so why can’t we make it
> > > > > non-executable at the same time? Actually, why do we make the module memory,
> > > > > including its data executable before freeing it???
> > > >
> > > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > > of the various different configurations and hysterical raisins. We can't
> > > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > > can we rely on disable_ro_nx() being available at build time.
> > > >
> > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > afaict -- who cares about the memory attributes of a mapping that's about
> > > > to disappear anyway?
> > > >
> > > > Is it just nios2 that does something different?
> > > >
> > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> > > solve it as well, in fact that was what I first thought the solution should be
> > > until this was suggested. It's interesting that from the other thread Masami
> > > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > > inadvertently blocked this on x86. But, on the other architectures I have since
> > > learned it is a bit different.
> > >
> > > It looks like actually most arch's don't re-define set_memory_*, and so all of
> > > the frob_* functions are actually just noops. In which case allocating RWX is
> > > needed to make it work at all, because that is what the allocation is going to
> > > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > > nothing.
> > >
> > > On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> > > changing of the permissions on the directmap as well. You don't want some other
> > > caller getting a page that was left RO when freed and then trying to write to
> > > it, if I understand this.
> > >
> >
> > Exactly.
>
> Of course, I forgot about the linear mapping. On arm64, we've just queued
> support for reflecting changes to read-only permissions in the linear map
> [1]. So, whilst the linear map is always non-executable, we will need to
> make parts of it writable again when freeing the module.
>
> > After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> > VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> > but it would also call some arch hooks to put back the direct map
> > permissions before the flush. Does that seem reasonable? It would
> > need to be hooked up that implement set_memory_ro(), but that should
> > be quite easy. If nothing else, it could fall back to set_memory_ro()
> > in the absence of a better implementation.
>
> You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> would open up a window where the vmap mapping is executable and the linear
> mapping is writable, which is a bit rubbish.
>
Right, and Rick pointed out the same issue. Instead, we should set
the direct map not-present or its ARM equivalent, then do the flush,
then make it RW. I assume this also works on arm and arm64, although
I don't know for sure. On x86, the CPU won't cache not-present PTEs.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-05 23:16 ` Andy Lutomirski
@ 2018-12-06 7:29 ` Ard Biesheuvel
-1 siblings, 0 replies; 117+ messages in thread
From: Ard Biesheuvel @ 2018-12-06 7:29 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Will Deacon, Rick Edgecombe, nadav.amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller, <netdev@vger.kernel.org>,
Dave Hansen
On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> > > On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> > > >
> > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > > > > pages,
> > > > > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > > > > This is
> > > > > > > undesirable for cases where the memory being freed has special permissions
> > > > > > > such
> > > > > > > as executable.
> > > > > >
> > > > > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > > > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > > > > for
> > > > > > pointing it out).
> > > > > >
> > > > > > But all of the sudden, I don’t understand why we have the problem that this
> > > > > > (your) patch-set deals with at all. We already change the mappings to make
> > > > > > the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
itable before freeing the memory, so why can’t we make it
> > > > > > non-executable at the same time? Actually, why do we make the module memory,
> > > > > > including its data executable before freeing it???
> > > > >
> > > > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > > > of the various different configurations and hysterical raisins. We can't
> > > > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > > > can we rely on disable_ro_nx() being available at build time.
> > > > >
> > > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > > afaict -- who cares about the memory attributes of a mapping that's about
> > > > > to disappear anyway?
> > > > >
> > > > > Is it just nios2 that does something different?
> > > > >
> > > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> > > > solve it as well, in fact that was what I first thought the solution should be
> > > > until this was suggested. It's interesting that from the other thread Masami
> > > > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > > > inadvertently blocked this on x86. But, on the other architectures I have since
> > > > learned it is a bit different.
> > > >
> > > > It looks like actually most arch's don't re-define set_memory_*, and so all of
> > > > the frob_* functions are actually just noops. In which case allocating RWX is
> > > > needed to make it work at all, because that is what the allocation is going to
> > > > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > > > nothing.
> > > >
> > > > On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> > > > changing of the permissions on the directmap as well. You don't want some other
> > > > caller getting a page that was left RO when freed and then trying to write to
> > > > it, if I understand this.
> > > >
> > >
> > > Exactly.
> >
> > Of course, I forgot about the linear mapping. On arm64, we've just queued
> > support for reflecting changes to read-only permissions in the linear map
> > [1]. So, whilst the linear map is always non-executable, we will need to
> > make parts of it writable again when freeing the module.
> >
> > > After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> > > VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> > > but it would also call some arch hooks to put back the direct map
> > > permissions before the flush. Does that seem reasonable? It would
> > > need to be hooked up that implement set_memory_ro(), but that should
> > > be quite easy. If nothing else, it could fall back to set_memory_ro()
> > > in the absence of a better implementation.
> >
> > You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> > would open up a window where the vmap mapping is executable and the linear
> > mapping is writable, which is a bit rubbish.
> >
>
> Right, and Rick pointed out the same issue. Instead, we should set
> the direct map not-present or its ARM equivalent, then do the flush,
> then make it RW. I assume this also works on arm and arm64, although
> I don't know for sure. On x86, the CPU won't cache not-present PTEs.
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 7:29 ` Ard Biesheuvel
0 siblings, 0 replies; 117+ messages in thread
From: Ard Biesheuvel @ 2018-12-06 7:29 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Will Deacon, Rick Edgecombe, nadav.amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller,
On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> > > On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> > > <rick.p.edgecombe@intel.com> wrote:
> > > >
> > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > > > > pages,
> > > > > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > > > > This is
> > > > > > > undesirable for cases where the memory being freed has special permissions
> > > > > > > such
> > > > > > > as executable.
> > > > > >
> > > > > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > > > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > > > > for
> > > > > > pointing it out).
> > > > > >
> > > > > > But all of the sudden, I don’t understand why we have the problem that this
> > > > > > (your) patch-set deals with at all. We already change the mappings to make
> > > > > > the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
itable before freeing the memory, so why can’t we make it
> > > > > > non-executable at the same time? Actually, why do we make the module memory,
> > > > > > including its data executable before freeing it???
> > > > >
> > > > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > > > of the various different configurations and hysterical raisins. We can't
> > > > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > > > can we rely on disable_ro_nx() being available at build time.
> > > > >
> > > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > > afaict -- who cares about the memory attributes of a mapping that's about
> > > > > to disappear anyway?
> > > > >
> > > > > Is it just nios2 that does something different?
> > > > >
> > > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> > > > solve it as well, in fact that was what I first thought the solution should be
> > > > until this was suggested. It's interesting that from the other thread Masami
> > > > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > > > inadvertently blocked this on x86. But, on the other architectures I have since
> > > > learned it is a bit different.
> > > >
> > > > It looks like actually most arch's don't re-define set_memory_*, and so all of
> > > > the frob_* functions are actually just noops. In which case allocating RWX is
> > > > needed to make it work at all, because that is what the allocation is going to
> > > > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > > > nothing.
> > > >
> > > > On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> > > > changing of the permissions on the directmap as well. You don't want some other
> > > > caller getting a page that was left RO when freed and then trying to write to
> > > > it, if I understand this.
> > > >
> > >
> > > Exactly.
> >
> > Of course, I forgot about the linear mapping. On arm64, we've just queued
> > support for reflecting changes to read-only permissions in the linear map
> > [1]. So, whilst the linear map is always non-executable, we will need to
> > make parts of it writable again when freeing the module.
> >
> > > After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> > > VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> > > but it would also call some arch hooks to put back the direct map
> > > permissions before the flush. Does that seem reasonable? It would
> > > need to be hooked up that implement set_memory_ro(), but that should
> > > be quite easy. If nothing else, it could fall back to set_memory_ro()
> > > in the absence of a better implementation.
> >
> > You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> > would open up a window where the vmap mapping is executable and the linear
> > mapping is writable, which is a bit rubbish.
> >
>
> Right, and Rick pointed out the same issue. Instead, we should set
> the direct map not-present or its ARM equivalent, then do the flush,
> then make it RW. I assume this also works on arm and arm64, although
> I don't know for sure. On x86, the CPU won't cache not-present PTEs.
If we are going to unmap the linear alias, why not do it at vmalloc()
time rather than vfree() time?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 7:29 ` Ard Biesheuvel
@ 2018-12-06 11:10 ` Will Deacon
-1 siblings, 0 replies; 117+ messages in thread
From: Will Deacon @ 2018-12-06 11:10 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andy Lutomirski, Rick Edgecombe, nadav.amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller, <netdev@vger.kernel.org>,
Dave Hansen
On Thu, Dec 06, 2018 at 08:29:03AM +0100, Ard Biesheuvel wrote:
> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> > > > On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> > > > <rick.p.edgecombe@intel.com> wrote:
> > > > >
> > > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > > > > > pages,
> > > > > > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > > > > > This is
> > > > > > > > undesirable for cases where the memory being freed has special permissions
> > > > > > > > such
> > > > > > > > as executable.
> > > > > > >
> > > > > > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > > > > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > > > > > for
> > > > > > > pointing it out).
> > > > > > >
> > > > > > > But all of the sudden, I don’t understand why we have the problem that this
> > > > > > > (your) patch-set deals with at all. We already change the mappings to make
> > > > > > > the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> itable before freeing the memory, so why can’t we make it
> > > > > > > non-executable at the same time? Actually, why do we make the module memory,
> > > > > > > including its data executable before freeing it???
> > > > > >
> > > > > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > > > > of the various different configurations and hysterical raisins. We can't
> > > > > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > > > > can we rely on disable_ro_nx() being available at build time.
> > > > > >
> > > > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > > > afaict -- who cares about the memory attributes of a mapping that's about
> > > > > > to disappear anyway?
> > > > > >
> > > > > > Is it just nios2 that does something different?
> > > > > >
> > > > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> > > > > solve it as well, in fact that was what I first thought the solution should be
> > > > > until this was suggested. It's interesting that from the other thread Masami
> > > > > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > > > > inadvertently blocked this on x86. But, on the other architectures I have since
> > > > > learned it is a bit different.
> > > > >
> > > > > It looks like actually most arch's don't re-define set_memory_*, and so all of
> > > > > the frob_* functions are actually just noops. In which case allocating RWX is
> > > > > needed to make it work at all, because that is what the allocation is going to
> > > > > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > > > > nothing.
> > > > >
> > > > > On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> > > > > changing of the permissions on the directmap as well. You don't want some other
> > > > > caller getting a page that was left RO when freed and then trying to write to
> > > > > it, if I understand this.
> > > > >
> > > >
> > > > Exactly.
> > >
> > > Of course, I forgot about the linear mapping. On arm64, we've just queued
> > > support for reflecting changes to read-only permissions in the linear map
> > > [1]. So, whilst the linear map is always non-executable, we will need to
> > > make parts of it writable again when freeing the module.
> > >
> > > > After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> > > > VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> > > > but it would also call some arch hooks to put back the direct map
> > > > permissions before the flush. Does that seem reasonable? It would
> > > > need to be hooked up that implement set_memory_ro(), but that should
> > > > be quite easy. If nothing else, it could fall back to set_memory_ro()
> > > > in the absence of a better implementation.
> > >
> > > You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> > > would open up a window where the vmap mapping is executable and the linear
> > > mapping is writable, which is a bit rubbish.
> > >
> >
> > Right, and Rick pointed out the same issue. Instead, we should set
> > the direct map not-present or its ARM equivalent, then do the flush,
> > then make it RW. I assume this also works on arm and arm64, although
> > I don't know for sure. On x86, the CPU won't cache not-present PTEs.
>
> If we are going to unmap the linear alias, why not do it at vmalloc()
> time rather than vfree() time?
Right, that should be pretty straightforward. We're basically saying that
RO in the vmalloc area implies PROT_NONE in the linear map, so we could
just do this in our set_memory_ro() function.
Will
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 11:10 ` Will Deacon
0 siblings, 0 replies; 117+ messages in thread
From: Will Deacon @ 2018-12-06 11:10 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andy Lutomirski, Rick Edgecombe, nadav.amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller,
On Thu, Dec 06, 2018 at 08:29:03AM +0100, Ard Biesheuvel wrote:
> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
> > >
> > > On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> > > > On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> > > > <rick.p.edgecombe@intel.com> wrote:
> > > > >
> > > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > Since vfree will lazily flush the TLB, but not lazily free the underlying
> > > > > > > > pages,
> > > > > > > > it often leaves stale TLB entries to freed pages that could get re-used.
> > > > > > > > This is
> > > > > > > > undesirable for cases where the memory being freed has special permissions
> > > > > > > > such
> > > > > > > > as executable.
> > > > > > >
> > > > > > > So I am trying to finish my patch-set for preventing transient W+X mappings
> > > > > > > from taking space, by handling kprobes & ftrace that I missed (thanks again
> > > > > > > for
> > > > > > > pointing it out).
> > > > > > >
> > > > > > > But all of the sudden, I don’t understand why we have the problem that this
> > > > > > > (your) patch-set deals with at all. We already change the mappings to make
> > > > > > > the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> itable before freeing the memory, so why can’t we make it
> > > > > > > non-executable at the same time? Actually, why do we make the module memory,
> > > > > > > including its data executable before freeing it???
> > > > > >
> > > > > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > > > > of the various different configurations and hysterical raisins. We can't
> > > > > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > > > > can we rely on disable_ro_nx() being available at build time.
> > > > > >
> > > > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > > > afaict -- who cares about the memory attributes of a mapping that's about
> > > > > > to disappear anyway?
> > > > > >
> > > > > > Is it just nios2 that does something different?
> > > > > >
> > > > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> > > > > solve it as well, in fact that was what I first thought the solution should be
> > > > > until this was suggested. It's interesting that from the other thread Masami
> > > > > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > > > > inadvertently blocked this on x86. But, on the other architectures I have since
> > > > > learned it is a bit different.
> > > > >
> > > > > It looks like actually most arch's don't re-define set_memory_*, and so all of
> > > > > the frob_* functions are actually just noops. In which case allocating RWX is
> > > > > needed to make it work at all, because that is what the allocation is going to
> > > > > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > > > > nothing.
> > > > >
> > > > > On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> > > > > changing of the permissions on the directmap as well. You don't want some other
> > > > > caller getting a page that was left RO when freed and then trying to write to
> > > > > it, if I understand this.
> > > > >
> > > >
> > > > Exactly.
> > >
> > > Of course, I forgot about the linear mapping. On arm64, we've just queued
> > > support for reflecting changes to read-only permissions in the linear map
> > > [1]. So, whilst the linear map is always non-executable, we will need to
> > > make parts of it writable again when freeing the module.
> > >
> > > > After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> > > > VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> > > > but it would also call some arch hooks to put back the direct map
> > > > permissions before the flush. Does that seem reasonable? It would
> > > > need to be hooked up that implement set_memory_ro(), but that should
> > > > be quite easy. If nothing else, it could fall back to set_memory_ro()
> > > > in the absence of a better implementation.
> > >
> > > You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> > > would open up a window where the vmap mapping is executable and the linear
> > > mapping is writable, which is a bit rubbish.
> > >
> >
> > Right, and Rick pointed out the same issue. Instead, we should set
> > the direct map not-present or its ARM equivalent, then do the flush,
> > then make it RW. I assume this also works on arm and arm64, although
> > I don't know for sure. On x86, the CPU won't cache not-present PTEs.
>
> If we are going to unmap the linear alias, why not do it at vmalloc()
> time rather than vfree() time?
Right, that should be pretty straightforward. We're basically saying that
RO in the vmalloc area implies PROT_NONE in the linear map, so we could
just do this in our set_memory_ro() function.
Will
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 7:29 ` Ard Biesheuvel
@ 2018-12-06 18:53 ` Andy Lutomirski
-1 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 18:53 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andrew Lutomirski, Will Deacon, Rick Edgecombe, Nadav Amit, LKML,
Daniel Borkmann, Jessica Yu, Steven Rostedt, Alexei Starovoitov,
Linux-MM, Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu,
Naveen N . Rao, David S. Miller, Network Development,
Dave Hansen
> On Dec 5, 2018, at 11:29 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
>> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
>>
>>> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
>>>
>>>> On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
>>>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>>>> <rick.p.edgecombe@intel.com> wrote:
>>>>>
>>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
>>>>>>>> pages,
>>>>>>>> it often leaves stale TLB entries to freed pages that could get re-used.
>>>>>>>> This is
>>>>>>>> undesirable for cases where the memory being freed has special permissions
>>>>>>>> such
>>>>>>>> as executable.
>>>>>>>
>>>>>>> So I am trying to finish my patch-set for preventing transient W+X mappings
>>>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
>>>>>>> for
>>>>>>> pointing it out).
>>>>>>>
>>>>>>> But all of the sudden, I don’t understand why we have the problem that this
>>>>>>> (your) patch-set deals with at all. We already change the mappings to make
>>>>>>> the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> itable before freeing the memory, so why can’t we make it
>>>>>>> non-executable at the same time? Actually, why do we make the module memory,
>>>>>>> including its data executable before freeing it???
>>>>>>
>>>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>>>> of the various different configurations and hysterical raisins. We can't
>>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>
>>>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>>>> to disappear anyway?
>>>>>>
>>>>>> Is it just nios2 that does something different?
>>>>>>
>>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
>>>>> solve it as well, in fact that was what I first thought the solution should be
>>>>> until this was suggested. It's interesting that from the other thread Masami
>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>>>> inadvertently blocked this on x86. But, on the other architectures I have since
>>>>> learned it is a bit different.
>>>>>
>>>>> It looks like actually most arch's don't re-define set_memory_*, and so all of
>>>>> the frob_* functions are actually just noops. In which case allocating RWX is
>>>>> needed to make it work at all, because that is what the allocation is going to
>>>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>>>> nothing.
>>>>>
>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
>>>>> changing of the permissions on the directmap as well. You don't want some other
>>>>> caller getting a page that was left RO when freed and then trying to write to
>>>>> it, if I understand this.
>>>>>
>>>>
>>>> Exactly.
>>>
>>> Of course, I forgot about the linear mapping. On arm64, we've just queued
>>> support for reflecting changes to read-only permissions in the linear map
>>> [1]. So, whilst the linear map is always non-executable, we will need to
>>> make parts of it writable again when freeing the module.
>>>
>>>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
>>>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
>>>> but it would also call some arch hooks to put back the direct map
>>>> permissions before the flush. Does that seem reasonable? It would
>>>> need to be hooked up that implement set_memory_ro(), but that should
>>>> be quite easy. If nothing else, it could fall back to set_memory_ro()
>>>> in the absence of a better implementation.
>>>
>>> You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
>>> would open up a window where the vmap mapping is executable and the linear
>>> mapping is writable, which is a bit rubbish.
>>>
>>
>> Right, and Rick pointed out the same issue. Instead, we should set
>> the direct map not-present or its ARM equivalent, then do the flush,
>> then make it RW. I assume this also works on arm and arm64, although
>> I don't know for sure. On x86, the CPU won't cache not-present PTEs.
>
> If we are going to unmap the linear alias, why not do it at vmalloc()
> time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
(It seems like some people call it the linear map and some people call
it the direct map. Is there any preference?)
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 18:53 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 18:53 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andrew Lutomirski, Will Deacon, Rick Edgecombe, Nadav Amit, LKML,
Daniel Borkmann, Jessica Yu, Steven Rostedt, Alexei Starovoitov,
Linux-MM, Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu
> On Dec 5, 2018, at 11:29 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
>
>> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
>>
>>> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
>>>
>>>> On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
>>>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>>>> <rick.p.edgecombe@intel.com> wrote:
>>>>>
>>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
>>>>>>>> pages,
>>>>>>>> it often leaves stale TLB entries to freed pages that could get re-used.
>>>>>>>> This is
>>>>>>>> undesirable for cases where the memory being freed has special permissions
>>>>>>>> such
>>>>>>>> as executable.
>>>>>>>
>>>>>>> So I am trying to finish my patch-set for preventing transient W+X mappings
>>>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
>>>>>>> for
>>>>>>> pointing it out).
>>>>>>>
>>>>>>> But all of the sudden, I don’t understand why we have the problem that this
>>>>>>> (your) patch-set deals with at all. We already change the mappings to make
>>>>>>> the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> itable before freeing the memory, so why can’t we make it
>>>>>>> non-executable at the same time? Actually, why do we make the module memory,
>>>>>>> including its data executable before freeing it???
>>>>>>
>>>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>>>> of the various different configurations and hysterical raisins. We can't
>>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>
>>>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>>>> to disappear anyway?
>>>>>>
>>>>>> Is it just nios2 that does something different?
>>>>>>
>>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
>>>>> solve it as well, in fact that was what I first thought the solution should be
>>>>> until this was suggested. It's interesting that from the other thread Masami
>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>>>> inadvertently blocked this on x86. But, on the other architectures I have since
>>>>> learned it is a bit different.
>>>>>
>>>>> It looks like actually most arch's don't re-define set_memory_*, and so all of
>>>>> the frob_* functions are actually just noops. In which case allocating RWX is
>>>>> needed to make it work at all, because that is what the allocation is going to
>>>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>>>> nothing.
>>>>>
>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
>>>>> changing of the permissions on the directmap as well. You don't want some other
>>>>> caller getting a page that was left RO when freed and then trying to write to
>>>>> it, if I understand this.
>>>>>
>>>>
>>>> Exactly.
>>>
>>> Of course, I forgot about the linear mapping. On arm64, we've just queued
>>> support for reflecting changes to read-only permissions in the linear map
>>> [1]. So, whilst the linear map is always non-executable, we will need to
>>> make parts of it writable again when freeing the module.
>>>
>>>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
>>>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
>>>> but it would also call some arch hooks to put back the direct map
>>>> permissions before the flush. Does that seem reasonable? It would
>>>> need to be hooked up that implement set_memory_ro(), but that should
>>>> be quite easy. If nothing else, it could fall back to set_memory_ro()
>>>> in the absence of a better implementation.
>>>
>>> You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
>>> would open up a window where the vmap mapping is executable and the linear
>>> mapping is writable, which is a bit rubbish.
>>>
>>
>> Right, and Rick pointed out the same issue. Instead, we should set
>> the direct map not-present or its ARM equivalent, then do the flush,
>> then make it RW. I assume this also works on arm and arm64, although
>> I don't know for sure. On x86, the CPU won't cache not-present PTEs.
>
> If we are going to unmap the linear alias, why not do it at vmalloc()
> time rather than vfree() time?
That’s not totally nuts. Do we ever have code that expects __va() to
work on module data? Perhaps crypto code trying to encrypt static
data because our APIs don’t understand virtual addresses. I guess if
highmem is ever used for modules, then we should be fine.
RO instead of not present might be safer. But I do like the idea of
renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
making it do all of this.
(It seems like some people call it the linear map and some people call
it the direct map. Is there any preference?)
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 18:53 ` Andy Lutomirski
@ 2018-12-06 19:01 ` Tycho Andersen
-1 siblings, 0 replies; 117+ messages in thread
From: Tycho Andersen @ 2018-12-06 19:01 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Ard Biesheuvel, Will Deacon, Rick Edgecombe, Nadav Amit, LKML,
Daniel Borkmann, Jessica Yu, Steven Rostedt, Alexei Starovoitov,
Linux-MM, Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu,
Naveen N . Rao, David S. Miller, Network Development,
Dave Hansen
On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > On Dec 5, 2018, at 11:29 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >
> >> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >>> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
> >>>
> >>>> On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> >>>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> >>>> <rick.p.edgecombe@intel.com> wrote:
> >>>>>
> >>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> >>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> >>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
> >>>>>>>> pages,
> >>>>>>>> it often leaves stale TLB entries to freed pages that could get re-used.
> >>>>>>>> This is
> >>>>>>>> undesirable for cases where the memory being freed has special permissions
> >>>>>>>> such
> >>>>>>>> as executable.
> >>>>>>>
> >>>>>>> So I am trying to finish my patch-set for preventing transient W+X mappings
> >>>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
> >>>>>>> for
> >>>>>>> pointing it out).
> >>>>>>>
> >>>>>>> But all of the sudden, I don’t understand why we have the problem that this
> >>>>>>> (your) patch-set deals with at all. We already change the mappings to make
> >>>>>>> the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > itable before freeing the memory, so why can’t we make it
> >>>>>>> non-executable at the same time? Actually, why do we make the module memory,
> >>>>>>> including its data executable before freeing it???
> >>>>>>
> >>>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
> >>>>>> of the various different configurations and hysterical raisins. We can't
> >>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> >>>>>> can we rely on disable_ro_nx() being available at build time.
> >>>>>>
> >>>>>> If we *could* rely on module allocations always using vmalloc(), then
> >>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
> >>>>>> afaict -- who cares about the memory attributes of a mapping that's about
> >>>>>> to disappear anyway?
> >>>>>>
> >>>>>> Is it just nios2 that does something different?
> >>>>>>
> >>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> >>>>> solve it as well, in fact that was what I first thought the solution should be
> >>>>> until this was suggested. It's interesting that from the other thread Masami
> >>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> >>>>> inadvertently blocked this on x86. But, on the other architectures I have since
> >>>>> learned it is a bit different.
> >>>>>
> >>>>> It looks like actually most arch's don't re-define set_memory_*, and so all of
> >>>>> the frob_* functions are actually just noops. In which case allocating RWX is
> >>>>> needed to make it work at all, because that is what the allocation is going to
> >>>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
> >>>>> nothing.
> >>>>>
> >>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> >>>>> changing of the permissions on the directmap as well. You don't want some other
> >>>>> caller getting a page that was left RO when freed and then trying to write to
> >>>>> it, if I understand this.
> >>>>>
> >>>>
> >>>> Exactly.
> >>>
> >>> Of course, I forgot about the linear mapping. On arm64, we've just queued
> >>> support for reflecting changes to read-only permissions in the linear map
> >>> [1]. So, whilst the linear map is always non-executable, we will need to
> >>> make parts of it writable again when freeing the module.
> >>>
> >>>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> >>>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> >>>> but it would also call some arch hooks to put back the direct map
> >>>> permissions before the flush. Does that seem reasonable? It would
> >>>> need to be hooked up that implement set_memory_ro(), but that should
> >>>> be quite easy. If nothing else, it could fall back to set_memory_ro()
> >>>> in the absence of a better implementation.
> >>>
> >>> You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> >>> would open up a window where the vmap mapping is executable and the linear
> >>> mapping is writable, which is a bit rubbish.
> >>>
> >>
> >> Right, and Rick pointed out the same issue. Instead, we should set
> >> the direct map not-present or its ARM equivalent, then do the flush,
> >> then make it RW. I assume this also works on arm and arm64, although
> >> I don't know for sure. On x86, the CPU won't cache not-present PTEs.
> >
> > If we are going to unmap the linear alias, why not do it at vmalloc()
> > time rather than vfree() time?
>
> That’s not totally nuts. Do we ever have code that expects __va() to
> work on module data? Perhaps crypto code trying to encrypt static
> data because our APIs don’t understand virtual addresses. I guess if
> highmem is ever used for modules, then we should be fine.
>
> RO instead of not present might be safer. But I do like the idea of
> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
Seems like maybe we should do this for rodata today?
> (It seems like some people call it the linear map and some people call
> it the direct map. Is there any preference?)
...and some people call it the physmap :)
Tycho
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 19:01 ` Tycho Andersen
0 siblings, 0 replies; 117+ messages in thread
From: Tycho Andersen @ 2018-12-06 19:01 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Ard Biesheuvel, Will Deacon, Rick Edgecombe, Nadav Amit, LKML,
Daniel Borkmann, Jessica Yu, Steven Rostedt, Alexei Starovoitov,
Linux-MM, Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu
On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > On Dec 5, 2018, at 11:29 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >
> >> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >>> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
> >>>
> >>>> On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> >>>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> >>>> <rick.p.edgecombe@intel.com> wrote:
> >>>>>
> >>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> >>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> >>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
> >>>>>>>> pages,
> >>>>>>>> it often leaves stale TLB entries to freed pages that could get re-used.
> >>>>>>>> This is
> >>>>>>>> undesirable for cases where the memory being freed has special permissions
> >>>>>>>> such
> >>>>>>>> as executable.
> >>>>>>>
> >>>>>>> So I am trying to finish my patch-set for preventing transient W+X mappings
> >>>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
> >>>>>>> for
> >>>>>>> pointing it out).
> >>>>>>>
> >>>>>>> But all of the sudden, I don’t understand why we have the problem that this
> >>>>>>> (your) patch-set deals with at all. We already change the mappings to make
> >>>>>>> the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > itable before freeing the memory, so why can’t we make it
> >>>>>>> non-executable at the same time? Actually, why do we make the module memory,
> >>>>>>> including its data executable before freeing it???
> >>>>>>
> >>>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
> >>>>>> of the various different configurations and hysterical raisins. We can't
> >>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> >>>>>> can we rely on disable_ro_nx() being available at build time.
> >>>>>>
> >>>>>> If we *could* rely on module allocations always using vmalloc(), then
> >>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
> >>>>>> afaict -- who cares about the memory attributes of a mapping that's about
> >>>>>> to disappear anyway?
> >>>>>>
> >>>>>> Is it just nios2 that does something different?
> >>>>>>
> >>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> >>>>> solve it as well, in fact that was what I first thought the solution should be
> >>>>> until this was suggested. It's interesting that from the other thread Masami
> >>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> >>>>> inadvertently blocked this on x86. But, on the other architectures I have since
> >>>>> learned it is a bit different.
> >>>>>
> >>>>> It looks like actually most arch's don't re-define set_memory_*, and so all of
> >>>>> the frob_* functions are actually just noops. In which case allocating RWX is
> >>>>> needed to make it work at all, because that is what the allocation is going to
> >>>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
> >>>>> nothing.
> >>>>>
> >>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> >>>>> changing of the permissions on the directmap as well. You don't want some other
> >>>>> caller getting a page that was left RO when freed and then trying to write to
> >>>>> it, if I understand this.
> >>>>>
> >>>>
> >>>> Exactly.
> >>>
> >>> Of course, I forgot about the linear mapping. On arm64, we've just queued
> >>> support for reflecting changes to read-only permissions in the linear map
> >>> [1]. So, whilst the linear map is always non-executable, we will need to
> >>> make parts of it writable again when freeing the module.
> >>>
> >>>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> >>>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> >>>> but it would also call some arch hooks to put back the direct map
> >>>> permissions before the flush. Does that seem reasonable? It would
> >>>> need to be hooked up that implement set_memory_ro(), but that should
> >>>> be quite easy. If nothing else, it could fall back to set_memory_ro()
> >>>> in the absence of a better implementation.
> >>>
> >>> You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> >>> would open up a window where the vmap mapping is executable and the linear
> >>> mapping is writable, which is a bit rubbish.
> >>>
> >>
> >> Right, and Rick pointed out the same issue. Instead, we should set
> >> the direct map not-present or its ARM equivalent, then do the flush,
> >> then make it RW. I assume this also works on arm and arm64, although
> >> I don't know for sure. On x86, the CPU won't cache not-present PTEs.
> >
> > If we are going to unmap the linear alias, why not do it at vmalloc()
> > time rather than vfree() time?
>
> That’s not totally nuts. Do we ever have code that expects __va() to
> work on module data? Perhaps crypto code trying to encrypt static
> data because our APIs don’t understand virtual addresses. I guess if
> highmem is ever used for modules, then we should be fine.
>
> RO instead of not present might be safer. But I do like the idea of
> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> making it do all of this.
Yeah, doing it for everything automatically seemed like it was/is
going to be a lot of work to debug all the corner cases where things
expect memory to be mapped but don't explicitly say it. And in
particular, the XPFO series only does it for user memory, whereas an
additional flag like this would work for extra paranoid allocations
of kernel memory too.
Seems like maybe we should do this for rodata today?
> (It seems like some people call it the linear map and some people call
> it the direct map. Is there any preference?)
...and some people call it the physmap :)
Tycho
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 19:01 ` Tycho Andersen
@ 2018-12-06 19:19 ` Andy Lutomirski
-1 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 19:19 UTC (permalink / raw)
To: Tycho Andersen
Cc: Andrew Lutomirski, Ard Biesheuvel, Will Deacon, Rick Edgecombe,
Nadav Amit, LKML, Daniel Borkmann, Jessica Yu, Steven Rostedt,
Alexei Starovoitov, Linux-MM, Jann Horn, Dock, Deneen T,
Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu, Naveen N . Rao, David S. Miller,
Network Development, Dave Hansen
On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > time rather than vfree() time?
> >
> > That’s not totally nuts. Do we ever have code that expects __va() to
> > work on module data? Perhaps crypto code trying to encrypt static
> > data because our APIs don’t understand virtual addresses. I guess if
> > highmem is ever used for modules, then we should be fine.
> >
> > RO instead of not present might be safer. But I do like the idea of
> > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > making it do all of this.
>
> Yeah, doing it for everything automatically seemed like it was/is
> going to be a lot of work to debug all the corner cases where things
> expect memory to be mapped but don't explicitly say it. And in
> particular, the XPFO series only does it for user memory, whereas an
> additional flag like this would work for extra paranoid allocations
> of kernel memory too.
>
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.
So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.
What do you all think? This should solve most of the problems we have.
If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.
I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 19:19 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 19:19 UTC (permalink / raw)
To: Tycho Andersen
Cc: Andrew Lutomirski, Ard Biesheuvel, Will Deacon, Rick Edgecombe,
Nadav Amit, LKML, Daniel Borkmann, Jessica Yu, Steven Rostedt,
Alexei Starovoitov, Linux-MM, Jann Horn, Dock, Deneen T,
Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening
On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > time rather than vfree() time?
> >
> > That’s not totally nuts. Do we ever have code that expects __va() to
> > work on module data? Perhaps crypto code trying to encrypt static
> > data because our APIs don’t understand virtual addresses. I guess if
> > highmem is ever used for modules, then we should be fine.
> >
> > RO instead of not present might be safer. But I do like the idea of
> > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > making it do all of this.
>
> Yeah, doing it for everything automatically seemed like it was/is
> going to be a lot of work to debug all the corner cases where things
> expect memory to be mapped but don't explicitly say it. And in
> particular, the XPFO series only does it for user memory, whereas an
> additional flag like this would work for extra paranoid allocations
> of kernel memory too.
>
I just read the code, and I looks like vmalloc() is already using
highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
example, we already don't have modules in the direct map.
So I say we go for it. This should be quite simple to implement --
the pageattr code already has almost all the needed logic on x86. The
only arch support we should need is a pair of functions to remove a
vmalloc address range from the address map (if it was present in the
first place) and a function to put it back. On x86, this should only
be a few lines of code.
What do you all think? This should solve most of the problems we have.
If we really wanted to optimize this, we'd make it so that
module_alloc() allocates memory the normal way, then, later on, we
call some function that, all at once, removes the memory from the
direct map and applies the right permissions to the vmalloc alias (or
just makes the vmalloc alias not-present so we can add permissions
later without flushing), and flushes the TLB. And we arrange for
vunmap to zap the vmalloc range, then put the memory back into the
direct map, then free the pages back to the page allocator, with the
flush in the appropriate place.
I don't see why the page allocator needs to know about any of this.
It's already okay with the permissions being changed out from under it
on x86, and it seems fine. Rick, do you want to give some variant of
this a try?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 19:19 ` Andy Lutomirski
@ 2018-12-06 19:39 ` Nadav Amit
-1 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-06 19:39 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Tycho Andersen, Ard Biesheuvel, Will Deacon, Rick Edgecombe,
LKML, Daniel Borkmann, Jessica Yu, Steven Rostedt,
Alexei Starovoitov, Linux-MM, Jann Horn, Dock, Deneen T,
Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu, Naveen N . Rao, David S. Miller,
Network Development, Dave Hansen
> On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
>> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
>>>> If we are going to unmap the linear alias, why not do it at vmalloc()
>>>> time rather than vfree() time?
>>>
>>> That’s not totally nuts. Do we ever have code that expects __va() to
>>> work on module data? Perhaps crypto code trying to encrypt static
>>> data because our APIs don’t understand virtual addresses. I guess if
>>> highmem is ever used for modules, then we should be fine.
>>>
>>> RO instead of not present might be safer. But I do like the idea of
>>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
>>> making it do all of this.
>>
>> Yeah, doing it for everything automatically seemed like it was/is
>> going to be a lot of work to debug all the corner cases where things
>> expect memory to be mapped but don't explicitly say it. And in
>> particular, the XPFO series only does it for user memory, whereas an
>> additional flag like this would work for extra paranoid allocations
>> of kernel memory too.
>
> I just read the code, and I looks like vmalloc() is already using
> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> example, we already don't have modules in the direct map.
>
> So I say we go for it. This should be quite simple to implement --
> the pageattr code already has almost all the needed logic on x86. The
> only arch support we should need is a pair of functions to remove a
> vmalloc address range from the address map (if it was present in the
> first place) and a function to put it back. On x86, this should only
> be a few lines of code.
>
> What do you all think? This should solve most of the problems we have.
>
> If we really wanted to optimize this, we'd make it so that
> module_alloc() allocates memory the normal way, then, later on, we
> call some function that, all at once, removes the memory from the
> direct map and applies the right permissions to the vmalloc alias (or
> just makes the vmalloc alias not-present so we can add permissions
> later without flushing), and flushes the TLB. And we arrange for
> vunmap to zap the vmalloc range, then put the memory back into the
> direct map, then free the pages back to the page allocator, with the
> flush in the appropriate place.
>
> I don't see why the page allocator needs to know about any of this.
> It's already okay with the permissions being changed out from under it
> on x86, and it seems fine. Rick, do you want to give some variant of
> this a try?
Setting it as read-only may work (and already happens for the read-only
module data). I am not sure about setting it as non-present.
At some point, a discussion about a threat-model, as Rick indicated, would
be required. I presume ROP attacks can easily call set_all_modules_text_rw()
and override all the protections.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 19:39 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-06 19:39 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Tycho Andersen, Ard Biesheuvel, Will Deacon, Rick Edgecombe,
LKML, Daniel Borkmann, Jessica Yu, Steven Rostedt,
Alexei Starovoitov, Linux-MM, Jann Horn, Dock, Deneen T,
Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu
> On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
>> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
>>>> If we are going to unmap the linear alias, why not do it at vmalloc()
>>>> time rather than vfree() time?
>>>
>>> That’s not totally nuts. Do we ever have code that expects __va() to
>>> work on module data? Perhaps crypto code trying to encrypt static
>>> data because our APIs don’t understand virtual addresses. I guess if
>>> highmem is ever used for modules, then we should be fine.
>>>
>>> RO instead of not present might be safer. But I do like the idea of
>>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
>>> making it do all of this.
>>
>> Yeah, doing it for everything automatically seemed like it was/is
>> going to be a lot of work to debug all the corner cases where things
>> expect memory to be mapped but don't explicitly say it. And in
>> particular, the XPFO series only does it for user memory, whereas an
>> additional flag like this would work for extra paranoid allocations
>> of kernel memory too.
>
> I just read the code, and I looks like vmalloc() is already using
> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> example, we already don't have modules in the direct map.
>
> So I say we go for it. This should be quite simple to implement --
> the pageattr code already has almost all the needed logic on x86. The
> only arch support we should need is a pair of functions to remove a
> vmalloc address range from the address map (if it was present in the
> first place) and a function to put it back. On x86, this should only
> be a few lines of code.
>
> What do you all think? This should solve most of the problems we have.
>
> If we really wanted to optimize this, we'd make it so that
> module_alloc() allocates memory the normal way, then, later on, we
> call some function that, all at once, removes the memory from the
> direct map and applies the right permissions to the vmalloc alias (or
> just makes the vmalloc alias not-present so we can add permissions
> later without flushing), and flushes the TLB. And we arrange for
> vunmap to zap the vmalloc range, then put the memory back into the
> direct map, then free the pages back to the page allocator, with the
> flush in the appropriate place.
>
> I don't see why the page allocator needs to know about any of this.
> It's already okay with the permissions being changed out from under it
> on x86, and it seems fine. Rick, do you want to give some variant of
> this a try?
Setting it as read-only may work (and already happens for the read-only
module data). I am not sure about setting it as non-present.
At some point, a discussion about a threat-model, as Rick indicated, would
be required. I presume ROP attacks can easily call set_all_modules_text_rw()
and override all the protections.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 19:39 ` Nadav Amit
@ 2018-12-06 20:17 ` Andy Lutomirski
-1 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 20:17 UTC (permalink / raw)
To: Nadav Amit
Cc: Andrew Lutomirski, Tycho Andersen, Ard Biesheuvel, Will Deacon,
Rick Edgecombe, LKML, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu, Naveen N . Rao, David S. Miller,
Network Development, Dave Hansen
On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> >>>> If we are going to unmap the linear alias, why not do it at vmalloc()
> >>>> time rather than vfree() time?
> >>>
> >>> That’s not totally nuts. Do we ever have code that expects __va() to
> >>> work on module data? Perhaps crypto code trying to encrypt static
> >>> data because our APIs don’t understand virtual addresses. I guess if
> >>> highmem is ever used for modules, then we should be fine.
> >>>
> >>> RO instead of not present might be safer. But I do like the idea of
> >>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> >>> making it do all of this.
> >>
> >> Yeah, doing it for everything automatically seemed like it was/is
> >> going to be a lot of work to debug all the corner cases where things
> >> expect memory to be mapped but don't explicitly say it. And in
> >> particular, the XPFO series only does it for user memory, whereas an
> >> additional flag like this would work for extra paranoid allocations
> >> of kernel memory too.
> >
> > I just read the code, and I looks like vmalloc() is already using
> > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > example, we already don't have modules in the direct map.
> >
> > So I say we go for it. This should be quite simple to implement --
> > the pageattr code already has almost all the needed logic on x86. The
> > only arch support we should need is a pair of functions to remove a
> > vmalloc address range from the address map (if it was present in the
> > first place) and a function to put it back. On x86, this should only
> > be a few lines of code.
> >
> > What do you all think? This should solve most of the problems we have.
> >
> > If we really wanted to optimize this, we'd make it so that
> > module_alloc() allocates memory the normal way, then, later on, we
> > call some function that, all at once, removes the memory from the
> > direct map and applies the right permissions to the vmalloc alias (or
> > just makes the vmalloc alias not-present so we can add permissions
> > later without flushing), and flushes the TLB. And we arrange for
> > vunmap to zap the vmalloc range, then put the memory back into the
> > direct map, then free the pages back to the page allocator, with the
> > flush in the appropriate place.
> >
> > I don't see why the page allocator needs to know about any of this.
> > It's already okay with the permissions being changed out from under it
> > on x86, and it seems fine. Rick, do you want to give some variant of
> > this a try?
>
> Setting it as read-only may work (and already happens for the read-only
> module data). I am not sure about setting it as non-present.
>
> At some point, a discussion about a threat-model, as Rick indicated, would
> be required. I presume ROP attacks can easily call set_all_modules_text_rw()
> and override all the protections.
>
I am far from an expert on exploit techniques, but here's a
potentially useful model: let's assume there's an attacker who can
write controlled data to a controlled kernel address but cannot
directly modify control flow. It would be nice for such an attacker
to have a very difficult time of modifying kernel text or of
compromising control flow. So we're assuming a feature like kernel
CET or that the attacker finds it very difficult to do something like
modifying some thread's IRET frame.
Admittedly, for the kernel, this is an odd threat model, since an
attacker can presumably quite easily learn the kernel stack address of
one of their tasks, do some syscall, and then modify their kernel
thread's stack such that it will IRET right back to a fully controlled
register state with RSP pointing at an attacker-supplied kernel stack.
So this threat model gives very strong ROP powers. unless we have
either CET or some software technique to harden all the RET
instructions in the kernel.
I wonder if there's a better model to use. Maybe with stack-protector
we get some degree of protection? Or is all of this is rather weak
until we have CET or a RAP-like feature.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 20:17 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 20:17 UTC (permalink / raw)
To: Nadav Amit
Cc: Andrew Lutomirski, Tycho Andersen, Ard Biesheuvel, Will Deacon,
Rick Edgecombe, LKML, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening
On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>
> > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> >>>> If we are going to unmap the linear alias, why not do it at vmalloc()
> >>>> time rather than vfree() time?
> >>>
> >>> That’s not totally nuts. Do we ever have code that expects __va() to
> >>> work on module data? Perhaps crypto code trying to encrypt static
> >>> data because our APIs don’t understand virtual addresses. I guess if
> >>> highmem is ever used for modules, then we should be fine.
> >>>
> >>> RO instead of not present might be safer. But I do like the idea of
> >>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> >>> making it do all of this.
> >>
> >> Yeah, doing it for everything automatically seemed like it was/is
> >> going to be a lot of work to debug all the corner cases where things
> >> expect memory to be mapped but don't explicitly say it. And in
> >> particular, the XPFO series only does it for user memory, whereas an
> >> additional flag like this would work for extra paranoid allocations
> >> of kernel memory too.
> >
> > I just read the code, and I looks like vmalloc() is already using
> > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > example, we already don't have modules in the direct map.
> >
> > So I say we go for it. This should be quite simple to implement --
> > the pageattr code already has almost all the needed logic on x86. The
> > only arch support we should need is a pair of functions to remove a
> > vmalloc address range from the address map (if it was present in the
> > first place) and a function to put it back. On x86, this should only
> > be a few lines of code.
> >
> > What do you all think? This should solve most of the problems we have.
> >
> > If we really wanted to optimize this, we'd make it so that
> > module_alloc() allocates memory the normal way, then, later on, we
> > call some function that, all at once, removes the memory from the
> > direct map and applies the right permissions to the vmalloc alias (or
> > just makes the vmalloc alias not-present so we can add permissions
> > later without flushing), and flushes the TLB. And we arrange for
> > vunmap to zap the vmalloc range, then put the memory back into the
> > direct map, then free the pages back to the page allocator, with the
> > flush in the appropriate place.
> >
> > I don't see why the page allocator needs to know about any of this.
> > It's already okay with the permissions being changed out from under it
> > on x86, and it seems fine. Rick, do you want to give some variant of
> > this a try?
>
> Setting it as read-only may work (and already happens for the read-only
> module data). I am not sure about setting it as non-present.
>
> At some point, a discussion about a threat-model, as Rick indicated, would
> be required. I presume ROP attacks can easily call set_all_modules_text_rw()
> and override all the protections.
>
I am far from an expert on exploit techniques, but here's a
potentially useful model: let's assume there's an attacker who can
write controlled data to a controlled kernel address but cannot
directly modify control flow. It would be nice for such an attacker
to have a very difficult time of modifying kernel text or of
compromising control flow. So we're assuming a feature like kernel
CET or that the attacker finds it very difficult to do something like
modifying some thread's IRET frame.
Admittedly, for the kernel, this is an odd threat model, since an
attacker can presumably quite easily learn the kernel stack address of
one of their tasks, do some syscall, and then modify their kernel
thread's stack such that it will IRET right back to a fully controlled
register state with RSP pointing at an attacker-supplied kernel stack.
So this threat model gives very strong ROP powers. unless we have
either CET or some software technique to harden all the RET
instructions in the kernel.
I wonder if there's a better model to use. Maybe with stack-protector
we get some degree of protection? Or is all of this is rather weak
until we have CET or a RAP-like feature.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 20:17 ` Andy Lutomirski
@ 2018-12-06 23:08 ` Nadav Amit
-1 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-06 23:08 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Tycho Andersen, Ard Biesheuvel, Will Deacon, Rick Edgecombe,
LKML, Daniel Borkmann, Jessica Yu, Steven Rostedt,
Alexei Starovoitov, Linux-MM, Jann Horn, Dock, Deneen T,
Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu, Naveen N . Rao, David S. Miller,
Network Development, Dave Hansen, Igor Stoppa
> On Dec 6, 2018, at 12:17 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
>>>> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
>>>>>> If we are going to unmap the linear alias, why not do it at vmalloc()
>>>>>> time rather than vfree() time?
>>>>>
>>>>> That’s not totally nuts. Do we ever have code that expects __va() to
>>>>> work on module data? Perhaps crypto code trying to encrypt static
>>>>> data because our APIs don’t understand virtual addresses. I guess if
>>>>> highmem is ever used for modules, then we should be fine.
>>>>>
>>>>> RO instead of not present might be safer. But I do like the idea of
>>>>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
>>>>> making it do all of this.
>>>>
>>>> Yeah, doing it for everything automatically seemed like it was/is
>>>> going to be a lot of work to debug all the corner cases where things
>>>> expect memory to be mapped but don't explicitly say it. And in
>>>> particular, the XPFO series only does it for user memory, whereas an
>>>> additional flag like this would work for extra paranoid allocations
>>>> of kernel memory too.
>>>
>>> I just read the code, and I looks like vmalloc() is already using
>>> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
>>> example, we already don't have modules in the direct map.
>>>
>>> So I say we go for it. This should be quite simple to implement --
>>> the pageattr code already has almost all the needed logic on x86. The
>>> only arch support we should need is a pair of functions to remove a
>>> vmalloc address range from the address map (if it was present in the
>>> first place) and a function to put it back. On x86, this should only
>>> be a few lines of code.
>>>
>>> What do you all think? This should solve most of the problems we have.
>>>
>>> If we really wanted to optimize this, we'd make it so that
>>> module_alloc() allocates memory the normal way, then, later on, we
>>> call some function that, all at once, removes the memory from the
>>> direct map and applies the right permissions to the vmalloc alias (or
>>> just makes the vmalloc alias not-present so we can add permissions
>>> later without flushing), and flushes the TLB. And we arrange for
>>> vunmap to zap the vmalloc range, then put the memory back into the
>>> direct map, then free the pages back to the page allocator, with the
>>> flush in the appropriate place.
>>>
>>> I don't see why the page allocator needs to know about any of this.
>>> It's already okay with the permissions being changed out from under it
>>> on x86, and it seems fine. Rick, do you want to give some variant of
>>> this a try?
>>
>> Setting it as read-only may work (and already happens for the read-only
>> module data). I am not sure about setting it as non-present.
>>
>> At some point, a discussion about a threat-model, as Rick indicated, would
>> be required. I presume ROP attacks can easily call set_all_modules_text_rw()
>> and override all the protections.
>
> I am far from an expert on exploit techniques, but here's a
> potentially useful model: let's assume there's an attacker who can
> write controlled data to a controlled kernel address but cannot
> directly modify control flow. It would be nice for such an attacker
> to have a very difficult time of modifying kernel text or of
> compromising control flow. So we're assuming a feature like kernel
> CET or that the attacker finds it very difficult to do something like
> modifying some thread's IRET frame.
>
> Admittedly, for the kernel, this is an odd threat model, since an
> attacker can presumably quite easily learn the kernel stack address of
> one of their tasks, do some syscall, and then modify their kernel
> thread's stack such that it will IRET right back to a fully controlled
> register state with RSP pointing at an attacker-supplied kernel stack.
> So this threat model gives very strong ROP powers. unless we have
> either CET or some software technique to harden all the RET
> instructions in the kernel.
>
> I wonder if there's a better model to use. Maybe with stack-protector
> we get some degree of protection? Or is all of this is rather weak
> until we have CET or a RAP-like feature.
I believe that seeing the end-goal would make reasoning about patches
easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
up.
I’m not sure CET or other CFI would be enough even with this threat-model.
The page-tables (the very least) need to be write-protected, as otherwise
controlled data writes may just modify them. There are various possible
solutions I presume: write_rare for page-tables, hypervisor-assisted
security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of
hardware enclave.
What do you think?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 23:08 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-06 23:08 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Tycho Andersen, Ard Biesheuvel, Will Deacon, Rick Edgecombe,
LKML, Daniel Borkmann, Jessica Yu, Steven Rostedt,
Alexei Starovoitov, Linux-MM, Jann Horn, Dock, Deneen T,
Peter Zijlstra, Kristen Carlson Accardi, Andrew Morton,
Ingo Molnar, Anil S Keshavamurthy, Kernel Hardening,
Masami Hiramatsu
> On Dec 6, 2018, at 12:17 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit <nadav.amit@gmail.com> wrote:
>>> On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>
>>> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
>>>> On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
>>>>>> If we are going to unmap the linear alias, why not do it at vmalloc()
>>>>>> time rather than vfree() time?
>>>>>
>>>>> That’s not totally nuts. Do we ever have code that expects __va() to
>>>>> work on module data? Perhaps crypto code trying to encrypt static
>>>>> data because our APIs don’t understand virtual addresses. I guess if
>>>>> highmem is ever used for modules, then we should be fine.
>>>>>
>>>>> RO instead of not present might be safer. But I do like the idea of
>>>>> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
>>>>> making it do all of this.
>>>>
>>>> Yeah, doing it for everything automatically seemed like it was/is
>>>> going to be a lot of work to debug all the corner cases where things
>>>> expect memory to be mapped but don't explicitly say it. And in
>>>> particular, the XPFO series only does it for user memory, whereas an
>>>> additional flag like this would work for extra paranoid allocations
>>>> of kernel memory too.
>>>
>>> I just read the code, and I looks like vmalloc() is already using
>>> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
>>> example, we already don't have modules in the direct map.
>>>
>>> So I say we go for it. This should be quite simple to implement --
>>> the pageattr code already has almost all the needed logic on x86. The
>>> only arch support we should need is a pair of functions to remove a
>>> vmalloc address range from the address map (if it was present in the
>>> first place) and a function to put it back. On x86, this should only
>>> be a few lines of code.
>>>
>>> What do you all think? This should solve most of the problems we have.
>>>
>>> If we really wanted to optimize this, we'd make it so that
>>> module_alloc() allocates memory the normal way, then, later on, we
>>> call some function that, all at once, removes the memory from the
>>> direct map and applies the right permissions to the vmalloc alias (or
>>> just makes the vmalloc alias not-present so we can add permissions
>>> later without flushing), and flushes the TLB. And we arrange for
>>> vunmap to zap the vmalloc range, then put the memory back into the
>>> direct map, then free the pages back to the page allocator, with the
>>> flush in the appropriate place.
>>>
>>> I don't see why the page allocator needs to know about any of this.
>>> It's already okay with the permissions being changed out from under it
>>> on x86, and it seems fine. Rick, do you want to give some variant of
>>> this a try?
>>
>> Setting it as read-only may work (and already happens for the read-only
>> module data). I am not sure about setting it as non-present.
>>
>> At some point, a discussion about a threat-model, as Rick indicated, would
>> be required. I presume ROP attacks can easily call set_all_modules_text_rw()
>> and override all the protections.
>
> I am far from an expert on exploit techniques, but here's a
> potentially useful model: let's assume there's an attacker who can
> write controlled data to a controlled kernel address but cannot
> directly modify control flow. It would be nice for such an attacker
> to have a very difficult time of modifying kernel text or of
> compromising control flow. So we're assuming a feature like kernel
> CET or that the attacker finds it very difficult to do something like
> modifying some thread's IRET frame.
>
> Admittedly, for the kernel, this is an odd threat model, since an
> attacker can presumably quite easily learn the kernel stack address of
> one of their tasks, do some syscall, and then modify their kernel
> thread's stack such that it will IRET right back to a fully controlled
> register state with RSP pointing at an attacker-supplied kernel stack.
> So this threat model gives very strong ROP powers. unless we have
> either CET or some software technique to harden all the RET
> instructions in the kernel.
>
> I wonder if there's a better model to use. Maybe with stack-protector
> we get some degree of protection? Or is all of this is rather weak
> until we have CET or a RAP-like feature.
I believe that seeing the end-goal would make reasoning about patches
easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
up.
I’m not sure CET or other CFI would be enough even with this threat-model.
The page-tables (the very least) need to be write-protected, as otherwise
controlled data writes may just modify them. There are various possible
solutions I presume: write_rare for page-tables, hypervisor-assisted
security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of
hardware enclave.
What do you think?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 23:08 ` Nadav Amit
(?)
@ 2018-12-07 3:06 ` Edgecombe, Rick P
-1 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-07 3:06 UTC (permalink / raw)
To: keescook, luto, nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, ast, rostedt, jeyu,
linux-mm, jannh, Dock, Deneen T, peterz, kristen, akpm,
igor.stoppa, tycho, will.deacon, mingo, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
On Thu, 2018-12-06 at 15:08 -0800, Nadav Amit wrote:
> > On Dec 6, 2018, at 12:17 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> > > > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> > > > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > > > > If we are going to unmap the linear alias, why not do it at
> > > > > > > vmalloc()
> > > > > > > time rather than vfree() time?
> > > > > >
> > > > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > > > data because our APIs don’t understand virtual addresses. I guess
> > > > > > if
> > > > > > highmem is ever used for modules, then we should be fine.
> > > > > >
> > > > > > RO instead of not present might be safer. But I do like the idea of
> > > > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP
> > > > > > and
> > > > > > making it do all of this.
> > > > >
> > > > > Yeah, doing it for everything automatically seemed like it was/is
> > > > > going to be a lot of work to debug all the corner cases where things
> > > > > expect memory to be mapped but don't explicitly say it. And in
> > > > > particular, the XPFO series only does it for user memory, whereas an
> > > > > additional flag like this would work for extra paranoid allocations
> > > > > of kernel memory too.
> > > >
> > > > I just read the code, and I looks like vmalloc() is already using
> > > > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > > > example, we already don't have modules in the direct map.
> > > >
> > > > So I say we go for it. This should be quite simple to implement --
> > > > the pageattr code already has almost all the needed logic on x86. The
> > > > only arch support we should need is a pair of functions to remove a
> > > > vmalloc address range from the address map (if it was present in the
> > > > first place) and a function to put it back. On x86, this should only
> > > > be a few lines of code.
> > > >
> > > > What do you all think? This should solve most of the problems we have.
> > > >
> > > > If we really wanted to optimize this, we'd make it so that
> > > > module_alloc() allocates memory the normal way, then, later on, we
> > > > call some function that, all at once, removes the memory from the
> > > > direct map and applies the right permissions to the vmalloc alias (or
> > > > just makes the vmalloc alias not-present so we can add permissions
> > > > later without flushing), and flushes the TLB. And we arrange for
> > > > vunmap to zap the vmalloc range, then put the memory back into the
> > > > direct map, then free the pages back to the page allocator, with the
> > > > flush in the appropriate place.
> > > >
> > > > I don't see why the page allocator needs to know about any of this.
> > > > It's already okay with the permissions being changed out from under it
> > > > on x86, and it seems fine. Rick, do you want to give some variant of
> > > > this a try?
> > >
> > > Setting it as read-only may work (and already happens for the read-only
> > > module data). I am not sure about setting it as non-present.
> > >
> > > At some point, a discussion about a threat-model, as Rick indicated, would
> > > be required. I presume ROP attacks can easily call
> > > set_all_modules_text_rw()
> > > and override all the protections.
> >
> > I am far from an expert on exploit techniques, but here's a
> > potentially useful model: let's assume there's an attacker who can
> > write controlled data to a controlled kernel address but cannot
> > directly modify control flow. It would be nice for such an attacker
> > to have a very difficult time of modifying kernel text or of
> > compromising control flow. So we're assuming a feature like kernel
> > CET or that the attacker finds it very difficult to do something like
> > modifying some thread's IRET frame.
> >
> > Admittedly, for the kernel, this is an odd threat model, since an
> > attacker can presumably quite easily learn the kernel stack address of
> > one of their tasks, do some syscall, and then modify their kernel
> > thread's stack such that it will IRET right back to a fully controlled
> > register state with RSP pointing at an attacker-supplied kernel stack.
> > So this threat model gives very strong ROP powers. unless we have
> > either CET or some software technique to harden all the RET
> > instructions in the kernel.
> >
> > I wonder if there's a better model to use. Maybe with stack-protector
> > we get some degree of protection? Or is all of this is rather weak
> > until we have CET or a RAP-like feature.
>
> I believe that seeing the end-goal would make reasoning about patches
> easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
> up.
>
> I’m not sure CET or other CFI would be enough even with this threat-model.
> The page-tables (the very least) need to be write-protected, as otherwise
> controlled data writes may just modify them. There are various possible
> solutions I presume: write_rare for page-tables, hypervisor-assisted
> security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of
> hardware enclave.
>
> What do you think?
I am not sure which issue you are talking about. I think there are actually two
separate issues that are merged discussions from overlap of fix for the teardown
W^X window.
For the W^X stuff I had originally imagined the protection was for when an
attacker has a limited bug that could write to a location in the module space,
but not other locations due to only having the ability to overwrite part of a
pointer or some something like that. Then the module could execute the new code
as it ran normally after finishing loading. So that is why I was wondering about
the RW window during load. Still seems generally sensible to enforce W^X though.
I like your idea about something like text_poke to load modules. I think maybe
my modules KASLR patchset could help the above somewhat too since it loads at a
freshly randomized address.
Since the issue with the freed pages before flush (the original source of this
thread) doesn't require a write bug to insert the code, but does require a way
to jump to it, its kind of the opposite model of the above. So that's why I
think they are different.
I am still learning lots on kernel exploits though, maybe Kees can provide some
better insight here?
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-07 3:06 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-07 3:06 UTC (permalink / raw)
To: keescook, luto, nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, ast, rostedt, jeyu,
linux-mm, jannh, Dock, Deneen T, peterz, kristen, akpm,
igor.stoppa, tycho, will.deacon, mingo, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
On Thu, 2018-12-06 at 15:08 -0800, Nadav Amit wrote:
> > On Dec 6, 2018, at 12:17 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> > > > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> > > > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > > > > If we are going to unmap the linear alias, why not do it at
> > > > > > > vmalloc()
> > > > > > > time rather than vfree() time?
> > > > > >
> > > > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > > > data because our APIs don’t understand virtual addresses. I guess
> > > > > > if
> > > > > > highmem is ever used for modules, then we should be fine.
> > > > > >
> > > > > > RO instead of not present might be safer. But I do like the idea of
> > > > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP
> > > > > > and
> > > > > > making it do all of this.
> > > > >
> > > > > Yeah, doing it for everything automatically seemed like it was/is
> > > > > going to be a lot of work to debug all the corner cases where things
> > > > > expect memory to be mapped but don't explicitly say it. And in
> > > > > particular, the XPFO series only does it for user memory, whereas an
> > > > > additional flag like this would work for extra paranoid allocations
> > > > > of kernel memory too.
> > > >
> > > > I just read the code, and I looks like vmalloc() is already using
> > > > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > > > example, we already don't have modules in the direct map.
> > > >
> > > > So I say we go for it. This should be quite simple to implement --
> > > > the pageattr code already has almost all the needed logic on x86. The
> > > > only arch support we should need is a pair of functions to remove a
> > > > vmalloc address range from the address map (if it was present in the
> > > > first place) and a function to put it back. On x86, this should only
> > > > be a few lines of code.
> > > >
> > > > What do you all think? This should solve most of the problems we have.
> > > >
> > > > If we really wanted to optimize this, we'd make it so that
> > > > module_alloc() allocates memory the normal way, then, later on, we
> > > > call some function that, all at once, removes the memory from the
> > > > direct map and applies the right permissions to the vmalloc alias (or
> > > > just makes the vmalloc alias not-present so we can add permissions
> > > > later without flushing), and flushes the TLB. And we arrange for
> > > > vunmap to zap the vmalloc range, then put the memory back into the
> > > > direct map, then free the pages back to the page allocator, with the
> > > > flush in the appropriate place.
> > > >
> > > > I don't see why the page allocator needs to know about any of this.
> > > > It's already okay with the permissions being changed out from under it
> > > > on x86, and it seems fine. Rick, do you want to give some variant of
> > > > this a try?
> > >
> > > Setting it as read-only may work (and already happens for the read-only
> > > module data). I am not sure about setting it as non-present.
> > >
> > > At some point, a discussion about a threat-model, as Rick indicated, would
> > > be required. I presume ROP attacks can easily call
> > > set_all_modules_text_rw()
> > > and override all the protections.
> >
> > I am far from an expert on exploit techniques, but here's a
> > potentially useful model: let's assume there's an attacker who can
> > write controlled data to a controlled kernel address but cannot
> > directly modify control flow. It would be nice for such an attacker
> > to have a very difficult time of modifying kernel text or of
> > compromising control flow. So we're assuming a feature like kernel
> > CET or that the attacker finds it very difficult to do something like
> > modifying some thread's IRET frame.
> >
> > Admittedly, for the kernel, this is an odd threat model, since an
> > attacker can presumably quite easily learn the kernel stack address of
> > one of their tasks, do some syscall, and then modify their kernel
> > thread's stack such that it will IRET right back to a fully controlled
> > register state with RSP pointing at an attacker-supplied kernel stack.
> > So this threat model gives very strong ROP powers. unless we have
> > either CET or some software technique to harden all the RET
> > instructions in the kernel.
> >
> > I wonder if there's a better model to use. Maybe with stack-protector
> > we get some degree of protection? Or is all of this is rather weak
> > until we have CET or a RAP-like feature.
>
> I believe that seeing the end-goal would make reasoning about patches
> easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
> up.
>
> I’m not sure CET or other CFI would be enough even with this threat-model.
> The page-tables (the very least) need to be write-protected, as otherwise
> controlled data writes may just modify them. There are various possible
> solutions I presume: write_rare for page-tables, hypervisor-assisted
> security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of
> hardware enclave.
>
> What do you think?
I am not sure which issue you are talking about. I think there are actually two
separate issues that are merged discussions from overlap of fix for the teardown
W^X window.
For the W^X stuff I had originally imagined the protection was for when an
attacker has a limited bug that could write to a location in the module space,
but not other locations due to only having the ability to overwrite part of a
pointer or some something like that. Then the module could execute the new code
as it ran normally after finishing loading. So that is why I was wondering about
the RW window during load. Still seems generally sensible to enforce W^X though.
I like your idea about something like text_poke to load modules. I think maybe
my modules KASLR patchset could help the above somewhat too since it loads at a
freshly randomized address.
Since the issue with the freed pages before flush (the original source of this
thread) doesn't require a write bug to insert the code, but does require a way
to jump to it, its kind of the opposite model of the above. So that's why I
think they are different.
I am still learning lots on kernel exploits though, maybe Kees can provide some
better insight here?
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-07 3:06 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-07 3:06 UTC (permalink / raw)
To: keescook, luto, nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, ast, rostedt, jeyu,
linux-mm, jannh, Dock, Deneen T, peterz, kristen, akpm,
igor.stoppa, tycho, will.deacon, mingo, Keshavamurthy, Anil S
On Thu, 2018-12-06 at 15:08 -0800, Nadav Amit wrote:
> > On Dec 6, 2018, at 12:17 PM, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Thu, Dec 6, 2018 at 11:39 AM Nadav Amit <nadav.amit@gmail.com> wrote:
> > > > On Dec 6, 2018, at 11:19 AM, Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> > > > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > > > > If we are going to unmap the linear alias, why not do it at
> > > > > > > vmalloc()
> > > > > > > time rather than vfree() time?
> > > > > >
> > > > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > > > data because our APIs don’t understand virtual addresses. I guess
> > > > > > if
> > > > > > highmem is ever used for modules, then we should be fine.
> > > > > >
> > > > > > RO instead of not present might be safer. But I do like the idea of
> > > > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP
> > > > > > and
> > > > > > making it do all of this.
> > > > >
> > > > > Yeah, doing it for everything automatically seemed like it was/is
> > > > > going to be a lot of work to debug all the corner cases where things
> > > > > expect memory to be mapped but don't explicitly say it. And in
> > > > > particular, the XPFO series only does it for user memory, whereas an
> > > > > additional flag like this would work for extra paranoid allocations
> > > > > of kernel memory too.
> > > >
> > > > I just read the code, and I looks like vmalloc() is already using
> > > > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > > > example, we already don't have modules in the direct map.
> > > >
> > > > So I say we go for it. This should be quite simple to implement --
> > > > the pageattr code already has almost all the needed logic on x86. The
> > > > only arch support we should need is a pair of functions to remove a
> > > > vmalloc address range from the address map (if it was present in the
> > > > first place) and a function to put it back. On x86, this should only
> > > > be a few lines of code.
> > > >
> > > > What do you all think? This should solve most of the problems we have.
> > > >
> > > > If we really wanted to optimize this, we'd make it so that
> > > > module_alloc() allocates memory the normal way, then, later on, we
> > > > call some function that, all at once, removes the memory from the
> > > > direct map and applies the right permissions to the vmalloc alias (or
> > > > just makes the vmalloc alias not-present so we can add permissions
> > > > later without flushing), and flushes the TLB. And we arrange for
> > > > vunmap to zap the vmalloc range, then put the memory back into the
> > > > direct map, then free the pages back to the page allocator, with the
> > > > flush in the appropriate place.
> > > >
> > > > I don't see why the page allocator needs to know about any of this.
> > > > It's already okay with the permissions being changed out from under it
> > > > on x86, and it seems fine. Rick, do you want to give some variant of
> > > > this a try?
> > >
> > > Setting it as read-only may work (and already happens for the read-only
> > > module data). I am not sure about setting it as non-present.
> > >
> > > At some point, a discussion about a threat-model, as Rick indicated, would
> > > be required. I presume ROP attacks can easily call
> > > set_all_modules_text_rw()
> > > and override all the protections.
> >
> > I am far from an expert on exploit techniques, but here's a
> > potentially useful model: let's assume there's an attacker who can
> > write controlled data to a controlled kernel address but cannot
> > directly modify control flow. It would be nice for such an attacker
> > to have a very difficult time of modifying kernel text or of
> > compromising control flow. So we're assuming a feature like kernel
> > CET or that the attacker finds it very difficult to do something like
> > modifying some thread's IRET frame.
> >
> > Admittedly, for the kernel, this is an odd threat model, since an
> > attacker can presumably quite easily learn the kernel stack address of
> > one of their tasks, do some syscall, and then modify their kernel
> > thread's stack such that it will IRET right back to a fully controlled
> > register state with RSP pointing at an attacker-supplied kernel stack.
> > So this threat model gives very strong ROP powers. unless we have
> > either CET or some software technique to harden all the RET
> > instructions in the kernel.
> >
> > I wonder if there's a better model to use. Maybe with stack-protector
> > we get some degree of protection? Or is all of this is rather weak
> > until we have CET or a RAP-like feature.
>
> I believe that seeing the end-goal would make reasoning about patches
> easier, otherwise the complaint “but anyhow it’s all insecure” keeps popping
> up.
>
> I’m not sure CET or other CFI would be enough even with this threat-model.
> The page-tables (the very least) need to be write-protected, as otherwise
> controlled data writes may just modify them. There are various possible
> solutions I presume: write_rare for page-tables, hypervisor-assisted
> security to obtain physical level NX/RO (a-la Microsoft VBS) or some sort of
> hardware enclave.
>
> What do you think?
I am not sure which issue you are talking about. I think there are actually two
separate issues that are merged discussions from overlap of fix for the teardown
W^X window.
For the W^X stuff I had originally imagined the protection was for when an
attacker has a limited bug that could write to a location in the module space,
but not other locations due to only having the ability to overwrite part of a
pointer or some something like that. Then the module could execute the new code
as it ran normally after finishing loading. So that is why I was wondering about
the RW window during load. Still seems generally sensible to enforce W^X though.
I like your idea about something like text_poke to load modules. I think maybe
my modules KASLR patchset could help the above somewhat too since it loads at a
freshly randomized address.
Since the issue with the freed pages before flush (the original source of this
thread) doesn't require a write bug to insert the code, but does require a way
to jump to it, its kind of the opposite model of the above. So that's why I
think they are different.
I am still learning lots on kernel exploits though, maybe Kees can provide some
better insight here?
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 19:19 ` Andy Lutomirski
(?)
@ 2018-12-06 20:19 ` Edgecombe, Rick P
-1 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-06 20:19 UTC (permalink / raw)
To: luto, tycho
Cc: linux-kernel, daniel, ard.biesheuvel, ast, rostedt, jeyu,
linux-mm, jannh, nadav.amit, Dock, Deneen T, peterz, kristen,
akpm, will.deacon, mingo, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote:
> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > > time rather than vfree() time?
> > >
> > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > work on module data? Perhaps crypto code trying to encrypt static
> > > data because our APIs don’t understand virtual addresses. I guess if
> > > highmem is ever used for modules, then we should be fine.
> > >
> > > RO instead of not present might be safer. But I do like the idea of
> > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > > making it do all of this.
> >
> > Yeah, doing it for everything automatically seemed like it was/is
> > going to be a lot of work to debug all the corner cases where things
> > expect memory to be mapped but don't explicitly say it. And in
> > particular, the XPFO series only does it for user memory, whereas an
> > additional flag like this would work for extra paranoid allocations
> > of kernel memory too.
> >
>
> I just read the code, and I looks like vmalloc() is already using
> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> example, we already don't have modules in the direct map.
>
> So I say we go for it. This should be quite simple to implement --
> the pageattr code already has almost all the needed logic on x86. The
> only arch support we should need is a pair of functions to remove a
> vmalloc address range from the address map (if it was present in the
> first place) and a function to put it back. On x86, this should only
> be a few lines of code.
>
> What do you all think? This should solve most of the problems we have.
>
> If we really wanted to optimize this, we'd make it so that
> module_alloc() allocates memory the normal way, then, later on, we
> call some function that, all at once, removes the memory from the
> direct map and applies the right permissions to the vmalloc alias (or
> just makes the vmalloc alias not-present so we can add permissions
> later without flushing), and flushes the TLB. And we arrange for
> vunmap to zap the vmalloc range, then put the memory back into the
> direct map, then free the pages back to the page allocator, with the
> flush in the appropriate place.
>
> I don't see why the page allocator needs to know about any of this.
> It's already okay with the permissions being changed out from under it
> on x86, and it seems fine. Rick, do you want to give some variant of
> this a try?
Hi,
Sorry, I've been having email troubles today.
I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
NP/RO in the directmap, so it would be sort of inconsistent whether the
directmap of vmalloc range allocations were readable or not. I couldn't see any
places where it would cause problems today though.
I was ready to assume that all TLBs don't cache NP, because I don't know how
usages where a page fault is used to load something could work without lots of
flushes. If that's the case, then all archs with directmap permissions could
share a single vmalloc special permission flush implementation that works like
Andy described originally. It could be controlled with an
ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
set_pages_rw on any archs with directmap permissions. So seems simpler to me
(and what I have been doing) unless I'm missing the problem.
If you all think so I can indeed take a shot at it, I just don't see what the
problem was with the original solution, that seems less likely to break
anything.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 20:19 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-06 20:19 UTC (permalink / raw)
To: luto, tycho
Cc: linux-kernel, daniel, ard.biesheuvel, ast, rostedt, jeyu,
linux-mm, jannh, nadav.amit, Dock, Deneen T, peterz, kristen,
akpm, will.deacon, mingo, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote:
> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > > time rather than vfree() time?
> > >
> > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > work on module data? Perhaps crypto code trying to encrypt static
> > > data because our APIs don’t understand virtual addresses. I guess if
> > > highmem is ever used for modules, then we should be fine.
> > >
> > > RO instead of not present might be safer. But I do like the idea of
> > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > > making it do all of this.
> >
> > Yeah, doing it for everything automatically seemed like it was/is
> > going to be a lot of work to debug all the corner cases where things
> > expect memory to be mapped but don't explicitly say it. And in
> > particular, the XPFO series only does it for user memory, whereas an
> > additional flag like this would work for extra paranoid allocations
> > of kernel memory too.
> >
>
> I just read the code, and I looks like vmalloc() is already using
> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> example, we already don't have modules in the direct map.
>
> So I say we go for it. This should be quite simple to implement --
> the pageattr code already has almost all the needed logic on x86. The
> only arch support we should need is a pair of functions to remove a
> vmalloc address range from the address map (if it was present in the
> first place) and a function to put it back. On x86, this should only
> be a few lines of code.
>
> What do you all think? This should solve most of the problems we have.
>
> If we really wanted to optimize this, we'd make it so that
> module_alloc() allocates memory the normal way, then, later on, we
> call some function that, all at once, removes the memory from the
> direct map and applies the right permissions to the vmalloc alias (or
> just makes the vmalloc alias not-present so we can add permissions
> later without flushing), and flushes the TLB. And we arrange for
> vunmap to zap the vmalloc range, then put the memory back into the
> direct map, then free the pages back to the page allocator, with the
> flush in the appropriate place.
>
> I don't see why the page allocator needs to know about any of this.
> It's already okay with the permissions being changed out from under it
> on x86, and it seems fine. Rick, do you want to give some variant of
> this a try?
Hi,
Sorry, I've been having email troubles today.
I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
NP/RO in the directmap, so it would be sort of inconsistent whether the
directmap of vmalloc range allocations were readable or not. I couldn't see any
places where it would cause problems today though.
I was ready to assume that all TLBs don't cache NP, because I don't know how
usages where a page fault is used to load something could work without lots of
flushes. If that's the case, then all archs with directmap permissions could
share a single vmalloc special permission flush implementation that works like
Andy described originally. It could be controlled with an
ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
set_pages_rw on any archs with directmap permissions. So seems simpler to me
(and what I have been doing) unless I'm missing the problem.
If you all think so I can indeed take a shot at it, I just don't see what the
problem was with the original solution, that seems less likely to break
anything.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 20:19 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-06 20:19 UTC (permalink / raw)
To: luto, tycho
Cc: linux-kernel, daniel, ard.biesheuvel, ast, rostedt, jeyu,
linux-mm, jannh, nadav.amit, Dock, Deneen T, peterz, kristen,
akpm, will.deacon, mingo, Keshavamurthy, Anil S,
On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote:
> On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > > time rather than vfree() time?
> > >
> > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > work on module data? Perhaps crypto code trying to encrypt static
> > > data because our APIs don’t understand virtual addresses. I guess if
> > > highmem is ever used for modules, then we should be fine.
> > >
> > > RO instead of not present might be safer. But I do like the idea of
> > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > > making it do all of this.
> >
> > Yeah, doing it for everything automatically seemed like it was/is
> > going to be a lot of work to debug all the corner cases where things
> > expect memory to be mapped but don't explicitly say it. And in
> > particular, the XPFO series only does it for user memory, whereas an
> > additional flag like this would work for extra paranoid allocations
> > of kernel memory too.
> >
>
> I just read the code, and I looks like vmalloc() is already using
> highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> example, we already don't have modules in the direct map.
>
> So I say we go for it. This should be quite simple to implement --
> the pageattr code already has almost all the needed logic on x86. The
> only arch support we should need is a pair of functions to remove a
> vmalloc address range from the address map (if it was present in the
> first place) and a function to put it back. On x86, this should only
> be a few lines of code.
>
> What do you all think? This should solve most of the problems we have.
>
> If we really wanted to optimize this, we'd make it so that
> module_alloc() allocates memory the normal way, then, later on, we
> call some function that, all at once, removes the memory from the
> direct map and applies the right permissions to the vmalloc alias (or
> just makes the vmalloc alias not-present so we can add permissions
> later without flushing), and flushes the TLB. And we arrange for
> vunmap to zap the vmalloc range, then put the memory back into the
> direct map, then free the pages back to the page allocator, with the
> flush in the appropriate place.
>
> I don't see why the page allocator needs to know about any of this.
> It's already okay with the permissions being changed out from under it
> on x86, and it seems fine. Rick, do you want to give some variant of
> this a try?
Hi,
Sorry, I've been having email troubles today.
I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
NP/RO in the directmap, so it would be sort of inconsistent whether the
directmap of vmalloc range allocations were readable or not. I couldn't see any
places where it would cause problems today though.
I was ready to assume that all TLBs don't cache NP, because I don't know how
usages where a page fault is used to load something could work without lots of
flushes. If that's the case, then all archs with directmap permissions could
share a single vmalloc special permission flush implementation that works like
Andy described originally. It could be controlled with an
ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
set_pages_rw on any archs with directmap permissions. So seems simpler to me
(and what I have been doing) unless I'm missing the problem.
If you all think so I can indeed take a shot at it, I just don't see what the
problem was with the original solution, that seems less likely to break
anything.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 20:19 ` Edgecombe, Rick P
@ 2018-12-06 20:26 ` Andy Lutomirski
-1 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 20:26 UTC (permalink / raw)
To: Rick Edgecombe
Cc: Andrew Lutomirski, Tycho Andersen, LKML, Daniel Borkmann,
Ard Biesheuvel, Alexei Starovoitov, Steven Rostedt, Jessica Yu,
Linux-MM, Jann Horn, Nadav Amit, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Will Deacon, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu,
Naveen N . Rao, David S. Miller, Network Development,
Dave Hansen
On Thu, Dec 6, 2018 at 12:20 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote:
> > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > > > time rather than vfree() time?
> > > >
> > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > data because our APIs don’t understand virtual addresses. I guess if
> > > > highmem is ever used for modules, then we should be fine.
> > > >
> > > > RO instead of not present might be safer. But I do like the idea of
> > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > > > making it do all of this.
> > >
> > > Yeah, doing it for everything automatically seemed like it was/is
> > > going to be a lot of work to debug all the corner cases where things
> > > expect memory to be mapped but don't explicitly say it. And in
> > > particular, the XPFO series only does it for user memory, whereas an
> > > additional flag like this would work for extra paranoid allocations
> > > of kernel memory too.
> > >
> >
> > I just read the code, and I looks like vmalloc() is already using
> > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > example, we already don't have modules in the direct map.
> >
> > So I say we go for it. This should be quite simple to implement --
> > the pageattr code already has almost all the needed logic on x86. The
> > only arch support we should need is a pair of functions to remove a
> > vmalloc address range from the address map (if it was present in the
> > first place) and a function to put it back. On x86, this should only
> > be a few lines of code.
> >
> > What do you all think? This should solve most of the problems we have.
> >
> > If we really wanted to optimize this, we'd make it so that
> > module_alloc() allocates memory the normal way, then, later on, we
> > call some function that, all at once, removes the memory from the
> > direct map and applies the right permissions to the vmalloc alias (or
> > just makes the vmalloc alias not-present so we can add permissions
> > later without flushing), and flushes the TLB. And we arrange for
> > vunmap to zap the vmalloc range, then put the memory back into the
> > direct map, then free the pages back to the page allocator, with the
> > flush in the appropriate place.
> >
> > I don't see why the page allocator needs to know about any of this.
> > It's already okay with the permissions being changed out from under it
> > on x86, and it seems fine. Rick, do you want to give some variant of
> > this a try?
> Hi,
>
> Sorry, I've been having email troubles today.
>
> I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
> NP/RO in the directmap, so it would be sort of inconsistent whether the
> directmap of vmalloc range allocations were readable or not. I couldn't see any
> places where it would cause problems today though.
>
> I was ready to assume that all TLBs don't cache NP, because I don't know how
> usages where a page fault is used to load something could work without lots of
> flushes.
Or the architecture just fixes up the spurious faults, I suppose. I'm
only well-educated on the x86 mmu.
> If that's the case, then all archs with directmap permissions could
> share a single vmalloc special permission flush implementation that works like
> Andy described originally. It could be controlled with an
> ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
> set_pages_rw on any archs with directmap permissions. So seems simpler to me
> (and what I have been doing) unless I'm missing the problem.
Hmm. The only reason I've proposed anything fancier was because I was
thinking of minimizing flushes, but I think I'm being silly. This
sequence ought to work optimally:
- vmalloc(..., VM_HAS_DIRECT_MAP_PERMS); /* no flushes */
- Write some data, via vmalloc's return address.
- Use some set_memory_whatever() functions to update permissions,
which will flush, hopefully just once.
- Run the module code!
- vunmap -- this will do a single flush that will fix everything.
This does require that set_pages_np() or set_memory_np() or whatever
exists and that it's safe to do that, then flush, and then
set_pages_rw(). So maybe you want set_pages_np_noflush() and
set_pages_rw_noflush() to make it totally clear what's supposed to
happen.
--Andy
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 20:26 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 20:26 UTC (permalink / raw)
To: Rick Edgecombe
Cc: Andrew Lutomirski, Tycho Andersen, LKML, Daniel Borkmann,
Ard Biesheuvel, Alexei Starovoitov, Steven Rostedt, Jessica Yu,
Linux-MM, Jann Horn, Nadav Amit, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Will Deacon, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masam
On Thu, Dec 6, 2018 at 12:20 PM Edgecombe, Rick P
<rick.p.edgecombe@intel.com> wrote:
>
> On Thu, 2018-12-06 at 11:19 -0800, Andy Lutomirski wrote:
> > On Thu, Dec 6, 2018 at 11:01 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > On Thu, Dec 06, 2018 at 10:53:50AM -0800, Andy Lutomirski wrote:
> > > > > If we are going to unmap the linear alias, why not do it at vmalloc()
> > > > > time rather than vfree() time?
> > > >
> > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > data because our APIs don’t understand virtual addresses. I guess if
> > > > highmem is ever used for modules, then we should be fine.
> > > >
> > > > RO instead of not present might be safer. But I do like the idea of
> > > > renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> > > > making it do all of this.
> > >
> > > Yeah, doing it for everything automatically seemed like it was/is
> > > going to be a lot of work to debug all the corner cases where things
> > > expect memory to be mapped but don't explicitly say it. And in
> > > particular, the XPFO series only does it for user memory, whereas an
> > > additional flag like this would work for extra paranoid allocations
> > > of kernel memory too.
> > >
> >
> > I just read the code, and I looks like vmalloc() is already using
> > highmem (__GFP_HIGH) if available, so, on big x86_32 systems, for
> > example, we already don't have modules in the direct map.
> >
> > So I say we go for it. This should be quite simple to implement --
> > the pageattr code already has almost all the needed logic on x86. The
> > only arch support we should need is a pair of functions to remove a
> > vmalloc address range from the address map (if it was present in the
> > first place) and a function to put it back. On x86, this should only
> > be a few lines of code.
> >
> > What do you all think? This should solve most of the problems we have.
> >
> > If we really wanted to optimize this, we'd make it so that
> > module_alloc() allocates memory the normal way, then, later on, we
> > call some function that, all at once, removes the memory from the
> > direct map and applies the right permissions to the vmalloc alias (or
> > just makes the vmalloc alias not-present so we can add permissions
> > later without flushing), and flushes the TLB. And we arrange for
> > vunmap to zap the vmalloc range, then put the memory back into the
> > direct map, then free the pages back to the page allocator, with the
> > flush in the appropriate place.
> >
> > I don't see why the page allocator needs to know about any of this.
> > It's already okay with the permissions being changed out from under it
> > on x86, and it seems fine. Rick, do you want to give some variant of
> > this a try?
> Hi,
>
> Sorry, I've been having email troubles today.
>
> I found some cases where vmap with PAGE_KERNEL_RO happens, which would not set
> NP/RO in the directmap, so it would be sort of inconsistent whether the
> directmap of vmalloc range allocations were readable or not. I couldn't see any
> places where it would cause problems today though.
>
> I was ready to assume that all TLBs don't cache NP, because I don't know how
> usages where a page fault is used to load something could work without lots of
> flushes.
Or the architecture just fixes up the spurious faults, I suppose. I'm
only well-educated on the x86 mmu.
> If that's the case, then all archs with directmap permissions could
> share a single vmalloc special permission flush implementation that works like
> Andy described originally. It could be controlled with an
> ARCH_HAS_DIRECT_MAP_PERMS. We would just need something like set_pages_np and
> set_pages_rw on any archs with directmap permissions. So seems simpler to me
> (and what I have been doing) unless I'm missing the problem.
Hmm. The only reason I've proposed anything fancier was because I was
thinking of minimizing flushes, but I think I'm being silly. This
sequence ought to work optimally:
- vmalloc(..., VM_HAS_DIRECT_MAP_PERMS); /* no flushes */
- Write some data, via vmalloc's return address.
- Use some set_memory_whatever() functions to update permissions,
which will flush, hopefully just once.
- Run the module code!
- vunmap -- this will do a single flush that will fix everything.
This does require that set_pages_np() or set_memory_np() or whatever
exists and that it's safe to do that, then flush, and then
set_pages_rw(). So maybe you want set_pages_np_noflush() and
set_pages_rw_noflush() to make it totally clear what's supposed to
happen.
--Andy
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 18:53 ` Andy Lutomirski
@ 2018-12-06 19:04 ` Ard Biesheuvel
-1 siblings, 0 replies; 117+ messages in thread
From: Ard Biesheuvel @ 2018-12-06 19:04 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Will Deacon, Rick Edgecombe, Nadav Amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller, <netdev@vger.kernel.org>,
Dave Hansen
On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
>
> > On Dec 5, 2018, at 11:29 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >
> >> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >>> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
> >>>
> >>>> On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> >>>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> >>>> <rick.p.edgecombe@intel.com> wrote:
> >>>>>
> >>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> >>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> >>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
> >>>>>>>> pages,
> >>>>>>>> it often leaves stale TLB entries to freed pages that could get re-used.
> >>>>>>>> This is
> >>>>>>>> undesirable for cases where the memory being freed has special permissions
> >>>>>>>> such
> >>>>>>>> as executable.
> >>>>>>>
> >>>>>>> So I am trying to finish my patch-set for preventing transient W+X mappings
> >>>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
> >>>>>>> for
> >>>>>>> pointing it out).
> >>>>>>>
> >>>>>>> But all of the sudden, I don’t understand why we have the problem that this
> >>>>>>> (your) patch-set deals with at all. We already change the mappings to make
> >>>>>>> the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > itable before freeing the memory, so why can’t we make it
> >>>>>>> non-executable at the same time? Actually, why do we make the module memory,
> >>>>>>> including its data executable before freeing it???
> >>>>>>
> >>>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
> >>>>>> of the various different configurations and hysterical raisins. We can't
> >>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> >>>>>> can we rely on disable_ro_nx() being available at build time.
> >>>>>>
> >>>>>> If we *could* rely on module allocations always using vmalloc(), then
> >>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
> >>>>>> afaict -- who cares about the memory attributes of a mapping that's about
> >>>>>> to disappear anyway?
> >>>>>>
> >>>>>> Is it just nios2 that does something different?
> >>>>>>
> >>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> >>>>> solve it as well, in fact that was what I first thought the solution should be
> >>>>> until this was suggested. It's interesting that from the other thread Masami
> >>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> >>>>> inadvertently blocked this on x86. But, on the other architectures I have since
> >>>>> learned it is a bit different.
> >>>>>
> >>>>> It looks like actually most arch's don't re-define set_memory_*, and so all of
> >>>>> the frob_* functions are actually just noops. In which case allocating RWX is
> >>>>> needed to make it work at all, because that is what the allocation is going to
> >>>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
> >>>>> nothing.
> >>>>>
> >>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> >>>>> changing of the permissions on the directmap as well. You don't want some other
> >>>>> caller getting a page that was left RO when freed and then trying to write to
> >>>>> it, if I understand this.
> >>>>>
> >>>>
> >>>> Exactly.
> >>>
> >>> Of course, I forgot about the linear mapping. On arm64, we've just queued
> >>> support for reflecting changes to read-only permissions in the linear map
> >>> [1]. So, whilst the linear map is always non-executable, we will need to
> >>> make parts of it writable again when freeing the module.
> >>>
> >>>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> >>>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> >>>> but it would also call some arch hooks to put back the direct map
> >>>> permissions before the flush. Does that seem reasonable? It would
> >>>> need to be hooked up that implement set_memory_ro(), but that should
> >>>> be quite easy. If nothing else, it could fall back to set_memory_ro()
> >>>> in the absence of a better implementation.
> >>>
> >>> You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> >>> would open up a window where the vmap mapping is executable and the linear
> >>> mapping is writable, which is a bit rubbish.
> >>>
> >>
> >> Right, and Rick pointed out the same issue. Instead, we should set
> >> the direct map not-present or its ARM equivalent, then do the flush,
> >> then make it RW. I assume this also works on arm and arm64, although
> >> I don't know for sure. On x86, the CPU won't cache not-present PTEs.
> >
> > If we are going to unmap the linear alias, why not do it at vmalloc()
> > time rather than vfree() time?
>
> That’s not totally nuts. Do we ever have code that expects __va() to
> work on module data? Perhaps crypto code trying to encrypt static
> data because our APIs don’t understand virtual addresses. I guess if
> highmem is ever used for modules, then we should be fine.
>
The crypto code shouldn't care, but I think it will probably break hibernate :-(
> RO instead of not present might be safer. But I do like the idea of
> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> making it do all of this.
>
> (It seems like some people call it the linear map and some people call
> it the direct map. Is there any preference?)
Either is fine with me.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 19:04 ` Ard Biesheuvel
0 siblings, 0 replies; 117+ messages in thread
From: Ard Biesheuvel @ 2018-12-06 19:04 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Will Deacon, Rick Edgecombe, Nadav Amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller
On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
>
> > On Dec 5, 2018, at 11:29 PM, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> >
> >> On Thu, 6 Dec 2018 at 00:16, Andy Lutomirski <luto@kernel.org> wrote:
> >>
> >>> On Wed, Dec 5, 2018 at 3:41 AM Will Deacon <will.deacon@arm.com> wrote:
> >>>
> >>>> On Tue, Dec 04, 2018 at 12:09:49PM -0800, Andy Lutomirski wrote:
> >>>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
> >>>> <rick.p.edgecombe@intel.com> wrote:
> >>>>>
> >>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> >>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> >>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
> >>>>>>>> pages,
> >>>>>>>> it often leaves stale TLB entries to freed pages that could get re-used.
> >>>>>>>> This is
> >>>>>>>> undesirable for cases where the memory being freed has special permissions
> >>>>>>>> such
> >>>>>>>> as executable.
> >>>>>>>
> >>>>>>> So I am trying to finish my patch-set for preventing transient W+X mappings
> >>>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
> >>>>>>> for
> >>>>>>> pointing it out).
> >>>>>>>
> >>>>>>> But all of the sudden, I don’t understand why we have the problem that this
> >>>>>>> (your) patch-set deals with at all. We already change the mappings to make
> >>>>>>> the memory wrAcked-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
> > itable before freeing the memory, so why can’t we make it
> >>>>>>> non-executable at the same time? Actually, why do we make the module memory,
> >>>>>>> including its data executable before freeing it???
> >>>>>>
> >>>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
> >>>>>> of the various different configurations and hysterical raisins. We can't
> >>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> >>>>>> can we rely on disable_ro_nx() being available at build time.
> >>>>>>
> >>>>>> If we *could* rely on module allocations always using vmalloc(), then
> >>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
> >>>>>> afaict -- who cares about the memory attributes of a mapping that's about
> >>>>>> to disappear anyway?
> >>>>>>
> >>>>>> Is it just nios2 that does something different?
> >>>>>>
> >>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> >>>>> solve it as well, in fact that was what I first thought the solution should be
> >>>>> until this was suggested. It's interesting that from the other thread Masami
> >>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> >>>>> inadvertently blocked this on x86. But, on the other architectures I have since
> >>>>> learned it is a bit different.
> >>>>>
> >>>>> It looks like actually most arch's don't re-define set_memory_*, and so all of
> >>>>> the frob_* functions are actually just noops. In which case allocating RWX is
> >>>>> needed to make it work at all, because that is what the allocation is going to
> >>>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
> >>>>> nothing.
> >>>>>
> >>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> >>>>> changing of the permissions on the directmap as well. You don't want some other
> >>>>> caller getting a page that was left RO when freed and then trying to write to
> >>>>> it, if I understand this.
> >>>>>
> >>>>
> >>>> Exactly.
> >>>
> >>> Of course, I forgot about the linear mapping. On arm64, we've just queued
> >>> support for reflecting changes to read-only permissions in the linear map
> >>> [1]. So, whilst the linear map is always non-executable, we will need to
> >>> make parts of it writable again when freeing the module.
> >>>
> >>>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
> >>>> VM_MAY_ADJUST_PERMS or similar. It would have the semantics you want,
> >>>> but it would also call some arch hooks to put back the direct map
> >>>> permissions before the flush. Does that seem reasonable? It would
> >>>> need to be hooked up that implement set_memory_ro(), but that should
> >>>> be quite easy. If nothing else, it could fall back to set_memory_ro()
> >>>> in the absence of a better implementation.
> >>>
> >>> You mean set_memory_rw() here, right? Although, eliding the TLB invalidation
> >>> would open up a window where the vmap mapping is executable and the linear
> >>> mapping is writable, which is a bit rubbish.
> >>>
> >>
> >> Right, and Rick pointed out the same issue. Instead, we should set
> >> the direct map not-present or its ARM equivalent, then do the flush,
> >> then make it RW. I assume this also works on arm and arm64, although
> >> I don't know for sure. On x86, the CPU won't cache not-present PTEs.
> >
> > If we are going to unmap the linear alias, why not do it at vmalloc()
> > time rather than vfree() time?
>
> That’s not totally nuts. Do we ever have code that expects __va() to
> work on module data? Perhaps crypto code trying to encrypt static
> data because our APIs don’t understand virtual addresses. I guess if
> highmem is ever used for modules, then we should be fine.
>
The crypto code shouldn't care, but I think it will probably break hibernate :-(
> RO instead of not present might be safer. But I do like the idea of
> renaming Rick's flag to something like VM_XPFO or VM_NO_DIRECT_MAP and
> making it do all of this.
>
> (It seems like some people call it the linear map and some people call
> it the direct map. Is there any preference?)
Either is fine with me.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 19:04 ` Ard Biesheuvel
@ 2018-12-06 19:20 ` Andy Lutomirski
-1 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 19:20 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andrew Lutomirski, Will Deacon, Rick Edgecombe, Nadav Amit, LKML,
Daniel Borkmann, Jessica Yu, Steven Rostedt, Alexei Starovoitov,
Linux-MM, Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu,
Naveen N . Rao, David S. Miller, Network Development,
Dave Hansen
On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > That’s not totally nuts. Do we ever have code that expects __va() to
> > work on module data? Perhaps crypto code trying to encrypt static
> > data because our APIs don’t understand virtual addresses. I guess if
> > highmem is ever used for modules, then we should be fine.
> >
>
> The crypto code shouldn't care, but I think it will probably break hibernate :-(
How so? Hibernate works (or at least should work) on x86 PAE, where
__va doesn't work on module data, and, on x86, the direct map has some
RO parts with where the module is, so hibernate can't be writing to
the memory through the direct map with its final permissions.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 19:20 ` Andy Lutomirski
0 siblings, 0 replies; 117+ messages in thread
From: Andy Lutomirski @ 2018-12-06 19:20 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andrew Lutomirski, Will Deacon, Rick Edgecombe, Nadav Amit, LKML,
Daniel Borkmann, Jessica Yu, Steven Rostedt, Alexei Starovoitov,
Linux-MM, Jann Horn, Dock, Deneen T, Peter Zijlstra,
Kristen Carlson Accardi, Andrew Morton, Ingo Molnar,
Anil S Keshavamurthy, Kernel Hardening, Masami Hiramatsu
On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > That’s not totally nuts. Do we ever have code that expects __va() to
> > work on module data? Perhaps crypto code trying to encrypt static
> > data because our APIs don’t understand virtual addresses. I guess if
> > highmem is ever used for modules, then we should be fine.
> >
>
> The crypto code shouldn't care, but I think it will probably break hibernate :-(
How so? Hibernate works (or at least should work) on x86 PAE, where
__va doesn't work on module data, and, on x86, the direct map has some
RO parts with where the module is, so hibernate can't be writing to
the memory through the direct map with its final permissions.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 19:20 ` Andy Lutomirski
@ 2018-12-06 19:23 ` Ard Biesheuvel
-1 siblings, 0 replies; 117+ messages in thread
From: Ard Biesheuvel @ 2018-12-06 19:23 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Will Deacon, Rick Edgecombe, Nadav Amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller, <netdev@vger.kernel.org>,
Dave Hansen
On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
> > >
>
> > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > work on module data? Perhaps crypto code trying to encrypt static
> > > data because our APIs don’t understand virtual addresses. I guess if
> > > highmem is ever used for modules, then we should be fine.
> > >
> >
> > The crypto code shouldn't care, but I think it will probably break hibernate :-(
>
> How so? Hibernate works (or at least should work) on x86 PAE, where
> __va doesn't work on module data, and, on x86, the direct map has some
> RO parts with where the module is, so hibernate can't be writing to
> the memory through the direct map with its final permissions.
On arm64 at least, hibernate reads the contents of memory via the
linear mapping. Not sure about other arches.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 19:23 ` Ard Biesheuvel
0 siblings, 0 replies; 117+ messages in thread
From: Ard Biesheuvel @ 2018-12-06 19:23 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Will Deacon, Rick Edgecombe, Nadav Amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller
On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski <luto@kernel.org> wrote:
>
> On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
> > >
>
> > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > work on module data? Perhaps crypto code trying to encrypt static
> > > data because our APIs don’t understand virtual addresses. I guess if
> > > highmem is ever used for modules, then we should be fine.
> > >
> >
> > The crypto code shouldn't care, but I think it will probably break hibernate :-(
>
> How so? Hibernate works (or at least should work) on x86 PAE, where
> __va doesn't work on module data, and, on x86, the direct map has some
> RO parts with where the module is, so hibernate can't be writing to
> the memory through the direct map with its final permissions.
On arm64 at least, hibernate reads the contents of memory via the
linear mapping. Not sure about other arches.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 19:23 ` Ard Biesheuvel
@ 2018-12-06 19:31 ` Will Deacon
-1 siblings, 0 replies; 117+ messages in thread
From: Will Deacon @ 2018-12-06 19:31 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andy Lutomirski, Rick Edgecombe, Nadav Amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller, <netdev@vger.kernel.org>,
Dave Hansen
On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote:
> On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> > >
> > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> >
> > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > data because our APIs don’t understand virtual addresses. I guess if
> > > > highmem is ever used for modules, then we should be fine.
> > > >
> > >
> > > The crypto code shouldn't care, but I think it will probably break hibernate :-(
> >
> > How so? Hibernate works (or at least should work) on x86 PAE, where
> > __va doesn't work on module data, and, on x86, the direct map has some
> > RO parts with where the module is, so hibernate can't be writing to
> > the memory through the direct map with its final permissions.
>
> On arm64 at least, hibernate reads the contents of memory via the
> linear mapping. Not sure about other arches.
Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
the pte when we see that it's PROT_NONE?
Will
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 19:31 ` Will Deacon
0 siblings, 0 replies; 117+ messages in thread
From: Will Deacon @ 2018-12-06 19:31 UTC (permalink / raw)
To: Ard Biesheuvel
Cc: Andy Lutomirski, Rick Edgecombe, Nadav Amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller
On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote:
> On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski <luto@kernel.org> wrote:
> >
> > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> > >
> > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
> > > >
> >
> > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > data because our APIs don’t understand virtual addresses. I guess if
> > > > highmem is ever used for modules, then we should be fine.
> > > >
> > >
> > > The crypto code shouldn't care, but I think it will probably break hibernate :-(
> >
> > How so? Hibernate works (or at least should work) on x86 PAE, where
> > __va doesn't work on module data, and, on x86, the direct map has some
> > RO parts with where the module is, so hibernate can't be writing to
> > the memory through the direct map with its final permissions.
>
> On arm64 at least, hibernate reads the contents of memory via the
> linear mapping. Not sure about other arches.
Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
the pte when we see that it's PROT_NONE?
Will
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-06 19:31 ` Will Deacon
@ 2018-12-06 19:36 ` Ard Biesheuvel
-1 siblings, 0 replies; 117+ messages in thread
From: Ard Biesheuvel @ 2018-12-06 19:36 UTC (permalink / raw)
To: Will Deacon
Cc: Andy Lutomirski, Rick Edgecombe, Nadav Amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller, <netdev@vger.kernel.org>,
Dave Hansen
On Thu, 6 Dec 2018 at 20:30, Will Deacon <will.deacon@arm.com> wrote:
>
> On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote:
> > On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> > > <ard.biesheuvel@linaro.org> wrote:
> > > >
> > > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > >
> > > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > > data because our APIs don’t understand virtual addresses. I guess if
> > > > > highmem is ever used for modules, then we should be fine.
> > > > >
> > > >
> > > > The crypto code shouldn't care, but I think it will probably break hibernate :-(
> > >
> > > How so? Hibernate works (or at least should work) on x86 PAE, where
> > > __va doesn't work on module data, and, on x86, the direct map has some
> > > RO parts with where the module is, so hibernate can't be writing to
> > > the memory through the direct map with its final permissions.
> >
> > On arm64 at least, hibernate reads the contents of memory via the
> > linear mapping. Not sure about other arches.
>
> Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
> the pte when we see that it's PROT_NONE?
>
As long as we can easily figure out whether a certain linear address
is mapped or not, having a special case like that for these mappings
doesn't sound unreasonable.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-06 19:36 ` Ard Biesheuvel
0 siblings, 0 replies; 117+ messages in thread
From: Ard Biesheuvel @ 2018-12-06 19:36 UTC (permalink / raw)
To: Will Deacon
Cc: Andy Lutomirski, Rick Edgecombe, Nadav Amit,
Linux Kernel Mailing List, Daniel Borkmann, Jessica Yu,
Steven Rostedt, Alexei Starovoitov, Linux-MM, Jann Horn, Dock,
Deneen T, Peter Zijlstra, kristen, Andrew Morton, Ingo Molnar,
anil.s.keshavamurthy, Kernel Hardening, Masami Hiramatsu,
naveen.n.rao, David S. Miller
On Thu, 6 Dec 2018 at 20:30, Will Deacon <will.deacon@arm.com> wrote:
>
> On Thu, Dec 06, 2018 at 08:23:20PM +0100, Ard Biesheuvel wrote:
> > On Thu, 6 Dec 2018 at 20:21, Andy Lutomirski <luto@kernel.org> wrote:
> > >
> > > On Thu, Dec 6, 2018 at 11:04 AM Ard Biesheuvel
> > > <ard.biesheuvel@linaro.org> wrote:
> > > >
> > > > On Thu, 6 Dec 2018 at 19:54, Andy Lutomirski <luto@kernel.org> wrote:
> > > > >
> > >
> > > > > That’s not totally nuts. Do we ever have code that expects __va() to
> > > > > work on module data? Perhaps crypto code trying to encrypt static
> > > > > data because our APIs don’t understand virtual addresses. I guess if
> > > > > highmem is ever used for modules, then we should be fine.
> > > > >
> > > >
> > > > The crypto code shouldn't care, but I think it will probably break hibernate :-(
> > >
> > > How so? Hibernate works (or at least should work) on x86 PAE, where
> > > __va doesn't work on module data, and, on x86, the direct map has some
> > > RO parts with where the module is, so hibernate can't be writing to
> > > the memory through the direct map with its final permissions.
> >
> > On arm64 at least, hibernate reads the contents of memory via the
> > linear mapping. Not sure about other arches.
>
> Can we handle this like the DEBUG_PAGEALLOC case, and extract the pfn from
> the pte when we see that it's PROT_NONE?
>
As long as we can easily figure out whether a certain linear address
is mapped or not, having a special case like that for these mappings
doesn't sound unreasonable.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-04 20:02 ` Edgecombe, Rick P
(?)
@ 2018-12-04 20:36 ` Nadav Amit
-1 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-04 20:36 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: will.deacon, linux-kernel, daniel, jeyu, rostedt, ast,
ard.biesheuvel, linux-mm, jannh, Dock, Deneen T, peterz, kristen,
akpm, mingo, luto, Keshavamurthy, Anil S, kernel-hardening,
mhiramat, naveen.n.rao, davem, netdev, Hansen, Dave
> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>>> wrote:
>>>>
>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
>>>> pages,
>>>> it often leaves stale TLB entries to freed pages that could get re-used.
>>>> This is
>>>> undesirable for cases where the memory being freed has special permissions
>>>> such
>>>> as executable.
>>>
>>> So I am trying to finish my patch-set for preventing transient W+X mappings
>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
>>> for
>>> pointing it out).
>>>
>>> But all of the sudden, I don’t understand why we have the problem that this
>>> (your) patch-set deals with at all. We already change the mappings to make
>>> the memory writable before freeing the memory, so why can’t we make it
>>> non-executable at the same time? Actually, why do we make the module memory,
>>> including its data executable before freeing it???
>>
>> Yeah, this is really confusing, but I have a suspicion it's a combination
>> of the various different configurations and hysterical raisins. We can't
>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>> can we rely on disable_ro_nx() being available at build time.
>>
>> If we *could* rely on module allocations always using vmalloc(), then
>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>> afaict -- who cares about the memory attributes of a mapping that's about
>> to disappear anyway?
>>
>> Is it just nios2 that does something different?
>>
>> Will
>
> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> solve it as well, in fact that was what I first thought the solution should be
> until this was suggested. It's interesting that from the other thread Masami
> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> inadvertently blocked this on x86. But, on the other architectures I have since
> learned it is a bit different.
>
> It looks like actually most arch's don't re-define set_memory_*, and so all of
> the frob_* functions are actually just noops. In which case allocating RWX is
> needed to make it work at all, because that is what the allocation is going to
> stay at. So in these archs, set_memory_nx won't solve it because it will do
> nothing.
>
> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> changing of the permissions on the directmap as well. You don't want some other
> caller getting a page that was left RO when freed and then trying to write to
> it, if I understand this.
>
> The other reasoning was that calling set_memory_nx isn't doing what we are
> actually trying to do which is prevent the pages from getting released too
> early.
>
> A more clear solution for all of this might involve refactoring some of the
> set_memory_ de-allocation logic out into __weak functions in either modules or
> vmalloc. As Jessica points out in the other thread though, modules does a lot
> more stuff there than the other module_alloc callers. I think it may take some
> thought to centralize AND make it optimal for every module_alloc/vmalloc_exec
> user and arch.
>
> But for now with the change in vmalloc, we can block the executable mapping
> freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-window in
which W^X is broken due to disable_ro_nx().
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 20:36 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-04 20:36 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: will.deacon, linux-kernel, daniel, jeyu, rostedt, ast,
ard.biesheuvel, linux-mm, jannh, Dock, Deneen T, peterz, kristen,
akpm, mingo, luto, Keshavamurthy, Anil S, kernel-hardening,
mhiramat, naveen.n.rao, davem, netdev, Hansen, Dave
> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>>> wrote:
>>>>
>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
>>>> pages,
>>>> it often leaves stale TLB entries to freed pages that could get re-used.
>>>> This is
>>>> undesirable for cases where the memory being freed has special permissions
>>>> such
>>>> as executable.
>>>
>>> So I am trying to finish my patch-set for preventing transient W+X mappings
>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
>>> for
>>> pointing it out).
>>>
>>> But all of the sudden, I don’t understand why we have the problem that this
>>> (your) patch-set deals with at all. We already change the mappings to make
>>> the memory writable before freeing the memory, so why can’t we make it
>>> non-executable at the same time? Actually, why do we make the module memory,
>>> including its data executable before freeing it???
>>
>> Yeah, this is really confusing, but I have a suspicion it's a combination
>> of the various different configurations and hysterical raisins. We can't
>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>> can we rely on disable_ro_nx() being available at build time.
>>
>> If we *could* rely on module allocations always using vmalloc(), then
>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>> afaict -- who cares about the memory attributes of a mapping that's about
>> to disappear anyway?
>>
>> Is it just nios2 that does something different?
>>
>> Will
>
> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> solve it as well, in fact that was what I first thought the solution should be
> until this was suggested. It's interesting that from the other thread Masami
> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> inadvertently blocked this on x86. But, on the other architectures I have since
> learned it is a bit different.
>
> It looks like actually most arch's don't re-define set_memory_*, and so all of
> the frob_* functions are actually just noops. In which case allocating RWX is
> needed to make it work at all, because that is what the allocation is going to
> stay at. So in these archs, set_memory_nx won't solve it because it will do
> nothing.
>
> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> changing of the permissions on the directmap as well. You don't want some other
> caller getting a page that was left RO when freed and then trying to write to
> it, if I understand this.
>
> The other reasoning was that calling set_memory_nx isn't doing what we are
> actually trying to do which is prevent the pages from getting released too
> early.
>
> A more clear solution for all of this might involve refactoring some of the
> set_memory_ de-allocation logic out into __weak functions in either modules or
> vmalloc. As Jessica points out in the other thread though, modules does a lot
> more stuff there than the other module_alloc callers. I think it may take some
> thought to centralize AND make it optimal for every module_alloc/vmalloc_exec
> user and arch.
>
> But for now with the change in vmalloc, we can block the executable mapping
> freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-window in
which W^X is broken due to disable_ro_nx().
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 20:36 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-04 20:36 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: will.deacon, linux-kernel, daniel, jeyu, rostedt, ast,
ard.biesheuvel, linux-mm, jannh, Dock, Deneen T, peterz, kristen,
akpm, mingo, luto, Keshavamurthy, Anil S,
> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <rick.p.edgecombe@intel.com>
>>>> wrote:
>>>>
>>>> Since vfree will lazily flush the TLB, but not lazily free the underlying
>>>> pages,
>>>> it often leaves stale TLB entries to freed pages that could get re-used.
>>>> This is
>>>> undesirable for cases where the memory being freed has special permissions
>>>> such
>>>> as executable.
>>>
>>> So I am trying to finish my patch-set for preventing transient W+X mappings
>>> from taking space, by handling kprobes & ftrace that I missed (thanks again
>>> for
>>> pointing it out).
>>>
>>> But all of the sudden, I don’t understand why we have the problem that this
>>> (your) patch-set deals with at all. We already change the mappings to make
>>> the memory writable before freeing the memory, so why can’t we make it
>>> non-executable at the same time? Actually, why do we make the module memory,
>>> including its data executable before freeing it???
>>
>> Yeah, this is really confusing, but I have a suspicion it's a combination
>> of the various different configurations and hysterical raisins. We can't
>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>> can we rely on disable_ro_nx() being available at build time.
>>
>> If we *could* rely on module allocations always using vmalloc(), then
>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>> afaict -- who cares about the memory attributes of a mapping that's about
>> to disappear anyway?
>>
>> Is it just nios2 that does something different?
>>
>> Will
>
> Yea it is really intertwined. I think for x86, set_memory_nx everywhere would
> solve it as well, in fact that was what I first thought the solution should be
> until this was suggested. It's interesting that from the other thread Masami
> Hiramatsu referenced, set_memory_nx was suggested last year and would have
> inadvertently blocked this on x86. But, on the other architectures I have since
> learned it is a bit different.
>
> It looks like actually most arch's don't re-define set_memory_*, and so all of
> the frob_* functions are actually just noops. In which case allocating RWX is
> needed to make it work at all, because that is what the allocation is going to
> stay at. So in these archs, set_memory_nx won't solve it because it will do
> nothing.
>
> On x86 I think you cannot get rid of disable_ro_nx fully because there is the
> changing of the permissions on the directmap as well. You don't want some other
> caller getting a page that was left RO when freed and then trying to write to
> it, if I understand this.
>
> The other reasoning was that calling set_memory_nx isn't doing what we are
> actually trying to do which is prevent the pages from getting released too
> early.
>
> A more clear solution for all of this might involve refactoring some of the
> set_memory_ de-allocation logic out into __weak functions in either modules or
> vmalloc. As Jessica points out in the other thread though, modules does a lot
> more stuff there than the other module_alloc callers. I think it may take some
> thought to centralize AND make it optimal for every module_alloc/vmalloc_exec
> user and arch.
>
> But for now with the change in vmalloc, we can block the executable mapping
> freed page re-use issue in a cross platform way.
Please understand me correctly - I didn’t mean that your patches are not
needed.
All I did is asking - how come the PTEs are executable when they are cleared
they are executable, when in fact we manipulate them when the module is
removed.
I think I try to deal with a similar problem to the one you encounter -
broken W^X. The only thing that bothered me in regard to your patches (and
only after I played with the code) is that there is still a time-window in
which W^X is broken due to disable_ro_nx().
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-04 20:36 ` Nadav Amit
@ 2018-12-04 23:51 ` Edgecombe, Rick P
-1 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-04 23:51 UTC (permalink / raw)
To: nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, Dock, Deneen T, jannh, kristen, akpm, peterz,
will.deacon, mingo, luto, Keshavamurthy, Anil S,
kernel-hardening, mhiramat,
On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
> > On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> >
> > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > rick.p.edgecombe@intel.com>
> > > > > wrote:
> > > > >
> > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > underlying
> > > > > pages,
> > > > > it often leaves stale TLB entries to freed pages that could get re-
> > > > > used.
> > > > > This is
> > > > > undesirable for cases where the memory being freed has special
> > > > > permissions
> > > > > such
> > > > > as executable.
> > > >
> > > > So I am trying to finish my patch-set for preventing transient W+X
> > > > mappings
> > > > from taking space, by handling kprobes & ftrace that I missed (thanks
> > > > again
> > > > for
> > > > pointing it out).
> > > >
> > > > But all of the sudden, I don’t understand why we have the problem that
> > > > this
> > > > (your) patch-set deals with at all. We already change the mappings to
> > > > make
> > > > the memory writable before freeing the memory, so why can’t we make it
> > > > non-executable at the same time? Actually, why do we make the module
> > > > memory,
> > > > including its data executable before freeing it???
> > >
> > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > of the various different configurations and hysterical raisins. We can't
> > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > can we rely on disable_ro_nx() being available at build time.
> > >
> > > If we *could* rely on module allocations always using vmalloc(), then
> > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > afaict -- who cares about the memory attributes of a mapping that's about
> > > to disappear anyway?
> > >
> > > Is it just nios2 that does something different?
> > >
> > > Will
> >
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere
> > would
> > solve it as well, in fact that was what I first thought the solution should
> > be
> > until this was suggested. It's interesting that from the other thread Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > inadvertently blocked this on x86. But, on the other architectures I have
> > since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so all
> > of
> > the frob_* functions are actually just noops. In which case allocating RWX
> > is
> > needed to make it work at all, because that is what the allocation is going
> > to
> > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there is
> > the
> > changing of the permissions on the directmap as well. You don't want some
> > other
> > caller getting a page that was left RO when freed and then trying to write
> > to
> > it, if I understand this.
> >
> > The other reasoning was that calling set_memory_nx isn't doing what we are
> > actually trying to do which is prevent the pages from getting released too
> > early.
> >
> > A more clear solution for all of this might involve refactoring some of the
> > set_memory_ de-allocation logic out into __weak functions in either modules
> > or
> > vmalloc. As Jessica points out in the other thread though, modules does a
> > lot
> > more stuff there than the other module_alloc callers. I think it may take
> > some
> > thought to centralize AND make it optimal for every
> > module_alloc/vmalloc_exec
> > user and arch.
> >
> > But for now with the change in vmalloc, we can block the executable mapping
> > freed page re-use issue in a cross platform way.
>
> Please understand me correctly - I didn’t mean that your patches are not
> needed.
Ok, I think I understand. I have been pondering these same things after Masami
Hiramatsu's comments on this thread the other day.
> All I did is asking - how come the PTEs are executable when they are cleared
> they are executable, when in fact we manipulate them when the module is
> removed.
I think the directmap used to be RWX so maybe historically its trying to return
it to its default state? Not sure.
> I think I try to deal with a similar problem to the one you encounter -
> broken W^X. The only thing that bothered me in regard to your patches (and
> only after I played with the code) is that there is still a time-window in
> which W^X is broken due to disable_ro_nx().
>
Totally agree there is overlap in the fixes and we should sync.
What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
it and on x86 resets the direct map.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-04 23:51 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-04 23:51 UTC (permalink / raw)
To: nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, Dock, Deneen T, jannh, kristen, akpm, peterz,
will.deacon, mingo, luto, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
> > On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> >
> > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > rick.p.edgecombe@intel.com>
> > > > > wrote:
> > > > >
> > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > underlying
> > > > > pages,
> > > > > it often leaves stale TLB entries to freed pages that could get re-
> > > > > used.
> > > > > This is
> > > > > undesirable for cases where the memory being freed has special
> > > > > permissions
> > > > > such
> > > > > as executable.
> > > >
> > > > So I am trying to finish my patch-set for preventing transient W+X
> > > > mappings
> > > > from taking space, by handling kprobes & ftrace that I missed (thanks
> > > > again
> > > > for
> > > > pointing it out).
> > > >
> > > > But all of the sudden, I don’t understand why we have the problem that
> > > > this
> > > > (your) patch-set deals with at all. We already change the mappings to
> > > > make
> > > > the memory writable before freeing the memory, so why can’t we make it
> > > > non-executable at the same time? Actually, why do we make the module
> > > > memory,
> > > > including its data executable before freeing it???
> > >
> > > Yeah, this is really confusing, but I have a suspicion it's a combination
> > > of the various different configurations and hysterical raisins. We can't
> > > rely on module_alloc() allocating from the vmalloc area (see nios2) nor
> > > can we rely on disable_ro_nx() being available at build time.
> > >
> > > If we *could* rely on module allocations always using vmalloc(), then
> > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > afaict -- who cares about the memory attributes of a mapping that's about
> > > to disappear anyway?
> > >
> > > Is it just nios2 that does something different?
> > >
> > > Will
> >
> > Yea it is really intertwined. I think for x86, set_memory_nx everywhere
> > would
> > solve it as well, in fact that was what I first thought the solution should
> > be
> > until this was suggested. It's interesting that from the other thread Masami
> > Hiramatsu referenced, set_memory_nx was suggested last year and would have
> > inadvertently blocked this on x86. But, on the other architectures I have
> > since
> > learned it is a bit different.
> >
> > It looks like actually most arch's don't re-define set_memory_*, and so all
> > of
> > the frob_* functions are actually just noops. In which case allocating RWX
> > is
> > needed to make it work at all, because that is what the allocation is going
> > to
> > stay at. So in these archs, set_memory_nx won't solve it because it will do
> > nothing.
> >
> > On x86 I think you cannot get rid of disable_ro_nx fully because there is
> > the
> > changing of the permissions on the directmap as well. You don't want some
> > other
> > caller getting a page that was left RO when freed and then trying to write
> > to
> > it, if I understand this.
> >
> > The other reasoning was that calling set_memory_nx isn't doing what we are
> > actually trying to do which is prevent the pages from getting released too
> > early.
> >
> > A more clear solution for all of this might involve refactoring some of the
> > set_memory_ de-allocation logic out into __weak functions in either modules
> > or
> > vmalloc. As Jessica points out in the other thread though, modules does a
> > lot
> > more stuff there than the other module_alloc callers. I think it may take
> > some
> > thought to centralize AND make it optimal for every
> > module_alloc/vmalloc_exec
> > user and arch.
> >
> > But for now with the change in vmalloc, we can block the executable mapping
> > freed page re-use issue in a cross platform way.
>
> Please understand me correctly - I didn’t mean that your patches are not
> needed.
Ok, I think I understand. I have been pondering these same things after Masami
Hiramatsu's comments on this thread the other day.
> All I did is asking - how come the PTEs are executable when they are cleared
> they are executable, when in fact we manipulate them when the module is
> removed.
I think the directmap used to be RWX so maybe historically its trying to return
it to its default state? Not sure.
> I think I try to deal with a similar problem to the one you encounter -
> broken W^X. The only thing that bothered me in regard to your patches (and
> only after I played with the code) is that there is still a time-window in
> which W^X is broken due to disable_ro_nx().
>
Totally agree there is overlap in the fixes and we should sync.
What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
it and on x86 resets the direct map.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-04 23:51 ` Edgecombe, Rick P
(?)
@ 2018-12-05 0:01 ` Nadav Amit
-1 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 0:01 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, Dock, Deneen T, jannh, kristen, akpm, peterz,
will.deacon, mingo, luto, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>> rick.p.edgecombe@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>> underlying
>>>>>> pages,
>>>>>> it often leaves stale TLB entries to freed pages that could get re-
>>>>>> used.
>>>>>> This is
>>>>>> undesirable for cases where the memory being freed has special
>>>>>> permissions
>>>>>> such
>>>>>> as executable.
>>>>>
>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>> mappings
>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks
>>>>> again
>>>>> for
>>>>> pointing it out).
>>>>>
>>>>> But all of the sudden, I don’t understand why we have the problem that
>>>>> this
>>>>> (your) patch-set deals with at all. We already change the mappings to
>>>>> make
>>>>> the memory writable before freeing the memory, so why can’t we make it
>>>>> non-executable at the same time? Actually, why do we make the module
>>>>> memory,
>>>>> including its data executable before freeing it???
>>>>
>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>> of the various different configurations and hysterical raisins. We can't
>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>> can we rely on disable_ro_nx() being available at build time.
>>>>
>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>> to disappear anyway?
>>>>
>>>> Is it just nios2 that does something different?
>>>>
>>>> Will
>>>
>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>> would
>>> solve it as well, in fact that was what I first thought the solution should
>>> be
>>> until this was suggested. It's interesting that from the other thread Masami
>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>> inadvertently blocked this on x86. But, on the other architectures I have
>>> since
>>> learned it is a bit different.
>>>
>>> It looks like actually most arch's don't re-define set_memory_*, and so all
>>> of
>>> the frob_* functions are actually just noops. In which case allocating RWX
>>> is
>>> needed to make it work at all, because that is what the allocation is going
>>> to
>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>> nothing.
>>>
>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is
>>> the
>>> changing of the permissions on the directmap as well. You don't want some
>>> other
>>> caller getting a page that was left RO when freed and then trying to write
>>> to
>>> it, if I understand this.
>>>
>>> The other reasoning was that calling set_memory_nx isn't doing what we are
>>> actually trying to do which is prevent the pages from getting released too
>>> early.
>>>
>>> A more clear solution for all of this might involve refactoring some of the
>>> set_memory_ de-allocation logic out into __weak functions in either modules
>>> or
>>> vmalloc. As Jessica points out in the other thread though, modules does a
>>> lot
>>> more stuff there than the other module_alloc callers. I think it may take
>>> some
>>> thought to centralize AND make it optimal for every
>>> module_alloc/vmalloc_exec
>>> user and arch.
>>>
>>> But for now with the change in vmalloc, we can block the executable mapping
>>> freed page re-use issue in a cross platform way.
>>
>> Please understand me correctly - I didn’t mean that your patches are not
>> needed.
> Ok, I think I understand. I have been pondering these same things after Masami
> Hiramatsu's comments on this thread the other day.
>
>> All I did is asking - how come the PTEs are executable when they are cleared
>> they are executable, when in fact we manipulate them when the module is
>> removed.
> I think the directmap used to be RWX so maybe historically its trying to return
> it to its default state? Not sure.
>
>> I think I try to deal with a similar problem to the one you encounter -
>> broken W^X. The only thing that bothered me in regard to your patches (and
>> only after I played with the code) is that there is still a time-window in
>> which W^X is broken due to disable_ro_nx().
> Totally agree there is overlap in the fixes and we should sync.
>
> What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
> with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
> it and on x86 resets the direct map.
As long as you do it, I have no problem ;-)
You would need to consider all the callers of module_memfree(), and probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with all
that”.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 0:01 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 0:01 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, Dock, Deneen T, jannh, kristen, akpm, peterz,
will.deacon, mingo, luto, Keshavamurthy, Anil S,
kernel-hardening, mhiramat, naveen.n.rao, davem, netdev, Hansen,
Dave
> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>> rick.p.edgecombe@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>> underlying
>>>>>> pages,
>>>>>> it often leaves stale TLB entries to freed pages that could get re-
>>>>>> used.
>>>>>> This is
>>>>>> undesirable for cases where the memory being freed has special
>>>>>> permissions
>>>>>> such
>>>>>> as executable.
>>>>>
>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>> mappings
>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks
>>>>> again
>>>>> for
>>>>> pointing it out).
>>>>>
>>>>> But all of the sudden, I don’t understand why we have the problem that
>>>>> this
>>>>> (your) patch-set deals with at all. We already change the mappings to
>>>>> make
>>>>> the memory writable before freeing the memory, so why can’t we make it
>>>>> non-executable at the same time? Actually, why do we make the module
>>>>> memory,
>>>>> including its data executable before freeing it???
>>>>
>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>> of the various different configurations and hysterical raisins. We can't
>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>> can we rely on disable_ro_nx() being available at build time.
>>>>
>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>> to disappear anyway?
>>>>
>>>> Is it just nios2 that does something different?
>>>>
>>>> Will
>>>
>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>> would
>>> solve it as well, in fact that was what I first thought the solution should
>>> be
>>> until this was suggested. It's interesting that from the other thread Masami
>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>> inadvertently blocked this on x86. But, on the other architectures I have
>>> since
>>> learned it is a bit different.
>>>
>>> It looks like actually most arch's don't re-define set_memory_*, and so all
>>> of
>>> the frob_* functions are actually just noops. In which case allocating RWX
>>> is
>>> needed to make it work at all, because that is what the allocation is going
>>> to
>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>> nothing.
>>>
>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is
>>> the
>>> changing of the permissions on the directmap as well. You don't want some
>>> other
>>> caller getting a page that was left RO when freed and then trying to write
>>> to
>>> it, if I understand this.
>>>
>>> The other reasoning was that calling set_memory_nx isn't doing what we are
>>> actually trying to do which is prevent the pages from getting released too
>>> early.
>>>
>>> A more clear solution for all of this might involve refactoring some of the
>>> set_memory_ de-allocation logic out into __weak functions in either modules
>>> or
>>> vmalloc. As Jessica points out in the other thread though, modules does a
>>> lot
>>> more stuff there than the other module_alloc callers. I think it may take
>>> some
>>> thought to centralize AND make it optimal for every
>>> module_alloc/vmalloc_exec
>>> user and arch.
>>>
>>> But for now with the change in vmalloc, we can block the executable mapping
>>> freed page re-use issue in a cross platform way.
>>
>> Please understand me correctly - I didn’t mean that your patches are not
>> needed.
> Ok, I think I understand. I have been pondering these same things after Masami
> Hiramatsu's comments on this thread the other day.
>
>> All I did is asking - how come the PTEs are executable when they are cleared
>> they are executable, when in fact we manipulate them when the module is
>> removed.
> I think the directmap used to be RWX so maybe historically its trying to return
> it to its default state? Not sure.
>
>> I think I try to deal with a similar problem to the one you encounter -
>> broken W^X. The only thing that bothered me in regard to your patches (and
>> only after I played with the code) is that there is still a time-window in
>> which W^X is broken due to disable_ro_nx().
> Totally agree there is overlap in the fixes and we should sync.
>
> What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
> with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
> it and on x86 resets the direct map.
As long as you do it, I have no problem ;-)
You would need to consider all the callers of module_memfree(), and probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with all
that”.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 0:01 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 0:01 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, ast,
linux-mm, Dock, Deneen T, jannh, kristen, akpm, peterz,
will.deacon, mingo, luto, Keshavamurthy, Anil S,
> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>> rick.p.edgecombe@intel.com>
>>>>>> wrote:
>>>>>>
>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>> underlying
>>>>>> pages,
>>>>>> it often leaves stale TLB entries to freed pages that could get re-
>>>>>> used.
>>>>>> This is
>>>>>> undesirable for cases where the memory being freed has special
>>>>>> permissions
>>>>>> such
>>>>>> as executable.
>>>>>
>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>> mappings
>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks
>>>>> again
>>>>> for
>>>>> pointing it out).
>>>>>
>>>>> But all of the sudden, I don’t understand why we have the problem that
>>>>> this
>>>>> (your) patch-set deals with at all. We already change the mappings to
>>>>> make
>>>>> the memory writable before freeing the memory, so why can’t we make it
>>>>> non-executable at the same time? Actually, why do we make the module
>>>>> memory,
>>>>> including its data executable before freeing it???
>>>>
>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>> of the various different configurations and hysterical raisins. We can't
>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>> can we rely on disable_ro_nx() being available at build time.
>>>>
>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>> to disappear anyway?
>>>>
>>>> Is it just nios2 that does something different?
>>>>
>>>> Will
>>>
>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>> would
>>> solve it as well, in fact that was what I first thought the solution should
>>> be
>>> until this was suggested. It's interesting that from the other thread Masami
>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>> inadvertently blocked this on x86. But, on the other architectures I have
>>> since
>>> learned it is a bit different.
>>>
>>> It looks like actually most arch's don't re-define set_memory_*, and so all
>>> of
>>> the frob_* functions are actually just noops. In which case allocating RWX
>>> is
>>> needed to make it work at all, because that is what the allocation is going
>>> to
>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>> nothing.
>>>
>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is
>>> the
>>> changing of the permissions on the directmap as well. You don't want some
>>> other
>>> caller getting a page that was left RO when freed and then trying to write
>>> to
>>> it, if I understand this.
>>>
>>> The other reasoning was that calling set_memory_nx isn't doing what we are
>>> actually trying to do which is prevent the pages from getting released too
>>> early.
>>>
>>> A more clear solution for all of this might involve refactoring some of the
>>> set_memory_ de-allocation logic out into __weak functions in either modules
>>> or
>>> vmalloc. As Jessica points out in the other thread though, modules does a
>>> lot
>>> more stuff there than the other module_alloc callers. I think it may take
>>> some
>>> thought to centralize AND make it optimal for every
>>> module_alloc/vmalloc_exec
>>> user and arch.
>>>
>>> But for now with the change in vmalloc, we can block the executable mapping
>>> freed page re-use issue in a cross platform way.
>>
>> Please understand me correctly - I didn’t mean that your patches are not
>> needed.
> Ok, I think I understand. I have been pondering these same things after Masami
> Hiramatsu's comments on this thread the other day.
>
>> All I did is asking - how come the PTEs are executable when they are cleared
>> they are executable, when in fact we manipulate them when the module is
>> removed.
> I think the directmap used to be RWX so maybe historically its trying to return
> it to its default state? Not sure.
>
>> I think I try to deal with a similar problem to the one you encounter -
>> broken W^X. The only thing that bothered me in regard to your patches (and
>> only after I played with the code) is that there is still a time-window in
>> which W^X is broken due to disable_ro_nx().
> Totally agree there is overlap in the fixes and we should sync.
>
> What do you think about Andy's suggestion for doing the vfree cleanup in vmalloc
> with arch hooks? So the allocation goes into vfree fully setup and vmalloc frees
> it and on x86 resets the direct map.
As long as you do it, I have no problem ;-)
You would need to consider all the callers of module_memfree(), and probably
to untangle at least part of the mess in pageattr.c . If you are up to it,
just say so, and I’ll drop this patch. All I can say is “good luck with all
that”.
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-05 0:01 ` Nadav Amit
(?)
@ 2018-12-05 0:29 ` Edgecombe, Rick P
-1 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-05 0:29 UTC (permalink / raw)
To: nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
> > On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> >
> > On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
> > > > On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
> > > > rick.p.edgecombe@intel.com>
> > > > wrote:
> > > >
> > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > > > rick.p.edgecombe@intel.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > > > underlying
> > > > > > > pages,
> > > > > > > it often leaves stale TLB entries to freed pages that could get
> > > > > > > re-
> > > > > > > used.
> > > > > > > This is
> > > > > > > undesirable for cases where the memory being freed has special
> > > > > > > permissions
> > > > > > > such
> > > > > > > as executable.
> > > > > >
> > > > > > So I am trying to finish my patch-set for preventing transient W+X
> > > > > > mappings
> > > > > > from taking space, by handling kprobes & ftrace that I missed
> > > > > > (thanks
> > > > > > again
> > > > > > for
> > > > > > pointing it out).
> > > > > >
> > > > > > But all of the sudden, I don’t understand why we have the problem
> > > > > > that
> > > > > > this
> > > > > > (your) patch-set deals with at all. We already change the mappings
> > > > > > to
> > > > > > make
> > > > > > the memory writable before freeing the memory, so why can’t we make
> > > > > > it
> > > > > > non-executable at the same time? Actually, why do we make the module
> > > > > > memory,
> > > > > > including its data executable before freeing it???
> > > > >
> > > > > Yeah, this is really confusing, but I have a suspicion it's a
> > > > > combination
> > > > > of the various different configurations and hysterical raisins. We
> > > > > can't
> > > > > rely on module_alloc() allocating from the vmalloc area (see nios2)
> > > > > nor
> > > > > can we rely on disable_ro_nx() being available at build time.
> > > > >
> > > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > > afaict -- who cares about the memory attributes of a mapping that's
> > > > > about
> > > > > to disappear anyway?
> > > > >
> > > > > Is it just nios2 that does something different?
> > > > >
> > > > > Will
> > > >
> > > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere
> > > > would
> > > > solve it as well, in fact that was what I first thought the solution
> > > > should
> > > > be
> > > > until this was suggested. It's interesting that from the other thread
> > > > Masami
> > > > Hiramatsu referenced, set_memory_nx was suggested last year and would
> > > > have
> > > > inadvertently blocked this on x86. But, on the other architectures I
> > > > have
> > > > since
> > > > learned it is a bit different.
> > > >
> > > > It looks like actually most arch's don't re-define set_memory_*, and so
> > > > all
> > > > of
> > > > the frob_* functions are actually just noops. In which case allocating
> > > > RWX
> > > > is
> > > > needed to make it work at all, because that is what the allocation is
> > > > going
> > > > to
> > > > stay at. So in these archs, set_memory_nx won't solve it because it will
> > > > do
> > > > nothing.
> > > >
> > > > On x86 I think you cannot get rid of disable_ro_nx fully because there
> > > > is
> > > > the
> > > > changing of the permissions on the directmap as well. You don't want
> > > > some
> > > > other
> > > > caller getting a page that was left RO when freed and then trying to
> > > > write
> > > > to
> > > > it, if I understand this.
> > > >
> > > > The other reasoning was that calling set_memory_nx isn't doing what we
> > > > are
> > > > actually trying to do which is prevent the pages from getting released
> > > > too
> > > > early.
> > > >
> > > > A more clear solution for all of this might involve refactoring some of
> > > > the
> > > > set_memory_ de-allocation logic out into __weak functions in either
> > > > modules
> > > > or
> > > > vmalloc. As Jessica points out in the other thread though, modules does
> > > > a
> > > > lot
> > > > more stuff there than the other module_alloc callers. I think it may
> > > > take
> > > > some
> > > > thought to centralize AND make it optimal for every
> > > > module_alloc/vmalloc_exec
> > > > user and arch.
> > > >
> > > > But for now with the change in vmalloc, we can block the executable
> > > > mapping
> > > > freed page re-use issue in a cross platform way.
> > >
> > > Please understand me correctly - I didn’t mean that your patches are not
> > > needed.
> >
> > Ok, I think I understand. I have been pondering these same things after
> > Masami
> > Hiramatsu's comments on this thread the other day.
> >
> > > All I did is asking - how come the PTEs are executable when they are
> > > cleared
> > > they are executable, when in fact we manipulate them when the module is
> > > removed.
> >
> > I think the directmap used to be RWX so maybe historically its trying to
> > return
> > it to its default state? Not sure.
> >
> > > I think I try to deal with a similar problem to the one you encounter -
> > > broken W^X. The only thing that bothered me in regard to your patches (and
> > > only after I played with the code) is that there is still a time-window in
> > > which W^X is broken due to disable_ro_nx().
> >
> > Totally agree there is overlap in the fixes and we should sync.
> >
> > What do you think about Andy's suggestion for doing the vfree cleanup in
> > vmalloc
> > with arch hooks? So the allocation goes into vfree fully setup and vmalloc
> > frees
> > it and on x86 resets the direct map.
>
> As long as you do it, I have no problem ;-)
>
> You would need to consider all the callers of module_memfree(), and probably
> to untangle at least part of the mess in pageattr.c . If you are up to it,
> just say so, and I’ll drop this patch. All I can say is “good luck with all
> that”.
>
I thought you were trying to prevent having any memory that at any time was W+X,
how does vfree help with the module load time issues, where it starts WRX on
x86?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 0:29 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-05 0:29 UTC (permalink / raw)
To: nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
> > On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> >
> > On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
> > > > On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
> > > > rick.p.edgecombe@intel.com>
> > > > wrote:
> > > >
> > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > > > rick.p.edgecombe@intel.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > > > underlying
> > > > > > > pages,
> > > > > > > it often leaves stale TLB entries to freed pages that could get
> > > > > > > re-
> > > > > > > used.
> > > > > > > This is
> > > > > > > undesirable for cases where the memory being freed has special
> > > > > > > permissions
> > > > > > > such
> > > > > > > as executable.
> > > > > >
> > > > > > So I am trying to finish my patch-set for preventing transient W+X
> > > > > > mappings
> > > > > > from taking space, by handling kprobes & ftrace that I missed
> > > > > > (thanks
> > > > > > again
> > > > > > for
> > > > > > pointing it out).
> > > > > >
> > > > > > But all of the sudden, I don’t understand why we have the problem
> > > > > > that
> > > > > > this
> > > > > > (your) patch-set deals with at all. We already change the mappings
> > > > > > to
> > > > > > make
> > > > > > the memory writable before freeing the memory, so why can’t we make
> > > > > > it
> > > > > > non-executable at the same time? Actually, why do we make the module
> > > > > > memory,
> > > > > > including its data executable before freeing it???
> > > > >
> > > > > Yeah, this is really confusing, but I have a suspicion it's a
> > > > > combination
> > > > > of the various different configurations and hysterical raisins. We
> > > > > can't
> > > > > rely on module_alloc() allocating from the vmalloc area (see nios2)
> > > > > nor
> > > > > can we rely on disable_ro_nx() being available at build time.
> > > > >
> > > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > > afaict -- who cares about the memory attributes of a mapping that's
> > > > > about
> > > > > to disappear anyway?
> > > > >
> > > > > Is it just nios2 that does something different?
> > > > >
> > > > > Will
> > > >
> > > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere
> > > > would
> > > > solve it as well, in fact that was what I first thought the solution
> > > > should
> > > > be
> > > > until this was suggested. It's interesting that from the other thread
> > > > Masami
> > > > Hiramatsu referenced, set_memory_nx was suggested last year and would
> > > > have
> > > > inadvertently blocked this on x86. But, on the other architectures I
> > > > have
> > > > since
> > > > learned it is a bit different.
> > > >
> > > > It looks like actually most arch's don't re-define set_memory_*, and so
> > > > all
> > > > of
> > > > the frob_* functions are actually just noops. In which case allocating
> > > > RWX
> > > > is
> > > > needed to make it work at all, because that is what the allocation is
> > > > going
> > > > to
> > > > stay at. So in these archs, set_memory_nx won't solve it because it will
> > > > do
> > > > nothing.
> > > >
> > > > On x86 I think you cannot get rid of disable_ro_nx fully because there
> > > > is
> > > > the
> > > > changing of the permissions on the directmap as well. You don't want
> > > > some
> > > > other
> > > > caller getting a page that was left RO when freed and then trying to
> > > > write
> > > > to
> > > > it, if I understand this.
> > > >
> > > > The other reasoning was that calling set_memory_nx isn't doing what we
> > > > are
> > > > actually trying to do which is prevent the pages from getting released
> > > > too
> > > > early.
> > > >
> > > > A more clear solution for all of this might involve refactoring some of
> > > > the
> > > > set_memory_ de-allocation logic out into __weak functions in either
> > > > modules
> > > > or
> > > > vmalloc. As Jessica points out in the other thread though, modules does
> > > > a
> > > > lot
> > > > more stuff there than the other module_alloc callers. I think it may
> > > > take
> > > > some
> > > > thought to centralize AND make it optimal for every
> > > > module_alloc/vmalloc_exec
> > > > user and arch.
> > > >
> > > > But for now with the change in vmalloc, we can block the executable
> > > > mapping
> > > > freed page re-use issue in a cross platform way.
> > >
> > > Please understand me correctly - I didn’t mean that your patches are not
> > > needed.
> >
> > Ok, I think I understand. I have been pondering these same things after
> > Masami
> > Hiramatsu's comments on this thread the other day.
> >
> > > All I did is asking - how come the PTEs are executable when they are
> > > cleared
> > > they are executable, when in fact we manipulate them when the module is
> > > removed.
> >
> > I think the directmap used to be RWX so maybe historically its trying to
> > return
> > it to its default state? Not sure.
> >
> > > I think I try to deal with a similar problem to the one you encounter -
> > > broken W^X. The only thing that bothered me in regard to your patches (and
> > > only after I played with the code) is that there is still a time-window in
> > > which W^X is broken due to disable_ro_nx().
> >
> > Totally agree there is overlap in the fixes and we should sync.
> >
> > What do you think about Andy's suggestion for doing the vfree cleanup in
> > vmalloc
> > with arch hooks? So the allocation goes into vfree fully setup and vmalloc
> > frees
> > it and on x86 resets the direct map.
>
> As long as you do it, I have no problem ;-)
>
> You would need to consider all the callers of module_memfree(), and probably
> to untangle at least part of the mess in pageattr.c . If you are up to it,
> just say so, and I’ll drop this patch. All I can say is “good luck with all
> that”.
>
I thought you were trying to prevent having any memory that at any time was W+X,
how does vfree help with the module load time issues, where it starts WRX on
x86?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 0:29 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-05 0:29 UTC (permalink / raw)
To: nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S,
On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
> > On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> >
> > On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
> > > > On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
> > > > rick.p.edgecombe@intel.com>
> > > > wrote:
> > > >
> > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > > > rick.p.edgecombe@intel.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > > > underlying
> > > > > > > pages,
> > > > > > > it often leaves stale TLB entries to freed pages that could get
> > > > > > > re-
> > > > > > > used.
> > > > > > > This is
> > > > > > > undesirable for cases where the memory being freed has special
> > > > > > > permissions
> > > > > > > such
> > > > > > > as executable.
> > > > > >
> > > > > > So I am trying to finish my patch-set for preventing transient W+X
> > > > > > mappings
> > > > > > from taking space, by handling kprobes & ftrace that I missed
> > > > > > (thanks
> > > > > > again
> > > > > > for
> > > > > > pointing it out).
> > > > > >
> > > > > > But all of the sudden, I don’t understand why we have the problem
> > > > > > that
> > > > > > this
> > > > > > (your) patch-set deals with at all. We already change the mappings
> > > > > > to
> > > > > > make
> > > > > > the memory writable before freeing the memory, so why can’t we make
> > > > > > it
> > > > > > non-executable at the same time? Actually, why do we make the module
> > > > > > memory,
> > > > > > including its data executable before freeing it???
> > > > >
> > > > > Yeah, this is really confusing, but I have a suspicion it's a
> > > > > combination
> > > > > of the various different configurations and hysterical raisins. We
> > > > > can't
> > > > > rely on module_alloc() allocating from the vmalloc area (see nios2)
> > > > > nor
> > > > > can we rely on disable_ro_nx() being available at build time.
> > > > >
> > > > > If we *could* rely on module allocations always using vmalloc(), then
> > > > > we could pass in Rick's new flag and drop disable_ro_nx() altogether
> > > > > afaict -- who cares about the memory attributes of a mapping that's
> > > > > about
> > > > > to disappear anyway?
> > > > >
> > > > > Is it just nios2 that does something different?
> > > > >
> > > > > Will
> > > >
> > > > Yea it is really intertwined. I think for x86, set_memory_nx everywhere
> > > > would
> > > > solve it as well, in fact that was what I first thought the solution
> > > > should
> > > > be
> > > > until this was suggested. It's interesting that from the other thread
> > > > Masami
> > > > Hiramatsu referenced, set_memory_nx was suggested last year and would
> > > > have
> > > > inadvertently blocked this on x86. But, on the other architectures I
> > > > have
> > > > since
> > > > learned it is a bit different.
> > > >
> > > > It looks like actually most arch's don't re-define set_memory_*, and so
> > > > all
> > > > of
> > > > the frob_* functions are actually just noops. In which case allocating
> > > > RWX
> > > > is
> > > > needed to make it work at all, because that is what the allocation is
> > > > going
> > > > to
> > > > stay at. So in these archs, set_memory_nx won't solve it because it will
> > > > do
> > > > nothing.
> > > >
> > > > On x86 I think you cannot get rid of disable_ro_nx fully because there
> > > > is
> > > > the
> > > > changing of the permissions on the directmap as well. You don't want
> > > > some
> > > > other
> > > > caller getting a page that was left RO when freed and then trying to
> > > > write
> > > > to
> > > > it, if I understand this.
> > > >
> > > > The other reasoning was that calling set_memory_nx isn't doing what we
> > > > are
> > > > actually trying to do which is prevent the pages from getting released
> > > > too
> > > > early.
> > > >
> > > > A more clear solution for all of this might involve refactoring some of
> > > > the
> > > > set_memory_ de-allocation logic out into __weak functions in either
> > > > modules
> > > > or
> > > > vmalloc. As Jessica points out in the other thread though, modules does
> > > > a
> > > > lot
> > > > more stuff there than the other module_alloc callers. I think it may
> > > > take
> > > > some
> > > > thought to centralize AND make it optimal for every
> > > > module_alloc/vmalloc_exec
> > > > user and arch.
> > > >
> > > > But for now with the change in vmalloc, we can block the executable
> > > > mapping
> > > > freed page re-use issue in a cross platform way.
> > >
> > > Please understand me correctly - I didn’t mean that your patches are not
> > > needed.
> >
> > Ok, I think I understand. I have been pondering these same things after
> > Masami
> > Hiramatsu's comments on this thread the other day.
> >
> > > All I did is asking - how come the PTEs are executable when they are
> > > cleared
> > > they are executable, when in fact we manipulate them when the module is
> > > removed.
> >
> > I think the directmap used to be RWX so maybe historically its trying to
> > return
> > it to its default state? Not sure.
> >
> > > I think I try to deal with a similar problem to the one you encounter -
> > > broken W^X. The only thing that bothered me in regard to your patches (and
> > > only after I played with the code) is that there is still a time-window in
> > > which W^X is broken due to disable_ro_nx().
> >
> > Totally agree there is overlap in the fixes and we should sync.
> >
> > What do you think about Andy's suggestion for doing the vfree cleanup in
> > vmalloc
> > with arch hooks? So the allocation goes into vfree fully setup and vmalloc
> > frees
> > it and on x86 resets the direct map.
>
> As long as you do it, I have no problem ;-)
>
> You would need to consider all the callers of module_memfree(), and probably
> to untangle at least part of the mess in pageattr.c . If you are up to it,
> just say so, and I’ll drop this patch. All I can say is “good luck with all
> that”.
>
I thought you were trying to prevent having any memory that at any time was W+X,
how does vfree help with the module load time issues, where it starts WRX on
x86?
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-05 0:29 ` Edgecombe, Rick P
(?)
@ 2018-12-05 0:53 ` Nadav Amit
-1 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 0:53 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
> On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
>>>>> rick.p.edgecombe@intel.com>
>>>>> wrote:
>>>>>
>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>>>> underlying
>>>>>>>> pages,
>>>>>>>> it often leaves stale TLB entries to freed pages that could get
>>>>>>>> re-
>>>>>>>> used.
>>>>>>>> This is
>>>>>>>> undesirable for cases where the memory being freed has special
>>>>>>>> permissions
>>>>>>>> such
>>>>>>>> as executable.
>>>>>>>
>>>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>>>> mappings
>>>>>>> from taking space, by handling kprobes & ftrace that I missed
>>>>>>> (thanks
>>>>>>> again
>>>>>>> for
>>>>>>> pointing it out).
>>>>>>>
>>>>>>> But all of the sudden, I don’t understand why we have the problem
>>>>>>> that
>>>>>>> this
>>>>>>> (your) patch-set deals with at all. We already change the mappings
>>>>>>> to
>>>>>>> make
>>>>>>> the memory writable before freeing the memory, so why can’t we make
>>>>>>> it
>>>>>>> non-executable at the same time? Actually, why do we make the module
>>>>>>> memory,
>>>>>>> including its data executable before freeing it???
>>>>>>
>>>>>> Yeah, this is really confusing, but I have a suspicion it's a
>>>>>> combination
>>>>>> of the various different configurations and hysterical raisins. We
>>>>>> can't
>>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2)
>>>>>> nor
>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>
>>>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>>>> afaict -- who cares about the memory attributes of a mapping that's
>>>>>> about
>>>>>> to disappear anyway?
>>>>>>
>>>>>> Is it just nios2 that does something different?
>>>>>>
>>>>>> Will
>>>>>
>>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>>>> would
>>>>> solve it as well, in fact that was what I first thought the solution
>>>>> should
>>>>> be
>>>>> until this was suggested. It's interesting that from the other thread
>>>>> Masami
>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would
>>>>> have
>>>>> inadvertently blocked this on x86. But, on the other architectures I
>>>>> have
>>>>> since
>>>>> learned it is a bit different.
>>>>>
>>>>> It looks like actually most arch's don't re-define set_memory_*, and so
>>>>> all
>>>>> of
>>>>> the frob_* functions are actually just noops. In which case allocating
>>>>> RWX
>>>>> is
>>>>> needed to make it work at all, because that is what the allocation is
>>>>> going
>>>>> to
>>>>> stay at. So in these archs, set_memory_nx won't solve it because it will
>>>>> do
>>>>> nothing.
>>>>>
>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there
>>>>> is
>>>>> the
>>>>> changing of the permissions on the directmap as well. You don't want
>>>>> some
>>>>> other
>>>>> caller getting a page that was left RO when freed and then trying to
>>>>> write
>>>>> to
>>>>> it, if I understand this.
>>>>>
>>>>> The other reasoning was that calling set_memory_nx isn't doing what we
>>>>> are
>>>>> actually trying to do which is prevent the pages from getting released
>>>>> too
>>>>> early.
>>>>>
>>>>> A more clear solution for all of this might involve refactoring some of
>>>>> the
>>>>> set_memory_ de-allocation logic out into __weak functions in either
>>>>> modules
>>>>> or
>>>>> vmalloc. As Jessica points out in the other thread though, modules does
>>>>> a
>>>>> lot
>>>>> more stuff there than the other module_alloc callers. I think it may
>>>>> take
>>>>> some
>>>>> thought to centralize AND make it optimal for every
>>>>> module_alloc/vmalloc_exec
>>>>> user and arch.
>>>>>
>>>>> But for now with the change in vmalloc, we can block the executable
>>>>> mapping
>>>>> freed page re-use issue in a cross platform way.
>>>>
>>>> Please understand me correctly - I didn’t mean that your patches are not
>>>> needed.
>>>
>>> Ok, I think I understand. I have been pondering these same things after
>>> Masami
>>> Hiramatsu's comments on this thread the other day.
>>>
>>>> All I did is asking - how come the PTEs are executable when they are
>>>> cleared
>>>> they are executable, when in fact we manipulate them when the module is
>>>> removed.
>>>
>>> I think the directmap used to be RWX so maybe historically its trying to
>>> return
>>> it to its default state? Not sure.
>>>
>>>> I think I try to deal with a similar problem to the one you encounter -
>>>> broken W^X. The only thing that bothered me in regard to your patches (and
>>>> only after I played with the code) is that there is still a time-window in
>>>> which W^X is broken due to disable_ro_nx().
>>>
>>> Totally agree there is overlap in the fixes and we should sync.
>>>
>>> What do you think about Andy's suggestion for doing the vfree cleanup in
>>> vmalloc
>>> with arch hooks? So the allocation goes into vfree fully setup and vmalloc
>>> frees
>>> it and on x86 resets the direct map.
>>
>> As long as you do it, I have no problem ;-)
>>
>> You would need to consider all the callers of module_memfree(), and probably
>> to untangle at least part of the mess in pageattr.c . If you are up to it,
>> just say so, and I’ll drop this patch. All I can say is “good luck with all
>> that”.
> I thought you were trying to prevent having any memory that at any time was W+X,
> how does vfree help with the module load time issues, where it starts WRX on
> x86?
I didn’t say it does. The patch I submitted before [1] should deal with the
issue of module loading, and I still think it is required. I also addressed
the kprobe and ftrace issues that you raised.
Perhaps it makes more sense that I will include the patch I proposed for
module cleanup to make the patch-set “complete”. If you finish the changes
you propose before the patch is applied, it could be dropped. I just want to
get rid of this series, as it keeps collecting more and more patches.
I suspect it will not be the last version anyhow.
[1] https://lkml.org/lkml/2018/11/21/305
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 0:53 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 0:53 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
> On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
>>>>> rick.p.edgecombe@intel.com>
>>>>> wrote:
>>>>>
>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>>>> underlying
>>>>>>>> pages,
>>>>>>>> it often leaves stale TLB entries to freed pages that could get
>>>>>>>> re-
>>>>>>>> used.
>>>>>>>> This is
>>>>>>>> undesirable for cases where the memory being freed has special
>>>>>>>> permissions
>>>>>>>> such
>>>>>>>> as executable.
>>>>>>>
>>>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>>>> mappings
>>>>>>> from taking space, by handling kprobes & ftrace that I missed
>>>>>>> (thanks
>>>>>>> again
>>>>>>> for
>>>>>>> pointing it out).
>>>>>>>
>>>>>>> But all of the sudden, I don’t understand why we have the problem
>>>>>>> that
>>>>>>> this
>>>>>>> (your) patch-set deals with at all. We already change the mappings
>>>>>>> to
>>>>>>> make
>>>>>>> the memory writable before freeing the memory, so why can’t we make
>>>>>>> it
>>>>>>> non-executable at the same time? Actually, why do we make the module
>>>>>>> memory,
>>>>>>> including its data executable before freeing it???
>>>>>>
>>>>>> Yeah, this is really confusing, but I have a suspicion it's a
>>>>>> combination
>>>>>> of the various different configurations and hysterical raisins. We
>>>>>> can't
>>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2)
>>>>>> nor
>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>
>>>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>>>> afaict -- who cares about the memory attributes of a mapping that's
>>>>>> about
>>>>>> to disappear anyway?
>>>>>>
>>>>>> Is it just nios2 that does something different?
>>>>>>
>>>>>> Will
>>>>>
>>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>>>> would
>>>>> solve it as well, in fact that was what I first thought the solution
>>>>> should
>>>>> be
>>>>> until this was suggested. It's interesting that from the other thread
>>>>> Masami
>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would
>>>>> have
>>>>> inadvertently blocked this on x86. But, on the other architectures I
>>>>> have
>>>>> since
>>>>> learned it is a bit different.
>>>>>
>>>>> It looks like actually most arch's don't re-define set_memory_*, and so
>>>>> all
>>>>> of
>>>>> the frob_* functions are actually just noops. In which case allocating
>>>>> RWX
>>>>> is
>>>>> needed to make it work at all, because that is what the allocation is
>>>>> going
>>>>> to
>>>>> stay at. So in these archs, set_memory_nx won't solve it because it will
>>>>> do
>>>>> nothing.
>>>>>
>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there
>>>>> is
>>>>> the
>>>>> changing of the permissions on the directmap as well. You don't want
>>>>> some
>>>>> other
>>>>> caller getting a page that was left RO when freed and then trying to
>>>>> write
>>>>> to
>>>>> it, if I understand this.
>>>>>
>>>>> The other reasoning was that calling set_memory_nx isn't doing what we
>>>>> are
>>>>> actually trying to do which is prevent the pages from getting released
>>>>> too
>>>>> early.
>>>>>
>>>>> A more clear solution for all of this might involve refactoring some of
>>>>> the
>>>>> set_memory_ de-allocation logic out into __weak functions in either
>>>>> modules
>>>>> or
>>>>> vmalloc. As Jessica points out in the other thread though, modules does
>>>>> a
>>>>> lot
>>>>> more stuff there than the other module_alloc callers. I think it may
>>>>> take
>>>>> some
>>>>> thought to centralize AND make it optimal for every
>>>>> module_alloc/vmalloc_exec
>>>>> user and arch.
>>>>>
>>>>> But for now with the change in vmalloc, we can block the executable
>>>>> mapping
>>>>> freed page re-use issue in a cross platform way.
>>>>
>>>> Please understand me correctly - I didn’t mean that your patches are not
>>>> needed.
>>>
>>> Ok, I think I understand. I have been pondering these same things after
>>> Masami
>>> Hiramatsu's comments on this thread the other day.
>>>
>>>> All I did is asking - how come the PTEs are executable when they are
>>>> cleared
>>>> they are executable, when in fact we manipulate them when the module is
>>>> removed.
>>>
>>> I think the directmap used to be RWX so maybe historically its trying to
>>> return
>>> it to its default state? Not sure.
>>>
>>>> I think I try to deal with a similar problem to the one you encounter -
>>>> broken W^X. The only thing that bothered me in regard to your patches (and
>>>> only after I played with the code) is that there is still a time-window in
>>>> which W^X is broken due to disable_ro_nx().
>>>
>>> Totally agree there is overlap in the fixes and we should sync.
>>>
>>> What do you think about Andy's suggestion for doing the vfree cleanup in
>>> vmalloc
>>> with arch hooks? So the allocation goes into vfree fully setup and vmalloc
>>> frees
>>> it and on x86 resets the direct map.
>>
>> As long as you do it, I have no problem ;-)
>>
>> You would need to consider all the callers of module_memfree(), and probably
>> to untangle at least part of the mess in pageattr.c . If you are up to it,
>> just say so, and I’ll drop this patch. All I can say is “good luck with all
>> that”.
> I thought you were trying to prevent having any memory that at any time was W+X,
> how does vfree help with the module load time issues, where it starts WRX on
> x86?
I didn’t say it does. The patch I submitted before [1] should deal with the
issue of module loading, and I still think it is required. I also addressed
the kprobe and ftrace issues that you raised.
Perhaps it makes more sense that I will include the patch I proposed for
module cleanup to make the patch-set “complete”. If you finish the changes
you propose before the patch is applied, it could be dropped. I just want to
get rid of this series, as it keeps collecting more and more patches.
I suspect it will not be the last version anyhow.
[1] https://lkml.org/lkml/2018/11/21/305
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 0:53 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 0:53 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S,
> On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
>>>>> rick.p.edgecombe@intel.com>
>>>>> wrote:
>>>>>
>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>>>> underlying
>>>>>>>> pages,
>>>>>>>> it often leaves stale TLB entries to freed pages that could get
>>>>>>>> re-
>>>>>>>> used.
>>>>>>>> This is
>>>>>>>> undesirable for cases where the memory being freed has special
>>>>>>>> permissions
>>>>>>>> such
>>>>>>>> as executable.
>>>>>>>
>>>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>>>> mappings
>>>>>>> from taking space, by handling kprobes & ftrace that I missed
>>>>>>> (thanks
>>>>>>> again
>>>>>>> for
>>>>>>> pointing it out).
>>>>>>>
>>>>>>> But all of the sudden, I don’t understand why we have the problem
>>>>>>> that
>>>>>>> this
>>>>>>> (your) patch-set deals with at all. We already change the mappings
>>>>>>> to
>>>>>>> make
>>>>>>> the memory writable before freeing the memory, so why can’t we make
>>>>>>> it
>>>>>>> non-executable at the same time? Actually, why do we make the module
>>>>>>> memory,
>>>>>>> including its data executable before freeing it???
>>>>>>
>>>>>> Yeah, this is really confusing, but I have a suspicion it's a
>>>>>> combination
>>>>>> of the various different configurations and hysterical raisins. We
>>>>>> can't
>>>>>> rely on module_alloc() allocating from the vmalloc area (see nios2)
>>>>>> nor
>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>
>>>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>>>> afaict -- who cares about the memory attributes of a mapping that's
>>>>>> about
>>>>>> to disappear anyway?
>>>>>>
>>>>>> Is it just nios2 that does something different?
>>>>>>
>>>>>> Will
>>>>>
>>>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>>>> would
>>>>> solve it as well, in fact that was what I first thought the solution
>>>>> should
>>>>> be
>>>>> until this was suggested. It's interesting that from the other thread
>>>>> Masami
>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and would
>>>>> have
>>>>> inadvertently blocked this on x86. But, on the other architectures I
>>>>> have
>>>>> since
>>>>> learned it is a bit different.
>>>>>
>>>>> It looks like actually most arch's don't re-define set_memory_*, and so
>>>>> all
>>>>> of
>>>>> the frob_* functions are actually just noops. In which case allocating
>>>>> RWX
>>>>> is
>>>>> needed to make it work at all, because that is what the allocation is
>>>>> going
>>>>> to
>>>>> stay at. So in these archs, set_memory_nx won't solve it because it will
>>>>> do
>>>>> nothing.
>>>>>
>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because there
>>>>> is
>>>>> the
>>>>> changing of the permissions on the directmap as well. You don't want
>>>>> some
>>>>> other
>>>>> caller getting a page that was left RO when freed and then trying to
>>>>> write
>>>>> to
>>>>> it, if I understand this.
>>>>>
>>>>> The other reasoning was that calling set_memory_nx isn't doing what we
>>>>> are
>>>>> actually trying to do which is prevent the pages from getting released
>>>>> too
>>>>> early.
>>>>>
>>>>> A more clear solution for all of this might involve refactoring some of
>>>>> the
>>>>> set_memory_ de-allocation logic out into __weak functions in either
>>>>> modules
>>>>> or
>>>>> vmalloc. As Jessica points out in the other thread though, modules does
>>>>> a
>>>>> lot
>>>>> more stuff there than the other module_alloc callers. I think it may
>>>>> take
>>>>> some
>>>>> thought to centralize AND make it optimal for every
>>>>> module_alloc/vmalloc_exec
>>>>> user and arch.
>>>>>
>>>>> But for now with the change in vmalloc, we can block the executable
>>>>> mapping
>>>>> freed page re-use issue in a cross platform way.
>>>>
>>>> Please understand me correctly - I didn’t mean that your patches are not
>>>> needed.
>>>
>>> Ok, I think I understand. I have been pondering these same things after
>>> Masami
>>> Hiramatsu's comments on this thread the other day.
>>>
>>>> All I did is asking - how come the PTEs are executable when they are
>>>> cleared
>>>> they are executable, when in fact we manipulate them when the module is
>>>> removed.
>>>
>>> I think the directmap used to be RWX so maybe historically its trying to
>>> return
>>> it to its default state? Not sure.
>>>
>>>> I think I try to deal with a similar problem to the one you encounter -
>>>> broken W^X. The only thing that bothered me in regard to your patches (and
>>>> only after I played with the code) is that there is still a time-window in
>>>> which W^X is broken due to disable_ro_nx().
>>>
>>> Totally agree there is overlap in the fixes and we should sync.
>>>
>>> What do you think about Andy's suggestion for doing the vfree cleanup in
>>> vmalloc
>>> with arch hooks? So the allocation goes into vfree fully setup and vmalloc
>>> frees
>>> it and on x86 resets the direct map.
>>
>> As long as you do it, I have no problem ;-)
>>
>> You would need to consider all the callers of module_memfree(), and probably
>> to untangle at least part of the mess in pageattr.c . If you are up to it,
>> just say so, and I’ll drop this patch. All I can say is “good luck with all
>> that”.
> I thought you were trying to prevent having any memory that at any time was W+X,
> how does vfree help with the module load time issues, where it starts WRX on
> x86?
I didn’t say it does. The patch I submitted before [1] should deal with the
issue of module loading, and I still think it is required. I also addressed
the kprobe and ftrace issues that you raised.
Perhaps it makes more sense that I will include the patch I proposed for
module cleanup to make the patch-set “complete”. If you finish the changes
you propose before the patch is applied, it could be dropped. I just want to
get rid of this series, as it keeps collecting more and more patches.
I suspect it will not be the last version anyhow.
[1] https://lkml.org/lkml/2018/11/21/305
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-05 0:53 ` Nadav Amit
(?)
@ 2018-12-05 1:45 ` Edgecombe, Rick P
-1 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-05 1:45 UTC (permalink / raw)
To: nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote:
> > On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> >
> > On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
> > > > On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
> > > > rick.p.edgecombe@intel.com>
> > > > wrote:
> > > >
> > > > On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
> > > > > > On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
> > > > > > rick.p.edgecombe@intel.com>
> > > > > > wrote:
> > > > > >
> > > > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > > > > > rick.p.edgecombe@intel.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > > > > > underlying
> > > > > > > > > pages,
> > > > > > > > > it often leaves stale TLB entries to freed pages that could
> > > > > > > > > get
> > > > > > > > > re-
> > > > > > > > > used.
> > > > > > > > > This is
> > > > > > > > > undesirable for cases where the memory being freed has special
> > > > > > > > > permissions
> > > > > > > > > such
> > > > > > > > > as executable.
> > > > > > > >
> > > > > > > > So I am trying to finish my patch-set for preventing transient
> > > > > > > > W+X
> > > > > > > > mappings
> > > > > > > > from taking space, by handling kprobes & ftrace that I missed
> > > > > > > > (thanks
> > > > > > > > again
> > > > > > > > for
> > > > > > > > pointing it out).
> > > > > > > >
> > > > > > > > But all of the sudden, I don’t understand why we have the
> > > > > > > > problem
> > > > > > > > that
> > > > > > > > this
> > > > > > > > (your) patch-set deals with at all. We already change the
> > > > > > > > mappings
> > > > > > > > to
> > > > > > > > make
> > > > > > > > the memory writable before freeing the memory, so why can’t we
> > > > > > > > make
> > > > > > > > it
> > > > > > > > non-executable at the same time? Actually, why do we make the
> > > > > > > > module
> > > > > > > > memory,
> > > > > > > > including its data executable before freeing it???
> > > > > > >
> > > > > > > Yeah, this is really confusing, but I have a suspicion it's a
> > > > > > > combination
> > > > > > > of the various different configurations and hysterical raisins. We
> > > > > > > can't
> > > > > > > rely on module_alloc() allocating from the vmalloc area (see
> > > > > > > nios2)
> > > > > > > nor
> > > > > > > can we rely on disable_ro_nx() being available at build time.
> > > > > > >
> > > > > > > If we *could* rely on module allocations always using vmalloc(),
> > > > > > > then
> > > > > > > we could pass in Rick's new flag and drop disable_ro_nx()
> > > > > > > altogether
> > > > > > > afaict -- who cares about the memory attributes of a mapping
> > > > > > > that's
> > > > > > > about
> > > > > > > to disappear anyway?
> > > > > > >
> > > > > > > Is it just nios2 that does something different?
> > > > > > >
> > > > > > > Will
> > > > > >
> > > > > > Yea it is really intertwined. I think for x86, set_memory_nx
> > > > > > everywhere
> > > > > > would
> > > > > > solve it as well, in fact that was what I first thought the solution
> > > > > > should
> > > > > > be
> > > > > > until this was suggested. It's interesting that from the other
> > > > > > thread
> > > > > > Masami
> > > > > > Hiramatsu referenced, set_memory_nx was suggested last year and
> > > > > > would
> > > > > > have
> > > > > > inadvertently blocked this on x86. But, on the other architectures I
> > > > > > have
> > > > > > since
> > > > > > learned it is a bit different.
> > > > > >
> > > > > > It looks like actually most arch's don't re-define set_memory_*, and
> > > > > > so
> > > > > > all
> > > > > > of
> > > > > > the frob_* functions are actually just noops. In which case
> > > > > > allocating
> > > > > > RWX
> > > > > > is
> > > > > > needed to make it work at all, because that is what the allocation
> > > > > > is
> > > > > > going
> > > > > > to
> > > > > > stay at. So in these archs, set_memory_nx won't solve it because it
> > > > > > will
> > > > > > do
> > > > > > nothing.
> > > > > >
> > > > > > On x86 I think you cannot get rid of disable_ro_nx fully because
> > > > > > there
> > > > > > is
> > > > > > the
> > > > > > changing of the permissions on the directmap as well. You don't want
> > > > > > some
> > > > > > other
> > > > > > caller getting a page that was left RO when freed and then trying to
> > > > > > write
> > > > > > to
> > > > > > it, if I understand this.
> > > > > >
> > > > > > The other reasoning was that calling set_memory_nx isn't doing what
> > > > > > we
> > > > > > are
> > > > > > actually trying to do which is prevent the pages from getting
> > > > > > released
> > > > > > too
> > > > > > early.
> > > > > >
> > > > > > A more clear solution for all of this might involve refactoring some
> > > > > > of
> > > > > > the
> > > > > > set_memory_ de-allocation logic out into __weak functions in either
> > > > > > modules
> > > > > > or
> > > > > > vmalloc. As Jessica points out in the other thread though, modules
> > > > > > does
> > > > > > a
> > > > > > lot
> > > > > > more stuff there than the other module_alloc callers. I think it may
> > > > > > take
> > > > > > some
> > > > > > thought to centralize AND make it optimal for every
> > > > > > module_alloc/vmalloc_exec
> > > > > > user and arch.
> > > > > >
> > > > > > But for now with the change in vmalloc, we can block the executable
> > > > > > mapping
> > > > > > freed page re-use issue in a cross platform way.
> > > > >
> > > > > Please understand me correctly - I didn’t mean that your patches are
> > > > > not
> > > > > needed.
> > > >
> > > > Ok, I think I understand. I have been pondering these same things after
> > > > Masami
> > > > Hiramatsu's comments on this thread the other day.
> > > >
> > > > > All I did is asking - how come the PTEs are executable when they are
> > > > > cleared
> > > > > they are executable, when in fact we manipulate them when the module
> > > > > is
> > > > > removed.
> > > >
> > > > I think the directmap used to be RWX so maybe historically its trying to
> > > > return
> > > > it to its default state? Not sure.
> > > >
> > > > > I think I try to deal with a similar problem to the one you encounter
> > > > > -
> > > > > broken W^X. The only thing that bothered me in regard to your patches
> > > > > (and
> > > > > only after I played with the code) is that there is still a time-
> > > > > window in
> > > > > which W^X is broken due to disable_ro_nx().
> > > >
> > > > Totally agree there is overlap in the fixes and we should sync.
> > > >
> > > > What do you think about Andy's suggestion for doing the vfree cleanup in
> > > > vmalloc
> > > > with arch hooks? So the allocation goes into vfree fully setup and
> > > > vmalloc
> > > > frees
> > > > it and on x86 resets the direct map.
> > >
> > > As long as you do it, I have no problem ;-)
> > >
> > > You would need to consider all the callers of module_memfree(), and
> > > probably
> > > to untangle at least part of the mess in pageattr.c . If you are up to it,
> > > just say so, and I’ll drop this patch. All I can say is “good luck with
> > > all
> > > that”.
> >
> > I thought you were trying to prevent having any memory that at any time was
> > W+X,
> > how does vfree help with the module load time issues, where it starts WRX on
> > x86?
>
> I didn’t say it does. The patch I submitted before [1] should deal with the
> issue of module loading, and I still think it is required. I also addressed
> the kprobe and ftrace issues that you raised.
>
> Perhaps it makes more sense that I will include the patch I proposed for
> module cleanup to make the patch-set “complete”. If you finish the changes
> you propose before the patch is applied, it could be dropped. I just want to
> get rid of this series, as it keeps collecting more and more patches.
>
> I suspect it will not be the last version anyhow.
>
> [1] https://lkml.org/lkml/2018/11/21/305
That seems fine.
And not to make it any more complicated, but how much different is a W+X mapping
from a RW mapping that is about to turn X? Can't an attacker with the ability to
write to the module space just write and wait a short time? If that is the
threat model, I think there may still be additional work to do here even after
you found all the W+X cases.
I'll take a shot at what Andy suggested in the next few days.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 1:45 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-05 1:45 UTC (permalink / raw)
To: nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote:
> > On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> >
> > On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
> > > > On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
> > > > rick.p.edgecombe@intel.com>
> > > > wrote:
> > > >
> > > > On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
> > > > > > On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
> > > > > > rick.p.edgecombe@intel.com>
> > > > > > wrote:
> > > > > >
> > > > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > > > > > rick.p.edgecombe@intel.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > > > > > underlying
> > > > > > > > > pages,
> > > > > > > > > it often leaves stale TLB entries to freed pages that could
> > > > > > > > > get
> > > > > > > > > re-
> > > > > > > > > used.
> > > > > > > > > This is
> > > > > > > > > undesirable for cases where the memory being freed has special
> > > > > > > > > permissions
> > > > > > > > > such
> > > > > > > > > as executable.
> > > > > > > >
> > > > > > > > So I am trying to finish my patch-set for preventing transient
> > > > > > > > W+X
> > > > > > > > mappings
> > > > > > > > from taking space, by handling kprobes & ftrace that I missed
> > > > > > > > (thanks
> > > > > > > > again
> > > > > > > > for
> > > > > > > > pointing it out).
> > > > > > > >
> > > > > > > > But all of the sudden, I don’t understand why we have the
> > > > > > > > problem
> > > > > > > > that
> > > > > > > > this
> > > > > > > > (your) patch-set deals with at all. We already change the
> > > > > > > > mappings
> > > > > > > > to
> > > > > > > > make
> > > > > > > > the memory writable before freeing the memory, so why can’t we
> > > > > > > > make
> > > > > > > > it
> > > > > > > > non-executable at the same time? Actually, why do we make the
> > > > > > > > module
> > > > > > > > memory,
> > > > > > > > including its data executable before freeing it???
> > > > > > >
> > > > > > > Yeah, this is really confusing, but I have a suspicion it's a
> > > > > > > combination
> > > > > > > of the various different configurations and hysterical raisins. We
> > > > > > > can't
> > > > > > > rely on module_alloc() allocating from the vmalloc area (see
> > > > > > > nios2)
> > > > > > > nor
> > > > > > > can we rely on disable_ro_nx() being available at build time.
> > > > > > >
> > > > > > > If we *could* rely on module allocations always using vmalloc(),
> > > > > > > then
> > > > > > > we could pass in Rick's new flag and drop disable_ro_nx()
> > > > > > > altogether
> > > > > > > afaict -- who cares about the memory attributes of a mapping
> > > > > > > that's
> > > > > > > about
> > > > > > > to disappear anyway?
> > > > > > >
> > > > > > > Is it just nios2 that does something different?
> > > > > > >
> > > > > > > Will
> > > > > >
> > > > > > Yea it is really intertwined. I think for x86, set_memory_nx
> > > > > > everywhere
> > > > > > would
> > > > > > solve it as well, in fact that was what I first thought the solution
> > > > > > should
> > > > > > be
> > > > > > until this was suggested. It's interesting that from the other
> > > > > > thread
> > > > > > Masami
> > > > > > Hiramatsu referenced, set_memory_nx was suggested last year and
> > > > > > would
> > > > > > have
> > > > > > inadvertently blocked this on x86. But, on the other architectures I
> > > > > > have
> > > > > > since
> > > > > > learned it is a bit different.
> > > > > >
> > > > > > It looks like actually most arch's don't re-define set_memory_*, and
> > > > > > so
> > > > > > all
> > > > > > of
> > > > > > the frob_* functions are actually just noops. In which case
> > > > > > allocating
> > > > > > RWX
> > > > > > is
> > > > > > needed to make it work at all, because that is what the allocation
> > > > > > is
> > > > > > going
> > > > > > to
> > > > > > stay at. So in these archs, set_memory_nx won't solve it because it
> > > > > > will
> > > > > > do
> > > > > > nothing.
> > > > > >
> > > > > > On x86 I think you cannot get rid of disable_ro_nx fully because
> > > > > > there
> > > > > > is
> > > > > > the
> > > > > > changing of the permissions on the directmap as well. You don't want
> > > > > > some
> > > > > > other
> > > > > > caller getting a page that was left RO when freed and then trying to
> > > > > > write
> > > > > > to
> > > > > > it, if I understand this.
> > > > > >
> > > > > > The other reasoning was that calling set_memory_nx isn't doing what
> > > > > > we
> > > > > > are
> > > > > > actually trying to do which is prevent the pages from getting
> > > > > > released
> > > > > > too
> > > > > > early.
> > > > > >
> > > > > > A more clear solution for all of this might involve refactoring some
> > > > > > of
> > > > > > the
> > > > > > set_memory_ de-allocation logic out into __weak functions in either
> > > > > > modules
> > > > > > or
> > > > > > vmalloc. As Jessica points out in the other thread though, modules
> > > > > > does
> > > > > > a
> > > > > > lot
> > > > > > more stuff there than the other module_alloc callers. I think it may
> > > > > > take
> > > > > > some
> > > > > > thought to centralize AND make it optimal for every
> > > > > > module_alloc/vmalloc_exec
> > > > > > user and arch.
> > > > > >
> > > > > > But for now with the change in vmalloc, we can block the executable
> > > > > > mapping
> > > > > > freed page re-use issue in a cross platform way.
> > > > >
> > > > > Please understand me correctly - I didn’t mean that your patches are
> > > > > not
> > > > > needed.
> > > >
> > > > Ok, I think I understand. I have been pondering these same things after
> > > > Masami
> > > > Hiramatsu's comments on this thread the other day.
> > > >
> > > > > All I did is asking - how come the PTEs are executable when they are
> > > > > cleared
> > > > > they are executable, when in fact we manipulate them when the module
> > > > > is
> > > > > removed.
> > > >
> > > > I think the directmap used to be RWX so maybe historically its trying to
> > > > return
> > > > it to its default state? Not sure.
> > > >
> > > > > I think I try to deal with a similar problem to the one you encounter
> > > > > -
> > > > > broken W^X. The only thing that bothered me in regard to your patches
> > > > > (and
> > > > > only after I played with the code) is that there is still a time-
> > > > > window in
> > > > > which W^X is broken due to disable_ro_nx().
> > > >
> > > > Totally agree there is overlap in the fixes and we should sync.
> > > >
> > > > What do you think about Andy's suggestion for doing the vfree cleanup in
> > > > vmalloc
> > > > with arch hooks? So the allocation goes into vfree fully setup and
> > > > vmalloc
> > > > frees
> > > > it and on x86 resets the direct map.
> > >
> > > As long as you do it, I have no problem ;-)
> > >
> > > You would need to consider all the callers of module_memfree(), and
> > > probably
> > > to untangle at least part of the mess in pageattr.c . If you are up to it,
> > > just say so, and I’ll drop this patch. All I can say is “good luck with
> > > all
> > > that”.
> >
> > I thought you were trying to prevent having any memory that at any time was
> > W+X,
> > how does vfree help with the module load time issues, where it starts WRX on
> > x86?
>
> I didn’t say it does. The patch I submitted before [1] should deal with the
> issue of module loading, and I still think it is required. I also addressed
> the kprobe and ftrace issues that you raised.
>
> Perhaps it makes more sense that I will include the patch I proposed for
> module cleanup to make the patch-set “complete”. If you finish the changes
> you propose before the patch is applied, it could be dropped. I just want to
> get rid of this series, as it keeps collecting more and more patches.
>
> I suspect it will not be the last version anyhow.
>
> [1] https://lkml.org/lkml/2018/11/21/305
That seems fine.
And not to make it any more complicated, but how much different is a W+X mapping
from a RW mapping that is about to turn X? Can't an attacker with the ability to
write to the module space just write and wait a short time? If that is the
threat model, I think there may still be additional work to do here even after
you found all the W+X cases.
I'll take a shot at what Andy suggested in the next few days.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 1:45 ` Edgecombe, Rick P
0 siblings, 0 replies; 117+ messages in thread
From: Edgecombe, Rick P @ 2018-12-05 1:45 UTC (permalink / raw)
To: nadav.amit
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S,
On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote:
> > On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
> > wrote:
> >
> > On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
> > > > On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
> > > > rick.p.edgecombe@intel.com>
> > > > wrote:
> > > >
> > > > On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
> > > > > > On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
> > > > > > rick.p.edgecombe@intel.com>
> > > > > > wrote:
> > > > > >
> > > > > > On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
> > > > > > > On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
> > > > > > > > > On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
> > > > > > > > > rick.p.edgecombe@intel.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Since vfree will lazily flush the TLB, but not lazily free the
> > > > > > > > > underlying
> > > > > > > > > pages,
> > > > > > > > > it often leaves stale TLB entries to freed pages that could
> > > > > > > > > get
> > > > > > > > > re-
> > > > > > > > > used.
> > > > > > > > > This is
> > > > > > > > > undesirable for cases where the memory being freed has special
> > > > > > > > > permissions
> > > > > > > > > such
> > > > > > > > > as executable.
> > > > > > > >
> > > > > > > > So I am trying to finish my patch-set for preventing transient
> > > > > > > > W+X
> > > > > > > > mappings
> > > > > > > > from taking space, by handling kprobes & ftrace that I missed
> > > > > > > > (thanks
> > > > > > > > again
> > > > > > > > for
> > > > > > > > pointing it out).
> > > > > > > >
> > > > > > > > But all of the sudden, I don’t understand why we have the
> > > > > > > > problem
> > > > > > > > that
> > > > > > > > this
> > > > > > > > (your) patch-set deals with at all. We already change the
> > > > > > > > mappings
> > > > > > > > to
> > > > > > > > make
> > > > > > > > the memory writable before freeing the memory, so why can’t we
> > > > > > > > make
> > > > > > > > it
> > > > > > > > non-executable at the same time? Actually, why do we make the
> > > > > > > > module
> > > > > > > > memory,
> > > > > > > > including its data executable before freeing it???
> > > > > > >
> > > > > > > Yeah, this is really confusing, but I have a suspicion it's a
> > > > > > > combination
> > > > > > > of the various different configurations and hysterical raisins. We
> > > > > > > can't
> > > > > > > rely on module_alloc() allocating from the vmalloc area (see
> > > > > > > nios2)
> > > > > > > nor
> > > > > > > can we rely on disable_ro_nx() being available at build time.
> > > > > > >
> > > > > > > If we *could* rely on module allocations always using vmalloc(),
> > > > > > > then
> > > > > > > we could pass in Rick's new flag and drop disable_ro_nx()
> > > > > > > altogether
> > > > > > > afaict -- who cares about the memory attributes of a mapping
> > > > > > > that's
> > > > > > > about
> > > > > > > to disappear anyway?
> > > > > > >
> > > > > > > Is it just nios2 that does something different?
> > > > > > >
> > > > > > > Will
> > > > > >
> > > > > > Yea it is really intertwined. I think for x86, set_memory_nx
> > > > > > everywhere
> > > > > > would
> > > > > > solve it as well, in fact that was what I first thought the solution
> > > > > > should
> > > > > > be
> > > > > > until this was suggested. It's interesting that from the other
> > > > > > thread
> > > > > > Masami
> > > > > > Hiramatsu referenced, set_memory_nx was suggested last year and
> > > > > > would
> > > > > > have
> > > > > > inadvertently blocked this on x86. But, on the other architectures I
> > > > > > have
> > > > > > since
> > > > > > learned it is a bit different.
> > > > > >
> > > > > > It looks like actually most arch's don't re-define set_memory_*, and
> > > > > > so
> > > > > > all
> > > > > > of
> > > > > > the frob_* functions are actually just noops. In which case
> > > > > > allocating
> > > > > > RWX
> > > > > > is
> > > > > > needed to make it work at all, because that is what the allocation
> > > > > > is
> > > > > > going
> > > > > > to
> > > > > > stay at. So in these archs, set_memory_nx won't solve it because it
> > > > > > will
> > > > > > do
> > > > > > nothing.
> > > > > >
> > > > > > On x86 I think you cannot get rid of disable_ro_nx fully because
> > > > > > there
> > > > > > is
> > > > > > the
> > > > > > changing of the permissions on the directmap as well. You don't want
> > > > > > some
> > > > > > other
> > > > > > caller getting a page that was left RO when freed and then trying to
> > > > > > write
> > > > > > to
> > > > > > it, if I understand this.
> > > > > >
> > > > > > The other reasoning was that calling set_memory_nx isn't doing what
> > > > > > we
> > > > > > are
> > > > > > actually trying to do which is prevent the pages from getting
> > > > > > released
> > > > > > too
> > > > > > early.
> > > > > >
> > > > > > A more clear solution for all of this might involve refactoring some
> > > > > > of
> > > > > > the
> > > > > > set_memory_ de-allocation logic out into __weak functions in either
> > > > > > modules
> > > > > > or
> > > > > > vmalloc. As Jessica points out in the other thread though, modules
> > > > > > does
> > > > > > a
> > > > > > lot
> > > > > > more stuff there than the other module_alloc callers. I think it may
> > > > > > take
> > > > > > some
> > > > > > thought to centralize AND make it optimal for every
> > > > > > module_alloc/vmalloc_exec
> > > > > > user and arch.
> > > > > >
> > > > > > But for now with the change in vmalloc, we can block the executable
> > > > > > mapping
> > > > > > freed page re-use issue in a cross platform way.
> > > > >
> > > > > Please understand me correctly - I didn’t mean that your patches are
> > > > > not
> > > > > needed.
> > > >
> > > > Ok, I think I understand. I have been pondering these same things after
> > > > Masami
> > > > Hiramatsu's comments on this thread the other day.
> > > >
> > > > > All I did is asking - how come the PTEs are executable when they are
> > > > > cleared
> > > > > they are executable, when in fact we manipulate them when the module
> > > > > is
> > > > > removed.
> > > >
> > > > I think the directmap used to be RWX so maybe historically its trying to
> > > > return
> > > > it to its default state? Not sure.
> > > >
> > > > > I think I try to deal with a similar problem to the one you encounter
> > > > > -
> > > > > broken W^X. The only thing that bothered me in regard to your patches
> > > > > (and
> > > > > only after I played with the code) is that there is still a time-
> > > > > window in
> > > > > which W^X is broken due to disable_ro_nx().
> > > >
> > > > Totally agree there is overlap in the fixes and we should sync.
> > > >
> > > > What do you think about Andy's suggestion for doing the vfree cleanup in
> > > > vmalloc
> > > > with arch hooks? So the allocation goes into vfree fully setup and
> > > > vmalloc
> > > > frees
> > > > it and on x86 resets the direct map.
> > >
> > > As long as you do it, I have no problem ;-)
> > >
> > > You would need to consider all the callers of module_memfree(), and
> > > probably
> > > to untangle at least part of the mess in pageattr.c . If you are up to it,
> > > just say so, and I’ll drop this patch. All I can say is “good luck with
> > > all
> > > that”.
> >
> > I thought you were trying to prevent having any memory that at any time was
> > W+X,
> > how does vfree help with the module load time issues, where it starts WRX on
> > x86?
>
> I didn’t say it does. The patch I submitted before [1] should deal with the
> issue of module loading, and I still think it is required. I also addressed
> the kprobe and ftrace issues that you raised.
>
> Perhaps it makes more sense that I will include the patch I proposed for
> module cleanup to make the patch-set “complete”. If you finish the changes
> you propose before the patch is applied, it could be dropped. I just want to
> get rid of this series, as it keeps collecting more and more patches.
>
> I suspect it will not be the last version anyhow.
>
> [1] https://lkml.org/lkml/2018/11/21/305
That seems fine.
And not to make it any more complicated, but how much different is a W+X mapping
from a RW mapping that is about to turn X? Can't an attacker with the ability to
write to the module space just write and wait a short time? If that is the
threat model, I think there may still be additional work to do here even after
you found all the W+X cases.
I'll take a shot at what Andy suggested in the next few days.
Thanks,
Rick
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
2018-12-05 1:45 ` Edgecombe, Rick P
(?)
@ 2018-12-05 2:09 ` Nadav Amit
-1 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 2:09 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
> On Dec 4, 2018, at 5:45 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
>>>>> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
>>>>> rick.p.edgecombe@intel.com>
>>>>> wrote:
>>>>>
>>>>> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>>>>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>>>>>> underlying
>>>>>>>>>> pages,
>>>>>>>>>> it often leaves stale TLB entries to freed pages that could
>>>>>>>>>> get
>>>>>>>>>> re-
>>>>>>>>>> used.
>>>>>>>>>> This is
>>>>>>>>>> undesirable for cases where the memory being freed has special
>>>>>>>>>> permissions
>>>>>>>>>> such
>>>>>>>>>> as executable.
>>>>>>>>>
>>>>>>>>> So I am trying to finish my patch-set for preventing transient
>>>>>>>>> W+X
>>>>>>>>> mappings
>>>>>>>>> from taking space, by handling kprobes & ftrace that I missed
>>>>>>>>> (thanks
>>>>>>>>> again
>>>>>>>>> for
>>>>>>>>> pointing it out).
>>>>>>>>>
>>>>>>>>> But all of the sudden, I don’t understand why we have the
>>>>>>>>> problem
>>>>>>>>> that
>>>>>>>>> this
>>>>>>>>> (your) patch-set deals with at all. We already change the
>>>>>>>>> mappings
>>>>>>>>> to
>>>>>>>>> make
>>>>>>>>> the memory writable before freeing the memory, so why can’t we
>>>>>>>>> make
>>>>>>>>> it
>>>>>>>>> non-executable at the same time? Actually, why do we make the
>>>>>>>>> module
>>>>>>>>> memory,
>>>>>>>>> including its data executable before freeing it???
>>>>>>>>
>>>>>>>> Yeah, this is really confusing, but I have a suspicion it's a
>>>>>>>> combination
>>>>>>>> of the various different configurations and hysterical raisins. We
>>>>>>>> can't
>>>>>>>> rely on module_alloc() allocating from the vmalloc area (see
>>>>>>>> nios2)
>>>>>>>> nor
>>>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>>>
>>>>>>>> If we *could* rely on module allocations always using vmalloc(),
>>>>>>>> then
>>>>>>>> we could pass in Rick's new flag and drop disable_ro_nx()
>>>>>>>> altogether
>>>>>>>> afaict -- who cares about the memory attributes of a mapping
>>>>>>>> that's
>>>>>>>> about
>>>>>>>> to disappear anyway?
>>>>>>>>
>>>>>>>> Is it just nios2 that does something different?
>>>>>>>>
>>>>>>>> Will
>>>>>>>
>>>>>>> Yea it is really intertwined. I think for x86, set_memory_nx
>>>>>>> everywhere
>>>>>>> would
>>>>>>> solve it as well, in fact that was what I first thought the solution
>>>>>>> should
>>>>>>> be
>>>>>>> until this was suggested. It's interesting that from the other
>>>>>>> thread
>>>>>>> Masami
>>>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and
>>>>>>> would
>>>>>>> have
>>>>>>> inadvertently blocked this on x86. But, on the other architectures I
>>>>>>> have
>>>>>>> since
>>>>>>> learned it is a bit different.
>>>>>>>
>>>>>>> It looks like actually most arch's don't re-define set_memory_*, and
>>>>>>> so
>>>>>>> all
>>>>>>> of
>>>>>>> the frob_* functions are actually just noops. In which case
>>>>>>> allocating
>>>>>>> RWX
>>>>>>> is
>>>>>>> needed to make it work at all, because that is what the allocation
>>>>>>> is
>>>>>>> going
>>>>>>> to
>>>>>>> stay at. So in these archs, set_memory_nx won't solve it because it
>>>>>>> will
>>>>>>> do
>>>>>>> nothing.
>>>>>>>
>>>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because
>>>>>>> there
>>>>>>> is
>>>>>>> the
>>>>>>> changing of the permissions on the directmap as well. You don't want
>>>>>>> some
>>>>>>> other
>>>>>>> caller getting a page that was left RO when freed and then trying to
>>>>>>> write
>>>>>>> to
>>>>>>> it, if I understand this.
>>>>>>>
>>>>>>> The other reasoning was that calling set_memory_nx isn't doing what
>>>>>>> we
>>>>>>> are
>>>>>>> actually trying to do which is prevent the pages from getting
>>>>>>> released
>>>>>>> too
>>>>>>> early.
>>>>>>>
>>>>>>> A more clear solution for all of this might involve refactoring some
>>>>>>> of
>>>>>>> the
>>>>>>> set_memory_ de-allocation logic out into __weak functions in either
>>>>>>> modules
>>>>>>> or
>>>>>>> vmalloc. As Jessica points out in the other thread though, modules
>>>>>>> does
>>>>>>> a
>>>>>>> lot
>>>>>>> more stuff there than the other module_alloc callers. I think it may
>>>>>>> take
>>>>>>> some
>>>>>>> thought to centralize AND make it optimal for every
>>>>>>> module_alloc/vmalloc_exec
>>>>>>> user and arch.
>>>>>>>
>>>>>>> But for now with the change in vmalloc, we can block the executable
>>>>>>> mapping
>>>>>>> freed page re-use issue in a cross platform way.
>>>>>>
>>>>>> Please understand me correctly - I didn’t mean that your patches are
>>>>>> not
>>>>>> needed.
>>>>>
>>>>> Ok, I think I understand. I have been pondering these same things after
>>>>> Masami
>>>>> Hiramatsu's comments on this thread the other day.
>>>>>
>>>>>> All I did is asking - how come the PTEs are executable when they are
>>>>>> cleared
>>>>>> they are executable, when in fact we manipulate them when the module
>>>>>> is
>>>>>> removed.
>>>>>
>>>>> I think the directmap used to be RWX so maybe historically its trying to
>>>>> return
>>>>> it to its default state? Not sure.
>>>>>
>>>>>> I think I try to deal with a similar problem to the one you encounter
>>>>>> -
>>>>>> broken W^X. The only thing that bothered me in regard to your patches
>>>>>> (and
>>>>>> only after I played with the code) is that there is still a time-
>>>>>> window in
>>>>>> which W^X is broken due to disable_ro_nx().
>>>>>
>>>>> Totally agree there is overlap in the fixes and we should sync.
>>>>>
>>>>> What do you think about Andy's suggestion for doing the vfree cleanup in
>>>>> vmalloc
>>>>> with arch hooks? So the allocation goes into vfree fully setup and
>>>>> vmalloc
>>>>> frees
>>>>> it and on x86 resets the direct map.
>>>>
>>>> As long as you do it, I have no problem ;-)
>>>>
>>>> You would need to consider all the callers of module_memfree(), and
>>>> probably
>>>> to untangle at least part of the mess in pageattr.c . If you are up to it,
>>>> just say so, and I’ll drop this patch. All I can say is “good luck with
>>>> all
>>>> that”.
>>>
>>> I thought you were trying to prevent having any memory that at any time was
>>> W+X,
>>> how does vfree help with the module load time issues, where it starts WRX on
>>> x86?
>>
>> I didn’t say it does. The patch I submitted before [1] should deal with the
>> issue of module loading, and I still think it is required. I also addressed
>> the kprobe and ftrace issues that you raised.
>>
>> Perhaps it makes more sense that I will include the patch I proposed for
>> module cleanup to make the patch-set “complete”. If you finish the changes
>> you propose before the patch is applied, it could be dropped. I just want to
>> get rid of this series, as it keeps collecting more and more patches.
>>
>> I suspect it will not be the last version anyhow.
>>
>> [1] https://lkml.org/lkml/2018/11/21/305
>
> That seems fine.
>
> And not to make it any more complicated, but how much different is a W+X mapping
> from a RW mapping that is about to turn X? Can't an attacker with the ability to
> write to the module space just write and wait a short time? If that is the
> threat model, I think there may still be additional work to do here even after
> you found all the W+X cases.
I agree that a complete solution may require to block any direct write onto
a code-page. When I say “complete”, I mean for a threat model in which
dangling pointers are used to inject code, but not to run existing ROP/JOP
gadgets. (I didn’t think too deeply on the threat-model, so perhaps it needs
to be further refined).
I think the first stage is to make everybody go through a unified interface
(text_poke() and text_poke_early()). ftrace, for example, uses an
independent mechanism to change the code.
Eventually, after boot text_poke_early() should not be used, and text_poke()
(or something similar) should be used instead. Alternatively, when module
text is loaded, a hash value can be computed and calculated over it.
Since Igor Stoppa wants to use the infrastructure that is included in the
first patches, and since I didn’t intend this patch-set to be a full
solution for W^X (I was pushed there by tglx+Andy [1]), it may be enough
as a first step.
[1] https://lore.kernel.org/patchwork/patch/1006293/#1191341
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 2:09 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 2:09 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S, kernel-hardening, mhiramat,
naveen.n.rao, davem, netdev, Hansen, Dave
> On Dec 4, 2018, at 5:45 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
>>>>> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
>>>>> rick.p.edgecombe@intel.com>
>>>>> wrote:
>>>>>
>>>>> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>>>>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>>>>>> underlying
>>>>>>>>>> pages,
>>>>>>>>>> it often leaves stale TLB entries to freed pages that could
>>>>>>>>>> get
>>>>>>>>>> re-
>>>>>>>>>> used.
>>>>>>>>>> This is
>>>>>>>>>> undesirable for cases where the memory being freed has special
>>>>>>>>>> permissions
>>>>>>>>>> such
>>>>>>>>>> as executable.
>>>>>>>>>
>>>>>>>>> So I am trying to finish my patch-set for preventing transient
>>>>>>>>> W+X
>>>>>>>>> mappings
>>>>>>>>> from taking space, by handling kprobes & ftrace that I missed
>>>>>>>>> (thanks
>>>>>>>>> again
>>>>>>>>> for
>>>>>>>>> pointing it out).
>>>>>>>>>
>>>>>>>>> But all of the sudden, I don’t understand why we have the
>>>>>>>>> problem
>>>>>>>>> that
>>>>>>>>> this
>>>>>>>>> (your) patch-set deals with at all. We already change the
>>>>>>>>> mappings
>>>>>>>>> to
>>>>>>>>> make
>>>>>>>>> the memory writable before freeing the memory, so why can’t we
>>>>>>>>> make
>>>>>>>>> it
>>>>>>>>> non-executable at the same time? Actually, why do we make the
>>>>>>>>> module
>>>>>>>>> memory,
>>>>>>>>> including its data executable before freeing it???
>>>>>>>>
>>>>>>>> Yeah, this is really confusing, but I have a suspicion it's a
>>>>>>>> combination
>>>>>>>> of the various different configurations and hysterical raisins. We
>>>>>>>> can't
>>>>>>>> rely on module_alloc() allocating from the vmalloc area (see
>>>>>>>> nios2)
>>>>>>>> nor
>>>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>>>
>>>>>>>> If we *could* rely on module allocations always using vmalloc(),
>>>>>>>> then
>>>>>>>> we could pass in Rick's new flag and drop disable_ro_nx()
>>>>>>>> altogether
>>>>>>>> afaict -- who cares about the memory attributes of a mapping
>>>>>>>> that's
>>>>>>>> about
>>>>>>>> to disappear anyway?
>>>>>>>>
>>>>>>>> Is it just nios2 that does something different?
>>>>>>>>
>>>>>>>> Will
>>>>>>>
>>>>>>> Yea it is really intertwined. I think for x86, set_memory_nx
>>>>>>> everywhere
>>>>>>> would
>>>>>>> solve it as well, in fact that was what I first thought the solution
>>>>>>> should
>>>>>>> be
>>>>>>> until this was suggested. It's interesting that from the other
>>>>>>> thread
>>>>>>> Masami
>>>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and
>>>>>>> would
>>>>>>> have
>>>>>>> inadvertently blocked this on x86. But, on the other architectures I
>>>>>>> have
>>>>>>> since
>>>>>>> learned it is a bit different.
>>>>>>>
>>>>>>> It looks like actually most arch's don't re-define set_memory_*, and
>>>>>>> so
>>>>>>> all
>>>>>>> of
>>>>>>> the frob_* functions are actually just noops. In which case
>>>>>>> allocating
>>>>>>> RWX
>>>>>>> is
>>>>>>> needed to make it work at all, because that is what the allocation
>>>>>>> is
>>>>>>> going
>>>>>>> to
>>>>>>> stay at. So in these archs, set_memory_nx won't solve it because it
>>>>>>> will
>>>>>>> do
>>>>>>> nothing.
>>>>>>>
>>>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because
>>>>>>> there
>>>>>>> is
>>>>>>> the
>>>>>>> changing of the permissions on the directmap as well. You don't want
>>>>>>> some
>>>>>>> other
>>>>>>> caller getting a page that was left RO when freed and then trying to
>>>>>>> write
>>>>>>> to
>>>>>>> it, if I understand this.
>>>>>>>
>>>>>>> The other reasoning was that calling set_memory_nx isn't doing what
>>>>>>> we
>>>>>>> are
>>>>>>> actually trying to do which is prevent the pages from getting
>>>>>>> released
>>>>>>> too
>>>>>>> early.
>>>>>>>
>>>>>>> A more clear solution for all of this might involve refactoring some
>>>>>>> of
>>>>>>> the
>>>>>>> set_memory_ de-allocation logic out into __weak functions in either
>>>>>>> modules
>>>>>>> or
>>>>>>> vmalloc. As Jessica points out in the other thread though, modules
>>>>>>> does
>>>>>>> a
>>>>>>> lot
>>>>>>> more stuff there than the other module_alloc callers. I think it may
>>>>>>> take
>>>>>>> some
>>>>>>> thought to centralize AND make it optimal for every
>>>>>>> module_alloc/vmalloc_exec
>>>>>>> user and arch.
>>>>>>>
>>>>>>> But for now with the change in vmalloc, we can block the executable
>>>>>>> mapping
>>>>>>> freed page re-use issue in a cross platform way.
>>>>>>
>>>>>> Please understand me correctly - I didn’t mean that your patches are
>>>>>> not
>>>>>> needed.
>>>>>
>>>>> Ok, I think I understand. I have been pondering these same things after
>>>>> Masami
>>>>> Hiramatsu's comments on this thread the other day.
>>>>>
>>>>>> All I did is asking - how come the PTEs are executable when they are
>>>>>> cleared
>>>>>> they are executable, when in fact we manipulate them when the module
>>>>>> is
>>>>>> removed.
>>>>>
>>>>> I think the directmap used to be RWX so maybe historically its trying to
>>>>> return
>>>>> it to its default state? Not sure.
>>>>>
>>>>>> I think I try to deal with a similar problem to the one you encounter
>>>>>> -
>>>>>> broken W^X. The only thing that bothered me in regard to your patches
>>>>>> (and
>>>>>> only after I played with the code) is that there is still a time-
>>>>>> window in
>>>>>> which W^X is broken due to disable_ro_nx().
>>>>>
>>>>> Totally agree there is overlap in the fixes and we should sync.
>>>>>
>>>>> What do you think about Andy's suggestion for doing the vfree cleanup in
>>>>> vmalloc
>>>>> with arch hooks? So the allocation goes into vfree fully setup and
>>>>> vmalloc
>>>>> frees
>>>>> it and on x86 resets the direct map.
>>>>
>>>> As long as you do it, I have no problem ;-)
>>>>
>>>> You would need to consider all the callers of module_memfree(), and
>>>> probably
>>>> to untangle at least part of the mess in pageattr.c . If you are up to it,
>>>> just say so, and I’ll drop this patch. All I can say is “good luck with
>>>> all
>>>> that”.
>>>
>>> I thought you were trying to prevent having any memory that at any time was
>>> W+X,
>>> how does vfree help with the module load time issues, where it starts WRX on
>>> x86?
>>
>> I didn’t say it does. The patch I submitted before [1] should deal with the
>> issue of module loading, and I still think it is required. I also addressed
>> the kprobe and ftrace issues that you raised.
>>
>> Perhaps it makes more sense that I will include the patch I proposed for
>> module cleanup to make the patch-set “complete”. If you finish the changes
>> you propose before the patch is applied, it could be dropped. I just want to
>> get rid of this series, as it keeps collecting more and more patches.
>>
>> I suspect it will not be the last version anyhow.
>>
>> [1] https://lkml.org/lkml/2018/11/21/305
>
> That seems fine.
>
> And not to make it any more complicated, but how much different is a W+X mapping
> from a RW mapping that is about to turn X? Can't an attacker with the ability to
> write to the module space just write and wait a short time? If that is the
> threat model, I think there may still be additional work to do here even after
> you found all the W+X cases.
I agree that a complete solution may require to block any direct write onto
a code-page. When I say “complete”, I mean for a threat model in which
dangling pointers are used to inject code, but not to run existing ROP/JOP
gadgets. (I didn’t think too deeply on the threat-model, so perhaps it needs
to be further refined).
I think the first stage is to make everybody go through a unified interface
(text_poke() and text_poke_early()). ftrace, for example, uses an
independent mechanism to change the code.
Eventually, after boot text_poke_early() should not be used, and text_poke()
(or something similar) should be used instead. Alternatively, when module
text is loaded, a hash value can be computed and calculated over it.
Since Igor Stoppa wants to use the infrastructure that is included in the
first patches, and since I didn’t intend this patch-set to be a full
solution for W^X (I was pushed there by tglx+Andy [1]), it may be enough
as a first step.
[1] https://lore.kernel.org/patchwork/patch/1006293/#1191341
^ permalink raw reply [flat|nested] 117+ messages in thread
* Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
@ 2018-12-05 2:09 ` Nadav Amit
0 siblings, 0 replies; 117+ messages in thread
From: Nadav Amit @ 2018-12-05 2:09 UTC (permalink / raw)
To: Edgecombe, Rick P
Cc: linux-kernel, daniel, ard.biesheuvel, jeyu, rostedt, linux-mm,
jannh, ast, Dock, Deneen T, peterz, kristen, akpm, will.deacon,
mingo, luto, Keshavamurthy, Anil S,
> On Dec 4, 2018, at 5:45 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
>
> On Tue, 2018-12-04 at 16:53 -0800, Nadav Amit wrote:
>>> On Dec 4, 2018, at 4:29 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com>
>>> wrote:
>>>
>>> On Tue, 2018-12-04 at 16:01 -0800, Nadav Amit wrote:
>>>>> On Dec 4, 2018, at 3:51 PM, Edgecombe, Rick P <
>>>>> rick.p.edgecombe@intel.com>
>>>>> wrote:
>>>>>
>>>>> On Tue, 2018-12-04 at 12:36 -0800, Nadav Amit wrote:
>>>>>>> On Dec 4, 2018, at 12:02 PM, Edgecombe, Rick P <
>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>>>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>>>>>> rick.p.edgecombe@intel.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>>>>>> underlying
>>>>>>>>>> pages,
>>>>>>>>>> it often leaves stale TLB entries to freed pages that could
>>>>>>>>>> get
>>>>>>>>>> re-
>>>>>>>>>> used.
>>>>>>>>>> This is
>>>>>>>>>> undesirable for cases where the memory being freed has special
>>>>>>>>>> permissions
>>>>>>>>>> such
>>>>>>>>>> as executable.
>>>>>>>>>
>>>>>>>>> So I am trying to finish my patch-set for preventing transient
>>>>>>>>> W+X
>>>>>>>>> mappings
>>>>>>>>> from taking space, by handling kprobes & ftrace that I missed
>>>>>>>>> (thanks
>>>>>>>>> again
>>>>>>>>> for
>>>>>>>>> pointing it out).
>>>>>>>>>
>>>>>>>>> But all of the sudden, I don’t understand why we have the
>>>>>>>>> problem
>>>>>>>>> that
>>>>>>>>> this
>>>>>>>>> (your) patch-set deals with at all. We already change the
>>>>>>>>> mappings
>>>>>>>>> to
>>>>>>>>> make
>>>>>>>>> the memory writable before freeing the memory, so why can’t we
>>>>>>>>> make
>>>>>>>>> it
>>>>>>>>> non-executable at the same time? Actually, why do we make the
>>>>>>>>> module
>>>>>>>>> memory,
>>>>>>>>> including its data executable before freeing it???
>>>>>>>>
>>>>>>>> Yeah, this is really confusing, but I have a suspicion it's a
>>>>>>>> combination
>>>>>>>> of the various different configurations and hysterical raisins. We
>>>>>>>> can't
>>>>>>>> rely on module_alloc() allocating from the vmalloc area (see
>>>>>>>> nios2)
>>>>>>>> nor
>>>>>>>> can we rely on disable_ro_nx() being available at build time.
>>>>>>>>
>>>>>>>> If we *could* rely on module allocations always using vmalloc(),
>>>>>>>> then
>>>>>>>> we could pass in Rick's new flag and drop disable_ro_nx()
>>>>>>>> altogether
>>>>>>>> afaict -- who cares about the memory attributes of a mapping
>>>>>>>> that's
>>>>>>>> about
>>>>>>>> to disappear anyway?
>>>>>>>>
>>>>>>>> Is it just nios2 that does something different?
>>>>>>>>
>>>>>>>> Will
>>>>>>>
>>>>>>> Yea it is really intertwined. I think for x86, set_memory_nx
>>>>>>> everywhere
>>>>>>> would
>>>>>>> solve it as well, in fact that was what I first thought the solution
>>>>>>> should
>>>>>>> be
>>>>>>> until this was suggested. It's interesting that from the other
>>>>>>> thread
>>>>>>> Masami
>>>>>>> Hiramatsu referenced, set_memory_nx was suggested last year and
>>>>>>> would
>>>>>>> have
>>>>>>> inadvertently blocked this on x86. But, on the other architectures I
>>>>>>> have
>>>>>>> since
>>>>>>> learned it is a bit different.
>>>>>>>
>>>>>>> It looks like actually most arch's don't re-define set_memory_*, and
>>>>>>> so
>>>>>>> all
>>>>>>> of
>>>>>>> the frob_* functions are actually just noops. In which case
>>>>>>> allocating
>>>>>>> RWX
>>>>>>> is
>>>>>>> needed to make it work at all, because that is what the allocation
>>>>>>> is
>>>>>>> going
>>>>>>> to
>>>>>>> stay at. So in these archs, set_memory_nx won't solve it because it
>>>>>>> will
>>>>>>> do
>>>>>>> nothing.
>>>>>>>
>>>>>>> On x86 I think you cannot get rid of disable_ro_nx fully because
>>>>>>> there
>>>>>>> is
>>>>>>> the
>>>>>>> changing of the permissions on the directmap as well. You don't want
>>>>>>> some
>>>>>>> other
>>>>>>> caller getting a page that was left RO when freed and then trying to
>>>>>>> write
>>>>>>> to
>>>>>>> it, if I understand this.
>>>>>>>
>>>>>>> The other reasoning was that calling set_memory_nx isn't doing what
>>>>>>> we
>>>>>>> are
>>>>>>> actually trying to do which is prevent the pages from getting
>>>>>>> released
>>>>>>> too
>>>>>>> early.
>>>>>>>
>>>>>>> A more clear solution for all of this might involve refactoring some
>>>>>>> of
>>>>>>> the
>>>>>>> set_memory_ de-allocation logic out into __weak functions in either
>>>>>>> modules
>>>>>>> or
>>>>>>> vmalloc. As Jessica points out in the other thread though, modules
>>>>>>> does
>>>>>>> a
>>>>>>> lot
>>>>>>> more stuff there than the other module_alloc callers. I think it may
>>>>>>> take
>>>>>>> some
>>>>>>> thought to centralize AND make it optimal for every
>>>>>>> module_alloc/vmalloc_exec
>>>>>>> user and arch.
>>>>>>>
>>>>>>> But for now with the change in vmalloc, we can block the executable
>>>>>>> mapping
>>>>>>> freed page re-use issue in a cross platform way.
>>>>>>
>>>>>> Please understand me correctly - I didn’t mean that your patches are
>>>>>> not
>>>>>> needed.
>>>>>
>>>>> Ok, I think I understand. I have been pondering these same things after
>>>>> Masami
>>>>> Hiramatsu's comments on this thread the other day.
>>>>>
>>>>>> All I did is asking - how come the PTEs are executable when they are
>>>>>> cleared
>>>>>> they are executable, when in fact we manipulate them when the module
>>>>>> is
>>>>>> removed.
>>>>>
>>>>> I think the directmap used to be RWX so maybe historically its trying to
>>>>> return
>>>>> it to its default state? Not sure.
>>>>>
>>>>>> I think I try to deal with a similar problem to the one you encounter
>>>>>> -
>>>>>> broken W^X. The only thing that bothered me in regard to your patches
>>>>>> (and
>>>>>> only after I played with the code) is that there is still a time-
>>>>>> window in
>>>>>> which W^X is broken due to disable_ro_nx().
>>>>>
>>>>> Totally agree there is overlap in the fixes and we should sync.
>>>>>
>>>>> What do you think about Andy's suggestion for doing the vfree cleanup in
>>>>> vmalloc
>>>>> with arch hooks? So the allocation goes into vfree fully setup and
>>>>> vmalloc
>>>>> frees
>>>>> it and on x86 resets the direct map.
>>>>
>>>> As long as you do it, I have no problem ;-)
>>>>
>>>> You would need to consider all the callers of module_memfree(), and
>>>> probably
>>>> to untangle at least part of the mess in pageattr.c . If you are up to it,
>>>> just say so, and I’ll drop this patch. All I can say is “good luck with
>>>> all
>>>> that”.
>>>
>>> I thought you were trying to prevent having any memory that at any time was
>>> W+X,
>>> how does vfree help with the module load time issues, where it starts WRX on
>>> x86?
>>
>> I didn’t say it does. The patch I submitted before [1] should deal with the
>> issue of module loading, and I still think it is required. I also addressed
>> the kprobe and ftrace issues that you raised.
>>
>> Perhaps it makes more sense that I will include the patch I proposed for
>> module cleanup to make the patch-set “complete”. If you finish the changes
>> you propose before the patch is applied, it could be dropped. I just want to
>> get rid of this series, as it keeps collecting more and more patches.
>>
>> I suspect it will not be the last version anyhow.
>>
>> [1] https://lkml.org/lkml/2018/11/21/305
>
> That seems fine.
>
> And not to make it any more complicated, but how much different is a W+X mapping
> from a RW mapping that is about to turn X? Can't an attacker with the ability to
> write to the module space just write and wait a short time? If that is the
> threat model, I think there may still be additional work to do here even after
> you found all the W+X cases.
I agree that a complete solution may require to block any direct write onto
a code-page. When I say “complete”, I mean for a threat model in which
dangling pointers are used to inject code, but not to run existing ROP/JOP
gadgets. (I didn’t think too deeply on the threat-model, so perhaps it needs
to be further refined).
I think the first stage is to make everybody go through a unified interface
(text_poke() and text_poke_early()). ftrace, for example, uses an
independent mechanism to change the code.
Eventually, after boot text_poke_early() should not be used, and text_poke()
(or something similar) should be used instead. Alternatively, when module
text is loaded, a hash value can be computed and calculated over it.
Since Igor Stoppa wants to use the infrastructure that is included in the
first patches, and since I didn’t intend this patch-set to be a full
solution for W^X (I was pushed there by tglx+Andy [1]), it may be enough
as a first step.
[1] https://lore.kernel.org/patchwork/patch/1006293/#1191341
^ permalink raw reply [flat|nested] 117+ messages in thread