All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andy Lutomirski <luto@amacapital.net>
To: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "luto@kernel.org" <luto@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"ard.biesheuvel@linaro.org" <ard.biesheuvel@linaro.org>,
	"jeyu@kernel.org" <jeyu@kernel.org>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"ast@kernel.org" <ast@kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"nadav.amit@gmail.com" <nadav.amit@gmail.com>,
	"Dock, Deneen T" <deneen.t.dock@intel.com>,
	"jannh@google.com" <jannh@google.com>,
	"kristen@linux.intel.com" <kristen@linux.intel.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"will.deacon@arm.com" <will.deacon@arm.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"Keshavamurthy, Anil S" <anil.s.keshavamurthy@intel.com>,
	"kernel-hardening@lists.openwall.com" 
	<kernel-hardening@lists.openwall.com>,
	"mhiramat@kernel.org" <mhiramat@kernel.org>,
	"naveen.n.rao@linux.vnet.ibm.com"
	<naveen.n.rao@linux.vnet.ibm.com>,
	"davem@davemloft.net" <davem@davemloft.net>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"Hansen, Dave" <dave.hansen@intel.com>
Subject: Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
Date: Tue, 4 Dec 2018 17:57:26 -0800	[thread overview]
Message-ID: <1913CD9F-B912-490A-8DEC-8C24CFF0F6D6@amacapital.net> (raw)
In-Reply-To: <58a3b01c78b6c299f76c156f96211ff22ec28751.camel@intel.com>



> On Dec 4, 2018, at 3:52 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
>> On Tue, 2018-12-04 at 12:09 -0800, Andy Lutomirski wrote:
>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
>>> 
>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>> rick.p.edgecombe@intel.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>> underlying
>>>>>> pages,
>>>>>> it often leaves stale TLB entries to freed pages that could get re-
>>>>>> used.
>>>>>> This is
>>>>>> undesirable for cases where the memory being freed has special
>>>>>> permissions
>>>>>> such
>>>>>> as executable.
>>>>> 
>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>> mappings
>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks
>>>>> again
>>>>> for
>>>>> pointing it out).
>>>>> 
>>>>> But all of the sudden, I don’t understand why we have the problem that
>>>>> this
>>>>> (your) patch-set deals with at all. We already change the mappings to
>>>>> make
>>>>> the memory writable before freeing the memory, so why can’t we make it
>>>>> non-executable at the same time? Actually, why do we make the module
>>>>> memory,
>>>>> including its data executable before freeing it???
>>>> 
>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>> of the various different configurations and hysterical raisins. We can't
>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>> can we rely on disable_ro_nx() being available at build time.
>>>> 
>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>> to disappear anyway?
>>>> 
>>>> Is it just nios2 that does something different?
>>>> 
>>>> Will
>>> 
>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>> would
>>> solve it as well, in fact that was what I first thought the solution should
>>> be
>>> until this was suggested. It's interesting that from the other thread Masami
>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>> inadvertently blocked this on x86. But, on the other architectures I have
>>> since
>>> learned it is a bit different.
>>> 
>>> It looks like actually most arch's don't re-define set_memory_*, and so all
>>> of
>>> the frob_* functions are actually just noops. In which case allocating RWX
>>> is
>>> needed to make it work at all, because that is what the allocation is going
>>> to
>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>> nothing.
>>> 
>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is
>>> the
>>> changing of the permissions on the directmap as well. You don't want some
>>> other
>>> caller getting a page that was left RO when freed and then trying to write
>>> to
>>> it, if I understand this.
>>> 
>> 
>> Exactly.
>> 
>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
>> VM_MAY_ADJUST_PERMS or similar.  It would have the semantics you want,
>> but it would also call some arch hooks to put back the direct map
>> permissions before the flush.  Does that seem reasonable?  It would
>> need to be hooked up that implement set_memory_ro(), but that should
>> be quite easy.  If nothing else, it could fall back to set_memory_ro()
>> in the absence of a better implementation.
> 
> With arch hooks, I guess we could remove disable_ro_nx then. I think you would
> still have to flush twice on x86 to really have no W^X violating window from the
> direct map (I think x86 is the only one that sets permissions there?). But this
> could be down from sometimes 3. You could also directly vfree non exec RO memory
> without set_memory_, like in BPF. 

Just one flush if you’re careful. Set the memory not-present in the direct map and zap it from the vmap area, then flush, then set it RW in the 

> 
> The vfree deferred list would need to be moved since it then couldn't reuse the
> allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
> the vm_struct. So would probably be a little slower in the interrupt case. Is
> this ok?

I’m fine with that. For eBPF, we should really have a lookaside list for small allocations.


WARNING: multiple messages have this Message-ID (diff)
From: Andy Lutomirski <luto@amacapital.net>
To: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: "luto@kernel.org" <luto@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"daniel@iogearbox.net" <daniel@iogearbox.net>,
	"ard.biesheuvel@linaro.org" <ard.biesheuvel@linaro.org>,
	"jeyu@kernel.org" <jeyu@kernel.org>,
	"rostedt@goodmis.org" <rostedt@goodmis.org>,
	"ast@kernel.org" <ast@kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"nadav.amit@gmail.com" <nadav.amit@gmail.com>,
	"Dock, Deneen T" <deneen.t.dock@intel.com>,
	"jannh@google.com" <jannh@google.com>,
	"kristen@linux.intel.com" <kristen@linux.intel.com>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"peterz@infradead.org" <peterz@infradead.org>,
	"will.deacon@arm.com" <will.deacon@arm.com>,
	"mingo@redhat.com" <mingo@redhat.com>,
	"Keshavamurthy, Anil S" <anil.s.k
Subject: Re: [PATCH 1/2] vmalloc: New flag for flush before releasing pages
Date: Tue, 4 Dec 2018 17:57:26 -0800	[thread overview]
Message-ID: <1913CD9F-B912-490A-8DEC-8C24CFF0F6D6@amacapital.net> (raw)
In-Reply-To: <58a3b01c78b6c299f76c156f96211ff22ec28751.camel@intel.com>



> On Dec 4, 2018, at 3:52 PM, Edgecombe, Rick P <rick.p.edgecombe@intel.com> wrote:
> 
>> On Tue, 2018-12-04 at 12:09 -0800, Andy Lutomirski wrote:
>> On Tue, Dec 4, 2018 at 12:02 PM Edgecombe, Rick P
>> <rick.p.edgecombe@intel.com> wrote:
>>> 
>>>> On Tue, 2018-12-04 at 16:03 +0000, Will Deacon wrote:
>>>> On Mon, Dec 03, 2018 at 05:43:11PM -0800, Nadav Amit wrote:
>>>>>> On Nov 27, 2018, at 4:07 PM, Rick Edgecombe <
>>>>>> rick.p.edgecombe@intel.com>
>>>>>> wrote:
>>>>>> 
>>>>>> Since vfree will lazily flush the TLB, but not lazily free the
>>>>>> underlying
>>>>>> pages,
>>>>>> it often leaves stale TLB entries to freed pages that could get re-
>>>>>> used.
>>>>>> This is
>>>>>> undesirable for cases where the memory being freed has special
>>>>>> permissions
>>>>>> such
>>>>>> as executable.
>>>>> 
>>>>> So I am trying to finish my patch-set for preventing transient W+X
>>>>> mappings
>>>>> from taking space, by handling kprobes & ftrace that I missed (thanks
>>>>> again
>>>>> for
>>>>> pointing it out).
>>>>> 
>>>>> But all of the sudden, I don’t understand why we have the problem that
>>>>> this
>>>>> (your) patch-set deals with at all. We already change the mappings to
>>>>> make
>>>>> the memory writable before freeing the memory, so why can’t we make it
>>>>> non-executable at the same time? Actually, why do we make the module
>>>>> memory,
>>>>> including its data executable before freeing it???
>>>> 
>>>> Yeah, this is really confusing, but I have a suspicion it's a combination
>>>> of the various different configurations and hysterical raisins. We can't
>>>> rely on module_alloc() allocating from the vmalloc area (see nios2) nor
>>>> can we rely on disable_ro_nx() being available at build time.
>>>> 
>>>> If we *could* rely on module allocations always using vmalloc(), then
>>>> we could pass in Rick's new flag and drop disable_ro_nx() altogether
>>>> afaict -- who cares about the memory attributes of a mapping that's about
>>>> to disappear anyway?
>>>> 
>>>> Is it just nios2 that does something different?
>>>> 
>>>> Will
>>> 
>>> Yea it is really intertwined. I think for x86, set_memory_nx everywhere
>>> would
>>> solve it as well, in fact that was what I first thought the solution should
>>> be
>>> until this was suggested. It's interesting that from the other thread Masami
>>> Hiramatsu referenced, set_memory_nx was suggested last year and would have
>>> inadvertently blocked this on x86. But, on the other architectures I have
>>> since
>>> learned it is a bit different.
>>> 
>>> It looks like actually most arch's don't re-define set_memory_*, and so all
>>> of
>>> the frob_* functions are actually just noops. In which case allocating RWX
>>> is
>>> needed to make it work at all, because that is what the allocation is going
>>> to
>>> stay at. So in these archs, set_memory_nx won't solve it because it will do
>>> nothing.
>>> 
>>> On x86 I think you cannot get rid of disable_ro_nx fully because there is
>>> the
>>> changing of the permissions on the directmap as well. You don't want some
>>> other
>>> caller getting a page that was left RO when freed and then trying to write
>>> to
>>> it, if I understand this.
>>> 
>> 
>> Exactly.
>> 
>> After slightly more thought, I suggest renaming VM_IMMEDIATE_UNMAP to
>> VM_MAY_ADJUST_PERMS or similar.  It would have the semantics you want,
>> but it would also call some arch hooks to put back the direct map
>> permissions before the flush.  Does that seem reasonable?  It would
>> need to be hooked up that implement set_memory_ro(), but that should
>> be quite easy.  If nothing else, it could fall back to set_memory_ro()
>> in the absence of a better implementation.
> 
> With arch hooks, I guess we could remove disable_ro_nx then. I think you would
> still have to flush twice on x86 to really have no W^X violating window from the
> direct map (I think x86 is the only one that sets permissions there?). But this
> could be down from sometimes 3. You could also directly vfree non exec RO memory
> without set_memory_, like in BPF. 

Just one flush if you’re careful. Set the memory not-present in the direct map and zap it from the vmap area, then flush, then set it RW in the 

> 
> The vfree deferred list would need to be moved since it then couldn't reuse the
> allocations since now the vfreed memory might be RO. It could kmalloc, or lookup
> the vm_struct. So would probably be a little slower in the interrupt case. Is
> this ok?

I’m fine with that. For eBPF, we should really have a lookaside list for small allocations.

  reply	other threads:[~2018-12-05  1:57 UTC|newest]

Thread overview: 117+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-11-28  0:07 [PATCH 0/2] Don’t leave executable TLB entries to freed pages Rick Edgecombe
2018-11-28  0:07 ` [PATCH 1/2] vmalloc: New flag for flush before releasing pages Rick Edgecombe
2018-12-04  0:04   ` Edgecombe, Rick P
2018-12-04  0:04     ` Edgecombe, Rick P
2018-12-04  0:04     ` Edgecombe, Rick P
2018-12-04  0:04     ` Edgecombe, Rick P
2018-12-04  1:43   ` Nadav Amit
2018-12-04 16:03     ` Will Deacon
2018-12-04 20:02       ` Edgecombe, Rick P
2018-12-04 20:02         ` Edgecombe, Rick P
2018-12-04 20:02         ` Edgecombe, Rick P
2018-12-04 20:02         ` Edgecombe, Rick P
2018-12-04 20:09         ` Andy Lutomirski
2018-12-04 20:09           ` Andy Lutomirski
2018-12-04 23:52           ` Edgecombe, Rick P
2018-12-04 23:52             ` Edgecombe, Rick P
2018-12-04 23:52             ` Edgecombe, Rick P
2018-12-05  1:57             ` Andy Lutomirski [this message]
2018-12-05  1:57               ` Andy Lutomirski
2018-12-05  1:57               ` Andy Lutomirski
2018-12-05 11:41           ` Will Deacon
2018-12-05 11:41             ` Will Deacon
2018-12-05 23:16             ` Andy Lutomirski
2018-12-05 23:16               ` Andy Lutomirski
2018-12-06  7:29               ` Ard Biesheuvel
2018-12-06  7:29                 ` Ard Biesheuvel
2018-12-06 11:10                 ` Will Deacon
2018-12-06 11:10                   ` Will Deacon
2018-12-06 18:53                 ` Andy Lutomirski
2018-12-06 18:53                   ` Andy Lutomirski
2018-12-06 19:01                   ` Tycho Andersen
2018-12-06 19:01                     ` Tycho Andersen
2018-12-06 19:19                     ` Andy Lutomirski
2018-12-06 19:19                       ` Andy Lutomirski
2018-12-06 19:39                       ` Nadav Amit
2018-12-06 19:39                         ` Nadav Amit
2018-12-06 20:17                         ` Andy Lutomirski
2018-12-06 20:17                           ` Andy Lutomirski
2018-12-06 23:08                           ` Nadav Amit
2018-12-06 23:08                             ` Nadav Amit
2018-12-07  3:06                             ` Edgecombe, Rick P
2018-12-07  3:06                               ` Edgecombe, Rick P
2018-12-07  3:06                               ` Edgecombe, Rick P
2018-12-06 20:19                       ` Edgecombe, Rick P
2018-12-06 20:19                         ` Edgecombe, Rick P
2018-12-06 20:19                         ` Edgecombe, Rick P
2018-12-06 20:26                         ` Andy Lutomirski
2018-12-06 20:26                           ` Andy Lutomirski
2018-12-06 19:04                   ` Ard Biesheuvel
2018-12-06 19:04                     ` Ard Biesheuvel
2018-12-06 19:20                     ` Andy Lutomirski
2018-12-06 19:20                       ` Andy Lutomirski
2018-12-06 19:23                       ` Ard Biesheuvel
2018-12-06 19:23                         ` Ard Biesheuvel
2018-12-06 19:31                         ` Will Deacon
2018-12-06 19:31                           ` Will Deacon
2018-12-06 19:36                           ` Ard Biesheuvel
2018-12-06 19:36                             ` Ard Biesheuvel
2018-12-04 20:36         ` Nadav Amit
2018-12-04 20:36           ` Nadav Amit
2018-12-04 20:36           ` Nadav Amit
2018-12-04 23:51           ` Edgecombe, Rick P
2018-12-04 23:51             ` Edgecombe, Rick P
2018-12-05  0:01             ` Nadav Amit
2018-12-05  0:01               ` Nadav Amit
2018-12-05  0:01               ` Nadav Amit
2018-12-05  0:29               ` Edgecombe, Rick P
2018-12-05  0:29                 ` Edgecombe, Rick P
2018-12-05  0:29                 ` Edgecombe, Rick P
2018-12-05  0:53                 ` Nadav Amit
2018-12-05  0:53                   ` Nadav Amit
2018-12-05  0:53                   ` Nadav Amit
2018-12-05  1:45                   ` Edgecombe, Rick P
2018-12-05  1:45                     ` Edgecombe, Rick P
2018-12-05  1:45                     ` Edgecombe, Rick P
2018-12-05  2:09                     ` Nadav Amit
2018-12-05  2:09                       ` Nadav Amit
2018-12-05  2:09                       ` Nadav Amit
2018-12-04 18:56     ` Andy Lutomirski
2018-12-04 18:56       ` Andy Lutomirski
2018-12-04 19:44       ` Nadav Amit
2018-12-04 19:44         ` Nadav Amit
2018-12-04 19:48         ` Andy Lutomirski
2018-12-04 19:48           ` Andy Lutomirski
2018-12-04 22:48           ` Nadav Amit
2018-12-04 22:48             ` Nadav Amit
2018-12-04 23:27             ` Andy Lutomirski
2018-12-04 23:27               ` Andy Lutomirski
2018-12-04 23:34               ` Nadav Amit
2018-12-04 23:34                 ` Nadav Amit
2018-12-05  1:09             ` Edgecombe, Rick P
2018-12-05  1:09               ` Edgecombe, Rick P
2018-12-05  1:09               ` Edgecombe, Rick P
2018-12-05  1:45               ` Nadav Amit
2018-12-05  1:45                 ` Nadav Amit
2018-12-05  1:45                 ` Nadav Amit
2018-11-28  0:07 ` [PATCH 2/2] x86/modules: Make x86 allocs to flush when free Rick Edgecombe
2018-11-28 23:11   ` Andrew Morton
2018-11-29  0:02     ` Edgecombe, Rick P
2018-11-29  0:02       ` Edgecombe, Rick P
2018-11-29  0:02       ` Edgecombe, Rick P
2018-11-29  1:40   ` Andy Lutomirski
2018-11-29  1:40     ` Andy Lutomirski
2018-11-29  6:14     ` Edgecombe, Rick P
2018-11-29  6:14       ` Edgecombe, Rick P
2018-11-29  6:14       ` Edgecombe, Rick P
2018-11-28  1:06 ` [PATCH 0/2] Don’t leave executable TLB entries to freed pages Nadav Amit
2018-11-28  1:21   ` Nadav Amit
2018-11-28  9:57     ` Will Deacon
2018-11-28 18:29       ` Nadav Amit
2018-11-29 14:06 ` Masami Hiramatsu
2018-11-29 18:49   ` Edgecombe, Rick P
2018-11-29 18:49     ` Edgecombe, Rick P
2018-11-29 18:49     ` Edgecombe, Rick P
2018-11-29 23:19     ` Masami Hiramatsu
2018-11-29 23:19       ` Masami Hiramatsu
2018-11-29 23:19       ` Masami Hiramatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1913CD9F-B912-490A-8DEC-8C24CFF0F6D6@amacapital.net \
    --to=luto@amacapital.net \
    --cc=akpm@linux-foundation.org \
    --cc=anil.s.keshavamurthy@intel.com \
    --cc=ard.biesheuvel@linaro.org \
    --cc=ast@kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dave.hansen@intel.com \
    --cc=davem@davemloft.net \
    --cc=deneen.t.dock@intel.com \
    --cc=jannh@google.com \
    --cc=jeyu@kernel.org \
    --cc=kernel-hardening@lists.openwall.com \
    --cc=kristen@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mhiramat@kernel.org \
    --cc=mingo@redhat.com \
    --cc=nadav.amit@gmail.com \
    --cc=naveen.n.rao@linux.vnet.ibm.com \
    --cc=netdev@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rostedt@goodmis.org \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.