linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Why do we let munmap fail?
@ 2018-05-21 22:07 Daniel Colascione
  2018-05-21 22:12 ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Colascione @ 2018-05-21 22:07 UTC (permalink / raw)
  To: linux-mm; +Cc: Tim Murray, Minchan Kim

Right now, we have this system knob max_map_count that caps the number of
VMAs we can have in a single address space. Put aside for the moment of
whether this knob should exist: even if it does, enforcing it for munmap,
mprotect, etc. produces weird and counter-intuitive situations in which
it's possible to fail to return resources (address space and commit charge)
to the system. At a deep philosophical level, that's the kind of operation
that should never fail. A library that does all the right things can still
experience a failure to deallocate resources it allocated itself if it gets
unlucky with VMA merging. Why should we allow that to happen?

Now let's return to max_map_count itself: what is it supposed to achieve?
If we want to limit application kernel memory resource consumption, let's
limit application kernel memory resource consumption, accounting for it on
a byte basis the same way we account for other kernel objects allocated on
behalf of userspace. Why should we have a separate cap just for the VMA
count?

I propose the following changes:

1) Let -1 mean "no VMA count limit".
2) Default max_map_count to -1.
3) Do not enforce max_map_count on munmap and mprotect.

Alternatively, can we account VMAs toward max_map_count on a page count
basis instead of a VMA basis? This way, no matter how you split and merge
your VMAs, you'll never see a weird failure to release resources. We'd have
to bump the default value of max_map_count to compensate for its new
interpretation.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 22:07 Why do we let munmap fail? Daniel Colascione
@ 2018-05-21 22:12 ` Dave Hansen
  2018-05-21 22:20   ` Daniel Colascione
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2018-05-21 22:12 UTC (permalink / raw)
  To: Daniel Colascione, linux-mm; +Cc: Tim Murray, Minchan Kim

On 05/21/2018 03:07 PM, Daniel Colascione wrote:
> Now let's return to max_map_count itself: what is it supposed to achieve?
> If we want to limit application kernel memory resource consumption, let's
> limit application kernel memory resource consumption, accounting for it on
> a byte basis the same way we account for other kernel objects allocated on
> behalf of userspace. Why should we have a separate cap just for the VMA
> count?

VMAs consume kernel memory and we can't reclaim them.  That's what it
boils down to.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 22:12 ` Dave Hansen
@ 2018-05-21 22:20   ` Daniel Colascione
  2018-05-21 22:29     ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Colascione @ 2018-05-21 22:20 UTC (permalink / raw)
  To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 3:12 PM Dave Hansen <dave.hansen@intel.com> wrote:

> On 05/21/2018 03:07 PM, Daniel Colascione wrote:
> > Now let's return to max_map_count itself: what is it supposed to
achieve?
> > If we want to limit application kernel memory resource consumption,
let's
> > limit application kernel memory resource consumption, accounting for it
on
> > a byte basis the same way we account for other kernel objects allocated
on
> > behalf of userspace. Why should we have a separate cap just for the VMA
> > count?

> VMAs consume kernel memory and we can't reclaim them.  That's what it
> boils down to.

How is it different from memfd in that respect?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 22:20   ` Daniel Colascione
@ 2018-05-21 22:29     ` Dave Hansen
  2018-05-21 22:35       ` Daniel Colascione
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2018-05-21 22:29 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: linux-mm, Tim Murray, Minchan Kim

On 05/21/2018 03:20 PM, Daniel Colascione wrote:
>> VMAs consume kernel memory and we can't reclaim them.  That's what it
>> boils down to.
> How is it different from memfd in that respect?

I don't really know what you mean.  I know folks use memfd to figure out
how much memory pressure we are under.  I guess that would trigger when
you consume lots of memory with VMAs.

VMAs are probably the most similar to things like page tables that are
kernel memory that can't be directly reclaimed, but do get freed at
OOM-kill-time.  But, VMAs are a bit harder than page tables because
freeing a page worth of VMAs does not necessarily free an entire page.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 22:29     ` Dave Hansen
@ 2018-05-21 22:35       ` Daniel Colascione
  2018-05-21 22:48         ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Colascione @ 2018-05-21 22:35 UTC (permalink / raw)
  To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 3:29 PM Dave Hansen <dave.hansen@intel.com> wrote:

> On 05/21/2018 03:20 PM, Daniel Colascione wrote:
> >> VMAs consume kernel memory and we can't reclaim them.  That's what it
> >> boils down to.
> > How is it different from memfd in that respect?

> I don't really know what you mean.

I should have been more clear. I meant, in general, that processes can
*already* ask the kernel to allocate memory on behalf of the process, and
sometimes this memory can't be reclaimed without an OOM kill. (You can swap
memfd/tmpfs contents, but for simplicity, imagine we're running without a
pagefile.)

> I know folks use memfd to figure out
> how much memory pressure we are under.  I guess that would trigger when
> you consume lots of memory with VMAs.

I think you're thinking of the VM pressure level special files, not memfd,
which creates an anonymous tmpfs file.

> VMAs are probably the most similar to things like page tables that are
> kernel memory that can't be directly reclaimed, but do get freed at
> OOM-kill-time.  But, VMAs are a bit harder than page tables because
> freeing a page worth of VMAs does not necessarily free an entire page.

I don't understand. We can reclaim memory used by VMAs by killing the
process or processes attached to the address space that owns those VMAs.
The OOM killer should Just Work. Why do we have to have some special limit
of VMA count?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 22:35       ` Daniel Colascione
@ 2018-05-21 22:48         ` Dave Hansen
  2018-05-21 22:54           ` Daniel Colascione
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2018-05-21 22:48 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: linux-mm, Tim Murray, Minchan Kim

On 05/21/2018 03:35 PM, Daniel Colascione wrote:
>> I know folks use memfd to figure out
>> how much memory pressure we are under.  I guess that would trigger when
>> you consume lots of memory with VMAs.
> 
> I think you're thinking of the VM pressure level special files, not memfd,
> which creates an anonymous tmpfs file.

Yep, you're right.

>> VMAs are probably the most similar to things like page tables that are
>> kernel memory that can't be directly reclaimed, but do get freed at
>> OOM-kill-time.  But, VMAs are a bit harder than page tables because
>> freeing a page worth of VMAs does not necessarily free an entire page.
> 
> I don't understand. We can reclaim memory used by VMAs by killing the
> process or processes attached to the address space that owns those VMAs.
> The OOM killer should Just Work. Why do we have to have some special limit
> of VMA count?

The OOM killer doesn't take the VMA count into consideration as far as I
remember.  I can't think of any reason why not except for the internal
fragmentation problem.

The current VMA limit is ~12MB of VMAs per process, which is quite a
bit.  I think it would be reasonable to start considering that in OOM
decisions, although it's surely inconsequential except on very small
systems.

There are also certainly denial-of-service concerns if you allow
arbitrary numbers of VMAs.  The rbtree, for instance, is O(log(n)), but
I 'd be willing to be there are plenty of things that fall over if you
let the ~65k limit get 10x or 100x larger.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 22:48         ` Dave Hansen
@ 2018-05-21 22:54           ` Daniel Colascione
  2018-05-21 23:02             ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Colascione @ 2018-05-21 22:54 UTC (permalink / raw)
  To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 3:48 PM Dave Hansen <dave.hansen@intel.com> wrote:

> On 05/21/2018 03:35 PM, Daniel Colascione wrote:
> >> I know folks use memfd to figure out
> >> how much memory pressure we are under.  I guess that would trigger when
> >> you consume lots of memory with VMAs.
> >
> > I think you're thinking of the VM pressure level special files, not
memfd,
> > which creates an anonymous tmpfs file.

> Yep, you're right.

> >> VMAs are probably the most similar to things like page tables that are
> >> kernel memory that can't be directly reclaimed, but do get freed at
> >> OOM-kill-time.  But, VMAs are a bit harder than page tables because
> >> freeing a page worth of VMAs does not necessarily free an entire page.
> >
> > I don't understand. We can reclaim memory used by VMAs by killing the
> > process or processes attached to the address space that owns those VMAs.
> > The OOM killer should Just Work. Why do we have to have some special
limit
> > of VMA count?

> The OOM killer doesn't take the VMA count into consideration as far as I
> remember.  I can't think of any reason why not except for the internal
> fragmentation problem.

> The current VMA limit is ~12MB of VMAs per process, which is quite a
> bit.  I think it would be reasonable to start considering that in OOM
> decisions, although it's surely inconsequential except on very small
> systems.

> There are also certainly denial-of-service concerns if you allow
> arbitrary numbers of VMAs.  The rbtree, for instance, is O(log(n)), but
> I 'd be willing to be there are plenty of things that fall over if you
> let the ~65k limit get 10x or 100x larger.

Sure. I'm receptive to the idea of having *some* VMA limit. I just think
it's unacceptable let deallocation routines fail.

What about the proposal at the end of my original message? If we account
for mapped address space by counting pages instead of counting VMAs, no
amount of VMA splitting can trip us over the threshold. We could just
impose a system-wide vsize limit in addition to RLIMIT_AS, with the
effective limit being the smaller of the two. (On further thought, we'd
probably want to leave the meaning of max_map_count unchanged and introduce
a new knob.)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 22:54           ` Daniel Colascione
@ 2018-05-21 23:02             ` Dave Hansen
  2018-05-21 23:16               ` Daniel Colascione
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2018-05-21 23:02 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: linux-mm, Tim Murray, Minchan Kim

On 05/21/2018 03:54 PM, Daniel Colascione wrote:
>> There are also certainly denial-of-service concerns if you allow
>> arbitrary numbers of VMAs.  The rbtree, for instance, is O(log(n)), but
>> I 'd be willing to be there are plenty of things that fall over if you
>> let the ~65k limit get 10x or 100x larger.
> Sure. I'm receptive to the idea of having *some* VMA limit. I just think
> it's unacceptable let deallocation routines fail.

If you have a resource limit and deallocation consumes resources, you
*eventually* have to fail a deallocation.  Right?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 23:02             ` Dave Hansen
@ 2018-05-21 23:16               ` Daniel Colascione
  2018-05-21 23:32                 ` Dave Hansen
  0 siblings, 1 reply; 19+ messages in thread
From: Daniel Colascione @ 2018-05-21 23:16 UTC (permalink / raw)
  To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@intel.com> wrote:

> On 05/21/2018 03:54 PM, Daniel Colascione wrote:
> >> There are also certainly denial-of-service concerns if you allow
> >> arbitrary numbers of VMAs.  The rbtree, for instance, is O(log(n)), but
> >> I 'd be willing to be there are plenty of things that fall over if you
> >> let the ~65k limit get 10x or 100x larger.
> > Sure. I'm receptive to the idea of having *some* VMA limit. I just think
> > it's unacceptable let deallocation routines fail.

> If you have a resource limit and deallocation consumes resources, you
> *eventually* have to fail a deallocation.  Right?

That's why robust software sets aside at allocation time whatever resources
are needed to make forward progress at deallocation time. That's what I'm
trying to propose here, essentially: if we specify the VMA limit in terms
of pages and not the number of VMAs, we've effectively "budgeted" for the
worst case of VMA splitting, since in the worst case, you end up with one
page per VMA.

Done this way, we still prevent runaway VMA tree growth, but we can also
make sure that anyone who's successfully called mmap can successfully call
munmap.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 23:16               ` Daniel Colascione
@ 2018-05-21 23:32                 ` Dave Hansen
  2018-05-22  0:00                   ` Daniel Colascione
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Hansen @ 2018-05-21 23:32 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: linux-mm, Tim Murray, Minchan Kim

On 05/21/2018 04:16 PM, Daniel Colascione wrote:
> On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@intel.com> wrote:
> 
>> On 05/21/2018 03:54 PM, Daniel Colascione wrote:
>>>> There are also certainly denial-of-service concerns if you allow
>>>> arbitrary numbers of VMAs.  The rbtree, for instance, is O(log(n)), but
>>>> I 'd be willing to be there are plenty of things that fall over if you
>>>> let the ~65k limit get 10x or 100x larger.
>>> Sure. I'm receptive to the idea of having *some* VMA limit. I just think
>>> it's unacceptable let deallocation routines fail.
>> If you have a resource limit and deallocation consumes resources, you
>> *eventually* have to fail a deallocation.  Right?
> That's why robust software sets aside at allocation time whatever resources
> are needed to make forward progress at deallocation time.

I think there's still a potential dead-end here.  "Deallocation" does
not always free resources.

> That's what I'm trying to propose here, essentially: if we specify
> the VMA limit in terms of pages and not the number of VMAs, we've
> effectively "budgeted" for the worst case of VMA splitting, since in
> the worst case, you end up with one page per VMA.
Not a bad idea, but it's not really how we allocate VMAs today.  You
would somehow need per-process (mm?) slabs.  Such a scheme would
probably, on average, waste half of a page per mm.

> Done this way, we still prevent runaway VMA tree growth, but we can also
> make sure that anyone who's successfully called mmap can successfully call
> munmap.

I'd be curious how this works out, but I bet you end up reserving a lot
more resources than people want.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-21 23:32                 ` Dave Hansen
@ 2018-05-22  0:00                   ` Daniel Colascione
  2018-05-22  0:22                     ` Matthew Wilcox
  2018-05-22  5:34                     ` Nicholas Piggin
  0 siblings, 2 replies; 19+ messages in thread
From: Daniel Colascione @ 2018-05-22  0:00 UTC (permalink / raw)
  To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com> wrote:

> On 05/21/2018 04:16 PM, Daniel Colascione wrote:
> > On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@intel.com>
wrote:
> >
> >> On 05/21/2018 03:54 PM, Daniel Colascione wrote:
> >>>> There are also certainly denial-of-service concerns if you allow
> >>>> arbitrary numbers of VMAs.  The rbtree, for instance, is O(log(n)),
but
> >>>> I 'd be willing to be there are plenty of things that fall over if
you
> >>>> let the ~65k limit get 10x or 100x larger.
> >>> Sure. I'm receptive to the idea of having *some* VMA limit. I just
think
> >>> it's unacceptable let deallocation routines fail.
> >> If you have a resource limit and deallocation consumes resources, you
> >> *eventually* have to fail a deallocation.  Right?
> > That's why robust software sets aside at allocation time whatever
resources
> > are needed to make forward progress at deallocation time.

> I think there's still a potential dead-end here.  "Deallocation" does
> not always free resources.

Sure, but the general principle applies: reserve resources when you *can*
fail so that you don't fail where you can't fail.

> > That's what I'm trying to propose here, essentially: if we specify
> > the VMA limit in terms of pages and not the number of VMAs, we've
> > effectively "budgeted" for the worst case of VMA splitting, since in
> > the worst case, you end up with one page per VMA.
> Not a bad idea, but it's not really how we allocate VMAs today.  You
> would somehow need per-process (mm?) slabs.  Such a scheme would
> probably, on average, waste half of a page per mm.

> > Done this way, we still prevent runaway VMA tree growth, but we can also
> > make sure that anyone who's successfully called mmap can successfully
call
> > munmap.

> I'd be curious how this works out, but I bet you end up reserving a lot
> more resources than people want.

I'm not sure. We're talking about two separate goals, I think. Goal #1 is
preventing the VMA tree becoming so large that we effectively DoS the
system. Goal #2 is about ensuring that the munmap path can't fail. Right
now, the system only achieves goal #1.

All we have to do to continue to achieve goal #1 is impose *some* sanity
limit on the VMA count, right? It doesn't really matter whether the limit
is specified in pages or number-of-VMAs so long as it's larger than most
applications will need but smaller than the DoS threshold. The resource
we're allocating at mmap time isn't really bytes of
struct-vm_area_struct-backing-storage, but sort of virtual anti-DoS
credits. Right now, these anti-DoS credits are denominated in number of
VMAs, but if we changed the denomination to page counts instead, we'd still
achieve goal #1 while avoiding the munmap-failing-with-ENOMEM weirdness.
Granted, if we make only this change, then munmap internal allocations
*still* fail if the actual VMA allocation failed, but I think the default
kernel OOM killer strategy will suffice for handling this kind of global
extreme memory pressure situation. All we have to do is change the *limit
check* during VMA creation, not the actual allocation strategy.

Another way of looking at it: Linux is usually configured to overcommit
with respect to *commit charge*. This behavior is well-known and widely
understood. What the VMA limit does is effectively overcommit with respect
to *address space*, which is weird and surprising because we normally think
of address space as being strictly accounted. If we can easily and cheaply
make address space actually strictly accounted, why not give it a shot?

Goal #2 is interesting as well, and I think it's what your slab-allocation
proposal would help address. If we literally set aside memory for all
possible VMAs, we'd ensure that internal allocations on the munmap path
could never fail. In the abstract, I'd like that (I'm a fan of strict
commit accounting generally), but I don't think it's necessary for fixing
the problem that motivated this thread.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-22  0:00                   ` Daniel Colascione
@ 2018-05-22  0:22                     ` Matthew Wilcox
  2018-05-22  0:38                       ` Daniel Colascione
  2018-05-22  5:34                     ` Nicholas Piggin
  1 sibling, 1 reply; 19+ messages in thread
From: Matthew Wilcox @ 2018-05-22  0:22 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: dave.hansen, linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 05:00:47PM -0700, Daniel Colascione wrote:
> On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com> wrote:
> > I think there's still a potential dead-end here.  "Deallocation" does
> > not always free resources.
> 
> Sure, but the general principle applies: reserve resources when you *can*
> fail so that you don't fail where you can't fail.

Umm.  OK.  But you want an mmap of 4TB to succeed, right?  That implies
preallocating one billion * sizeof(*vma).  That's, what, dozens of
gigabytes right there?

I'm sympathetic to wanting to keep both vma-merging and
unmap-anything-i-mapped working, but your proposal isn't going to fix it.

You need to handle the attacker writing a program which mmaps 46 bits
of address space and then munmaps alternate pages.  That program needs
to be detected and stopped.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-22  0:22                     ` Matthew Wilcox
@ 2018-05-22  0:38                       ` Daniel Colascione
  2018-05-22  1:19                         ` Theodore Y. Ts'o
  2018-05-22  1:22                         ` Matthew Wilcox
  0 siblings, 2 replies; 19+ messages in thread
From: Daniel Colascione @ 2018-05-22  0:38 UTC (permalink / raw)
  To: willy; +Cc: dave.hansen, linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 5:22 PM Matthew Wilcox <willy@infradead.org> wrote:

> On Mon, May 21, 2018 at 05:00:47PM -0700, Daniel Colascione wrote:
> > On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com>
wrote:
> > > I think there's still a potential dead-end here.  "Deallocation" does
> > > not always free resources.
> >
> > Sure, but the general principle applies: reserve resources when you
*can*
> > fail so that you don't fail where you can't fail.

> Umm.  OK.  But you want an mmap of 4TB to succeed, right?  That implies
> preallocating one billion * sizeof(*vma).  That's, what, dozens of
> gigabytes right there?

That's not what I'm proposing here. I'd hoped to make that clear in the
remainder of the email to which you've replied.

> I'm sympathetic to wanting to keep both vma-merging and
> unmap-anything-i-mapped working, but your proposal isn't going to fix it.

> You need to handle the attacker writing a program which mmaps 46 bits
> of address space and then munmaps alternate pages.  That program needs
> to be detected and stopped.

Let's look at why it's bad to mmap 46 bits of address space and munmap
alternate pages. It can't be that doing so would just use too much memory:
you can mmap 46 bits of address space *already* and touch each page, one by
one, until the kernel gets fed up and the OOM killer kills you.

So it's not because we'd allocate a lot of memory that having a huge VMA
tree is bad, because we already let processes allocate globs of memory in
other ways. The badness comes, AIUI, from the asymptotic behavior of the
address lookup algorithm in a tree that big.

One approach to dealing with this badness, the one I proposed earlier, is
to prevent that giant mmap from appearing in the first place (because we'd
cap vsize). If that giant mmap never appears, you can't generate a huge VMA
tree by splitting it.

Maybe that's not a good approach. Maybe processes really need mappings that
big. If they do, then maybe the right approach is to just make 8 billion
VMAs not "DoS the system". What actually goes wrong if we just let the VMA
tree grow that large? So what if VMA lookup ends up taking a while --- the
process with the pathological allocation pattern is paying the cost, right?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-22  0:38                       ` Daniel Colascione
@ 2018-05-22  1:19                         ` Theodore Y. Ts'o
  2018-05-22  1:41                           ` Daniel Colascione
  2018-05-22  1:22                         ` Matthew Wilcox
  1 sibling, 1 reply; 19+ messages in thread
From: Theodore Y. Ts'o @ 2018-05-22  1:19 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: willy, dave.hansen, linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 05:38:06PM -0700, Daniel Colascione wrote:
> 
> One approach to dealing with this badness, the one I proposed earlier, is
> to prevent that giant mmap from appearing in the first place (because we'd
> cap vsize). If that giant mmap never appears, you can't generate a huge VMA
> tree by splitting it.
> 
> Maybe that's not a good approach. Maybe processes really need mappings that
> big. If they do, then maybe the right approach is to just make 8 billion
> VMAs not "DoS the system". What actually goes wrong if we just let the VMA
> tree grow that large? So what if VMA lookup ends up taking a while --- the
> process with the pathological allocation pattern is paying the cost, right?
>

Fine.  Let's pick a more reasonable size --- say, 1GB.  That's still
2**18 4k pages.  Someone who munmap's every other 4k page is going to
create 2**17 VMA's.  That's a lot of VMA's.  So now the question is do
we pre-preserve enough VMA's for this worst case scenario, for all
processes in the system?  Or do we fail or otherwise kill the process
who is clearly attempting a DOS attack on the system?

If your goal is that munmap must ***never*** fail, then effectively
you have to preserve enough resources for 50% of all 4k pages in all
of the virtual address spaces in use by all of the processes in the
system.  That's a horrible waste of resources, just to guarantee that
munmap(2) must never fail.

Personally, I think it's not worth it.

Why is it so important to you that munmap(2) must not fail?  Is it not
enough to say that if you mmap(2) a region, if you munmap(2) that
exact same size region as you mmap(2)'ed, it must not fail?  That's a
much easier guarantee to make....

						- Ted

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-22  0:38                       ` Daniel Colascione
  2018-05-22  1:19                         ` Theodore Y. Ts'o
@ 2018-05-22  1:22                         ` Matthew Wilcox
  1 sibling, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2018-05-22  1:22 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: dave.hansen, linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 05:38:06PM -0700, Daniel Colascione wrote:
> On Mon, May 21, 2018 at 5:22 PM Matthew Wilcox <willy@infradead.org> wrote:
> > On Mon, May 21, 2018 at 05:00:47PM -0700, Daniel Colascione wrote:
> > > On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com>
> wrote:
> > > > I think there's still a potential dead-end here.  "Deallocation" does
> > > > not always free resources.
> > >
> > > Sure, but the general principle applies: reserve resources when you
> *can*
> > > fail so that you don't fail where you can't fail.
> 
> > Umm.  OK.  But you want an mmap of 4TB to succeed, right?  That implies
> > preallocating one billion * sizeof(*vma).  That's, what, dozens of
> > gigabytes right there?
> 
> That's not what I'm proposing here. I'd hoped to make that clear in the
> remainder of the email to which you've replied.
> 
> > I'm sympathetic to wanting to keep both vma-merging and
> > unmap-anything-i-mapped working, but your proposal isn't going to fix it.
> 
> > You need to handle the attacker writing a program which mmaps 46 bits
> > of address space and then munmaps alternate pages.  That program needs
> > to be detected and stopped.
> 
> Let's look at why it's bad to mmap 46 bits of address space and munmap
> alternate pages. It can't be that doing so would just use too much memory:
> you can mmap 46 bits of address space *already* and touch each page, one by
> one, until the kernel gets fed up and the OOM killer kills you.

If it's anonymous memory, sure, the kernel will kill you.  If it's
file-backed memory, the kernel will page it out again.  Sure, page
table consumption might also kill you, but 8 bytes per page is a lot
less memory consumption than ~200 bytes per page!

> So it's not because we'd allocate a lot of memory that having a huge VMA
> tree is bad, because we already let processes allocate globs of memory in
> other ways. The badness comes, AIUI, from the asymptotic behavior of the
> address lookup algorithm in a tree that big.

There's an order of magnitude difference in memory consumption though.

> One approach to dealing with this badness, the one I proposed earlier, is
> to prevent that giant mmap from appearing in the first place (because we'd
> cap vsize). If that giant mmap never appears, you can't generate a huge VMA
> tree by splitting it.

I have 16GB of memory in this laptop.  At 200 bytes per page, allocating
10% of my memory to vm_area_structs (a ridiculously high overhead),
restricts the total amount I can mmap (spread between all processes)
at 8 million pages, 32GB.  Firefox alone is taking 3.6GB; gnome-shell
is taking another 4.4GB, even gnome-shell is taking 4GB.  Your proposal
just doesn't work.

> Maybe that's not a good approach. Maybe processes really need mappings that
> big. If they do, then maybe the right approach is to just make 8 billion
> VMAs not "DoS the system". What actually goes wrong if we just let the VMA
> tree grow that large? So what if VMA lookup ends up taking a while --- the
> process with the pathological allocation pattern is paying the cost, right?

There's a per-inode tree of every mapping of that file, so if I mmap
libc and then munmap alternate pages, every user of libc pays the price.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-22  1:19                         ` Theodore Y. Ts'o
@ 2018-05-22  1:41                           ` Daniel Colascione
  2018-05-22  2:09                             ` Daniel Colascione
  2018-05-22  2:11                             ` Matthew Wilcox
  0 siblings, 2 replies; 19+ messages in thread
From: Daniel Colascione @ 2018-05-22  1:41 UTC (permalink / raw)
  To: tytso; +Cc: willy, dave.hansen, linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 6:19 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:

> On Mon, May 21, 2018 at 05:38:06PM -0700, Daniel Colascione wrote:
> >
> > One approach to dealing with this badness, the one I proposed earlier,
is
> > to prevent that giant mmap from appearing in the first place (because
we'd
> > cap vsize). If that giant mmap never appears, you can't generate a huge
VMA
> > tree by splitting it.
> >
> > Maybe that's not a good approach. Maybe processes really need mappings
that
> > big. If they do, then maybe the right approach is to just make 8 billion
> > VMAs not "DoS the system". What actually goes wrong if we just let the
VMA
> > tree grow that large? So what if VMA lookup ends up taking a while ---
the
> > process with the pathological allocation pattern is paying the cost,
right?
> >

> Fine.  Let's pick a more reasonable size --- say, 1GB.  That's still
> 2**18 4k pages.  Someone who munmap's every other 4k page is going to
> create 2**17 VMA's.  That's a lot of VMA's.  So now the question is do
> we pre-preserve enough VMA's for this worst case scenario, for all
> processes in the system?  Or do we fail or otherwise kill the process
> who is clearly attempting a DOS attack on the system?

> If your goal is that munmap must ***never*** fail, then effectively
> you have to preserve enough resources for 50% of all 4k pages in all
> of the virtual address spaces in use by all of the processes in the
> system.  That's a horrible waste of resources, just to guarantee that
> munmap(2) must never fail.

To be clear, I'm not suggesting that we actually perform this
preallocation. (Maybe in the distant future, with strict commit accounting,
it'd be useful.) I'm just suggesting that we perform the accounting as if
we did. But I think Matthew's convinced me that there's no vsize cap small
enough to be safe and still large enough to be useful, so I'll retract the
vsize cap idea.

> Personally, I think it's not worth it.

> Why is it so important to you that munmap(2) must not fail?  Is it not
> enough to say that if you mmap(2) a region, if you munmap(2) that
> exact same size region as you mmap(2)'ed, it must not fail?  That's a
> much easier guarantee to make....

That'd be good too, but I don't see how this guarantee would be easier to
make. If you call mmap three times, those three allocations might end up
merged into the same VMA, and if you called munmap on the middle
allocation, you'd still have to split. Am I misunderstanding something?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-22  1:41                           ` Daniel Colascione
@ 2018-05-22  2:09                             ` Daniel Colascione
  2018-05-22  2:11                             ` Matthew Wilcox
  1 sibling, 0 replies; 19+ messages in thread
From: Daniel Colascione @ 2018-05-22  2:09 UTC (permalink / raw)
  To: tytso; +Cc: willy, dave.hansen, linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 6:41 PM Daniel Colascione <dancol@google.com> wrote:
> That'd be good too, but I don't see how this guarantee would be easier to
> make. If you call mmap three times, those three allocations might end up
> merged into the same VMA, and if you called munmap on the middle
> allocation, you'd still have to split. Am I misunderstanding something?

Oh: a sequence number stored in the VMA, combined with a refusal to merge
across sequence number differences.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-22  1:41                           ` Daniel Colascione
  2018-05-22  2:09                             ` Daniel Colascione
@ 2018-05-22  2:11                             ` Matthew Wilcox
  1 sibling, 0 replies; 19+ messages in thread
From: Matthew Wilcox @ 2018-05-22  2:11 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: tytso, dave.hansen, linux-mm, Tim Murray, Minchan Kim

On Mon, May 21, 2018 at 06:41:12PM -0700, Daniel Colascione wrote:
> On Mon, May 21, 2018 at 6:19 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
> 
> > On Mon, May 21, 2018 at 05:38:06PM -0700, Daniel Colascione wrote:
> > >
> > > One approach to dealing with this badness, the one I proposed earlier,
> is
> > > to prevent that giant mmap from appearing in the first place (because
> we'd
> > > cap vsize). If that giant mmap never appears, you can't generate a huge
> VMA
> > > tree by splitting it.
> > >
> > > Maybe that's not a good approach. Maybe processes really need mappings
> that
> > > big. If they do, then maybe the right approach is to just make 8 billion
> > > VMAs not "DoS the system". What actually goes wrong if we just let the
> VMA
> > > tree grow that large? So what if VMA lookup ends up taking a while ---
> the
> > > process with the pathological allocation pattern is paying the cost,
> right?
> > >
> 
> > Fine.  Let's pick a more reasonable size --- say, 1GB.  That's still
> > 2**18 4k pages.  Someone who munmap's every other 4k page is going to
> > create 2**17 VMA's.  That's a lot of VMA's.  So now the question is do
> > we pre-preserve enough VMA's for this worst case scenario, for all
> > processes in the system?  Or do we fail or otherwise kill the process
> > who is clearly attempting a DOS attack on the system?
> 
> > If your goal is that munmap must ***never*** fail, then effectively
> > you have to preserve enough resources for 50% of all 4k pages in all
> > of the virtual address spaces in use by all of the processes in the
> > system.  That's a horrible waste of resources, just to guarantee that
> > munmap(2) must never fail.
> 
> To be clear, I'm not suggesting that we actually perform this
> preallocation. (Maybe in the distant future, with strict commit accounting,
> it'd be useful.) I'm just suggesting that we perform the accounting as if
> we did. But I think Matthew's convinced me that there's no vsize cap small
> enough to be safe and still large enough to be useful, so I'll retract the
> vsize cap idea.
> 
> > Personally, I think it's not worth it.
> 
> > Why is it so important to you that munmap(2) must not fail?  Is it not
> > enough to say that if you mmap(2) a region, if you munmap(2) that
> > exact same size region as you mmap(2)'ed, it must not fail?  That's a
> > much easier guarantee to make....
> 
> That'd be good too, but I don't see how this guarantee would be easier to
> make. If you call mmap three times, those three allocations might end up
> merged into the same VMA, and if you called munmap on the middle
> allocation, you'd still have to split. Am I misunderstanding something?

What I think Ted's proposing (and I was too) is that we either preallocate
or make a note of how many VMAs we've merged.  So you can unmap as many
times as you've mapped without risking failure.  If you start unmapping
in the middle, then you might see munmap failures, but if you only unmap
things that you already mapped, we can guarantee that munmap won't fail.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: Why do we let munmap fail?
  2018-05-22  0:00                   ` Daniel Colascione
  2018-05-22  0:22                     ` Matthew Wilcox
@ 2018-05-22  5:34                     ` Nicholas Piggin
  1 sibling, 0 replies; 19+ messages in thread
From: Nicholas Piggin @ 2018-05-22  5:34 UTC (permalink / raw)
  To: Daniel Colascione; +Cc: dave.hansen, linux-mm, Tim Murray, Minchan Kim

On Mon, 21 May 2018 17:00:47 -0700
Daniel Colascione <dancol@google.com> wrote:

> On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com> wrote:
> 
> > On 05/21/2018 04:16 PM, Daniel Colascione wrote:  
> > > On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@intel.com>  
> wrote:
> > >  
> > >> On 05/21/2018 03:54 PM, Daniel Colascione wrote:  
> > >>>> There are also certainly denial-of-service concerns if you allow
> > >>>> arbitrary numbers of VMAs.  The rbtree, for instance, is O(log(n)),  
> but
> > >>>> I 'd be willing to be there are plenty of things that fall over if  
> you
> > >>>> let the ~65k limit get 10x or 100x larger.  
> > >>> Sure. I'm receptive to the idea of having *some* VMA limit. I just  
> think
> > >>> it's unacceptable let deallocation routines fail.  
> > >> If you have a resource limit and deallocation consumes resources, you
> > >> *eventually* have to fail a deallocation.  Right?  
> > > That's why robust software sets aside at allocation time whatever  
> resources
> > > are needed to make forward progress at deallocation time.  
> 
> > I think there's still a potential dead-end here.  "Deallocation" does
> > not always free resources.  
> 
> Sure, but the general principle applies: reserve resources when you *can*
> fail so that you don't fail where you can't fail.

munmap != deallocation, it's a request to change the address mapping.
A more complex mapping uses more resources. mmap can free resources
if it transforms your mapping to a simpler one.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-05-22  5:34 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-05-21 22:07 Why do we let munmap fail? Daniel Colascione
2018-05-21 22:12 ` Dave Hansen
2018-05-21 22:20   ` Daniel Colascione
2018-05-21 22:29     ` Dave Hansen
2018-05-21 22:35       ` Daniel Colascione
2018-05-21 22:48         ` Dave Hansen
2018-05-21 22:54           ` Daniel Colascione
2018-05-21 23:02             ` Dave Hansen
2018-05-21 23:16               ` Daniel Colascione
2018-05-21 23:32                 ` Dave Hansen
2018-05-22  0:00                   ` Daniel Colascione
2018-05-22  0:22                     ` Matthew Wilcox
2018-05-22  0:38                       ` Daniel Colascione
2018-05-22  1:19                         ` Theodore Y. Ts'o
2018-05-22  1:41                           ` Daniel Colascione
2018-05-22  2:09                             ` Daniel Colascione
2018-05-22  2:11                             ` Matthew Wilcox
2018-05-22  1:22                         ` Matthew Wilcox
2018-05-22  5:34                     ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).