* Why do we let munmap fail? @ 2018-05-21 22:07 Daniel Colascione 2018-05-21 22:12 ` Dave Hansen 0 siblings, 1 reply; 19+ messages in thread From: Daniel Colascione @ 2018-05-21 22:07 UTC (permalink / raw) To: linux-mm; +Cc: Tim Murray, Minchan Kim Right now, we have this system knob max_map_count that caps the number of VMAs we can have in a single address space. Put aside for the moment of whether this knob should exist: even if it does, enforcing it for munmap, mprotect, etc. produces weird and counter-intuitive situations in which it's possible to fail to return resources (address space and commit charge) to the system. At a deep philosophical level, that's the kind of operation that should never fail. A library that does all the right things can still experience a failure to deallocate resources it allocated itself if it gets unlucky with VMA merging. Why should we allow that to happen? Now let's return to max_map_count itself: what is it supposed to achieve? If we want to limit application kernel memory resource consumption, let's limit application kernel memory resource consumption, accounting for it on a byte basis the same way we account for other kernel objects allocated on behalf of userspace. Why should we have a separate cap just for the VMA count? I propose the following changes: 1) Let -1 mean "no VMA count limit". 2) Default max_map_count to -1. 3) Do not enforce max_map_count on munmap and mprotect. Alternatively, can we account VMAs toward max_map_count on a page count basis instead of a VMA basis? This way, no matter how you split and merge your VMAs, you'll never see a weird failure to release resources. We'd have to bump the default value of max_map_count to compensate for its new interpretation. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 22:07 Why do we let munmap fail? Daniel Colascione @ 2018-05-21 22:12 ` Dave Hansen 2018-05-21 22:20 ` Daniel Colascione 0 siblings, 1 reply; 19+ messages in thread From: Dave Hansen @ 2018-05-21 22:12 UTC (permalink / raw) To: Daniel Colascione, linux-mm; +Cc: Tim Murray, Minchan Kim On 05/21/2018 03:07 PM, Daniel Colascione wrote: > Now let's return to max_map_count itself: what is it supposed to achieve? > If we want to limit application kernel memory resource consumption, let's > limit application kernel memory resource consumption, accounting for it on > a byte basis the same way we account for other kernel objects allocated on > behalf of userspace. Why should we have a separate cap just for the VMA > count? VMAs consume kernel memory and we can't reclaim them. That's what it boils down to. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 22:12 ` Dave Hansen @ 2018-05-21 22:20 ` Daniel Colascione 2018-05-21 22:29 ` Dave Hansen 0 siblings, 1 reply; 19+ messages in thread From: Daniel Colascione @ 2018-05-21 22:20 UTC (permalink / raw) To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 3:12 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 05/21/2018 03:07 PM, Daniel Colascione wrote: > > Now let's return to max_map_count itself: what is it supposed to achieve? > > If we want to limit application kernel memory resource consumption, let's > > limit application kernel memory resource consumption, accounting for it on > > a byte basis the same way we account for other kernel objects allocated on > > behalf of userspace. Why should we have a separate cap just for the VMA > > count? > VMAs consume kernel memory and we can't reclaim them. That's what it > boils down to. How is it different from memfd in that respect? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 22:20 ` Daniel Colascione @ 2018-05-21 22:29 ` Dave Hansen 2018-05-21 22:35 ` Daniel Colascione 0 siblings, 1 reply; 19+ messages in thread From: Dave Hansen @ 2018-05-21 22:29 UTC (permalink / raw) To: Daniel Colascione; +Cc: linux-mm, Tim Murray, Minchan Kim On 05/21/2018 03:20 PM, Daniel Colascione wrote: >> VMAs consume kernel memory and we can't reclaim them. That's what it >> boils down to. > How is it different from memfd in that respect? I don't really know what you mean. I know folks use memfd to figure out how much memory pressure we are under. I guess that would trigger when you consume lots of memory with VMAs. VMAs are probably the most similar to things like page tables that are kernel memory that can't be directly reclaimed, but do get freed at OOM-kill-time. But, VMAs are a bit harder than page tables because freeing a page worth of VMAs does not necessarily free an entire page. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 22:29 ` Dave Hansen @ 2018-05-21 22:35 ` Daniel Colascione 2018-05-21 22:48 ` Dave Hansen 0 siblings, 1 reply; 19+ messages in thread From: Daniel Colascione @ 2018-05-21 22:35 UTC (permalink / raw) To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 3:29 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 05/21/2018 03:20 PM, Daniel Colascione wrote: > >> VMAs consume kernel memory and we can't reclaim them. That's what it > >> boils down to. > > How is it different from memfd in that respect? > I don't really know what you mean. I should have been more clear. I meant, in general, that processes can *already* ask the kernel to allocate memory on behalf of the process, and sometimes this memory can't be reclaimed without an OOM kill. (You can swap memfd/tmpfs contents, but for simplicity, imagine we're running without a pagefile.) > I know folks use memfd to figure out > how much memory pressure we are under. I guess that would trigger when > you consume lots of memory with VMAs. I think you're thinking of the VM pressure level special files, not memfd, which creates an anonymous tmpfs file. > VMAs are probably the most similar to things like page tables that are > kernel memory that can't be directly reclaimed, but do get freed at > OOM-kill-time. But, VMAs are a bit harder than page tables because > freeing a page worth of VMAs does not necessarily free an entire page. I don't understand. We can reclaim memory used by VMAs by killing the process or processes attached to the address space that owns those VMAs. The OOM killer should Just Work. Why do we have to have some special limit of VMA count? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 22:35 ` Daniel Colascione @ 2018-05-21 22:48 ` Dave Hansen 2018-05-21 22:54 ` Daniel Colascione 0 siblings, 1 reply; 19+ messages in thread From: Dave Hansen @ 2018-05-21 22:48 UTC (permalink / raw) To: Daniel Colascione; +Cc: linux-mm, Tim Murray, Minchan Kim On 05/21/2018 03:35 PM, Daniel Colascione wrote: >> I know folks use memfd to figure out >> how much memory pressure we are under. I guess that would trigger when >> you consume lots of memory with VMAs. > > I think you're thinking of the VM pressure level special files, not memfd, > which creates an anonymous tmpfs file. Yep, you're right. >> VMAs are probably the most similar to things like page tables that are >> kernel memory that can't be directly reclaimed, but do get freed at >> OOM-kill-time. But, VMAs are a bit harder than page tables because >> freeing a page worth of VMAs does not necessarily free an entire page. > > I don't understand. We can reclaim memory used by VMAs by killing the > process or processes attached to the address space that owns those VMAs. > The OOM killer should Just Work. Why do we have to have some special limit > of VMA count? The OOM killer doesn't take the VMA count into consideration as far as I remember. I can't think of any reason why not except for the internal fragmentation problem. The current VMA limit is ~12MB of VMAs per process, which is quite a bit. I think it would be reasonable to start considering that in OOM decisions, although it's surely inconsequential except on very small systems. There are also certainly denial-of-service concerns if you allow arbitrary numbers of VMAs. The rbtree, for instance, is O(log(n)), but I 'd be willing to be there are plenty of things that fall over if you let the ~65k limit get 10x or 100x larger. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 22:48 ` Dave Hansen @ 2018-05-21 22:54 ` Daniel Colascione 2018-05-21 23:02 ` Dave Hansen 0 siblings, 1 reply; 19+ messages in thread From: Daniel Colascione @ 2018-05-21 22:54 UTC (permalink / raw) To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 3:48 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 05/21/2018 03:35 PM, Daniel Colascione wrote: > >> I know folks use memfd to figure out > >> how much memory pressure we are under. I guess that would trigger when > >> you consume lots of memory with VMAs. > > > > I think you're thinking of the VM pressure level special files, not memfd, > > which creates an anonymous tmpfs file. > Yep, you're right. > >> VMAs are probably the most similar to things like page tables that are > >> kernel memory that can't be directly reclaimed, but do get freed at > >> OOM-kill-time. But, VMAs are a bit harder than page tables because > >> freeing a page worth of VMAs does not necessarily free an entire page. > > > > I don't understand. We can reclaim memory used by VMAs by killing the > > process or processes attached to the address space that owns those VMAs. > > The OOM killer should Just Work. Why do we have to have some special limit > > of VMA count? > The OOM killer doesn't take the VMA count into consideration as far as I > remember. I can't think of any reason why not except for the internal > fragmentation problem. > The current VMA limit is ~12MB of VMAs per process, which is quite a > bit. I think it would be reasonable to start considering that in OOM > decisions, although it's surely inconsequential except on very small > systems. > There are also certainly denial-of-service concerns if you allow > arbitrary numbers of VMAs. The rbtree, for instance, is O(log(n)), but > I 'd be willing to be there are plenty of things that fall over if you > let the ~65k limit get 10x or 100x larger. Sure. I'm receptive to the idea of having *some* VMA limit. I just think it's unacceptable let deallocation routines fail. What about the proposal at the end of my original message? If we account for mapped address space by counting pages instead of counting VMAs, no amount of VMA splitting can trip us over the threshold. We could just impose a system-wide vsize limit in addition to RLIMIT_AS, with the effective limit being the smaller of the two. (On further thought, we'd probably want to leave the meaning of max_map_count unchanged and introduce a new knob.) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 22:54 ` Daniel Colascione @ 2018-05-21 23:02 ` Dave Hansen 2018-05-21 23:16 ` Daniel Colascione 0 siblings, 1 reply; 19+ messages in thread From: Dave Hansen @ 2018-05-21 23:02 UTC (permalink / raw) To: Daniel Colascione; +Cc: linux-mm, Tim Murray, Minchan Kim On 05/21/2018 03:54 PM, Daniel Colascione wrote: >> There are also certainly denial-of-service concerns if you allow >> arbitrary numbers of VMAs. The rbtree, for instance, is O(log(n)), but >> I 'd be willing to be there are plenty of things that fall over if you >> let the ~65k limit get 10x or 100x larger. > Sure. I'm receptive to the idea of having *some* VMA limit. I just think > it's unacceptable let deallocation routines fail. If you have a resource limit and deallocation consumes resources, you *eventually* have to fail a deallocation. Right? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 23:02 ` Dave Hansen @ 2018-05-21 23:16 ` Daniel Colascione 2018-05-21 23:32 ` Dave Hansen 0 siblings, 1 reply; 19+ messages in thread From: Daniel Colascione @ 2018-05-21 23:16 UTC (permalink / raw) To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 05/21/2018 03:54 PM, Daniel Colascione wrote: > >> There are also certainly denial-of-service concerns if you allow > >> arbitrary numbers of VMAs. The rbtree, for instance, is O(log(n)), but > >> I 'd be willing to be there are plenty of things that fall over if you > >> let the ~65k limit get 10x or 100x larger. > > Sure. I'm receptive to the idea of having *some* VMA limit. I just think > > it's unacceptable let deallocation routines fail. > If you have a resource limit and deallocation consumes resources, you > *eventually* have to fail a deallocation. Right? That's why robust software sets aside at allocation time whatever resources are needed to make forward progress at deallocation time. That's what I'm trying to propose here, essentially: if we specify the VMA limit in terms of pages and not the number of VMAs, we've effectively "budgeted" for the worst case of VMA splitting, since in the worst case, you end up with one page per VMA. Done this way, we still prevent runaway VMA tree growth, but we can also make sure that anyone who's successfully called mmap can successfully call munmap. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 23:16 ` Daniel Colascione @ 2018-05-21 23:32 ` Dave Hansen 2018-05-22 0:00 ` Daniel Colascione 0 siblings, 1 reply; 19+ messages in thread From: Dave Hansen @ 2018-05-21 23:32 UTC (permalink / raw) To: Daniel Colascione; +Cc: linux-mm, Tim Murray, Minchan Kim On 05/21/2018 04:16 PM, Daniel Colascione wrote: > On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@intel.com> wrote: > >> On 05/21/2018 03:54 PM, Daniel Colascione wrote: >>>> There are also certainly denial-of-service concerns if you allow >>>> arbitrary numbers of VMAs. The rbtree, for instance, is O(log(n)), but >>>> I 'd be willing to be there are plenty of things that fall over if you >>>> let the ~65k limit get 10x or 100x larger. >>> Sure. I'm receptive to the idea of having *some* VMA limit. I just think >>> it's unacceptable let deallocation routines fail. >> If you have a resource limit and deallocation consumes resources, you >> *eventually* have to fail a deallocation. Right? > That's why robust software sets aside at allocation time whatever resources > are needed to make forward progress at deallocation time. I think there's still a potential dead-end here. "Deallocation" does not always free resources. > That's what I'm trying to propose here, essentially: if we specify > the VMA limit in terms of pages and not the number of VMAs, we've > effectively "budgeted" for the worst case of VMA splitting, since in > the worst case, you end up with one page per VMA. Not a bad idea, but it's not really how we allocate VMAs today. You would somehow need per-process (mm?) slabs. Such a scheme would probably, on average, waste half of a page per mm. > Done this way, we still prevent runaway VMA tree growth, but we can also > make sure that anyone who's successfully called mmap can successfully call > munmap. I'd be curious how this works out, but I bet you end up reserving a lot more resources than people want. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-21 23:32 ` Dave Hansen @ 2018-05-22 0:00 ` Daniel Colascione 2018-05-22 0:22 ` Matthew Wilcox 2018-05-22 5:34 ` Nicholas Piggin 0 siblings, 2 replies; 19+ messages in thread From: Daniel Colascione @ 2018-05-22 0:00 UTC (permalink / raw) To: dave.hansen; +Cc: linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 05/21/2018 04:16 PM, Daniel Colascione wrote: > > On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > >> On 05/21/2018 03:54 PM, Daniel Colascione wrote: > >>>> There are also certainly denial-of-service concerns if you allow > >>>> arbitrary numbers of VMAs. The rbtree, for instance, is O(log(n)), but > >>>> I 'd be willing to be there are plenty of things that fall over if you > >>>> let the ~65k limit get 10x or 100x larger. > >>> Sure. I'm receptive to the idea of having *some* VMA limit. I just think > >>> it's unacceptable let deallocation routines fail. > >> If you have a resource limit and deallocation consumes resources, you > >> *eventually* have to fail a deallocation. Right? > > That's why robust software sets aside at allocation time whatever resources > > are needed to make forward progress at deallocation time. > I think there's still a potential dead-end here. "Deallocation" does > not always free resources. Sure, but the general principle applies: reserve resources when you *can* fail so that you don't fail where you can't fail. > > That's what I'm trying to propose here, essentially: if we specify > > the VMA limit in terms of pages and not the number of VMAs, we've > > effectively "budgeted" for the worst case of VMA splitting, since in > > the worst case, you end up with one page per VMA. > Not a bad idea, but it's not really how we allocate VMAs today. You > would somehow need per-process (mm?) slabs. Such a scheme would > probably, on average, waste half of a page per mm. > > Done this way, we still prevent runaway VMA tree growth, but we can also > > make sure that anyone who's successfully called mmap can successfully call > > munmap. > I'd be curious how this works out, but I bet you end up reserving a lot > more resources than people want. I'm not sure. We're talking about two separate goals, I think. Goal #1 is preventing the VMA tree becoming so large that we effectively DoS the system. Goal #2 is about ensuring that the munmap path can't fail. Right now, the system only achieves goal #1. All we have to do to continue to achieve goal #1 is impose *some* sanity limit on the VMA count, right? It doesn't really matter whether the limit is specified in pages or number-of-VMAs so long as it's larger than most applications will need but smaller than the DoS threshold. The resource we're allocating at mmap time isn't really bytes of struct-vm_area_struct-backing-storage, but sort of virtual anti-DoS credits. Right now, these anti-DoS credits are denominated in number of VMAs, but if we changed the denomination to page counts instead, we'd still achieve goal #1 while avoiding the munmap-failing-with-ENOMEM weirdness. Granted, if we make only this change, then munmap internal allocations *still* fail if the actual VMA allocation failed, but I think the default kernel OOM killer strategy will suffice for handling this kind of global extreme memory pressure situation. All we have to do is change the *limit check* during VMA creation, not the actual allocation strategy. Another way of looking at it: Linux is usually configured to overcommit with respect to *commit charge*. This behavior is well-known and widely understood. What the VMA limit does is effectively overcommit with respect to *address space*, which is weird and surprising because we normally think of address space as being strictly accounted. If we can easily and cheaply make address space actually strictly accounted, why not give it a shot? Goal #2 is interesting as well, and I think it's what your slab-allocation proposal would help address. If we literally set aside memory for all possible VMAs, we'd ensure that internal allocations on the munmap path could never fail. In the abstract, I'd like that (I'm a fan of strict commit accounting generally), but I don't think it's necessary for fixing the problem that motivated this thread. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-22 0:00 ` Daniel Colascione @ 2018-05-22 0:22 ` Matthew Wilcox 2018-05-22 0:38 ` Daniel Colascione 2018-05-22 5:34 ` Nicholas Piggin 1 sibling, 1 reply; 19+ messages in thread From: Matthew Wilcox @ 2018-05-22 0:22 UTC (permalink / raw) To: Daniel Colascione; +Cc: dave.hansen, linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 05:00:47PM -0700, Daniel Colascione wrote: > On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com> wrote: > > I think there's still a potential dead-end here. "Deallocation" does > > not always free resources. > > Sure, but the general principle applies: reserve resources when you *can* > fail so that you don't fail where you can't fail. Umm. OK. But you want an mmap of 4TB to succeed, right? That implies preallocating one billion * sizeof(*vma). That's, what, dozens of gigabytes right there? I'm sympathetic to wanting to keep both vma-merging and unmap-anything-i-mapped working, but your proposal isn't going to fix it. You need to handle the attacker writing a program which mmaps 46 bits of address space and then munmaps alternate pages. That program needs to be detected and stopped. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-22 0:22 ` Matthew Wilcox @ 2018-05-22 0:38 ` Daniel Colascione 2018-05-22 1:19 ` Theodore Y. Ts'o 2018-05-22 1:22 ` Matthew Wilcox 0 siblings, 2 replies; 19+ messages in thread From: Daniel Colascione @ 2018-05-22 0:38 UTC (permalink / raw) To: willy; +Cc: dave.hansen, linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 5:22 PM Matthew Wilcox <willy@infradead.org> wrote: > On Mon, May 21, 2018 at 05:00:47PM -0700, Daniel Colascione wrote: > > On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > I think there's still a potential dead-end here. "Deallocation" does > > > not always free resources. > > > > Sure, but the general principle applies: reserve resources when you *can* > > fail so that you don't fail where you can't fail. > Umm. OK. But you want an mmap of 4TB to succeed, right? That implies > preallocating one billion * sizeof(*vma). That's, what, dozens of > gigabytes right there? That's not what I'm proposing here. I'd hoped to make that clear in the remainder of the email to which you've replied. > I'm sympathetic to wanting to keep both vma-merging and > unmap-anything-i-mapped working, but your proposal isn't going to fix it. > You need to handle the attacker writing a program which mmaps 46 bits > of address space and then munmaps alternate pages. That program needs > to be detected and stopped. Let's look at why it's bad to mmap 46 bits of address space and munmap alternate pages. It can't be that doing so would just use too much memory: you can mmap 46 bits of address space *already* and touch each page, one by one, until the kernel gets fed up and the OOM killer kills you. So it's not because we'd allocate a lot of memory that having a huge VMA tree is bad, because we already let processes allocate globs of memory in other ways. The badness comes, AIUI, from the asymptotic behavior of the address lookup algorithm in a tree that big. One approach to dealing with this badness, the one I proposed earlier, is to prevent that giant mmap from appearing in the first place (because we'd cap vsize). If that giant mmap never appears, you can't generate a huge VMA tree by splitting it. Maybe that's not a good approach. Maybe processes really need mappings that big. If they do, then maybe the right approach is to just make 8 billion VMAs not "DoS the system". What actually goes wrong if we just let the VMA tree grow that large? So what if VMA lookup ends up taking a while --- the process with the pathological allocation pattern is paying the cost, right? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-22 0:38 ` Daniel Colascione @ 2018-05-22 1:19 ` Theodore Y. Ts'o 2018-05-22 1:41 ` Daniel Colascione 2018-05-22 1:22 ` Matthew Wilcox 1 sibling, 1 reply; 19+ messages in thread From: Theodore Y. Ts'o @ 2018-05-22 1:19 UTC (permalink / raw) To: Daniel Colascione; +Cc: willy, dave.hansen, linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 05:38:06PM -0700, Daniel Colascione wrote: > > One approach to dealing with this badness, the one I proposed earlier, is > to prevent that giant mmap from appearing in the first place (because we'd > cap vsize). If that giant mmap never appears, you can't generate a huge VMA > tree by splitting it. > > Maybe that's not a good approach. Maybe processes really need mappings that > big. If they do, then maybe the right approach is to just make 8 billion > VMAs not "DoS the system". What actually goes wrong if we just let the VMA > tree grow that large? So what if VMA lookup ends up taking a while --- the > process with the pathological allocation pattern is paying the cost, right? > Fine. Let's pick a more reasonable size --- say, 1GB. That's still 2**18 4k pages. Someone who munmap's every other 4k page is going to create 2**17 VMA's. That's a lot of VMA's. So now the question is do we pre-preserve enough VMA's for this worst case scenario, for all processes in the system? Or do we fail or otherwise kill the process who is clearly attempting a DOS attack on the system? If your goal is that munmap must ***never*** fail, then effectively you have to preserve enough resources for 50% of all 4k pages in all of the virtual address spaces in use by all of the processes in the system. That's a horrible waste of resources, just to guarantee that munmap(2) must never fail. Personally, I think it's not worth it. Why is it so important to you that munmap(2) must not fail? Is it not enough to say that if you mmap(2) a region, if you munmap(2) that exact same size region as you mmap(2)'ed, it must not fail? That's a much easier guarantee to make.... - Ted ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-22 1:19 ` Theodore Y. Ts'o @ 2018-05-22 1:41 ` Daniel Colascione 2018-05-22 2:09 ` Daniel Colascione 2018-05-22 2:11 ` Matthew Wilcox 0 siblings, 2 replies; 19+ messages in thread From: Daniel Colascione @ 2018-05-22 1:41 UTC (permalink / raw) To: tytso; +Cc: willy, dave.hansen, linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 6:19 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > On Mon, May 21, 2018 at 05:38:06PM -0700, Daniel Colascione wrote: > > > > One approach to dealing with this badness, the one I proposed earlier, is > > to prevent that giant mmap from appearing in the first place (because we'd > > cap vsize). If that giant mmap never appears, you can't generate a huge VMA > > tree by splitting it. > > > > Maybe that's not a good approach. Maybe processes really need mappings that > > big. If they do, then maybe the right approach is to just make 8 billion > > VMAs not "DoS the system". What actually goes wrong if we just let the VMA > > tree grow that large? So what if VMA lookup ends up taking a while --- the > > process with the pathological allocation pattern is paying the cost, right? > > > Fine. Let's pick a more reasonable size --- say, 1GB. That's still > 2**18 4k pages. Someone who munmap's every other 4k page is going to > create 2**17 VMA's. That's a lot of VMA's. So now the question is do > we pre-preserve enough VMA's for this worst case scenario, for all > processes in the system? Or do we fail or otherwise kill the process > who is clearly attempting a DOS attack on the system? > If your goal is that munmap must ***never*** fail, then effectively > you have to preserve enough resources for 50% of all 4k pages in all > of the virtual address spaces in use by all of the processes in the > system. That's a horrible waste of resources, just to guarantee that > munmap(2) must never fail. To be clear, I'm not suggesting that we actually perform this preallocation. (Maybe in the distant future, with strict commit accounting, it'd be useful.) I'm just suggesting that we perform the accounting as if we did. But I think Matthew's convinced me that there's no vsize cap small enough to be safe and still large enough to be useful, so I'll retract the vsize cap idea. > Personally, I think it's not worth it. > Why is it so important to you that munmap(2) must not fail? Is it not > enough to say that if you mmap(2) a region, if you munmap(2) that > exact same size region as you mmap(2)'ed, it must not fail? That's a > much easier guarantee to make.... That'd be good too, but I don't see how this guarantee would be easier to make. If you call mmap three times, those three allocations might end up merged into the same VMA, and if you called munmap on the middle allocation, you'd still have to split. Am I misunderstanding something? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-22 1:41 ` Daniel Colascione @ 2018-05-22 2:09 ` Daniel Colascione 2018-05-22 2:11 ` Matthew Wilcox 1 sibling, 0 replies; 19+ messages in thread From: Daniel Colascione @ 2018-05-22 2:09 UTC (permalink / raw) To: tytso; +Cc: willy, dave.hansen, linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 6:41 PM Daniel Colascione <dancol@google.com> wrote: > That'd be good too, but I don't see how this guarantee would be easier to > make. If you call mmap three times, those three allocations might end up > merged into the same VMA, and if you called munmap on the middle > allocation, you'd still have to split. Am I misunderstanding something? Oh: a sequence number stored in the VMA, combined with a refusal to merge across sequence number differences. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-22 1:41 ` Daniel Colascione 2018-05-22 2:09 ` Daniel Colascione @ 2018-05-22 2:11 ` Matthew Wilcox 1 sibling, 0 replies; 19+ messages in thread From: Matthew Wilcox @ 2018-05-22 2:11 UTC (permalink / raw) To: Daniel Colascione; +Cc: tytso, dave.hansen, linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 06:41:12PM -0700, Daniel Colascione wrote: > On Mon, May 21, 2018 at 6:19 PM Theodore Y. Ts'o <tytso@mit.edu> wrote: > > > On Mon, May 21, 2018 at 05:38:06PM -0700, Daniel Colascione wrote: > > > > > > One approach to dealing with this badness, the one I proposed earlier, > is > > > to prevent that giant mmap from appearing in the first place (because > we'd > > > cap vsize). If that giant mmap never appears, you can't generate a huge > VMA > > > tree by splitting it. > > > > > > Maybe that's not a good approach. Maybe processes really need mappings > that > > > big. If they do, then maybe the right approach is to just make 8 billion > > > VMAs not "DoS the system". What actually goes wrong if we just let the > VMA > > > tree grow that large? So what if VMA lookup ends up taking a while --- > the > > > process with the pathological allocation pattern is paying the cost, > right? > > > > > > Fine. Let's pick a more reasonable size --- say, 1GB. That's still > > 2**18 4k pages. Someone who munmap's every other 4k page is going to > > create 2**17 VMA's. That's a lot of VMA's. So now the question is do > > we pre-preserve enough VMA's for this worst case scenario, for all > > processes in the system? Or do we fail or otherwise kill the process > > who is clearly attempting a DOS attack on the system? > > > If your goal is that munmap must ***never*** fail, then effectively > > you have to preserve enough resources for 50% of all 4k pages in all > > of the virtual address spaces in use by all of the processes in the > > system. That's a horrible waste of resources, just to guarantee that > > munmap(2) must never fail. > > To be clear, I'm not suggesting that we actually perform this > preallocation. (Maybe in the distant future, with strict commit accounting, > it'd be useful.) I'm just suggesting that we perform the accounting as if > we did. But I think Matthew's convinced me that there's no vsize cap small > enough to be safe and still large enough to be useful, so I'll retract the > vsize cap idea. > > > Personally, I think it's not worth it. > > > Why is it so important to you that munmap(2) must not fail? Is it not > > enough to say that if you mmap(2) a region, if you munmap(2) that > > exact same size region as you mmap(2)'ed, it must not fail? That's a > > much easier guarantee to make.... > > That'd be good too, but I don't see how this guarantee would be easier to > make. If you call mmap three times, those three allocations might end up > merged into the same VMA, and if you called munmap on the middle > allocation, you'd still have to split. Am I misunderstanding something? What I think Ted's proposing (and I was too) is that we either preallocate or make a note of how many VMAs we've merged. So you can unmap as many times as you've mapped without risking failure. If you start unmapping in the middle, then you might see munmap failures, but if you only unmap things that you already mapped, we can guarantee that munmap won't fail. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-22 0:38 ` Daniel Colascione 2018-05-22 1:19 ` Theodore Y. Ts'o @ 2018-05-22 1:22 ` Matthew Wilcox 1 sibling, 0 replies; 19+ messages in thread From: Matthew Wilcox @ 2018-05-22 1:22 UTC (permalink / raw) To: Daniel Colascione; +Cc: dave.hansen, linux-mm, Tim Murray, Minchan Kim On Mon, May 21, 2018 at 05:38:06PM -0700, Daniel Colascione wrote: > On Mon, May 21, 2018 at 5:22 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, May 21, 2018 at 05:00:47PM -0700, Daniel Colascione wrote: > > > On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com> > wrote: > > > > I think there's still a potential dead-end here. "Deallocation" does > > > > not always free resources. > > > > > > Sure, but the general principle applies: reserve resources when you > *can* > > > fail so that you don't fail where you can't fail. > > > Umm. OK. But you want an mmap of 4TB to succeed, right? That implies > > preallocating one billion * sizeof(*vma). That's, what, dozens of > > gigabytes right there? > > That's not what I'm proposing here. I'd hoped to make that clear in the > remainder of the email to which you've replied. > > > I'm sympathetic to wanting to keep both vma-merging and > > unmap-anything-i-mapped working, but your proposal isn't going to fix it. > > > You need to handle the attacker writing a program which mmaps 46 bits > > of address space and then munmaps alternate pages. That program needs > > to be detected and stopped. > > Let's look at why it's bad to mmap 46 bits of address space and munmap > alternate pages. It can't be that doing so would just use too much memory: > you can mmap 46 bits of address space *already* and touch each page, one by > one, until the kernel gets fed up and the OOM killer kills you. If it's anonymous memory, sure, the kernel will kill you. If it's file-backed memory, the kernel will page it out again. Sure, page table consumption might also kill you, but 8 bytes per page is a lot less memory consumption than ~200 bytes per page! > So it's not because we'd allocate a lot of memory that having a huge VMA > tree is bad, because we already let processes allocate globs of memory in > other ways. The badness comes, AIUI, from the asymptotic behavior of the > address lookup algorithm in a tree that big. There's an order of magnitude difference in memory consumption though. > One approach to dealing with this badness, the one I proposed earlier, is > to prevent that giant mmap from appearing in the first place (because we'd > cap vsize). If that giant mmap never appears, you can't generate a huge VMA > tree by splitting it. I have 16GB of memory in this laptop. At 200 bytes per page, allocating 10% of my memory to vm_area_structs (a ridiculously high overhead), restricts the total amount I can mmap (spread between all processes) at 8 million pages, 32GB. Firefox alone is taking 3.6GB; gnome-shell is taking another 4.4GB, even gnome-shell is taking 4GB. Your proposal just doesn't work. > Maybe that's not a good approach. Maybe processes really need mappings that > big. If they do, then maybe the right approach is to just make 8 billion > VMAs not "DoS the system". What actually goes wrong if we just let the VMA > tree grow that large? So what if VMA lookup ends up taking a while --- the > process with the pathological allocation pattern is paying the cost, right? There's a per-inode tree of every mapping of that file, so if I mmap libc and then munmap alternate pages, every user of libc pays the price. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: Why do we let munmap fail? 2018-05-22 0:00 ` Daniel Colascione 2018-05-22 0:22 ` Matthew Wilcox @ 2018-05-22 5:34 ` Nicholas Piggin 1 sibling, 0 replies; 19+ messages in thread From: Nicholas Piggin @ 2018-05-22 5:34 UTC (permalink / raw) To: Daniel Colascione; +Cc: dave.hansen, linux-mm, Tim Murray, Minchan Kim On Mon, 21 May 2018 17:00:47 -0700 Daniel Colascione <dancol@google.com> wrote: > On Mon, May 21, 2018 at 4:32 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > On 05/21/2018 04:16 PM, Daniel Colascione wrote: > > > On Mon, May 21, 2018 at 4:02 PM Dave Hansen <dave.hansen@intel.com> > wrote: > > > > > >> On 05/21/2018 03:54 PM, Daniel Colascione wrote: > > >>>> There are also certainly denial-of-service concerns if you allow > > >>>> arbitrary numbers of VMAs. The rbtree, for instance, is O(log(n)), > but > > >>>> I 'd be willing to be there are plenty of things that fall over if > you > > >>>> let the ~65k limit get 10x or 100x larger. > > >>> Sure. I'm receptive to the idea of having *some* VMA limit. I just > think > > >>> it's unacceptable let deallocation routines fail. > > >> If you have a resource limit and deallocation consumes resources, you > > >> *eventually* have to fail a deallocation. Right? > > > That's why robust software sets aside at allocation time whatever > resources > > > are needed to make forward progress at deallocation time. > > > I think there's still a potential dead-end here. "Deallocation" does > > not always free resources. > > Sure, but the general principle applies: reserve resources when you *can* > fail so that you don't fail where you can't fail. munmap != deallocation, it's a request to change the address mapping. A more complex mapping uses more resources. mmap can free resources if it transforms your mapping to a simpler one. Thanks, Nick ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2018-05-22 5:34 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-05-21 22:07 Why do we let munmap fail? Daniel Colascione 2018-05-21 22:12 ` Dave Hansen 2018-05-21 22:20 ` Daniel Colascione 2018-05-21 22:29 ` Dave Hansen 2018-05-21 22:35 ` Daniel Colascione 2018-05-21 22:48 ` Dave Hansen 2018-05-21 22:54 ` Daniel Colascione 2018-05-21 23:02 ` Dave Hansen 2018-05-21 23:16 ` Daniel Colascione 2018-05-21 23:32 ` Dave Hansen 2018-05-22 0:00 ` Daniel Colascione 2018-05-22 0:22 ` Matthew Wilcox 2018-05-22 0:38 ` Daniel Colascione 2018-05-22 1:19 ` Theodore Y. Ts'o 2018-05-22 1:41 ` Daniel Colascione 2018-05-22 2:09 ` Daniel Colascione 2018-05-22 2:11 ` Matthew Wilcox 2018-05-22 1:22 ` Matthew Wilcox 2018-05-22 5:34 ` Nicholas Piggin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).