linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* mremap vs sysctl_max_map_count
@ 2019-02-18  8:33 Oscar Salvador
  2019-02-18  9:57 ` Vlastimil Babka
  0 siblings, 1 reply; 6+ messages in thread
From: Oscar Salvador @ 2019-02-18  8:33 UTC (permalink / raw)
  To: linux-mm, linux-kernel, linux-api; +Cc: hughd, viro, torvalds


Hi all,

I would like to bring up a topic that comes from an issue a customer of ours
is facing with the mremap syscall + hitting the max_map_count threshold:

When passing the MREMAP_FIXED flag, mremap() calls mremap_to() which does the
following:

1) it unmaps the region where we want to put the new map:
   (new_addr, new_addr + new_len] [1]
2) IFF old_len > new_len, it unmaps the region:
   (old_addr + new_len, (old_addr + new_len) + (old_len - new_len)] [2]

Now, having gone through steps 1) and 2), we eventually call move_vma() to do
the actual move.

move_vma() checks if we are at least 4 maps below max_map_count, otherwise
it bails out with -ENOMEM [3].
The problem is that we might have already unmapped the vma's in steps 1) and 2),
so it is not possible for userspace to figure out the state of the vma's after
it gets -ENOMEM.

- Did new_addr got unmaped?
- Did part of the old_addr got unmaped?

Because of that, it gets tricky for userspace to clean up properly on error
path.

While it is true that we can return -ENOMEM for more reasons
(e.g: see vma_to_resize()->may_expand_vm()), I think that we might be able to
pre-compute the number of maps that we are going add/release during the first
two do_munmaps(), and check whether we are 4 maps below the threshold
(as move_vma() does).
Should not be the case, we can bail out early before we unmap anything, so we
make sure the vma's are left untouched in case we are going to be short of maps.

I am not sure if that is realistically doable, or there are limitations
I overlooked, or we simply do not want to do that.

Before investing more time and giving it a shoot, I just wanted to bring
this upstream to get feedback on this matter.

Thanks

[1] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L519
[2] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L523
[3] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L338

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mremap vs sysctl_max_map_count
  2019-02-18  8:33 mremap vs sysctl_max_map_count Oscar Salvador
@ 2019-02-18  9:57 ` Vlastimil Babka
  2019-02-18 10:29   ` Mike Rapoport
  2019-02-18 11:15   ` Kirill A. Shutemov
  0 siblings, 2 replies; 6+ messages in thread
From: Vlastimil Babka @ 2019-02-18  9:57 UTC (permalink / raw)
  To: Oscar Salvador, linux-mm, linux-kernel, linux-api; +Cc: hughd, viro, torvalds

On 2/18/19 9:33 AM, Oscar Salvador wrote:
> 
> Hi all,
> 
> I would like to bring up a topic that comes from an issue a customer of ours
> is facing with the mremap syscall + hitting the max_map_count threshold:
> 
> When passing the MREMAP_FIXED flag, mremap() calls mremap_to() which does the
> following:
> 
> 1) it unmaps the region where we want to put the new map:
>    (new_addr, new_addr + new_len] [1]
> 2) IFF old_len > new_len, it unmaps the region:
>    (old_addr + new_len, (old_addr + new_len) + (old_len - new_len)] [2]
> 
> Now, having gone through steps 1) and 2), we eventually call move_vma() to do
> the actual move.
> 
> move_vma() checks if we are at least 4 maps below max_map_count, otherwise
> it bails out with -ENOMEM [3].
> The problem is that we might have already unmapped the vma's in steps 1) and 2),
> so it is not possible for userspace to figure out the state of the vma's after
> it gets -ENOMEM.
> 
> - Did new_addr got unmaped?
> - Did part of the old_addr got unmaped?
> 
> Because of that, it gets tricky for userspace to clean up properly on error
> path.
> 
> While it is true that we can return -ENOMEM for more reasons
> (e.g: see vma_to_resize()->may_expand_vm()), I think that we might be able to
> pre-compute the number of maps that we are going add/release during the first
> two do_munmaps(), and check whether we are 4 maps below the threshold
> (as move_vma() does).
> Should not be the case, we can bail out early before we unmap anything, so we
> make sure the vma's are left untouched in case we are going to be short of maps.
> 
> I am not sure if that is realistically doable, or there are limitations
> I overlooked, or we simply do not want to do that.

IMHO it makes sense to do all such resource limit checks upfront. It
should all be protected by mmap_sem and thus stable, right? Even if it
was racy, I'd think it's better to breach the limit a bit due to a race
than bail out in the middle of operation. Being also resilient against
"real" ENOMEM's due to e.g. failure to alocate a vma would be much
harder perhaps (but maybe it's already mostly covered by the
too-small-to-fail in page allocator), but I'd try with the artificial
limits at least.

> Before investing more time and giving it a shoot, I just wanted to bring
> this upstream to get feedback on this matter.
> 
> Thanks
> 
> [1] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L519
> [2] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L523
> [3] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L338
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mremap vs sysctl_max_map_count
  2019-02-18  9:57 ` Vlastimil Babka
@ 2019-02-18 10:29   ` Mike Rapoport
  2019-02-18 11:15   ` Kirill A. Shutemov
  1 sibling, 0 replies; 6+ messages in thread
From: Mike Rapoport @ 2019-02-18 10:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Oscar Salvador, linux-mm, linux-kernel, linux-api, hughd, viro, torvalds

On Mon, Feb 18, 2019 at 10:57:18AM +0100, Vlastimil Babka wrote:
> On 2/18/19 9:33 AM, Oscar Salvador wrote:
> > 
> > Hi all,
> > 
> > I would like to bring up a topic that comes from an issue a customer of ours
> > is facing with the mremap syscall + hitting the max_map_count threshold:
> > 
> > When passing the MREMAP_FIXED flag, mremap() calls mremap_to() which does the
> > following:
> > 
> > 1) it unmaps the region where we want to put the new map:
> >    (new_addr, new_addr + new_len] [1]
> > 2) IFF old_len > new_len, it unmaps the region:
> >    (old_addr + new_len, (old_addr + new_len) + (old_len - new_len)] [2]
> > 
> > Now, having gone through steps 1) and 2), we eventually call move_vma() to do
> > the actual move.
> > 
> > move_vma() checks if we are at least 4 maps below max_map_count, otherwise
> > it bails out with -ENOMEM [3].
> > The problem is that we might have already unmapped the vma's in steps 1) and 2),
> > so it is not possible for userspace to figure out the state of the vma's after
> > it gets -ENOMEM.
> > 
> > - Did new_addr got unmaped?
> > - Did part of the old_addr got unmaped?
> > 
> > Because of that, it gets tricky for userspace to clean up properly on error
> > path.
> > 
> > While it is true that we can return -ENOMEM for more reasons
> > (e.g: see vma_to_resize()->may_expand_vm()), I think that we might be able to
> > pre-compute the number of maps that we are going add/release during the first
> > two do_munmaps(), and check whether we are 4 maps below the threshold
> > (as move_vma() does).
> > Should not be the case, we can bail out early before we unmap anything, so we
> > make sure the vma's are left untouched in case we are going to be short of maps.
> > 
> > I am not sure if that is realistically doable, or there are limitations
> > I overlooked, or we simply do not want to do that.
> 
> IMHO it makes sense to do all such resource limit checks upfront. It
> should all be protected by mmap_sem and thus stable, right? Even if it
> was racy, I'd think it's better to breach the limit a bit due to a race
> than bail out in the middle of operation. Being also resilient against
> "real" ENOMEM's due to e.g. failure to alocate a vma would be much
> harder perhaps (but maybe it's already mostly covered by the
> too-small-to-fail in page allocator), but I'd try with the artificial
> limits at least.

The mremap_to() is called with mmap_sem hold, so there won't be a race.

But it seems mremap_to() is not the only path to call do_munmap(). There is
also an unmap in shrinking remap and possible move_vma() even with
~MREMAP_FIXED.

Maybe it'd make sense to check the limits right after taking the mmap_sem?
 
> > Before investing more time and giving it a shoot, I just wanted to bring
> > this upstream to get feedback on this matter.
> > 
> > Thanks
> > 
> > [1] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L519
> > [2] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L523
> > [3] https://github.com/torvalds/linux/blob/master/mm/mremap.c#L338
> > 
> 

-- 
Sincerely yours,
Mike.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mremap vs sysctl_max_map_count
  2019-02-18  9:57 ` Vlastimil Babka
  2019-02-18 10:29   ` Mike Rapoport
@ 2019-02-18 11:15   ` Kirill A. Shutemov
  2019-02-19 15:53     ` Oscar Salvador
  1 sibling, 1 reply; 6+ messages in thread
From: Kirill A. Shutemov @ 2019-02-18 11:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Oscar Salvador, linux-mm, linux-kernel, linux-api, hughd, viro, torvalds

On Mon, Feb 18, 2019 at 10:57:18AM +0100, Vlastimil Babka wrote:
> On 2/18/19 9:33 AM, Oscar Salvador wrote:
> > 
> > Hi all,
> > 
> > I would like to bring up a topic that comes from an issue a customer of ours
> > is facing with the mremap syscall + hitting the max_map_count threshold:
> > 
> > When passing the MREMAP_FIXED flag, mremap() calls mremap_to() which does the
> > following:
> > 
> > 1) it unmaps the region where we want to put the new map:
> >    (new_addr, new_addr + new_len] [1]
> > 2) IFF old_len > new_len, it unmaps the region:
> >    (old_addr + new_len, (old_addr + new_len) + (old_len - new_len)] [2]
> > 
> > Now, having gone through steps 1) and 2), we eventually call move_vma() to do
> > the actual move.
> > 
> > move_vma() checks if we are at least 4 maps below max_map_count, otherwise
> > it bails out with -ENOMEM [3].
> > The problem is that we might have already unmapped the vma's in steps 1) and 2),
> > so it is not possible for userspace to figure out the state of the vma's after
> > it gets -ENOMEM.
> > 
> > - Did new_addr got unmaped?
> > - Did part of the old_addr got unmaped?
> > 
> > Because of that, it gets tricky for userspace to clean up properly on error
> > path.
> > 
> > While it is true that we can return -ENOMEM for more reasons
> > (e.g: see vma_to_resize()->may_expand_vm()), I think that we might be able to
> > pre-compute the number of maps that we are going add/release during the first
> > two do_munmaps(), and check whether we are 4 maps below the threshold
> > (as move_vma() does).
> > Should not be the case, we can bail out early before we unmap anything, so we
> > make sure the vma's are left untouched in case we are going to be short of maps.
> > 
> > I am not sure if that is realistically doable, or there are limitations
> > I overlooked, or we simply do not want to do that.
> 
> IMHO it makes sense to do all such resource limit checks upfront. It
> should all be protected by mmap_sem and thus stable, right? Even if it
> was racy, I'd think it's better to breach the limit a bit due to a race
> than bail out in the middle of operation. Being also resilient against
> "real" ENOMEM's due to e.g. failure to alocate a vma would be much
> harder perhaps (but maybe it's already mostly covered by the
> too-small-to-fail in page allocator), but I'd try with the artificial
> limits at least.

There's slight chance of false-postive -ENOMEM with upfront approach:
unmapping can reduce number of VMAs so in some cases upfront check would
fail something that could succeed otherwise.

We could check also what number of VMA unmap would free (if any). But it
complicates the picture and I don't think worth it in the end.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: mremap vs sysctl_max_map_count
  2019-02-18 11:15   ` Kirill A. Shutemov
@ 2019-02-19 15:53     ` Oscar Salvador
  2019-02-20 10:30       ` Vlastimil Babka
  0 siblings, 1 reply; 6+ messages in thread
From: Oscar Salvador @ 2019-02-19 15:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Vlastimil Babka, linux-mm, linux-kernel, linux-api, hughd, viro,
	torvalds

On Mon, Feb 18, 2019 at 02:15:35PM +0300, Kirill A. Shutemov wrote:
> On Mon, Feb 18, 2019 at 10:57:18AM +0100, Vlastimil Babka wrote:
> > IMHO it makes sense to do all such resource limit checks upfront. It
> > should all be protected by mmap_sem and thus stable, right? Even if it
> > was racy, I'd think it's better to breach the limit a bit due to a race
> > than bail out in the middle of operation. Being also resilient against
> > "real" ENOMEM's due to e.g. failure to alocate a vma would be much
> > harder perhaps (but maybe it's already mostly covered by the
> > too-small-to-fail in page allocator), but I'd try with the artificial
> > limits at least.
> 
> There's slight chance of false-postive -ENOMEM with upfront approach:
> unmapping can reduce number of VMAs so in some cases upfront check would
> fail something that could succeed otherwise.
> 
> We could check also what number of VMA unmap would free (if any). But it
> complicates the picture and I don't think worth it in the end.

I came up with an approach which tries to check how many vma's are we going
to split and the number of vma's that we are going to free.
I did several tests and it worked for me, but I am not sure if I overlooked
something due to false assumptions.
I am also not sure either if the extra code is worth, but from my POV
it could avoid such cases where we unmap regions but move_vma()
is not going to succeed at all.


It is not yet complete (sanity checks are missing), but I wanted to show it
to see whether it is something that is worth spending time with:

diff --git a/mm/mremap.c b/mm/mremap.c
index 3320616ed93f..f504c29d2af4 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -494,6 +494,51 @@ static struct vm_area_struct *vma_to_resize(unsigned long addr,
 	return vma;
 }
 
+static int pre_compute_maps(unsigned long addr, unsigned long len)
+{
+	struct mm_struct *mm = current->mm;
+	struct vm_area_struct *vma, *vma_end;
+	unsigned long end;
+	int maps_needed = 0;
+
+	end = addr + len;
+
+	vma = find_vma(mm, addr);
+	if (!vma)
+		return 0;
+	vma_end = find_vma(mm, end);
+
+	if (addr >= vma->vm_start && end <= vma->vm_end) {
+		/*
+		 * Possible outcomes when dealing with a single vma:
+		 * the vma will be entirely removed: map_count will be decremented by 1
+		 * it needs to be split in 2 before unmapping: map_count not changed
+		 * it needs to be split in 3 before unmapping: map_count incremented by 1
+		 */
+		if (addr > vma->vm_start && end < vma->vm_end)
+			maps_needed++;
+		else if (addr == vma->vm_start && end == vma->vm_end)
+			maps_needed--;
+	} else {
+		struct vm_area_struct *tmp = vma;
+		int vmas;
+
+		if (addr > tmp->vm_start)
+			vmas = -1;
+		else
+			vmas = 0;
+
+		while (tmp != vma_end) {
+			if (end >= tmp->vm_end)
+				vmas++;
+			tmp = tmp->vm_next;
+		}
+		maps_needed -= vmas;
+	}
+
+	return maps_needed;
+}
+
 static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 		unsigned long new_addr, unsigned long new_len, bool *locked,
 		struct vm_userfaultfd_ctx *uf,
@@ -516,6 +561,24 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (addr + old_len > new_addr && new_addr + new_len > addr)
 		goto out;
 
+	/*
+	 * Worst-scenario case is when a vma gets split in 3 before unmaping it.
+	 * So, that would mean 2 (1 for new_addr and 1 for addr) more maps to
+	 * the ones we already hold.
+	 * If that is the case, let us check further if we are going to free
+	 * enough to go beyond the check in move_vma().
+	 */
+	if ((mm->map_count + 2) >= sysctl_max_map_count - 3) {
+		int maps_needed = 0;
+
+		maps_needed += pre_compute_maps(new_addr, new_len);
+		if (old_len > new_len)
+			maps_needed += pre_compute_maps(addr + new_len, old_len - new_len);
+
+		if ((mm->map_count + maps_needed) >= sysctl_max_map_count - 3)
+			return -ENOMEM;
+	}
+
 	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
 	if (ret)
 		goto out;
Thanks
-- 
Oscar Salvador
SUSE L3

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: mremap vs sysctl_max_map_count
  2019-02-19 15:53     ` Oscar Salvador
@ 2019-02-20 10:30       ` Vlastimil Babka
  0 siblings, 0 replies; 6+ messages in thread
From: Vlastimil Babka @ 2019-02-20 10:30 UTC (permalink / raw)
  To: Oscar Salvador, Kirill A. Shutemov
  Cc: linux-mm, linux-kernel, linux-api, hughd, viro, torvalds

On 2/19/19 4:53 PM, Oscar Salvador wrote:
> On Mon, Feb 18, 2019 at 02:15:35PM +0300, Kirill A. Shutemov wrote:
>> On Mon, Feb 18, 2019 at 10:57:18AM +0100, Vlastimil Babka wrote:
>>> IMHO it makes sense to do all such resource limit checks upfront. It
>>> should all be protected by mmap_sem and thus stable, right? Even if it
>>> was racy, I'd think it's better to breach the limit a bit due to a race
>>> than bail out in the middle of operation. Being also resilient against
>>> "real" ENOMEM's due to e.g. failure to alocate a vma would be much
>>> harder perhaps (but maybe it's already mostly covered by the
>>> too-small-to-fail in page allocator), but I'd try with the artificial
>>> limits at least.
>>
>> There's slight chance of false-postive -ENOMEM with upfront approach:
>> unmapping can reduce number of VMAs so in some cases upfront check would
>> fail something that could succeed otherwise.
>>
>> We could check also what number of VMA unmap would free (if any). But it
>> complicates the picture and I don't think worth it in the end.
> 
> I came up with an approach which tries to check how many vma's are we going
> to split and the number of vma's that we are going to free.
> I did several tests and it worked for me, but I am not sure if I overlooked
> something due to false assumptions.
> I am also not sure either if the extra code is worth, but from my POV
> it could avoid such cases where we unmap regions but move_vma()
> is not going to succeed at all.
> 
> 
> It is not yet complete (sanity checks are missing), but I wanted to show it
> to see whether it is something that is worth spending time with:

Since move_vma() seems to consider only the worst case with the
hardcoded slack value of 3, I think we can afford to do that here as
well. And IIRC also nothing considers the possibility that the moved
area might merge with neighbours at the new address?

What worries me more is the amount of checks in vma_to_resize() that can
make things fail after the munmap was already done. Could it be also
called upfront? (And shouldn't it only be called when newsize > oldsize?)

Vlastimil

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-02-20 10:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-18  8:33 mremap vs sysctl_max_map_count Oscar Salvador
2019-02-18  9:57 ` Vlastimil Babka
2019-02-18 10:29   ` Mike Rapoport
2019-02-18 11:15   ` Kirill A. Shutemov
2019-02-19 15:53     ` Oscar Salvador
2019-02-20 10:30       ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).