linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure
@ 2019-02-21  8:54 Oscar Salvador
  2019-02-22 13:01 ` Kirill A. Shutemov
  0 siblings, 1 reply; 4+ messages in thread
From: Oscar Salvador @ 2019-02-21  8:54 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, linux-api, hughd, kirill, vbabka, joel, jglisse,
	yang.shi, mgorman, Oscar Salvador

When using mremap() syscall in addition to MREMAP_FIXED flag,
mremap() calls mremap_to() which does the following:

1) unmaps the destination region where we are going to move the map
2) If the new region is going to be smaller, we unmap the last part
   of the old region

Then, we will eventually call move_vma() to do the actual move.

move_vma() checks whether we are at least 4 maps below max_map_count
before going further, otherwise it bails out with -ENOMEM.
The problem is that we might have already unmapped the vma's in steps
1) and 2), so it is not possible for userspace to figure out the state
of the vma's after it gets -ENOMEM, and it gets tricky for userspace
to clean up properly on error path.

While it is true that we can return -ENOMEM for more reasons
(e.g: see may_expand_vm() or move_page_tables()), I think that we can
avoid this scenario in concret if we check early in mremap_to() if the
operation has high chances to succeed map-wise.

Should not be that the case, we can bail out before we even try to unmap
anything, so we make sure the vma's are left untouched in case we are likely
to be short of maps.

The thumb-rule now is to rely on the worst-scenario case we can have.
That is when both vma's (old region and new region) are going to be split
in 3, so we get two more maps to the ones we already hold (one per each).
If current map count + 2 maps still leads us to 4 maps below the threshold,
we are going to pass the check in move_vma().

Of course, this is not free, as it might generate false positives when it is
true that we are tight map-wise, but the unmap operation can release several
vma's leading us to a good state.

Because of that I am sending this as a RFC.
Another approach was also investigated [1], but it may be too much hassle
for what it brings.

[1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/mremap.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/mm/mremap.c b/mm/mremap.c
index 3320616ed93f..e3edef6b7a12 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -516,6 +516,23 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
 	if (addr + old_len > new_addr && new_addr + new_len > addr)
 		goto out;
 
+	/*
+	 * move_vma() need us to stay 4 maps below the threshold, otherwise
+	 * it will bail out at the very beginning.
+	 * That is a problem if we have already unmaped the regions here
+	 * (new_addr, and old_addr), because userspace will not know the
+	 * state of the vma's after it gets -ENOMEM.
+	 * So, to avoid such scenario we can pre-compute if the whole
+	 * operation has high chances to success map-wise.
+	 * Worst-scenario case is when both vma's (new_addr and old_addr) get
+	 * split in 3 before unmaping it.
+	 * That means 2 more maps (1 for each) to the ones we already hold.
+	 * Check whether current map count plus 2 still leads us to 4 maps below
+	 * the threshold, otherwise return -ENOMEM here to be more safe.
+	 */
+	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)
+		return -ENOMEM;
+
 	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
 	if (ret)
 		goto out;
-- 
2.13.7


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure
  2019-02-21  8:54 [RFC PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure Oscar Salvador
@ 2019-02-22 13:01 ` Kirill A. Shutemov
  2019-02-25 11:46   ` Vlastimil Babka
  0 siblings, 1 reply; 4+ messages in thread
From: Kirill A. Shutemov @ 2019-02-22 13:01 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-mm, linux-kernel, linux-api, hughd, vbabka, joel, jglisse,
	yang.shi, mgorman

On Thu, Feb 21, 2019 at 09:54:06AM +0100, Oscar Salvador wrote:
> When using mremap() syscall in addition to MREMAP_FIXED flag,
> mremap() calls mremap_to() which does the following:
> 
> 1) unmaps the destination region where we are going to move the map
> 2) If the new region is going to be smaller, we unmap the last part
>    of the old region
> 
> Then, we will eventually call move_vma() to do the actual move.
> 
> move_vma() checks whether we are at least 4 maps below max_map_count
> before going further, otherwise it bails out with -ENOMEM.
> The problem is that we might have already unmapped the vma's in steps
> 1) and 2), so it is not possible for userspace to figure out the state
> of the vma's after it gets -ENOMEM, and it gets tricky for userspace
> to clean up properly on error path.
> 
> While it is true that we can return -ENOMEM for more reasons
> (e.g: see may_expand_vm() or move_page_tables()), I think that we can
> avoid this scenario in concret if we check early in mremap_to() if the
> operation has high chances to succeed map-wise.
> 
> Should not be that the case, we can bail out before we even try to unmap
> anything, so we make sure the vma's are left untouched in case we are likely
> to be short of maps.
> 
> The thumb-rule now is to rely on the worst-scenario case we can have.
> That is when both vma's (old region and new region) are going to be split
> in 3, so we get two more maps to the ones we already hold (one per each).
> If current map count + 2 maps still leads us to 4 maps below the threshold,
> we are going to pass the check in move_vma().
> 
> Of course, this is not free, as it might generate false positives when it is
> true that we are tight map-wise, but the unmap operation can release several
> vma's leading us to a good state.
> 
> Because of that I am sending this as a RFC.
> Another approach was also investigated [1], but it may be too much hassle
> for what it brings.

I believe we don't need the check in move_vma() with this patch. Or do we?

> 
> [1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/mremap.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 3320616ed93f..e3edef6b7a12 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -516,6 +516,23 @@ static unsigned long mremap_to(unsigned long addr, unsigned long old_len,
>  	if (addr + old_len > new_addr && new_addr + new_len > addr)
>  		goto out;
>  
> +	/*
> +	 * move_vma() need us to stay 4 maps below the threshold, otherwise
> +	 * it will bail out at the very beginning.
> +	 * That is a problem if we have already unmaped the regions here
> +	 * (new_addr, and old_addr), because userspace will not know the
> +	 * state of the vma's after it gets -ENOMEM.
> +	 * So, to avoid such scenario we can pre-compute if the whole
> +	 * operation has high chances to success map-wise.
> +	 * Worst-scenario case is when both vma's (new_addr and old_addr) get
> +	 * split in 3 before unmaping it.
> +	 * That means 2 more maps (1 for each) to the ones we already hold.
> +	 * Check whether current map count plus 2 still leads us to 4 maps below
> +	 * the threshold, otherwise return -ENOMEM here to be more safe.
> +	 */
> +	if ((mm->map_count + 2) >= sysctl_max_map_count - 3)

Nit: redundant parentheses around 'mm->map_count + 2'.

> +		return -ENOMEM;
> +
>  	ret = do_munmap(mm, new_addr, new_len, uf_unmap_early);
>  	if (ret)
>  		goto out;
> -- 
> 2.13.7
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure
  2019-02-22 13:01 ` Kirill A. Shutemov
@ 2019-02-25 11:46   ` Vlastimil Babka
  2019-02-25 12:16     ` Kirill A. Shutemov
  0 siblings, 1 reply; 4+ messages in thread
From: Vlastimil Babka @ 2019-02-25 11:46 UTC (permalink / raw)
  To: Kirill A. Shutemov, Oscar Salvador
  Cc: linux-mm, linux-kernel, linux-api, hughd, joel, jglisse,
	yang.shi, mgorman

On 2/22/19 2:01 PM, Kirill A. Shutemov wrote:
> On Thu, Feb 21, 2019 at 09:54:06AM +0100, Oscar Salvador wrote:
>> When using mremap() syscall in addition to MREMAP_FIXED flag,
>> mremap() calls mremap_to() which does the following:
>>
>> 1) unmaps the destination region where we are going to move the map
>> 2) If the new region is going to be smaller, we unmap the last part
>>    of the old region
>>
>> Then, we will eventually call move_vma() to do the actual move.
>>
>> move_vma() checks whether we are at least 4 maps below max_map_count
>> before going further, otherwise it bails out with -ENOMEM.
>> The problem is that we might have already unmapped the vma's in steps
>> 1) and 2), so it is not possible for userspace to figure out the state
>> of the vma's after it gets -ENOMEM, and it gets tricky for userspace
>> to clean up properly on error path.
>>
>> While it is true that we can return -ENOMEM for more reasons
>> (e.g: see may_expand_vm() or move_page_tables()), I think that we can
>> avoid this scenario in concret if we check early in mremap_to() if the
>> operation has high chances to succeed map-wise.
>>
>> Should not be that the case, we can bail out before we even try to unmap
>> anything, so we make sure the vma's are left untouched in case we are likely
>> to be short of maps.
>>
>> The thumb-rule now is to rely on the worst-scenario case we can have.
>> That is when both vma's (old region and new region) are going to be split
>> in 3, so we get two more maps to the ones we already hold (one per each).
>> If current map count + 2 maps still leads us to 4 maps below the threshold,
>> we are going to pass the check in move_vma().
>>
>> Of course, this is not free, as it might generate false positives when it is
>> true that we are tight map-wise, but the unmap operation can release several
>> vma's leading us to a good state.
>>
>> Because of that I am sending this as a RFC.
>> Another approach was also investigated [1], but it may be too much hassle
>> for what it brings.
> 
> I believe we don't need the check in move_vma() with this patch. Or do we?

move_vma() can be also called directly from SYSCALL_DEFINE5(mremap) for
the non-MMAP_FIXED case. So unless there's further refactoring, the
check is still needed.

>>
>> [1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/
>>
>> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure
  2019-02-25 11:46   ` Vlastimil Babka
@ 2019-02-25 12:16     ` Kirill A. Shutemov
  0 siblings, 0 replies; 4+ messages in thread
From: Kirill A. Shutemov @ 2019-02-25 12:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Oscar Salvador, linux-mm, linux-kernel, linux-api, hughd, joel,
	jglisse, yang.shi, mgorman

On Mon, Feb 25, 2019 at 12:46:46PM +0100, Vlastimil Babka wrote:
> On 2/22/19 2:01 PM, Kirill A. Shutemov wrote:
> > On Thu, Feb 21, 2019 at 09:54:06AM +0100, Oscar Salvador wrote:
> >> When using mremap() syscall in addition to MREMAP_FIXED flag,
> >> mremap() calls mremap_to() which does the following:
> >>
> >> 1) unmaps the destination region where we are going to move the map
> >> 2) If the new region is going to be smaller, we unmap the last part
> >>    of the old region
> >>
> >> Then, we will eventually call move_vma() to do the actual move.
> >>
> >> move_vma() checks whether we are at least 4 maps below max_map_count
> >> before going further, otherwise it bails out with -ENOMEM.
> >> The problem is that we might have already unmapped the vma's in steps
> >> 1) and 2), so it is not possible for userspace to figure out the state
> >> of the vma's after it gets -ENOMEM, and it gets tricky for userspace
> >> to clean up properly on error path.
> >>
> >> While it is true that we can return -ENOMEM for more reasons
> >> (e.g: see may_expand_vm() or move_page_tables()), I think that we can
> >> avoid this scenario in concret if we check early in mremap_to() if the
> >> operation has high chances to succeed map-wise.
> >>
> >> Should not be that the case, we can bail out before we even try to unmap
> >> anything, so we make sure the vma's are left untouched in case we are likely
> >> to be short of maps.
> >>
> >> The thumb-rule now is to rely on the worst-scenario case we can have.
> >> That is when both vma's (old region and new region) are going to be split
> >> in 3, so we get two more maps to the ones we already hold (one per each).
> >> If current map count + 2 maps still leads us to 4 maps below the threshold,
> >> we are going to pass the check in move_vma().
> >>
> >> Of course, this is not free, as it might generate false positives when it is
> >> true that we are tight map-wise, but the unmap operation can release several
> >> vma's leading us to a good state.
> >>
> >> Because of that I am sending this as a RFC.
> >> Another approach was also investigated [1], but it may be too much hassle
> >> for what it brings.
> > 
> > I believe we don't need the check in move_vma() with this patch. Or do we?
> 
> move_vma() can be also called directly from SYSCALL_DEFINE5(mremap) for
> the non-MMAP_FIXED case. So unless there's further refactoring, the
> check is still needed.

Okay, makes sense.

> >>
> >> [1] https://lore.kernel.org/lkml/20190219155320.tkfkwvqk53tfdojt@d104.suse.de/
> >>
> >> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2019-02-25 12:16 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-02-21  8:54 [RFC PATCH] mm,mremap: Bail out earlier in mremap_to under map pressure Oscar Salvador
2019-02-22 13:01 ` Kirill A. Shutemov
2019-02-25 11:46   ` Vlastimil Babka
2019-02-25 12:16     ` Kirill A. Shutemov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).