linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/1] make start_isolate_page_range() thread safe
@ 2018-02-26 19:10 Mike Kravetz
  2018-02-26 19:10 ` [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated Mike Kravetz
  2018-03-09 22:47 ` [PATCH v2] " Mike Kravetz
  0 siblings, 2 replies; 11+ messages in thread
From: Mike Kravetz @ 2018-02-26 19:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: KAMEZAWA Hiroyuki, Luiz Capitulino, Michal Nazarewicz,
	Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Andrew Morton, Mike Kravetz

This patch was included in the RFC series "Interface for higher order
contiguous allocations".
http://lkml.kernel.org/r/20180212222056.9735-1-mike.kravetz@oracle.com

Since there have been few comments on the RFC and this patch addresses
a real issue with the current code, I am sending it separately.

To verify this is a real issue, I created a large CMA area at boot time.
I wrote some code to exercise large allocations and frees via cma_alloc()
and cma_release().  At the same time, I had a script just allocate and
free gigantic pages via the sysfs interface.

After a little bit of running, 'free memory' on the system went to
zero.  After 'stopping' the tests, I observed that most zone normal
page blocks were marked as MIGRATE_ISOLATE.  Hence 'not available'.

I suspect there are few (if any) systems employing both CMA and
dynamic gigantic huge page allocation.  However, it is probably a
good idea to fix this issue.  Because this is so unlikely, I am not
sure if this should got to stable releases as well.

Mike Kravetz (1):
  mm: make start_isolate_page_range() fail if already isolated

 mm/page_alloc.c     |  8 ++++----
 mm/page_isolation.c | 10 +++++++++-
 2 files changed, 13 insertions(+), 5 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated
  2018-02-26 19:10 [PATCH 0/1] make start_isolate_page_range() thread safe Mike Kravetz
@ 2018-02-26 19:10 ` Mike Kravetz
  2018-03-03  0:06   ` Andrew Morton
  2018-03-09 22:47 ` [PATCH v2] " Mike Kravetz
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-02-26 19:10 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: KAMEZAWA Hiroyuki, Luiz Capitulino, Michal Nazarewicz,
	Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Andrew Morton, Mike Kravetz

start_isolate_page_range() is used to set the migrate type of a
set of page blocks to MIGRATE_ISOLATE while attempting to start
a migration operation.  It assumes that only one thread is
calling it for the specified range.  This routine is used by
CMA, memory hotplug and gigantic huge pages.  Each of these users
synchronize access to the range within their subsystem.  However,
two subsystems (CMA and gigantic huge pages for example) could
attempt operations on the same range.  If this happens, page
blocks may be incorrectly left marked as MIGRATE_ISOLATE and
therefore not available for page allocation.

Without 'locking code' there is no easy way to synchronize access
to the range of page blocks passed to start_isolate_page_range.
However, if two threads are working on the same set of page blocks
one will stumble upon blocks set to MIGRATE_ISOLATE by the other.
In such conditions, make the thread noticing MIGRATE_ISOLATE
clean up as normal and return -EBUSY to the caller.

This will allow start_isolate_page_range to serve as a
synchronization mechanism and will allow for more general use
of callers making use of these interfaces.  So, update comments
in alloc_contig_range to reflect this new functionality.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/page_alloc.c     |  8 ++++----
 mm/page_isolation.c | 10 +++++++++-
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb416723538f..02a17efac233 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7621,11 +7621,11 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
  * @gfp_mask:	GFP mask to use during compaction
  *
  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
- * aligned, however it's the caller's responsibility to guarantee that
- * we are the only thread that changes migrate type of pageblocks the
- * pages fall in.
+ * aligned.  The PFN range must belong to a single zone.
  *
- * The PFN range must belong to a single zone.
+ * The first thing this routine does is attempt to MIGRATE_ISOLATE all
+ * pageblocks in the range.  Once isolated, the pageblocks should not
+ * be modified by others.
  *
  * Returns zero on success or negative error code.  On success all
  * pages which PFN is in [start, end) are allocated for the caller and
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 165ed8117bd1..70d01ec5b463 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -28,6 +28,13 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
 
 	spin_lock_irqsave(&zone->lock, flags);
 
+	/*
+	 * We assume we are the only ones trying to isolate this block.
+	 * If MIGRATE_ISOLATE already set, return -EBUSY
+	 */
+	if (is_migrate_isolate_page(page))
+		goto out;
+
 	pfn = page_to_pfn(page);
 	arg.start_pfn = pfn;
 	arg.nr_pages = pageblock_nr_pages;
@@ -166,7 +173,8 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * future will not be allocated again.
  *
  * start_pfn/end_pfn must be aligned to pageblock_order.
- * Returns 0 on success and -EBUSY if any part of range cannot be isolated.
+ * Return 0 on success and -EBUSY if any part of range cannot be isolated
+ * or any part of the range is already set to MIGRATE_ISOLATE.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     unsigned migratetype, bool skip_hwpoisoned_pages)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated
  2018-02-26 19:10 ` [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated Mike Kravetz
@ 2018-03-03  0:06   ` Andrew Morton
  2018-03-03  0:38     ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2018-03-03  0:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Luiz Capitulino,
	Michal Nazarewicz, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Johannes Weiner

On Mon, 26 Feb 2018 11:10:54 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> start_isolate_page_range() is used to set the migrate type of a
> set of page blocks to MIGRATE_ISOLATE while attempting to start
> a migration operation.  It assumes that only one thread is
> calling it for the specified range.  This routine is used by
> CMA, memory hotplug and gigantic huge pages.  Each of these users
> synchronize access to the range within their subsystem.  However,
> two subsystems (CMA and gigantic huge pages for example) could
> attempt operations on the same range.  If this happens, page
> blocks may be incorrectly left marked as MIGRATE_ISOLATE and
> therefore not available for page allocation.
> 
> Without 'locking code' there is no easy way to synchronize access
> to the range of page blocks passed to start_isolate_page_range.
> However, if two threads are working on the same set of page blocks
> one will stumble upon blocks set to MIGRATE_ISOLATE by the other.
> In such conditions, make the thread noticing MIGRATE_ISOLATE
> clean up as normal and return -EBUSY to the caller.
> 
> This will allow start_isolate_page_range to serve as a
> synchronization mechanism and will allow for more general use
> of callers making use of these interfaces.  So, update comments
> in alloc_contig_range to reflect this new functionality.
> 
> ...
>
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -28,6 +28,13 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
>  
>  	spin_lock_irqsave(&zone->lock, flags);
>  
> +	/*
> +	 * We assume we are the only ones trying to isolate this block.
> +	 * If MIGRATE_ISOLATE already set, return -EBUSY
> +	 */
> +	if (is_migrate_isolate_page(page))
> +		goto out;
> +
>  	pfn = page_to_pfn(page);
>  	arg.start_pfn = pfn;
>  	arg.nr_pages = pageblock_nr_pages;

Seems a bit ugly and I'm not sure that it's correct.  If the loop in
start_isolate_page_range() gets partway through a number of pages then
we hit the race, start_isolate_page_range() will then go and "undo" the
work being done by the thread which it is racing against?

Even if that can't happen, blundering through a whole bunch of pages
then saying whoops then undoing everything is unpleasing.

Should we be looking at preventing these races at a higher level?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated
  2018-03-03  0:06   ` Andrew Morton
@ 2018-03-03  0:38     ` Mike Kravetz
  2018-03-03  0:56       ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-03-03  0:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Luiz Capitulino,
	Michal Nazarewicz, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Johannes Weiner

On 03/02/2018 04:06 PM, Andrew Morton wrote:
> On Mon, 26 Feb 2018 11:10:54 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> start_isolate_page_range() is used to set the migrate type of a
>> set of page blocks to MIGRATE_ISOLATE while attempting to start
>> a migration operation.  It assumes that only one thread is
>> calling it for the specified range.  This routine is used by
>> CMA, memory hotplug and gigantic huge pages.  Each of these users
>> synchronize access to the range within their subsystem.  However,
>> two subsystems (CMA and gigantic huge pages for example) could
>> attempt operations on the same range.  If this happens, page
>> blocks may be incorrectly left marked as MIGRATE_ISOLATE and
>> therefore not available for page allocation.
>>
>> Without 'locking code' there is no easy way to synchronize access
>> to the range of page blocks passed to start_isolate_page_range.
>> However, if two threads are working on the same set of page blocks
>> one will stumble upon blocks set to MIGRATE_ISOLATE by the other.
>> In such conditions, make the thread noticing MIGRATE_ISOLATE
>> clean up as normal and return -EBUSY to the caller.
>>
>> This will allow start_isolate_page_range to serve as a
>> synchronization mechanism and will allow for more general use
>> of callers making use of these interfaces.  So, update comments
>> in alloc_contig_range to reflect this new functionality.
>>
>> ...
>>
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -28,6 +28,13 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
>>  
>>  	spin_lock_irqsave(&zone->lock, flags);
>>  
>> +	/*
>> +	 * We assume we are the only ones trying to isolate this block.
>> +	 * If MIGRATE_ISOLATE already set, return -EBUSY
>> +	 */
>> +	if (is_migrate_isolate_page(page))
>> +		goto out;
>> +
>>  	pfn = page_to_pfn(page);
>>  	arg.start_pfn = pfn;
>>  	arg.nr_pages = pageblock_nr_pages;
> 
> Seems a bit ugly and I'm not sure that it's correct.  If the loop in
> start_isolate_page_range() gets partway through a number of pages then
> we hit the race, start_isolate_page_range() will then go and "undo" the
> work being done by the thread which it is racing against?

I agree that it is a bit ugly.  However, when a thread hits the above
condition it will only undo what it has done.  Only one thread is able
to set migrate state to isolate (under the zone lock).  So, a thread
will only undo what it has done.

The exact problem of one thread undoing what another thread has done
is possible with the code today and is what this patch is attempting
to address.

> Even if that can't happen, blundering through a whole bunch of pages
> then saying whoops then undoing everything is unpleasing.
> 
> Should we be looking at preventing these races at a higher level?

I could not immediately come up with a good idea here.  The zone lock
would be the obvious choice, but I don't think we want to hold it while
examining each of the page blocks.  Perhaps a new lock or semaphore
associated with the zone?  I'm open to suggestions.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated
  2018-03-03  0:38     ` Mike Kravetz
@ 2018-03-03  0:56       ` Andrew Morton
  2018-03-03  1:39         ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2018-03-03  0:56 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Luiz Capitulino,
	Michal Nazarewicz, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Johannes Weiner

On Fri, 2 Mar 2018 16:38:33 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> On 03/02/2018 04:06 PM, Andrew Morton wrote:
> > On Mon, 26 Feb 2018 11:10:54 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > 
> >> start_isolate_page_range() is used to set the migrate type of a
> >> set of page blocks to MIGRATE_ISOLATE while attempting to start
> >> a migration operation.  It assumes that only one thread is
> >> calling it for the specified range.  This routine is used by
> >> CMA, memory hotplug and gigantic huge pages.  Each of these users
> >> synchronize access to the range within their subsystem.  However,
> >> two subsystems (CMA and gigantic huge pages for example) could
> >> attempt operations on the same range.  If this happens, page
> >> blocks may be incorrectly left marked as MIGRATE_ISOLATE and
> >> therefore not available for page allocation.
> >>
> >> Without 'locking code' there is no easy way to synchronize access
> >> to the range of page blocks passed to start_isolate_page_range.
> >> However, if two threads are working on the same set of page blocks
> >> one will stumble upon blocks set to MIGRATE_ISOLATE by the other.
> >> In such conditions, make the thread noticing MIGRATE_ISOLATE
> >> clean up as normal and return -EBUSY to the caller.
> >>
> >> This will allow start_isolate_page_range to serve as a
> >> synchronization mechanism and will allow for more general use
> >> of callers making use of these interfaces.  So, update comments
> >> in alloc_contig_range to reflect this new functionality.
> >>
> >> ...
> >>
> >> --- a/mm/page_isolation.c
> >> +++ b/mm/page_isolation.c
> >> @@ -28,6 +28,13 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
> >>  
> >>  	spin_lock_irqsave(&zone->lock, flags);
> >>  
> >> +	/*
> >> +	 * We assume we are the only ones trying to isolate this block.
> >> +	 * If MIGRATE_ISOLATE already set, return -EBUSY
> >> +	 */
> >> +	if (is_migrate_isolate_page(page))
> >> +		goto out;
> >> +
> >>  	pfn = page_to_pfn(page);
> >>  	arg.start_pfn = pfn;
> >>  	arg.nr_pages = pageblock_nr_pages;
> > 
> > Seems a bit ugly and I'm not sure that it's correct.  If the loop in
> > start_isolate_page_range() gets partway through a number of pages then
> > we hit the race, start_isolate_page_range() will then go and "undo" the
> > work being done by the thread which it is racing against?
> 
> I agree that it is a bit ugly.  However, when a thread hits the above
> condition it will only undo what it has done.  Only one thread is able
> to set migrate state to isolate (under the zone lock).  So, a thread
> will only undo what it has done.

I don't get it.  That would make sense if start_isolate_page_range()
held zone->lock across the entire loop, but it doesn't do that.

> The exact problem of one thread undoing what another thread has done
> is possible with the code today and is what this patch is attempting
> to address.
> 
> > Even if that can't happen, blundering through a whole bunch of pages
> > then saying whoops then undoing everything is unpleasing.
> > 
> > Should we be looking at preventing these races at a higher level?
> 
> I could not immediately come up with a good idea here.  The zone lock
> would be the obvious choice, but I don't think we want to hold it while
> examining each of the page blocks.  Perhaps a new lock or semaphore
> associated with the zone?  I'm open to suggestions.

Yes, I think it would need a new lock.  Hopefully a mutex.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated
  2018-03-03  0:56       ` Andrew Morton
@ 2018-03-03  1:39         ` Mike Kravetz
  2018-03-06  0:57           ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-03-03  1:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Luiz Capitulino,
	Michal Nazarewicz, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Johannes Weiner

On 03/02/2018 04:56 PM, Andrew Morton wrote:
> On Fri, 2 Mar 2018 16:38:33 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> On 03/02/2018 04:06 PM, Andrew Morton wrote:
>>> On Mon, 26 Feb 2018 11:10:54 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>>> start_isolate_page_range() is used to set the migrate type of a
>>>> set of page blocks to MIGRATE_ISOLATE while attempting to start
>>>> a migration operation.  It assumes that only one thread is
>>>> calling it for the specified range.  This routine is used by
>>>> CMA, memory hotplug and gigantic huge pages.  Each of these users
>>>> synchronize access to the range within their subsystem.  However,
>>>> two subsystems (CMA and gigantic huge pages for example) could
>>>> attempt operations on the same range.  If this happens, page
>>>> blocks may be incorrectly left marked as MIGRATE_ISOLATE and
>>>> therefore not available for page allocation.
>>>>
>>>> Without 'locking code' there is no easy way to synchronize access
>>>> to the range of page blocks passed to start_isolate_page_range.
>>>> However, if two threads are working on the same set of page blocks
>>>> one will stumble upon blocks set to MIGRATE_ISOLATE by the other.
>>>> In such conditions, make the thread noticing MIGRATE_ISOLATE
>>>> clean up as normal and return -EBUSY to the caller.
>>>>
>>>> This will allow start_isolate_page_range to serve as a
>>>> synchronization mechanism and will allow for more general use
>>>> of callers making use of these interfaces.  So, update comments
>>>> in alloc_contig_range to reflect this new functionality.
>>>>
>>>> ...
>>>>
>>>> --- a/mm/page_isolation.c
>>>> +++ b/mm/page_isolation.c
>>>> @@ -28,6 +28,13 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
>>>>  
>>>>  	spin_lock_irqsave(&zone->lock, flags);
>>>>  
>>>> +	/*
>>>> +	 * We assume we are the only ones trying to isolate this block.
>>>> +	 * If MIGRATE_ISOLATE already set, return -EBUSY
>>>> +	 */
>>>> +	if (is_migrate_isolate_page(page))
>>>> +		goto out;
>>>> +
>>>>  	pfn = page_to_pfn(page);
>>>>  	arg.start_pfn = pfn;
>>>>  	arg.nr_pages = pageblock_nr_pages;
>>>
>>> Seems a bit ugly and I'm not sure that it's correct.  If the loop in
>>> start_isolate_page_range() gets partway through a number of pages then
>>> we hit the race, start_isolate_page_range() will then go and "undo" the
>>> work being done by the thread which it is racing against?
>>
>> I agree that it is a bit ugly.  However, when a thread hits the above
>> condition it will only undo what it has done.  Only one thread is able
>> to set migrate state to isolate (under the zone lock).  So, a thread
>> will only undo what it has done.
> 
> I don't get it.  That would make sense if start_isolate_page_range()
> held zone->lock across the entire loop, but it doesn't do that.
> 

It works because all threads set migrate isolate on page blocks going
from pfn low to pfn high.  When they encounter a conflict, they know
exactly which blocks they set and only undo those blocks.  Perhaps, I
am missing something, but it does not matter because ...

>> The exact problem of one thread undoing what another thread has done
>> is possible with the code today and is what this patch is attempting
>> to address.
>>
>>> Even if that can't happen, blundering through a whole bunch of pages
>>> then saying whoops then undoing everything is unpleasing.
>>>
>>> Should we be looking at preventing these races at a higher level?
>>
>> I could not immediately come up with a good idea here.  The zone lock
>> would be the obvious choice, but I don't think we want to hold it while
>> examining each of the page blocks.  Perhaps a new lock or semaphore
>> associated with the zone?  I'm open to suggestions.
> 
> Yes, I think it would need a new lock.  Hopefully a mutex.

I'll look into adding an 'isolate' mutex to the zone structure and reworking
this patch.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated
  2018-03-03  1:39         ` Mike Kravetz
@ 2018-03-06  0:57           ` Mike Kravetz
  2018-03-06 22:32             ` Andrew Morton
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-03-06  0:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Luiz Capitulino,
	Michal Nazarewicz, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Johannes Weiner

On 03/02/2018 05:39 PM, Mike Kravetz wrote:
> On 03/02/2018 04:56 PM, Andrew Morton wrote:
>> On Fri, 2 Mar 2018 16:38:33 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>>> On 03/02/2018 04:06 PM, Andrew Morton wrote:
>>>> On Mon, 26 Feb 2018 11:10:54 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>>
>>>>> start_isolate_page_range() is used to set the migrate type of a
>>>>> set of page blocks to MIGRATE_ISOLATE while attempting to start
>>>>> a migration operation.  It assumes that only one thread is
>>>>> calling it for the specified range.  This routine is used by
>>>>> CMA, memory hotplug and gigantic huge pages.  Each of these users
>>>>> synchronize access to the range within their subsystem.  However,
>>>>> two subsystems (CMA and gigantic huge pages for example) could
>>>>> attempt operations on the same range.  If this happens, page
>>>>> blocks may be incorrectly left marked as MIGRATE_ISOLATE and
>>>>> therefore not available for page allocation.
>>>>>
>>>>> Without 'locking code' there is no easy way to synchronize access
>>>>> to the range of page blocks passed to start_isolate_page_range.
>>>>> However, if two threads are working on the same set of page blocks
>>>>> one will stumble upon blocks set to MIGRATE_ISOLATE by the other.
>>>>> In such conditions, make the thread noticing MIGRATE_ISOLATE
>>>>> clean up as normal and return -EBUSY to the caller.
>>>>>
>>>>> This will allow start_isolate_page_range to serve as a
>>>>> synchronization mechanism and will allow for more general use
>>>>> of callers making use of these interfaces.  So, update comments
>>>>> in alloc_contig_range to reflect this new functionality.
>>>>>
>>>>> ...
>>>>>
>>>>> --- a/mm/page_isolation.c
>>>>> +++ b/mm/page_isolation.c
>>>>> @@ -28,6 +28,13 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
>>>>>  
>>>>>  	spin_lock_irqsave(&zone->lock, flags);
>>>>>  
>>>>> +	/*
>>>>> +	 * We assume we are the only ones trying to isolate this block.
>>>>> +	 * If MIGRATE_ISOLATE already set, return -EBUSY
>>>>> +	 */
>>>>> +	if (is_migrate_isolate_page(page))
>>>>> +		goto out;
>>>>> +
>>>>>  	pfn = page_to_pfn(page);
>>>>>  	arg.start_pfn = pfn;
>>>>>  	arg.nr_pages = pageblock_nr_pages;
>>>>
>>>> Seems a bit ugly and I'm not sure that it's correct.  If the loop in
>>>> start_isolate_page_range() gets partway through a number of pages then
>>>> we hit the race, start_isolate_page_range() will then go and "undo" the
>>>> work being done by the thread which it is racing against?
>>>
>>> I agree that it is a bit ugly.  However, when a thread hits the above
>>> condition it will only undo what it has done.  Only one thread is able
>>> to set migrate state to isolate (under the zone lock).  So, a thread
>>> will only undo what it has done.
>>
>> I don't get it.  That would make sense if start_isolate_page_range()
>> held zone->lock across the entire loop, but it doesn't do that.
>>
> 
> It works because all threads set migrate isolate on page blocks going
> from pfn low to pfn high.  When they encounter a conflict, they know
> exactly which blocks they set and only undo those blocks.  Perhaps, I
> am missing something, but it does not matter because ...
> 
>>> The exact problem of one thread undoing what another thread has done
>>> is possible with the code today and is what this patch is attempting
>>> to address.
>>>
>>>> Even if that can't happen, blundering through a whole bunch of pages
>>>> then saying whoops then undoing everything is unpleasing.
>>>>
>>>> Should we be looking at preventing these races at a higher level?
>>>
>>> I could not immediately come up with a good idea here.  The zone lock
>>> would be the obvious choice, but I don't think we want to hold it while
>>> examining each of the page blocks.  Perhaps a new lock or semaphore
>>> associated with the zone?  I'm open to suggestions.
>>
>> Yes, I think it would need a new lock.  Hopefully a mutex.
> 
> I'll look into adding an 'isolate' mutex to the zone structure and reworking
> this patch.

I went back and examined the 'isolation functionality' with an eye on perhaps
adding a mutex for some higher level synchronization.  However, there does
not appear to be a straight forward solution.

What we really need is some way of preventing two threads from operating on
the same set of page blocks concurrently.  We do not want a big mutex, as
we do want two threads to run in parallel if operating on separate
non-overlapping ranges (CMA does this today).  If we did this, I think we
would need a new data structure to represent page blocks within a zone.
start_isolate_page_range() would then then check the new data structure for
conflicts, and if none found mark the range it is operating on as 'in use'.
undo_isolate_page_range() would clear the entries for the range in the new
data structure.  Such information would hang off the zone and be protected
by the zone lock.  The new data structure could be static (like a bit map),
or dynamic.  It certainly is doable, but ...

The more I think about it, the more I like my original proposal.  The
comment "blundering through a whole bunch of pages then saying whoops
then undoing everything is unpleasing" is certainly true.  But do note
that after isolating the page blocks, we will then attempt to migrate
pages within those blocks.  There is a more than a minimal chance that
we will not be able to migrate something within the set of page blocks.
In that case we again say whoops and undo even more work.

I am relatively new to this area of code.  Therefore, it would be good to
get comments from some of the original authors.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated
  2018-03-06  0:57           ` Mike Kravetz
@ 2018-03-06 22:32             ` Andrew Morton
  0 siblings, 0 replies; 11+ messages in thread
From: Andrew Morton @ 2018-03-06 22:32 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Luiz Capitulino,
	Michal Nazarewicz, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Johannes Weiner

On Mon, 5 Mar 2018 16:57:40 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> >>>
> >>> I could not immediately come up with a good idea here.  The zone lock
> >>> would be the obvious choice, but I don't think we want to hold it while
> >>> examining each of the page blocks.  Perhaps a new lock or semaphore
> >>> associated with the zone?  I'm open to suggestions.
> >>
> >> Yes, I think it would need a new lock.  Hopefully a mutex.
> > 
> > I'll look into adding an 'isolate' mutex to the zone structure and reworking
> > this patch.
> 
> I went back and examined the 'isolation functionality' with an eye on perhaps
> adding a mutex for some higher level synchronization.  However, there does
> not appear to be a straight forward solution.
> 
> What we really need is some way of preventing two threads from operating on
> the same set of page blocks concurrently.  We do not want a big mutex, as
> we do want two threads to run in parallel if operating on separate
> non-overlapping ranges (CMA does this today).  If we did this, I think we
> would need a new data structure to represent page blocks within a zone.
> start_isolate_page_range() would then then check the new data structure for
> conflicts, and if none found mark the range it is operating on as 'in use'.
> undo_isolate_page_range() would clear the entries for the range in the new
> data structure.  Such information would hang off the zone and be protected
> by the zone lock.  The new data structure could be static (like a bit map),
> or dynamic.  It certainly is doable, but ...
> 
> The more I think about it, the more I like my original proposal.  The
> comment "blundering through a whole bunch of pages then saying whoops
> then undoing everything is unpleasing" is certainly true.  But do note
> that after isolating the page blocks, we will then attempt to migrate
> pages within those blocks.  There is a more than a minimal chance that
> we will not be able to migrate something within the set of page blocks.
> In that case we again say whoops and undo even more work.
> 
> I am relatively new to this area of code.  Therefore, it would be good to
> get comments from some of the original authors.

hm, OK.  Perhaps it would help to produce a v2 which has more comments
and changelogging describing what's happening here and why things are
as they are.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v2] mm: make start_isolate_page_range() fail if already isolated
  2018-02-26 19:10 [PATCH 0/1] make start_isolate_page_range() thread safe Mike Kravetz
  2018-02-26 19:10 ` [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated Mike Kravetz
@ 2018-03-09 22:47 ` Mike Kravetz
  2018-03-13 21:14   ` Andrew Morton
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2018-03-09 22:47 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: KAMEZAWA Hiroyuki, Luiz Capitulino, Michal Nazarewicz,
	Michal Hocko, Vlastimil Babka, Mel Gorman, Johannes Weiner,
	Andrew Morton, Mike Kravetz

start_isolate_page_range() is used to set the migrate type of a
set of pageblocks to MIGRATE_ISOLATE while attempting to start
a migration operation.  It assumes that only one thread is
calling it for the specified range.  This routine is used by
CMA, memory hotplug and gigantic huge pages.  Each of these users
synchronize access to the range within their subsystem.  However,
two subsystems (CMA and gigantic huge pages for example) could
attempt operations on the same range.  If this happens, one thread
may 'undo' the work another thread is doing.  This can result in
pageblocks being incorrectly left marked as MIGRATE_ISOLATE and
therefore not available for page allocation.

What is ideally needed is a way to synchronize access to a set
of pageblocks that are undergoing isolation and migration.  The
only thing we know about these pageblocks is that they are all
in the same zone.  A per-node mutex is too coarse as we want to
allow multiple operations on different ranges within the same zone
concurrently.  Instead, we will use the migration type of the
pageblocks themselves as a form of synchronization.

start_isolate_page_range sets the migration type on a set of page-
blocks going in order from the one associated with the smallest
pfn to the largest pfn.  The zone lock is acquired to check and
set the migration type.  When going through the list of pageblocks
check if MIGRATE_ISOLATE is already set.  If so, this indicates
another thread is working on this pageblock.  We know exactly
which pageblocks we set, so clean up by undo those and return
-EBUSY.

This allows start_isolate_page_range to serve as a synchronization
mechanism and will allow for more general use of callers making
use of these interfaces.  Update comments in alloc_contig_range
to reflect this new functionality.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
Changes in v2
  * Updated commit message and comments as suggested by Andrew Morton

 mm/page_alloc.c     |  8 ++++----
 mm/page_isolation.c | 18 +++++++++++++++++-
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb416723538f..02a17efac233 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7621,11 +7621,11 @@ static int __alloc_contig_migrate_range(struct compact_control *cc,
  * @gfp_mask:	GFP mask to use during compaction
  *
  * The PFN range does not have to be pageblock or MAX_ORDER_NR_PAGES
- * aligned, however it's the caller's responsibility to guarantee that
- * we are the only thread that changes migrate type of pageblocks the
- * pages fall in.
+ * aligned.  The PFN range must belong to a single zone.
  *
- * The PFN range must belong to a single zone.
+ * The first thing this routine does is attempt to MIGRATE_ISOLATE all
+ * pageblocks in the range.  Once isolated, the pageblocks should not
+ * be modified by others.
  *
  * Returns zero on success or negative error code.  On success all
  * pages which PFN is in [start, end) are allocated for the caller and
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 165ed8117bd1..61dee77bb211 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -28,6 +28,14 @@ static int set_migratetype_isolate(struct page *page, int migratetype,
 
 	spin_lock_irqsave(&zone->lock, flags);
 
+	/*
+	 * We assume the caller intended to SET migrate type to isolate.
+	 * If it is already set, then someone else must have raced and
+	 * set it before us.  Return -EBUSY
+	 */
+	if (is_migrate_isolate_page(page))
+		goto out;
+
 	pfn = page_to_pfn(page);
 	arg.start_pfn = pfn;
 	arg.nr_pages = pageblock_nr_pages;
@@ -166,7 +174,15 @@ __first_valid_page(unsigned long pfn, unsigned long nr_pages)
  * future will not be allocated again.
  *
  * start_pfn/end_pfn must be aligned to pageblock_order.
- * Returns 0 on success and -EBUSY if any part of range cannot be isolated.
+ * Return 0 on success and -EBUSY if any part of range cannot be isolated.
+ *
+ * There is no high level synchronization mechanism that prevents two threads
+ * from trying to isolate overlapping ranges.  If this happens, one thread
+ * will notice pageblocks in the overlapping range already set to isolate.
+ * This happens in set_migratetype_isolate, and set_migratetype_isolate
+ * returns an error.  We then clean up by restoring the migration type on
+ * pageblocks we may have modified and return -EBUSY to caller.  This
+ * prevents two threads from simultaneously working on overlapping ranges.
  */
 int start_isolate_page_range(unsigned long start_pfn, unsigned long end_pfn,
 			     unsigned migratetype, bool skip_hwpoisoned_pages)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] mm: make start_isolate_page_range() fail if already isolated
  2018-03-09 22:47 ` [PATCH v2] " Mike Kravetz
@ 2018-03-13 21:14   ` Andrew Morton
  2018-03-13 21:27     ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2018-03-13 21:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Luiz Capitulino,
	Michal Nazarewicz, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Johannes Weiner

On Fri,  9 Mar 2018 14:47:31 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> start_isolate_page_range() is used to set the migrate type of a
> set of pageblocks to MIGRATE_ISOLATE while attempting to start
> a migration operation.  It assumes that only one thread is
> calling it for the specified range.  This routine is used by
> CMA, memory hotplug and gigantic huge pages.  Each of these users
> synchronize access to the range within their subsystem.  However,
> two subsystems (CMA and gigantic huge pages for example) could
> attempt operations on the same range.  If this happens, one thread
> may 'undo' the work another thread is doing.  This can result in
> pageblocks being incorrectly left marked as MIGRATE_ISOLATE and
> therefore not available for page allocation.
> 
> What is ideally needed is a way to synchronize access to a set
> of pageblocks that are undergoing isolation and migration.  The
> only thing we know about these pageblocks is that they are all
> in the same zone.  A per-node mutex is too coarse as we want to
> allow multiple operations on different ranges within the same zone
> concurrently.  Instead, we will use the migration type of the
> pageblocks themselves as a form of synchronization.
> 
> start_isolate_page_range sets the migration type on a set of page-
> blocks going in order from the one associated with the smallest
> pfn to the largest pfn.  The zone lock is acquired to check and
> set the migration type.  When going through the list of pageblocks
> check if MIGRATE_ISOLATE is already set.  If so, this indicates
> another thread is working on this pageblock.  We know exactly
> which pageblocks we set, so clean up by undo those and return
> -EBUSY.
> 
> This allows start_isolate_page_range to serve as a synchronization
> mechanism and will allow for more general use of callers making
> use of these interfaces.  Update comments in alloc_contig_range
> to reflect this new functionality.
> 
> ...
>
> + * There is no high level synchronization mechanism that prevents two threads
> + * from trying to isolate overlapping ranges.  If this happens, one thread
> + * will notice pageblocks in the overlapping range already set to isolate.
> + * This happens in set_migratetype_isolate, and set_migratetype_isolate
> + * returns an error.  We then clean up by restoring the migration type on
> + * pageblocks we may have modified and return -EBUSY to caller.  This
> + * prevents two threads from simultaneously working on overlapping ranges.
>   */

Well I can kinda visualize how this works, with two CPUs chewing away
at two overlapping blocks of pfns, possibly with different starting
pfns.  And I can't immediately see any holes in it, apart from possible
memory ordering issues.  What guarantee is there that CPU1 will see
CPU2's writes in the order in which CPU2 performed them?  And what
guarantee is there that CPU1 will see CPU2's writes in a sequential
manner?  If four of CPU2's writes get written back in a single atomic
flush, what will CPU1 make of that?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v2] mm: make start_isolate_page_range() fail if already isolated
  2018-03-13 21:14   ` Andrew Morton
@ 2018-03-13 21:27     ` Mike Kravetz
  0 siblings, 0 replies; 11+ messages in thread
From: Mike Kravetz @ 2018-03-13 21:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, KAMEZAWA Hiroyuki, Luiz Capitulino,
	Michal Nazarewicz, Michal Hocko, Vlastimil Babka, Mel Gorman,
	Johannes Weiner

On 03/13/2018 02:14 PM, Andrew Morton wrote:
> On Fri,  9 Mar 2018 14:47:31 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:
> 
>> start_isolate_page_range() is used to set the migrate type of a
>> set of pageblocks to MIGRATE_ISOLATE while attempting to start
>> a migration operation.  It assumes that only one thread is
>> calling it for the specified range.  This routine is used by
>> CMA, memory hotplug and gigantic huge pages.  Each of these users
>> synchronize access to the range within their subsystem.  However,
>> two subsystems (CMA and gigantic huge pages for example) could
>> attempt operations on the same range.  If this happens, one thread
>> may 'undo' the work another thread is doing.  This can result in
>> pageblocks being incorrectly left marked as MIGRATE_ISOLATE and
>> therefore not available for page allocation.
>>
>> What is ideally needed is a way to synchronize access to a set
>> of pageblocks that are undergoing isolation and migration.  The
>> only thing we know about these pageblocks is that they are all
>> in the same zone.  A per-node mutex is too coarse as we want to
>> allow multiple operations on different ranges within the same zone
>> concurrently.  Instead, we will use the migration type of the
>> pageblocks themselves as a form of synchronization.
>>
>> start_isolate_page_range sets the migration type on a set of page-
>> blocks going in order from the one associated with the smallest
>> pfn to the largest pfn.  The zone lock is acquired to check and
>> set the migration type.  When going through the list of pageblocks
>> check if MIGRATE_ISOLATE is already set.  If so, this indicates
>> another thread is working on this pageblock.  We know exactly
>> which pageblocks we set, so clean up by undo those and return
>> -EBUSY.
>>
>> This allows start_isolate_page_range to serve as a synchronization
>> mechanism and will allow for more general use of callers making
>> use of these interfaces.  Update comments in alloc_contig_range
>> to reflect this new functionality.
>>
>> ...
>>
>> + * There is no high level synchronization mechanism that prevents two threads
>> + * from trying to isolate overlapping ranges.  If this happens, one thread
>> + * will notice pageblocks in the overlapping range already set to isolate.
>> + * This happens in set_migratetype_isolate, and set_migratetype_isolate
>> + * returns an error.  We then clean up by restoring the migration type on
>> + * pageblocks we may have modified and return -EBUSY to caller.  This
>> + * prevents two threads from simultaneously working on overlapping ranges.
>>   */
> 
> Well I can kinda visualize how this works, with two CPUs chewing away
> at two overlapping blocks of pfns, possibly with different starting
> pfns.  And I can't immediately see any holes in it, apart from possible
> memory ordering issues.  What guarantee is there that CPU1 will see
> CPU2's writes in the order in which CPU2 performed them?  And what
> guarantee is there that CPU1 will see CPU2's writes in a sequential
> manner?  If four of CPU2's writes get written back in a single atomic
> flush, what will CPU1 make of that?
> 

Each CPU holds the associated zone lock to modify or examine the migration
type of a pageblock.  And, it will only examine/update a single pageblock
per lock acquire/release cycle.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2018-03-13 21:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-26 19:10 [PATCH 0/1] make start_isolate_page_range() thread safe Mike Kravetz
2018-02-26 19:10 ` [PATCH 1/1] mm: make start_isolate_page_range() fail if already isolated Mike Kravetz
2018-03-03  0:06   ` Andrew Morton
2018-03-03  0:38     ` Mike Kravetz
2018-03-03  0:56       ` Andrew Morton
2018-03-03  1:39         ` Mike Kravetz
2018-03-06  0:57           ` Mike Kravetz
2018-03-06 22:32             ` Andrew Morton
2018-03-09 22:47 ` [PATCH v2] " Mike Kravetz
2018-03-13 21:14   ` Andrew Morton
2018-03-13 21:27     ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).