linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim
       [not found]   ` <87lfi9wxk9.fsf@yhuang-dev.intel.com>
@ 2020-08-20 15:21     ` Dave Hansen
  2020-08-20 16:26       ` Yang Shi
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2020-08-20 15:21 UTC (permalink / raw)
  To: Huang, Ying, Dave Hansen
  Cc: linux-kernel, yang.shi, rientjes, dan.j.williams, Linux-MM

On 8/20/20 1:06 AM, Huang, Ying wrote:
>> +	/* Migrate pages selected for demotion */
>> +	nr_reclaimed += demote_page_list(&ret_pages, &demote_pages, pgdat, sc);
>> +
>>  	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
>>  
>>  	mem_cgroup_uncharge_list(&free_pages);
>> _
> Generally, it's good to batch the page migration.  But one side effect
> is that, if the pages are failed to be migrated, they will be placed
> back to the LRU list instead of falling back to be reclaimed really.
> This may cause some issue in some situation.  For example, if there's no
> enough space in the PMEM (slow) node, so the page migration fails, OOM
> may be triggered, because the direct reclaiming on the DRAM (fast) node
> may make no progress, while it can reclaim some pages really before.

Yes, agreed.

There are a couple of ways we could fix this.  Instead of splicing
'demote_pages' back into 'ret_pages', we could try to get them back on
'page_list' and goto the beginning on shrink_page_list().  This will
probably yield the best behavior, but might be a bit ugly.

We could also add a field to 'struct scan_control' and just stop trying
to migrate after it has failed one or more times.  The trick will be
picking a threshold that doesn't mess with either the normal reclaim
rate or the migration rate.

This is on my list to fix up next.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim
  2020-08-20 15:21     ` [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim Dave Hansen
@ 2020-08-20 16:26       ` Yang Shi
  2020-08-21  0:57         ` Huang, Ying
  0 siblings, 1 reply; 7+ messages in thread
From: Yang Shi @ 2020-08-20 16:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Huang, Ying, Dave Hansen, Linux Kernel Mailing List, Yang Shi,
	David Rientjes, Dan Williams, Linux-MM

On Thu, Aug 20, 2020 at 8:22 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 8/20/20 1:06 AM, Huang, Ying wrote:
> >> +    /* Migrate pages selected for demotion */
> >> +    nr_reclaimed += demote_page_list(&ret_pages, &demote_pages, pgdat, sc);
> >> +
> >>      pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
> >>
> >>      mem_cgroup_uncharge_list(&free_pages);
> >> _
> > Generally, it's good to batch the page migration.  But one side effect
> > is that, if the pages are failed to be migrated, they will be placed
> > back to the LRU list instead of falling back to be reclaimed really.
> > This may cause some issue in some situation.  For example, if there's no
> > enough space in the PMEM (slow) node, so the page migration fails, OOM
> > may be triggered, because the direct reclaiming on the DRAM (fast) node
> > may make no progress, while it can reclaim some pages really before.
>
> Yes, agreed.

Kind of. But I think that should be transient and very rare. The
kswapd on pmem nodes will be waken up to drop pages when we try to
allocate migration target pages. It should be very rare that there is
not reclaimable page on pmem nodes.

>
> There are a couple of ways we could fix this.  Instead of splicing
> 'demote_pages' back into 'ret_pages', we could try to get them back on
> 'page_list' and goto the beginning on shrink_page_list().  This will
> probably yield the best behavior, but might be a bit ugly.
>
> We could also add a field to 'struct scan_control' and just stop trying
> to migrate after it has failed one or more times.  The trick will be
> picking a threshold that doesn't mess with either the normal reclaim
> rate or the migration rate.

In my patchset I implemented a fallback mechanism via adding a new
PGDAT_CONTENDED node flag. Please check this out:
https://patchwork.kernel.org/patch/10993839/.

Basically the PGDAT_CONTENDED flag will be set once migrate_pages()
return -ENOMEM which indicates the target pmem node is under memory
pressure, then it would fallback to regular reclaim path. The flag
would be cleared by clear_pgdat_congested() once the pmem node memory
pressure is gone.

We already use node flags to indicate the state of node in reclaim
code, i.e. PGDAT_WRITEBACK, PGDAT_DIRTY, etc. So, adding a new flag
sounds more straightforward to me IMHO.

>
> This is on my list to fix up next.
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim
  2020-08-20 16:26       ` Yang Shi
@ 2020-08-21  0:57         ` Huang, Ying
  2020-08-21 16:17           ` Yang Shi
  0 siblings, 1 reply; 7+ messages in thread
From: Huang, Ying @ 2020-08-21  0:57 UTC (permalink / raw)
  To: Yang Shi
  Cc: Dave Hansen, Dave Hansen, Linux Kernel Mailing List, Yang Shi,
	David Rientjes, Dan Williams, Linux-MM

Yang Shi <shy828301@gmail.com> writes:

> On Thu, Aug 20, 2020 at 8:22 AM Dave Hansen <dave.hansen@intel.com> wrote:
>>
>> On 8/20/20 1:06 AM, Huang, Ying wrote:
>> >> +    /* Migrate pages selected for demotion */
>> >> +    nr_reclaimed += demote_page_list(&ret_pages, &demote_pages, pgdat, sc);
>> >> +
>> >>      pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
>> >>
>> >>      mem_cgroup_uncharge_list(&free_pages);
>> >> _
>> > Generally, it's good to batch the page migration.  But one side effect
>> > is that, if the pages are failed to be migrated, they will be placed
>> > back to the LRU list instead of falling back to be reclaimed really.
>> > This may cause some issue in some situation.  For example, if there's no
>> > enough space in the PMEM (slow) node, so the page migration fails, OOM
>> > may be triggered, because the direct reclaiming on the DRAM (fast) node
>> > may make no progress, while it can reclaim some pages really before.
>>
>> Yes, agreed.
>
> Kind of. But I think that should be transient and very rare. The
> kswapd on pmem nodes will be waken up to drop pages when we try to
> allocate migration target pages. It should be very rare that there is
> not reclaimable page on pmem nodes.
>
>>
>> There are a couple of ways we could fix this.  Instead of splicing
>> 'demote_pages' back into 'ret_pages', we could try to get them back on
>> 'page_list' and goto the beginning on shrink_page_list().  This will
>> probably yield the best behavior, but might be a bit ugly.
>>
>> We could also add a field to 'struct scan_control' and just stop trying
>> to migrate after it has failed one or more times.  The trick will be
>> picking a threshold that doesn't mess with either the normal reclaim
>> rate or the migration rate.
>
> In my patchset I implemented a fallback mechanism via adding a new
> PGDAT_CONTENDED node flag. Please check this out:
> https://patchwork.kernel.org/patch/10993839/.
>
> Basically the PGDAT_CONTENDED flag will be set once migrate_pages()
> return -ENOMEM which indicates the target pmem node is under memory
> pressure, then it would fallback to regular reclaim path. The flag
> would be cleared by clear_pgdat_congested() once the pmem node memory
> pressure is gone.

There may be some races between the flag set and clear.  For example,

- try to migrate some pages from DRAM node to PMEM node

- no enough free pages on the PMEM node, so wakeup kswapd

- kswapd on PMEM node reclaimed some page and try to clear
  PGDAT_CONTENDED on DRAM node

- set PGDAT_CONTENDED on DRAM node
 
This may be resolvable.  But I still prefer to fallback to real page
reclaiming directly for the pages failed to be migrated.  That looks
more robust.

Best Regards,
Huang, Ying

> We already use node flags to indicate the state of node in reclaim
> code, i.e. PGDAT_WRITEBACK, PGDAT_DIRTY, etc. So, adding a new flag
> sounds more straightforward to me IMHO.
>
>>
>> This is on my list to fix up next.
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim
  2020-08-21  0:57         ` Huang, Ying
@ 2020-08-21 16:17           ` Yang Shi
  0 siblings, 0 replies; 7+ messages in thread
From: Yang Shi @ 2020-08-21 16:17 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Dave Hansen, Dave Hansen, Linux Kernel Mailing List, Yang Shi,
	David Rientjes, Dan Williams, Linux-MM

On Thu, Aug 20, 2020 at 5:57 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Yang Shi <shy828301@gmail.com> writes:
>
> > On Thu, Aug 20, 2020 at 8:22 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >>
> >> On 8/20/20 1:06 AM, Huang, Ying wrote:
> >> >> +    /* Migrate pages selected for demotion */
> >> >> +    nr_reclaimed += demote_page_list(&ret_pages, &demote_pages, pgdat, sc);
> >> >> +
> >> >>      pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
> >> >>
> >> >>      mem_cgroup_uncharge_list(&free_pages);
> >> >> _
> >> > Generally, it's good to batch the page migration.  But one side effect
> >> > is that, if the pages are failed to be migrated, they will be placed
> >> > back to the LRU list instead of falling back to be reclaimed really.
> >> > This may cause some issue in some situation.  For example, if there's no
> >> > enough space in the PMEM (slow) node, so the page migration fails, OOM
> >> > may be triggered, because the direct reclaiming on the DRAM (fast) node
> >> > may make no progress, while it can reclaim some pages really before.
> >>
> >> Yes, agreed.
> >
> > Kind of. But I think that should be transient and very rare. The
> > kswapd on pmem nodes will be waken up to drop pages when we try to
> > allocate migration target pages. It should be very rare that there is
> > not reclaimable page on pmem nodes.
> >
> >>
> >> There are a couple of ways we could fix this.  Instead of splicing
> >> 'demote_pages' back into 'ret_pages', we could try to get them back on
> >> 'page_list' and goto the beginning on shrink_page_list().  This will
> >> probably yield the best behavior, but might be a bit ugly.
> >>
> >> We could also add a field to 'struct scan_control' and just stop trying
> >> to migrate after it has failed one or more times.  The trick will be
> >> picking a threshold that doesn't mess with either the normal reclaim
> >> rate or the migration rate.
> >
> > In my patchset I implemented a fallback mechanism via adding a new
> > PGDAT_CONTENDED node flag. Please check this out:
> > https://patchwork.kernel.org/patch/10993839/.
> >
> > Basically the PGDAT_CONTENDED flag will be set once migrate_pages()
> > return -ENOMEM which indicates the target pmem node is under memory
> > pressure, then it would fallback to regular reclaim path. The flag
> > would be cleared by clear_pgdat_congested() once the pmem node memory
> > pressure is gone.
>
> There may be some races between the flag set and clear.  For example,
>
> - try to migrate some pages from DRAM node to PMEM node
>
> - no enough free pages on the PMEM node, so wakeup kswapd
>
> - kswapd on PMEM node reclaimed some page and try to clear
>   PGDAT_CONTENDED on DRAM node
>
> - set PGDAT_CONTENDED on DRAM node

Yes, the race is true. Someone else may set PGDAT_CONTENDED, but pmem
node's kswapd already went to sleep, so the flag might be not be able
to be cleared for a while.

I think this can be solved easily. We can just move the flag set to
kswapd. Once kswapd is waken up we know there is kind of memory
pressure on that node, then set the flag, clear the flag when kswapd
goes to sleep. kswapd is single threaded and just set/clear its own
node's flag, so there should be no race if I don't miss something.

>
> This may be resolvable.  But I still prefer to fallback to real page
> reclaiming directly for the pages failed to be migrated.  That looks
> more robust.
>
> Best Regards,
> Huang, Ying
>
> > We already use node flags to indicate the state of node in reclaim
> > code, i.e. PGDAT_WRITEBACK, PGDAT_DIRTY, etc. So, adding a new flag
> > sounds more straightforward to me IMHO.
> >
> >>
> >> This is on my list to fix up next.
> >>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim
  2020-10-27 15:29   ` Oscar Salvador
@ 2020-10-27 16:53     ` Yang Shi
  0 siblings, 0 replies; 7+ messages in thread
From: Yang Shi @ 2020-10-27 16:53 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Dave Hansen, Linux Kernel Mailing List, Linux MM, Yang Shi,
	David Rientjes, Huang Ying, Dan Williams

On Tue, Oct 27, 2020 at 8:29 AM Oscar Salvador <osalvador@suse.de> wrote:
>
> On Wed, Oct 07, 2020 at 09:17:45AM -0700, Dave Hansen wrote:
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Cc: Yang Shi <yang.shi@linux.alibaba.com>
> > Cc: David Rientjes <rientjes@google.com>
> > Cc: Huang Ying <ying.huang@intel.com>
> > Cc: Dan Williams <dan.j.williams@intel.com>
>
> I am still going through all the details, but just my thoughts on things
> that caught my eye:
>
> > --- a/include/linux/migrate.h~demote-with-migrate_pages       2020-10-07 09:15:31.028642442 -0700
> > +++ b/include/linux/migrate.h 2020-10-07 09:15:31.034642442 -0700
> > @@ -27,6 +27,7 @@ enum migrate_reason {
> >       MR_MEMPOLICY_MBIND,
> >       MR_NUMA_MISPLACED,
> >       MR_CONTIG_RANGE,
> > +     MR_DEMOTION,
> >       MR_TYPES
>
> I think you also need to add it under include/trace/events/migrate.h, so
> mm_migrate_pages event can know about it.

Agree.

>
> > +bool migrate_demote_page_ok(struct page *page, struct scan_control *sc)
>
> Make it static?
> Also, scan_control seems to be unused here.
>
> > +{
> > +     int next_nid = next_demotion_node(page_to_nid(page));
> > +
> > +     VM_BUG_ON_PAGE(!PageLocked(page), page);
>
> Right after the call to migrate_demote_page_ok, we call unlock_page
> which already has this check in place.
> I know that this is only to be on the safe side and we do not loss anything,
> but just my thoughts.
>
> > +static struct page *alloc_demote_page(struct page *page, unsigned long node)
> > +{
> > +     /*
> > +      * Try to fail quickly if memory on the target node is not
> > +      * available.  Leaving out __GFP_IO and __GFP_FS helps with
> > +      * this.  If the desintation node is full, we want kswapd to
> > +      * run there so that its pages will get reclaimed and future
> > +      * migration attempts may succeed.
> > +      */
> > +     gfp_t flags = (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY |
> > +                    __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE |
> > +                    __GFP_KSWAPD_RECLAIM);
>
> I think it would be nicer to have this as a real GFP_ thingy defined.
> e.g: GFP_DEMOTION
>
> > +     /* HugeTLB pages should not be on the LRU */
> > +     WARN_ON_ONCE(PageHuge(page));
>
> I am not sure about this one.
> This could only happen if the page, which now it is in another list, ends up in
> the buddy system. That is quite unlikely bth.
> And nevertheless, this is only a warning, which means that if this scenario gets
> to happen, we will be allocating a single page to satisfy a higher-order page, and
> I am not sure about the situation we will end up with.

IMHO, we should use BUG_ON instead of WARN_ON or we should just back
off if we see hugetlb page in this path and print out some warning.

>
> > +
> > +     if (PageTransHuge(page)) {
> > +             struct page *thp;
> > +
> > +             flags |= __GFP_COMP;
> > +
> > +             thp = alloc_pages_node(node, flags, HPAGE_PMD_ORDER);
> > +             if (!thp)
> > +                     return NULL;
> > +             prep_transhuge_page(thp);
> > +             return thp;
> > +     }
> > +
> > +     return __alloc_pages_node(node, flags, 0);
>
> Would make sense to transform this in some sort of new_demotion_page,
> which actually calls alloc_migration_target with the right stuff in place?
> And then pass a struct migration_target_control so alloc_migration_target
> does the right thing.
> alloc_migration_target also takes care of calling prep_transhuge_page
> when needed.
> e.g:
>
> static struct page *new_demotion_node(struct page *page, unsigned long private)
> {
>         struct migration_target_control mtc = {
>                 .nid = private,
>                 .gfp_mask = GFP_DEMOTION,
>         };
>
>         if (PageTransHuge(page))
>                 mtc.gfp_mask |= __GFP_COMP;
>
>         return alloc_migration_target(page, (unsigned long)&mtc);
> }
>
> The only thing I see is that alloc_migration_target seems to "override"
> the gfp_mask and does ORs GFP_TRANSHUGE for THP pages, which includes
> __GFP_DIRECT_RECLAIM (not appreciated in this case).
> But maybe this can be worked around by checking if gfp_mask == GFP_DEMOTION,
> and if so, just keep the mask as it is.

Makes sense to me.

>
> > +
> > +     if (list_empty(demote_pages))
> > +             return 0;
> > +
> > +     /* Demotion ignores all cpuset and mempolicy settings */
> > +     err = migrate_pages(demote_pages, alloc_demote_page, NULL,
> > +                         target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> > +                         &nr_succeeded);
>
> As I said, instead of alloc_demote_page, use a new_demote_page and make
> alloc_migration_target handle the allocations and prep thp pages.
>
>
> --
> Oscar Salvador
> SUSE L3
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim
  2020-10-07 16:17 ` [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim Dave Hansen
@ 2020-10-27 15:29   ` Oscar Salvador
  2020-10-27 16:53     ` Yang Shi
  0 siblings, 1 reply; 7+ messages in thread
From: Oscar Salvador @ 2020-10-27 15:29 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, yang.shi, rientjes, ying.huang, dan.j.williams

On Wed, Oct 07, 2020 at 09:17:45AM -0700, Dave Hansen wrote:
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Yang Shi <yang.shi@linux.alibaba.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Cc: Dan Williams <dan.j.williams@intel.com>

I am still going through all the details, but just my thoughts on things
that caught my eye:

> --- a/include/linux/migrate.h~demote-with-migrate_pages	2020-10-07 09:15:31.028642442 -0700
> +++ b/include/linux/migrate.h	2020-10-07 09:15:31.034642442 -0700
> @@ -27,6 +27,7 @@ enum migrate_reason {
>  	MR_MEMPOLICY_MBIND,
>  	MR_NUMA_MISPLACED,
>  	MR_CONTIG_RANGE,
> +	MR_DEMOTION,
>  	MR_TYPES

I think you also need to add it under include/trace/events/migrate.h, so
mm_migrate_pages event can know about it.

> +bool migrate_demote_page_ok(struct page *page, struct scan_control *sc)

Make it static?
Also, scan_control seems to be unused here.

> +{
> +	int next_nid = next_demotion_node(page_to_nid(page));
> +
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);

Right after the call to migrate_demote_page_ok, we call unlock_page
which already has this check in place.
I know that this is only to be on the safe side and we do not loss anything,
but just my thoughts.

> +static struct page *alloc_demote_page(struct page *page, unsigned long node)
> +{
> +	/*
> +	 * Try to fail quickly if memory on the target node is not
> +	 * available.  Leaving out __GFP_IO and __GFP_FS helps with
> +	 * this.  If the desintation node is full, we want kswapd to
> +	 * run there so that its pages will get reclaimed and future
> +	 * migration attempts may succeed.
> +	 */
> +	gfp_t flags = (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY |
> +		       __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE |
> +		       __GFP_KSWAPD_RECLAIM);

I think it would be nicer to have this as a real GFP_ thingy defined.
e.g: GFP_DEMOTION

> +	/* HugeTLB pages should not be on the LRU */
> +	WARN_ON_ONCE(PageHuge(page));

I am not sure about this one.
This could only happen if the page, which now it is in another list, ends up in
the buddy system. That is quite unlikely bth.
And nevertheless, this is only a warning, which means that if this scenario gets
to happen, we will be allocating a single page to satisfy a higher-order page, and
I am not sure about the situation we will end up with.

> +
> +	if (PageTransHuge(page)) {
> +		struct page *thp;
> +
> +		flags |= __GFP_COMP;
> +
> +		thp = alloc_pages_node(node, flags, HPAGE_PMD_ORDER);
> +		if (!thp)
> +			return NULL;
> +		prep_transhuge_page(thp);
> +		return thp;
> +	}
> +
> +	return __alloc_pages_node(node, flags, 0);

Would make sense to transform this in some sort of new_demotion_page,
which actually calls alloc_migration_target with the right stuff in place?
And then pass a struct migration_target_control so alloc_migration_target
does the right thing.
alloc_migration_target also takes care of calling prep_transhuge_page
when needed.
e.g:

static struct page *new_demotion_node(struct page *page, unsigned long private)
{
        struct migration_target_control mtc = {
                .nid = private,
                .gfp_mask = GFP_DEMOTION,
        };

        if (PageTransHuge(page))
                mtc.gfp_mask |= __GFP_COMP;

        return alloc_migration_target(page, (unsigned long)&mtc);
}

The only thing I see is that alloc_migration_target seems to "override"
the gfp_mask and does ORs GFP_TRANSHUGE for THP pages, which includes
__GFP_DIRECT_RECLAIM (not appreciated in this case).
But maybe this can be worked around by checking if gfp_mask == GFP_DEMOTION,
and if so, just keep the mask as it is.

> +
> +	if (list_empty(demote_pages))
> +		return 0;
> +
> +	/* Demotion ignores all cpuset and mempolicy settings */
> +	err = migrate_pages(demote_pages, alloc_demote_page, NULL,
> +			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
> +			    &nr_succeeded);

As I said, instead of alloc_demote_page, use a new_demote_page and make
alloc_migration_target handle the allocations and prep thp pages.


-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim
  2020-10-07 16:17 [RFC][PATCH 0/9] [v4][RESEND] Migrate Pages in lieu of discard Dave Hansen
@ 2020-10-07 16:17 ` Dave Hansen
  2020-10-27 15:29   ` Oscar Salvador
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2020-10-07 16:17 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, Dave Hansen, yang.shi, rientjes, ying.huang, dan.j.williams


From: Dave Hansen <dave.hansen@linux.intel.com>

This is mostly derived from a patch from Yang Shi:

	https://lore.kernel.org/linux-mm/1560468577-101178-10-git-send-email-yang.shi@linux.alibaba.com/

Add code to the reclaim path (shrink_page_list()) to "demote" data
to another NUMA node instead of discarding the data.  This always
avoids the cost of I/O needed to read the page back in and sometimes
avoids the writeout cost when the pagee is dirty.

A second pass through shrink_page_list() will be made if any demotions
fail.  This essentally falls back to normal reclaim behavior in the
case that demotions fail.  Previous versions of this patch may have
simply failed to reclaim pages which were eligible for demotion but
were unable to be demoted in practice.

Note: This just adds the start of infratructure for migration. It is
actually disabled next to the FIXME in migrate_demote_page_ok().

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Yang Shi <yang.shi@linux.alibaba.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Dan Williams <dan.j.williams@intel.com>

--

changes from 20200730:
 * Add another pass through shrink_page_list() when demotion
   fails.
---

 b/include/linux/migrate.h |    2 
 b/mm/vmscan.c             |   97 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+)

diff -puN include/linux/migrate.h~demote-with-migrate_pages include/linux/migrate.h
--- a/include/linux/migrate.h~demote-with-migrate_pages	2020-10-07 09:15:31.028642442 -0700
+++ b/include/linux/migrate.h	2020-10-07 09:15:31.034642442 -0700
@@ -27,6 +27,7 @@ enum migrate_reason {
 	MR_MEMPOLICY_MBIND,
 	MR_NUMA_MISPLACED,
 	MR_CONTIG_RANGE,
+	MR_DEMOTION,
 	MR_TYPES
 };
 
@@ -196,6 +197,7 @@ struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
+int next_demotion_node(int node);
 
 #endif /* CONFIG_MIGRATION */
 
diff -puN mm/vmscan.c~demote-with-migrate_pages mm/vmscan.c
--- a/mm/vmscan.c~demote-with-migrate_pages	2020-10-07 09:15:31.030642442 -0700
+++ b/mm/vmscan.c	2020-10-07 09:15:31.037642442 -0700
@@ -43,6 +43,7 @@
 #include <linux/kthread.h>
 #include <linux/freezer.h>
 #include <linux/memcontrol.h>
+#include <linux/migrate.h>
 #include <linux/delayacct.h>
 #include <linux/sysctl.h>
 #include <linux/oom.h>
@@ -1034,6 +1035,24 @@ static enum page_references page_check_r
 	return PAGEREF_RECLAIM;
 }
 
+bool migrate_demote_page_ok(struct page *page, struct scan_control *sc)
+{
+	int next_nid = next_demotion_node(page_to_nid(page));
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageHuge(page), page);
+	VM_BUG_ON_PAGE(PageLRU(page), page);
+
+	if (next_nid == NUMA_NO_NODE)
+		return false;
+	if (PageTransHuge(page) && !thp_migration_supported())
+		return false;
+
+	// FIXME: actually enable this later in the series
+	return false;
+}
+
+
 /* Check if a page is dirty or under writeback */
 static void page_check_dirty_writeback(struct page *page,
 				       bool *dirty, bool *writeback)
@@ -1064,6 +1083,60 @@ static void page_check_dirty_writeback(s
 		mapping->a_ops->is_dirty_writeback(page, dirty, writeback);
 }
 
+static struct page *alloc_demote_page(struct page *page, unsigned long node)
+{
+	/*
+	 * Try to fail quickly if memory on the target node is not
+	 * available.  Leaving out __GFP_IO and __GFP_FS helps with
+	 * this.  If the desintation node is full, we want kswapd to
+	 * run there so that its pages will get reclaimed and future
+	 * migration attempts may succeed.
+	 */
+	gfp_t flags = (__GFP_HIGHMEM | __GFP_MOVABLE | __GFP_NORETRY |
+		       __GFP_NOMEMALLOC | __GFP_NOWARN | __GFP_THISNODE |
+		       __GFP_KSWAPD_RECLAIM);
+	/* HugeTLB pages should not be on the LRU */
+	WARN_ON_ONCE(PageHuge(page));
+
+	if (PageTransHuge(page)) {
+		struct page *thp;
+
+		flags |= __GFP_COMP;
+
+		thp = alloc_pages_node(node, flags, HPAGE_PMD_ORDER);
+		if (!thp)
+			return NULL;
+		prep_transhuge_page(thp);
+		return thp;
+	}
+
+	return __alloc_pages_node(node, flags, 0);
+}
+
+/*
+ * Take pages on @demote_list and attempt to demote them to
+ * another node.  Pages which are not demoted are left on
+ * @demote_pages.
+ */
+static unsigned int demote_page_list(struct list_head *demote_pages,
+				     struct pglist_data *pgdat,
+				     struct scan_control *sc)
+{
+	int target_nid = next_demotion_node(pgdat->node_id);
+	unsigned int nr_succeeded = 0;
+	int err;
+
+	if (list_empty(demote_pages))
+		return 0;
+
+	/* Demotion ignores all cpuset and mempolicy settings */
+	err = migrate_pages(demote_pages, alloc_demote_page, NULL,
+			    target_nid, MIGRATE_ASYNC, MR_DEMOTION,
+			    &nr_succeeded);
+
+	return nr_succeeded;
+}
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -1076,12 +1149,15 @@ static unsigned int shrink_page_list(str
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
+	LIST_HEAD(demote_pages);
 	unsigned int nr_reclaimed = 0;
 	unsigned int pgactivate = 0;
+	bool do_demote_pass = true;
 
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 
+retry:
 	while (!list_empty(page_list)) {
 		struct address_space *mapping;
 		struct page *page;
@@ -1231,6 +1307,16 @@ static unsigned int shrink_page_list(str
 		}
 
 		/*
+		 * Before reclaiming the page, try to relocate
+		 * its contents to another node.
+		 */
+		if (do_demote_pass && migrate_demote_page_ok(page, sc)) {
+			list_add(&page->lru, &demote_pages);
+			unlock_page(page);
+			continue;
+		}
+
+		/*
 		 * Anonymous process memory has backing store?
 		 * Try to allocate it some swap space here.
 		 * Lazyfree page could be freed directly
@@ -1477,6 +1563,17 @@ keep:
 		list_add(&page->lru, &ret_pages);
 		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
 	}
+	/* 'page_list' is always empty here */
+
+	/* Migrate pages selected for demotion */
+	nr_reclaimed += demote_page_list(&demote_pages, pgdat, sc);
+	/* Pages that could not be demoted are still in @demote_pages */
+	if (!list_empty(&demote_pages)) {
+		/* Pages which failed to demoted go back on on @page_list for retry: */
+		list_splice_init(&demote_pages, page_list);
+		do_demote_pass = false;
+		goto retry;
+	}
 
 	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
 
_


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2020-10-27 16:54 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20200818184122.29C415DF@viggo.jf.intel.com>
     [not found] ` <20200818184131.C972AFCC@viggo.jf.intel.com>
     [not found]   ` <87lfi9wxk9.fsf@yhuang-dev.intel.com>
2020-08-20 15:21     ` [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim Dave Hansen
2020-08-20 16:26       ` Yang Shi
2020-08-21  0:57         ` Huang, Ying
2020-08-21 16:17           ` Yang Shi
2020-10-07 16:17 [RFC][PATCH 0/9] [v4][RESEND] Migrate Pages in lieu of discard Dave Hansen
2020-10-07 16:17 ` [RFC][PATCH 5/9] mm/migrate: demote pages during reclaim Dave Hansen
2020-10-27 15:29   ` Oscar Salvador
2020-10-27 16:53     ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).