* Re: [RFC 1/7] mm: introduce MADV_COOL
@ 2019-05-28 12:15 Hillf Danton
2019-05-28 12:39 ` Minchan Kim
0 siblings, 1 reply; 18+ messages in thread
From: Hillf Danton @ 2019-05-28 12:15 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Tue, 28 May 2019 18:58:15 +0800 Minchan Kim wrote:
> On Tue, May 28, 2019 at 04:53:01PM +0800, Hillf Danton wrote:
> >
> > On Mon, 20 May 2019 12:52:48 +0900 Minchan Kim wrote:
> > > +static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> > > + unsigned long end, struct mm_walk *walk)
> > > +{
> > > + pte_t *orig_pte, *pte, ptent;
> > > + spinlock_t *ptl;
> > > + struct page *page;
> > > + struct vm_area_struct *vma = walk->vma;
> > > + unsigned long next;
> > > +
> > > + next = pmd_addr_end(addr, end);
> > > + if (pmd_trans_huge(*pmd)) {
> > > + spinlock_t *ptl;
> >
> > Seems not needed with another ptl declared above.
>
> Will remove it.
>
> > > +
> > > + ptl = pmd_trans_huge_lock(pmd, vma);
> > > + if (!ptl)
> > > + return 0;
> > > +
> > > + if (is_huge_zero_pmd(*pmd))
> > > + goto huge_unlock;
> > > +
> > > + page = pmd_page(*pmd);
> > > + if (page_mapcount(page) > 1)
> > > + goto huge_unlock;
> > > +
> > > + if (next - addr != HPAGE_PMD_SIZE) {
> > > + int err;
> >
> > Alternately, we deactivate thp only if the address range from userspace
> > is sane enough, in order to avoid complex works we have to do here.
>
> Not sure it's a good idea. That's the way we have done in MADV_FREE
> so want to be consistent.
>
Fair.
> > > +
> > > + get_page(page);
> > > + spin_unlock(ptl);
> > > + lock_page(page);
> > > + err = split_huge_page(page);
> > > + unlock_page(page);
> > > + put_page(page);
> > > + if (!err)
> > > + goto regular_page;
> > > + return 0;
> > > + }
> > > +
> > > + pmdp_test_and_clear_young(vma, addr, pmd);
> > > + deactivate_page(page);
> > > +huge_unlock:
> > > + spin_unlock(ptl);
> > > + return 0;
> > > + }
> > > +
> > > + if (pmd_trans_unstable(pmd))
> > > + return 0;
> > > +
> > > +regular_page:
> >
> > Take a look at pending signal?
>
> Do you have any reason to see pending signal here? I want to know what's
> your requirement so that what's the better place to handle it.
>
We could bail out without work done IMO if there is a fatal siganl pending.
And we can do that, if it makes sense to you, before the hard work.
> >
> > > + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> >
> > s/end/next/ ?
>
> Why do you think it should be next?
>
Simply based on the following line, and afraid that next != end
> > > + next = pmd_addr_end(addr, end);
> > > + ptent = *pte;
> > > +
> > > + if (pte_none(ptent))
> > > + continue;
> > > +
> > > + if (!pte_present(ptent))
> > > + continue;
> > > +
> > > + page = vm_normal_page(vma, addr, ptent);
> > > + if (!page)
> > > + continue;
> > > +
> > > + if (page_mapcount(page) > 1)
> > > + continue;
> > > +
> > > + ptep_test_and_clear_young(vma, addr, pte);
> > > + deactivate_page(page);
> > > + }
> > > +
> > > + pte_unmap_unlock(orig_pte, ptl);
> > > + cond_resched();
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static long madvise_cool(struct vm_area_struct *vma,
> > > + unsigned long start_addr, unsigned long end_addr)
> > > +{
> > > + struct mm_struct *mm = vma->vm_mm;
> > > + struct mmu_gather tlb;
> > > +
> > > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > > + return -EINVAL;
> >
> > No service in case of VM_IO?
>
> I don't know VM_IO would have regular LRU pages but just follow normal
> convention for DONTNEED and FREE.
> Do you have anything in your mind?
>
I want to skip a mapping set up for DMA.
BR
Hillf
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-28 12:15 [RFC 1/7] mm: introduce MADV_COOL Hillf Danton
@ 2019-05-28 12:39 ` Minchan Kim
0 siblings, 0 replies; 18+ messages in thread
From: Minchan Kim @ 2019-05-28 12:39 UTC (permalink / raw)
To: Hillf Danton
Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Tue, May 28, 2019 at 08:15:23PM +0800, Hillf Danton wrote:
< snip >
> > > > +
> > > > + get_page(page);
> > > > + spin_unlock(ptl);
> > > > + lock_page(page);
> > > > + err = split_huge_page(page);
> > > > + unlock_page(page);
> > > > + put_page(page);
> > > > + if (!err)
> > > > + goto regular_page;
> > > > + return 0;
> > > > + }
> > > > +
> > > > + pmdp_test_and_clear_young(vma, addr, pmd);
> > > > + deactivate_page(page);
> > > > +huge_unlock:
> > > > + spin_unlock(ptl);
> > > > + return 0;
> > > > + }
> > > > +
> > > > + if (pmd_trans_unstable(pmd))
> > > > + return 0;
> > > > +
> > > > +regular_page:
> > >
> > > Take a look at pending signal?
> >
> > Do you have any reason to see pending signal here? I want to know what's
> > your requirement so that what's the better place to handle it.
> >
> We could bail out without work done IMO if there is a fatal siganl pending.
> And we can do that, if it makes sense to you, before the hard work.
Make sense, especically, swapping out.
I will add it in next revision.
>
> > >
> > > > + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > >
> > > s/end/next/ ?
> >
> > Why do you think it should be next?
> >
> Simply based on the following line, and afraid that next != end
> > > > + next = pmd_addr_end(addr, end);
pmd_addr_end will return smaller address so end is more proper.
>
> > > > + ptent = *pte;
> > > > +
> > > > + if (pte_none(ptent))
> > > > + continue;
> > > > +
> > > > + if (!pte_present(ptent))
> > > > + continue;
> > > > +
> > > > + page = vm_normal_page(vma, addr, ptent);
> > > > + if (!page)
> > > > + continue;
> > > > +
> > > > + if (page_mapcount(page) > 1)
> > > > + continue;
> > > > +
> > > > + ptep_test_and_clear_young(vma, addr, pte);
> > > > + deactivate_page(page);
> > > > + }
> > > > +
> > > > + pte_unmap_unlock(orig_pte, ptl);
> > > > + cond_resched();
> > > > +
> > > > + return 0;
> > > > +}
> > > > +
> > > > +static long madvise_cool(struct vm_area_struct *vma,
> > > > + unsigned long start_addr, unsigned long end_addr)
> > > > +{
> > > > + struct mm_struct *mm = vma->vm_mm;
> > > > + struct mmu_gather tlb;
> > > > +
> > > > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > > > + return -EINVAL;
> > >
> > > No service in case of VM_IO?
> >
> > I don't know VM_IO would have regular LRU pages but just follow normal
> > convention for DONTNEED and FREE.
> > Do you have anything in your mind?
> >
> I want to skip a mapping set up for DMA.
What you meant is those pages in VM_IO vma are not in LRU list?
Or
pages in the vma are always pinned so no worth to deactivate or reclaim?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
@ 2019-05-29 8:52 Hillf Danton
0 siblings, 0 replies; 18+ messages in thread
From: Hillf Danton @ 2019-05-29 8:52 UTC (permalink / raw)
To: Michal Hocko
Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Wed, 29 May 2019 13:05:52 +0800 Michal Hocko wrote:
> On Wed 29-05-19 10:40:33, Hillf Danton wrote:
> > On Wed, 29 May 2019 00:11:15 +0800 Michal Hocko wrote:
> > > On Tue 28-05-19 23:38:11, Hillf Danton wrote:
> > > >
> > > > In short, I prefer to skip IO mapping since any kind of address range
> > > > can be expected from userspace, and it may probably cover an IO mapping.
> > > > And things can get out of control, if we reclaim some IO pages while
> > > > underlying device is trying to fill data into any of them, for instance.
> > >
> > > What do you mean by IO pages why what is the actual problem?
> > >
> > Io pages are the backing-store pages of a mapping whose vm_flags has
> > VM_IO set, and the comment in mm/memory.c says:
> > /*
> > * Physically remapped pages are special. Tell the
> > * rest of the world about it:
> > * VM_IO tells people not to look at these pages
> > * (accesses can have side effects).
> >
>
> OK, thanks for the clarification of the first part of the question. Now
> to the second and the more important one. What is the actual concern?
> AFAIK those pages shouldn't be on LRU list.
The backing pages for GEM object are lru pages, see the function
drm_gem_get_pages() in drivers/gpu/drm/drm_gem.c, please.
> If they are then they should
> be safe to get reclaimed otherwise we would have a problem when
> reclaiming them on the normal memory pressure.
Yes, Sir, they could be swapped out.
> Why is this madvise any different?
Now, it is not, thanks to the light you are casting.
BR
Hillf
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
@ 2019-05-29 2:40 Hillf Danton
2019-05-29 5:05 ` Michal Hocko
0 siblings, 1 reply; 18+ messages in thread
From: Hillf Danton @ 2019-05-29 2:40 UTC (permalink / raw)
To: Michal Hocko
Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Wed, 29 May 2019 00:11:15 +0800 Michal Hocko wrote:
> On Tue 28-05-19 23:38:11, Hillf Danton wrote:
> >
> > In short, I prefer to skip IO mapping since any kind of address range
> > can be expected from userspace, and it may probably cover an IO mapping.
> > And things can get out of control, if we reclaim some IO pages while
> > underlying device is trying to fill data into any of them, for instance.
>
> What do you mean by IO pages why what is the actual problem?
>
Io pages are the backing-store pages of a mapping whose vm_flags has
VM_IO set, and the comment in mm/memory.c says:
/*
* Physically remapped pages are special. Tell the
* rest of the world about it:
* VM_IO tells people not to look at these pages
* (accesses can have side effects).
BR
Hillf
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-29 2:40 Hillf Danton
@ 2019-05-29 5:05 ` Michal Hocko
0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2019-05-29 5:05 UTC (permalink / raw)
To: Hillf Danton
Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Wed 29-05-19 10:40:33, Hillf Danton wrote:
>
> On Wed, 29 May 2019 00:11:15 +0800 Michal Hocko wrote:
> > On Tue 28-05-19 23:38:11, Hillf Danton wrote:
> > >
> > > In short, I prefer to skip IO mapping since any kind of address range
> > > can be expected from userspace, and it may probably cover an IO mapping.
> > > And things can get out of control, if we reclaim some IO pages while
> > > underlying device is trying to fill data into any of them, for instance.
> >
> > What do you mean by IO pages why what is the actual problem?
> >
> Io pages are the backing-store pages of a mapping whose vm_flags has
> VM_IO set, and the comment in mm/memory.c says:
> /*
> * Physically remapped pages are special. Tell the
> * rest of the world about it:
> * VM_IO tells people not to look at these pages
> * (accesses can have side effects).
>
OK, thanks for the clarification of the first part of the question. Now
to the second and the more important one. What is the actual concern?
AFAIK those pages shouldn't be on LRU list. If they are then they should
be safe to get reclaimed otherwise we would have a problem when
reclaiming them on the normal memory pressure. Why is this madvise any
different?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
@ 2019-05-28 15:38 Hillf Danton
2019-05-28 16:11 ` Michal Hocko
0 siblings, 1 reply; 18+ messages in thread
From: Hillf Danton @ 2019-05-28 15:38 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Tue, 28 May 2019 20:39:36 +0800 Minchan Kim wrote:
> On Tue, May 28, 2019 at 08:15:23PM +0800, Hillf Danton wrote:
> < snip >
> > > > > + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > >
> > > > s/end/next/ ?
> > >
> > > Why do you think it should be next?
> > >
> > Simply based on the following line, and afraid that next != end
> > > > > + next = pmd_addr_end(addr, end);
>
> pmd_addr_end will return smaller address so end is more proper.
>
Fair.
> > > > > +static long madvise_cool(struct vm_area_struct *vma,
> > > > > + unsigned long start_addr, unsigned long end_addr)
> > > > > +{
> > > > > + struct mm_struct *mm = vma->vm_mm;
> > > > > + struct mmu_gather tlb;
> > > > > +
> > > > > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > > > > + return -EINVAL;
> > > >
> > > > No service in case of VM_IO?
> > >
> > > I don't know VM_IO would have regular LRU pages but just follow normal
> > > convention for DONTNEED and FREE.
> > > Do you have anything in your mind?
> > >
> > I want to skip a mapping set up for DMA.
>
> What you meant is those pages in VM_IO vma are not in LRU list?
What I concern is the case that there are IO pages on lru list.
> Or
> pages in the vma are always pinned so no worth to deactivate or reclaim?
>
I will not be nervous or paranoid if they are pinned.
In short, I prefer to skip IO mapping since any kind of address range
can be expected from userspace, and it may probably cover an IO mapping.
And things can get out of control, if we reclaim some IO pages while
underlying device is trying to fill data into any of them, for instance.
BR
Hillf
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-28 15:38 Hillf Danton
@ 2019-05-28 16:11 ` Michal Hocko
0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2019-05-28 16:11 UTC (permalink / raw)
To: Hillf Danton
Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Tue 28-05-19 23:38:11, Hillf Danton wrote:
>
> On Tue, 28 May 2019 20:39:36 +0800 Minchan Kim wrote:
> > On Tue, May 28, 2019 at 08:15:23PM +0800, Hillf Danton wrote:
> > < snip >
> > > > > > + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > > > > > + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
> > > > >
> > > > > s/end/next/ ?
> > > >
> > > > Why do you think it should be next?
> > > >
> > > Simply based on the following line, and afraid that next != end
> > > > > > + next = pmd_addr_end(addr, end);
> >
> > pmd_addr_end will return smaller address so end is more proper.
> >
> Fair.
>
> > > > > > +static long madvise_cool(struct vm_area_struct *vma,
> > > > > > + unsigned long start_addr, unsigned long end_addr)
> > > > > > +{
> > > > > > + struct mm_struct *mm = vma->vm_mm;
> > > > > > + struct mmu_gather tlb;
> > > > > > +
> > > > > > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > > > > > + return -EINVAL;
> > > > >
> > > > > No service in case of VM_IO?
> > > >
> > > > I don't know VM_IO would have regular LRU pages but just follow normal
> > > > convention for DONTNEED and FREE.
> > > > Do you have anything in your mind?
> > > >
> > > I want to skip a mapping set up for DMA.
> >
> > What you meant is those pages in VM_IO vma are not in LRU list?
>
> What I concern is the case that there are IO pages on lru list.
> > Or
> > pages in the vma are always pinned so no worth to deactivate or reclaim?
> >
> I will not be nervous or paranoid if they are pinned.
>
> In short, I prefer to skip IO mapping since any kind of address range
> can be expected from userspace, and it may probably cover an IO mapping.
> And things can get out of control, if we reclaim some IO pages while
> underlying device is trying to fill data into any of them, for instance.
What do you mean by IO pages why what is the actual problem?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC 0/7] introduce memory hinting API for external process
@ 2019-05-20 3:52 Minchan Kim
2019-05-20 3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
0 siblings, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2019-05-20 3:52 UTC (permalink / raw)
To: Andrew Morton
Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim
- Background
The Android terminology used for forking a new process and starting an app
from scratch is a cold start, while resuming an existing app is a hot start.
While we continually try to improve the performance of cold starts, hot
starts will always be significantly less power hungry as well as faster so
we are trying to make hot start more likely than cold start.
To increase hot start, Android userspace manages the order that apps should
be killed in a process called ActivityManagerService. ActivityManagerService
tracks every Android app or service that the user could be interacting with
at any time and translates that into a ranked list for lmkd(low memory
killer daemon). They are likely to be killed by lmkd if the system has to
reclaim memory. In that sense they are similar to entries in any other cache.
Those apps are kept alive for opportunistic performance improvements but
those performance improvements will vary based on the memory requirements of
individual workloads.
- Problem
Naturally, cached apps were dominant consumers of memory on the system.
However, they were not significant consumers of swap even though they are
good candidate for swap. Under investigation, swapping out only begins
once the low zone watermark is hit and kswapd wakes up, but the overall
allocation rate in the system might trip lmkd thresholds and cause a cached
process to be killed(we measured performance swapping out vs. zapping the
memory by killing a process. Unsurprisingly, zapping is 10x times faster
even though we use zram which is much faster than real storage) so kill
from lmkd will often satisfy the high zone watermark, resulting in very
few pages actually being moved to swap.
- Approach
The approach we chose was to use a new interface to allow userspace to
proactively reclaim entire processes by leveraging platform information.
This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
that are known to be cold from userspace and to avoid races with lmkd
by reclaiming apps as soon as they entered the cached state. Additionally,
it could provide many chances for platform to use much information to
optimize memory efficiency.
IMHO we should spell it out that this patchset complements MADV_WONTNEED
and MADV_FREE by adding non-destructive ways to gain some free memory
space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the
kernel that memory region is not currently needed and should be reclaimed
immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the
kernel that memory region is not currently needed and should be reclaimed
when memory pressure rises.
To achieve the goal, the patchset introduce two new options for madvise.
One is MADV_COOL which will deactive activated pages and the other is
MADV_COLD which will reclaim private pages instantly. These new options
complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to
gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way
that it hints the kernel that memory region is not currently needed and
should be reclaimed when memory pressure rises.
This approach is similar in spirit to madvise(MADV_WONTNEED), but the
information required to make the reclaim decision is not known to the app.
Instead, it is known to a centralized userspace daemon, and that daemon
must be able to initiate reclaim on its own without any app involvement.
To solve the concern, this patch introduces new syscall -
struct pr_madvise_param {
int size;
const struct iovec *vec;
}
int process_madvise(int pidfd, ssize_t nr_elem, int *behavior,
struct pr_madvise_param *restuls,
struct pr_madvise_param *ranges,
unsigned long flags);
The syscall get pidfd to give hints to external process and provides
pair of result/ranges vector arguments so that it could give several
hints to each address range all at once.
I guess others have different ideas about the naming of syscall and options
so feel free to suggest better naming.
- Experiment
We did bunch of testing with several hundreds of real users, not artificial
benchmark on android. We saw about 17% cold start decreasement without any
significant battery/app startup latency issues. And with artificial benchmark
which launches and switching apps, we saw average 7% app launching improvement,
18% less lmkd kill and good stat from vmstat.
A is vanilla and B is process_madvise.
A B delta ratio(%)
allocstall_dma 0 0 0 0.00
allocstall_movable 1464 457 -1007 -69.00
allocstall_normal 263210 190763 -72447 -28.00
allocstall_total 264674 191220 -73454 -28.00
compact_daemon_wake 26912 25294 -1618 -7.00
compact_fail 17885 14151 -3734 -21.00
compact_free_scanned 4204766409 3835994922 -368771487 -9.00
compact_isolated 3446484 2967618 -478866 -14.00
compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00
compact_stall 19387 15343 -4044 -21.00
compact_success 1502 1192 -310 -21.00
kswapd_high_wmark_hit_quickly 234 184 -50 -22.00
kswapd_inodesteal 221635 233093 11458 5.00
kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00
nr_dirtied 259934 296476 36542 14.00
nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00
nr_vmscan_write 1274232 2661733 1387501 108.00
nr_written 1514060 2937560 1423500 94.00
pageoutrun 67561 55133 -12428 -19.00
pgactivate 2335060 1984882 -350178 -15.00
pgalloc_dma 13743011 14096463 353452 2.00
pgalloc_movable 0 0 0 0.00
pgalloc_normal 18742440 16802065 -1940375 -11.00
pgalloc_total 32485451 30898528 -1586923 -5.00
pgdeactivate 4262210 2930670 -1331540 -32.00
pgfault 30812334 31085065 272731 0.00
pgfree 33553970 31765164 -1788806 -6.00
pginodesteal 33411 15084 -18327 -55.00
pglazyfreed 0 0 0 0.00
pgmajfault 551312 1508299 956987 173.00
pgmigrate_fail 43927 29330 -14597 -34.00
pgmigrate_success 1399851 1203922 -195929 -14.00
pgpgin 24141776 19032156 -5109620 -22.00
pgpgout 959344 1103316 143972 15.00
pgpgoutclean 4639732 3765868 -873864 -19.00
pgrefill 4884560 3006938 -1877622 -39.00
pgrotated 37828 25897 -11931 -32.00
pgscan_direct 1456037 957567 -498470 -35.00
pgscan_direct_throttle 0 0 0 0.00
pgscan_kswapd 6667767 5047360 -1620407 -25.00
pgscan_total 8123804 6004927 -2118877 -27.00
pgskip_dma 0 0 0 0.00
pgskip_movable 0 0 0 0.00
pgskip_normal 14907 25382 10475 70.00
pgskip_total 14907 25382 10475 70.00
pgsteal_direct 1118986 690215 -428771 -39.00
pgsteal_kswapd 4750223 3657107 -1093116 -24.00
pgsteal_total 5869209 4347322 -1521887 -26.00
pswpin 417613 1392647 975034 233.00
pswpout 1274224 2661731 1387507 108.00
slabs_scanned 13686905 10807200 -2879705 -22.00
workingset_activate 668966 569444 -99522 -15.00
workingset_nodereclaim 38957 32621 -6336 -17.00
workingset_refault 2816795 2179782 -637013 -23.00
workingset_restore 294320 168601 -125719 -43.00
pgmajfault is increased by 173% because swapin is increased by 200% by
process_madvise hint. However, swap read based on zram is much cheaper
than file IO in performance point of view and app hot start by swapin is
also cheaper than cold start from the beginning of app which needs many IO
from storage and initialization steps.
This patchset is against on next-20190517.
Minchan Kim (7):
mm: introduce MADV_COOL
mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
mm: introduce MADV_COLD
mm: factor out madvise's core functionality
mm: introduce external memory hinting API
mm: extend process_madvise syscall to support vector arrary
mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/page-flags.h | 1 +
include/linux/page_idle.h | 15 +
include/linux/proc_fs.h | 1 +
include/linux/swap.h | 2 +
include/linux/syscalls.h | 2 +
include/uapi/asm-generic/mman-common.h | 12 +
include/uapi/asm-generic/unistd.h | 2 +
kernel/signal.c | 2 +-
kernel/sys_ni.c | 1 +
mm/madvise.c | 600 +++++++++++++++++++++----
mm/swap.c | 43 ++
mm/vmscan.c | 80 +++-
14 files changed, 680 insertions(+), 83 deletions(-)
--
2.21.0.1020.gf2820cf01a-goog
^ permalink raw reply [flat|nested] 18+ messages in thread
* [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
@ 2019-05-20 3:52 ` Minchan Kim
2019-05-20 8:16 ` Michal Hocko
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Minchan Kim @ 2019-05-20 3:52 UTC (permalink / raw)
To: Andrew Morton
Cc: LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, Minchan Kim
When a process expects no accesses to a certain memory range
it could hint kernel that the pages can be reclaimed
when memory pressure happens but data should be preserved
for future use. This could reduce workingset eviction so it
ends up increasing performance.
This patch introduces the new MADV_COOL hint to madvise(2)
syscall. MADV_COOL can be used by a process to mark a memory range
as not expected to be used in the near future. The hint can help
kernel in deciding which pages to evict early during memory
pressure.
Internally, it works via deactivating memory from active list to
inactive's head so when the memory pressure happens, they will be
reclaimed earlier than other active pages unless there is no
access until the time.
* v1r2
* use clear_page_young in deactivate_page - joelaf
* v1r1
* Revise the description - surenb
* Renaming from MADV_WARM to MADV_COOL - surenb
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/page-flags.h | 1 +
include/linux/page_idle.h | 15 ++++
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 112 +++++++++++++++++++++++++
mm/swap.c | 43 ++++++++++
6 files changed, 173 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
TESTPAGEFLAG(Young, young, PF_ANY)
SETPAGEFLAG(Young, young, PF_ANY)
TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
SetPageYoung(page);
}
+static inline void clear_page_young(struct page *page)
+{
+ ClearPageYoung(page);
+}
+
static inline bool test_and_clear_page_young(struct page *page)
{
return TestClearPageYoung(page);
@@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
}
+static void clear_page_young(struct page *page)
+{
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (unlikely(!page_ext))
+ return;
+
+ clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+}
+
static inline bool test_and_clear_page_young(struct page *page)
{
struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 4bfb5c4ac108..64795abea003 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index abd238d0f7a4..f7a4a5d4b642 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -42,6 +42,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_COOL 5 /* deactivatie these pages */
/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..c05817fb570d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -8,6 +8,7 @@
#include <linux/mman.h>
#include <linux/pagemap.h>
+#include <linux/page_idle.h>
#include <linux/syscalls.h>
#include <linux/mempolicy.h>
#include <linux/page-isolation.h>
@@ -40,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_COOL:
case MADV_FREE:
return 0;
default:
@@ -307,6 +309,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}
+static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ struct page *page;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ spinlock_t *ptl;
+
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ if (is_huge_zero_pmd(*pmd))
+ goto huge_unlock;
+
+ page = pmd_page(*pmd);
+ if (page_mapcount(page) > 1)
+ goto huge_unlock;
+
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ pmdp_test_and_clear_young(vma, addr, pmd);
+ deactivate_page(page);
+huge_unlock:
+ spin_unlock(ptl);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+regular_page:
+ orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (pte_none(ptent))
+ continue;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (page_mapcount(page) > 1)
+ continue;
+
+ ptep_test_and_clear_young(vma, addr, pte);
+ deactivate_page(page);
+ }
+
+ pte_unmap_unlock(orig_pte, ptl);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_cool_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk cool_walk = {
+ .pmd_entry = madvise_cool_pte_range,
+ .mm = vma->vm_mm,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &cool_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cool(struct vm_area_struct *vma,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -695,6 +804,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_COOL:
+ return madvise_cool(vma, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +827,7 @@ madvise_behavior_valid(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_FREE:
+ case MADV_COOL:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/swap.c b/mm/swap.c
index 3a75722e68a9..0f94c3b5397d 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -46,6 +46,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -537,6 +538,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, file, 0);
}
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ int file = page_is_file_cache(page);
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ clear_page_young(page);
+ add_page_to_lru_list(page, lruvec, lru);
+
+ __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+ update_page_reclaim_stat(lruvec, file, 0);
+ }
+}
static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
void *arg)
@@ -589,6 +607,10 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -622,6 +644,26 @@ void deactivate_file_page(struct page *page)
}
}
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page. This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ get_page(page);
+ if (!pagevec_add(pvec, page) || PageCompound(page))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
+}
+
/**
* mark_page_lazyfree - make an anon page lazyfree
* @page: page to deactivate
@@ -686,6 +728,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
--
2.21.0.1020.gf2820cf01a-goog
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
@ 2019-05-20 8:16 ` Michal Hocko
2019-05-20 8:19 ` Michal Hocko
2019-05-20 22:54 ` Minchan Kim
2019-05-28 8:53 ` Hillf Danton
2019-05-28 10:58 ` Minchan Kim
2 siblings, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2019-05-20 8:16 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
[CC linux-api]
On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> it could hint kernel that the pages can be reclaimed
> when memory pressure happens but data should be preserved
> for future use. This could reduce workingset eviction so it
> ends up increasing performance.
>
> This patch introduces the new MADV_COOL hint to madvise(2)
> syscall. MADV_COOL can be used by a process to mark a memory range
> as not expected to be used in the near future. The hint can help
> kernel in deciding which pages to evict early during memory
> pressure.
I do not want to start naming fight but MADV_COOL sounds a bit
misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
or MADV_DONTNEED_PRESERVE.
> Internally, it works via deactivating memory from active list to
> inactive's head so when the memory pressure happens, they will be
> reclaimed earlier than other active pages unless there is no
> access until the time.
Could you elaborate about the decision to move to the head rather than
tail? What should happen to inactive pages? Should we move them to the
tail? Your implementation seems to ignore those completely. Why?
What should happen for shared pages? In other words do we want to allow
less privileged process to control evicting of shared pages with a more
privileged one? E.g. think of all sorts of side channel attacks. Maybe
we want to do the same thing as for mincore where write access is
required.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 8:16 ` Michal Hocko
@ 2019-05-20 8:19 ` Michal Hocko
2019-05-20 15:08 ` Suren Baghdasaryan
2019-05-20 22:55 ` Minchan Kim
2019-05-20 22:54 ` Minchan Kim
1 sibling, 2 replies; 18+ messages in thread
From: Michal Hocko @ 2019-05-20 8:19 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> [CC linux-api]
>
> On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > it could hint kernel that the pages can be reclaimed
> > when memory pressure happens but data should be preserved
> > for future use. This could reduce workingset eviction so it
> > ends up increasing performance.
> >
> > This patch introduces the new MADV_COOL hint to madvise(2)
> > syscall. MADV_COOL can be used by a process to mark a memory range
> > as not expected to be used in the near future. The hint can help
> > kernel in deciding which pages to evict early during memory
> > pressure.
>
> I do not want to start naming fight but MADV_COOL sounds a bit
> misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> or MADV_DONTNEED_PRESERVE.
OK, I can see that you have used MADV_COLD for a different mode.
So this one is effectively a non destructive MADV_FREE alternative
so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
in other patch would then be MADV_DONTNEED_PRESERVE. Right?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 8:19 ` Michal Hocko
@ 2019-05-20 15:08 ` Suren Baghdasaryan
2019-05-20 22:55 ` Minchan Kim
1 sibling, 0 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2019-05-20 15:08 UTC (permalink / raw)
To: Michal Hocko
Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner,
Tim Murray, Joel Fernandes, Daniel Colascione, Shakeel Butt,
Sonny Rao, Brian Geffon, linux-api
On Mon, May 20, 2019 at 1:19 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> > [CC linux-api]
> >
> > On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range
> > > it could hint kernel that the pages can be reclaimed
> > > when memory pressure happens but data should be preserved
> > > for future use. This could reduce workingset eviction so it
> > > ends up increasing performance.
> > >
> > > This patch introduces the new MADV_COOL hint to madvise(2)
> > > syscall. MADV_COOL can be used by a process to mark a memory range
> > > as not expected to be used in the near future. The hint can help
> > > kernel in deciding which pages to evict early during memory
> > > pressure.
> >
> > I do not want to start naming fight but MADV_COOL sounds a bit
> > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> > or MADV_DONTNEED_PRESERVE.
>
> OK, I can see that you have used MADV_COLD for a different mode.
> So this one is effectively a non destructive MADV_FREE alternative
> so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
> in other patch would then be MADV_DONTNEED_PRESERVE. Right?
>
I agree that naming them this way would be more in-line with the
existing API. Another good option IMO could be MADV_RECLAIM_NOW /
MADV_RECLAIM_LAZY which might explain a bit better what they do but
Michal's proposal is more consistent with the current API.
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 8:19 ` Michal Hocko
2019-05-20 15:08 ` Suren Baghdasaryan
@ 2019-05-20 22:55 ` Minchan Kim
1 sibling, 0 replies; 18+ messages in thread
From: Minchan Kim @ 2019-05-20 22:55 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
On Mon, May 20, 2019 at 10:19:43AM +0200, Michal Hocko wrote:
> On Mon 20-05-19 10:16:21, Michal Hocko wrote:
> > [CC linux-api]
> >
> > On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range
> > > it could hint kernel that the pages can be reclaimed
> > > when memory pressure happens but data should be preserved
> > > for future use. This could reduce workingset eviction so it
> > > ends up increasing performance.
> > >
> > > This patch introduces the new MADV_COOL hint to madvise(2)
> > > syscall. MADV_COOL can be used by a process to mark a memory range
> > > as not expected to be used in the near future. The hint can help
> > > kernel in deciding which pages to evict early during memory
> > > pressure.
> >
> > I do not want to start naming fight but MADV_COOL sounds a bit
> > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> > or MADV_DONTNEED_PRESERVE.
>
> OK, I can see that you have used MADV_COLD for a different mode.
> So this one is effectively a non destructive MADV_FREE alternative
> so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD
> in other patch would then be MADV_DONTNEED_PRESERVE. Right?
Correct.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 8:16 ` Michal Hocko
2019-05-20 8:19 ` Michal Hocko
@ 2019-05-20 22:54 ` Minchan Kim
2019-05-21 6:04 ` Michal Hocko
1 sibling, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2019-05-20 22:54 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> [CC linux-api]
Thanks, Michal. I forgot to add it.
>
> On Mon 20-05-19 12:52:48, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > it could hint kernel that the pages can be reclaimed
> > when memory pressure happens but data should be preserved
> > for future use. This could reduce workingset eviction so it
> > ends up increasing performance.
> >
> > This patch introduces the new MADV_COOL hint to madvise(2)
> > syscall. MADV_COOL can be used by a process to mark a memory range
> > as not expected to be used in the near future. The hint can help
> > kernel in deciding which pages to evict early during memory
> > pressure.
>
> I do not want to start naming fight but MADV_COOL sounds a bit
> misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD
> or MADV_DONTNEED_PRESERVE.
Thanks for the suggestion. Since I got several suggestions, Let's discuss
them all at once in cover-letter.
>
> > Internally, it works via deactivating memory from active list to
> > inactive's head so when the memory pressure happens, they will be
> > reclaimed earlier than other active pages unless there is no
> > access until the time.
>
> Could you elaborate about the decision to move to the head rather than
> tail? What should happen to inactive pages? Should we move them to the
> tail? Your implementation seems to ignore those completely. Why?
Normally, inactive LRU could have used-once pages without any mapping
to user's address space. Such pages would be better candicate to
reclaim when the memory pressure happens. With deactivating only
active LRU pages of the process to the head of inactive LRU, we will
keep them in RAM longer than used-once pages and could have more chance
to be activated once the process is resumed.
>
> What should happen for shared pages? In other words do we want to allow
> less privileged process to control evicting of shared pages with a more
> privileged one? E.g. think of all sorts of side channel attacks. Maybe
> we want to do the same thing as for mincore where write access is
> required.
It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
in the description.
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 22:54 ` Minchan Kim
@ 2019-05-21 6:04 ` Michal Hocko
2019-05-21 9:11 ` Minchan Kim
0 siblings, 1 reply; 18+ messages in thread
From: Michal Hocko @ 2019-05-21 6:04 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
[...]
> > > Internally, it works via deactivating memory from active list to
> > > inactive's head so when the memory pressure happens, they will be
> > > reclaimed earlier than other active pages unless there is no
> > > access until the time.
> >
> > Could you elaborate about the decision to move to the head rather than
> > tail? What should happen to inactive pages? Should we move them to the
> > tail? Your implementation seems to ignore those completely. Why?
>
> Normally, inactive LRU could have used-once pages without any mapping
> to user's address space. Such pages would be better candicate to
> reclaim when the memory pressure happens. With deactivating only
> active LRU pages of the process to the head of inactive LRU, we will
> keep them in RAM longer than used-once pages and could have more chance
> to be activated once the process is resumed.
You are making some assumptions here. You have an explicit call what is
cold now you are assuming something is even colder. Is this assumption a
general enough to make people depend on it? Not that we wouldn't be able
to change to logic later but that will always be risky - especially in
the area when somebody want to make a user space driven memory
management.
> > What should happen for shared pages? In other words do we want to allow
> > less privileged process to control evicting of shared pages with a more
> > privileged one? E.g. think of all sorts of side channel attacks. Maybe
> > we want to do the same thing as for mincore where write access is
> > required.
>
> It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
> in the description.
OK, this is good for the starter. It makes the implementation simpler
and we can add shared mappings coverage later.
Although I would argue that touching only writeable mappings should be
reasonably safe.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-21 6:04 ` Michal Hocko
@ 2019-05-21 9:11 ` Minchan Kim
2019-05-21 10:05 ` Michal Hocko
0 siblings, 1 reply; 18+ messages in thread
From: Minchan Kim @ 2019-05-21 9:11 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
On Tue, May 21, 2019 at 08:04:43AM +0200, Michal Hocko wrote:
> On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> [...]
> > > > Internally, it works via deactivating memory from active list to
> > > > inactive's head so when the memory pressure happens, they will be
> > > > reclaimed earlier than other active pages unless there is no
> > > > access until the time.
> > >
> > > Could you elaborate about the decision to move to the head rather than
> > > tail? What should happen to inactive pages? Should we move them to the
> > > tail? Your implementation seems to ignore those completely. Why?
> >
> > Normally, inactive LRU could have used-once pages without any mapping
> > to user's address space. Such pages would be better candicate to
> > reclaim when the memory pressure happens. With deactivating only
> > active LRU pages of the process to the head of inactive LRU, we will
> > keep them in RAM longer than used-once pages and could have more chance
> > to be activated once the process is resumed.
>
> You are making some assumptions here. You have an explicit call what is
> cold now you are assuming something is even colder. Is this assumption a
> general enough to make people depend on it? Not that we wouldn't be able
> to change to logic later but that will always be risky - especially in
> the area when somebody want to make a user space driven memory
> management.
Think about MADV_FREE. It moves those pages into inactive file LRU's head.
See the get_scan_count which makes forceful scanning of inactive file LRU
if it has enough size based on the memory pressure.
The reason is it's likely to have used-once pages in inactive file LRU,
generally. Those pages has been top-priority candidate to be reclaimed
for a long time.
Only parts I am aware of moving pages into tail of inactive LRU are places
writeback is done for pages VM already decide to reclaim by LRU aging or
destructive operation like invalidating but couldn't completed. It's
really strong hints with no doubt.
>
> > > What should happen for shared pages? In other words do we want to allow
> > > less privileged process to control evicting of shared pages with a more
> > > privileged one? E.g. think of all sorts of side channel attacks. Maybe
> > > we want to do the same thing as for mincore where write access is
> > > required.
> >
> > It doesn't work with shared pages(ie, page_mapcount > 1). I will add it
> > in the description.
>
> OK, this is good for the starter. It makes the implementation simpler
> and we can add shared mappings coverage later.
>
> Although I would argue that touching only writeable mappings should be
> reasonably safe.
>
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-21 9:11 ` Minchan Kim
@ 2019-05-21 10:05 ` Michal Hocko
0 siblings, 0 replies; 18+ messages in thread
From: Michal Hocko @ 2019-05-21 10:05 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray,
Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
On Tue 21-05-19 18:11:34, Minchan Kim wrote:
> On Tue, May 21, 2019 at 08:04:43AM +0200, Michal Hocko wrote:
> > On Tue 21-05-19 07:54:19, Minchan Kim wrote:
> > > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote:
> > [...]
> > > > > Internally, it works via deactivating memory from active list to
> > > > > inactive's head so when the memory pressure happens, they will be
> > > > > reclaimed earlier than other active pages unless there is no
> > > > > access until the time.
> > > >
> > > > Could you elaborate about the decision to move to the head rather than
> > > > tail? What should happen to inactive pages? Should we move them to the
> > > > tail? Your implementation seems to ignore those completely. Why?
> > >
> > > Normally, inactive LRU could have used-once pages without any mapping
> > > to user's address space. Such pages would be better candicate to
> > > reclaim when the memory pressure happens. With deactivating only
> > > active LRU pages of the process to the head of inactive LRU, we will
> > > keep them in RAM longer than used-once pages and could have more chance
> > > to be activated once the process is resumed.
> >
> > You are making some assumptions here. You have an explicit call what is
> > cold now you are assuming something is even colder. Is this assumption a
> > general enough to make people depend on it? Not that we wouldn't be able
> > to change to logic later but that will always be risky - especially in
> > the area when somebody want to make a user space driven memory
> > management.
>
> Think about MADV_FREE. It moves those pages into inactive file LRU's head.
> See the get_scan_count which makes forceful scanning of inactive file LRU
> if it has enough size based on the memory pressure.
> The reason is it's likely to have used-once pages in inactive file LRU,
> generally. Those pages has been top-priority candidate to be reclaimed
> for a long time.
OK, fair enough. Being consistent with MADV_FREE is reasonable. I just
forgot we do rotate like this there.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
2019-05-20 8:16 ` Michal Hocko
@ 2019-05-28 8:53 ` Hillf Danton
2019-05-28 10:58 ` Minchan Kim
2 siblings, 0 replies; 18+ messages in thread
From: Hillf Danton @ 2019-05-28 8:53 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Mon, 20 May 2019 12:52:48 +0900 Minchan Kim wrote:
> +static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> +{
> + pte_t *orig_pte, *pte, ptent;
> + spinlock_t *ptl;
> + struct page *page;
> + struct vm_area_struct *vma = walk->vma;
> + unsigned long next;
> +
> + next = pmd_addr_end(addr, end);
> + if (pmd_trans_huge(*pmd)) {
> + spinlock_t *ptl;
Seems not needed with another ptl declared above.
> +
> + ptl = pmd_trans_huge_lock(pmd, vma);
> + if (!ptl)
> + return 0;
> +
> + if (is_huge_zero_pmd(*pmd))
> + goto huge_unlock;
> +
> + page = pmd_page(*pmd);
> + if (page_mapcount(page) > 1)
> + goto huge_unlock;
> +
> + if (next - addr != HPAGE_PMD_SIZE) {
> + int err;
Alternately, we deactivate thp only if the address range from userspace
is sane enough, in order to avoid complex works we have to do here.
> +
> + get_page(page);
> + spin_unlock(ptl);
> + lock_page(page);
> + err = split_huge_page(page);
> + unlock_page(page);
> + put_page(page);
> + if (!err)
> + goto regular_page;
> + return 0;
> + }
> +
> + pmdp_test_and_clear_young(vma, addr, pmd);
> + deactivate_page(page);
> +huge_unlock:
> + spin_unlock(ptl);
> + return 0;
> + }
> +
> + if (pmd_trans_unstable(pmd))
> + return 0;
> +
> +regular_page:
Take a look at pending signal?
> + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
s/end/next/ ?
> + ptent = *pte;
> +
> + if (pte_none(ptent))
> + continue;
> +
> + if (!pte_present(ptent))
> + continue;
> +
> + page = vm_normal_page(vma, addr, ptent);
> + if (!page)
> + continue;
> +
> + if (page_mapcount(page) > 1)
> + continue;
> +
> + ptep_test_and_clear_young(vma, addr, pte);
> + deactivate_page(page);
> + }
> +
> + pte_unmap_unlock(orig_pte, ptl);
> + cond_resched();
> +
> + return 0;
> +}
> +
> +static long madvise_cool(struct vm_area_struct *vma,
> + unsigned long start_addr, unsigned long end_addr)
> +{
> + struct mm_struct *mm = vma->vm_mm;
> + struct mmu_gather tlb;
> +
> + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> + return -EINVAL;
No service in case of VM_IO?
> +
> + lru_add_drain();
> + tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> + madvise_cool_page_range(&tlb, vma, start_addr, end_addr);
> + tlb_finish_mmu(&tlb, start_addr, end_addr);
> +
> + return 0;
> +}
> +
> +/*
> + * deactivate_page - deactivate a page
> + * @page: page to deactivate
> + *
> + * deactivate_page() moves @page to the inactive list if @page was on the active
> + * list and was not an unevictable page. This is done to accelerate the reclaim
> + * of @page.
> + */
> +void deactivate_page(struct page *page)
> +{
> + if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
> + struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
> +
> + get_page(page);
A line of comment seems needed for pinning the page.
> + if (!pagevec_add(pvec, page) || PageCompound(page))
> + pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
> + put_cpu_var(lru_deactivate_pvecs);
> + }
> +}
> +
--
Hillf
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL
2019-05-20 3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
2019-05-20 8:16 ` Michal Hocko
2019-05-28 8:53 ` Hillf Danton
@ 2019-05-28 10:58 ` Minchan Kim
2 siblings, 0 replies; 18+ messages in thread
From: Minchan Kim @ 2019-05-28 10:58 UTC (permalink / raw)
To: Hillf Danton
Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon
On Tue, May 28, 2019 at 04:53:01PM +0800, Hillf Danton wrote:
>
> On Mon, 20 May 2019 12:52:48 +0900 Minchan Kim wrote:
> > +static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr,
> > + unsigned long end, struct mm_walk *walk)
> > +{
> > + pte_t *orig_pte, *pte, ptent;
> > + spinlock_t *ptl;
> > + struct page *page;
> > + struct vm_area_struct *vma = walk->vma;
> > + unsigned long next;
> > +
> > + next = pmd_addr_end(addr, end);
> > + if (pmd_trans_huge(*pmd)) {
> > + spinlock_t *ptl;
>
> Seems not needed with another ptl declared above.
Will remove it.
> > +
> > + ptl = pmd_trans_huge_lock(pmd, vma);
> > + if (!ptl)
> > + return 0;
> > +
> > + if (is_huge_zero_pmd(*pmd))
> > + goto huge_unlock;
> > +
> > + page = pmd_page(*pmd);
> > + if (page_mapcount(page) > 1)
> > + goto huge_unlock;
> > +
> > + if (next - addr != HPAGE_PMD_SIZE) {
> > + int err;
>
> Alternately, we deactivate thp only if the address range from userspace
> is sane enough, in order to avoid complex works we have to do here.
Not sure it's a good idea. That's the way we have done in MADV_FREE
so want to be consistent.
> > +
> > + get_page(page);
> > + spin_unlock(ptl);
> > + lock_page(page);
> > + err = split_huge_page(page);
> > + unlock_page(page);
> > + put_page(page);
> > + if (!err)
> > + goto regular_page;
> > + return 0;
> > + }
> > +
> > + pmdp_test_and_clear_young(vma, addr, pmd);
> > + deactivate_page(page);
> > +huge_unlock:
> > + spin_unlock(ptl);
> > + return 0;
> > + }
> > +
> > + if (pmd_trans_unstable(pmd))
> > + return 0;
> > +
> > +regular_page:
>
> Take a look at pending signal?
Do you have any reason to see pending signal here? I want to know what's
your requirement so that what's the better place to handle it.
>
> > + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
>
> s/end/next/ ?
Why do you think it should be next?
> > + ptent = *pte;
> > +
> > + if (pte_none(ptent))
> > + continue;
> > +
> > + if (!pte_present(ptent))
> > + continue;
> > +
> > + page = vm_normal_page(vma, addr, ptent);
> > + if (!page)
> > + continue;
> > +
> > + if (page_mapcount(page) > 1)
> > + continue;
> > +
> > + ptep_test_and_clear_young(vma, addr, pte);
> > + deactivate_page(page);
> > + }
> > +
> > + pte_unmap_unlock(orig_pte, ptl);
> > + cond_resched();
> > +
> > + return 0;
> > +}
> > +
> > +static long madvise_cool(struct vm_area_struct *vma,
> > + unsigned long start_addr, unsigned long end_addr)
> > +{
> > + struct mm_struct *mm = vma->vm_mm;
> > + struct mmu_gather tlb;
> > +
> > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
> > + return -EINVAL;
>
> No service in case of VM_IO?
I don't know VM_IO would have regular LRU pages but just follow normal
convention for DONTNEED and FREE.
Do you have anything in your mind?
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2019-05-29 8:52 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-05-28 12:15 [RFC 1/7] mm: introduce MADV_COOL Hillf Danton
2019-05-28 12:39 ` Minchan Kim
-- strict thread matches above, loose matches on Subject: below --
2019-05-29 8:52 Hillf Danton
2019-05-29 2:40 Hillf Danton
2019-05-29 5:05 ` Michal Hocko
2019-05-28 15:38 Hillf Danton
2019-05-28 16:11 ` Michal Hocko
2019-05-20 3:52 [RFC 0/7] introduce memory hinting API for external process Minchan Kim
2019-05-20 3:52 ` [RFC 1/7] mm: introduce MADV_COOL Minchan Kim
2019-05-20 8:16 ` Michal Hocko
2019-05-20 8:19 ` Michal Hocko
2019-05-20 15:08 ` Suren Baghdasaryan
2019-05-20 22:55 ` Minchan Kim
2019-05-20 22:54 ` Minchan Kim
2019-05-21 6:04 ` Michal Hocko
2019-05-21 9:11 ` Minchan Kim
2019-05-21 10:05 ` Michal Hocko
2019-05-28 8:53 ` Hillf Danton
2019-05-28 10:58 ` Minchan Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).