* [RFCv2 1/6] mm: introduce MADV_COLD
2019-05-31 6:43 [RFCv2 0/6] introduce memory hinting API for external process Minchan Kim
@ 2019-05-31 6:43 ` Minchan Kim
2019-05-31 8:47 ` Michal Hocko
2019-05-31 6:43 ` [RFCv2 2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
` (4 subsequent siblings)
5 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton, Minchan Kim
When a process expects no accesses to a certain memory range, it could
give a hint to kernel that the pages can be reclaimed when memory pressure
happens but data should be preserved for future use. This could reduce
workingset eviction so it ends up increasing performance.
This patch introduces the new MADV_COLD hint to madvise(2) syscall.
MADV_COLD can be used by a process to mark a memory range as not expected
to be used in the near future. The hint can help kernel in deciding which
pages to evict early during memory pressure.
Internally, it works via deactivating pages from active list to inactive's
head if the page is private because inactive list could be full of
used-once pages which are first candidate for the reclaiming and that's a
reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
if the memory pressure happens, they will be reclaimed earlier than other
active pages unless there is no access until the time.
* RFCv1
* renaming from MADV_COOL to MADV_COLD - hannes
* internal review
* use clear_page_youn in deactivate_page - joelaf
* Revise the description - surenb
* Renaming from MADV_WARM to MADV_COOL - surenb
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/page-flags.h | 1 +
include/linux/page_idle.h | 15 ++++
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 111 +++++++++++++++++++++++++
mm/swap.c | 43 ++++++++++
6 files changed, 172 insertions(+)
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 9f8712a4b1a5..58b06654c8dd 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -424,6 +424,7 @@ static inline bool set_hwpoison_free_buddy_page(struct page *page)
TESTPAGEFLAG(Young, young, PF_ANY)
SETPAGEFLAG(Young, young, PF_ANY)
TESTCLEARFLAG(Young, young, PF_ANY)
+CLEARPAGEFLAG(Young, young, PF_ANY)
PAGEFLAG(Idle, idle, PF_ANY)
#endif
diff --git a/include/linux/page_idle.h b/include/linux/page_idle.h
index 1e894d34bdce..f3f43b317150 100644
--- a/include/linux/page_idle.h
+++ b/include/linux/page_idle.h
@@ -19,6 +19,11 @@ static inline void set_page_young(struct page *page)
SetPageYoung(page);
}
+static inline void clear_page_young(struct page *page)
+{
+ ClearPageYoung(page);
+}
+
static inline bool test_and_clear_page_young(struct page *page)
{
return TestClearPageYoung(page);
@@ -65,6 +70,16 @@ static inline void set_page_young(struct page *page)
set_bit(PAGE_EXT_YOUNG, &page_ext->flags);
}
+static void clear_page_young(struct page *page)
+{
+ struct page_ext *page_ext = lookup_page_ext(page);
+
+ if (unlikely(!page_ext))
+ return;
+
+ clear_bit(PAGE_EXT_YOUNG, &page_ext->flags);
+}
+
static inline bool test_and_clear_page_young(struct page *page)
{
struct page_ext *page_ext = lookup_page_ext(page);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index de2c67a33b7e..0ce997edb8bb 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -340,6 +340,7 @@ extern void lru_add_drain_cpu(int cpu);
extern void lru_add_drain_all(void);
extern void rotate_reclaimable_page(struct page *page);
extern void deactivate_file_page(struct page *page);
+extern void deactivate_page(struct page *page);
extern void mark_page_lazyfree(struct page *page);
extern void swap_setup(void);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index bea0278f65ab..1190f4e7f7b9 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -43,6 +43,7 @@
#define MADV_SEQUENTIAL 2 /* expect sequential page references */
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
+#define MADV_COLD 5 /* deactivatie these pages */
/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index 628022e674a7..bff150eab6da 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -40,6 +40,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_REMOVE:
case MADV_WILLNEED:
case MADV_DONTNEED:
+ case MADV_COLD:
case MADV_FREE:
return 0;
default:
@@ -307,6 +308,113 @@ static long madvise_willneed(struct vm_area_struct *vma,
return 0;
}
+static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ struct page *page;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long next;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ if (is_huge_zero_pmd(*pmd))
+ goto huge_unlock;
+
+ page = pmd_page(*pmd);
+ if (page_mapcount(page) > 1)
+ goto huge_unlock;
+
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ pmdp_test_and_clear_young(vma, addr, pmd);
+ deactivate_page(page);
+huge_unlock:
+ spin_unlock(ptl);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+
+regular_page:
+ orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+
+ if (pte_none(ptent))
+ continue;
+
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (page_mapcount(page) > 1)
+ continue;
+
+ ptep_test_and_clear_young(vma, addr, pte);
+ deactivate_page(page);
+ }
+
+ pte_unmap_unlock(orig_pte, ptl);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_cold_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk cool_walk = {
+ .pmd_entry = madvise_cold_pte_range,
+ .mm = vma->vm_mm,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &cool_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+static long madvise_cold(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ *prev = vma;
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_cold_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -695,6 +803,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_remove(vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
+ case MADV_COLD:
+ return madvise_cold(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -716,6 +826,7 @@ madvise_behavior_valid(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_FREE:
+ case MADV_COLD:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/swap.c b/mm/swap.c
index 7b079976cbec..cebedab15aa2 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -47,6 +47,7 @@ int page_cluster;
static DEFINE_PER_CPU(struct pagevec, lru_add_pvec);
static DEFINE_PER_CPU(struct pagevec, lru_rotate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_deactivate_file_pvecs);
+static DEFINE_PER_CPU(struct pagevec, lru_deactivate_pvecs);
static DEFINE_PER_CPU(struct pagevec, lru_lazyfree_pvecs);
#ifdef CONFIG_SMP
static DEFINE_PER_CPU(struct pagevec, activate_page_pvecs);
@@ -538,6 +539,23 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec,
update_page_reclaim_stat(lruvec, file, 0);
}
+static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec,
+ void *arg)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ int file = page_is_file_cache(page);
+ int lru = page_lru_base_type(page);
+
+ del_page_from_lru_list(page, lruvec, lru + LRU_ACTIVE);
+ ClearPageActive(page);
+ ClearPageReferenced(page);
+ clear_page_young(page);
+ add_page_to_lru_list(page, lruvec, lru);
+
+ __count_vm_events(PGDEACTIVATE, hpage_nr_pages(page));
+ update_page_reclaim_stat(lruvec, file, 0);
+ }
+}
static void lru_lazyfree_fn(struct page *page, struct lruvec *lruvec,
void *arg)
@@ -590,6 +608,10 @@ void lru_add_drain_cpu(int cpu)
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_deactivate_file_fn, NULL);
+ pvec = &per_cpu(lru_deactivate_pvecs, cpu);
+ if (pagevec_count(pvec))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+
pvec = &per_cpu(lru_lazyfree_pvecs, cpu);
if (pagevec_count(pvec))
pagevec_lru_move_fn(pvec, lru_lazyfree_fn, NULL);
@@ -623,6 +645,26 @@ void deactivate_file_page(struct page *page)
}
}
+/*
+ * deactivate_page - deactivate a page
+ * @page: page to deactivate
+ *
+ * deactivate_page() moves @page to the inactive list if @page was on the active
+ * list and was not an unevictable page. This is done to accelerate the reclaim
+ * of @page.
+ */
+void deactivate_page(struct page *page)
+{
+ if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
+ struct pagevec *pvec = &get_cpu_var(lru_deactivate_pvecs);
+
+ get_page(page);
+ if (!pagevec_add(pvec, page) || PageCompound(page))
+ pagevec_lru_move_fn(pvec, lru_deactivate_fn, NULL);
+ put_cpu_var(lru_deactivate_pvecs);
+ }
+}
+
/**
* mark_page_lazyfree - make an anon page lazyfree
* @page: page to deactivate
@@ -687,6 +729,7 @@ void lru_add_drain_all(void)
if (pagevec_count(&per_cpu(lru_add_pvec, cpu)) ||
pagevec_count(&per_cpu(lru_rotate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_deactivate_file_pvecs, cpu)) ||
+ pagevec_count(&per_cpu(lru_deactivate_pvecs, cpu)) ||
pagevec_count(&per_cpu(lru_lazyfree_pvecs, cpu)) ||
need_activate_page_drain(cpu)) {
INIT_WORK(work, lru_add_drain_per_cpu);
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-05-31 6:43 ` [RFCv2 1/6] mm: introduce MADV_COLD Minchan Kim
@ 2019-05-31 8:47 ` Michal Hocko
2019-05-31 13:39 ` Minchan Kim
0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-05-31 8:47 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range, it could
> give a hint to kernel that the pages can be reclaimed when memory pressure
> happens but data should be preserved for future use. This could reduce
> workingset eviction so it ends up increasing performance.
>
> This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> MADV_COLD can be used by a process to mark a memory range as not expected
> to be used in the near future. The hint can help kernel in deciding which
> pages to evict early during memory pressure.
>
> Internally, it works via deactivating pages from active list to inactive's
> head if the page is private because inactive list could be full of
> used-once pages which are first candidate for the reclaiming and that's a
> reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> if the memory pressure happens, they will be reclaimed earlier than other
> active pages unless there is no access until the time.
[I am intentionally not looking at the implementation because below
points should be clear from the changelog - sorry about nagging ;)]
What kind of pages can be deactivated? Anonymous/File backed.
Private/shared? If shared, are there any restrictions?
Are there any restrictions on mappings? E.g. what would be an effect of
this operation on hugetlbfs mapping?
Also you are talking about inactive LRU but what kind of LRU is that? Is
it the anonymous LRU? If yes, don't we have the same problem as with the
early MADV_FREE implementation when enough page cache causes that
deactivated anonymous memory doesn't get reclaimed anytime soon. Or
worse never when there is no swap available?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-05-31 8:47 ` Michal Hocko
@ 2019-05-31 13:39 ` Minchan Kim
2019-05-31 14:03 ` Michal Hocko
0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 13:39 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range, it could
> > give a hint to kernel that the pages can be reclaimed when memory pressure
> > happens but data should be preserved for future use. This could reduce
> > workingset eviction so it ends up increasing performance.
> >
> > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > MADV_COLD can be used by a process to mark a memory range as not expected
> > to be used in the near future. The hint can help kernel in deciding which
> > pages to evict early during memory pressure.
> >
> > Internally, it works via deactivating pages from active list to inactive's
> > head if the page is private because inactive list could be full of
> > used-once pages which are first candidate for the reclaiming and that's a
> > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > if the memory pressure happens, they will be reclaimed earlier than other
> > active pages unless there is no access until the time.
>
> [I am intentionally not looking at the implementation because below
> points should be clear from the changelog - sorry about nagging ;)]
>
> What kind of pages can be deactivated? Anonymous/File backed.
> Private/shared? If shared, are there any restrictions?
Both file and private pages could be deactived from each active LRU
to each inactive LRU if the page has one map_count. In other words,
if (page_mapcount(page) <= 1)
deactivate_page(page);
>
> Are there any restrictions on mappings? E.g. what would be an effect of
> this operation on hugetlbfs mapping?
VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED
>
> Also you are talking about inactive LRU but what kind of LRU is that? Is
> it the anonymous LRU? If yes, don't we have the same problem as with the
active file page -> inactive file LRU
active anon page -> inacdtive anon LRU
> early MADV_FREE implementation when enough page cache causes that
> deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> worse never when there is no swap available?
I think MADV_COLD is a little bit different symantic with MAVD_FREE.
MADV_FREE means it's okay to discard when the memory pressure because
the content of the page is *garbage*. Furthemore, freeing such pages is
almost zero overhead since we don't need to swap out and access
afterward causes minor fault. Thus, it would make sense to put those
freeable pages in inactive file LRU to compete other used-once pages.
However, MADV_COLD doesn't means it's a garbage and freeing requires
swap out/swap in afterward. So, it would be better to move inactive
anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
scanning of those cold anonymous if system doesn't have a swap device.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-05-31 13:39 ` Minchan Kim
@ 2019-05-31 14:03 ` Michal Hocko
2019-05-31 14:34 ` Minchan Kim
0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-05-31 14:03 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > When a process expects no accesses to a certain memory range, it could
> > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > happens but data should be preserved for future use. This could reduce
> > > workingset eviction so it ends up increasing performance.
> > >
> > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > to be used in the near future. The hint can help kernel in deciding which
> > > pages to evict early during memory pressure.
> > >
> > > Internally, it works via deactivating pages from active list to inactive's
> > > head if the page is private because inactive list could be full of
> > > used-once pages which are first candidate for the reclaiming and that's a
> > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > if the memory pressure happens, they will be reclaimed earlier than other
> > > active pages unless there is no access until the time.
> >
> > [I am intentionally not looking at the implementation because below
> > points should be clear from the changelog - sorry about nagging ;)]
> >
> > What kind of pages can be deactivated? Anonymous/File backed.
> > Private/shared? If shared, are there any restrictions?
>
> Both file and private pages could be deactived from each active LRU
> to each inactive LRU if the page has one map_count. In other words,
>
> if (page_mapcount(page) <= 1)
> deactivate_page(page);
Why do we restrict to pages that are single mapped?
> > Are there any restrictions on mappings? E.g. what would be an effect of
> > this operation on hugetlbfs mapping?
>
> VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED
OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED
is really useful to mention.
>
> >
> > Also you are talking about inactive LRU but what kind of LRU is that? Is
> > it the anonymous LRU? If yes, don't we have the same problem as with the
>
> active file page -> inactive file LRU
> active anon page -> inacdtive anon LRU
>
> > early MADV_FREE implementation when enough page cache causes that
> > deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> > worse never when there is no swap available?
>
> I think MADV_COLD is a little bit different symantic with MAVD_FREE.
> MADV_FREE means it's okay to discard when the memory pressure because
> the content of the page is *garbage*. Furthemore, freeing such pages is
> almost zero overhead since we don't need to swap out and access
> afterward causes minor fault. Thus, it would make sense to put those
> freeable pages in inactive file LRU to compete other used-once pages.
>
> However, MADV_COLD doesn't means it's a garbage and freeing requires
> swap out/swap in afterward. So, it would be better to move inactive
> anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
> scanning of those cold anonymous if system doesn't have a swap device.
Please document this, if this is really a desirable semantic because
then you have the same set of problems as we've had with the early
MADV_FREE implementation mentioned above.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-05-31 14:03 ` Michal Hocko
@ 2019-05-31 14:34 ` Minchan Kim
2019-06-03 7:16 ` Michal Hocko
0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 14:34 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > When a process expects no accesses to a certain memory range, it could
> > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > happens but data should be preserved for future use. This could reduce
> > > > workingset eviction so it ends up increasing performance.
> > > >
> > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > to be used in the near future. The hint can help kernel in deciding which
> > > > pages to evict early during memory pressure.
> > > >
> > > > Internally, it works via deactivating pages from active list to inactive's
> > > > head if the page is private because inactive list could be full of
> > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > active pages unless there is no access until the time.
> > >
> > > [I am intentionally not looking at the implementation because below
> > > points should be clear from the changelog - sorry about nagging ;)]
> > >
> > > What kind of pages can be deactivated? Anonymous/File backed.
> > > Private/shared? If shared, are there any restrictions?
> >
> > Both file and private pages could be deactived from each active LRU
> > to each inactive LRU if the page has one map_count. In other words,
> >
> > if (page_mapcount(page) <= 1)
> > deactivate_page(page);
>
> Why do we restrict to pages that are single mapped?
Because page table in one of process shared the page would have access bit
so finally we couldn't reclaim the page. The more process it is shared,
the more fail to reclaim.
>
> > > Are there any restrictions on mappings? E.g. what would be an effect of
> > > this operation on hugetlbfs mapping?
> >
> > VM_LOCKED|VM_HUGETLB|VM_PFNMAP vma will be skipped like MADV_FREE|DONTNEED
>
> OK documenting that this is restricted to the same vmas as MADV_FREE|DONTNEED
> is really useful to mention.
Sure.
>
> >
> > >
> > > Also you are talking about inactive LRU but what kind of LRU is that? Is
> > > it the anonymous LRU? If yes, don't we have the same problem as with the
> >
> > active file page -> inactive file LRU
> > active anon page -> inacdtive anon LRU
> >
> > > early MADV_FREE implementation when enough page cache causes that
> > > deactivated anonymous memory doesn't get reclaimed anytime soon. Or
> > > worse never when there is no swap available?
> >
> > I think MADV_COLD is a little bit different symantic with MAVD_FREE.
> > MADV_FREE means it's okay to discard when the memory pressure because
> > the content of the page is *garbage*. Furthemore, freeing such pages is
> > almost zero overhead since we don't need to swap out and access
> > afterward causes minor fault. Thus, it would make sense to put those
> > freeable pages in inactive file LRU to compete other used-once pages.
> >
> > However, MADV_COLD doesn't means it's a garbage and freeing requires
> > swap out/swap in afterward. So, it would be better to move inactive
> > anon's LRU list, not file LRU. Furthermore, it would avoid unnecessary
> > scanning of those cold anonymous if system doesn't have a swap device.
>
> Please document this, if this is really a desirable semantic because
> then you have the same set of problems as we've had with the early
> MADV_FREE implementation mentioned above.
IIRC, the problem of MADV_FREE was that we couldn't discard freeable
pages because VM never scan anonymous LRU with swapless system.
However, it's not the our case because we should reclaim them, not
discarding.
I will include it in the description.
Thanks.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-05-31 14:34 ` Minchan Kim
@ 2019-06-03 7:16 ` Michal Hocko
2019-06-03 15:43 ` Daniel Colascione
` (2 more replies)
0 siblings, 3 replies; 39+ messages in thread
From: Michal Hocko @ 2019-06-03 7:16 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > When a process expects no accesses to a certain memory range, it could
> > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > happens but data should be preserved for future use. This could reduce
> > > > > workingset eviction so it ends up increasing performance.
> > > > >
> > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > pages to evict early during memory pressure.
> > > > >
> > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > head if the page is private because inactive list could be full of
> > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > active pages unless there is no access until the time.
> > > >
> > > > [I am intentionally not looking at the implementation because below
> > > > points should be clear from the changelog - sorry about nagging ;)]
> > > >
> > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > Private/shared? If shared, are there any restrictions?
> > >
> > > Both file and private pages could be deactived from each active LRU
> > > to each inactive LRU if the page has one map_count. In other words,
> > >
> > > if (page_mapcount(page) <= 1)
> > > deactivate_page(page);
> >
> > Why do we restrict to pages that are single mapped?
>
> Because page table in one of process shared the page would have access bit
> so finally we couldn't reclaim the page. The more process it is shared,
> the more fail to reclaim.
So what? In other words why should it be restricted solely based on the
map count. I can see a reason to restrict based on the access
permissions because we do not want to simplify all sorts of side channel
attacks but memory reclaim is capable of reclaiming shared pages and so
far I haven't heard any sound argument why madvise should skip those.
Again if there are any reasons, then document them in the changelog.
[...]
> > Please document this, if this is really a desirable semantic because
> > then you have the same set of problems as we've had with the early
> > MADV_FREE implementation mentioned above.
>
> IIRC, the problem of MADV_FREE was that we couldn't discard freeable
> pages because VM never scan anonymous LRU with swapless system.
> However, it's not the our case because we should reclaim them, not
> discarding.
Right. But there is still the page cache reclaim. Is it expected that
an explicitly cold memory doesn't get reclaimed because we have a
sufficient amount of page cache (a very common case) and we never age
anonymous memory because of that?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 7:16 ` Michal Hocko
@ 2019-06-03 15:43 ` Daniel Colascione
2019-06-03 17:27 ` Johannes Weiner
2019-06-04 4:26 ` Minchan Kim
2 siblings, 0 replies; 39+ messages in thread
From: Daniel Colascione @ 2019-06-03 15:43 UTC (permalink / raw)
To: Michal Hocko
Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, Linux API,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn, Oleg Nesterov,
Christian Brauner, oleksandr, hdanton
On Mon, Jun 3, 2019 at 12:16 AM Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.
Whether to reclaim shared pages is a policy decision best left to
userland, IMHO.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 7:16 ` Michal Hocko
2019-06-03 15:43 ` Daniel Colascione
@ 2019-06-03 17:27 ` Johannes Weiner
2019-06-03 20:32 ` Michal Hocko
2019-06-04 4:26 ` Minchan Kim
2 siblings, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2019-06-03 17:27 UTC (permalink / raw)
To: Michal Hocko
Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.
I think it makes sense. It could be explained, but it also follows
established madvise semantics, and I'm not sure it's necessarily
Minchan's job to re-iterate those.
Sharing isn't exactly transparent to userspace. The kernel does COW,
ksm etc. When you madvise, you can really only speak for your own
reference to that memory - "*I* am not using this."
This is in line with other madvise calls: MADV_DONTNEED clears the
local page table entries and drops the corresponding references, so
shared pages won't get freed. MADV_FREE clears the pte dirty bit and
also has explicit mapcount checks before clearing PG_dirty, so again
shared pages don't get freed.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 17:27 ` Johannes Weiner
@ 2019-06-03 20:32 ` Michal Hocko
2019-06-03 21:50 ` Johannes Weiner
0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-06-03 20:32 UTC (permalink / raw)
To: Johannes Weiner
Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > >
> > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > pages to evict early during memory pressure.
> > > > > > >
> > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > head if the page is private because inactive list could be full of
> > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > active pages unless there is no access until the time.
> > > > > >
> > > > > > [I am intentionally not looking at the implementation because below
> > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > >
> > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > Private/shared? If shared, are there any restrictions?
> > > > >
> > > > > Both file and private pages could be deactived from each active LRU
> > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > >
> > > > > if (page_mapcount(page) <= 1)
> > > > > deactivate_page(page);
> > > >
> > > > Why do we restrict to pages that are single mapped?
> > >
> > > Because page table in one of process shared the page would have access bit
> > > so finally we couldn't reclaim the page. The more process it is shared,
> > > the more fail to reclaim.
> >
> > So what? In other words why should it be restricted solely based on the
> > map count. I can see a reason to restrict based on the access
> > permissions because we do not want to simplify all sorts of side channel
> > attacks but memory reclaim is capable of reclaiming shared pages and so
> > far I haven't heard any sound argument why madvise should skip those.
> > Again if there are any reasons, then document them in the changelog.
>
> I think it makes sense. It could be explained, but it also follows
> established madvise semantics, and I'm not sure it's necessarily
> Minchan's job to re-iterate those.
>
> Sharing isn't exactly transparent to userspace. The kernel does COW,
> ksm etc. When you madvise, you can really only speak for your own
> reference to that memory - "*I* am not using this."
>
> This is in line with other madvise calls: MADV_DONTNEED clears the
> local page table entries and drops the corresponding references, so
> shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> also has explicit mapcount checks before clearing PG_dirty, so again
> shared pages don't get freed.
Right, being consistent with other madvise syscalls is certainly a way
to go. And I am not pushing one way or another, I just want this to be
documented with a reasoning behind. Consistency is certainly an argument
to use.
On the other hand these non-destructive madvise operations are quite
different and the shared policy might differ as a result as well. We are
aging objects rather than destroying them after all. Being able to age
a pagecache with a sufficient privileges sounds like a useful usecase to
me. In other words you are able to cause the same effect indirectly
without the madvise operation so it kinda makes sense to allow it in a
more sophisticated way.
That being said, madvise is just a _hint_ and the kernel will be always
free to ignore it so the future implementation might change so we can
start simple and consistent with existing MADV_$FOO operations now and
extend later on. But let's document the intention in the changelog and
make the decision clear. I am sorry to be so anal about this but I have
seen so many ad-hoc policies that were undocumented and it was so hard
to guess when revisiting later on and make some sense of it.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 20:32 ` Michal Hocko
@ 2019-06-03 21:50 ` Johannes Weiner
2019-06-03 23:02 ` Minchan Kim
2019-06-04 6:55 ` Michal Hocko
0 siblings, 2 replies; 39+ messages in thread
From: Johannes Weiner @ 2019-06-03 21:50 UTC (permalink / raw)
To: Michal Hocko
Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > >
> > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > pages to evict early during memory pressure.
> > > > > > > >
> > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > active pages unless there is no access until the time.
> > > > > > >
> > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > >
> > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > >
> > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > >
> > > > > > if (page_mapcount(page) <= 1)
> > > > > > deactivate_page(page);
> > > > >
> > > > > Why do we restrict to pages that are single mapped?
> > > >
> > > > Because page table in one of process shared the page would have access bit
> > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > the more fail to reclaim.
> > >
> > > So what? In other words why should it be restricted solely based on the
> > > map count. I can see a reason to restrict based on the access
> > > permissions because we do not want to simplify all sorts of side channel
> > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > far I haven't heard any sound argument why madvise should skip those.
> > > Again if there are any reasons, then document them in the changelog.
> >
> > I think it makes sense. It could be explained, but it also follows
> > established madvise semantics, and I'm not sure it's necessarily
> > Minchan's job to re-iterate those.
> >
> > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > ksm etc. When you madvise, you can really only speak for your own
> > reference to that memory - "*I* am not using this."
> >
> > This is in line with other madvise calls: MADV_DONTNEED clears the
> > local page table entries and drops the corresponding references, so
> > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > also has explicit mapcount checks before clearing PG_dirty, so again
> > shared pages don't get freed.
>
> Right, being consistent with other madvise syscalls is certainly a way
> to go. And I am not pushing one way or another, I just want this to be
> documented with a reasoning behind. Consistency is certainly an argument
> to use.
>
> On the other hand these non-destructive madvise operations are quite
> different and the shared policy might differ as a result as well. We are
> aging objects rather than destroying them after all. Being able to age
> a pagecache with a sufficient privileges sounds like a useful usecase to
> me. In other words you are able to cause the same effect indirectly
> without the madvise operation so it kinda makes sense to allow it in a
> more sophisticated way.
Right, I don't think it's about permission - as you say, you can do
this indirectly. Page reclaim is all about relative page order, so if
we thwarted you from demoting some pages, you could instead promote
other pages to cause a similar end result.
I think it's about intent. You're advising the kernel that *you're*
not using this memory and would like to have it cleared out based on
that knowledge. You could do the same by simply allocating the new
pages and have the kernel sort it out. However, if the kernel sorts it
out, it *will* look at other users of the page, and it might decide
that other pages are actually colder when considering all users.
When you ignore shared state, on the other hand, the pages you advise
out could refault right after. And then, not only did you not free up
the memory, but you also caused IO that may interfere with bringing in
the new data for which you tried to create room in the first place.
So I don't think it ever makes sense to override it.
But it might be better to drop the explicit mapcount check and instead
make the local pte young and call shrink_page_list() without the
TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
to handle references and shared pages exactly the same way it would if
those pages came fresh off the LRU tail, excluding only the reference
from the mapping that we're madvising.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 21:50 ` Johannes Weiner
@ 2019-06-03 23:02 ` Minchan Kim
2019-06-04 6:56 ` Michal Hocko
2019-06-04 12:06 ` Johannes Weiner
2019-06-04 6:55 ` Michal Hocko
1 sibling, 2 replies; 39+ messages in thread
From: Minchan Kim @ 2019-06-03 23:02 UTC (permalink / raw)
To: Johannes Weiner
Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, linux-api,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
Hi Johannes,
On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > >
> > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > >
> > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > active pages unless there is no access until the time.
> > > > > > > >
> > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > >
> > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > >
> > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > >
> > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > deactivate_page(page);
> > > > > >
> > > > > > Why do we restrict to pages that are single mapped?
> > > > >
> > > > > Because page table in one of process shared the page would have access bit
> > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > the more fail to reclaim.
> > > >
> > > > So what? In other words why should it be restricted solely based on the
> > > > map count. I can see a reason to restrict based on the access
> > > > permissions because we do not want to simplify all sorts of side channel
> > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > far I haven't heard any sound argument why madvise should skip those.
> > > > Again if there are any reasons, then document them in the changelog.
> > >
> > > I think it makes sense. It could be explained, but it also follows
> > > established madvise semantics, and I'm not sure it's necessarily
> > > Minchan's job to re-iterate those.
> > >
> > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > ksm etc. When you madvise, you can really only speak for your own
> > > reference to that memory - "*I* am not using this."
> > >
> > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > local page table entries and drops the corresponding references, so
> > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > shared pages don't get freed.
> >
> > Right, being consistent with other madvise syscalls is certainly a way
> > to go. And I am not pushing one way or another, I just want this to be
> > documented with a reasoning behind. Consistency is certainly an argument
> > to use.
> >
> > On the other hand these non-destructive madvise operations are quite
> > different and the shared policy might differ as a result as well. We are
> > aging objects rather than destroying them after all. Being able to age
> > a pagecache with a sufficient privileges sounds like a useful usecase to
> > me. In other words you are able to cause the same effect indirectly
> > without the madvise operation so it kinda makes sense to allow it in a
> > more sophisticated way.
>
> Right, I don't think it's about permission - as you say, you can do
> this indirectly. Page reclaim is all about relative page order, so if
> we thwarted you from demoting some pages, you could instead promote
> other pages to cause a similar end result.
>
> I think it's about intent. You're advising the kernel that *you're*
> not using this memory and would like to have it cleared out based on
> that knowledge. You could do the same by simply allocating the new
> pages and have the kernel sort it out. However, if the kernel sorts it
> out, it *will* look at other users of the page, and it might decide
> that other pages are actually colder when considering all users.
>
> When you ignore shared state, on the other hand, the pages you advise
> out could refault right after. And then, not only did you not free up
> the memory, but you also caused IO that may interfere with bringing in
> the new data for which you tried to create room in the first place.
>
> So I don't think it ever makes sense to override it.
>
> But it might be better to drop the explicit mapcount check and instead
> make the local pte young and call shrink_page_list() without the
^
old?
> TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> to handle references and shared pages exactly the same way it would if
> those pages came fresh off the LRU tail, excluding only the reference
> from the mapping that we're madvising.
You are confused from the name change. Here, MADV_COLD is deactivating
, not pageing out. Therefore, shrink_page_list doesn't matter.
And madvise_cold_pte_range already makes the local pte *old*(I guess
your saying was typo).
I guess that's exactly what Michal wanted: just removing page_mapcount
check and defers to decision on normal page reclaim policy:
If I didn't miss your intention, it seems you and Michal are on same page.
(Please correct me if you want to say something other)
I could drop the page_mapcount check at next revision.
Thanks for the review!
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 23:02 ` Minchan Kim
@ 2019-06-04 6:56 ` Michal Hocko
2019-06-04 12:06 ` Johannes Weiner
1 sibling, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2019-06-04 6:56 UTC (permalink / raw)
To: Minchan Kim
Cc: Johannes Weiner, Andrew Morton, linux-mm, LKML, linux-api,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Tue 04-06-19 08:02:05, Minchan Kim wrote:
> Hi Johannes,
>
> On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > >
> > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > >
> > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > >
> > > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > > >
> > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > >
> > > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > > >
> > > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > > deactivate_page(page);
> > > > > > >
> > > > > > > Why do we restrict to pages that are single mapped?
> > > > > >
> > > > > > Because page table in one of process shared the page would have access bit
> > > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > > the more fail to reclaim.
> > > > >
> > > > > So what? In other words why should it be restricted solely based on the
> > > > > map count. I can see a reason to restrict based on the access
> > > > > permissions because we do not want to simplify all sorts of side channel
> > > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > > far I haven't heard any sound argument why madvise should skip those.
> > > > > Again if there are any reasons, then document them in the changelog.
> > > >
> > > > I think it makes sense. It could be explained, but it also follows
> > > > established madvise semantics, and I'm not sure it's necessarily
> > > > Minchan's job to re-iterate those.
> > > >
> > > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > > ksm etc. When you madvise, you can really only speak for your own
> > > > reference to that memory - "*I* am not using this."
> > > >
> > > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > > local page table entries and drops the corresponding references, so
> > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > > shared pages don't get freed.
> > >
> > > Right, being consistent with other madvise syscalls is certainly a way
> > > to go. And I am not pushing one way or another, I just want this to be
> > > documented with a reasoning behind. Consistency is certainly an argument
> > > to use.
> > >
> > > On the other hand these non-destructive madvise operations are quite
> > > different and the shared policy might differ as a result as well. We are
> > > aging objects rather than destroying them after all. Being able to age
> > > a pagecache with a sufficient privileges sounds like a useful usecase to
> > > me. In other words you are able to cause the same effect indirectly
> > > without the madvise operation so it kinda makes sense to allow it in a
> > > more sophisticated way.
> >
> > Right, I don't think it's about permission - as you say, you can do
> > this indirectly. Page reclaim is all about relative page order, so if
> > we thwarted you from demoting some pages, you could instead promote
> > other pages to cause a similar end result.
> >
> > I think it's about intent. You're advising the kernel that *you're*
> > not using this memory and would like to have it cleared out based on
> > that knowledge. You could do the same by simply allocating the new
> > pages and have the kernel sort it out. However, if the kernel sorts it
> > out, it *will* look at other users of the page, and it might decide
> > that other pages are actually colder when considering all users.
> >
> > When you ignore shared state, on the other hand, the pages you advise
> > out could refault right after. And then, not only did you not free up
> > the memory, but you also caused IO that may interfere with bringing in
> > the new data for which you tried to create room in the first place.
> >
> > So I don't think it ever makes sense to override it.
> >
> > But it might be better to drop the explicit mapcount check and instead
> > make the local pte young and call shrink_page_list() without the
> ^
> old?
>
> > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> > to handle references and shared pages exactly the same way it would if
> > those pages came fresh off the LRU tail, excluding only the reference
> > from the mapping that we're madvising.
>
> You are confused from the name change. Here, MADV_COLD is deactivating
> , not pageing out. Therefore, shrink_page_list doesn't matter.
> And madvise_cold_pte_range already makes the local pte *old*(I guess
> your saying was typo).
> I guess that's exactly what Michal wanted: just removing page_mapcount
> check and defers to decision on normal page reclaim policy:
> If I didn't miss your intention, it seems you and Michal are on same page.
> (Please correct me if you want to say something other)
Indeed.
> I could drop the page_mapcount check at next revision.
Yes please.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 23:02 ` Minchan Kim
2019-06-04 6:56 ` Michal Hocko
@ 2019-06-04 12:06 ` Johannes Weiner
1 sibling, 0 replies; 39+ messages in thread
From: Johannes Weiner @ 2019-06-04 12:06 UTC (permalink / raw)
To: Minchan Kim
Cc: Michal Hocko, Andrew Morton, linux-mm, LKML, linux-api,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Tue, Jun 04, 2019 at 08:02:05AM +0900, Minchan Kim wrote:
> Hi Johannes,
>
> On Mon, Jun 03, 2019 at 05:50:59PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > > >
> > > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > > >
> > > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > > active pages unless there is no access until the time.
> > > > > > > > >
> > > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > > >
> > > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > > >
> > > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > > >
> > > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > > deactivate_page(page);
> > > > > > >
> > > > > > > Why do we restrict to pages that are single mapped?
> > > > > >
> > > > > > Because page table in one of process shared the page would have access bit
> > > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > > the more fail to reclaim.
> > > > >
> > > > > So what? In other words why should it be restricted solely based on the
> > > > > map count. I can see a reason to restrict based on the access
> > > > > permissions because we do not want to simplify all sorts of side channel
> > > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > > far I haven't heard any sound argument why madvise should skip those.
> > > > > Again if there are any reasons, then document them in the changelog.
> > > >
> > > > I think it makes sense. It could be explained, but it also follows
> > > > established madvise semantics, and I'm not sure it's necessarily
> > > > Minchan's job to re-iterate those.
> > > >
> > > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > > ksm etc. When you madvise, you can really only speak for your own
> > > > reference to that memory - "*I* am not using this."
> > > >
> > > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > > local page table entries and drops the corresponding references, so
> > > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > > shared pages don't get freed.
> > >
> > > Right, being consistent with other madvise syscalls is certainly a way
> > > to go. And I am not pushing one way or another, I just want this to be
> > > documented with a reasoning behind. Consistency is certainly an argument
> > > to use.
> > >
> > > On the other hand these non-destructive madvise operations are quite
> > > different and the shared policy might differ as a result as well. We are
> > > aging objects rather than destroying them after all. Being able to age
> > > a pagecache with a sufficient privileges sounds like a useful usecase to
> > > me. In other words you are able to cause the same effect indirectly
> > > without the madvise operation so it kinda makes sense to allow it in a
> > > more sophisticated way.
> >
> > Right, I don't think it's about permission - as you say, you can do
> > this indirectly. Page reclaim is all about relative page order, so if
> > we thwarted you from demoting some pages, you could instead promote
> > other pages to cause a similar end result.
> >
> > I think it's about intent. You're advising the kernel that *you're*
> > not using this memory and would like to have it cleared out based on
> > that knowledge. You could do the same by simply allocating the new
> > pages and have the kernel sort it out. However, if the kernel sorts it
> > out, it *will* look at other users of the page, and it might decide
> > that other pages are actually colder when considering all users.
> >
> > When you ignore shared state, on the other hand, the pages you advise
> > out could refault right after. And then, not only did you not free up
> > the memory, but you also caused IO that may interfere with bringing in
> > the new data for which you tried to create room in the first place.
> >
> > So I don't think it ever makes sense to override it.
> >
> > But it might be better to drop the explicit mapcount check and instead
> > make the local pte young and call shrink_page_list() without the
> ^
> old?
Ah yes, of course. Clear the reference bit.
> > TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> > to handle references and shared pages exactly the same way it would if
> > those pages came fresh off the LRU tail, excluding only the reference
> > from the mapping that we're madvising.
>
> You are confused from the name change. Here, MADV_COLD is deactivating
> , not pageing out. Therefore, shrink_page_list doesn't matter.
> And madvise_cold_pte_range already makes the local pte *old*(I guess
> your saying was typo).
> I guess that's exactly what Michal wanted: just removing page_mapcount
> check and defers to decision on normal page reclaim policy:
> If I didn't miss your intention, it seems you and Michal are on same page.
> (Please correct me if you want to say something other)
> I could drop the page_mapcount check at next revision.
Sorry, I was actually talking about the MADV_PAGEOUT patch in this
case, since this turned into a general discussion about how shared
pages should be handled, which applies to both operations.
My argument was for removing the check in both patches, yes, but to
additionally change the pageout patch to 1) make the advised pte old
and then 2) call shrink_page_list() WITHOUT ignore_access/references
so that it respects references from other mappings, if any.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 21:50 ` Johannes Weiner
2019-06-03 23:02 ` Minchan Kim
@ 2019-06-04 6:55 ` Michal Hocko
1 sibling, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2019-06-04 6:55 UTC (permalink / raw)
To: Johannes Weiner
Cc: Minchan Kim, Andrew Morton, linux-mm, LKML, linux-api,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Mon 03-06-19 17:50:59, Johannes Weiner wrote:
> On Mon, Jun 03, 2019 at 10:32:30PM +0200, Michal Hocko wrote:
> > On Mon 03-06-19 13:27:17, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> > > > On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > > > > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > > > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > > > > workingset eviction so it ends up increasing performance.
> > > > > > > > >
> > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > > > > pages to evict early during memory pressure.
> > > > > > > > >
> > > > > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > > > > head if the page is private because inactive list could be full of
> > > > > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > > > > active pages unless there is no access until the time.
> > > > > > > >
> > > > > > > > [I am intentionally not looking at the implementation because below
> > > > > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > > > > >
> > > > > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > > > > Private/shared? If shared, are there any restrictions?
> > > > > > >
> > > > > > > Both file and private pages could be deactived from each active LRU
> > > > > > > to each inactive LRU if the page has one map_count. In other words,
> > > > > > >
> > > > > > > if (page_mapcount(page) <= 1)
> > > > > > > deactivate_page(page);
> > > > > >
> > > > > > Why do we restrict to pages that are single mapped?
> > > > >
> > > > > Because page table in one of process shared the page would have access bit
> > > > > so finally we couldn't reclaim the page. The more process it is shared,
> > > > > the more fail to reclaim.
> > > >
> > > > So what? In other words why should it be restricted solely based on the
> > > > map count. I can see a reason to restrict based on the access
> > > > permissions because we do not want to simplify all sorts of side channel
> > > > attacks but memory reclaim is capable of reclaiming shared pages and so
> > > > far I haven't heard any sound argument why madvise should skip those.
> > > > Again if there are any reasons, then document them in the changelog.
> > >
> > > I think it makes sense. It could be explained, but it also follows
> > > established madvise semantics, and I'm not sure it's necessarily
> > > Minchan's job to re-iterate those.
> > >
> > > Sharing isn't exactly transparent to userspace. The kernel does COW,
> > > ksm etc. When you madvise, you can really only speak for your own
> > > reference to that memory - "*I* am not using this."
> > >
> > > This is in line with other madvise calls: MADV_DONTNEED clears the
> > > local page table entries and drops the corresponding references, so
> > > shared pages won't get freed. MADV_FREE clears the pte dirty bit and
> > > also has explicit mapcount checks before clearing PG_dirty, so again
> > > shared pages don't get freed.
> >
> > Right, being consistent with other madvise syscalls is certainly a way
> > to go. And I am not pushing one way or another, I just want this to be
> > documented with a reasoning behind. Consistency is certainly an argument
> > to use.
> >
> > On the other hand these non-destructive madvise operations are quite
> > different and the shared policy might differ as a result as well. We are
> > aging objects rather than destroying them after all. Being able to age
> > a pagecache with a sufficient privileges sounds like a useful usecase to
> > me. In other words you are able to cause the same effect indirectly
> > without the madvise operation so it kinda makes sense to allow it in a
> > more sophisticated way.
>
> Right, I don't think it's about permission - as you say, you can do
> this indirectly. Page reclaim is all about relative page order, so if
> we thwarted you from demoting some pages, you could instead promote
> other pages to cause a similar end result.
There is one notable difference. If we allow an easy way to demote a
shared resource _easily_ then we have to think about potential side
channel attacks. Sure you can generate a memory pressure to cause the
same but that is much harder and impractical in many cases.
> I think it's about intent. You're advising the kernel that *you're*
> not using this memory and would like to have it cleared out based on
> that knowledge. You could do the same by simply allocating the new
> pages and have the kernel sort it out. However, if the kernel sorts it
> out, it *will* look at other users of the page, and it might decide
> that other pages are actually colder when considering all users.
>
> When you ignore shared state, on the other hand, the pages you advise
> out could refault right after. And then, not only did you not free up
> the memory, but you also caused IO that may interfere with bringing in
> the new data for which you tried to create room in the first place.
That is a fair argument and I would tend to agree. On the other hand we
are talking about potential usecases which tend to _know_ what they are
doing and removing the possibility completely sounds like they will not
exploit the existing interface to the maximum. But as already mentioned
starting simpler and more restricted is usually a better choice when
the semantic is not carved in stone from the very beginning and
documented that way.
> So I don't think it ever makes sense to override it.
>
> But it might be better to drop the explicit mapcount check and instead
> make the local pte young and call shrink_page_list() without the
> TTU_IGNORE_ACCESS, ignore_references flags - leave it to reclaim code
> to handle references and shared pages exactly the same way it would if
> those pages came fresh off the LRU tail, excluding only the reference
> from the mapping that we're madvising.
Yeah that makes sense to me.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-03 7:16 ` Michal Hocko
2019-06-03 15:43 ` Daniel Colascione
2019-06-03 17:27 ` Johannes Weiner
@ 2019-06-04 4:26 ` Minchan Kim
2019-06-04 7:02 ` Michal Hocko
2 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-06-04 4:26 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 23:34:07, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 04:03:32PM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 22:39:04, Minchan Kim wrote:
> > > > On Fri, May 31, 2019 at 10:47:52AM +0200, Michal Hocko wrote:
> > > > > On Fri 31-05-19 15:43:08, Minchan Kim wrote:
> > > > > > When a process expects no accesses to a certain memory range, it could
> > > > > > give a hint to kernel that the pages can be reclaimed when memory pressure
> > > > > > happens but data should be preserved for future use. This could reduce
> > > > > > workingset eviction so it ends up increasing performance.
> > > > > >
> > > > > > This patch introduces the new MADV_COLD hint to madvise(2) syscall.
> > > > > > MADV_COLD can be used by a process to mark a memory range as not expected
> > > > > > to be used in the near future. The hint can help kernel in deciding which
> > > > > > pages to evict early during memory pressure.
> > > > > >
> > > > > > Internally, it works via deactivating pages from active list to inactive's
> > > > > > head if the page is private because inactive list could be full of
> > > > > > used-once pages which are first candidate for the reclaiming and that's a
> > > > > > reason why MADV_FREE move pages to head of inactive LRU list. Therefore,
> > > > > > if the memory pressure happens, they will be reclaimed earlier than other
> > > > > > active pages unless there is no access until the time.
> > > > >
> > > > > [I am intentionally not looking at the implementation because below
> > > > > points should be clear from the changelog - sorry about nagging ;)]
> > > > >
> > > > > What kind of pages can be deactivated? Anonymous/File backed.
> > > > > Private/shared? If shared, are there any restrictions?
> > > >
> > > > Both file and private pages could be deactived from each active LRU
> > > > to each inactive LRU if the page has one map_count. In other words,
> > > >
> > > > if (page_mapcount(page) <= 1)
> > > > deactivate_page(page);
> > >
> > > Why do we restrict to pages that are single mapped?
> >
> > Because page table in one of process shared the page would have access bit
> > so finally we couldn't reclaim the page. The more process it is shared,
> > the more fail to reclaim.
>
> So what? In other words why should it be restricted solely based on the
> map count. I can see a reason to restrict based on the access
> permissions because we do not want to simplify all sorts of side channel
> attacks but memory reclaim is capable of reclaiming shared pages and so
> far I haven't heard any sound argument why madvise should skip those.
> Again if there are any reasons, then document them in the changelog.
I will go with removing the part so that defer to decision to the VM reclaim
based on the review.
>
> [...]
>
> > > Please document this, if this is really a desirable semantic because
> > > then you have the same set of problems as we've had with the early
> > > MADV_FREE implementation mentioned above.
> >
> > IIRC, the problem of MADV_FREE was that we couldn't discard freeable
> > pages because VM never scan anonymous LRU with swapless system.
> > However, it's not the our case because we should reclaim them, not
> > discarding.
>
> Right. But there is still the page cache reclaim. Is it expected that
> an explicitly cold memory doesn't get reclaimed because we have a
> sufficient amount of page cache (a very common case) and we never age
> anonymous memory because of that?
If there are lots of used-once pages in file-LRU, I think there is no
need to reclaim anonymous pages because it needs bigger overhead due to
IO. It has been true for a long time in current VM policy.
Reclaim preference model based on hints is as following based on cost:
MADV_DONTNEED >> MADV_PAGEOUT > used-once pages > MADV_FREE >= MADV_COLD
It is desirable for the new hints to be placed in the reclaiming preference
order such that a) they don't overlap functionally with existing hints and
b) we have a balanced ordering of disruptive and non-disruptive hints.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 1/6] mm: introduce MADV_COLD
2019-06-04 4:26 ` Minchan Kim
@ 2019-06-04 7:02 ` Michal Hocko
0 siblings, 0 replies; 39+ messages in thread
From: Michal Hocko @ 2019-06-04 7:02 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Tue 04-06-19 13:26:51, Minchan Kim wrote:
> On Mon, Jun 03, 2019 at 09:16:07AM +0200, Michal Hocko wrote:
[...]
> > Right. But there is still the page cache reclaim. Is it expected that
> > an explicitly cold memory doesn't get reclaimed because we have a
> > sufficient amount of page cache (a very common case) and we never age
> > anonymous memory because of that?
>
> If there are lots of used-once pages in file-LRU, I think there is no
> need to reclaim anonymous pages because it needs bigger overhead due to
> IO. It has been true for a long time in current VM policy.
You are making an assumption which is not universally true. If I _know_
that there is a considerable amount of idle anonymous memory then I
would really prefer if it goes to the swap rather than make a pressure
on caching. Inactive list is not guaranteed to contain only used-once
pages, right?
Anyway, as already mentioned, we can start with a simpler implementation
for now and explicitly note that pagecache biased reclaim is known to be
a problem potentially. I am pretty sure somebody will come sooner or
later and we can address the problem then with some good numbers to back
the additional complexity.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* [RFCv2 2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM
2019-05-31 6:43 [RFCv2 0/6] introduce memory hinting API for external process Minchan Kim
2019-05-31 6:43 ` [RFCv2 1/6] mm: introduce MADV_COLD Minchan Kim
@ 2019-05-31 6:43 ` Minchan Kim
2019-05-31 6:43 ` [RFCv2 3/6] mm: introduce MADV_PAGEOUT Minchan Kim
` (3 subsequent siblings)
5 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton, Minchan Kim
The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
as default. It is for preventing to reclaim dirty pages when CMA try to
migrate pages. Strictly speaking, we don't need it because CMA didn't allow
to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
Moreover, it has a problem to prevent anonymous pages's swap out even
though force_reclaim = true in shrink_page_list on upcoming patch.
So this patch makes references's default value to PAGEREF_RECLAIM and
rename force_reclaim with ignore_references to make it more clear.
This is a preparatory work for next patch.
* RFCv1
* use ignore_referecnes as parameter name - hannes
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/vmscan.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 84dcb651d05c..0973a46a0472 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1102,7 +1102,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct scan_control *sc,
enum ttu_flags ttu_flags,
struct reclaim_stat *stat,
- bool force_reclaim)
+ bool ignore_references)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
@@ -1116,7 +1116,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
struct address_space *mapping;
struct page *page;
int may_enter_fs;
- enum page_references references = PAGEREF_RECLAIM_CLEAN;
+ enum page_references references = PAGEREF_RECLAIM;
bool dirty, writeback;
unsigned int nr_pages;
@@ -1247,7 +1247,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
}
}
- if (!force_reclaim)
+ if (!ignore_references)
references = page_check_references(page, sc);
switch (references) {
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [RFCv2 3/6] mm: introduce MADV_PAGEOUT
2019-05-31 6:43 [RFCv2 0/6] introduce memory hinting API for external process Minchan Kim
2019-05-31 6:43 ` [RFCv2 1/6] mm: introduce MADV_COLD Minchan Kim
2019-05-31 6:43 ` [RFCv2 2/6] mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM Minchan Kim
@ 2019-05-31 6:43 ` Minchan Kim
2019-05-31 8:50 ` Michal Hocko
2019-05-31 16:59 ` Johannes Weiner
2019-05-31 6:43 ` [RFCv2 4/6] mm: factor out madvise's core functionality Minchan Kim
` (2 subsequent siblings)
5 siblings, 2 replies; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton, Minchan Kim
When a process expects no accesses to a certain memory range
for a long time, it could hint kernel that the pages can be
reclaimed instantly but data should be preserved for future use.
This could reduce workingset eviction so it ends up increasing
performance.
This patch introduces the new MADV_PAGEOUT hint to madvise(2)
syscall. MADV_PAGEOUT can be used by a process to mark a memory
range as not expected to be used for a long time so that kernel
reclaims the memory instantly. The hint can help kernel in deciding
which pages to evict proactively.
* RFCv1
* rename from MADV_COLD to MADV_PAGEOUT - hannes
* bail out if process is being killed - Hillf
* fix reclaim_pages bugs - Hillf
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/swap.h | 1 +
include/uapi/asm-generic/mman-common.h | 1 +
mm/madvise.c | 126 +++++++++++++++++++++++++
mm/vmscan.c | 77 +++++++++++++++
4 files changed, 205 insertions(+)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 0ce997edb8bb..063c0c1e112b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -365,6 +365,7 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern unsigned long vm_total_pages;
+extern unsigned long reclaim_pages(struct list_head *page_list);
#ifdef CONFIG_NUMA
extern int node_reclaim_mode;
extern int sysctl_min_unmapped_ratio;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 1190f4e7f7b9..92e347a89ddc 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -44,6 +44,7 @@
#define MADV_WILLNEED 3 /* will need these pages */
#define MADV_DONTNEED 4 /* don't need these pages */
#define MADV_COLD 5 /* deactivatie these pages */
+#define MADV_PAGEOUT 6 /* reclaim these pages */
/* common parameters: try to keep these consistent across architectures */
#define MADV_FREE 8 /* free pages only if memory pressure */
diff --git a/mm/madvise.c b/mm/madvise.c
index bff150eab6da..9d749a1420b4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -41,6 +41,7 @@ static int madvise_need_mmap_write(int behavior)
case MADV_WILLNEED:
case MADV_DONTNEED:
case MADV_COLD:
+ case MADV_PAGEOUT:
case MADV_FREE:
return 0;
default:
@@ -415,6 +416,128 @@ static long madvise_cold(struct vm_area_struct *vma,
return 0;
}
+static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ pte_t *orig_pte, *pte, ptent;
+ spinlock_t *ptl;
+ LIST_HEAD(page_list);
+ struct page *page;
+ int isolated = 0;
+ struct vm_area_struct *vma = walk->vma;
+ unsigned long next;
+
+ if (fatal_signal_pending(current))
+ return -EINTR;
+
+ next = pmd_addr_end(addr, end);
+ if (pmd_trans_huge(*pmd)) {
+ ptl = pmd_trans_huge_lock(pmd, vma);
+ if (!ptl)
+ return 0;
+
+ if (is_huge_zero_pmd(*pmd))
+ goto huge_unlock;
+
+ page = pmd_page(*pmd);
+ if (page_mapcount(page) > 1)
+ goto huge_unlock;
+
+ if (next - addr != HPAGE_PMD_SIZE) {
+ int err;
+
+ get_page(page);
+ spin_unlock(ptl);
+ lock_page(page);
+ err = split_huge_page(page);
+ unlock_page(page);
+ put_page(page);
+ if (!err)
+ goto regular_page;
+ return 0;
+ }
+
+ if (isolate_lru_page(page))
+ goto huge_unlock;
+
+ list_add(&page->lru, &page_list);
+huge_unlock:
+ spin_unlock(ptl);
+ reclaim_pages(&page_list);
+ return 0;
+ }
+
+ if (pmd_trans_unstable(pmd))
+ return 0;
+regular_page:
+ orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) {
+ ptent = *pte;
+ if (!pte_present(ptent))
+ continue;
+
+ page = vm_normal_page(vma, addr, ptent);
+ if (!page)
+ continue;
+
+ if (page_mapcount(page) > 1)
+ continue;
+
+ if (isolate_lru_page(page))
+ continue;
+
+ isolated++;
+ list_add(&page->lru, &page_list);
+ if (isolated >= SWAP_CLUSTER_MAX) {
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages(&page_list);
+ isolated = 0;
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ orig_pte = pte;
+ }
+ }
+
+ pte_unmap_unlock(orig_pte, ptl);
+ reclaim_pages(&page_list);
+ cond_resched();
+
+ return 0;
+}
+
+static void madvise_pageout_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
+{
+ struct mm_walk warm_walk = {
+ .pmd_entry = madvise_pageout_pte_range,
+ .mm = vma->vm_mm,
+ };
+
+ tlb_start_vma(tlb, vma);
+ walk_page_range(addr, end, &warm_walk);
+ tlb_end_vma(tlb, vma);
+}
+
+
+static long madvise_pageout(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
+{
+ struct mm_struct *mm = vma->vm_mm;
+ struct mmu_gather tlb;
+
+ *prev = vma;
+ if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP))
+ return -EINVAL;
+
+ lru_add_drain();
+ tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ tlb_finish_mmu(&tlb, start_addr, end_addr);
+
+ return 0;
+}
+
static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
@@ -805,6 +928,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
return madvise_cold(vma, prev, start, end);
+ case MADV_PAGEOUT:
+ return madvise_pageout(vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
return madvise_dontneed_free(vma, prev, start, end, behavior);
@@ -827,6 +952,7 @@ madvise_behavior_valid(int behavior)
case MADV_DONTNEED:
case MADV_FREE:
case MADV_COLD:
+ case MADV_PAGEOUT:
#ifdef CONFIG_KSM
case MADV_MERGEABLE:
case MADV_UNMERGEABLE:
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0973a46a0472..280dd808fb91 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2126,6 +2126,83 @@ static void shrink_active_list(unsigned long nr_to_scan,
nr_deactivate, nr_rotated, sc->priority, file);
}
+unsigned long reclaim_pages(struct list_head *page_list)
+{
+ int nid = -1;
+ unsigned long nr_isolated[2] = {0, };
+ unsigned long nr_reclaimed = 0;
+ LIST_HEAD(node_page_list);
+ struct reclaim_stat dummy_stat;
+ struct scan_control sc = {
+ .gfp_mask = GFP_KERNEL,
+ .priority = DEF_PRIORITY,
+ .may_writepage = 1,
+ .may_unmap = 1,
+ .may_swap = 1,
+ };
+
+ while (!list_empty(page_list)) {
+ struct page *page;
+
+ page = lru_to_page(page_list);
+ if (nid == -1) {
+ nid = page_to_nid(page);
+ INIT_LIST_HEAD(&node_page_list);
+ nr_isolated[0] = nr_isolated[1] = 0;
+ }
+
+ if (nid == page_to_nid(page)) {
+ list_move(&page->lru, &node_page_list);
+ nr_isolated[!!page_is_file_cache(page)] +=
+ hpage_nr_pages(page);
+ continue;
+ }
+
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ nr_isolated[1]);
+ nr_reclaimed += shrink_page_list(&node_page_list,
+ NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
+ &dummy_stat, true);
+ while (!list_empty(&node_page_list)) {
+ struct page *page = lru_to_page(&node_page_list);
+
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ -nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ -nr_isolated[1]);
+ nid = -1;
+ }
+
+ if (!list_empty(&node_page_list)) {
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ nr_isolated[1]);
+ nr_reclaimed += shrink_page_list(&node_page_list,
+ NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
+ &dummy_stat, true);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
+ -nr_isolated[0]);
+ mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
+ -nr_isolated[1]);
+
+ while (!list_empty(&node_page_list)) {
+ struct page *page = lru_to_page(&node_page_list);
+
+ list_del(&page->lru);
+ putback_lru_page(page);
+ }
+
+ }
+
+ return nr_reclaimed;
+}
+
/*
* The inactive anon list should be small enough that the VM never has
* to do too much work.
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT
2019-05-31 6:43 ` [RFCv2 3/6] mm: introduce MADV_PAGEOUT Minchan Kim
@ 2019-05-31 8:50 ` Michal Hocko
2019-05-31 13:44 ` Minchan Kim
2019-05-31 16:59 ` Johannes Weiner
1 sibling, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-05-31 8:50 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri 31-05-19 15:43:10, Minchan Kim wrote:
> When a process expects no accesses to a certain memory range
> for a long time, it could hint kernel that the pages can be
> reclaimed instantly but data should be preserved for future use.
> This could reduce workingset eviction so it ends up increasing
> performance.
>
> This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> syscall. MADV_PAGEOUT can be used by a process to mark a memory
> range as not expected to be used for a long time so that kernel
> reclaims the memory instantly. The hint can help kernel in deciding
> which pages to evict proactively.
Again, are there any restictions on what kind of memory can be paged out?
Private/Shared, anonymous/file backed. Any restrictions on mapping type.
Etc. Please make sure all that is in the changelog.
What are the failure modes? E.g. what if the swap is full, does the call
fails or it silently ignores the error?
Thanks!
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT
2019-05-31 8:50 ` Michal Hocko
@ 2019-05-31 13:44 ` Minchan Kim
0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 13:44 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri, May 31, 2019 at 10:50:44AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:10, Minchan Kim wrote:
> > When a process expects no accesses to a certain memory range
> > for a long time, it could hint kernel that the pages can be
> > reclaimed instantly but data should be preserved for future use.
> > This could reduce workingset eviction so it ends up increasing
> > performance.
> >
> > This patch introduces the new MADV_PAGEOUT hint to madvise(2)
> > syscall. MADV_PAGEOUT can be used by a process to mark a memory
> > range as not expected to be used for a long time so that kernel
> > reclaims the memory instantly. The hint can help kernel in deciding
> > which pages to evict proactively.
>
> Again, are there any restictions on what kind of memory can be paged out?
> Private/Shared, anonymous/file backed. Any restrictions on mapping type.
> Etc. Please make sure all that is in the changelog.
It's same with MADV_COLD. Yes, I will include all detail in the
description.
>
> What are the failure modes? E.g. what if the swap is full, does the call
> fails or it silently ignores the error?
In such case, just ignore the swapout. It returns -EINVAL only if the
vma is one of (VM_LOCKED|VM_HUGETLB|VM_PFNMAP) at this moment.
>
> Thanks!
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT
2019-05-31 6:43 ` [RFCv2 3/6] mm: introduce MADV_PAGEOUT Minchan Kim
2019-05-31 8:50 ` Michal Hocko
@ 2019-05-31 16:59 ` Johannes Weiner
2019-05-31 23:14 ` Minchan Kim
1 sibling, 1 reply; 39+ messages in thread
From: Johannes Weiner @ 2019-05-31 16:59 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
Hi Michan,
this looks pretty straight-forward to me, only one kink:
On Fri, May 31, 2019 at 03:43:10PM +0900, Minchan Kim wrote:
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2126,6 +2126,83 @@ static void shrink_active_list(unsigned long nr_to_scan,
> nr_deactivate, nr_rotated, sc->priority, file);
> }
>
> +unsigned long reclaim_pages(struct list_head *page_list)
> +{
> + int nid = -1;
> + unsigned long nr_isolated[2] = {0, };
> + unsigned long nr_reclaimed = 0;
> + LIST_HEAD(node_page_list);
> + struct reclaim_stat dummy_stat;
> + struct scan_control sc = {
> + .gfp_mask = GFP_KERNEL,
> + .priority = DEF_PRIORITY,
> + .may_writepage = 1,
> + .may_unmap = 1,
> + .may_swap = 1,
> + };
> +
> + while (!list_empty(page_list)) {
> + struct page *page;
> +
> + page = lru_to_page(page_list);
> + if (nid == -1) {
> + nid = page_to_nid(page);
> + INIT_LIST_HEAD(&node_page_list);
> + nr_isolated[0] = nr_isolated[1] = 0;
> + }
> +
> + if (nid == page_to_nid(page)) {
> + list_move(&page->lru, &node_page_list);
> + nr_isolated[!!page_is_file_cache(page)] +=
> + hpage_nr_pages(page);
> + continue;
> + }
> +
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> + nr_isolated[0]);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> + nr_isolated[1]);
> + nr_reclaimed += shrink_page_list(&node_page_list,
> + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> + &dummy_stat, true);
> + while (!list_empty(&node_page_list)) {
> + struct page *page = lru_to_page(&node_page_list);
> +
> + list_del(&page->lru);
> + putback_lru_page(page);
> + }
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> + -nr_isolated[0]);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> + -nr_isolated[1]);
> + nid = -1;
> + }
> +
> + if (!list_empty(&node_page_list)) {
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> + nr_isolated[0]);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> + nr_isolated[1]);
> + nr_reclaimed += shrink_page_list(&node_page_list,
> + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> + &dummy_stat, true);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> + -nr_isolated[0]);
> + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> + -nr_isolated[1]);
> +
> + while (!list_empty(&node_page_list)) {
> + struct page *page = lru_to_page(&node_page_list);
> +
> + list_del(&page->lru);
> + putback_lru_page(page);
> + }
> +
> + }
The NR_ISOLATED accounting, nid parsing etc. is really awkward and
makes it hard to see what the function actually does.
Can you please make those ISOLATED counters part of the isolation API?
Your patch really shows this is an overdue cleanup.
These are fast local percpu counters, we don't need the sprawling
batching we do all over vmscan.c, migrate.c, khugepaged.c,
compaction.c etc. Isolation can increase the counter page by page, and
reclaim or putback can likewise decrease them one by one.
It looks like mlock is the only user of the isolation api that does
not participate in the NR_ISOLATED_* counters protocol, but I don't
see why it wouldn't, or why doing so would hurt.
There are also seem to be quite a few callsites that use the atomic
versions of the counter API when they're clearly under the irqsafe
lru_lock. That would be fixed automatically by this work as well.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 3/6] mm: introduce MADV_PAGEOUT
2019-05-31 16:59 ` Johannes Weiner
@ 2019-05-31 23:14 ` Minchan Kim
0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 23:14 UTC (permalink / raw)
To: Johannes Weiner
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
Hey Johannes,
On Fri, May 31, 2019 at 12:59:27PM -0400, Johannes Weiner wrote:
> Hi Michan,
>
> this looks pretty straight-forward to me, only one kink:
>
> On Fri, May 31, 2019 at 03:43:10PM +0900, Minchan Kim wrote:
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2126,6 +2126,83 @@ static void shrink_active_list(unsigned long nr_to_scan,
> > nr_deactivate, nr_rotated, sc->priority, file);
> > }
> >
> > +unsigned long reclaim_pages(struct list_head *page_list)
> > +{
> > + int nid = -1;
> > + unsigned long nr_isolated[2] = {0, };
> > + unsigned long nr_reclaimed = 0;
> > + LIST_HEAD(node_page_list);
> > + struct reclaim_stat dummy_stat;
> > + struct scan_control sc = {
> > + .gfp_mask = GFP_KERNEL,
> > + .priority = DEF_PRIORITY,
> > + .may_writepage = 1,
> > + .may_unmap = 1,
> > + .may_swap = 1,
> > + };
> > +
> > + while (!list_empty(page_list)) {
> > + struct page *page;
> > +
> > + page = lru_to_page(page_list);
> > + if (nid == -1) {
> > + nid = page_to_nid(page);
> > + INIT_LIST_HEAD(&node_page_list);
> > + nr_isolated[0] = nr_isolated[1] = 0;
> > + }
> > +
> > + if (nid == page_to_nid(page)) {
> > + list_move(&page->lru, &node_page_list);
> > + nr_isolated[!!page_is_file_cache(page)] +=
> > + hpage_nr_pages(page);
> > + continue;
> > + }
> > +
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > + nr_isolated[0]);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > + nr_isolated[1]);
> > + nr_reclaimed += shrink_page_list(&node_page_list,
> > + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> > + &dummy_stat, true);
> > + while (!list_empty(&node_page_list)) {
> > + struct page *page = lru_to_page(&node_page_list);
> > +
> > + list_del(&page->lru);
> > + putback_lru_page(page);
> > + }
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > + -nr_isolated[0]);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > + -nr_isolated[1]);
> > + nid = -1;
> > + }
> > +
> > + if (!list_empty(&node_page_list)) {
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > + nr_isolated[0]);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > + nr_isolated[1]);
> > + nr_reclaimed += shrink_page_list(&node_page_list,
> > + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS,
> > + &dummy_stat, true);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON,
> > + -nr_isolated[0]);
> > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE,
> > + -nr_isolated[1]);
> > +
> > + while (!list_empty(&node_page_list)) {
> > + struct page *page = lru_to_page(&node_page_list);
> > +
> > + list_del(&page->lru);
> > + putback_lru_page(page);
> > + }
> > +
> > + }
>
> The NR_ISOLATED accounting, nid parsing etc. is really awkward and
> makes it hard to see what the function actually does.
>
> Can you please make those ISOLATED counters part of the isolation API?
> Your patch really shows this is an overdue cleanup.
Yeah, that was very painful.
>
> These are fast local percpu counters, we don't need the sprawling
> batching we do all over vmscan.c, migrate.c, khugepaged.c,
> compaction.c etc. Isolation can increase the counter page by page, and
> reclaim or putback can likewise decrease them one by one.
>
> It looks like mlock is the only user of the isolation api that does
> not participate in the NR_ISOLATED_* counters protocol, but I don't
> see why it wouldn't, or why doing so would hurt.
>
> There are also seem to be quite a few callsites that use the atomic
> versions of the counter API when they're clearly under the irqsafe
> lru_lock. That would be fixed automatically by this work as well.
I agree all points so will prepare clean up patch.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [RFCv2 4/6] mm: factor out madvise's core functionality
2019-05-31 6:43 [RFCv2 0/6] introduce memory hinting API for external process Minchan Kim
` (2 preceding siblings ...)
2019-05-31 6:43 ` [RFCv2 3/6] mm: introduce MADV_PAGEOUT Minchan Kim
@ 2019-05-31 6:43 ` Minchan Kim
2019-05-31 7:04 ` Oleksandr Natalenko
2019-05-31 6:43 ` [RFCv2 5/6] mm: introduce external memory hinting API Minchan Kim
2019-05-31 6:43 ` [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary Minchan Kim
5 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton, Minchan Kim
This patch factor out madvise's core functionality so that upcoming
patch can reuse it without duplication. It shouldn't change any behavior.
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
1 file changed, 101 insertions(+), 87 deletions(-)
diff --git a/mm/madvise.c b/mm/madvise.c
index 9d749a1420b4..466623ea8c36 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
struct page *page;
int isolated = 0;
struct vm_area_struct *vma = walk->vma;
+ struct task_struct *task = walk->private;
unsigned long next;
- if (fatal_signal_pending(current))
+ if (fatal_signal_pending(task))
return -EINTR;
next = pmd_addr_end(addr, end);
@@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
}
static void madvise_pageout_page_range(struct mmu_gather *tlb,
- struct vm_area_struct *vma,
- unsigned long addr, unsigned long end)
+ struct task_struct *task,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end)
{
struct mm_walk warm_walk = {
.pmd_entry = madvise_pageout_pte_range,
.mm = vma->vm_mm,
+ .private = task,
};
tlb_start_vma(tlb, vma);
@@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
}
-static long madvise_pageout(struct vm_area_struct *vma,
- struct vm_area_struct **prev,
- unsigned long start_addr, unsigned long end_addr)
+static long madvise_pageout(struct task_struct *task,
+ struct vm_area_struct *vma, struct vm_area_struct **prev,
+ unsigned long start_addr, unsigned long end_addr)
{
struct mm_struct *mm = vma->vm_mm;
struct mmu_gather tlb;
@@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
lru_add_drain();
tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
- madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
+ madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
tlb_finish_mmu(&tlb, start_addr, end_addr);
return 0;
@@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
return 0;
}
-static long madvise_dontneed_free(struct vm_area_struct *vma,
+static long madvise_dontneed_free(struct mm_struct *mm,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end,
int behavior)
@@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
if (!userfaultfd_remove(vma, start, end)) {
*prev = NULL; /* mmap_sem has been dropped, prev is stale */
- down_read(¤t->mm->mmap_sem);
- vma = find_vma(current->mm, start);
+ down_read(&mm->mmap_sem);
+ vma = find_vma(mm, start);
if (!vma)
return -ENOMEM;
if (start < vma->vm_start) {
@@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
* Application wants to free up the pages and associated backing store.
* This is effectively punching a hole into the middle of a file.
*/
-static long madvise_remove(struct vm_area_struct *vma,
+static long madvise_remove(struct mm_struct *mm,
+ struct vm_area_struct *vma,
struct vm_area_struct **prev,
unsigned long start, unsigned long end)
{
@@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
get_file(f);
if (userfaultfd_remove(vma, start, end)) {
/* mmap_sem was not released by userfaultfd_remove() */
- up_read(¤t->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
}
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
offset, end - start);
fput(f);
- down_read(¤t->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
return error;
}
@@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
#endif
static long
-madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
+madvise_vma(struct task_struct *task, struct mm_struct *mm,
+ struct vm_area_struct *vma, struct vm_area_struct **prev,
unsigned long start, unsigned long end, int behavior)
{
switch (behavior) {
case MADV_REMOVE:
- return madvise_remove(vma, prev, start, end);
+ return madvise_remove(mm, vma, prev, start, end);
case MADV_WILLNEED:
return madvise_willneed(vma, prev, start, end);
case MADV_COLD:
return madvise_cold(vma, prev, start, end);
case MADV_PAGEOUT:
- return madvise_pageout(vma, prev, start, end);
+ return madvise_pageout(task, vma, prev, start, end);
case MADV_FREE:
case MADV_DONTNEED:
- return madvise_dontneed_free(vma, prev, start, end, behavior);
+ return madvise_dontneed_free(mm, vma, prev, start,
+ end, behavior);
default:
return madvise_behavior(vma, prev, start, end, behavior);
}
@@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
}
}
-/*
- * The madvise(2) system call.
- *
- * Applications can use madvise() to advise the kernel how it should
- * handle paging I/O in this VM area. The idea is to help the kernel
- * use appropriate read-ahead and caching techniques. The information
- * provided is advisory only, and can be safely disregarded by the
- * kernel without affecting the correct operation of the application.
- *
- * behavior values:
- * MADV_NORMAL - the default behavior is to read clusters. This
- * results in some read-ahead and read-behind.
- * MADV_RANDOM - the system should read the minimum amount of data
- * on any access, since it is unlikely that the appli-
- * cation will need more than what it asks for.
- * MADV_SEQUENTIAL - pages in the given range will probably be accessed
- * once, so they can be aggressively read ahead, and
- * can be freed soon after they are accessed.
- * MADV_WILLNEED - the application is notifying the system to read
- * some pages ahead.
- * MADV_DONTNEED - the application is finished with the given range,
- * so the kernel can free resources associated with it.
- * MADV_FREE - the application marks pages in the given range as lazy free,
- * where actual purges are postponed until memory pressure happens.
- * MADV_REMOVE - the application wants to free up the given range of
- * pages and associated backing store.
- * MADV_DONTFORK - omit this area from child's address space when forking:
- * typically, to avoid COWing pages pinned by get_user_pages().
- * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
- * MADV_WIPEONFORK - present the child process with zero-filled memory in this
- * range after a fork.
- * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
- * MADV_HWPOISON - trigger memory error handler as if the given memory range
- * were corrupted by unrecoverable hardware memory failure.
- * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
- * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
- * this area with pages of identical content from other such areas.
- * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
- * MADV_HUGEPAGE - the application wants to back the given range by transparent
- * huge pages in the future. Existing pages might be coalesced and
- * new pages might be allocated as THP.
- * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
- * transparent huge pages so the existing pages will not be
- * coalesced into THP and new pages will not be allocated as THP.
- * MADV_DONTDUMP - the application wants to prevent pages in the given range
- * from being included in its core dump.
- * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
- *
- * return values:
- * zero - success
- * -EINVAL - start + len < 0, start is not page-aligned,
- * "behavior" is not a valid value, or application
- * is attempting to release locked or shared pages,
- * or the specified address range includes file, Huge TLB,
- * MAP_SHARED or VMPFNMAP range.
- * -ENOMEM - addresses in the specified range are not currently
- * mapped, or are outside the AS of the process.
- * -EIO - an I/O error occurred while paging in data.
- * -EBADF - map exists, but area maps something that isn't a file.
- * -EAGAIN - a kernel resource was temporarily unavailable.
- */
-SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+static int madvise_core(struct task_struct *task, struct mm_struct *mm,
+ unsigned long start, size_t len_in, int behavior)
{
unsigned long end, tmp;
struct vm_area_struct *vma, *prev;
@@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
#ifdef CONFIG_MEMORY_FAILURE
if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
- return madvise_inject_error(behavior, start, start + len_in);
+ return madvise_inject_error(behavior,
+ start, start + len_in);
#endif
write = madvise_need_mmap_write(behavior);
if (write) {
- if (down_write_killable(¤t->mm->mmap_sem))
+ if (down_write_killable(&mm->mmap_sem))
return -EINTR;
} else {
- down_read(¤t->mm->mmap_sem);
+ down_read(&mm->mmap_sem);
}
/*
@@ -1084,7 +1032,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
* ranges, just ignore them, but return -ENOMEM at the end.
* - different from the way of handling in mlock etc.
*/
- vma = find_vma_prev(current->mm, start, &prev);
+ vma = find_vma_prev(mm, start, &prev);
if (vma && start > vma->vm_start)
prev = vma;
@@ -1109,7 +1057,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
tmp = end;
/* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
- error = madvise_vma(vma, &prev, start, tmp, behavior);
+ error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior);
if (error)
goto out;
start = tmp;
@@ -1121,14 +1069,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
if (prev)
vma = prev->vm_next;
else /* madvise_remove dropped mmap_sem */
- vma = find_vma(current->mm, start);
+ vma = find_vma(mm, start);
}
out:
blk_finish_plug(&plug);
if (write)
- up_write(¤t->mm->mmap_sem);
+ up_write(&mm->mmap_sem);
else
- up_read(¤t->mm->mmap_sem);
+ up_read(&mm->mmap_sem);
return error;
}
+
+/*
+ * The madvise(2) system call.
+ *
+ * Applications can use madvise() to advise the kernel how it should
+ * handle paging I/O in this VM area. The idea is to help the kernel
+ * use appropriate read-ahead and caching techniques. The information
+ * provided is advisory only, and can be safely disregarded by the
+ * kernel without affecting the correct operation of the application.
+ *
+ * behavior values:
+ * MADV_NORMAL - the default behavior is to read clusters. This
+ * results in some read-ahead and read-behind.
+ * MADV_RANDOM - the system should read the minimum amount of data
+ * on any access, since it is unlikely that the appli-
+ * cation will need more than what it asks for.
+ * MADV_SEQUENTIAL - pages in the given range will probably be accessed
+ * once, so they can be aggressively read ahead, and
+ * can be freed soon after they are accessed.
+ * MADV_WILLNEED - the application is notifying the system to read
+ * some pages ahead.
+ * MADV_DONTNEED - the application is finished with the given range,
+ * so the kernel can free resources associated with it.
+ * MADV_FREE - the application marks pages in the given range as lazy free,
+ * where actual purges are postponed until memory pressure happens.
+ * MADV_REMOVE - the application wants to free up the given range of
+ * pages and associated backing store.
+ * MADV_DONTFORK - omit this area from child's address space when forking:
+ * typically, to avoid COWing pages pinned by get_user_pages().
+ * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
+ * MADV_WIPEONFORK - present the child process with zero-filled memory in this
+ * range after a fork.
+ * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
+ * MADV_HWPOISON - trigger memory error handler as if the given memory range
+ * were corrupted by unrecoverable hardware memory failure.
+ * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
+ * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
+ * this area with pages of identical content from other such areas.
+ * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
+ * MADV_HUGEPAGE - the application wants to back the given range by transparent
+ * huge pages in the future. Existing pages might be coalesced and
+ * new pages might be allocated as THP.
+ * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
+ * transparent huge pages so the existing pages will not be
+ * coalesced into THP and new pages will not be allocated as THP.
+ * MADV_DONTDUMP - the application wants to prevent pages in the given range
+ * from being included in its core dump.
+ * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
+ *
+ * return values:
+ * zero - success
+ * -EINVAL - start + len < 0, start is not page-aligned,
+ * "behavior" is not a valid value, or application
+ * is attempting to release locked or shared pages,
+ * or the specified address range includes file, Huge TLB,
+ * MAP_SHARED or VMPFNMAP range.
+ * -ENOMEM - addresses in the specified range are not currently
+ * mapped, or are outside the AS of the process.
+ * -EIO - an I/O error occurred while paging in data.
+ * -EBADF - map exists, but area maps something that isn't a file.
+ * -EAGAIN - a kernel resource was temporarily unavailable.
+ */
+SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
+{
+ return madvise_core(current, current->mm, start, len_in, behavior);
+}
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
2019-05-31 6:43 ` [RFCv2 4/6] mm: factor out madvise's core functionality Minchan Kim
@ 2019-05-31 7:04 ` Oleksandr Natalenko
2019-05-31 13:12 ` Minchan Kim
0 siblings, 1 reply; 39+ messages in thread
From: Oleksandr Natalenko @ 2019-05-31 7:04 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, hdanton
On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> This patch factor out madvise's core functionality so that upcoming
> patch can reuse it without duplication. It shouldn't change any behavior.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
> mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> 1 file changed, 101 insertions(+), 87 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 9d749a1420b4..466623ea8c36 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> struct page *page;
> int isolated = 0;
> struct vm_area_struct *vma = walk->vma;
> + struct task_struct *task = walk->private;
> unsigned long next;
>
> - if (fatal_signal_pending(current))
> + if (fatal_signal_pending(task))
> return -EINTR;
>
> next = pmd_addr_end(addr, end);
> @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> }
>
> static void madvise_pageout_page_range(struct mmu_gather *tlb,
> - struct vm_area_struct *vma,
> - unsigned long addr, unsigned long end)
> + struct task_struct *task,
> + struct vm_area_struct *vma,
> + unsigned long addr, unsigned long end)
> {
> struct mm_walk warm_walk = {
> .pmd_entry = madvise_pageout_pte_range,
> .mm = vma->vm_mm,
> + .private = task,
> };
>
> tlb_start_vma(tlb, vma);
> @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> }
>
>
> -static long madvise_pageout(struct vm_area_struct *vma,
> - struct vm_area_struct **prev,
> - unsigned long start_addr, unsigned long end_addr)
> +static long madvise_pageout(struct task_struct *task,
> + struct vm_area_struct *vma, struct vm_area_struct **prev,
> + unsigned long start_addr, unsigned long end_addr)
> {
> struct mm_struct *mm = vma->vm_mm;
> struct mmu_gather tlb;
> @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
>
> lru_add_drain();
> tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> tlb_finish_mmu(&tlb, start_addr, end_addr);
>
> return 0;
> @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> return 0;
> }
>
> -static long madvise_dontneed_free(struct vm_area_struct *vma,
> +static long madvise_dontneed_free(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> unsigned long start, unsigned long end,
> int behavior)
> @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> if (!userfaultfd_remove(vma, start, end)) {
> *prev = NULL; /* mmap_sem has been dropped, prev is stale */
>
> - down_read(¤t->mm->mmap_sem);
> - vma = find_vma(current->mm, start);
> + down_read(&mm->mmap_sem);
> + vma = find_vma(mm, start);
> if (!vma)
> return -ENOMEM;
> if (start < vma->vm_start) {
> @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> * Application wants to free up the pages and associated backing store.
> * This is effectively punching a hole into the middle of a file.
> */
> -static long madvise_remove(struct vm_area_struct *vma,
> +static long madvise_remove(struct mm_struct *mm,
> + struct vm_area_struct *vma,
> struct vm_area_struct **prev,
> unsigned long start, unsigned long end)
> {
> @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> get_file(f);
> if (userfaultfd_remove(vma, start, end)) {
> /* mmap_sem was not released by userfaultfd_remove() */
> - up_read(¤t->mm->mmap_sem);
> + up_read(&mm->mmap_sem);
> }
> error = vfs_fallocate(f,
> FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> offset, end - start);
> fput(f);
> - down_read(¤t->mm->mmap_sem);
> + down_read(&mm->mmap_sem);
> return error;
> }
>
> @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> #endif
>
> static long
> -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> + struct vm_area_struct *vma, struct vm_area_struct **prev,
> unsigned long start, unsigned long end, int behavior)
> {
> switch (behavior) {
> case MADV_REMOVE:
> - return madvise_remove(vma, prev, start, end);
> + return madvise_remove(mm, vma, prev, start, end);
> case MADV_WILLNEED:
> return madvise_willneed(vma, prev, start, end);
> case MADV_COLD:
> return madvise_cold(vma, prev, start, end);
> case MADV_PAGEOUT:
> - return madvise_pageout(vma, prev, start, end);
> + return madvise_pageout(task, vma, prev, start, end);
> case MADV_FREE:
> case MADV_DONTNEED:
> - return madvise_dontneed_free(vma, prev, start, end, behavior);
> + return madvise_dontneed_free(mm, vma, prev, start,
> + end, behavior);
> default:
> return madvise_behavior(vma, prev, start, end, behavior);
> }
> @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> }
> }
>
> -/*
> - * The madvise(2) system call.
> - *
> - * Applications can use madvise() to advise the kernel how it should
> - * handle paging I/O in this VM area. The idea is to help the kernel
> - * use appropriate read-ahead and caching techniques. The information
> - * provided is advisory only, and can be safely disregarded by the
> - * kernel without affecting the correct operation of the application.
> - *
> - * behavior values:
> - * MADV_NORMAL - the default behavior is to read clusters. This
> - * results in some read-ahead and read-behind.
> - * MADV_RANDOM - the system should read the minimum amount of data
> - * on any access, since it is unlikely that the appli-
> - * cation will need more than what it asks for.
> - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> - * once, so they can be aggressively read ahead, and
> - * can be freed soon after they are accessed.
> - * MADV_WILLNEED - the application is notifying the system to read
> - * some pages ahead.
> - * MADV_DONTNEED - the application is finished with the given range,
> - * so the kernel can free resources associated with it.
> - * MADV_FREE - the application marks pages in the given range as lazy free,
> - * where actual purges are postponed until memory pressure happens.
> - * MADV_REMOVE - the application wants to free up the given range of
> - * pages and associated backing store.
> - * MADV_DONTFORK - omit this area from child's address space when forking:
> - * typically, to avoid COWing pages pinned by get_user_pages().
> - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> - * range after a fork.
> - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> - * were corrupted by unrecoverable hardware memory failure.
> - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> - * this area with pages of identical content from other such areas.
> - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> - * huge pages in the future. Existing pages might be coalesced and
> - * new pages might be allocated as THP.
> - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> - * transparent huge pages so the existing pages will not be
> - * coalesced into THP and new pages will not be allocated as THP.
> - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> - * from being included in its core dump.
> - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> - *
> - * return values:
> - * zero - success
> - * -EINVAL - start + len < 0, start is not page-aligned,
> - * "behavior" is not a valid value, or application
> - * is attempting to release locked or shared pages,
> - * or the specified address range includes file, Huge TLB,
> - * MAP_SHARED or VMPFNMAP range.
> - * -ENOMEM - addresses in the specified range are not currently
> - * mapped, or are outside the AS of the process.
> - * -EIO - an I/O error occurred while paging in data.
> - * -EBADF - map exists, but area maps something that isn't a file.
> - * -EAGAIN - a kernel resource was temporarily unavailable.
> - */
> -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> + unsigned long start, size_t len_in, int behavior)
Just a minor nitpick, but can we please have it named madvise_common,
not madvise_core? This would follow a usual naming scheme, when some
common functionality is factored out (like, for mutexes, semaphores
etc), and within the kernel "core" usually means something completely
different.
> {
> unsigned long end, tmp;
> struct vm_area_struct *vma, *prev;
> @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
>
> #ifdef CONFIG_MEMORY_FAILURE
> if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> - return madvise_inject_error(behavior, start, start + len_in);
> + return madvise_inject_error(behavior,
> + start, start + len_in);
Not sure what this change is about except changing the line length.
Note, madvise_inject_error() still operates on "current" through
get_user_pages_fast() and gup_pgd_range(), but that was not changed
here. I Know you've filtered out this hint later, so technically this
is not an issue, but, maybe, this needs some attention too since we've
already spotted it?
> #endif
>
> write = madvise_need_mmap_write(behavior);
> if (write) {
> - if (down_write_killable(¤t->mm->mmap_sem))
> + if (down_write_killable(&mm->mmap_sem))
> return -EINTR;
Do you still need that trick with mmget_still_valid() here?
Something like:
if (current->mm != mm && !mmget_still_valid(mm))
goto skip_mm;
and that skip_mm label would be before
if (write)
up_write(&mm->mmap_sem);
below.
(see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)
> } else {
> - down_read(¤t->mm->mmap_sem);
> + down_read(&mm->mmap_sem);
> }
>
> /*
> @@ -1084,7 +1032,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> * ranges, just ignore them, but return -ENOMEM at the end.
> * - different from the way of handling in mlock etc.
> */
> - vma = find_vma_prev(current->mm, start, &prev);
> + vma = find_vma_prev(mm, start, &prev);
> if (vma && start > vma->vm_start)
> prev = vma;
>
> @@ -1109,7 +1057,7 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> tmp = end;
>
> /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */
> - error = madvise_vma(vma, &prev, start, tmp, behavior);
> + error = madvise_vma(task, mm, vma, &prev, start, tmp, behavior);
> if (error)
> goto out;
> start = tmp;
> @@ -1121,14 +1069,80 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> if (prev)
> vma = prev->vm_next;
> else /* madvise_remove dropped mmap_sem */
> - vma = find_vma(current->mm, start);
> + vma = find_vma(mm, start);
> }
> out:
> blk_finish_plug(&plug);
skip_mm:
> if (write)
> - up_write(¤t->mm->mmap_sem);
> + up_write(&mm->mmap_sem);
> else
> - up_read(¤t->mm->mmap_sem);
> + up_read(&mm->mmap_sem);
>
> return error;
> }
> +
> +/*
> + * The madvise(2) system call.
> + *
> + * Applications can use madvise() to advise the kernel how it should
> + * handle paging I/O in this VM area. The idea is to help the kernel
> + * use appropriate read-ahead and caching techniques. The information
> + * provided is advisory only, and can be safely disregarded by the
> + * kernel without affecting the correct operation of the application.
> + *
> + * behavior values:
> + * MADV_NORMAL - the default behavior is to read clusters. This
> + * results in some read-ahead and read-behind.
> + * MADV_RANDOM - the system should read the minimum amount of data
> + * on any access, since it is unlikely that the appli-
> + * cation will need more than what it asks for.
> + * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> + * once, so they can be aggressively read ahead, and
> + * can be freed soon after they are accessed.
> + * MADV_WILLNEED - the application is notifying the system to read
> + * some pages ahead.
> + * MADV_DONTNEED - the application is finished with the given range,
> + * so the kernel can free resources associated with it.
> + * MADV_FREE - the application marks pages in the given range as lazy free,
> + * where actual purges are postponed until memory pressure happens.
> + * MADV_REMOVE - the application wants to free up the given range of
> + * pages and associated backing store.
> + * MADV_DONTFORK - omit this area from child's address space when forking:
> + * typically, to avoid COWing pages pinned by get_user_pages().
> + * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> + * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> + * range after a fork.
> + * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> + * MADV_HWPOISON - trigger memory error handler as if the given memory range
> + * were corrupted by unrecoverable hardware memory failure.
> + * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> + * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> + * this area with pages of identical content from other such areas.
> + * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> + * MADV_HUGEPAGE - the application wants to back the given range by transparent
> + * huge pages in the future. Existing pages might be coalesced and
> + * new pages might be allocated as THP.
> + * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> + * transparent huge pages so the existing pages will not be
> + * coalesced into THP and new pages will not be allocated as THP.
> + * MADV_DONTDUMP - the application wants to prevent pages in the given range
> + * from being included in its core dump.
> + * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> + *
> + * return values:
> + * zero - success
> + * -EINVAL - start + len < 0, start is not page-aligned,
> + * "behavior" is not a valid value, or application
> + * is attempting to release locked or shared pages,
> + * or the specified address range includes file, Huge TLB,
> + * MAP_SHARED or VMPFNMAP range.
> + * -ENOMEM - addresses in the specified range are not currently
> + * mapped, or are outside the AS of the process.
> + * -EIO - an I/O error occurred while paging in data.
> + * -EBADF - map exists, but area maps something that isn't a file.
> + * -EAGAIN - a kernel resource was temporarily unavailable.
> + */
> +SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> +{
> + return madvise_core(current, current->mm, start, len_in, behavior);
> +}
> --
> 2.22.0.rc1.257.g3120a18244-goog
>
--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
2019-05-31 7:04 ` Oleksandr Natalenko
@ 2019-05-31 13:12 ` Minchan Kim
2019-05-31 14:35 ` Oleksandr Natalenko
0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 13:12 UTC (permalink / raw)
To: Oleksandr Natalenko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, hdanton
On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote:
> On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> > This patch factor out madvise's core functionality so that upcoming
> > patch can reuse it without duplication. It shouldn't change any behavior.
> >
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > ---
> > mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> > 1 file changed, 101 insertions(+), 87 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 9d749a1420b4..466623ea8c36 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > struct page *page;
> > int isolated = 0;
> > struct vm_area_struct *vma = walk->vma;
> > + struct task_struct *task = walk->private;
> > unsigned long next;
> >
> > - if (fatal_signal_pending(current))
> > + if (fatal_signal_pending(task))
> > return -EINTR;
> >
> > next = pmd_addr_end(addr, end);
> > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > }
> >
> > static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > - struct vm_area_struct *vma,
> > - unsigned long addr, unsigned long end)
> > + struct task_struct *task,
> > + struct vm_area_struct *vma,
> > + unsigned long addr, unsigned long end)
> > {
> > struct mm_walk warm_walk = {
> > .pmd_entry = madvise_pageout_pte_range,
> > .mm = vma->vm_mm,
> > + .private = task,
> > };
> >
> > tlb_start_vma(tlb, vma);
> > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > }
> >
> >
> > -static long madvise_pageout(struct vm_area_struct *vma,
> > - struct vm_area_struct **prev,
> > - unsigned long start_addr, unsigned long end_addr)
> > +static long madvise_pageout(struct task_struct *task,
> > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > + unsigned long start_addr, unsigned long end_addr)
> > {
> > struct mm_struct *mm = vma->vm_mm;
> > struct mmu_gather tlb;
> > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
> >
> > lru_add_drain();
> > tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> > tlb_finish_mmu(&tlb, start_addr, end_addr);
> >
> > return 0;
> > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > return 0;
> > }
> >
> > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > +static long madvise_dontneed_free(struct mm_struct *mm,
> > + struct vm_area_struct *vma,
> > struct vm_area_struct **prev,
> > unsigned long start, unsigned long end,
> > int behavior)
> > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > if (!userfaultfd_remove(vma, start, end)) {
> > *prev = NULL; /* mmap_sem has been dropped, prev is stale */
> >
> > - down_read(¤t->mm->mmap_sem);
> > - vma = find_vma(current->mm, start);
> > + down_read(&mm->mmap_sem);
> > + vma = find_vma(mm, start);
> > if (!vma)
> > return -ENOMEM;
> > if (start < vma->vm_start) {
> > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > * Application wants to free up the pages and associated backing store.
> > * This is effectively punching a hole into the middle of a file.
> > */
> > -static long madvise_remove(struct vm_area_struct *vma,
> > +static long madvise_remove(struct mm_struct *mm,
> > + struct vm_area_struct *vma,
> > struct vm_area_struct **prev,
> > unsigned long start, unsigned long end)
> > {
> > @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > get_file(f);
> > if (userfaultfd_remove(vma, start, end)) {
> > /* mmap_sem was not released by userfaultfd_remove() */
> > - up_read(¤t->mm->mmap_sem);
> > + up_read(&mm->mmap_sem);
> > }
> > error = vfs_fallocate(f,
> > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > offset, end - start);
> > fput(f);
> > - down_read(¤t->mm->mmap_sem);
> > + down_read(&mm->mmap_sem);
> > return error;
> > }
> >
> > @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> > #endif
> >
> > static long
> > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > unsigned long start, unsigned long end, int behavior)
> > {
> > switch (behavior) {
> > case MADV_REMOVE:
> > - return madvise_remove(vma, prev, start, end);
> > + return madvise_remove(mm, vma, prev, start, end);
> > case MADV_WILLNEED:
> > return madvise_willneed(vma, prev, start, end);
> > case MADV_COLD:
> > return madvise_cold(vma, prev, start, end);
> > case MADV_PAGEOUT:
> > - return madvise_pageout(vma, prev, start, end);
> > + return madvise_pageout(task, vma, prev, start, end);
> > case MADV_FREE:
> > case MADV_DONTNEED:
> > - return madvise_dontneed_free(vma, prev, start, end, behavior);
> > + return madvise_dontneed_free(mm, vma, prev, start,
> > + end, behavior);
> > default:
> > return madvise_behavior(vma, prev, start, end, behavior);
> > }
> > @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> > }
> > }
> >
> > -/*
> > - * The madvise(2) system call.
> > - *
> > - * Applications can use madvise() to advise the kernel how it should
> > - * handle paging I/O in this VM area. The idea is to help the kernel
> > - * use appropriate read-ahead and caching techniques. The information
> > - * provided is advisory only, and can be safely disregarded by the
> > - * kernel without affecting the correct operation of the application.
> > - *
> > - * behavior values:
> > - * MADV_NORMAL - the default behavior is to read clusters. This
> > - * results in some read-ahead and read-behind.
> > - * MADV_RANDOM - the system should read the minimum amount of data
> > - * on any access, since it is unlikely that the appli-
> > - * cation will need more than what it asks for.
> > - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > - * once, so they can be aggressively read ahead, and
> > - * can be freed soon after they are accessed.
> > - * MADV_WILLNEED - the application is notifying the system to read
> > - * some pages ahead.
> > - * MADV_DONTNEED - the application is finished with the given range,
> > - * so the kernel can free resources associated with it.
> > - * MADV_FREE - the application marks pages in the given range as lazy free,
> > - * where actual purges are postponed until memory pressure happens.
> > - * MADV_REMOVE - the application wants to free up the given range of
> > - * pages and associated backing store.
> > - * MADV_DONTFORK - omit this area from child's address space when forking:
> > - * typically, to avoid COWing pages pinned by get_user_pages().
> > - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> > - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> > - * range after a fork.
> > - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> > - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> > - * were corrupted by unrecoverable hardware memory failure.
> > - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> > - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> > - * this area with pages of identical content from other such areas.
> > - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> > - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> > - * huge pages in the future. Existing pages might be coalesced and
> > - * new pages might be allocated as THP.
> > - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > - * transparent huge pages so the existing pages will not be
> > - * coalesced into THP and new pages will not be allocated as THP.
> > - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> > - * from being included in its core dump.
> > - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > - *
> > - * return values:
> > - * zero - success
> > - * -EINVAL - start + len < 0, start is not page-aligned,
> > - * "behavior" is not a valid value, or application
> > - * is attempting to release locked or shared pages,
> > - * or the specified address range includes file, Huge TLB,
> > - * MAP_SHARED or VMPFNMAP range.
> > - * -ENOMEM - addresses in the specified range are not currently
> > - * mapped, or are outside the AS of the process.
> > - * -EIO - an I/O error occurred while paging in data.
> > - * -EBADF - map exists, but area maps something that isn't a file.
> > - * -EAGAIN - a kernel resource was temporarily unavailable.
> > - */
> > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> > + unsigned long start, size_t len_in, int behavior)
>
> Just a minor nitpick, but can we please have it named madvise_common,
> not madvise_core? This would follow a usual naming scheme, when some
> common functionality is factored out (like, for mutexes, semaphores
> etc), and within the kernel "core" usually means something completely
> different.
Sure.
>
> > {
> > unsigned long end, tmp;
> > struct vm_area_struct *vma, *prev;
> > @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> >
> > #ifdef CONFIG_MEMORY_FAILURE
> > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > - return madvise_inject_error(behavior, start, start + len_in);
> > + return madvise_inject_error(behavior,
> > + start, start + len_in);
>
> Not sure what this change is about except changing the line length.
> Note, madvise_inject_error() still operates on "current" through
> get_user_pages_fast() and gup_pgd_range(), but that was not changed
> here. I Know you've filtered out this hint later, so technically this
> is not an issue, but, maybe, this needs some attention too since we've
> already spotted it?
It is leftover I had done. I actually modified it to handle remote
task but changed my mind not to fix it because process_madvise
will not support it at this moment. I'm not sure it's a good idea
to change it for *might-be-done-in-future* at this moment even though
we have spotted.
>
> > #endif
> >
> > write = madvise_need_mmap_write(behavior);
> > if (write) {
> > - if (down_write_killable(¤t->mm->mmap_sem))
> > + if (down_write_killable(&mm->mmap_sem))
> > return -EINTR;
>
> Do you still need that trick with mmget_still_valid() here?
> Something like:
Since MADV_COLD|PAGEOUT doesn't change address space layout or
vma->vm_flags, technically, we don't need it if I understand
correctly. Right?
>
> if (current->mm != mm && !mmget_still_valid(mm))
> goto skip_mm;
>
> and that skip_mm label would be before
>
> if (write)
> up_write(&mm->mmap_sem);
>
> below.
>
> (see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
2019-05-31 13:12 ` Minchan Kim
@ 2019-05-31 14:35 ` Oleksandr Natalenko
2019-05-31 23:29 ` Minchan Kim
0 siblings, 1 reply; 39+ messages in thread
From: Oleksandr Natalenko @ 2019-05-31 14:35 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, hdanton
On Fri, May 31, 2019 at 10:12:26PM +0900, Minchan Kim wrote:
> On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote:
> > On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> > > This patch factor out madvise's core functionality so that upcoming
> > > patch can reuse it without duplication. It shouldn't change any behavior.
> > >
> > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > ---
> > > mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> > > 1 file changed, 101 insertions(+), 87 deletions(-)
> > >
> > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > index 9d749a1420b4..466623ea8c36 100644
> > > --- a/mm/madvise.c
> > > +++ b/mm/madvise.c
> > > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > struct page *page;
> > > int isolated = 0;
> > > struct vm_area_struct *vma = walk->vma;
> > > + struct task_struct *task = walk->private;
> > > unsigned long next;
> > >
> > > - if (fatal_signal_pending(current))
> > > + if (fatal_signal_pending(task))
> > > return -EINTR;
> > >
> > > next = pmd_addr_end(addr, end);
> > > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > }
> > >
> > > static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > - struct vm_area_struct *vma,
> > > - unsigned long addr, unsigned long end)
> > > + struct task_struct *task,
> > > + struct vm_area_struct *vma,
> > > + unsigned long addr, unsigned long end)
> > > {
> > > struct mm_walk warm_walk = {
> > > .pmd_entry = madvise_pageout_pte_range,
> > > .mm = vma->vm_mm,
> > > + .private = task,
> > > };
> > >
> > > tlb_start_vma(tlb, vma);
> > > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > }
> > >
> > >
> > > -static long madvise_pageout(struct vm_area_struct *vma,
> > > - struct vm_area_struct **prev,
> > > - unsigned long start_addr, unsigned long end_addr)
> > > +static long madvise_pageout(struct task_struct *task,
> > > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > + unsigned long start_addr, unsigned long end_addr)
> > > {
> > > struct mm_struct *mm = vma->vm_mm;
> > > struct mmu_gather tlb;
> > > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
> > >
> > > lru_add_drain();
> > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > > - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > > + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> > > tlb_finish_mmu(&tlb, start_addr, end_addr);
> > >
> > > return 0;
> > > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > return 0;
> > > }
> > >
> > > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > +static long madvise_dontneed_free(struct mm_struct *mm,
> > > + struct vm_area_struct *vma,
> > > struct vm_area_struct **prev,
> > > unsigned long start, unsigned long end,
> > > int behavior)
> > > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > if (!userfaultfd_remove(vma, start, end)) {
> > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */
> > >
> > > - down_read(¤t->mm->mmap_sem);
> > > - vma = find_vma(current->mm, start);
> > > + down_read(&mm->mmap_sem);
> > > + vma = find_vma(mm, start);
> > > if (!vma)
> > > return -ENOMEM;
> > > if (start < vma->vm_start) {
> > > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > * Application wants to free up the pages and associated backing store.
> > > * This is effectively punching a hole into the middle of a file.
> > > */
> > > -static long madvise_remove(struct vm_area_struct *vma,
> > > +static long madvise_remove(struct mm_struct *mm,
> > > + struct vm_area_struct *vma,
> > > struct vm_area_struct **prev,
> > > unsigned long start, unsigned long end)
> > > {
> > > @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > > get_file(f);
> > > if (userfaultfd_remove(vma, start, end)) {
> > > /* mmap_sem was not released by userfaultfd_remove() */
> > > - up_read(¤t->mm->mmap_sem);
> > > + up_read(&mm->mmap_sem);
> > > }
> > > error = vfs_fallocate(f,
> > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > > offset, end - start);
> > > fput(f);
> > > - down_read(¤t->mm->mmap_sem);
> > > + down_read(&mm->mmap_sem);
> > > return error;
> > > }
> > >
> > > @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> > > #endif
> > >
> > > static long
> > > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> > > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > unsigned long start, unsigned long end, int behavior)
> > > {
> > > switch (behavior) {
> > > case MADV_REMOVE:
> > > - return madvise_remove(vma, prev, start, end);
> > > + return madvise_remove(mm, vma, prev, start, end);
> > > case MADV_WILLNEED:
> > > return madvise_willneed(vma, prev, start, end);
> > > case MADV_COLD:
> > > return madvise_cold(vma, prev, start, end);
> > > case MADV_PAGEOUT:
> > > - return madvise_pageout(vma, prev, start, end);
> > > + return madvise_pageout(task, vma, prev, start, end);
> > > case MADV_FREE:
> > > case MADV_DONTNEED:
> > > - return madvise_dontneed_free(vma, prev, start, end, behavior);
> > > + return madvise_dontneed_free(mm, vma, prev, start,
> > > + end, behavior);
> > > default:
> > > return madvise_behavior(vma, prev, start, end, behavior);
> > > }
> > > @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> > > }
> > > }
> > >
> > > -/*
> > > - * The madvise(2) system call.
> > > - *
> > > - * Applications can use madvise() to advise the kernel how it should
> > > - * handle paging I/O in this VM area. The idea is to help the kernel
> > > - * use appropriate read-ahead and caching techniques. The information
> > > - * provided is advisory only, and can be safely disregarded by the
> > > - * kernel without affecting the correct operation of the application.
> > > - *
> > > - * behavior values:
> > > - * MADV_NORMAL - the default behavior is to read clusters. This
> > > - * results in some read-ahead and read-behind.
> > > - * MADV_RANDOM - the system should read the minimum amount of data
> > > - * on any access, since it is unlikely that the appli-
> > > - * cation will need more than what it asks for.
> > > - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > > - * once, so they can be aggressively read ahead, and
> > > - * can be freed soon after they are accessed.
> > > - * MADV_WILLNEED - the application is notifying the system to read
> > > - * some pages ahead.
> > > - * MADV_DONTNEED - the application is finished with the given range,
> > > - * so the kernel can free resources associated with it.
> > > - * MADV_FREE - the application marks pages in the given range as lazy free,
> > > - * where actual purges are postponed until memory pressure happens.
> > > - * MADV_REMOVE - the application wants to free up the given range of
> > > - * pages and associated backing store.
> > > - * MADV_DONTFORK - omit this area from child's address space when forking:
> > > - * typically, to avoid COWing pages pinned by get_user_pages().
> > > - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> > > - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> > > - * range after a fork.
> > > - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> > > - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> > > - * were corrupted by unrecoverable hardware memory failure.
> > > - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> > > - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> > > - * this area with pages of identical content from other such areas.
> > > - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> > > - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> > > - * huge pages in the future. Existing pages might be coalesced and
> > > - * new pages might be allocated as THP.
> > > - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > - * transparent huge pages so the existing pages will not be
> > > - * coalesced into THP and new pages will not be allocated as THP.
> > > - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > - * from being included in its core dump.
> > > - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > - *
> > > - * return values:
> > > - * zero - success
> > > - * -EINVAL - start + len < 0, start is not page-aligned,
> > > - * "behavior" is not a valid value, or application
> > > - * is attempting to release locked or shared pages,
> > > - * or the specified address range includes file, Huge TLB,
> > > - * MAP_SHARED or VMPFNMAP range.
> > > - * -ENOMEM - addresses in the specified range are not currently
> > > - * mapped, or are outside the AS of the process.
> > > - * -EIO - an I/O error occurred while paging in data.
> > > - * -EBADF - map exists, but area maps something that isn't a file.
> > > - * -EAGAIN - a kernel resource was temporarily unavailable.
> > > - */
> > > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > > +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> > > + unsigned long start, size_t len_in, int behavior)
> >
> > Just a minor nitpick, but can we please have it named madvise_common,
> > not madvise_core? This would follow a usual naming scheme, when some
> > common functionality is factored out (like, for mutexes, semaphores
> > etc), and within the kernel "core" usually means something completely
> > different.
>
> Sure.
>
> >
> > > {
> > > unsigned long end, tmp;
> > > struct vm_area_struct *vma, *prev;
> > > @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > >
> > > #ifdef CONFIG_MEMORY_FAILURE
> > > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > > - return madvise_inject_error(behavior, start, start + len_in);
> > > + return madvise_inject_error(behavior,
> > > + start, start + len_in);
> >
> > Not sure what this change is about except changing the line length.
> > Note, madvise_inject_error() still operates on "current" through
> > get_user_pages_fast() and gup_pgd_range(), but that was not changed
> > here. I Know you've filtered out this hint later, so technically this
> > is not an issue, but, maybe, this needs some attention too since we've
> > already spotted it?
>
> It is leftover I had done. I actually modified it to handle remote
> task but changed my mind not to fix it because process_madvise
> will not support it at this moment. I'm not sure it's a good idea
> to change it for *might-be-done-in-future* at this moment even though
> we have spotted.
I'd expect to have at least some comments in code on why other hints
are disabled, so if we already know some shortcomings, this information
would not be lost.
Of course, I don't care much about memory poisoning, but if it can be
addressed now, let's address it now.
>
> >
> > > #endif
> > >
> > > write = madvise_need_mmap_write(behavior);
> > > if (write) {
> > > - if (down_write_killable(¤t->mm->mmap_sem))
> > > + if (down_write_killable(&mm->mmap_sem))
> > > return -EINTR;
> >
> > Do you still need that trick with mmget_still_valid() here?
> > Something like:
>
> Since MADV_COLD|PAGEOUT doesn't change address space layout or
> vma->vm_flags, technically, we don't need it if I understand
> correctly. Right?
I'd expect so, yes. But.
Since we want this interface to be universal and to be able to cover
various needs, and since my initial intention with working in this
direction involved KSM, I'd ask you to enable KSM hints too, and once
(and if) that happens, the work there is done under write lock, and
you'll need this trick to be applied.
Of course, I can do that myself later in a subsequent patch series once
(and, again, if) your series is merged, but, maybe, we can cover this
already especially given the fact that KSM hinting is a relatively easy
task in this pile. I did some preliminary tests with it, and so far no
dragons have started to roar.
>
> >
> > if (current->mm != mm && !mmget_still_valid(mm))
> > goto skip_mm;
> >
> > and that skip_mm label would be before
> >
> > if (write)
> > up_write(&mm->mmap_sem);
> >
> > below.
> >
> > (see 04f5866e41fb70690e28397487d8bd8eea7d712a for details on this)
--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
2019-05-31 14:35 ` Oleksandr Natalenko
@ 2019-05-31 23:29 ` Minchan Kim
2019-06-05 13:27 ` Oleksandr Natalenko
0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 23:29 UTC (permalink / raw)
To: Oleksandr Natalenko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, hdanton
On Fri, May 31, 2019 at 04:35:45PM +0200, Oleksandr Natalenko wrote:
> On Fri, May 31, 2019 at 10:12:26PM +0900, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 09:04:20AM +0200, Oleksandr Natalenko wrote:
> > > On Fri, May 31, 2019 at 03:43:11PM +0900, Minchan Kim wrote:
> > > > This patch factor out madvise's core functionality so that upcoming
> > > > patch can reuse it without duplication. It shouldn't change any behavior.
> > > >
> > > > Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > ---
> > > > mm/madvise.c | 188 +++++++++++++++++++++++++++------------------------
> > > > 1 file changed, 101 insertions(+), 87 deletions(-)
> > > >
> > > > diff --git a/mm/madvise.c b/mm/madvise.c
> > > > index 9d749a1420b4..466623ea8c36 100644
> > > > --- a/mm/madvise.c
> > > > +++ b/mm/madvise.c
> > > > @@ -425,9 +425,10 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > > struct page *page;
> > > > int isolated = 0;
> > > > struct vm_area_struct *vma = walk->vma;
> > > > + struct task_struct *task = walk->private;
> > > > unsigned long next;
> > > >
> > > > - if (fatal_signal_pending(current))
> > > > + if (fatal_signal_pending(task))
> > > > return -EINTR;
> > > >
> > > > next = pmd_addr_end(addr, end);
> > > > @@ -505,12 +506,14 @@ static int madvise_pageout_pte_range(pmd_t *pmd, unsigned long addr,
> > > > }
> > > >
> > > > static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > > - struct vm_area_struct *vma,
> > > > - unsigned long addr, unsigned long end)
> > > > + struct task_struct *task,
> > > > + struct vm_area_struct *vma,
> > > > + unsigned long addr, unsigned long end)
> > > > {
> > > > struct mm_walk warm_walk = {
> > > > .pmd_entry = madvise_pageout_pte_range,
> > > > .mm = vma->vm_mm,
> > > > + .private = task,
> > > > };
> > > >
> > > > tlb_start_vma(tlb, vma);
> > > > @@ -519,9 +522,9 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
> > > > }
> > > >
> > > >
> > > > -static long madvise_pageout(struct vm_area_struct *vma,
> > > > - struct vm_area_struct **prev,
> > > > - unsigned long start_addr, unsigned long end_addr)
> > > > +static long madvise_pageout(struct task_struct *task,
> > > > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > + unsigned long start_addr, unsigned long end_addr)
> > > > {
> > > > struct mm_struct *mm = vma->vm_mm;
> > > > struct mmu_gather tlb;
> > > > @@ -532,7 +535,7 @@ static long madvise_pageout(struct vm_area_struct *vma,
> > > >
> > > > lru_add_drain();
> > > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr);
> > > > - madvise_pageout_page_range(&tlb, vma, start_addr, end_addr);
> > > > + madvise_pageout_page_range(&tlb, task, vma, start_addr, end_addr);
> > > > tlb_finish_mmu(&tlb, start_addr, end_addr);
> > > >
> > > > return 0;
> > > > @@ -744,7 +747,8 @@ static long madvise_dontneed_single_vma(struct vm_area_struct *vma,
> > > > return 0;
> > > > }
> > > >
> > > > -static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > +static long madvise_dontneed_free(struct mm_struct *mm,
> > > > + struct vm_area_struct *vma,
> > > > struct vm_area_struct **prev,
> > > > unsigned long start, unsigned long end,
> > > > int behavior)
> > > > @@ -756,8 +760,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > if (!userfaultfd_remove(vma, start, end)) {
> > > > *prev = NULL; /* mmap_sem has been dropped, prev is stale */
> > > >
> > > > - down_read(¤t->mm->mmap_sem);
> > > > - vma = find_vma(current->mm, start);
> > > > + down_read(&mm->mmap_sem);
> > > > + vma = find_vma(mm, start);
> > > > if (!vma)
> > > > return -ENOMEM;
> > > > if (start < vma->vm_start) {
> > > > @@ -804,7 +808,8 @@ static long madvise_dontneed_free(struct vm_area_struct *vma,
> > > > * Application wants to free up the pages and associated backing store.
> > > > * This is effectively punching a hole into the middle of a file.
> > > > */
> > > > -static long madvise_remove(struct vm_area_struct *vma,
> > > > +static long madvise_remove(struct mm_struct *mm,
> > > > + struct vm_area_struct *vma,
> > > > struct vm_area_struct **prev,
> > > > unsigned long start, unsigned long end)
> > > > {
> > > > @@ -838,13 +843,13 @@ static long madvise_remove(struct vm_area_struct *vma,
> > > > get_file(f);
> > > > if (userfaultfd_remove(vma, start, end)) {
> > > > /* mmap_sem was not released by userfaultfd_remove() */
> > > > - up_read(¤t->mm->mmap_sem);
> > > > + up_read(&mm->mmap_sem);
> > > > }
> > > > error = vfs_fallocate(f,
> > > > FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> > > > offset, end - start);
> > > > fput(f);
> > > > - down_read(¤t->mm->mmap_sem);
> > > > + down_read(&mm->mmap_sem);
> > > > return error;
> > > > }
> > > >
> > > > @@ -918,21 +923,23 @@ static int madvise_inject_error(int behavior,
> > > > #endif
> > > >
> > > > static long
> > > > -madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > +madvise_vma(struct task_struct *task, struct mm_struct *mm,
> > > > + struct vm_area_struct *vma, struct vm_area_struct **prev,
> > > > unsigned long start, unsigned long end, int behavior)
> > > > {
> > > > switch (behavior) {
> > > > case MADV_REMOVE:
> > > > - return madvise_remove(vma, prev, start, end);
> > > > + return madvise_remove(mm, vma, prev, start, end);
> > > > case MADV_WILLNEED:
> > > > return madvise_willneed(vma, prev, start, end);
> > > > case MADV_COLD:
> > > > return madvise_cold(vma, prev, start, end);
> > > > case MADV_PAGEOUT:
> > > > - return madvise_pageout(vma, prev, start, end);
> > > > + return madvise_pageout(task, vma, prev, start, end);
> > > > case MADV_FREE:
> > > > case MADV_DONTNEED:
> > > > - return madvise_dontneed_free(vma, prev, start, end, behavior);
> > > > + return madvise_dontneed_free(mm, vma, prev, start,
> > > > + end, behavior);
> > > > default:
> > > > return madvise_behavior(vma, prev, start, end, behavior);
> > > > }
> > > > @@ -976,68 +983,8 @@ madvise_behavior_valid(int behavior)
> > > > }
> > > > }
> > > >
> > > > -/*
> > > > - * The madvise(2) system call.
> > > > - *
> > > > - * Applications can use madvise() to advise the kernel how it should
> > > > - * handle paging I/O in this VM area. The idea is to help the kernel
> > > > - * use appropriate read-ahead and caching techniques. The information
> > > > - * provided is advisory only, and can be safely disregarded by the
> > > > - * kernel without affecting the correct operation of the application.
> > > > - *
> > > > - * behavior values:
> > > > - * MADV_NORMAL - the default behavior is to read clusters. This
> > > > - * results in some read-ahead and read-behind.
> > > > - * MADV_RANDOM - the system should read the minimum amount of data
> > > > - * on any access, since it is unlikely that the appli-
> > > > - * cation will need more than what it asks for.
> > > > - * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > > > - * once, so they can be aggressively read ahead, and
> > > > - * can be freed soon after they are accessed.
> > > > - * MADV_WILLNEED - the application is notifying the system to read
> > > > - * some pages ahead.
> > > > - * MADV_DONTNEED - the application is finished with the given range,
> > > > - * so the kernel can free resources associated with it.
> > > > - * MADV_FREE - the application marks pages in the given range as lazy free,
> > > > - * where actual purges are postponed until memory pressure happens.
> > > > - * MADV_REMOVE - the application wants to free up the given range of
> > > > - * pages and associated backing store.
> > > > - * MADV_DONTFORK - omit this area from child's address space when forking:
> > > > - * typically, to avoid COWing pages pinned by get_user_pages().
> > > > - * MADV_DOFORK - cancel MADV_DONTFORK: no longer omit this area when forking.
> > > > - * MADV_WIPEONFORK - present the child process with zero-filled memory in this
> > > > - * range after a fork.
> > > > - * MADV_KEEPONFORK - undo the effect of MADV_WIPEONFORK
> > > > - * MADV_HWPOISON - trigger memory error handler as if the given memory range
> > > > - * were corrupted by unrecoverable hardware memory failure.
> > > > - * MADV_SOFT_OFFLINE - try to soft-offline the given range of memory.
> > > > - * MADV_MERGEABLE - the application recommends that KSM try to merge pages in
> > > > - * this area with pages of identical content from other such areas.
> > > > - * MADV_UNMERGEABLE- cancel MADV_MERGEABLE: no longer merge pages with others.
> > > > - * MADV_HUGEPAGE - the application wants to back the given range by transparent
> > > > - * huge pages in the future. Existing pages might be coalesced and
> > > > - * new pages might be allocated as THP.
> > > > - * MADV_NOHUGEPAGE - mark the given range as not worth being backed by
> > > > - * transparent huge pages so the existing pages will not be
> > > > - * coalesced into THP and new pages will not be allocated as THP.
> > > > - * MADV_DONTDUMP - the application wants to prevent pages in the given range
> > > > - * from being included in its core dump.
> > > > - * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump.
> > > > - *
> > > > - * return values:
> > > > - * zero - success
> > > > - * -EINVAL - start + len < 0, start is not page-aligned,
> > > > - * "behavior" is not a valid value, or application
> > > > - * is attempting to release locked or shared pages,
> > > > - * or the specified address range includes file, Huge TLB,
> > > > - * MAP_SHARED or VMPFNMAP range.
> > > > - * -ENOMEM - addresses in the specified range are not currently
> > > > - * mapped, or are outside the AS of the process.
> > > > - * -EIO - an I/O error occurred while paging in data.
> > > > - * -EBADF - map exists, but area maps something that isn't a file.
> > > > - * -EAGAIN - a kernel resource was temporarily unavailable.
> > > > - */
> > > > -SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > > > +static int madvise_core(struct task_struct *task, struct mm_struct *mm,
> > > > + unsigned long start, size_t len_in, int behavior)
> > >
> > > Just a minor nitpick, but can we please have it named madvise_common,
> > > not madvise_core? This would follow a usual naming scheme, when some
> > > common functionality is factored out (like, for mutexes, semaphores
> > > etc), and within the kernel "core" usually means something completely
> > > different.
> >
> > Sure.
> >
> > >
> > > > {
> > > > unsigned long end, tmp;
> > > > struct vm_area_struct *vma, *prev;
> > > > @@ -1068,15 +1015,16 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
> > > >
> > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > > > - return madvise_inject_error(behavior, start, start + len_in);
> > > > + return madvise_inject_error(behavior,
> > > > + start, start + len_in);
> > >
> > > Not sure what this change is about except changing the line length.
> > > Note, madvise_inject_error() still operates on "current" through
> > > get_user_pages_fast() and gup_pgd_range(), but that was not changed
> > > here. I Know you've filtered out this hint later, so technically this
> > > is not an issue, but, maybe, this needs some attention too since we've
> > > already spotted it?
> >
> > It is leftover I had done. I actually modified it to handle remote
> > task but changed my mind not to fix it because process_madvise
> > will not support it at this moment. I'm not sure it's a good idea
> > to change it for *might-be-done-in-future* at this moment even though
> > we have spotted.
>
> I'd expect to have at least some comments in code on why other hints
> are disabled, so if we already know some shortcomings, this information
> would not be lost.
Okay, I will add some comment but do not want to fix code piece until
someone want to expose the poisoning to external process.
>
> Of course, I don't care much about memory poisoning, but if it can be
> addressed now, let's address it now.
>
> >
> > >
> > > > #endif
> > > >
> > > > write = madvise_need_mmap_write(behavior);
> > > > if (write) {
> > > > - if (down_write_killable(¤t->mm->mmap_sem))
> > > > + if (down_write_killable(&mm->mmap_sem))
> > > > return -EINTR;
> > >
> > > Do you still need that trick with mmget_still_valid() here?
> > > Something like:
> >
> > Since MADV_COLD|PAGEOUT doesn't change address space layout or
> > vma->vm_flags, technically, we don't need it if I understand
> > correctly. Right?
>
> I'd expect so, yes. But.
>
> Since we want this interface to be universal and to be able to cover
> various needs, and since my initial intention with working in this
> direction involved KSM, I'd ask you to enable KSM hints too, and once
> (and if) that happens, the work there is done under write lock, and
> you'll need this trick to be applied.
>
> Of course, I can do that myself later in a subsequent patch series once
> (and, again, if) your series is merged, but, maybe, we can cover this
> already especially given the fact that KSM hinting is a relatively easy
> task in this pile. I did some preliminary tests with it, and so far no
> dragons have started to roar.
Then, do you mind sending a patch based upon this series to expose
MADV_MERGEABLE to process_madvise? It will have the right description
why you want to have such feature which I couldn't provide since I don't
have enough material to write the motivation. And the patch also could
include the logic to prevent coredump race, which is more proper since
finally we need to hold mmap_sem write-side lock, finally.
I will pick it up and will rebase since then.
Thanks.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
2019-05-31 23:29 ` Minchan Kim
@ 2019-06-05 13:27 ` Oleksandr Natalenko
2019-06-10 10:12 ` Minchan Kim
0 siblings, 1 reply; 39+ messages in thread
From: Oleksandr Natalenko @ 2019-06-05 13:27 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, hdanton
Hi.
On Sat, Jun 01, 2019 at 08:29:59AM +0900, Minchan Kim wrote:
> > > > > /* snip a lot */
> > > > >
> > > > > #ifdef CONFIG_MEMORY_FAILURE
> > > > > if (behavior == MADV_HWPOISON || behavior == MADV_SOFT_OFFLINE)
> > > > > - return madvise_inject_error(behavior, start, start + len_in);
> > > > > + return madvise_inject_error(behavior,
> > > > > + start, start + len_in);
> > > >
> > > > Not sure what this change is about except changing the line length.
> > > > Note, madvise_inject_error() still operates on "current" through
> > > > get_user_pages_fast() and gup_pgd_range(), but that was not changed
> > > > here. I Know you've filtered out this hint later, so technically this
> > > > is not an issue, but, maybe, this needs some attention too since we've
> > > > already spotted it?
> > >
> > > It is leftover I had done. I actually modified it to handle remote
> > > task but changed my mind not to fix it because process_madvise
> > > will not support it at this moment. I'm not sure it's a good idea
> > > to change it for *might-be-done-in-future* at this moment even though
> > > we have spotted.
> >
> > I'd expect to have at least some comments in code on why other hints
> > are disabled, so if we already know some shortcomings, this information
> > would not be lost.
>
> Okay, I will add some comment but do not want to fix code piece until
> someone want to expose the poisoning to external process.
Fair enough.
> > > > > write = madvise_need_mmap_write(behavior);
> > > > > if (write) {
> > > > > - if (down_write_killable(¤t->mm->mmap_sem))
> > > > > + if (down_write_killable(&mm->mmap_sem))
> > > > > return -EINTR;
> > > >
> > > > Do you still need that trick with mmget_still_valid() here?
> > > > Something like:
> > >
> > > Since MADV_COLD|PAGEOUT doesn't change address space layout or
> > > vma->vm_flags, technically, we don't need it if I understand
> > > correctly. Right?
> >
> > I'd expect so, yes. But.
> >
> > Since we want this interface to be universal and to be able to cover
> > various needs, and since my initial intention with working in this
> > direction involved KSM, I'd ask you to enable KSM hints too, and once
> > (and if) that happens, the work there is done under write lock, and
> > you'll need this trick to be applied.
> >
> > Of course, I can do that myself later in a subsequent patch series once
> > (and, again, if) your series is merged, but, maybe, we can cover this
> > already especially given the fact that KSM hinting is a relatively easy
> > task in this pile. I did some preliminary tests with it, and so far no
> > dragons have started to roar.
>
> Then, do you mind sending a patch based upon this series to expose
> MADV_MERGEABLE to process_madvise? It will have the right description
> why you want to have such feature which I couldn't provide since I don't
> have enough material to write the motivation. And the patch also could
> include the logic to prevent coredump race, which is more proper since
> finally we need to hold mmap_sem write-side lock, finally.
> I will pick it up and will rebase since then.
Sure, I can. Would you really like to have it being based on this exact
revision, or I should wait till you deal with MADV_COLD & Co and re-iterate
this part again?
Thanks.
--
Best regards,
Oleksandr Natalenko (post-factum)
Senior Software Maintenance Engineer
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 4/6] mm: factor out madvise's core functionality
2019-06-05 13:27 ` Oleksandr Natalenko
@ 2019-06-10 10:12 ` Minchan Kim
0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2019-06-10 10:12 UTC (permalink / raw)
To: Oleksandr Natalenko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, hdanton
Hi Oleksandr,
On Wed, Jun 05, 2019 at 03:27:28PM +0200, Oleksandr Natalenko wrote:
< snip >
> > > > > > write = madvise_need_mmap_write(behavior);
> > > > > > if (write) {
> > > > > > - if (down_write_killable(¤t->mm->mmap_sem))
> > > > > > + if (down_write_killable(&mm->mmap_sem))
> > > > > > return -EINTR;
> > > > >
> > > > > Do you still need that trick with mmget_still_valid() here?
> > > > > Something like:
> > > >
> > > > Since MADV_COLD|PAGEOUT doesn't change address space layout or
> > > > vma->vm_flags, technically, we don't need it if I understand
> > > > correctly. Right?
> > >
> > > I'd expect so, yes. But.
> > >
> > > Since we want this interface to be universal and to be able to cover
> > > various needs, and since my initial intention with working in this
> > > direction involved KSM, I'd ask you to enable KSM hints too, and once
> > > (and if) that happens, the work there is done under write lock, and
> > > you'll need this trick to be applied.
> > >
> > > Of course, I can do that myself later in a subsequent patch series once
> > > (and, again, if) your series is merged, but, maybe, we can cover this
> > > already especially given the fact that KSM hinting is a relatively easy
> > > task in this pile. I did some preliminary tests with it, and so far no
> > > dragons have started to roar.
> >
> > Then, do you mind sending a patch based upon this series to expose
> > MADV_MERGEABLE to process_madvise? It will have the right description
> > why you want to have such feature which I couldn't provide since I don't
> > have enough material to write the motivation. And the patch also could
> > include the logic to prevent coredump race, which is more proper since
> > finally we need to hold mmap_sem write-side lock, finally.
> > I will pick it up and will rebase since then.
>
> Sure, I can. Would you really like to have it being based on this exact
> revision, or I should wait till you deal with MADV_COLD & Co and re-iterate
> this part again?
I'm okay you to send your patch against this revision. I'm happy to
include it when I start a new thread for process_madvise discussion
after resolving MADV_COLD|PAGEOUT.
Thanks.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [RFCv2 5/6] mm: introduce external memory hinting API
2019-05-31 6:43 [RFCv2 0/6] introduce memory hinting API for external process Minchan Kim
` (3 preceding siblings ...)
2019-05-31 6:43 ` [RFCv2 4/6] mm: factor out madvise's core functionality Minchan Kim
@ 2019-05-31 6:43 ` Minchan Kim
2019-05-31 8:37 ` Michal Hocko
2019-05-31 17:35 ` Daniel Colascione
2019-05-31 6:43 ` [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary Minchan Kim
5 siblings, 2 replies; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton, Minchan Kim
There is some usecase that centralized userspace daemon want to give
a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
ActivityManagerService is one of them.
It's similar in spirit to madvise(MADV_WONTNEED), but the information
required to make the reclaim decision is not known to the app. Instead,
it is known to the centralized userspace daemon(ActivityManagerService),
and that daemon must be able to initiate reclaim on its own without
any app involvement.
To solve the issue, this patch introduces new syscall process_madvise(2).
It could give a hint to the exeternal process of pidfd.
int process_madvise(int pidfd, void *addr, size_t length, int advise,
unsigned long cookie, unsigned long flag);
Since it could affect other process's address range, only privileged
process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
gives it the right to ptrace the process could use it successfully.
The syscall has a cookie argument to privode atomicity(i.e., detect
target process's address space change since monitor process has parsed
the address range of target process so the operaion could fail in case
of happening race). Although there is no interface to get a cookie
at this moment, it could be useful to consider it as argument to avoid
introducing another new syscall in future. It could support *atomicity*
for disruptive hint(e.g., MADV_DONTNEED|FREE).
flag argument is reserved for future use if we need to extend the API.
I think supporting all hints madvise has/will supported/support to
process_madvise are rather risky. Because I'm not sure all hints makes
sense from external process and implementation for the hint may rely on
caller is current context like MADV_INJECT_ERROR so it could be error-prone
if the caller is external process. Other example is userfaultfd because
userfaultfd_remove need to release mmap_sem during the operation, which
wouuld be a obstacle to implement atomicity later. So, I just limited hints
as MADV_[COLD|PAGEOUT] at this moment. If someone want to expose other hint
we need to hear their workload/scenario and could review carefully by
design/implmentation of each hint. It's more safe for maintainace -
Once we open a buggy syscall but found hard to fix it later, we never
roll back.
TODO: once we agree the direction, I need to define the syscall for
every architecture.
* RFCv1
* not export pidfd_to_pid. Use pidfd_pid - Christian
* use mm struct instead of task_struct for madvise_core - Oleg
* add cookie variable as syscall argument to guarantee atomicity - dancol
* internal review
* use ptrace capability - surenb, dancol
I didn't solve the issue Olge pointed out(task we got from
pid_task could be a zombie leader so that syscall will fai
by mm_access even though process is alive with other threads)
because it's not a only problem of this new syscall but it's
general problem for other MM syscalls like process_vm_readv,
move_pages so need more discussion.
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/pid.h | 4 ++
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/fork.c | 8 +++
kernel/signal.c | 7 ++-
kernel/sys_ni.c | 1 +
mm/madvise.c | 87 ++++++++++++++++++++++++++
9 files changed, 113 insertions(+), 3 deletions(-)
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 43e4429a5272..5f44a29b7882 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -439,3 +439,4 @@
432 i386 fsmount sys_fsmount __ia32_sys_fsmount
433 i386 fspick sys_fspick __ia32_sys_fspick
434 i386 pidfd_open sys_pidfd_open __ia32_sys_pidfd_open
+435 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 1bee0a77fdd3..35e91f3e9646 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -356,6 +356,7 @@
432 common fsmount __x64_sys_fsmount
433 common fspick __x64_sys_fspick
434 common pidfd_open __x64_sys_pidfd_open
+435 common process_madvise __x64_sys_process_madvise
#
# x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/pid.h b/include/linux/pid.h
index a261712ac3fe..a49ef789c034 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -73,6 +73,10 @@ extern struct pid init_struct_pid;
extern const struct file_operations pidfd_fops;
extern int pidfd_create(struct pid *pid);
+struct file;
+
+extern struct pid *pidfd_pid(const struct file *file);
+
static inline struct pid *get_pid(struct pid *pid)
{
if (pid)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 1a4dc53f40d9..6ba081c955f6 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -872,6 +872,9 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
+asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
+ size_t len, int behavior,
+ unsigned long cookie, unsigned long flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e5684a4512c0..082d1f3fe3a2 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -846,9 +846,11 @@ __SYSCALL(__NR_fsmount, sys_fsmount)
__SYSCALL(__NR_fspick, sys_fspick)
#define __NR_pidfd_open 434
__SYSCALL(__NR_pidfd_open, sys_pidfd_open)
+#define __NR_process_madvise 435
+__SYSCALL(__NR_process_madvise, sys_process_madvise)
#undef __NR_syscalls
-#define __NR_syscalls 435
+#define __NR_syscalls 436
/*
* 32 bit systems traditionally used different
diff --git a/kernel/fork.c b/kernel/fork.c
index 9f238cdd886e..b76aade51631 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1738,6 +1738,14 @@ const struct file_operations pidfd_fops = {
#endif
};
+struct pid *pidfd_pid(const struct file *file)
+{
+ if (file->f_op == &pidfd_fops)
+ return file->private_data;
+
+ return ERR_PTR(-EBADF);
+}
+
/**
* pidfd_create() - Create a new pid file descriptor.
*
diff --git a/kernel/signal.c b/kernel/signal.c
index b477e21ecafc..b376870d7565 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3702,8 +3702,11 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info)
static struct pid *pidfd_to_pid(const struct file *file)
{
- if (file->f_op == &pidfd_fops)
- return file->private_data;
+ struct pid *pid;
+
+ pid = pidfd_pid(file);
+ if (pid)
+ return pid;
return tgid_pidfd_to_pid(file);
}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 4d9ae5ea6caf..5277421795ab 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -278,6 +278,7 @@ COND_SYSCALL(mlockall);
COND_SYSCALL(munlockall);
COND_SYSCALL(mincore);
COND_SYSCALL(madvise);
+COND_SYSCALL(process_madvise);
COND_SYSCALL(remap_file_pages);
COND_SYSCALL(mbind);
COND_SYSCALL_COMPAT(mbind);
diff --git a/mm/madvise.c b/mm/madvise.c
index 466623ea8c36..fd205e928a1b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -15,6 +15,7 @@
#include <linux/hugetlb.h>
#include <linux/falloc.h>
#include <linux/sched.h>
+#include <linux/sched/mm.h>
#include <linux/ksm.h>
#include <linux/fs.h>
#include <linux/file.h>
@@ -983,6 +984,31 @@ madvise_behavior_valid(int behavior)
}
}
+static bool
+process_madvise_behavior_valid(int behavior)
+{
+ switch (behavior) {
+ case MADV_COLD:
+ case MADV_PAGEOUT:
+ return true;
+
+ default:
+ return false;
+ }
+}
+
+/*
+ * madvise_core - request behavior hint to address range of the target process
+ *
+ * @task: task_struct got behavior hint, not giving the hint
+ * @mm: mm_struct got behavior hint, not giving the hint
+ * @start: base address of the hinted range
+ * @len_in: length of the hinted range
+ * @behavior: requested hint
+ *
+ * @task could be a zombie leader if it calls sys_exit so accessing mm_struct
+ * via task->mm is prohibited. Please use @mm insetad of task->mm.
+ */
static int madvise_core(struct task_struct *task, struct mm_struct *mm,
unsigned long start, size_t len_in, int behavior)
{
@@ -1146,3 +1172,64 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
{
return madvise_core(current, current->mm, start, len_in, behavior);
}
+
+SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
+ size_t, len_in, int, behavior, unsigned long, cookie,
+ unsigned long, flags)
+{
+ int ret;
+ struct fd f;
+ struct pid *pid;
+ struct task_struct *task;
+ struct mm_struct *mm;
+
+ if (flags != 0)
+ return -EINVAL;
+
+ /*
+ * We don't support cookie to gaurantee address space change
+ * atomicity yet.
+ */
+ if (cookie != 0)
+ return -EINVAL;
+
+ if (!process_madvise_behavior_valid(behavior))
+ return return -EINVAL;
+
+ f = fdget(pidfd);
+ if (!f.file)
+ return -EBADF;
+
+ pid = pidfd_pid(f.file);
+ if (IS_ERR(pid)) {
+ ret = PTR_ERR(pid);
+ goto err;
+ }
+
+ rcu_read_lock();
+ task = pid_task(pid, PIDTYPE_PID);
+ if (!task) {
+ rcu_read_unlock();
+ ret = -ESRCH;
+ goto err;
+ }
+
+ get_task_struct(task);
+ rcu_read_unlock();
+
+ mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS);
+ if (!mm || IS_ERR(mm)) {
+ ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH;
+ if (ret == -EACCES)
+ ret = -EPERM;
+ goto release_task;
+ }
+
+ ret = madvise_core(task, mm, start, len_in, behavior);
+ mmput(mm);
+release_task:
+ put_task_struct(task);
+err:
+ fdput(f);
+ return ret;
+}
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [RFCv2 5/6] mm: introduce external memory hinting API
2019-05-31 6:43 ` [RFCv2 5/6] mm: introduce external memory hinting API Minchan Kim
@ 2019-05-31 8:37 ` Michal Hocko
2019-05-31 13:19 ` Minchan Kim
2019-05-31 17:35 ` Daniel Colascione
1 sibling, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-05-31 8:37 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> ActivityManagerService is one of them.
>
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
>
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It could give a hint to the exeternal process of pidfd.
>
> int process_madvise(int pidfd, void *addr, size_t length, int advise,
> unsigned long cookie, unsigned long flag);
>
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
>
> The syscall has a cookie argument to privode atomicity(i.e., detect
> target process's address space change since monitor process has parsed
> the address range of target process so the operaion could fail in case
> of happening race). Although there is no interface to get a cookie
> at this moment, it could be useful to consider it as argument to avoid
> introducing another new syscall in future. It could support *atomicity*
> for disruptive hint(e.g., MADV_DONTNEED|FREE).
> flag argument is reserved for future use if we need to extend the API.
Providing an API that is incomplete will not fly. Really. As this really
begs for much more discussion and it would be good to move on with the
core idea of the pro active memory memory management from userspace
usecase. Could you split out the core change so that we can move on and
leave the external for a later discussion. I believe this would lead to
a smoother integration.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 5/6] mm: introduce external memory hinting API
2019-05-31 8:37 ` Michal Hocko
@ 2019-05-31 13:19 ` Minchan Kim
2019-05-31 14:00 ` Michal Hocko
0 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 13:19 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > There is some usecase that centralized userspace daemon want to give
> > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > ActivityManagerService is one of them.
> >
> > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > required to make the reclaim decision is not known to the app. Instead,
> > it is known to the centralized userspace daemon(ActivityManagerService),
> > and that daemon must be able to initiate reclaim on its own without
> > any app involvement.
> >
> > To solve the issue, this patch introduces new syscall process_madvise(2).
> > It could give a hint to the exeternal process of pidfd.
> >
> > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > unsigned long cookie, unsigned long flag);
> >
> > Since it could affect other process's address range, only privileged
> > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > gives it the right to ptrace the process could use it successfully.
> >
> > The syscall has a cookie argument to privode atomicity(i.e., detect
> > target process's address space change since monitor process has parsed
> > the address range of target process so the operaion could fail in case
> > of happening race). Although there is no interface to get a cookie
> > at this moment, it could be useful to consider it as argument to avoid
> > introducing another new syscall in future. It could support *atomicity*
> > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > flag argument is reserved for future use if we need to extend the API.
>
> Providing an API that is incomplete will not fly. Really. As this really
> begs for much more discussion and it would be good to move on with the
> core idea of the pro active memory memory management from userspace
> usecase. Could you split out the core change so that we can move on and
> leave the external for a later discussion. I believe this would lead to
> a smoother integration.
No problem but I need to understand what you want a little bit more because
I thought this patchset is already step by step so if we reach the agreement
of part of them like [1-5/6], it could be merged first.
Could you say how you want to split the patchset for forward progress?
Thanks.
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 5/6] mm: introduce external memory hinting API
2019-05-31 13:19 ` Minchan Kim
@ 2019-05-31 14:00 ` Michal Hocko
2019-05-31 14:11 ` Minchan Kim
0 siblings, 1 reply; 39+ messages in thread
From: Michal Hocko @ 2019-05-31 14:00 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri 31-05-19 22:19:00, Minchan Kim wrote:
> On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> > On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > > There is some usecase that centralized userspace daemon want to give
> > > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > > ActivityManagerService is one of them.
> > >
> > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > required to make the reclaim decision is not known to the app. Instead,
> > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > and that daemon must be able to initiate reclaim on its own without
> > > any app involvement.
> > >
> > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > It could give a hint to the exeternal process of pidfd.
> > >
> > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > unsigned long cookie, unsigned long flag);
> > >
> > > Since it could affect other process's address range, only privileged
> > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > gives it the right to ptrace the process could use it successfully.
> > >
> > > The syscall has a cookie argument to privode atomicity(i.e., detect
> > > target process's address space change since monitor process has parsed
> > > the address range of target process so the operaion could fail in case
> > > of happening race). Although there is no interface to get a cookie
> > > at this moment, it could be useful to consider it as argument to avoid
> > > introducing another new syscall in future. It could support *atomicity*
> > > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > > flag argument is reserved for future use if we need to extend the API.
> >
> > Providing an API that is incomplete will not fly. Really. As this really
> > begs for much more discussion and it would be good to move on with the
> > core idea of the pro active memory memory management from userspace
> > usecase. Could you split out the core change so that we can move on and
> > leave the external for a later discussion. I believe this would lead to
> > a smoother integration.
>
> No problem but I need to understand what you want a little bit more because
> I thought this patchset is already step by step so if we reach the agreement
> of part of them like [1-5/6], it could be merged first.
>
> Could you say how you want to split the patchset for forward progress?
I would start with new madvise modes and once they are in a shape to be
merged then we can start the remote madvise API. I believe that even
local process reclaim modes are interesting and useful. I haven't heard
anybody objecting to them without having a remote API so far.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 5/6] mm: introduce external memory hinting API
2019-05-31 14:00 ` Michal Hocko
@ 2019-05-31 14:11 ` Minchan Kim
0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 14:11 UTC (permalink / raw)
To: Michal Hocko
Cc: Andrew Morton, linux-mm, LKML, linux-api, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
On Fri, May 31, 2019 at 04:00:50PM +0200, Michal Hocko wrote:
> On Fri 31-05-19 22:19:00, Minchan Kim wrote:
> > On Fri, May 31, 2019 at 10:37:57AM +0200, Michal Hocko wrote:
> > > On Fri 31-05-19 15:43:12, Minchan Kim wrote:
> > > > There is some usecase that centralized userspace daemon want to give
> > > > a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> > > > ActivityManagerService is one of them.
> > > >
> > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information
> > > > required to make the reclaim decision is not known to the app. Instead,
> > > > it is known to the centralized userspace daemon(ActivityManagerService),
> > > > and that daemon must be able to initiate reclaim on its own without
> > > > any app involvement.
> > > >
> > > > To solve the issue, this patch introduces new syscall process_madvise(2).
> > > > It could give a hint to the exeternal process of pidfd.
> > > >
> > > > int process_madvise(int pidfd, void *addr, size_t length, int advise,
> > > > unsigned long cookie, unsigned long flag);
> > > >
> > > > Since it could affect other process's address range, only privileged
> > > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> > > > gives it the right to ptrace the process could use it successfully.
> > > >
> > > > The syscall has a cookie argument to privode atomicity(i.e., detect
> > > > target process's address space change since monitor process has parsed
> > > > the address range of target process so the operaion could fail in case
> > > > of happening race). Although there is no interface to get a cookie
> > > > at this moment, it could be useful to consider it as argument to avoid
> > > > introducing another new syscall in future. It could support *atomicity*
> > > > for disruptive hint(e.g., MADV_DONTNEED|FREE).
> > > > flag argument is reserved for future use if we need to extend the API.
> > >
> > > Providing an API that is incomplete will not fly. Really. As this really
> > > begs for much more discussion and it would be good to move on with the
> > > core idea of the pro active memory memory management from userspace
> > > usecase. Could you split out the core change so that we can move on and
> > > leave the external for a later discussion. I believe this would lead to
> > > a smoother integration.
> >
> > No problem but I need to understand what you want a little bit more because
> > I thought this patchset is already step by step so if we reach the agreement
> > of part of them like [1-5/6], it could be merged first.
> >
> > Could you say how you want to split the patchset for forward progress?
>
> I would start with new madvise modes and once they are in a shape to be
> merged then we can start the remote madvise API. I believe that even
> local process reclaim modes are interesting and useful. I haven't heard
> anybody objecting to them without having a remote API so far.
Okay, let's focus on [1-3/6] at this moment.
> --
> Michal Hocko
> SUSE Labs
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 5/6] mm: introduce external memory hinting API
2019-05-31 6:43 ` [RFCv2 5/6] mm: introduce external memory hinting API Minchan Kim
2019-05-31 8:37 ` Michal Hocko
@ 2019-05-31 17:35 ` Daniel Colascione
1 sibling, 0 replies; 39+ messages in thread
From: Daniel Colascione @ 2019-05-31 17:35 UTC (permalink / raw)
To: Minchan Kim
Cc: Andrew Morton, linux-mm, LKML, Linux API, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Shakeel Butt, Sonny Rao, Brian Geffon, Jann Horn, Oleg Nesterov,
Christian Brauner, oleksandr, hdanton
On Thu, May 30, 2019 at 11:43 PM Minchan Kim <minchan@kernel.org> wrote:
>
> There is some usecase that centralized userspace daemon want to give
> a memory hint like MADV_[COLD|PAGEEOUT] to other process. Android's
> ActivityManagerService is one of them.
>
> It's similar in spirit to madvise(MADV_WONTNEED), but the information
> required to make the reclaim decision is not known to the app. Instead,
> it is known to the centralized userspace daemon(ActivityManagerService),
> and that daemon must be able to initiate reclaim on its own without
> any app involvement.
>
> To solve the issue, this patch introduces new syscall process_madvise(2).
> It could give a hint to the exeternal process of pidfd.
>
> int process_madvise(int pidfd, void *addr, size_t length, int advise,
> unsigned long cookie, unsigned long flag);
>
> Since it could affect other process's address range, only privileged
> process(CAP_SYS_PTRACE) or something else(e.g., being the same UID)
> gives it the right to ptrace the process could use it successfully.
>
> The syscall has a cookie argument to privode atomicity(i.e., detect
> target process's address space change since monitor process has parsed
> the address range of target process so the operaion could fail in case
> of happening race). Although there is no interface to get a cookie
> at this moment, it could be useful to consider it as argument to avoid
> introducing another new syscall in future. It could support *atomicity*
> for disruptive hint(e.g., MADV_DONTNEED|FREE).
> flag argument is reserved for future use if we need to extend the API.
How about a compromise? Let's allow all madvise hints if the process
is calling process_madvise *on itself* (which will be useful once we
wire up the atomicity cookie) and restrict the cross-process case to
the hints you've mentioned. This way, the restriction on madvise hints
isn't tied to the specific API, but to the relationship between hinter
and hintee.
^ permalink raw reply [flat|nested] 39+ messages in thread
* [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary
2019-05-31 6:43 [RFCv2 0/6] introduce memory hinting API for external process Minchan Kim
` (4 preceding siblings ...)
2019-05-31 6:43 ` [RFCv2 5/6] mm: introduce external memory hinting API Minchan Kim
@ 2019-05-31 6:43 ` Minchan Kim
2019-05-31 10:06 ` Yann Droneaud
5 siblings, 1 reply; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 6:43 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton, Minchan Kim
Currently, process_madvise syscall works for only one address range so
user should call the syscall several times to give hints to multiple
address ranges. However, it's not efficient to support atomicity of
address range opreations as well as performance perspective.
This patch extends process_madvise syscall to support multiple hints,
address ranges and return vaules so user could give hints all at once.
struct pr_madvise_param {
int size; /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem; /* count of below arrary fields */
int __user *hints; /* hints for each range */
/* to store result of each operation */
const struct iovec __user *results;
/* input address ranges */
const struct iovec __user *ranges;
};
int process_madvise(int pidfd, struct pr_madvise_param *u_param,
unsigned long flags);
About cookie, Daniel Colascione suggested a idea[1] to support atomicity
as well as improving parsing speed of address ranges of the target process.
The process_getinfo(2) syscall could create vma configuration sequence
number and returns(e.g., the seq number will be increased when target process
holds mmap_sem exclusive lock) the number with address ranges as binary form.
With calling the this vector syscall with the sequence number and address
ranges we got from process_getinfo, we could detect there was race of
the target process address space layout and makes the fail of the syscall
if user want to have atomicity. It also speed up the address range
parsing because we don't need to parse human-friend strings from /proc fs.
[1] https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224
struct pr_madvise_param {
int size; /* the size of this structure */
int cookie; /* reserved to support atomicity */
int nr_elem; /* count of below arrary fields */
int *hints; /* hints for each range */
/* to store result of each operation */
const struct iovec *results;
/* input address ranges */
const struct iovec *ranges;
};
int main(int argc, char *argv[])
{
struct pr_madvise_param param;
int hints[NR_ADDR_RANGE];
int ret[NR_ADDR_RANGE];
struct iovec ret_vec[NR_ADDR_RANGE];
struct iovec range_vec[NR_ADDR_RANGE];
void *addr[NR_ADDR_RANGE];
pid_t pid;
addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
if (MAP_FAILED == addr[0]) {
printf("Fail to alloc\n");
return 1;
}
addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE,
MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
if (MAP_FAILED == addr[1]) {
printf("Fail to alloc\n");
return 1;
}
ret_vec[0].iov_base = &ret[0];
ret_vec[0].iov_len = sizeof(long);
ret_vec[1].iov_base = &ret[1];
ret_vec[1].iov_len = sizeof(long);
range_vec[0].iov_base = addr[0];
range_vec[0].iov_len = ALLOC_SIZE;
range_vec[1].iov_base = addr[1];
range_vec[1].iov_len = ALLOC_SIZE;
hints[0] = MADV_COLD;
hints[1] = MADV_PAGEOUT;
param.size = sizeof(struct pr_madvise_param);
param.cookie = 0;
param.nr_elem = NR_ADDR_RANGE;
param.hints = hints;
param.results = ret_vec;
param.ranges = range_vec;
pid = fork();
if (!pid) {
sleep(10);
} else {
int pidfd = syscall(__NR_pidfd_open, pid, 0);
if (pidfd < 0) {
printf("Fail to open process file descriptor\n");
return 1;
}
munmap(addr[0], ALLOC_SIZE);
munmap(addr[1], ALLOC_SIZE);
system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
if (syscall(__NR_process_madvise, pidfd, ¶m, 0))
perror("process_madvise fail\n");
system("cat /proc/vmstat | egrep 'pswpout|deactivate'");
}
return 0;
}
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
include/linux/syscalls.h | 6 +-
include/uapi/asm-generic/mman-common.h | 11 +++
mm/madvise.c | 126 ++++++++++++++++++++++---
3 files changed, 126 insertions(+), 17 deletions(-)
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 6ba081c955f6..05627718a547 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -872,9 +872,9 @@ asmlinkage long sys_munlockall(void);
asmlinkage long sys_mincore(unsigned long start, size_t len,
unsigned char __user * vec);
asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior);
-asmlinkage long sys_process_madvise(int pidfd, unsigned long start,
- size_t len, int behavior,
- unsigned long cookie, unsigned long flags);
+asmlinkage long sys_process_madvise(int pidfd,
+ struct pr_madvise_param __user *u_params,
+ unsigned long flags);
asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size,
unsigned long prot, unsigned long pgoff,
unsigned long flags);
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 92e347a89ddc..220c2b5eb961 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -75,4 +75,15 @@
#define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
PKEY_DISABLE_WRITE)
+struct pr_madvise_param {
+ int size; /* the size of this structure */
+ int cookie; /* reserved to support atomicity */
+ int nr_elem; /* count of below arrary fields */
+ int __user *hints; /* hints for each range */
+ /* to store result of each operation */
+ const struct iovec __user *results;
+ /* input address ranges */
+ const struct iovec __user *ranges;
+};
+
#endif /* __ASM_GENERIC_MMAN_COMMON_H */
diff --git a/mm/madvise.c b/mm/madvise.c
index fd205e928a1b..94d782097afd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1107,6 +1107,56 @@ static int madvise_core(struct task_struct *task, struct mm_struct *mm,
return error;
}
+static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param,
+ struct pr_madvise_param *param)
+{
+ u32 size;
+ int ret;
+
+ memset(param, 0, sizeof(*param));
+
+ ret = get_user(size, &u_param->size);
+ if (ret)
+ return ret;
+
+ if (size > PAGE_SIZE)
+ return -E2BIG;
+
+ if (!size || size > sizeof(struct pr_madvise_param))
+ return -EINVAL;
+
+ ret = copy_from_user(param, u_param, size);
+ if (ret)
+ return -EFAULT;
+
+ return ret;
+}
+
+static int process_madvise_core(struct task_struct *tsk, struct mm_struct *mm,
+ int *behaviors,
+ struct iov_iter *iter,
+ const struct iovec *range_vec,
+ unsigned long riovcnt)
+{
+ int i;
+ long err;
+
+ for (i = 0; i < riovcnt && iov_iter_count(iter); i++) {
+ err = -EINVAL;
+ if (process_madvise_behavior_valid(behaviors[i]))
+ err = madvise_core(tsk, mm,
+ (unsigned long)range_vec[i].iov_base,
+ range_vec[i].iov_len, behaviors[i]);
+
+ if (copy_to_iter(&err, sizeof(long), iter) !=
+ sizeof(long)) {
+ return -EFAULT;
+ }
+ }
+
+ return 0;
+}
+
/*
* The madvise(2) system call.
*
@@ -1173,37 +1223,78 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
return madvise_core(current, current->mm, start, len_in, behavior);
}
-SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
- size_t, len_in, int, behavior, unsigned long, cookie,
- unsigned long, flags)
+
+SYSCALL_DEFINE3(process_madvise, int, pidfd,
+ struct pr_madvise_param __user *, u_params,
+ unsigned long, flags)
{
int ret;
struct fd f;
struct pid *pid;
struct task_struct *task;
struct mm_struct *mm;
+ struct pr_madvise_param params;
+ const struct iovec __user *result_vec, __user *range_vec;
+ int *behaviors;
+ struct iovec iovstack_result[UIO_FASTIOV];
+ struct iovec iovstack_r[UIO_FASTIOV];
+ struct iovec *iov_l = iovstack_result;
+ struct iovec *iov_r = iovstack_r;
+ struct iov_iter iter;
+ int nr_elem;
if (flags != 0)
return -EINVAL;
+ ret = pr_madvise_copy_param(u_params, ¶ms);
+ if (ret)
+ return ret;
+
/*
- * We don't support cookie to gaurantee address space change
- * atomicity yet.
+ * We don't support cookie to gaurantee address space atomicity yet.
+ * Once we implment cookie, process_madvise_core need to hold mmap_sme
+ * during entire operation to guarantee atomicity.
*/
- if (cookie != 0)
+ if (params.cookie != 0)
return -EINVAL;
- if (!process_madvise_behavior_valid(behavior))
- return return -EINVAL;
+ range_vec = params.ranges;
+ result_vec = params.results;
+ nr_elem = params.nr_elem;
+
+ behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL);
+ if (!behaviors)
+ return -ENOMEM;
+
+ ret = copy_from_user(behaviors, params.hints, sizeof(int) * nr_elem);
+ if (ret < 0)
+ goto free_behavior_vec;
+
+ ret = import_iovec(READ, result_vec, params. nr_elem, UIO_FASTIOV,
+ &iov_l, &iter);
+ if (ret < 0)
+ goto free_behavior_vec;
+
+ if (!iov_iter_count(&iter)) {
+ ret = -EINVAL;
+ goto free_iovecs;
+ }
+
+ ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem,
+ UIO_FASTIOV, iovstack_r, &iov_r);
+ if (ret <= 0)
+ goto free_iovecs;
f = fdget(pidfd);
- if (!f.file)
- return -EBADF;
+ if (!f.file) {
+ ret = -EBADF;
+ goto free_iovecs;
+ }
pid = pidfd_pid(f.file);
if (IS_ERR(pid)) {
ret = PTR_ERR(pid);
- goto err;
+ goto put_fd;
}
rcu_read_lock();
@@ -1211,7 +1302,7 @@ SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
if (!task) {
rcu_read_unlock();
ret = -ESRCH;
- goto err;
+ goto put_fd;
}
get_task_struct(task);
@@ -1225,11 +1316,18 @@ SYSCALL_DEFINE6(process_madvise, int, pidfd, unsigned long, start,
goto release_task;
}
- ret = madvise_core(task, mm, start, len_in, behavior);
+ ret = process_madvise_core(task, mm, behaviors, &iter, iov_r, nr_elem);
mmput(mm);
release_task:
put_task_struct(task);
-err:
+put_fd:
fdput(f);
+free_iovecs:
+ if (iov_r != iovstack_r)
+ kfree(iov_r);
+ kfree(iov_l);
+free_behavior_vec:
+ kfree(behaviors);
+
return ret;
}
--
2.22.0.rc1.257.g3120a18244-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary
2019-05-31 6:43 ` [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary Minchan Kim
@ 2019-05-31 10:06 ` Yann Droneaud
2019-05-31 23:18 ` Minchan Kim
0 siblings, 1 reply; 39+ messages in thread
From: Yann Droneaud @ 2019-05-31 10:06 UTC (permalink / raw)
To: Minchan Kim, Andrew Morton
Cc: linux-mm, LKML, linux-api, Michal Hocko, Johannes Weiner,
Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
Hi,
Le vendredi 31 mai 2019 à 15:43 +0900, Minchan Kim a écrit :
>
> diff --git a/include/uapi/asm-generic/mman-common.h
> b/include/uapi/asm-generic/mman-common.h
> index 92e347a89ddc..220c2b5eb961 100644
> --- a/include/uapi/asm-generic/mman-common.h
> +++ b/include/uapi/asm-generic/mman-common.h
> @@ -75,4 +75,15 @@
> #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> PKEY_DISABLE_WRITE)
>
> +struct pr_madvise_param {
> + int size; /* the size of this structure */
> + int cookie; /* reserved to support atomicity */
> + int nr_elem; /* count of below arrary fields */
Those should be unsigned.
There's an implicit hole here on ABI with 64bits aligned pointers
> + int __user *hints; /* hints for each range */
> + /* to store result of each operation */
> + const struct iovec __user *results;
> + /* input address ranges */
> + const struct iovec __user *ranges;
Using pointer type in uAPI structure require a 'compat' version of the
syscall need to be provided.
If using iovec too.
> +};
> +
> #endif /* __ASM_GENERIC_MMAN_COMMON_H */
Regards.
--
Yann Droneaud
OPTEYA
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [RFCv2 6/6] mm: extend process_madvise syscall to support vector arrary
2019-05-31 10:06 ` Yann Droneaud
@ 2019-05-31 23:18 ` Minchan Kim
0 siblings, 0 replies; 39+ messages in thread
From: Minchan Kim @ 2019-05-31 23:18 UTC (permalink / raw)
To: Yann Droneaud
Cc: Andrew Morton, linux-mm, LKML, linux-api, Michal Hocko,
Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan,
Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, jannh,
oleg, christian, oleksandr, hdanton
Hi Yann,
On Fri, May 31, 2019 at 12:06:52PM +0200, Yann Droneaud wrote:
> Hi,
>
> Le vendredi 31 mai 2019 à 15:43 +0900, Minchan Kim a écrit :
> >
> > diff --git a/include/uapi/asm-generic/mman-common.h
> > b/include/uapi/asm-generic/mman-common.h
> > index 92e347a89ddc..220c2b5eb961 100644
> > --- a/include/uapi/asm-generic/mman-common.h
> > +++ b/include/uapi/asm-generic/mman-common.h
> > @@ -75,4 +75,15 @@
> > #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\
> > PKEY_DISABLE_WRITE)
> >
> > +struct pr_madvise_param {
> > + int size; /* the size of this structure */
> > + int cookie; /* reserved to support atomicity */
> > + int nr_elem; /* count of below arrary fields */
>
> Those should be unsigned.
>
> There's an implicit hole here on ABI with 64bits aligned pointers
>
> > + int __user *hints; /* hints for each range */
> > + /* to store result of each operation */
> > + const struct iovec __user *results;
> > + /* input address ranges */
> > + const struct iovec __user *ranges;
>
> Using pointer type in uAPI structure require a 'compat' version of the
> syscall need to be provided.
>
> If using iovec too.
I will fix them when I submit next revision.
Thanks for the review.
^ permalink raw reply [flat|nested] 39+ messages in thread