* Re: [RFC 1/7] mm: introduce MADV_COOL [not found] ` <20190520035254.57579-2-minchan@kernel.org> @ 2019-05-20 8:16 ` Michal Hocko 2019-05-20 8:19 ` Michal Hocko 2019-05-20 22:54 ` Minchan Kim 0 siblings, 2 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-20 8:16 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api [CC linux-api] On Mon 20-05-19 12:52:48, Minchan Kim wrote: > When a process expects no accesses to a certain memory range > it could hint kernel that the pages can be reclaimed > when memory pressure happens but data should be preserved > for future use. This could reduce workingset eviction so it > ends up increasing performance. > > This patch introduces the new MADV_COOL hint to madvise(2) > syscall. MADV_COOL can be used by a process to mark a memory range > as not expected to be used in the near future. The hint can help > kernel in deciding which pages to evict early during memory > pressure. I do not want to start naming fight but MADV_COOL sounds a bit misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD or MADV_DONTNEED_PRESERVE. > Internally, it works via deactivating memory from active list to > inactive's head so when the memory pressure happens, they will be > reclaimed earlier than other active pages unless there is no > access until the time. Could you elaborate about the decision to move to the head rather than tail? What should happen to inactive pages? Should we move them to the tail? Your implementation seems to ignore those completely. Why? What should happen for shared pages? In other words do we want to allow less privileged process to control evicting of shared pages with a more privileged one? E.g. think of all sorts of side channel attacks. Maybe we want to do the same thing as for mincore where write access is required. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL 2019-05-20 8:16 ` [RFC 1/7] mm: introduce MADV_COOL Michal Hocko @ 2019-05-20 8:19 ` Michal Hocko 2019-05-20 15:08 ` Suren Baghdasaryan 2019-05-20 22:55 ` Minchan Kim 2019-05-20 22:54 ` Minchan Kim 1 sibling, 2 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-20 8:19 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon 20-05-19 10:16:21, Michal Hocko wrote: > [CC linux-api] > > On Mon 20-05-19 12:52:48, Minchan Kim wrote: > > When a process expects no accesses to a certain memory range > > it could hint kernel that the pages can be reclaimed > > when memory pressure happens but data should be preserved > > for future use. This could reduce workingset eviction so it > > ends up increasing performance. > > > > This patch introduces the new MADV_COOL hint to madvise(2) > > syscall. MADV_COOL can be used by a process to mark a memory range > > as not expected to be used in the near future. The hint can help > > kernel in deciding which pages to evict early during memory > > pressure. > > I do not want to start naming fight but MADV_COOL sounds a bit > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD > or MADV_DONTNEED_PRESERVE. OK, I can see that you have used MADV_COLD for a different mode. So this one is effectively a non destructive MADV_FREE alternative so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD in other patch would then be MADV_DONTNEED_PRESERVE. Right? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL 2019-05-20 8:19 ` Michal Hocko @ 2019-05-20 15:08 ` Suren Baghdasaryan 2019-05-20 22:55 ` Minchan Kim 1 sibling, 0 replies; 68+ messages in thread From: Suren Baghdasaryan @ 2019-05-20 15:08 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon, May 20, 2019 at 1:19 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Mon 20-05-19 10:16:21, Michal Hocko wrote: > > [CC linux-api] > > > > On Mon 20-05-19 12:52:48, Minchan Kim wrote: > > > When a process expects no accesses to a certain memory range > > > it could hint kernel that the pages can be reclaimed > > > when memory pressure happens but data should be preserved > > > for future use. This could reduce workingset eviction so it > > > ends up increasing performance. > > > > > > This patch introduces the new MADV_COOL hint to madvise(2) > > > syscall. MADV_COOL can be used by a process to mark a memory range > > > as not expected to be used in the near future. The hint can help > > > kernel in deciding which pages to evict early during memory > > > pressure. > > > > I do not want to start naming fight but MADV_COOL sounds a bit > > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD > > or MADV_DONTNEED_PRESERVE. > > OK, I can see that you have used MADV_COLD for a different mode. > So this one is effectively a non destructive MADV_FREE alternative > so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD > in other patch would then be MADV_DONTNEED_PRESERVE. Right? > I agree that naming them this way would be more in-line with the existing API. Another good option IMO could be MADV_RECLAIM_NOW / MADV_RECLAIM_LAZY which might explain a bit better what they do but Michal's proposal is more consistent with the current API. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL 2019-05-20 8:19 ` Michal Hocko 2019-05-20 15:08 ` Suren Baghdasaryan @ 2019-05-20 22:55 ` Minchan Kim 1 sibling, 0 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-20 22:55 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon, May 20, 2019 at 10:19:43AM +0200, Michal Hocko wrote: > On Mon 20-05-19 10:16:21, Michal Hocko wrote: > > [CC linux-api] > > > > On Mon 20-05-19 12:52:48, Minchan Kim wrote: > > > When a process expects no accesses to a certain memory range > > > it could hint kernel that the pages can be reclaimed > > > when memory pressure happens but data should be preserved > > > for future use. This could reduce workingset eviction so it > > > ends up increasing performance. > > > > > > This patch introduces the new MADV_COOL hint to madvise(2) > > > syscall. MADV_COOL can be used by a process to mark a memory range > > > as not expected to be used in the near future. The hint can help > > > kernel in deciding which pages to evict early during memory > > > pressure. > > > > I do not want to start naming fight but MADV_COOL sounds a bit > > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD > > or MADV_DONTNEED_PRESERVE. > > OK, I can see that you have used MADV_COLD for a different mode. > So this one is effectively a non destructive MADV_FREE alternative > so MADV_FREE_PRESERVE would sound like a good fit. Your MADV_COLD > in other patch would then be MADV_DONTNEED_PRESERVE. Right? Correct. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL 2019-05-20 8:16 ` [RFC 1/7] mm: introduce MADV_COOL Michal Hocko 2019-05-20 8:19 ` Michal Hocko @ 2019-05-20 22:54 ` Minchan Kim 2019-05-21 6:04 ` Michal Hocko 1 sibling, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-20 22:54 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote: > [CC linux-api] Thanks, Michal. I forgot to add it. > > On Mon 20-05-19 12:52:48, Minchan Kim wrote: > > When a process expects no accesses to a certain memory range > > it could hint kernel that the pages can be reclaimed > > when memory pressure happens but data should be preserved > > for future use. This could reduce workingset eviction so it > > ends up increasing performance. > > > > This patch introduces the new MADV_COOL hint to madvise(2) > > syscall. MADV_COOL can be used by a process to mark a memory range > > as not expected to be used in the near future. The hint can help > > kernel in deciding which pages to evict early during memory > > pressure. > > I do not want to start naming fight but MADV_COOL sounds a bit > misleading. Everybody thinks his pages are cool ;). Probably MADV_COLD > or MADV_DONTNEED_PRESERVE. Thanks for the suggestion. Since I got several suggestions, Let's discuss them all at once in cover-letter. > > > Internally, it works via deactivating memory from active list to > > inactive's head so when the memory pressure happens, they will be > > reclaimed earlier than other active pages unless there is no > > access until the time. > > Could you elaborate about the decision to move to the head rather than > tail? What should happen to inactive pages? Should we move them to the > tail? Your implementation seems to ignore those completely. Why? Normally, inactive LRU could have used-once pages without any mapping to user's address space. Such pages would be better candicate to reclaim when the memory pressure happens. With deactivating only active LRU pages of the process to the head of inactive LRU, we will keep them in RAM longer than used-once pages and could have more chance to be activated once the process is resumed. > > What should happen for shared pages? In other words do we want to allow > less privileged process to control evicting of shared pages with a more > privileged one? E.g. think of all sorts of side channel attacks. Maybe > we want to do the same thing as for mincore where write access is > required. It doesn't work with shared pages(ie, page_mapcount > 1). I will add it in the description. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL 2019-05-20 22:54 ` Minchan Kim @ 2019-05-21 6:04 ` Michal Hocko 2019-05-21 9:11 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-21 6:04 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue 21-05-19 07:54:19, Minchan Kim wrote: > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote: [...] > > > Internally, it works via deactivating memory from active list to > > > inactive's head so when the memory pressure happens, they will be > > > reclaimed earlier than other active pages unless there is no > > > access until the time. > > > > Could you elaborate about the decision to move to the head rather than > > tail? What should happen to inactive pages? Should we move them to the > > tail? Your implementation seems to ignore those completely. Why? > > Normally, inactive LRU could have used-once pages without any mapping > to user's address space. Such pages would be better candicate to > reclaim when the memory pressure happens. With deactivating only > active LRU pages of the process to the head of inactive LRU, we will > keep them in RAM longer than used-once pages and could have more chance > to be activated once the process is resumed. You are making some assumptions here. You have an explicit call what is cold now you are assuming something is even colder. Is this assumption a general enough to make people depend on it? Not that we wouldn't be able to change to logic later but that will always be risky - especially in the area when somebody want to make a user space driven memory management. > > What should happen for shared pages? In other words do we want to allow > > less privileged process to control evicting of shared pages with a more > > privileged one? E.g. think of all sorts of side channel attacks. Maybe > > we want to do the same thing as for mincore where write access is > > required. > > It doesn't work with shared pages(ie, page_mapcount > 1). I will add it > in the description. OK, this is good for the starter. It makes the implementation simpler and we can add shared mappings coverage later. Although I would argue that touching only writeable mappings should be reasonably safe. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL 2019-05-21 6:04 ` Michal Hocko @ 2019-05-21 9:11 ` Minchan Kim 2019-05-21 10:05 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-21 9:11 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 21, 2019 at 08:04:43AM +0200, Michal Hocko wrote: > On Tue 21-05-19 07:54:19, Minchan Kim wrote: > > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote: > [...] > > > > Internally, it works via deactivating memory from active list to > > > > inactive's head so when the memory pressure happens, they will be > > > > reclaimed earlier than other active pages unless there is no > > > > access until the time. > > > > > > Could you elaborate about the decision to move to the head rather than > > > tail? What should happen to inactive pages? Should we move them to the > > > tail? Your implementation seems to ignore those completely. Why? > > > > Normally, inactive LRU could have used-once pages without any mapping > > to user's address space. Such pages would be better candicate to > > reclaim when the memory pressure happens. With deactivating only > > active LRU pages of the process to the head of inactive LRU, we will > > keep them in RAM longer than used-once pages and could have more chance > > to be activated once the process is resumed. > > You are making some assumptions here. You have an explicit call what is > cold now you are assuming something is even colder. Is this assumption a > general enough to make people depend on it? Not that we wouldn't be able > to change to logic later but that will always be risky - especially in > the area when somebody want to make a user space driven memory > management. Think about MADV_FREE. It moves those pages into inactive file LRU's head. See the get_scan_count which makes forceful scanning of inactive file LRU if it has enough size based on the memory pressure. The reason is it's likely to have used-once pages in inactive file LRU, generally. Those pages has been top-priority candidate to be reclaimed for a long time. Only parts I am aware of moving pages into tail of inactive LRU are places writeback is done for pages VM already decide to reclaim by LRU aging or destructive operation like invalidating but couldn't completed. It's really strong hints with no doubt. > > > > What should happen for shared pages? In other words do we want to allow > > > less privileged process to control evicting of shared pages with a more > > > privileged one? E.g. think of all sorts of side channel attacks. Maybe > > > we want to do the same thing as for mincore where write access is > > > required. > > > > It doesn't work with shared pages(ie, page_mapcount > 1). I will add it > > in the description. > > OK, this is good for the starter. It makes the implementation simpler > and we can add shared mappings coverage later. > > Although I would argue that touching only writeable mappings should be > reasonably safe. > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 1/7] mm: introduce MADV_COOL 2019-05-21 9:11 ` Minchan Kim @ 2019-05-21 10:05 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-21 10:05 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue 21-05-19 18:11:34, Minchan Kim wrote: > On Tue, May 21, 2019 at 08:04:43AM +0200, Michal Hocko wrote: > > On Tue 21-05-19 07:54:19, Minchan Kim wrote: > > > On Mon, May 20, 2019 at 10:16:21AM +0200, Michal Hocko wrote: > > [...] > > > > > Internally, it works via deactivating memory from active list to > > > > > inactive's head so when the memory pressure happens, they will be > > > > > reclaimed earlier than other active pages unless there is no > > > > > access until the time. > > > > > > > > Could you elaborate about the decision to move to the head rather than > > > > tail? What should happen to inactive pages? Should we move them to the > > > > tail? Your implementation seems to ignore those completely. Why? > > > > > > Normally, inactive LRU could have used-once pages without any mapping > > > to user's address space. Such pages would be better candicate to > > > reclaim when the memory pressure happens. With deactivating only > > > active LRU pages of the process to the head of inactive LRU, we will > > > keep them in RAM longer than used-once pages and could have more chance > > > to be activated once the process is resumed. > > > > You are making some assumptions here. You have an explicit call what is > > cold now you are assuming something is even colder. Is this assumption a > > general enough to make people depend on it? Not that we wouldn't be able > > to change to logic later but that will always be risky - especially in > > the area when somebody want to make a user space driven memory > > management. > > Think about MADV_FREE. It moves those pages into inactive file LRU's head. > See the get_scan_count which makes forceful scanning of inactive file LRU > if it has enough size based on the memory pressure. > The reason is it's likely to have used-once pages in inactive file LRU, > generally. Those pages has been top-priority candidate to be reclaimed > for a long time. OK, fair enough. Being consistent with MADV_FREE is reasonable. I just forgot we do rotate like this there. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20190520035254.57579-4-minchan@kernel.org>]
* Re: [RFC 3/7] mm: introduce MADV_COLD [not found] ` <20190520035254.57579-4-minchan@kernel.org> @ 2019-05-20 8:27 ` Michal Hocko 2019-05-20 23:00 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-20 8:27 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api [Cc linux-api] On Mon 20-05-19 12:52:50, Minchan Kim wrote: > When a process expects no accesses to a certain memory range > for a long time, it could hint kernel that the pages can be > reclaimed instantly but data should be preserved for future use. > This could reduce workingset eviction so it ends up increasing > performance. > > This patch introduces the new MADV_COLD hint to madvise(2) > syscall. MADV_COLD can be used by a process to mark a memory range > as not expected to be used for a long time. The hint can help > kernel in deciding which pages to evict proactively. As mentioned in other email this looks like a non-destructive MADV_DONTNEED alternative. > Internally, it works via reclaiming memory in process context > the syscall is called. If the page is dirty but backing storage > is not synchronous device, the written page will be rotate back > into LRU's tail once the write is done so they will reclaim easily > when memory pressure happens. If backing storage is > synchrnous device(e.g., zram), hte page will be reclaimed instantly. Why do we special case async backing storage? Please always try to explain _why_ the decision is made. I haven't checked the implementation yet so I cannot comment on that. > Signed-off-by: Minchan Kim <minchan@kernel.org> > --- > include/linux/swap.h | 1 + > include/uapi/asm-generic/mman-common.h | 1 + > mm/madvise.c | 123 +++++++++++++++++++++++++ > mm/vmscan.c | 74 +++++++++++++++ > 4 files changed, 199 insertions(+) > > diff --git a/include/linux/swap.h b/include/linux/swap.h > index 64795abea003..7f32a948fc6a 100644 > --- a/include/linux/swap.h > +++ b/include/linux/swap.h > @@ -365,6 +365,7 @@ extern int vm_swappiness; > extern int remove_mapping(struct address_space *mapping, struct page *page); > extern unsigned long vm_total_pages; > > +extern unsigned long reclaim_pages(struct list_head *page_list); > #ifdef CONFIG_NUMA > extern int node_reclaim_mode; > extern int sysctl_min_unmapped_ratio; > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index f7a4a5d4b642..b9b51eeb8e1a 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -43,6 +43,7 @@ > #define MADV_WILLNEED 3 /* will need these pages */ > #define MADV_DONTNEED 4 /* don't need these pages */ > #define MADV_COOL 5 /* deactivatie these pages */ > +#define MADV_COLD 6 /* reclaim these pages */ > > /* common parameters: try to keep these consistent across architectures */ > #define MADV_FREE 8 /* free pages only if memory pressure */ > diff --git a/mm/madvise.c b/mm/madvise.c > index c05817fb570d..9a6698b56845 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior) > case MADV_WILLNEED: > case MADV_DONTNEED: > case MADV_COOL: > + case MADV_COLD: > case MADV_FREE: > return 0; > default: > @@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma, > return 0; > } > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, > + unsigned long end, struct mm_walk *walk) > +{ > + pte_t *orig_pte, *pte, ptent; > + spinlock_t *ptl; > + LIST_HEAD(page_list); > + struct page *page; > + int isolated = 0; > + struct vm_area_struct *vma = walk->vma; > + unsigned long next; > + > + next = pmd_addr_end(addr, end); > + if (pmd_trans_huge(*pmd)) { > + spinlock_t *ptl; > + > + ptl = pmd_trans_huge_lock(pmd, vma); > + if (!ptl) > + return 0; > + > + if (is_huge_zero_pmd(*pmd)) > + goto huge_unlock; > + > + page = pmd_page(*pmd); > + if (page_mapcount(page) > 1) > + goto huge_unlock; > + > + if (next - addr != HPAGE_PMD_SIZE) { > + int err; > + > + get_page(page); > + spin_unlock(ptl); > + lock_page(page); > + err = split_huge_page(page); > + unlock_page(page); > + put_page(page); > + if (!err) > + goto regular_page; > + return 0; > + } > + > + if (isolate_lru_page(page)) > + goto huge_unlock; > + > + list_add(&page->lru, &page_list); > +huge_unlock: > + spin_unlock(ptl); > + reclaim_pages(&page_list); > + return 0; > + } > + > + if (pmd_trans_unstable(pmd)) > + return 0; > +regular_page: > + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) { > + ptent = *pte; > + if (!pte_present(ptent)) > + continue; > + > + page = vm_normal_page(vma, addr, ptent); > + if (!page) > + continue; > + > + if (page_mapcount(page) > 1) > + continue; > + > + if (isolate_lru_page(page)) > + continue; > + > + isolated++; > + list_add(&page->lru, &page_list); > + if (isolated >= SWAP_CLUSTER_MAX) { > + pte_unmap_unlock(orig_pte, ptl); > + reclaim_pages(&page_list); > + isolated = 0; > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > + orig_pte = pte; > + } > + } > + > + pte_unmap_unlock(orig_pte, ptl); > + reclaim_pages(&page_list); > + cond_resched(); > + > + return 0; > +} > + > +static void madvise_cold_page_range(struct mmu_gather *tlb, > + struct vm_area_struct *vma, > + unsigned long addr, unsigned long end) > +{ > + struct mm_walk warm_walk = { > + .pmd_entry = madvise_cold_pte_range, > + .mm = vma->vm_mm, > + }; > + > + tlb_start_vma(tlb, vma); > + walk_page_range(addr, end, &warm_walk); > + tlb_end_vma(tlb, vma); > +} > + > + > +static long madvise_cold(struct vm_area_struct *vma, > + unsigned long start_addr, unsigned long end_addr) > +{ > + struct mm_struct *mm = vma->vm_mm; > + struct mmu_gather tlb; > + > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) > + return -EINVAL; > + > + lru_add_drain(); > + tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > + madvise_cold_page_range(&tlb, vma, start_addr, end_addr); > + tlb_finish_mmu(&tlb, start_addr, end_addr); > + > + return 0; > +} > + > static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > unsigned long end, struct mm_walk *walk) > > @@ -806,6 +926,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > return madvise_willneed(vma, prev, start, end); > case MADV_COOL: > return madvise_cool(vma, start, end); > + case MADV_COLD: > + return madvise_cold(vma, start, end); > case MADV_FREE: > case MADV_DONTNEED: > return madvise_dontneed_free(vma, prev, start, end, behavior); > @@ -828,6 +950,7 @@ madvise_behavior_valid(int behavior) > case MADV_DONTNEED: > case MADV_FREE: > case MADV_COOL: > + case MADV_COLD: > #ifdef CONFIG_KSM > case MADV_MERGEABLE: > case MADV_UNMERGEABLE: > diff --git a/mm/vmscan.c b/mm/vmscan.c > index a28e5d17b495..1701b31f70a8 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -2096,6 +2096,80 @@ static void shrink_active_list(unsigned long nr_to_scan, > nr_deactivate, nr_rotated, sc->priority, file); > } > > +unsigned long reclaim_pages(struct list_head *page_list) > +{ > + int nid = -1; > + unsigned long nr_isolated[2] = {0, }; > + unsigned long nr_reclaimed = 0; > + LIST_HEAD(node_page_list); > + struct reclaim_stat dummy_stat; > + struct scan_control sc = { > + .gfp_mask = GFP_KERNEL, > + .priority = DEF_PRIORITY, > + .may_writepage = 1, > + .may_unmap = 1, > + .may_swap = 1, > + }; > + > + while (!list_empty(page_list)) { > + struct page *page; > + > + page = lru_to_page(page_list); > + list_del(&page->lru); > + > + if (nid == -1) { > + nid = page_to_nid(page); > + INIT_LIST_HEAD(&node_page_list); > + nr_isolated[0] = nr_isolated[1] = 0; > + } > + > + if (nid == page_to_nid(page)) { > + list_add(&page->lru, &node_page_list); > + nr_isolated[!!page_is_file_cache(page)] += > + hpage_nr_pages(page); > + continue; > + } > + > + nid = page_to_nid(page); > + > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, > + nr_isolated[0]); > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, > + nr_isolated[1]); > + nr_reclaimed += shrink_page_list(&node_page_list, > + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS, > + &dummy_stat, true); > + while (!list_empty(&node_page_list)) { > + struct page *page = lru_to_page(page_list); > + > + list_del(&page->lru); > + putback_lru_page(page); > + } > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, > + -nr_isolated[0]); > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, > + -nr_isolated[1]); > + nr_isolated[0] = nr_isolated[1] = 0; > + INIT_LIST_HEAD(&node_page_list); > + } > + > + if (!list_empty(&node_page_list)) { > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, > + nr_isolated[0]); > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, > + nr_isolated[1]); > + nr_reclaimed += shrink_page_list(&node_page_list, > + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS, > + &dummy_stat, true); > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, > + -nr_isolated[0]); > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, > + -nr_isolated[1]); > + } > + > + return nr_reclaimed; > +} > + > /* > * The inactive anon list should be small enough that the VM never has > * to do too much work. > -- > 2.21.0.1020.gf2820cf01a-goog > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 3/7] mm: introduce MADV_COLD 2019-05-20 8:27 ` [RFC 3/7] mm: introduce MADV_COLD Michal Hocko @ 2019-05-20 23:00 ` Minchan Kim 2019-05-21 6:08 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-20 23:00 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote: > [Cc linux-api] > > On Mon 20-05-19 12:52:50, Minchan Kim wrote: > > When a process expects no accesses to a certain memory range > > for a long time, it could hint kernel that the pages can be > > reclaimed instantly but data should be preserved for future use. > > This could reduce workingset eviction so it ends up increasing > > performance. > > > > This patch introduces the new MADV_COLD hint to madvise(2) > > syscall. MADV_COLD can be used by a process to mark a memory range > > as not expected to be used for a long time. The hint can help > > kernel in deciding which pages to evict proactively. > > As mentioned in other email this looks like a non-destructive > MADV_DONTNEED alternative. > > > Internally, it works via reclaiming memory in process context > > the syscall is called. If the page is dirty but backing storage > > is not synchronous device, the written page will be rotate back > > into LRU's tail once the write is done so they will reclaim easily > > when memory pressure happens. If backing storage is > > synchrnous device(e.g., zram), hte page will be reclaimed instantly. > > Why do we special case async backing storage? Please always try to > explain _why_ the decision is made. I didn't make any decesion. ;-) That's how current reclaim works to avoid latency of freeing page in interrupt context. I had a patchset to resolve the concern a few years ago but got distracted. > > I haven't checked the implementation yet so I cannot comment on that. > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > --- > > include/linux/swap.h | 1 + > > include/uapi/asm-generic/mman-common.h | 1 + > > mm/madvise.c | 123 +++++++++++++++++++++++++ > > mm/vmscan.c | 74 +++++++++++++++ > > 4 files changed, 199 insertions(+) > > > > diff --git a/include/linux/swap.h b/include/linux/swap.h > > index 64795abea003..7f32a948fc6a 100644 > > --- a/include/linux/swap.h > > +++ b/include/linux/swap.h > > @@ -365,6 +365,7 @@ extern int vm_swappiness; > > extern int remove_mapping(struct address_space *mapping, struct page *page); > > extern unsigned long vm_total_pages; > > > > +extern unsigned long reclaim_pages(struct list_head *page_list); > > #ifdef CONFIG_NUMA > > extern int node_reclaim_mode; > > extern int sysctl_min_unmapped_ratio; > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index f7a4a5d4b642..b9b51eeb8e1a 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -43,6 +43,7 @@ > > #define MADV_WILLNEED 3 /* will need these pages */ > > #define MADV_DONTNEED 4 /* don't need these pages */ > > #define MADV_COOL 5 /* deactivatie these pages */ > > +#define MADV_COLD 6 /* reclaim these pages */ > > > > /* common parameters: try to keep these consistent across architectures */ > > #define MADV_FREE 8 /* free pages only if memory pressure */ > > diff --git a/mm/madvise.c b/mm/madvise.c > > index c05817fb570d..9a6698b56845 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -42,6 +42,7 @@ static int madvise_need_mmap_write(int behavior) > > case MADV_WILLNEED: > > case MADV_DONTNEED: > > case MADV_COOL: > > + case MADV_COLD: > > case MADV_FREE: > > return 0; > > default: > > @@ -416,6 +417,125 @@ static long madvise_cool(struct vm_area_struct *vma, > > return 0; > > } > > > > +static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, > > + unsigned long end, struct mm_walk *walk) > > +{ > > + pte_t *orig_pte, *pte, ptent; > > + spinlock_t *ptl; > > + LIST_HEAD(page_list); > > + struct page *page; > > + int isolated = 0; > > + struct vm_area_struct *vma = walk->vma; > > + unsigned long next; > > + > > + next = pmd_addr_end(addr, end); > > + if (pmd_trans_huge(*pmd)) { > > + spinlock_t *ptl; > > + > > + ptl = pmd_trans_huge_lock(pmd, vma); > > + if (!ptl) > > + return 0; > > + > > + if (is_huge_zero_pmd(*pmd)) > > + goto huge_unlock; > > + > > + page = pmd_page(*pmd); > > + if (page_mapcount(page) > 1) > > + goto huge_unlock; > > + > > + if (next - addr != HPAGE_PMD_SIZE) { > > + int err; > > + > > + get_page(page); > > + spin_unlock(ptl); > > + lock_page(page); > > + err = split_huge_page(page); > > + unlock_page(page); > > + put_page(page); > > + if (!err) > > + goto regular_page; > > + return 0; > > + } > > + > > + if (isolate_lru_page(page)) > > + goto huge_unlock; > > + > > + list_add(&page->lru, &page_list); > > +huge_unlock: > > + spin_unlock(ptl); > > + reclaim_pages(&page_list); > > + return 0; > > + } > > + > > + if (pmd_trans_unstable(pmd)) > > + return 0; > > +regular_page: > > + orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > + for (pte = orig_pte; addr < end; pte++, addr += PAGE_SIZE) { > > + ptent = *pte; > > + if (!pte_present(ptent)) > > + continue; > > + > > + page = vm_normal_page(vma, addr, ptent); > > + if (!page) > > + continue; > > + > > + if (page_mapcount(page) > 1) > > + continue; > > + > > + if (isolate_lru_page(page)) > > + continue; > > + > > + isolated++; > > + list_add(&page->lru, &page_list); > > + if (isolated >= SWAP_CLUSTER_MAX) { > > + pte_unmap_unlock(orig_pte, ptl); > > + reclaim_pages(&page_list); > > + isolated = 0; > > + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > + orig_pte = pte; > > + } > > + } > > + > > + pte_unmap_unlock(orig_pte, ptl); > > + reclaim_pages(&page_list); > > + cond_resched(); > > + > > + return 0; > > +} > > + > > +static void madvise_cold_page_range(struct mmu_gather *tlb, > > + struct vm_area_struct *vma, > > + unsigned long addr, unsigned long end) > > +{ > > + struct mm_walk warm_walk = { > > + .pmd_entry = madvise_cold_pte_range, > > + .mm = vma->vm_mm, > > + }; > > + > > + tlb_start_vma(tlb, vma); > > + walk_page_range(addr, end, &warm_walk); > > + tlb_end_vma(tlb, vma); > > +} > > + > > + > > +static long madvise_cold(struct vm_area_struct *vma, > > + unsigned long start_addr, unsigned long end_addr) > > +{ > > + struct mm_struct *mm = vma->vm_mm; > > + struct mmu_gather tlb; > > + > > + if (vma->vm_flags & (VM_LOCKED|VM_HUGETLB|VM_PFNMAP)) > > + return -EINVAL; > > + > > + lru_add_drain(); > > + tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > > + madvise_cold_page_range(&tlb, vma, start_addr, end_addr); > > + tlb_finish_mmu(&tlb, start_addr, end_addr); > > + > > + return 0; > > +} > > + > > static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr, > > unsigned long end, struct mm_walk *walk) > > > > @@ -806,6 +926,8 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, > > return madvise_willneed(vma, prev, start, end); > > case MADV_COOL: > > return madvise_cool(vma, start, end); > > + case MADV_COLD: > > + return madvise_cold(vma, start, end); > > case MADV_FREE: > > case MADV_DONTNEED: > > return madvise_dontneed_free(vma, prev, start, end, behavior); > > @@ -828,6 +950,7 @@ madvise_behavior_valid(int behavior) > > case MADV_DONTNEED: > > case MADV_FREE: > > case MADV_COOL: > > + case MADV_COLD: > > #ifdef CONFIG_KSM > > case MADV_MERGEABLE: > > case MADV_UNMERGEABLE: > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index a28e5d17b495..1701b31f70a8 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -2096,6 +2096,80 @@ static void shrink_active_list(unsigned long nr_to_scan, > > nr_deactivate, nr_rotated, sc->priority, file); > > } > > > > +unsigned long reclaim_pages(struct list_head *page_list) > > +{ > > + int nid = -1; > > + unsigned long nr_isolated[2] = {0, }; > > + unsigned long nr_reclaimed = 0; > > + LIST_HEAD(node_page_list); > > + struct reclaim_stat dummy_stat; > > + struct scan_control sc = { > > + .gfp_mask = GFP_KERNEL, > > + .priority = DEF_PRIORITY, > > + .may_writepage = 1, > > + .may_unmap = 1, > > + .may_swap = 1, > > + }; > > + > > + while (!list_empty(page_list)) { > > + struct page *page; > > + > > + page = lru_to_page(page_list); > > + list_del(&page->lru); > > + > > + if (nid == -1) { > > + nid = page_to_nid(page); > > + INIT_LIST_HEAD(&node_page_list); > > + nr_isolated[0] = nr_isolated[1] = 0; > > + } > > + > > + if (nid == page_to_nid(page)) { > > + list_add(&page->lru, &node_page_list); > > + nr_isolated[!!page_is_file_cache(page)] += > > + hpage_nr_pages(page); > > + continue; > > + } > > + > > + nid = page_to_nid(page); > > + > > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, > > + nr_isolated[0]); > > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, > > + nr_isolated[1]); > > + nr_reclaimed += shrink_page_list(&node_page_list, > > + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS, > > + &dummy_stat, true); > > + while (!list_empty(&node_page_list)) { > > + struct page *page = lru_to_page(page_list); > > + > > + list_del(&page->lru); > > + putback_lru_page(page); > > + } > > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, > > + -nr_isolated[0]); > > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, > > + -nr_isolated[1]); > > + nr_isolated[0] = nr_isolated[1] = 0; > > + INIT_LIST_HEAD(&node_page_list); > > + } > > + > > + if (!list_empty(&node_page_list)) { > > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, > > + nr_isolated[0]); > > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, > > + nr_isolated[1]); > > + nr_reclaimed += shrink_page_list(&node_page_list, > > + NODE_DATA(nid), &sc, TTU_IGNORE_ACCESS, > > + &dummy_stat, true); > > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_ANON, > > + -nr_isolated[0]); > > + mod_node_page_state(NODE_DATA(nid), NR_ISOLATED_FILE, > > + -nr_isolated[1]); > > + } > > + > > + return nr_reclaimed; > > +} > > + > > /* > > * The inactive anon list should be small enough that the VM never has > > * to do too much work. > > -- > > 2.21.0.1020.gf2820cf01a-goog > > > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 3/7] mm: introduce MADV_COLD 2019-05-20 23:00 ` Minchan Kim @ 2019-05-21 6:08 ` Michal Hocko 2019-05-21 9:13 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-21 6:08 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue 21-05-19 08:00:38, Minchan Kim wrote: > On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote: > > [Cc linux-api] > > > > On Mon 20-05-19 12:52:50, Minchan Kim wrote: > > > When a process expects no accesses to a certain memory range > > > for a long time, it could hint kernel that the pages can be > > > reclaimed instantly but data should be preserved for future use. > > > This could reduce workingset eviction so it ends up increasing > > > performance. > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) > > > syscall. MADV_COLD can be used by a process to mark a memory range > > > as not expected to be used for a long time. The hint can help > > > kernel in deciding which pages to evict proactively. > > > > As mentioned in other email this looks like a non-destructive > > MADV_DONTNEED alternative. > > > > > Internally, it works via reclaiming memory in process context > > > the syscall is called. If the page is dirty but backing storage > > > is not synchronous device, the written page will be rotate back > > > into LRU's tail once the write is done so they will reclaim easily > > > when memory pressure happens. If backing storage is > > > synchrnous device(e.g., zram), hte page will be reclaimed instantly. > > > > Why do we special case async backing storage? Please always try to > > explain _why_ the decision is made. > > I didn't make any decesion. ;-) That's how current reclaim works to > avoid latency of freeing page in interrupt context. I had a patchset > to resolve the concern a few years ago but got distracted. Please articulate that in the changelog then. Or even do not go into implementation details and stick with - reuse the current reclaim implementation. If you call out some of the specific details you are risking people will start depending on them. The fact that this reuses the currect reclaim logic is enough from the review point of view because we know that there is no additional special casing to worry about. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 3/7] mm: introduce MADV_COLD 2019-05-21 6:08 ` Michal Hocko @ 2019-05-21 9:13 ` Minchan Kim 0 siblings, 0 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-21 9:13 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 21, 2019 at 08:08:20AM +0200, Michal Hocko wrote: > On Tue 21-05-19 08:00:38, Minchan Kim wrote: > > On Mon, May 20, 2019 at 10:27:03AM +0200, Michal Hocko wrote: > > > [Cc linux-api] > > > > > > On Mon 20-05-19 12:52:50, Minchan Kim wrote: > > > > When a process expects no accesses to a certain memory range > > > > for a long time, it could hint kernel that the pages can be > > > > reclaimed instantly but data should be preserved for future use. > > > > This could reduce workingset eviction so it ends up increasing > > > > performance. > > > > > > > > This patch introduces the new MADV_COLD hint to madvise(2) > > > > syscall. MADV_COLD can be used by a process to mark a memory range > > > > as not expected to be used for a long time. The hint can help > > > > kernel in deciding which pages to evict proactively. > > > > > > As mentioned in other email this looks like a non-destructive > > > MADV_DONTNEED alternative. > > > > > > > Internally, it works via reclaiming memory in process context > > > > the syscall is called. If the page is dirty but backing storage > > > > is not synchronous device, the written page will be rotate back > > > > into LRU's tail once the write is done so they will reclaim easily > > > > when memory pressure happens. If backing storage is > > > > synchrnous device(e.g., zram), hte page will be reclaimed instantly. > > > > > > Why do we special case async backing storage? Please always try to > > > explain _why_ the decision is made. > > > > I didn't make any decesion. ;-) That's how current reclaim works to > > avoid latency of freeing page in interrupt context. I had a patchset > > to resolve the concern a few years ago but got distracted. > > Please articulate that in the changelog then. Or even do not go into > implementation details and stick with - reuse the current reclaim > implementation. If you call out some of the specific details you are > risking people will start depending on them. The fact that this reuses > the currect reclaim logic is enough from the review point of view > because we know that there is no additional special casing to worry > about. I should have clarified. I will remove those lines in respin. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20190520035254.57579-6-minchan@kernel.org>]
* Re: [RFC 5/7] mm: introduce external memory hinting API [not found] ` <20190520035254.57579-6-minchan@kernel.org> @ 2019-05-20 9:18 ` Michal Hocko 2019-05-21 2:41 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-20 9:18 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api [Cc linux-api] On Mon 20-05-19 12:52:52, Minchan Kim wrote: > There is some usecase that centralized userspace daemon want to give > a memory hint like MADV_[COOL|COLD] to other process. Android's > ActivityManagerService is one of them. > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > required to make the reclaim decision is not known to the app. Instead, > it is known to the centralized userspace daemon(ActivityManagerService), > and that daemon must be able to initiate reclaim on its own without > any app involvement. Could you expand some more about how this all works? How does the centralized daemon track respective ranges? How does it synchronize against parallel modification of the address space etc. > To solve the issue, this patch introduces new syscall process_madvise(2) > which works based on pidfd so it could give a hint to the exeternal > process. > > int process_madvise(int pidfd, void *addr, size_t length, int advise); OK, this makes some sense from the API point of view. When we have discussed that at LSFMM I was contemplating about something like that except the fd would be a VMA fd rather than the process. We could extend and reuse /proc/<pid>/map_files interface which doesn't support the anonymous memory right now. I am not saying this would be a better interface but I wanted to mention it here for a further discussion. One slight advantage would be that you know the exact object that you are operating on because you have a fd for the VMA and we would have a more straightforward way to reject operation if the underlying object has changed (e.g. unmapped and reused for a different mapping). > All advises madvise provides can be supported in process_madvise, too. > Since it could affect other process's address range, only privileged > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > gives it the right to ptrrace the process could use it successfully. proc_mem_open model we use for accessing address space via proc sounds like a good mode. You are doing something similar. > Please suggest better idea if you have other idea about the permission. > > * from v1r1 > * use ptrace capability - surenb, dancol > > Signed-off-by: Minchan Kim <minchan@kernel.org> > --- > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/proc_fs.h | 1 + > include/linux/syscalls.h | 2 ++ > include/uapi/asm-generic/unistd.h | 2 ++ > kernel/signal.c | 2 +- > kernel/sys_ni.c | 1 + > mm/madvise.c | 45 ++++++++++++++++++++++++++ > 8 files changed, 54 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > index 4cd5f982b1e5..5b9dd55d6b57 100644 > --- a/arch/x86/entry/syscalls/syscall_32.tbl > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > @@ -438,3 +438,4 @@ > 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup > 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter > 427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register > +428 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > index 64ca0d06259a..0e5ee78161c9 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -355,6 +355,7 @@ > 425 common io_uring_setup __x64_sys_io_uring_setup > 426 common io_uring_enter __x64_sys_io_uring_enter > 427 common io_uring_register __x64_sys_io_uring_register > +428 common process_madvise __x64_sys_process_madvise > > # > # x32-specific system call numbers start at 512 to avoid cache impact > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h > index 52a283ba0465..f8545d7c5218 100644 > --- a/include/linux/proc_fs.h > +++ b/include/linux/proc_fs.h > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file) > > #endif /* CONFIG_PROC_FS */ > > +extern struct pid *pidfd_to_pid(const struct file *file); > struct net; > > static inline struct proc_dir_entry *proc_net_mkdir( > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index e2870fe1be5b..21c6c9a62006 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void); > asmlinkage long sys_mincore(unsigned long start, size_t len, > unsigned char __user * vec); > asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start, > + size_t len, int behavior); > asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, > unsigned long prot, unsigned long pgoff, > unsigned long flags); > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > index dee7292e1df6..7ee82ce04620 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) > __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) > #define __NR_io_uring_register 427 > __SYSCALL(__NR_io_uring_register, sys_io_uring_register) > +#define __NR_process_madvise 428 > +__SYSCALL(__NR_process_madvise, sys_process_madvise) > > #undef __NR_syscalls > #define __NR_syscalls 428 > diff --git a/kernel/signal.c b/kernel/signal.c > index 1c86b78a7597..04e75daab1f8 100644 > --- a/kernel/signal.c > +++ b/kernel/signal.c > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) > return copy_siginfo_from_user(kinfo, info); > } > > -static struct pid *pidfd_to_pid(const struct file *file) > +struct pid *pidfd_to_pid(const struct file *file) > { > if (file->f_op == &pidfd_fops) > return file->private_data; > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index 4d9ae5ea6caf..5277421795ab 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall); > COND_SYSCALL(munlockall); > COND_SYSCALL(mincore); > COND_SYSCALL(madvise); > +COND_SYSCALL(process_madvise); > COND_SYSCALL(remap_file_pages); > COND_SYSCALL(mbind); > COND_SYSCALL_COMPAT(mbind); > diff --git a/mm/madvise.c b/mm/madvise.c > index 119e82e1f065..af02aa17e5c1 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -9,6 +9,7 @@ > #include <linux/mman.h> > #include <linux/pagemap.h> > #include <linux/page_idle.h> > +#include <linux/proc_fs.h> > #include <linux/syscalls.h> > #include <linux/mempolicy.h> > #include <linux/page-isolation.h> > @@ -16,6 +17,7 @@ > #include <linux/hugetlb.h> > #include <linux/falloc.h> > #include <linux/sched.h> > +#include <linux/sched/mm.h> > #include <linux/ksm.h> > #include <linux/fs.h> > #include <linux/file.h> > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > { > return madvise_core(current, start, len_in, behavior); > } > + > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, > + size_t, len_in, int, behavior) > +{ > + int ret; > + struct fd f; > + struct pid *pid; > + struct task_struct *tsk; > + struct mm_struct *mm; > + > + f = fdget(pidfd); > + if (!f.file) > + return -EBADF; > + > + pid = pidfd_to_pid(f.file); > + if (IS_ERR(pid)) { > + ret = PTR_ERR(pid); > + goto err; > + } > + > + ret = -EINVAL; > + rcu_read_lock(); > + tsk = pid_task(pid, PIDTYPE_PID); > + if (!tsk) { > + rcu_read_unlock(); > + goto err; > + } > + get_task_struct(tsk); > + rcu_read_unlock(); > + mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS); > + if (!mm || IS_ERR(mm)) { > + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > + if (ret == -EACCES) > + ret = -EPERM; > + goto err; > + } > + ret = madvise_core(tsk, start, len_in, behavior); > + mmput(mm); > + put_task_struct(tsk); > +err: > + fdput(f); > + return ret; > +} > -- > 2.21.0.1020.gf2820cf01a-goog > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 5/7] mm: introduce external memory hinting API 2019-05-20 9:18 ` [RFC 5/7] mm: introduce external memory hinting API Michal Hocko @ 2019-05-21 2:41 ` Minchan Kim 2019-05-21 6:17 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-21 2:41 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote: > [Cc linux-api] > > On Mon 20-05-19 12:52:52, Minchan Kim wrote: > > There is some usecase that centralized userspace daemon want to give > > a memory hint like MADV_[COOL|COLD] to other process. Android's > > ActivityManagerService is one of them. > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > required to make the reclaim decision is not known to the app. Instead, > > it is known to the centralized userspace daemon(ActivityManagerService), > > and that daemon must be able to initiate reclaim on its own without > > any app involvement. > > Could you expand some more about how this all works? How does the > centralized daemon track respective ranges? How does it synchronize > against parallel modification of the address space etc. Currently, we don't track each address ranges because we have two policies at this moment: deactive file pages and reclaim anonymous pages of the app. Since the daemon has a ability to let background apps resume(IOW, process will be run by the daemon) and both hints are non-disruptive stabilty point of view, we are okay for the race. > > > To solve the issue, this patch introduces new syscall process_madvise(2) > > which works based on pidfd so it could give a hint to the exeternal > > process. > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise); > > OK, this makes some sense from the API point of view. When we have > discussed that at LSFMM I was contemplating about something like that > except the fd would be a VMA fd rather than the process. We could extend > and reuse /proc/<pid>/map_files interface which doesn't support the > anonymous memory right now. > > I am not saying this would be a better interface but I wanted to mention > it here for a further discussion. One slight advantage would be that > you know the exact object that you are operating on because you have a > fd for the VMA and we would have a more straightforward way to reject > operation if the underlying object has changed (e.g. unmapped and reused > for a different mapping). I agree your point. If I didn't miss something, such kinds of vma level modify notification doesn't work even file mapped vma at this moment. For anonymous vma, I think we could use userfaultfd, pontentially. It would be great if someone want to do with disruptive hints like MADV_DONTNEED. I'd like to see it further enhancement after landing address range based operation via limiting hints process_madvise supports to non-disruptive only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload when someone want to extend the API. > > > All advises madvise provides can be supported in process_madvise, too. > > Since it could affect other process's address range, only privileged > > process(CAP_SYS_PTRACE) or something else(e.g., being the same UID) > > gives it the right to ptrrace the process could use it successfully. > > proc_mem_open model we use for accessing address space via proc sounds > like a good mode. You are doing something similar. > > > Please suggest better idea if you have other idea about the permission. > > > > * from v1r1 > > * use ptrace capability - surenb, dancol > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > --- > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > > include/linux/proc_fs.h | 1 + > > include/linux/syscalls.h | 2 ++ > > include/uapi/asm-generic/unistd.h | 2 ++ > > kernel/signal.c | 2 +- > > kernel/sys_ni.c | 1 + > > mm/madvise.c | 45 ++++++++++++++++++++++++++ > > 8 files changed, 54 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl > > index 4cd5f982b1e5..5b9dd55d6b57 100644 > > --- a/arch/x86/entry/syscalls/syscall_32.tbl > > +++ b/arch/x86/entry/syscalls/syscall_32.tbl > > @@ -438,3 +438,4 @@ > > 425 i386 io_uring_setup sys_io_uring_setup __ia32_sys_io_uring_setup > > 426 i386 io_uring_enter sys_io_uring_enter __ia32_sys_io_uring_enter > > 427 i386 io_uring_register sys_io_uring_register __ia32_sys_io_uring_register > > +428 i386 process_madvise sys_process_madvise __ia32_sys_process_madvise > > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl > > index 64ca0d06259a..0e5ee78161c9 100644 > > --- a/arch/x86/entry/syscalls/syscall_64.tbl > > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > > @@ -355,6 +355,7 @@ > > 425 common io_uring_setup __x64_sys_io_uring_setup > > 426 common io_uring_enter __x64_sys_io_uring_enter > > 427 common io_uring_register __x64_sys_io_uring_register > > +428 common process_madvise __x64_sys_process_madvise > > > > # > > # x32-specific system call numbers start at 512 to avoid cache impact > > diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h > > index 52a283ba0465..f8545d7c5218 100644 > > --- a/include/linux/proc_fs.h > > +++ b/include/linux/proc_fs.h > > @@ -122,6 +122,7 @@ static inline struct pid *tgid_pidfd_to_pid(const struct file *file) > > > > #endif /* CONFIG_PROC_FS */ > > > > +extern struct pid *pidfd_to_pid(const struct file *file); > > struct net; > > > > static inline struct proc_dir_entry *proc_net_mkdir( > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > > index e2870fe1be5b..21c6c9a62006 100644 > > --- a/include/linux/syscalls.h > > +++ b/include/linux/syscalls.h > > @@ -872,6 +872,8 @@ asmlinkage long sys_munlockall(void); > > asmlinkage long sys_mincore(unsigned long start, size_t len, > > unsigned char __user * vec); > > asmlinkage long sys_madvise(unsigned long start, size_t len, int behavior); > > +asmlinkage long sys_process_madvise(int pid_fd, unsigned long start, > > + size_t len, int behavior); > > asmlinkage long sys_remap_file_pages(unsigned long start, unsigned long size, > > unsigned long prot, unsigned long pgoff, > > unsigned long flags); > > diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h > > index dee7292e1df6..7ee82ce04620 100644 > > --- a/include/uapi/asm-generic/unistd.h > > +++ b/include/uapi/asm-generic/unistd.h > > @@ -832,6 +832,8 @@ __SYSCALL(__NR_io_uring_setup, sys_io_uring_setup) > > __SYSCALL(__NR_io_uring_enter, sys_io_uring_enter) > > #define __NR_io_uring_register 427 > > __SYSCALL(__NR_io_uring_register, sys_io_uring_register) > > +#define __NR_process_madvise 428 > > +__SYSCALL(__NR_process_madvise, sys_process_madvise) > > > > #undef __NR_syscalls > > #define __NR_syscalls 428 > > diff --git a/kernel/signal.c b/kernel/signal.c > > index 1c86b78a7597..04e75daab1f8 100644 > > --- a/kernel/signal.c > > +++ b/kernel/signal.c > > @@ -3620,7 +3620,7 @@ static int copy_siginfo_from_user_any(kernel_siginfo_t *kinfo, siginfo_t *info) > > return copy_siginfo_from_user(kinfo, info); > > } > > > > -static struct pid *pidfd_to_pid(const struct file *file) > > +struct pid *pidfd_to_pid(const struct file *file) > > { > > if (file->f_op == &pidfd_fops) > > return file->private_data; > > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > > index 4d9ae5ea6caf..5277421795ab 100644 > > --- a/kernel/sys_ni.c > > +++ b/kernel/sys_ni.c > > @@ -278,6 +278,7 @@ COND_SYSCALL(mlockall); > > COND_SYSCALL(munlockall); > > COND_SYSCALL(mincore); > > COND_SYSCALL(madvise); > > +COND_SYSCALL(process_madvise); > > COND_SYSCALL(remap_file_pages); > > COND_SYSCALL(mbind); > > COND_SYSCALL_COMPAT(mbind); > > diff --git a/mm/madvise.c b/mm/madvise.c > > index 119e82e1f065..af02aa17e5c1 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -9,6 +9,7 @@ > > #include <linux/mman.h> > > #include <linux/pagemap.h> > > #include <linux/page_idle.h> > > +#include <linux/proc_fs.h> > > #include <linux/syscalls.h> > > #include <linux/mempolicy.h> > > #include <linux/page-isolation.h> > > @@ -16,6 +17,7 @@ > > #include <linux/hugetlb.h> > > #include <linux/falloc.h> > > #include <linux/sched.h> > > +#include <linux/sched/mm.h> > > #include <linux/ksm.h> > > #include <linux/fs.h> > > #include <linux/file.h> > > @@ -1140,3 +1142,46 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > { > > return madvise_core(current, start, len_in, behavior); > > } > > + > > +SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, > > + size_t, len_in, int, behavior) > > +{ > > + int ret; > > + struct fd f; > > + struct pid *pid; > > + struct task_struct *tsk; > > + struct mm_struct *mm; > > + > > + f = fdget(pidfd); > > + if (!f.file) > > + return -EBADF; > > + > > + pid = pidfd_to_pid(f.file); > > + if (IS_ERR(pid)) { > > + ret = PTR_ERR(pid); > > + goto err; > > + } > > + > > + ret = -EINVAL; > > + rcu_read_lock(); > > + tsk = pid_task(pid, PIDTYPE_PID); > > + if (!tsk) { > > + rcu_read_unlock(); > > + goto err; > > + } > > + get_task_struct(tsk); > > + rcu_read_unlock(); > > + mm = mm_access(tsk, PTRACE_MODE_ATTACH_REALCREDS); > > + if (!mm || IS_ERR(mm)) { > > + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > > + if (ret == -EACCES) > > + ret = -EPERM; > > + goto err; > > + } > > + ret = madvise_core(tsk, start, len_in, behavior); > > + mmput(mm); > > + put_task_struct(tsk); > > +err: > > + fdput(f); > > + return ret; > > +} > > -- > > 2.21.0.1020.gf2820cf01a-goog > > > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 5/7] mm: introduce external memory hinting API 2019-05-21 2:41 ` Minchan Kim @ 2019-05-21 6:17 ` Michal Hocko 2019-05-21 10:32 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-21 6:17 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue 21-05-19 11:41:07, Minchan Kim wrote: > On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote: > > [Cc linux-api] > > > > On Mon 20-05-19 12:52:52, Minchan Kim wrote: > > > There is some usecase that centralized userspace daemon want to give > > > a memory hint like MADV_[COOL|COLD] to other process. Android's > > > ActivityManagerService is one of them. > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > required to make the reclaim decision is not known to the app. Instead, > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > and that daemon must be able to initiate reclaim on its own without > > > any app involvement. > > > > Could you expand some more about how this all works? How does the > > centralized daemon track respective ranges? How does it synchronize > > against parallel modification of the address space etc. > > Currently, we don't track each address ranges because we have two > policies at this moment: > > deactive file pages and reclaim anonymous pages of the app. > > Since the daemon has a ability to let background apps resume(IOW, process > will be run by the daemon) and both hints are non-disruptive stabilty point > of view, we are okay for the race. Fair enough but the API should consider future usecases where this might be a problem. So we should really think about those potential scenarios now. If we are ok with that, fine, but then we should be explicit and document it that way. Essentially say that any sort of synchronization is supposed to be done by monitor. This will make the API less usable but maybe that is enough. > > > To solve the issue, this patch introduces new syscall process_madvise(2) > > > which works based on pidfd so it could give a hint to the exeternal > > > process. > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise); > > > > OK, this makes some sense from the API point of view. When we have > > discussed that at LSFMM I was contemplating about something like that > > except the fd would be a VMA fd rather than the process. We could extend > > and reuse /proc/<pid>/map_files interface which doesn't support the > > anonymous memory right now. > > > > I am not saying this would be a better interface but I wanted to mention > > it here for a further discussion. One slight advantage would be that > > you know the exact object that you are operating on because you have a > > fd for the VMA and we would have a more straightforward way to reject > > operation if the underlying object has changed (e.g. unmapped and reused > > for a different mapping). > > I agree your point. If I didn't miss something, such kinds of vma level > modify notification doesn't work even file mapped vma at this moment. > For anonymous vma, I think we could use userfaultfd, pontentially. > It would be great if someone want to do with disruptive hints like > MADV_DONTNEED. > > I'd like to see it further enhancement after landing address range based > operation via limiting hints process_madvise supports to non-disruptive > only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload > when someone want to extend the API. So do you think we want both interfaces (process_madvise and madvisefd)? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 5/7] mm: introduce external memory hinting API 2019-05-21 6:17 ` Michal Hocko @ 2019-05-21 10:32 ` Minchan Kim 0 siblings, 0 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-21 10:32 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 21, 2019 at 08:17:43AM +0200, Michal Hocko wrote: > On Tue 21-05-19 11:41:07, Minchan Kim wrote: > > On Mon, May 20, 2019 at 11:18:29AM +0200, Michal Hocko wrote: > > > [Cc linux-api] > > > > > > On Mon 20-05-19 12:52:52, Minchan Kim wrote: > > > > There is some usecase that centralized userspace daemon want to give > > > > a memory hint like MADV_[COOL|COLD] to other process. Android's > > > > ActivityManagerService is one of them. > > > > > > > > It's similar in spirit to madvise(MADV_WONTNEED), but the information > > > > required to make the reclaim decision is not known to the app. Instead, > > > > it is known to the centralized userspace daemon(ActivityManagerService), > > > > and that daemon must be able to initiate reclaim on its own without > > > > any app involvement. > > > > > > Could you expand some more about how this all works? How does the > > > centralized daemon track respective ranges? How does it synchronize > > > against parallel modification of the address space etc. > > > > Currently, we don't track each address ranges because we have two > > policies at this moment: > > > > deactive file pages and reclaim anonymous pages of the app. > > > > Since the daemon has a ability to let background apps resume(IOW, process > > will be run by the daemon) and both hints are non-disruptive stabilty point > > of view, we are okay for the race. > > Fair enough but the API should consider future usecases where this might > be a problem. So we should really think about those potential scenarios > now. If we are ok with that, fine, but then we should be explicit and > document it that way. Essentially say that any sort of synchronization > is supposed to be done by monitor. This will make the API less usable > but maybe that is enough. Okay, I will add more about that in the description. > > > > > To solve the issue, this patch introduces new syscall process_madvise(2) > > > > which works based on pidfd so it could give a hint to the exeternal > > > > process. > > > > > > > > int process_madvise(int pidfd, void *addr, size_t length, int advise); > > > > > > OK, this makes some sense from the API point of view. When we have > > > discussed that at LSFMM I was contemplating about something like that > > > except the fd would be a VMA fd rather than the process. We could extend > > > and reuse /proc/<pid>/map_files interface which doesn't support the > > > anonymous memory right now. > > > > > > I am not saying this would be a better interface but I wanted to mention > > > it here for a further discussion. One slight advantage would be that > > > you know the exact object that you are operating on because you have a > > > fd for the VMA and we would have a more straightforward way to reject > > > operation if the underlying object has changed (e.g. unmapped and reused > > > for a different mapping). > > > > I agree your point. If I didn't miss something, such kinds of vma level > > modify notification doesn't work even file mapped vma at this moment. > > For anonymous vma, I think we could use userfaultfd, pontentially. > > It would be great if someone want to do with disruptive hints like > > MADV_DONTNEED. > > > > I'd like to see it further enhancement after landing address range based > > operation via limiting hints process_madvise supports to non-disruptive > > only(e.g., MADV_[COOL|COLD]) so that we could catch up the usercase/workload > > when someone want to extend the API. > > So do you think we want both interfaces (process_madvise and madvisefd)? What I have in mind is to extend process_madvise later like this struct pr_madvise_param { int size; /* the size of this structure */ union { const struct iovec __user *vec; /* address range array */ int fd; /* supported from 6.0 */ } } with introducing new hint Or-able PR_MADV_RANGE_FD, so that process_madvise can go with fd instead of address range. ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20190520035254.57579-7-minchan@kernel.org>]
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary [not found] ` <20190520035254.57579-7-minchan@kernel.org> @ 2019-05-20 9:22 ` Michal Hocko 2019-05-21 2:48 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-20 9:22 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api [Cc linux-api] On Mon 20-05-19 12:52:53, Minchan Kim wrote: > Currently, process_madvise syscall works for only one address range > so user should call the syscall several times to give hints to > multiple address range. Is that a problem? How big of a problem? Any numbers? > This patch extends process_madvise syscall to support multiple > hints, address ranges and return vaules so user could give hints > all at once. > > struct pr_madvise_param { > int size; /* the size of this structure */ > const struct iovec __user *vec; /* address range array */ > } > > int process_madvise(int pidfd, ssize_t nr_elem, > int *behavior, > struct pr_madvise_param *results, > struct pr_madvise_param *ranges, > unsigned long flags); > > - pidfd > > target process fd > > - nr_elem > > the number of elemenent of array behavior, results, ranges > > - behavior > > hints for each address range in remote process so that user could > give different hints for each range. What is the guarantee of a single call? Do all hints get applied or the first failure backs of? What are the atomicity guarantees? > > - results > > array of buffers to get results for associated remote address range > action. > > - ranges > > array to buffers to have remote process's address ranges to be > processed > > - flags > > extra argument for the future. It should be zero this moment. > > Example) > > struct pr_madvise_param { > int size; > const struct iovec *vec; > }; > > int main(int argc, char *argv[]) > { > struct pr_madvise_param retp, rangep; > struct iovec result_vec[2], range_vec[2]; > int hints[2]; > long ret[2]; > void *addr[2]; > > pid_t pid; > char cmd[64] = {0,}; > addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, > MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); > > if (MAP_FAILED == addr[0]) > return 1; > > addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, > MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); > > if (MAP_FAILED == addr[1]) > return 1; > > hints[0] = MADV_COLD; > range_vec[0].iov_base = addr[0]; > range_vec[0].iov_len = ALLOC_SIZE; > result_vec[0].iov_base = &ret[0]; > result_vec[0].iov_len = sizeof(long); > retp.vec = result_vec; > retp.size = sizeof(struct pr_madvise_param); > > hints[1] = MADV_COOL; > range_vec[1].iov_base = addr[1]; > range_vec[1].iov_len = ALLOC_SIZE; > result_vec[1].iov_base = &ret[1]; > result_vec[1].iov_len = sizeof(long); > rangep.vec = range_vec; > rangep.size = sizeof(struct pr_madvise_param); > > pid = fork(); > if (!pid) { > sleep(10); > } else { > int pidfd = open(cmd, O_DIRECTORY | O_CLOEXEC); > if (pidfd < 0) > return 1; > > /* munmap to make pages private for the child */ > munmap(addr[0], ALLOC_SIZE); > munmap(addr[1], ALLOC_SIZE); > system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); > if (syscall(__NR_process_madvise, pidfd, 2, behaviors, > &retp, &rangep, 0)) > perror("process_madvise fail\n"); > system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); > } > > return 0; > } > > Signed-off-by: Minchan Kim <minchan@kernel.org> > --- > include/uapi/asm-generic/mman-common.h | 5 + > mm/madvise.c | 184 +++++++++++++++++++++---- > 2 files changed, 166 insertions(+), 23 deletions(-) > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index b9b51eeb8e1a..b8e230de84a6 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -74,4 +74,9 @@ > #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ > PKEY_DISABLE_WRITE) > > +struct pr_madvise_param { > + int size; /* the size of this structure */ > + const struct iovec __user *vec; /* address range array */ > +}; > + > #endif /* __ASM_GENERIC_MMAN_COMMON_H */ > diff --git a/mm/madvise.c b/mm/madvise.c > index af02aa17e5c1..f4f569dac2bd 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, > struct page *page; > struct vm_area_struct *vma = walk->vma; > unsigned long next; > + long nr_pages = 0; > > next = pmd_addr_end(addr, end); > if (pmd_trans_huge(*pmd)) { > @@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, > > ptep_test_and_clear_young(vma, addr, pte); > deactivate_page(page); > + nr_pages++; > + > } > > pte_unmap_unlock(orig_pte, ptl); > + *(long *)walk->private += nr_pages; > cond_resched(); > > return 0; > @@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, > > static void madvise_cool_page_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, > - unsigned long addr, unsigned long end) > + unsigned long addr, unsigned long end, > + long *nr_pages) > { > struct mm_walk cool_walk = { > .pmd_entry = madvise_cool_pte_range, > .mm = vma->vm_mm, > + .private = nr_pages > }; > > tlb_start_vma(tlb, vma); > @@ -403,7 +409,8 @@ static void madvise_cool_page_range(struct mmu_gather *tlb, > } > > static long madvise_cool(struct vm_area_struct *vma, > - unsigned long start_addr, unsigned long end_addr) > + unsigned long start_addr, unsigned long end_addr, > + long *nr_pages) > { > struct mm_struct *mm = vma->vm_mm; > struct mmu_gather tlb; > @@ -413,7 +420,7 @@ static long madvise_cool(struct vm_area_struct *vma, > > lru_add_drain(); > tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > - madvise_cool_page_range(&tlb, vma, start_addr, end_addr); > + madvise_cool_page_range(&tlb, vma, start_addr, end_addr, nr_pages); > tlb_finish_mmu(&tlb, start_addr, end_addr); > > return 0; > @@ -429,6 +436,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, > int isolated = 0; > struct vm_area_struct *vma = walk->vma; > unsigned long next; > + long nr_pages = 0; > > next = pmd_addr_end(addr, end); > if (pmd_trans_huge(*pmd)) { > @@ -492,7 +500,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, > list_add(&page->lru, &page_list); > if (isolated >= SWAP_CLUSTER_MAX) { > pte_unmap_unlock(orig_pte, ptl); > - reclaim_pages(&page_list); > + nr_pages += reclaim_pages(&page_list); > isolated = 0; > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > orig_pte = pte; > @@ -500,19 +508,22 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, > } > > pte_unmap_unlock(orig_pte, ptl); > - reclaim_pages(&page_list); > + nr_pages += reclaim_pages(&page_list); > cond_resched(); > > + *(long *)walk->private += nr_pages; > return 0; > } > > static void madvise_cold_page_range(struct mmu_gather *tlb, > struct vm_area_struct *vma, > - unsigned long addr, unsigned long end) > + unsigned long addr, unsigned long end, > + long *nr_pages) > { > struct mm_walk warm_walk = { > .pmd_entry = madvise_cold_pte_range, > .mm = vma->vm_mm, > + .private = nr_pages, > }; > > tlb_start_vma(tlb, vma); > @@ -522,7 +533,8 @@ static void madvise_cold_page_range(struct mmu_gather *tlb, > > > static long madvise_cold(struct vm_area_struct *vma, > - unsigned long start_addr, unsigned long end_addr) > + unsigned long start_addr, unsigned long end_addr, > + long *nr_pages) > { > struct mm_struct *mm = vma->vm_mm; > struct mmu_gather tlb; > @@ -532,7 +544,7 @@ static long madvise_cold(struct vm_area_struct *vma, > > lru_add_drain(); > tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > - madvise_cold_page_range(&tlb, vma, start_addr, end_addr); > + madvise_cold_page_range(&tlb, vma, start_addr, end_addr, nr_pages); > tlb_finish_mmu(&tlb, start_addr, end_addr); > > return 0; > @@ -922,7 +934,7 @@ static int madvise_inject_error(int behavior, > static long > madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, > struct vm_area_struct **prev, unsigned long start, > - unsigned long end, int behavior) > + unsigned long end, int behavior, long *nr_pages) > { > switch (behavior) { > case MADV_REMOVE: > @@ -930,9 +942,9 @@ madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, > case MADV_WILLNEED: > return madvise_willneed(vma, prev, start, end); > case MADV_COOL: > - return madvise_cool(vma, start, end); > + return madvise_cool(vma, start, end, nr_pages); > case MADV_COLD: > - return madvise_cold(vma, start, end); > + return madvise_cold(vma, start, end, nr_pages); > case MADV_FREE: > case MADV_DONTNEED: > return madvise_dontneed_free(tsk, vma, prev, start, > @@ -981,7 +993,7 @@ madvise_behavior_valid(int behavior) > } > > static int madvise_core(struct task_struct *tsk, unsigned long start, > - size_t len_in, int behavior) > + size_t len_in, int behavior, long *nr_pages) > { > unsigned long end, tmp; > struct vm_area_struct *vma, *prev; > @@ -996,6 +1008,7 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > > if (start & ~PAGE_MASK) > return error; > + > len = (len_in + ~PAGE_MASK) & PAGE_MASK; > > /* Check to see whether len was rounded up from small -ve to zero */ > @@ -1035,6 +1048,8 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > blk_start_plug(&plug); > for (;;) { > /* Still start < end. */ > + long pages = 0; > + > error = -ENOMEM; > if (!vma) > goto out; > @@ -1053,9 +1068,11 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > tmp = end; > > /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > - error = madvise_vma(tsk, vma, &prev, start, tmp, behavior); > + error = madvise_vma(tsk, vma, &prev, start, tmp, > + behavior, &pages); > if (error) > goto out; > + *nr_pages += pages; > start = tmp; > if (prev && start < prev->vm_end) > start = prev->vm_end; > @@ -1140,26 +1157,137 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > */ > SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > { > - return madvise_core(current, start, len_in, behavior); > + unsigned long dummy; > + > + return madvise_core(current, start, len_in, behavior, &dummy); > } > > -SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, > - size_t, len_in, int, behavior) > +static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param, > + struct pr_madvise_param *param) > +{ > + u32 size; > + int ret; > + > + memset(param, 0, sizeof(*param)); > + > + ret = get_user(size, &u_param->size); > + if (ret) > + return ret; > + > + if (size > PAGE_SIZE) > + return -E2BIG; > + > + if (!size || size > sizeof(struct pr_madvise_param)) > + return -EINVAL; > + > + ret = copy_from_user(param, u_param, size); > + if (ret) > + return -EFAULT; > + > + return ret; > +} > + > +static int process_madvise_core(struct task_struct *tsk, int *behaviors, > + struct iov_iter *iter, > + const struct iovec *range_vec, > + unsigned long riovcnt, > + unsigned long flags) > +{ > + int i; > + long err; > + > + for (err = 0, i = 0; i < riovcnt && iov_iter_count(iter); i++) { > + long ret = 0; > + > + err = madvise_core(tsk, (unsigned long)range_vec[i].iov_base, > + range_vec[i].iov_len, behaviors[i], > + &ret); > + if (err) > + ret = err; > + > + if (copy_to_iter(&ret, sizeof(long), iter) != > + sizeof(long)) { > + err = -EFAULT; > + break; > + } > + > + err = 0; > + } > + > + return err; > +} > + > +SYSCALL_DEFINE6(process_madvise, int, pidfd, ssize_t, nr_elem, > + const int __user *, hints, > + struct pr_madvise_param __user *, results, > + struct pr_madvise_param __user *, ranges, > + unsigned long, flags) > { > int ret; > struct fd f; > struct pid *pid; > struct task_struct *tsk; > struct mm_struct *mm; > + struct pr_madvise_param result_p, range_p; > + const struct iovec __user *result_vec, __user *range_vec; > + int *behaviors; > + struct iovec iovstack_result[UIO_FASTIOV]; > + struct iovec iovstack_r[UIO_FASTIOV]; > + struct iovec *iov_l = iovstack_result; > + struct iovec *iov_r = iovstack_r; > + struct iov_iter iter; > + > + if (flags != 0) > + return -EINVAL; > + > + ret = pr_madvise_copy_param(results, &result_p); > + if (ret) > + return ret; > + > + ret = pr_madvise_copy_param(ranges, &range_p); > + if (ret) > + return ret; > + > + result_vec = result_p.vec; > + range_vec = range_p.vec; > + > + if (result_p.size != sizeof(struct pr_madvise_param) || > + range_p.size != sizeof(struct pr_madvise_param)) > + return -EINVAL; > + > + behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL); > + if (!behaviors) > + return -ENOMEM; > + > + ret = copy_from_user(behaviors, hints, sizeof(int) * nr_elem); > + if (ret < 0) > + goto free_behavior_vec; > + > + ret = import_iovec(READ, result_vec, nr_elem, UIO_FASTIOV, > + &iov_l, &iter); > + if (ret < 0) > + goto free_behavior_vec; > + > + if (!iov_iter_count(&iter)) { > + ret = -EINVAL; > + goto free_iovecs; > + } > + > + ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem, > + UIO_FASTIOV, iovstack_r, &iov_r); > + if (ret <= 0) > + goto free_iovecs; > > f = fdget(pidfd); > - if (!f.file) > - return -EBADF; > + if (!f.file) { > + ret = -EBADF; > + goto free_iovecs; > + } > > pid = pidfd_to_pid(f.file); > if (IS_ERR(pid)) { > ret = PTR_ERR(pid); > - goto err; > + goto put_fd; > } > > ret = -EINVAL; > @@ -1167,7 +1295,7 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, > tsk = pid_task(pid, PIDTYPE_PID); > if (!tsk) { > rcu_read_unlock(); > - goto err; > + goto put_fd; > } > get_task_struct(tsk); > rcu_read_unlock(); > @@ -1176,12 +1304,22 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, > ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > if (ret == -EACCES) > ret = -EPERM; > - goto err; > + goto put_task; > } > - ret = madvise_core(tsk, start, len_in, behavior); > + > + ret = process_madvise_core(tsk, behaviors, &iter, iov_r, > + nr_elem, flags); > mmput(mm); > +put_task: > put_task_struct(tsk); > -err: > +put_fd: > fdput(f); > +free_iovecs: > + if (iov_r != iovstack_r) > + kfree(iov_r); > + kfree(iov_l); > +free_behavior_vec: > + kfree(behaviors); > + > return ret; > } > -- > 2.21.0.1020.gf2820cf01a-goog > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-20 9:22 ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Michal Hocko @ 2019-05-21 2:48 ` Minchan Kim 2019-05-21 6:24 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-21 2:48 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > [Cc linux-api] > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > Currently, process_madvise syscall works for only one address range > > so user should call the syscall several times to give hints to > > multiple address range. > > Is that a problem? How big of a problem? Any numbers? We easily have 2000+ vma so it's not trivial overhead. I will come up with number in the description at respin. > > > This patch extends process_madvise syscall to support multiple > > hints, address ranges and return vaules so user could give hints > > all at once. > > > > struct pr_madvise_param { > > int size; /* the size of this structure */ > > const struct iovec __user *vec; /* address range array */ > > } > > > > int process_madvise(int pidfd, ssize_t nr_elem, > > int *behavior, > > struct pr_madvise_param *results, > > struct pr_madvise_param *ranges, > > unsigned long flags); > > > > - pidfd > > > > target process fd > > > > - nr_elem > > > > the number of elemenent of array behavior, results, ranges > > > > - behavior > > > > hints for each address range in remote process so that user could > > give different hints for each range. > > What is the guarantee of a single call? Do all hints get applied or the > first failure backs of? What are the atomicity guarantees? All hints will be tried even though one of them is failed. User will see the success or failure from the restuls parameter. For the single call, there is no guarantee of atomicity. > > > > > - results > > > > array of buffers to get results for associated remote address range > > action. > > > > - ranges > > > > array to buffers to have remote process's address ranges to be > > processed > > > > - flags > > > > extra argument for the future. It should be zero this moment. > > > > Example) > > > > struct pr_madvise_param { > > int size; > > const struct iovec *vec; > > }; > > > > int main(int argc, char *argv[]) > > { > > struct pr_madvise_param retp, rangep; > > struct iovec result_vec[2], range_vec[2]; > > int hints[2]; > > long ret[2]; > > void *addr[2]; > > > > pid_t pid; > > char cmd[64] = {0,}; > > addr[0] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, > > MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); > > > > if (MAP_FAILED == addr[0]) > > return 1; > > > > addr[1] = mmap(NULL, ALLOC_SIZE, PROT_READ|PROT_WRITE, > > MAP_POPULATE|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); > > > > if (MAP_FAILED == addr[1]) > > return 1; > > > > hints[0] = MADV_COLD; > > range_vec[0].iov_base = addr[0]; > > range_vec[0].iov_len = ALLOC_SIZE; > > result_vec[0].iov_base = &ret[0]; > > result_vec[0].iov_len = sizeof(long); > > retp.vec = result_vec; > > retp.size = sizeof(struct pr_madvise_param); > > > > hints[1] = MADV_COOL; > > range_vec[1].iov_base = addr[1]; > > range_vec[1].iov_len = ALLOC_SIZE; > > result_vec[1].iov_base = &ret[1]; > > result_vec[1].iov_len = sizeof(long); > > rangep.vec = range_vec; > > rangep.size = sizeof(struct pr_madvise_param); > > > > pid = fork(); > > if (!pid) { > > sleep(10); > > } else { > > int pidfd = open(cmd, O_DIRECTORY | O_CLOEXEC); > > if (pidfd < 0) > > return 1; > > > > /* munmap to make pages private for the child */ > > munmap(addr[0], ALLOC_SIZE); > > munmap(addr[1], ALLOC_SIZE); > > system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); > > if (syscall(__NR_process_madvise, pidfd, 2, behaviors, > > &retp, &rangep, 0)) > > perror("process_madvise fail\n"); > > system("cat /proc/vmstat | egrep 'pswpout|deactivate'"); > > } > > > > return 0; > > } > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > --- > > include/uapi/asm-generic/mman-common.h | 5 + > > mm/madvise.c | 184 +++++++++++++++++++++---- > > 2 files changed, 166 insertions(+), 23 deletions(-) > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index b9b51eeb8e1a..b8e230de84a6 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -74,4 +74,9 @@ > > #define PKEY_ACCESS_MASK (PKEY_DISABLE_ACCESS |\ > > PKEY_DISABLE_WRITE) > > > > +struct pr_madvise_param { > > + int size; /* the size of this structure */ > > + const struct iovec __user *vec; /* address range array */ > > +}; > > + > > #endif /* __ASM_GENERIC_MMAN_COMMON_H */ > > diff --git a/mm/madvise.c b/mm/madvise.c > > index af02aa17e5c1..f4f569dac2bd 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -320,6 +320,7 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, > > struct page *page; > > struct vm_area_struct *vma = walk->vma; > > unsigned long next; > > + long nr_pages = 0; > > > > next = pmd_addr_end(addr, end); > > if (pmd_trans_huge(*pmd)) { > > @@ -380,9 +381,12 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, > > > > ptep_test_and_clear_young(vma, addr, pte); > > deactivate_page(page); > > + nr_pages++; > > + > > } > > > > pte_unmap_unlock(orig_pte, ptl); > > + *(long *)walk->private += nr_pages; > > cond_resched(); > > > > return 0; > > @@ -390,11 +394,13 @@ static int madvise_cool_pte_range(pmd_t *pmd, unsigned long addr, > > > > static void madvise_cool_page_range(struct mmu_gather *tlb, > > struct vm_area_struct *vma, > > - unsigned long addr, unsigned long end) > > + unsigned long addr, unsigned long end, > > + long *nr_pages) > > { > > struct mm_walk cool_walk = { > > .pmd_entry = madvise_cool_pte_range, > > .mm = vma->vm_mm, > > + .private = nr_pages > > }; > > > > tlb_start_vma(tlb, vma); > > @@ -403,7 +409,8 @@ static void madvise_cool_page_range(struct mmu_gather *tlb, > > } > > > > static long madvise_cool(struct vm_area_struct *vma, > > - unsigned long start_addr, unsigned long end_addr) > > + unsigned long start_addr, unsigned long end_addr, > > + long *nr_pages) > > { > > struct mm_struct *mm = vma->vm_mm; > > struct mmu_gather tlb; > > @@ -413,7 +420,7 @@ static long madvise_cool(struct vm_area_struct *vma, > > > > lru_add_drain(); > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > > - madvise_cool_page_range(&tlb, vma, start_addr, end_addr); > > + madvise_cool_page_range(&tlb, vma, start_addr, end_addr, nr_pages); > > tlb_finish_mmu(&tlb, start_addr, end_addr); > > > > return 0; > > @@ -429,6 +436,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, > > int isolated = 0; > > struct vm_area_struct *vma = walk->vma; > > unsigned long next; > > + long nr_pages = 0; > > > > next = pmd_addr_end(addr, end); > > if (pmd_trans_huge(*pmd)) { > > @@ -492,7 +500,7 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, > > list_add(&page->lru, &page_list); > > if (isolated >= SWAP_CLUSTER_MAX) { > > pte_unmap_unlock(orig_pte, ptl); > > - reclaim_pages(&page_list); > > + nr_pages += reclaim_pages(&page_list); > > isolated = 0; > > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > > orig_pte = pte; > > @@ -500,19 +508,22 @@ static int madvise_cold_pte_range(pmd_t *pmd, unsigned long addr, > > } > > > > pte_unmap_unlock(orig_pte, ptl); > > - reclaim_pages(&page_list); > > + nr_pages += reclaim_pages(&page_list); > > cond_resched(); > > > > + *(long *)walk->private += nr_pages; > > return 0; > > } > > > > static void madvise_cold_page_range(struct mmu_gather *tlb, > > struct vm_area_struct *vma, > > - unsigned long addr, unsigned long end) > > + unsigned long addr, unsigned long end, > > + long *nr_pages) > > { > > struct mm_walk warm_walk = { > > .pmd_entry = madvise_cold_pte_range, > > .mm = vma->vm_mm, > > + .private = nr_pages, > > }; > > > > tlb_start_vma(tlb, vma); > > @@ -522,7 +533,8 @@ static void madvise_cold_page_range(struct mmu_gather *tlb, > > > > > > static long madvise_cold(struct vm_area_struct *vma, > > - unsigned long start_addr, unsigned long end_addr) > > + unsigned long start_addr, unsigned long end_addr, > > + long *nr_pages) > > { > > struct mm_struct *mm = vma->vm_mm; > > struct mmu_gather tlb; > > @@ -532,7 +544,7 @@ static long madvise_cold(struct vm_area_struct *vma, > > > > lru_add_drain(); > > tlb_gather_mmu(&tlb, mm, start_addr, end_addr); > > - madvise_cold_page_range(&tlb, vma, start_addr, end_addr); > > + madvise_cold_page_range(&tlb, vma, start_addr, end_addr, nr_pages); > > tlb_finish_mmu(&tlb, start_addr, end_addr); > > > > return 0; > > @@ -922,7 +934,7 @@ static int madvise_inject_error(int behavior, > > static long > > madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, > > struct vm_area_struct **prev, unsigned long start, > > - unsigned long end, int behavior) > > + unsigned long end, int behavior, long *nr_pages) > > { > > switch (behavior) { > > case MADV_REMOVE: > > @@ -930,9 +942,9 @@ madvise_vma(struct task_struct *tsk, struct vm_area_struct *vma, > > case MADV_WILLNEED: > > return madvise_willneed(vma, prev, start, end); > > case MADV_COOL: > > - return madvise_cool(vma, start, end); > > + return madvise_cool(vma, start, end, nr_pages); > > case MADV_COLD: > > - return madvise_cold(vma, start, end); > > + return madvise_cold(vma, start, end, nr_pages); > > case MADV_FREE: > > case MADV_DONTNEED: > > return madvise_dontneed_free(tsk, vma, prev, start, > > @@ -981,7 +993,7 @@ madvise_behavior_valid(int behavior) > > } > > > > static int madvise_core(struct task_struct *tsk, unsigned long start, > > - size_t len_in, int behavior) > > + size_t len_in, int behavior, long *nr_pages) > > { > > unsigned long end, tmp; > > struct vm_area_struct *vma, *prev; > > @@ -996,6 +1008,7 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > > > > if (start & ~PAGE_MASK) > > return error; > > + > > len = (len_in + ~PAGE_MASK) & PAGE_MASK; > > > > /* Check to see whether len was rounded up from small -ve to zero */ > > @@ -1035,6 +1048,8 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > > blk_start_plug(&plug); > > for (;;) { > > /* Still start < end. */ > > + long pages = 0; > > + > > error = -ENOMEM; > > if (!vma) > > goto out; > > @@ -1053,9 +1068,11 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > > tmp = end; > > > > /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > > - error = madvise_vma(tsk, vma, &prev, start, tmp, behavior); > > + error = madvise_vma(tsk, vma, &prev, start, tmp, > > + behavior, &pages); > > if (error) > > goto out; > > + *nr_pages += pages; > > start = tmp; > > if (prev && start < prev->vm_end) > > start = prev->vm_end; > > @@ -1140,26 +1157,137 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > > */ > > SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > > { > > - return madvise_core(current, start, len_in, behavior); > > + unsigned long dummy; > > + > > + return madvise_core(current, start, len_in, behavior, &dummy); > > } > > > > -SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, > > - size_t, len_in, int, behavior) > > +static int pr_madvise_copy_param(struct pr_madvise_param __user *u_param, > > + struct pr_madvise_param *param) > > +{ > > + u32 size; > > + int ret; > > + > > + memset(param, 0, sizeof(*param)); > > + > > + ret = get_user(size, &u_param->size); > > + if (ret) > > + return ret; > > + > > + if (size > PAGE_SIZE) > > + return -E2BIG; > > + > > + if (!size || size > sizeof(struct pr_madvise_param)) > > + return -EINVAL; > > + > > + ret = copy_from_user(param, u_param, size); > > + if (ret) > > + return -EFAULT; > > + > > + return ret; > > +} > > + > > +static int process_madvise_core(struct task_struct *tsk, int *behaviors, > > + struct iov_iter *iter, > > + const struct iovec *range_vec, > > + unsigned long riovcnt, > > + unsigned long flags) > > +{ > > + int i; > > + long err; > > + > > + for (err = 0, i = 0; i < riovcnt && iov_iter_count(iter); i++) { > > + long ret = 0; > > + > > + err = madvise_core(tsk, (unsigned long)range_vec[i].iov_base, > > + range_vec[i].iov_len, behaviors[i], > > + &ret); > > + if (err) > > + ret = err; > > + > > + if (copy_to_iter(&ret, sizeof(long), iter) != > > + sizeof(long)) { > > + err = -EFAULT; > > + break; > > + } > > + > > + err = 0; > > + } > > + > > + return err; > > +} > > + > > +SYSCALL_DEFINE6(process_madvise, int, pidfd, ssize_t, nr_elem, > > + const int __user *, hints, > > + struct pr_madvise_param __user *, results, > > + struct pr_madvise_param __user *, ranges, > > + unsigned long, flags) > > { > > int ret; > > struct fd f; > > struct pid *pid; > > struct task_struct *tsk; > > struct mm_struct *mm; > > + struct pr_madvise_param result_p, range_p; > > + const struct iovec __user *result_vec, __user *range_vec; > > + int *behaviors; > > + struct iovec iovstack_result[UIO_FASTIOV]; > > + struct iovec iovstack_r[UIO_FASTIOV]; > > + struct iovec *iov_l = iovstack_result; > > + struct iovec *iov_r = iovstack_r; > > + struct iov_iter iter; > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + ret = pr_madvise_copy_param(results, &result_p); > > + if (ret) > > + return ret; > > + > > + ret = pr_madvise_copy_param(ranges, &range_p); > > + if (ret) > > + return ret; > > + > > + result_vec = result_p.vec; > > + range_vec = range_p.vec; > > + > > + if (result_p.size != sizeof(struct pr_madvise_param) || > > + range_p.size != sizeof(struct pr_madvise_param)) > > + return -EINVAL; > > + > > + behaviors = kmalloc_array(nr_elem, sizeof(int), GFP_KERNEL); > > + if (!behaviors) > > + return -ENOMEM; > > + > > + ret = copy_from_user(behaviors, hints, sizeof(int) * nr_elem); > > + if (ret < 0) > > + goto free_behavior_vec; > > + > > + ret = import_iovec(READ, result_vec, nr_elem, UIO_FASTIOV, > > + &iov_l, &iter); > > + if (ret < 0) > > + goto free_behavior_vec; > > + > > + if (!iov_iter_count(&iter)) { > > + ret = -EINVAL; > > + goto free_iovecs; > > + } > > + > > + ret = rw_copy_check_uvector(CHECK_IOVEC_ONLY, range_vec, nr_elem, > > + UIO_FASTIOV, iovstack_r, &iov_r); > > + if (ret <= 0) > > + goto free_iovecs; > > > > f = fdget(pidfd); > > - if (!f.file) > > - return -EBADF; > > + if (!f.file) { > > + ret = -EBADF; > > + goto free_iovecs; > > + } > > > > pid = pidfd_to_pid(f.file); > > if (IS_ERR(pid)) { > > ret = PTR_ERR(pid); > > - goto err; > > + goto put_fd; > > } > > > > ret = -EINVAL; > > @@ -1167,7 +1295,7 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, > > tsk = pid_task(pid, PIDTYPE_PID); > > if (!tsk) { > > rcu_read_unlock(); > > - goto err; > > + goto put_fd; > > } > > get_task_struct(tsk); > > rcu_read_unlock(); > > @@ -1176,12 +1304,22 @@ SYSCALL_DEFINE4(process_madvise, int, pidfd, unsigned long, start, > > ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > > if (ret == -EACCES) > > ret = -EPERM; > > - goto err; > > + goto put_task; > > } > > - ret = madvise_core(tsk, start, len_in, behavior); > > + > > + ret = process_madvise_core(tsk, behaviors, &iter, iov_r, > > + nr_elem, flags); > > mmput(mm); > > +put_task: > > put_task_struct(tsk); > > -err: > > +put_fd: > > fdput(f); > > +free_iovecs: > > + if (iov_r != iovstack_r) > > + kfree(iov_r); > > + kfree(iov_l); > > +free_behavior_vec: > > + kfree(behaviors); > > + > > return ret; > > } > > -- > > 2.21.0.1020.gf2820cf01a-goog > > > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-21 2:48 ` Minchan Kim @ 2019-05-21 6:24 ` Michal Hocko 2019-05-21 10:26 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-21 6:24 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue 21-05-19 11:48:20, Minchan Kim wrote: > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > [Cc linux-api] > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > Currently, process_madvise syscall works for only one address range > > > so user should call the syscall several times to give hints to > > > multiple address range. > > > > Is that a problem? How big of a problem? Any numbers? > > We easily have 2000+ vma so it's not trivial overhead. I will come up > with number in the description at respin. Does this really have to be a fast operation? I would expect the monitor is by no means a fast path. The system call overhead is not what it used to be, sigh, but still for something that is not a hot path it should be tolerable, especially when the whole operation is quite expensive on its own (wrt. the syscall entry/exit). I am not saying we do not need a multiplexing API, I am just not sure we need it right away. Btw. there was some demand for other MM syscalls to provide a multiplexing API (e.g. mprotect), maybe it would be better to handle those in one go? -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-21 6:24 ` Michal Hocko @ 2019-05-21 10:26 ` Minchan Kim 2019-05-21 10:37 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-21 10:26 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > [Cc linux-api] > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > Currently, process_madvise syscall works for only one address range > > > > so user should call the syscall several times to give hints to > > > > multiple address range. > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > with number in the description at respin. > > Does this really have to be a fast operation? I would expect the monitor > is by no means a fast path. The system call overhead is not what it used > to be, sigh, but still for something that is not a hot path it should be > tolerable, especially when the whole operation is quite expensive on its > own (wrt. the syscall entry/exit). What's different with process_vm_[readv|writev] and vmsplice? If the range needed to be covered is a lot, vector operation makes senese to me. > > I am not saying we do not need a multiplexing API, I am just not sure > we need it right away. Btw. there was some demand for other MM syscalls > to provide a multiplexing API (e.g. mprotect), maybe it would be better > to handle those in one go? That's the exactly what Daniel Colascione suggested from internal review. That would be a interesting approach if we could aggregate all of system call in one go. > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-21 10:26 ` Minchan Kim @ 2019-05-21 10:37 ` Michal Hocko 2019-05-27 7:49 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-21 10:37 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue 21-05-19 19:26:13, Minchan Kim wrote: > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > > [Cc linux-api] > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > > Currently, process_madvise syscall works for only one address range > > > > > so user should call the syscall several times to give hints to > > > > > multiple address range. > > > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > > with number in the description at respin. > > > > Does this really have to be a fast operation? I would expect the monitor > > is by no means a fast path. The system call overhead is not what it used > > to be, sigh, but still for something that is not a hot path it should be > > tolerable, especially when the whole operation is quite expensive on its > > own (wrt. the syscall entry/exit). > > What's different with process_vm_[readv|writev] and vmsplice? > If the range needed to be covered is a lot, vector operation makes senese > to me. I am not saying that the vector API is wrong. All I am trying to say is that the benefit is not really clear so far. If you want to push it through then you should better get some supporting data. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-21 10:37 ` Michal Hocko @ 2019-05-27 7:49 ` Minchan Kim 2019-05-29 10:08 ` Daniel Colascione 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-27 7:49 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote: > On Tue 21-05-19 19:26:13, Minchan Kim wrote: > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > > > [Cc linux-api] > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > > > Currently, process_madvise syscall works for only one address range > > > > > > so user should call the syscall several times to give hints to > > > > > > multiple address range. > > > > > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > > > with number in the description at respin. > > > > > > Does this really have to be a fast operation? I would expect the monitor > > > is by no means a fast path. The system call overhead is not what it used > > > to be, sigh, but still for something that is not a hot path it should be > > > tolerable, especially when the whole operation is quite expensive on its > > > own (wrt. the syscall entry/exit). > > > > What's different with process_vm_[readv|writev] and vmsplice? > > If the range needed to be covered is a lot, vector operation makes senese > > to me. > > I am not saying that the vector API is wrong. All I am trying to say is > that the benefit is not really clear so far. If you want to push it > through then you should better get some supporting data. I measured 1000 madvise syscall vs. a vector range syscall with 1000 ranges on ARM64 mordern device. Even though I saw 15% improvement but absoluate gain is just 1ms so I don't think it's worth to support. I will drop vector support at next revision. Thanks for the review, Michal! ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-27 7:49 ` Minchan Kim @ 2019-05-29 10:08 ` Daniel Colascione 2019-05-29 10:33 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Daniel Colascione @ 2019-05-29 10:08 UTC (permalink / raw) To: Minchan Kim Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote: > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote: > > On Tue 21-05-19 19:26:13, Minchan Kim wrote: > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > > > > [Cc linux-api] > > > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > > > > Currently, process_madvise syscall works for only one address range > > > > > > > so user should call the syscall several times to give hints to > > > > > > > multiple address range. > > > > > > > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > > > > with number in the description at respin. > > > > > > > > Does this really have to be a fast operation? I would expect the monitor > > > > is by no means a fast path. The system call overhead is not what it used > > > > to be, sigh, but still for something that is not a hot path it should be > > > > tolerable, especially when the whole operation is quite expensive on its > > > > own (wrt. the syscall entry/exit). > > > > > > What's different with process_vm_[readv|writev] and vmsplice? > > > If the range needed to be covered is a lot, vector operation makes senese > > > to me. > > > > I am not saying that the vector API is wrong. All I am trying to say is > > that the benefit is not really clear so far. If you want to push it > > through then you should better get some supporting data. > > I measured 1000 madvise syscall vs. a vector range syscall with 1000 > ranges on ARM64 mordern device. Even though I saw 15% improvement but > absoluate gain is just 1ms so I don't think it's worth to support. > I will drop vector support at next revision. Please do keep the vector support. Absolute timing is misleading, since in a tight loop, you're not going to contend on mmap_sem. We've seen tons of improvements in things like camera start come from coalescing mprotect calls, with the gains coming from taking and releasing various locks a lot less often and bouncing around less on the contended lock paths. Raw throughput doesn't tell the whole story, especially on mobile. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-29 10:08 ` Daniel Colascione @ 2019-05-29 10:33 ` Michal Hocko 2019-05-30 2:17 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-29 10:33 UTC (permalink / raw) To: Daniel Colascione Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Wed 29-05-19 03:08:32, Daniel Colascione wrote: > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote: > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote: > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote: > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > > > > > [Cc linux-api] > > > > > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > > > > > Currently, process_madvise syscall works for only one address range > > > > > > > > so user should call the syscall several times to give hints to > > > > > > > > multiple address range. > > > > > > > > > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > > > > > with number in the description at respin. > > > > > > > > > > Does this really have to be a fast operation? I would expect the monitor > > > > > is by no means a fast path. The system call overhead is not what it used > > > > > to be, sigh, but still for something that is not a hot path it should be > > > > > tolerable, especially when the whole operation is quite expensive on its > > > > > own (wrt. the syscall entry/exit). > > > > > > > > What's different with process_vm_[readv|writev] and vmsplice? > > > > If the range needed to be covered is a lot, vector operation makes senese > > > > to me. > > > > > > I am not saying that the vector API is wrong. All I am trying to say is > > > that the benefit is not really clear so far. If you want to push it > > > through then you should better get some supporting data. > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000 > > ranges on ARM64 mordern device. Even though I saw 15% improvement but > > absoluate gain is just 1ms so I don't think it's worth to support. > > I will drop vector support at next revision. > > Please do keep the vector support. Absolute timing is misleading, > since in a tight loop, you're not going to contend on mmap_sem. We've > seen tons of improvements in things like camera start come from > coalescing mprotect calls, with the gains coming from taking and > releasing various locks a lot less often and bouncing around less on > the contended lock paths. Raw throughput doesn't tell the whole story, > especially on mobile. This will always be a double edge sword. Taking a lock for longer can improve a throughput of a single call but it would make a latency for anybody contending on the lock much worse. Besides that, please do not overcomplicate the thing from the early beginning please. Let's start with a simple and well defined remote madvise alternative first and build a vector API on top with some numbers based on _real_ workloads. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-29 10:33 ` Michal Hocko @ 2019-05-30 2:17 ` Minchan Kim 2019-05-30 6:57 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-30 2:17 UTC (permalink / raw) To: Michal Hocko Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote: > On Wed 29-05-19 03:08:32, Daniel Colascione wrote: > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote: > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote: > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > > > > > > [Cc linux-api] > > > > > > > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > > > > > > Currently, process_madvise syscall works for only one address range > > > > > > > > > so user should call the syscall several times to give hints to > > > > > > > > > multiple address range. > > > > > > > > > > > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > > > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > > > > > > with number in the description at respin. > > > > > > > > > > > > Does this really have to be a fast operation? I would expect the monitor > > > > > > is by no means a fast path. The system call overhead is not what it used > > > > > > to be, sigh, but still for something that is not a hot path it should be > > > > > > tolerable, especially when the whole operation is quite expensive on its > > > > > > own (wrt. the syscall entry/exit). > > > > > > > > > > What's different with process_vm_[readv|writev] and vmsplice? > > > > > If the range needed to be covered is a lot, vector operation makes senese > > > > > to me. > > > > > > > > I am not saying that the vector API is wrong. All I am trying to say is > > > > that the benefit is not really clear so far. If you want to push it > > > > through then you should better get some supporting data. > > > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000 > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but > > > absoluate gain is just 1ms so I don't think it's worth to support. > > > I will drop vector support at next revision. > > > > Please do keep the vector support. Absolute timing is misleading, > > since in a tight loop, you're not going to contend on mmap_sem. We've > > seen tons of improvements in things like camera start come from > > coalescing mprotect calls, with the gains coming from taking and > > releasing various locks a lot less often and bouncing around less on > > the contended lock paths. Raw throughput doesn't tell the whole story, > > especially on mobile. > > This will always be a double edge sword. Taking a lock for longer can > improve a throughput of a single call but it would make a latency for > anybody contending on the lock much worse. > > Besides that, please do not overcomplicate the thing from the early > beginning please. Let's start with a simple and well defined remote > madvise alternative first and build a vector API on top with some > numbers based on _real_ workloads. First time, I didn't think about atomicity about address range race because MADV_COLD/PAGEOUT is not critical for the race. However you raised the atomicity issue because people would extend hints to destructive ones easily. I agree with that and that's why we discussed how to guarantee the race and Daniel comes up with good idea. - vma configuration seq number via process_getinfo(2). We discussed the race issue without _read_ workloads/requests because it's common sense that people might extend the syscall later. Here is same. For current workload, we don't need to support vector for perfomance point of view based on my experiment. However, it's rather limited experiment. Some configuration might have 10000+ vmas or really slow CPU. Furthermore, I want to have vector support due to atomicity issue if it's really the one we should consider. With vector support of the API and vma configuration sequence number from Daniel, we could support address ranges operations's atomicity. However, since we don't introduce vector at this moment, we need to introduce *another syscall* later to be able to handle multile ranges all at once atomically if it's okay. Other thought: Maybe we could extend address range batch syscall covers other MM syscall like mmap/munmap/madvise/mprotect and so on because there are multiple users that would benefit from this general batching mechanism. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-30 2:17 ` Minchan Kim @ 2019-05-30 6:57 ` Michal Hocko 2019-05-30 8:02 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-30 6:57 UTC (permalink / raw) To: Minchan Kim Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Thu 30-05-19 11:17:48, Minchan Kim wrote: > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote: > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote: > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote: > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote: > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > > > > > > > [Cc linux-api] > > > > > > > > > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > > > > > > > Currently, process_madvise syscall works for only one address range > > > > > > > > > > so user should call the syscall several times to give hints to > > > > > > > > > > multiple address range. > > > > > > > > > > > > > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > > > > > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > > > > > > > with number in the description at respin. > > > > > > > > > > > > > > Does this really have to be a fast operation? I would expect the monitor > > > > > > > is by no means a fast path. The system call overhead is not what it used > > > > > > > to be, sigh, but still for something that is not a hot path it should be > > > > > > > tolerable, especially when the whole operation is quite expensive on its > > > > > > > own (wrt. the syscall entry/exit). > > > > > > > > > > > > What's different with process_vm_[readv|writev] and vmsplice? > > > > > > If the range needed to be covered is a lot, vector operation makes senese > > > > > > to me. > > > > > > > > > > I am not saying that the vector API is wrong. All I am trying to say is > > > > > that the benefit is not really clear so far. If you want to push it > > > > > through then you should better get some supporting data. > > > > > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000 > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but > > > > absoluate gain is just 1ms so I don't think it's worth to support. > > > > I will drop vector support at next revision. > > > > > > Please do keep the vector support. Absolute timing is misleading, > > > since in a tight loop, you're not going to contend on mmap_sem. We've > > > seen tons of improvements in things like camera start come from > > > coalescing mprotect calls, with the gains coming from taking and > > > releasing various locks a lot less often and bouncing around less on > > > the contended lock paths. Raw throughput doesn't tell the whole story, > > > especially on mobile. > > > > This will always be a double edge sword. Taking a lock for longer can > > improve a throughput of a single call but it would make a latency for > > anybody contending on the lock much worse. > > > > Besides that, please do not overcomplicate the thing from the early > > beginning please. Let's start with a simple and well defined remote > > madvise alternative first and build a vector API on top with some > > numbers based on _real_ workloads. > > First time, I didn't think about atomicity about address range race > because MADV_COLD/PAGEOUT is not critical for the race. > However you raised the atomicity issue because people would extend > hints to destructive ones easily. I agree with that and that's why > we discussed how to guarantee the race and Daniel comes up with good idea. Just for the clarification, I didn't really mean atomicity but rather a _consistency_ (essentially time to check to time to use consistency). > - vma configuration seq number via process_getinfo(2). > > We discussed the race issue without _read_ workloads/requests because > it's common sense that people might extend the syscall later. > > Here is same. For current workload, we don't need to support vector > for perfomance point of view based on my experiment. However, it's > rather limited experiment. Some configuration might have 10000+ vmas > or really slow CPU. > > Furthermore, I want to have vector support due to atomicity issue > if it's really the one we should consider. > With vector support of the API and vma configuration sequence number > from Daniel, we could support address ranges operations's atomicity. I am not sure what do you mean here. Perform all ranges atomicaly wrt. other address space modifications? If yes I am not sure we want that semantic because it can cause really long stalls for other operations but that is a discussion on its own and I would rather focus on a simple interface first. > However, since we don't introduce vector at this moment, we need to > introduce *another syscall* later to be able to handle multile ranges > all at once atomically if it's okay. Agreed. > Other thought: > Maybe we could extend address range batch syscall covers other MM > syscall like mmap/munmap/madvise/mprotect and so on because there > are multiple users that would benefit from this general batching > mechanism. Again a discussion on its own ;) -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-30 6:57 ` Michal Hocko @ 2019-05-30 8:02 ` Minchan Kim 2019-05-30 16:19 ` Daniel Colascione 2019-05-30 18:47 ` Michal Hocko 0 siblings, 2 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-30 8:02 UTC (permalink / raw) To: Michal Hocko Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote: > On Thu 30-05-19 11:17:48, Minchan Kim wrote: > > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote: > > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote: > > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote: > > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote: > > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > > > > > > > > [Cc linux-api] > > > > > > > > > > > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > > > > > > > > Currently, process_madvise syscall works for only one address range > > > > > > > > > > > so user should call the syscall several times to give hints to > > > > > > > > > > > multiple address range. > > > > > > > > > > > > > > > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > > > > > > > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > > > > > > > > with number in the description at respin. > > > > > > > > > > > > > > > > Does this really have to be a fast operation? I would expect the monitor > > > > > > > > is by no means a fast path. The system call overhead is not what it used > > > > > > > > to be, sigh, but still for something that is not a hot path it should be > > > > > > > > tolerable, especially when the whole operation is quite expensive on its > > > > > > > > own (wrt. the syscall entry/exit). > > > > > > > > > > > > > > What's different with process_vm_[readv|writev] and vmsplice? > > > > > > > If the range needed to be covered is a lot, vector operation makes senese > > > > > > > to me. > > > > > > > > > > > > I am not saying that the vector API is wrong. All I am trying to say is > > > > > > that the benefit is not really clear so far. If you want to push it > > > > > > through then you should better get some supporting data. > > > > > > > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000 > > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but > > > > > absoluate gain is just 1ms so I don't think it's worth to support. > > > > > I will drop vector support at next revision. > > > > > > > > Please do keep the vector support. Absolute timing is misleading, > > > > since in a tight loop, you're not going to contend on mmap_sem. We've > > > > seen tons of improvements in things like camera start come from > > > > coalescing mprotect calls, with the gains coming from taking and > > > > releasing various locks a lot less often and bouncing around less on > > > > the contended lock paths. Raw throughput doesn't tell the whole story, > > > > especially on mobile. > > > > > > This will always be a double edge sword. Taking a lock for longer can > > > improve a throughput of a single call but it would make a latency for > > > anybody contending on the lock much worse. > > > > > > Besides that, please do not overcomplicate the thing from the early > > > beginning please. Let's start with a simple and well defined remote > > > madvise alternative first and build a vector API on top with some > > > numbers based on _real_ workloads. > > > > First time, I didn't think about atomicity about address range race > > because MADV_COLD/PAGEOUT is not critical for the race. > > However you raised the atomicity issue because people would extend > > hints to destructive ones easily. I agree with that and that's why > > we discussed how to guarantee the race and Daniel comes up with good idea. > > Just for the clarification, I didn't really mean atomicity but rather a > _consistency_ (essentially time to check to time to use consistency). What do you mean by *consistency*? Could you elaborate it more? > > > - vma configuration seq number via process_getinfo(2). > > > > We discussed the race issue without _read_ workloads/requests because > > it's common sense that people might extend the syscall later. > > > > Here is same. For current workload, we don't need to support vector > > for perfomance point of view based on my experiment. However, it's > > rather limited experiment. Some configuration might have 10000+ vmas > > or really slow CPU. > > > > Furthermore, I want to have vector support due to atomicity issue > > if it's really the one we should consider. > > With vector support of the API and vma configuration sequence number > > from Daniel, we could support address ranges operations's atomicity. > > I am not sure what do you mean here. Perform all ranges atomicaly wrt. > other address space modifications? If yes I am not sure we want that Yub, I think it's *necessary* if we want to support destructive hints via process_madvise. > semantic because it can cause really long stalls for other operations It could be or it couldn't be. For example, if we could multiplex several syscalls which we should enumerate all of page table lookup, it could be more effective rather than doing each page table on each syscall. > but that is a discussion on its own and I would rather focus on a simple > interface first. It seems it's time to send RFCv2 since we discussed a lot although we don't have clear conclution yet. But still want to understand what you meant _consistency_. Thanks for the review, Michal! It's very helpful. > > > However, since we don't introduce vector at this moment, we need to > > introduce *another syscall* later to be able to handle multile ranges > > all at once atomically if it's okay. > > Agreed. > > > Other thought: > > Maybe we could extend address range batch syscall covers other MM > > syscall like mmap/munmap/madvise/mprotect and so on because there > > are multiple users that would benefit from this general batching > > mechanism. > > Again a discussion on its own ;) > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-30 8:02 ` Minchan Kim @ 2019-05-30 16:19 ` Daniel Colascione 2019-05-30 18:47 ` Michal Hocko 1 sibling, 0 replies; 68+ messages in thread From: Daniel Colascione @ 2019-05-30 16:19 UTC (permalink / raw) To: Minchan Kim Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Thu, May 30, 2019 at 1:02 AM Minchan Kim <minchan@kernel.org> wrote: > > On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote: > > On Thu 30-05-19 11:17:48, Minchan Kim wrote: > > > On Wed, May 29, 2019 at 12:33:52PM +0200, Michal Hocko wrote: > > > > On Wed 29-05-19 03:08:32, Daniel Colascione wrote: > > > > > On Mon, May 27, 2019 at 12:49 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > > > On Tue, May 21, 2019 at 12:37:26PM +0200, Michal Hocko wrote: > > > > > > > On Tue 21-05-19 19:26:13, Minchan Kim wrote: > > > > > > > > On Tue, May 21, 2019 at 08:24:21AM +0200, Michal Hocko wrote: > > > > > > > > > On Tue 21-05-19 11:48:20, Minchan Kim wrote: > > > > > > > > > > On Mon, May 20, 2019 at 11:22:58AM +0200, Michal Hocko wrote: > > > > > > > > > > > [Cc linux-api] > > > > > > > > > > > > > > > > > > > > > > On Mon 20-05-19 12:52:53, Minchan Kim wrote: > > > > > > > > > > > > Currently, process_madvise syscall works for only one address range > > > > > > > > > > > > so user should call the syscall several times to give hints to > > > > > > > > > > > > multiple address range. > > > > > > > > > > > > > > > > > > > > > > Is that a problem? How big of a problem? Any numbers? > > > > > > > > > > > > > > > > > > > > We easily have 2000+ vma so it's not trivial overhead. I will come up > > > > > > > > > > with number in the description at respin. > > > > > > > > > > > > > > > > > > Does this really have to be a fast operation? I would expect the monitor > > > > > > > > > is by no means a fast path. The system call overhead is not what it used > > > > > > > > > to be, sigh, but still for something that is not a hot path it should be > > > > > > > > > tolerable, especially when the whole operation is quite expensive on its > > > > > > > > > own (wrt. the syscall entry/exit). > > > > > > > > > > > > > > > > What's different with process_vm_[readv|writev] and vmsplice? > > > > > > > > If the range needed to be covered is a lot, vector operation makes senese > > > > > > > > to me. > > > > > > > > > > > > > > I am not saying that the vector API is wrong. All I am trying to say is > > > > > > > that the benefit is not really clear so far. If you want to push it > > > > > > > through then you should better get some supporting data. > > > > > > > > > > > > I measured 1000 madvise syscall vs. a vector range syscall with 1000 > > > > > > ranges on ARM64 mordern device. Even though I saw 15% improvement but > > > > > > absoluate gain is just 1ms so I don't think it's worth to support. > > > > > > I will drop vector support at next revision. > > > > > > > > > > Please do keep the vector support. Absolute timing is misleading, > > > > > since in a tight loop, you're not going to contend on mmap_sem. We've > > > > > seen tons of improvements in things like camera start come from > > > > > coalescing mprotect calls, with the gains coming from taking and > > > > > releasing various locks a lot less often and bouncing around less on > > > > > the contended lock paths. Raw throughput doesn't tell the whole story, > > > > > especially on mobile. > > > > > > > > This will always be a double edge sword. Taking a lock for longer can > > > > improve a throughput of a single call but it would make a latency for > > > > anybody contending on the lock much worse. > > > > > > > > Besides that, please do not overcomplicate the thing from the early > > > > beginning please. Let's start with a simple and well defined remote > > > > madvise alternative first and build a vector API on top with some > > > > numbers based on _real_ workloads. > > > > > > First time, I didn't think about atomicity about address range race > > > because MADV_COLD/PAGEOUT is not critical for the race. > > > However you raised the atomicity issue because people would extend > > > hints to destructive ones easily. I agree with that and that's why > > > we discussed how to guarantee the race and Daniel comes up with good idea. > > > > Just for the clarification, I didn't really mean atomicity but rather a > > _consistency_ (essentially time to check to time to use consistency). > > What do you mean by *consistency*? Could you elaborate it more? > > > > > > - vma configuration seq number via process_getinfo(2). > > > > > > We discussed the race issue without _read_ workloads/requests because > > > it's common sense that people might extend the syscall later. > > > > > > Here is same. For current workload, we don't need to support vector > > > for perfomance point of view based on my experiment. However, it's > > > rather limited experiment. Some configuration might have 10000+ vmas > > > or really slow CPU. > > > > > > Furthermore, I want to have vector support due to atomicity issue > > > if it's really the one we should consider. > > > With vector support of the API and vma configuration sequence number > > > from Daniel, we could support address ranges operations's atomicity. > > > > I am not sure what do you mean here. Perform all ranges atomicaly wrt. > > other address space modifications? If yes I am not sure we want that > > Yub, I think it's *necessary* if we want to support destructive hints > via process_madvise. [Puts on flame-proof suit] Here's a quick sketch of what I have in mind for process_getinfo(2). Keep in mind that it's still just a rough idea. We've had trouble efficiently learning about process and memory state of the system via procfs. Android background memory-use scans (the android.bg daemon) consume quite a bit of CPU time for PSS; Minchan's done a lot of thinking for how we can specify desired page sets for compaction as part of this patch set; and the full procfs walks that some trace collection tools need to undertake take more than 200ms to collect (sometimes much more) due mostly to procfs iteration. ISTM we can do better. While procfs *works* on a functional level, it's inefficient due to the splatting of information we want across several different files (which need to be independently opened --- e.g., /proc/pid/oom_score_adj *and* /proc/pid/status), inefficient due to the ad-hoc text formatting, inefficient due to information over-fetch, and cumbersome due to the fundamental impedance mismatch between filesystem APIs and process lifetimes. procfs itself is also optional, which has caused various bits of awkwardness that you'll recall from the pidfd discussions. How about we solve the problem once and for all? I'm imagining a new process_getinfo(2) that solves all of these problems at the same time. I want something with a few properties: 1) if we want to learn M facts about N things, we enter the kernel once and learn all M*N things, 2) the information we collect is self-consistent (which implies atomicity most of the time), 3) we retrieve the information we want in an efficient binary format, and 4) we don't pay to learn anything not in M. I've jotted down a quick sketch of the API below; I'm curious what everyone else thinks. It'd basically look like this: int process_getinfo(int nr_proc, int* proc, int flags, unsigned long mask, void* out_buf, size_t* inout_sz) We wouldn't use the return value for much: 0 on success and -1 on error with errno set. NR_PROC and PROC together would specify the objects we want to learn about, which would be either PIDs or PIDFDs or maybe nothing at all if FLAGS tells us to inspect every process on the system. MASK is a bitset of facts we want to learn, described below. OUT_BUF and INOUT_SZ are for actually communicating the result. On input, the caller would fill *INOUT_SZ with the size of the buffer to which OUT_BUF points; on success, we'd fill *INOUT_SZ with the number of bytes we actually used. If the output buffer is too small, we'll truncate the result, fail the system call with E2BIG, and fill *INOUT_SZ with the number of needed bytes, inviting the caller to try again. (If a caller supplies something huge like a reusable 512KB buffer on the first call, no reallocation and retries will be necessary in practice on a typically-sized system.) The actual returned buffer is a collection of structures and data blocks starting with a struct process_info. The structures in the returned buffer sometimes contain "pointers" to other structures encoded as byte offsets from the start of the information buffer. Using offsets instead of actual pointers keeps the format the same across 32- and 64-bit versions of process_getinfo and makes it possible to relocate the returned information buffer with memcpy. struct process_info { int first_procrec_offset; // struct procrec* // Examples of system-wide things we could ask for int mem_total_kb; int mem_free_kb; int mem_available_kb; char reserved[]; }; struct procrec { int next_procrec_offset; // struct procrec* // Following fields are self-explanatory and are only examples of the // kind of information we could provide. int tid; int tgid; char status; int oom_score_adj; struct { int real, effective, saved, fs; } uids; int prio; int comm_offset; // char* uint64_t rss_file_kb uint64_t rss_anon_fb; uint64_t vm_seq; int first_procrec_vma_offset; // struct procrec_vma* char reserved[]; }; struct procrec_vma { int next_procrec_vma_offset; // struct procrec_vma* unsigned long start; unsigned long end; int backing_file_name_offset; // char* int prot; char reserved[]; }; Callers would use the returned buffer by casting it to a struct process_info and following the internal "pointers". MASK would specify which bits of information we wanted: for example, if we asked for PROCESS_VM_MEMORY_MAP, the kernel would fill in each struct procret's memory_map field and have it point to a struct procrec_vma in the returned output buffer. If we omitted PROCESS_VM_MEMORY_MAP, we'd leave the memory_map field as NULL (encoded as offset zero). The kernel would embed any strings (like comm and VMA names) into the output buffer; the precise locations would be unspecified so long as callers could find these fields via output-buffer pointers. Because all the structures are variable-length and are chained together with explicit pointers (or offsets) instead of being stuffed into a literal array, we can add additional fields to the output structures any time we want without breaking binary compatibility. Callers would tell the kernel that they're interested in the added struct fields by asking for them via bits in MASK, and kernels that don't understand those fields would just fail the system call with EINVAL or something. Depending on how we call it, we can use this API as a bunch of different things: 1) Quick retrieval of system-wide memory counters, like /proc/meminfo 2) Quick snapshot of all process-thread identities (asking for the wildcard TID match via FLAGS) 3) Fast enumeration of one process's address space 4) Collecting process-summary VM counters (e.g., rss_anon and rss_file) for a set of processes 5) Retrieval of every VMA of every process on the system for debugging We can do all of this with one entry into the kernel and without opening any new file descriptors (unless we want to use pidfds as inputs). We can also make this operation as atomic as we want, e.g., taking mmap_sem while looking at each process and taking tasklist_lock so all the thread IDs line up with their processes. We don't necessarily need to take the mmap_sems of all processes we care about at the same time. Since this isn't a filesystem-based API, we don't have to deal with seq_file or deal with consistency issues arising from userspace programs doing strange things like reading procfs files very slowly in small chunks. Security-wise, we'd just use different access checks for different requested information bits in MASK, maybe supplying "no access" struct procrec entries if a caller doesn't happen to have access to a particular process. I suppose we can talk about whether access check failures should result in dummy values or syscall failure: maybe callers should select which behavior they want. Format-wise, we could also just return flatbuffer messages from the kernel, but I suspect that we don't want flatbuffer in the kernel right now. :-) The API I'm proposing accepts an array of processes to inspect. We could simplify it by accepting just one process and making the caller enter the kernel once per process it wants to learn about, but this simplification would make the API less useful for answering questions like "What's the RSS of every process on the system right now?". What do you think? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 6/7] mm: extend process_madvise syscall to support vector arrary 2019-05-30 8:02 ` Minchan Kim 2019-05-30 16:19 ` Daniel Colascione @ 2019-05-30 18:47 ` Michal Hocko 1 sibling, 0 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-30 18:47 UTC (permalink / raw) To: Minchan Kim Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Thu 30-05-19 17:02:14, Minchan Kim wrote: > On Thu, May 30, 2019 at 08:57:55AM +0200, Michal Hocko wrote: > > On Thu 30-05-19 11:17:48, Minchan Kim wrote: [...] > > > First time, I didn't think about atomicity about address range race > > > because MADV_COLD/PAGEOUT is not critical for the race. > > > However you raised the atomicity issue because people would extend > > > hints to destructive ones easily. I agree with that and that's why > > > we discussed how to guarantee the race and Daniel comes up with good idea. > > > > Just for the clarification, I didn't really mean atomicity but rather a > > _consistency_ (essentially time to check to time to use consistency). > > What do you mean by *consistency*? Could you elaborate it more? That you operate on the object you have got by some means. In other words that the range you want to call madvise on hasn't been remapped/replaced by a different mmap operation. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20190520035254.57579-8-minchan@kernel.org>]
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER [not found] ` <20190520035254.57579-8-minchan@kernel.org> @ 2019-05-20 9:28 ` Michal Hocko 2019-05-21 2:55 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-20 9:28 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api [cc linux-api] On Mon 20-05-19 12:52:54, Minchan Kim wrote: > System could have much faster swap device like zRAM. In that case, swapping > is extremely cheaper than file-IO on the low-end storage. > In this configuration, userspace could handle different strategy for each > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > while it keeps file-backed pages in inactive LRU by MADV_COOL because > file IO is more expensive in this case so want to keep them in memory > until memory pressure happens. > > To support such strategy easier, this patch introduces > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > that /proc/<pid>/clear_refs already has supported same filters. > They are filters could be Ored with other existing hints using top two bits > of (int behavior). madvise operates on top of ranges and it is quite trivial to do the filtering from the userspace so why do we need any additional filtering? > Once either of them is set, the hint could affect only the interested vma > either anonymous or file-backed. > > With that, user could call a process_madvise syscall simply with a entire > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > MADV_FILE_FILTER so there is no need to call the syscall range by range. OK, so here is the reason you want that. The immediate question is why cannot the monitor do the filtering from the userspace. Slightly more work, all right, but less of an API to expose and that itself is a strong argument against. > * from v1r2 > * use consistent check with clear_refs to identify anon/file vma - surenb > > * from v1r1 > * use naming "filter" for new madvise option - dancol > > Signed-off-by: Minchan Kim <minchan@kernel.org> > --- > include/uapi/asm-generic/mman-common.h | 5 +++++ > mm/madvise.c | 14 ++++++++++++++ > 2 files changed, 19 insertions(+) > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > index b8e230de84a6..be59a1b90284 100644 > --- a/include/uapi/asm-generic/mman-common.h > +++ b/include/uapi/asm-generic/mman-common.h > @@ -66,6 +66,11 @@ > #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ > #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ > > +#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER)) > + > +#define MADV_ANONYMOUS_FILTER (1<<31) /* works for only anonymous vma */ > +#define MADV_FILE_FILTER (1<<30) /* works for only file-backed vma */ > + > /* compatibility flags */ > #define MAP_FILE 0 > > diff --git a/mm/madvise.c b/mm/madvise.c > index f4f569dac2bd..116131243540 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > int write; > size_t len; > struct blk_plug plug; > + bool anon_only, file_only; > > + anon_only = behavior & MADV_ANONYMOUS_FILTER; > + file_only = behavior & MADV_FILE_FILTER; > + > + if (anon_only && file_only) > + return error; > + > + behavior = behavior & MADV_BEHAVIOR_MASK; > if (!madvise_behavior_valid(behavior)) > return error; > > @@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > if (end < tmp) > tmp = end; > > + if (anon_only && vma->vm_file) > + goto next; > + if (file_only && !vma->vm_file) > + goto next; > + > /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > error = madvise_vma(tsk, vma, &prev, start, tmp, > behavior, &pages); > if (error) > goto out; > *nr_pages += pages; > +next: > start = tmp; > if (prev && start < prev->vm_end) > start = prev->vm_end; > -- > 2.21.0.1020.gf2820cf01a-goog > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-20 9:28 ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Michal Hocko @ 2019-05-21 2:55 ` Minchan Kim 2019-05-21 6:26 ` Michal Hocko 2019-05-21 15:33 ` Johannes Weiner 0 siblings, 2 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-21 2:55 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > [cc linux-api] > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > System could have much faster swap device like zRAM. In that case, swapping > > is extremely cheaper than file-IO on the low-end storage. > > In this configuration, userspace could handle different strategy for each > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > file IO is more expensive in this case so want to keep them in memory > > until memory pressure happens. > > > > To support such strategy easier, this patch introduces > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > that /proc/<pid>/clear_refs already has supported same filters. > > They are filters could be Ored with other existing hints using top two bits > > of (int behavior). > > madvise operates on top of ranges and it is quite trivial to do the > filtering from the userspace so why do we need any additional filtering? > > > Once either of them is set, the hint could affect only the interested vma > > either anonymous or file-backed. > > > > With that, user could call a process_madvise syscall simply with a entire > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > OK, so here is the reason you want that. The immediate question is why > cannot the monitor do the filtering from the userspace. Slightly more > work, all right, but less of an API to expose and that itself is a > strong argument against. What I should do if we don't have such filter option is to enumerate all of vma via /proc/<pid>/maps and then parse every ranges and inode from string, which would be painful for 2000+ vmas. > > > * from v1r2 > > * use consistent check with clear_refs to identify anon/file vma - surenb > > > > * from v1r1 > > * use naming "filter" for new madvise option - dancol > > > > Signed-off-by: Minchan Kim <minchan@kernel.org> > > --- > > include/uapi/asm-generic/mman-common.h | 5 +++++ > > mm/madvise.c | 14 ++++++++++++++ > > 2 files changed, 19 insertions(+) > > > > diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h > > index b8e230de84a6..be59a1b90284 100644 > > --- a/include/uapi/asm-generic/mman-common.h > > +++ b/include/uapi/asm-generic/mman-common.h > > @@ -66,6 +66,11 @@ > > #define MADV_WIPEONFORK 18 /* Zero memory on fork, child only */ > > #define MADV_KEEPONFORK 19 /* Undo MADV_WIPEONFORK */ > > > > +#define MADV_BEHAVIOR_MASK (~(MADV_ANONYMOUS_FILTER|MADV_FILE_FILTER)) > > + > > +#define MADV_ANONYMOUS_FILTER (1<<31) /* works for only anonymous vma */ > > +#define MADV_FILE_FILTER (1<<30) /* works for only file-backed vma */ > > + > > /* compatibility flags */ > > #define MAP_FILE 0 > > > > diff --git a/mm/madvise.c b/mm/madvise.c > > index f4f569dac2bd..116131243540 100644 > > --- a/mm/madvise.c > > +++ b/mm/madvise.c > > @@ -1002,7 +1002,15 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > > int write; > > size_t len; > > struct blk_plug plug; > > + bool anon_only, file_only; > > > > + anon_only = behavior & MADV_ANONYMOUS_FILTER; > > + file_only = behavior & MADV_FILE_FILTER; > > + > > + if (anon_only && file_only) > > + return error; > > + > > + behavior = behavior & MADV_BEHAVIOR_MASK; > > if (!madvise_behavior_valid(behavior)) > > return error; > > > > @@ -1067,12 +1075,18 @@ static int madvise_core(struct task_struct *tsk, unsigned long start, > > if (end < tmp) > > tmp = end; > > > > + if (anon_only && vma->vm_file) > > + goto next; > > + if (file_only && !vma->vm_file) > > + goto next; > > + > > /* Here vma->vm_start <= start < tmp <= (end|vma->vm_end). */ > > error = madvise_vma(tsk, vma, &prev, start, tmp, > > behavior, &pages); > > if (error) > > goto out; > > *nr_pages += pages; > > +next: > > start = tmp; > > if (prev && start < prev->vm_end) > > start = prev->vm_end; > > -- > > 2.21.0.1020.gf2820cf01a-goog > > > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-21 2:55 ` Minchan Kim @ 2019-05-21 6:26 ` Michal Hocko 2019-05-27 7:58 ` Minchan Kim 2019-05-21 15:33 ` Johannes Weiner 1 sibling, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-21 6:26 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue 21-05-19 11:55:33, Minchan Kim wrote: > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > > [cc linux-api] > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > > System could have much faster swap device like zRAM. In that case, swapping > > > is extremely cheaper than file-IO on the low-end storage. > > > In this configuration, userspace could handle different strategy for each > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > > file IO is more expensive in this case so want to keep them in memory > > > until memory pressure happens. > > > > > > To support such strategy easier, this patch introduces > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > > that /proc/<pid>/clear_refs already has supported same filters. > > > They are filters could be Ored with other existing hints using top two bits > > > of (int behavior). > > > > madvise operates on top of ranges and it is quite trivial to do the > > filtering from the userspace so why do we need any additional filtering? > > > > > Once either of them is set, the hint could affect only the interested vma > > > either anonymous or file-backed. > > > > > > With that, user could call a process_madvise syscall simply with a entire > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > > > OK, so here is the reason you want that. The immediate question is why > > cannot the monitor do the filtering from the userspace. Slightly more > > work, all right, but less of an API to expose and that itself is a > > strong argument against. > > What I should do if we don't have such filter option is to enumerate all of > vma via /proc/<pid>/maps and then parse every ranges and inode from string, > which would be painful for 2000+ vmas. Painful is not an argument to add a new user API. If the existing API suits the purpose then it should be used. If it is not usable, we can think of a different way. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-21 6:26 ` Michal Hocko @ 2019-05-27 7:58 ` Minchan Kim 2019-05-27 12:44 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-27 7:58 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote: > On Tue 21-05-19 11:55:33, Minchan Kim wrote: > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > > > [cc linux-api] > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > > > System could have much faster swap device like zRAM. In that case, swapping > > > > is extremely cheaper than file-IO on the low-end storage. > > > > In this configuration, userspace could handle different strategy for each > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > > > file IO is more expensive in this case so want to keep them in memory > > > > until memory pressure happens. > > > > > > > > To support such strategy easier, this patch introduces > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > > > that /proc/<pid>/clear_refs already has supported same filters. > > > > They are filters could be Ored with other existing hints using top two bits > > > > of (int behavior). > > > > > > madvise operates on top of ranges and it is quite trivial to do the > > > filtering from the userspace so why do we need any additional filtering? > > > > > > > Once either of them is set, the hint could affect only the interested vma > > > > either anonymous or file-backed. > > > > > > > > With that, user could call a process_madvise syscall simply with a entire > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > > > > > OK, so here is the reason you want that. The immediate question is why > > > cannot the monitor do the filtering from the userspace. Slightly more > > > work, all right, but less of an API to expose and that itself is a > > > strong argument against. > > > > What I should do if we don't have such filter option is to enumerate all of > > vma via /proc/<pid>/maps and then parse every ranges and inode from string, > > which would be painful for 2000+ vmas. > > Painful is not an argument to add a new user API. If the existing API > suits the purpose then it should be used. If it is not usable, we can > think of a different way. I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor. It's never trivial. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-27 7:58 ` Minchan Kim @ 2019-05-27 12:44 ` Michal Hocko 2019-05-28 3:26 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-27 12:44 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon 27-05-19 16:58:11, Minchan Kim wrote: > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote: > > On Tue 21-05-19 11:55:33, Minchan Kim wrote: > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > > > > [cc linux-api] > > > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > > > > System could have much faster swap device like zRAM. In that case, swapping > > > > > is extremely cheaper than file-IO on the low-end storage. > > > > > In this configuration, userspace could handle different strategy for each > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > > > > file IO is more expensive in this case so want to keep them in memory > > > > > until memory pressure happens. > > > > > > > > > > To support such strategy easier, this patch introduces > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > > > > that /proc/<pid>/clear_refs already has supported same filters. > > > > > They are filters could be Ored with other existing hints using top two bits > > > > > of (int behavior). > > > > > > > > madvise operates on top of ranges and it is quite trivial to do the > > > > filtering from the userspace so why do we need any additional filtering? > > > > > > > > > Once either of them is set, the hint could affect only the interested vma > > > > > either anonymous or file-backed. > > > > > > > > > > With that, user could call a process_madvise syscall simply with a entire > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > > > > > > > OK, so here is the reason you want that. The immediate question is why > > > > cannot the monitor do the filtering from the userspace. Slightly more > > > > work, all right, but less of an API to expose and that itself is a > > > > strong argument against. > > > > > > What I should do if we don't have such filter option is to enumerate all of > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string, > > > which would be painful for 2000+ vmas. > > > > Painful is not an argument to add a new user API. If the existing API > > suits the purpose then it should be used. If it is not usable, we can > > think of a different way. > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor. > It's never trivial. This is not the only option. Have you tried to simply use /proc/<pid>/map_files interface? This will provide you with all the file backed mappings. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-27 12:44 ` Michal Hocko @ 2019-05-28 3:26 ` Minchan Kim 2019-05-28 6:29 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-28 3:26 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote: > On Mon 27-05-19 16:58:11, Minchan Kim wrote: > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote: > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote: > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > > > > > [cc linux-api] > > > > > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > > > > > System could have much faster swap device like zRAM. In that case, swapping > > > > > > is extremely cheaper than file-IO on the low-end storage. > > > > > > In this configuration, userspace could handle different strategy for each > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > > > > > file IO is more expensive in this case so want to keep them in memory > > > > > > until memory pressure happens. > > > > > > > > > > > > To support such strategy easier, this patch introduces > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > > > > > that /proc/<pid>/clear_refs already has supported same filters. > > > > > > They are filters could be Ored with other existing hints using top two bits > > > > > > of (int behavior). > > > > > > > > > > madvise operates on top of ranges and it is quite trivial to do the > > > > > filtering from the userspace so why do we need any additional filtering? > > > > > > > > > > > Once either of them is set, the hint could affect only the interested vma > > > > > > either anonymous or file-backed. > > > > > > > > > > > > With that, user could call a process_madvise syscall simply with a entire > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > > > > > > > > > OK, so here is the reason you want that. The immediate question is why > > > > > cannot the monitor do the filtering from the userspace. Slightly more > > > > > work, all right, but less of an API to expose and that itself is a > > > > > strong argument against. > > > > > > > > What I should do if we don't have such filter option is to enumerate all of > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string, > > > > which would be painful for 2000+ vmas. > > > > > > Painful is not an argument to add a new user API. If the existing API > > > suits the purpose then it should be used. If it is not usable, we can > > > think of a different way. > > > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor. > > It's never trivial. > > This is not the only option. Have you tried to simply use > /proc/<pid>/map_files interface? This will provide you with all the file > backed mappings. I compared maps vs. map_files with 3036 file-backed vma. Test scenario is to dump all of vmas of the process and parse address ranges. For map_files, it's easy to parse each address range because directory name itself is range. However, in case of maps, I need to parse each range line by line so need to scan all of lines. (maps cover additional non-file-backed vmas so nr_vma is a little bigger) performance mode: map_files: nr_vma 3036 usec 13387 maps : nr_vma 3078 usec 12923 powersave mode: map_files: nr_vma 3036 usec 52614 maps : nr_vma 3078 usec 41089 map_files is slower than maps if we dump all of vmas. I guess directory operation needs much more jobs(e.g., dentry lookup, instantiation) compared to maps. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 3:26 ` Minchan Kim @ 2019-05-28 6:29 ` Michal Hocko 2019-05-28 8:13 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-28 6:29 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue 28-05-19 12:26:32, Minchan Kim wrote: > On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote: > > On Mon 27-05-19 16:58:11, Minchan Kim wrote: > > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote: > > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote: > > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > > > > > > [cc linux-api] > > > > > > > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > > > > > > System could have much faster swap device like zRAM. In that case, swapping > > > > > > > is extremely cheaper than file-IO on the low-end storage. > > > > > > > In this configuration, userspace could handle different strategy for each > > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > > > > > > file IO is more expensive in this case so want to keep them in memory > > > > > > > until memory pressure happens. > > > > > > > > > > > > > > To support such strategy easier, this patch introduces > > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > > > > > > that /proc/<pid>/clear_refs already has supported same filters. > > > > > > > They are filters could be Ored with other existing hints using top two bits > > > > > > > of (int behavior). > > > > > > > > > > > > madvise operates on top of ranges and it is quite trivial to do the > > > > > > filtering from the userspace so why do we need any additional filtering? > > > > > > > > > > > > > Once either of them is set, the hint could affect only the interested vma > > > > > > > either anonymous or file-backed. > > > > > > > > > > > > > > With that, user could call a process_madvise syscall simply with a entire > > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > > > > > > > > > > > OK, so here is the reason you want that. The immediate question is why > > > > > > cannot the monitor do the filtering from the userspace. Slightly more > > > > > > work, all right, but less of an API to expose and that itself is a > > > > > > strong argument against. > > > > > > > > > > What I should do if we don't have such filter option is to enumerate all of > > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string, > > > > > which would be painful for 2000+ vmas. > > > > > > > > Painful is not an argument to add a new user API. If the existing API > > > > suits the purpose then it should be used. If it is not usable, we can > > > > think of a different way. > > > > > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern > > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor. > > > It's never trivial. > > > > This is not the only option. Have you tried to simply use > > /proc/<pid>/map_files interface? This will provide you with all the file > > backed mappings. > > I compared maps vs. map_files with 3036 file-backed vma. > Test scenario is to dump all of vmas of the process and parse address > ranges. > For map_files, it's easy to parse each address range because directory name > itself is range. However, in case of maps, I need to parse each range > line by line so need to scan all of lines. > > (maps cover additional non-file-backed vmas so nr_vma is a little bigger) > > performance mode: > map_files: nr_vma 3036 usec 13387 > maps : nr_vma 3078 usec 12923 > > powersave mode: > > map_files: nr_vma 3036 usec 52614 > maps : nr_vma 3078 usec 41089 > > map_files is slower than maps if we dump all of vmas. I guess directory > operation needs much more jobs(e.g., dentry lookup, instantiation) > compared to maps. OK, that is somehow surprising. I am still not convinced the filter is a good idea though. The primary reason is that it encourages using madvise on a wide range without having a clue what the range contains. E.g. the full address range and rely the right thing will happen. Do we really want madvise to operate in that mode? Btw. if we went with the per vma fd approach then you would get this feature automatically because map_files would refer to file backed mappings while map_anon could refer only to anonymous mappings. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 6:29 ` Michal Hocko @ 2019-05-28 8:13 ` Minchan Kim 2019-05-28 8:31 ` Daniel Colascione 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-28 8:13 UTC (permalink / raw) To: Michal Hocko Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 28, 2019 at 08:29:47AM +0200, Michal Hocko wrote: > On Tue 28-05-19 12:26:32, Minchan Kim wrote: > > On Mon, May 27, 2019 at 02:44:11PM +0200, Michal Hocko wrote: > > > On Mon 27-05-19 16:58:11, Minchan Kim wrote: > > > > On Tue, May 21, 2019 at 08:26:28AM +0200, Michal Hocko wrote: > > > > > On Tue 21-05-19 11:55:33, Minchan Kim wrote: > > > > > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > > > > > > > [cc linux-api] > > > > > > > > > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > > > > > > > System could have much faster swap device like zRAM. In that case, swapping > > > > > > > > is extremely cheaper than file-IO on the low-end storage. > > > > > > > > In this configuration, userspace could handle different strategy for each > > > > > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > > > > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > > > > > > > file IO is more expensive in this case so want to keep them in memory > > > > > > > > until memory pressure happens. > > > > > > > > > > > > > > > > To support such strategy easier, this patch introduces > > > > > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > > > > > > > that /proc/<pid>/clear_refs already has supported same filters. > > > > > > > > They are filters could be Ored with other existing hints using top two bits > > > > > > > > of (int behavior). > > > > > > > > > > > > > > madvise operates on top of ranges and it is quite trivial to do the > > > > > > > filtering from the userspace so why do we need any additional filtering? > > > > > > > > > > > > > > > Once either of them is set, the hint could affect only the interested vma > > > > > > > > either anonymous or file-backed. > > > > > > > > > > > > > > > > With that, user could call a process_madvise syscall simply with a entire > > > > > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > > > > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > > > > > > > > > > > > > OK, so here is the reason you want that. The immediate question is why > > > > > > > cannot the monitor do the filtering from the userspace. Slightly more > > > > > > > work, all right, but less of an API to expose and that itself is a > > > > > > > strong argument against. > > > > > > > > > > > > What I should do if we don't have such filter option is to enumerate all of > > > > > > vma via /proc/<pid>/maps and then parse every ranges and inode from string, > > > > > > which would be painful for 2000+ vmas. > > > > > > > > > > Painful is not an argument to add a new user API. If the existing API > > > > > suits the purpose then it should be used. If it is not usable, we can > > > > > think of a different way. > > > > > > > > I measured 1568 vma parsing overhead of /proc/<pid>/maps in ARM64 modern > > > > mobile CPU. It takes 60ms and 185ms on big cores depending on cpu governor. > > > > It's never trivial. > > > > > > This is not the only option. Have you tried to simply use > > > /proc/<pid>/map_files interface? This will provide you with all the file > > > backed mappings. > > > > I compared maps vs. map_files with 3036 file-backed vma. > > Test scenario is to dump all of vmas of the process and parse address > > ranges. > > For map_files, it's easy to parse each address range because directory name > > itself is range. However, in case of maps, I need to parse each range > > line by line so need to scan all of lines. > > > > (maps cover additional non-file-backed vmas so nr_vma is a little bigger) > > > > performance mode: > > map_files: nr_vma 3036 usec 13387 > > maps : nr_vma 3078 usec 12923 > > > > powersave mode: > > > > map_files: nr_vma 3036 usec 52614 > > maps : nr_vma 3078 usec 41089 > > > > map_files is slower than maps if we dump all of vmas. I guess directory > > operation needs much more jobs(e.g., dentry lookup, instantiation) > > compared to maps. > > OK, that is somehow surprising. I am still not convinced the filter is a > good idea though. The primary reason is that it encourages using madvise > on a wide range without having a clue what the range contains. E.g. the > full address range and rely the right thing will happen. Do we really > want madvise to operate in that mode? If user space daemon(e.g., activity manager service) could know a certain process is bakground and idle for a while, yeb, that would be good option. > > Btw. if we went with the per vma fd approach then you would get this > feature automatically because map_files would refer to file backed > mappings while map_anon could refer only to anonymous mappings. The reason to add such filter option is to avoid the parsing overhead so map_anon wouldn't be helpful. > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 8:13 ` Minchan Kim @ 2019-05-28 8:31 ` Daniel Colascione 2019-05-28 8:49 ` Minchan Kim 0 siblings, 1 reply; 68+ messages in thread From: Daniel Colascione @ 2019-05-28 8:31 UTC (permalink / raw) To: Minchan Kim Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > if we went with the per vma fd approach then you would get this > > feature automatically because map_files would refer to file backed > > mappings while map_anon could refer only to anonymous mappings. > > The reason to add such filter option is to avoid the parsing overhead > so map_anon wouldn't be helpful. Without chiming on whether the filter option is a good idea, I'd like to suggest that providing an efficient binary interfaces for pulling memory map information out of processes. Some single-system-call method for retrieving a binary snapshot of a process's address space complete with attributes (selectable, like statx?) for each VMA would reduce complexity and increase performance in a variety of areas, e.g., Android memory map debugging commands. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 8:31 ` Daniel Colascione @ 2019-05-28 8:49 ` Minchan Kim 2019-05-28 9:08 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-28 8:49 UTC (permalink / raw) To: Daniel Colascione Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > if we went with the per vma fd approach then you would get this > > > feature automatically because map_files would refer to file backed > > > mappings while map_anon could refer only to anonymous mappings. > > > > The reason to add such filter option is to avoid the parsing overhead > > so map_anon wouldn't be helpful. > > Without chiming on whether the filter option is a good idea, I'd like > to suggest that providing an efficient binary interfaces for pulling > memory map information out of processes. Some single-system-call > method for retrieving a binary snapshot of a process's address space > complete with attributes (selectable, like statx?) for each VMA would > reduce complexity and increase performance in a variety of areas, > e.g., Android memory map debugging commands. I agree it's the best we can get *generally*. Michal, any opinion? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 8:49 ` Minchan Kim @ 2019-05-28 9:08 ` Michal Hocko 2019-05-28 9:39 ` Daniel Colascione 2019-05-28 10:32 ` Minchan Kim 0 siblings, 2 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-28 9:08 UTC (permalink / raw) To: Minchan Kim Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 17:49:27, Minchan Kim wrote: > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > if we went with the per vma fd approach then you would get this > > > > feature automatically because map_files would refer to file backed > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > so map_anon wouldn't be helpful. > > > > Without chiming on whether the filter option is a good idea, I'd like > > to suggest that providing an efficient binary interfaces for pulling > > memory map information out of processes. Some single-system-call > > method for retrieving a binary snapshot of a process's address space > > complete with attributes (selectable, like statx?) for each VMA would > > reduce complexity and increase performance in a variety of areas, > > e.g., Android memory map debugging commands. > > I agree it's the best we can get *generally*. > Michal, any opinion? I am not really sure this is directly related. I think the primary question that we have to sort out first is whether we want to have the remote madvise call process or vma fd based. This is an important distinction wrt. usability. I have only seen pid vs. pidfd discussions so far unfortunately. An interface to query address range information is a separate but although a related topic. We have /proc/<pid>/[s]maps for that right now and I understand it is not a general win for all usecases because it tends to be slow for some. I can see how /proc/<pid>/map_anons could provide per vma information in a binary form via a fd based interface. But I would rather not conflate those two discussions much - well except if it could give one of the approaches more justification but let's focus on the madvise part first. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 9:08 ` Michal Hocko @ 2019-05-28 9:39 ` Daniel Colascione 2019-05-28 10:33 ` Michal Hocko 2019-05-28 10:32 ` Minchan Kim 1 sibling, 1 reply; 68+ messages in thread From: Daniel Colascione @ 2019-05-28 9:39 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > if we went with the per vma fd approach then you would get this > > > > > feature automatically because map_files would refer to file backed > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > so map_anon wouldn't be helpful. > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > to suggest that providing an efficient binary interfaces for pulling > > > memory map information out of processes. Some single-system-call > > > method for retrieving a binary snapshot of a process's address space > > > complete with attributes (selectable, like statx?) for each VMA would > > > reduce complexity and increase performance in a variety of areas, > > > e.g., Android memory map debugging commands. > > > > I agree it's the best we can get *generally*. > > Michal, any opinion? > > I am not really sure this is directly related. I think the primary > question that we have to sort out first is whether we want to have > the remote madvise call process or vma fd based. This is an important > distinction wrt. usability. I have only seen pid vs. pidfd discussions > so far unfortunately. I don't think the vma fd approach is viable. We have some processes with a *lot* of VMAs --- system_server had 4204 when I checked just now (and that's typical) --- and an FD operation per VMA would be excessive. VMAs also come and go pretty easily depending on changes in protections and various faults. It's also not entirely clear what the semantics of vma FDs should be over address space mutations, while the semantics of address ranges are well-understood. I would much prefer an interface operating on address ranges to one operating on VMA FDs, both for efficiency and for consistency with other memory management APIs. > An interface to query address range information is a separate but > although a related topic. We have /proc/<pid>/[s]maps for that right > now and I understand it is not a general win for all usecases because > it tends to be slow for some. I can see how /proc/<pid>/map_anons could > provide per vma information in a binary form via a fd based interface. > But I would rather not conflate those two discussions much - well except > if it could give one of the approaches more justification but let's > focus on the madvise part first. I don't think it's a good idea to focus on one feature in a multi-feature change when the interactions between features can be very important for overall design of the multi-feature system and the design of each feature. Here's my thinking on the high-level design: I'm imagining an address-range system that would work like this: we'd create some kind of process_vm_getinfo(2) system call [1] that would accept a statx-like attribute map and a pid/fd parameter as input and return, on output, two things: 1) an array [2] of VMA descriptors containing the requested information, and 2) a VMA configuration sequence number. We'd then have process_madvise() and other cross-process VM interfaces accept both address ranges and this sequence number; they'd succeed only if the VMA configuration sequence number is still current, i.e., the target process hasn't changed its VMA configuration (implicitly or explicitly) since the call to process_vm_getinfo(). This way, a process A that wants to perform some VM operation on process B can slurp B's VMA configuration using process_vm_getinfo(), figure out what it wants to do, and attempt to do it. If B modifies its memory map in the meantime, If A finds that its local knowledge of B's memory map has become invalid between the process_vm_getinfo() and A taking some action based on the result, A can retry [3]. While A could instead ptrace or otherwise suspend B, *then* read B's memory map (knowing B is quiescent), *then* operate on B, the optimistic approach I'm describing would be much lighter-weight in the typical case. It's also pretty simple, IMHO. If the "operate on B" step is some kind of vectorized operation over multiple address ranges, this approach also gets us all-or-nothing semantics. Or maybe the whole sequence number thing is overkill and we don't need atomicity? But if there's a concern that A shouldn't operate on B's memory without knowing what it's operating on, then the scheme I've proposed above solves this knowledge problem in a pretty lightweight way. [1] or some other interface [2] or something more complicated if we want the descriptors to contain variable-length elements, e.g., strings [3] or override the sequence number check if it's feeling bold? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 9:39 ` Daniel Colascione @ 2019-05-28 10:33 ` Michal Hocko 2019-05-28 11:21 ` Daniel Colascione 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-28 10:33 UTC (permalink / raw) To: Daniel Colascione Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 02:39:03, Daniel Colascione wrote: > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > if we went with the per vma fd approach then you would get this > > > > > > feature automatically because map_files would refer to file backed > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > so map_anon wouldn't be helpful. > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > to suggest that providing an efficient binary interfaces for pulling > > > > memory map information out of processes. Some single-system-call > > > > method for retrieving a binary snapshot of a process's address space > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > reduce complexity and increase performance in a variety of areas, > > > > e.g., Android memory map debugging commands. > > > > > > I agree it's the best we can get *generally*. > > > Michal, any opinion? > > > > I am not really sure this is directly related. I think the primary > > question that we have to sort out first is whether we want to have > > the remote madvise call process or vma fd based. This is an important > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > so far unfortunately. > > I don't think the vma fd approach is viable. We have some processes > with a *lot* of VMAs --- system_server had 4204 when I checked just > now (and that's typical) --- and an FD operation per VMA would be > excessive. What do you mean by excessive here? Do you expect the process to have them open all at once? > VMAs also come and go pretty easily depending on changes in > protections and various faults. Is this really too much different from /proc/<pid>/map_files? [...] > > An interface to query address range information is a separate but > > although a related topic. We have /proc/<pid>/[s]maps for that right > > now and I understand it is not a general win for all usecases because > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could > > provide per vma information in a binary form via a fd based interface. > > But I would rather not conflate those two discussions much - well except > > if it could give one of the approaches more justification but let's > > focus on the madvise part first. > > I don't think it's a good idea to focus on one feature in a > multi-feature change when the interactions between features can be > very important for overall design of the multi-feature system and the > design of each feature. > > Here's my thinking on the high-level design: > > I'm imagining an address-range system that would work like this: we'd > create some kind of process_vm_getinfo(2) system call [1] that would > accept a statx-like attribute map and a pid/fd parameter as input and > return, on output, two things: 1) an array [2] of VMA descriptors > containing the requested information, and 2) a VMA configuration > sequence number. We'd then have process_madvise() and other > cross-process VM interfaces accept both address ranges and this > sequence number; they'd succeed only if the VMA configuration sequence > number is still current, i.e., the target process hasn't changed its > VMA configuration (implicitly or explicitly) since the call to > process_vm_getinfo(). The sequence number is essentially a cookie that is transparent to the userspace right? If yes, how does it differ from a fd (returned from /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can be used to revalidate when the operation is requested and fail if something has changed. Moreover we already do have a fd based madvise syscall so there shouldn't be really a large need to add a new set of syscalls. [...] > Or maybe the whole sequence number thing is overkill and we don't need > atomicity? But if there's a concern that A shouldn't operate on B's > memory without knowing what it's operating on, then the scheme I've > proposed above solves this knowledge problem in a pretty lightweight > way. This is the main question here. Do we really want to enforce an external synchronization between the two processes to make sure that they are both operating on the same range - aka protect from the range going away and being reused for a different purpose. Right now it wouldn't be fatal because both operations are non destructive but I can imagine that there will be more madvise operations to follow (including those that are destructive) because people will simply find usecases for that. This should be reflected in the proposed API. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 10:33 ` Michal Hocko @ 2019-05-28 11:21 ` Daniel Colascione 2019-05-28 11:49 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Daniel Colascione @ 2019-05-28 11:21 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 28-05-19 02:39:03, Daniel Colascione wrote: > > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > memory map information out of processes. Some single-system-call > > > > > method for retrieving a binary snapshot of a process's address space > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > reduce complexity and increase performance in a variety of areas, > > > > > e.g., Android memory map debugging commands. > > > > > > > > I agree it's the best we can get *generally*. > > > > Michal, any opinion? > > > > > > I am not really sure this is directly related. I think the primary > > > question that we have to sort out first is whether we want to have > > > the remote madvise call process or vma fd based. This is an important > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > so far unfortunately. > > > > I don't think the vma fd approach is viable. We have some processes > > with a *lot* of VMAs --- system_server had 4204 when I checked just > > now (and that's typical) --- and an FD operation per VMA would be > > excessive. > > What do you mean by excessive here? Do you expect the process to have > them open all at once? Minchan's already done timing. More broadly, in an era with various speculative execution mitigations, making a system call is pretty expensive. If we have two options for remote VMA manipulation, one that requires thousands of system calls (with the count proportional to the address space size of the process) and one that requires only a few system calls no matter how large the target process is, the latter ought to start off with more points than the former under any kind of design scoring. > > VMAs also come and go pretty easily depending on changes in > > protections and various faults. > > Is this really too much different from /proc/<pid>/map_files? It's very different. See below. > > > An interface to query address range information is a separate but > > > although a related topic. We have /proc/<pid>/[s]maps for that right > > > now and I understand it is not a general win for all usecases because > > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could > > > provide per vma information in a binary form via a fd based interface. > > > But I would rather not conflate those two discussions much - well except > > > if it could give one of the approaches more justification but let's > > > focus on the madvise part first. > > > > I don't think it's a good idea to focus on one feature in a > > multi-feature change when the interactions between features can be > > very important for overall design of the multi-feature system and the > > design of each feature. > > > > Here's my thinking on the high-level design: > > > > I'm imagining an address-range system that would work like this: we'd > > create some kind of process_vm_getinfo(2) system call [1] that would > > accept a statx-like attribute map and a pid/fd parameter as input and > > return, on output, two things: 1) an array [2] of VMA descriptors > > containing the requested information, and 2) a VMA configuration > > sequence number. We'd then have process_madvise() and other > > cross-process VM interfaces accept both address ranges and this > > sequence number; they'd succeed only if the VMA configuration sequence > > number is still current, i.e., the target process hasn't changed its > > VMA configuration (implicitly or explicitly) since the call to > > process_vm_getinfo(). > > The sequence number is essentially a cookie that is transparent to the > userspace right? If yes, how does it differ from a fd (returned from > /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can If you want to operate on N VMAs simultaneously under an FD-per-VMA model, you'd need to have those N FDs all open at the same time *and* add some kind of system call that accepted those N FDs and an operation to perform. The sequence number I'm proposing also applies to the whole address space, not just one VMA. Even if you did have these N FDs open all at once and supplied them all to some batch operation, you couldn't guarantee via the FD mechanism that some *new* VMA didn't appear in the address range you want to manipulate. A global sequence number would catch this case. I still think supplying a list of address ranges (like we already do for scatter-gather IO) is less error-prone, less resource-intensive, more consistent with existing practice, and equally flexible, especially if we start supporting destructive cross-process memory operations, which may be useful for things like checkpointing and optimizing process startup. Besides: process_vm_readv and process_vm_writev already work on address ranges. Why should other cross-process memory APIs use a very different model for naming memory regions? > be used to revalidate when the operation is requested and fail if > something has changed. Moreover we already do have a fd based madvise > syscall so there shouldn't be really a large need to add a new set of > syscalls. We have various system calls that provide hints for open files, but the memory operations are distinct. Modeling anonymous memory as a kind of file-backed memory for purposes of VMA manipulation would also be a departure from existing practice. Can you help me understand why you seem to favor the FD-per-VMA approach so heavily? I don't see any arguments *for* an FD-per-VMA model for remove memory manipulation and I see a lot of arguments against it. Is there some compelling advantage I'm missing? > > Or maybe the whole sequence number thing is overkill and we don't need > > atomicity? But if there's a concern that A shouldn't operate on B's > > memory without knowing what it's operating on, then the scheme I've > > proposed above solves this knowledge problem in a pretty lightweight > > way. > > This is the main question here. Do we really want to enforce an external > synchronization between the two processes to make sure that they are > both operating on the same range - aka protect from the range going away > and being reused for a different purpose. Right now it wouldn't be fatal > because both operations are non destructive but I can imagine that there > will be more madvise operations to follow (including those that are > destructive) because people will simply find usecases for that. This > should be reflected in the proposed API. A sequence number gives us this synchronization at very low cost and adds safety. It's also a general-purpose mechanism that would safeguard *any* cross-process VM operation, not just the VM operations we're discussing right now. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:21 ` Daniel Colascione @ 2019-05-28 11:49 ` Michal Hocko 2019-05-28 12:11 ` Daniel Colascione 0 siblings, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-28 11:49 UTC (permalink / raw) To: Daniel Colascione Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 04:21:44, Daniel Colascione wrote: > On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 28-05-19 02:39:03, Daniel Colascione wrote: > > > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > memory map information out of processes. Some single-system-call > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > Michal, any opinion? > > > > > > > > I am not really sure this is directly related. I think the primary > > > > question that we have to sort out first is whether we want to have > > > > the remote madvise call process or vma fd based. This is an important > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > so far unfortunately. > > > > > > I don't think the vma fd approach is viable. We have some processes > > > with a *lot* of VMAs --- system_server had 4204 when I checked just > > > now (and that's typical) --- and an FD operation per VMA would be > > > excessive. > > > > What do you mean by excessive here? Do you expect the process to have > > them open all at once? > > Minchan's already done timing. More broadly, in an era with various > speculative execution mitigations, making a system call is pretty > expensive. This is a completely separate discussion. This could be argued about many other syscalls. Let's make the semantic correct first before we even start thinking about mutliplexing. It is easier to multiplex on an existing and sane interface. Btw. Minchan concluded that multiplexing is not really all that important based on his numbers http://lkml.kernel.org/r/20190527074940.GB6879@google.com [...] > > Is this really too much different from /proc/<pid>/map_files? > > It's very different. See below. > > > > > An interface to query address range information is a separate but > > > > although a related topic. We have /proc/<pid>/[s]maps for that right > > > > now and I understand it is not a general win for all usecases because > > > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could > > > > provide per vma information in a binary form via a fd based interface. > > > > But I would rather not conflate those two discussions much - well except > > > > if it could give one of the approaches more justification but let's > > > > focus on the madvise part first. > > > > > > I don't think it's a good idea to focus on one feature in a > > > multi-feature change when the interactions between features can be > > > very important for overall design of the multi-feature system and the > > > design of each feature. > > > > > > Here's my thinking on the high-level design: > > > > > > I'm imagining an address-range system that would work like this: we'd > > > create some kind of process_vm_getinfo(2) system call [1] that would > > > accept a statx-like attribute map and a pid/fd parameter as input and > > > return, on output, two things: 1) an array [2] of VMA descriptors > > > containing the requested information, and 2) a VMA configuration > > > sequence number. We'd then have process_madvise() and other > > > cross-process VM interfaces accept both address ranges and this > > > sequence number; they'd succeed only if the VMA configuration sequence > > > number is still current, i.e., the target process hasn't changed its > > > VMA configuration (implicitly or explicitly) since the call to > > > process_vm_getinfo(). > > > > The sequence number is essentially a cookie that is transparent to the > > userspace right? If yes, how does it differ from a fd (returned from > > /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can > > If you want to operate on N VMAs simultaneously under an FD-per-VMA > model, you'd need to have those N FDs all open at the same time *and* > add some kind of system call that accepted those N FDs and an > operation to perform. The sequence number I'm proposing also applies > to the whole address space, not just one VMA. Even if you did have > these N FDs open all at once and supplied them all to some batch > operation, you couldn't guarantee via the FD mechanism that some *new* > VMA didn't appear in the address range you want to manipulate. A > global sequence number would catch this case. I still think supplying > a list of address ranges (like we already do for scatter-gather IO) is > less error-prone, less resource-intensive, more consistent with > existing practice, and equally flexible, especially if we start > supporting destructive cross-process memory operations, which may be > useful for things like checkpointing and optimizing process startup. I have a strong feeling you are over optimizing here. We are talking about a pro-active memory management and so far I haven't heard any usecase where all this would happen in the fast path. There are essentially two usecases I have heard so far. Age/Reclaim the whole process (with anon/fs preferency) and do the same on a particular and well specified range (e.g. a garbage collector or an inactive large image in browsert etc...). The former doesn't really care about parallel address range manipulations because it can tolerate them. The later is a completely different story. Are there any others where saving few ms matter so much? > Besides: process_vm_readv and process_vm_writev already work on > address ranges. Why should other cross-process memory APIs use a very > different model for naming memory regions? I would consider those APIs not a great example. They are racy on more levels (pid reuse and address space modification), and require a non-trivial synchronization. Do you want something similar for madvise on a non-cooperating remote application? > > be used to revalidate when the operation is requested and fail if > > something has changed. Moreover we already do have a fd based madvise > > syscall so there shouldn't be really a large need to add a new set of > > syscalls. > > We have various system calls that provide hints for open files, but > the memory operations are distinct. Modeling anonymous memory as a > kind of file-backed memory for purposes of VMA manipulation would also > be a departure from existing practice. Can you help me understand why > you seem to favor the FD-per-VMA approach so heavily? I don't see any > arguments *for* an FD-per-VMA model for remove memory manipulation and > I see a lot of arguments against it. Is there some compelling > advantage I'm missing? First and foremost it provides an easy cookie to the userspace to guarantee time-to-check-time-to-use consistency. It also naturally extend an existing fadvise interface that achieves madvise semantic on files. I am not really pushing hard for this particular API but I really do care about a programming model that would be sane. If we have a different means to achieve the same then all fine by me but so far I haven't heard any sound arguments to invent something completely new when we have established APIs to use. Exporting anonymous mappings via proc the same way we do for file mappings doesn't seem to be stepping outside of the current practice way too much. All I am trying to say here is that process_madvise(fd, start, len) is inherently racy API and we should focus on discussing whether this is a sane model. And I think it would be much better to discuss that under the respective patch which introduces that API rather than here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:49 ` Michal Hocko @ 2019-05-28 12:11 ` Daniel Colascione 2019-05-28 12:32 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Daniel Colascione @ 2019-05-28 12:11 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 4:49 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 28-05-19 04:21:44, Daniel Colascione wrote: > > On Tue, May 28, 2019 at 3:33 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Tue 28-05-19 02:39:03, Daniel Colascione wrote: > > > > On Tue, May 28, 2019 at 2:08 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > Michal, any opinion? > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > question that we have to sort out first is whether we want to have > > > > > the remote madvise call process or vma fd based. This is an important > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > so far unfortunately. > > > > > > > > I don't think the vma fd approach is viable. We have some processes > > > > with a *lot* of VMAs --- system_server had 4204 when I checked just > > > > now (and that's typical) --- and an FD operation per VMA would be > > > > excessive. > > > > > > What do you mean by excessive here? Do you expect the process to have > > > them open all at once? > > > > Minchan's already done timing. More broadly, in an era with various > > speculative execution mitigations, making a system call is pretty > > expensive. > > This is a completely separate discussion. This could be argued about > many other syscalls. Yes, it can be. That's why we have scatter-gather IO system calls in the first place. > Let's make the semantic correct first before we > even start thinking about mutliplexing. It is easier to multiplex on an > existing and sane interface. I don't think of it as "multiplexing" yet, not in the fundamental unit of operation is the address range. > Btw. Minchan concluded that multiplexing is not really all that > important based on his numbers http://lkml.kernel.org/r/20190527074940.GB6879@google.com > > [...] > > > > Is this really too much different from /proc/<pid>/map_files? > > > > It's very different. See below. > > > > > > > An interface to query address range information is a separate but > > > > > although a related topic. We have /proc/<pid>/[s]maps for that right > > > > > now and I understand it is not a general win for all usecases because > > > > > it tends to be slow for some. I can see how /proc/<pid>/map_anons could > > > > > provide per vma information in a binary form via a fd based interface. > > > > > But I would rather not conflate those two discussions much - well except > > > > > if it could give one of the approaches more justification but let's > > > > > focus on the madvise part first. > > > > > > > > I don't think it's a good idea to focus on one feature in a > > > > multi-feature change when the interactions between features can be > > > > very important for overall design of the multi-feature system and the > > > > design of each feature. > > > > > > > > Here's my thinking on the high-level design: > > > > > > > > I'm imagining an address-range system that would work like this: we'd > > > > create some kind of process_vm_getinfo(2) system call [1] that would > > > > accept a statx-like attribute map and a pid/fd parameter as input and > > > > return, on output, two things: 1) an array [2] of VMA descriptors > > > > containing the requested information, and 2) a VMA configuration > > > > sequence number. We'd then have process_madvise() and other > > > > cross-process VM interfaces accept both address ranges and this > > > > sequence number; they'd succeed only if the VMA configuration sequence > > > > number is still current, i.e., the target process hasn't changed its > > > > VMA configuration (implicitly or explicitly) since the call to > > > > process_vm_getinfo(). > > > > > > The sequence number is essentially a cookie that is transparent to the > > > userspace right? If yes, how does it differ from a fd (returned from > > > /proc/<pid>/map_{anons,files}/range) which is a cookie itself and it can > > > > If you want to operate on N VMAs simultaneously under an FD-per-VMA > > model, you'd need to have those N FDs all open at the same time *and* > > add some kind of system call that accepted those N FDs and an > > operation to perform. The sequence number I'm proposing also applies > > to the whole address space, not just one VMA. Even if you did have > > these N FDs open all at once and supplied them all to some batch > > operation, you couldn't guarantee via the FD mechanism that some *new* > > VMA didn't appear in the address range you want to manipulate. A > > global sequence number would catch this case. I still think supplying > > a list of address ranges (like we already do for scatter-gather IO) is > > less error-prone, less resource-intensive, more consistent with > > existing practice, and equally flexible, especially if we start > > supporting destructive cross-process memory operations, which may be > > useful for things like checkpointing and optimizing process startup. > > I have a strong feeling you are over optimizing here. We are talking > about a pro-active memory management and so far I haven't heard any > usecase where all this would happen in the fast path. There are > essentially two usecases I have heard so far. Age/Reclaim the whole > process (with anon/fs preferency) and do the same on a particular > and well specified range (e.g. a garbage collector or an inactive large > image in browsert etc...). The former doesn't really care about parallel > address range manipulations because it can tolerate them. The later is a > completely different story. > > Are there any others where saving few ms matter so much? Saving ms matters quite a bit. We may want to perform some of this eager memory management in response to user activity, e.g., application switch, and even if that work isn't quite synchronous, every cycle the system spends on management overhead is a cycle it can't spend on rendering frames. Overhead means jank. Additionally, we're on battery-operated devices. Milliseconds of CPU overhead accumulated over a long time is a real energy sink. > > Besides: process_vm_readv and process_vm_writev already work on > > address ranges. Why should other cross-process memory APIs use a very > > different model for naming memory regions? > > I would consider those APIs not a great example. They are racy on > more levels (pid reuse and address space modification), and require a > non-trivial synchronization. Do you want something similar for madvise > on a non-cooperating remote application? > > > > be used to revalidate when the operation is requested and fail if > > > something has changed. Moreover we already do have a fd based madvise > > > syscall so there shouldn't be really a large need to add a new set of > > > syscalls. > > > > We have various system calls that provide hints for open files, but > > the memory operations are distinct. Modeling anonymous memory as a > > kind of file-backed memory for purposes of VMA manipulation would also > > be a departure from existing practice. Can you help me understand why > > you seem to favor the FD-per-VMA approach so heavily? I don't see any > > arguments *for* an FD-per-VMA model for remove memory manipulation and > > I see a lot of arguments against it. Is there some compelling > > advantage I'm missing? > > First and foremost it provides an easy cookie to the userspace to > guarantee time-to-check-time-to-use consistency. But only for one VMA at a time. > It also naturally > extend an existing fadvise interface that achieves madvise semantic on > files. There are lots of things that madvise can do that fadvise can't and that don't even really make sense for fadvise, e.g., MADV_FREE. It seems odd to me to duplicate much of the madvise interface into fadvise so that we can use file APIs to give madvise hints. It seems simpler to me to just provide a mechanism to put the madvise hints where they're needed. > I am not really pushing hard for this particular API but I really > do care about a programming model that would be sane. You've used "sane" twice so far in this message. Can you specify more precisely what you mean by that word? I agree that there needs to be some defense against TOCTOU races when doing remote memory management, but I don't think providing this robustness via a file descriptor is any more sane than alternative approaches. A file descriptor comes with a lot of other features --- e.g., SCM_RIGHTS, fstat, and a concept of owning a resource --- that aren't needed to achieve robustness. Normally, a file descriptor refers to some resource that the kernel holds as long as the file descriptor (well, the open file description or struct file) lives -- things like graphics buffers, files, and sockets. If we're using an FD *just* as a cookie and not a resource, I'd rather just expose the cookie directly. > If we have a > different means to achieve the same then all fine by me but so far I > haven't heard any sound arguments to invent something completely new > when we have established APIs to use. Doesn't the next sentence describe something profoundly new? :-) > Exporting anonymous mappings via > proc the same way we do for file mappings doesn't seem to be stepping > outside of the current practice way too much. It seems like a radical departure from existing practice to provide filesystem interfaces to anonymous memory regions, e.g., anon_vma. You've never been able to refer to those memory regions with file descriptors. All I'm suggesting is that we take the existing madvise mechanism, make it work cross-process, and make it robust against TOCTOU problems, all one step at a time. Maybe my sense of API "size" is miscalibrated, but adding a new type of FD to refer to anonymous VMA regions feels like a bigger departure and so requires stronger justification, especially if the result of the FD approach is probably something less efficient than a cookie-based one. > and we should focus on discussing whether this is a > sane model. And I think it would be much better to discuss that under > the respective patch which introduces that API rather than here. I think it's important to discuss what that API should look like. :-) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 12:11 ` Daniel Colascione @ 2019-05-28 12:32 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-28 12:32 UTC (permalink / raw) To: Daniel Colascione Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 05:11:16, Daniel Colascione wrote: > On Tue, May 28, 2019 at 4:49 AM Michal Hocko <mhocko@kernel.org> wrote: [...] > > > We have various system calls that provide hints for open files, but > > > the memory operations are distinct. Modeling anonymous memory as a > > > kind of file-backed memory for purposes of VMA manipulation would also > > > be a departure from existing practice. Can you help me understand why > > > you seem to favor the FD-per-VMA approach so heavily? I don't see any > > > arguments *for* an FD-per-VMA model for remove memory manipulation and > > > I see a lot of arguments against it. Is there some compelling > > > advantage I'm missing? > > > > First and foremost it provides an easy cookie to the userspace to > > guarantee time-to-check-time-to-use consistency. > > But only for one VMA at a time. Which is the unit we operate on, right? > > It also naturally > > extend an existing fadvise interface that achieves madvise semantic on > > files. > > There are lots of things that madvise can do that fadvise can't and > that don't even really make sense for fadvise, e.g., MADV_FREE. It > seems odd to me to duplicate much of the madvise interface into > fadvise so that we can use file APIs to give madvise hints. It seems > simpler to me to just provide a mechanism to put the madvise hints > where they're needed. I do not see why we would duplicate. I confess I haven't tried to implement this so I might be overlooking something but it seems to me that we could simply reuse the same functionality from both APIs. > > I am not really pushing hard for this particular API but I really > > do care about a programming model that would be sane. > > You've used "sane" twice so far in this message. Can you specify more > precisely what you mean by that word? Well, I would consider a model which would prevent from unintended side effects (e.g. working on a completely different object) without a tricky synchronization sane. > I agree that there needs to be > some defense against TOCTOU races when doing remote memory management, > but I don't think providing this robustness via a file descriptor is > any more sane than alternative approaches. A file descriptor comes > with a lot of other features --- e.g., SCM_RIGHTS, fstat, and a > concept of owning a resource --- that aren't needed to achieve > robustness. > > Normally, a file descriptor refers to some resource that the kernel > holds as long as the file descriptor (well, the open file description > or struct file) lives -- things like graphics buffers, files, and > sockets. If we're using an FD *just* as a cookie and not a resource, > I'd rather just expose the cookie directly. You are absolutely right. But doesn't that apply to any other revalidation method that would be tracking VMA status as well. As I've said I am not married to this approach as long as there are better alternatives. So far we are in a discussion what should be the actual semantic of the operation and how much do we want to tolerate races. And it seems that we are diving into implementation details rather than landing with a firm decision that the current proposed API is suitable or not. > > If we have a > > different means to achieve the same then all fine by me but so far I > > haven't heard any sound arguments to invent something completely new > > when we have established APIs to use. > > Doesn't the next sentence describe something profoundly new? :-) > > > Exporting anonymous mappings via > > proc the same way we do for file mappings doesn't seem to be stepping > > outside of the current practice way too much. > > It seems like a radical departure from existing practice to provide > filesystem interfaces to anonymous memory regions, e.g., anon_vma. > You've never been able to refer to those memory regions with file > descriptors. > > All I'm suggesting is that we take the existing madvise mechanism, > make it work cross-process, and make it robust against TOCTOU > problems, all one step at a time. Maybe my sense of API "size" is > miscalibrated, but adding a new type of FD to refer to anonymous VMA > regions feels like a bigger departure and so requires stronger > justification, especially if the result of the FD approach is probably > something less efficient than a cookie-based one. Feel free to propose the way to achieve that in the respective email thread. > > and we should focus on discussing whether this is a > > sane model. And I think it would be much better to discuss that under > > the respective patch which introduces that API rather than here. > > I think it's important to discuss what that API should look like. :-) It will be fun to follow this discussion and make some sense of different parallel threads. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 9:08 ` Michal Hocko 2019-05-28 9:39 ` Daniel Colascione @ 2019-05-28 10:32 ` Minchan Kim 2019-05-28 10:41 ` Michal Hocko 1 sibling, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-28 10:32 UTC (permalink / raw) To: Michal Hocko Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > if we went with the per vma fd approach then you would get this > > > > > feature automatically because map_files would refer to file backed > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > so map_anon wouldn't be helpful. > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > to suggest that providing an efficient binary interfaces for pulling > > > memory map information out of processes. Some single-system-call > > > method for retrieving a binary snapshot of a process's address space > > > complete with attributes (selectable, like statx?) for each VMA would > > > reduce complexity and increase performance in a variety of areas, > > > e.g., Android memory map debugging commands. > > > > I agree it's the best we can get *generally*. > > Michal, any opinion? > > I am not really sure this is directly related. I think the primary > question that we have to sort out first is whether we want to have > the remote madvise call process or vma fd based. This is an important > distinction wrt. usability. I have only seen pid vs. pidfd discussions > so far unfortunately. With current usecase, it's per-process API with distinguishable anon/file but thought it could be easily extended later for each address range operation as userspace getting smarter with more information. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 10:32 ` Minchan Kim @ 2019-05-28 10:41 ` Michal Hocko 2019-05-28 11:12 ` Minchan Kim 2019-05-28 11:28 ` Daniel Colascione 0 siblings, 2 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-28 10:41 UTC (permalink / raw) To: Minchan Kim Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 19:32:56, Minchan Kim wrote: > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > if we went with the per vma fd approach then you would get this > > > > > > feature automatically because map_files would refer to file backed > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > so map_anon wouldn't be helpful. > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > to suggest that providing an efficient binary interfaces for pulling > > > > memory map information out of processes. Some single-system-call > > > > method for retrieving a binary snapshot of a process's address space > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > reduce complexity and increase performance in a variety of areas, > > > > e.g., Android memory map debugging commands. > > > > > > I agree it's the best we can get *generally*. > > > Michal, any opinion? > > > > I am not really sure this is directly related. I think the primary > > question that we have to sort out first is whether we want to have > > the remote madvise call process or vma fd based. This is an important > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > so far unfortunately. > > With current usecase, it's per-process API with distinguishable anon/file > but thought it could be easily extended later for each address range > operation as userspace getting smarter with more information. Never design user API based on a single usecase, please. The "easily extended" part is by far not clear to me TBH. As I've already mentioned several times, the synchronization model has to be thought through carefuly before a remote process address range operation can be implemented. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 10:41 ` Michal Hocko @ 2019-05-28 11:12 ` Minchan Kim 2019-05-28 11:28 ` Michal Hocko 2019-05-28 11:28 ` Daniel Colascione 1 sibling, 1 reply; 68+ messages in thread From: Minchan Kim @ 2019-05-28 11:12 UTC (permalink / raw) To: Michal Hocko Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > memory map information out of processes. Some single-system-call > > > > > method for retrieving a binary snapshot of a process's address space > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > reduce complexity and increase performance in a variety of areas, > > > > > e.g., Android memory map debugging commands. > > > > > > > > I agree it's the best we can get *generally*. > > > > Michal, any opinion? > > > > > > I am not really sure this is directly related. I think the primary > > > question that we have to sort out first is whether we want to have > > > the remote madvise call process or vma fd based. This is an important > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > so far unfortunately. > > > > With current usecase, it's per-process API with distinguishable anon/file > > but thought it could be easily extended later for each address range > > operation as userspace getting smarter with more information. > > Never design user API based on a single usecase, please. The "easily > extended" part is by far not clear to me TBH. As I've already mentioned > several times, the synchronization model has to be thought through > carefuly before a remote process address range operation can be > implemented. I agree with you that we shouldn't design API on single usecase but what you are concerning is actually not our usecase because we are resilient with the race since MADV_COLD|PAGEOUT is not destruptive. Actually, many hints are already racy in that the upcoming pattern would be different with the behavior you thought at the moment. If you are still concerning of address range synchronization, how about moving such hints to per-process level like prctl? Does it make sense to you? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:12 ` Minchan Kim @ 2019-05-28 11:28 ` Michal Hocko 2019-05-28 11:42 ` Daniel Colascione 2019-05-28 11:44 ` Minchan Kim 0 siblings, 2 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-28 11:28 UTC (permalink / raw) To: Minchan Kim Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 20:12:08, Minchan Kim wrote: > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > memory map information out of processes. Some single-system-call > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > Michal, any opinion? > > > > > > > > I am not really sure this is directly related. I think the primary > > > > question that we have to sort out first is whether we want to have > > > > the remote madvise call process or vma fd based. This is an important > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > so far unfortunately. > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > but thought it could be easily extended later for each address range > > > operation as userspace getting smarter with more information. > > > > Never design user API based on a single usecase, please. The "easily > > extended" part is by far not clear to me TBH. As I've already mentioned > > several times, the synchronization model has to be thought through > > carefuly before a remote process address range operation can be > > implemented. > > I agree with you that we shouldn't design API on single usecase but what > you are concerning is actually not our usecase because we are resilient > with the race since MADV_COLD|PAGEOUT is not destruptive. > Actually, many hints are already racy in that the upcoming pattern would > be different with the behavior you thought at the moment. How come they are racy wrt address ranges? You would have to be in multithreaded environment and then the onus of synchronization is on threads. That model is quite clear. But we are talking about separate processes and some of them might be even not aware of an external entity tweaking their address space. > If you are still concerning of address range synchronization, how about > moving such hints to per-process level like prctl? > Does it make sense to you? No it doesn't. How is prctl any relevant to any address range operations. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:28 ` Michal Hocko @ 2019-05-28 11:42 ` Daniel Colascione 2019-05-28 11:56 ` Michal Hocko 2019-05-28 12:10 ` Minchan Kim 2019-05-28 11:44 ` Minchan Kim 1 sibling, 2 replies; 68+ messages in thread From: Daniel Colascione @ 2019-05-28 11:42 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 28-05-19 20:12:08, Minchan Kim wrote: > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > Michal, any opinion? > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > question that we have to sort out first is whether we want to have > > > > > the remote madvise call process or vma fd based. This is an important > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > so far unfortunately. > > > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > > but thought it could be easily extended later for each address range > > > > operation as userspace getting smarter with more information. > > > > > > Never design user API based on a single usecase, please. The "easily > > > extended" part is by far not clear to me TBH. As I've already mentioned > > > several times, the synchronization model has to be thought through > > > carefuly before a remote process address range operation can be > > > implemented. > > > > I agree with you that we shouldn't design API on single usecase but what > > you are concerning is actually not our usecase because we are resilient > > with the race since MADV_COLD|PAGEOUT is not destruptive. > > Actually, many hints are already racy in that the upcoming pattern would > > be different with the behavior you thought at the moment. > > How come they are racy wrt address ranges? You would have to be in > multithreaded environment and then the onus of synchronization is on > threads. That model is quite clear. But we are talking about separate > processes and some of them might be even not aware of an external entity > tweaking their address space. I don't think the difference between a thread and a process matters in this context. Threads race on address space operations all the time --- in the sense that multiple threads modify a process's address space without synchronization. The main reasons that these races hasn't been a problem are: 1) threads mostly "mind their own business" and modify different parts of the address space or use locks to ensure that they don't stop on each other (e.g., the malloc heap lock), and 2) POSIX mmap atomic-replacement semantics make certain classes of operation (like "magic ring buffer" setup) safe even in the presence of other threads stomping over an address space. The thing that's new in this discussion from a synchronization point of view isn't that the VM operation we're talking about is coming from outside the process, but that we want to do a read-decide-modify-ish thing. We want to affect (using various hints) classes of pages like "all file pages" or "all anonymous pages" or "some pages referring to graphics buffers up to 100MB" (to pick an example off the top of my head of a policy that might make sense). From a synchronization point of view, it doesn't really matter whether it's a thread within the target process or a thread outside the target process that does the address space manipulation. What's new is the inspection of the address space before performing an operation. Minchan started this thread by proposing some flags that would implement a few of the filtering policies I used as examples above. Personally, instead of providing a few pre-built policies as flags, I'd rather push the page manipulation policy to userspace as much as possible and just have the kernel provide a mechanism that *in general* makes these read-decide-modify operations efficient and robust. I still think there's way to achieve this goal very inexpensively without compromising on flexibility. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:42 ` Daniel Colascione @ 2019-05-28 11:56 ` Michal Hocko 2019-05-28 12:18 ` Daniel Colascione 2019-05-28 12:10 ` Minchan Kim 1 sibling, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-28 11:56 UTC (permalink / raw) To: Daniel Colascione Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 04:42:47, Daniel Colascione wrote: > On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 28-05-19 20:12:08, Minchan Kim wrote: > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > > Michal, any opinion? > > > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > > question that we have to sort out first is whether we want to have > > > > > > the remote madvise call process or vma fd based. This is an important > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > > so far unfortunately. > > > > > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > > > but thought it could be easily extended later for each address range > > > > > operation as userspace getting smarter with more information. > > > > > > > > Never design user API based on a single usecase, please. The "easily > > > > extended" part is by far not clear to me TBH. As I've already mentioned > > > > several times, the synchronization model has to be thought through > > > > carefuly before a remote process address range operation can be > > > > implemented. > > > > > > I agree with you that we shouldn't design API on single usecase but what > > > you are concerning is actually not our usecase because we are resilient > > > with the race since MADV_COLD|PAGEOUT is not destruptive. > > > Actually, many hints are already racy in that the upcoming pattern would > > > be different with the behavior you thought at the moment. > > > > How come they are racy wrt address ranges? You would have to be in > > multithreaded environment and then the onus of synchronization is on > > threads. That model is quite clear. But we are talking about separate > > processes and some of them might be even not aware of an external entity > > tweaking their address space. > > I don't think the difference between a thread and a process matters in > this context. Threads race on address space operations all the time > --- in the sense that multiple threads modify a process's address > space without synchronization. I would disagree. They do have in-kernel synchronization as long as they do not use MAP_FIXED. If they do want to use MAP_FIXED then they better synchronize or the result is undefined. > The main reasons that these races > hasn't been a problem are: 1) threads mostly "mind their own business" > and modify different parts of the address space or use locks to ensure > that they don't stop on each other (e.g., the malloc heap lock), and > 2) POSIX mmap atomic-replacement semantics make certain classes of > operation (like "magic ring buffer" setup) safe even in the presence > of other threads stomping over an address space. Agreed here. [...] > From a synchronization point > of view, it doesn't really matter whether it's a thread within the > target process or a thread outside the target process that does the > address space manipulation. What's new is the inspection of the > address space before performing an operation. The fundamental difference is that if you want to achieve the same inside the process then your application is inherenly aware of the operation and use whatever synchronization is needed to achieve a consistency. As soon as you allow the same from outside you either have to have an aware target application as well or you need a mechanism to find out that your decision has been invalidated by a later unsynchronized action. > Minchan started this thread by proposing some flags that would > implement a few of the filtering policies I used as examples above. > Personally, instead of providing a few pre-built policies as flags, > I'd rather push the page manipulation policy to userspace as much as > possible and just have the kernel provide a mechanism that *in > general* makes these read-decide-modify operations efficient and > robust. I still think there's way to achieve this goal very > inexpensively without compromising on flexibility. Agreed here. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:56 ` Michal Hocko @ 2019-05-28 12:18 ` Daniel Colascione 2019-05-28 12:38 ` Michal Hocko 0 siblings, 1 reply; 68+ messages in thread From: Daniel Colascione @ 2019-05-28 12:18 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 4:56 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 28-05-19 04:42:47, Daniel Colascione wrote: > > On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > > > On Tue 28-05-19 20:12:08, Minchan Kim wrote: > > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > > > Michal, any opinion? > > > > > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > > > question that we have to sort out first is whether we want to have > > > > > > > the remote madvise call process or vma fd based. This is an important > > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > > > so far unfortunately. > > > > > > > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > > > > but thought it could be easily extended later for each address range > > > > > > operation as userspace getting smarter with more information. > > > > > > > > > > Never design user API based on a single usecase, please. The "easily > > > > > extended" part is by far not clear to me TBH. As I've already mentioned > > > > > several times, the synchronization model has to be thought through > > > > > carefuly before a remote process address range operation can be > > > > > implemented. > > > > > > > > I agree with you that we shouldn't design API on single usecase but what > > > > you are concerning is actually not our usecase because we are resilient > > > > with the race since MADV_COLD|PAGEOUT is not destruptive. > > > > Actually, many hints are already racy in that the upcoming pattern would > > > > be different with the behavior you thought at the moment. > > > > > > How come they are racy wrt address ranges? You would have to be in > > > multithreaded environment and then the onus of synchronization is on > > > threads. That model is quite clear. But we are talking about separate > > > processes and some of them might be even not aware of an external entity > > > tweaking their address space. > > > > I don't think the difference between a thread and a process matters in > > this context. Threads race on address space operations all the time > > --- in the sense that multiple threads modify a process's address > > space without synchronization. > > I would disagree. They do have in-kernel synchronization as long as they > do not use MAP_FIXED. If they do want to use MAP_FIXED then they better > synchronize or the result is undefined. Right. It's because the kernel hands off different regions to different non-MAP_FIXED mmap callers that it's pretty easy for threads to mind their own business, but they're all still using the same address space. > > From a synchronization point > > of view, it doesn't really matter whether it's a thread within the > > target process or a thread outside the target process that does the > > address space manipulation. What's new is the inspection of the > > address space before performing an operation. > > The fundamental difference is that if you want to achieve the same > inside the process then your application is inherenly aware of the > operation and use whatever synchronization is needed to achieve a > consistency. As soon as you allow the same from outside you either > have to have an aware target application as well or you need a mechanism > to find out that your decision has been invalidated by a later > unsynchronized action. I thought of this objection immediately after I hit send. :-) I still don't think the intra- vs inter-process difference matters. It's true that threads can synchronize with each other, but different processes can synchronize with each other too. I mean, you *could* use sem_open(3) for your heap lock and open the semaphore from two different processes. That's silly, but it'd work. The important requirement, I think, is that we need to support managing "memory-naive" uncooperative tasks (perhaps legacy ones written before cross-process memory management even became possible), and I think that the cooperative-vs-uncooperative distinction matters a lot more than the tgid of the thread doing the memory manipulation. (Although in our case, we really do need a separate tgid. :-)) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 12:18 ` Daniel Colascione @ 2019-05-28 12:38 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-28 12:38 UTC (permalink / raw) To: Daniel Colascione Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 05:18:48, Daniel Colascione wrote: [...] > The important requirement, I think, is that we need to support > managing "memory-naive" uncooperative tasks (perhaps legacy ones > written before cross-process memory management even became possible), > and I think that the cooperative-vs-uncooperative distinction matters > a lot more than the tgid of the thread doing the memory manipulation. > (Although in our case, we really do need a separate tgid. :-)) Agreed here and that requires some sort of revalidation and failure on "object has changed" in one form or another IMHO. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:42 ` Daniel Colascione 2019-05-28 11:56 ` Michal Hocko @ 2019-05-28 12:10 ` Minchan Kim 1 sibling, 0 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-28 12:10 UTC (permalink / raw) To: Daniel Colascione Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 04:42:47AM -0700, Daniel Colascione wrote: > On Tue, May 28, 2019 at 4:28 AM Michal Hocko <mhocko@kernel.org> wrote: > > > > On Tue 28-05-19 20:12:08, Minchan Kim wrote: > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > > Michal, any opinion? > > > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > > question that we have to sort out first is whether we want to have > > > > > > the remote madvise call process or vma fd based. This is an important > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > > so far unfortunately. > > > > > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > > > but thought it could be easily extended later for each address range > > > > > operation as userspace getting smarter with more information. > > > > > > > > Never design user API based on a single usecase, please. The "easily > > > > extended" part is by far not clear to me TBH. As I've already mentioned > > > > several times, the synchronization model has to be thought through > > > > carefuly before a remote process address range operation can be > > > > implemented. > > > > > > I agree with you that we shouldn't design API on single usecase but what > > > you are concerning is actually not our usecase because we are resilient > > > with the race since MADV_COLD|PAGEOUT is not destruptive. > > > Actually, many hints are already racy in that the upcoming pattern would > > > be different with the behavior you thought at the moment. > > > > How come they are racy wrt address ranges? You would have to be in > > multithreaded environment and then the onus of synchronization is on > > threads. That model is quite clear. But we are talking about separate > > processes and some of them might be even not aware of an external entity > > tweaking their address space. > > I don't think the difference between a thread and a process matters in > this context. Threads race on address space operations all the time > --- in the sense that multiple threads modify a process's address > space without synchronization. The main reasons that these races > hasn't been a problem are: 1) threads mostly "mind their own business" > and modify different parts of the address space or use locks to ensure > that they don't stop on each other (e.g., the malloc heap lock), and > 2) POSIX mmap atomic-replacement semantics make certain classes of > operation (like "magic ring buffer" setup) safe even in the presence > of other threads stomping over an address space. > > The thing that's new in this discussion from a synchronization point > of view isn't that the VM operation we're talking about is coming from > outside the process, but that we want to do a read-decide-modify-ish > thing. We want to affect (using various hints) classes of pages like > "all file pages" or "all anonymous pages" or "some pages referring to > graphics buffers up to 100MB" (to pick an example off the top of my > head of a policy that might make sense). From a synchronization point > of view, it doesn't really matter whether it's a thread within the > target process or a thread outside the target process that does the > address space manipulation. What's new is the inspection of the > address space before performing an operation. > > Minchan started this thread by proposing some flags that would > implement a few of the filtering policies I used as examples above. > Personally, instead of providing a few pre-built policies as flags, > I'd rather push the page manipulation policy to userspace as much as > possible and just have the kernel provide a mechanism that *in > general* makes these read-decide-modify operations efficient and > robust. I still think there's way to achieve this goal very > inexpensively without compromising on flexibility. I'm looking forward to seeing the way. ;-) ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:28 ` Michal Hocko 2019-05-28 11:42 ` Daniel Colascione @ 2019-05-28 11:44 ` Minchan Kim 2019-05-28 11:51 ` Daniel Colascione 2019-05-28 12:06 ` Michal Hocko 1 sibling, 2 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-28 11:44 UTC (permalink / raw) To: Michal Hocko Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote: > On Tue 28-05-19 20:12:08, Minchan Kim wrote: > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > Michal, any opinion? > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > question that we have to sort out first is whether we want to have > > > > > the remote madvise call process or vma fd based. This is an important > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > so far unfortunately. > > > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > > but thought it could be easily extended later for each address range > > > > operation as userspace getting smarter with more information. > > > > > > Never design user API based on a single usecase, please. The "easily > > > extended" part is by far not clear to me TBH. As I've already mentioned > > > several times, the synchronization model has to be thought through > > > carefuly before a remote process address range operation can be > > > implemented. > > > > I agree with you that we shouldn't design API on single usecase but what > > you are concerning is actually not our usecase because we are resilient > > with the race since MADV_COLD|PAGEOUT is not destruptive. > > Actually, many hints are already racy in that the upcoming pattern would > > be different with the behavior you thought at the moment. > > How come they are racy wrt address ranges? You would have to be in > multithreaded environment and then the onus of synchronization is on > threads. That model is quite clear. But we are talking about separate Think about MADV_FREE. Allocator would think the chunk is worth to mark "freeable" but soon, user of the allocator asked the chunk - ie, it's not freeable any longer once user start to use it. My point is that kinds of *hints* are always racy so any synchronization couldn't help a lot. That's why I want to restrict hints process_madvise supports as such kinds of non-destruptive one at next respin. > processes and some of them might be even not aware of an external entity > tweaking their address space. > > > If you are still concerning of address range synchronization, how about > > moving such hints to per-process level like prctl? > > Does it make sense to you? > > No it doesn't. How is prctl any relevant to any address range > operations. "whether we want to have the remote madvise call process or vma fd based." You asked the above question and I answered we are using process level hints but anon/vma filter at this moment. That's why I told you prctl to make forward progress on discussion. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:44 ` Minchan Kim @ 2019-05-28 11:51 ` Daniel Colascione 2019-05-28 12:06 ` Michal Hocko 1 sibling, 0 replies; 68+ messages in thread From: Daniel Colascione @ 2019-05-28 11:51 UTC (permalink / raw) To: Minchan Kim Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 4:44 AM Minchan Kim <minchan@kernel.org> wrote: > > On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote: > > On Tue 28-05-19 20:12:08, Minchan Kim wrote: > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > > Michal, any opinion? > > > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > > question that we have to sort out first is whether we want to have > > > > > > the remote madvise call process or vma fd based. This is an important > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > > so far unfortunately. > > > > > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > > > but thought it could be easily extended later for each address range > > > > > operation as userspace getting smarter with more information. > > > > > > > > Never design user API based on a single usecase, please. The "easily > > > > extended" part is by far not clear to me TBH. As I've already mentioned > > > > several times, the synchronization model has to be thought through > > > > carefuly before a remote process address range operation can be > > > > implemented. > > > > > > I agree with you that we shouldn't design API on single usecase but what > > > you are concerning is actually not our usecase because we are resilient > > > with the race since MADV_COLD|PAGEOUT is not destruptive. > > > Actually, many hints are already racy in that the upcoming pattern would > > > be different with the behavior you thought at the moment. > > > > How come they are racy wrt address ranges? You would have to be in > > multithreaded environment and then the onus of synchronization is on > > threads. That model is quite clear. But we are talking about separate > > Think about MADV_FREE. Allocator would think the chunk is worth to mark > "freeable" but soon, user of the allocator asked the chunk - ie, it's not > freeable any longer once user start to use it. > > My point is that kinds of *hints* are always racy so any synchronization > couldn't help a lot. That's why I want to restrict hints process_madvise > supports as such kinds of non-destruptive one at next respin. I think it's more natural for process_madvise to be a superset of regular madvise. What's the harm? There are no security implications, since anyone who could process_madvise could just ptrace anyway. I also don't think limiting the hinting to non-destructive operations guarantees safety (in a broad sense) either, since operating on the wrong memory range can still cause unexpected system performance issues even if there's no data loss. More broadly, what I want to see is a family of process_* APIs that provide a superset of the functionality that the existing intraprocess APIs provide. I think this approach is elegant and generalizes easily. I'm worried about prematurely limiting the interprocess memory APIs and creating limitations that will last a long time in order to avoid having to consider issues like VMA synchronization. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 11:44 ` Minchan Kim 2019-05-28 11:51 ` Daniel Colascione @ 2019-05-28 12:06 ` Michal Hocko 2019-05-28 12:22 ` Minchan Kim 1 sibling, 1 reply; 68+ messages in thread From: Michal Hocko @ 2019-05-28 12:06 UTC (permalink / raw) To: Minchan Kim Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue 28-05-19 20:44:36, Minchan Kim wrote: > On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote: > > On Tue 28-05-19 20:12:08, Minchan Kim wrote: > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > > Michal, any opinion? > > > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > > question that we have to sort out first is whether we want to have > > > > > > the remote madvise call process or vma fd based. This is an important > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > > so far unfortunately. > > > > > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > > > but thought it could be easily extended later for each address range > > > > > operation as userspace getting smarter with more information. > > > > > > > > Never design user API based on a single usecase, please. The "easily > > > > extended" part is by far not clear to me TBH. As I've already mentioned > > > > several times, the synchronization model has to be thought through > > > > carefuly before a remote process address range operation can be > > > > implemented. > > > > > > I agree with you that we shouldn't design API on single usecase but what > > > you are concerning is actually not our usecase because we are resilient > > > with the race since MADV_COLD|PAGEOUT is not destruptive. > > > Actually, many hints are already racy in that the upcoming pattern would > > > be different with the behavior you thought at the moment. > > > > How come they are racy wrt address ranges? You would have to be in > > multithreaded environment and then the onus of synchronization is on > > threads. That model is quite clear. But we are talking about separate > > Think about MADV_FREE. Allocator would think the chunk is worth to mark > "freeable" but soon, user of the allocator asked the chunk - ie, it's not > freeable any longer once user start to use it. That is not a race in the address space, right. The underlying object hasn't changed. It has been declared as freeable and since that moment nobody can rely on the content because it might have been discarded. Or put simply, the content is undefined. It is responsibility of the madvise caller to make sure that the object is not in active use while it is marking it. > My point is that kinds of *hints* are always racy so any synchronization > couldn't help a lot. That's why I want to restrict hints process_madvise > supports as such kinds of non-destruptive one at next respin. I agree that a non-destructive operations are safer against paralel modifications because you just get a annoying and unexpected latency at worst case. But we should discuss whether this assumption is sufficient for further development. I am pretty sure once we open remote madvise people will find usecases for destructive operations or even new madvise modes we haven't heard of. What then? > > processes and some of them might be even not aware of an external entity > > tweaking their address space. > > > > > If you are still concerning of address range synchronization, how about > > > moving such hints to per-process level like prctl? > > > Does it make sense to you? > > > > No it doesn't. How is prctl any relevant to any address range > > operations. > > "whether we want to have the remote madvise call process or vma fd based." Still not following. So you want to have a prctl (one of the worst API we have along with ioctl) to tell the semantic? This sounds like a terrible idea to me. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 12:06 ` Michal Hocko @ 2019-05-28 12:22 ` Minchan Kim 0 siblings, 0 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-28 12:22 UTC (permalink / raw) To: Michal Hocko Cc: Daniel Colascione, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 02:06:14PM +0200, Michal Hocko wrote: > On Tue 28-05-19 20:44:36, Minchan Kim wrote: > > On Tue, May 28, 2019 at 01:28:40PM +0200, Michal Hocko wrote: > > > On Tue 28-05-19 20:12:08, Minchan Kim wrote: > > > > On Tue, May 28, 2019 at 12:41:17PM +0200, Michal Hocko wrote: > > > > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > > > > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > > > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > > > > > memory map information out of processes. Some single-system-call > > > > > > > > > method for retrieving a binary snapshot of a process's address space > > > > > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > > > > > reduce complexity and increase performance in a variety of areas, > > > > > > > > > e.g., Android memory map debugging commands. > > > > > > > > > > > > > > > > I agree it's the best we can get *generally*. > > > > > > > > Michal, any opinion? > > > > > > > > > > > > > > I am not really sure this is directly related. I think the primary > > > > > > > question that we have to sort out first is whether we want to have > > > > > > > the remote madvise call process or vma fd based. This is an important > > > > > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > > > > > so far unfortunately. > > > > > > > > > > > > With current usecase, it's per-process API with distinguishable anon/file > > > > > > but thought it could be easily extended later for each address range > > > > > > operation as userspace getting smarter with more information. > > > > > > > > > > Never design user API based on a single usecase, please. The "easily > > > > > extended" part is by far not clear to me TBH. As I've already mentioned > > > > > several times, the synchronization model has to be thought through > > > > > carefuly before a remote process address range operation can be > > > > > implemented. > > > > > > > > I agree with you that we shouldn't design API on single usecase but what > > > > you are concerning is actually not our usecase because we are resilient > > > > with the race since MADV_COLD|PAGEOUT is not destruptive. > > > > Actually, many hints are already racy in that the upcoming pattern would > > > > be different with the behavior you thought at the moment. > > > > > > How come they are racy wrt address ranges? You would have to be in > > > multithreaded environment and then the onus of synchronization is on > > > threads. That model is quite clear. But we are talking about separate > > > > Think about MADV_FREE. Allocator would think the chunk is worth to mark > > "freeable" but soon, user of the allocator asked the chunk - ie, it's not > > freeable any longer once user start to use it. > > That is not a race in the address space, right. The underlying object > hasn't changed. It has been declared as freeable and since that moment > nobody can rely on the content because it might have been discarded. > Or put simply, the content is undefined. It is responsibility of the > madvise caller to make sure that the object is not in active use while > it is marking it. > > > My point is that kinds of *hints* are always racy so any synchronization > > couldn't help a lot. That's why I want to restrict hints process_madvise > > supports as such kinds of non-destruptive one at next respin. > > I agree that a non-destructive operations are safer against paralel > modifications because you just get a annoying and unexpected latency at > worst case. But we should discuss whether this assumption is sufficient > for further development. I am pretty sure once we open remote madvise > people will find usecases for destructive operations or even new madvise > modes we haven't heard of. What then? I support Daniel's vma seq number approach for the future plan. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-28 10:41 ` Michal Hocko 2019-05-28 11:12 ` Minchan Kim @ 2019-05-28 11:28 ` Daniel Colascione 1 sibling, 0 replies; 68+ messages in thread From: Daniel Colascione @ 2019-05-28 11:28 UTC (permalink / raw) To: Michal Hocko Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Shakeel Butt, Sonny Rao, Brian Geffon, Linux API On Tue, May 28, 2019 at 3:41 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Tue 28-05-19 19:32:56, Minchan Kim wrote: > > On Tue, May 28, 2019 at 11:08:21AM +0200, Michal Hocko wrote: > > > On Tue 28-05-19 17:49:27, Minchan Kim wrote: > > > > On Tue, May 28, 2019 at 01:31:13AM -0700, Daniel Colascione wrote: > > > > > On Tue, May 28, 2019 at 1:14 AM Minchan Kim <minchan@kernel.org> wrote: > > > > > > if we went with the per vma fd approach then you would get this > > > > > > > feature automatically because map_files would refer to file backed > > > > > > > mappings while map_anon could refer only to anonymous mappings. > > > > > > > > > > > > The reason to add such filter option is to avoid the parsing overhead > > > > > > so map_anon wouldn't be helpful. > > > > > > > > > > Without chiming on whether the filter option is a good idea, I'd like > > > > > to suggest that providing an efficient binary interfaces for pulling > > > > > memory map information out of processes. Some single-system-call > > > > > method for retrieving a binary snapshot of a process's address space > > > > > complete with attributes (selectable, like statx?) for each VMA would > > > > > reduce complexity and increase performance in a variety of areas, > > > > > e.g., Android memory map debugging commands. > > > > > > > > I agree it's the best we can get *generally*. > > > > Michal, any opinion? > > > > > > I am not really sure this is directly related. I think the primary > > > question that we have to sort out first is whether we want to have > > > the remote madvise call process or vma fd based. This is an important > > > distinction wrt. usability. I have only seen pid vs. pidfd discussions > > > so far unfortunately. > > > > With current usecase, it's per-process API with distinguishable anon/file > > but thought it could be easily extended later for each address range > > operation as userspace getting smarter with more information. > > Never design user API based on a single usecase, please. The "easily > extended" part is by far not clear to me TBH. As I've already mentioned > several times, the synchronization model has to be thought through > carefuly before a remote process address range operation can be > implemented. I don't think anyone is overfitting for a specific use case. When some process A wants to manipulate process B's memory, it's fair for A to want to know what memory it's manipulating. That's a general concern that applies to a large family of cross-process memory operations. It's less important for non-destructive hints than for some kind of destructive operation, but the same idea applies. If there's a simple way to solve this A-B information problem in a general way, it seems to be that we should apply that general solution. Likewise, an API to get an efficiently-collected snapshot of a process's address space would be immediately useful in several very different use cases, including debuggers, Android memory use reporting tools, and various kinds of metric collection. Because we're talking about mechanisms that solve several independent problems at the same time and in a general way, it doesn't sound to me like overfitting for a particular use case. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-21 2:55 ` Minchan Kim 2019-05-21 6:26 ` Michal Hocko @ 2019-05-21 15:33 ` Johannes Weiner 2019-05-22 1:50 ` Minchan Kim 1 sibling, 1 reply; 68+ messages in thread From: Johannes Weiner @ 2019-05-21 15:33 UTC (permalink / raw) To: Minchan Kim Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 21, 2019 at 11:55:33AM +0900, Minchan Kim wrote: > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > > [cc linux-api] > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > > System could have much faster swap device like zRAM. In that case, swapping > > > is extremely cheaper than file-IO on the low-end storage. > > > In this configuration, userspace could handle different strategy for each > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > > file IO is more expensive in this case so want to keep them in memory > > > until memory pressure happens. > > > > > > To support such strategy easier, this patch introduces > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > > that /proc/<pid>/clear_refs already has supported same filters. > > > They are filters could be Ored with other existing hints using top two bits > > > of (int behavior). > > > > madvise operates on top of ranges and it is quite trivial to do the > > filtering from the userspace so why do we need any additional filtering? > > > > > Once either of them is set, the hint could affect only the interested vma > > > either anonymous or file-backed. > > > > > > With that, user could call a process_madvise syscall simply with a entire > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > > > OK, so here is the reason you want that. The immediate question is why > > cannot the monitor do the filtering from the userspace. Slightly more > > work, all right, but less of an API to expose and that itself is a > > strong argument against. > > What I should do if we don't have such filter option is to enumerate all of > vma via /proc/<pid>/maps and then parse every ranges and inode from string, > which would be painful for 2000+ vmas. Just out of curiosity, how do you get to 2000+ distinct memory regions in the address space of a mobile app? I'm assuming these aren't files, but rather anon objects with poor grouping. Is that from guard pages between individual heap allocations or something? ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER 2019-05-21 15:33 ` Johannes Weiner @ 2019-05-22 1:50 ` Minchan Kim 0 siblings, 0 replies; 68+ messages in thread From: Minchan Kim @ 2019-05-22 1:50 UTC (permalink / raw) To: Johannes Weiner Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api On Tue, May 21, 2019 at 11:33:10AM -0400, Johannes Weiner wrote: > On Tue, May 21, 2019 at 11:55:33AM +0900, Minchan Kim wrote: > > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote: > > > [cc linux-api] > > > > > > On Mon 20-05-19 12:52:54, Minchan Kim wrote: > > > > System could have much faster swap device like zRAM. In that case, swapping > > > > is extremely cheaper than file-IO on the low-end storage. > > > > In this configuration, userspace could handle different strategy for each > > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD > > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because > > > > file IO is more expensive in this case so want to keep them in memory > > > > until memory pressure happens. > > > > > > > > To support such strategy easier, this patch introduces > > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like > > > > that /proc/<pid>/clear_refs already has supported same filters. > > > > They are filters could be Ored with other existing hints using top two bits > > > > of (int behavior). > > > > > > madvise operates on top of ranges and it is quite trivial to do the > > > filtering from the userspace so why do we need any additional filtering? > > > > > > > Once either of them is set, the hint could affect only the interested vma > > > > either anonymous or file-backed. > > > > > > > > With that, user could call a process_madvise syscall simply with a entire > > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and > > > > MADV_FILE_FILTER so there is no need to call the syscall range by range. > > > > > > OK, so here is the reason you want that. The immediate question is why > > > cannot the monitor do the filtering from the userspace. Slightly more > > > work, all right, but less of an API to expose and that itself is a > > > strong argument against. > > > > What I should do if we don't have such filter option is to enumerate all of > > vma via /proc/<pid>/maps and then parse every ranges and inode from string, > > which would be painful for 2000+ vmas. > > Just out of curiosity, how do you get to 2000+ distinct memory regions > in the address space of a mobile app? I'm assuming these aren't files, > but rather anon objects with poor grouping. Is that from guard pages > between individual heap allocations or something? Android uses preload library model to speed up app launch so it loads all of library in advance on zygote and forks new app based on it. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 0/7] introduce memory hinting API for external process [not found] <20190520035254.57579-1-minchan@kernel.org> ` (4 preceding siblings ...) [not found] ` <20190520035254.57579-8-minchan@kernel.org> @ 2019-05-20 9:28 ` Michal Hocko [not found] ` <20190520164605.GA11665@cmpxchg.org> ` (3 subsequent siblings) 9 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-20 9:28 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api [Cc linux-api] On Mon 20-05-19 12:52:47, Minchan Kim wrote: > - Background > > The Android terminology used for forking a new process and starting an app > from scratch is a cold start, while resuming an existing app is a hot start. > While we continually try to improve the performance of cold starts, hot > starts will always be significantly less power hungry as well as faster so > we are trying to make hot start more likely than cold start. > > To increase hot start, Android userspace manages the order that apps should > be killed in a process called ActivityManagerService. ActivityManagerService > tracks every Android app or service that the user could be interacting with > at any time and translates that into a ranked list for lmkd(low memory > killer daemon). They are likely to be killed by lmkd if the system has to > reclaim memory. In that sense they are similar to entries in any other cache. > Those apps are kept alive for opportunistic performance improvements but > those performance improvements will vary based on the memory requirements of > individual workloads. > > - Problem > > Naturally, cached apps were dominant consumers of memory on the system. > However, they were not significant consumers of swap even though they are > good candidate for swap. Under investigation, swapping out only begins > once the low zone watermark is hit and kswapd wakes up, but the overall > allocation rate in the system might trip lmkd thresholds and cause a cached > process to be killed(we measured performance swapping out vs. zapping the > memory by killing a process. Unsurprisingly, zapping is 10x times faster > even though we use zram which is much faster than real storage) so kill > from lmkd will often satisfy the high zone watermark, resulting in very > few pages actually being moved to swap. > > - Approach > > The approach we chose was to use a new interface to allow userspace to > proactively reclaim entire processes by leveraging platform information. > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > that are known to be cold from userspace and to avoid races with lmkd > by reclaiming apps as soon as they entered the cached state. Additionally, > it could provide many chances for platform to use much information to > optimize memory efficiency. > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > and MADV_FREE by adding non-destructive ways to gain some free memory > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > when memory pressure rises. > > To achieve the goal, the patchset introduce two new options for madvise. > One is MADV_COOL which will deactive activated pages and the other is > MADV_COLD which will reclaim private pages instantly. These new options > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed when memory pressure rises. > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > information required to make the reclaim decision is not known to the app. > Instead, it is known to a centralized userspace daemon, and that daemon > must be able to initiate reclaim on its own without any app involvement. > To solve the concern, this patch introduces new syscall - > > struct pr_madvise_param { > int size; > const struct iovec *vec; > } > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > struct pr_madvise_param *restuls, > struct pr_madvise_param *ranges, > unsigned long flags); > > The syscall get pidfd to give hints to external process and provides > pair of result/ranges vector arguments so that it could give several > hints to each address range all at once. > > I guess others have different ideas about the naming of syscall and options > so feel free to suggest better naming. > > - Experiment > > We did bunch of testing with several hundreds of real users, not artificial > benchmark on android. We saw about 17% cold start decreasement without any > significant battery/app startup latency issues. And with artificial benchmark > which launches and switching apps, we saw average 7% app launching improvement, > 18% less lmkd kill and good stat from vmstat. > > A is vanilla and B is process_madvise. > > > A B delta ratio(%) > allocstall_dma 0 0 0 0.00 > allocstall_movable 1464 457 -1007 -69.00 > allocstall_normal 263210 190763 -72447 -28.00 > allocstall_total 264674 191220 -73454 -28.00 > compact_daemon_wake 26912 25294 -1618 -7.00 > compact_fail 17885 14151 -3734 -21.00 > compact_free_scanned 4204766409 3835994922 -368771487 -9.00 > compact_isolated 3446484 2967618 -478866 -14.00 > compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00 > compact_stall 19387 15343 -4044 -21.00 > compact_success 1502 1192 -310 -21.00 > kswapd_high_wmark_hit_quickly 234 184 -50 -22.00 > kswapd_inodesteal 221635 233093 11458 5.00 > kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00 > nr_dirtied 259934 296476 36542 14.00 > nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00 > nr_vmscan_write 1274232 2661733 1387501 108.00 > nr_written 1514060 2937560 1423500 94.00 > pageoutrun 67561 55133 -12428 -19.00 > pgactivate 2335060 1984882 -350178 -15.00 > pgalloc_dma 13743011 14096463 353452 2.00 > pgalloc_movable 0 0 0 0.00 > pgalloc_normal 18742440 16802065 -1940375 -11.00 > pgalloc_total 32485451 30898528 -1586923 -5.00 > pgdeactivate 4262210 2930670 -1331540 -32.00 > pgfault 30812334 31085065 272731 0.00 > pgfree 33553970 31765164 -1788806 -6.00 > pginodesteal 33411 15084 -18327 -55.00 > pglazyfreed 0 0 0 0.00 > pgmajfault 551312 1508299 956987 173.00 > pgmigrate_fail 43927 29330 -14597 -34.00 > pgmigrate_success 1399851 1203922 -195929 -14.00 > pgpgin 24141776 19032156 -5109620 -22.00 > pgpgout 959344 1103316 143972 15.00 > pgpgoutclean 4639732 3765868 -873864 -19.00 > pgrefill 4884560 3006938 -1877622 -39.00 > pgrotated 37828 25897 -11931 -32.00 > pgscan_direct 1456037 957567 -498470 -35.00 > pgscan_direct_throttle 0 0 0 0.00 > pgscan_kswapd 6667767 5047360 -1620407 -25.00 > pgscan_total 8123804 6004927 -2118877 -27.00 > pgskip_dma 0 0 0 0.00 > pgskip_movable 0 0 0 0.00 > pgskip_normal 14907 25382 10475 70.00 > pgskip_total 14907 25382 10475 70.00 > pgsteal_direct 1118986 690215 -428771 -39.00 > pgsteal_kswapd 4750223 3657107 -1093116 -24.00 > pgsteal_total 5869209 4347322 -1521887 -26.00 > pswpin 417613 1392647 975034 233.00 > pswpout 1274224 2661731 1387507 108.00 > slabs_scanned 13686905 10807200 -2879705 -22.00 > workingset_activate 668966 569444 -99522 -15.00 > workingset_nodereclaim 38957 32621 -6336 -17.00 > workingset_refault 2816795 2179782 -637013 -23.00 > workingset_restore 294320 168601 -125719 -43.00 > > pgmajfault is increased by 173% because swapin is increased by 200% by > process_madvise hint. However, swap read based on zram is much cheaper > than file IO in performance point of view and app hot start by swapin is > also cheaper than cold start from the beginning of app which needs many IO > from storage and initialization steps. > > This patchset is against on next-20190517. > > Minchan Kim (7): > mm: introduce MADV_COOL > mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM > mm: introduce MADV_COLD > mm: factor out madvise's core functionality > mm: introduce external memory hinting API > mm: extend process_madvise syscall to support vector arrary > mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/page-flags.h | 1 + > include/linux/page_idle.h | 15 + > include/linux/proc_fs.h | 1 + > include/linux/swap.h | 2 + > include/linux/syscalls.h | 2 + > include/uapi/asm-generic/mman-common.h | 12 + > include/uapi/asm-generic/unistd.h | 2 + > kernel/signal.c | 2 +- > kernel/sys_ni.c | 1 + > mm/madvise.c | 600 +++++++++++++++++++++---- > mm/swap.c | 43 ++ > mm/vmscan.c | 80 +++- > 14 files changed, 680 insertions(+), 83 deletions(-) > > -- > 2.21.0.1020.gf2820cf01a-goog > -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20190520164605.GA11665@cmpxchg.org>]
[parent not found: <20190521043950.GJ10039@google.com>]
* Re: [RFC 0/7] introduce memory hinting API for external process [not found] ` <20190521043950.GJ10039@google.com> @ 2019-05-21 6:32 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-21 6:32 UTC (permalink / raw) To: Minchan Kim Cc: Johannes Weiner, Andrew Morton, LKML, linux-mm, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api [Cc linux-api] On Tue 21-05-19 13:39:50, Minchan Kim wrote: > On Mon, May 20, 2019 at 12:46:05PM -0400, Johannes Weiner wrote: > > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > > - Approach > > > > > > The approach we chose was to use a new interface to allow userspace to > > > proactively reclaim entire processes by leveraging platform information. > > > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > > > that are known to be cold from userspace and to avoid races with lmkd > > > by reclaiming apps as soon as they entered the cached state. Additionally, > > > it could provide many chances for platform to use much information to > > > optimize memory efficiency. > > > > > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > > and MADV_FREE by adding non-destructive ways to gain some free memory > > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > > kernel that memory region is not currently needed and should be reclaimed > > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > > kernel that memory region is not currently needed and should be reclaimed > > > when memory pressure rises. > > > > I agree with this approach and the semantics. But these names are very > > vague and extremely easy to confuse since they're so similar. > > > > MADV_COLD could be a good name, but for deactivating pages, not > > reclaiming them - marking memory "cold" on the LRU for later reclaim. > > > > For the immediate reclaim one, I think there is a better option too: > > In virtual memory speak, putting a page into secondary storage (or > > ensuring it's already there), and then freeing its in-memory copy, is > > called "paging out". And that's what this flag is supposed to do. So > > how about MADV_PAGEOUT? > > > > With that, we'd have: > > > > MADV_FREE: Mark data invalid, free memory when needed > > MADV_DONTNEED: Mark data invalid, free memory immediately > > > > MADV_COLD: Data is not used for a while, free memory when needed > > MADV_PAGEOUT: Data is not used for a while, free memory immediately > > > > What do you think? > > There are several suggestions until now. Thanks, Folks! > > For deactivating: > > - MADV_COOL > - MADV_RECLAIM_LAZY > - MADV_DEACTIVATE > - MADV_COLD > - MADV_FREE_PRESERVE > > > For reclaiming: > > - MADV_COLD > - MADV_RECLAIM_NOW > - MADV_RECLAIMING > - MADV_PAGEOUT > - MADV_DONTNEED_PRESERVE > > It seems everybody doesn't like MADV_COLD so want to go with other. > For consisteny of view with other existing hints of madvise, -preserve > postfix suits well. However, originally, I don't like the naming FREE > vs DONTNEED from the beginning. They were easily confused. > I prefer PAGEOUT to RECLAIM since it's more likely to be nuance to > represent reclaim with memory pressure and is supposed to paged-in > if someone need it later. So, it imply PRESERVE. > If there is not strong against it, I want to go with MADV_COLD and > MADV_PAGEOUT. > > Other opinion? I do not really care strongly. I am pretty sure we will have a lot of suggestions because people tend to be good at arguing about that... Anyway, unlike DONTNEED/FREE we do not have any other OS to implement these features, right? So we shouldn't be tight to existing names. On the other hand I kinda like the reference to the existing names but DEACTIVATE/PAGEOUT seem a good fit to me as well. Unless there is way much better name suggested I would go with one of those. Up to you. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <20190521014452.GA6738@bombadil.infradead.org>]
* Re: [RFC 0/7] introduce memory hinting API for external process [not found] ` <20190521014452.GA6738@bombadil.infradead.org> @ 2019-05-21 6:34 ` Michal Hocko 0 siblings, 0 replies; 68+ messages in thread From: Michal Hocko @ 2019-05-21 6:34 UTC (permalink / raw) To: Matthew Wilcox Cc: Minchan Kim, Andrew Morton, LKML, linux-mm, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Shakeel Butt, Sonny Rao, Brian Geffon, linux-api [linux-api] On Mon 20-05-19 18:44:52, Matthew Wilcox wrote: > On Mon, May 20, 2019 at 12:52:47PM +0900, Minchan Kim wrote: > > IMHO we should spell it out that this patchset complements MADV_WONTNEED > > and MADV_FREE by adding non-destructive ways to gain some free memory > > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > > kernel that memory region is not currently needed and should be reclaimed > > when memory pressure rises. > > Do we tear down page tables for these ranges? That seems like a good > way of reclaiming potentially a substantial amount of memory. I do not think we can in general because this is a non-destructive operation. So at least we cannot tear down anonymous ptes (they will turn into swap entries). -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 0/7] introduce memory hinting API for external process [not found] <20190520035254.57579-1-minchan@kernel.org> ` (7 preceding siblings ...) [not found] ` <20190521014452.GA6738@bombadil.infradead.org> @ 2019-05-21 12:53 ` Shakeel Butt [not found] ` <dbe801f0-4bbe-5f6e-9053-4b7deb38e235@arm.com> 9 siblings, 0 replies; 68+ messages in thread From: Shakeel Butt @ 2019-05-21 12:53 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner, Tim Murray, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Sonny Rao, Brian Geffon, linux-api On Sun, May 19, 2019 at 8:53 PM Minchan Kim <minchan@kernel.org> wrote: > > - Background > > The Android terminology used for forking a new process and starting an app > from scratch is a cold start, while resuming an existing app is a hot start. > While we continually try to improve the performance of cold starts, hot > starts will always be significantly less power hungry as well as faster so > we are trying to make hot start more likely than cold start. > > To increase hot start, Android userspace manages the order that apps should > be killed in a process called ActivityManagerService. ActivityManagerService > tracks every Android app or service that the user could be interacting with > at any time and translates that into a ranked list for lmkd(low memory > killer daemon). They are likely to be killed by lmkd if the system has to > reclaim memory. In that sense they are similar to entries in any other cache. > Those apps are kept alive for opportunistic performance improvements but > those performance improvements will vary based on the memory requirements of > individual workloads. > > - Problem > > Naturally, cached apps were dominant consumers of memory on the system. > However, they were not significant consumers of swap even though they are > good candidate for swap. Under investigation, swapping out only begins > once the low zone watermark is hit and kswapd wakes up, but the overall > allocation rate in the system might trip lmkd thresholds and cause a cached > process to be killed(we measured performance swapping out vs. zapping the > memory by killing a process. Unsurprisingly, zapping is 10x times faster > even though we use zram which is much faster than real storage) so kill > from lmkd will often satisfy the high zone watermark, resulting in very > few pages actually being moved to swap. It is not clear what exactly is the problem from the above para. IMO low usage of swap is not the problem but rather global memory pressure and the reactive response to it is the problem. Killing apps over swap is preferred as you have noted zapping frees memory faster but it indirectly increases cold start. Also swapping on allocation causes latency issues for the app. So, a proactive mechanism is needed to keep global pressure away and indirectly reduces cold starts and alloc stalls. > > - Approach > > The approach we chose was to use a new interface to allow userspace to > proactively reclaim entire processes by leveraging platform information. > This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages > that are known to be cold from userspace and to avoid races with lmkd > by reclaiming apps as soon as they entered the cached state. Additionally, > it could provide many chances for platform to use much information to > optimize memory efficiency. I think it would be good to have clear reasoning on why "reclaim from userspace" approach is taken. Android runtime clearly has more accurate stale/cold information at the app/process level and can positively influence kernel's reclaim decisions. So, "reclaim from userspace" approach makes total sense for Android. I envision that Chrome OS would be another very obvious user of this approach. There can be tens of tabs which the user have not touched for sometime. Chrome OS can proactively reclaim memory from such tabs. > > IMHO we should spell it out that this patchset complements MADV_WONTNEED MADV_DONTNEED? same at couple of places below. > and MADV_FREE by adding non-destructive ways to gain some free memory > space. MADV_COLD is similar to MADV_WONTNEED in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > immediately; MADV_COOL is similar to MADV_FREE in a way that it hints the > kernel that memory region is not currently needed and should be reclaimed > when memory pressure rises. > > To achieve the goal, the patchset introduce two new options for madvise. > One is MADV_COOL which will deactive activated pages and the other is > MADV_COLD which will reclaim private pages instantly. These new options > complement MADV_DONTNEED and MADV_FREE by adding non-destructive ways to > gain some free memory space. MADV_COLD is similar to MADV_DONTNEED in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed immediately; MADV_COOL is similar to MADV_FREE in a way > that it hints the kernel that memory region is not currently needed and > should be reclaimed when memory pressure rises. > > This approach is similar in spirit to madvise(MADV_WONTNEED), but the > information required to make the reclaim decision is not known to the app. > Instead, it is known to a centralized userspace daemon, and that daemon > must be able to initiate reclaim on its own without any app involvement. > To solve the concern, this patch introduces new syscall - > > struct pr_madvise_param { > int size; > const struct iovec *vec; > } > > int process_madvise(int pidfd, ssize_t nr_elem, int *behavior, > struct pr_madvise_param *restuls, > struct pr_madvise_param *ranges, > unsigned long flags); > > The syscall get pidfd to give hints to external process and provides > pair of result/ranges vector arguments so that it could give several > hints to each address range all at once. > > I guess others have different ideas about the naming of syscall and options > so feel free to suggest better naming. > > - Experiment > > We did bunch of testing with several hundreds of real users, not artificial > benchmark on android. We saw about 17% cold start decreasement without any > significant battery/app startup latency issues. And with artificial benchmark > which launches and switching apps, we saw average 7% app launching improvement, > 18% less lmkd kill and good stat from vmstat. > > A is vanilla and B is process_madvise. > > > A B delta ratio(%) > allocstall_dma 0 0 0 0.00 > allocstall_movable 1464 457 -1007 -69.00 > allocstall_normal 263210 190763 -72447 -28.00 > allocstall_total 264674 191220 -73454 -28.00 > compact_daemon_wake 26912 25294 -1618 -7.00 > compact_fail 17885 14151 -3734 -21.00 > compact_free_scanned 4204766409 3835994922 -368771487 -9.00 > compact_isolated 3446484 2967618 -478866 -14.00 > compact_migrate_scanned 1621336411 1324695710 -296640701 -19.00 > compact_stall 19387 15343 -4044 -21.00 > compact_success 1502 1192 -310 -21.00 > kswapd_high_wmark_hit_quickly 234 184 -50 -22.00 > kswapd_inodesteal 221635 233093 11458 5.00 > kswapd_low_wmark_hit_quickly 66065 54009 -12056 -19.00 > nr_dirtied 259934 296476 36542 14.00 > nr_vmscan_immediate_reclaim 2587 2356 -231 -9.00 > nr_vmscan_write 1274232 2661733 1387501 108.00 > nr_written 1514060 2937560 1423500 94.00 > pageoutrun 67561 55133 -12428 -19.00 > pgactivate 2335060 1984882 -350178 -15.00 > pgalloc_dma 13743011 14096463 353452 2.00 > pgalloc_movable 0 0 0 0.00 > pgalloc_normal 18742440 16802065 -1940375 -11.00 > pgalloc_total 32485451 30898528 -1586923 -5.00 > pgdeactivate 4262210 2930670 -1331540 -32.00 > pgfault 30812334 31085065 272731 0.00 > pgfree 33553970 31765164 -1788806 -6.00 > pginodesteal 33411 15084 -18327 -55.00 > pglazyfreed 0 0 0 0.00 > pgmajfault 551312 1508299 956987 173.00 > pgmigrate_fail 43927 29330 -14597 -34.00 > pgmigrate_success 1399851 1203922 -195929 -14.00 > pgpgin 24141776 19032156 -5109620 -22.00 > pgpgout 959344 1103316 143972 15.00 > pgpgoutclean 4639732 3765868 -873864 -19.00 > pgrefill 4884560 3006938 -1877622 -39.00 > pgrotated 37828 25897 -11931 -32.00 > pgscan_direct 1456037 957567 -498470 -35.00 > pgscan_direct_throttle 0 0 0 0.00 > pgscan_kswapd 6667767 5047360 -1620407 -25.00 > pgscan_total 8123804 6004927 -2118877 -27.00 > pgskip_dma 0 0 0 0.00 > pgskip_movable 0 0 0 0.00 > pgskip_normal 14907 25382 10475 70.00 > pgskip_total 14907 25382 10475 70.00 > pgsteal_direct 1118986 690215 -428771 -39.00 > pgsteal_kswapd 4750223 3657107 -1093116 -24.00 > pgsteal_total 5869209 4347322 -1521887 -26.00 > pswpin 417613 1392647 975034 233.00 > pswpout 1274224 2661731 1387507 108.00 > slabs_scanned 13686905 10807200 -2879705 -22.00 > workingset_activate 668966 569444 -99522 -15.00 > workingset_nodereclaim 38957 32621 -6336 -17.00 > workingset_refault 2816795 2179782 -637013 -23.00 > workingset_restore 294320 168601 -125719 -43.00 > > pgmajfault is increased by 173% because swapin is increased by 200% by > process_madvise hint. However, swap read based on zram is much cheaper > than file IO in performance point of view and app hot start by swapin is > also cheaper than cold start from the beginning of app which needs many IO > from storage and initialization steps. > > This patchset is against on next-20190517. > > Minchan Kim (7): > mm: introduce MADV_COOL > mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM > mm: introduce MADV_COLD > mm: factor out madvise's core functionality > mm: introduce external memory hinting API > mm: extend process_madvise syscall to support vector arrary > mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER > > arch/x86/entry/syscalls/syscall_32.tbl | 1 + > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/page-flags.h | 1 + > include/linux/page_idle.h | 15 + > include/linux/proc_fs.h | 1 + > include/linux/swap.h | 2 + > include/linux/syscalls.h | 2 + > include/uapi/asm-generic/mman-common.h | 12 + > include/uapi/asm-generic/unistd.h | 2 + > kernel/signal.c | 2 +- > kernel/sys_ni.c | 1 + > mm/madvise.c | 600 +++++++++++++++++++++---- > mm/swap.c | 43 ++ > mm/vmscan.c | 80 +++- > 14 files changed, 680 insertions(+), 83 deletions(-) > > -- > 2.21.0.1020.gf2820cf01a-goog > ^ permalink raw reply [flat|nested] 68+ messages in thread
[parent not found: <dbe801f0-4bbe-5f6e-9053-4b7deb38e235@arm.com>]
[parent not found: <CAEe=Sxka3Q3vX+7aWUJGKicM+a9Px0rrusyL+5bB1w4ywF6N4Q@mail.gmail.com>]
[parent not found: <1754d0ef-6756-d88b-f728-17b1fe5d5b07@arm.com>]
* Re: [RFC 0/7] introduce memory hinting API for external process [not found] ` <1754d0ef-6756-d88b-f728-17b1fe5d5b07@arm.com> @ 2019-05-21 12:56 ` Shakeel Butt 2019-05-22 4:23 ` Brian Geffon 0 siblings, 1 reply; 68+ messages in thread From: Shakeel Butt @ 2019-05-21 12:56 UTC (permalink / raw) To: Anshuman Khandual Cc: Tim Murray, Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Sonny Rao, Brian Geffon, linux-api On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual <anshuman.khandual@arm.com> wrote: > > > > On 05/20/2019 10:29 PM, Tim Murray wrote: > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual > > <anshuman.khandual@arm.com> wrote: > >> > >> Or Is the objective here is reduce the number of processes which get killed by > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern > >> here as well ? > > > > The objective is to avoid some instances of memory pressure by > > proactively swapping pages that userspace knows to be cold before > > those pages reach the end of the LRUs, which in turn can prevent some > > apps from being killed by lmk/lmkd. As soon as Android userspace knows > > that an application is not being used and is only resident to improve > > performance if the user returns to that app, we can kick off > > process_madvise on that process's pages (or some portion of those > > pages) in a power-efficient way to reduce memory pressure long before > > the system hits the free page watermark. This allows the system more > > time to put pages into zram versus waiting for the watermark to > > trigger kswapd, which decreases the likelihood that later memory > > allocations will cause enough pressure to trigger a kill of one of > > these apps. > > So this opens up bit of LRU management to user space hints. Also because the app > in itself wont know about the memory situation of the entire system, new system > call needs to be called from an external process. > > > > >> Swapping out memory into zram wont increase the latency for a hot start ? Or > >> is it because as it will prevent a fresh cold start which anyway will be slower > >> than a slow hot start. Just being curious. > > > > First, not all swapped pages will be reloaded immediately once an app > > is resumed. We've found that an app's working set post-process_madvise > > is significantly smaller than what an app allocates when it first > > launches (see the delta between pswpin and pswpout in Minchan's > > results). Presumably because of this, faulting to fetch from zram does > > pswpin 417613 1392647 975034 233.00 > pswpout 1274224 2661731 1387507 108.00 > > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that > always the case ? Or it tend to swap out from an active area of the working set > which faulted back again. > > > not seem to introduce a noticeable hot start penalty, not does it > > cause an increase in performance problems later in the app's > > lifecycle. I've measured with and without process_madvise, and the > > differences are within our noise bounds. Second, because we're not > > That is assuming that post process_madvise() working set for the application is > always smaller. There is another challenge. The external process should ideally > have the knowledge of active areas of the working set for an application in > question for it to invoke process_madvise() correctly to prevent such scenarios. > > > preemptively evicting file pages and only making them more likely to > > be evicted when there's already memory pressure, we avoid the case > > where we process_madvise an app then immediately return to the app and > > reload all file pages in the working set even though there was no > > intervening memory pressure. Our initial version of this work evicted > > That would be the worst case scenario which should be avoided. Memory pressure > must be a parameter before actually doing the swap out. But pages if know to be > inactive/cold can be marked high priority to be swapped out. > > > file pages preemptively and did cause a noticeable slowdown (~15%) for > > that case; this patch set avoids that slowdown. Finally, the benefit > > from avoiding cold starts is huge. The performance improvement from > > having a hot start instead of a cold start ranges from 3x for very > > small apps to 50x+ for larger apps like high-fidelity games. > > Is there any other real world scenario apart from this app based ecosystem where > user hinted LRU management might be helpful ? Just being curious. Thanks for the > detailed explanation. I will continue looking into this series. Chrome OS is another real world use-case for this user hinted LRU management approach by proactively reclaiming reclaim from tabs not accessed by the user for some time. ^ permalink raw reply [flat|nested] 68+ messages in thread
* Re: [RFC 0/7] introduce memory hinting API for external process 2019-05-21 12:56 ` Shakeel Butt @ 2019-05-22 4:23 ` Brian Geffon 0 siblings, 0 replies; 68+ messages in thread From: Brian Geffon @ 2019-05-22 4:23 UTC (permalink / raw) To: Shakeel Butt Cc: Anshuman Khandual, Tim Murray, Minchan Kim, Andrew Morton, LKML, linux-mm, Michal Hocko, Johannes Weiner, Joel Fernandes, Suren Baghdasaryan, Daniel Colascione, Sonny Rao, linux-api To expand on the ChromeOS use case we're in a very similar situation to Android. For example, the Chrome browser uses a separate process for each individual tab (with some exceptions) and over time many tabs remain open in a back-grounded or idle state. Given that we have a lot of information about the weight of a tab, when it was last active, etc, we can benefit tremendously from per-process reclaim. We're working on getting real world numbers but all of our initial testing shows very promising results. On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote: > > On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual > <anshuman.khandual@arm.com> wrote: > > > > > > > > On 05/20/2019 10:29 PM, Tim Murray wrote: > > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual > > > <anshuman.khandual@arm.com> wrote: > > >> > > >> Or Is the objective here is reduce the number of processes which get killed by > > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that > > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern > > >> here as well ? > > > > > > The objective is to avoid some instances of memory pressure by > > > proactively swapping pages that userspace knows to be cold before > > > those pages reach the end of the LRUs, which in turn can prevent some > > > apps from being killed by lmk/lmkd. As soon as Android userspace knows > > > that an application is not being used and is only resident to improve > > > performance if the user returns to that app, we can kick off > > > process_madvise on that process's pages (or some portion of those > > > pages) in a power-efficient way to reduce memory pressure long before > > > the system hits the free page watermark. This allows the system more > > > time to put pages into zram versus waiting for the watermark to > > > trigger kswapd, which decreases the likelihood that later memory > > > allocations will cause enough pressure to trigger a kill of one of > > > these apps. > > > > So this opens up bit of LRU management to user space hints. Also because the app > > in itself wont know about the memory situation of the entire system, new system > > call needs to be called from an external process. > > > > > > > >> Swapping out memory into zram wont increase the latency for a hot start ? Or > > >> is it because as it will prevent a fresh cold start which anyway will be slower > > >> than a slow hot start. Just being curious. > > > > > > First, not all swapped pages will be reloaded immediately once an app > > > is resumed. We've found that an app's working set post-process_madvise > > > is significantly smaller than what an app allocates when it first > > > launches (see the delta between pswpin and pswpout in Minchan's > > > results). Presumably because of this, faulting to fetch from zram does > > > > pswpin 417613 1392647 975034 233.00 > > pswpout 1274224 2661731 1387507 108.00 > > > > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that > > always the case ? Or it tend to swap out from an active area of the working set > > which faulted back again. > > > > > not seem to introduce a noticeable hot start penalty, not does it > > > cause an increase in performance problems later in the app's > > > lifecycle. I've measured with and without process_madvise, and the > > > differences are within our noise bounds. Second, because we're not > > > > That is assuming that post process_madvise() working set for the application is > > always smaller. There is another challenge. The external process should ideally > > have the knowledge of active areas of the working set for an application in > > question for it to invoke process_madvise() correctly to prevent such scenarios. > > > > > preemptively evicting file pages and only making them more likely to > > > be evicted when there's already memory pressure, we avoid the case > > > where we process_madvise an app then immediately return to the app and > > > reload all file pages in the working set even though there was no > > > intervening memory pressure. Our initial version of this work evicted > > > > That would be the worst case scenario which should be avoided. Memory pressure > > must be a parameter before actually doing the swap out. But pages if know to be > > inactive/cold can be marked high priority to be swapped out. > > > > > file pages preemptively and did cause a noticeable slowdown (~15%) for > > > that case; this patch set avoids that slowdown. Finally, the benefit > > > from avoiding cold starts is huge. The performance improvement from > > > having a hot start instead of a cold start ranges from 3x for very > > > small apps to 50x+ for larger apps like high-fidelity games. > > > > Is there any other real world scenario apart from this app based ecosystem where > > user hinted LRU management might be helpful ? Just being curious. Thanks for the > > detailed explanation. I will continue looking into this series. > > Chrome OS is another real world use-case for this user hinted LRU > management approach by proactively reclaiming reclaim from tabs not > accessed by the user for some time. ^ permalink raw reply [flat|nested] 68+ messages in thread
end of thread, other threads:[~2019-05-30 18:47 UTC | newest] Thread overview: 68+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20190520035254.57579-1-minchan@kernel.org> [not found] ` <20190520035254.57579-2-minchan@kernel.org> 2019-05-20 8:16 ` [RFC 1/7] mm: introduce MADV_COOL Michal Hocko 2019-05-20 8:19 ` Michal Hocko 2019-05-20 15:08 ` Suren Baghdasaryan 2019-05-20 22:55 ` Minchan Kim 2019-05-20 22:54 ` Minchan Kim 2019-05-21 6:04 ` Michal Hocko 2019-05-21 9:11 ` Minchan Kim 2019-05-21 10:05 ` Michal Hocko [not found] ` <20190520035254.57579-4-minchan@kernel.org> 2019-05-20 8:27 ` [RFC 3/7] mm: introduce MADV_COLD Michal Hocko 2019-05-20 23:00 ` Minchan Kim 2019-05-21 6:08 ` Michal Hocko 2019-05-21 9:13 ` Minchan Kim [not found] ` <20190520035254.57579-6-minchan@kernel.org> 2019-05-20 9:18 ` [RFC 5/7] mm: introduce external memory hinting API Michal Hocko 2019-05-21 2:41 ` Minchan Kim 2019-05-21 6:17 ` Michal Hocko 2019-05-21 10:32 ` Minchan Kim [not found] ` <20190520035254.57579-7-minchan@kernel.org> 2019-05-20 9:22 ` [RFC 6/7] mm: extend process_madvise syscall to support vector arrary Michal Hocko 2019-05-21 2:48 ` Minchan Kim 2019-05-21 6:24 ` Michal Hocko 2019-05-21 10:26 ` Minchan Kim 2019-05-21 10:37 ` Michal Hocko 2019-05-27 7:49 ` Minchan Kim 2019-05-29 10:08 ` Daniel Colascione 2019-05-29 10:33 ` Michal Hocko 2019-05-30 2:17 ` Minchan Kim 2019-05-30 6:57 ` Michal Hocko 2019-05-30 8:02 ` Minchan Kim 2019-05-30 16:19 ` Daniel Colascione 2019-05-30 18:47 ` Michal Hocko [not found] ` <20190520035254.57579-8-minchan@kernel.org> 2019-05-20 9:28 ` [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER Michal Hocko 2019-05-21 2:55 ` Minchan Kim 2019-05-21 6:26 ` Michal Hocko 2019-05-27 7:58 ` Minchan Kim 2019-05-27 12:44 ` Michal Hocko 2019-05-28 3:26 ` Minchan Kim 2019-05-28 6:29 ` Michal Hocko 2019-05-28 8:13 ` Minchan Kim 2019-05-28 8:31 ` Daniel Colascione 2019-05-28 8:49 ` Minchan Kim 2019-05-28 9:08 ` Michal Hocko 2019-05-28 9:39 ` Daniel Colascione 2019-05-28 10:33 ` Michal Hocko 2019-05-28 11:21 ` Daniel Colascione 2019-05-28 11:49 ` Michal Hocko 2019-05-28 12:11 ` Daniel Colascione 2019-05-28 12:32 ` Michal Hocko 2019-05-28 10:32 ` Minchan Kim 2019-05-28 10:41 ` Michal Hocko 2019-05-28 11:12 ` Minchan Kim 2019-05-28 11:28 ` Michal Hocko 2019-05-28 11:42 ` Daniel Colascione 2019-05-28 11:56 ` Michal Hocko 2019-05-28 12:18 ` Daniel Colascione 2019-05-28 12:38 ` Michal Hocko 2019-05-28 12:10 ` Minchan Kim 2019-05-28 11:44 ` Minchan Kim 2019-05-28 11:51 ` Daniel Colascione 2019-05-28 12:06 ` Michal Hocko 2019-05-28 12:22 ` Minchan Kim 2019-05-28 11:28 ` Daniel Colascione 2019-05-21 15:33 ` Johannes Weiner 2019-05-22 1:50 ` Minchan Kim 2019-05-20 9:28 ` [RFC 0/7] introduce memory hinting API for external process Michal Hocko [not found] ` <20190520164605.GA11665@cmpxchg.org> [not found] ` <20190521043950.GJ10039@google.com> 2019-05-21 6:32 ` Michal Hocko [not found] ` <20190521014452.GA6738@bombadil.infradead.org> 2019-05-21 6:34 ` Michal Hocko 2019-05-21 12:53 ` Shakeel Butt [not found] ` <dbe801f0-4bbe-5f6e-9053-4b7deb38e235@arm.com> [not found] ` <CAEe=Sxka3Q3vX+7aWUJGKicM+a9Px0rrusyL+5bB1w4ywF6N4Q@mail.gmail.com> [not found] ` <1754d0ef-6756-d88b-f728-17b1fe5d5b07@arm.com> 2019-05-21 12:56 ` Shakeel Butt 2019-05-22 4:23 ` Brian Geffon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).