* [PATCH -mm -v4 0/5] mm, swap: VMA based swap readahead @ 2017-08-07 5:40 Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 1/5] mm, swap: Add swap readahead hit statistics Huang, Ying ` (4 more replies) 0 siblings, 5 replies; 19+ messages in thread From: Huang, Ying @ 2017-08-07 5:40 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Huang, Ying, Johannes Weiner, Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory space. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patchset, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. In addition to the swap readahead changes, some new sysfs interface is added to show the efficiency of the readahead algorithm and some other swap statistics. This new implementation will incur more small random read, on SSD, the improved correctness of estimation and readahead target should beat the potential increased overhead, this is also illustrated in the test results below. But on HDD, the overhead may beat the benefit, so the original implementation will be used by default. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. Changelogs: v4: - Rebased on latest -mm tree. - Remove swap cache statistics interface, because we found that the interface for readahead statistics should be sufficient. - Use /proc/vmstat for swap readahead statistics, because that is the interface used by other similar statistics. - Add ABI document for newly added sysfs interface. v3: - Rebased on latest -mm tree - Use percpu_counter for swap readahead statistics per Dave Hansen's comment. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH -mm -v4 1/5] mm, swap: Add swap readahead hit statistics 2017-08-07 5:40 [PATCH -mm -v4 0/5] mm, swap: VMA based swap readahead Huang, Ying @ 2017-08-07 5:40 ` Huang, Ying 2017-08-09 21:50 ` Andrew Morton 2017-08-07 5:40 ` [PATCH -mm -v4 2/5] mm, swap: Fix swap readahead marking Huang, Ying ` (3 subsequent siblings) 4 siblings, 1 reply; 19+ messages in thread From: Huang, Ying @ 2017-08-07 5:40 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen From: Huang Ying <ying.huang@intel.com> The statistics for total readahead pages and total readahead hits are recorded and exported via the following sysfs interface. /sys/kernel/mm/swap/ra_hits /sys/kernel/mm/swap/ra_total With them, the efficiency of the swap readahead could be measured, so that the swap readahead algorithm and parameters could be tuned accordingly. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> --- include/linux/vm_event_item.h | 2 ++ mm/swap_state.c | 9 +++++++-- mm/vmstat.c | 3 +++ 3 files changed, 12 insertions(+), 2 deletions(-) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index e02820fc2861..27e3339cfd65 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -106,6 +106,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, VMACACHE_FIND_HITS, VMACACHE_FULL_FLUSHES, #endif + SWAP_RA, + SWAP_RA_HIT, NR_VM_EVENT_ITEMS }; diff --git a/mm/swap_state.c b/mm/swap_state.c index b68c93014f50..d1bdb31cab13 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -305,8 +305,10 @@ struct page * lookup_swap_cache(swp_entry_t entry) if (page && likely(!PageTransCompound(page))) { INC_CACHE_INFO(find_success); - if (TestClearPageReadahead(page)) + if (TestClearPageReadahead(page)) { atomic_inc(&swapin_readahead_hits); + count_vm_event(SWAP_RA_HIT); + } } INC_CACHE_INFO(find_total); @@ -516,8 +518,11 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, gfp_mask, vma, addr, false); if (!page) continue; - if (offset != entry_offset && likely(!PageTransCompound(page))) + if (offset != entry_offset && + likely(!PageTransCompound(page))) { SetPageReadahead(page); + count_vm_event(SWAP_RA); + } put_page(page); } blk_finish_plug(&plug); diff --git a/mm/vmstat.c b/mm/vmstat.c index ba9b202e8500..4c2121a8b877 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1095,6 +1095,9 @@ const char * const vmstat_text[] = { "vmacache_find_hits", "vmacache_full_flushes", #endif + + "swap_ra", + "swap_ra_hit", #endif /* CONFIG_VM_EVENTS_COUNTERS */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */ -- 2.11.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 1/5] mm, swap: Add swap readahead hit statistics 2017-08-07 5:40 ` [PATCH -mm -v4 1/5] mm, swap: Add swap readahead hit statistics Huang, Ying @ 2017-08-09 21:50 ` Andrew Morton 2017-08-09 23:17 ` Huang, Ying 0 siblings, 1 reply; 19+ messages in thread From: Andrew Morton @ 2017-08-09 21:50 UTC (permalink / raw) To: Huang, Ying Cc: linux-mm, linux-kernel, Johannes Weiner, Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen On Mon, 7 Aug 2017 13:40:34 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > From: Huang Ying <ying.huang@intel.com> > > The statistics for total readahead pages and total readahead hits are > recorded and exported via the following sysfs interface. > > /sys/kernel/mm/swap/ra_hits > /sys/kernel/mm/swap/ra_total > > With them, the efficiency of the swap readahead could be measured, so > that the swap readahead algorithm and parameters could be tuned > accordingly. > > ... > > --- a/include/linux/vm_event_item.h > +++ b/include/linux/vm_event_item.h > @@ -106,6 +106,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, > VMACACHE_FIND_HITS, > VMACACHE_FULL_FLUSHES, > #endif > + SWAP_RA, > + SWAP_RA_HIT, > NR_VM_EVENT_ITEMS > }; swap_state.o isn't even compiled if CONFIG_SWAP=n so there doesn't seem much point in displaying these? --- a/include/linux/vm_event_item.h~mm-swap-add-swap-readahead-hit-statistics-fix +++ a/include/linux/vm_event_item.h @@ -106,8 +106,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS VMACACHE_FIND_HITS, VMACACHE_FULL_FLUSHES, #endif +#ifdef CONFIG_SWAP SWAP_RA, SWAP_RA_HIT, +#endif NR_VM_EVENT_ITEMS }; --- a/mm/vmstat.c~mm-swap-add-swap-readahead-hit-statistics-fix +++ a/mm/vmstat.c @@ -1098,9 +1098,10 @@ const char * const vmstat_text[] = { "vmacache_find_hits", "vmacache_full_flushes", #endif - +#ifdef CONFIG_SWAP "swap_ra", "swap_ra_hit", +#endif #endif /* CONFIG_VM_EVENTS_COUNTERS */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */ _ ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 1/5] mm, swap: Add swap readahead hit statistics 2017-08-09 21:50 ` Andrew Morton @ 2017-08-09 23:17 ` Huang, Ying 0 siblings, 0 replies; 19+ messages in thread From: Huang, Ying @ 2017-08-09 23:17 UTC (permalink / raw) To: Andrew Morton Cc: Huang, Ying, linux-mm, linux-kernel, Johannes Weiner, Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen Andrew Morton <akpm@linux-foundation.org> writes: > On Mon, 7 Aug 2017 13:40:34 +0800 "Huang, Ying" <ying.huang@intel.com> wrote: > >> From: Huang Ying <ying.huang@intel.com> >> >> The statistics for total readahead pages and total readahead hits are >> recorded and exported via the following sysfs interface. >> >> /sys/kernel/mm/swap/ra_hits >> /sys/kernel/mm/swap/ra_total >> >> With them, the efficiency of the swap readahead could be measured, so >> that the swap readahead algorithm and parameters could be tuned >> accordingly. >> >> ... >> >> --- a/include/linux/vm_event_item.h >> +++ b/include/linux/vm_event_item.h >> @@ -106,6 +106,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, >> VMACACHE_FIND_HITS, >> VMACACHE_FULL_FLUSHES, >> #endif >> + SWAP_RA, >> + SWAP_RA_HIT, >> NR_VM_EVENT_ITEMS >> }; > > swap_state.o isn't even compiled if CONFIG_SWAP=n so there doesn't seem > much point in displaying these? Oh, Yes! Thanks for pointing this out. Best Regards, Huang, Ying > --- a/include/linux/vm_event_item.h~mm-swap-add-swap-readahead-hit-statistics-fix > +++ a/include/linux/vm_event_item.h > @@ -106,8 +106,10 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS > VMACACHE_FIND_HITS, > VMACACHE_FULL_FLUSHES, > #endif > +#ifdef CONFIG_SWAP > SWAP_RA, > SWAP_RA_HIT, > +#endif > NR_VM_EVENT_ITEMS > }; > > --- a/mm/vmstat.c~mm-swap-add-swap-readahead-hit-statistics-fix > +++ a/mm/vmstat.c > @@ -1098,9 +1098,10 @@ const char * const vmstat_text[] = { > "vmacache_find_hits", > "vmacache_full_flushes", > #endif > - > +#ifdef CONFIG_SWAP > "swap_ra", > "swap_ra_hit", > +#endif > #endif /* CONFIG_VM_EVENTS_COUNTERS */ > }; > #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA */ > _ ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH -mm -v4 2/5] mm, swap: Fix swap readahead marking 2017-08-07 5:40 [PATCH -mm -v4 0/5] mm, swap: VMA based swap readahead Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 1/5] mm, swap: Add swap readahead hit statistics Huang, Ying @ 2017-08-07 5:40 ` Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead Huang, Ying ` (2 subsequent siblings) 4 siblings, 0 replies; 19+ messages in thread From: Huang, Ying @ 2017-08-07 5:40 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen From: Huang Ying <ying.huang@intel.com> In the original implementation, it is possible that the existing pages in the swap cache (not newly readahead) could be marked as the readahead pages. This will cause the statistics of swap readahead be wrong and influence the swap readahead algorithm too. This is fixed via marking a page as the readahead page only if it is newly allocated and read from the disk. When testing with linpack, after the fixing the swap readahead hit rate increased from ~66% to ~86%. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> --- mm/swap_state.c | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/mm/swap_state.c b/mm/swap_state.c index d1bdb31cab13..a901afe9da61 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -498,7 +498,7 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, unsigned long start_offset, end_offset; unsigned long mask; struct blk_plug plug; - bool do_poll = true; + bool do_poll = true, page_allocated; mask = swapin_nr_pages(offset) - 1; if (!mask) @@ -514,14 +514,18 @@ struct page *swapin_readahead(swp_entry_t entry, gfp_t gfp_mask, blk_start_plug(&plug); for (offset = start_offset; offset <= end_offset ; offset++) { /* Ok, do the async read-ahead now */ - page = read_swap_cache_async(swp_entry(swp_type(entry), offset), - gfp_mask, vma, addr, false); + page = __read_swap_cache_async( + swp_entry(swp_type(entry), offset), + gfp_mask, vma, addr, &page_allocated); if (!page) continue; - if (offset != entry_offset && - likely(!PageTransCompound(page))) { - SetPageReadahead(page); - count_vm_event(SWAP_RA); + if (page_allocated) { + swap_readpage(page, false); + if (offset != entry_offset && + likely(!PageTransCompound(page))) { + SetPageReadahead(page); + count_vm_event(SWAP_RA); + } } put_page(page); } -- 2.11.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-08-07 5:40 [PATCH -mm -v4 0/5] mm, swap: VMA based swap readahead Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 1/5] mm, swap: Add swap readahead hit statistics Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 2/5] mm, swap: Fix swap readahead marking Huang, Ying @ 2017-08-07 5:40 ` Huang, Ying 2017-09-13 1:40 ` Minchan Kim 2017-08-07 5:40 ` [PATCH -mm -v4 4/5] mm, swap: Add sysfs interface for " Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 5/5] mm, swap: Don't use VMA based swap readahead if HDD is used as swap Huang, Ying 4 siblings, 1 reply; 19+ messages in thread From: Huang, Ying @ 2017-08-07 5:40 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen From: Huang Ying <ying.huang@intel.com> The swap readahead is an important mechanism to reduce the swap in latency. Although pure sequential memory access pattern isn't very popular for anonymous memory, the space locality is still considered valid. In the original swap readahead implementation, the consecutive blocks in swap device are readahead based on the global space locality estimation. But the consecutive blocks in swap device just reflect the order of page reclaiming, don't necessarily reflect the access pattern in virtual memory. And the different tasks in the system may have different access patterns, which makes the global space locality estimation incorrect. In this patch, when page fault occurs, the virtual pages near the fault address will be readahead instead of the swap slots near the fault swap slot in swap device. This avoid to readahead the unrelated swap slots. At the same time, the swap readahead is changed to work on per-VMA from globally. So that the different access patterns of the different VMAs could be distinguished, and the different readahead policy could be applied accordingly. The original core readahead detection and scaling algorithm is reused, because it is an effect algorithm to detect the space locality. The test and result is as follow, Common test condition ===================== Test Machine: Xeon E5 v3 (2 sockets, 72 threads, 32G RAM) Swap device: NVMe disk Micro-benchmark with combined access pattern ============================================ vm-scalability, sequential swap test case, 4 processes to eat 50G virtual memory space, repeat the sequential memory writing until 300 seconds. The first round writing will trigger swap out, the following rounds will trigger sequential swap in and out. At the same time, run vm-scalability random swap test case in background, 8 processes to eat 30G virtual memory space, repeat the random memory write until 300 seconds. This will trigger random swap-in in the background. This is a combined workload with sequential and random memory accessing at the same time. The result (for sequential workload) is as follow, Base Optimized ---- --------- throughput 345413 KB/s 414029 KB/s (+19.9%) latency.average 97.14 us 61.06 us (-37.1%) latency.50th 2 us 1 us latency.60th 2 us 1 us latency.70th 98 us 2 us latency.80th 160 us 2 us latency.90th 260 us 217 us latency.95th 346 us 369 us latency.99th 1.34 ms 1.09 ms ra_hit% 52.69% 99.98% The original swap readahead algorithm is confused by the background random access workload, so readahead hit rate is lower. The VMA-base readahead algorithm works much better. Linpack ======= The test memory size is bigger than RAM to trigger swapping. Base Optimized ---- --------- elapsed_time 393.49 s 329.88 s (-16.2%) ra_hit% 86.21% 98.82% The score of base and optimized kernel hasn't visible changes. But the elapsed time reduced and readahead hit rate improved, so the optimized kernel runs better for startup and tear down stages. And the absolute value of readahead hit rate is high, shows that the space locality is still valid in some practical workloads. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> --- include/linux/mm_types.h | 1 + include/linux/swap.h | 57 ++++++++++++- mm/memory.c | 23 +++-- mm/shmem.c | 2 +- mm/swap_state.c | 215 +++++++++++++++++++++++++++++++++++++++++++---- 5 files changed, 273 insertions(+), 25 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7f384bb62d8e..5c02027050a2 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -335,6 +335,7 @@ struct vm_area_struct { struct file * vm_file; /* File we map to (can be NULL). */ void * vm_private_data; /* was vm_pte (shared mem) */ + atomic_long_t swap_readahead_info; #ifndef CONFIG_MMU struct vm_region *vm_region; /* NOMMU mapping region */ #endif diff --git a/include/linux/swap.h b/include/linux/swap.h index 76f1632eea5a..61d63379e956 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -251,6 +251,25 @@ struct swap_info_struct { struct swap_cluster_list discard_clusters; /* discard clusters list */ }; +#ifdef CONFIG_64BIT +#define SWAP_RA_ORDER_CEILING 5 +#else +/* Avoid stack overflow, because we need to save part of page table */ +#define SWAP_RA_ORDER_CEILING 3 +#define SWAP_RA_PTE_CACHE_SIZE (1 << SWAP_RA_ORDER_CEILING) +#endif + +struct vma_swap_readahead { + unsigned short win; + unsigned short offset; + unsigned short nr_pte; +#ifdef CONFIG_64BIT + pte_t *ptes; +#else + pte_t ptes[SWAP_RA_PTE_CACHE_SIZE]; +#endif +}; + /* linux/mm/workingset.c */ void *workingset_eviction(struct address_space *mapping, struct page *page); bool workingset_refault(void *shadow); @@ -350,6 +369,7 @@ int generic_swapfile_activate(struct swap_info_struct *, struct file *, #define SWAP_ADDRESS_SPACE_SHIFT 14 #define SWAP_ADDRESS_SPACE_PAGES (1 << SWAP_ADDRESS_SPACE_SHIFT) extern struct address_space *swapper_spaces[]; +extern bool swap_vma_readahead; #define swap_address_space(entry) \ (&swapper_spaces[swp_type(entry)][swp_offset(entry) \ >> SWAP_ADDRESS_SPACE_SHIFT]) @@ -362,7 +382,9 @@ extern void __delete_from_swap_cache(struct page *); extern void delete_from_swap_cache(struct page *); extern void free_page_and_swap_cache(struct page *); extern void free_pages_and_swap_cache(struct page **, int); -extern struct page *lookup_swap_cache(swp_entry_t); +extern struct page *lookup_swap_cache(swp_entry_t entry, + struct vm_area_struct *vma, + unsigned long addr); extern struct page *read_swap_cache_async(swp_entry_t, gfp_t, struct vm_area_struct *vma, unsigned long addr, bool do_poll); @@ -372,6 +394,17 @@ extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t, extern struct page *swapin_readahead(swp_entry_t, gfp_t, struct vm_area_struct *vma, unsigned long addr); +extern struct page *swap_readahead_detect(struct vm_fault *vmf, + struct vma_swap_readahead *swap_ra); +extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask, + struct vm_fault *vmf, + struct vma_swap_readahead *swap_ra); + +static inline bool swap_use_vma_readahead(void) +{ + return READ_ONCE(swap_vma_readahead); +} + /* linux/mm/swapfile.c */ extern atomic_long_t nr_swap_pages; extern long total_swap_pages; @@ -466,12 +499,32 @@ static inline struct page *swapin_readahead(swp_entry_t swp, gfp_t gfp_mask, return NULL; } +static inline bool swap_use_vma_readahead(void) +{ + return false; +} + +static inline struct page *swap_readahead_detect( + struct vm_fault *vmf, struct vma_swap_readahead *swap_ra) +{ + return NULL; +} + +static inline struct page *do_swap_page_readahead( + swp_entry_t fentry, gfp_t gfp_mask, + struct vm_fault *vmf, struct vma_swap_readahead *swap_ra) +{ + return NULL; +} + static inline int swap_writepage(struct page *p, struct writeback_control *wbc) { return 0; } -static inline struct page *lookup_swap_cache(swp_entry_t swp) +static inline struct page *lookup_swap_cache(swp_entry_t swp, + struct vm_area_struct *vma, + unsigned long addr) { return NULL; } diff --git a/mm/memory.c b/mm/memory.c index 5cd37032ca67..e44fba697fd7 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2733,16 +2733,23 @@ noinline int swap_lock_page_or_retry(struct page *page, struct mm_struct *mm, int do_swap_page(struct vm_fault *vmf) { struct vm_area_struct *vma = vmf->vma; - struct page *page, *swapcache; + struct page *page = NULL, *swapcache; struct mem_cgroup *memcg; + struct vma_swap_readahead swap_ra; swp_entry_t entry; pte_t pte; int locked; int exclusive = 0; int ret = 0; + bool vma_readahead = swap_use_vma_readahead(); - if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) + if (vma_readahead) + page = swap_readahead_detect(vmf, &swap_ra); + if (!pte_unmap_same(vma->vm_mm, vmf->pmd, vmf->pte, vmf->orig_pte)) { + if (page) + put_page(page); goto out; + } entry = pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { @@ -2758,10 +2765,16 @@ int do_swap_page(struct vm_fault *vmf) goto out; } delayacct_set_flag(DELAYACCT_PF_SWAPIN); - page = lookup_swap_cache(entry); + if (!page) + page = lookup_swap_cache(entry, vma_readahead ? vma : NULL, + vmf->address); if (!page) { - page = swapin_readahead(entry, GFP_HIGHUSER_MOVABLE, vma, - vmf->address); + if (vma_readahead) + page = do_swap_page_readahead(entry, + GFP_HIGHUSER_MOVABLE, vmf, &swap_ra); + else + page = swapin_readahead(entry, + GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (!page) { /* * Back out if somebody else faulted in this pte diff --git a/mm/shmem.c b/mm/shmem.c index 3c381ef18c30..3d366c6e5608 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1645,7 +1645,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index, if (swap.val) { /* Look it up and read it in.. */ - page = lookup_swap_cache(swap); + page = lookup_swap_cache(swap, NULL, 0); if (!page) { /* Or update major stats only when swapin succeeds?? */ if (fault_type) { diff --git a/mm/swap_state.c b/mm/swap_state.c index a901afe9da61..3885fef7bdf5 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -37,6 +37,29 @@ static const struct address_space_operations swap_aops = { struct address_space *swapper_spaces[MAX_SWAPFILES]; static unsigned int nr_swapper_spaces[MAX_SWAPFILES]; +bool swap_vma_readahead = true; + +#define SWAP_RA_MAX_ORDER_DEFAULT 3 + +static int swap_ra_max_order = SWAP_RA_MAX_ORDER_DEFAULT; + +#define SWAP_RA_WIN_SHIFT (PAGE_SHIFT / 2) +#define SWAP_RA_HITS_MASK ((1UL << SWAP_RA_WIN_SHIFT) - 1) +#define SWAP_RA_HITS_MAX SWAP_RA_HITS_MASK +#define SWAP_RA_WIN_MASK (~PAGE_MASK & ~SWAP_RA_HITS_MASK) + +#define SWAP_RA_HITS(v) ((v) & SWAP_RA_HITS_MASK) +#define SWAP_RA_WIN(v) (((v) & SWAP_RA_WIN_MASK) >> SWAP_RA_WIN_SHIFT) +#define SWAP_RA_ADDR(v) ((v) & PAGE_MASK) + +#define SWAP_RA_VAL(addr, win, hits) \ + (((addr) & PAGE_MASK) | \ + (((win) << SWAP_RA_WIN_SHIFT) & SWAP_RA_WIN_MASK) | \ + ((hits) & SWAP_RA_HITS_MASK)) + +/* Initial readahead hits is 4 to start up with a small window */ +#define GET_SWAP_RA_VAL(vma) \ + (atomic_long_read(&(vma)->swap_readahead_info) ? : 4) #define INC_CACHE_INFO(x) do { swap_cache_info.x++; } while (0) #define ADD_CACHE_INFO(x, nr) do { swap_cache_info.x += (nr); } while (0) @@ -297,21 +320,36 @@ void free_pages_and_swap_cache(struct page **pages, int nr) * lock getting page table operations atomic even if we drop the page * lock before returning. */ -struct page * lookup_swap_cache(swp_entry_t entry) +struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma, + unsigned long addr) { struct page *page; + unsigned long ra_info; + int win, hits, readahead; page = find_get_page(swap_address_space(entry), swp_offset(entry)); - if (page && likely(!PageTransCompound(page))) { + INC_CACHE_INFO(find_total); + if (page) { INC_CACHE_INFO(find_success); - if (TestClearPageReadahead(page)) { - atomic_inc(&swapin_readahead_hits); + if (unlikely(PageTransCompound(page))) + return page; + readahead = TestClearPageReadahead(page); + if (vma) { + ra_info = GET_SWAP_RA_VAL(vma); + win = SWAP_RA_WIN(ra_info); + hits = SWAP_RA_HITS(ra_info); + if (readahead) + hits = min_t(int, hits + 1, SWAP_RA_HITS_MAX); + atomic_long_set(&vma->swap_readahead_info, + SWAP_RA_VAL(addr, win, hits)); + } + if (readahead) { count_vm_event(SWAP_RA_HIT); + if (!vma) + atomic_inc(&swapin_readahead_hits); } } - - INC_CACHE_INFO(find_total); return page; } @@ -426,22 +464,20 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask, return retpage; } -static unsigned long swapin_nr_pages(unsigned long offset) +static unsigned int __swapin_nr_pages(unsigned long prev_offset, + unsigned long offset, + int hits, + int max_pages, + int prev_win) { - static unsigned long prev_offset; - unsigned int pages, max_pages, last_ra; - static atomic_t last_readahead_pages; - - max_pages = 1 << READ_ONCE(page_cluster); - if (max_pages <= 1) - return 1; + unsigned int pages, last_ra; /* * This heuristic has been found to work well on both sequential and * random loads, swapping to hard disk or to SSD: please don't ask * what the "+ 2" means, it just happens to work well, that's all. */ - pages = atomic_xchg(&swapin_readahead_hits, 0) + 2; + pages = hits + 2; if (pages == 2) { /* * We can have no readahead hits to judge by: but must not get @@ -450,7 +486,6 @@ static unsigned long swapin_nr_pages(unsigned long offset) */ if (offset != prev_offset + 1 && offset != prev_offset - 1) pages = 1; - prev_offset = offset; } else { unsigned int roundup = 4; while (roundup < pages) @@ -462,9 +497,28 @@ static unsigned long swapin_nr_pages(unsigned long offset) pages = max_pages; /* Don't shrink readahead too fast */ - last_ra = atomic_read(&last_readahead_pages) / 2; + last_ra = prev_win / 2; if (pages < last_ra) pages = last_ra; + + return pages; +} + +static unsigned long swapin_nr_pages(unsigned long offset) +{ + static unsigned long prev_offset; + unsigned int hits, pages, max_pages; + static atomic_t last_readahead_pages; + + max_pages = 1 << READ_ONCE(page_cluster); + if (max_pages <= 1) + return 1; + + hits = atomic_xchg(&swapin_readahead_hits, 0); + pages = __swapin_nr_pages(prev_offset, offset, hits, max_pages, + atomic_read(&last_readahead_pages)); + if (!hits) + prev_offset = offset; atomic_set(&last_readahead_pages, pages); return pages; @@ -570,3 +624,130 @@ void exit_swap_address_space(unsigned int type) synchronize_rcu(); kvfree(spaces); } + +static inline void swap_ra_clamp_pfn(struct vm_area_struct *vma, + unsigned long faddr, + unsigned long lpfn, + unsigned long rpfn, + unsigned long *start, + unsigned long *end) +{ + *start = max3(lpfn, PFN_DOWN(vma->vm_start), + PFN_DOWN(faddr & PMD_MASK)); + *end = min3(rpfn, PFN_DOWN(vma->vm_end), + PFN_DOWN((faddr & PMD_MASK) + PMD_SIZE)); +} + +struct page *swap_readahead_detect(struct vm_fault *vmf, + struct vma_swap_readahead *swap_ra) +{ + struct vm_area_struct *vma = vmf->vma; + unsigned long swap_ra_info; + struct page *page; + swp_entry_t entry; + unsigned long faddr, pfn, fpfn; + unsigned long start, end; + pte_t *pte; + unsigned int max_win, hits, prev_win, win, left; +#ifndef CONFIG_64BIT + pte_t *tpte; +#endif + + faddr = vmf->address; + entry = pte_to_swp_entry(vmf->orig_pte); + if ((unlikely(non_swap_entry(entry)))) + return NULL; + page = lookup_swap_cache(entry, vma, faddr); + if (page) + return page; + + max_win = 1 << READ_ONCE(swap_ra_max_order); + if (max_win == 1) { + swap_ra->win = 1; + return NULL; + } + + fpfn = PFN_DOWN(faddr); + swap_ra_info = GET_SWAP_RA_VAL(vma); + pfn = PFN_DOWN(SWAP_RA_ADDR(swap_ra_info)); + prev_win = SWAP_RA_WIN(swap_ra_info); + hits = SWAP_RA_HITS(swap_ra_info); + swap_ra->win = win = __swapin_nr_pages(pfn, fpfn, hits, + max_win, prev_win); + atomic_long_set(&vma->swap_readahead_info, + SWAP_RA_VAL(faddr, win, 0)); + + if (win == 1) + return NULL; + + /* Copy the PTEs because the page table may be unmapped */ + if (fpfn == pfn + 1) + swap_ra_clamp_pfn(vma, faddr, fpfn, fpfn + win, &start, &end); + else if (pfn == fpfn + 1) + swap_ra_clamp_pfn(vma, faddr, fpfn - win + 1, fpfn + 1, + &start, &end); + else { + left = (win - 1) / 2; + swap_ra_clamp_pfn(vma, faddr, fpfn - left, fpfn + win - left, + &start, &end); + } + swap_ra->nr_pte = end - start; + swap_ra->offset = fpfn - start; + pte = vmf->pte - swap_ra->offset; +#ifdef CONFIG_64BIT + swap_ra->ptes = pte; +#else + tpte = swap_ra->ptes; + for (pfn = start; pfn != end; pfn++) + *tpte++ = *pte++; +#endif + + return NULL; +} + +struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask, + struct vm_fault *vmf, + struct vma_swap_readahead *swap_ra) +{ + struct blk_plug plug; + struct vm_area_struct *vma = vmf->vma; + struct page *page; + pte_t *pte, pentry; + swp_entry_t entry; + unsigned int i; + bool page_allocated; + + if (swap_ra->win == 1) + goto skip; + + blk_start_plug(&plug); + for (i = 0, pte = swap_ra->ptes; i < swap_ra->nr_pte; + i++, pte++) { + pentry = *pte; + if (pte_none(pentry)) + continue; + if (pte_present(pentry)) + continue; + entry = pte_to_swp_entry(pentry); + if (unlikely(non_swap_entry(entry))) + continue; + page = __read_swap_cache_async(entry, gfp_mask, vma, + vmf->address, &page_allocated); + if (!page) + continue; + if (page_allocated) { + swap_readpage(page, false); + if (i != swap_ra->offset && + likely(!PageTransCompound(page))) { + SetPageReadahead(page); + count_vm_event(SWAP_RA); + } + } + put_page(page); + } + blk_finish_plug(&plug); + lru_add_drain(); +skip: + return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address, + swap_ra->win == 1); +} -- 2.11.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-08-07 5:40 ` [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead Huang, Ying @ 2017-09-13 1:40 ` Minchan Kim 2017-09-13 21:02 ` Andrew Morton 0 siblings, 1 reply; 19+ messages in thread From: Minchan Kim @ 2017-09-13 1:40 UTC (permalink / raw) To: Huang, Ying, Andrew Morton Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen On Mon, Aug 07, 2017 at 01:40:36PM +0800, Huang, Ying wrote: > From: Huang Ying <ying.huang@intel.com> > > The swap readahead is an important mechanism to reduce the swap in > latency. Although pure sequential memory access pattern isn't very > popular for anonymous memory, the space locality is still considered > valid. > > In the original swap readahead implementation, the consecutive blocks > in swap device are readahead based on the global space locality > estimation. But the consecutive blocks in swap device just reflect > the order of page reclaiming, don't necessarily reflect the access > pattern in virtual memory. And the different tasks in the system may > have different access patterns, which makes the global space locality > estimation incorrect. > > In this patch, when page fault occurs, the virtual pages near the > fault address will be readahead instead of the swap slots near the > fault swap slot in swap device. This avoid to readahead the unrelated > swap slots. At the same time, the swap readahead is changed to work > on per-VMA from globally. So that the different access patterns of > the different VMAs could be distinguished, and the different readahead > policy could be applied accordingly. The original core readahead > detection and scaling algorithm is reused, because it is an effect > algorithm to detect the space locality. Andrew, Every zram users like low-end android device has used 0 page-cluster to disable swap readahead because it has no seek cost and works as synchronous IO operation so if we do readahead multiple pages, swap falut latency would be (4K * readahead window size). IOW, readahead is meaningful only if it doesn't bother faulted page's latency. However, this patch introduces additional knob /sys/kernel/mm/swap/ vma_ra_max_order as well as page-cluster. It means existing users has used disabled swap readahead doesn't work until they should be aware of new knob and modification of their script/code to disable vma_ra_max_order as well as page-cluster. I say it's a *regression* and wanted to fix it but Huang's opinion is that it's not a functional regression so userspace should be fixed by themselves. Please look into detail of discussion in http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E The discussion is never productive so it's time to follow maintainer's opinion. Could you share your opinion? Thanks. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-13 1:40 ` Minchan Kim @ 2017-09-13 21:02 ` Andrew Morton 2017-09-14 0:53 ` Huang, Ying 2017-09-14 7:53 ` Minchan Kim 0 siblings, 2 replies; 19+ messages in thread From: Andrew Morton @ 2017-09-13 21:02 UTC (permalink / raw) To: Minchan Kim Cc: Huang, Ying, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: > Every zram users like low-end android device has used 0 page-cluster > to disable swap readahead because it has no seek cost and works as > synchronous IO operation so if we do readahead multiple pages, > swap falut latency would be (4K * readahead window size). IOW, > readahead is meaningful only if it doesn't bother faulted page's > latency. > > However, this patch introduces additional knob /sys/kernel/mm/swap/ > vma_ra_max_order as well as page-cluster. It means existing users > has used disabled swap readahead doesn't work until they should be > aware of new knob and modification of their script/code to disable > vma_ra_max_order as well as page-cluster. > > I say it's a *regression* and wanted to fix it but Huang's opinion > is that it's not a functional regression so userspace should be fixed > by themselves. > Please look into detail of discussion in > http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E hm, tricky problem. I do agree that linking the physical and virtual readahead schemes in the proposed fashion is unfortunate. I also agree that breaking existing setups (a bit) is also unfortunate. Would it help if, when page-cluster is written to zero, we do printk_once("physical readahead disabled, virtual readahead still enabled. Disable virtual readhead via /sys/kernel/mm/swap/vma_ra_max_order"). Or something like that. It's pretty lame, but it should help alert the zram-readahead-disabling people to the issue? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-13 21:02 ` Andrew Morton @ 2017-09-14 0:53 ` Huang, Ying 2017-09-14 8:15 ` Minchan Kim 2017-09-14 7:53 ` Minchan Kim 1 sibling, 1 reply; 19+ messages in thread From: Huang, Ying @ 2017-09-14 0:53 UTC (permalink / raw) To: Andrew Morton Cc: Minchan Kim, Huang, Ying, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen Hi, Andrew, Andrew Morton <akpm@linux-foundation.org> writes: > On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: > >> Every zram users like low-end android device has used 0 page-cluster >> to disable swap readahead because it has no seek cost and works as >> synchronous IO operation so if we do readahead multiple pages, >> swap falut latency would be (4K * readahead window size). IOW, >> readahead is meaningful only if it doesn't bother faulted page's >> latency. >> >> However, this patch introduces additional knob /sys/kernel/mm/swap/ >> vma_ra_max_order as well as page-cluster. It means existing users >> has used disabled swap readahead doesn't work until they should be >> aware of new knob and modification of their script/code to disable >> vma_ra_max_order as well as page-cluster. >> >> I say it's a *regression* and wanted to fix it but Huang's opinion >> is that it's not a functional regression so userspace should be fixed >> by themselves. >> Please look into detail of discussion in >> http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E > > hm, tricky problem. I do agree that linking the physical and virtual > readahead schemes in the proposed fashion is unfortunate. I also agree > that breaking existing setups (a bit) is also unfortunate. > > Would it help if, when page-cluster is written to zero, we do > > printk_once("physical readahead disabled, virtual readahead still > enabled. Disable virtual readhead via > /sys/kernel/mm/swap/vma_ra_max_order"). > > Or something like that. It's pretty lame, but it should help alert the > zram-readahead-disabling people to the issue? This sounds good for me. Hi, Minchan, what do you think about this? I think for low-end android device, the end-user may have no opportunity to upgrade to the latest kernel, the device vendor should care about this. For desktop users, the warning proposed by Andrew may help to remind them for the new knob. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-14 0:53 ` Huang, Ying @ 2017-09-14 8:15 ` Minchan Kim 0 siblings, 0 replies; 19+ messages in thread From: Minchan Kim @ 2017-09-14 8:15 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen On Thu, Sep 14, 2017 at 08:53:04AM +0800, Huang, Ying wrote: > Hi, Andrew, > > Andrew Morton <akpm@linux-foundation.org> writes: > > > On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: > > > >> Every zram users like low-end android device has used 0 page-cluster > >> to disable swap readahead because it has no seek cost and works as > >> synchronous IO operation so if we do readahead multiple pages, > >> swap falut latency would be (4K * readahead window size). IOW, > >> readahead is meaningful only if it doesn't bother faulted page's > >> latency. > >> > >> However, this patch introduces additional knob /sys/kernel/mm/swap/ > >> vma_ra_max_order as well as page-cluster. It means existing users > >> has used disabled swap readahead doesn't work until they should be > >> aware of new knob and modification of their script/code to disable > >> vma_ra_max_order as well as page-cluster. > >> > >> I say it's a *regression* and wanted to fix it but Huang's opinion > >> is that it's not a functional regression so userspace should be fixed > >> by themselves. > >> Please look into detail of discussion in > >> http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E > > > > hm, tricky problem. I do agree that linking the physical and virtual > > readahead schemes in the proposed fashion is unfortunate. I also agree > > that breaking existing setups (a bit) is also unfortunate. > > > > Would it help if, when page-cluster is written to zero, we do > > > > printk_once("physical readahead disabled, virtual readahead still > > enabled. Disable virtual readhead via > > /sys/kernel/mm/swap/vma_ra_max_order"). > > > > Or something like that. It's pretty lame, but it should help alert the > > zram-readahead-disabling people to the issue? > > This sounds good for me. > > Hi, Minchan, what do you think about this? I think for low-end android > device, the end-user may have no opportunity to upgrade to the latest > kernel, the device vendor should care about this. For desktop users, > the warning proposed by Andrew may help to remind them for the new knob. Yes, it would be option. At least, we should alert to the user to make a chance to fix. However, can't we make vma-based readahead new config option? Please look at the detail in my reply of andrew. With that, there is no regression with current users and as a bonus, user can measure both algorithm with their real workload with both algorithm rather than artificial benchmark. I think recency vs spartial locality would have each pros and cons so that kind soft landing would be safer option rather than sudden replacing. After a while, we can set new algorithm as default. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-13 21:02 ` Andrew Morton 2017-09-14 0:53 ` Huang, Ying @ 2017-09-14 7:53 ` Minchan Kim 2017-09-14 12:01 ` Huang, Ying 1 sibling, 1 reply; 19+ messages in thread From: Minchan Kim @ 2017-09-14 7:53 UTC (permalink / raw) To: Andrew Morton Cc: Huang, Ying, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote: > On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: > > > Every zram users like low-end android device has used 0 page-cluster > > to disable swap readahead because it has no seek cost and works as > > synchronous IO operation so if we do readahead multiple pages, > > swap falut latency would be (4K * readahead window size). IOW, > > readahead is meaningful only if it doesn't bother faulted page's > > latency. > > > > However, this patch introduces additional knob /sys/kernel/mm/swap/ > > vma_ra_max_order as well as page-cluster. It means existing users > > has used disabled swap readahead doesn't work until they should be > > aware of new knob and modification of their script/code to disable > > vma_ra_max_order as well as page-cluster. > > > > I say it's a *regression* and wanted to fix it but Huang's opinion > > is that it's not a functional regression so userspace should be fixed > > by themselves. > > Please look into detail of discussion in > > http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E > > hm, tricky problem. I do agree that linking the physical and virtual > readahead schemes in the proposed fashion is unfortunate. I also agree > that breaking existing setups (a bit) is also unfortunate. > > Would it help if, when page-cluster is written to zero, we do > > printk_once("physical readahead disabled, virtual readahead still > enabled. Disable virtual readhead via > /sys/kernel/mm/swap/vma_ra_max_order"). > > Or something like that. It's pretty lame, but it should help alert the > zram-readahead-disabling people to the issue? It was my last resort. If we cannot find other ways after all, yes, it would be a minimum we should do. But it still breaks users don't/can't read/modify alert and program. How about this? Can't we make vma-based readahead config option? With that, users who no interest on readahead don't enable vma-based readahead. In this case, page-cluster works as expected "disable readahead completely" so it doesn't break anything. People who want to use upcoming vma-based readahead can enable the feature and we can say such unfortunate things in config/document description somewhere so upcoming users will be aware of that unforunate two knobs. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-14 7:53 ` Minchan Kim @ 2017-09-14 12:01 ` Huang, Ying 2017-09-14 13:14 ` Minchan Kim 0 siblings, 1 reply; 19+ messages in thread From: Huang, Ying @ 2017-09-14 12:01 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, Huang, Ying, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen Minchan Kim <minchan@kernel.org> writes: > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote: >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: >> >> > Every zram users like low-end android device has used 0 page-cluster >> > to disable swap readahead because it has no seek cost and works as >> > synchronous IO operation so if we do readahead multiple pages, >> > swap falut latency would be (4K * readahead window size). IOW, >> > readahead is meaningful only if it doesn't bother faulted page's >> > latency. >> > >> > However, this patch introduces additional knob /sys/kernel/mm/swap/ >> > vma_ra_max_order as well as page-cluster. It means existing users >> > has used disabled swap readahead doesn't work until they should be >> > aware of new knob and modification of their script/code to disable >> > vma_ra_max_order as well as page-cluster. >> > >> > I say it's a *regression* and wanted to fix it but Huang's opinion >> > is that it's not a functional regression so userspace should be fixed >> > by themselves. >> > Please look into detail of discussion in >> > http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E >> >> hm, tricky problem. I do agree that linking the physical and virtual >> readahead schemes in the proposed fashion is unfortunate. I also agree >> that breaking existing setups (a bit) is also unfortunate. >> >> Would it help if, when page-cluster is written to zero, we do >> >> printk_once("physical readahead disabled, virtual readahead still >> enabled. Disable virtual readhead via >> /sys/kernel/mm/swap/vma_ra_max_order"). >> >> Or something like that. It's pretty lame, but it should help alert the >> zram-readahead-disabling people to the issue? > > It was my last resort. If we cannot find other ways after all, yes, it would > be a minimum we should do. But it still breaks users don't/can't read/modify > alert and program. > > How about this? > > Can't we make vma-based readahead config option? > With that, users who no interest on readahead don't enable vma-based > readahead. In this case, page-cluster works as expected "disable readahead > completely" so it doesn't break anything. Now. Users can choose between VMA based readahead and original readahead via a knob as follow at runtime, /sys/kernel/mm/swap/vma_ra_enabled Best Regards, Huang, Ying > People who want to use upcoming vma-based readahead can enable the feature > and we can say such unfortunate things in config/document description > somewhere so upcoming users will be aware of that unforunate two knobs. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-14 12:01 ` Huang, Ying @ 2017-09-14 13:14 ` Minchan Kim 2017-09-14 21:21 ` Andrew Morton 2017-09-15 3:15 ` Huang, Ying 0 siblings, 2 replies; 19+ messages in thread From: Minchan Kim @ 2017-09-14 13:14 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen On Thu, Sep 14, 2017 at 08:01:30PM +0800, Huang, Ying wrote: > Minchan Kim <minchan@kernel.org> writes: > > > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote: > >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: > >> > >> > Every zram users like low-end android device has used 0 page-cluster > >> > to disable swap readahead because it has no seek cost and works as > >> > synchronous IO operation so if we do readahead multiple pages, > >> > swap falut latency would be (4K * readahead window size). IOW, > >> > readahead is meaningful only if it doesn't bother faulted page's > >> > latency. > >> > > >> > However, this patch introduces additional knob /sys/kernel/mm/swap/ > >> > vma_ra_max_order as well as page-cluster. It means existing users > >> > has used disabled swap readahead doesn't work until they should be > >> > aware of new knob and modification of their script/code to disable > >> > vma_ra_max_order as well as page-cluster. > >> > > >> > I say it's a *regression* and wanted to fix it but Huang's opinion > >> > is that it's not a functional regression so userspace should be fixed > >> > by themselves. > >> > Please look into detail of discussion in > >> > http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E > >> > >> hm, tricky problem. I do agree that linking the physical and virtual > >> readahead schemes in the proposed fashion is unfortunate. I also agree > >> that breaking existing setups (a bit) is also unfortunate. > >> > >> Would it help if, when page-cluster is written to zero, we do > >> > >> printk_once("physical readahead disabled, virtual readahead still > >> enabled. Disable virtual readhead via > >> /sys/kernel/mm/swap/vma_ra_max_order"). > >> > >> Or something like that. It's pretty lame, but it should help alert the > >> zram-readahead-disabling people to the issue? > > > > It was my last resort. If we cannot find other ways after all, yes, it would > > be a minimum we should do. But it still breaks users don't/can't read/modify > > alert and program. > > > > How about this? > > > > Can't we make vma-based readahead config option? > > With that, users who no interest on readahead don't enable vma-based > > readahead. In this case, page-cluster works as expected "disable readahead > > completely" so it doesn't break anything. > > Now. Users can choose between VMA based readahead and original > readahead via a knob as follow at runtime, > > /sys/kernel/mm/swap/vma_ra_enabled It's not a config option and is enabled by default. IOW, it's under the radar so current users cannot notice it. That's why we want to emit big fat warnning. when old user set 0 to page-cluster. However, as Andrew said, it's lame. If we make it config option, product maker/kernel upgrade user can have a chance to notice and read description so they could be aware of two weird knobs and help to solve the problem in advance without printk_once warn. If user has no interest about swap-readahead or skip the new config option by mistake, it works physcial readahead which means no regression. > > Best Regards, > Huang, Ying > > > > People who want to use upcoming vma-based readahead can enable the feature > > and we can say such unfortunate things in config/document description > > somewhere so upcoming users will be aware of that unforunate two knobs. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-14 13:14 ` Minchan Kim @ 2017-09-14 21:21 ` Andrew Morton 2017-09-15 3:15 ` Huang, Ying 1 sibling, 0 replies; 19+ messages in thread From: Andrew Morton @ 2017-09-14 21:21 UTC (permalink / raw) To: Minchan Kim Cc: Huang, Ying, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen On Thu, 14 Sep 2017 22:14:46 +0900 Minchan Kim <minchan@kernel.org> wrote: > > Now. Users can choose between VMA based readahead and original > > readahead via a knob as follow at runtime, > > > > /sys/kernel/mm/swap/vma_ra_enabled > > It's not a config option and is enabled by default. IOW, it's under the radar > so current users cannot notice it. That's why we want to emit big fat warnning. > when old user set 0 to page-cluster. However, as Andrew said, it's lame. > > If we make it config option, product maker/kernel upgrade user can have > a chance to notice and read description so they could be aware of two weird > knobs and help to solve the problem in advance without printk_once warn. > If user has no interest about swap-readahead or skip the new config option > by mistake, it works physcial readahead which means no regression. Yup, a Kconfig option sounds like a good idea. And that's a bit more friendly to tiny kernels as well. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-14 13:14 ` Minchan Kim 2017-09-14 21:21 ` Andrew Morton @ 2017-09-15 3:15 ` Huang, Ying 2017-09-15 3:42 ` Minchan Kim 1 sibling, 1 reply; 19+ messages in thread From: Huang, Ying @ 2017-09-15 3:15 UTC (permalink / raw) To: Minchan Kim Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen Minchan Kim <minchan@kernel.org> writes: > On Thu, Sep 14, 2017 at 08:01:30PM +0800, Huang, Ying wrote: >> Minchan Kim <minchan@kernel.org> writes: >> >> > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote: >> >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: >> >> >> >> > Every zram users like low-end android device has used 0 page-cluster >> >> > to disable swap readahead because it has no seek cost and works as >> >> > synchronous IO operation so if we do readahead multiple pages, >> >> > swap falut latency would be (4K * readahead window size). IOW, >> >> > readahead is meaningful only if it doesn't bother faulted page's >> >> > latency. >> >> > >> >> > However, this patch introduces additional knob /sys/kernel/mm/swap/ >> >> > vma_ra_max_order as well as page-cluster. It means existing users >> >> > has used disabled swap readahead doesn't work until they should be >> >> > aware of new knob and modification of their script/code to disable >> >> > vma_ra_max_order as well as page-cluster. >> >> > >> >> > I say it's a *regression* and wanted to fix it but Huang's opinion >> >> > is that it's not a functional regression so userspace should be fixed >> >> > by themselves. >> >> > Please look into detail of discussion in >> >> > http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E >> >> >> >> hm, tricky problem. I do agree that linking the physical and virtual >> >> readahead schemes in the proposed fashion is unfortunate. I also agree >> >> that breaking existing setups (a bit) is also unfortunate. >> >> >> >> Would it help if, when page-cluster is written to zero, we do >> >> >> >> printk_once("physical readahead disabled, virtual readahead still >> >> enabled. Disable virtual readhead via >> >> /sys/kernel/mm/swap/vma_ra_max_order"). >> >> >> >> Or something like that. It's pretty lame, but it should help alert the >> >> zram-readahead-disabling people to the issue? >> > >> > It was my last resort. If we cannot find other ways after all, yes, it would >> > be a minimum we should do. But it still breaks users don't/can't read/modify >> > alert and program. >> > >> > How about this? >> > >> > Can't we make vma-based readahead config option? >> > With that, users who no interest on readahead don't enable vma-based >> > readahead. In this case, page-cluster works as expected "disable readahead >> > completely" so it doesn't break anything. >> >> Now. Users can choose between VMA based readahead and original >> readahead via a knob as follow at runtime, >> >> /sys/kernel/mm/swap/vma_ra_enabled > > It's not a config option and is enabled by default. IOW, it's under the radar > so current users cannot notice it. That's why we want to emit big fat warnning. > when old user set 0 to page-cluster. However, as Andrew said, it's lame. > > If we make it config option, product maker/kernel upgrade user can have > a chance to notice and read description so they could be aware of two weird > knobs and help to solve the problem in advance without printk_once warn. > If user has no interest about swap-readahead or skip the new config option > by mistake, it works physcial readahead which means no regression. I am OK to make it config option. But I think VMA based swap readahead should be enabled by default. Because per my understanding, default option should be set for most common desktop users. And VMA based swap readahead should benefit them. People needs to turn off swap readahead is some special users, the original swap readahead default configuration isn't for them too. Best Regards, Huang, Ying >> >> Best Regards, >> Huang, Ying >> >> >> > People who want to use upcoming vma-based readahead can enable the feature >> > and we can say such unfortunate things in config/document description >> > somewhere so upcoming users will be aware of that unforunate two knobs. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-15 3:15 ` Huang, Ying @ 2017-09-15 3:42 ` Minchan Kim 2017-09-15 4:46 ` Huang, Ying 0 siblings, 1 reply; 19+ messages in thread From: Minchan Kim @ 2017-09-15 3:42 UTC (permalink / raw) To: Huang, Ying Cc: Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen On Fri, Sep 15, 2017 at 11:15:08AM +0800, Huang, Ying wrote: > Minchan Kim <minchan@kernel.org> writes: > > > On Thu, Sep 14, 2017 at 08:01:30PM +0800, Huang, Ying wrote: > >> Minchan Kim <minchan@kernel.org> writes: > >> > >> > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote: > >> >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: > >> >> > >> >> > Every zram users like low-end android device has used 0 page-cluster > >> >> > to disable swap readahead because it has no seek cost and works as > >> >> > synchronous IO operation so if we do readahead multiple pages, > >> >> > swap falut latency would be (4K * readahead window size). IOW, > >> >> > readahead is meaningful only if it doesn't bother faulted page's > >> >> > latency. > >> >> > > >> >> > However, this patch introduces additional knob /sys/kernel/mm/swap/ > >> >> > vma_ra_max_order as well as page-cluster. It means existing users > >> >> > has used disabled swap readahead doesn't work until they should be > >> >> > aware of new knob and modification of their script/code to disable > >> >> > vma_ra_max_order as well as page-cluster. > >> >> > > >> >> > I say it's a *regression* and wanted to fix it but Huang's opinion > >> >> > is that it's not a functional regression so userspace should be fixed > >> >> > by themselves. > >> >> > Please look into detail of discussion in > >> >> > http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E > >> >> > >> >> hm, tricky problem. I do agree that linking the physical and virtual > >> >> readahead schemes in the proposed fashion is unfortunate. I also agree > >> >> that breaking existing setups (a bit) is also unfortunate. > >> >> > >> >> Would it help if, when page-cluster is written to zero, we do > >> >> > >> >> printk_once("physical readahead disabled, virtual readahead still > >> >> enabled. Disable virtual readhead via > >> >> /sys/kernel/mm/swap/vma_ra_max_order"). > >> >> > >> >> Or something like that. It's pretty lame, but it should help alert the > >> >> zram-readahead-disabling people to the issue? > >> > > >> > It was my last resort. If we cannot find other ways after all, yes, it would > >> > be a minimum we should do. But it still breaks users don't/can't read/modify > >> > alert and program. > >> > > >> > How about this? > >> > > >> > Can't we make vma-based readahead config option? > >> > With that, users who no interest on readahead don't enable vma-based > >> > readahead. In this case, page-cluster works as expected "disable readahead > >> > completely" so it doesn't break anything. > >> > >> Now. Users can choose between VMA based readahead and original > >> readahead via a knob as follow at runtime, > >> > >> /sys/kernel/mm/swap/vma_ra_enabled > > > > It's not a config option and is enabled by default. IOW, it's under the radar > > so current users cannot notice it. That's why we want to emit big fat warnning. > > when old user set 0 to page-cluster. However, as Andrew said, it's lame. > > > > If we make it config option, product maker/kernel upgrade user can have > > a chance to notice and read description so they could be aware of two weird > > knobs and help to solve the problem in advance without printk_once warn. > > If user has no interest about swap-readahead or skip the new config option > > by mistake, it works physcial readahead which means no regression. > > I am OK to make it config option. But I think VMA based swap readahead > should be enabled by default. Because per my understanding, default > option should be set for most common desktop users. And VMA based swap > readahead should benefit them. People needs to turn off swap readahead > is some special users, the original swap readahead default configuration > isn't for them too. Okay. I don't care either one is default if it is a config option. It still gives a chance to notice a new algorithm so users can decide it It is absolutely better than silent regressoin and printk tric. Please add more description about those parallel two readahead algorithms in somewhere(e.g., vm.txt) so he can understand the situation exactly and can handle both tunable knobs at the same time. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead 2017-09-15 3:42 ` Minchan Kim @ 2017-09-15 4:46 ` Huang, Ying 0 siblings, 0 replies; 19+ messages in thread From: Huang, Ying @ 2017-09-15 4:46 UTC (permalink / raw) To: Minchan Kim Cc: Huang, Ying, Andrew Morton, linux-mm, linux-kernel, Johannes Weiner, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen Minchan Kim <minchan@kernel.org> writes: > On Fri, Sep 15, 2017 at 11:15:08AM +0800, Huang, Ying wrote: >> Minchan Kim <minchan@kernel.org> writes: >> >> > On Thu, Sep 14, 2017 at 08:01:30PM +0800, Huang, Ying wrote: >> >> Minchan Kim <minchan@kernel.org> writes: >> >> >> >> > On Wed, Sep 13, 2017 at 02:02:29PM -0700, Andrew Morton wrote: >> >> >> On Wed, 13 Sep 2017 10:40:19 +0900 Minchan Kim <minchan@kernel.org> wrote: >> >> >> >> >> >> > Every zram users like low-end android device has used 0 page-cluster >> >> >> > to disable swap readahead because it has no seek cost and works as >> >> >> > synchronous IO operation so if we do readahead multiple pages, >> >> >> > swap falut latency would be (4K * readahead window size). IOW, >> >> >> > readahead is meaningful only if it doesn't bother faulted page's >> >> >> > latency. >> >> >> > >> >> >> > However, this patch introduces additional knob /sys/kernel/mm/swap/ >> >> >> > vma_ra_max_order as well as page-cluster. It means existing users >> >> >> > has used disabled swap readahead doesn't work until they should be >> >> >> > aware of new knob and modification of their script/code to disable >> >> >> > vma_ra_max_order as well as page-cluster. >> >> >> > >> >> >> > I say it's a *regression* and wanted to fix it but Huang's opinion >> >> >> > is that it's not a functional regression so userspace should be fixed >> >> >> > by themselves. >> >> >> > Please look into detail of discussion in >> >> >> > http://lkml.kernel.org/r/%3C1505183833-4739-4-git-send-email-minchan@kernel.org%3E >> >> >> >> >> >> hm, tricky problem. I do agree that linking the physical and virtual >> >> >> readahead schemes in the proposed fashion is unfortunate. I also agree >> >> >> that breaking existing setups (a bit) is also unfortunate. >> >> >> >> >> >> Would it help if, when page-cluster is written to zero, we do >> >> >> >> >> >> printk_once("physical readahead disabled, virtual readahead still >> >> >> enabled. Disable virtual readhead via >> >> >> /sys/kernel/mm/swap/vma_ra_max_order"). >> >> >> >> >> >> Or something like that. It's pretty lame, but it should help alert the >> >> >> zram-readahead-disabling people to the issue? >> >> > >> >> > It was my last resort. If we cannot find other ways after all, yes, it would >> >> > be a minimum we should do. But it still breaks users don't/can't read/modify >> >> > alert and program. >> >> > >> >> > How about this? >> >> > >> >> > Can't we make vma-based readahead config option? >> >> > With that, users who no interest on readahead don't enable vma-based >> >> > readahead. In this case, page-cluster works as expected "disable readahead >> >> > completely" so it doesn't break anything. >> >> >> >> Now. Users can choose between VMA based readahead and original >> >> readahead via a knob as follow at runtime, >> >> >> >> /sys/kernel/mm/swap/vma_ra_enabled >> > >> > It's not a config option and is enabled by default. IOW, it's under the radar >> > so current users cannot notice it. That's why we want to emit big fat warnning. >> > when old user set 0 to page-cluster. However, as Andrew said, it's lame. >> > >> > If we make it config option, product maker/kernel upgrade user can have >> > a chance to notice and read description so they could be aware of two weird >> > knobs and help to solve the problem in advance without printk_once warn. >> > If user has no interest about swap-readahead or skip the new config option >> > by mistake, it works physcial readahead which means no regression. >> >> I am OK to make it config option. But I think VMA based swap readahead >> should be enabled by default. Because per my understanding, default >> option should be set for most common desktop users. And VMA based swap >> readahead should benefit them. People needs to turn off swap readahead >> is some special users, the original swap readahead default configuration >> isn't for them too. > > Okay. I don't care either one is default if it is a config option. > It still gives a chance to notice a new algorithm so users can decide it > It is absolutely better than silent regressoin and printk tric. > Please add more description about those parallel two readahead algorithms > in somewhere(e.g., vm.txt) so he can understand the situation exactly and > can handle both tunable knobs at the same time. Sure. Best Regards, Huang, Ying ^ permalink raw reply [flat|nested] 19+ messages in thread
* [PATCH -mm -v4 4/5] mm, swap: Add sysfs interface for VMA based swap readahead 2017-08-07 5:40 [PATCH -mm -v4 0/5] mm, swap: VMA based swap readahead Huang, Ying ` (2 preceding siblings ...) 2017-08-07 5:40 ` [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead Huang, Ying @ 2017-08-07 5:40 ` Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 5/5] mm, swap: Don't use VMA based swap readahead if HDD is used as swap Huang, Ying 4 siblings, 0 replies; 19+ messages in thread From: Huang, Ying @ 2017-08-07 5:40 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen From: Huang Ying <ying.huang@intel.com> The sysfs interface to control the VMA based swap readahead is added as follow, /sys/kernel/mm/swap/vma_ra_enabled Enable the VMA based swap readahead algorithm, or use the original global swap readahead algorithm. /sys/kernel/mm/swap/vma_ra_max_order Set the max order of the readahead window size for the VMA based swap readahead algorithm. The corresponding ABI documentation is added too. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> --- Documentation/ABI/testing/sysfs-kernel-mm-swap | 26 +++++++++ mm/swap_state.c | 80 ++++++++++++++++++++++++++ 2 files changed, 106 insertions(+) create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-swap diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-swap b/Documentation/ABI/testing/sysfs-kernel-mm-swap new file mode 100644 index 000000000000..587db52084c7 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-kernel-mm-swap @@ -0,0 +1,26 @@ +What: /sys/kernel/mm/swap/ +Date: August 2017 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Interface for swapping + +What: /sys/kernel/mm/swap/vma_ra_enabled +Date: August 2017 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: Enable/disable VMA based swap readahead. + + If set to true, the VMA based swap readahead algorithm + will be used for swappable anonymous pages mapped in a + VMA, and the global swap readahead algorithm will be + still used for tmpfs etc. other users. If set to + false, the global swap readahead algorithm will be + used for all swappable pages. + +What: /sys/kernel/mm/swap/vma_ra_max_order +Date: August 2017 +Contact: Linux memory management mailing list <linux-mm@kvack.org> +Description: The max readahead size in order for VMA based swap readahead + + VMA based swap readahead algorithm will readahead at + most 1 << max_order pages for each readahead. The + real readahead size for each readahead will be scaled + according to the estimation algorithm. diff --git a/mm/swap_state.c b/mm/swap_state.c index 3885fef7bdf5..71ce2d1ccbf7 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -751,3 +751,83 @@ struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask, return read_swap_cache_async(fentry, gfp_mask, vma, vmf->address, swap_ra->win == 1); } + +#ifdef CONFIG_SYSFS +static ssize_t vma_ra_enabled_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%s\n", swap_vma_readahead ? "true" : "false"); +} +static ssize_t vma_ra_enabled_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + if (!strncmp(buf, "true", 4) || !strncmp(buf, "1", 1)) + swap_vma_readahead = true; + else if (!strncmp(buf, "false", 5) || !strncmp(buf, "0", 1)) + swap_vma_readahead = false; + else + return -EINVAL; + + return count; +} +static struct kobj_attribute vma_ra_enabled_attr = + __ATTR(vma_ra_enabled, 0644, vma_ra_enabled_show, + vma_ra_enabled_store); + +static ssize_t vma_ra_max_order_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + return sprintf(buf, "%d\n", swap_ra_max_order); +} +static ssize_t vma_ra_max_order_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t count) +{ + int err, v; + + err = kstrtoint(buf, 10, &v); + if (err || v > SWAP_RA_ORDER_CEILING || v <= 0) + return -EINVAL; + + swap_ra_max_order = v; + + return count; +} +static struct kobj_attribute vma_ra_max_order_attr = + __ATTR(vma_ra_max_order, 0644, vma_ra_max_order_show, + vma_ra_max_order_store); + +static struct attribute *swap_attrs[] = { + &vma_ra_enabled_attr.attr, + &vma_ra_max_order_attr.attr, + NULL, +}; + +static struct attribute_group swap_attr_group = { + .attrs = swap_attrs, +}; + +static int __init swap_init_sysfs(void) +{ + int err; + struct kobject *swap_kobj; + + swap_kobj = kobject_create_and_add("swap", mm_kobj); + if (!swap_kobj) { + pr_err("failed to create swap kobject\n"); + return -ENOMEM; + } + err = sysfs_create_group(swap_kobj, &swap_attr_group); + if (err) { + pr_err("failed to register swap group\n"); + goto delete_obj; + } + return 0; + +delete_obj: + kobject_put(swap_kobj); + return err; +} +subsys_initcall(swap_init_sysfs); +#endif -- 2.11.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH -mm -v4 5/5] mm, swap: Don't use VMA based swap readahead if HDD is used as swap 2017-08-07 5:40 [PATCH -mm -v4 0/5] mm, swap: VMA based swap readahead Huang, Ying ` (3 preceding siblings ...) 2017-08-07 5:40 ` [PATCH -mm -v4 4/5] mm, swap: Add sysfs interface for " Huang, Ying @ 2017-08-07 5:40 ` Huang, Ying 4 siblings, 0 replies; 19+ messages in thread From: Huang, Ying @ 2017-08-07 5:40 UTC (permalink / raw) To: Andrew Morton Cc: linux-mm, linux-kernel, Huang Ying, Johannes Weiner, Minchan Kim, Rik van Riel, Shaohua Li, Hugh Dickins, Fengguang Wu, Tim Chen, Dave Hansen From: Huang Ying <ying.huang@intel.com> VMA based swap readahead will readahead the virtual pages that is continuous in the virtual address space. While the original swap readahead will readahead the swap slots that is continuous in the swap device. Although VMA based swap readahead is more correct for the swap slots to be readahead, it will trigger more small random readings, which may cause the performance of HDD (hard disk) to degrade heavily, and may finally exceed the benefit. To avoid the issue, in this patch, if the HDD is used as swap, the VMA based swap readahead will be disabled, and the original swap readahead will be used instead. Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Shaohua Li <shli@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Tim Chen <tim.c.chen@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> --- include/linux/swap.h | 11 ++++++----- mm/swapfile.c | 8 +++++++- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/include/linux/swap.h b/include/linux/swap.h index 61d63379e956..9c4ae6f14eea 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -400,16 +400,17 @@ extern struct page *do_swap_page_readahead(swp_entry_t fentry, gfp_t gfp_mask, struct vm_fault *vmf, struct vma_swap_readahead *swap_ra); -static inline bool swap_use_vma_readahead(void) -{ - return READ_ONCE(swap_vma_readahead); -} - /* linux/mm/swapfile.c */ extern atomic_long_t nr_swap_pages; extern long total_swap_pages; +extern atomic_t nr_rotate_swap; extern bool has_usable_swap(void); +static inline bool swap_use_vma_readahead(void) +{ + return READ_ONCE(swap_vma_readahead) && !atomic_read(&nr_rotate_swap); +} + /* Swap 50% full? Release swapcache more aggressively.. */ static inline bool vm_swap_full(void) { diff --git a/mm/swapfile.c b/mm/swapfile.c index 42eff9e4e972..4f8b3e08a547 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -96,6 +96,8 @@ static DECLARE_WAIT_QUEUE_HEAD(proc_poll_wait); /* Activity counter to indicate that a swapon or swapoff has occurred */ static atomic_t proc_poll_event = ATOMIC_INIT(0); +atomic_t nr_rotate_swap = ATOMIC_INIT(0); + static inline unsigned char swap_count(unsigned char ent) { return ent & ~SWAP_HAS_CACHE; /* may include SWAP_HAS_CONT flag */ @@ -2569,6 +2571,9 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile) if (p->flags & SWP_CONTINUED) free_swap_count_continuations(p); + if (!p->bdev || !blk_queue_nonrot(bdev_get_queue(p->bdev))) + atomic_dec(&nr_rotate_swap); + mutex_lock(&swapon_mutex); spin_lock(&swap_lock); spin_lock(&p->lock); @@ -3145,7 +3150,8 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags) cluster = per_cpu_ptr(p->percpu_cluster, cpu); cluster_set_null(&cluster->index); } - } + } else + atomic_inc(&nr_rotate_swap); error = swap_cgroup_swapon(p->type, maxpages); if (error) -- 2.11.0 ^ permalink raw reply related [flat|nested] 19+ messages in thread
end of thread, other threads:[~2017-09-15 4:46 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-08-07 5:40 [PATCH -mm -v4 0/5] mm, swap: VMA based swap readahead Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 1/5] mm, swap: Add swap readahead hit statistics Huang, Ying 2017-08-09 21:50 ` Andrew Morton 2017-08-09 23:17 ` Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 2/5] mm, swap: Fix swap readahead marking Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 3/5] mm, swap: VMA based swap readahead Huang, Ying 2017-09-13 1:40 ` Minchan Kim 2017-09-13 21:02 ` Andrew Morton 2017-09-14 0:53 ` Huang, Ying 2017-09-14 8:15 ` Minchan Kim 2017-09-14 7:53 ` Minchan Kim 2017-09-14 12:01 ` Huang, Ying 2017-09-14 13:14 ` Minchan Kim 2017-09-14 21:21 ` Andrew Morton 2017-09-15 3:15 ` Huang, Ying 2017-09-15 3:42 ` Minchan Kim 2017-09-15 4:46 ` Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 4/5] mm, swap: Add sysfs interface for " Huang, Ying 2017-08-07 5:40 ` [PATCH -mm -v4 5/5] mm, swap: Don't use VMA based swap readahead if HDD is used as swap Huang, Ying
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).