linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] hot page swap to zram, cold page swap to swapfile directly
@ 2023-10-08  9:59 Lincheng Yang
  2023-10-08  9:59 ` [RFC PATCH 1/5] mm/swap_slots: cleanup swap slot cache Lincheng Yang
                   ` (4 more replies)
  0 siblings, 5 replies; 8+ messages in thread
From: Lincheng Yang @ 2023-10-08  9:59 UTC (permalink / raw)
  To: akpm, rostedt, mhiramat, willy, hughd, peterx, mike.kravetz, jgg,
	surenb, steven.price, pasha.tatashin, kirill.shutemov, yuanchu,
	david, mathieu.desnoyers, dhowells, shakeelb, pcc, tytso,
	42.hyeyoo, vbabka, catalin.marinas, lrh2000, ying.huang, mhocko,
	vishal.moola, yosryahmed, findns94, neilb
  Cc: linux-kernel, linux-mm, wanbin.wang, chunlei.zhuang,
	jinsheng.zhao, jiajun.ling, dongyun.liu, Lincheng Yang

Hi All,

We team developed a feature in Android linux v4.19 that can directly swapout
cold pages to the swapfile device and hot pages to the ZRAM device. This can
reduce the lag when writing back cold pages to backing-dev through ZRAM when
there is a lot of memory pressure, saving the ZRAM compression/decompression
process. Especially for low-end Android devices, low CPU frequency and small
memory.

Currently, Android uses the GKI strategy. We cannot directly modify the Linux
kernel to support this feature. We can only support it through the hook vendor.
However, this feature involves too many modifications. Google TAM suggested
that we push this feature to the Linux community.

The main changes are as follows:
[PATCH 2/5]: Set the hot and cold status for each page.
             If it is a cold page, it is swapout to the swapfile directly.
             If it is a hot page, it is swapout to the ZRAM device.
[PATCH 3/5]: When a VMA has many hot pages, predict that the VMA is hot,
             so that all anonymous pages of this VMA are hot and are only
             swapout to the ZRAM device.
[PATCH 4/5]: When user space uses madvise/process_madvise(MADV_PAGEOUT),
             swapout to the swapfile device directly.
[PATCH 5/5]: When the storage life of the external storage device is too
             low or the amount of daily writes to swapfile is too high,
             the user turns off swapout hot/cold page to the swapfile
             device and can only swapout to the ZRAM device.

This series is based on linux v6.5, this is just porting the core function to
linux v6.5.

If similar function already exists in the kernel, please let me know and give
it a shout, also comments and suggestions are welcome.

Thanks,
Lincheng Yang


Lincheng Yang (5):
  mm/swap_slots: cleanup swap slot cache
  mm: introduce hot and cold anon page flags
  mm: add VMA hot flag
  mm: add page implyreclaim flag
  mm/swapfile: add swapfile_write_enable interface

 fs/proc/task_mmu.c             |   3 +
 include/linux/mm.h             |  32 +++++++
 include/linux/mm_types.h       |   2 +
 include/linux/mm_types_task.h  |  10 +++
 include/linux/mmzone.h         |   1 +
 include/linux/page-flags.h     |   9 ++
 include/linux/swap.h           |   8 +-
 include/linux/swap_slots.h     |   2 +-
 include/trace/events/mmflags.h |   6 +-
 mm/filemap.c                   |   2 +
 mm/madvise.c                   |   7 +-
 mm/memory.c                    |  44 ++++++++++
 mm/migrate.c                   |   6 ++
 mm/rmap.c                      |   3 +
 mm/shmem.c                     |   2 +-
 mm/swap.h                      |   4 +-
 mm/swap_slots.c                | 133 +++++++++++++++++-----------
 mm/swap_state.c                |   4 +-
 mm/swapfile.c                  | 153 +++++++++++++++++++++++++++++++--
 mm/vmscan.c                    |  22 ++++-
 20 files changed, 384 insertions(+), 69 deletions(-)

--
2.34.1


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [RFC PATCH 1/5] mm/swap_slots: cleanup swap slot cache
  2023-10-08  9:59 [RFC PATCH 0/5] hot page swap to zram, cold page swap to swapfile directly Lincheng Yang
@ 2023-10-08  9:59 ` Lincheng Yang
  2023-10-08  9:59 ` [RFC PATCH 2/5] mm: introduce hot and cold anon page flags Lincheng Yang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Lincheng Yang @ 2023-10-08  9:59 UTC (permalink / raw)
  To: akpm, rostedt, mhiramat, willy, hughd, peterx, mike.kravetz, jgg,
	surenb, steven.price, pasha.tatashin, kirill.shutemov, yuanchu,
	david, mathieu.desnoyers, dhowells, shakeelb, pcc, tytso,
	42.hyeyoo, vbabka, catalin.marinas, lrh2000, ying.huang, mhocko,
	vishal.moola, yosryahmed, findns94, neilb
  Cc: linux-kernel, linux-mm, wanbin.wang, chunlei.zhuang,
	jinsheng.zhao, jiajun.ling, dongyun.liu, Lincheng Yang

The function of the swap slot cache will be cleaned to prepare for
subsequent modifications.

Signed-off-by: Lincheng Yang <lincheng.yang@transsion.com>
---
 mm/swap_slots.c | 111 ++++++++++++++++++++++++++++--------------------
 1 file changed, 66 insertions(+), 45 deletions(-)

diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 0bec1f705f8e..bb41c8460b62 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -110,11 +110,13 @@ static bool check_cache_active(void)
 	return swap_slot_cache_active;
 }
 
-static int alloc_swap_slot_cache(unsigned int cpu)
+static int __alloc_swap_slot_cache(struct swap_slots_cache *cache)
 {
-	struct swap_slots_cache *cache;
 	swp_entry_t *slots, *slots_ret;
 
+	if (!cache)
+		return 0;
+
 	/*
 	 * Do allocation outside swap_slots_cache_mutex
 	 * as kvzalloc could trigger reclaim and folio_alloc_swap,
@@ -133,17 +135,6 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 	}
 
 	mutex_lock(&swap_slots_cache_mutex);
-	cache = &per_cpu(swp_slots, cpu);
-	if (cache->slots || cache->slots_ret) {
-		/* cache already allocated */
-		mutex_unlock(&swap_slots_cache_mutex);
-
-		kvfree(slots);
-		kvfree(slots_ret);
-
-		return 0;
-	}
-
 	if (!cache->lock_initialized) {
 		mutex_init(&cache->alloc_lock);
 		spin_lock_init(&cache->free_lock);
@@ -165,13 +156,26 @@ static int alloc_swap_slot_cache(unsigned int cpu)
 	return 0;
 }
 
-static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
-				  bool free_slots)
+static int alloc_swap_slot_cache(unsigned int cpu)
 {
 	struct swap_slots_cache *cache;
-	swp_entry_t *slots = NULL;
 
+	mutex_lock(&swap_slots_cache_mutex);
 	cache = &per_cpu(swp_slots, cpu);
+	if (cache->slots || cache->slots_ret)   /* cache already allocated */
+		cache = NULL;
+	mutex_unlock(&swap_slots_cache_mutex);
+
+	__alloc_swap_slot_cache(cache);
+
+	return 0;
+}
+
+static void __drain_slots_cache_cpu(struct swap_slots_cache *cache,
+				    unsigned int type, bool free_slots)
+{
+	swp_entry_t *slots = NULL;
+
 	if ((type & SLOTS_CACHE) && cache->slots) {
 		mutex_lock(&cache->alloc_lock);
 		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
@@ -196,6 +200,15 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
 	}
 }
 
+static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
+				  bool free_slots)
+{
+	struct swap_slots_cache *cache;
+
+	cache = &per_cpu(swp_slots, cpu);
+	__drain_slots_cache_cpu(cache, type, free_slots);
+}
+
 static void __drain_swap_slots_cache(unsigned int type)
 {
 	unsigned int cpu;
@@ -269,11 +282,8 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 	return cache->nr;
 }
 
-void free_swap_slot(swp_entry_t entry)
+static void __free_swap_slot(struct swap_slots_cache *cache, swp_entry_t entry)
 {
-	struct swap_slots_cache *cache;
-
-	cache = raw_cpu_ptr(&swp_slots);
 	if (likely(use_swap_slot_cache && cache->slots_ret)) {
 		spin_lock_irq(&cache->free_lock);
 		/* Swap slots cache may be deactivated before acquiring lock */
@@ -299,18 +309,18 @@ void free_swap_slot(swp_entry_t entry)
 	}
 }
 
-swp_entry_t folio_alloc_swap(struct folio *folio)
+void free_swap_slot(swp_entry_t entry)
 {
-	swp_entry_t entry;
 	struct swap_slots_cache *cache;
 
-	entry.val = 0;
+	cache = raw_cpu_ptr(&swp_slots);
+	__free_swap_slot(cache, entry);
+}
 
-	if (folio_test_large(folio)) {
-		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
-			get_swap_pages(1, &entry, folio_nr_pages(folio));
-		goto out;
-	}
+static int __folio_alloc_swap(struct swap_slots_cache *cache, swp_entry_t *entry)
+{
+	if (unlikely(!check_cache_active() || !cache->slots))
+		return -EINVAL;
 
 	/*
 	 * Preemption is allowed here, because we may sleep
@@ -321,26 +331,37 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 	 * The alloc path here does not touch cache->slots_ret
 	 * so cache->free_lock is not taken.
 	 */
-	cache = raw_cpu_ptr(&swp_slots);
-
-	if (likely(check_cache_active() && cache->slots)) {
-		mutex_lock(&cache->alloc_lock);
-		if (cache->slots) {
+	mutex_lock(&cache->alloc_lock);
 repeat:
-			if (cache->nr) {
-				entry = cache->slots[cache->cur];
-				cache->slots[cache->cur++].val = 0;
-				cache->nr--;
-			} else if (refill_swap_slots_cache(cache)) {
-				goto repeat;
-			}
-		}
-		mutex_unlock(&cache->alloc_lock);
-		if (entry.val)
-			goto out;
+	if (cache->nr) {
+		*entry = cache->slots[cache->cur];
+		cache->slots[cache->cur++].val = 0;
+		cache->nr--;
+	} else if (refill_swap_slots_cache(cache)) {
+		goto repeat;
 	}
+	mutex_unlock(&cache->alloc_lock);
+
+	return !!entry->val;
+}
+
+swp_entry_t folio_alloc_swap(struct folio *folio)
+{
+	swp_entry_t entry;
+	struct swap_slots_cache *cache;
+
+	entry.val = 0;
+
+	if (folio_test_large(folio)) {
+		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
+			get_swap_pages(1, &entry, folio_nr_pages(folio));
+		goto out;
+	}
+
+	cache = raw_cpu_ptr(&swp_slots);
+	if (__folio_alloc_swap(cache, &entry))
+		get_swap_pages(1, &entry, 1);
 
-	get_swap_pages(1, &entry, 1);
 out:
 	if (mem_cgroup_try_charge_swap(folio, entry)) {
 		put_swap_folio(folio, entry);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 2/5] mm: introduce hot and cold anon page flags
  2023-10-08  9:59 [RFC PATCH 0/5] hot page swap to zram, cold page swap to swapfile directly Lincheng Yang
  2023-10-08  9:59 ` [RFC PATCH 1/5] mm/swap_slots: cleanup swap slot cache Lincheng Yang
@ 2023-10-08  9:59 ` Lincheng Yang
  2023-10-08  9:59 ` [RFC PATCH 3/5] mm: add VMA hot flag Lincheng Yang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 8+ messages in thread
From: Lincheng Yang @ 2023-10-08  9:59 UTC (permalink / raw)
  To: akpm, rostedt, mhiramat, willy, hughd, peterx, mike.kravetz, jgg,
	surenb, steven.price, pasha.tatashin, kirill.shutemov, yuanchu,
	david, mathieu.desnoyers, dhowells, shakeelb, pcc, tytso,
	42.hyeyoo, vbabka, catalin.marinas, lrh2000, ying.huang, mhocko,
	vishal.moola, yosryahmed, findns94, neilb
  Cc: linux-kernel, linux-mm, wanbin.wang, chunlei.zhuang,
	jinsheng.zhao, jiajun.ling, dongyun.liu, Lincheng Yang

When the anon page is reclaim, the hot anon page is written back to the
zram device, and the cold anon page is written back to the swapfile device.
In this way, when the hot anon page is needed again, it can be quickly
obtained from the zram device, reducing IO operations and increasing read
and write speeds.

and the hot and cold status of the device is displayed in /proc/swaps.

Signed-off-by: Lincheng Yang <lincheng.yang@transsion.com>
---
 include/linux/mm_types.h       |  1 +
 include/linux/page-flags.h     |  6 ++++
 include/linux/swap.h           |  4 ++-
 include/linux/swap_slots.h     |  2 +-
 include/trace/events/mmflags.h |  4 ++-
 mm/filemap.c                   |  2 ++
 mm/madvise.c                   |  6 ++--
 mm/memory.c                    | 42 ++++++++++++++++++++++++++++
 mm/migrate.c                   |  4 +++
 mm/swap_slots.c                | 48 ++++++++++++++++++++------------
 mm/swapfile.c                  | 51 ++++++++++++++++++++++++++++++----
 mm/vmscan.c                    |  9 ++++++
 12 files changed, 151 insertions(+), 28 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7d30dc4ff0ff..5e5cf457a236 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -1203,6 +1203,7 @@ enum fault_flag {
 	FAULT_FLAG_UNSHARE =		1 << 10,
 	FAULT_FLAG_ORIG_PTE_VALID =	1 << 11,
 	FAULT_FLAG_VMA_LOCK =		1 << 12,
+	FAULT_FLAG_SWAP =		1 << 13,
 };
 
 typedef unsigned int __bitwise zap_flags_t;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 92a2063a0a23..a2c83c0100aa 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -136,6 +136,8 @@ enum pageflags {
 	PG_arch_2,
 	PG_arch_3,
 #endif
+	PG_hot,
+	PG_cold,
 	__NR_PAGEFLAGS,
 
 	PG_readahead = PG_reclaim,
@@ -476,6 +478,10 @@ PAGEFLAG(Active, active, PF_HEAD) __CLEARPAGEFLAG(Active, active, PF_HEAD)
 	TESTCLEARFLAG(Active, active, PF_HEAD)
 PAGEFLAG(Workingset, workingset, PF_HEAD)
 	TESTCLEARFLAG(Workingset, workingset, PF_HEAD)
+PAGEFLAG(Hot, hot, PF_HEAD)
+	TESTCLEARFLAG(Hot, hot, PF_HEAD)
+PAGEFLAG(Cold, cold, PF_HEAD)
+	TESTCLEARFLAG(Cold, cold, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
 
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 456546443f1f..70678dbd9a3a 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -219,6 +219,7 @@ enum {
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
 					/* add others here before... */
+	SWP_HOT		= (1 << 13),	/* hot swap device */
 	SWP_SCANNING	= (1 << 14),	/* refcount in scan_swap_map */
 };
 
@@ -480,7 +481,8 @@ swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
+extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size, bool hot);
+bool swap_folio_hot(struct folio *folio);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 15adfb8c813a..b876b1f935a2 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -24,7 +24,7 @@ struct swap_slots_cache {
 void disable_swap_slots_cache_lock(void);
 void reenable_swap_slots_cache_unlock(void);
 void enable_swap_slots_cache(void);
-void free_swap_slot(swp_entry_t entry);
+void free_swap_slot(swp_entry_t entry, bool hot);
 
 extern bool swap_slot_cache_enabled;
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 1478b9dd05fa..2bbef50d80a8 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -118,7 +118,9 @@
 	DEF_PAGEFLAG_NAME(mappedtodisk),				\
 	DEF_PAGEFLAG_NAME(reclaim),					\
 	DEF_PAGEFLAG_NAME(swapbacked),					\
-	DEF_PAGEFLAG_NAME(unevictable)					\
+	DEF_PAGEFLAG_NAME(unevictable),					\
+	DEF_PAGEFLAG_NAME(hot),						\
+	DEF_PAGEFLAG_NAME(cold)						\
 IF_HAVE_PG_MLOCK(mlocked)						\
 IF_HAVE_PG_UNCACHED(uncached)						\
 IF_HAVE_PG_HWPOISON(hwpoison)						\
diff --git a/mm/filemap.c b/mm/filemap.c
index 9e44a49bbd74..3bcd67d48fd1 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1435,6 +1435,8 @@ void migration_entry_wait_on_locked(swp_entry_t entry, spinlock_t *ptl)
 		delayacct_thrashing_end(&in_thrashing);
 		psi_memstall_leave(&pflags);
 	}
+
+	folio_set_hot(folio);
 }
 #endif
 
diff --git a/mm/madvise.c b/mm/madvise.c
index ec30f48f8f2e..a5c19bb3f392 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -513,10 +513,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		folio_test_clear_young(folio);
 		if (pageout) {
 			if (folio_isolate_lru(folio)) {
-				if (folio_test_unevictable(folio))
+				if (folio_test_unevictable(folio)) {
 					folio_putback_lru(folio);
-				else
+				} else {
 					list_add(&folio->lru, &folio_list);
+					folio_set_cold(folio);
+				}
 			}
 		} else
 			folio_deactivate(folio);
diff --git a/mm/memory.c b/mm/memory.c
index cdc4d4c1c858..ef41e65b1075 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3103,6 +3103,10 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			delayacct_wpcopy_end();
 			return ret == -EHWPOISON ? VM_FAULT_HWPOISON : 0;
 		}
+
+		if (vmf->flags & FAULT_FLAG_SWAP)
+			folio_set_hot(new_folio);
+
 		kmsan_copy_page_meta(&new_folio->page, vmf->page);
 	}
 
@@ -3966,6 +3970,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	if (should_try_to_free_swap(folio, vma, vmf->flags))
 		folio_free_swap(folio);
 
+	vmf->flags |= FAULT_FLAG_SWAP;
+
 	inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
 	dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
 	pte = mk_pte(page, vma->vm_page_prot);
@@ -3999,6 +4005,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		page_add_anon_rmap(page, vma, vmf->address, rmap_flags);
 	}
 
+	if (!(vmf->flags & FAULT_FLAG_WRITE) && !folio_test_clear_cold(folio))
+		folio_set_hot(folio);
+
 	VM_BUG_ON(!folio_test_anon(folio) ||
 			(pte_write(pte) && !PageAnonExclusive(page)));
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
@@ -4887,6 +4896,36 @@ static vm_fault_t wp_huge_pud(struct vm_fault *vmf, pud_t orig_pud)
 	return VM_FAULT_FALLBACK;
 }
 
+static inline void __do_access_page(struct folio *folio)
+{
+	if (folio_test_cold(folio))
+		folio_clear_cold(folio);
+	else
+		folio_set_hot(folio);
+}
+
+static inline void do_access_page(struct vm_fault *vmf)
+{
+	struct folio *folio;
+
+	vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
+	if (!vmf->page)
+		return;
+
+	folio = page_folio(vmf->page);
+	if (!folio_test_anon(folio))
+		return;
+
+	if (folio_trylock(folio)) {
+		__do_access_page(folio);
+		folio_unlock(folio);
+	} else {
+		folio_get(folio);
+		__do_access_page(folio);
+		folio_put(folio);
+	}
+}
+
 /*
  * These routines also need to handle stuff like marking pages dirty
  * and/or accessed for architectures that don't do it in hardware (most
@@ -4974,6 +5013,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			flush_tlb_fix_spurious_fault(vmf->vma, vmf->address,
 						     vmf->pte);
 	}
+
+	do_access_page(vmf);
+
 unlock:
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	return 0;
diff --git a/mm/migrate.c b/mm/migrate.c
index 24baad2571e3..9f97744bb0a8 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -561,6 +561,10 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
 		folio_set_unevictable(newfolio);
 	if (folio_test_workingset(folio))
 		folio_set_workingset(newfolio);
+	if (folio_test_hot(folio))
+		folio_set_hot(newfolio);
+	if (folio_test_cold(folio))
+		folio_set_cold(newfolio);
 	if (folio_test_checked(folio))
 		folio_set_checked(newfolio);
 	/*
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index bb41c8460b62..dff98cf4d2e5 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -35,7 +35,8 @@
 #include <linux/mutex.h>
 #include <linux/mm.h>
 
-static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots);
+static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots_hot);
+static DEFINE_PER_CPU(struct swap_slots_cache, swp_slots_cold);
 static bool	swap_slot_cache_active;
 bool	swap_slot_cache_enabled;
 static bool	swap_slot_cache_initialized;
@@ -158,15 +159,22 @@ static int __alloc_swap_slot_cache(struct swap_slots_cache *cache)
 
 static int alloc_swap_slot_cache(unsigned int cpu)
 {
-	struct swap_slots_cache *cache;
+	struct swap_slots_cache *cache_hot, *cache_cold;
 
 	mutex_lock(&swap_slots_cache_mutex);
-	cache = &per_cpu(swp_slots, cpu);
-	if (cache->slots || cache->slots_ret)   /* cache already allocated */
-		cache = NULL;
+
+	cache_hot = &per_cpu(swp_slots_hot, cpu);
+	if (cache_hot->slots || cache_hot->slots_ret)   /* cache already allocated */
+		cache_hot = NULL;
+
+	cache_cold = &per_cpu(swp_slots_cold, cpu);
+	if (cache_cold->slots || cache_cold->slots_ret) /* cache already allocated */
+		cache_cold = NULL;
+
 	mutex_unlock(&swap_slots_cache_mutex);
 
-	__alloc_swap_slot_cache(cache);
+	__alloc_swap_slot_cache(cache_hot);
+	__alloc_swap_slot_cache(cache_cold);
 
 	return 0;
 }
@@ -205,7 +213,10 @@ static void drain_slots_cache_cpu(unsigned int cpu, unsigned int type,
 {
 	struct swap_slots_cache *cache;
 
-	cache = &per_cpu(swp_slots, cpu);
+	cache = &per_cpu(swp_slots_hot, cpu);
+	__drain_slots_cache_cpu(cache, type, free_slots);
+
+	cache = &per_cpu(swp_slots_cold, cpu);
 	__drain_slots_cache_cpu(cache, type, free_slots);
 }
 
@@ -269,7 +280,7 @@ void enable_swap_slots_cache(void)
 }
 
 /* called with swap slot cache's alloc lock held */
-static int refill_swap_slots_cache(struct swap_slots_cache *cache)
+static int refill_swap_slots_cache(struct swap_slots_cache *cache, bool hot)
 {
 	if (!use_swap_slot_cache)
 		return 0;
@@ -277,7 +288,7 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 	cache->cur = 0;
 	if (swap_slot_cache_active)
 		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
-					   cache->slots, 1);
+					   cache->slots, 1, hot);
 
 	return cache->nr;
 }
@@ -309,15 +320,16 @@ static void __free_swap_slot(struct swap_slots_cache *cache, swp_entry_t entry)
 	}
 }
 
-void free_swap_slot(swp_entry_t entry)
+void free_swap_slot(swp_entry_t entry, bool hot)
 {
 	struct swap_slots_cache *cache;
 
-	cache = raw_cpu_ptr(&swp_slots);
+	cache = hot ? raw_cpu_ptr(&swp_slots_hot) : raw_cpu_ptr(&swp_slots_cold);
 	__free_swap_slot(cache, entry);
 }
 
-static int __folio_alloc_swap(struct swap_slots_cache *cache, swp_entry_t *entry)
+static int __folio_alloc_swap(struct swap_slots_cache *cache, swp_entry_t *entry,
+			      bool hot)
 {
 	if (unlikely(!check_cache_active() || !cache->slots))
 		return -EINVAL;
@@ -337,7 +349,7 @@ static int __folio_alloc_swap(struct swap_slots_cache *cache, swp_entry_t *entry
 		*entry = cache->slots[cache->cur];
 		cache->slots[cache->cur++].val = 0;
 		cache->nr--;
-	} else if (refill_swap_slots_cache(cache)) {
+	} else if (refill_swap_slots_cache(cache, hot)) {
 		goto repeat;
 	}
 	mutex_unlock(&cache->alloc_lock);
@@ -349,18 +361,20 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_entry_t entry;
 	struct swap_slots_cache *cache;
+	bool hot;
 
 	entry.val = 0;
+	hot = swap_folio_hot(folio);
 
 	if (folio_test_large(folio)) {
 		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
-			get_swap_pages(1, &entry, folio_nr_pages(folio));
+			get_swap_pages(1, &entry, folio_nr_pages(folio), hot);
 		goto out;
 	}
 
-	cache = raw_cpu_ptr(&swp_slots);
-	if (__folio_alloc_swap(cache, &entry))
-		get_swap_pages(1, &entry, 1);
+	cache = hot ? raw_cpu_ptr(&swp_slots_hot) : raw_cpu_ptr(&swp_slots_cold);
+	if (__folio_alloc_swap(cache, &entry, hot))
+		get_swap_pages(1, &entry, 1, hot);
 
 out:
 	if (mem_cgroup_try_charge_swap(folio, entry)) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index b15112b1f1a8..ada28c0ce569 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -113,6 +113,22 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
 	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
+static inline bool swap_info_hot(struct swap_info_struct *si)
+{
+	return !!(si->flags & SWP_HOT);
+}
+
+bool swap_folio_hot(struct folio *folio)
+{
+	if (folio_test_swapbacked(folio) && folio_test_hot(folio))
+		return true;
+
+	if (folio_test_cold(folio))
+		folio_clear_cold(folio);
+
+	return false;
+}
+
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
@@ -1045,7 +1061,7 @@ static void swap_free_cluster(struct swap_info_struct *si, unsigned long idx)
 	swap_range_free(si, offset, SWAPFILE_CLUSTER);
 }
 
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
+int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size, bool hot)
 {
 	unsigned long size = swap_entry_size(entry_size);
 	struct swap_info_struct *si, *next;
@@ -1075,6 +1091,13 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		spin_lock(&si->lock);
+
+		if (hot != swap_info_hot(si)) {
+			spin_lock(&swap_avail_lock);
+			spin_unlock(&si->lock);
+			goto nextsi;
+		}
+
 		if (!si->highest_bit || !(si->flags & SWP_WRITEOK)) {
 			spin_lock(&swap_avail_lock);
 			if (plist_node_empty(&si->avail_lists[node])) {
@@ -1119,6 +1142,11 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_size)
 			goto start_over;
 	}
 
+	if (!hot) {
+		hot = true;
+		goto start_over;
+	}
+
 	spin_unlock(&swap_avail_lock);
 
 check_out:
@@ -1300,7 +1328,7 @@ static unsigned char __swap_entry_free(struct swap_info_struct *p,
 	usage = __swap_entry_free_locked(p, offset, 1);
 	unlock_cluster_or_swap_info(p, ci);
 	if (!usage)
-		free_swap_slot(entry);
+		free_swap_slot(entry, swap_info_hot(p));
 
 	return usage;
 }
@@ -1376,7 +1404,7 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	for (i = 0; i < size; i++, entry.val++) {
 		if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE)) {
 			unlock_cluster_or_swap_info(si, ci);
-			free_swap_slot(entry);
+			free_swap_slot(entry, swap_info_hot(si));
 			if (i == size - 1)
 				return;
 			lock_cluster_or_swap_info(si, offset);
@@ -1560,6 +1588,8 @@ static bool folio_swapped(struct folio *folio)
  */
 bool folio_free_swap(struct folio *folio)
 {
+	struct swap_info_struct *si = page_swap_info(folio_page(folio, 0));
+
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 
 	if (!folio_test_swapcache(folio))
@@ -1589,6 +1619,10 @@ bool folio_free_swap(struct folio *folio)
 
 	delete_from_swap_cache(folio);
 	folio_set_dirty(folio);
+
+	if (swap_info_hot(si))
+		folio_set_hot(folio);
+
 	return true;
 }
 
@@ -2628,7 +2662,7 @@ static int swap_show(struct seq_file *swap, void *v)
 	unsigned long bytes, inuse;
 
 	if (si == SEQ_START_TOKEN) {
-		seq_puts(swap, "Filename\t\t\t\tType\t\tSize\t\tUsed\t\tPriority\n");
+		seq_puts(swap, "Filename\t\t\t\tType\t\tSize\t\tUsed\t\tPriority\tType\n");
 		return 0;
 	}
 
@@ -2637,13 +2671,14 @@ static int swap_show(struct seq_file *swap, void *v)
 
 	file = si->swap_file;
 	len = seq_file_path(swap, file, " \t\n\\");
-	seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\n",
+	seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\t\t%s\n",
 			len < 40 ? 40 - len : 1, " ",
 			S_ISBLK(file_inode(file)->i_mode) ?
 				"partition" : "file\t",
 			bytes, bytes < 10000000 ? "\t" : "",
 			inuse, inuse < 10000000 ? "\t" : "",
-			si->prio);
+			si->prio,
+			(si->flags & SWP_HOT) ? "hot" : "cold");
 	return 0;
 }
 
@@ -3069,6 +3104,10 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	if (p->bdev && bdev_stable_writes(p->bdev))
 		p->flags |= SWP_STABLE_WRITES;
 
+	if ((p->flags & SWP_BLKDEV) &&
+	    !strncmp(strrchr(name->name, '/'), "/zram", 5))
+		p->flags |= SWP_HOT;
+
 	if (p->bdev && bdev_synchronous(p->bdev))
 		p->flags |= SWP_SYNCHRONOUS_IO;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2fe4a11d63f4..f496e1abea76 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1519,6 +1519,7 @@ enum folio_references {
 	FOLIOREF_RECLAIM_CLEAN,
 	FOLIOREF_KEEP,
 	FOLIOREF_ACTIVATE,
+	FOLIOREF_ANON_ACCESS,
 };
 
 static enum folio_references folio_check_references(struct folio *folio,
@@ -1543,6 +1544,9 @@ static enum folio_references folio_check_references(struct folio *folio,
 		return FOLIOREF_KEEP;
 
 	if (referenced_ptes) {
+		if (folio_test_swapbacked(folio))
+			return FOLIOREF_ANON_ACCESS;
+
 		/*
 		 * All mapped folios start out with page table
 		 * references from the instantiating fault, so we need
@@ -1861,6 +1865,11 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			references = folio_check_references(folio, sc);
 
 		switch (references) {
+		case FOLIOREF_ANON_ACCESS:
+			if (folio_test_cold(folio))
+				folio_clear_cold(folio);
+			else
+				folio_set_hot(folio);
 		case FOLIOREF_ACTIVATE:
 			goto activate_locked;
 		case FOLIOREF_KEEP:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 3/5] mm: add VMA hot flag
  2023-10-08  9:59 [RFC PATCH 0/5] hot page swap to zram, cold page swap to swapfile directly Lincheng Yang
  2023-10-08  9:59 ` [RFC PATCH 1/5] mm/swap_slots: cleanup swap slot cache Lincheng Yang
  2023-10-08  9:59 ` [RFC PATCH 2/5] mm: introduce hot and cold anon page flags Lincheng Yang
@ 2023-10-08  9:59 ` Lincheng Yang
  2023-10-08  9:59 ` [RFC PATCH 4/5] mm: add page implyreclaim flag Lincheng Yang
  2023-10-08  9:59 ` [RFC PATCH 5/5] mm/swapfile: add swapfile_write_enable interface Lincheng Yang
  4 siblings, 0 replies; 8+ messages in thread
From: Lincheng Yang @ 2023-10-08  9:59 UTC (permalink / raw)
  To: akpm, rostedt, mhiramat, willy, hughd, peterx, mike.kravetz, jgg,
	surenb, steven.price, pasha.tatashin, kirill.shutemov, yuanchu,
	david, mathieu.desnoyers, dhowells, shakeelb, pcc, tytso,
	42.hyeyoo, vbabka, catalin.marinas, lrh2000, ying.huang, mhocko,
	vishal.moola, yosryahmed, findns94, neilb
  Cc: linux-kernel, linux-mm, wanbin.wang, chunlei.zhuang,
	jinsheng.zhao, jiajun.ling, dongyun.liu, Lincheng Yang

When the VMA is executable/stack/lock, or the number of VMA PGFAULTs
exceeds twice the number of pages in the entire VMA address range, or
the number of VMA PGMAJFAULTs exceeds half the number of pages in the
entire VMA address range, it means that the VMA is hot. Therefore, all
anon pages of this VMA are hot pages. When memory reclaim is performed,
they are written back to zram.

and the number of Faults and MajFaults is displayed in /proc/<pid>/smaps.

Signed-off-by: Lincheng Yang <lincheng.yang@transsion.com>
---
 fs/proc/task_mmu.c             |  3 +++
 include/linux/mm.h             | 32 ++++++++++++++++++++++++++++++++
 include/linux/mm_types.h       |  1 +
 include/linux/mm_types_task.h  | 10 ++++++++++
 include/linux/swap.h           |  6 +++---
 include/trace/events/mmflags.h |  1 +
 mm/memory.c                    |  2 ++
 mm/rmap.c                      |  3 +++
 mm/shmem.c                     |  2 +-
 mm/swap.h                      |  4 ++--
 mm/swap_slots.c                |  4 ++--
 mm/swap_state.c                |  4 ++--
 mm/swapfile.c                  |  5 ++++-
 mm/vmscan.c                    | 10 ++++++++--
 14 files changed, 74 insertions(+), 13 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fafff1bd34cd..dc2c53eb759a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -871,6 +871,9 @@ static int show_smap(struct seq_file *m, void *v)
 
 	__show_smap(m, &mss, false);
 
+	seq_printf(m, "Faults:    %ld\n", get_vma_counter(vma, VMA_PGFAULT));
+	seq_printf(m, "MajFaults:    %ld\n", get_vma_counter(vma, VMA_PGMAJFAULT));
+
 	seq_printf(m, "THPeligible:    %d\n",
 		   hugepage_vma_check(vma, vma->vm_flags, true, false, true));
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 34f9dba17c1a..5dbe5a5c49c8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -282,6 +282,7 @@ extern unsigned int kobjsize(const void *objp);
 #define VM_UFFD_MISSING	0
 #endif /* CONFIG_MMU */
 #define VM_PFNMAP	0x00000400	/* Page-ranges managed without "struct page", just pure PFN */
+#define VM_HOT		0x00000800
 #define VM_UFFD_WP	0x00001000	/* wrprotect pages tracking */
 
 #define VM_LOCKED	0x00002000
@@ -763,6 +764,7 @@ static inline void vma_mark_detached(struct vm_area_struct *vma,
 static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 {
 	static const struct vm_operations_struct dummy_vm_ops = {};
+	int i;
 
 	memset(vma, 0, sizeof(*vma));
 	vma->vm_mm = mm;
@@ -770,6 +772,9 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm)
 	INIT_LIST_HEAD(&vma->anon_vma_chain);
 	vma_mark_detached(vma, false);
 	vma_numab_state_init(vma);
+
+	for (i = 0; i < NR_VMA_EVENTS; i++)
+		atomic_long_set(&vma->events_stat.count[i], 0);
 }
 
 /* Use when VMA is not part of the VMA tree and needs no locking */
@@ -2515,6 +2520,33 @@ static inline bool get_user_page_fast_only(unsigned long addr,
 {
 	return get_user_pages_fast_only(addr, 1, gup_flags, pagep) == 1;
 }
+
+static inline void count_vma_counter(struct vm_area_struct *vma, int member)
+{
+	atomic_long_inc(&vma->events_stat.count[member]);
+}
+
+static inline unsigned long get_vma_counter(struct vm_area_struct *vma, int member)
+{
+	return atomic_long_read(&vma->events_stat.count[member]);
+}
+
+static inline bool vma_hot(struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & (VM_EXEC | VM_STACK | VM_LOCKED))
+		return true;
+
+	if (get_vma_counter(vma, VMA_PGFAULT) >
+			(vma->vm_end - vma->vm_start) >> (PAGE_SHIFT - 1))
+		return true;
+
+	if (get_vma_counter(vma, VMA_PGMAJFAULT) >
+			(vma->vm_end - vma->vm_start) >> (PAGE_SHIFT + 1))
+		return true;
+
+	return false;
+}
+
 /*
  * per-process(per-mm_struct) statistics.
  */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5e5cf457a236..863c54f2e165 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -585,6 +585,7 @@ struct vm_area_struct {
 	struct vma_numab_state *numab_state;	/* NUMA Balancing state */
 #endif
 	struct vm_userfaultfd_ctx vm_userfaultfd_ctx;
+	struct vma_events_stat events_stat;
 } __randomize_layout;
 
 #ifdef CONFIG_SCHED_MM_CID
diff --git a/include/linux/mm_types_task.h b/include/linux/mm_types_task.h
index 5414b5c6a103..604107222306 100644
--- a/include/linux/mm_types_task.h
+++ b/include/linux/mm_types_task.h
@@ -36,6 +36,16 @@ enum {
 	NR_MM_COUNTERS
 };
 
+enum {
+	VMA_PGFAULT,
+	VMA_PGMAJFAULT,
+	NR_VMA_EVENTS
+};
+
+struct vma_events_stat {
+	atomic_long_t count[NR_VMA_EVENTS];
+};
+
 struct page_frag {
 	struct page *page;
 #if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 70678dbd9a3a..a4c764eca0cc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -477,12 +477,12 @@ static inline long get_nr_swap_pages(void)
 }
 
 extern void si_swapinfo(struct sysinfo *);
-swp_entry_t folio_alloc_swap(struct folio *folio);
+swp_entry_t folio_alloc_swap(struct folio *folio, bool hotness);
 bool folio_free_swap(struct folio *folio);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
 extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size, bool hot);
-bool swap_folio_hot(struct folio *folio);
+bool swap_folio_hot(struct folio *folio, bool hotness);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t);
@@ -583,7 +583,7 @@ static inline int swp_swapcount(swp_entry_t entry)
 	return 0;
 }
 
-static inline swp_entry_t folio_alloc_swap(struct folio *folio)
+static inline swp_entry_t folio_alloc_swap(struct folio *folio, bool hotness)
 {
 	swp_entry_t entry;
 	entry.val = 0;
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 2bbef50d80a8..f266f92c41c6 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -179,6 +179,7 @@ IF_HAVE_PG_ARCH_X(arch_3)
 	{VM_UFFD_MISSING,		"uffd_missing"	},		\
 IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR,	"uffd_minor"	)		\
 	{VM_PFNMAP,			"pfnmap"	},		\
+	{VM_HOT,			"hot"		},		\
 	{VM_UFFD_WP,			"uffd_wp"	},		\
 	{VM_LOCKED,			"locked"	},		\
 	{VM_IO,				"io"		},		\
diff --git a/mm/memory.c b/mm/memory.c
index ef41e65b1075..907e13933567 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3836,6 +3836,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 		/* Had to read the page from swap area: Major fault */
 		ret = VM_FAULT_MAJOR;
+		count_vma_counter(vma, VMA_PGMAJFAULT);
 		count_vm_event(PGMAJFAULT);
 		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
 	} else if (PageHWPoison(page)) {
@@ -5288,6 +5289,7 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
 			mem_cgroup_oom_synchronize(false);
 	}
 out:
+	count_vma_counter(vma, VMA_PGFAULT);
 	mm_account_fault(mm, regs, address, flags, ret);
 
 	return ret;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0c0d8857dfce..9b439280f430 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -857,6 +857,9 @@ static bool folio_referenced_one(struct folio *folio,
 		pra->vm_flags |= vma->vm_flags & ~VM_LOCKED;
 	}
 
+	if (vma_hot(vma))
+		pra->vm_flags |= VM_HOT;
+
 	if (!pra->mapcount)
 		return false; /* To break the loop */
 
diff --git a/mm/shmem.c b/mm/shmem.c
index d963c747dabc..28350a38d169 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1412,7 +1412,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 		folio_mark_uptodate(folio);
 	}
 
-	swap = folio_alloc_swap(folio);
+	swap = folio_alloc_swap(folio, true);
 	if (!swap.val)
 		goto redirty;
 
diff --git a/mm/swap.h b/mm/swap.h
index 7c033d793f15..470b8084b036 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -29,7 +29,7 @@ extern struct address_space *swapper_spaces[];
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
 void show_swap_cache_info(void);
-bool add_to_swap(struct folio *folio);
+bool add_to_swap(struct folio *folio, bool hotness);
 void *get_shadow_from_swap_cache(swp_entry_t entry);
 int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
 		      gfp_t gfp, void **shadowp);
@@ -110,7 +110,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 	return filemap_get_folio(mapping, index);
 }
 
-static inline bool add_to_swap(struct folio *folio)
+static inline bool add_to_swap(struct folio *folio, bool hotness)
 {
 	return false;
 }
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index dff98cf4d2e5..4707347599ce 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -357,14 +357,14 @@ static int __folio_alloc_swap(struct swap_slots_cache *cache, swp_entry_t *entry
 	return !!entry->val;
 }
 
-swp_entry_t folio_alloc_swap(struct folio *folio)
+swp_entry_t folio_alloc_swap(struct folio *folio, bool hotness)
 {
 	swp_entry_t entry;
 	struct swap_slots_cache *cache;
 	bool hot;
 
 	entry.val = 0;
-	hot = swap_folio_hot(folio);
+	hot = swap_folio_hot(folio, hotness);
 
 	if (folio_test_large(folio)) {
 		if (IS_ENABLED(CONFIG_THP_SWAP) && arch_thp_swp_supported())
diff --git a/mm/swap_state.c b/mm/swap_state.c
index f8ea7015bad4..5766fa5ac21e 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -174,7 +174,7 @@ void __delete_from_swap_cache(struct folio *folio,
  * Context: Caller needs to hold the folio lock.
  * Return: Whether the folio was added to the swap cache.
  */
-bool add_to_swap(struct folio *folio)
+bool add_to_swap(struct folio *folio, bool hotness)
 {
 	swp_entry_t entry;
 	int err;
@@ -182,7 +182,7 @@ bool add_to_swap(struct folio *folio)
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
 
-	entry = folio_alloc_swap(folio);
+	entry = folio_alloc_swap(folio, hotness);
 	if (!entry.val)
 		return false;
 
diff --git a/mm/swapfile.c b/mm/swapfile.c
index ada28c0ce569..5378f70d330d 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -118,8 +118,11 @@ static inline bool swap_info_hot(struct swap_info_struct *si)
 	return !!(si->flags & SWP_HOT);
 }
 
-bool swap_folio_hot(struct folio *folio)
+bool swap_folio_hot(struct folio *folio, bool hotness)
 {
+	if (hotness)
+		return true;
+
 	if (folio_test_swapbacked(folio) && folio_test_hot(folio))
 		return true;
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index f496e1abea76..11d175d9fe0c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -155,6 +155,8 @@ struct scan_control {
 	/* Number of pages freed so far during a call to shrink_zones() */
 	unsigned long nr_reclaimed;
 
+	bool hotness;
+
 	struct {
 		unsigned int dirty;
 		unsigned int unqueued_dirty;
@@ -1532,6 +1534,8 @@ static enum folio_references folio_check_references(struct folio *folio,
 					   &vm_flags);
 	referenced_folio = folio_test_clear_referenced(folio);
 
+	sc->hotness = !!(vm_flags & VM_HOT);
+
 	/*
 	 * The supposedly reclaimable folio was found to be in a VM_LOCKED vma.
 	 * Let the folio, now marked Mlocked, be moved to the unevictable list.
@@ -1745,6 +1749,8 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 		/* Account the number of base pages */
 		sc->nr_scanned += nr_pages;
 
+		sc->hotness = !ignore_references;
+
 		if (unlikely(!folio_evictable(folio)))
 			goto activate_locked;
 
@@ -1916,7 +1922,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 								folio_list))
 						goto activate_locked;
 				}
-				if (!add_to_swap(folio)) {
+				if (!add_to_swap(folio, sc->hotness)) {
 					if (!folio_test_large(folio))
 						goto activate_locked_split;
 					/* Fallback to swap normal pages */
@@ -1926,7 +1932,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 					count_vm_event(THP_SWPOUT_FALLBACK);
 #endif
-					if (!add_to_swap(folio))
+					if (!add_to_swap(folio, sc->hotness))
 						goto activate_locked_split;
 				}
 			}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 4/5] mm: add page implyreclaim flag
  2023-10-08  9:59 [RFC PATCH 0/5] hot page swap to zram, cold page swap to swapfile directly Lincheng Yang
                   ` (2 preceding siblings ...)
  2023-10-08  9:59 ` [RFC PATCH 3/5] mm: add VMA hot flag Lincheng Yang
@ 2023-10-08  9:59 ` Lincheng Yang
  2023-10-08 11:07   ` Matthew Wilcox
  2023-10-08  9:59 ` [RFC PATCH 5/5] mm/swapfile: add swapfile_write_enable interface Lincheng Yang
  4 siblings, 1 reply; 8+ messages in thread
From: Lincheng Yang @ 2023-10-08  9:59 UTC (permalink / raw)
  To: akpm, rostedt, mhiramat, willy, hughd, peterx, mike.kravetz, jgg,
	surenb, steven.price, pasha.tatashin, kirill.shutemov, yuanchu,
	david, mathieu.desnoyers, dhowells, shakeelb, pcc, tytso,
	42.hyeyoo, vbabka, catalin.marinas, lrh2000, ying.huang, mhocko,
	vishal.moola, yosryahmed, findns94, neilb
  Cc: linux-kernel, linux-mm, wanbin.wang, chunlei.zhuang,
	jinsheng.zhao, jiajun.ling, dongyun.liu, Lincheng Yang

Add implyrecalim flag means that the page is reclaim from the user advise.
If the number of restore times for these implyrecalim pages exceeds
workingset_restore_limit, it means they are frequently used and are hot
pages. Otherwise, continue to determine whether it has been restored. If
so, it means that it is frequently used also and belongs to the hot page.

Signed-off-by: Lincheng Yang <lincheng.yang@transsion.com>
---
 include/linux/mmzone.h         |  1 +
 include/linux/page-flags.h     |  3 ++
 include/trace/events/mmflags.h |  3 +-
 mm/madvise.c                   |  1 +
 mm/migrate.c                   |  2 ++
 mm/swapfile.c                  | 64 +++++++++++++++++++++++++++++++++-
 mm/vmscan.c                    |  3 ++
 7 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 5e50b78d58ea..b280e6b0015a 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -626,6 +626,7 @@ struct lruvec {
 	atomic_long_t			nonresident_age;
 	/* Refaults at the time of last reclaim cycle */
 	unsigned long			refaults[ANON_AND_FILE];
+	unsigned long			restores[ANON_AND_FILE];
 	/* Various lruvec state flags (enum lruvec_flags) */
 	unsigned long			flags;
 #ifdef CONFIG_LRU_GEN
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a2c83c0100aa..4a1278851d4b 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -138,6 +138,7 @@ enum pageflags {
 #endif
 	PG_hot,
 	PG_cold,
+	PG_implyreclaim,
 	__NR_PAGEFLAGS,
 
 	PG_readahead = PG_reclaim,
@@ -482,6 +483,8 @@ PAGEFLAG(Hot, hot, PF_HEAD)
 	TESTCLEARFLAG(Hot, hot, PF_HEAD)
 PAGEFLAG(Cold, cold, PF_HEAD)
 	TESTCLEARFLAG(Cold, cold, PF_HEAD)
+PAGEFLAG(Implyreclaim, implyreclaim, PF_HEAD)
+	TESTCLEARFLAG(Implyreclaim, implyreclaim, PF_HEAD)
 __PAGEFLAG(Slab, slab, PF_NO_TAIL)
 PAGEFLAG(Checked, checked, PF_NO_COMPOUND)	   /* Used by some filesystems */
 
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index f266f92c41c6..ee014f955aef 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -120,7 +120,8 @@
 	DEF_PAGEFLAG_NAME(swapbacked),					\
 	DEF_PAGEFLAG_NAME(unevictable),					\
 	DEF_PAGEFLAG_NAME(hot),						\
-	DEF_PAGEFLAG_NAME(cold)						\
+	DEF_PAGEFLAG_NAME(cold),					\
+	DEF_PAGEFLAG_NAME(implyreclaim)					\
 IF_HAVE_PG_MLOCK(mlocked)						\
 IF_HAVE_PG_UNCACHED(uncached)						\
 IF_HAVE_PG_HWPOISON(hwpoison)						\
diff --git a/mm/madvise.c b/mm/madvise.c
index a5c19bb3f392..199b48dfa8c5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -518,6 +518,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 				} else {
 					list_add(&folio->lru, &folio_list);
 					folio_set_cold(folio);
+					folio_set_implyreclaim(folio);
 				}
 			}
 		} else
diff --git a/mm/migrate.c b/mm/migrate.c
index 9f97744bb0a8..691b4f7bf1ae 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -565,6 +565,8 @@ void folio_migrate_flags(struct folio *newfolio, struct folio *folio)
 		folio_set_hot(newfolio);
 	if (folio_test_cold(folio))
 		folio_set_cold(newfolio);
+	if (folio_test_implyreclaim(folio))
+		folio_set_implyreclaim(newfolio);
 	if (folio_test_checked(folio))
 		folio_set_checked(newfolio);
 	/*
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 5378f70d330d..629e6a291e9b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -105,6 +105,8 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
 
 atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
+static unsigned int workingset_restore_limit;
+
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
 	if (type >= MAX_SWAPFILES)
@@ -120,11 +122,33 @@ static inline bool swap_info_hot(struct swap_info_struct *si)
 
 bool swap_folio_hot(struct folio *folio, bool hotness)
 {
+	struct lruvec *lruvec;
+	struct mem_cgroup *memcg;
+	unsigned long restores;
+	int delta;
+
 	if (hotness)
 		return true;
 
-	if (folio_test_swapbacked(folio) && folio_test_hot(folio))
+	if (folio_test_swapbacked(folio) && folio_test_hot(folio)) {
+		folio_clear_implyreclaim(folio);
+		return true;
+	}
+
+	rcu_read_lock(); // prevent writing from delaying reading
+	memcg = folio_memcg_rcu(folio);
+	rcu_read_unlock();
+
+	lruvec = mem_cgroup_lruvec(memcg, folio_pgdat(folio));
+	restores = lruvec_page_state(lruvec, WORKINGSET_RESTORE_ANON);
+	delta = restores - lruvec->restores[WORKINGSET_ANON];
+
+	if (folio_test_clear_implyreclaim(folio)) {
+		if (delta > workingset_restore_limit)
+			return true;
+	} else if (delta) {
 		return true;
+	}
 
 	if (folio_test_cold(folio))
 		folio_clear_cold(folio);
@@ -2715,9 +2739,47 @@ static const struct proc_ops swaps_proc_ops = {
 	.proc_poll	= swaps_poll,
 };
 
+static ssize_t workingset_restore_limit_write(struct file *file,
+					      const char __user *ubuf,
+					      size_t count, loff_t *pos)
+{
+	unsigned int val;
+	int ret;
+
+	ret = kstrtouint_from_user(ubuf, count, 10, &val);
+	if (ret)
+		return ret;
+
+	workingset_restore_limit = val;
+
+	return count;
+}
+
+static int workingset_restore_limit_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", workingset_restore_limit);
+
+	return 0;
+}
+
+static int workingset_restore_limit_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, workingset_restore_limit_show, NULL);
+}
+
+const struct proc_ops workingset_restore_limit_fops = {
+	.proc_open = workingset_restore_limit_open,
+	.proc_read = seq_read,
+	.proc_lseek = seq_lseek,
+	.proc_release = seq_release,
+	.proc_write = workingset_restore_limit_write,
+};
+
 static int __init procswaps_init(void)
 {
 	proc_create("swaps", 0, NULL, &swaps_proc_ops);
+	proc_create("workingset_restore_limit", S_IALLUGO, NULL, &workingset_restore_limit_fops);
+
 	return 0;
 }
 __initcall(procswaps_init);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 11d175d9fe0c..8107f8d86d7f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6805,6 +6805,9 @@ static void snapshot_refaults(struct mem_cgroup *target_memcg, pg_data_t *pgdat)
 	target_lruvec->refaults[WORKINGSET_ANON] = refaults;
 	refaults = lruvec_page_state(target_lruvec, WORKINGSET_ACTIVATE_FILE);
 	target_lruvec->refaults[WORKINGSET_FILE] = refaults;
+
+	refaults = lruvec_page_state(target_lruvec, WORKINGSET_RESTORE_ANON);
+	target_lruvec->restores[WORKINGSET_ANON] = refaults;
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [RFC PATCH 5/5] mm/swapfile: add swapfile_write_enable interface
  2023-10-08  9:59 [RFC PATCH 0/5] hot page swap to zram, cold page swap to swapfile directly Lincheng Yang
                   ` (3 preceding siblings ...)
  2023-10-08  9:59 ` [RFC PATCH 4/5] mm: add page implyreclaim flag Lincheng Yang
@ 2023-10-08  9:59 ` Lincheng Yang
  4 siblings, 0 replies; 8+ messages in thread
From: Lincheng Yang @ 2023-10-08  9:59 UTC (permalink / raw)
  To: akpm, rostedt, mhiramat, willy, hughd, peterx, mike.kravetz, jgg,
	surenb, steven.price, pasha.tatashin, kirill.shutemov, yuanchu,
	david, mathieu.desnoyers, dhowells, shakeelb, pcc, tytso,
	42.hyeyoo, vbabka, catalin.marinas, lrh2000, ying.huang, mhocko,
	vishal.moola, yosryahmed, findns94, neilb
  Cc: linux-kernel, linux-mm, wanbin.wang, chunlei.zhuang,
	jinsheng.zhao, jiajun.ling, dongyun.liu, Lincheng Yang

When the user does not want to write back cold pages to swapfile, set
/proc/swapfile_write_enable to 0. At this time, all anon pages, regardless
of hot or cold status, will be written back to the zram device.

Signed-off-by: Lincheng Yang <lincheng.yang@transsion.com>
---
 mm/swapfile.c | 39 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 629e6a291e9b..557d1c29be77 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -106,6 +106,7 @@ static atomic_t proc_poll_event = ATOMIC_INIT(0);
 atomic_t nr_rotate_swap = ATOMIC_INIT(0);
 
 static unsigned int workingset_restore_limit;
+static unsigned int swapfile_write_enable;
 
 static struct swap_info_struct *swap_type_to_swap_info(int type)
 {
@@ -127,7 +128,7 @@ bool swap_folio_hot(struct folio *folio, bool hotness)
 	unsigned long restores;
 	int delta;
 
-	if (hotness)
+	if (!swapfile_write_enable || hotness)
 		return true;
 
 	if (folio_test_swapbacked(folio) && folio_test_hot(folio)) {
@@ -2775,10 +2776,46 @@ const struct proc_ops workingset_restore_limit_fops = {
 	.proc_write = workingset_restore_limit_write,
 };
 
+static ssize_t swapfile_write_enable_write(struct file *file,
+					   const char __user *ubuf,
+					   size_t count, loff_t *pos)
+{
+	unsigned int val;
+	int ret;
+
+	ret = kstrtouint_from_user(ubuf, count, 10, &val);
+	if (ret)
+		return ret;
+
+	swapfile_write_enable = val;
+
+	return count;
+}
+
+static int swapfile_write_enable_show(struct seq_file *m, void *v)
+{
+	seq_printf(m, "%d\n", swapfile_write_enable);
+	return 0;
+}
+
+static int swapfile_write_enable_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, swapfile_write_enable_show, inode->i_private);
+}
+
+const struct proc_ops swapfile_write_enable_fops = {
+	.proc_open	= swapfile_write_enable_open,
+	.proc_read	= seq_read,
+	.proc_lseek	= seq_lseek,
+	.proc_release	= seq_release,
+	.proc_write	= swapfile_write_enable_write,
+};
+
 static int __init procswaps_init(void)
 {
 	proc_create("swaps", 0, NULL, &swaps_proc_ops);
 	proc_create("workingset_restore_limit", S_IALLUGO, NULL, &workingset_restore_limit_fops);
+	proc_create("swapfile_write_enable", S_IALLUGO, NULL, &swapfile_write_enable_fops);
 
 	return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 4/5] mm: add page implyreclaim flag
  2023-10-08  9:59 ` [RFC PATCH 4/5] mm: add page implyreclaim flag Lincheng Yang
@ 2023-10-08 11:07   ` Matthew Wilcox
  2023-10-10  3:27     ` Lincheng Yang
  0 siblings, 1 reply; 8+ messages in thread
From: Matthew Wilcox @ 2023-10-08 11:07 UTC (permalink / raw)
  To: Lincheng Yang
  Cc: akpm, rostedt, mhiramat, hughd, peterx, mike.kravetz, jgg,
	surenb, steven.price, pasha.tatashin, kirill.shutemov, yuanchu,
	david, mathieu.desnoyers, dhowells, shakeelb, pcc, tytso,
	42.hyeyoo, vbabka, catalin.marinas, lrh2000, ying.huang, mhocko,
	vishal.moola, yosryahmed, findns94, neilb, linux-kernel,
	linux-mm, wanbin.wang, chunlei.zhuang, jinsheng.zhao,
	jiajun.ling, dongyun.liu, Lincheng Yang

On Sun, Oct 08, 2023 at 05:59:23PM +0800, Lincheng Yang wrote:
> Add implyrecalim flag means that the page is reclaim from the user advise.
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a2c83c0100aa..4a1278851d4b 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -138,6 +138,7 @@ enum pageflags {
>  #endif
>  	PG_hot,
>  	PG_cold,
> +	PG_implyreclaim,
>  	__NR_PAGEFLAGS,

Can you do all of this without adding three page flags?  We're really
tight on page flags.  At a minimum, this is going to have to go behind
an ifdef that depends on 64BIT, but it'd be better if we can derive
hot/cold/implyreclaim from existing page flags.  Look at the antics
we go through for PG_young and PG_idle; if we had two page flags free,
we'd spend them on removing the special cases there.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC PATCH 4/5] mm: add page implyreclaim flag
  2023-10-08 11:07   ` Matthew Wilcox
@ 2023-10-10  3:27     ` Lincheng Yang
  0 siblings, 0 replies; 8+ messages in thread
From: Lincheng Yang @ 2023-10-10  3:27 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: akpm, rostedt, mhiramat, hughd, peterx, mike.kravetz, jgg,
	surenb, steven.price, pasha.tatashin, kirill.shutemov, yuanchu,
	david, mathieu.desnoyers, dhowells, shakeelb, pcc, tytso,
	42.hyeyoo, vbabka, catalin.marinas, lrh2000, ying.huang, mhocko,
	vishal.moola, yosryahmed, findns94, neilb, linux-kernel,
	linux-mm, wanbin.wang, chunlei.zhuang, jinsheng.zhao,
	jiajun.ling, dongyun.liu, Lincheng Yang

On Sun, Oct 08, 2023 at 12:07:49PM +0100, Matthew Wilcox wrote:
> On Sun, Oct 08, 2023 at 05:59:23PM +0800, Lincheng Yang wrote:
> > Add implyrecalim flag means that the page is reclaim from the user advise.
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index a2c83c0100aa..4a1278851d4b 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -138,6 +138,7 @@ enum pageflags {
> >  #endif
> >  	PG_hot,
> >  	PG_cold,
> > +	PG_implyreclaim,
> >  	__NR_PAGEFLAGS,
>
> Can you do all of this without adding three page flags?  We're really
> tight on page flags.  At a minimum, this is going to have to go behind
> an ifdef that depends on 64BIT, but it'd be better if we can derive
> hot/cold/implyreclaim from existing page flags.  Look at the antics

Sounds good, we will try to do it with the exist page flags, if the exist page
flags does not achieve our goal, we will use an ifdef that depends on 64BIT.

> we go through for PG_young and PG_idle; if we had two page flags free,
> we'd spend them on removing the special cases there.
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-10-10  3:28 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-08  9:59 [RFC PATCH 0/5] hot page swap to zram, cold page swap to swapfile directly Lincheng Yang
2023-10-08  9:59 ` [RFC PATCH 1/5] mm/swap_slots: cleanup swap slot cache Lincheng Yang
2023-10-08  9:59 ` [RFC PATCH 2/5] mm: introduce hot and cold anon page flags Lincheng Yang
2023-10-08  9:59 ` [RFC PATCH 3/5] mm: add VMA hot flag Lincheng Yang
2023-10-08  9:59 ` [RFC PATCH 4/5] mm: add page implyreclaim flag Lincheng Yang
2023-10-08 11:07   ` Matthew Wilcox
2023-10-10  3:27     ` Lincheng Yang
2023-10-08  9:59 ` [RFC PATCH 5/5] mm/swapfile: add swapfile_write_enable interface Lincheng Yang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).