linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece
@ 2018-10-10  7:19 Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 01/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP Huang Ying
                   ` (22 more replies)
  0 siblings, 23 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

Hi, Andrew, could you help me to check whether the overall design is
reasonable?

Hi, Hugh, Shaohua, Minchan and Rik, could you help me to review the
swap part of the patchset?  Especially [02/21], [03/21], [04/21],
[05/21], [06/21], [07/21], [08/21], [09/21], [10/21], [11/21],
[12/21], [20/21], [21/21].

Hi, Andrea and Kirill, could you help me to review the THP part of the
patchset?  Especially [01/21], [07/21], [09/21], [11/21], [13/21],
[15/21], [16/21], [17/21], [18/21], [19/21], [20/21].

Hi, Johannes and Michal, could you help me to review the cgroup part
of the patchset?  Especially [14/21].

And for all, Any comment is welcome!

This patchset is based on the 2018-10-3 head of mmotm/master.

This is the final step of THP (Transparent Huge Page) swap
optimization.  After the first and second step, the splitting huge
page is delayed from almost the first step of swapout to after swapout
has been finished.  In this step, we avoid splitting THP for swapout
and swapout/swapin the THP in one piece.

We tested the patchset with vm-scalability benchmark swap-w-seq test
case, with 16 processes.  The test case forks 16 processes.  Each
process allocates large anonymous memory range, and writes it from
begin to end for 8 rounds.  The first round will swapout, while the
remaining rounds will swapin and swapout.  The test is done on a Xeon
E5 v3 system, the swap device used is a RAM simulated PMEM (persistent
memory) device.  The test result is as follow,

            base                  optimized
---------------- -------------------------- 
         %stddev     %change         %stddev
             \          |                \  
   1417897 ±  2%    +992.8%   15494673        vm-scalability.throughput
   1020489 ±  4%   +1091.2%   12156349        vmstat.swap.si
   1255093 ±  3%    +940.3%   13056114        vmstat.swap.so
   1259769 ±  7%   +1818.3%   24166779        meminfo.AnonHugePages
  28021761           -10.7%   25018848 ±  2%  meminfo.AnonPages
  64080064 ±  4%     -95.6%    2787565 ± 33%  interrupts.CAL:Function_call_interrupts
     13.91 ±  5%     -13.8        0.10 ± 27%  perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath

Where, the score of benchmark (bytes written per second) improved
992.8%.  The swapout/swapin throughput improved 1008% (from about
2.17GB/s to 24.04GB/s).  The performance difference is huge.  In base
kernel, for the first round of writing, the THP is swapout and split,
so in the remaining rounds, there is only normal page swapin and
swapout.  While in optimized kernel, the THP is kept after first
swapout, so THP swapin and swapout is used in the remaining rounds.
This shows the key benefit to swapout/swapin THP in one piece, the THP
will be kept instead of being split.  meminfo information verified
this, in base kernel only 4.5% of anonymous page are THP during the
test, while in optimized kernel, that is 96.6%.  The TLB flushing IPI
(represented as interrupts.CAL:Function_call_interrupts) reduced
95.6%, while cycles for spinlock reduced from 13.9% to 0.1%.  These
are performance benefit of THP swapout/swapin too.

Below is the description for all steps of THP swap optimization.

Recently, the performance of the storage devices improved so fast that
we cannot saturate the disk bandwidth with single logical CPU when do
page swapping even on a high-end server machine.  Because the
performance of the storage device improved faster than that of single
logical CPU.  And it seems that the trend will not change in the near
future.  On the other hand, the THP becomes more and more popular
because of increased memory size.  So it becomes necessary to optimize
THP swap performance.

The advantages to swapout/swapin a THP in one piece include:

- Batch various swap operations for the THP.  Many operations need to
  be done once per THP instead of per normal page, for example,
  allocating/freeing the swap space, writing/reading the swap space,
  flushing TLB, page fault, etc.  This will improve the performance of
  the THP swap greatly.

- The THP swap space read/write will be large sequential IO (2M on
  x86_64).  It is particularly helpful for the swapin, which are
  usually 4k random IO.  This will improve the performance of the THP
  swap too.

- It will help the memory fragmentation, especially when the THP is
  heavily used by the applications.  The THP order pages will be free
  up after THP swapout.

- It will improve the THP utilization on the system with the swap
  turned on.  Because the speed for khugepaged to collapse the normal
  pages into the THP is quite slow.  After the THP is split during the
  swapout, it will take quite long time for the normal pages to
  collapse back into the THP after being swapin.  The high THP
  utilization helps the efficiency of the page based memory management
  too.

There are some concerns regarding THP swapin, mainly because possible
enlarged read/write IO size (for swapout/swapin) may put more overhead
on the storage device.  To deal with that, the THP swapin is turned on
only when necessary.  A new sysfs interface:
/sys/kernel/mm/transparent_hugepage/swapin_enabled is added to
configure it.  It uses "always/never/madvise" logic, to be turned on
globally, turned off globally, or turned on only for VMA with
MADV_HUGEPAGE, etc.
GE, etc.

Changelog
---------

V6:

- Rebased on 10/3 HEAD of mmotm/master

- Added return value checking in swap_duplicate() per Daniel's comments

v5:

- Rebased on 9/20 HEAD of mmotm/master

- Merged the swap operations implementation for the huge and the
  normal swap entries when possible

- Added more code comments to improve code readability

- Changed function parameter style to avoid to use Boolean parameter
  as much as possible

- Fixed a deadlock issue in do_huge_pmd_swap_page(), thanks 0-Day and sparse

v4:

- Rebased on 6/14 HEAD of mmotm/master

- Fixed one build bug and several coding style issues, Thanks Daniel Jordon

v3:

- Rebased on 5/18 HEAD of mmotm/master

- Fixed a build bug, Thanks 0-Day!

v2:

- Fixed several build bugs, Thanks 0-Day!

- Improved documentation as suggested by Randy Dunlap.

- Fixed several bugs in reading huge swap cluster

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 01/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 02/21] swap: Add __swap_duplicate_locked() Huang Ying
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

Currently, "the swap entry" in the page tables is used for a number of
things outside of actual swap, like page migration, etc.  We support
the THP/PMD "swap entry" for page migration currently and the
functions behind this are tied to page migration's config
option (CONFIG_ARCH_ENABLE_THP_MIGRATION).

But, we also need them for THP swap optimization.  So a new config
option (CONFIG_HAVE_PMD_SWAP_ENTRY) is added.  It is enabled when
either CONFIG_ARCH_ENABLE_THP_MIGRATION or CONFIG_THP_SWAP is enabled.
And PMD swap entry functions are tied to this new config option
instead.  Some functions enabled by CONFIG_ARCH_ENABLE_THP_MIGRATION
are for page migration only, they are still enabled only for that.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 arch/x86/include/asm/pgtable.h |  2 +-
 include/asm-generic/pgtable.h  |  2 +-
 include/linux/swapops.h        | 44 ++++++++++++++++++++++--------------------
 mm/Kconfig                     |  8 ++++++++
 4 files changed, 33 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..e830ab345551 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1333,7 +1333,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte)
 	return pte_clear_flags(pte, _PAGE_SWP_SOFT_DIRTY);
 }
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifdef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
 	return pmd_set_flags(pmd, _PAGE_SWP_SOFT_DIRTY);
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 5657a20e0c59..eb1e9d17371b 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -675,7 +675,7 @@ static inline void ptep_modify_prot_commit(struct mm_struct *mm,
 #endif
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
-#ifndef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#ifndef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline pmd_t pmd_swp_mksoft_dirty(pmd_t pmd)
 {
 	return pmd;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4d961668e5fc..905ddc65caa3 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -254,17 +254,7 @@ static inline int is_write_migration_entry(swp_entry_t entry)
 
 #endif
 
-struct page_vma_mapped_walk;
-
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
-		struct page *page);
-
-extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
-		struct page *new);
-
-extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
-
+#ifdef CONFIG_HAVE_PMD_SWAP_ENTRY
 static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
 {
 	swp_entry_t arch_entry;
@@ -282,6 +272,28 @@ static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
 	arch_entry = __swp_entry(swp_type(entry), swp_offset(entry));
 	return __swp_entry_to_pmd(arch_entry);
 }
+#else
+static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
+{
+	return swp_entry(0, 0);
+}
+
+static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
+{
+	return __pmd(0);
+}
+#endif
+
+struct page_vma_mapped_walk;
+
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+extern void set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
+		struct page *page);
+
+extern void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
+		struct page *new);
+
+extern void pmd_migration_entry_wait(struct mm_struct *mm, pmd_t *pmd);
 
 static inline int is_pmd_migration_entry(pmd_t pmd)
 {
@@ -302,16 +314,6 @@ static inline void remove_migration_pmd(struct page_vma_mapped_walk *pvmw,
 
 static inline void pmd_migration_entry_wait(struct mm_struct *m, pmd_t *p) { }
 
-static inline swp_entry_t pmd_to_swp_entry(pmd_t pmd)
-{
-	return swp_entry(0, 0);
-}
-
-static inline pmd_t swp_entry_to_pmd(swp_entry_t entry)
-{
-	return __pmd(0);
-}
-
 static inline int is_pmd_migration_entry(pmd_t pmd)
 {
 	return 0;
diff --git a/mm/Kconfig b/mm/Kconfig
index b1006cdf3aff..44f7d72010fd 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -424,6 +424,14 @@ config THP_SWAP
 
 	  For selection by architectures with reasonable THP sizes.
 
+#
+# "PMD swap entry" in the page table is used both for migration and
+# actual swap.
+#
+config HAVE_PMD_SWAP_ENTRY
+	def_bool y
+	depends on THP_SWAP || ARCH_ENABLE_THP_MIGRATION
+
 config	TRANSPARENT_HUGE_PAGECACHE
 	def_bool y
 	depends on TRANSPARENT_HUGEPAGE
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 02/21] swap: Add __swap_duplicate_locked()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 01/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 03/21] swap: Support PMD swap mapping in swap_duplicate() Huang Ying
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

The part of __swap_duplicate() with lock held is separated into a new
function __swap_duplicate_locked().  Because we will add more logic
about the PMD swap mapping into __swap_duplicate() and keep the most
PTE swap mapping related logic in __swap_duplicate_locked().

Just mechanical code refactoring, there is no any functional change in
this patch.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/swapfile.c | 63 +++++++++++++++++++++++++++++++++--------------------------
 1 file changed, 35 insertions(+), 28 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index 97a1bd1a7c9a..6a570ef00fa7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -3436,32 +3436,12 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
-/*
- * Verify that a swap entry is valid and increment its swap map count.
- *
- * Returns error code in following case.
- * - success -> 0
- * - swp_entry is invalid -> EINVAL
- * - swp_entry is migration entry -> EINVAL
- * - swap-cache reference is requested but there is already one. -> EEXIST
- * - swap-cache reference is requested but the entry is not used. -> ENOENT
- * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
- */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate_locked(struct swap_info_struct *p,
+				   unsigned long offset, unsigned char usage)
 {
-	struct swap_info_struct *p;
-	struct swap_cluster_info *ci;
-	unsigned long offset;
 	unsigned char count;
 	unsigned char has_cache;
-	int err = -EINVAL;
-
-	p = get_swap_device(entry);
-	if (!p)
-		goto out;
-
-	offset = swp_offset(entry);
-	ci = lock_cluster_or_swap_info(p, offset);
+	int err = 0;
 
 	count = p->swap_map[offset];
 
@@ -3471,12 +3451,11 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 	 */
 	if (unlikely(swap_count(count) == SWAP_MAP_BAD)) {
 		err = -ENOENT;
-		goto unlock_out;
+		goto out;
 	}
 
 	has_cache = count & SWAP_HAS_CACHE;
 	count &= ~SWAP_HAS_CACHE;
-	err = 0;
 
 	if (usage == SWAP_HAS_CACHE) {
 
@@ -3503,11 +3482,39 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
 
 	p->swap_map[offset] = count | has_cache;
 
-unlock_out:
+out:
+	return err;
+}
+
+/*
+ * Verify that a swap entry is valid and increment its swap map count.
+ *
+ * Returns error code in following case.
+ * - success -> 0
+ * - swp_entry is invalid -> EINVAL
+ * - swp_entry is migration entry -> EINVAL
+ * - swap-cache reference is requested but there is already one. -> EEXIST
+ * - swap-cache reference is requested but the entry is not used. -> ENOENT
+ * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
+ */
+static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+{
+	struct swap_info_struct *p;
+	struct swap_cluster_info *ci;
+	unsigned long offset;
+	int err = -EINVAL;
+
+	p = get_swap_device(entry);
+	if (!p)
+		goto out;
+
+	offset = swp_offset(entry);
+	ci = lock_cluster_or_swap_info(p, offset);
+	err = __swap_duplicate_locked(p, offset, usage);
 	unlock_cluster_or_swap_info(p, ci);
+
+	put_swap_device(p);
 out:
-	if (p)
-		put_swap_device(p);
 	return err;
 }
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 03/21] swap: Support PMD swap mapping in swap_duplicate()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 01/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 02/21] swap: Add __swap_duplicate_locked() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 04/21] swap: Support PMD swap mapping in put_swap_page() Huang Ying
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

To support to swapin the THP in one piece, we need to create PMD swap
mapping during swapout, and maintain PMD swap mapping count.  This
patch implements the support to increase the PMD swap mapping
count (for swapout, fork, etc.)  and set SWAP_HAS_CACHE flag (for
swapin, etc.) for a huge swap cluster in swap_duplicate() function
family.  Although it only implements a part of the design of the swap
reference count with PMD swap mapping, the whole design is described
as follow to make it easy to understand the patch and the whole
picture.

A huge swap cluster is used to hold the contents of a swapouted THP.
After swapout, a PMD page mapping to the THP will become a PMD
swap mapping to the huge swap cluster via a swap entry in PMD.  While
a PTE page mapping to a subpage of the THP will become the PTE swap
mapping to a swap slot in the huge swap cluster via a swap entry in
PTE.

If there is no PMD swap mapping and the corresponding THP is removed
from the page cache (reclaimed), the huge swap cluster will be split
and become a normal swap cluster.

The count (cluster_count()) of the huge swap cluster is
SWAPFILE_CLUSTER (= HPAGE_PMD_NR) + PMD swap mapping count.  Because
all swap slots in the huge swap cluster are mapped by PTE or PMD, or
has SWAP_HAS_CACHE bit set, the usage count of the swap cluster is
HPAGE_PMD_NR.  And the PMD swap mapping count is recorded too to make
it easy to determine whether there are remaining PMD swap mappings.

The count in swap_map[offset] is the sum of PTE and PMD swap mapping
count.  This means when we increase the PMD swap mapping count, we
need to increase swap_map[offset] for all swap slots inside the swap
cluster.  An alternative choice is to make swap_map[offset] to record
PTE swap map count only, given we have recorded PMD swap mapping count
in the count of the huge swap cluster.  But this need to increase
swap_map[offset] when splitting the PMD swap mapping, that may fail
because of memory allocation for swap count continuation.  That is
hard to dealt with.  So we choose current solution.

The PMD swap mapping to a huge swap cluster may be split when unmap a
part of PMD mapping etc.  That is easy because only the count of the
huge swap cluster need to be changed.  When the last PMD swap mapping
is gone and SWAP_HAS_CACHE is unset, we will split the huge swap
cluster (clear the huge flag).  This makes it easy to reason the
cluster state.

A huge swap cluster will be split when splitting the THP in swap
cache, or failing to allocate THP during swapin, etc.  But when
splitting the huge swap cluster, we will not try to split all PMD swap
mappings, because we haven't enough information available for that
sometimes.  Later, when the PMD swap mapping is duplicated or swapin,
etc, the PMD swap mapping will be split and fallback to the PTE
operation.

When a THP is added into swap cache, the SWAP_HAS_CACHE flag will be
set in the swap_map[offset] of all swap slots inside the huge swap
cluster backing the THP.  This huge swap cluster will not be split
unless the THP is split even if its PMD swap mapping count dropped to
0.  Later, when the THP is removed from swap cache, the SWAP_HAS_CACHE
flag will be cleared in the swap_map[offset] of all swap slots inside
the huge swap cluster.  And this huge swap cluster will be split if
its PMD swap mapping count is 0.

The first parameter of swap_duplicate() is changed to return the swap
entry to call add_swap_count_continuation() for.  Because we may need
to call it for a swap entry in the middle of a huge swap cluster.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/swap.h |   9 +++--
 mm/memory.c          |   2 +-
 mm/rmap.c            |   2 +-
 mm/swap_state.c      |   2 +-
 mm/swapfile.c        | 109 ++++++++++++++++++++++++++++++++++++++++++---------
 5 files changed, 99 insertions(+), 25 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 34de0d8bf4fa..984a652b9925 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -446,8 +446,8 @@ extern swp_entry_t get_swap_page_of_type(int);
 extern int get_swap_pages(int n, swp_entry_t swp_entries[], int entry_size);
 extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
-extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t);
+extern int swap_duplicate(swp_entry_t *entry, int entry_size);
+extern int swapcache_prepare(swp_entry_t entry, int entry_size);
 extern void swap_free(swp_entry_t);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
 extern int free_swap_and_cache(swp_entry_t);
@@ -505,7 +505,8 @@ static inline void show_swap_cache_info(void)
 }
 
 #define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
-#define swapcache_prepare(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
+#define swapcache_prepare(e, s)						\
+	({(is_migration_entry(e) || is_device_private_entry(e)); })
 
 static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
 {
@@ -516,7 +517,7 @@ static inline void swap_shmem_alloc(swp_entry_t swp)
 {
 }
 
-static inline int swap_duplicate(swp_entry_t swp)
+static inline int swap_duplicate(swp_entry_t *swp, int entry_size)
 {
 	return 0;
 }
diff --git a/mm/memory.c b/mm/memory.c
index fa8894c70575..207e90717305 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -709,7 +709,7 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		swp_entry_t entry = pte_to_swp_entry(pte);
 
 		if (likely(!non_swap_entry(entry))) {
-			if (swap_duplicate(entry) < 0)
+			if (swap_duplicate(&entry, 1) < 0)
 				return entry.val;
 
 			/* make sure dst_mm is on swapoff's mmlist. */
diff --git a/mm/rmap.c b/mm/rmap.c
index 1e79fac3186b..3bb4be720bc0 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1598,7 +1598,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				break;
 			}
 
-			if (swap_duplicate(entry) < 0) {
+			if (swap_duplicate(&entry, 1) < 0) {
 				set_pte_at(mm, address, pvmw.pte, pteval);
 				ret = false;
 				page_vma_mapped_walk_done(&pvmw);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 8f53d2a80253..bca34fc7a5e5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -402,7 +402,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/*
 		 * Swap entry may have been freed since our caller observed it.
 		 */
-		err = swapcache_prepare(entry);
+		err = swapcache_prepare(entry, 1);
 		if (err == -EEXIST) {
 			/*
 			 * We might race against get_swap_page() and stumble
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 6a570ef00fa7..a5a1ab46dab7 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -534,6 +534,40 @@ static void dec_cluster_info_page(struct swap_info_struct *p,
 		free_cluster(p, idx);
 }
 
+/*
+ * When swapout a THP in one piece, PMD page mappings to THP are
+ * replaced by PMD swap mappings to the corresponding swap cluster.
+ * cluster_swapcount() returns the PMD swap mapping count.
+ *
+ * cluster_count() = PMD swap mapping count + count of allocated swap
+ * entries in cluster.  If a cluster is mapped by PMD, all swap
+ * entries inside is used, so here cluster_count() = PMD swap mapping
+ * count + SWAPFILE_CLUSTER.
+ */
+static inline int cluster_swapcount(struct swap_cluster_info *ci)
+{
+	VM_BUG_ON(!cluster_is_huge(ci) || cluster_count(ci) < SWAPFILE_CLUSTER);
+	return cluster_count(ci) - SWAPFILE_CLUSTER;
+}
+
+/*
+ * Set PMD swap mapping count for the huge cluster
+ */
+static inline void cluster_set_swapcount(struct swap_cluster_info *ci,
+					 unsigned int count)
+{
+	VM_BUG_ON(!cluster_is_huge(ci) || cluster_count(ci) < SWAPFILE_CLUSTER);
+	cluster_set_count(ci, SWAPFILE_CLUSTER + count);
+}
+
+static inline void cluster_add_swapcount(struct swap_cluster_info *ci, int add)
+{
+	int count = cluster_swapcount(ci) + add;
+
+	VM_BUG_ON(count < 0);
+	cluster_set_swapcount(ci, count);
+}
+
 /*
  * It's possible scan_swap_map() uses a free cluster in the middle of free
  * cluster list. Avoiding such abuse to avoid list corruption.
@@ -3487,35 +3521,66 @@ static int __swap_duplicate_locked(struct swap_info_struct *p,
 }
 
 /*
- * Verify that a swap entry is valid and increment its swap map count.
+ * Verify that the swap entries from *entry is valid and increment their
+ * PMD/PTE swap mapping count.
  *
  * Returns error code in following case.
  * - success -> 0
  * - swp_entry is invalid -> EINVAL
- * - swp_entry is migration entry -> EINVAL
  * - swap-cache reference is requested but there is already one. -> EEXIST
  * - swap-cache reference is requested but the entry is not used. -> ENOENT
  * - swap-mapped reference requested but needs continued swap count. -> ENOMEM
+ * - the huge swap cluster has been split. -> ENOTDIR
  */
-static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
+static int __swap_duplicate(swp_entry_t *entry, int entry_size,
+			    unsigned char usage)
 {
 	struct swap_info_struct *p;
 	struct swap_cluster_info *ci;
 	unsigned long offset;
 	int err = -EINVAL;
+	int i, size = swap_entry_size(entry_size);
 
-	p = get_swap_device(entry);
+	p = get_swap_device(*entry);
 	if (!p)
 		goto out;
 
-	offset = swp_offset(entry);
+	offset = swp_offset(*entry);
 	ci = lock_cluster_or_swap_info(p, offset);
-	err = __swap_duplicate_locked(p, offset, usage);
+	if (size == SWAPFILE_CLUSTER) {
+		/*
+		 * The huge swap cluster has been split, for example, failed to
+		 * allocate huge page during swapin, the caller should split
+		 * the PMD swap mapping and operate on normal swap entries.
+		 */
+		if (!cluster_is_huge(ci)) {
+			err = -ENOTDIR;
+			goto unlock;
+		}
+		VM_BUG_ON(!IS_ALIGNED(offset, size));
+		/* If cluster is huge, all swap entries inside is in-use */
+		VM_BUG_ON(cluster_count(ci) < SWAPFILE_CLUSTER);
+	}
+	/* p->swap_map[] = PMD swap map count + PTE swap map count */
+	for (i = 0; i < size; i++) {
+		err = __swap_duplicate_locked(p, offset + i, usage);
+		if (err && size != 1) {
+			*entry = swp_entry(p->type, offset + i);
+			goto undup;
+		}
+	}
+	if (size == SWAPFILE_CLUSTER && usage == 1)
+		cluster_add_swapcount(ci, usage);
+unlock:
 	unlock_cluster_or_swap_info(p, ci);
 
 	put_swap_device(p);
 out:
 	return err;
+undup:
+	for (i--; i >= 0; i--)
+		__swap_entry_free_locked(p, offset + i, usage);
+	goto unlock;
 }
 
 /*
@@ -3524,36 +3589,44 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage)
  */
 void swap_shmem_alloc(swp_entry_t entry)
 {
-	__swap_duplicate(entry, SWAP_MAP_SHMEM);
+	__swap_duplicate(&entry, 1, SWAP_MAP_SHMEM);
 }
 
 /*
  * Increase reference count of swap entry by 1.
- * Returns 0 for success, or -ENOMEM if a swap_count_continuation is required
- * but could not be atomically allocated.  Returns 0, just as if it succeeded,
- * if __swap_duplicate() fails for another reason (-EINVAL or -ENOENT), which
- * might occur if a page table entry has got corrupted.
+ *
+ * Return error code in following case.
+ * - success -> 0
+ * - swap_count_continuation is required but could not be atomically allocated.
+ *   *entry is used to return swap entry to call add_swap_count_continuation().
+ *								      -> ENOMEM
+ * - otherwise same as __swap_duplicate()
  */
-int swap_duplicate(swp_entry_t entry)
+int swap_duplicate(swp_entry_t *entry, int entry_size)
 {
 	int err = 0;
 
-	while (!err && __swap_duplicate(entry, 1) == -ENOMEM)
-		err = add_swap_count_continuation(entry, GFP_ATOMIC);
+	while (!err &&
+	       (err = __swap_duplicate(entry, entry_size, 1)) == -ENOMEM)
+		err = add_swap_count_continuation(*entry, GFP_ATOMIC);
+	/* If kernel works correctly, other errno is impossible */
+	VM_BUG_ON(err && err != -ENOMEM && err != -ENOTDIR);
 	return err;
 }
 
 /*
  * @entry: swap entry for which we allocate swap cache.
+ * @entry_size: size of the swap entry, 1 or SWAPFILE_CLUSTER
  *
  * Called when allocating swap cache for existing swap entry,
  * This can return error codes. Returns 0 at success.
- * -EBUSY means there is a swap cache.
- * Note: return code is different from swap_duplicate().
+ * -EINVAL means the swap device has been swapoff.
+ * -EEXIST means there is a swap cache.
+ * Otherwise same as __swap_duplicate()
  */
-int swapcache_prepare(swp_entry_t entry)
+int swapcache_prepare(swp_entry_t entry, int entry_size)
 {
-	return __swap_duplicate(entry, SWAP_HAS_CACHE);
+	return __swap_duplicate(&entry, entry_size, SWAP_HAS_CACHE);
 }
 
 struct swap_info_struct *swp_swap_info(swp_entry_t entry)
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 04/21] swap: Support PMD swap mapping in put_swap_page()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (2 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 03/21] swap: Support PMD swap mapping in swap_duplicate() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 05/21] swap: Support PMD swap mapping in free_swap_and_cache()/swap_free() Huang Ying
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

Previously, during swapout, all PMD page mapping will be split and
replaced with PTE swap mapping.  And when clearing the SWAP_HAS_CACHE
flag for the huge swap cluster in put_swap_page(), the huge swap
cluster will be split.  Now, during swapout, the PMD page mappings to
the THP will be changed to PMD swap mappings to the corresponding swap
cluster.  So when clearing the SWAP_HAS_CACHE flag, the huge swap
cluster will only be split if the PMD swap mapping count is 0.
Otherwise, we will keep it as the huge swap cluster.  So that we can
swapin a THP in one piece later.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/swapfile.c | 31 ++++++++++++++++++++++++-------
 1 file changed, 24 insertions(+), 7 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index a5a1ab46dab7..45c12abcb467 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1314,6 +1314,15 @@ void swap_free(swp_entry_t entry)
 
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
+ *
+ * When a THP is added into swap cache, the SWAP_HAS_CACHE flag will
+ * be set in the swap_map[] of all swap entries in the huge swap
+ * cluster backing the THP.  This huge swap cluster will not be split
+ * unless the THP is split even if its PMD swap mapping count dropped
+ * to 0.  Later, when the THP is removed from swap cache, the
+ * SWAP_HAS_CACHE flag will be cleared in the swap_map[] of all swap
+ * entries in the huge swap cluster.  And this huge swap cluster will
+ * be split if its PMD swap mapping count is 0.
  */
 void put_swap_page(struct page *page, swp_entry_t entry)
 {
@@ -1332,15 +1341,23 @@ void put_swap_page(struct page *page, swp_entry_t entry)
 
 	ci = lock_cluster_or_swap_info(si, offset);
 	if (size == SWAPFILE_CLUSTER) {
-		VM_BUG_ON(!cluster_is_huge(ci));
+		VM_BUG_ON(!IS_ALIGNED(offset, size));
 		map = si->swap_map + offset;
-		for (i = 0; i < SWAPFILE_CLUSTER; i++) {
-			val = map[i];
-			VM_BUG_ON(!(val & SWAP_HAS_CACHE));
-			if (val == SWAP_HAS_CACHE)
-				free_entries++;
+		/*
+		 * No PMD swap mapping, the swap cluster will be freed
+		 * if all swap entries becoming free, otherwise the
+		 * huge swap cluster will be split.
+		 */
+		if (!cluster_swapcount(ci)) {
+			for (i = 0; i < SWAPFILE_CLUSTER; i++) {
+				val = map[i];
+				VM_BUG_ON(!(val & SWAP_HAS_CACHE));
+				if (val == SWAP_HAS_CACHE)
+					free_entries++;
+			}
+			if (free_entries != SWAPFILE_CLUSTER)
+				cluster_clear_huge(ci);
 		}
-		cluster_clear_huge(ci);
 		if (free_entries == SWAPFILE_CLUSTER) {
 			unlock_cluster_or_swap_info(si, ci);
 			spin_lock(&si->lock);
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 05/21] swap: Support PMD swap mapping in free_swap_and_cache()/swap_free()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (3 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 04/21] swap: Support PMD swap mapping in put_swap_page() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 06/21] swap: Support PMD swap mapping when splitting huge PMD Huang Ying
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

When a PMD swap mapping is removed from a huge swap cluster, for
example, unmap a memory range mapped with PMD swap mapping, etc,
free_swap_and_cache() will be called to decrease the reference count
to the huge swap cluster.  free_swap_and_cache() may also free or
split the huge swap cluster, and free the corresponding THP in swap
cache if necessary.  swap_free() is similar, and shares most
implementation with free_swap_and_cache().  This patch revises
free_swap_and_cache() and swap_free() to implement this.

If the swap cluster has been split already, for example, because of
failing to allocate a THP during swapin, we just decrease one from the
reference count of all swap slots.

Otherwise, we will decrease one from the reference count of all swap
slots and the PMD swap mapping count in cluster_count().  When the
corresponding THP isn't in swap cache, if PMD swap mapping count
becomes 0, the huge swap cluster will be split, and if all swap count
becomes 0, the huge swap cluster will be freed.  When the corresponding
THP is in swap cache, if every swap_map[offset] == SWAP_HAS_CACHE, we
will try to delete the THP from swap cache.  Which will cause the THP
and the huge swap cluster be freed.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 arch/s390/mm/pgtable.c |   2 +-
 include/linux/swap.h   |   9 +--
 kernel/power/swap.c    |   4 +-
 mm/madvise.c           |   2 +-
 mm/memory.c            |   4 +-
 mm/shmem.c             |   6 +-
 mm/swapfile.c          | 171 ++++++++++++++++++++++++++++++++++++++-----------
 7 files changed, 149 insertions(+), 49 deletions(-)

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index f2cc7da473e4..ffd4b68adbb3 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -675,7 +675,7 @@ static void ptep_zap_swap_entry(struct mm_struct *mm, swp_entry_t entry)
 
 		dec_mm_counter(mm, mm_counter(page));
 	}
-	free_swap_and_cache(entry);
+	free_swap_and_cache(entry, 1);
 }
 
 void ptep_zap_unused(struct mm_struct *mm, unsigned long addr,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 984a652b9925..e79d7aead142 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -448,9 +448,9 @@ extern int add_swap_count_continuation(swp_entry_t, gfp_t);
 extern void swap_shmem_alloc(swp_entry_t);
 extern int swap_duplicate(swp_entry_t *entry, int entry_size);
 extern int swapcache_prepare(swp_entry_t entry, int entry_size);
-extern void swap_free(swp_entry_t);
+extern void swap_free(swp_entry_t entry, int entry_size);
 extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern int free_swap_and_cache(swp_entry_t);
+extern int free_swap_and_cache(swp_entry_t entry, int entry_size);
 extern int swap_type_of(dev_t, sector_t, struct block_device **);
 extern unsigned int count_swap_pages(int, int);
 extern sector_t map_swap_page(struct page *, struct block_device **);
@@ -504,7 +504,8 @@ static inline void show_swap_cache_info(void)
 {
 }
 
-#define free_swap_and_cache(e) ({(is_migration_entry(e) || is_device_private_entry(e));})
+#define free_swap_and_cache(e, s)					\
+	({(is_migration_entry(e) || is_device_private_entry(e)); })
 #define swapcache_prepare(e, s)						\
 	({(is_migration_entry(e) || is_device_private_entry(e)); })
 
@@ -522,7 +523,7 @@ static inline int swap_duplicate(swp_entry_t *swp, int entry_size)
 	return 0;
 }
 
-static inline void swap_free(swp_entry_t swp)
+static inline void swap_free(swp_entry_t swp, int entry_size)
 {
 }
 
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index d7f6c1a288d3..0275df84ed3d 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -182,7 +182,7 @@ sector_t alloc_swapdev_block(int swap)
 	offset = swp_offset(get_swap_page_of_type(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_free(swp_entry(swap, offset), 1);
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -206,7 +206,7 @@ void free_all_swap_pages(int swap)
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
 		for (offset = ext->start; offset <= ext->end; offset++)
-			swap_free(swp_entry(swap, offset));
+			swap_free(swp_entry(swap, offset), 1);
 
 		kfree(ext);
 	}
diff --git a/mm/madvise.c b/mm/madvise.c
index 9d802566c494..50282ba862e2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -349,7 +349,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			if (non_swap_entry(entry))
 				continue;
 			nr_swap--;
-			free_swap_and_cache(entry);
+			free_swap_and_cache(entry, 1);
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			continue;
 		}
diff --git a/mm/memory.c b/mm/memory.c
index 207e90717305..17895a347056 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1134,7 +1134,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
 			page = migration_entry_to_page(entry);
 			rss[mm_counter(page)]--;
 		}
-		if (unlikely(!free_swap_and_cache(entry)))
+		if (unlikely(!free_swap_and_cache(entry, 1)))
 			print_bad_pte(vma, addr, ptent, NULL);
 		pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 	} while (pte++, addr += PAGE_SIZE, addr != end);
@@ -2823,7 +2823,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 	set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
 
-	swap_free(entry);
+	swap_free(entry, 1);
 	if (mem_cgroup_swap_full(page) ||
 	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
 		try_to_free_swap(page);
diff --git a/mm/shmem.c b/mm/shmem.c
index a6964ba74d50..3cc1d58a534f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -662,7 +662,7 @@ static int shmem_free_swap(struct address_space *mapping,
 	xa_unlock_irq(&mapping->i_pages);
 	if (old != radswap)
 		return -ENOENT;
-	free_swap_and_cache(radix_to_swp_entry(radswap));
+	free_swap_and_cache(radix_to_swp_entry(radswap), 1);
 	return 0;
 }
 
@@ -1180,7 +1180,7 @@ static int shmem_unuse_inode(struct shmem_inode_info *info,
 			spin_lock_irq(&info->lock);
 			info->swapped--;
 			spin_unlock_irq(&info->lock);
-			swap_free(swap);
+			swap_free(swap, 1);
 		}
 	}
 	return error;
@@ -1712,7 +1712,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 
 		delete_from_swap_cache(page);
 		set_page_dirty(page);
-		swap_free(swap);
+		swap_free(swap, 1);
 
 	} else {
 		if (vma && userfaultfd_missing(vma)) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 45c12abcb467..8d8803103543 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -49,6 +49,9 @@ static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
 static sector_t map_swap_entry(swp_entry_t, struct block_device**);
+static bool __swap_page_trans_huge_swapped(struct swap_info_struct *si,
+					   struct swap_cluster_info *ci,
+					   unsigned long offset);
 
 DEFINE_SPINLOCK(swap_lock);
 static unsigned int nr_swapfiles;
@@ -1267,19 +1270,106 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
-static unsigned char __swap_entry_free(struct swap_info_struct *p,
-				       swp_entry_t entry, unsigned char usage)
+#define SF_FREE_CACHE		0x1
+
+static void __swap_free(struct swap_info_struct *p, swp_entry_t entry,
+			      int entry_size, unsigned long flags)
 {
 	struct swap_cluster_info *ci;
 	unsigned long offset = swp_offset(entry);
+	int i, free_entries = 0, cache_only = 0;
+	int size = swap_entry_size(entry_size);
+	unsigned char *map, count;
 
 	ci = lock_cluster_or_swap_info(p, offset);
-	usage = __swap_entry_free_locked(p, offset, usage);
+	VM_BUG_ON(!IS_ALIGNED(offset, size));
+	/*
+	 * Normal swap entry or huge swap cluster has been split, free
+	 * each swap entry
+	 */
+	if (size == 1 || !cluster_is_huge(ci)) {
+		for (i = 0; i < size; i++, entry.val++) {
+			count = __swap_entry_free_locked(p, offset + i, 1);
+			if (!count ||
+			    (flags & SF_FREE_CACHE &&
+			     count == SWAP_HAS_CACHE &&
+			     !__swap_page_trans_huge_swapped(p, ci,
+							     offset + i))) {
+				unlock_cluster_or_swap_info(p, ci);
+				if (!count)
+					free_swap_slot(entry);
+				else
+					__try_to_reclaim_swap(p, offset + i,
+						TTRS_UNMAPPED | TTRS_FULL);
+				if (i == size - 1)
+					return;
+				lock_cluster_or_swap_info(p, offset);
+			}
+		}
+		unlock_cluster_or_swap_info(p, ci);
+		return;
+	}
+	/*
+	 * Return for normal swap entry above, the following code is
+	 * for huge swap cluster only.
+	 */
+	cluster_add_swapcount(ci, -1);
+	/*
+	 * Decrease mapping count for each swap entry in cluster.
+	 * Because PMD swap mapping is counted in p->swap_map[] too.
+	 */
+	map = p->swap_map + offset;
+	for (i = 0; i < size; i++) {
+		/*
+		 * Mark swap entries to become free as SWAP_MAP_BAD
+		 * temporarily.
+		 */
+		if (map[i] == 1) {
+			map[i] = SWAP_MAP_BAD;
+			free_entries++;
+		} else if (__swap_entry_free_locked(p, offset + i, 1) ==
+			   SWAP_HAS_CACHE)
+			cache_only++;
+	}
+	/*
+	 * If there are PMD swap mapping or the THP is in swap cache,
+	 * it's impossible for some swap entries to become free.
+	 */
+	VM_BUG_ON(free_entries &&
+		  (cluster_swapcount(ci) || (map[0] & SWAP_HAS_CACHE)));
+	if (free_entries == SWAPFILE_CLUSTER)
+		memset(map, SWAP_HAS_CACHE, SWAPFILE_CLUSTER);
+	/*
+	 * If there are no PMD swap mappings remain and the THP isn't
+	 * in swap cache, split the huge swap cluster.
+	 */
+	else if (!cluster_swapcount(ci) && !(map[0] & SWAP_HAS_CACHE))
+		cluster_clear_huge(ci);
 	unlock_cluster_or_swap_info(p, ci);
-	if (!usage)
-		free_swap_slot(entry);
-
-	return usage;
+	if (free_entries == SWAPFILE_CLUSTER) {
+		spin_lock(&p->lock);
+		mem_cgroup_uncharge_swap(entry, SWAPFILE_CLUSTER);
+		swap_free_cluster(p, offset / SWAPFILE_CLUSTER);
+		spin_unlock(&p->lock);
+	} else if (free_entries) {
+		ci = lock_cluster(p, offset);
+		for (i = 0; i < size; i++, entry.val++) {
+			/*
+			 * To be freed swap entries are marked as SWAP_MAP_BAD
+			 * temporarily as above
+			 */
+			if (map[i] == SWAP_MAP_BAD) {
+				map[i] = SWAP_HAS_CACHE;
+				unlock_cluster(ci);
+				free_swap_slot(entry);
+				if (i == size - 1)
+					return;
+				ci = lock_cluster(p, offset);
+			}
+		}
+		unlock_cluster(ci);
+	} else if (cache_only == SWAPFILE_CLUSTER && flags & SF_FREE_CACHE)
+		__try_to_reclaim_swap(p, offset, TTRS_UNMAPPED | TTRS_FULL);
 }
 
 static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
@@ -1303,13 +1393,13 @@ static void swap_entry_free(struct swap_info_struct *p, swp_entry_t entry)
  * Caller has made sure that the swap device corresponding to entry
  * is still around or has not been recycled.
  */
-void swap_free(swp_entry_t entry)
+void swap_free(swp_entry_t entry, int entry_size)
 {
 	struct swap_info_struct *p;
 
 	p = _swap_info_get(entry);
 	if (p)
-		__swap_entry_free(p, entry, 1);
+		__swap_free(p, entry, entry_size, 0);
 }
 
 /*
@@ -1545,29 +1635,33 @@ int swp_swapcount(swp_entry_t entry)
 	return count;
 }
 
-static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
-					 swp_entry_t entry)
+/* si->lock or ci->lock must be held before calling this function */
+static bool __swap_page_trans_huge_swapped(struct swap_info_struct *si,
+					   struct swap_cluster_info *ci,
+					   unsigned long offset)
 {
-	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
-	unsigned long roffset = swp_offset(entry);
-	unsigned long offset = round_down(roffset, SWAPFILE_CLUSTER);
+	unsigned long hoffset = round_down(offset, SWAPFILE_CLUSTER);
 	int i;
-	bool ret = false;
 
-	ci = lock_cluster_or_swap_info(si, offset);
-	if (!ci || !cluster_is_huge(ci)) {
-		if (swap_count(map[roffset]))
-			ret = true;
-		goto unlock_out;
-	}
+	if (!ci || !cluster_is_huge(ci))
+		return !!swap_count(map[offset]);
 	for (i = 0; i < SWAPFILE_CLUSTER; i++) {
-		if (swap_count(map[offset + i])) {
-			ret = true;
-			break;
-		}
+		if (swap_count(map[hoffset + i]))
+			return true;
 	}
-unlock_out:
+	return false;
+}
+
+static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
+					 swp_entry_t entry)
+{
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+	bool ret;
+
+	ci = lock_cluster_or_swap_info(si, offset);
+	ret = __swap_page_trans_huge_swapped(si, ci, offset);
 	unlock_cluster_or_swap_info(si, ci);
 	return ret;
 }
@@ -1739,22 +1833,17 @@ int try_to_free_swap(struct page *page)
  * Free the swap entry like above, but also try to
  * free the page cache entry if it is the last user.
  */
-int free_swap_and_cache(swp_entry_t entry)
+int free_swap_and_cache(swp_entry_t entry, int entry_size)
 {
 	struct swap_info_struct *p;
-	unsigned char count;
 
 	if (non_swap_entry(entry))
 		return 1;
 
 	p = _swap_info_get(entry);
-	if (p) {
-		count = __swap_entry_free(p, entry, 1);
-		if (count == SWAP_HAS_CACHE &&
-		    !swap_page_trans_huge_swapped(p, entry))
-			__try_to_reclaim_swap(p, swp_offset(entry),
-					      TTRS_UNMAPPED | TTRS_FULL);
-	}
+	if (p)
+		__swap_free(p, entry, entry_size, SF_FREE_CACHE);
+
 	return p != NULL;
 }
 
@@ -1901,7 +1990,7 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	set_pte_at(vma->vm_mm, addr, pte,
 		   pte_mkold(mk_pte(page, vma->vm_page_prot)));
-	swap_free(entry);
+	swap_free(entry, 1);
 	/*
 	 * Move the page to the active list so it is not
 	 * immediately swapped out again after swapon.
@@ -2340,6 +2429,16 @@ int try_to_unuse(unsigned int type, bool frontswap,
 	}
 
 	mmput(start_mm);
+
+	/*
+	 * Swap entries may be marked as SWAP_MAP_BAD temporarily in
+	 * __swap_free() before being freed really.
+	 * find_next_to_unuse() will skip these swap entries, that is
+	 * OK.  But we need to wait until they are freed really.
+	 */
+	while (!retval && READ_ONCE(si->inuse_pages))
+		schedule_timeout_uninterruptible(1);
+
 	return retval;
 }
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 06/21] swap: Support PMD swap mapping when splitting huge PMD
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (4 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 05/21] swap: Support PMD swap mapping in free_swap_and_cache()/swap_free() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-24 17:25   ` Daniel Jordan
  2018-10-10  7:19 ` [PATCH -V6 07/21] swap: Support PMD swap mapping in split_swap_cluster() Huang Ying
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

A huge PMD need to be split when zap a part of the PMD mapping etc.
If the PMD mapping is a swap mapping, we need to split it too.  This
patch implemented the support for this.  This is similar as splitting
the PMD page mapping, except we need to decrease the PMD swap mapping
count for the huge swap cluster too.  If the PMD swap mapping count
becomes 0, the huge swap cluster will be split.

Notice: is_huge_zero_pmd() and pmd_page() doesn't work well with swap
PMD, so pmd_present() check is called before them.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/huge_mm.h |  4 ++++
 include/linux/swap.h    |  6 ++++++
 mm/huge_memory.c        | 48 +++++++++++++++++++++++++++++++++++++++++++-----
 mm/swapfile.c           | 32 ++++++++++++++++++++++++++++++++
 4 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 99c19b06d9a4..0f3e1739986f 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -226,6 +226,10 @@ static inline bool is_huge_zero_page(struct page *page)
 	return READ_ONCE(huge_zero_page) == page;
 }
 
+/*
+ * is_huge_zero_pmd() must be called after checking pmd_present(),
+ * otherwise, it may report false positive for PMD swap entry.
+ */
 static inline bool is_huge_zero_pmd(pmd_t pmd)
 {
 	return is_huge_zero_page(pmd_page(pmd));
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e79d7aead142..9bb3f73b5d68 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -614,11 +614,17 @@ static inline swp_entry_t get_swap_page(struct page *page)
 
 #ifdef CONFIG_THP_SWAP
 extern int split_swap_cluster(swp_entry_t entry);
+extern int split_swap_cluster_map(swp_entry_t entry);
 #else
 static inline int split_swap_cluster(swp_entry_t entry)
 {
 	return 0;
 }
+
+static inline int split_swap_cluster_map(swp_entry_t entry)
+{
+	return 0;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bae21d3e88cf..9f1c74487576 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1624,6 +1624,40 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	return 0;
 }
 
+/* Convert a PMD swap mapping to a set of PTE swap mappings */
+static void __split_huge_swap_pmd(struct vm_area_struct *vma,
+				  unsigned long haddr,
+				  pmd_t *pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	pgtable_t pgtable;
+	pmd_t _pmd;
+	swp_entry_t entry;
+	int i, soft_dirty;
+
+	entry = pmd_to_swp_entry(*pmd);
+	soft_dirty = pmd_soft_dirty(*pmd);
+
+	split_swap_cluster_map(entry);
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pmd_populate(mm, &_pmd, pgtable);
+
+	for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE, entry.val++) {
+		pte_t *pte, ptent;
+
+		pte = pte_offset_map(&_pmd, haddr);
+		VM_BUG_ON(!pte_none(*pte));
+		ptent = swp_entry_to_pte(entry);
+		if (soft_dirty)
+			ptent = pte_swp_mksoft_dirty(ptent);
+		set_pte_at(mm, haddr, pte, ptent);
+		pte_unmap(pte);
+	}
+	smp_wmb(); /* make pte visible before pmd */
+	pmd_populate(mm, pmd, pgtable);
+}
+
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
@@ -2090,7 +2124,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+	VM_BUG_ON(!is_swap_pmd(*pmd) && !pmd_trans_huge(*pmd)
 				&& !pmd_devmap(*pmd));
 
 	count_vm_event(THP_SPLIT_PMD);
@@ -2114,7 +2148,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		put_page(page);
 		add_mm_counter(mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		return;
-	} else if (is_huge_zero_pmd(*pmd)) {
+	} else if (pmd_present(*pmd) && is_huge_zero_pmd(*pmd)) {
 		/*
 		 * FIXME: Do we want to invalidate secondary mmu by calling
 		 * mmu_notifier_invalidate_range() see comments below inside
@@ -2158,6 +2192,9 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		page = pfn_to_page(swp_offset(entry));
 	} else
 #endif
+	if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(old_pmd))
+		return __split_huge_swap_pmd(vma, haddr, pmd);
+	else
 		page = pmd_page(old_pmd);
 	VM_BUG_ON_PAGE(!page_count(page), page);
 	page_ref_add(page, HPAGE_PMD_NR - 1);
@@ -2249,14 +2286,15 @@ void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	 * pmd against. Otherwise we can end up replacing wrong page.
 	 */
 	VM_BUG_ON(freeze && !page);
-	if (page && page != pmd_page(*pmd))
-	        goto out;
+	/* pmd_page() should be called only if pmd_present() */
+	if (page && (!pmd_present(*pmd) || page != pmd_page(*pmd)))
+		goto out;
 
 	if (pmd_trans_huge(*pmd)) {
 		page = pmd_page(*pmd);
 		if (PageMlocked(page))
 			clear_page_mlock(page);
-	} else if (!(pmd_devmap(*pmd) || is_pmd_migration_entry(*pmd)))
+	} else if (!(pmd_devmap(*pmd) || is_swap_pmd(*pmd)))
 		goto out;
 	__split_huge_pmd_locked(vma, pmd, haddr, freeze);
 out:
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8d8803103543..fa6b81b4e185 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -4036,6 +4036,38 @@ void mem_cgroup_throttle_swaprate(struct mem_cgroup *memcg, int node,
 }
 #endif
 
+#ifdef CONFIG_THP_SWAP
+/*
+ * The corresponding page table shouldn't be changed under us, that
+ * is, the page table lock should be held.
+ */
+int split_swap_cluster_map(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+
+	VM_BUG_ON(!IS_ALIGNED(offset, SWAPFILE_CLUSTER));
+	si = _swap_info_get(entry);
+	if (!si)
+		return -EBUSY;
+	ci = lock_cluster(si, offset);
+	/* The swap cluster has been split by someone else, we are done */
+	if (!cluster_is_huge(ci))
+		goto out;
+	cluster_add_swapcount(ci, -1);
+	/*
+	 * If the last PMD swap mapping has gone and the THP isn't in
+	 * swap cache, the huge swap cluster will be split.
+	 */
+	if (!cluster_swapcount(ci) && !(si->swap_map[offset] & SWAP_HAS_CACHE))
+		cluster_clear_huge(ci);
+out:
+	unlock_cluster(ci);
+	return 0;
+}
+#endif
+
 static int __init swapfile_init(void)
 {
 	int nid;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 07/21] swap: Support PMD swap mapping in split_swap_cluster()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (5 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 06/21] swap: Support PMD swap mapping when splitting huge PMD Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 08/21] swap: Support to read a huge swap cluster for swapin a THP Huang Ying
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

When splitting a THP in swap cache or failing to allocate a THP when
swapin a huge swap cluster, the huge swap cluster will be split.  In
addition to clear the huge flag of the swap cluster, the PMD swap
mapping count recorded in cluster_count() will be set to 0.  But we
will not touch PMD swap mappings themselves, because it is hard to
find them all sometimes.  When the PMD swap mappings are operated
later, it will be found that the huge swap cluster has been split and
the PMD swap mappings will be split at that time.

Unless splitting a THP in swap cache (specified via "force"
parameter), split_swap_cluster() will return -EEXIST if there is
SWAP_HAS_CACHE flag in swap_map[offset].  Because this indicates there
is a THP corresponds to this huge swap cluster, and it isn't desired
to split the THP.

When splitting a THP in swap cache, the position to call
split_swap_cluster() is changed to before unlocking sub-pages.  So
that all sub-pages will be kept locked from the THP has been split to
the huge swap cluster is split.  This makes the code much easier to be
reasoned.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/swap.h |  6 ++++--
 mm/huge_memory.c     | 18 ++++++++++------
 mm/swapfile.c        | 58 +++++++++++++++++++++++++++++++++++++---------------
 3 files changed, 57 insertions(+), 25 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 9bb3f73b5d68..60fd5189fde9 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -612,11 +612,13 @@ static inline swp_entry_t get_swap_page(struct page *page)
 
 #endif /* CONFIG_SWAP */
 
+#define SSC_SPLIT_CACHED	0x1
+
 #ifdef CONFIG_THP_SWAP
-extern int split_swap_cluster(swp_entry_t entry);
+extern int split_swap_cluster(swp_entry_t entry, unsigned long flags);
 extern int split_swap_cluster_map(swp_entry_t entry);
 #else
-static inline int split_swap_cluster(swp_entry_t entry)
+static inline int split_swap_cluster(swp_entry_t entry, unsigned long flags)
 {
 	return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9f1c74487576..92e0cdb99c5a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2517,6 +2517,17 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 
 	unfreeze_page(head);
 
+	/*
+	 * Split swap cluster before unlocking sub-pages.  So all
+	 * sub-pages will be kept locked from THP has been split to
+	 * swap cluster is split.
+	 */
+	if (PageSwapCache(head)) {
+		swp_entry_t entry = { .val = page_private(head) };
+
+		split_swap_cluster(entry, SSC_SPLIT_CACHED);
+	}
+
 	for (i = 0; i < HPAGE_PMD_NR; i++) {
 		struct page *subpage = head + i;
 		if (subpage == page)
@@ -2740,12 +2751,7 @@ int split_huge_page_to_list(struct page *page, struct list_head *list)
 			__dec_node_page_state(page, NR_SHMEM_THPS);
 		spin_unlock(&pgdata->split_queue_lock);
 		__split_huge_page(page, list, flags);
-		if (PageSwapCache(head)) {
-			swp_entry_t entry = { .val = page_private(head) };
-
-			ret = split_swap_cluster(entry);
-		} else
-			ret = 0;
+		ret = 0;
 	} else {
 		if (IS_ENABLED(CONFIG_DEBUG_VM) && mapcount) {
 			pr_alert("total_mapcount: %u, page_count(): %u\n",
diff --git a/mm/swapfile.c b/mm/swapfile.c
index fa6b81b4e185..2020bd494419 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1469,23 +1469,6 @@ void put_swap_page(struct page *page, swp_entry_t entry)
 	unlock_cluster_or_swap_info(si, ci);
 }
 
-#ifdef CONFIG_THP_SWAP
-int split_swap_cluster(swp_entry_t entry)
-{
-	struct swap_info_struct *si;
-	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
-
-	si = _swap_info_get(entry);
-	if (!si)
-		return -EBUSY;
-	ci = lock_cluster(si, offset);
-	cluster_clear_huge(ci);
-	unlock_cluster(ci);
-	return 0;
-}
-#endif
-
 static int swp_entry_cmp(const void *ent1, const void *ent2)
 {
 	const swp_entry_t *e1 = ent1, *e2 = ent2;
@@ -4066,6 +4049,47 @@ int split_swap_cluster_map(swp_entry_t entry)
 	unlock_cluster(ci);
 	return 0;
 }
+
+/*
+ * We will not try to split all PMD swap mappings to the swap cluster,
+ * because we haven't enough information available for that.  Later,
+ * when the PMD swap mapping is duplicated or swapin, etc, the PMD
+ * swap mapping will be split and fallback to the PTE operations.
+ */
+int split_swap_cluster(swp_entry_t entry, unsigned long flags)
+{
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+	unsigned long offset = swp_offset(entry);
+	int ret = 0;
+
+	si = get_swap_device(entry);
+	if (!si)
+		return -EINVAL;
+	ci = lock_cluster(si, offset);
+	/* The swap cluster has been split by someone else, we are done */
+	if (!cluster_is_huge(ci))
+		goto out;
+	VM_BUG_ON(!IS_ALIGNED(offset, SWAPFILE_CLUSTER));
+	VM_BUG_ON(cluster_count(ci) < SWAPFILE_CLUSTER);
+	/*
+	 * If not requested, don't split swap cluster that has SWAP_HAS_CACHE
+	 * flag.  When the flag is cleared later, the huge swap cluster will
+	 * be split if there is no PMD swap mapping.
+	 */
+	if (!(flags & SSC_SPLIT_CACHED) &&
+	    si->swap_map[offset] & SWAP_HAS_CACHE) {
+		ret = -EEXIST;
+		goto out;
+	}
+	cluster_set_swapcount(ci, 0);
+	cluster_clear_huge(ci);
+
+out:
+	unlock_cluster(ci);
+	put_swap_device(si);
+	return ret;
+}
 #endif
 
 static int __init swapfile_init(void)
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 08/21] swap: Support to read a huge swap cluster for swapin a THP
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (6 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 07/21] swap: Support PMD swap mapping in split_swap_cluster() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 09/21] swap: Swapin a THP in one piece Huang Ying
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

To swapin a THP in one piece, we need to read a huge swap cluster from
the swap device.  This patch revised the __read_swap_cache_async() and
its callers and callees to support this.  If __read_swap_cache_async()
find the swap cluster of the specified swap entry is huge, it will try
to allocate a THP, add it into the swap cache.  So later the contents
of the huge swap cluster can be read into the THP.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/huge_mm.h |  8 +++++++
 include/linux/swap.h    |  4 ++--
 mm/huge_memory.c        |  3 ++-
 mm/swap_state.c         | 59 ++++++++++++++++++++++++++++++++++++++++---------
 mm/swapfile.c           |  9 +++++---
 5 files changed, 66 insertions(+), 17 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 0f3e1739986f..a0e7f4f9c12b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -250,6 +250,8 @@ static inline bool thp_migration_supported(void)
 	return IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION);
 }
 
+gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma,
+				    unsigned long addr);
 #else /* CONFIG_TRANSPARENT_HUGEPAGE */
 #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
 #define HPAGE_PMD_MASK ({ BUILD_BUG(); 0; })
@@ -363,6 +365,12 @@ static inline bool thp_migration_supported(void)
 {
 	return false;
 }
+
+static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma,
+						  unsigned long addr)
+{
+	return 0;
+}
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 60fd5189fde9..f2daf3fbdd4b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -457,7 +457,7 @@ extern sector_t map_swap_page(struct page *, struct block_device **);
 extern sector_t swapdev_block(int, pgoff_t);
 extern int page_swapcount(struct page *);
 extern int __swap_count(swp_entry_t entry);
-extern int __swp_swapcount(swp_entry_t entry);
+extern int __swp_swapcount(swp_entry_t entry, int *entry_size);
 extern int swp_swapcount(swp_entry_t entry);
 extern struct swap_info_struct *page_swap_info(struct page *);
 extern struct swap_info_struct *swp_swap_info(swp_entry_t entry);
@@ -585,7 +585,7 @@ static inline int __swap_count(swp_entry_t entry)
 	return 0;
 }
 
-static inline int __swp_swapcount(swp_entry_t entry)
+static inline int __swp_swapcount(swp_entry_t entry, int *entry_size)
 {
 	return 0;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 92e0cdb99c5a..a025494dd828 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -629,7 +629,8 @@ static vm_fault_t __do_huge_pmd_anonymous_page(struct vm_fault *vmf,
  *	    available
  * never: never stall for any thp allocation
  */
-static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
+gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma,
+				    unsigned long addr)
 {
 	const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 	gfp_t this_node = 0;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index bca34fc7a5e5..784ad6388da0 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -361,7 +361,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 {
 	struct page *found_page = NULL, *new_page = NULL;
 	struct swap_info_struct *si;
-	int err;
+	int err, entry_size = 1;
+	swp_entry_t hentry;
+
 	*new_page_allocated = false;
 
 	do {
@@ -387,14 +389,42 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * as SWAP_HAS_CACHE.  That's done in later part of code or
 		 * else swap_off will be aborted if we return NULL.
 		 */
-		if (!__swp_swapcount(entry) && swap_slot_cache_enabled)
+		if (!__swp_swapcount(entry, &entry_size) &&
+		    swap_slot_cache_enabled)
 			break;
 
 		/*
 		 * Get a new page to read into from swap.
 		 */
-		if (!new_page) {
-			new_page = alloc_page_vma(gfp_mask, vma, addr);
+		if (!new_page ||
+		    (IS_ENABLED(CONFIG_THP_SWAP) &&
+		     hpage_nr_pages(new_page) != entry_size)) {
+			if (new_page)
+				put_page(new_page);
+			if (IS_ENABLED(CONFIG_THP_SWAP) &&
+			    entry_size == HPAGE_PMD_NR) {
+				gfp_t gfp;
+
+				gfp = alloc_hugepage_direct_gfpmask(vma, addr);
+				/*
+				 * Make sure huge page allocation flags are
+				 * compatible with that of normal page
+				 */
+				VM_WARN_ONCE(gfp_mask & ~(gfp | __GFP_RECLAIM),
+					     "ignoring gfp_mask bits: %x",
+					     gfp_mask & ~(gfp | __GFP_RECLAIM));
+				new_page = alloc_pages_vma(gfp, HPAGE_PMD_ORDER,
+							   vma, addr,
+							   numa_node_id());
+				if (new_page)
+					prep_transhuge_page(new_page);
+				hentry = swp_entry(swp_type(entry),
+						   round_down(swp_offset(entry),
+							      HPAGE_PMD_NR));
+			} else {
+				new_page = alloc_page_vma(gfp_mask, vma, addr);
+				hentry = entry;
+			}
 			if (!new_page)
 				break;		/* Out of memory */
 		}
@@ -402,7 +432,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/*
 		 * Swap entry may have been freed since our caller observed it.
 		 */
-		err = swapcache_prepare(entry, 1);
+		err = swapcache_prepare(hentry, entry_size);
 		if (err == -EEXIST) {
 			/*
 			 * We might race against get_swap_page() and stumble
@@ -411,6 +441,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			 */
 			cond_resched();
 			continue;
+		} else if (err == -ENOTDIR) {
+			/* huge swap cluster has been split under us */
+			continue;
 		} else if (err) {	/* swp entry is obsolete ? */
 			break;
 		}
@@ -424,6 +457,9 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			SetPageWorkingset(new_page);
 			lru_cache_add_anon(new_page);
 			*new_page_allocated = true;
+			if (IS_ENABLED(CONFIG_THP_SWAP))
+				new_page += swp_offset(entry) &
+					(entry_size - 1);
 			return new_page;
 		}
 		__ClearPageLocked(new_page);
@@ -431,7 +467,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * add_to_swap_cache() doesn't return -EEXIST, so we can safely
 		 * clear SWAP_HAS_CACHE flag.
 		 */
-		put_swap_page(new_page, entry);
+		put_swap_page(new_page, hentry);
 	} while (err != -ENOMEM);
 
 	if (new_page)
@@ -453,7 +489,7 @@ struct page *read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 			vma, addr, &page_was_allocated);
 
 	if (page_was_allocated)
-		swap_readpage(retpage, do_poll);
+		swap_readpage(compound_head(retpage), do_poll);
 
 	return retpage;
 }
@@ -572,8 +608,9 @@ struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 		if (!page)
 			continue;
 		if (page_allocated) {
-			swap_readpage(page, false);
-			if (offset != entry_offset) {
+			swap_readpage(compound_head(page), false);
+			if (offset != entry_offset &&
+			    !PageTransCompound(page)) {
 				SetPageReadahead(page);
 				count_vm_event(SWAP_RA);
 			}
@@ -734,8 +771,8 @@ static struct page *swap_vma_readahead(swp_entry_t fentry, gfp_t gfp_mask,
 		if (!page)
 			continue;
 		if (page_allocated) {
-			swap_readpage(page, false);
-			if (i != ra_info.offset) {
+			swap_readpage(compound_head(page), false);
+			if (i != ra_info.offset && !PageTransCompound(page)) {
 				SetPageReadahead(page);
 				count_vm_event(SWAP_RA);
 			}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2020bd494419..2ca013df35e1 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1542,7 +1542,8 @@ int __swap_count(swp_entry_t entry)
 	return count;
 }
 
-static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
+static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry,
+			  int *entry_size)
 {
 	int count = 0;
 	pgoff_t offset = swp_offset(entry);
@@ -1550,6 +1551,8 @@ static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 
 	ci = lock_cluster_or_swap_info(si, offset);
 	count = swap_count(si->swap_map[offset]);
+	if (entry_size)
+		*entry_size = ci && cluster_is_huge(ci) ? SWAPFILE_CLUSTER : 1;
 	unlock_cluster_or_swap_info(si, ci);
 	return count;
 }
@@ -1559,14 +1562,14 @@ static int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
  * This does not give an exact answer when swap count is continued,
  * but does include the high COUNT_CONTINUED flag to allow for that.
  */
-int __swp_swapcount(swp_entry_t entry)
+int __swp_swapcount(swp_entry_t entry, int *entry_size)
 {
 	int count = 0;
 	struct swap_info_struct *si;
 
 	si = get_swap_device(entry);
 	if (si) {
-		count = swap_swapcount(si, entry);
+		count = swap_swapcount(si, entry, entry_size);
 		put_swap_device(si);
 	}
 	return count;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 09/21] swap: Swapin a THP in one piece
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (7 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 08/21] swap: Support to read a huge swap cluster for swapin a THP Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 10/21] swap: Support to count THP swapin and its fallback Huang Ying
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

With this patch, when page fault handler find a PMD swap mapping, it
will swap in a THP in one piece.  This avoids the overhead of
splitting/collapsing before/after the THP swapping.  And improves the
swap performance greatly for reduced page fault count etc.

do_huge_pmd_swap_page() is added in the patch to implement this.  It
is similar to do_swap_page() for normal page swapin.

If failing to allocate a THP, the huge swap cluster and the PMD swap
mapping will be split to fallback to normal page swapin.

If the huge swap cluster has been split already, the PMD swap mapping
will be split to fallback to normal page swapin.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/huge_mm.h |   9 +++
 mm/huge_memory.c        | 174 ++++++++++++++++++++++++++++++++++++++++++++++++
 mm/memory.c             |  16 +++--
 3 files changed, 193 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a0e7f4f9c12b..d88579cb059a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -373,4 +373,13 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+#ifdef CONFIG_THP_SWAP
+extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+#else /* CONFIG_THP_SWAP */
+static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
+{
+	return 0;
+}
+#endif /* CONFIG_THP_SWAP */
+
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a025494dd828..fbc9c9e30992 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -33,6 +33,8 @@
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
 #include <linux/oom.h>
+#include <linux/delayacct.h>
+#include <linux/swap.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1659,6 +1661,178 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
+#ifdef CONFIG_THP_SWAP
+static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			       unsigned long address, pmd_t orig_pmd)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	spinlock_t *ptl;
+	int ret = 0;
+
+	ptl = pmd_lock(mm, pmd);
+	if (pmd_same(*pmd, orig_pmd))
+		__split_huge_swap_pmd(vma, address & HPAGE_PMD_MASK, pmd);
+	else
+		ret = -ENOENT;
+	spin_unlock(ptl);
+
+	return ret;
+}
+
+int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
+{
+	struct page *page;
+	struct mem_cgroup *memcg;
+	struct vm_area_struct *vma = vmf->vma;
+	unsigned long haddr = vmf->address & HPAGE_PMD_MASK;
+	swp_entry_t entry;
+	pmd_t pmd;
+	int i, locked, exclusive = 0, ret = 0;
+
+	entry = pmd_to_swp_entry(orig_pmd);
+	VM_BUG_ON(non_swap_entry(entry));
+	delayacct_set_flag(DELAYACCT_PF_SWAPIN);
+retry:
+	page = lookup_swap_cache(entry, NULL, vmf->address);
+	if (!page) {
+		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma,
+					     haddr, false);
+		if (!page) {
+			/*
+			 * Back out if somebody else faulted in this pmd
+			 * while we released the pmd lock.
+			 */
+			if (likely(pmd_same(*vmf->pmd, orig_pmd))) {
+				/*
+				 * Failed to allocate huge page, split huge swap
+				 * cluster, and fallback to swapin normal page
+				 */
+				ret = split_swap_cluster(entry, 0);
+				/* Somebody else swapin the swap entry, retry */
+				if (ret == -EEXIST) {
+					ret = 0;
+					goto retry;
+				/* swapoff occurs under us */
+				} else if (ret == -EINVAL)
+					ret = 0;
+				else
+					goto fallback;
+			}
+			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+			goto out;
+		}
+
+		/* Had to read the page from swap area: Major fault */
+		ret = VM_FAULT_MAJOR;
+		count_vm_event(PGMAJFAULT);
+		count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
+	} else if (!PageTransCompound(page))
+		goto fallback;
+
+	locked = lock_page_or_retry(page, vma->vm_mm, vmf->flags);
+
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!locked) {
+		ret |= VM_FAULT_RETRY;
+		goto out_release;
+	}
+
+	/*
+	 * Make sure try_to_free_swap or reuse_swap_page or swapoff did not
+	 * release the swapcache from under us.  The page pin, and pmd_same
+	 * test below, are not enough to exclude that.  Even if it is still
+	 * swapcache, we need to check that the page's swap has not changed.
+	 */
+	if (unlikely(!PageSwapCache(page) || page_private(page) != entry.val))
+		goto out_page;
+
+	if (mem_cgroup_try_charge_delay(page, vma->vm_mm, GFP_KERNEL,
+					&memcg, true)) {
+		ret = VM_FAULT_OOM;
+		goto out_page;
+	}
+
+	/*
+	 * Back out if somebody else already faulted in this pmd.
+	 */
+	vmf->ptl = pmd_lockptr(vma->vm_mm, vmf->pmd);
+	spin_lock(vmf->ptl);
+	if (unlikely(!pmd_same(*vmf->pmd, orig_pmd)))
+		goto out_nomap;
+
+	if (unlikely(!PageUptodate(page))) {
+		ret = VM_FAULT_SIGBUS;
+		goto out_nomap;
+	}
+
+	/*
+	 * The page isn't present yet, go ahead with the fault.
+	 *
+	 * Be careful about the sequence of operations here.
+	 * To get its accounting right, reuse_swap_page() must be called
+	 * while the page is counted on swap but not yet in mapcount i.e.
+	 * before page_add_anon_rmap() and swap_free(); try_to_free_swap()
+	 * must be called after the swap_free(), or it will never succeed.
+	 */
+
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+	pmd = mk_huge_pmd(page, vma->vm_page_prot);
+	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
+		pmd = maybe_pmd_mkwrite(pmd_mkdirty(pmd), vma);
+		vmf->flags &= ~FAULT_FLAG_WRITE;
+		ret |= VM_FAULT_WRITE;
+		exclusive = RMAP_EXCLUSIVE;
+	}
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		flush_icache_page(vma, page + i);
+	if (pmd_swp_soft_dirty(orig_pmd))
+		pmd = pmd_mksoft_dirty(pmd);
+	do_page_add_anon_rmap(page, vma, haddr,
+			      exclusive | RMAP_COMPOUND);
+	mem_cgroup_commit_charge(page, memcg, true, true);
+	activate_page(page);
+	set_pmd_at(vma->vm_mm, haddr, vmf->pmd, pmd);
+
+	swap_free(entry, HPAGE_PMD_NR);
+	if (mem_cgroup_swap_full(page) ||
+	    (vma->vm_flags & VM_LOCKED) || PageMlocked(page))
+		try_to_free_swap(page);
+	unlock_page(page);
+
+	if (vmf->flags & FAULT_FLAG_WRITE) {
+		spin_unlock(vmf->ptl);
+		ret |= do_huge_pmd_wp_page(vmf, pmd);
+		if (ret & VM_FAULT_ERROR)
+			ret &= VM_FAULT_ERROR;
+		goto out;
+	}
+
+	/* No need to invalidate - it was non-present before */
+	update_mmu_cache_pmd(vma, vmf->address, vmf->pmd);
+	spin_unlock(vmf->ptl);
+out:
+	return ret;
+out_nomap:
+	mem_cgroup_cancel_charge(page, memcg, true);
+	spin_unlock(vmf->ptl);
+out_page:
+	unlock_page(page);
+out_release:
+	put_page(page);
+	return ret;
+fallback:
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+	if (!split_huge_swap_pmd(vmf->vma, vmf->pmd, vmf->address, orig_pmd))
+		ret = VM_FAULT_FALLBACK;
+	else
+		ret = 0;
+	if (page)
+		put_page(page);
+	return ret;
+}
+#endif
+
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
diff --git a/mm/memory.c b/mm/memory.c
index 17895a347056..6970bb10cf5a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3862,13 +3862,17 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 		barrier();
 		if (unlikely(is_swap_pmd(orig_pmd))) {
-			VM_BUG_ON(thp_migration_supported() &&
-					  !is_pmd_migration_entry(orig_pmd));
-			if (is_pmd_migration_entry(orig_pmd))
+			if (thp_migration_supported() &&
+			    is_pmd_migration_entry(orig_pmd)) {
 				pmd_migration_entry_wait(mm, vmf.pmd);
-			return 0;
-		}
-		if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
+				return 0;
+			} else if (IS_ENABLED(CONFIG_THP_SWAP)) {
+				ret = do_huge_pmd_swap_page(&vmf, orig_pmd);
+				if (!(ret & VM_FAULT_FALLBACK))
+					return ret;
+			} else
+				VM_BUG_ON(1);
+		} else if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
 			if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf, orig_pmd);
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 10/21] swap: Support to count THP swapin and its fallback
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (8 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 09/21] swap: Swapin a THP in one piece Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 11/21] swap: Add sysfs interface to configure THP swapin Huang Ying
                   ` (12 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

2 new /proc/vmstat fields are added, "thp_swapin" and
"thp_swapin_fallback" to count swapin a THP from swap device in one
piece and fallback to normal page swapin.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 Documentation/admin-guide/mm/transhuge.rst |  8 ++++++++
 include/linux/vm_event_item.h              |  2 ++
 mm/huge_memory.c                           |  4 +++-
 mm/page_io.c                               | 15 ++++++++++++---
 mm/vmstat.c                                |  2 ++
 5 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 7ab93a8404b9..85e33f785fd7 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -364,6 +364,14 @@ thp_swpout_fallback
 	Usually because failed to allocate some continuous swap space
 	for the huge page.
 
+thp_swpin
+	is incremented every time a huge page is swapin in one piece
+	without splitting.
+
+thp_swpin_fallback
+	is incremented if a huge page has to be split during swapin.
+	Usually because failed to allocate a huge page.
+
 As the system ages, allocating huge pages may be expensive as the
 system uses memory compaction to copy data around memory to free a
 huge page for use. There are some counters in ``/proc/vmstat`` to help
diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h
index 47a3441cf4c4..c20b655cfdcc 100644
--- a/include/linux/vm_event_item.h
+++ b/include/linux/vm_event_item.h
@@ -88,6 +88,8 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT,
 		THP_ZERO_PAGE_ALLOC_FAILED,
 		THP_SWPOUT,
 		THP_SWPOUT_FALLBACK,
+		THP_SWPIN,
+		THP_SWPIN_FALLBACK,
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
 		BALLOON_INFLATE,
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index fbc9c9e30992..8efcc84fb4b0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1715,8 +1715,10 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 				/* swapoff occurs under us */
 				} else if (ret == -EINVAL)
 					ret = 0;
-				else
+				else {
+					count_vm_event(THP_SWPIN_FALLBACK);
 					goto fallback;
+				}
 			}
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 			goto out;
diff --git a/mm/page_io.c b/mm/page_io.c
index 573d3663d846..bcc2da750590 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -348,6 +348,15 @@ int __swap_writepage(struct page *page, struct writeback_control *wbc,
 	return ret;
 }
 
+static inline void count_swpin_vm_event(struct page *page)
+{
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (unlikely(PageTransHuge(page)))
+		count_vm_event(THP_SWPIN);
+#endif
+	count_vm_events(PSWPIN, hpage_nr_pages(page));
+}
+
 int swap_readpage(struct page *page, bool synchronous)
 {
 	struct bio *bio;
@@ -371,7 +380,7 @@ int swap_readpage(struct page *page, bool synchronous)
 
 		ret = mapping->a_ops->readpage(swap_file, page);
 		if (!ret)
-			count_vm_event(PSWPIN);
+			count_swpin_vm_event(page);
 		return ret;
 	}
 
@@ -382,7 +391,7 @@ int swap_readpage(struct page *page, bool synchronous)
 			unlock_page(page);
 		}
 
-		count_vm_event(PSWPIN);
+		count_swpin_vm_event(page);
 		return 0;
 	}
 
@@ -401,7 +410,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	get_task_struct(current);
 	bio->bi_private = current;
 	bio_set_op_attrs(bio, REQ_OP_READ, 0);
-	count_vm_event(PSWPIN);
+	count_swpin_vm_event(page);
 	bio_get(bio);
 	qc = submit_bio(bio);
 	while (synchronous) {
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d08ed044759d..823856fae136 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1264,6 +1264,8 @@ const char * const vmstat_text[] = {
 	"thp_zero_page_alloc_failed",
 	"thp_swpout",
 	"thp_swpout_fallback",
+	"thp_swpin",
+	"thp_swpin_fallback",
 #endif
 #ifdef CONFIG_MEMORY_BALLOON
 	"balloon_inflate",
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 11/21] swap: Add sysfs interface to configure THP swapin
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (9 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 10/21] swap: Support to count THP swapin and its fallback Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 12/21] swap: Support PMD swap mapping in swapoff Huang Ying
                   ` (11 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

Swapin a THP as a whole isn't desirable in some situations.  For
example, for completely random access pattern, swapin a THP in one
piece will inflate the reading greatly.  So a sysfs interface:
/sys/kernel/mm/transparent_hugepage/swapin_enabled is added to
configure it.  Three options as follow are provided,

- always: THP swapin will be enabled always

- madvise: THP swapin will be enabled only for VMA with VM_HUGEPAGE
  flag set.

- never: THP swapin will be disabled always

The default configuration is: madvise.

During page fault, if a PMD swap mapping is found and THP swapin is
disabled, the huge swap cluster and the PMD swap mapping will be split
and fallback to normal page swapin.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 Documentation/admin-guide/mm/transhuge.rst | 21 +++++++
 include/linux/huge_mm.h                    | 31 ++++++++++
 mm/huge_memory.c                           | 94 ++++++++++++++++++++++++------
 3 files changed, 127 insertions(+), 19 deletions(-)

diff --git a/Documentation/admin-guide/mm/transhuge.rst b/Documentation/admin-guide/mm/transhuge.rst
index 85e33f785fd7..23aefb17101c 100644
--- a/Documentation/admin-guide/mm/transhuge.rst
+++ b/Documentation/admin-guide/mm/transhuge.rst
@@ -160,6 +160,27 @@ Some userspace (such as a test program, or an optimized memory allocation
 
 	cat /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
 
+Transparent hugepage may be swapout and swapin in one piece without
+splitting.  This will improve the utility of transparent hugepage but
+may inflate the read/write too.  So whether to enable swapin
+transparent hugepage in one piece can be configured as follow.
+
+	echo always >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+	echo madvise >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+	echo never >/sys/kernel/mm/transparent_hugepage/swapin_enabled
+
+always
+	Attempt to allocate a transparent huge page and read it from
+	swap space in one piece every time.
+
+never
+	Always split the swap space and PMD swap mapping and swapin
+	the fault normal page during swapin.
+
+madvise
+	Only swapin the transparent huge page in one piece for
+	MADV_HUGEPAGE madvise regions.
+
 khugepaged will be automatically started when
 transparent_hugepage/enabled is set to "always" or "madvise, and it'll
 be automatically shutdown if it's set to "never".
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index d88579cb059a..a13cd19b6047 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -63,6 +63,8 @@ enum transparent_hugepage_flag {
 #ifdef CONFIG_DEBUG_VM
 	TRANSPARENT_HUGEPAGE_DEBUG_COW_FLAG,
 #endif
+	TRANSPARENT_HUGEPAGE_SWAPIN_FLAG,
+	TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG,
 };
 
 struct kobject;
@@ -375,11 +377,40 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma,
 
 #ifdef CONFIG_THP_SWAP
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+
+static inline bool transparent_hugepage_swapin_enabled(
+	struct vm_area_struct *vma)
+{
+	if (vma->vm_flags & VM_NOHUGEPAGE)
+		return false;
+
+	if (is_vma_temporary_stack(vma))
+		return false;
+
+	if (test_bit(MMF_DISABLE_THP, &vma->vm_mm->flags))
+		return false;
+
+	if (transparent_hugepage_flags &
+			(1 << TRANSPARENT_HUGEPAGE_SWAPIN_FLAG))
+		return true;
+
+	if (transparent_hugepage_flags &
+			(1 << TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG))
+		return !!(vma->vm_flags & VM_HUGEPAGE);
+
+	return false;
+}
 #else /* CONFIG_THP_SWAP */
 static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	return 0;
 }
+
+static inline bool transparent_hugepage_swapin_enabled(
+	struct vm_area_struct *vma)
+{
+	return false;
+}
 #endif /* CONFIG_THP_SWAP */
 
 #endif /* _LINUX_HUGE_MM_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8efcc84fb4b0..0ccb1b78d661 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -57,7 +57,8 @@ unsigned long transparent_hugepage_flags __read_mostly =
 #endif
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG)|
 	(1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
-	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
+	(1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG)|
+	(1<<TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG);
 
 static struct shrinker deferred_split_shrinker;
 
@@ -316,6 +317,53 @@ static struct kobj_attribute debug_cow_attr =
 	__ATTR(debug_cow, 0644, debug_cow_show, debug_cow_store);
 #endif /* CONFIG_DEBUG_VM */
 
+#ifdef CONFIG_THP_SWAP
+static ssize_t swapin_enabled_show(struct kobject *kobj,
+				   struct kobj_attribute *attr, char *buf)
+{
+	if (test_bit(TRANSPARENT_HUGEPAGE_SWAPIN_FLAG,
+		     &transparent_hugepage_flags))
+		return sprintf(buf, "[always] madvise never\n");
+	else if (test_bit(TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG,
+			  &transparent_hugepage_flags))
+		return sprintf(buf, "always [madvise] never\n");
+	else
+		return sprintf(buf, "always madvise [never]\n");
+}
+
+static ssize_t swapin_enabled_store(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    const char *buf, size_t count)
+{
+	ssize_t ret = count;
+
+	if (!memcmp("always", buf,
+		    min(sizeof("always")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG,
+			  &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_SWAPIN_FLAG,
+			&transparent_hugepage_flags);
+	} else if (!memcmp("madvise", buf,
+			   min(sizeof("madvise")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_SWAPIN_FLAG,
+			  &transparent_hugepage_flags);
+		set_bit(TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG,
+			&transparent_hugepage_flags);
+	} else if (!memcmp("never", buf,
+			   min(sizeof("never")-1, count))) {
+		clear_bit(TRANSPARENT_HUGEPAGE_SWAPIN_FLAG,
+			  &transparent_hugepage_flags);
+		clear_bit(TRANSPARENT_HUGEPAGE_SWAPIN_REQ_MADV_FLAG,
+			  &transparent_hugepage_flags);
+	} else
+		ret = -EINVAL;
+
+	return ret;
+}
+static struct kobj_attribute swapin_enabled_attr =
+	__ATTR(swapin_enabled, 0644, swapin_enabled_show, swapin_enabled_store);
+#endif /* CONFIG_THP_SWAP */
+
 static struct attribute *hugepage_attr[] = {
 	&enabled_attr.attr,
 	&defrag_attr.attr,
@@ -326,6 +374,9 @@ static struct attribute *hugepage_attr[] = {
 #endif
 #ifdef CONFIG_DEBUG_VM
 	&debug_cow_attr.attr,
+#endif
+#ifdef CONFIG_THP_SWAP
+	&swapin_enabled_attr.attr,
 #endif
 	NULL,
 };
@@ -1695,6 +1746,9 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 retry:
 	page = lookup_swap_cache(entry, NULL, vmf->address);
 	if (!page) {
+		if (!transparent_hugepage_swapin_enabled(vma))
+			goto split;
+
 		page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE, vma,
 					     haddr, false);
 		if (!page) {
@@ -1702,24 +1756,8 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 			 * Back out if somebody else faulted in this pmd
 			 * while we released the pmd lock.
 			 */
-			if (likely(pmd_same(*vmf->pmd, orig_pmd))) {
-				/*
-				 * Failed to allocate huge page, split huge swap
-				 * cluster, and fallback to swapin normal page
-				 */
-				ret = split_swap_cluster(entry, 0);
-				/* Somebody else swapin the swap entry, retry */
-				if (ret == -EEXIST) {
-					ret = 0;
-					goto retry;
-				/* swapoff occurs under us */
-				} else if (ret == -EINVAL)
-					ret = 0;
-				else {
-					count_vm_event(THP_SWPIN_FALLBACK);
-					goto fallback;
-				}
-			}
+			if (likely(pmd_same(*vmf->pmd, orig_pmd)))
+				goto split;
 			delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 			goto out;
 		}
@@ -1832,6 +1870,24 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	if (page)
 		put_page(page);
 	return ret;
+split:
+	/*
+	 * Failed to allocate huge page, split huge swap cluster, and
+	 * fallback to swapin normal page
+	 */
+	ret = split_swap_cluster(entry, 0);
+	/* Somebody else swapin the swap entry, retry */
+	if (ret == -EEXIST) {
+		ret = 0;
+		goto retry;
+	}
+	/* swapoff occurs under us */
+	if (ret == -EINVAL) {
+		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+		return 0;
+	}
+	count_vm_event(THP_SWPIN_FALLBACK);
+	goto fallback;
 }
 #endif
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 12/21] swap: Support PMD swap mapping in swapoff
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (10 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 11/21] swap: Add sysfs interface to configure THP swapin Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 13/21] swap: Support PMD swap mapping in madvise_free() Huang Ying
                   ` (10 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

During swapoff, for a huge swap cluster, we need to allocate a THP,
read its contents into the THP and unuse the PMD and PTE swap mappings
to it.  If failed to allocate a THP, the huge swap cluster will be
split.

During unuse, if it is found that the swap cluster mapped by a PMD
swap mapping is split already, we will split the PMD swap mapping and
unuse the PTEs.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/asm-generic/pgtable.h | 14 +------
 include/linux/huge_mm.h       |  8 ++++
 mm/huge_memory.c              |  4 +-
 mm/swapfile.c                 | 86 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 97 insertions(+), 15 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index eb1e9d17371b..d64cef2bff04 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -931,22 +931,12 @@ static inline int pmd_none_or_trans_huge_or_clear_bad(pmd_t *pmd)
 	barrier();
 #endif
 	/*
-	 * !pmd_present() checks for pmd migration entries
-	 *
-	 * The complete check uses is_pmd_migration_entry() in linux/swapops.h
-	 * But using that requires moving current function and pmd_trans_unstable()
-	 * to linux/swapops.h to resovle dependency, which is too much code move.
-	 *
-	 * !pmd_present() is equivalent to is_pmd_migration_entry() currently,
-	 * because !pmd_present() pages can only be under migration not swapped
-	 * out.
-	 *
-	 * pmd_none() is preseved for future condition checks on pmd migration
+	 * pmd_none() is preseved for future condition checks on pmd swap
 	 * entries and not confusing with this function name, although it is
 	 * redundant with !pmd_present().
 	 */
 	if (pmd_none(pmdval) || pmd_trans_huge(pmdval) ||
-		(IS_ENABLED(CONFIG_ARCH_ENABLE_THP_MIGRATION) && !pmd_present(pmdval)))
+	    (IS_ENABLED(CONFIG_HAVE_PMD_SWAP_ENTRY) && !pmd_present(pmdval)))
 		return 1;
 	if (unlikely(pmd_bad(pmdval))) {
 		pmd_clear_bad(pmd);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a13cd19b6047..1927b2edb74a 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -376,6 +376,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_THP_SWAP
+extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			       unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
 
 static inline bool transparent_hugepage_swapin_enabled(
@@ -401,6 +403,12 @@ static inline bool transparent_hugepage_swapin_enabled(
 	return false;
 }
 #else /* CONFIG_THP_SWAP */
+static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+				      unsigned long address, pmd_t orig_pmd)
+{
+	return 0;
+}
+
 static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 {
 	return 0;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0ccb1b78d661..0ec71f907fa5 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1713,8 +1713,8 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma,
 }
 
 #ifdef CONFIG_THP_SWAP
-static int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
-			       unsigned long address, pmd_t orig_pmd)
+int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+			unsigned long address, pmd_t orig_pmd)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 2ca013df35e1..93b6a5d4e44a 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1931,6 +1931,11 @@ static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte)
 	return pte_same(pte_swp_clear_soft_dirty(pte), swp_pte);
 }
 
+static inline int pmd_same_as_swp(pmd_t pmd, pmd_t swp_pmd)
+{
+	return pmd_same(pmd_swp_clear_soft_dirty(pmd), swp_pmd);
+}
+
 /*
  * No need to decide whether this PTE shares the swap entry with others,
  * just let do_wp_page work it out if a write is requested later - to
@@ -1992,6 +1997,53 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_t *pmd,
 	return ret;
 }
 
+#ifdef CONFIG_THP_SWAP
+static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		     unsigned long addr, swp_entry_t entry, struct page *page)
+{
+	struct mem_cgroup *memcg;
+	spinlock_t *ptl;
+	int ret = 1;
+
+	if (mem_cgroup_try_charge(page, vma->vm_mm, GFP_KERNEL,
+				  &memcg, true)) {
+		ret = -ENOMEM;
+		goto out_nolock;
+	}
+
+	ptl = pmd_lock(vma->vm_mm, pmd);
+	if (unlikely(!pmd_same_as_swp(*pmd, swp_entry_to_pmd(entry)))) {
+		mem_cgroup_cancel_charge(page, memcg, true);
+		ret = 0;
+		goto out;
+	}
+
+	add_mm_counter(vma->vm_mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+	add_mm_counter(vma->vm_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+	get_page(page);
+	set_pmd_at(vma->vm_mm, addr, pmd,
+		   pmd_mkold(mk_huge_pmd(page, vma->vm_page_prot)));
+	page_add_anon_rmap(page, vma, addr, true);
+	mem_cgroup_commit_charge(page, memcg, true, true);
+	swap_free(entry, HPAGE_PMD_NR);
+	/*
+	 * Move the page to the active list so it is not
+	 * immediately swapped out again after swapon.
+	 */
+	activate_page(page);
+out:
+	spin_unlock(ptl);
+out_nolock:
+	return ret;
+}
+#else
+static int unuse_pmd(struct vm_area_struct *vma, pmd_t *pmd,
+		     unsigned long addr, swp_entry_t entry, struct page *page)
+{
+	return 0;
+}
+#endif
+
 static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				unsigned long addr, unsigned long end,
 				swp_entry_t entry, struct page *page)
@@ -2032,7 +2084,7 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 				unsigned long addr, unsigned long end,
 				swp_entry_t entry, struct page *page)
 {
-	pmd_t *pmd;
+	pmd_t swp_pmd = swp_entry_to_pmd(entry), *pmd, orig_pmd;
 	unsigned long next;
 	int ret;
 
@@ -2040,6 +2092,27 @@ static inline int unuse_pmd_range(struct vm_area_struct *vma, pud_t *pud,
 	do {
 		cond_resched();
 		next = pmd_addr_end(addr, end);
+		orig_pmd = *pmd;
+		if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(orig_pmd)) {
+			if (likely(!pmd_same_as_swp(orig_pmd, swp_pmd)))
+				continue;
+			/*
+			 * Huge cluster has been split already, split
+			 * PMD swap mapping and fallback to unuse PTE
+			 */
+			if (!PageTransCompound(page)) {
+				ret = split_huge_swap_pmd(vma, pmd,
+							  addr, orig_pmd);
+				if (ret)
+					return ret;
+				ret = unuse_pte_range(vma, pmd, addr,
+						      next, entry, page);
+			} else
+				ret = unuse_pmd(vma, pmd, addr, entry, page);
+			if (ret)
+				return ret;
+			continue;
+		}
 		if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 			continue;
 		ret = unuse_pte_range(vma, pmd, addr, next, entry, page);
@@ -2233,6 +2306,7 @@ int try_to_unuse(unsigned int type, bool frontswap,
 	 * there are races when an instance of an entry might be missed.
 	 */
 	while ((i = find_next_to_unuse(si, i, frontswap)) != 0) {
+retry:
 		if (signal_pending(current)) {
 			retval = -EINTR;
 			break;
@@ -2248,6 +2322,8 @@ int try_to_unuse(unsigned int type, bool frontswap,
 		page = read_swap_cache_async(entry,
 					GFP_HIGHUSER_MOVABLE, NULL, 0, false);
 		if (!page) {
+			struct swap_cluster_info *ci = NULL;
+
 			/*
 			 * Either swap_duplicate() failed because entry
 			 * has been freed independently, and will not be
@@ -2264,6 +2340,14 @@ int try_to_unuse(unsigned int type, bool frontswap,
 			 */
 			if (!swcount || swcount == SWAP_MAP_BAD)
 				continue;
+			if (si->cluster_info)
+				ci = si->cluster_info + i / SWAPFILE_CLUSTER;
+			/* Split huge cluster if failed to allocate huge page */
+			if (cluster_is_huge(ci)) {
+				retval = split_swap_cluster(entry, 0);
+				if (!retval || retval == -EEXIST)
+					goto retry;
+			}
 			retval = -ENOMEM;
 			break;
 		}
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 13/21] swap: Support PMD swap mapping in madvise_free()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (11 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 12/21] swap: Support PMD swap mapping in swapoff Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 14/21] swap: Support to move swap account for PMD swap mapping Huang Ying
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

When madvise_free() found a PMD swap mapping, if only part of the huge
swap cluster is operated on, the PMD swap mapping will be split and
fallback to PTE swap mapping processing.  Otherwise, if all huge swap
cluster is operated on, free_swap_and_cache() will be called to
decrease the PMD swap mapping count and probably free the swap space
and the THP in swap cache too.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/huge_memory.c | 54 +++++++++++++++++++++++++++++++++++++++---------------
 mm/madvise.c     |  2 +-
 2 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0ec71f907fa5..60b4105734b1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1891,6 +1891,15 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 }
 #endif
 
+static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
+{
+	pgtable_t pgtable;
+
+	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
+	pte_free(mm, pgtable);
+	mm_dec_nr_ptes(mm);
+}
+
 /*
  * Return true if we do MADV_FREE successfully on entire pmd page.
  * Otherwise, return false.
@@ -1911,15 +1920,39 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		goto out_unlocked;
 
 	orig_pmd = *pmd;
-	if (is_huge_zero_pmd(orig_pmd))
-		goto out;
-
 	if (unlikely(!pmd_present(orig_pmd))) {
-		VM_BUG_ON(thp_migration_supported() &&
-				  !is_pmd_migration_entry(orig_pmd));
-		goto out;
+		swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
+
+		if (is_migration_entry(entry)) {
+			VM_BUG_ON(!thp_migration_supported());
+			goto out;
+		} else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+			   !non_swap_entry(entry)) {
+			/*
+			 * If part of THP is discarded, split the PMD
+			 * swap mapping and operate on the PTEs
+			 */
+			if (next - addr != HPAGE_PMD_SIZE) {
+				unsigned long haddr = addr & HPAGE_PMD_MASK;
+
+				__split_huge_swap_pmd(vma, haddr, pmd);
+				goto out;
+			}
+			free_swap_and_cache(entry, HPAGE_PMD_NR);
+			pmd_clear(pmd);
+			zap_deposited_table(mm, pmd);
+			if (current->mm == mm)
+				sync_mm_rss(mm);
+			add_mm_counter(mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+			ret = true;
+			goto out;
+		} else
+			VM_BUG_ON(1);
 	}
 
+	if (is_huge_zero_pmd(orig_pmd))
+		goto out;
+
 	page = pmd_page(orig_pmd);
 	/*
 	 * If other processes are mapping this page, we couldn't discard
@@ -1965,15 +1998,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 	return ret;
 }
 
-static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
-{
-	pgtable_t pgtable;
-
-	pgtable = pgtable_trans_huge_withdraw(mm, pmd);
-	pte_free(mm, pgtable);
-	mm_dec_nr_ptes(mm);
-}
-
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/madvise.c b/mm/madvise.c
index 50282ba862e2..20101ff125d0 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -321,7 +321,7 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 	unsigned long next;
 
 	next = pmd_addr_end(addr, end);
-	if (pmd_trans_huge(*pmd))
+	if (pmd_trans_huge(*pmd) || is_swap_pmd(*pmd))
 		if (madvise_free_huge_pmd(tlb, vma, pmd, addr, next))
 			goto next;
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 14/21] swap: Support to move swap account for PMD swap mapping
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (12 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 13/21] swap: Support PMD swap mapping in madvise_free() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-24 17:27   ` Daniel Jordan
  2018-10-10  7:19 ` [PATCH -V6 15/21] swap: Support to copy PMD swap mapping when fork() Huang Ying
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

Previously the huge swap cluster will be split after the THP is
swapout.  Now, to support to swapin the THP in one piece, the huge
swap cluster will not be split after the THP is reclaimed.  So in
memcg, we need to move the swap account for PMD swap mappings in the
process's page table.

When the page table is scanned during moving memcg charge, the PMD
swap mapping will be identified.  And mem_cgroup_move_swap_account()
and its callee is revised to move account for the whole huge swap
cluster.  If the swap cluster mapped by PMD has been split, the PMD
swap mapping will be split and fallback to PTE processing.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/huge_mm.h     |   9 ++++
 include/linux/swap.h        |   6 +++
 include/linux/swap_cgroup.h |   3 +-
 mm/huge_memory.c            |   8 +--
 mm/memcontrol.c             | 129 ++++++++++++++++++++++++++++++++++----------
 mm/swap_cgroup.c            |  45 +++++++++++++---
 mm/swapfile.c               |  14 +++++
 7 files changed, 174 insertions(+), 40 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 1927b2edb74a..e573774f9014 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -376,6 +376,9 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma,
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
 #ifdef CONFIG_THP_SWAP
+extern void __split_huge_swap_pmd(struct vm_area_struct *vma,
+				  unsigned long haddr,
+				  pmd_t *pmd);
 extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			       unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
@@ -403,6 +406,12 @@ static inline bool transparent_hugepage_swapin_enabled(
 	return false;
 }
 #else /* CONFIG_THP_SWAP */
+static inline void __split_huge_swap_pmd(struct vm_area_struct *vma,
+					 unsigned long haddr,
+					 pmd_t *pmd)
+{
+}
+
 static inline int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 				      unsigned long address, pmd_t orig_pmd)
 {
diff --git a/include/linux/swap.h b/include/linux/swap.h
index f2daf3fbdd4b..1210f70f72bc 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -617,6 +617,7 @@ static inline swp_entry_t get_swap_page(struct page *page)
 #ifdef CONFIG_THP_SWAP
 extern int split_swap_cluster(swp_entry_t entry, unsigned long flags);
 extern int split_swap_cluster_map(swp_entry_t entry);
+extern int get_swap_entry_size(swp_entry_t entry);
 #else
 static inline int split_swap_cluster(swp_entry_t entry, unsigned long flags)
 {
@@ -627,6 +628,11 @@ static inline int split_swap_cluster_map(swp_entry_t entry)
 {
 	return 0;
 }
+
+static inline int get_swap_entry_size(swp_entry_t entry)
+{
+	return 1;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/include/linux/swap_cgroup.h b/include/linux/swap_cgroup.h
index a12dd1c3966c..c40fb52b0563 100644
--- a/include/linux/swap_cgroup.h
+++ b/include/linux/swap_cgroup.h
@@ -7,7 +7,8 @@
 #ifdef CONFIG_MEMCG_SWAP
 
 extern unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new);
+					unsigned short old, unsigned short new,
+					unsigned int nr_ents);
 extern unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
 					 unsigned int nr_ents);
 extern unsigned short lookup_swap_cgroup_id(swp_entry_t ent);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 60b4105734b1..ebd043528309 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1678,10 +1678,11 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 	return 0;
 }
 
+#ifdef CONFIG_THP_SWAP
 /* Convert a PMD swap mapping to a set of PTE swap mappings */
-static void __split_huge_swap_pmd(struct vm_area_struct *vma,
-				  unsigned long haddr,
-				  pmd_t *pmd)
+void __split_huge_swap_pmd(struct vm_area_struct *vma,
+			   unsigned long haddr,
+			   pmd_t *pmd)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgtable_t pgtable;
@@ -1712,7 +1713,6 @@ static void __split_huge_swap_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
-#ifdef CONFIG_THP_SWAP
 int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			unsigned long address, pmd_t orig_pmd)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 7bebe2ddec05..b6504b81ae0c 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2660,9 +2660,10 @@ void mem_cgroup_split_huge_fixup(struct page *head)
 #ifdef CONFIG_MEMCG_SWAP
 /**
  * mem_cgroup_move_swap_account - move swap charge and swap_cgroup's record.
- * @entry: swap entry to be moved
+ * @entry: the first swap entry to be moved
  * @from:  mem_cgroup which the entry is moved from
  * @to:  mem_cgroup which the entry is moved to
+ * @nr_ents: number of swap entries
  *
  * It succeeds only when the swap_cgroup's record for this entry is the same
  * as the mem_cgroup's id of @from.
@@ -2673,23 +2674,27 @@ void mem_cgroup_split_huge_fixup(struct page *head)
  * both res and memsw, and called css_get().
  */
 static int mem_cgroup_move_swap_account(swp_entry_t entry,
-				struct mem_cgroup *from, struct mem_cgroup *to)
+					struct mem_cgroup *from,
+					struct mem_cgroup *to,
+					unsigned int nr_ents)
 {
 	unsigned short old_id, new_id;
 
 	old_id = mem_cgroup_id(from);
 	new_id = mem_cgroup_id(to);
 
-	if (swap_cgroup_cmpxchg(entry, old_id, new_id) == old_id) {
-		mod_memcg_state(from, MEMCG_SWAP, -1);
-		mod_memcg_state(to, MEMCG_SWAP, 1);
+	if (swap_cgroup_cmpxchg(entry, old_id, new_id, nr_ents) == old_id) {
+		mod_memcg_state(from, MEMCG_SWAP, -nr_ents);
+		mod_memcg_state(to, MEMCG_SWAP, nr_ents);
 		return 0;
 	}
 	return -EINVAL;
 }
 #else
 static inline int mem_cgroup_move_swap_account(swp_entry_t entry,
-				struct mem_cgroup *from, struct mem_cgroup *to)
+					       struct mem_cgroup *from,
+					       struct mem_cgroup *to,
+					       unsigned int nr_ents)
 {
 	return -EINVAL;
 }
@@ -4644,6 +4649,7 @@ enum mc_target_type {
 	MC_TARGET_PAGE,
 	MC_TARGET_SWAP,
 	MC_TARGET_DEVICE,
+	MC_TARGET_FALLBACK,
 };
 
 static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
@@ -4710,6 +4716,26 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
 }
 #endif
 
+static struct page *mc_handle_swap_pmd(struct vm_area_struct *vma,
+			pmd_t pmd, swp_entry_t *entry)
+{
+	struct page *page = NULL;
+	swp_entry_t ent = pmd_to_swp_entry(pmd);
+
+	if (!(mc.flags & MOVE_ANON) || non_swap_entry(ent))
+		return NULL;
+
+	/*
+	 * Because lookup_swap_cache() updates some statistics counter,
+	 * we call find_get_page() with swapper_space directly.
+	 */
+	page = find_get_page(swap_address_space(ent), swp_offset(ent));
+	if (do_memsw_account())
+		entry->val = ent.val;
+
+	return page;
+}
+
 static struct page *mc_handle_file_pte(struct vm_area_struct *vma,
 			unsigned long addr, pte_t ptent, swp_entry_t *entry)
 {
@@ -4898,7 +4924,9 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 	 * There is a swap entry and a page doesn't exist or isn't charged.
 	 * But we cannot move a tail-page in a THP.
 	 */
-	if (ent.val && !ret && (!page || !PageTransCompound(page)) &&
+	if (ent.val && !ret &&
+	    ((page && !PageTransCompound(page)) ||
+	     (!page && get_swap_entry_size(ent) == 1)) &&
 	    mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
 		ret = MC_TARGET_SWAP;
 		if (target)
@@ -4909,37 +4937,64 @@ static enum mc_target_type get_mctgt_type(struct vm_area_struct *vma,
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
- * We don't consider PMD mapped swapping or file mapped pages because THP does
- * not support them for now.
- * Caller should make sure that pmd_trans_huge(pmd) is true.
+ * We don't consider file mapped pages because THP does not support
+ * them for now.
  */
 static enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t pmd, union mc_target *target)
+		unsigned long addr, pmd_t *pmdp, union mc_target *target)
 {
+	pmd_t pmd = *pmdp;
 	struct page *page = NULL;
 	enum mc_target_type ret = MC_TARGET_NONE;
+	swp_entry_t ent = { .val = 0 };
 
 	if (unlikely(is_swap_pmd(pmd))) {
-		VM_BUG_ON(thp_migration_supported() &&
-				  !is_pmd_migration_entry(pmd));
-		return ret;
+		if (is_pmd_migration_entry(pmd)) {
+			VM_BUG_ON(!thp_migration_supported());
+			return ret;
+		}
+		if (!IS_ENABLED(CONFIG_THP_SWAP)) {
+			VM_BUG_ON(1);
+			return ret;
+		}
+		page = mc_handle_swap_pmd(vma, pmd, &ent);
+		/* The swap cluster has been split under us */
+		if ((page && !PageTransHuge(page)) ||
+		    (!page && ent.val && get_swap_entry_size(ent) == 1)) {
+			__split_huge_swap_pmd(vma, addr, pmdp);
+			ret = MC_TARGET_FALLBACK;
+			goto out;
+		}
+	} else {
+		page = pmd_page(pmd);
+		get_page(page);
 	}
-	page = pmd_page(pmd);
-	VM_BUG_ON_PAGE(!page || !PageHead(page), page);
+	VM_BUG_ON_PAGE(page && !PageHead(page), page);
 	if (!(mc.flags & MOVE_ANON))
-		return ret;
-	if (page->mem_cgroup == mc.from) {
+		goto out;
+	if (!page && !ent.val)
+		goto out;
+	if (page && page->mem_cgroup == mc.from) {
 		ret = MC_TARGET_PAGE;
 		if (target) {
 			get_page(page);
 			target->page = page;
 		}
 	}
+	if (ent.val && !ret && !page &&
+	    mem_cgroup_id(mc.from) == lookup_swap_cgroup_id(ent)) {
+		ret = MC_TARGET_SWAP;
+		if (target)
+			target->ent = ent;
+	}
+out:
+	if (page)
+		put_page(page);
 	return ret;
 }
 #else
 static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
-		unsigned long addr, pmd_t pmd, union mc_target *target)
+		unsigned long addr, pmd_t *pmdp, union mc_target *target)
 {
 	return MC_TARGET_NONE;
 }
@@ -4952,6 +5007,7 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 	struct vm_area_struct *vma = walk->vma;
 	pte_t *pte;
 	spinlock_t *ptl;
+	int ret;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -4960,12 +5016,16 @@ static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
 		 * support transparent huge page with MEMORY_DEVICE_PUBLIC or
 		 * MEMORY_DEVICE_PRIVATE but this might change.
 		 */
-		if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
-			mc.precharge += HPAGE_PMD_NR;
+		ret = get_mctgt_type_thp(vma, addr, pmd, NULL);
 		spin_unlock(ptl);
+		if (ret == MC_TARGET_FALLBACK)
+			goto fallback;
+		if (ret)
+			mc.precharge += HPAGE_PMD_NR;
 		return 0;
 	}
 
+fallback:
 	if (pmd_trans_unstable(pmd))
 		return 0;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
@@ -5156,6 +5216,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	enum mc_target_type target_type;
 	union mc_target target;
 	struct page *page;
+	swp_entry_t ent;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
@@ -5163,8 +5224,9 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			spin_unlock(ptl);
 			return 0;
 		}
-		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
-		if (target_type == MC_TARGET_PAGE) {
+		target_type = get_mctgt_type_thp(vma, addr, pmd, &target);
+		switch (target_type) {
+		case MC_TARGET_PAGE:
 			page = target.page;
 			if (!isolate_lru_page(page)) {
 				if (!mem_cgroup_move_account(page, true,
@@ -5175,7 +5237,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				putback_lru_page(page);
 			}
 			put_page(page);
-		} else if (target_type == MC_TARGET_DEVICE) {
+			break;
+		case MC_TARGET_DEVICE:
 			page = target.page;
 			if (!mem_cgroup_move_account(page, true,
 						     mc.from, mc.to)) {
@@ -5183,9 +5246,21 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 				mc.moved_charge += HPAGE_PMD_NR;
 			}
 			put_page(page);
+			break;
+		case MC_TARGET_SWAP:
+			ent = target.ent;
+			if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to,
+							  HPAGE_PMD_NR)) {
+				mc.precharge -= HPAGE_PMD_NR;
+				mc.moved_swap += HPAGE_PMD_NR;
+			}
+			break;
+		default:
+			break;
 		}
 		spin_unlock(ptl);
-		return 0;
+		if (target_type != MC_TARGET_FALLBACK)
+			return 0;
 	}
 
 	if (pmd_trans_unstable(pmd))
@@ -5195,7 +5270,6 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 	for (; addr != end; addr += PAGE_SIZE) {
 		pte_t ptent = *(pte++);
 		bool device = false;
-		swp_entry_t ent;
 
 		if (!mc.precharge)
 			break;
@@ -5229,7 +5303,8 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			break;
 		case MC_TARGET_SWAP:
 			ent = target.ent;
-			if (!mem_cgroup_move_swap_account(ent, mc.from, mc.to)) {
+			if (!mem_cgroup_move_swap_account(ent, mc.from,
+							  mc.to, 1)) {
 				mc.precharge--;
 				/* we fixup refcnts and charges later. */
 				mc.moved_swap++;
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 45affaef3bc6..ccc08e88962a 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -87,29 +87,58 @@ static struct swap_cgroup *lookup_swap_cgroup(swp_entry_t ent,
 
 /**
  * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
- * @ent: swap entry to be cmpxchged
+ * @ent: the first swap entry to be cmpxchged
  * @old: old id
  * @new: new id
+ * @nr_ents: number of swap entries
  *
  * Returns old id at success, 0 at failure.
  * (There is no mem_cgroup using 0 as its id)
  */
 unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
-					unsigned short old, unsigned short new)
+				   unsigned short old, unsigned short new,
+				   unsigned int nr_ents)
 {
 	struct swap_cgroup_ctrl *ctrl;
-	struct swap_cgroup *sc;
+	struct swap_cgroup *sc_start, *sc;
 	unsigned long flags;
 	unsigned short retval;
+	pgoff_t offset_start = swp_offset(ent), offset;
+	pgoff_t end = offset_start + nr_ents;
 
-	sc = lookup_swap_cgroup(ent, &ctrl);
+	sc_start = lookup_swap_cgroup(ent, &ctrl);
 
 	spin_lock_irqsave(&ctrl->lock, flags);
-	retval = sc->id;
-	if (retval == old)
+	sc = sc_start;
+	offset = offset_start;
+	for (;;) {
+		if (sc->id != old) {
+			retval = 0;
+			goto out;
+		}
+		offset++;
+		if (offset == end)
+			break;
+		if (offset % SC_PER_PAGE)
+			sc++;
+		else
+			sc = __lookup_swap_cgroup(ctrl, offset);
+	}
+
+	sc = sc_start;
+	offset = offset_start;
+	for (;;) {
 		sc->id = new;
-	else
-		retval = 0;
+		offset++;
+		if (offset == end)
+			break;
+		if (offset % SC_PER_PAGE)
+			sc++;
+		else
+			sc = __lookup_swap_cgroup(ctrl, offset);
+	}
+	retval = old;
+out:
 	spin_unlock_irqrestore(&ctrl->lock, flags);
 	return retval;
 }
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 93b6a5d4e44a..938a800aee12 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1730,6 +1730,20 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
 	return map_swapcount;
 }
 
+#ifdef CONFIG_THP_SWAP
+int get_swap_entry_size(swp_entry_t entry)
+{
+	struct swap_info_struct *si;
+	struct swap_cluster_info *ci;
+
+	si = _swap_info_get(entry);
+	if (!si || !si->cluster_info)
+		return 1;
+	ci = si->cluster_info + swp_offset(entry) / SWAPFILE_CLUSTER;
+	return cluster_is_huge(ci) ? SWAPFILE_CLUSTER : 1;
+}
+#endif
+
 /*
  * We can write to an anon page without COW if there are no other references
  * to it.  And as a side-effect, free up its swap: because the old content
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 15/21] swap: Support to copy PMD swap mapping when fork()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (13 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 14/21] swap: Support to move swap account for PMD swap mapping Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 16/21] swap: Free PMD swap mapping when zap_huge_pmd() Huang Ying
                   ` (7 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

During fork, the page table need to be copied from parent to child.  A
PMD swap mapping need to be copied too and the swap reference count
need to be increased.

When the huge swap cluster has been split already, we need to split
the PMD swap mapping and fallback to PTE copying.

When swap count continuation failed to allocate a page with
GFP_ATOMIC, we need to unlock the spinlock and try again with
GFP_KERNEL.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/huge_memory.c | 72 ++++++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 57 insertions(+), 15 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ebd043528309..74c8621619cb 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -987,6 +987,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(!pgtable))
 		goto out;
 
+retry:
 	dst_ptl = pmd_lock(dst_mm, dst_pmd);
 	src_ptl = pmd_lockptr(src_mm, src_pmd);
 	spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
@@ -994,26 +995,67 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	ret = -EAGAIN;
 	pmd = *src_pmd;
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
-		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (is_write_migration_entry(entry)) {
-			make_migration_entry_read(&entry);
-			pmd = swp_entry_to_pmd(entry);
-			if (pmd_swp_soft_dirty(*src_pmd))
-				pmd = pmd_swp_mksoft_dirty(pmd);
-			set_pmd_at(src_mm, addr, src_pmd, pmd);
+#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+		if (is_migration_entry(entry)) {
+			if (is_write_migration_entry(entry)) {
+				make_migration_entry_read(&entry);
+				pmd = swp_entry_to_pmd(entry);
+				if (pmd_swp_soft_dirty(*src_pmd))
+					pmd = pmd_swp_mksoft_dirty(pmd);
+				set_pmd_at(src_mm, addr, src_pmd, pmd);
+			}
+			add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
+			mm_inc_nr_ptes(dst_mm);
+			pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
+			set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+			ret = 0;
+			goto out_unlock;
 		}
-		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
-		mm_inc_nr_ptes(dst_mm);
-		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
-		set_pmd_at(dst_mm, addr, dst_pmd, pmd);
-		ret = 0;
-		goto out_unlock;
-	}
 #endif
+		if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry)) {
+			ret = swap_duplicate(&entry, HPAGE_PMD_NR);
+			if (!ret) {
+				add_mm_counter(dst_mm, MM_SWAPENTS,
+					       HPAGE_PMD_NR);
+				mm_inc_nr_ptes(dst_mm);
+				pgtable_trans_huge_deposit(dst_mm, dst_pmd,
+							   pgtable);
+				set_pmd_at(dst_mm, addr, dst_pmd, pmd);
+				/* make sure dst_mm is on swapoff's mmlist. */
+				if (unlikely(list_empty(&dst_mm->mmlist))) {
+					spin_lock(&mmlist_lock);
+					if (list_empty(&dst_mm->mmlist))
+						list_add(&dst_mm->mmlist,
+							 &src_mm->mmlist);
+					spin_unlock(&mmlist_lock);
+				}
+			} else if (ret == -ENOTDIR) {
+				/*
+				 * The huge swap cluster has been split, split
+				 * the PMD swap mapping and fallback to PTE
+				 */
+				__split_huge_swap_pmd(vma, addr, src_pmd);
+				pte_free(dst_mm, pgtable);
+			} else if (ret == -ENOMEM) {
+				spin_unlock(src_ptl);
+				spin_unlock(dst_ptl);
+				ret = add_swap_count_continuation(entry,
+								  GFP_KERNEL);
+				if (ret < 0) {
+					ret = -ENOMEM;
+					pte_free(dst_mm, pgtable);
+					goto out;
+				}
+				goto retry;
+			} else
+				VM_BUG_ON(1);
+			goto out_unlock;
+		}
+		VM_BUG_ON(1);
+	}
 
 	if (unlikely(!pmd_trans_huge(pmd))) {
 		pte_free(dst_mm, pgtable);
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 16/21] swap: Free PMD swap mapping when zap_huge_pmd()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (14 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 15/21] swap: Support to copy PMD swap mapping when fork() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 17/21] swap: Support PMD swap mapping for MADV_WILLNEED Huang Ying
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

For a PMD swap mapping, zap_huge_pmd() will clear the PMD and call
free_swap_and_cache() to decrease the swap reference count and maybe
free or split the huge swap cluster and the THP in swap cache.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/huge_memory.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 74c8621619cb..b9c766683ee1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2066,7 +2066,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		spin_unlock(ptl);
 		if (is_huge_zero_pmd(orig_pmd))
 			tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
-	} else if (is_huge_zero_pmd(orig_pmd)) {
+	} else if (pmd_present(orig_pmd) && is_huge_zero_pmd(orig_pmd)) {
 		zap_deposited_table(tlb->mm, pmd);
 		spin_unlock(ptl);
 		tlb_remove_page_size(tlb, pmd_page(orig_pmd), HPAGE_PMD_SIZE);
@@ -2079,17 +2079,27 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			page_remove_rmap(page, true);
 			VM_BUG_ON_PAGE(page_mapcount(page) < 0, page);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-		} else if (thp_migration_supported()) {
-			swp_entry_t entry;
-
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
-			entry = pmd_to_swp_entry(orig_pmd);
-			page = pfn_to_page(swp_offset(entry));
+		} else {
+			swp_entry_t entry = pmd_to_swp_entry(orig_pmd);
+
+			if (thp_migration_supported() &&
+			    is_migration_entry(entry))
+				page = pfn_to_page(swp_offset(entry));
+			else if (IS_ENABLED(CONFIG_THP_SWAP) &&
+				 !non_swap_entry(entry))
+				free_swap_and_cache(entry, HPAGE_PMD_NR);
+			else {
+				WARN_ONCE(1,
+"Non present huge pmd without pmd migration or swap enabled!");
+				goto unlock;
+			}
 			flush_needed = 0;
-		} else
-			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+		}
 
-		if (PageAnon(page)) {
+		if (!page) {
+			zap_deposited_table(tlb->mm, pmd);
+			add_mm_counter(tlb->mm, MM_SWAPENTS, -HPAGE_PMD_NR);
+		} else if (PageAnon(page)) {
 			zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, MM_ANONPAGES, -HPAGE_PMD_NR);
 		} else {
@@ -2097,7 +2107,7 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				zap_deposited_table(tlb->mm, pmd);
 			add_mm_counter(tlb->mm, mm_counter_file(page), -HPAGE_PMD_NR);
 		}
-
+unlock:
 		spin_unlock(ptl);
 		if (flush_needed)
 			tlb_remove_page_size(tlb, page, HPAGE_PMD_SIZE);
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 17/21] swap: Support PMD swap mapping for MADV_WILLNEED
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (15 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 16/21] swap: Free PMD swap mapping when zap_huge_pmd() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 18/21] swap: Support PMD swap mapping in mincore() Huang Ying
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

During MADV_WILLNEED, for a PMD swap mapping, if THP swapin is enabled
for the VMA, the whole swap cluster will be swapin.  Otherwise, the
huge swap cluster and the PMD swap mapping will be split and fallback
to PTE swap mapping.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/madvise.c | 26 ++++++++++++++++++++++++--
 1 file changed, 24 insertions(+), 2 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 20101ff125d0..0413659ff6ba 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -196,14 +196,36 @@ static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
 	pte_t *orig_pte;
 	struct vm_area_struct *vma = walk->private;
 	unsigned long index;
+	swp_entry_t entry;
+	struct page *page;
+	pmd_t pmdval;
+
+	pmdval = *pmd;
+	if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(pmdval) &&
+	    !is_pmd_migration_entry(pmdval)) {
+		entry = pmd_to_swp_entry(pmdval);
+		if (!transparent_hugepage_swapin_enabled(vma)) {
+			if (!split_swap_cluster(entry, 0))
+				split_huge_swap_pmd(vma, pmd, start, pmdval);
+		} else {
+			page = read_swap_cache_async(entry,
+						     GFP_HIGHUSER_MOVABLE,
+						     vma, start, false);
+			if (page) {
+				/* The swap cluster has been split under us */
+				if (!PageTransHuge(page))
+					split_huge_swap_pmd(vma, pmd, start,
+							    pmdval);
+				put_page(page);
+			}
+		}
+	}
 
 	if (pmd_none_or_trans_huge_or_clear_bad(pmd))
 		return 0;
 
 	for (index = start; index != end; index += PAGE_SIZE) {
 		pte_t pte;
-		swp_entry_t entry;
-		struct page *page;
 		spinlock_t *ptl;
 
 		orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 18/21] swap: Support PMD swap mapping in mincore()
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (16 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 17/21] swap: Support PMD swap mapping for MADV_WILLNEED Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 19/21] swap: Support PMD swap mapping in common path Huang Ying
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

During mincore(), for PMD swap mapping, swap cache will be looked up.
If the resulting page isn't compound page, the PMD swap mapping will
be split and fallback to PTE swap mapping processing.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/mincore.c | 37 +++++++++++++++++++++++++++++++------
 1 file changed, 31 insertions(+), 6 deletions(-)

diff --git a/mm/mincore.c b/mm/mincore.c
index aa0e542569f9..1d861fac82ee 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -48,7 +48,8 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
  * and is up to date; i.e. that no page-in operation would be required
  * at this time if an application were to map and access this page.
  */
-static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
+static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff,
+				  bool *compound)
 {
 	unsigned char present = 0;
 	struct page *page;
@@ -86,6 +87,8 @@ static unsigned char mincore_page(struct address_space *mapping, pgoff_t pgoff)
 #endif
 	if (page) {
 		present = PageUptodate(page);
+		if (compound)
+			*compound = PageCompound(page);
 		put_page(page);
 	}
 
@@ -103,7 +106,8 @@ static int __mincore_unmapped_range(unsigned long addr, unsigned long end,
 
 		pgoff = linear_page_index(vma, addr);
 		for (i = 0; i < nr; i++, pgoff++)
-			vec[i] = mincore_page(vma->vm_file->f_mapping, pgoff);
+			vec[i] = mincore_page(vma->vm_file->f_mapping,
+					      pgoff, NULL);
 	} else {
 		for (i = 0; i < nr; i++)
 			vec[i] = 0;
@@ -127,14 +131,36 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	pte_t *ptep;
 	unsigned char *vec = walk->private;
 	int nr = (end - addr) >> PAGE_SHIFT;
+	swp_entry_t entry;
 
 	ptl = pmd_trans_huge_lock(pmd, vma);
 	if (ptl) {
-		memset(vec, 1, nr);
+		unsigned char val = 1;
+		bool compound;
+
+		if (IS_ENABLED(CONFIG_THP_SWAP) && is_swap_pmd(*pmd)) {
+			entry = pmd_to_swp_entry(*pmd);
+			if (!non_swap_entry(entry)) {
+				val = mincore_page(swap_address_space(entry),
+						   swp_offset(entry),
+						   &compound);
+				/*
+				 * The huge swap cluster has been
+				 * split under us
+				 */
+				if (!compound) {
+					__split_huge_swap_pmd(vma, addr, pmd);
+					spin_unlock(ptl);
+					goto fallback;
+				}
+			}
+		}
+		memset(vec, val, nr);
 		spin_unlock(ptl);
 		goto out;
 	}
 
+fallback:
 	if (pmd_trans_unstable(pmd)) {
 		__mincore_unmapped_range(addr, end, vma, vec);
 		goto out;
@@ -150,8 +176,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		else if (pte_present(pte))
 			*vec = 1;
 		else { /* pte is a swap entry */
-			swp_entry_t entry = pte_to_swp_entry(pte);
-
+			entry = pte_to_swp_entry(pte);
 			if (non_swap_entry(entry)) {
 				/*
 				 * migration or hwpoison entries are always
@@ -161,7 +186,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 			} else {
 #ifdef CONFIG_SWAP
 				*vec = mincore_page(swap_address_space(entry),
-						    swp_offset(entry));
+						    swp_offset(entry), NULL);
 #else
 				WARN_ON(1);
 				*vec = 1;
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 19/21] swap: Support PMD swap mapping in common path
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (17 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 18/21] swap: Support PMD swap mapping in mincore() Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 20/21] swap: create PMD swap mapping when unmap the THP Huang Ying
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

Original code is only for PMD migration entry, it is revised to
support PMD swap mapping.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 fs/proc/task_mmu.c | 12 +++++-------
 mm/gup.c           | 36 ++++++++++++++++++++++++------------
 mm/huge_memory.c   |  7 ++++---
 mm/mempolicy.c     |  2 +-
 4 files changed, 34 insertions(+), 23 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 0995a84c78dc..befac96b42d9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -984,7 +984,7 @@ static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
 		pmd = pmd_clear_soft_dirty(pmd);
 
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
-	} else if (is_migration_entry(pmd_to_swp_entry(pmd))) {
+	} else if (is_swap_pmd(pmd)) {
 		pmd = pmd_swp_clear_soft_dirty(pmd);
 		set_pmd_at(vma->vm_mm, addr, pmdp, pmd);
 	}
@@ -1314,9 +1314,8 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			if (pm->show_pfn)
 				frame = pmd_pfn(pmd) +
 					((addr & ~PMD_MASK) >> PAGE_SHIFT);
-		}
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-		else if (is_swap_pmd(pmd)) {
+		} else if (IS_ENABLED(CONFIG_HAVE_PMD_SWAP_ENTRY) &&
+			   is_swap_pmd(pmd)) {
 			swp_entry_t entry = pmd_to_swp_entry(pmd);
 			unsigned long offset;
 
@@ -1329,10 +1328,9 @@ static int pagemap_pmd_range(pmd_t *pmdp, unsigned long addr, unsigned long end,
 			flags |= PM_SWAP;
 			if (pmd_swp_soft_dirty(pmd))
 				flags |= PM_SOFT_DIRTY;
-			VM_BUG_ON(!is_pmd_migration_entry(pmd));
-			page = migration_entry_to_page(entry);
+			if (is_pmd_migration_entry(pmd))
+				page = migration_entry_to_page(entry);
 		}
-#endif
 
 		if (page && page_mapcount(page) == 1)
 			flags |= PM_MMAP_EXCLUSIVE;
diff --git a/mm/gup.c b/mm/gup.c
index 08eb350e0f35..17fd850aa2cc 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -216,6 +216,7 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	struct page *page;
 	struct mm_struct *mm = vma->vm_mm;
+	swp_entry_t entry;
 
 	pmd = pmd_offset(pudp, address);
 	/*
@@ -243,18 +244,22 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 	if (!pmd_present(pmdval)) {
 		if (likely(!(flags & FOLL_MIGRATION)))
 			return no_page_table(vma, flags);
-		VM_BUG_ON(thp_migration_supported() &&
-				  !is_pmd_migration_entry(pmdval));
-		if (is_pmd_migration_entry(pmdval))
+		entry = pmd_to_swp_entry(pmdval);
+		if (thp_migration_supported() && is_migration_entry(entry)) {
 			pmd_migration_entry_wait(mm, pmd);
-		pmdval = READ_ONCE(*pmd);
-		/*
-		 * MADV_DONTNEED may convert the pmd to null because
-		 * mmap_sem is held in read mode
-		 */
-		if (pmd_none(pmdval))
+			pmdval = READ_ONCE(*pmd);
+			/*
+			 * MADV_DONTNEED may convert the pmd to null because
+			 * mmap_sem is held in read mode
+			 */
+			if (pmd_none(pmdval))
+				return no_page_table(vma, flags);
+			goto retry;
+		}
+		if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry))
 			return no_page_table(vma, flags);
-		goto retry;
+		WARN_ON(1);
+		return no_page_table(vma, flags);
 	}
 	if (pmd_devmap(pmdval)) {
 		ptl = pmd_lock(mm, pmd);
@@ -276,11 +281,18 @@ static struct page *follow_pmd_mask(struct vm_area_struct *vma,
 		return no_page_table(vma, flags);
 	}
 	if (unlikely(!pmd_present(*pmd))) {
+		entry = pmd_to_swp_entry(*pmd);
 		spin_unlock(ptl);
 		if (likely(!(flags & FOLL_MIGRATION)))
 			return no_page_table(vma, flags);
-		pmd_migration_entry_wait(mm, pmd);
-		goto retry_locked;
+		if (thp_migration_supported() && is_migration_entry(entry)) {
+			pmd_migration_entry_wait(mm, pmd);
+			goto retry_locked;
+		}
+		if (IS_ENABLED(CONFIG_THP_SWAP) && !non_swap_entry(entry))
+			return no_page_table(vma, flags);
+		WARN_ON(1);
+		return no_page_table(vma, flags);
 	}
 	if (unlikely(!pmd_trans_huge(*pmd))) {
 		spin_unlock(ptl);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b9c766683ee1..abefc50b08b7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2133,7 +2133,7 @@ static inline int pmd_move_must_withdraw(spinlock_t *new_pmd_ptl,
 static pmd_t move_soft_dirty_pmd(pmd_t pmd)
 {
 #ifdef CONFIG_MEM_SOFT_DIRTY
-	if (unlikely(is_pmd_migration_entry(pmd)))
+	if (unlikely(is_swap_pmd(pmd)))
 		pmd = pmd_swp_mksoft_dirty(pmd);
 	else if (pmd_present(pmd))
 		pmd = pmd_mksoft_dirty(pmd);
@@ -2219,11 +2219,12 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 	preserve_write = prot_numa && pmd_write(*pmd);
 	ret = 1;
 
-#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
+#if defined(CONFIG_ARCH_ENABLE_THP_MIGRATION) || defined(CONFIG_THP_SWAP)
 	if (is_swap_pmd(*pmd)) {
 		swp_entry_t entry = pmd_to_swp_entry(*pmd);
 
-		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+		VM_BUG_ON(!IS_ENABLED(CONFIG_THP_SWAP) &&
+			  !is_migration_entry(entry));
 		if (is_write_migration_entry(entry)) {
 			pmd_t newpmd;
 			/*
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5837a067124d..7a5c1d2faea2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -436,7 +436,7 @@ static int queue_pages_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr,
 	struct queue_pages *qp = walk->private;
 	unsigned long flags;
 
-	if (unlikely(is_pmd_migration_entry(*pmd))) {
+	if (unlikely(is_swap_pmd(*pmd))) {
 		ret = 1;
 		goto unlock;
 	}
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 20/21] swap: create PMD swap mapping when unmap the THP
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (18 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 19/21] swap: Support PMD swap mapping in common path Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-10  7:19 ` [PATCH -V6 21/21] swap: Update help of CONFIG_THP_SWAP Huang Ying
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

This is the final step of the THP swapin support.  When reclaiming a
anonymous THP, after allocating the huge swap cluster and add the THP
into swap cache, the PMD page mapping will be changed to the mapping
to the swap space.  Previously, the PMD page mapping will be split
before being changed.  In this patch, the unmap code is enhanced not
to split the PMD mapping, but create a PMD swap mapping to replace it
instead.  So later when clear the SWAP_HAS_CACHE flag in the last step
of swapout, the huge swap cluster will be kept instead of being split,
and when swapin, the huge swap cluster will be read in one piece into a
THP.  That is, the THP will not be split during swapout/swapin.  This
can eliminate the overhead of splitting/collapsing, and reduce the
page fault count, etc.  But more important, the utilization of THP is
improved greatly, that is, much more THP will be kept when swapping is
used, so that we can take full advantage of THP including its high
performance for swapout/swapin.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/huge_mm.h | 11 +++++++++++
 mm/huge_memory.c        | 30 ++++++++++++++++++++++++++++++
 mm/rmap.c               | 43 ++++++++++++++++++++++++++++++++++++++++++-
 mm/vmscan.c             |  6 +-----
 4 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index e573774f9014..f6370e8c7742 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -375,6 +375,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma,
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+struct page_vma_mapped_walk;
+
 #ifdef CONFIG_THP_SWAP
 extern void __split_huge_swap_pmd(struct vm_area_struct *vma,
 				  unsigned long haddr,
@@ -382,6 +384,8 @@ extern void __split_huge_swap_pmd(struct vm_area_struct *vma,
 extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			       unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
+extern bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+	struct page *page, unsigned long address, pmd_t pmdval);
 
 static inline bool transparent_hugepage_swapin_enabled(
 	struct vm_area_struct *vma)
@@ -423,6 +427,13 @@ static inline int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	return 0;
 }
 
+static inline bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw,
+				      struct page *page, unsigned long address,
+				      pmd_t pmdval)
+{
+	return false;
+}
+
 static inline bool transparent_hugepage_swapin_enabled(
 	struct vm_area_struct *vma)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index abefc50b08b7..87795529c547 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1931,6 +1931,36 @@ int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd)
 	count_vm_event(THP_SWPIN_FALLBACK);
 	goto fallback;
 }
+
+bool set_pmd_swap_entry(struct page_vma_mapped_walk *pvmw, struct page *page,
+			unsigned long address, pmd_t pmdval)
+{
+	struct vm_area_struct *vma = pvmw->vma;
+	struct mm_struct *mm = vma->vm_mm;
+	pmd_t swp_pmd;
+	swp_entry_t entry = { .val = page_private(page) };
+
+	if (swap_duplicate(&entry, HPAGE_PMD_NR) < 0) {
+		set_pmd_at(mm, address, pvmw->pmd, pmdval);
+		return false;
+	}
+	if (list_empty(&mm->mmlist)) {
+		spin_lock(&mmlist_lock);
+		if (list_empty(&mm->mmlist))
+			list_add(&mm->mmlist, &init_mm.mmlist);
+		spin_unlock(&mmlist_lock);
+	}
+	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+	add_mm_counter(mm, MM_SWAPENTS, HPAGE_PMD_NR);
+	swp_pmd = swp_entry_to_pmd(entry);
+	if (pmd_soft_dirty(pmdval))
+		swp_pmd = pmd_swp_mksoft_dirty(swp_pmd);
+	set_pmd_at(mm, address, pvmw->pmd, swp_pmd);
+
+	page_remove_rmap(page, true);
+	put_page(page);
+	return true;
+}
 #endif
 
 static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
diff --git a/mm/rmap.c b/mm/rmap.c
index 3bb4be720bc0..a180cb1fe2db 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1413,11 +1413,52 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
 				continue;
 		}
 
+		address = pvmw.address;
+
+#ifdef CONFIG_THP_SWAP
+		/* PMD-mapped THP swap entry */
+		if (IS_ENABLED(CONFIG_THP_SWAP) &&
+		    !pvmw.pte && PageAnon(page)) {
+			pmd_t pmdval;
+
+			VM_BUG_ON_PAGE(PageHuge(page) ||
+				       !PageTransCompound(page), page);
+
+			flush_cache_range(vma, address,
+					  address + HPAGE_PMD_SIZE);
+			mmu_notifier_invalidate_range_start(mm, address,
+					address + HPAGE_PMD_SIZE);
+			if (should_defer_flush(mm, flags)) {
+				/* check comments for PTE below */
+				pmdval = pmdp_huge_get_and_clear(mm, address,
+								 pvmw.pmd);
+				set_tlb_ubc_flush_pending(mm,
+							  pmd_dirty(pmdval));
+			} else
+				pmdval = pmdp_huge_clear_flush(vma, address,
+							       pvmw.pmd);
+
+			/*
+			 * Move the dirty bit to the page. Now the pmd
+			 * is gone.
+			 */
+			if (pmd_dirty(pmdval))
+				set_page_dirty(page);
+
+			/* Update high watermark before we lower rss */
+			update_hiwater_rss(mm);
+
+			ret = set_pmd_swap_entry(&pvmw, page, address, pmdval);
+			mmu_notifier_invalidate_range_end(mm, address,
+					address + HPAGE_PMD_SIZE);
+			continue;
+		}
+#endif
+
 		/* Unexpected PMD-mapped THP? */
 		VM_BUG_ON_PAGE(!pvmw.pte, page);
 
 		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
-		address = pvmw.address;
 
 		if (PageHuge(page)) {
 			if (huge_pmd_unshare(mm, &address, pvmw.pte)) {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a859f64a2166..017e9060082f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1318,11 +1318,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		 * processes. Try to unmap it here.
 		 */
 		if (page_mapped(page)) {
-			enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
-
-			if (unlikely(PageTransHuge(page)))
-				flags |= TTU_SPLIT_HUGE_PMD;
-			if (!try_to_unmap(page, flags)) {
+			if (!try_to_unmap(page, ttu_flags | TTU_BATCH_FLUSH)) {
 				nr_unmap_fail++;
 				goto activate_locked;
 			}
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH -V6 21/21] swap: Update help of CONFIG_THP_SWAP
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (19 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 20/21] swap: create PMD swap mapping when unmap the THP Huang Ying
@ 2018-10-10  7:19 ` Huang Ying
  2018-10-23 12:27 ` [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Daniel Jordan
  2018-11-09  1:12 ` Daniel Jordan
  22 siblings, 0 replies; 32+ messages in thread
From: Huang Ying @ 2018-10-10  7:19 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Huang Ying, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan, Dan Williams

The help of CONFIG_THP_SWAP is updated to reflect the latest progress
of THP (Tranparent Huge Page) swap optimization.

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/Kconfig | 2 --
 1 file changed, 2 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 44f7d72010fd..63caae29ae2b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -419,8 +419,6 @@ config THP_SWAP
 	depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_SWAP && SWAP
 	help
 	  Swap transparent huge pages in one piece, without splitting.
-	  XXX: For now, swap cluster backing transparent huge page
-	  will be split after swapout.
 
 	  For selection by architectures with reasonable THP sizes.
 
-- 
2.16.4


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (20 preceding siblings ...)
  2018-10-10  7:19 ` [PATCH -V6 21/21] swap: Update help of CONFIG_THP_SWAP Huang Ying
@ 2018-10-23 12:27 ` Daniel Jordan
  2018-10-24  3:31   ` Huang, Ying
  2018-11-09  1:12 ` Daniel Jordan
  22 siblings, 1 reply; 32+ messages in thread
From: Daniel Jordan @ 2018-10-23 12:27 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

On Wed, Oct 10, 2018 at 03:19:03PM +0800, Huang Ying wrote:
> And for all, Any comment is welcome!
> 
> This patchset is based on the 2018-10-3 head of mmotm/master.

There seems to be some infrequent memory corruption with THPs that have been
swapped out: page contents differ after swapin.

Reproducer at the bottom.  Part of some tests I'm writing, had to separate it a
little hack-ily.  Basically it writes the word offset _at_ each word offset in
a memory blob, tries to push it to swap, and verifies the offset is the same
after swapin.

I ran with THP enabled=always.  THP swapin_enabled could be always or never, it
happened with both.  Every time swapping occurred, a single THP-sized chunk in
the middle of the blob had different offsets.  Example:

** > word corruption gap
** corruption detected 14929920 bytes in (got 15179776, expected 14929920) **
** corruption detected 14929928 bytes in (got 15179784, expected 14929928) **
** corruption detected 14929936 bytes in (got 15179792, expected 14929936) **
...pattern continues...
** corruption detected 17027048 bytes in (got 15179752, expected 17027048) **
** corruption detected 17027056 bytes in (got 15179760, expected 17027056) **
** corruption detected 17027064 bytes in (got 15179768, expected 17027064) **
100.0% of memory was swapped out at mincore time
0.00305% of pages were corrupted (first corrupt word 14929920, last corrupt word 17027064)

The problem goes away with THP enabled=never, and I don't see it on 2018-10-3
mmotm/master with THP enabled=always.

The server had an NVMe swap device and ~760G memory over two nodes, and the
program was always run like this:  swap-verify -s $((64 * 2**30))

The kernels had one extra patch, Alexander Duyck's
"dma-direct: Fix return value of dma_direct_supported", which was required to
get them to build.

---------------------------------------8<---------------------------------------

/*
 * swap-verify.c - helper to verify contents of swapped out pages
 *
 * Daniel Jordan <daniel.m.jordan@oracle.com>
 */

#define _GNU_SOURCE
#include <getopt.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>

#define TEST_SUCCESS	0
#define TEST_FAIL	1
#define TEST_SKIP	2

static void usagedie(int exitcode)
{
	fprintf(stderr, "usage: swap-verify\n"
			"    -h                show this message\n"
			"    -s bytes\n");
	exit(exitcode);
}

int main(int argc, char **argv)
{
	int c, pgsize;
	char *pages;
	unsigned char *mincore_vec;
	size_t i, j, nr_pages_swapped, nr_pages_corrupted;
	size_t bytes = 1ul << 30;	/* default 1G */
	ssize_t bytes_read;
	size_t first_corrupt_word, last_corrupt_word, prev_corrupt_word;

	while ((c = getopt(argc, argv, "hs:")) != -1) {
		switch (c) {
		case 'h':
			usagedie(0);
			break;
		case 's':
			bytes = strtoul(optarg, NULL, 10);
			break;
		default:
			fprintf(stderr, "unrecognized option %c\n", c);
			exit(TEST_SKIP);
		}
	}

	pgsize = getpagesize();

	if ((mincore_vec = calloc(bytes / pgsize, 1)) == NULL) {
		perror("calloc");
		exit(TEST_SKIP);
	}

	if ((pages = mmap(NULL, bytes, PROT_READ | PROT_WRITE,
			  MAP_PRIVATE | MAP_ANONYMOUS, -1, 0)) == MAP_FAILED) {
		perror("mmap");
		exit(TEST_SKIP);
	}

	/* Fill pages with a "random" pattern. */
	for (i = 0; i < bytes; i += sizeof(unsigned long))
		*(unsigned long *)(pages + i) = i;

	/* Now fill memory, trying to push the pages just allocated to swap. */
	system("./use-mem-total");

	/* Is the memory swapped out? */
	if (mincore(pages, bytes, mincore_vec) == -1) {
		perror("mincore");
		exit(TEST_SKIP);
	}

	nr_pages_swapped = 0;
	nr_pages_corrupted = 0;
	first_corrupt_word = bytes;
	last_corrupt_word = 0;
	prev_corrupt_word = bytes;
	for (i = 0; i < bytes; i += pgsize) {
		bool page_corrupt = false;
		if (mincore_vec[i / pgsize] & 1) {
			/* Resident, don't bother checking. */
			continue;
		}
		++nr_pages_swapped;
		for (j = i; j < i + pgsize; j += sizeof(unsigned long)) {
			unsigned long val = *(unsigned long *)(pages + j);
			if (val != j) {
				if (!page_corrupt)
					++nr_pages_corrupted;
				page_corrupt = true;
				if (j - prev_corrupt_word != sizeof(unsigned long))
					fprintf(stderr, "** > word corruption gap\n");
				if (j % (1ul << 21) == 0)
					fprintf(stderr, "-- THP boundary\n");
				if (j < first_corrupt_word)
					first_corrupt_word = j;
				if (j > last_corrupt_word)
					last_corrupt_word = j;
				fprintf(stderr, "** corruption detected %lu "
					"bytes in (got %lu, expected %lu) **\n",
					j, val, j);
				prev_corrupt_word = j;
			}
		}
	}
	fprintf(stderr, "%.1f%% of memory was swapped out at mincore time\n",
	       ((double)nr_pages_swapped / (bytes / pgsize)) * 100);

	if (nr_pages_corrupted) {
		fprintf(stderr, "%.5f%% of pages were corrupted (first corrupt "
			"word %lu, last corrupt word %lu)\n",
			((double)nr_pages_corrupted / (bytes / pgsize)) * 100,
			first_corrupt_word, last_corrupt_word);
	} else {
		fprintf(stderr, "no memory corruption detected\n");
	}
	return (nr_pages_corrupted) ? TEST_FAIL : TEST_SUCCESS;
}

---------------------------------------8<---------------------------------------

#!/usr/bin/env bash
#
# use-mem-total
#
# Helper that allocates MemTotal and exits immediately.  Useful for causing
# swapping of previously allocated memory.

# XXX fix paths
source /path/to/vm-scalability/hw_vars

/path/to/usemem --thread $nr_task --step $pagesize -q --repeat 4 \
	$(( mem * 11 / 10 / nr_task )) > /dev/null

---------------------------------------8<---------------------------------------

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece
  2018-10-23 12:27 ` [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Daniel Jordan
@ 2018-10-24  3:31   ` Huang, Ying
  2018-10-24 17:24     ` Daniel Jordan
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Ying @ 2018-10-24  3:31 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, linux-mm, linux-kernel, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan

Hi, Daniel,

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Wed, Oct 10, 2018 at 03:19:03PM +0800, Huang Ying wrote:
>> And for all, Any comment is welcome!
>> 
>> This patchset is based on the 2018-10-3 head of mmotm/master.
>
> There seems to be some infrequent memory corruption with THPs that have been
> swapped out: page contents differ after swapin.

Thanks a lot for testing this!  I know there were big effort behind this
and it definitely will improve the quality of the patchset greatly!

> Reproducer at the bottom.  Part of some tests I'm writing, had to separate it a
> little hack-ily.  Basically it writes the word offset _at_ each word offset in
> a memory blob, tries to push it to swap, and verifies the offset is the same
> after swapin.
>
> I ran with THP enabled=always.  THP swapin_enabled could be always or never, it
> happened with both.  Every time swapping occurred, a single THP-sized chunk in
> the middle of the blob had different offsets.  Example:
>
> ** > word corruption gap
> ** corruption detected 14929920 bytes in (got 15179776, expected 14929920) **
> ** corruption detected 14929928 bytes in (got 15179784, expected 14929928) **
> ** corruption detected 14929936 bytes in (got 15179792, expected 14929936) **
> ...pattern continues...
> ** corruption detected 17027048 bytes in (got 15179752, expected 17027048) **
> ** corruption detected 17027056 bytes in (got 15179760, expected 17027056) **
> ** corruption detected 17027064 bytes in (got 15179768, expected 17027064) **

15179776 < 15179xxx <= 17027064

15179776 % 4096 = 0

And 15179776 = 15179768 + 8

So I guess we have some alignment bug.  Could you try the patches
attached?  It deal with some alignment issue.

> 100.0% of memory was swapped out at mincore time
> 0.00305% of pages were corrupted (first corrupt word 14929920, last corrupt word 17027064)
>
> The problem goes away with THP enabled=never, and I don't see it on 2018-10-3
> mmotm/master with THP enabled=always.
>
> The server had an NVMe swap device and ~760G memory over two nodes, and the
> program was always run like this:  swap-verify -s $((64 * 2**30))
>
> The kernels had one extra patch, Alexander Duyck's
> "dma-direct: Fix return value of dma_direct_supported", which was required to
> get them to build.
>

Thanks again!

Best Regards,
Huang, Ying

---------------------------------->8-----------------------------
From e1c3e4f565deeb8245bdc4ee53a1f1e4188b6d4a Mon Sep 17 00:00:00 2001
From: Huang Ying <ying.huang@intel.com>
Date: Wed, 24 Oct 2018 11:24:15 +0800
Subject: [PATCH] Fix alignment bug

---
 include/linux/huge_mm.h | 6 ++----
 mm/huge_memory.c        | 9 ++++-----
 mm/swap_state.c         | 2 +-
 3 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 96baae08f47c..e7b3527bc493 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -379,8 +379,7 @@ struct page_vma_mapped_walk;
 
 #ifdef CONFIG_THP_SWAP
 extern void __split_huge_swap_pmd(struct vm_area_struct *vma,
-				  unsigned long haddr,
-				  pmd_t *pmd);
+				  unsigned long addr, pmd_t *pmd);
 extern int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 			       unsigned long address, pmd_t orig_pmd);
 extern int do_huge_pmd_swap_page(struct vm_fault *vmf, pmd_t orig_pmd);
@@ -411,8 +410,7 @@ static inline bool transparent_hugepage_swapin_enabled(
 }
 #else /* CONFIG_THP_SWAP */
 static inline void __split_huge_swap_pmd(struct vm_area_struct *vma,
-					 unsigned long haddr,
-					 pmd_t *pmd)
+					 unsigned long addr, pmd_t *pmd)
 {
 }
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index ed64266b63dc..b2af3bff7624 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1731,10 +1731,11 @@ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf, pmd_t pmd)
 #ifdef CONFIG_THP_SWAP
 /* Convert a PMD swap mapping to a set of PTE swap mappings */
 void __split_huge_swap_pmd(struct vm_area_struct *vma,
-			   unsigned long haddr,
+			   unsigned long addr,
 			   pmd_t *pmd)
 {
 	struct mm_struct *mm = vma->vm_mm;
+	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	pgtable_t pgtable;
 	pmd_t _pmd;
 	swp_entry_t entry;
@@ -1772,7 +1773,7 @@ int split_huge_swap_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, orig_pmd))
-		__split_huge_swap_pmd(vma, address & HPAGE_PMD_MASK, pmd);
+		__split_huge_swap_pmd(vma, address, pmd);
 	else
 		ret = -ENOENT;
 	spin_unlock(ptl);
@@ -2013,9 +2014,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			 * swap mapping and operate on the PTEs
 			 */
 			if (next - addr != HPAGE_PMD_SIZE) {
-				unsigned long haddr = addr & HPAGE_PMD_MASK;
-
-				__split_huge_swap_pmd(vma, haddr, pmd);
+				__split_huge_swap_pmd(vma, addr, pmd);
 				goto out;
 			}
 			free_swap_and_cache(entry, HPAGE_PMD_NR);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 784ad6388da0..fd143ef82351 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -451,7 +451,7 @@ struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		/* May fail (-ENOMEM) if XArray node allocation failed. */
 		__SetPageLocked(new_page);
 		__SetPageSwapBacked(new_page);
-		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
+		err = add_to_swap_cache(new_page, hentry, gfp_mask & GFP_KERNEL);
 		if (likely(!err)) {
 			/* Initiate read into locked page */
 			SetPageWorkingset(new_page);
-- 
2.18.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece
  2018-10-24  3:31   ` Huang, Ying
@ 2018-10-24 17:24     ` Daniel Jordan
  2018-10-25  0:42       ` Huang, Ying
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Jordan @ 2018-10-24 17:24 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Daniel Jordan, Andrew Morton, linux-mm, linux-kernel,
	Kirill A. Shutemov, Andrea Arcangeli, Michal Hocko,
	Johannes Weiner, Shaohua Li, Hugh Dickins, Minchan Kim,
	Rik van Riel, Dave Hansen, Naoya Horiguchi, Zi Yan

On Wed, Oct 24, 2018 at 11:31:42AM +0800, Huang, Ying wrote:
> Hi, Daniel,
> 
> Daniel Jordan <daniel.m.jordan@oracle.com> writes:
> 
> > On Wed, Oct 10, 2018 at 03:19:03PM +0800, Huang Ying wrote:
> >> And for all, Any comment is welcome!
> >> 
> >> This patchset is based on the 2018-10-3 head of mmotm/master.
> >
> > There seems to be some infrequent memory corruption with THPs that have been
> > swapped out: page contents differ after swapin.
> 
> Thanks a lot for testing this!  I know there were big effort behind this
> and it definitely will improve the quality of the patchset greatly!

You're welcome!  Hopefully I'll have more results and tests to share in the
next two weeks.

> 
> > Reproducer at the bottom.  Part of some tests I'm writing, had to separate it a
> > little hack-ily.  Basically it writes the word offset _at_ each word offset in
> > a memory blob, tries to push it to swap, and verifies the offset is the same
> > after swapin.
> >
> > I ran with THP enabled=always.  THP swapin_enabled could be always or never, it
> > happened with both.  Every time swapping occurred, a single THP-sized chunk in
> > the middle of the blob had different offsets.  Example:
> >
> > ** > word corruption gap
> > ** corruption detected 14929920 bytes in (got 15179776, expected 14929920) **
> > ** corruption detected 14929928 bytes in (got 15179784, expected 14929928) **
> > ** corruption detected 14929936 bytes in (got 15179792, expected 14929936) **
> > ...pattern continues...
> > ** corruption detected 17027048 bytes in (got 15179752, expected 17027048) **
> > ** corruption detected 17027056 bytes in (got 15179760, expected 17027056) **
> > ** corruption detected 17027064 bytes in (got 15179768, expected 17027064) **
> 
> 15179776 < 15179xxx <= 17027064
> 
> 15179776 % 4096 = 0
> 
> And 15179776 = 15179768 + 8
> 
> So I guess we have some alignment bug.  Could you try the patches
> attached?  It deal with some alignment issue.

That fixed it.  And removed three lines of code.  Nice :)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 06/21] swap: Support PMD swap mapping when splitting huge PMD
  2018-10-10  7:19 ` [PATCH -V6 06/21] swap: Support PMD swap mapping when splitting huge PMD Huang Ying
@ 2018-10-24 17:25   ` Daniel Jordan
  2018-10-25  0:54     ` Huang, Ying
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Jordan @ 2018-10-24 17:25 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

On Wed, Oct 10, 2018 at 03:19:09PM +0800, Huang Ying wrote:
> +#ifdef CONFIG_THP_SWAP
> +/*
> + * The corresponding page table shouldn't be changed under us, that
> + * is, the page table lock should be held.
> + */
> +int split_swap_cluster_map(swp_entry_t entry)
> +{
> +	struct swap_info_struct *si;
> +	struct swap_cluster_info *ci;
> +	unsigned long offset = swp_offset(entry);
> +
> +	VM_BUG_ON(!IS_ALIGNED(offset, SWAPFILE_CLUSTER));
> +	si = _swap_info_get(entry);
> +	if (!si)
> +		return -EBUSY;

I think this return value doesn't get used anywhere?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 14/21] swap: Support to move swap account for PMD swap mapping
  2018-10-10  7:19 ` [PATCH -V6 14/21] swap: Support to move swap account for PMD swap mapping Huang Ying
@ 2018-10-24 17:27   ` Daniel Jordan
  2018-10-25  1:06     ` Huang, Ying
  0 siblings, 1 reply; 32+ messages in thread
From: Daniel Jordan @ 2018-10-24 17:27 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

On Wed, Oct 10, 2018 at 03:19:17PM +0800, Huang Ying wrote:
> +static struct page *mc_handle_swap_pmd(struct vm_area_struct *vma,
> +			pmd_t pmd, swp_entry_t *entry)
> +{

Got
/home/dbbench/linux/mm/memcontrol.c:4719:21: warning: ‘mc_handle_swap_pmd’ defined but not used [-Wunused-function]
 static struct page *mc_handle_swap_pmd(struct vm_area_struct *vma,
when
# CONFIG_TRANSPARENT_HUGEPAGE is not set


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece
  2018-10-24 17:24     ` Daniel Jordan
@ 2018-10-25  0:42       ` Huang, Ying
  0 siblings, 0 replies; 32+ messages in thread
From: Huang, Ying @ 2018-10-25  0:42 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, linux-mm, linux-kernel, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Wed, Oct 24, 2018 at 11:31:42AM +0800, Huang, Ying wrote:
>> Hi, Daniel,
>> 
>> Daniel Jordan <daniel.m.jordan@oracle.com> writes:
>> 
>> > On Wed, Oct 10, 2018 at 03:19:03PM +0800, Huang Ying wrote:
>> >> And for all, Any comment is welcome!
>> >> 
>> >> This patchset is based on the 2018-10-3 head of mmotm/master.
>> >
>> > There seems to be some infrequent memory corruption with THPs that have been
>> > swapped out: page contents differ after swapin.
>> 
>> Thanks a lot for testing this!  I know there were big effort behind this
>> and it definitely will improve the quality of the patchset greatly!
>
> You're welcome!  Hopefully I'll have more results and tests to share in the
> next two weeks.
>
>> 
>> > Reproducer at the bottom.  Part of some tests I'm writing, had to separate it a
>> > little hack-ily.  Basically it writes the word offset _at_ each word offset in
>> > a memory blob, tries to push it to swap, and verifies the offset is the same
>> > after swapin.
>> >
>> > I ran with THP enabled=always.  THP swapin_enabled could be always or never, it
>> > happened with both.  Every time swapping occurred, a single THP-sized chunk in
>> > the middle of the blob had different offsets.  Example:
>> >
>> > ** > word corruption gap
>> > ** corruption detected 14929920 bytes in (got 15179776, expected 14929920) **
>> > ** corruption detected 14929928 bytes in (got 15179784, expected 14929928) **
>> > ** corruption detected 14929936 bytes in (got 15179792, expected 14929936) **
>> > ...pattern continues...
>> > ** corruption detected 17027048 bytes in (got 15179752, expected 17027048) **
>> > ** corruption detected 17027056 bytes in (got 15179760, expected 17027056) **
>> > ** corruption detected 17027064 bytes in (got 15179768, expected 17027064) **
>> 
>> 15179776 < 15179xxx <= 17027064
>> 
>> 15179776 % 4096 = 0
>> 
>> And 15179776 = 15179768 + 8
>> 
>> So I guess we have some alignment bug.  Could you try the patches
>> attached?  It deal with some alignment issue.
>
> That fixed it.  And removed three lines of code.  Nice :)

Thanks!  I will merge the fixes into the patchset.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 06/21] swap: Support PMD swap mapping when splitting huge PMD
  2018-10-24 17:25   ` Daniel Jordan
@ 2018-10-25  0:54     ` Huang, Ying
  2018-10-25 15:00       ` Daniel Jordan
  0 siblings, 1 reply; 32+ messages in thread
From: Huang, Ying @ 2018-10-25  0:54 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, linux-mm, linux-kernel, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Wed, Oct 10, 2018 at 03:19:09PM +0800, Huang Ying wrote:
>> +#ifdef CONFIG_THP_SWAP
>> +/*
>> + * The corresponding page table shouldn't be changed under us, that
>> + * is, the page table lock should be held.
>> + */
>> +int split_swap_cluster_map(swp_entry_t entry)
>> +{
>> +	struct swap_info_struct *si;
>> +	struct swap_cluster_info *ci;
>> +	unsigned long offset = swp_offset(entry);
>> +
>> +	VM_BUG_ON(!IS_ALIGNED(offset, SWAPFILE_CLUSTER));
>> +	si = _swap_info_get(entry);
>> +	if (!si)
>> +		return -EBUSY;
>
> I think this return value doesn't get used anywhere?

Yes.  And the error is only possible if page table is corrupted.  So
maybe add a VM_BUG_ON() in it caller __split_huge_swap_pmd()?

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 14/21] swap: Support to move swap account for PMD swap mapping
  2018-10-24 17:27   ` Daniel Jordan
@ 2018-10-25  1:06     ` Huang, Ying
  0 siblings, 0 replies; 32+ messages in thread
From: Huang, Ying @ 2018-10-25  1:06 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: Andrew Morton, linux-mm, linux-kernel, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan

Daniel Jordan <daniel.m.jordan@oracle.com> writes:

> On Wed, Oct 10, 2018 at 03:19:17PM +0800, Huang Ying wrote:
>> +static struct page *mc_handle_swap_pmd(struct vm_area_struct *vma,
>> +			pmd_t pmd, swp_entry_t *entry)
>> +{
>
> Got
> /home/dbbench/linux/mm/memcontrol.c:4719:21: warning: ‘mc_handle_swap_pmd’ defined but not used [-Wunused-function]
>  static struct page *mc_handle_swap_pmd(struct vm_area_struct *vma,
> when
> # CONFIG_TRANSPARENT_HUGEPAGE is not set

Thanks for pointing this out.  Will fix it in the next version.

Best Regards,
Huang, Ying

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 06/21] swap: Support PMD swap mapping when splitting huge PMD
  2018-10-25  0:54     ` Huang, Ying
@ 2018-10-25 15:00       ` Daniel Jordan
  0 siblings, 0 replies; 32+ messages in thread
From: Daniel Jordan @ 2018-10-25 15:00 UTC (permalink / raw)
  To: Huang, Ying
  Cc: Daniel Jordan, Andrew Morton, linux-mm, linux-kernel,
	Kirill A. Shutemov, Andrea Arcangeli, Michal Hocko,
	Johannes Weiner, Shaohua Li, Hugh Dickins, Minchan Kim,
	Rik van Riel, Dave Hansen, Naoya Horiguchi, Zi Yan

On Thu, Oct 25, 2018 at 08:54:16AM +0800, Huang, Ying wrote:
> Daniel Jordan <daniel.m.jordan@oracle.com> writes:
> 
> > On Wed, Oct 10, 2018 at 03:19:09PM +0800, Huang Ying wrote:
> >> +#ifdef CONFIG_THP_SWAP
> >> +/*
> >> + * The corresponding page table shouldn't be changed under us, that
> >> + * is, the page table lock should be held.
> >> + */
> >> +int split_swap_cluster_map(swp_entry_t entry)
> >> +{
> >> +	struct swap_info_struct *si;
> >> +	struct swap_cluster_info *ci;
> >> +	unsigned long offset = swp_offset(entry);
> >> +
> >> +	VM_BUG_ON(!IS_ALIGNED(offset, SWAPFILE_CLUSTER));
> >> +	si = _swap_info_get(entry);
> >> +	if (!si)
> >> +		return -EBUSY;
> >
> > I think this return value doesn't get used anywhere?
> 
> Yes.  And the error is only possible if page table is corrupted.  So
> maybe add a VM_BUG_ON() in it caller __split_huge_swap_pmd()?

Taking a second look at this, I see we'd get some nice pr_err message in this
case, so VM_BUG_ON doesn't seem necessary.

Still odd there's an unchecked return value, but it could be useful to future
callers.  Just my nitpick, feel free to leave as is.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece
  2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
                   ` (21 preceding siblings ...)
  2018-10-23 12:27 ` [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Daniel Jordan
@ 2018-11-09  1:12 ` Daniel Jordan
  22 siblings, 0 replies; 32+ messages in thread
From: Daniel Jordan @ 2018-11-09  1:12 UTC (permalink / raw)
  To: Huang Ying
  Cc: Andrew Morton, linux-mm, linux-kernel, Kirill A. Shutemov,
	Andrea Arcangeli, Michal Hocko, Johannes Weiner, Shaohua Li,
	Hugh Dickins, Minchan Kim, Rik van Riel, Dave Hansen,
	Naoya Horiguchi, Zi Yan, Daniel Jordan

On Wed, Oct 10, 2018 at 03:19:03PM +0800, Huang Ying wrote:
> And for all, Any comment is welcome!

Hi Ying, 

Looks like an edge case.  I'd run the program at the bottom like
    ./stress-usage-counts -l 4 -s 4 -U 780g
where the 780g was big enough to cause swapping on the machine.  This allocates
a bunch of THPs in the parent and then forks children that either unmap pieces
of the THPs and then do random reads in the pieces still mapped, or just
randomly read in the whole range without unmapping anything.

I had your patch from the other thread, fyi.

Thanks,
Daniel

[15384.814483] ------------[ cut here ]------------
[15384.820622] kernel BUG at /home/dbbench/src/linux/mm/swapfile.c:4134!
[15384.828793] invalid opcode: 0000 [#1] SMP PTI
[15384.834604] CPU: 15 PID: 27456 Comm: stress-usage-co Kdump: loaded Not tainted 4.19.0-rc6-mm1-thp-swap-v6-gcov+ #3
[15384.847096] Hardware name: Oracle Corporation ORACLE SERVER X7-2/ASM, MB, X7-2, BIOS 41017600 10/06/2017
[15384.858637] RIP: 0010:split_swap_cluster_map+0x172/0x1d0
[15384.865493] Code: 89 4c 01 01 e9 2a ff ff ff 5b 5d 31 c0 48 83 05 1b 89 4c 01 01 41 5c c3 b8 f0 ff ff ff e9 37 ff ff ff 48 83 05 0e 88 4c 01 01 <0f> 0b 48 83 05 14 88 4c 01 01 48 83 05 14 88 4c 01 01 48 83 05 14
[15384.888329] RSP: 0018:ffffaca85fb9bc88 EFLAGS: 00010202
[15384.895075] RAX: 0000000000000000 RBX: 00007f0463800000 RCX: 0000000000000000
[15384.903964] RDX: ffff9154229e28e0 RSI: 00007f0463800000 RDI: 0000000000194ff8
[15384.912834] RBP: ffff9154229e28e0 R08: 0000000000000000 R09: 000fffffffe00000
[15384.921694] R10: 0000000000000000 R11: ffff90f8000008e0 R12: 0000000000194ff8
[15384.930533] R13: ffff9156a8bb1100 R14: ffff915168646c00 R15: ffff9156a8bb1100
[15384.939363] FS:  00007fc763ff5740(0000) GS:ffff9156c07c0000(0000) knlGS:0000000000000000
[15384.949272] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[15384.956567] CR2: 0000000000603090 CR3: 0000005a9a2f8003 CR4: 00000000007606e0
[15384.965373] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[15384.974164] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[15384.982943] PKRU: 55555554
[15384.986756] Call Trace:
[15384.990268]  __split_huge_swap_pmd+0x48/0x170
[15384.995928]  __split_huge_pmd_locked+0x8be/0x1590
[15385.001933]  ? flush_tlb_mm_range+0xa1/0x120
[15385.007439]  ? memcg_check_events+0x2f/0x2e0
[15385.012949]  __split_huge_pmd+0x2d6/0x3e0
[15385.018135]  split_huge_pmd_address+0xbd/0x100
[15385.023794]  vma_adjust_trans_huge+0xe0/0x150
[15385.029344]  __vma_adjust+0xb8/0x770
[15385.034004]  __split_vma+0x182/0x1a0
[15385.038647]  __do_munmap+0xfd/0x340
[15385.043182]  __vm_munmap+0x6d/0xc0
[15385.047600]  __x64_sys_munmap+0x27/0x30
[15385.052509]  do_syscall_64+0x49/0x100
[15385.057206]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[15385.063430] RIP: 0033:0x7fc7638fa087
[15385.068018] Code: 64 89 02 48 83 c8 ff eb 9c 48 8b 15 03 ce 2c 00 f7 d8 64 89 02 e9 6a ff ff ff 66 0f 1f 84 00 00 00 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d d9 cd 2c 00 f7 d8 64 89 01 48
[15385.090123] RSP: 002b:00007ffec5d40198 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
[15385.099142] RAX: ffffffffffffffda RBX: 000000005be4d221 RCX: 00007fc7638fa087
[15385.107700] RDX: 00007f0463960000 RSI: 00000000000216d6 RDI: 00007f0463960000
[15385.116223] RBP: 00007ffec5d401f0 R08: 0000000000006ab7 R09: 00007ffec5d401f0
[15385.124738] R10: 00007ffec5d3f5a0 R11: 0000000000000202 R12: 0000000000400ca0
[15385.133217] R13: 00007ffec5d40340 R14: 0000000000000000 R15: 0000000000000000
[15385.141684] Modules linked in: sunrpc vfat fat coretemp x86_pkg_temp_thermal crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 ext4 crypto_simd cryptd glue_helper jbd2 ext2 mbcache ipmi_ssif ipmi_si ioatdma ipmi_devintf sg iTCO_wdt lpc_ich pcspkr wmi mfd_core i2c_i801 ipmi_msghandler ip_tables xfs libcrc32c sd_mod mgag200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops igb ttm nvme hwmon xhci_pci drm dca megaraid_sas xhci_hcd crc32c_intel nvme_core i2c_algo_bit ahci i2c_core libahci dm_mirror dm_region_hash dm_log dm_mod dax efivarfs ipv6 crc_ccitt autofs4

--------------------------------------8<---------------------------------------

/*
 * stress-usage-counts.c
 *
 * gcc -o stress-usage-counts stress-usage-counts.c -pthread
 *
 * Daniel Jordan <daniel.m.jordan@oracle.com>
 */

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <getopt.h>
#include <errno.h>
#include <time.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/syscall.h>
#include <assert.h>
#include <fcntl.h>
#include <pthread.h>
#include <time.h>

#define ALIGN(x, a)	((x) & ~((a) - 1))
#define DEBUG 0
#define dprintf		if (DEBUG) printf
#define THP_PGSZ_SYSFS	"/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"

/* Taken from include/linux/kernel.h */
#define __round_mask(x, y) ((__typeof__(x))((y)-1))
#define round_up(x, y) ((((x)-1) | __round_mask(x, y))+1)

typedef void * (*start_routine)(void *);

char *ourname;
size_t bytes;
char *memory;
char *memory_unaligned;
char *munmap_arg;
unsigned long munmap_size;
unsigned long munmap_offset;
int random_munmap;
size_t pagesize;
unsigned long thp_size;
size_t nr_thread;

static void usage(int ok)
{
	fprintf(stderr,
	"Usage: %s [options] size[k|m|g|t]\n"
	"    -h                  show this message\n"
	"    -l N                start N large-page processes, default 1\n"
	"    -s N                start N small-page processes, default 1\n"
	"    -u offset,size      munmap in small-page procs at each offset of size\n"
	"    -U                  munmap in small-page procs at random offset and size\n"
	,		ourname);

	exit(ok);
}

/**
 *      [copied from usemem.c - dmj]
 *	memparse - parse a string with mem suffixes into a number
 *	@ptr: Where parse begins
 *	@retptr: (output) Optional pointer to next char after parse completes
 *
 *	Parses a string into a number.	The number stored at @ptr is
 *	potentially suffixed with %K (for kilobytes, or 1024 bytes),
 *	%M (for megabytes, or 1048576 bytes), or %G (for gigabytes, or
 *	1073741824).  If the number is suffixed with K, M, or G, then
 *	the return value is the number multiplied by one kilobyte, one
 *	megabyte, or one gigabyte, respectively.
 */
static unsigned long long memparse(const char *ptr, char **retptr)
{
	char *endptr;	/* local pointer to end of parsed string */

	unsigned long long ret = strtoull(ptr, &endptr, 0);

	switch (*endptr) {
	case 'T':
	case 't':
		ret <<= 10;
	case 'G':
	case 'g':
		ret <<= 10;
	case 'M':
	case 'm':
		ret <<= 10;
	case 'K':
	case 'k':
		ret <<= 10;
		endptr++;
	default:
		break;
	}

	if (retptr)
		*retptr = endptr;

	return ret;
}

static unsigned long read_sysfs_ul(const char *fname)
{
	int fd;
	ssize_t len;
	char buf[64];

	fd = open(fname, O_RDONLY);
	if (fd == -1) {
		perror("sysfs open");
		exit(1);
	}

	len = read(fd, buf, sizeof(buf) - 1);
	if (len == -1) {
		perror("sysfs read");
		exit(1);
	}

	return strtoul(buf, NULL, 10);
}

static inline void os_random_seed(unsigned long seed, struct drand48_data *rs)
{
	srand48_r(seed, rs);
}

static inline long os_random_long(unsigned long max, struct drand48_data *rs)
{
	long val;

	lrand48_r(rs, &val);
	return (unsigned long)((double)max * val / (RAND_MAX + 1.0));
}

struct fault_range {
	size_t start, end;
};

static long fault_thread(void *arg)
{
	size_t i;
	struct fault_range *range = (struct fault_range *)arg;

	for (i = range->start; i < range->end; i += pagesize)
		memory[i] = 'b';
}

static void fault_all(char *memory, size_t bytes)
{
	int ret;
	size_t i;
	long thread_ret;
	pthread_t threads[nr_thread];
	struct fault_range ranges[nr_thread];

	if (nr_thread > bytes) {
		ranges[0].start = 0;
		ranges[0].end   = bytes;
		fault_thread(&ranges[0]);
		return;
	}

	for (i = 0; i < nr_thread; i++) {
		ranges[i].start = bytes * i / nr_thread;
		ranges[i].end   = bytes * (i + 1) / nr_thread;
		ret = pthread_create(&threads[i], NULL, (start_routine)fault_thread, &ranges[i]);
		if (ret) {
			perror("pthread_create");
			exit(1);
		}
	}

	for (i = 0; i < nr_thread; i++) {
		ret = pthread_join(threads[i], (void *)&thread_ret);
		if (ret) {
			perror("pthread_join");
			exit(1);
		}
	}

	dprintf("done faulting\n");
}

static void read_memory(size_t idx)
{
	volatile char c = (volatile char)memory[idx];
}

int do_small_page_task(void)
{
	size_t i;
	struct drand48_data rand_data;

	os_random_seed(time(0) ^ syscall(SYS_gettid), &rand_data);

	/* Unmap parts of the range? */
	if (munmap_size) {
		assert(munmap_offset % pagesize == 0);
		for (i = munmap_offset; i < bytes; i += thp_size) {
			dprintf("munmap(%lx, %lx)\n", memory + i, munmap_size);
			if (munmap(memory + i, munmap_size) == -1) {
				fprintf(stderr, "munmap failed: %s\n",
					strerror(errno));
				exit(1);
			}
		}
	}

	while (1) {
		struct timespec ts;
		size_t thp_offset;

		i = ALIGN(os_random_long(bytes, &rand_data), pagesize);
		thp_offset = i % thp_size;

		if (thp_offset >= munmap_offset &&
		    thp_offset <= munmap_offset + munmap_size)
			continue;

		read_memory(i);

		ts.tv_sec  = 0;
		ts.tv_nsec = 1000;
		if (nanosleep(&ts, NULL) == -1) {
			fprintf(stderr, "nanosleep failed: %s\n", strerror(errno));
			exit(1);
		}
	}
}

int do_large_page_task(void)
{
	size_t i;
	size_t pmd_aligned_start, pmd_aligned_end;
	struct drand48_data rand_data;
	struct timespec ts;

	os_random_seed(time(0) ^ syscall(SYS_gettid), &rand_data);

	while (1) {
		volatile char *c;

		i = ALIGN(os_random_long(bytes, &rand_data), thp_size);

		read_memory(i);

		ts.tv_sec  = 0;
		ts.tv_nsec = 1000;
		if (nanosleep(&ts, NULL) == -1) {
			fprintf(stderr, "nanosleep failed: %s\n", strerror(errno));
			exit(1);
		}
	}
}

int main(int argc, char **argv)
{
	int i, c, child_pid, status;
	struct drand48_data rand_data;
	size_t nr_smallpg_procs = 1;
	size_t nr_largepg_procs = 1;

	ourname = argv[0];

	pagesize = sysconf(_SC_PAGESIZE);
	dprintf("pagesize = %lu\n", pagesize);
	thp_size = read_sysfs_ul(THP_PGSZ_SYSFS);
	dprintf("thp_size = %lu\n", thp_size);

	nr_thread = sysconf(_SC_NPROCESSORS_ONLN);
	dprintf("nr_thread = %lu\n", nr_thread);

	while ((c = getopt(argc, argv, "hl:s:u:U")) != -1) {
		switch (c) {
		case 'h':
			usage(0);
		case 'l':
			nr_largepg_procs = strtol(optarg, NULL, 10);
			break;
		case 's':
			nr_smallpg_procs = strtol(optarg, NULL, 10);
			break;
		case 'u':
                        if ((munmap_arg = strtok(optarg, ",")) == NULL)
                                usage(1);
                        munmap_offset = memparse(munmap_arg, NULL);
                        if ((munmap_arg = strtok(NULL, ",")) == NULL)
                                usage(1);
                        munmap_size = memparse(munmap_arg, NULL);
			break;
		case 'U':
			random_munmap = 1;
			break;
		default:
			usage(1);
		}
	}

	if (optind != argc - 1)
		usage(0);

	bytes = memparse(argv[optind], NULL);

	if (random_munmap) {
		os_random_seed(time(0) ^ syscall(SYS_gettid), &rand_data);
		munmap_offset = ALIGN(os_random_long(thp_size - 1, &rand_data),
				      pagesize);
		printf("random munmap offset = %lu\n", munmap_offset);
		munmap_size = os_random_long(thp_size - munmap_offset,
					     &rand_data);
		printf("random munmap size   = %lu\n", munmap_size);
	}

	memory_unaligned = mmap(NULL, bytes + thp_size, PROT_READ | PROT_WRITE,
				MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
	if (memory_unaligned == MAP_FAILED) {
		fprintf(stderr, "mmap failed: %s\n", strerror(errno));
		exit(1);
	}
	if (madvise(memory_unaligned, bytes, MADV_HUGEPAGE) == -1) {
		fprintf(stderr, "madvise failed: %s\n", strerror(errno));
		exit(1);
	}
	memory = (char *)round_up((unsigned long)memory_unaligned, thp_size);

	printf("mmap(%ld, %ld)\n", memory, bytes);

	/* fault it all in */
	fault_all(memory_unaligned, bytes);

	for (i = 0; i < nr_smallpg_procs; i++) {
		if ((child_pid = fork()) == 0)
			return do_small_page_task();
		else if (child_pid < 0)
			fprintf(stderr, "failed to fork: %s\n",
				strerror(errno));
	}

	for (i = 0; i < nr_largepg_procs; i++) {
		if ((child_pid = fork()) == 0)
			return do_large_page_task();
		else if (child_pid < 0)
			fprintf(stderr, "failed to fork: %s\n",
				strerror(errno));
	}

	for (i = 0; i < nr_smallpg_procs + nr_largepg_procs; i++) {
		if (wait3(&status, 0, 0) < 0) {
			if (errno != EINTR) {
				printf("wait3 error on %dth child\n", i);
				perror("wait3");
				return 1;
			}
		}
	}

	dprintf("finished\n");

	return 0;
}

--------------------------------------8<---------------------------------------

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-11-09  1:12 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-10  7:19 [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Huang Ying
2018-10-10  7:19 ` [PATCH -V6 01/21] swap: Enable PMD swap operations for CONFIG_THP_SWAP Huang Ying
2018-10-10  7:19 ` [PATCH -V6 02/21] swap: Add __swap_duplicate_locked() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 03/21] swap: Support PMD swap mapping in swap_duplicate() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 04/21] swap: Support PMD swap mapping in put_swap_page() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 05/21] swap: Support PMD swap mapping in free_swap_and_cache()/swap_free() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 06/21] swap: Support PMD swap mapping when splitting huge PMD Huang Ying
2018-10-24 17:25   ` Daniel Jordan
2018-10-25  0:54     ` Huang, Ying
2018-10-25 15:00       ` Daniel Jordan
2018-10-10  7:19 ` [PATCH -V6 07/21] swap: Support PMD swap mapping in split_swap_cluster() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 08/21] swap: Support to read a huge swap cluster for swapin a THP Huang Ying
2018-10-10  7:19 ` [PATCH -V6 09/21] swap: Swapin a THP in one piece Huang Ying
2018-10-10  7:19 ` [PATCH -V6 10/21] swap: Support to count THP swapin and its fallback Huang Ying
2018-10-10  7:19 ` [PATCH -V6 11/21] swap: Add sysfs interface to configure THP swapin Huang Ying
2018-10-10  7:19 ` [PATCH -V6 12/21] swap: Support PMD swap mapping in swapoff Huang Ying
2018-10-10  7:19 ` [PATCH -V6 13/21] swap: Support PMD swap mapping in madvise_free() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 14/21] swap: Support to move swap account for PMD swap mapping Huang Ying
2018-10-24 17:27   ` Daniel Jordan
2018-10-25  1:06     ` Huang, Ying
2018-10-10  7:19 ` [PATCH -V6 15/21] swap: Support to copy PMD swap mapping when fork() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 16/21] swap: Free PMD swap mapping when zap_huge_pmd() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 17/21] swap: Support PMD swap mapping for MADV_WILLNEED Huang Ying
2018-10-10  7:19 ` [PATCH -V6 18/21] swap: Support PMD swap mapping in mincore() Huang Ying
2018-10-10  7:19 ` [PATCH -V6 19/21] swap: Support PMD swap mapping in common path Huang Ying
2018-10-10  7:19 ` [PATCH -V6 20/21] swap: create PMD swap mapping when unmap the THP Huang Ying
2018-10-10  7:19 ` [PATCH -V6 21/21] swap: Update help of CONFIG_THP_SWAP Huang Ying
2018-10-23 12:27 ` [PATCH -V6 00/21] swap: Swapout/swapin THP in one piece Daniel Jordan
2018-10-24  3:31   ` Huang, Ying
2018-10-24 17:24     ` Daniel Jordan
2018-10-25  0:42       ` Huang, Ying
2018-11-09  1:12 ` Daniel Jordan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).