All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] Bugfixes for THP refcounting
@ 2015-11-03 15:26 ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov

Hi,

There's few bugfixes for THP refcounting patchset. It should address most
reported bugs.

I need to track down one more bug: rss-counter mismatch on exit.

Kirill A. Shutemov (4):
  mm: do not crash on PageDoubleMap() for non-head pages
  mm: duplicate rmap reference for hugetlb pages as compound
  thp: fix split vs. unmap race
  mm: prepare page_referenced() and page_idle to new THP refcounting

 include/linux/huge_mm.h    |   4 --
 include/linux/mm.h         |  19 +++++++
 include/linux/page-flags.h |   3 +-
 mm/huge_memory.c           |  76 ++++++-------------------
 mm/migrate.c               |   2 +-
 mm/page_idle.c             |  64 ++++++++++++++++++---
 mm/rmap.c                  | 137 ++++++++++++++++++++++++++++-----------------
 7 files changed, 180 insertions(+), 125 deletions(-)

-- 
2.6.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 0/4] Bugfixes for THP refcounting
@ 2015-11-03 15:26 ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov

Hi,

There's few bugfixes for THP refcounting patchset. It should address most
reported bugs.

I need to track down one more bug: rss-counter mismatch on exit.

Kirill A. Shutemov (4):
  mm: do not crash on PageDoubleMap() for non-head pages
  mm: duplicate rmap reference for hugetlb pages as compound
  thp: fix split vs. unmap race
  mm: prepare page_referenced() and page_idle to new THP refcounting

 include/linux/huge_mm.h    |   4 --
 include/linux/mm.h         |  19 +++++++
 include/linux/page-flags.h |   3 +-
 mm/huge_memory.c           |  76 ++++++-------------------
 mm/migrate.c               |   2 +-
 mm/page_idle.c             |  64 ++++++++++++++++++---
 mm/rmap.c                  | 137 ++++++++++++++++++++++++++++-----------------
 7 files changed, 180 insertions(+), 125 deletions(-)

-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/4] mm: do not crash on PageDoubleMap() for non-head pages
  2015-11-03 15:26 ` Kirill A. Shutemov
@ 2015-11-03 15:26   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov

We usually don't call PageDoubleMap() on small or tail pages, but during
read from /proc/kpageflags we don't protect the page from being freed under
us and it can lead to VM_BUG_ON_PAGE() in PageDoubleMap():

 page:ffffea00033e0000 count:0 mapcount:0 mapping:          (null) index:0x700000200
 flags: 0x4000000000000000()
 page dumped because: VM_BUG_ON_PAGE(!PageHead(page))
 page->mem_cgroup:ffff88021588cc00
 ------------[ cut here ]------------
 kernel BUG at /src/linux-dev/include/linux/page-flags.h:552!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
 Modules linked in: cfg80211 rfkill crc32c_intel virtio_balloon serio_raw i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
 CPU: 0 PID: 1183 Comm: page-types Not tainted 4.2.0-mmotm-2015-10-21-14-41-151027-1418-00014-41+ #179
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: ffff880214a08bc0 ti: ffff880213e2c000 task.ti: ffff880213e2c000
 RIP: 0010:[<ffffffff812434b6>]  [<ffffffff812434b6>] stable_page_flags+0x336/0x340
 RSP: 0018:ffff880213e2fda8  EFLAGS: 00010292
 RAX: 0000000000000021 RBX: ffff8802150a39c0 RCX: 0000000000000000
 RDX: ffff88021ec0ff38 RSI: ffff88021ec0d658 RDI: ffff88021ec0d658
 RBP: ffff880213e2fdc8 R08: 000000000000000a R09: 000000000000132f
 R10: 0000000000000000 R11: 000000000000132f R12: 4000000000000000
 R13: ffffea00033e6340 R14: 00007fff8449e430 R15: ffffea00033e6340
 FS:  00007ff7f9525700(0000) GS:ffff88021ec00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000063b800 CR3: 00000000d9e71000 CR4: 00000000000006f0
 Stack:
  ffff8800db82df80 ffff8802150a39c0 0000000000000008 00000000000cf98d
  ffff880213e2fe18 ffffffff81243588 00007fff8449e430 ffff880213e2ff20
  000000000063b800 ffff8802150a39c0 fffffffffffffffb ffff880213e2ff20
 Call Trace:
  [<ffffffff81243588>] kpageflags_read+0xc8/0x130
  [<ffffffff81235848>] proc_reg_read+0x48/0x70
  [<ffffffff811d6b08>] __vfs_read+0x28/0xd0
  [<ffffffff812ee43e>] ? security_file_permission+0xae/0xc0
  [<ffffffff811d6f53>] ? rw_verify_area+0x53/0xf0
  [<ffffffff811d707a>] vfs_read+0x8a/0x130
  [<ffffffff811d7bf7>] SyS_pread64+0x77/0x90
  [<ffffffff81648117>] entry_SYSCALL_64_fastpath+0x12/0x6a
 Code: ca 00 00 40 01 48 39 c1 48 0f 44 da e9 a2 fd ff ff 48 c7 c6 50 a6 a1 8 1 e8 58 ab f4 ff 0f 0b 48 c7 c6 90 a2 a1 81 e8 4a ab f4 ff <0f> 0b 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57
 RIP  [<ffffffff812434b6>] stable_page_flags+0x336/0x340
  RSP <ffff880213e2fda8>
 ---[ end trace e5d18553088c026a ]---

Let's drop the VM_BUG_ON_PAGE() from PageDoubleMap() and return false for
non-head pages.

The patch can be folded into
	"mm: rework mapcount accounting to enable 4k mapping of THPs"

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/page-flags.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 72356fbc3f2d..26cc7a068126 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -549,8 +549,7 @@ static inline int PageTransTail(struct page *page)
  */
 static inline int PageDoubleMap(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_bit(PG_double_map, &page[1].flags);
+	return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
 }
 
 static inline int TestSetPageDoubleMap(struct page *page)
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 1/4] mm: do not crash on PageDoubleMap() for non-head pages
@ 2015-11-03 15:26   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov

We usually don't call PageDoubleMap() on small or tail pages, but during
read from /proc/kpageflags we don't protect the page from being freed under
us and it can lead to VM_BUG_ON_PAGE() in PageDoubleMap():

 page:ffffea00033e0000 count:0 mapcount:0 mapping:          (null) index:0x700000200
 flags: 0x4000000000000000()
 page dumped because: VM_BUG_ON_PAGE(!PageHead(page))
 page->mem_cgroup:ffff88021588cc00
 ------------[ cut here ]------------
 kernel BUG at /src/linux-dev/include/linux/page-flags.h:552!
 invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
 Modules linked in: cfg80211 rfkill crc32c_intel virtio_balloon serio_raw i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi
 CPU: 0 PID: 1183 Comm: page-types Not tainted 4.2.0-mmotm-2015-10-21-14-41-151027-1418-00014-41+ #179
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: ffff880214a08bc0 ti: ffff880213e2c000 task.ti: ffff880213e2c000
 RIP: 0010:[<ffffffff812434b6>]  [<ffffffff812434b6>] stable_page_flags+0x336/0x340
 RSP: 0018:ffff880213e2fda8  EFLAGS: 00010292
 RAX: 0000000000000021 RBX: ffff8802150a39c0 RCX: 0000000000000000
 RDX: ffff88021ec0ff38 RSI: ffff88021ec0d658 RDI: ffff88021ec0d658
 RBP: ffff880213e2fdc8 R08: 000000000000000a R09: 000000000000132f
 R10: 0000000000000000 R11: 000000000000132f R12: 4000000000000000
 R13: ffffea00033e6340 R14: 00007fff8449e430 R15: ffffea00033e6340
 FS:  00007ff7f9525700(0000) GS:ffff88021ec00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000063b800 CR3: 00000000d9e71000 CR4: 00000000000006f0
 Stack:
  ffff8800db82df80 ffff8802150a39c0 0000000000000008 00000000000cf98d
  ffff880213e2fe18 ffffffff81243588 00007fff8449e430 ffff880213e2ff20
  000000000063b800 ffff8802150a39c0 fffffffffffffffb ffff880213e2ff20
 Call Trace:
  [<ffffffff81243588>] kpageflags_read+0xc8/0x130
  [<ffffffff81235848>] proc_reg_read+0x48/0x70
  [<ffffffff811d6b08>] __vfs_read+0x28/0xd0
  [<ffffffff812ee43e>] ? security_file_permission+0xae/0xc0
  [<ffffffff811d6f53>] ? rw_verify_area+0x53/0xf0
  [<ffffffff811d707a>] vfs_read+0x8a/0x130
  [<ffffffff811d7bf7>] SyS_pread64+0x77/0x90
  [<ffffffff81648117>] entry_SYSCALL_64_fastpath+0x12/0x6a
 Code: ca 00 00 40 01 48 39 c1 48 0f 44 da e9 a2 fd ff ff 48 c7 c6 50 a6 a1 8 1 e8 58 ab f4 ff 0f 0b 48 c7 c6 90 a2 a1 81 e8 4a ab f4 ff <0f> 0b 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57
 RIP  [<ffffffff812434b6>] stable_page_flags+0x336/0x340
  RSP <ffff880213e2fda8>
 ---[ end trace e5d18553088c026a ]---

Let's drop the VM_BUG_ON_PAGE() from PageDoubleMap() and return false for
non-head pages.

The patch can be folded into
	"mm: rework mapcount accounting to enable 4k mapping of THPs"

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/page-flags.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 72356fbc3f2d..26cc7a068126 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -549,8 +549,7 @@ static inline int PageTransTail(struct page *page)
  */
 static inline int PageDoubleMap(struct page *page)
 {
-	VM_BUG_ON_PAGE(!PageHead(page), page);
-	return test_bit(PG_double_map, &page[1].flags);
+	return PageHead(page) && test_bit(PG_double_map, &page[1].flags);
 }
 
 static inline int TestSetPageDoubleMap(struct page *page)
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/4] mm: duplicate rmap reference for hugetlb pages as compound
  2015-11-03 15:26 ` Kirill A. Shutemov
@ 2015-11-03 15:26   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov

Naoya noticed that I wrongly duplicate rmap reference for hugetlb pages
in remove_migration_pte() as non-compound. Let's fix this.

The patch can be folded into
	"mm: rework mapcount accounting to enable 4k mapping of THPs"

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 1ae0113559c9..b1034f9c77e7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -165,7 +165,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		if (PageAnon(new))
 			hugepage_add_anon_rmap(new, vma, addr);
 		else
-			page_dup_rmap(new, false);
+			page_dup_rmap(new, true);
 	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr, false);
 	else
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/4] mm: duplicate rmap reference for hugetlb pages as compound
@ 2015-11-03 15:26   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov

Naoya noticed that I wrongly duplicate rmap reference for hugetlb pages
in remove_migration_pte() as non-compound. Let's fix this.

The patch can be folded into
	"mm: rework mapcount accounting to enable 4k mapping of THPs"

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/migrate.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 1ae0113559c9..b1034f9c77e7 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -165,7 +165,7 @@ static int remove_migration_pte(struct page *new, struct vm_area_struct *vma,
 		if (PageAnon(new))
 			hugepage_add_anon_rmap(new, vma, addr);
 		else
-			page_dup_rmap(new, false);
+			page_dup_rmap(new, true);
 	} else if (PageAnon(new))
 		page_add_anon_rmap(new, vma, addr, false);
 	else
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/4] thp: fix split vs. unmap race
  2015-11-03 15:26 ` Kirill A. Shutemov
@ 2015-11-03 15:26   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov

To stabilize compound page during split we use migration entries.
The code to implement this is buggy: I wrongly assumed that kernel would
wait migration to finish, before zapping ptes.

But turn out that's not true.

As result if zap_pte_range() races with split_huge_page(), we can end up
with page which is not mapped anymore but has _count and _mapcount
elevated. The page is on LRU too. So it's still reachable by vmscan and by
pfn scanners.  It's likely that page->mapping in this case would point to
freed anon_vma.

BOOM!

The patch modify freeze/unfreeze_page() code to match normal migration
entries logic: on setup we remove page from rmap and drop pin, on removing
we get pin back and put page on rmap. This way even if migration entry
will be removed under us we don't corrupt page's state.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/huge_memory.c | 22 ++++++++++++++++++----
 mm/rmap.c        | 19 +++++--------------
 2 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5009f68786d0..3700981f8035 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2934,6 +2934,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
+
+	if (freeze) {
+		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+			page_remove_rmap(page + i, false);
+			put_page(page + i);
+		}
+	}
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3079,6 +3086,8 @@ static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		if (pte_soft_dirty(entry))
 			swp_pte = pte_swp_mksoft_dirty(swp_pte);
 		set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
+		page_remove_rmap(page, false);
+		put_page(page);
 	}
 	pte_unmap_unlock(pte, ptl);
 }
@@ -3117,8 +3126,6 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		return;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
 	for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
-		if (!page_mapped(page))
-			continue;
 		if (!is_swap_pte(pte[i]))
 			continue;
 
@@ -3128,6 +3135,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		if (migration_entry_to_page(swp_entry) != page)
 			continue;
 
+		get_page(page);
+		page_add_anon_rmap(page, vma, address, false);
+
 		entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
 		entry = pte_mkdirty(entry);
 		if (is_write_migration_entry(swp_entry))
@@ -3195,8 +3205,6 @@ static int __split_huge_page_tail(struct page *head, int tail,
 	 */
 	atomic_add(mapcount + 1, &page_tail->_count);
 
-	/* after clearing PageTail the gup refcount can be released */
-	smp_mb__after_atomic();
 
 	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	page_tail->flags |= (head->flags &
@@ -3209,6 +3217,12 @@ static int __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_unevictable)));
 	page_tail->flags |= (1L << PG_dirty);
 
+	/*
+	 * After clearing PageTail the gup refcount can be released.
+	 * Page flags also must be visible before we make the page non-compound.
+	 */
+	smp_wmb();
+
 	clear_compound_head(page_tail);
 
 	if (page_is_young(head))
diff --git a/mm/rmap.c b/mm/rmap.c
index 288622f5f34d..ad9af8b3a381 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1135,20 +1135,12 @@ void do_page_add_anon_rmap(struct page *page,
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
 
-	if (PageTransCompound(page)) {
+	if (compound) {
+		atomic_t *mapcount;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		if (compound) {
-			atomic_t *mapcount;
-
-			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-			mapcount = compound_mapcount_ptr(page);
-			first = atomic_inc_and_test(mapcount);
-		} else {
-			/* Anon THP always mapped first with PMD */
-			first = 0;
-			VM_BUG_ON_PAGE(!page_mapcount(page), page);
-			atomic_inc(&page->_mapcount);
-		}
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		mapcount = compound_mapcount_ptr(page);
+		first = atomic_inc_and_test(mapcount);
 	} else {
 		VM_BUG_ON_PAGE(compound, page);
 		first = atomic_inc_and_test(&page->_mapcount);
@@ -1163,7 +1155,6 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
-			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 		}
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/4] thp: fix split vs. unmap race
@ 2015-11-03 15:26   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov

To stabilize compound page during split we use migration entries.
The code to implement this is buggy: I wrongly assumed that kernel would
wait migration to finish, before zapping ptes.

But turn out that's not true.

As result if zap_pte_range() races with split_huge_page(), we can end up
with page which is not mapped anymore but has _count and _mapcount
elevated. The page is on LRU too. So it's still reachable by vmscan and by
pfn scanners.  It's likely that page->mapping in this case would point to
freed anon_vma.

BOOM!

The patch modify freeze/unfreeze_page() code to match normal migration
entries logic: on setup we remove page from rmap and drop pin, on removing
we get pin back and put page on rmap. This way even if migration entry
will be removed under us we don't corrupt page's state.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reported-by: Minchan Kim <minchan@kernel.org>
Reported-by: Sasha Levin <sasha.levin@oracle.com>
---
 mm/huge_memory.c | 22 ++++++++++++++++++----
 mm/rmap.c        | 19 +++++--------------
 2 files changed, 23 insertions(+), 18 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5009f68786d0..3700981f8035 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2934,6 +2934,13 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 
 	smp_wmb(); /* make pte visible before pmd */
 	pmd_populate(mm, pmd, pgtable);
+
+	if (freeze) {
+		for (i = 0; i < HPAGE_PMD_NR; i++, haddr += PAGE_SIZE) {
+			page_remove_rmap(page + i, false);
+			put_page(page + i);
+		}
+	}
 }
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
@@ -3079,6 +3086,8 @@ static void freeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		if (pte_soft_dirty(entry))
 			swp_pte = pte_swp_mksoft_dirty(swp_pte);
 		set_pte_at(vma->vm_mm, address, pte + i, swp_pte);
+		page_remove_rmap(page, false);
+		put_page(page);
 	}
 	pte_unmap_unlock(pte, ptl);
 }
@@ -3117,8 +3126,6 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		return;
 	pte = pte_offset_map_lock(vma->vm_mm, pmd, address, &ptl);
 	for (i = 0; i < HPAGE_PMD_NR; i++, address += PAGE_SIZE, page++) {
-		if (!page_mapped(page))
-			continue;
 		if (!is_swap_pte(pte[i]))
 			continue;
 
@@ -3128,6 +3135,9 @@ static void unfreeze_page_vma(struct vm_area_struct *vma, struct page *page,
 		if (migration_entry_to_page(swp_entry) != page)
 			continue;
 
+		get_page(page);
+		page_add_anon_rmap(page, vma, address, false);
+
 		entry = pte_mkold(mk_pte(page, vma->vm_page_prot));
 		entry = pte_mkdirty(entry);
 		if (is_write_migration_entry(swp_entry))
@@ -3195,8 +3205,6 @@ static int __split_huge_page_tail(struct page *head, int tail,
 	 */
 	atomic_add(mapcount + 1, &page_tail->_count);
 
-	/* after clearing PageTail the gup refcount can be released */
-	smp_mb__after_atomic();
 
 	page_tail->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 	page_tail->flags |= (head->flags &
@@ -3209,6 +3217,12 @@ static int __split_huge_page_tail(struct page *head, int tail,
 			 (1L << PG_unevictable)));
 	page_tail->flags |= (1L << PG_dirty);
 
+	/*
+	 * After clearing PageTail the gup refcount can be released.
+	 * Page flags also must be visible before we make the page non-compound.
+	 */
+	smp_wmb();
+
 	clear_compound_head(page_tail);
 
 	if (page_is_young(head))
diff --git a/mm/rmap.c b/mm/rmap.c
index 288622f5f34d..ad9af8b3a381 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1135,20 +1135,12 @@ void do_page_add_anon_rmap(struct page *page,
 	bool compound = flags & RMAP_COMPOUND;
 	bool first;
 
-	if (PageTransCompound(page)) {
+	if (compound) {
+		atomic_t *mapcount;
 		VM_BUG_ON_PAGE(!PageLocked(page), page);
-		if (compound) {
-			atomic_t *mapcount;
-
-			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
-			mapcount = compound_mapcount_ptr(page);
-			first = atomic_inc_and_test(mapcount);
-		} else {
-			/* Anon THP always mapped first with PMD */
-			first = 0;
-			VM_BUG_ON_PAGE(!page_mapcount(page), page);
-			atomic_inc(&page->_mapcount);
-		}
+		VM_BUG_ON_PAGE(!PageTransHuge(page), page);
+		mapcount = compound_mapcount_ptr(page);
+		first = atomic_inc_and_test(mapcount);
 	} else {
 		VM_BUG_ON_PAGE(compound, page);
 		first = atomic_inc_and_test(&page->_mapcount);
@@ -1163,7 +1155,6 @@ void do_page_add_anon_rmap(struct page *page,
 		 * disabled.
 		 */
 		if (compound) {
-			VM_BUG_ON_PAGE(!PageTransHuge(page), page);
 			__inc_zone_page_state(page,
 					      NR_ANON_TRANSPARENT_HUGEPAGES);
 		}
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-03 15:26 ` Kirill A. Shutemov
@ 2015-11-03 15:26   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov,
	Vladimir Davydov

I've missed two simlar codepath which need some preparation to work well
with reworked THP refcounting.

Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.

The patch removes PageTransHuge() test from the functions and opencode
page table check.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/huge_mm.h |   4 --
 include/linux/mm.h      |  19 ++++++++
 mm/huge_memory.c        |  54 ----------------------
 mm/page_idle.c          |  64 ++++++++++++++++++++++----
 mm/rmap.c               | 118 +++++++++++++++++++++++++++++++++---------------
 5 files changed, 155 insertions(+), 104 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f7c3f13f3a9c..5c7b00e88236 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -51,10 +51,6 @@ enum transparent_hugepage_flag {
 #endif
 };
 
-extern pmd_t *page_check_address_pmd(struct page *page,
-				     struct mm_struct *mm,
-				     unsigned long address,
-				     spinlock_t **ptl);
 extern int pmd_freeable(pmd_t pmd);
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b4cd988a794a..a36f9fa4e4cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -432,6 +432,25 @@ static inline int page_mapcount(struct page *page)
 	return ret;
 }
 
+static inline int total_mapcount(struct page *page)
+{
+	int i, ret;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	ret = compound_mapcount(page);
+	if (PageHuge(page))
+		return ret;
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		ret += atomic_read(&page[i]._mapcount) + 1;
+	if (PageDoubleMap(page))
+		ret -= HPAGE_PMD_NR;
+	return ret;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3700981f8035..14cbbad54a3e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1713,46 +1713,6 @@ bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 	return false;
 }
 
-/*
- * This function returns whether a given @page is mapped onto the @address
- * in the virtual space of @mm.
- *
- * When it's true, this function returns *pmd with holding the page table lock
- * and passing it back to the caller via @ptl.
- * If it's false, returns NULL without holding the page table lock.
- */
-pmd_t *page_check_address_pmd(struct page *page,
-			      struct mm_struct *mm,
-			      unsigned long address,
-			      spinlock_t **ptl)
-{
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-
-	if (address & ~HPAGE_PMD_MASK)
-		return NULL;
-
-	pgd = pgd_offset(mm, address);
-	if (!pgd_present(*pgd))
-		return NULL;
-	pud = pud_offset(pgd, address);
-	if (!pud_present(*pud))
-		return NULL;
-	pmd = pmd_offset(pud, address);
-
-	*ptl = pmd_lock(mm, pmd);
-	if (!pmd_present(*pmd))
-		goto unlock;
-	if (pmd_page(*pmd) != page)
-		goto unlock;
-	if (pmd_trans_huge(*pmd))
-		return pmd;
-unlock:
-	spin_unlock(*ptl);
-	return NULL;
-}
-
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
@@ -3169,20 +3129,6 @@ static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
 	}
 }
 
-static int total_mapcount(struct page *page)
-{
-	int i, ret;
-
-	ret = compound_mapcount(page);
-	for (i = 0; i < HPAGE_PMD_NR; i++)
-		ret += atomic_read(&page[i]._mapcount) + 1;
-
-	if (PageDoubleMap(page))
-		ret -= HPAGE_PMD_NR;
-
-	return ret;
-}
-
 static int __split_huge_page_tail(struct page *head, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 1c245d9027e3..2c9ebe12b40d 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -56,23 +56,69 @@ static int page_idle_clear_pte_refs_one(struct page *page,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	bool referenced = false;
 
-	if (unlikely(PageTransHuge(page))) {
-		pmd = page_check_address_pmd(page, mm, addr, &ptl);
-		if (pmd) {
-			referenced = pmdp_clear_young_notify(vma, addr, pmd);
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		return SWAP_AGAIN;
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		return SWAP_AGAIN;
+	pmd = pmd_offset(pud, addr);
+
+	if (pmd_trans_huge(*pmd)) {
+		ptl = pmd_lock(mm, pmd);
+                if (!pmd_present(*pmd))
+			goto unlock_pmd;
+		if (unlikely(!pmd_trans_huge(*pmd))) {
 			spin_unlock(ptl);
+			goto map_pte;
 		}
+
+		if (pmd_page(*pmd) != page)
+			goto unlock_pmd;
+
+		referenced = pmdp_clear_young_notify(vma, addr, pmd);
+		spin_unlock(ptl);
+		goto found;
+unlock_pmd:
+		spin_unlock(ptl);
+		return SWAP_AGAIN;
 	} else {
-		pte = page_check_address(page, mm, addr, &ptl, 0);
-		if (pte) {
-			referenced = ptep_clear_young_notify(vma, addr, pte);
-			pte_unmap_unlock(pte, ptl);
-		}
+		pmd_t pmde = *pmd;
+		barrier();
+		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+			return SWAP_AGAIN;
+
+	}
+map_pte:
+	pte = pte_offset_map(pmd, addr);
+	if (!pte_present(*pte)) {
+		pte_unmap(pte);
+		return SWAP_AGAIN;
 	}
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	if (!pte_present(*pte)) {
+		pte_unmap_unlock(pte, ptl);
+		return SWAP_AGAIN;
+	}
+
+	/* THP can be referenced by any subpage */
+	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
+		pte_unmap_unlock(pte, ptl);
+		return SWAP_AGAIN;
+	}
+
+	referenced = ptep_clear_young_notify(vma, addr, pte);
+	pte_unmap_unlock(pte, ptl);
+found:
 	if (referenced) {
 		clear_page_idle(page);
 		/*
diff --git a/mm/rmap.c b/mm/rmap.c
index ad9af8b3a381..0837487d3737 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -812,60 +812,104 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int referenced = 0;
 	struct page_referenced_arg *pra = arg;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
-	if (unlikely(PageTransHuge(page))) {
-		pmd_t *pmd;
-
-		/*
-		 * rmap might return false positives; we must filter
-		 * these out using page_check_address_pmd().
-		 */
-		pmd = page_check_address_pmd(page, mm, address, &ptl);
-		if (!pmd)
+	if (unlikely(PageHuge(page))) {
+		/* when pud is not present, pte will be NULL */
+		pte = huge_pte_offset(mm, address);
+		if (!pte)
 			return SWAP_AGAIN;
 
-		if (vma->vm_flags & VM_LOCKED) {
+		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
+		goto check_pte;
+	}
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return SWAP_AGAIN;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return SWAP_AGAIN;
+	pmd = pmd_offset(pud, address);
+
+	if (pmd_trans_huge(*pmd)) {
+		int ret = SWAP_AGAIN;
+
+		ptl = pmd_lock(mm, pmd);
+		if (!pmd_present(*pmd))
+			goto unlock_pmd;
+		if (unlikely(!pmd_trans_huge(*pmd))) {
 			spin_unlock(ptl);
+			goto map_pte;
+		}
+
+		if (pmd_page(*pmd) != page)
+			goto unlock_pmd;
+
+		if (vma->vm_flags & VM_LOCKED) {
 			pra->vm_flags |= VM_LOCKED;
-			return SWAP_FAIL; /* To break the loop */
+			ret = SWAP_FAIL; /* To break the loop */
+			goto unlock_pmd;
 		}
 
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
-
 		spin_unlock(ptl);
+		goto found;
+unlock_pmd:
+		spin_unlock(ptl);
+		return ret;
 	} else {
-		pte_t *pte;
-
-		/*
-		 * rmap might return false positives; we must filter
-		 * these out using page_check_address().
-		 */
-		pte = page_check_address(page, mm, address, &ptl, 0);
-		if (!pte)
+		pmd_t pmde = *pmd;
+		barrier();
+		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
 			return SWAP_AGAIN;
+	}
+map_pte:
+	pte = pte_offset_map(pmd, address);
+	if (!pte_present(*pte)) {
+		pte_unmap(pte);
+		return SWAP_AGAIN;
+	}
 
-		if (vma->vm_flags & VM_LOCKED) {
-			pte_unmap_unlock(pte, ptl);
-			pra->vm_flags |= VM_LOCKED;
-			return SWAP_FAIL; /* To break the loop */
-		}
+	ptl = pte_lockptr(mm, pmd);
+check_pte:
+	spin_lock(ptl);
 
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
-			/*
-			 * Don't treat a reference through a sequentially read
-			 * mapping as such.  If the page has been used in
-			 * another mapping, we will catch it; if this other
-			 * mapping is already gone, the unmap path will have
-			 * set PG_referenced or activated the page.
-			 */
-			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
-				referenced++;
-		}
+	if (!pte_present(*pte)) {
+		pte_unmap_unlock(pte, ptl);
+		return SWAP_AGAIN;
+	}
+
+	/* THP can be referenced by any subpage */
+	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
+		pte_unmap_unlock(pte, ptl);
+		return SWAP_AGAIN;
+	}
 
+	if (vma->vm_flags & VM_LOCKED) {
 		pte_unmap_unlock(pte, ptl);
+		pra->vm_flags |= VM_LOCKED;
+		return SWAP_FAIL; /* To break the loop */
 	}
 
+	if (ptep_clear_flush_young_notify(vma, address, pte)) {
+		/*
+		 * Don't treat a reference through a sequentially read
+		 * mapping as such.  If the page has been used in
+		 * another mapping, we will catch it; if this other
+		 * mapping is already gone, the unmap path will have
+		 * set PG_referenced or activated the page.
+		 */
+		if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+			referenced++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+found:
 	if (referenced)
 		clear_page_idle(page);
 	if (test_and_clear_page_young(page))
@@ -912,7 +956,7 @@ int page_referenced(struct page *page,
 	int ret;
 	int we_locked = 0;
 	struct page_referenced_arg pra = {
-		.mapcount = page_mapcount(page),
+		.mapcount = total_mapcount(page),
 		.memcg = memcg,
 	};
 	struct rmap_walk_control rwc = {
-- 
2.6.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-03 15:26   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-03 15:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Kirill A. Shutemov,
	Vladimir Davydov

I've missed two simlar codepath which need some preparation to work well
with reworked THP refcounting.

Both page_referenced() and page_idle_clear_pte_refs_one() assume that
THP can only be mapped with PMD, so there's no reason to look on PTEs
for PageTransHuge() pages. That's no true anymore: THP can be mapped
with PTEs too.

The patch removes PageTransHuge() test from the functions and opencode
page table check.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Vladimir Davydov <vdavydov@parallels.com>
---
 include/linux/huge_mm.h |   4 --
 include/linux/mm.h      |  19 ++++++++
 mm/huge_memory.c        |  54 ----------------------
 mm/page_idle.c          |  64 ++++++++++++++++++++++----
 mm/rmap.c               | 118 +++++++++++++++++++++++++++++++++---------------
 5 files changed, 155 insertions(+), 104 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f7c3f13f3a9c..5c7b00e88236 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -51,10 +51,6 @@ enum transparent_hugepage_flag {
 #endif
 };
 
-extern pmd_t *page_check_address_pmd(struct page *page,
-				     struct mm_struct *mm,
-				     unsigned long address,
-				     spinlock_t **ptl);
 extern int pmd_freeable(pmd_t pmd);
 
 #define HPAGE_PMD_ORDER (HPAGE_PMD_SHIFT-PAGE_SHIFT)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b4cd988a794a..a36f9fa4e4cd 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -432,6 +432,25 @@ static inline int page_mapcount(struct page *page)
 	return ret;
 }
 
+static inline int total_mapcount(struct page *page)
+{
+	int i, ret;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	ret = compound_mapcount(page);
+	if (PageHuge(page))
+		return ret;
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		ret += atomic_read(&page[i]._mapcount) + 1;
+	if (PageDoubleMap(page))
+		ret -= HPAGE_PMD_NR;
+	return ret;
+}
+
 static inline int page_count(struct page *page)
 {
 	return atomic_read(&compound_head(page)->_count);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3700981f8035..14cbbad54a3e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1713,46 +1713,6 @@ bool __pmd_trans_huge_lock(pmd_t *pmd, struct vm_area_struct *vma,
 	return false;
 }
 
-/*
- * This function returns whether a given @page is mapped onto the @address
- * in the virtual space of @mm.
- *
- * When it's true, this function returns *pmd with holding the page table lock
- * and passing it back to the caller via @ptl.
- * If it's false, returns NULL without holding the page table lock.
- */
-pmd_t *page_check_address_pmd(struct page *page,
-			      struct mm_struct *mm,
-			      unsigned long address,
-			      spinlock_t **ptl)
-{
-	pgd_t *pgd;
-	pud_t *pud;
-	pmd_t *pmd;
-
-	if (address & ~HPAGE_PMD_MASK)
-		return NULL;
-
-	pgd = pgd_offset(mm, address);
-	if (!pgd_present(*pgd))
-		return NULL;
-	pud = pud_offset(pgd, address);
-	if (!pud_present(*pud))
-		return NULL;
-	pmd = pmd_offset(pud, address);
-
-	*ptl = pmd_lock(mm, pmd);
-	if (!pmd_present(*pmd))
-		goto unlock;
-	if (pmd_page(*pmd) != page)
-		goto unlock;
-	if (pmd_trans_huge(*pmd))
-		return pmd;
-unlock:
-	spin_unlock(*ptl);
-	return NULL;
-}
-
 #define VM_NO_THP (VM_SPECIAL | VM_HUGETLB | VM_SHARED | VM_MAYSHARE)
 
 int hugepage_madvise(struct vm_area_struct *vma,
@@ -3169,20 +3129,6 @@ static void unfreeze_page(struct anon_vma *anon_vma, struct page *page)
 	}
 }
 
-static int total_mapcount(struct page *page)
-{
-	int i, ret;
-
-	ret = compound_mapcount(page);
-	for (i = 0; i < HPAGE_PMD_NR; i++)
-		ret += atomic_read(&page[i]._mapcount) + 1;
-
-	if (PageDoubleMap(page))
-		ret -= HPAGE_PMD_NR;
-
-	return ret;
-}
-
 static int __split_huge_page_tail(struct page *head, int tail,
 		struct lruvec *lruvec, struct list_head *list)
 {
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 1c245d9027e3..2c9ebe12b40d 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -56,23 +56,69 @@ static int page_idle_clear_pte_refs_one(struct page *page,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
+	pgd_t *pgd;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	bool referenced = false;
 
-	if (unlikely(PageTransHuge(page))) {
-		pmd = page_check_address_pmd(page, mm, addr, &ptl);
-		if (pmd) {
-			referenced = pmdp_clear_young_notify(vma, addr, pmd);
+	pgd = pgd_offset(mm, addr);
+	if (!pgd_present(*pgd))
+		return SWAP_AGAIN;
+	pud = pud_offset(pgd, addr);
+	if (!pud_present(*pud))
+		return SWAP_AGAIN;
+	pmd = pmd_offset(pud, addr);
+
+	if (pmd_trans_huge(*pmd)) {
+		ptl = pmd_lock(mm, pmd);
+                if (!pmd_present(*pmd))
+			goto unlock_pmd;
+		if (unlikely(!pmd_trans_huge(*pmd))) {
 			spin_unlock(ptl);
+			goto map_pte;
 		}
+
+		if (pmd_page(*pmd) != page)
+			goto unlock_pmd;
+
+		referenced = pmdp_clear_young_notify(vma, addr, pmd);
+		spin_unlock(ptl);
+		goto found;
+unlock_pmd:
+		spin_unlock(ptl);
+		return SWAP_AGAIN;
 	} else {
-		pte = page_check_address(page, mm, addr, &ptl, 0);
-		if (pte) {
-			referenced = ptep_clear_young_notify(vma, addr, pte);
-			pte_unmap_unlock(pte, ptl);
-		}
+		pmd_t pmde = *pmd;
+		barrier();
+		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
+			return SWAP_AGAIN;
+
+	}
+map_pte:
+	pte = pte_offset_map(pmd, addr);
+	if (!pte_present(*pte)) {
+		pte_unmap(pte);
+		return SWAP_AGAIN;
 	}
+
+	ptl = pte_lockptr(mm, pmd);
+	spin_lock(ptl);
+
+	if (!pte_present(*pte)) {
+		pte_unmap_unlock(pte, ptl);
+		return SWAP_AGAIN;
+	}
+
+	/* THP can be referenced by any subpage */
+	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
+		pte_unmap_unlock(pte, ptl);
+		return SWAP_AGAIN;
+	}
+
+	referenced = ptep_clear_young_notify(vma, addr, pte);
+	pte_unmap_unlock(pte, ptl);
+found:
 	if (referenced) {
 		clear_page_idle(page);
 		/*
diff --git a/mm/rmap.c b/mm/rmap.c
index ad9af8b3a381..0837487d3737 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -812,60 +812,104 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 	spinlock_t *ptl;
 	int referenced = 0;
 	struct page_referenced_arg *pra = arg;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
 
-	if (unlikely(PageTransHuge(page))) {
-		pmd_t *pmd;
-
-		/*
-		 * rmap might return false positives; we must filter
-		 * these out using page_check_address_pmd().
-		 */
-		pmd = page_check_address_pmd(page, mm, address, &ptl);
-		if (!pmd)
+	if (unlikely(PageHuge(page))) {
+		/* when pud is not present, pte will be NULL */
+		pte = huge_pte_offset(mm, address);
+		if (!pte)
 			return SWAP_AGAIN;
 
-		if (vma->vm_flags & VM_LOCKED) {
+		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
+		goto check_pte;
+	}
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		return SWAP_AGAIN;
+	pud = pud_offset(pgd, address);
+	if (!pud_present(*pud))
+		return SWAP_AGAIN;
+	pmd = pmd_offset(pud, address);
+
+	if (pmd_trans_huge(*pmd)) {
+		int ret = SWAP_AGAIN;
+
+		ptl = pmd_lock(mm, pmd);
+		if (!pmd_present(*pmd))
+			goto unlock_pmd;
+		if (unlikely(!pmd_trans_huge(*pmd))) {
 			spin_unlock(ptl);
+			goto map_pte;
+		}
+
+		if (pmd_page(*pmd) != page)
+			goto unlock_pmd;
+
+		if (vma->vm_flags & VM_LOCKED) {
 			pra->vm_flags |= VM_LOCKED;
-			return SWAP_FAIL; /* To break the loop */
+			ret = SWAP_FAIL; /* To break the loop */
+			goto unlock_pmd;
 		}
 
 		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
-
 		spin_unlock(ptl);
+		goto found;
+unlock_pmd:
+		spin_unlock(ptl);
+		return ret;
 	} else {
-		pte_t *pte;
-
-		/*
-		 * rmap might return false positives; we must filter
-		 * these out using page_check_address().
-		 */
-		pte = page_check_address(page, mm, address, &ptl, 0);
-		if (!pte)
+		pmd_t pmde = *pmd;
+		barrier();
+		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
 			return SWAP_AGAIN;
+	}
+map_pte:
+	pte = pte_offset_map(pmd, address);
+	if (!pte_present(*pte)) {
+		pte_unmap(pte);
+		return SWAP_AGAIN;
+	}
 
-		if (vma->vm_flags & VM_LOCKED) {
-			pte_unmap_unlock(pte, ptl);
-			pra->vm_flags |= VM_LOCKED;
-			return SWAP_FAIL; /* To break the loop */
-		}
+	ptl = pte_lockptr(mm, pmd);
+check_pte:
+	spin_lock(ptl);
 
-		if (ptep_clear_flush_young_notify(vma, address, pte)) {
-			/*
-			 * Don't treat a reference through a sequentially read
-			 * mapping as such.  If the page has been used in
-			 * another mapping, we will catch it; if this other
-			 * mapping is already gone, the unmap path will have
-			 * set PG_referenced or activated the page.
-			 */
-			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
-				referenced++;
-		}
+	if (!pte_present(*pte)) {
+		pte_unmap_unlock(pte, ptl);
+		return SWAP_AGAIN;
+	}
+
+	/* THP can be referenced by any subpage */
+	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
+		pte_unmap_unlock(pte, ptl);
+		return SWAP_AGAIN;
+	}
 
+	if (vma->vm_flags & VM_LOCKED) {
 		pte_unmap_unlock(pte, ptl);
+		pra->vm_flags |= VM_LOCKED;
+		return SWAP_FAIL; /* To break the loop */
 	}
 
+	if (ptep_clear_flush_young_notify(vma, address, pte)) {
+		/*
+		 * Don't treat a reference through a sequentially read
+		 * mapping as such.  If the page has been used in
+		 * another mapping, we will catch it; if this other
+		 * mapping is already gone, the unmap path will have
+		 * set PG_referenced or activated the page.
+		 */
+		if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+			referenced++;
+	}
+	pte_unmap_unlock(pte, ptl);
+
+found:
 	if (referenced)
 		clear_page_idle(page);
 	if (test_and_clear_page_young(page))
@@ -912,7 +956,7 @@ int page_referenced(struct page *page,
 	int ret;
 	int we_locked = 0;
 	struct page_referenced_arg pra = {
-		.mapcount = page_mapcount(page),
+		.mapcount = total_mapcount(page),
 		.memcg = memcg,
 	};
 	struct rmap_walk_control rwc = {
-- 
2.6.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-03 15:26   ` Kirill A. Shutemov
@ 2015-11-05  9:10     ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05  9:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi,
	Sasha Levin, Minchan Kim, linux-kernel, linux-mm

On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
...
> @@ -56,23 +56,69 @@ static int page_idle_clear_pte_refs_one(struct page *page,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	spinlock_t *ptl;
> +	pgd_t *pgd;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  	pte_t *pte;
>  	bool referenced = false;
>  
> -	if (unlikely(PageTransHuge(page))) {
> -		pmd = page_check_address_pmd(page, mm, addr, &ptl);
> -		if (pmd) {
> -			referenced = pmdp_clear_young_notify(vma, addr, pmd);
> +	pgd = pgd_offset(mm, addr);
> +	if (!pgd_present(*pgd))
> +		return SWAP_AGAIN;
> +	pud = pud_offset(pgd, addr);
> +	if (!pud_present(*pud))
> +		return SWAP_AGAIN;
> +	pmd = pmd_offset(pud, addr);
> +
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_lock(mm, pmd);
> +                if (!pmd_present(*pmd))
> +			goto unlock_pmd;
> +		if (unlikely(!pmd_trans_huge(*pmd))) {
>  			spin_unlock(ptl);
> +			goto map_pte;
>  		}
> +
> +		if (pmd_page(*pmd) != page)
> +			goto unlock_pmd;
> +
> +		referenced = pmdp_clear_young_notify(vma, addr, pmd);
> +		spin_unlock(ptl);
> +		goto found;
> +unlock_pmd:
> +		spin_unlock(ptl);
> +		return SWAP_AGAIN;
>  	} else {
> -		pte = page_check_address(page, mm, addr, &ptl, 0);
> -		if (pte) {
> -			referenced = ptep_clear_young_notify(vma, addr, pte);
> -			pte_unmap_unlock(pte, ptl);
> -		}
> +		pmd_t pmde = *pmd;
> +		barrier();
> +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> +			return SWAP_AGAIN;
> +
> +	}
> +map_pte:
> +	pte = pte_offset_map(pmd, addr);
> +	if (!pte_present(*pte)) {
> +		pte_unmap(pte);
> +		return SWAP_AGAIN;
>  	}
> +
> +	ptl = pte_lockptr(mm, pmd);
> +	spin_lock(ptl);
> +
> +	if (!pte_present(*pte)) {
> +		pte_unmap_unlock(pte, ptl);
> +		return SWAP_AGAIN;
> +	}
> +
> +	/* THP can be referenced by any subpage */
> +	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
> +		pte_unmap_unlock(pte, ptl);
> +		return SWAP_AGAIN;
> +	}
> +
> +	referenced = ptep_clear_young_notify(vma, addr, pte);
> +	pte_unmap_unlock(pte, ptl);
> +found:

Can't we hide this stuff in a helper function, which would be used by
both page_referenced_one and page_idle_clear_pte_refs_one, instead of
duplicating page_referenced_one code here?

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05  9:10     ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05  9:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi,
	Sasha Levin, Minchan Kim, linux-kernel, linux-mm

On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
...
> @@ -56,23 +56,69 @@ static int page_idle_clear_pte_refs_one(struct page *page,
>  {
>  	struct mm_struct *mm = vma->vm_mm;
>  	spinlock_t *ptl;
> +	pgd_t *pgd;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  	pte_t *pte;
>  	bool referenced = false;
>  
> -	if (unlikely(PageTransHuge(page))) {
> -		pmd = page_check_address_pmd(page, mm, addr, &ptl);
> -		if (pmd) {
> -			referenced = pmdp_clear_young_notify(vma, addr, pmd);
> +	pgd = pgd_offset(mm, addr);
> +	if (!pgd_present(*pgd))
> +		return SWAP_AGAIN;
> +	pud = pud_offset(pgd, addr);
> +	if (!pud_present(*pud))
> +		return SWAP_AGAIN;
> +	pmd = pmd_offset(pud, addr);
> +
> +	if (pmd_trans_huge(*pmd)) {
> +		ptl = pmd_lock(mm, pmd);
> +                if (!pmd_present(*pmd))
> +			goto unlock_pmd;
> +		if (unlikely(!pmd_trans_huge(*pmd))) {
>  			spin_unlock(ptl);
> +			goto map_pte;
>  		}
> +
> +		if (pmd_page(*pmd) != page)
> +			goto unlock_pmd;
> +
> +		referenced = pmdp_clear_young_notify(vma, addr, pmd);
> +		spin_unlock(ptl);
> +		goto found;
> +unlock_pmd:
> +		spin_unlock(ptl);
> +		return SWAP_AGAIN;
>  	} else {
> -		pte = page_check_address(page, mm, addr, &ptl, 0);
> -		if (pte) {
> -			referenced = ptep_clear_young_notify(vma, addr, pte);
> -			pte_unmap_unlock(pte, ptl);
> -		}
> +		pmd_t pmde = *pmd;
> +		barrier();
> +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> +			return SWAP_AGAIN;
> +
> +	}
> +map_pte:
> +	pte = pte_offset_map(pmd, addr);
> +	if (!pte_present(*pte)) {
> +		pte_unmap(pte);
> +		return SWAP_AGAIN;
>  	}
> +
> +	ptl = pte_lockptr(mm, pmd);
> +	spin_lock(ptl);
> +
> +	if (!pte_present(*pte)) {
> +		pte_unmap_unlock(pte, ptl);
> +		return SWAP_AGAIN;
> +	}
> +
> +	/* THP can be referenced by any subpage */
> +	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
> +		pte_unmap_unlock(pte, ptl);
> +		return SWAP_AGAIN;
> +	}
> +
> +	referenced = ptep_clear_young_notify(vma, addr, pte);
> +	pte_unmap_unlock(pte, ptl);
> +found:

Can't we hide this stuff in a helper function, which would be used by
both page_referenced_one and page_idle_clear_pte_refs_one, instead of
duplicating page_referenced_one code here?

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-05  9:10     ` Vladimir Davydov
@ 2015-11-05  9:24       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-05  9:24 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 12:10:13PM +0300, Vladimir Davydov wrote:
> On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
> ...
> > @@ -56,23 +56,69 @@ static int page_idle_clear_pte_refs_one(struct page *page,
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	spinlock_t *ptl;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> >  	pmd_t *pmd;
> >  	pte_t *pte;
> >  	bool referenced = false;
> >  
> > -	if (unlikely(PageTransHuge(page))) {
> > -		pmd = page_check_address_pmd(page, mm, addr, &ptl);
> > -		if (pmd) {
> > -			referenced = pmdp_clear_young_notify(vma, addr, pmd);
> > +	pgd = pgd_offset(mm, addr);
> > +	if (!pgd_present(*pgd))
> > +		return SWAP_AGAIN;
> > +	pud = pud_offset(pgd, addr);
> > +	if (!pud_present(*pud))
> > +		return SWAP_AGAIN;
> > +	pmd = pmd_offset(pud, addr);
> > +
> > +	if (pmd_trans_huge(*pmd)) {
> > +		ptl = pmd_lock(mm, pmd);
> > +                if (!pmd_present(*pmd))
> > +			goto unlock_pmd;
> > +		if (unlikely(!pmd_trans_huge(*pmd))) {
> >  			spin_unlock(ptl);
> > +			goto map_pte;
> >  		}
> > +
> > +		if (pmd_page(*pmd) != page)
> > +			goto unlock_pmd;
> > +
> > +		referenced = pmdp_clear_young_notify(vma, addr, pmd);
> > +		spin_unlock(ptl);
> > +		goto found;
> > +unlock_pmd:
> > +		spin_unlock(ptl);
> > +		return SWAP_AGAIN;
> >  	} else {
> > -		pte = page_check_address(page, mm, addr, &ptl, 0);
> > -		if (pte) {
> > -			referenced = ptep_clear_young_notify(vma, addr, pte);
> > -			pte_unmap_unlock(pte, ptl);
> > -		}
> > +		pmd_t pmde = *pmd;
> > +		barrier();
> > +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> > +			return SWAP_AGAIN;
> > +
> > +	}
> > +map_pte:
> > +	pte = pte_offset_map(pmd, addr);
> > +	if (!pte_present(*pte)) {
> > +		pte_unmap(pte);
> > +		return SWAP_AGAIN;
> >  	}
> > +
> > +	ptl = pte_lockptr(mm, pmd);
> > +	spin_lock(ptl);
> > +
> > +	if (!pte_present(*pte)) {
> > +		pte_unmap_unlock(pte, ptl);
> > +		return SWAP_AGAIN;
> > +	}
> > +
> > +	/* THP can be referenced by any subpage */
> > +	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
> > +		pte_unmap_unlock(pte, ptl);
> > +		return SWAP_AGAIN;
> > +	}
> > +
> > +	referenced = ptep_clear_young_notify(vma, addr, pte);
> > +	pte_unmap_unlock(pte, ptl);
> > +found:
> 
> Can't we hide this stuff in a helper function, which would be used by
> both page_referenced_one and page_idle_clear_pte_refs_one, instead of
> duplicating page_referenced_one code here?

I would like to, but there's no obvious way to do that: PMDs and PTEs
require different handling.

Any ideas?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05  9:24       ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-05  9:24 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 12:10:13PM +0300, Vladimir Davydov wrote:
> On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
> ...
> > @@ -56,23 +56,69 @@ static int page_idle_clear_pte_refs_one(struct page *page,
> >  {
> >  	struct mm_struct *mm = vma->vm_mm;
> >  	spinlock_t *ptl;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> >  	pmd_t *pmd;
> >  	pte_t *pte;
> >  	bool referenced = false;
> >  
> > -	if (unlikely(PageTransHuge(page))) {
> > -		pmd = page_check_address_pmd(page, mm, addr, &ptl);
> > -		if (pmd) {
> > -			referenced = pmdp_clear_young_notify(vma, addr, pmd);
> > +	pgd = pgd_offset(mm, addr);
> > +	if (!pgd_present(*pgd))
> > +		return SWAP_AGAIN;
> > +	pud = pud_offset(pgd, addr);
> > +	if (!pud_present(*pud))
> > +		return SWAP_AGAIN;
> > +	pmd = pmd_offset(pud, addr);
> > +
> > +	if (pmd_trans_huge(*pmd)) {
> > +		ptl = pmd_lock(mm, pmd);
> > +                if (!pmd_present(*pmd))
> > +			goto unlock_pmd;
> > +		if (unlikely(!pmd_trans_huge(*pmd))) {
> >  			spin_unlock(ptl);
> > +			goto map_pte;
> >  		}
> > +
> > +		if (pmd_page(*pmd) != page)
> > +			goto unlock_pmd;
> > +
> > +		referenced = pmdp_clear_young_notify(vma, addr, pmd);
> > +		spin_unlock(ptl);
> > +		goto found;
> > +unlock_pmd:
> > +		spin_unlock(ptl);
> > +		return SWAP_AGAIN;
> >  	} else {
> > -		pte = page_check_address(page, mm, addr, &ptl, 0);
> > -		if (pte) {
> > -			referenced = ptep_clear_young_notify(vma, addr, pte);
> > -			pte_unmap_unlock(pte, ptl);
> > -		}
> > +		pmd_t pmde = *pmd;
> > +		barrier();
> > +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> > +			return SWAP_AGAIN;
> > +
> > +	}
> > +map_pte:
> > +	pte = pte_offset_map(pmd, addr);
> > +	if (!pte_present(*pte)) {
> > +		pte_unmap(pte);
> > +		return SWAP_AGAIN;
> >  	}
> > +
> > +	ptl = pte_lockptr(mm, pmd);
> > +	spin_lock(ptl);
> > +
> > +	if (!pte_present(*pte)) {
> > +		pte_unmap_unlock(pte, ptl);
> > +		return SWAP_AGAIN;
> > +	}
> > +
> > +	/* THP can be referenced by any subpage */
> > +	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
> > +		pte_unmap_unlock(pte, ptl);
> > +		return SWAP_AGAIN;
> > +	}
> > +
> > +	referenced = ptep_clear_young_notify(vma, addr, pte);
> > +	pte_unmap_unlock(pte, ptl);
> > +found:
> 
> Can't we hide this stuff in a helper function, which would be used by
> both page_referenced_one and page_idle_clear_pte_refs_one, instead of
> duplicating page_referenced_one code here?

I would like to, but there's no obvious way to do that: PMDs and PTEs
require different handling.

Any ideas?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-05  9:24       ` Kirill A. Shutemov
@ 2015-11-05 12:07         ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05 12:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 11:24:59AM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 05, 2015 at 12:10:13PM +0300, Vladimir Davydov wrote:
> > On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
> > ...
> > > @@ -56,23 +56,69 @@ static int page_idle_clear_pte_refs_one(struct page *page,
> > >  {
> > >  	struct mm_struct *mm = vma->vm_mm;
> > >  	spinlock_t *ptl;
> > > +	pgd_t *pgd;
> > > +	pud_t *pud;
> > >  	pmd_t *pmd;
> > >  	pte_t *pte;
> > >  	bool referenced = false;
> > >  
> > > -	if (unlikely(PageTransHuge(page))) {
> > > -		pmd = page_check_address_pmd(page, mm, addr, &ptl);
> > > -		if (pmd) {
> > > -			referenced = pmdp_clear_young_notify(vma, addr, pmd);
> > > +	pgd = pgd_offset(mm, addr);
> > > +	if (!pgd_present(*pgd))
> > > +		return SWAP_AGAIN;
> > > +	pud = pud_offset(pgd, addr);
> > > +	if (!pud_present(*pud))
> > > +		return SWAP_AGAIN;
> > > +	pmd = pmd_offset(pud, addr);
> > > +
> > > +	if (pmd_trans_huge(*pmd)) {
> > > +		ptl = pmd_lock(mm, pmd);
> > > +                if (!pmd_present(*pmd))
> > > +			goto unlock_pmd;
> > > +		if (unlikely(!pmd_trans_huge(*pmd))) {
> > >  			spin_unlock(ptl);
> > > +			goto map_pte;
> > >  		}
> > > +
> > > +		if (pmd_page(*pmd) != page)
> > > +			goto unlock_pmd;
> > > +
> > > +		referenced = pmdp_clear_young_notify(vma, addr, pmd);
> > > +		spin_unlock(ptl);
> > > +		goto found;
> > > +unlock_pmd:
> > > +		spin_unlock(ptl);
> > > +		return SWAP_AGAIN;
> > >  	} else {
> > > -		pte = page_check_address(page, mm, addr, &ptl, 0);
> > > -		if (pte) {
> > > -			referenced = ptep_clear_young_notify(vma, addr, pte);
> > > -			pte_unmap_unlock(pte, ptl);
> > > -		}
> > > +		pmd_t pmde = *pmd;
> > > +		barrier();
> > > +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> > > +			return SWAP_AGAIN;
> > > +
> > > +	}
> > > +map_pte:
> > > +	pte = pte_offset_map(pmd, addr);
> > > +	if (!pte_present(*pte)) {
> > > +		pte_unmap(pte);
> > > +		return SWAP_AGAIN;
> > >  	}
> > > +
> > > +	ptl = pte_lockptr(mm, pmd);
> > > +	spin_lock(ptl);
> > > +
> > > +	if (!pte_present(*pte)) {
> > > +		pte_unmap_unlock(pte, ptl);
> > > +		return SWAP_AGAIN;
> > > +	}
> > > +
> > > +	/* THP can be referenced by any subpage */
> > > +	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
> > > +		pte_unmap_unlock(pte, ptl);
> > > +		return SWAP_AGAIN;
> > > +	}
> > > +
> > > +	referenced = ptep_clear_young_notify(vma, addr, pte);
> > > +	pte_unmap_unlock(pte, ptl);
> > > +found:
> > 
> > Can't we hide this stuff in a helper function, which would be used by
> > both page_referenced_one and page_idle_clear_pte_refs_one, instead of
> > duplicating page_referenced_one code here?
> 
> I would like to, but there's no obvious way to do that: PMDs and PTEs
> require different handling.
> 
> Any ideas?

Something like this? [COMPLETELY UNTESTED]
---
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 853f4f3c6742..bb9169d07c2b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -216,6 +216,10 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	return ptep;
 }
 
+pte_t *page_check_address_transhuge(struct page *page, struct mm_struct *mm,
+				    unsigned long address,
+				    pmd_t **pmdp, spinlock_t **ptlp);
+
 /*
  * Used by swapoff to help locate where page is expected in vma.
  */
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 2c9ebe12b40d..6574ef6a1a96 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -56,69 +56,21 @@ static int page_idle_clear_pte_refs_one(struct page *page,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
-	pgd_t *pgd;
-	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	bool referenced = false;
 
-	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
+	pte = page_check_address_transhuge(page, mm, address, &pmd, &ptl);
+	if (!pte)
 		return SWAP_AGAIN;
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		return SWAP_AGAIN;
-	pmd = pmd_offset(pud, addr);
-
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(mm, pmd);
-                if (!pmd_present(*pmd))
-			goto unlock_pmd;
-		if (unlikely(!pmd_trans_huge(*pmd))) {
-			spin_unlock(ptl);
-			goto map_pte;
-		}
 
-		if (pmd_page(*pmd) != page)
-			goto unlock_pmd;
-
-		referenced = pmdp_clear_young_notify(vma, addr, pmd);
-		spin_unlock(ptl);
-		goto found;
-unlock_pmd:
-		spin_unlock(ptl);
-		return SWAP_AGAIN;
-	} else {
-		pmd_t pmde = *pmd;
-		barrier();
-		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return SWAP_AGAIN;
-
-	}
-map_pte:
-	pte = pte_offset_map(pmd, addr);
-	if (!pte_present(*pte)) {
-		pte_unmap(pte);
-		return SWAP_AGAIN;
-	}
+	if (pte == pmd) /* trans huge */
+		referenced = pmdp_clear_young_notify(vma, address, pmd);
+	else
+		referenced = ptep_clear_young_notify(vma, addr, pte);
 
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-
-	if (!pte_present(*pte)) {
-		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
-	}
-
-	/* THP can be referenced by any subpage */
-	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
-		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
-	}
-
-	referenced = ptep_clear_young_notify(vma, addr, pte);
 	pte_unmap_unlock(pte, ptl);
-found:
+
 	if (referenced) {
 		clear_page_idle(page);
 		/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 1f90bda685b6..3638190cf7bc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -796,48 +796,35 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	return 1;
 }
 
-struct page_referenced_arg {
-	int mapcount;
-	int referenced;
-	unsigned long vm_flags;
-	struct mem_cgroup *memcg;
-};
-/*
- * arg: page_referenced_arg will be passed
- */
-static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, void *arg)
+pte_t *page_check_address_transhuge(struct page *page, struct mm_struct *mm,
+				    unsigned long address,
+				    pmd_t **pmdp, spinlock_t **ptlp)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	int referenced = 0;
-	struct page_referenced_arg *pra = arg;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (unlikely(PageHuge(page))) {
 		/* when pud is not present, pte will be NULL */
 		pte = huge_pte_offset(mm, address);
 		if (!pte)
-			return SWAP_AGAIN;
+			return NULL;
 
 		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
+		pmd = NULL;
 		goto check_pte;
 	}
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
-		return SWAP_AGAIN;
-	pud = pud_offset(pgd, address);
+		return NULL;
 	if (!pud_present(*pud))
-		return SWAP_AGAIN;
+		return NULL;
 	pmd = pmd_offset(pud, address);
 
 	if (pmd_trans_huge(*pmd)) {
-		int ret = SWAP_AGAIN;
-
 		ptl = pmd_lock(mm, pmd);
 		if (!pmd_present(*pmd))
 			goto unlock_pmd;
@@ -849,30 +836,23 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (pmd_page(*pmd) != page)
 			goto unlock_pmd;
 
-		if (vma->vm_flags & VM_LOCKED) {
-			pra->vm_flags |= VM_LOCKED;
-			ret = SWAP_FAIL; /* To break the loop */
-			goto unlock_pmd;
-		}
-
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
-		spin_unlock(ptl);
+		pte = (pte_t *)pmd;
 		goto found;
 unlock_pmd:
 		spin_unlock(ptl);
-		return ret;
+		return NULL;
 	} else {
 		pmd_t pmde = *pmd;
 		barrier();
 		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return SWAP_AGAIN;
+			return NULL;
 	}
+
 map_pte:
 	pte = pte_offset_map(pmd, address);
 	if (!pte_present(*pte)) {
 		pte_unmap(pte);
-		return SWAP_AGAIN;
+		return NULL;
 	}
 
 	ptl = pte_lockptr(mm, pmd);
@@ -881,35 +861,66 @@ check_pte:
 
 	if (!pte_present(*pte)) {
 		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
+		return NULL;
 	}
 
 	/* THP can be referenced by any subpage */
 	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
 		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
+		return NULL;
 	}
+found:
+	*ptlp = ptl;
+	*pmdp = pmd;
+	return pte;
+}
+
+struct page_referenced_arg {
+	int mapcount;
+	int referenced;
+	unsigned long vm_flags;
+	struct mem_cgroup *memcg;
+};
+/*
+ * arg: page_referenced_arg will be passed
+ */
+static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
+			unsigned long address, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int referenced = 0;
+	struct page_referenced_arg *pra = arg;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	pte = page_check_address_transhuge(page, mm, address, &pmd, &ptl);
+	if (!pte)
+		return SWAP_AGAIN;
 
 	if (vma->vm_flags & VM_LOCKED) {
 		pte_unmap_unlock(pte, ptl);
-		pra->vm_flags |= VM_LOCKED;
 		return SWAP_FAIL; /* To break the loop */
 	}
 
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+	if (pte == pmd) { /* trans huge */
+		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
+	} else {
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+				referenced++;
+		}
 	}
 	pte_unmap_unlock(pte, ptl);
 
-found:
 	if (referenced)
 		clear_page_idle(page);
 	if (test_and_clear_page_young(page))

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05 12:07         ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05 12:07 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 11:24:59AM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 05, 2015 at 12:10:13PM +0300, Vladimir Davydov wrote:
> > On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
> > ...
> > > @@ -56,23 +56,69 @@ static int page_idle_clear_pte_refs_one(struct page *page,
> > >  {
> > >  	struct mm_struct *mm = vma->vm_mm;
> > >  	spinlock_t *ptl;
> > > +	pgd_t *pgd;
> > > +	pud_t *pud;
> > >  	pmd_t *pmd;
> > >  	pte_t *pte;
> > >  	bool referenced = false;
> > >  
> > > -	if (unlikely(PageTransHuge(page))) {
> > > -		pmd = page_check_address_pmd(page, mm, addr, &ptl);
> > > -		if (pmd) {
> > > -			referenced = pmdp_clear_young_notify(vma, addr, pmd);
> > > +	pgd = pgd_offset(mm, addr);
> > > +	if (!pgd_present(*pgd))
> > > +		return SWAP_AGAIN;
> > > +	pud = pud_offset(pgd, addr);
> > > +	if (!pud_present(*pud))
> > > +		return SWAP_AGAIN;
> > > +	pmd = pmd_offset(pud, addr);
> > > +
> > > +	if (pmd_trans_huge(*pmd)) {
> > > +		ptl = pmd_lock(mm, pmd);
> > > +                if (!pmd_present(*pmd))
> > > +			goto unlock_pmd;
> > > +		if (unlikely(!pmd_trans_huge(*pmd))) {
> > >  			spin_unlock(ptl);
> > > +			goto map_pte;
> > >  		}
> > > +
> > > +		if (pmd_page(*pmd) != page)
> > > +			goto unlock_pmd;
> > > +
> > > +		referenced = pmdp_clear_young_notify(vma, addr, pmd);
> > > +		spin_unlock(ptl);
> > > +		goto found;
> > > +unlock_pmd:
> > > +		spin_unlock(ptl);
> > > +		return SWAP_AGAIN;
> > >  	} else {
> > > -		pte = page_check_address(page, mm, addr, &ptl, 0);
> > > -		if (pte) {
> > > -			referenced = ptep_clear_young_notify(vma, addr, pte);
> > > -			pte_unmap_unlock(pte, ptl);
> > > -		}
> > > +		pmd_t pmde = *pmd;
> > > +		barrier();
> > > +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> > > +			return SWAP_AGAIN;
> > > +
> > > +	}
> > > +map_pte:
> > > +	pte = pte_offset_map(pmd, addr);
> > > +	if (!pte_present(*pte)) {
> > > +		pte_unmap(pte);
> > > +		return SWAP_AGAIN;
> > >  	}
> > > +
> > > +	ptl = pte_lockptr(mm, pmd);
> > > +	spin_lock(ptl);
> > > +
> > > +	if (!pte_present(*pte)) {
> > > +		pte_unmap_unlock(pte, ptl);
> > > +		return SWAP_AGAIN;
> > > +	}
> > > +
> > > +	/* THP can be referenced by any subpage */
> > > +	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
> > > +		pte_unmap_unlock(pte, ptl);
> > > +		return SWAP_AGAIN;
> > > +	}
> > > +
> > > +	referenced = ptep_clear_young_notify(vma, addr, pte);
> > > +	pte_unmap_unlock(pte, ptl);
> > > +found:
> > 
> > Can't we hide this stuff in a helper function, which would be used by
> > both page_referenced_one and page_idle_clear_pte_refs_one, instead of
> > duplicating page_referenced_one code here?
> 
> I would like to, but there's no obvious way to do that: PMDs and PTEs
> require different handling.
> 
> Any ideas?

Something like this? [COMPLETELY UNTESTED]
---
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 853f4f3c6742..bb9169d07c2b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -216,6 +216,10 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 	return ptep;
 }
 
+pte_t *page_check_address_transhuge(struct page *page, struct mm_struct *mm,
+				    unsigned long address,
+				    pmd_t **pmdp, spinlock_t **ptlp);
+
 /*
  * Used by swapoff to help locate where page is expected in vma.
  */
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 2c9ebe12b40d..6574ef6a1a96 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -56,69 +56,21 @@ static int page_idle_clear_pte_refs_one(struct page *page,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	spinlock_t *ptl;
-	pgd_t *pgd;
-	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	bool referenced = false;
 
-	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
+	pte = page_check_address_transhuge(page, mm, address, &pmd, &ptl);
+	if (!pte)
 		return SWAP_AGAIN;
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		return SWAP_AGAIN;
-	pmd = pmd_offset(pud, addr);
-
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(mm, pmd);
-                if (!pmd_present(*pmd))
-			goto unlock_pmd;
-		if (unlikely(!pmd_trans_huge(*pmd))) {
-			spin_unlock(ptl);
-			goto map_pte;
-		}
 
-		if (pmd_page(*pmd) != page)
-			goto unlock_pmd;
-
-		referenced = pmdp_clear_young_notify(vma, addr, pmd);
-		spin_unlock(ptl);
-		goto found;
-unlock_pmd:
-		spin_unlock(ptl);
-		return SWAP_AGAIN;
-	} else {
-		pmd_t pmde = *pmd;
-		barrier();
-		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return SWAP_AGAIN;
-
-	}
-map_pte:
-	pte = pte_offset_map(pmd, addr);
-	if (!pte_present(*pte)) {
-		pte_unmap(pte);
-		return SWAP_AGAIN;
-	}
+	if (pte == pmd) /* trans huge */
+		referenced = pmdp_clear_young_notify(vma, address, pmd);
+	else
+		referenced = ptep_clear_young_notify(vma, addr, pte);
 
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-
-	if (!pte_present(*pte)) {
-		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
-	}
-
-	/* THP can be referenced by any subpage */
-	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
-		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
-	}
-
-	referenced = ptep_clear_young_notify(vma, addr, pte);
 	pte_unmap_unlock(pte, ptl);
-found:
+
 	if (referenced) {
 		clear_page_idle(page);
 		/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 1f90bda685b6..3638190cf7bc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -796,48 +796,35 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	return 1;
 }
 
-struct page_referenced_arg {
-	int mapcount;
-	int referenced;
-	unsigned long vm_flags;
-	struct mem_cgroup *memcg;
-};
-/*
- * arg: page_referenced_arg will be passed
- */
-static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, void *arg)
+pte_t *page_check_address_transhuge(struct page *page, struct mm_struct *mm,
+				    unsigned long address,
+				    pmd_t **pmdp, spinlock_t **ptlp)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	int referenced = 0;
-	struct page_referenced_arg *pra = arg;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (unlikely(PageHuge(page))) {
 		/* when pud is not present, pte will be NULL */
 		pte = huge_pte_offset(mm, address);
 		if (!pte)
-			return SWAP_AGAIN;
+			return NULL;
 
 		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
+		pmd = NULL;
 		goto check_pte;
 	}
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
-		return SWAP_AGAIN;
-	pud = pud_offset(pgd, address);
+		return NULL;
 	if (!pud_present(*pud))
-		return SWAP_AGAIN;
+		return NULL;
 	pmd = pmd_offset(pud, address);
 
 	if (pmd_trans_huge(*pmd)) {
-		int ret = SWAP_AGAIN;
-
 		ptl = pmd_lock(mm, pmd);
 		if (!pmd_present(*pmd))
 			goto unlock_pmd;
@@ -849,30 +836,23 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (pmd_page(*pmd) != page)
 			goto unlock_pmd;
 
-		if (vma->vm_flags & VM_LOCKED) {
-			pra->vm_flags |= VM_LOCKED;
-			ret = SWAP_FAIL; /* To break the loop */
-			goto unlock_pmd;
-		}
-
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
-		spin_unlock(ptl);
+		pte = (pte_t *)pmd;
 		goto found;
 unlock_pmd:
 		spin_unlock(ptl);
-		return ret;
+		return NULL;
 	} else {
 		pmd_t pmde = *pmd;
 		barrier();
 		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return SWAP_AGAIN;
+			return NULL;
 	}
+
 map_pte:
 	pte = pte_offset_map(pmd, address);
 	if (!pte_present(*pte)) {
 		pte_unmap(pte);
-		return SWAP_AGAIN;
+		return NULL;
 	}
 
 	ptl = pte_lockptr(mm, pmd);
@@ -881,35 +861,66 @@ check_pte:
 
 	if (!pte_present(*pte)) {
 		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
+		return NULL;
 	}
 
 	/* THP can be referenced by any subpage */
 	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
 		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
+		return NULL;
 	}
+found:
+	*ptlp = ptl;
+	*pmdp = pmd;
+	return pte;
+}
+
+struct page_referenced_arg {
+	int mapcount;
+	int referenced;
+	unsigned long vm_flags;
+	struct mem_cgroup *memcg;
+};
+/*
+ * arg: page_referenced_arg will be passed
+ */
+static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
+			unsigned long address, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	int referenced = 0;
+	struct page_referenced_arg *pra = arg;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+
+	pte = page_check_address_transhuge(page, mm, address, &pmd, &ptl);
+	if (!pte)
+		return SWAP_AGAIN;
 
 	if (vma->vm_flags & VM_LOCKED) {
 		pte_unmap_unlock(pte, ptl);
-		pra->vm_flags |= VM_LOCKED;
 		return SWAP_FAIL; /* To break the loop */
 	}
 
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+	if (pte == pmd) { /* trans huge */
+		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
+	} else {
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+				referenced++;
+		}
 	}
 	pte_unmap_unlock(pte, ptl);
 
-found:
 	if (referenced)
 		clear_page_idle(page);
 	if (test_and_clear_page_young(page))

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-05 12:07         ` Vladimir Davydov
@ 2015-11-05 12:36           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-05 12:36 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 03:07:26PM +0300, Vladimir Davydov wrote:
> @@ -849,30 +836,23 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  		if (pmd_page(*pmd) != page)
>  			goto unlock_pmd;
>  
> -		if (vma->vm_flags & VM_LOCKED) {
> -			pra->vm_flags |= VM_LOCKED;
> -			ret = SWAP_FAIL; /* To break the loop */
> -			goto unlock_pmd;
> -		}
> -
> -		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> -			referenced++;
> -		spin_unlock(ptl);
> +		pte = (pte_t *)pmd;

pmd_t and pte_t are not always compatible. We shouldn't pretend they are.
And we shouldn't use pte_unmap_unlock() to unlock pmd table.

What about interface like this (I'm not sure about helper's name):

void page_check_address_transhuge(struct page *page, struct mm_struct *mm,
                                   unsigned long address,
                                   pmd_t **pmdp, pte_t **ptep,
				   spinlock_t **ptlp);

page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl);
if (pmd) {
	/* handle pmd... */
} else if (pte) {
	/* handle pte... */
} else {
	return SWAP_AGAIN;
}

/* common stuff */

if (pmd)
	spin_unlock(ptl);
else 
	pte_unmap_unlock(pte, ptl);

/* ... */

The helper shouldn't set pmd if the page is tracked to pte.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05 12:36           ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-05 12:36 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 03:07:26PM +0300, Vladimir Davydov wrote:
> @@ -849,30 +836,23 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  		if (pmd_page(*pmd) != page)
>  			goto unlock_pmd;
>  
> -		if (vma->vm_flags & VM_LOCKED) {
> -			pra->vm_flags |= VM_LOCKED;
> -			ret = SWAP_FAIL; /* To break the loop */
> -			goto unlock_pmd;
> -		}
> -
> -		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> -			referenced++;
> -		spin_unlock(ptl);
> +		pte = (pte_t *)pmd;

pmd_t and pte_t are not always compatible. We shouldn't pretend they are.
And we shouldn't use pte_unmap_unlock() to unlock pmd table.

What about interface like this (I'm not sure about helper's name):

void page_check_address_transhuge(struct page *page, struct mm_struct *mm,
                                   unsigned long address,
                                   pmd_t **pmdp, pte_t **ptep,
				   spinlock_t **ptlp);

page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl);
if (pmd) {
	/* handle pmd... */
} else if (pte) {
	/* handle pte... */
} else {
	return SWAP_AGAIN;
}

/* common stuff */

if (pmd)
	spin_unlock(ptl);
else 
	pte_unmap_unlock(pte, ptl);

/* ... */

The helper shouldn't set pmd if the page is tracked to pte.

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-05 12:36           ` Kirill A. Shutemov
@ 2015-11-05 12:53             ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05 12:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 02:36:06PM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 05, 2015 at 03:07:26PM +0300, Vladimir Davydov wrote:
> > @@ -849,30 +836,23 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> >  		if (pmd_page(*pmd) != page)
> >  			goto unlock_pmd;
> >  
> > -		if (vma->vm_flags & VM_LOCKED) {
> > -			pra->vm_flags |= VM_LOCKED;
> > -			ret = SWAP_FAIL; /* To break the loop */
> > -			goto unlock_pmd;
> > -		}
> > -
> > -		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> > -			referenced++;
> > -		spin_unlock(ptl);
> > +		pte = (pte_t *)pmd;
> 
> pmd_t and pte_t are not always compatible. We shouldn't pretend they are.
> And we shouldn't use pte_unmap_unlock() to unlock pmd table.

Out of curiosity, is it OK that __page_check_address can call
pte_unmap_unlock on pte returned by huge_pte_offset, which isn't really
pte, but pmd or pud?

> 
> What about interface like this (I'm not sure about helper's name):
> 
> void page_check_address_transhuge(struct page *page, struct mm_struct *mm,
>                                    unsigned long address,
>                                    pmd_t **pmdp, pte_t **ptep,
> 				   spinlock_t **ptlp);
> 
> page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl);
> if (pmd) {
> 	/* handle pmd... */
> } else if (pte) {
> 	/* handle pte... */
> } else {
> 	return SWAP_AGAIN;
> }
> 
> /* common stuff */
> 
> if (pmd)
> 	spin_unlock(ptl);
> else 
> 	pte_unmap_unlock(pte, ptl);

spin_unlock(ptl);
if (pte)
	pte_unmap(pte);

would look neater IMO. Other than that, I think it's OK. At least, it
looks better and less error-prone than duplicating such a huge chunk of
code IMO.

Thanks,
Vladimir

> 
> /* ... */
> 
> The helper shouldn't set pmd if the page is tracked to pte.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05 12:53             ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05 12:53 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 02:36:06PM +0200, Kirill A. Shutemov wrote:
> On Thu, Nov 05, 2015 at 03:07:26PM +0300, Vladimir Davydov wrote:
> > @@ -849,30 +836,23 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> >  		if (pmd_page(*pmd) != page)
> >  			goto unlock_pmd;
> >  
> > -		if (vma->vm_flags & VM_LOCKED) {
> > -			pra->vm_flags |= VM_LOCKED;
> > -			ret = SWAP_FAIL; /* To break the loop */
> > -			goto unlock_pmd;
> > -		}
> > -
> > -		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> > -			referenced++;
> > -		spin_unlock(ptl);
> > +		pte = (pte_t *)pmd;
> 
> pmd_t and pte_t are not always compatible. We shouldn't pretend they are.
> And we shouldn't use pte_unmap_unlock() to unlock pmd table.

Out of curiosity, is it OK that __page_check_address can call
pte_unmap_unlock on pte returned by huge_pte_offset, which isn't really
pte, but pmd or pud?

> 
> What about interface like this (I'm not sure about helper's name):
> 
> void page_check_address_transhuge(struct page *page, struct mm_struct *mm,
>                                    unsigned long address,
>                                    pmd_t **pmdp, pte_t **ptep,
> 				   spinlock_t **ptlp);
> 
> page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl);
> if (pmd) {
> 	/* handle pmd... */
> } else if (pte) {
> 	/* handle pte... */
> } else {
> 	return SWAP_AGAIN;
> }
> 
> /* common stuff */
> 
> if (pmd)
> 	spin_unlock(ptl);
> else 
> 	pte_unmap_unlock(pte, ptl);

spin_unlock(ptl);
if (pte)
	pte_unmap(pte);

would look neater IMO. Other than that, I think it's OK. At least, it
looks better and less error-prone than duplicating such a huge chunk of
code IMO.

Thanks,
Vladimir

> 
> /* ... */
> 
> The helper shouldn't set pmd if the page is tracked to pte.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-05 12:53             ` Vladimir Davydov
@ 2015-11-05 12:58               ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-05 12:58 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 03:53:54PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 05, 2015 at 02:36:06PM +0200, Kirill A. Shutemov wrote:
> > On Thu, Nov 05, 2015 at 03:07:26PM +0300, Vladimir Davydov wrote:
> > > @@ -849,30 +836,23 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> > >  		if (pmd_page(*pmd) != page)
> > >  			goto unlock_pmd;
> > >  
> > > -		if (vma->vm_flags & VM_LOCKED) {
> > > -			pra->vm_flags |= VM_LOCKED;
> > > -			ret = SWAP_FAIL; /* To break the loop */
> > > -			goto unlock_pmd;
> > > -		}
> > > -
> > > -		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> > > -			referenced++;
> > > -		spin_unlock(ptl);
> > > +		pte = (pte_t *)pmd;
> > 
> > pmd_t and pte_t are not always compatible. We shouldn't pretend they are.
> > And we shouldn't use pte_unmap_unlock() to unlock pmd table.
> 
> Out of curiosity, is it OK that __page_check_address can call
> pte_unmap_unlock on pte returned by huge_pte_offset, which isn't really
> pte, but pmd or pud?

hugetlb is usually implemented on architectures where you can expect some
level of compatibility between page table enties on different levels.

> > What about interface like this (I'm not sure about helper's name):
> > 
> > void page_check_address_transhuge(struct page *page, struct mm_struct *mm,
> >                                    unsigned long address,
> >                                    pmd_t **pmdp, pte_t **ptep,
> > 				   spinlock_t **ptlp);
> > 
> > page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl);
> > if (pmd) {
> > 	/* handle pmd... */
> > } else if (pte) {
> > 	/* handle pte... */
> > } else {
> > 	return SWAP_AGAIN;
> > }
> > 
> > /* common stuff */
> > 
> > if (pmd)
> > 	spin_unlock(ptl);
> > else 
> > 	pte_unmap_unlock(pte, ptl);
> 
> spin_unlock(ptl);
> if (pte)
> 	pte_unmap(pte);
> 
> would look neater IMO. Other than that, I think it's OK. At least, it
> looks better and less error-prone than duplicating such a huge chunk of
> code IMO.

Okay. Could you prepare the patch?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05 12:58               ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-05 12:58 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 03:53:54PM +0300, Vladimir Davydov wrote:
> On Thu, Nov 05, 2015 at 02:36:06PM +0200, Kirill A. Shutemov wrote:
> > On Thu, Nov 05, 2015 at 03:07:26PM +0300, Vladimir Davydov wrote:
> > > @@ -849,30 +836,23 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> > >  		if (pmd_page(*pmd) != page)
> > >  			goto unlock_pmd;
> > >  
> > > -		if (vma->vm_flags & VM_LOCKED) {
> > > -			pra->vm_flags |= VM_LOCKED;
> > > -			ret = SWAP_FAIL; /* To break the loop */
> > > -			goto unlock_pmd;
> > > -		}
> > > -
> > > -		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> > > -			referenced++;
> > > -		spin_unlock(ptl);
> > > +		pte = (pte_t *)pmd;
> > 
> > pmd_t and pte_t are not always compatible. We shouldn't pretend they are.
> > And we shouldn't use pte_unmap_unlock() to unlock pmd table.
> 
> Out of curiosity, is it OK that __page_check_address can call
> pte_unmap_unlock on pte returned by huge_pte_offset, which isn't really
> pte, but pmd or pud?

hugetlb is usually implemented on architectures where you can expect some
level of compatibility between page table enties on different levels.

> > What about interface like this (I'm not sure about helper's name):
> > 
> > void page_check_address_transhuge(struct page *page, struct mm_struct *mm,
> >                                    unsigned long address,
> >                                    pmd_t **pmdp, pte_t **ptep,
> > 				   spinlock_t **ptlp);
> > 
> > page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl);
> > if (pmd) {
> > 	/* handle pmd... */
> > } else if (pte) {
> > 	/* handle pte... */
> > } else {
> > 	return SWAP_AGAIN;
> > }
> > 
> > /* common stuff */
> > 
> > if (pmd)
> > 	spin_unlock(ptl);
> > else 
> > 	pte_unmap_unlock(pte, ptl);
> 
> spin_unlock(ptl);
> if (pte)
> 	pte_unmap(pte);
> 
> would look neater IMO. Other than that, I think it's OK. At least, it
> looks better and less error-prone than duplicating such a huge chunk of
> code IMO.

Okay. Could you prepare the patch?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-03 15:26   ` Kirill A. Shutemov
@ 2015-11-05 16:03     ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05 16:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi,
	Sasha Levin, Minchan Kim, linux-kernel, linux-mm

On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
...
> @@ -812,60 +812,104 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	int referenced = 0;
>  	struct page_referenced_arg *pra = arg;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
>  
> -	if (unlikely(PageTransHuge(page))) {
> -		pmd_t *pmd;
> -
> -		/*
> -		 * rmap might return false positives; we must filter
> -		 * these out using page_check_address_pmd().
> -		 */
> -		pmd = page_check_address_pmd(page, mm, address, &ptl);
> -		if (!pmd)
> +	if (unlikely(PageHuge(page))) {
> +		/* when pud is not present, pte will be NULL */
> +		pte = huge_pte_offset(mm, address);
> +		if (!pte)
>  			return SWAP_AGAIN;
>  
> -		if (vma->vm_flags & VM_LOCKED) {
> +		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
> +		goto check_pte;
> +	}
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		return SWAP_AGAIN;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return SWAP_AGAIN;
> +	pmd = pmd_offset(pud, address);
> +
> +	if (pmd_trans_huge(*pmd)) {
> +		int ret = SWAP_AGAIN;
> +
> +		ptl = pmd_lock(mm, pmd);
> +		if (!pmd_present(*pmd))
> +			goto unlock_pmd;
> +		if (unlikely(!pmd_trans_huge(*pmd))) {
>  			spin_unlock(ptl);
> +			goto map_pte;
> +		}
> +
> +		if (pmd_page(*pmd) != page)
> +			goto unlock_pmd;
> +
> +		if (vma->vm_flags & VM_LOCKED) {
>  			pra->vm_flags |= VM_LOCKED;
> -			return SWAP_FAIL; /* To break the loop */
> +			ret = SWAP_FAIL; /* To break the loop */
> +			goto unlock_pmd;
>  		}
>  
>  		if (pmdp_clear_flush_young_notify(vma, address, pmd))
>  			referenced++;
> -
>  		spin_unlock(ptl);
> +		goto found;
> +unlock_pmd:
> +		spin_unlock(ptl);
> +		return ret;
>  	} else {
> -		pte_t *pte;
> -
> -		/*
> -		 * rmap might return false positives; we must filter
> -		 * these out using page_check_address().
> -		 */
> -		pte = page_check_address(page, mm, address, &ptl, 0);
> -		if (!pte)
> +		pmd_t pmde = *pmd;
> +		barrier();

This is supposed to be

		pmd_t pmde = READ_ONCE(*pmd);

Right?

I don't understand why we need a barrier here. Why can't we just do

	} else if (!pmd_present(*pmd))
		reutnr SWAP_AGAIN;

?

Thanks,
Vladimir

> +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
>  			return SWAP_AGAIN;
> +	}

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05 16:03     ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05 16:03 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi,
	Sasha Levin, Minchan Kim, linux-kernel, linux-mm

On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
...
> @@ -812,60 +812,104 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
>  	spinlock_t *ptl;
>  	int referenced = 0;
>  	struct page_referenced_arg *pra = arg;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
>  
> -	if (unlikely(PageTransHuge(page))) {
> -		pmd_t *pmd;
> -
> -		/*
> -		 * rmap might return false positives; we must filter
> -		 * these out using page_check_address_pmd().
> -		 */
> -		pmd = page_check_address_pmd(page, mm, address, &ptl);
> -		if (!pmd)
> +	if (unlikely(PageHuge(page))) {
> +		/* when pud is not present, pte will be NULL */
> +		pte = huge_pte_offset(mm, address);
> +		if (!pte)
>  			return SWAP_AGAIN;
>  
> -		if (vma->vm_flags & VM_LOCKED) {
> +		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
> +		goto check_pte;
> +	}
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		return SWAP_AGAIN;
> +	pud = pud_offset(pgd, address);
> +	if (!pud_present(*pud))
> +		return SWAP_AGAIN;
> +	pmd = pmd_offset(pud, address);
> +
> +	if (pmd_trans_huge(*pmd)) {
> +		int ret = SWAP_AGAIN;
> +
> +		ptl = pmd_lock(mm, pmd);
> +		if (!pmd_present(*pmd))
> +			goto unlock_pmd;
> +		if (unlikely(!pmd_trans_huge(*pmd))) {
>  			spin_unlock(ptl);
> +			goto map_pte;
> +		}
> +
> +		if (pmd_page(*pmd) != page)
> +			goto unlock_pmd;
> +
> +		if (vma->vm_flags & VM_LOCKED) {
>  			pra->vm_flags |= VM_LOCKED;
> -			return SWAP_FAIL; /* To break the loop */
> +			ret = SWAP_FAIL; /* To break the loop */
> +			goto unlock_pmd;
>  		}
>  
>  		if (pmdp_clear_flush_young_notify(vma, address, pmd))
>  			referenced++;
> -
>  		spin_unlock(ptl);
> +		goto found;
> +unlock_pmd:
> +		spin_unlock(ptl);
> +		return ret;
>  	} else {
> -		pte_t *pte;
> -
> -		/*
> -		 * rmap might return false positives; we must filter
> -		 * these out using page_check_address().
> -		 */
> -		pte = page_check_address(page, mm, address, &ptl, 0);
> -		if (!pte)
> +		pmd_t pmde = *pmd;
> +		barrier();

This is supposed to be

		pmd_t pmde = READ_ONCE(*pmd);

Right?

I don't understand why we need a barrier here. Why can't we just do

	} else if (!pmd_present(*pmd))
		reutnr SWAP_AGAIN;

?

Thanks,
Vladimir

> +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
>  			return SWAP_AGAIN;
> +	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-05 12:58               ` Kirill A. Shutemov
@ 2015-11-05 16:31                 ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05 16:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 02:58:38PM +0200, Kirill A. Shutemov wrote:

> Okay. Could you prepare the patch?

OK, give me some time.

Thanks,
Vladimir

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05 16:31                 ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-05 16:31 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 02:58:38PM +0200, Kirill A. Shutemov wrote:

> Okay. Could you prepare the patch?

OK, give me some time.

Thanks,
Vladimir

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-05 16:03     ` Vladimir Davydov
@ 2015-11-05 17:27       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-05 17:27 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 07:03:24PM +0300, Vladimir Davydov wrote:
> On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
> ...
> > @@ -812,60 +812,104 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> >  	spinlock_t *ptl;
> >  	int referenced = 0;
> >  	struct page_referenced_arg *pra = arg;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> > +	pmd_t *pmd;
> > +	pte_t *pte;
> >  
> > -	if (unlikely(PageTransHuge(page))) {
> > -		pmd_t *pmd;
> > -
> > -		/*
> > -		 * rmap might return false positives; we must filter
> > -		 * these out using page_check_address_pmd().
> > -		 */
> > -		pmd = page_check_address_pmd(page, mm, address, &ptl);
> > -		if (!pmd)
> > +	if (unlikely(PageHuge(page))) {
> > +		/* when pud is not present, pte will be NULL */
> > +		pte = huge_pte_offset(mm, address);
> > +		if (!pte)
> >  			return SWAP_AGAIN;
> >  
> > -		if (vma->vm_flags & VM_LOCKED) {
> > +		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
> > +		goto check_pte;
> > +	}
> > +
> > +	pgd = pgd_offset(mm, address);
> > +	if (!pgd_present(*pgd))
> > +		return SWAP_AGAIN;
> > +	pud = pud_offset(pgd, address);
> > +	if (!pud_present(*pud))
> > +		return SWAP_AGAIN;
> > +	pmd = pmd_offset(pud, address);
> > +
> > +	if (pmd_trans_huge(*pmd)) {
> > +		int ret = SWAP_AGAIN;
> > +
> > +		ptl = pmd_lock(mm, pmd);
> > +		if (!pmd_present(*pmd))
> > +			goto unlock_pmd;
> > +		if (unlikely(!pmd_trans_huge(*pmd))) {
> >  			spin_unlock(ptl);
> > +			goto map_pte;
> > +		}
> > +
> > +		if (pmd_page(*pmd) != page)
> > +			goto unlock_pmd;
> > +
> > +		if (vma->vm_flags & VM_LOCKED) {
> >  			pra->vm_flags |= VM_LOCKED;
> > -			return SWAP_FAIL; /* To break the loop */
> > +			ret = SWAP_FAIL; /* To break the loop */
> > +			goto unlock_pmd;
> >  		}
> >  
> >  		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> >  			referenced++;
> > -
> >  		spin_unlock(ptl);
> > +		goto found;
> > +unlock_pmd:
> > +		spin_unlock(ptl);
> > +		return ret;
> >  	} else {
> > -		pte_t *pte;
> > -
> > -		/*
> > -		 * rmap might return false positives; we must filter
> > -		 * these out using page_check_address().
> > -		 */
> > -		pte = page_check_address(page, mm, address, &ptl, 0);
> > -		if (!pte)
> > +		pmd_t pmde = *pmd;
> > +		barrier();
> 
> This is supposed to be
> 
> 		pmd_t pmde = READ_ONCE(*pmd);
> 
> Right?

See e37c69827063. If I read this correctly, barrier() is less overhead for
some archs.

> 
> I don't understand why we need a barrier here. Why can't we just do
> 
> 	} else if (!pmd_present(*pmd))
> 		reutnr SWAP_AGAIN;
> 
> ?

See f72e7dcdd252 too.

> > +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> >  			return SWAP_AGAIN;
> > +	}
-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-05 17:27       ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-05 17:27 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Thu, Nov 05, 2015 at 07:03:24PM +0300, Vladimir Davydov wrote:
> On Tue, Nov 03, 2015 at 05:26:15PM +0200, Kirill A. Shutemov wrote:
> ...
> > @@ -812,60 +812,104 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
> >  	spinlock_t *ptl;
> >  	int referenced = 0;
> >  	struct page_referenced_arg *pra = arg;
> > +	pgd_t *pgd;
> > +	pud_t *pud;
> > +	pmd_t *pmd;
> > +	pte_t *pte;
> >  
> > -	if (unlikely(PageTransHuge(page))) {
> > -		pmd_t *pmd;
> > -
> > -		/*
> > -		 * rmap might return false positives; we must filter
> > -		 * these out using page_check_address_pmd().
> > -		 */
> > -		pmd = page_check_address_pmd(page, mm, address, &ptl);
> > -		if (!pmd)
> > +	if (unlikely(PageHuge(page))) {
> > +		/* when pud is not present, pte will be NULL */
> > +		pte = huge_pte_offset(mm, address);
> > +		if (!pte)
> >  			return SWAP_AGAIN;
> >  
> > -		if (vma->vm_flags & VM_LOCKED) {
> > +		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
> > +		goto check_pte;
> > +	}
> > +
> > +	pgd = pgd_offset(mm, address);
> > +	if (!pgd_present(*pgd))
> > +		return SWAP_AGAIN;
> > +	pud = pud_offset(pgd, address);
> > +	if (!pud_present(*pud))
> > +		return SWAP_AGAIN;
> > +	pmd = pmd_offset(pud, address);
> > +
> > +	if (pmd_trans_huge(*pmd)) {
> > +		int ret = SWAP_AGAIN;
> > +
> > +		ptl = pmd_lock(mm, pmd);
> > +		if (!pmd_present(*pmd))
> > +			goto unlock_pmd;
> > +		if (unlikely(!pmd_trans_huge(*pmd))) {
> >  			spin_unlock(ptl);
> > +			goto map_pte;
> > +		}
> > +
> > +		if (pmd_page(*pmd) != page)
> > +			goto unlock_pmd;
> > +
> > +		if (vma->vm_flags & VM_LOCKED) {
> >  			pra->vm_flags |= VM_LOCKED;
> > -			return SWAP_FAIL; /* To break the loop */
> > +			ret = SWAP_FAIL; /* To break the loop */
> > +			goto unlock_pmd;
> >  		}
> >  
> >  		if (pmdp_clear_flush_young_notify(vma, address, pmd))
> >  			referenced++;
> > -
> >  		spin_unlock(ptl);
> > +		goto found;
> > +unlock_pmd:
> > +		spin_unlock(ptl);
> > +		return ret;
> >  	} else {
> > -		pte_t *pte;
> > -
> > -		/*
> > -		 * rmap might return false positives; we must filter
> > -		 * these out using page_check_address().
> > -		 */
> > -		pte = page_check_address(page, mm, address, &ptl, 0);
> > -		if (!pte)
> > +		pmd_t pmde = *pmd;
> > +		barrier();
> 
> This is supposed to be
> 
> 		pmd_t pmde = READ_ONCE(*pmd);
> 
> Right?

See e37c69827063. If I read this correctly, barrier() is less overhead for
some archs.

> 
> I don't understand why we need a barrier here. Why can't we just do
> 
> 	} else if (!pmd_present(*pmd))
> 		reutnr SWAP_AGAIN;
> 
> ?

See f72e7dcdd252 too.

> > +		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
> >  			return SWAP_AGAIN;
> > +	}
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-03 15:26   ` Kirill A. Shutemov
@ 2015-11-06  0:32     ` Andrew Morton
  -1 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2015-11-06  0:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Vladimir Davydov

On Tue,  3 Nov 2015 17:26:15 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> I've missed two simlar codepath which need some preparation to work well
> with reworked THP refcounting.
> 
> Both page_referenced() and page_idle_clear_pte_refs_one() assume that
> THP can only be mapped with PMD, so there's no reason to look on PTEs
> for PageTransHuge() pages. That's no true anymore: THP can be mapped
> with PTEs too.
> 
> The patch removes PageTransHuge() test from the functions and opencode
> page table check.

x86_64 allnoconfig:

In file included from mm/rmap.c:47:
include/linux/mm.h: In function 'page_referenced':
include/linux/mm.h:448: error: call to '__compiletime_assert_448' declared with attribute error: BUILD_BUG failed
make[1]: *** [mm/rmap.o] Error 1
make: *** [mm/rmap.o] Error 2

because

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })


btw, total_mapcount() is far too large to be inlined and
page_mapcount() is getting pretty bad too.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-06  0:32     ` Andrew Morton
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2015-11-06  0:32 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrea Arcangeli, Hugh Dickins, Naoya Horiguchi, Sasha Levin,
	Minchan Kim, linux-kernel, linux-mm, Vladimir Davydov

On Tue,  3 Nov 2015 17:26:15 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:

> I've missed two simlar codepath which need some preparation to work well
> with reworked THP refcounting.
> 
> Both page_referenced() and page_idle_clear_pte_refs_one() assume that
> THP can only be mapped with PMD, so there's no reason to look on PTEs
> for PageTransHuge() pages. That's no true anymore: THP can be mapped
> with PTEs too.
> 
> The patch removes PageTransHuge() test from the functions and opencode
> page table check.

x86_64 allnoconfig:

In file included from mm/rmap.c:47:
include/linux/mm.h: In function 'page_referenced':
include/linux/mm.h:448: error: call to '__compiletime_assert_448' declared with attribute error: BUILD_BUG failed
make[1]: *** [mm/rmap.o] Error 1
make: *** [mm/rmap.o] Error 2

because

#else /* CONFIG_TRANSPARENT_HUGEPAGE */
#define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })


btw, total_mapcount() is far too large to be inlined and
page_mapcount() is getting pretty bad too.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-06  0:32     ` Andrew Morton
@ 2015-11-06 10:29       ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-06 10:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins,
	Naoya Horiguchi, Sasha Levin, Minchan Kim, linux-kernel,
	linux-mm, Vladimir Davydov

On Thu, Nov 05, 2015 at 04:32:11PM -0800, Andrew Morton wrote:
> On Tue,  3 Nov 2015 17:26:15 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > I've missed two simlar codepath which need some preparation to work well
> > with reworked THP refcounting.
> > 
> > Both page_referenced() and page_idle_clear_pte_refs_one() assume that
> > THP can only be mapped with PMD, so there's no reason to look on PTEs
> > for PageTransHuge() pages. That's no true anymore: THP can be mapped
> > with PTEs too.
> > 
> > The patch removes PageTransHuge() test from the functions and opencode
> > page table check.
> 
> x86_64 allnoconfig:
> 
> In file included from mm/rmap.c:47:
> include/linux/mm.h: In function 'page_referenced':
> include/linux/mm.h:448: error: call to '__compiletime_assert_448' declared with attribute error: BUILD_BUG failed
> make[1]: *** [mm/rmap.o] Error 1
> make: *** [mm/rmap.o] Error 2
> 
> because
> 
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> 
> 
> btw, total_mapcount() is far too large to be inlined and

The patch below is my propsal to fix this.

> page_mapcount() is getting pretty bad too.

Do you want me to uninline slow path (PageCompound())?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a36f9fa4e4cd..f874d2a1d1a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -432,24 +432,14 @@ static inline int page_mapcount(struct page *page)
 	return ret;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int total_mapcount(struct page *page);
+#else
 static inline int total_mapcount(struct page *page)
 {
-	int i, ret;
-
-	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	if (likely(!PageCompound(page)))
-		return atomic_read(&page->_mapcount) + 1;
-
-	ret = compound_mapcount(page);
-	if (PageHuge(page))
-		return ret;
-	for (i = 0; i < HPAGE_PMD_NR; i++)
-		ret += atomic_read(&page[i]._mapcount) + 1;
-	if (PageDoubleMap(page))
-		ret -= HPAGE_PMD_NR;
-	return ret;
+	return page_mapcount(page);
 }
+#endif
 
 static inline int page_count(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 14cbbad54a3e..287bc009bc10 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3236,6 +3236,25 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 	}
 }
 
+int total_mapcount(struct page *page)
+{
+	int i, ret;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	ret = compound_mapcount(page);
+	if (PageHuge(page))
+		return ret;
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		ret += atomic_read(&page[i]._mapcount) + 1;
+	if (PageDoubleMap(page))
+		ret -= HPAGE_PMD_NR;
+	return ret;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-06 10:29       ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-06 10:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins,
	Naoya Horiguchi, Sasha Levin, Minchan Kim, linux-kernel,
	linux-mm, Vladimir Davydov

On Thu, Nov 05, 2015 at 04:32:11PM -0800, Andrew Morton wrote:
> On Tue,  3 Nov 2015 17:26:15 +0200 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> wrote:
> 
> > I've missed two simlar codepath which need some preparation to work well
> > with reworked THP refcounting.
> > 
> > Both page_referenced() and page_idle_clear_pte_refs_one() assume that
> > THP can only be mapped with PMD, so there's no reason to look on PTEs
> > for PageTransHuge() pages. That's no true anymore: THP can be mapped
> > with PTEs too.
> > 
> > The patch removes PageTransHuge() test from the functions and opencode
> > page table check.
> 
> x86_64 allnoconfig:
> 
> In file included from mm/rmap.c:47:
> include/linux/mm.h: In function 'page_referenced':
> include/linux/mm.h:448: error: call to '__compiletime_assert_448' declared with attribute error: BUILD_BUG failed
> make[1]: *** [mm/rmap.o] Error 1
> make: *** [mm/rmap.o] Error 2
> 
> because
> 
> #else /* CONFIG_TRANSPARENT_HUGEPAGE */
> #define HPAGE_PMD_SHIFT ({ BUILD_BUG(); 0; })
> 
> 
> btw, total_mapcount() is far too large to be inlined and

The patch below is my propsal to fix this.

> page_mapcount() is getting pretty bad too.

Do you want me to uninline slow path (PageCompound())?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a36f9fa4e4cd..f874d2a1d1a6 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -432,24 +432,14 @@ static inline int page_mapcount(struct page *page)
 	return ret;
 }
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+int total_mapcount(struct page *page);
+#else
 static inline int total_mapcount(struct page *page)
 {
-	int i, ret;
-
-	VM_BUG_ON_PAGE(PageTail(page), page);
-
-	if (likely(!PageCompound(page)))
-		return atomic_read(&page->_mapcount) + 1;
-
-	ret = compound_mapcount(page);
-	if (PageHuge(page))
-		return ret;
-	for (i = 0; i < HPAGE_PMD_NR; i++)
-		ret += atomic_read(&page[i]._mapcount) + 1;
-	if (PageDoubleMap(page))
-		ret -= HPAGE_PMD_NR;
-	return ret;
+	return page_mapcount(page);
 }
+#endif
 
 static inline int page_count(struct page *page)
 {
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 14cbbad54a3e..287bc009bc10 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3236,6 +3236,25 @@ static void __split_huge_page(struct page *page, struct list_head *list)
 	}
 }
 
+int total_mapcount(struct page *page)
+{
+	int i, ret;
+
+	VM_BUG_ON_PAGE(PageTail(page), page);
+
+	if (likely(!PageCompound(page)))
+		return atomic_read(&page->_mapcount) + 1;
+
+	ret = compound_mapcount(page);
+	if (PageHuge(page))
+		return ret;
+	for (i = 0; i < HPAGE_PMD_NR; i++)
+		ret += atomic_read(&page[i]._mapcount) + 1;
+	if (PageDoubleMap(page))
+		ret -= HPAGE_PMD_NR;
+	return ret;
+}
+
 /*
  * This function splits huge page into normal pages. @page can point to any
  * subpage of huge page to split. Split doesn't change the position of @page.
-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH] mm: add page_check_address_transhuge helper
  2015-11-05 12:58               ` Kirill A. Shutemov
@ 2015-11-06 14:37                 ` Vladimir Davydov
  -1 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-06 14:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
code for looking up pte of a (possibly transhuge) page. Move this code
to a new helper function, page_check_address_transhuge(), and make the
above mentioned functions use it.

This is just a cleanup, no functional changes are intended.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 include/linux/rmap.h |   8 ++++
 mm/page_idle.c       |  62 ++++-------------------------
 mm/rmap.c            | 110 ++++++++++++++++++++++++++++++---------------------
 3 files changed, 81 insertions(+), 99 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 853f4f3c6742..7b7ce0282c2d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -217,6 +217,14 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 }
 
 /*
+ * Used by idle page tracking to check if a page was referenced via page
+ * tables.
+ */
+bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
+				  unsigned long address, pmd_t **pmdp,
+				  pte_t **ptep, spinlock_t **ptlp);
+
+/*
  * Used by swapoff to help locate where page is expected in vma.
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 2c9ebe12b40d..374931f32ebc 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -55,70 +55,22 @@ static int page_idle_clear_pte_refs_one(struct page *page,
 					unsigned long addr, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pgd_t *pgd;
-	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	spinlock_t *ptl;
 	bool referenced = false;
 
-	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
-		return SWAP_AGAIN;
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		return SWAP_AGAIN;
-	pmd = pmd_offset(pud, addr);
-
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(mm, pmd);
-                if (!pmd_present(*pmd))
-			goto unlock_pmd;
-		if (unlikely(!pmd_trans_huge(*pmd))) {
-			spin_unlock(ptl);
-			goto map_pte;
-		}
-
-		if (pmd_page(*pmd) != page)
-			goto unlock_pmd;
-
-		referenced = pmdp_clear_young_notify(vma, addr, pmd);
-		spin_unlock(ptl);
-		goto found;
-unlock_pmd:
-		spin_unlock(ptl);
+	if (!page_check_address_transhuge(page, mm, addr, &pmd, &pte, &ptl))
 		return SWAP_AGAIN;
-	} else {
-		pmd_t pmde = *pmd;
-		barrier();
-		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return SWAP_AGAIN;
 
-	}
-map_pte:
-	pte = pte_offset_map(pmd, addr);
-	if (!pte_present(*pte)) {
+	if (pte) {
+		referenced = ptep_clear_young_notify(vma, addr, pte);
 		pte_unmap(pte);
-		return SWAP_AGAIN;
-	}
-
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-
-	if (!pte_present(*pte)) {
-		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
-	}
+	} else
+		referenced = pmdp_clear_young_notify(vma, addr, pmd);
 
-	/* THP can be referenced by any subpage */
-	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
-		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
-	}
+	spin_unlock(ptl);
 
-	referenced = ptep_clear_young_notify(vma, addr, pte);
-	pte_unmap_unlock(pte, ptl);
-found:
 	if (referenced) {
 		clear_page_idle(page);
 		/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 0837487d3737..7ac775e41820 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -796,48 +796,43 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	return 1;
 }
 
-struct page_referenced_arg {
-	int mapcount;
-	int referenced;
-	unsigned long vm_flags;
-	struct mem_cgroup *memcg;
-};
 /*
- * arg: page_referenced_arg will be passed
+ * Check that @page is mapped at @address into @mm. In contrast to
+ * page_check_address(), this function can handle transparent huge pages.
+ *
+ * On success returns true with pte mapped and locked. For transparent huge
+ * pages *@ptep is set to NULL.
  */
-static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, void *arg)
+bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
+				  unsigned long address, pmd_t **pmdp,
+				  pte_t **ptep, spinlock_t **ptlp)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	int referenced = 0;
-	struct page_referenced_arg *pra = arg;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (unlikely(PageHuge(page))) {
 		/* when pud is not present, pte will be NULL */
 		pte = huge_pte_offset(mm, address);
 		if (!pte)
-			return SWAP_AGAIN;
+			return false;
 
 		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
+		pmd = NULL;
 		goto check_pte;
 	}
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
-		return SWAP_AGAIN;
+		return false;
 	pud = pud_offset(pgd, address);
 	if (!pud_present(*pud))
-		return SWAP_AGAIN;
+		return false;
 	pmd = pmd_offset(pud, address);
 
 	if (pmd_trans_huge(*pmd)) {
-		int ret = SWAP_AGAIN;
-
 		ptl = pmd_lock(mm, pmd);
 		if (!pmd_present(*pmd))
 			goto unlock_pmd;
@@ -849,30 +844,22 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (pmd_page(*pmd) != page)
 			goto unlock_pmd;
 
-		if (vma->vm_flags & VM_LOCKED) {
-			pra->vm_flags |= VM_LOCKED;
-			ret = SWAP_FAIL; /* To break the loop */
-			goto unlock_pmd;
-		}
-
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
-		spin_unlock(ptl);
+		pte = NULL;
 		goto found;
 unlock_pmd:
 		spin_unlock(ptl);
-		return ret;
+		return false;
 	} else {
 		pmd_t pmde = *pmd;
 		barrier();
 		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return SWAP_AGAIN;
+			return false;
 	}
 map_pte:
 	pte = pte_offset_map(pmd, address);
 	if (!pte_present(*pte)) {
 		pte_unmap(pte);
-		return SWAP_AGAIN;
+		return false;
 	}
 
 	ptl = pte_lockptr(mm, pmd);
@@ -881,35 +868,70 @@ check_pte:
 
 	if (!pte_present(*pte)) {
 		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
+		return false;
 	}
 
 	/* THP can be referenced by any subpage */
 	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
 		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
+		return false;
 	}
+found:
+	*ptep = pte;
+	*pmdp = pmd;
+	*ptlp = ptl;
+	return true;
+}
+
+struct page_referenced_arg {
+	int mapcount;
+	int referenced;
+	unsigned long vm_flags;
+	struct mem_cgroup *memcg;
+};
+/*
+ * arg: page_referenced_arg will be passed
+ */
+static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
+			unsigned long address, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page_referenced_arg *pra = arg;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+	int referenced = 0;
+
+	if (!page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl))
+		return SWAP_AGAIN;
 
 	if (vma->vm_flags & VM_LOCKED) {
-		pte_unmap_unlock(pte, ptl);
+		if (pte)
+			pte_unmap(pte);
+		spin_unlock(ptl);
 		pra->vm_flags |= VM_LOCKED;
 		return SWAP_FAIL; /* To break the loop */
 	}
 
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+	if (pte) {
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+				referenced++;
+		}
+		pte_unmap(pte);
+	} else {
+		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 	}
-	pte_unmap_unlock(pte, ptl);
+	spin_unlock(ptl);
 
-found:
 	if (referenced)
 		clear_page_idle(page);
 	if (test_and_clear_page_young(page))
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH] mm: add page_check_address_transhuge helper
@ 2015-11-06 14:37                 ` Vladimir Davydov
  0 siblings, 0 replies; 40+ messages in thread
From: Vladimir Davydov @ 2015-11-06 14:37 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

page_referenced_one() and page_idle_clear_pte_refs_one() duplicate the
code for looking up pte of a (possibly transhuge) page. Move this code
to a new helper function, page_check_address_transhuge(), and make the
above mentioned functions use it.

This is just a cleanup, no functional changes are intended.

Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
---
 include/linux/rmap.h |   8 ++++
 mm/page_idle.c       |  62 ++++-------------------------
 mm/rmap.c            | 110 ++++++++++++++++++++++++++++++---------------------
 3 files changed, 81 insertions(+), 99 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 853f4f3c6742..7b7ce0282c2d 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -217,6 +217,14 @@ static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
 }
 
 /*
+ * Used by idle page tracking to check if a page was referenced via page
+ * tables.
+ */
+bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
+				  unsigned long address, pmd_t **pmdp,
+				  pte_t **ptep, spinlock_t **ptlp);
+
+/*
  * Used by swapoff to help locate where page is expected in vma.
  */
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
diff --git a/mm/page_idle.c b/mm/page_idle.c
index 2c9ebe12b40d..374931f32ebc 100644
--- a/mm/page_idle.c
+++ b/mm/page_idle.c
@@ -55,70 +55,22 @@ static int page_idle_clear_pte_refs_one(struct page *page,
 					unsigned long addr, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	pgd_t *pgd;
-	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	spinlock_t *ptl;
 	bool referenced = false;
 
-	pgd = pgd_offset(mm, addr);
-	if (!pgd_present(*pgd))
-		return SWAP_AGAIN;
-	pud = pud_offset(pgd, addr);
-	if (!pud_present(*pud))
-		return SWAP_AGAIN;
-	pmd = pmd_offset(pud, addr);
-
-	if (pmd_trans_huge(*pmd)) {
-		ptl = pmd_lock(mm, pmd);
-                if (!pmd_present(*pmd))
-			goto unlock_pmd;
-		if (unlikely(!pmd_trans_huge(*pmd))) {
-			spin_unlock(ptl);
-			goto map_pte;
-		}
-
-		if (pmd_page(*pmd) != page)
-			goto unlock_pmd;
-
-		referenced = pmdp_clear_young_notify(vma, addr, pmd);
-		spin_unlock(ptl);
-		goto found;
-unlock_pmd:
-		spin_unlock(ptl);
+	if (!page_check_address_transhuge(page, mm, addr, &pmd, &pte, &ptl))
 		return SWAP_AGAIN;
-	} else {
-		pmd_t pmde = *pmd;
-		barrier();
-		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return SWAP_AGAIN;
 
-	}
-map_pte:
-	pte = pte_offset_map(pmd, addr);
-	if (!pte_present(*pte)) {
+	if (pte) {
+		referenced = ptep_clear_young_notify(vma, addr, pte);
 		pte_unmap(pte);
-		return SWAP_AGAIN;
-	}
-
-	ptl = pte_lockptr(mm, pmd);
-	spin_lock(ptl);
-
-	if (!pte_present(*pte)) {
-		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
-	}
+	} else
+		referenced = pmdp_clear_young_notify(vma, addr, pmd);
 
-	/* THP can be referenced by any subpage */
-	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
-		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
-	}
+	spin_unlock(ptl);
 
-	referenced = ptep_clear_young_notify(vma, addr, pte);
-	pte_unmap_unlock(pte, ptl);
-found:
 	if (referenced) {
 		clear_page_idle(page);
 		/*
diff --git a/mm/rmap.c b/mm/rmap.c
index 0837487d3737..7ac775e41820 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -796,48 +796,43 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
 	return 1;
 }
 
-struct page_referenced_arg {
-	int mapcount;
-	int referenced;
-	unsigned long vm_flags;
-	struct mem_cgroup *memcg;
-};
 /*
- * arg: page_referenced_arg will be passed
+ * Check that @page is mapped at @address into @mm. In contrast to
+ * page_check_address(), this function can handle transparent huge pages.
+ *
+ * On success returns true with pte mapped and locked. For transparent huge
+ * pages *@ptep is set to NULL.
  */
-static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
-			unsigned long address, void *arg)
+bool page_check_address_transhuge(struct page *page, struct mm_struct *mm,
+				  unsigned long address, pmd_t **pmdp,
+				  pte_t **ptep, spinlock_t **ptlp)
 {
-	struct mm_struct *mm = vma->vm_mm;
-	spinlock_t *ptl;
-	int referenced = 0;
-	struct page_referenced_arg *pra = arg;
 	pgd_t *pgd;
 	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
+	spinlock_t *ptl;
 
 	if (unlikely(PageHuge(page))) {
 		/* when pud is not present, pte will be NULL */
 		pte = huge_pte_offset(mm, address);
 		if (!pte)
-			return SWAP_AGAIN;
+			return false;
 
 		ptl = huge_pte_lockptr(page_hstate(page), mm, pte);
+		pmd = NULL;
 		goto check_pte;
 	}
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
-		return SWAP_AGAIN;
+		return false;
 	pud = pud_offset(pgd, address);
 	if (!pud_present(*pud))
-		return SWAP_AGAIN;
+		return false;
 	pmd = pmd_offset(pud, address);
 
 	if (pmd_trans_huge(*pmd)) {
-		int ret = SWAP_AGAIN;
-
 		ptl = pmd_lock(mm, pmd);
 		if (!pmd_present(*pmd))
 			goto unlock_pmd;
@@ -849,30 +844,22 @@ static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
 		if (pmd_page(*pmd) != page)
 			goto unlock_pmd;
 
-		if (vma->vm_flags & VM_LOCKED) {
-			pra->vm_flags |= VM_LOCKED;
-			ret = SWAP_FAIL; /* To break the loop */
-			goto unlock_pmd;
-		}
-
-		if (pmdp_clear_flush_young_notify(vma, address, pmd))
-			referenced++;
-		spin_unlock(ptl);
+		pte = NULL;
 		goto found;
 unlock_pmd:
 		spin_unlock(ptl);
-		return ret;
+		return false;
 	} else {
 		pmd_t pmde = *pmd;
 		barrier();
 		if (!pmd_present(pmde) || pmd_trans_huge(pmde))
-			return SWAP_AGAIN;
+			return false;
 	}
 map_pte:
 	pte = pte_offset_map(pmd, address);
 	if (!pte_present(*pte)) {
 		pte_unmap(pte);
-		return SWAP_AGAIN;
+		return false;
 	}
 
 	ptl = pte_lockptr(mm, pmd);
@@ -881,35 +868,70 @@ check_pte:
 
 	if (!pte_present(*pte)) {
 		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
+		return false;
 	}
 
 	/* THP can be referenced by any subpage */
 	if (pte_pfn(*pte) - page_to_pfn(page) >= hpage_nr_pages(page)) {
 		pte_unmap_unlock(pte, ptl);
-		return SWAP_AGAIN;
+		return false;
 	}
+found:
+	*ptep = pte;
+	*pmdp = pmd;
+	*ptlp = ptl;
+	return true;
+}
+
+struct page_referenced_arg {
+	int mapcount;
+	int referenced;
+	unsigned long vm_flags;
+	struct mem_cgroup *memcg;
+};
+/*
+ * arg: page_referenced_arg will be passed
+ */
+static int page_referenced_one(struct page *page, struct vm_area_struct *vma,
+			unsigned long address, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page_referenced_arg *pra = arg;
+	pmd_t *pmd;
+	pte_t *pte;
+	spinlock_t *ptl;
+	int referenced = 0;
+
+	if (!page_check_address_transhuge(page, mm, address, &pmd, &pte, &ptl))
+		return SWAP_AGAIN;
 
 	if (vma->vm_flags & VM_LOCKED) {
-		pte_unmap_unlock(pte, ptl);
+		if (pte)
+			pte_unmap(pte);
+		spin_unlock(ptl);
 		pra->vm_flags |= VM_LOCKED;
 		return SWAP_FAIL; /* To break the loop */
 	}
 
-	if (ptep_clear_flush_young_notify(vma, address, pte)) {
-		/*
-		 * Don't treat a reference through a sequentially read
-		 * mapping as such.  If the page has been used in
-		 * another mapping, we will catch it; if this other
-		 * mapping is already gone, the unmap path will have
-		 * set PG_referenced or activated the page.
-		 */
-		if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+	if (pte) {
+		if (ptep_clear_flush_young_notify(vma, address, pte)) {
+			/*
+			 * Don't treat a reference through a sequentially read
+			 * mapping as such.  If the page has been used in
+			 * another mapping, we will catch it; if this other
+			 * mapping is already gone, the unmap path will have
+			 * set PG_referenced or activated the page.
+			 */
+			if (likely(!(vma->vm_flags & VM_SEQ_READ)))
+				referenced++;
+		}
+		pte_unmap(pte);
+	} else {
+		if (pmdp_clear_flush_young_notify(vma, address, pmd))
 			referenced++;
 	}
-	pte_unmap_unlock(pte, ptl);
+	spin_unlock(ptl);
 
-found:
 	if (referenced)
 		clear_page_idle(page);
 	if (test_and_clear_page_young(page))
-- 
2.1.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: add page_check_address_transhuge helper
  2015-11-06 14:37                 ` Vladimir Davydov
@ 2015-11-06 15:24                   ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-06 15:24 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Fri, Nov 06, 2015 at 05:37:07PM +0300, Vladimir Davydov wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0837487d3737..7ac775e41820 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -796,48 +796,43 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
>  	return 1;
>  }
>  
> -struct page_referenced_arg {
> -	int mapcount;
> -	int referenced;
> -	unsigned long vm_flags;
> -	struct mem_cgroup *memcg;
> -};
>  /*
> - * arg: page_referenced_arg will be passed
> + * Check that @page is mapped at @address into @mm. In contrast to
> + * page_check_address(), this function can handle transparent huge pages.
> + *
> + * On success returns true with pte mapped and locked. For transparent huge
> + * pages *@ptep is set to NULL.

I think

	"For PMD-mapped transparent huge pages"...

would be more correct.

Otherwise looks great!

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: add page_check_address_transhuge helper
@ 2015-11-06 15:24                   ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-06 15:24 UTC (permalink / raw)
  To: Vladimir Davydov
  Cc: Kirill A. Shutemov, Andrew Morton, Andrea Arcangeli,
	Hugh Dickins, Naoya Horiguchi, Sasha Levin, Minchan Kim,
	linux-kernel, linux-mm

On Fri, Nov 06, 2015 at 05:37:07PM +0300, Vladimir Davydov wrote:
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 0837487d3737..7ac775e41820 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -796,48 +796,43 @@ int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma)
>  	return 1;
>  }
>  
> -struct page_referenced_arg {
> -	int mapcount;
> -	int referenced;
> -	unsigned long vm_flags;
> -	struct mem_cgroup *memcg;
> -};
>  /*
> - * arg: page_referenced_arg will be passed
> + * Check that @page is mapped at @address into @mm. In contrast to
> + * page_check_address(), this function can handle transparent huge pages.
> + *
> + * On success returns true with pte mapped and locked. For transparent huge
> + * pages *@ptep is set to NULL.

I think

	"For PMD-mapped transparent huge pages"...

would be more correct.

Otherwise looks great!

Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-06 10:29       ` Kirill A. Shutemov
@ 2015-11-06 22:39         ` Andrew Morton
  -1 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2015-11-06 22:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins,
	Naoya Horiguchi, Sasha Levin, Minchan Kim, linux-kernel,
	linux-mm, Vladimir Davydov

On Fri, 6 Nov 2015 12:29:21 +0200 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> > page_mapcount() is getting pretty bad too.
> 
> Do you want me to uninline slow path (PageCompound())?

I guess so.  Uninlining all of page_mapcount() does this:

gcc-4.4.4:

   text    data     bss     dec     hex filename
 973702  273954  831512 2079168  1fb9c0 mm/built-in.o-before
 970148  273954  831000 2075102  1fa9de mm/built-in.o-after

That's quite a bit of bloat.

I don't know why bss changed; this usually (always?) happens.  Seems
bogus.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-06 22:39         ` Andrew Morton
  0 siblings, 0 replies; 40+ messages in thread
From: Andrew Morton @ 2015-11-06 22:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins,
	Naoya Horiguchi, Sasha Levin, Minchan Kim, linux-kernel,
	linux-mm, Vladimir Davydov

On Fri, 6 Nov 2015 12:29:21 +0200 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:

> > page_mapcount() is getting pretty bad too.
> 
> Do you want me to uninline slow path (PageCompound())?

I guess so.  Uninlining all of page_mapcount() does this:

gcc-4.4.4:

   text    data     bss     dec     hex filename
 973702  273954  831512 2079168  1fb9c0 mm/built-in.o-before
 970148  273954  831000 2075102  1fa9de mm/built-in.o-after

That's quite a bit of bloat.

I don't know why bss changed; this usually (always?) happens.  Seems
bogus.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
  2015-11-06 22:39         ` Andrew Morton
@ 2015-11-08 23:40           ` Kirill A. Shutemov
  -1 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-08 23:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins,
	Naoya Horiguchi, Sasha Levin, Minchan Kim, linux-kernel,
	linux-mm, Vladimir Davydov

On Fri, Nov 06, 2015 at 02:39:00PM -0800, Andrew Morton wrote:
> On Fri, 6 Nov 2015 12:29:21 +0200 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > > page_mapcount() is getting pretty bad too.
> > 
> > Do you want me to uninline slow path (PageCompound())?
> 
> I guess so.  Uninlining all of page_mapcount() does this:
> 
> gcc-4.4.4:
> 
>    text    data     bss     dec     hex filename
>  973702  273954  831512 2079168  1fb9c0 mm/built-in.o-before
>  970148  273954  831000 2075102  1fa9de mm/built-in.o-after
> 
> That's quite a bit of bloat.
> 
> I don't know why bss changed; this usually (always?) happens.  Seems
> bogus.

Here it is.

>From 4bd3af3b6b9498254bd71e8288721dcff641156c Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Mon, 9 Nov 2015 01:34:08 +0200
Subject: [PATCH] mm: uninline slowpath of page_mapcount()

Let's move page_mapcount() part for compound page into mm/util.c.

make allyesconfig:

  text	   data	    bss	    dec	    hex	filename
188515051	153360535	85458720	427334306	19789aa2	vmlinux.o.before
188512917	153356439	85458720	427328076	1978824c	vmlinux.o.after

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 include/linux/mm.h | 14 +++++---------
 mm/util.c          | 14 ++++++++++++++
 2 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f874d2a1d1a6..72edbbec7b91 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -417,19 +417,15 @@ static inline void page_mapcount_reset(struct page *page)
 	atomic_set(&(page)->_mapcount, -1);
 }
 
+int __page_mapcount(struct page *page);
+
 static inline int page_mapcount(struct page *page)
 {
-	int ret;
 	VM_BUG_ON_PAGE(PageSlab(page), page);
 
-	ret = atomic_read(&page->_mapcount) + 1;
-	if (PageCompound(page)) {
-		page = compound_head(page);
-		ret += atomic_read(compound_mapcount_ptr(page)) + 1;
-		if (PageDoubleMap(page))
-			ret--;
-	}
-	return ret;
+	if (unlikely(PageCompound(page)))
+		return __page_mapcount(page);
+	return atomic_read(&page->_mapcount) + 1;
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
diff --git a/mm/util.c b/mm/util.c
index 902b65a43899..68535c0bb9da 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -376,6 +376,20 @@ struct address_space *page_mapping(struct page *page)
 	return mapping;
 }
 
+/* Slow path of page_mapcount() for compound pages */
+int __page_mapcount(struct page *page)
+{
+	int ret;
+
+	page = compound_head(page);
+	ret = atomic_read(&page->_mapcount) + 1;
+	ret += atomic_read(compound_mapcount_ptr(page)) + 1;
+	if (PageDoubleMap(page))
+		ret--;
+	return ret;
+}
+EXPORT_SYMBOL_GPL(__page_mapcount);
+
 int overcommit_ratio_handler(struct ctl_table *table, int write,
 			     void __user *buffer, size_t *lenp,
 			     loff_t *ppos)
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting
@ 2015-11-08 23:40           ` Kirill A. Shutemov
  0 siblings, 0 replies; 40+ messages in thread
From: Kirill A. Shutemov @ 2015-11-08 23:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Kirill A. Shutemov, Andrea Arcangeli, Hugh Dickins,
	Naoya Horiguchi, Sasha Levin, Minchan Kim, linux-kernel,
	linux-mm, Vladimir Davydov

On Fri, Nov 06, 2015 at 02:39:00PM -0800, Andrew Morton wrote:
> On Fri, 6 Nov 2015 12:29:21 +0200 "Kirill A. Shutemov" <kirill@shutemov.name> wrote:
> 
> > > page_mapcount() is getting pretty bad too.
> > 
> > Do you want me to uninline slow path (PageCompound())?
> 
> I guess so.  Uninlining all of page_mapcount() does this:
> 
> gcc-4.4.4:
> 
>    text    data     bss     dec     hex filename
>  973702  273954  831512 2079168  1fb9c0 mm/built-in.o-before
>  970148  273954  831000 2075102  1fa9de mm/built-in.o-after
> 
> That's quite a bit of bloat.
> 
> I don't know why bss changed; this usually (always?) happens.  Seems
> bogus.

Here it is.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2015-11-08 23:40 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-11-03 15:26 [PATCH 0/4] Bugfixes for THP refcounting Kirill A. Shutemov
2015-11-03 15:26 ` Kirill A. Shutemov
2015-11-03 15:26 ` [PATCH 1/4] mm: do not crash on PageDoubleMap() for non-head pages Kirill A. Shutemov
2015-11-03 15:26   ` Kirill A. Shutemov
2015-11-03 15:26 ` [PATCH 2/4] mm: duplicate rmap reference for hugetlb pages as compound Kirill A. Shutemov
2015-11-03 15:26   ` Kirill A. Shutemov
2015-11-03 15:26 ` [PATCH 3/4] thp: fix split vs. unmap race Kirill A. Shutemov
2015-11-03 15:26   ` Kirill A. Shutemov
2015-11-03 15:26 ` [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting Kirill A. Shutemov
2015-11-03 15:26   ` Kirill A. Shutemov
2015-11-05  9:10   ` Vladimir Davydov
2015-11-05  9:10     ` Vladimir Davydov
2015-11-05  9:24     ` Kirill A. Shutemov
2015-11-05  9:24       ` Kirill A. Shutemov
2015-11-05 12:07       ` Vladimir Davydov
2015-11-05 12:07         ` Vladimir Davydov
2015-11-05 12:36         ` Kirill A. Shutemov
2015-11-05 12:36           ` Kirill A. Shutemov
2015-11-05 12:53           ` Vladimir Davydov
2015-11-05 12:53             ` Vladimir Davydov
2015-11-05 12:58             ` Kirill A. Shutemov
2015-11-05 12:58               ` Kirill A. Shutemov
2015-11-05 16:31               ` Vladimir Davydov
2015-11-05 16:31                 ` Vladimir Davydov
2015-11-06 14:37               ` [PATCH] mm: add page_check_address_transhuge helper Vladimir Davydov
2015-11-06 14:37                 ` Vladimir Davydov
2015-11-06 15:24                 ` Kirill A. Shutemov
2015-11-06 15:24                   ` Kirill A. Shutemov
2015-11-05 16:03   ` [PATCH 4/4] mm: prepare page_referenced() and page_idle to new THP refcounting Vladimir Davydov
2015-11-05 16:03     ` Vladimir Davydov
2015-11-05 17:27     ` Kirill A. Shutemov
2015-11-05 17:27       ` Kirill A. Shutemov
2015-11-06  0:32   ` Andrew Morton
2015-11-06  0:32     ` Andrew Morton
2015-11-06 10:29     ` Kirill A. Shutemov
2015-11-06 10:29       ` Kirill A. Shutemov
2015-11-06 22:39       ` Andrew Morton
2015-11-06 22:39         ` Andrew Morton
2015-11-08 23:40         ` Kirill A. Shutemov
2015-11-08 23:40           ` Kirill A. Shutemov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.