linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/16] Support non-lru page migration
@ 2016-03-30  7:11 Minchan Kim
  2016-03-30  7:12 ` [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page Minchan Kim
                   ` (17 more replies)
  0 siblings, 18 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

Recently, I got many reports about perfermance degradation
in embedded system(Android mobile phone, webOS TV and so on)
and failed to fork easily.

The problem was fragmentation caused by zram and GPU driver
pages. Their pages cannot be migrated so compaction cannot
work well, either so reclaimer ends up shrinking all of working
set pages. It made system very slow and even to fail to fork
easily.

Other pain point is that they cannot work with CMA.
Most of CMA memory space could be idle(ie, it could be used
for movable pages unless driver is using) but if driver(i.e.,
zram) cannot migrate his page, that memory space could be
wasted. In our product which has big CMA memory, it reclaims
zones too exccessively although there are lots of free space
in CMA so system was very slow easily.

To solve these problem, this patch try to add facility to
migrate non-lru pages via introducing new friend functions
of migratepage in address_space_operation and new page flags.

	(isolate_page, putback_page)
	(PG_movable, PG_isolated)

For details, please read description in
"mm/compaction: support non-lru movable page migration".

Originally, Gioh Kim tried to support this feature but he moved
so I took over the work. But I took many code from his work and
changed a little bit.
Thanks, Gioh!

And I should mention Konstantin Khlebnikov. He really heped Gioh
at that time so he should deserve to have many credit, too.
Thanks, Konstantin!

This patchset consists of five parts.

1. clean up migration
  mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
  mm/compaction: support non-lru movable page migration
  mm: add non-lru movable page support document

3. rework KVM memory-ballooning
  mm/balloon: use general movable page feature into balloon

4. zsmalloc clean-up for preparing page migration
  zsmalloc: keep max_object in size_class
  zsmalloc: squeeze inuse into page->mapping
  zsmalloc: remove page_mapcount_reset
  zsmalloc: squeeze freelist into page->mapping
  zsmalloc: move struct zs_meta from mapping to freelist
  zsmalloc: factor page chain functionality out
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: zs_compact refactoring

5. add zsmalloc page migration
  zsmalloc: migrate head page of zspage
  zsmalloc: use single linked list for page chain
  zsmalloc: migrate tail pages in zspage
  zram: use __GFP_MOVABLE for memory allocation

* From v2
  * rebase on mmotm-2016-03-29-15-54-16
  * check PageMovable before lock_page - Joonsoo
  * check PageMovable before PageIsolated checking - Joonsoo
  * add more description about rule

* From v1
  * rebase on v4.5-mmotm-2016-03-17-15-04
  * reordering patches to merge clean-up patches first
  * add Acked-by/Reviewed-by from Vlastimil and Sergey
  * use each own mount model instead of reusing anon_inode_fs - Al Viro
  * small changes - YiPing, Gioh


Minchan Kim (16):
  mm: use put_page to free page instead of putback_lru_page
  mm/compaction: support non-lru movable page migration
  mm: add non-lru movable page support document
  mm/balloon: use general movable page feature into balloon
  zsmalloc: keep max_object in size_class
  zsmalloc: squeeze inuse into page->mapping
  zsmalloc: remove page_mapcount_reset
  zsmalloc: squeeze freelist into page->mapping
  zsmalloc: move struct zs_meta from mapping to freelist
  zsmalloc: factor page chain functionality out
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: zs_compact refactoring
  zsmalloc: migrate head page of zspage
  zsmalloc: use single linked list for page chain
  zsmalloc: migrate tail pages in zspage
  zram: use __GFP_MOVABLE for memory allocation

 Documentation/filesystems/Locking      |    4 +
 Documentation/filesystems/vfs.txt      |   16 +-
 Documentation/vm/page_migration        |   69 +-
 drivers/block/zram/zram_drv.c          |    3 +-
 drivers/virtio/virtio_balloon.c        |   53 +-
 fs/proc/page.c                         |    3 +
 include/linux/balloon_compaction.h     |   49 +-
 include/linux/fs.h                     |    2 +
 include/linux/migrate.h                |    2 +
 include/linux/page-flags.h             |   47 +-
 include/uapi/linux/kernel-page-flags.h |    1 +
 include/uapi/linux/magic.h             |    2 +
 mm/balloon_compaction.c                |  101 +--
 mm/compaction.c                        |   15 +-
 mm/migrate.c                           |  238 ++++--
 mm/vmscan.c                            |    2 +-
 mm/zsmalloc.c                          | 1253 ++++++++++++++++++++++++--------
 17 files changed, 1368 insertions(+), 492 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 65+ messages in thread

* [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-01 12:58   ` Vlastimil Babka
  2016-04-04  5:53   ` Balbir Singh
  2016-03-30  7:12 ` [PATCH v3 02/16] mm/compaction: support non-lru movable page migration Minchan Kim
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim,
	Naoya Horiguchi

Procedure of page migration is as follows:

First of all, it should isolate a page from LRU and try to
migrate the page. If it is successful, it releases the page
for freeing. Otherwise, it should put the page back to LRU
list.

For LRU pages, we have used putback_lru_page for both freeing
and putback to LRU list. It's okay because put_page is aware of
LRU list so if it releases last refcount of the page, it removes
the page from LRU list. However, It makes unnecessary operations
(e.g., lru_cache_add, pagevec and flags operations. It would be
not significant but no worth to do) and harder to support new
non-lru page migration because put_page isn't aware of non-lru
page's data structure.

To solve the problem, we can add new hook in put_page with
PageMovable flags check but it can increase overhead in
hot path and needs new locking scheme to stabilize the flag check
with put_page.

So, this patch cleans it up to divide two semantic(ie, put and putback).
If migration is successful, use put_page instead of putback_lru_page and
use putback_lru_page only on failure. That makes code more readable
and doesn't add overhead in put_page.

Comment from Vlastimil
"Yeah, and compaction (perhaps also other migration users) has to drain
the lru pvec... Getting rid of this stuff is worth even by itself."

Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/migrate.c | 50 +++++++++++++++++++++++++++++++-------------------
 1 file changed, 31 insertions(+), 19 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6c822a7b27e0..53529c805752 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -913,6 +913,14 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		put_anon_vma(anon_vma);
 	unlock_page(page);
 out:
+	/* If migration is successful, move newpage to right list */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		if (unlikely(__is_movable_balloon_page(newpage)))
+			put_page(newpage);
+		else
+			putback_lru_page(newpage);
+	}
+
 	return rc;
 }
 
@@ -946,6 +954,12 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 
 	if (page_count(page) == 1) {
 		/* page was freed from under us. So we are done. */
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+		if (put_new_page)
+			put_new_page(newpage, private);
+		else
+			put_page(newpage);
 		goto out;
 	}
 
@@ -958,10 +972,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 	}
 
 	rc = __unmap_and_move(page, newpage, force, mode);
-	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_new_page = NULL;
+	if (rc == MIGRATEPAGE_SUCCESS)
 		set_page_owner_migrate_reason(newpage, reason);
-	}
 
 out:
 	if (rc != -EAGAIN) {
@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		/* Soft-offlined page shouldn't go through lru cache list */
+	}
+
+	/*
+	 * If migration is successful, drop the reference grabbed during
+	 * isolation. Otherwise, restore the page to LRU list unless we
+	 * want to retry.
+	 */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		put_page(page);
 		if (reason == MR_MEMORY_FAILURE) {
-			put_page(page);
 			if (!test_set_page_hwpoison(page))
 				num_poisoned_pages_inc();
-		} else
+		}
+	} else {
+		if (rc != -EAGAIN)
 			putback_lru_page(page);
+		if (put_new_page)
+			put_new_page(newpage, private);
+		else
+			put_page(newpage);
 	}
 
-	/*
-	 * If migration was not successful and there's a freeing callback, use
-	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
-	 * during isolation.
-	 */
-	if (put_new_page)
-		put_new_page(newpage, private);
-	else if (unlikely(__is_movable_balloon_page(newpage))) {
-		/* drop our reference, page already in the balloon */
-		put_page(newpage);
-	} else
-		putback_lru_page(newpage);
-
 	if (result) {
 		if (rc)
 			*result = rc;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
  2016-03-30  7:12 ` [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-01 21:29   ` Vlastimil Babka
  2016-04-12  8:00   ` Chulmin Kim
  2016-03-30  7:12 ` [PATCH v3 03/16] mm: add non-lru movable page support document Minchan Kim
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim,
	dri-devel, Gioh Kim

We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.

Basically, this patch supports two page-flags and two functions related
to page migration. The flag and page->mapping stability are protected
by PG_lock.

	PG_movable
	PG_isolated

	bool (*isolate_page) (struct page *, isolate_mode_t);
	void (*putback_page) (struct page *);

Duty of subsystem want to make their pages as migratable are
as follows:

1. It should register address_space to page->mapping then mark
the page as PG_movable via __SetPageMovable.

2. It should mark the page as PG_isolated via SetPageIsolated
if isolation is sucessful and return true.

3. If migration is successful, it should clear PG_isolated and
PG_movable of the page for free preparation then release the
reference of the page to free.

4. If migration fails, putback function of subsystem should
clear PG_isolated via ClearPageIsolated.

5. If a subsystem want to release isolated page, it should
clear PG_isolated but not PG_movable. Instead, VM will do it.

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: dri-devel@lists.freedesktop.org
Cc: virtualization@lists.linux-foundation.org
Signed-off-by: Gioh Kim <gurugio@hanmail.net>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/Locking      |   4 +
 Documentation/filesystems/vfs.txt      |   5 +
 fs/proc/page.c                         |   3 +
 include/linux/fs.h                     |   2 +
 include/linux/migrate.h                |   2 +
 include/linux/page-flags.h             |  31 ++++++
 include/uapi/linux/kernel-page-flags.h |   1 +
 mm/compaction.c                        |  14 ++-
 mm/migrate.c                           | 174 +++++++++++++++++++++++++++++----
 9 files changed, 217 insertions(+), 19 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 619af9bfdcb3..0bb79560abb3 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -195,7 +195,9 @@ unlocks and drops the reference.
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
+	void (*putback_page) (struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
@@ -219,7 +221,9 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
+isolate_page:		yes
 migratepage:		yes (both)
+putback_page:		yes
 launder_page:		yes
 is_partially_uptodate:	yes
 error_remove_page:	yes
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index b02a7d598258..4c1b6c3b4bc8 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -592,9 +592,14 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
+	/* isolate a page for migration */
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
+	/* put the page back to right list */
+	void (*putback_page) (struct page *);
 	int (*launder_page) (struct page *);
+
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
 	void (*is_dirty_writeback) (struct page *, bool *, bool *);
diff --git a/fs/proc/page.c b/fs/proc/page.c
index 3ecd445e830d..ce3d08a4ad8d 100644
--- a/fs/proc/page.c
+++ b/fs/proc/page.c
@@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
 	if (page_is_idle(page))
 		u |= 1 << KPF_IDLE;
 
+	if (PageMovable(page))
+		u |= 1 << KPF_MOVABLE;
+
 	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
 
 	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index da9e67d937e5..36f2d610e7a8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -401,6 +401,8 @@ struct address_space_operations {
 	 */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
+	bool (*isolate_page)(struct page *, isolate_mode_t);
+	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9b50325e4ddf..404fbfefeb33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
+extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f4ed4f1b0c77..77ebf8fdbc6e 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -129,6 +129,10 @@ enum pageflags {
 
 	/* Compound pages. Stored in first tail page's flags */
 	PG_double_map = PG_private_2,
+
+	/* non-lru movable pages */
+	PG_movable = PG_reclaim,
+	PG_isolated = PG_owner_priv_1,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -614,6 +618,33 @@ static inline void __ClearPageBalloon(struct page *page)
 	atomic_set(&page->_mapcount, -1);
 }
 
+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
+
+static inline int PageMovable(struct page *page)
+{
+	return ((test_bit(PG_movable, &(page)->flags) &&
+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
+		|| PageBalloon(page));
+}
+
+/* Caller should hold a PG_lock */
+static inline void __SetPageMovable(struct page *page,
+				struct address_space *mapping)
+{
+	page->mapping = mapping;
+	__set_bit(PG_movable, &page->flags);
+	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
+}
+
+static inline void __ClearPageMovable(struct page *page)
+{
+	atomic_set(&page->_mapcount, -1);
+	__clear_bit(PG_movable, &(page)->flags);
+	page->mapping = NULL;
+}
+
+PAGEFLAG(Isolated, isolated, PF_ANY);
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
index 5da5f8751ce7..a184fd2434fa 100644
--- a/include/uapi/linux/kernel-page-flags.h
+++ b/include/uapi/linux/kernel-page-flags.h
@@ -34,6 +34,7 @@
 #define KPF_BALLOON		23
 #define KPF_ZERO_PAGE		24
 #define KPF_IDLE		25
+#define KPF_MOVABLE		26
 
 
 #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
diff --git a/mm/compaction.c b/mm/compaction.c
index ccf97b02b85f..7557aedddaee 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 
 		/*
 		 * Check may be lockless but that's ok as we recheck later.
-		 * It's possible to migrate LRU pages and balloon pages
+		 * It's possible to migrate LRU and movable kernel pages.
 		 * Skip any other type of page
 		 */
 		is_lru = PageLRU(page);
@@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 					goto isolate_success;
 				}
 			}
+
+			if (unlikely(PageMovable(page)) &&
+					!PageIsolated(page)) {
+				if (locked) {
+					spin_unlock_irqrestore(&zone->lru_lock,
+									flags);
+					locked = false;
+				}
+
+				if (isolate_movable_page(page, isolate_mode))
+					goto isolate_success;
+			}
 		}
 
 		/*
diff --git a/mm/migrate.c b/mm/migrate.c
index 53529c805752..b56bf2b3fe8c 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -73,6 +73,85 @@ int migrate_prep_local(void)
 	return 0;
 }
 
+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
+{
+	bool ret = false;
+
+	/*
+	 * Avoid burning cycles with pages that are yet under __free_pages(),
+	 * or just got freed under us.
+	 *
+	 * In case we 'win' a race for a movable page being freed under us and
+	 * raise its refcount preventing __free_pages() from doing its job
+	 * the put_page() at the end of this block will take care of
+	 * release this page, thus avoiding a nasty leakage.
+	 */
+	if (unlikely(!get_page_unless_zero(page)))
+		goto out;
+
+	/*
+	 * Check PG_movable before holding a PG_lock because page's owner
+	 * assumes anybody doesn't touch PG_lock of newly allocated page.
+	 */
+	if (unlikely(!PageMovable(page)))
+		goto out_putpage;
+	/*
+	 * As movable pages are not isolated from LRU lists, concurrent
+	 * compaction threads can race against page migration functions
+	 * as well as race against the releasing a page.
+	 *
+	 * In order to avoid having an already isolated movable page
+	 * being (wrongly) re-isolated while it is under migration,
+	 * or to avoid attempting to isolate pages being released,
+	 * lets be sure we have the page lock
+	 * before proceeding with the movable page isolation steps.
+	 */
+	if (unlikely(!trylock_page(page)))
+		goto out_putpage;
+
+	if (!PageMovable(page) || PageIsolated(page))
+		goto out_no_isolated;
+
+	ret = page->mapping->a_ops->isolate_page(page, mode);
+	if (!ret)
+		goto out_no_isolated;
+
+	WARN_ON_ONCE(!PageIsolated(page));
+	unlock_page(page);
+	return ret;
+
+out_no_isolated:
+	unlock_page(page);
+out_putpage:
+	put_page(page);
+out:
+	return ret;
+}
+
+/* It should be called on page which is PG_movable */
+void putback_movable_page(struct page *page)
+{
+	/*
+	 * 'lock_page()' stabilizes the page and prevents races against
+	 * concurrent isolation threads attempting to re-isolate it.
+	 */
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+
+	lock_page(page);
+	if (PageIsolated(page)) {
+		struct address_space *mapping;
+
+		mapping = page_mapping(page);
+		mapping->a_ops->putback_page(page);
+		WARN_ON_ONCE(PageIsolated(page));
+	} else {
+		__ClearPageMovable(page);
+	}
+	unlock_page(page);
+	/* drop the extra ref count taken for movable page isolation */
+	put_page(page);
+}
+
 /*
  * Put previously isolated pages back onto the appropriate lists
  * from where they were once taken off for compaction/migration.
@@ -94,10 +173,18 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page)))
+		if (unlikely(isolated_balloon_page(page))) {
 			balloon_page_putback(page);
-		else
+		} else if (unlikely(PageMovable(page))) {
+			if (PageIsolated(page)) {
+				putback_movable_page(page);
+			} else {
+				__ClearPageMovable(page);
+				put_page(page);
+			}
+		} else {
 			putback_lru_page(page);
+		}
 	}
 }
 
@@ -592,7 +679,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
  ***********************************************************/
 
 /*
- * Common logic to directly migrate a single page suitable for
+ * Common logic to directly migrate a single LRU page suitable for
  * pages that do not use PagePrivate/PagePrivate2.
  *
  * Pages are locked upon entry and exit.
@@ -755,24 +842,54 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 				enum migrate_mode mode)
 {
 	struct address_space *mapping;
-	int rc;
+	int rc = -EAGAIN;
+	bool lru_movable = true;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	mapping = page_mapping(page);
-	if (!mapping)
-		rc = migrate_page(mapping, newpage, page, mode);
-	else if (mapping->a_ops->migratepage)
-		/*
-		 * Most pages have a mapping and most filesystems provide a
-		 * migratepage callback. Anonymous pages are part of swap
-		 * space which also has its own migratepage callback. This
-		 * is the most common path for page migration.
-		 */
-		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
-	else
-		rc = fallback_migrate_page(mapping, newpage, page, mode);
+	/*
+	 * In case of non-lru page, it could be released after
+	 * isolation step. In that case, we shouldn't try
+	 * fallback migration which was designed for LRU pages.
+	 *
+	 * The rule for such case is that subsystem should clear
+	 * PG_isolated but remains PG_movable so VM should catch
+	 * it and clear PG_movable for it.
+	 */
+	if (unlikely(PageMovable(page))) {
+		lru_movable = false;
+		VM_BUG_ON_PAGE(!mapping, page);
+		if (!PageIsolated(page)) {
+			rc = MIGRATEPAGE_SUCCESS;
+			__ClearPageMovable(page);
+			goto out;
+		}
+	}
+
+	if (likely(lru_movable)) {
+		if (!mapping)
+			rc = migrate_page(mapping, newpage, page, mode);
+		else if (mapping->a_ops->migratepage)
+			/*
+			 * Most pages have a mapping and most filesystems
+			 * provide a migratepage callback. Anonymous pages
+			 * are part of swap space which also has its own
+			 * migratepage callback. This is the most common path
+			 * for page migration.
+			 */
+			rc = mapping->a_ops->migratepage(mapping, newpage,
+							page, mode);
+		else
+			rc = fallback_migrate_page(mapping, newpage,
+							page, mode);
+	} else {
+		rc = mapping->a_ops->migratepage(mapping, newpage,
+						page, mode);
+		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
+			PageIsolated(page));
+	}
 
 	/*
 	 * When successful, old pagecache page->mapping must be cleared before
@@ -782,6 +899,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		if (!PageAnon(page))
 			page->mapping = NULL;
 	}
+out:
 	return rc;
 }
 
@@ -960,6 +1078,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 			put_new_page(newpage, private);
 		else
 			put_page(newpage);
+		if (PageMovable(page))
+			__ClearPageMovable(page);
 		goto out;
 	}
 
@@ -1000,8 +1120,26 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 				num_poisoned_pages_inc();
 		}
 	} else {
-		if (rc != -EAGAIN)
-			putback_lru_page(page);
+		if (rc != -EAGAIN) {
+			/*
+			 * subsystem couldn't remove PG_movable since page is
+			 * isolated so PageMovable check is not racy in here.
+			 * But PageIsolated check can be racy but it's okay
+			 * because putback_movable_page checks it under PG_lock
+			 * again.
+			 */
+			if (unlikely(PageMovable(page))) {
+				if (PageIsolated(page))
+					putback_movable_page(page);
+				else {
+					__ClearPageMovable(page);
+					put_page(page);
+				}
+			} else {
+				putback_lru_page(page);
+			}
+		}
+
 		if (put_new_page)
 			put_new_page(newpage, private);
 		else
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 03/16] mm: add non-lru movable page support document
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
  2016-03-30  7:12 ` [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page Minchan Kim
  2016-03-30  7:12 ` [PATCH v3 02/16] mm/compaction: support non-lru movable page migration Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-01 14:38   ` Vlastimil Babka
  2016-03-30  7:12 ` [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon Minchan Kim
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim,
	Jonathan Corbet

This patch describes what a subsystem should do for non-lru movable
page supporting.

Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/vfs.txt | 11 ++++++-
 Documentation/vm/page_migration   | 69 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 4c1b6c3b4bc8..d63142f8ed7b 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -752,12 +752,21 @@ struct address_space_operations {
         and transfer data directly between the storage and the
         application's address space.
 
+  isolate_page: Called by the VM when isolating a movable non-lru page.
+	If page is successfully isolated, we should mark the page as
+	PG_isolated via __SetPageIsolated.
+
   migrate_page:  This is used to compact the physical memory usage.
         If the VM wants to relocate a page (maybe off a memory card
         that is signalling imminent failure) it will pass a new page
 	and an old page to this function.  migrate_page should
 	transfer any private data across and update any references
-        that it has to the page.
+	that it has to the page. If migrated page is non-lru page,
+	we should clear PG_isolated and PG_movable via __ClearPageIsolated
+	and __ClearPageMovable.
+
+  putback_page: Called by the VM when isolated page's migration fails.
+	We should clear PG_isolated marked in isolated_page function.
 
   launder_page: Called before freeing a page - it writes back the dirty page. To
   	prevent redirtying the page, it is kept locked during the whole
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index fea5c0864170..c4e7551a414e 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -142,5 +142,72 @@ is increased so that the page cannot be freed while page migration occurs.
 20. The new page is moved to the LRU and can be scanned by the swapper
     etc again.
 
-Christoph Lameter, May 8, 2006.
+C. Non-LRU Page migration
+-------------------------
+
+Although original migration aimed for reducing the latency of memory access
+for NUMA, compaction who want to create high-order page is also main customer.
+
+Ppage migration's disadvantage is that it was designed to migrate only
+*LRU* pages. However, there are potential non-lru movable pages which can be
+migrated in system, for example, zsmalloc, virtio-balloon pages.
+For virtio-balloon pages, some parts of migration code path was hooked up
+and added virtio-balloon specific functions to intercept logi.
+It's too specific to one subsystem so other subsystem who want to make
+their pages movable should add own specific hooks in migration path.
+
+To solve such problem, VM supports non-LRU page migration which provides
+generic functions for non-LRU movable pages without needing subsystem
+specific hook in mm/{migrate|compact}.c.
+
+If a subsystem want to make own pages movable, it should mark pages as
+PG_movable via __SetPageMovable. __SetPageMovable needs address_space for
+argument for register functions which will be called by VM.
+
+Three functions in address_space_operation related to non-lru movable page:
+
+	bool (*isolate_page) (struct page *, isolate_mode_t);
+	int (*migratepage) (struct address_space *,
+		struct page *, struct page *, enum migrate_mode);
+	void (*putback_page)(struct page *);
+
+1. Isolation
+
+What VM expected on isolate_page of subsystem is to set PG_isolated flags
+of the page if it was successful. With that, concurrent isolation among
+CPUs skips the isolated page by other CPU earlier. VM calls isolate_page
+under PG_lock of page. If a subsystem cannot isolate the page, it should
+return false.
 
+2. Migration
+
+After successful isolation, VM calls migratepage. The migratepage's goal is
+to move content of the old page to new page and set up struct page fields
+of new page. If migration is successful, subsystem should release old page's
+refcount to free. Keep in mind that subsystem should clear PG_movable and
+PG_isolated before releasing the refcount.  If everything are done, user
+should return MIGRATEPAGE_SUCCESS. If subsystem cannot migrate the page
+at the moment, migratepage can return -EAGAIN. On -EAGAIN, VM will retry page
+migration because VM interprets -EAGAIN as "temporal migration failure".
+
+3. Putback
+
+If migration was unsuccessful, VM calls putback_page. The subsystem should
+insert isolated page to own data structure again if it has. And subsystem
+should clear PG_isolated which was marked in isolation step.
+
+Note about releasing page:
+
+Subsystem can release pages whenever it want but if it releses the page
+which is already isolated, it should clear PG_isolated but doesn't touch
+PG_movable under PG_lock. Instead of it, VM will clear PG_movable after
+his job done. Otherweise, subsystem should clear both page flags before
+releasing the page.
+
+Note about PG_isolated:
+
+PG_isolated check on a page is valid only if the page's flag is already
+set to PG_movable.
+
+Christoph Lameter, May 8, 2006.
+Minchan Kim, Mar 28, 2016.
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (2 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 03/16] mm: add non-lru movable page support document Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-05 12:03   ` Vlastimil Babka
  2016-03-30  7:12 ` [PATCH v3 05/16] zsmalloc: keep max_object in size_class Minchan Kim
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim,
	Gioh Kim

Now, VM has a feature to migrate non-lru movable pages so
balloon doesn't need custom migration hooks in migrate.c
and compact.c. Instead, this patch implements page->mapping
->{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general
migration functions and make balloon compaction simple.

Cc: virtualization@lists.linux-foundation.org
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Signed-off-by: Gioh Kim <gurugio@hanmail.net>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/virtio/virtio_balloon.c    |  53 ++++++++++++++++---
 include/linux/balloon_compaction.h |  49 ++++--------------
 include/linux/page-flags.h         |  56 +++++++++++---------
 include/uapi/linux/magic.h         |   1 +
 mm/balloon_compaction.c            | 101 ++++++++-----------------------------
 mm/compaction.c                    |   7 ---
 mm/migrate.c                       |  22 ++------
 mm/vmscan.c                        |   2 +-
 8 files changed, 119 insertions(+), 172 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 7b6d74f0c72f..0c16192d2684 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include <linux/oom.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,10 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+#ifdef CONFIG_BALLOON_COMPACTION
+static struct vfsmount *balloon_mnt;
+#endif
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -482,10 +487,29 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 
 	mutex_unlock(&vb->balloon_lock);
 
+	ClearPageIsolated(page);
 	put_page(page); /* balloon reference */
 
 	return MIGRATEPAGE_SUCCESS;
 }
+
+static struct dentry *balloon_mount(struct file_system_type *fs_type,
+		int flags, const char *dev_name, void *data)
+{
+	static const struct dentry_operations ops = {
+		.d_dname = simple_dname,
+	};
+
+	return mount_pseudo(fs_type, "balloon-kvm:", NULL, &ops,
+				BALLOON_KVM_MAGIC);
+}
+
+static struct file_system_type balloon_fs = {
+	.name           = "balloon-kvm",
+	.mount          = balloon_mount,
+	.kill_sb        = kill_anon_super,
+};
+
 #endif /* CONFIG_BALLOON_COMPACTION */
 
 static int virtballoon_probe(struct virtio_device *vdev)
@@ -515,10 +539,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
-#ifdef CONFIG_BALLOON_COMPACTION
-	vb->vb_dev_info.migratepage = virtballoon_migratepage;
-#endif
-
 	err = init_vqs(vb);
 	if (err)
 		goto out_free_vb;
@@ -527,13 +547,32 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
 	if (err < 0)
-		goto out_oom_notify;
+		goto out_del_vqs;
+
+#ifdef CONFIG_BALLOON_COMPACTION
+	balloon_mnt = kern_mount(&balloon_fs);
+	if (IS_ERR(balloon_mnt)) {
+		err = PTR_ERR(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		goto out_del_vqs;
+	}
 
+	vb->vb_dev_info.migratepage = virtballoon_migratepage;
+	vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
+	if (IS_ERR(vb->vb_dev_info.inode)) {
+		err = PTR_ERR(vb->vb_dev_info.inode);
+		kern_unmount(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		vb->vb_dev_info.inode = NULL;
+		goto out_del_vqs;
+	}
+	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
+#endif
 	virtio_device_ready(vdev);
 
 	return 0;
 
-out_oom_notify:
+out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
@@ -567,6 +606,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	if (vb->vb_dev_info.inode)
+		iput(vb->vb_dev_info.inode);
 	kfree(vb);
 }
 
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 9b0a15d06a4f..4c693bf3abdf 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -48,6 +48,7 @@
 #include <linux/migrate.h>
 #include <linux/gfp.h>
 #include <linux/err.h>
+#include <linux/fs.h>
 
 /*
  * Balloon device information descriptor.
@@ -62,6 +63,7 @@ struct balloon_dev_info {
 	struct list_head pages;		/* Pages enqueued & handled to Host */
 	int (*migratepage)(struct balloon_dev_info *, struct page *newpage,
 			struct page *page, enum migrate_mode mode);
+	struct inode *inode;
 };
 
 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
@@ -73,45 +75,19 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 	spin_lock_init(&balloon->pages_lock);
 	INIT_LIST_HEAD(&balloon->pages);
 	balloon->migratepage = NULL;
+	balloon->inode = NULL;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
-extern bool balloon_page_isolate(struct page *page);
+extern const struct address_space_operations balloon_aops;
+extern bool balloon_page_isolate(struct page *page,
+				isolate_mode_t mode);
 extern void balloon_page_putback(struct page *page);
-extern int balloon_page_migrate(struct page *newpage,
+extern int balloon_page_migrate(struct address_space *mapping,
+				struct page *newpage,
 				struct page *page, enum migrate_mode mode);
 
 /*
- * __is_movable_balloon_page - helper to perform @page PageBalloon tests
- */
-static inline bool __is_movable_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
- * balloon_page_movable - test PageBalloon to identify balloon pages
- *			  and PagePrivate to check that the page is not
- *			  isolated and can be moved by compaction/migration.
- *
- * As we might return false positives in the case of a balloon page being just
- * released under us, this need to be re-tested later, under the page lock.
- */
-static inline bool balloon_page_movable(struct page *page)
-{
-	return PageBalloon(page) && PagePrivate(page);
-}
-
-/*
- * isolated_balloon_page - identify an isolated balloon page on private
- *			   compaction/migration page lists.
- */
-static inline bool isolated_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
  * balloon_page_insert - insert a page into the balloon's page list and make
  *			 the page->private assignment accordingly.
  * @balloon : pointer to balloon device
@@ -123,8 +99,7 @@ static inline bool isolated_balloon_page(struct page *page)
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 				       struct page *page)
 {
-	__SetPageBalloon(page);
-	SetPagePrivate(page);
+	__SetPageBalloon(page, balloon->inode->i_mapping);
 	set_page_private(page, (unsigned long)balloon);
 	list_add(&page->lru, &balloon->pages);
 }
@@ -141,10 +116,8 @@ static inline void balloon_page_delete(struct page *page)
 {
 	__ClearPageBalloon(page);
 	set_page_private(page, 0);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
+	if (!PageIsolated(page))
 		list_del(&page->lru);
-	}
 }
 
 /*
@@ -166,7 +139,7 @@ static inline gfp_t balloon_mapping_gfp_mask(void)
 static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 				       struct page *page)
 {
-	__SetPageBalloon(page);
+	__SetPageBalloon(page, NULL);
 	list_add(&page->lru, &balloon->pages);
 }
 
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 77ebf8fdbc6e..603c47752126 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -599,32 +599,13 @@ static inline void __ClearPageBuddy(struct page *page)
 
 extern bool is_free_buddy_page(struct page *page);
 
-#define PAGE_BALLOON_MAPCOUNT_VALUE (-256)
-
-static inline int PageBalloon(struct page *page)
-{
-	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE;
-}
-
-static inline void __SetPageBalloon(struct page *page)
-{
-	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
-	atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
-}
-
-static inline void __ClearPageBalloon(struct page *page)
-{
-	VM_BUG_ON_PAGE(!PageBalloon(page), page);
-	atomic_set(&page->_mapcount, -1);
-}
-
-#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-256)
+#define PAGE_BALLOON_MAPCOUNT_VALUE PAGE_MOVABLE_MAPCOUNT_VALUE
 
 static inline int PageMovable(struct page *page)
 {
-	return ((test_bit(PG_movable, &(page)->flags) &&
-		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
-		|| PageBalloon(page));
+	return (test_bit(PG_movable, &(page)->flags) &&
+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE);
 }
 
 /* Caller should hold a PG_lock */
@@ -645,6 +626,35 @@ static inline void __ClearPageMovable(struct page *page)
 
 PAGEFLAG(Isolated, isolated, PF_ANY);
 
+static inline int PageBalloon(struct page *page)
+{
+	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE
+		&& PagePrivate2(page);
+}
+
+static inline void __SetPageBalloon(struct page *page,
+				struct address_space *mapping)
+{
+	VM_BUG_ON_PAGE(atomic_read(&page->_mapcount) != -1, page);
+#ifdef CONFIG_BALLOON_COMPACTION
+	__SetPageMovable(page, mapping);
+#else
+	atomic_set(&page->_mapcount, PAGE_BALLOON_MAPCOUNT_VALUE);
+#endif
+	SetPagePrivate2(page);
+}
+
+static inline void __ClearPageBalloon(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageBalloon(page), page);
+#ifdef CONFIG_BALLOON_COMPACTION
+	__ClearPageMovable(page);
+#else
+	atomic_set(&page->_mapcount, -1);
+#endif
+	ClearPagePrivate2(page);
+}
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 0de181ad73d5..e1fbe72c39c0 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -78,5 +78,6 @@
 #define BTRFS_TEST_MAGIC	0x73727279
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
+#define BALLOON_KVM_MAGIC	0x13661366
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 57b3e9bd6bc5..1fbc7fb387bb 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -70,7 +70,7 @@ struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
 		 */
 		if (trylock_page(page)) {
 #ifdef CONFIG_BALLOON_COMPACTION
-			if (!PagePrivate(page)) {
+			if (PageIsolated(page)) {
 				/* raced with isolation */
 				unlock_page(page);
 				continue;
@@ -106,110 +106,53 @@ EXPORT_SYMBOL_GPL(balloon_page_dequeue);
 
 #ifdef CONFIG_BALLOON_COMPACTION
 
-static inline void __isolate_balloon_page(struct page *page)
+/* __isolate_lru_page() counterpart for a ballooned page */
+bool balloon_page_isolate(struct page *page, isolate_mode_t mode)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	ClearPagePrivate(page);
 	list_del(&page->lru);
 	b_dev_info->isolated_pages++;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+	SetPageIsolated(page);
+
+	return true;
 }
 
-static inline void __putback_balloon_page(struct page *page)
+/* putback_lru_page() counterpart for a ballooned page */
+void balloon_page_putback(struct page *page)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
+	ClearPageIsolated(page);
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	SetPagePrivate(page);
 	list_add(&page->lru, &b_dev_info->pages);
 	b_dev_info->isolated_pages--;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
 }
 
-/* __isolate_lru_page() counterpart for a ballooned page */
-bool balloon_page_isolate(struct page *page)
-{
-	/*
-	 * Avoid burning cycles with pages that are yet under __free_pages(),
-	 * or just got freed under us.
-	 *
-	 * In case we 'win' a race for a balloon page being freed under us and
-	 * raise its refcount preventing __free_pages() from doing its job
-	 * the put_page() at the end of this block will take care of
-	 * release this page, thus avoiding a nasty leakage.
-	 */
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * As balloon pages are not isolated from LRU lists, concurrent
-		 * compaction threads can race against page migration functions
-		 * as well as race against the balloon driver releasing a page.
-		 *
-		 * In order to avoid having an already isolated balloon page
-		 * being (wrongly) re-isolated while it is under migration,
-		 * or to avoid attempting to isolate pages being released by
-		 * the balloon driver, lets be sure we have the page lock
-		 * before proceeding with the balloon page isolation steps.
-		 */
-		if (likely(trylock_page(page))) {
-			/*
-			 * A ballooned page, by default, has PagePrivate set.
-			 * Prevent concurrent compaction threads from isolating
-			 * an already isolated balloon page by clearing it.
-			 */
-			if (balloon_page_movable(page)) {
-				__isolate_balloon_page(page);
-				unlock_page(page);
-				return true;
-			}
-			unlock_page(page);
-		}
-		put_page(page);
-	}
-	return false;
-}
-
-/* putback_lru_page() counterpart for a ballooned page */
-void balloon_page_putback(struct page *page)
-{
-	/*
-	 * 'lock_page()' stabilizes the page and prevents races against
-	 * concurrent isolation threads attempting to re-isolate it.
-	 */
-	lock_page(page);
-
-	if (__is_movable_balloon_page(page)) {
-		__putback_balloon_page(page);
-		/* drop the extra ref count taken for page isolation */
-		put_page(page);
-	} else {
-		WARN_ON(1);
-		dump_page(page, "not movable balloon page");
-	}
-	unlock_page(page);
-}
-
 /* move_to_new_page() counterpart for a ballooned page */
-int balloon_page_migrate(struct page *newpage,
-			 struct page *page, enum migrate_mode mode)
+int balloon_page_migrate(struct address_space *mapping,
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
-	int rc = -EAGAIN;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	if (WARN_ON(!__is_movable_balloon_page(page))) {
-		dump_page(page, "not movable balloon page");
-		return rc;
-	}
-
-	if (balloon && balloon->migratepage)
-		rc = balloon->migratepage(balloon, newpage, page, mode);
-
-	return rc;
+	return balloon->migratepage(balloon, newpage, page, mode);
 }
+
+const struct address_space_operations balloon_aops = {
+	.migratepage = balloon_page_migrate,
+	.isolate_page = balloon_page_isolate,
+	.putback_page = balloon_page_putback,
+};
+EXPORT_SYMBOL_GPL(balloon_aops);
 #endif /* CONFIG_BALLOON_COMPACTION */
diff --git a/mm/compaction.c b/mm/compaction.c
index 7557aedddaee..e336c620fd7b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -708,13 +708,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		is_lru = PageLRU(page);
 		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-
 			if (unlikely(PageMovable(page)) &&
 					!PageIsolated(page)) {
 				if (locked) {
diff --git a/mm/migrate.c b/mm/migrate.c
index b56bf2b3fe8c..028814625eea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -157,8 +157,8 @@ void putback_movable_page(struct page *page)
  * from where they were once taken off for compaction/migration.
  *
  * This function shall be used whenever the isolated pageset has been
- * built from lru, balloon, hugetlbfs page. See isolate_migratepages_range()
- * and isolate_huge_page().
+ * built from lru, movable, hugetlbfs page.
+ * See isolate_migratepages_range() and isolate_huge_page().
  */
 void putback_movable_pages(struct list_head *l)
 {
@@ -173,9 +173,7 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page))) {
-			balloon_page_putback(page);
-		} else if (unlikely(PageMovable(page))) {
+		if (unlikely(PageMovable(page))) {
 			if (PageIsolated(page)) {
 				putback_movable_page(page);
 			} else {
@@ -977,18 +975,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (unlikely(!trylock_page(newpage)))
 		goto out_unlock;
 
-	if (unlikely(isolated_balloon_page(page))) {
-		/*
-		 * A ballooned page does not need any special attention from
-		 * physical to virtual reverse mapping procedures.
-		 * Skip any attempt to unmap PTEs or to remap swap cache,
-		 * in order to avoid burning cycles at rmap level, and perform
-		 * the page migration right away (proteced by page lock).
-		 */
-		rc = balloon_page_migrate(newpage, page, mode);
-		goto out_unlock_both;
-	}
-
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -1033,7 +1019,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 out:
 	/* If migration is successful, move newpage to right list */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage)))
+		if (unlikely(PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index d82196244340..c7696a2e11c7 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1254,7 +1254,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
 		if (page_is_file_cache(page) && !PageDirty(page) &&
-		    !isolated_balloon_page(page)) {
+		    !PageIsolated(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 05/16] zsmalloc: keep max_object in size_class
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (3 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-17 15:08   ` Sergey Senozhatsky
  2016-03-30  7:12 ` [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping Minchan Kim
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

Every zspage in a size_class has same number of max objects so
we could move it to a size_class.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index a0890e9003e2..8649d0243e6c 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -32,8 +32,6 @@
  *	page->freelist: points to the first free object in zspage.
  *		Free objects are linked together using in-place
  *		metadata.
- *	page->objects: maximum number of objects we can store in this
- *		zspage (class->zspage_order * PAGE_SIZE / class->size)
  *	page->lru: links together first pages of various zspages.
  *		Basically forming list of zspages in a fullness group.
  *	page->mapping: class index and fullness group of the zspage
@@ -211,6 +209,7 @@ struct size_class {
 	 * of ZS_ALIGN.
 	 */
 	int size;
+	int objs_per_zspage;
 	unsigned int index;
 
 	struct zs_size_stat stats;
@@ -627,21 +626,22 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
  * the pool (not yet implemented). This function returns fullness
  * status of the given page.
  */
-static enum fullness_group get_fullness_group(struct page *first_page)
+static enum fullness_group get_fullness_group(struct size_class *class,
+						struct page *first_page)
 {
-	int inuse, max_objects;
+	int inuse, objs_per_zspage;
 	enum fullness_group fg;
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
 	inuse = first_page->inuse;
-	max_objects = first_page->objects;
+	objs_per_zspage = class->objs_per_zspage;
 
 	if (inuse == 0)
 		fg = ZS_EMPTY;
-	else if (inuse == max_objects)
+	else if (inuse == objs_per_zspage)
 		fg = ZS_FULL;
-	else if (inuse <= 3 * max_objects / fullness_threshold_frac)
+	else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
 		fg = ZS_ALMOST_EMPTY;
 	else
 		fg = ZS_ALMOST_FULL;
@@ -728,7 +728,7 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
 	enum fullness_group currfg, newfg;
 
 	get_zspage_mapping(first_page, &class_idx, &currfg);
-	newfg = get_fullness_group(first_page);
+	newfg = get_fullness_group(class, first_page);
 	if (newfg == currfg)
 		goto out;
 
@@ -1008,9 +1008,6 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	init_zspage(class, first_page);
 
 	first_page->freelist = location_to_obj(first_page, 0);
-	/* Maximum number of objects we can store in this zspage */
-	first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
-
 	error = 0; /* Success */
 
 cleanup:
@@ -1238,11 +1235,11 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage)
 	return true;
 }
 
-static bool zspage_full(struct page *first_page)
+static bool zspage_full(struct size_class *class, struct page *first_page)
 {
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	return first_page->inuse == first_page->objects;
+	return first_page->inuse == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1628,7 +1625,7 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
 		}
 
 		/* Stop if there is no more space */
-		if (zspage_full(d_page)) {
+		if (zspage_full(class, d_page)) {
 			unpin_tag(handle);
 			ret = -ENOMEM;
 			break;
@@ -1687,7 +1684,7 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
 {
 	enum fullness_group fullness;
 
-	fullness = get_fullness_group(first_page);
+	fullness = get_fullness_group(class, first_page);
 	insert_zspage(class, fullness, first_page);
 	set_zspage_mapping(first_page, class->index, fullness);
 
@@ -1936,8 +1933,9 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 		class->size = size;
 		class->index = i;
 		class->pages_per_zspage = pages_per_zspage;
-		if (pages_per_zspage == 1 &&
-			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
+		class->objs_per_zspage = class->pages_per_zspage *
+						PAGE_SIZE / class->size;
+		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
 			class->huge = true;
 		spin_lock_init(&class->lock);
 		pool->size_class[i] = class;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (4 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 05/16] zsmalloc: keep max_object in size_class Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-17 15:08   ` Sergey Senozhatsky
  2016-03-30  7:12 ` [PATCH v3 07/16] zsmalloc: remove page_mapcount_reset Minchan Kim
                   ` (11 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

Currently, we store class:fullness into page->mapping.
The number of class we can support is 255 and fullness is 4 so
(8 + 2 = 10bit) is enough to represent them.
Meanwhile, the bits we need to store in-use objects in zspage
is that 11bit is enough.

For example, If we assume that 64K PAGE_SIZE, class_size 32
which is worst case, class->pages_per_zspage become 1 so
the number of objects in zspage is 2048 so 11bit is enough.
The next class is 32 + 256(i.e., ZS_SIZE_CLASS_DELTA).
With worst case that ZS_MAX_PAGES_PER_ZSPAGE, 64K * 4 /
(32 + 256) = 910 so 11bit is still enough.

So, we could squeeze inuse object count to page->mapping.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 103 ++++++++++++++++++++++++++++++++++++++++------------------
 1 file changed, 71 insertions(+), 32 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 8649d0243e6c..4dd72a803568 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -34,8 +34,7 @@
  *		metadata.
  *	page->lru: links together first pages of various zspages.
  *		Basically forming list of zspages in a fullness group.
- *	page->mapping: class index and fullness group of the zspage
- *	page->inuse: the number of objects that are used in this zspage
+ *	page->mapping: override by struct zs_meta
  *
  * Usage of struct page flags:
  *	PG_private: identifies the first component page
@@ -132,6 +131,13 @@
 /* each chunk includes extra space to keep handle */
 #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
 
+#define CLASS_BITS	8
+#define CLASS_MASK	((1 << CLASS_BITS) - 1)
+#define FULLNESS_BITS	2
+#define FULLNESS_MASK	((1 << FULLNESS_BITS) - 1)
+#define INUSE_BITS	11
+#define INUSE_MASK	((1 << INUSE_BITS) - 1)
+
 /*
  * On systems with 4K page size, this gives 255 size classes! There is a
  * trader-off here:
@@ -145,7 +151,7 @@
  *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
  */
-#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> 8)
+#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
 
 /*
  * We do not maintain any list for completely empty or full pages
@@ -155,7 +161,7 @@ enum fullness_group {
 	ZS_ALMOST_EMPTY,
 	_ZS_NR_FULLNESS_GROUPS,
 
-	ZS_EMPTY,
+	ZS_EMPTY = _ZS_NR_FULLNESS_GROUPS,
 	ZS_FULL
 };
 
@@ -263,14 +269,11 @@ struct zs_pool {
 #endif
 };
 
-/*
- * A zspage's class index and fullness group
- * are encoded in its (first)page->mapping
- */
-#define CLASS_IDX_BITS	28
-#define FULLNESS_BITS	4
-#define CLASS_IDX_MASK	((1 << CLASS_IDX_BITS) - 1)
-#define FULLNESS_MASK	((1 << FULLNESS_BITS) - 1)
+struct zs_meta {
+	unsigned long class:CLASS_BITS;
+	unsigned long fullness:FULLNESS_BITS;
+	unsigned long inuse:INUSE_BITS;
+};
 
 struct mapping_area {
 #ifdef CONFIG_PGTABLE_MAPPING
@@ -412,28 +415,61 @@ static int is_last_page(struct page *page)
 	return PagePrivate2(page);
 }
 
+static int get_zspage_inuse(struct page *first_page)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	m = (struct zs_meta *)&first_page->mapping;
+
+	return m->inuse;
+}
+
+static void set_zspage_inuse(struct page *first_page, int val)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	m = (struct zs_meta *)&first_page->mapping;
+	m->inuse = val;
+}
+
+static void mod_zspage_inuse(struct page *first_page, int val)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	m = (struct zs_meta *)&first_page->mapping;
+	m->inuse += val;
+}
+
 static void get_zspage_mapping(struct page *first_page,
 				unsigned int *class_idx,
 				enum fullness_group *fullness)
 {
-	unsigned long m;
+	struct zs_meta *m;
+
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (unsigned long)first_page->mapping;
-	*fullness = m & FULLNESS_MASK;
-	*class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK;
+	m = (struct zs_meta *)&first_page->mapping;
+	*fullness = m->fullness;
+	*class_idx = m->class;
 }
 
 static void set_zspage_mapping(struct page *first_page,
 				unsigned int class_idx,
 				enum fullness_group fullness)
 {
-	unsigned long m;
+	struct zs_meta *m;
+
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) |
-			(fullness & FULLNESS_MASK);
-	first_page->mapping = (struct address_space *)m;
+	m = (struct zs_meta *)&first_page->mapping;
+	m->fullness = fullness;
+	m->class = class_idx;
 }
 
 /*
@@ -632,9 +668,7 @@ static enum fullness_group get_fullness_group(struct size_class *class,
 	int inuse, objs_per_zspage;
 	enum fullness_group fg;
 
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	inuse = first_page->inuse;
+	inuse = get_zspage_inuse(first_page);
 	objs_per_zspage = class->objs_per_zspage;
 
 	if (inuse == 0)
@@ -677,10 +711,10 @@ static void insert_zspage(struct size_class *class,
 
 	/*
 	 * We want to see more ZS_FULL pages and less almost
-	 * empty/full. Put pages with higher ->inuse first.
+	 * empty/full. Put pages with higher inuse first.
 	 */
 	list_add_tail(&first_page->lru, &(*head)->lru);
-	if (first_page->inuse >= (*head)->inuse)
+	if (get_zspage_inuse(first_page) >= get_zspage_inuse(*head))
 		*head = first_page;
 }
 
@@ -896,7 +930,7 @@ static void free_zspage(struct page *first_page)
 	struct page *nextp, *tmp, *head_extra;
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-	VM_BUG_ON_PAGE(first_page->inuse, first_page);
+	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
 
 	head_extra = (struct page *)page_private(first_page);
 
@@ -992,7 +1026,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 			SetPagePrivate(page);
 			set_page_private(page, 0);
 			first_page = page;
-			first_page->inuse = 0;
+			set_zspage_inuse(page, 0);
 		}
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
@@ -1237,9 +1271,7 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage)
 
 static bool zspage_full(struct size_class *class, struct page *first_page)
 {
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	return first_page->inuse == class->objs_per_zspage;
+	return get_zspage_inuse(first_page) == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1372,7 +1404,7 @@ static unsigned long obj_malloc(struct size_class *class,
 		/* record handle in first_page->private */
 		set_page_private(first_page, handle);
 	kunmap_atomic(vaddr);
-	first_page->inuse++;
+	mod_zspage_inuse(first_page, 1);
 	zs_stat_inc(class, OBJ_USED, 1);
 
 	return obj;
@@ -1457,7 +1489,7 @@ static void obj_free(struct size_class *class, unsigned long obj)
 		set_page_private(first_page, 0);
 	kunmap_atomic(vaddr);
 	first_page->freelist = (void *)obj;
-	first_page->inuse--;
+	mod_zspage_inuse(first_page, -1);
 	zs_stat_dec(class, OBJ_USED, 1);
 }
 
@@ -2002,6 +2034,13 @@ static int __init zs_init(void)
 	if (ret)
 		goto notifier_fail;
 
+	/*
+	 * A zspage's class index, fullness group, inuse object count are
+	 * encoded in its (first)page->mapping so sizeof(struct zs_meta)
+	 * should be less than sizeof(page->mapping(i.e., unsigned long)).
+	 */
+	BUILD_BUG_ON(sizeof(struct zs_meta) > sizeof(unsigned long));
+
 	init_zs_size_classes();
 
 #ifdef CONFIG_ZPOOL
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 07/16] zsmalloc: remove page_mapcount_reset
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (5 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-17 15:11   ` Sergey Senozhatsky
  2016-03-30  7:12 ` [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping Minchan Kim
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

We don't use page->_mapcount any more so no need to reset.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 4dd72a803568..0f6cce9b9119 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -922,7 +922,6 @@ static void reset_page(struct page *page)
 	set_page_private(page, 0);
 	page->mapping = NULL;
 	page->freelist = NULL;
-	page_mapcount_reset(page);
 }
 
 static void free_zspage(struct page *first_page)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (6 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 07/16] zsmalloc: remove page_mapcount_reset Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-17 15:56   ` Sergey Senozhatsky
  2016-03-30  7:12 ` [PATCH v3 09/16] zsmalloc: move struct zs_meta from mapping to freelist Minchan Kim
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

Zsmalloc stores first free object's position into first_page->freelist
in each zspage. If we change it with object index from first_page
instead of location, we could squeeze it into page->mapping because
the number of bit we need to store offset is at most 11bit.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 158 +++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 96 insertions(+), 62 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 0f6cce9b9119..807998462539 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -18,9 +18,7 @@
  * Usage of struct page fields:
  *	page->private: points to the first component (0-order) page
  *	page->index (union with page->freelist): offset of the first object
- *		starting in this page. For the first page, this is
- *		always 0, so we use this field (aka freelist) to point
- *		to the first free object in zspage.
+ *		starting in this page.
  *	page->lru: links together all component pages (except the first page)
  *		of a zspage
  *
@@ -29,9 +27,6 @@
  *	page->private: refers to the component page after the first page
  *		If the page is first_page for huge object, it stores handle.
  *		Look at size_class->huge.
- *	page->freelist: points to the first free object in zspage.
- *		Free objects are linked together using in-place
- *		metadata.
  *	page->lru: links together first pages of various zspages.
  *		Basically forming list of zspages in a fullness group.
  *	page->mapping: override by struct zs_meta
@@ -131,6 +126,7 @@
 /* each chunk includes extra space to keep handle */
 #define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
 
+#define FREEOBJ_BITS 11
 #define CLASS_BITS	8
 #define CLASS_MASK	((1 << CLASS_BITS) - 1)
 #define FULLNESS_BITS	2
@@ -228,17 +224,17 @@ struct size_class {
 
 /*
  * Placed within free objects to form a singly linked list.
- * For every zspage, first_page->freelist gives head of this list.
+ * For every zspage, first_page->freeobj gives head of this list.
  *
  * This must be power of 2 and less than or equal to ZS_ALIGN
  */
 struct link_free {
 	union {
 		/*
-		 * Position of next free chunk (encodes <PFN, obj_idx>)
+		 * free object list
 		 * It's valid for non-allocated object
 		 */
-		void *next;
+		unsigned long next;
 		/*
 		 * Handle of allocated object.
 		 */
@@ -270,6 +266,7 @@ struct zs_pool {
 };
 
 struct zs_meta {
+	unsigned long freeobj:FREEOBJ_BITS;
 	unsigned long class:CLASS_BITS;
 	unsigned long fullness:FULLNESS_BITS;
 	unsigned long inuse:INUSE_BITS;
@@ -446,6 +443,26 @@ static void mod_zspage_inuse(struct page *first_page, int val)
 	m->inuse += val;
 }
 
+static void set_freeobj(struct page *first_page, int idx)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	m = (struct zs_meta *)&first_page->mapping;
+	m->freeobj = idx;
+}
+
+static unsigned long get_freeobj(struct page *first_page)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	m = (struct zs_meta *)&first_page->mapping;
+	return m->freeobj;
+}
+
 static void get_zspage_mapping(struct page *first_page,
 				unsigned int *class_idx,
 				enum fullness_group *fullness)
@@ -837,30 +854,33 @@ static struct page *get_next_page(struct page *page)
 	return next;
 }
 
-/*
- * Encode <page, obj_idx> as a single handle value.
- * We use the least bit of handle for tagging.
- */
-static void *location_to_obj(struct page *page, unsigned long obj_idx)
+static void objidx_to_page_and_offset(struct size_class *class,
+				struct page *first_page,
+				unsigned long obj_idx,
+				struct page **obj_page,
+				unsigned long *offset_in_page)
 {
-	unsigned long obj;
+	int i;
+	unsigned long offset;
+	struct page *cursor;
+	int nr_page;
 
-	if (!page) {
-		VM_BUG_ON(obj_idx);
-		return NULL;
-	}
+	offset = obj_idx * class->size;
+	cursor = first_page;
+	nr_page = offset >> PAGE_SHIFT;
 
-	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
-	obj |= ((obj_idx) & OBJ_INDEX_MASK);
-	obj <<= OBJ_TAG_BITS;
+	*offset_in_page = offset & ~PAGE_MASK;
+
+	for (i = 0; i < nr_page; i++)
+		cursor = get_next_page(cursor);
 
-	return (void *)obj;
+	*obj_page = cursor;
 }
 
-/*
- * Decode <page, obj_idx> pair from the given object handle. We adjust the
- * decoded obj_idx back to its original value since it was adjusted in
- * location_to_obj().
+/**
+ * obj_to_location - get (<page>, <obj_idx>) from encoded object value
+ * @page: page object resides in zspage
+ * @obj_idx: object index
  */
 static void obj_to_location(unsigned long obj, struct page **page,
 				unsigned long *obj_idx)
@@ -870,6 +890,23 @@ static void obj_to_location(unsigned long obj, struct page **page,
 	*obj_idx = (obj & OBJ_INDEX_MASK);
 }
 
+/**
+ * location_to_obj - get obj value encoded from (<page>, <obj_idx>)
+ * @page: page object resides in zspage
+ * @obj_idx: object index
+ */
+static unsigned long location_to_obj(struct page *page,
+				unsigned long obj_idx)
+{
+	unsigned long obj;
+
+	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
+	obj |= obj_idx & OBJ_INDEX_MASK;
+	obj <<= OBJ_TAG_BITS;
+
+	return obj;
+}
+
 static unsigned long handle_to_obj(unsigned long handle)
 {
 	return *(unsigned long *)handle;
@@ -885,17 +922,6 @@ static unsigned long obj_to_head(struct size_class *class, struct page *page,
 		return *(unsigned long *)obj;
 }
 
-static unsigned long obj_idx_to_offset(struct page *page,
-				unsigned long obj_idx, int class_size)
-{
-	unsigned long off = 0;
-
-	if (!is_first_page(page))
-		off = page->index;
-
-	return off + obj_idx * class_size;
-}
-
 static inline int trypin_tag(unsigned long handle)
 {
 	unsigned long *ptr = (unsigned long *)handle;
@@ -952,6 +978,7 @@ static void free_zspage(struct page *first_page)
 /* Initialize a newly allocated zspage */
 static void init_zspage(struct size_class *class, struct page *first_page)
 {
+	int freeobj = 1;
 	unsigned long off = 0;
 	struct page *page = first_page;
 
@@ -960,14 +987,11 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 	while (page) {
 		struct page *next_page;
 		struct link_free *link;
-		unsigned int i = 1;
 		void *vaddr;
 
 		/*
 		 * page->index stores offset of first object starting
-		 * in the page. For the first page, this is always 0,
-		 * so we use first_page->index (aka ->freelist) to store
-		 * head of corresponding zspage's freelist.
+		 * in the page.
 		 */
 		if (page != first_page)
 			page->index = off;
@@ -976,7 +1000,7 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
 		while ((off += class->size) < PAGE_SIZE) {
-			link->next = location_to_obj(page, i++);
+			link->next = freeobj++ << OBJ_ALLOCATED_TAG;
 			link += class->size / sizeof(*link);
 		}
 
@@ -986,11 +1010,21 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 		 * page (if present)
 		 */
 		next_page = get_next_page(page);
-		link->next = location_to_obj(next_page, 0);
+		if (next_page) {
+			link->next = freeobj++ << OBJ_ALLOCATED_TAG;
+		} else {
+			/*
+			 * Reset OBJ_ALLOCATED_TAG bit to last link for
+			 * migration to know it is allocated object or not.
+			 */
+			link->next = -1 << OBJ_ALLOCATED_TAG;
+		}
 		kunmap_atomic(vaddr);
 		page = next_page;
 		off %= PAGE_SIZE;
 	}
+
+	set_freeobj(first_page, 0);
 }
 
 /*
@@ -1040,7 +1074,6 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 
 	init_zspage(class, first_page);
 
-	first_page->freelist = location_to_obj(first_page, 0);
 	error = 0; /* Success */
 
 cleanup:
@@ -1320,7 +1353,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	obj_to_location(obj, &page, &obj_idx);
 	get_zspage_mapping(get_first_page(page), &class_idx, &fg);
 	class = pool->size_class[class_idx];
-	off = obj_idx_to_offset(page, obj_idx, class->size);
+	off = (class->size * obj_idx) & ~PAGE_MASK;
 
 	area = &get_cpu_var(zs_map_area);
 	area->vm_mm = mm;
@@ -1359,7 +1392,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 	obj_to_location(obj, &page, &obj_idx);
 	get_zspage_mapping(get_first_page(page), &class_idx, &fg);
 	class = pool->size_class[class_idx];
-	off = obj_idx_to_offset(page, obj_idx, class->size);
+	off = (class->size * obj_idx) & ~PAGE_MASK;
 
 	area = this_cpu_ptr(&zs_map_area);
 	if (off + class->size <= PAGE_SIZE)
@@ -1385,17 +1418,17 @@ static unsigned long obj_malloc(struct size_class *class,
 	struct link_free *link;
 
 	struct page *m_page;
-	unsigned long m_objidx, m_offset;
+	unsigned long m_offset;
 	void *vaddr;
 
 	handle |= OBJ_ALLOCATED_TAG;
-	obj = (unsigned long)first_page->freelist;
-	obj_to_location(obj, &m_page, &m_objidx);
-	m_offset = obj_idx_to_offset(m_page, m_objidx, class->size);
+	obj = get_freeobj(first_page);
+	objidx_to_page_and_offset(class, first_page, obj,
+				&m_page, &m_offset);
 
 	vaddr = kmap_atomic(m_page);
 	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
-	first_page->freelist = link->next;
+	set_freeobj(first_page, link->next >> OBJ_ALLOCATED_TAG);
 	if (!class->huge)
 		/* record handle in the header of allocated chunk */
 		link->handle = handle;
@@ -1406,6 +1439,8 @@ static unsigned long obj_malloc(struct size_class *class,
 	mod_zspage_inuse(first_page, 1);
 	zs_stat_inc(class, OBJ_USED, 1);
 
+	obj = location_to_obj(m_page, obj);
+
 	return obj;
 }
 
@@ -1475,19 +1510,17 @@ static void obj_free(struct size_class *class, unsigned long obj)
 
 	obj &= ~OBJ_ALLOCATED_TAG;
 	obj_to_location(obj, &f_page, &f_objidx);
+	f_offset = (class->size * f_objidx) & ~PAGE_MASK;
 	first_page = get_first_page(f_page);
-
-	f_offset = obj_idx_to_offset(f_page, f_objidx, class->size);
-
 	vaddr = kmap_atomic(f_page);
 
 	/* Insert this object in containing zspage's freelist */
 	link = (struct link_free *)(vaddr + f_offset);
-	link->next = first_page->freelist;
+	link->next = get_freeobj(first_page) << OBJ_ALLOCATED_TAG;
 	if (class->huge)
 		set_page_private(first_page, 0);
 	kunmap_atomic(vaddr);
-	first_page->freelist = (void *)obj;
+	set_freeobj(first_page, f_objidx);
 	mod_zspage_inuse(first_page, -1);
 	zs_stat_dec(class, OBJ_USED, 1);
 }
@@ -1543,8 +1576,8 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 	obj_to_location(src, &s_page, &s_objidx);
 	obj_to_location(dst, &d_page, &d_objidx);
 
-	s_off = obj_idx_to_offset(s_page, s_objidx, class->size);
-	d_off = obj_idx_to_offset(d_page, d_objidx, class->size);
+	s_off = (class->size * s_objidx) & ~PAGE_MASK;
+	d_off = (class->size * d_objidx) & ~PAGE_MASK;
 
 	if (s_off + class->size > PAGE_SIZE)
 		s_size = PAGE_SIZE - s_off;
@@ -2034,9 +2067,10 @@ static int __init zs_init(void)
 		goto notifier_fail;
 
 	/*
-	 * A zspage's class index, fullness group, inuse object count are
-	 * encoded in its (first)page->mapping so sizeof(struct zs_meta)
-	 * should be less than sizeof(page->mapping(i.e., unsigned long)).
+	 * A zspage's a free object index, class index, fullness group,
+	 * inuse object count are encoded in its (first)page->mapping
+	 * so sizeof(struct zs_meta) should be less than
+	 * sizeof(page->mapping(i.e., unsigned long)).
 	 */
 	BUILD_BUG_ON(sizeof(struct zs_meta) > sizeof(unsigned long));
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 09/16] zsmalloc: move struct zs_meta from mapping to freelist
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (7 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-17 15:22   ` Sergey Senozhatsky
  2016-03-30  7:12 ` [PATCH v3 10/16] zsmalloc: factor page chain functionality out Minchan Kim
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

For supporting migration from VM, we need to have address_space
on every page so zsmalloc shouldn't use page->mapping. So,
this patch moves zs_meta from mapping to freelist.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 807998462539..d4d33a819832 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -29,7 +29,7 @@
  *		Look at size_class->huge.
  *	page->lru: links together first pages of various zspages.
  *		Basically forming list of zspages in a fullness group.
- *	page->mapping: override by struct zs_meta
+ *	page->freelist: override by struct zs_meta
  *
  * Usage of struct page flags:
  *	PG_private: identifies the first component page
@@ -418,7 +418,7 @@ static int get_zspage_inuse(struct page *first_page)
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (struct zs_meta *)&first_page->mapping;
+	m = (struct zs_meta *)&first_page->freelist;
 
 	return m->inuse;
 }
@@ -429,7 +429,7 @@ static void set_zspage_inuse(struct page *first_page, int val)
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (struct zs_meta *)&first_page->mapping;
+	m = (struct zs_meta *)&first_page->freelist;
 	m->inuse = val;
 }
 
@@ -439,7 +439,7 @@ static void mod_zspage_inuse(struct page *first_page, int val)
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (struct zs_meta *)&first_page->mapping;
+	m = (struct zs_meta *)&first_page->freelist;
 	m->inuse += val;
 }
 
@@ -449,7 +449,7 @@ static void set_freeobj(struct page *first_page, int idx)
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (struct zs_meta *)&first_page->mapping;
+	m = (struct zs_meta *)&first_page->freelist;
 	m->freeobj = idx;
 }
 
@@ -459,7 +459,7 @@ static unsigned long get_freeobj(struct page *first_page)
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (struct zs_meta *)&first_page->mapping;
+	m = (struct zs_meta *)&first_page->freelist;
 	return m->freeobj;
 }
 
@@ -471,7 +471,7 @@ static void get_zspage_mapping(struct page *first_page,
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (struct zs_meta *)&first_page->mapping;
+	m = (struct zs_meta *)&first_page->freelist;
 	*fullness = m->fullness;
 	*class_idx = m->class;
 }
@@ -484,7 +484,7 @@ static void set_zspage_mapping(struct page *first_page,
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = (struct zs_meta *)&first_page->mapping;
+	m = (struct zs_meta *)&first_page->freelist;
 	m->fullness = fullness;
 	m->class = class_idx;
 }
@@ -946,7 +946,6 @@ static void reset_page(struct page *page)
 	clear_bit(PG_private, &page->flags);
 	clear_bit(PG_private_2, &page->flags);
 	set_page_private(page, 0);
-	page->mapping = NULL;
 	page->freelist = NULL;
 }
 
@@ -1056,6 +1055,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 
 		INIT_LIST_HEAD(&page->lru);
 		if (i == 0) {	/* first page */
+			page->freelist = NULL;
 			SetPagePrivate(page);
 			set_page_private(page, 0);
 			first_page = page;
@@ -2068,9 +2068,9 @@ static int __init zs_init(void)
 
 	/*
 	 * A zspage's a free object index, class index, fullness group,
-	 * inuse object count are encoded in its (first)page->mapping
+	 * inuse object count are encoded in its (first)page->freelist
 	 * so sizeof(struct zs_meta) should be less than
-	 * sizeof(page->mapping(i.e., unsigned long)).
+	 * sizeof(page->freelist(i.e., void *)).
 	 */
 	BUILD_BUG_ON(sizeof(struct zs_meta) > sizeof(unsigned long));
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 10/16] zsmalloc: factor page chain functionality out
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (8 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 09/16] zsmalloc: move struct zs_meta from mapping to freelist Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-18  0:33   ` Sergey Senozhatsky
  2016-03-30  7:12 ` [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage Minchan Kim
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

For migration, we need to create sub-page chain of zspage
dynamically so this patch factors it out from alloc_zspage.

As a minor refactoring, it makes OBJ_ALLOCATED_TAG assign
more clear in obj_malloc(it could be another patch but it's
trivial so I want to put together in this patch).

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 80 ++++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 46 insertions(+), 34 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index d4d33a819832..14bcc741eead 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -981,7 +981,9 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 	unsigned long off = 0;
 	struct page *page = first_page;
 
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+	first_page->freelist = NULL;
+	INIT_LIST_HEAD(&first_page->lru);
+	set_zspage_inuse(first_page, 0);
 
 	while (page) {
 		struct page *next_page;
@@ -1026,13 +1028,44 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 	set_freeobj(first_page, 0);
 }
 
+static void create_page_chain(struct page *pages[], int nr_pages)
+{
+	int i;
+	struct page *page;
+	struct page *prev_page = NULL;
+	struct page *first_page = NULL;
+
+	for (i = 0; i < nr_pages; i++) {
+		page = pages[i];
+
+		INIT_LIST_HEAD(&page->lru);
+		if (i == 0) {
+			SetPagePrivate(page);
+			set_page_private(page, 0);
+			first_page = page;
+		}
+
+		if (i == 1)
+			set_page_private(first_page, (unsigned long)page);
+		if (i >= 1)
+			set_page_private(page, (unsigned long)first_page);
+		if (i >= 2)
+			list_add(&page->lru, &prev_page->lru);
+		if (i == nr_pages - 1)
+			SetPagePrivate2(page);
+
+		prev_page = page;
+	}
+}
+
 /*
  * Allocate a zspage for the given size class
  */
 static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 {
-	int i, error;
-	struct page *first_page = NULL, *uninitialized_var(prev_page);
+	int i;
+	struct page *first_page = NULL;
+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
 
 	/*
 	 * Allocate individual pages and link them together as:
@@ -1045,43 +1078,23 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	 * (i.e. no other sub-page has this flag set) and PG_private_2 to
 	 * identify the last page.
 	 */
-	error = -ENOMEM;
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct page *page;
 
 		page = alloc_page(flags);
-		if (!page)
-			goto cleanup;
-
-		INIT_LIST_HEAD(&page->lru);
-		if (i == 0) {	/* first page */
-			page->freelist = NULL;
-			SetPagePrivate(page);
-			set_page_private(page, 0);
-			first_page = page;
-			set_zspage_inuse(page, 0);
+		if (!page) {
+			while (--i >= 0)
+				__free_page(pages[i]);
+			return NULL;
 		}
-		if (i == 1)
-			set_page_private(first_page, (unsigned long)page);
-		if (i >= 1)
-			set_page_private(page, (unsigned long)first_page);
-		if (i >= 2)
-			list_add(&page->lru, &prev_page->lru);
-		if (i == class->pages_per_zspage - 1)	/* last page */
-			SetPagePrivate2(page);
-		prev_page = page;
+
+		pages[i] = page;
 	}
 
+	create_page_chain(pages, class->pages_per_zspage);
+	first_page = pages[0];
 	init_zspage(class, first_page);
 
-	error = 0; /* Success */
-
-cleanup:
-	if (unlikely(error) && first_page) {
-		free_zspage(first_page);
-		first_page = NULL;
-	}
-
 	return first_page;
 }
 
@@ -1421,7 +1434,6 @@ static unsigned long obj_malloc(struct size_class *class,
 	unsigned long m_offset;
 	void *vaddr;
 
-	handle |= OBJ_ALLOCATED_TAG;
 	obj = get_freeobj(first_page);
 	objidx_to_page_and_offset(class, first_page, obj,
 				&m_page, &m_offset);
@@ -1431,10 +1443,10 @@ static unsigned long obj_malloc(struct size_class *class,
 	set_freeobj(first_page, link->next >> OBJ_ALLOCATED_TAG);
 	if (!class->huge)
 		/* record handle in the header of allocated chunk */
-		link->handle = handle;
+		link->handle = handle | OBJ_ALLOCATED_TAG;
 	else
 		/* record handle in first_page->private */
-		set_page_private(first_page, handle);
+		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
 	kunmap_atomic(vaddr);
 	mod_zspage_inuse(first_page, 1);
 	zs_stat_inc(class, OBJ_USED, 1);
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (9 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 10/16] zsmalloc: factor page chain functionality out Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-18  1:04   ` Sergey Senozhatsky
  2016-03-30  7:12 ` [PATCH v3 12/16] zsmalloc: zs_compact refactoring Minchan Kim
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

Currently, putback_zspage does free zspage under class->lock
if fullness become ZS_EMPTY but it makes trouble to implement
locking scheme for new zspage migration.
So, this patch is to separate free_zspage from putback_zspage
and free zspage out of class->lock which is preparation for
zspage migration.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 14bcc741eead..b11dcd718502 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -949,7 +949,8 @@ static void reset_page(struct page *page)
 	page->freelist = NULL;
 }
 
-static void free_zspage(struct page *first_page)
+static void free_zspage(struct zs_pool *pool, struct size_class *class,
+			struct page *first_page)
 {
 	struct page *nextp, *tmp, *head_extra;
 
@@ -972,6 +973,11 @@ static void free_zspage(struct page *first_page)
 	}
 	reset_page(head_extra);
 	__free_page(head_extra);
+
+	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
+			class->size, class->pages_per_zspage));
+	atomic_long_sub(class->pages_per_zspage,
+				&pool->pages_allocated);
 }
 
 /* Initialize a newly allocated zspage */
@@ -1559,13 +1565,8 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	spin_lock(&class->lock);
 	obj_free(class, obj);
 	fullness = fix_fullness_group(class, first_page);
-	if (fullness == ZS_EMPTY) {
-		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
-				class->size, class->pages_per_zspage));
-		atomic_long_sub(class->pages_per_zspage,
-				&pool->pages_allocated);
-		free_zspage(first_page);
-	}
+	if (fullness == ZS_EMPTY)
+		free_zspage(pool, class, first_page);
 	spin_unlock(&class->lock);
 	unpin_tag(handle);
 
@@ -1752,7 +1753,7 @@ static struct page *isolate_target_page(struct size_class *class)
  * @class: destination class
  * @first_page: target page
  *
- * Return @fist_page's fullness_group
+ * Return @first_page's updated fullness_group
  */
 static enum fullness_group putback_zspage(struct zs_pool *pool,
 			struct size_class *class,
@@ -1764,15 +1765,6 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
 	insert_zspage(class, fullness, first_page);
 	set_zspage_mapping(first_page, class->index, fullness);
 
-	if (fullness == ZS_EMPTY) {
-		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
-			class->size, class->pages_per_zspage));
-		atomic_long_sub(class->pages_per_zspage,
-				&pool->pages_allocated);
-
-		free_zspage(first_page);
-	}
-
 	return fullness;
 }
 
@@ -1835,23 +1827,31 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 			if (!migrate_zspage(pool, class, &cc))
 				break;
 
-			putback_zspage(pool, class, dst_page);
+			VM_BUG_ON_PAGE(putback_zspage(pool, class,
+				dst_page) == ZS_EMPTY, dst_page);
 		}
 
 		/* Stop if we couldn't find slot */
 		if (dst_page == NULL)
 			break;
 
-		putback_zspage(pool, class, dst_page);
-		if (putback_zspage(pool, class, src_page) == ZS_EMPTY)
+		VM_BUG_ON_PAGE(putback_zspage(pool, class,
+				dst_page) == ZS_EMPTY, dst_page);
+		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
 			pool->stats.pages_compacted += class->pages_per_zspage;
-		spin_unlock(&class->lock);
+			spin_unlock(&class->lock);
+			free_zspage(pool, class, src_page);
+		} else {
+			spin_unlock(&class->lock);
+		}
+
 		cond_resched();
 		spin_lock(&class->lock);
 	}
 
 	if (src_page)
-		putback_zspage(pool, class, src_page);
+		VM_BUG_ON_PAGE(putback_zspage(pool, class,
+				src_page) == ZS_EMPTY, src_page);
 
 	spin_unlock(&class->lock);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 12/16] zsmalloc: zs_compact refactoring
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (10 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-04  8:04   ` Chulmin Kim
  2016-03-30  7:12 ` [PATCH v3 13/16] zsmalloc: migrate head page of zspage Minchan Kim
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

Currently, we rely on class->lock to prevent zspage destruction.
It was okay until now because the critical section is short but
with run-time migration, it could be long so class->lock is not
a good apporach any more.

So, this patch introduces [un]freeze_zspage functions which
freeze allocated objects in the zspage with pinning tag so
user cannot free using object. With those functions, this patch
redesign compaction.

Those functions will be used for implementing zspage runtime
migrations, too.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 393 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 257 insertions(+), 136 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b11dcd718502..ac8ca7b10720 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -922,6 +922,13 @@ static unsigned long obj_to_head(struct size_class *class, struct page *page,
 		return *(unsigned long *)obj;
 }
 
+static inline int testpin_tag(unsigned long handle)
+{
+	unsigned long *ptr = (unsigned long *)handle;
+
+	return test_bit(HANDLE_PIN_BIT, ptr);
+}
+
 static inline int trypin_tag(unsigned long handle)
 {
 	unsigned long *ptr = (unsigned long *)handle;
@@ -949,8 +956,7 @@ static void reset_page(struct page *page)
 	page->freelist = NULL;
 }
 
-static void free_zspage(struct zs_pool *pool, struct size_class *class,
-			struct page *first_page)
+static void free_zspage(struct zs_pool *pool, struct page *first_page)
 {
 	struct page *nextp, *tmp, *head_extra;
 
@@ -973,11 +979,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
 	}
 	reset_page(head_extra);
 	__free_page(head_extra);
-
-	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
-			class->size, class->pages_per_zspage));
-	atomic_long_sub(class->pages_per_zspage,
-				&pool->pages_allocated);
 }
 
 /* Initialize a newly allocated zspage */
@@ -1325,6 +1326,11 @@ static bool zspage_full(struct size_class *class, struct page *first_page)
 	return get_zspage_inuse(first_page) == class->objs_per_zspage;
 }
 
+static bool zspage_empty(struct size_class *class, struct page *first_page)
+{
+	return get_zspage_inuse(first_page) == 0;
+}
+
 unsigned long zs_get_total_pages(struct zs_pool *pool)
 {
 	return atomic_long_read(&pool->pages_allocated);
@@ -1455,7 +1461,6 @@ static unsigned long obj_malloc(struct size_class *class,
 		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
 	kunmap_atomic(vaddr);
 	mod_zspage_inuse(first_page, 1);
-	zs_stat_inc(class, OBJ_USED, 1);
 
 	obj = location_to_obj(m_page, obj);
 
@@ -1510,6 +1515,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
 	}
 
 	obj = obj_malloc(class, first_page, handle);
+	zs_stat_inc(class, OBJ_USED, 1);
 	/* Now move the zspage to another fullness group, if required */
 	fix_fullness_group(class, first_page);
 	record_obj(handle, obj);
@@ -1540,7 +1546,6 @@ static void obj_free(struct size_class *class, unsigned long obj)
 	kunmap_atomic(vaddr);
 	set_freeobj(first_page, f_objidx);
 	mod_zspage_inuse(first_page, -1);
-	zs_stat_dec(class, OBJ_USED, 1);
 }
 
 void zs_free(struct zs_pool *pool, unsigned long handle)
@@ -1564,10 +1569,19 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 
 	spin_lock(&class->lock);
 	obj_free(class, obj);
+	zs_stat_dec(class, OBJ_USED, 1);
 	fullness = fix_fullness_group(class, first_page);
-	if (fullness == ZS_EMPTY)
-		free_zspage(pool, class, first_page);
+	if (fullness == ZS_EMPTY) {
+		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
+				class->size, class->pages_per_zspage));
+		spin_unlock(&class->lock);
+		atomic_long_sub(class->pages_per_zspage,
+					&pool->pages_allocated);
+		free_zspage(pool, first_page);
+		goto out;
+	}
 	spin_unlock(&class->lock);
+out:
 	unpin_tag(handle);
 
 	free_handle(pool, handle);
@@ -1637,127 +1651,66 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 	kunmap_atomic(s_addr);
 }
 
-/*
- * Find alloced object in zspage from index object and
- * return handle.
- */
-static unsigned long find_alloced_obj(struct size_class *class,
-					struct page *page, int index)
+static unsigned long handle_from_obj(struct size_class *class,
+				struct page *first_page, int obj_idx)
 {
-	unsigned long head;
-	int offset = 0;
-	unsigned long handle = 0;
-	void *addr = kmap_atomic(page);
-
-	if (!is_first_page(page))
-		offset = page->index;
-	offset += class->size * index;
-
-	while (offset < PAGE_SIZE) {
-		head = obj_to_head(class, page, addr + offset);
-		if (head & OBJ_ALLOCATED_TAG) {
-			handle = head & ~OBJ_ALLOCATED_TAG;
-			if (trypin_tag(handle))
-				break;
-			handle = 0;
-		}
+	struct page *page;
+	unsigned long offset_in_page;
+	void *addr;
+	unsigned long head, handle = 0;
 
-		offset += class->size;
-		index++;
-	}
+	objidx_to_page_and_offset(class, first_page, obj_idx,
+			&page, &offset_in_page);
 
+	addr = kmap_atomic(page);
+	head = obj_to_head(class, page, addr + offset_in_page);
+	if (head & OBJ_ALLOCATED_TAG)
+		handle = head & ~OBJ_ALLOCATED_TAG;
 	kunmap_atomic(addr);
+
 	return handle;
 }
 
-struct zs_compact_control {
-	/* Source page for migration which could be a subpage of zspage. */
-	struct page *s_page;
-	/* Destination page for migration which should be a first page
-	 * of zspage. */
-	struct page *d_page;
-	 /* Starting object index within @s_page which used for live object
-	  * in the subpage. */
-	int index;
-};
-
-static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
-				struct zs_compact_control *cc)
+static int migrate_zspage(struct size_class *class, struct page *dst_page,
+				struct page *src_page)
 {
-	unsigned long used_obj, free_obj;
 	unsigned long handle;
-	struct page *s_page = cc->s_page;
-	struct page *d_page = cc->d_page;
-	unsigned long index = cc->index;
-	int ret = 0;
+	unsigned long old_obj, new_obj;
+	int i;
+	int nr_migrated = 0;
 
-	while (1) {
-		handle = find_alloced_obj(class, s_page, index);
-		if (!handle) {
-			s_page = get_next_page(s_page);
-			if (!s_page)
-				break;
-			index = 0;
+	for (i = 0; i < class->objs_per_zspage; i++) {
+		handle = handle_from_obj(class, src_page, i);
+		if (!handle)
 			continue;
-		}
-
-		/* Stop if there is no more space */
-		if (zspage_full(class, d_page)) {
-			unpin_tag(handle);
-			ret = -ENOMEM;
+		if (zspage_full(class, dst_page))
 			break;
-		}
-
-		used_obj = handle_to_obj(handle);
-		free_obj = obj_malloc(class, d_page, handle);
-		zs_object_copy(class, free_obj, used_obj);
-		index++;
+		old_obj = handle_to_obj(handle);
+		new_obj = obj_malloc(class, dst_page, handle);
+		zs_object_copy(class, new_obj, old_obj);
+		nr_migrated++;
 		/*
 		 * record_obj updates handle's value to free_obj and it will
 		 * invalidate lock bit(ie, HANDLE_PIN_BIT) of handle, which
 		 * breaks synchronization using pin_tag(e,g, zs_free) so
 		 * let's keep the lock bit.
 		 */
-		free_obj |= BIT(HANDLE_PIN_BIT);
-		record_obj(handle, free_obj);
-		unpin_tag(handle);
-		obj_free(class, used_obj);
+		new_obj |= BIT(HANDLE_PIN_BIT);
+		record_obj(handle, new_obj);
+		obj_free(class, old_obj);
 	}
-
-	/* Remember last position in this iteration */
-	cc->s_page = s_page;
-	cc->index = index;
-
-	return ret;
-}
-
-static struct page *isolate_target_page(struct size_class *class)
-{
-	int i;
-	struct page *page;
-
-	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
-		page = class->fullness_list[i];
-		if (page) {
-			remove_zspage(class, i, page);
-			break;
-		}
-	}
-
-	return page;
+	return nr_migrated;
 }
 
 /*
  * putback_zspage - add @first_page into right class's fullness list
- * @pool: target pool
  * @class: destination class
  * @first_page: target page
  *
  * Return @first_page's updated fullness_group
  */
-static enum fullness_group putback_zspage(struct zs_pool *pool,
-			struct size_class *class,
-			struct page *first_page)
+static enum fullness_group putback_zspage(struct size_class *class,
+					struct page *first_page)
 {
 	enum fullness_group fullness;
 
@@ -1768,17 +1721,155 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
 	return fullness;
 }
 
+/*
+ * freeze_zspage - freeze all objects in a zspage
+ * @class: size class of the page
+ * @first_page: first page of zspage
+ *
+ * Freeze all allocated objects in a zspage so objects couldn't be
+ * freed until unfreeze objects. It should be called under class->lock.
+ *
+ * RETURNS:
+ * the number of pinned objects
+ */
+static int freeze_zspage(struct size_class *class, struct page *first_page)
+{
+	unsigned long obj_idx;
+	struct page *obj_page;
+	unsigned long offset;
+	void *addr;
+	int nr_freeze = 0;
+
+	for (obj_idx = 0; obj_idx < class->objs_per_zspage; obj_idx++) {
+		unsigned long head;
+
+		objidx_to_page_and_offset(class, first_page, obj_idx,
+					&obj_page, &offset);
+		addr = kmap_atomic(obj_page);
+		head = obj_to_head(class, obj_page, addr + offset);
+		if (head & OBJ_ALLOCATED_TAG) {
+			unsigned long handle = head & ~OBJ_ALLOCATED_TAG;
+
+			if (!trypin_tag(handle)) {
+				kunmap_atomic(addr);
+				break;
+			}
+			nr_freeze++;
+		}
+		kunmap_atomic(addr);
+	}
+
+	return nr_freeze;
+}
+
+/*
+ * unfreeze_page - unfreeze objects freezed by freeze_zspage in a zspage
+ * @class: size class of the page
+ * @first_page: freezed zspage to unfreeze
+ * @nr_obj: the number of objects to unfreeze
+ *
+ * unfreeze objects in a zspage.
+ */
+static void unfreeze_zspage(struct size_class *class, struct page *first_page,
+			int nr_obj)
+{
+	unsigned long obj_idx;
+	struct page *obj_page;
+	unsigned long offset;
+	void *addr;
+	int nr_unfreeze = 0;
+
+	for (obj_idx = 0; obj_idx < class->objs_per_zspage &&
+			nr_unfreeze < nr_obj; obj_idx++) {
+		unsigned long head;
+
+		objidx_to_page_and_offset(class, first_page, obj_idx,
+					&obj_page, &offset);
+		addr = kmap_atomic(obj_page);
+		head = obj_to_head(class, obj_page, addr + offset);
+		if (head & OBJ_ALLOCATED_TAG) {
+			unsigned long handle = head & ~OBJ_ALLOCATED_TAG;
+
+			VM_BUG_ON(!testpin_tag(handle));
+			unpin_tag(handle);
+			nr_unfreeze++;
+		}
+		kunmap_atomic(addr);
+	}
+}
+
+/*
+ * isolate_source_page - isolate a zspage for migration source
+ * @class: size class of zspage for isolation
+ *
+ * Returns a zspage which are isolated from list so anyone can
+ * allocate a object from that page. As well, freeze all objects
+ * allocated in the zspage so anyone cannot access that objects
+ * (e.g., zs_map_object, zs_free).
+ */
 static struct page *isolate_source_page(struct size_class *class)
 {
 	int i;
 	struct page *page = NULL;
 
 	for (i = ZS_ALMOST_EMPTY; i >= ZS_ALMOST_FULL; i--) {
+		int inuse, freezed;
+
 		page = class->fullness_list[i];
 		if (!page)
 			continue;
 
 		remove_zspage(class, i, page);
+
+		inuse = get_zspage_inuse(page);
+		freezed = freeze_zspage(class, page);
+
+		if (inuse != freezed) {
+			unfreeze_zspage(class, page, freezed);
+			putback_zspage(class, page);
+			page = NULL;
+			continue;
+		}
+
+		break;
+	}
+
+	return page;
+}
+
+/*
+ * isolate_target_page - isolate a zspage for migration target
+ * @class: size class of zspage for isolation
+ *
+ * Returns a zspage which are isolated from list so anyone can
+ * allocate a object from that page. As well, freeze all objects
+ * allocated in the zspage so anyone cannot access that objects
+ * (e.g., zs_map_object, zs_free).
+ */
+static struct page *isolate_target_page(struct size_class *class)
+{
+	int i;
+	struct page *page;
+
+	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
+		int inuse, freezed;
+
+		page = class->fullness_list[i];
+		if (!page)
+			continue;
+
+		remove_zspage(class, i, page);
+
+		inuse = get_zspage_inuse(page);
+		freezed = freeze_zspage(class, page);
+
+		if (inuse != freezed) {
+			unfreeze_zspage(class, page, freezed);
+			putback_zspage(class, page);
+			page = NULL;
+			continue;
+		}
+
 		break;
 	}
 
@@ -1793,9 +1884,11 @@ static struct page *isolate_source_page(struct size_class *class)
 static unsigned long zs_can_compact(struct size_class *class)
 {
 	unsigned long obj_wasted;
+	unsigned long obj_allocated, obj_used;
 
-	obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) -
-		zs_stat_get(class, OBJ_USED);
+	obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
+	obj_used = zs_stat_get(class, OBJ_USED);
+	obj_wasted = obj_allocated - obj_used;
 
 	obj_wasted /= get_maxobj_per_zspage(class->size,
 			class->pages_per_zspage);
@@ -1805,53 +1898,81 @@ static unsigned long zs_can_compact(struct size_class *class)
 
 static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 {
-	struct zs_compact_control cc;
-	struct page *src_page;
+	struct page *src_page = NULL;
 	struct page *dst_page = NULL;
 
-	spin_lock(&class->lock);
-	while ((src_page = isolate_source_page(class))) {
+	while (1) {
+		int nr_migrated;
 
-		if (!zs_can_compact(class))
+		spin_lock(&class->lock);
+		if (!zs_can_compact(class)) {
+			spin_unlock(&class->lock);
 			break;
+		}
 
-		cc.index = 0;
-		cc.s_page = src_page;
+		/*
+		 * Isolate source page and freeze all objects in a zspage
+		 * to prevent zspage destroying.
+		 */
+		if (!src_page) {
+			src_page = isolate_source_page(class);
+			if (!src_page) {
+				spin_unlock(&class->lock);
+				break;
+			}
+		}
 
-		while ((dst_page = isolate_target_page(class))) {
-			cc.d_page = dst_page;
-			/*
-			 * If there is no more space in dst_page, resched
-			 * and see if anyone had allocated another zspage.
-			 */
-			if (!migrate_zspage(pool, class, &cc))
+		/* Isolate target page and freeze all objects in the zspage */
+		if (!dst_page) {
+			dst_page = isolate_target_page(class);
+			if (!dst_page) {
+				spin_unlock(&class->lock);
 				break;
+			}
+		}
+		spin_unlock(&class->lock);
+
+		nr_migrated = migrate_zspage(class, dst_page, src_page);
 
-			VM_BUG_ON_PAGE(putback_zspage(pool, class,
-				dst_page) == ZS_EMPTY, dst_page);
+		if (zspage_full(class, dst_page)) {
+			spin_lock(&class->lock);
+			putback_zspage(class, dst_page);
+			unfreeze_zspage(class, dst_page,
+				class->objs_per_zspage);
+			spin_unlock(&class->lock);
+			dst_page = NULL;
 		}
 
-		/* Stop if we couldn't find slot */
-		if (dst_page == NULL)
-			break;
+		if (zspage_empty(class, src_page)) {
+			free_zspage(pool, src_page);
+			spin_lock(&class->lock);
+			zs_stat_dec(class, OBJ_ALLOCATED,
+				get_maxobj_per_zspage(
+				class->size, class->pages_per_zspage));
+			atomic_long_sub(class->pages_per_zspage,
+					&pool->pages_allocated);
 
-		VM_BUG_ON_PAGE(putback_zspage(pool, class,
-				dst_page) == ZS_EMPTY, dst_page);
-		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
 			pool->stats.pages_compacted += class->pages_per_zspage;
 			spin_unlock(&class->lock);
-			free_zspage(pool, class, src_page);
-		} else {
-			spin_unlock(&class->lock);
+			src_page = NULL;
 		}
+	}
 
-		cond_resched();
-		spin_lock(&class->lock);
+	if (!src_page && !dst_page)
+		return;
+
+	spin_lock(&class->lock);
+	if (src_page) {
+		putback_zspage(class, src_page);
+		unfreeze_zspage(class, src_page,
+				class->objs_per_zspage);
 	}
 
-	if (src_page)
-		VM_BUG_ON_PAGE(putback_zspage(pool, class,
-				src_page) == ZS_EMPTY, src_page);
+	if (dst_page) {
+		putback_zspage(class, dst_page);
+		unfreeze_zspage(class, dst_page,
+				class->objs_per_zspage);
+	}
 
 	spin_unlock(&class->lock);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 13/16] zsmalloc: migrate head page of zspage
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (11 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 12/16] zsmalloc: zs_compact refactoring Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-04-06 13:01   ` Chulmin Kim
  2016-04-19  6:08   ` Chulmin Kim
  2016-03-30  7:12 ` [PATCH v3 14/16] zsmalloc: use single linked list for page chain Minchan Kim
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

This patch introduces run-time migration feature for zspage.
To begin with, it supports only head page migration for
easy review(later patches will support tail page migration).

For migration, it supports three functions

* zs_page_isolate

It isolates a zspage which includes a subpage VM want to migrate
from class so anyone cannot allocate new object from the zspage.
IOW, allocation freeze

* zs_page_migrate

First of all, it freezes zspage to prevent zspage destrunction
so anyone cannot free object. Then, It copies content from oldpage
to newpage and create new page-chain with new page.
If it was successful, drop the refcount of old page to free
and putback new zspage to right data structure of zsmalloc.
Lastly, unfreeze zspages so we allows object allocation/free
from now on.

* zs_page_putback

It returns isolated zspage to right fullness_group list
if it fails to migrate a page.

NOTE: A hurdle to support migration is that destroying zspage
while migration is going on. Once a zspage is isolated,
anyone cannot allocate object from the zspage but can deallocate
object freely so a zspage could be destroyed until all of objects
in zspage are freezed to prevent deallocation. The problem is
large window betwwen zs_page_isolate and freeze_zspage
in zs_page_migrate so the zspage could be destroyed.

A easy approach to solve the problem is that object freezing
in zs_page_isolate but it has a drawback that any object cannot
be deallocated until migration fails after isolation. However,
There is large time gab between isolation and migration so
any object freeing in other CPU should spin by pin_tag which
would cause big latency. So, this patch introduces lock_zspage
which holds PG_lock of all pages in a zspage right before
freeing the zspage. VM migration locks the page, too right
before calling ->migratepage so such race doesn't exist any more.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/uapi/linux/magic.h |   1 +
 mm/zsmalloc.c              | 332 +++++++++++++++++++++++++++++++++++++++++++--
 2 files changed, 318 insertions(+), 15 deletions(-)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index e1fbe72c39c0..93b1affe4801 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -79,5 +79,6 @@
 #define NSFS_MAGIC		0x6e736673
 #define BPF_FS_MAGIC		0xcafe4a11
 #define BALLOON_KVM_MAGIC	0x13661366
+#define ZSMALLOC_MAGIC		0x58295829
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index ac8ca7b10720..f6c9138c3be0 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -56,6 +56,8 @@
 #include <linux/debugfs.h>
 #include <linux/zsmalloc.h>
 #include <linux/zpool.h>
+#include <linux/mount.h>
+#include <linux/migrate.h>
 
 /*
  * This must be power of 2 and greater than of equal to sizeof(link_free).
@@ -182,6 +184,8 @@ struct zs_size_stat {
 static struct dentry *zs_stat_root;
 #endif
 
+static struct vfsmount *zsmalloc_mnt;
+
 /*
  * number of size_classes
  */
@@ -263,6 +267,7 @@ struct zs_pool {
 #ifdef CONFIG_ZSMALLOC_STAT
 	struct dentry *stat_dentry;
 #endif
+	struct inode *inode;
 };
 
 struct zs_meta {
@@ -412,6 +417,29 @@ static int is_last_page(struct page *page)
 	return PagePrivate2(page);
 }
 
+/*
+ * Indicate that whether zspage is isolated for page migration.
+ * Protected by size_class lock
+ */
+static void SetZsPageIsolate(struct page *first_page)
+{
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+	SetPageUptodate(first_page);
+}
+
+static int ZsPageIsolate(struct page *first_page)
+{
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	return PageUptodate(first_page);
+}
+
+static void ClearZsPageIsolate(struct page *first_page)
+{
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+	ClearPageUptodate(first_page);
+}
+
 static int get_zspage_inuse(struct page *first_page)
 {
 	struct zs_meta *m;
@@ -783,8 +811,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
 	if (newfg == currfg)
 		goto out;
 
-	remove_zspage(class, currfg, first_page);
-	insert_zspage(class, newfg, first_page);
+	/* Later, putback will insert page to right list */
+	if (!ZsPageIsolate(first_page)) {
+		remove_zspage(class, currfg, first_page);
+		insert_zspage(class, newfg, first_page);
+	}
 	set_zspage_mapping(first_page, class_idx, newfg);
 
 out:
@@ -950,12 +981,31 @@ static void unpin_tag(unsigned long handle)
 
 static void reset_page(struct page *page)
 {
+	if (!PageIsolated(page))
+		__ClearPageMovable(page);
+	ClearPageIsolated(page);
 	clear_bit(PG_private, &page->flags);
 	clear_bit(PG_private_2, &page->flags);
 	set_page_private(page, 0);
 	page->freelist = NULL;
 }
 
+/**
+ * lock_zspage - lock all pages in the zspage
+ * @first_page: head page of the zspage
+ *
+ * To prevent destroy during migration, zspage freeing should
+ * hold locks of all pages in a zspage
+ */
+void lock_zspage(struct page *first_page)
+{
+	struct page *cursor = first_page;
+
+	do {
+		while (!trylock_page(cursor));
+	} while ((cursor = get_next_page(cursor)) != NULL);
+}
+
 static void free_zspage(struct zs_pool *pool, struct page *first_page)
 {
 	struct page *nextp, *tmp, *head_extra;
@@ -963,26 +1013,31 @@ static void free_zspage(struct zs_pool *pool, struct page *first_page)
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
 
+	lock_zspage(first_page);
 	head_extra = (struct page *)page_private(first_page);
 
-	reset_page(first_page);
-	__free_page(first_page);
-
 	/* zspage with only 1 system page */
 	if (!head_extra)
-		return;
+		goto out;
 
 	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
 		list_del(&nextp->lru);
 		reset_page(nextp);
-		__free_page(nextp);
+		unlock_page(nextp);
+		put_page(nextp);
 	}
 	reset_page(head_extra);
-	__free_page(head_extra);
+	unlock_page(head_extra);
+	put_page(head_extra);
+out:
+	reset_page(first_page);
+	unlock_page(first_page);
+	put_page(first_page);
 }
 
 /* Initialize a newly allocated zspage */
-static void init_zspage(struct size_class *class, struct page *first_page)
+static void init_zspage(struct size_class *class, struct page *first_page,
+			struct address_space *mapping)
 {
 	int freeobj = 1;
 	unsigned long off = 0;
@@ -991,6 +1046,9 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 	first_page->freelist = NULL;
 	INIT_LIST_HEAD(&first_page->lru);
 	set_zspage_inuse(first_page, 0);
+	BUG_ON(!trylock_page(first_page));
+	__SetPageMovable(first_page, mapping);
+	unlock_page(first_page);
 
 	while (page) {
 		struct page *next_page;
@@ -1065,10 +1123,45 @@ static void create_page_chain(struct page *pages[], int nr_pages)
 	}
 }
 
+static void replace_sub_page(struct size_class *class, struct page *first_page,
+		struct page *newpage, struct page *oldpage)
+{
+	struct page *page;
+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL,};
+	int idx = 0;
+
+	page = first_page;
+	do {
+		if (page == oldpage)
+			pages[idx] = newpage;
+		else
+			pages[idx] = page;
+		idx++;
+	} while ((page = get_next_page(page)) != NULL);
+
+	create_page_chain(pages, class->pages_per_zspage);
+
+	if (is_first_page(oldpage)) {
+		enum fullness_group fg;
+		int class_idx;
+
+		SetZsPageIsolate(newpage);
+		get_zspage_mapping(oldpage, &class_idx, &fg);
+		set_zspage_mapping(newpage, class_idx, fg);
+		set_freeobj(newpage, get_freeobj(oldpage));
+		set_zspage_inuse(newpage, get_zspage_inuse(oldpage));
+		if (class->huge)
+			set_page_private(newpage,  page_private(oldpage));
+	}
+
+	__SetPageMovable(newpage, oldpage->mapping);
+}
+
 /*
  * Allocate a zspage for the given size class
  */
-static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
+static struct page *alloc_zspage(struct zs_pool *pool,
+				struct size_class *class)
 {
 	int i;
 	struct page *first_page = NULL;
@@ -1088,7 +1181,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct page *page;
 
-		page = alloc_page(flags);
+		page = alloc_page(pool->flags);
 		if (!page) {
 			while (--i >= 0)
 				__free_page(pages[i]);
@@ -1100,7 +1193,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 
 	create_page_chain(pages, class->pages_per_zspage);
 	first_page = pages[0];
-	init_zspage(class, first_page);
+	init_zspage(class, first_page, pool->inode->i_mapping);
 
 	return first_page;
 }
@@ -1499,7 +1592,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
 
 	if (!first_page) {
 		spin_unlock(&class->lock);
-		first_page = alloc_zspage(class, pool->flags);
+		first_page = alloc_zspage(pool, class);
 		if (unlikely(!first_page)) {
 			free_handle(pool, handle);
 			return 0;
@@ -1559,6 +1652,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	if (unlikely(!handle))
 		return;
 
+	/* Once handle is pinned, page|object migration cannot work */
 	pin_tag(handle);
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &f_page, &f_objidx);
@@ -1714,6 +1808,9 @@ static enum fullness_group putback_zspage(struct size_class *class,
 {
 	enum fullness_group fullness;
 
+	VM_BUG_ON_PAGE(!list_empty(&first_page->lru), first_page);
+	VM_BUG_ON_PAGE(ZsPageIsolate(first_page), first_page);
+
 	fullness = get_fullness_group(class, first_page);
 	insert_zspage(class, fullness, first_page);
 	set_zspage_mapping(first_page, class->index, fullness);
@@ -2059,6 +2156,173 @@ static int zs_register_shrinker(struct zs_pool *pool)
 	return register_shrinker(&pool->shrinker);
 }
 
+bool zs_page_isolate(struct page *page, isolate_mode_t mode)
+{
+	struct zs_pool *pool;
+	struct size_class *class;
+	int class_idx;
+	enum fullness_group fullness;
+	struct page *first_page;
+
+	/*
+	 * The page is locked so it couldn't be destroyed.
+	 * For detail, look at lock_zspage in free_zspage.
+	 */
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(PageIsolated(page), page);
+	/*
+	 * In this implementation, it allows only first page migration.
+	 */
+	VM_BUG_ON_PAGE(!is_first_page(page), page);
+	first_page = page;
+
+	/*
+	 * Without class lock, fullness is meaningless while constant
+	 * class_idx is okay. We will get it under class lock at below,
+	 * again.
+	 */
+	get_zspage_mapping(first_page, &class_idx, &fullness);
+	pool = page->mapping->private_data;
+	class = pool->size_class[class_idx];
+
+	if (!spin_trylock(&class->lock))
+		return false;
+
+	get_zspage_mapping(first_page, &class_idx, &fullness);
+	remove_zspage(class, fullness, first_page);
+	SetZsPageIsolate(first_page);
+	SetPageIsolated(page);
+	spin_unlock(&class->lock);
+
+	return true;
+}
+
+int zs_page_migrate(struct address_space *mapping, struct page *newpage,
+		struct page *page, enum migrate_mode mode)
+{
+	struct zs_pool *pool;
+	struct size_class *class;
+	int class_idx;
+	enum fullness_group fullness;
+	struct page *first_page;
+	void *s_addr, *d_addr, *addr;
+	int ret = -EBUSY;
+	int offset = 0;
+	int freezed = 0;
+
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+	first_page = page;
+	get_zspage_mapping(first_page, &class_idx, &fullness);
+	pool = page->mapping->private_data;
+	class = pool->size_class[class_idx];
+
+	/*
+	 * Get stable fullness under class->lock
+	 */
+	if (!spin_trylock(&class->lock))
+		return ret;
+
+	get_zspage_mapping(first_page, &class_idx, &fullness);
+	if (get_zspage_inuse(first_page) == 0)
+		goto out_class_unlock;
+
+	freezed = freeze_zspage(class, first_page);
+	if (freezed != get_zspage_inuse(first_page))
+		goto out_unfreeze;
+
+	/* copy contents from page to newpage */
+	s_addr = kmap_atomic(page);
+	d_addr = kmap_atomic(newpage);
+	memcpy(d_addr, s_addr, PAGE_SIZE);
+	kunmap_atomic(d_addr);
+	kunmap_atomic(s_addr);
+
+	if (!is_first_page(page))
+		offset = page->index;
+
+	addr = kmap_atomic(page);
+	do {
+		unsigned long handle;
+		unsigned long head;
+		unsigned long new_obj, old_obj;
+		unsigned long obj_idx;
+		struct page *dummy;
+
+		head = obj_to_head(class, page, addr + offset);
+		if (head & OBJ_ALLOCATED_TAG) {
+			handle = head & ~OBJ_ALLOCATED_TAG;
+			if (!testpin_tag(handle))
+				BUG();
+
+			old_obj = handle_to_obj(handle);
+			obj_to_location(old_obj, &dummy, &obj_idx);
+			new_obj = location_to_obj(newpage, obj_idx);
+			new_obj |= BIT(HANDLE_PIN_BIT);
+			record_obj(handle, new_obj);
+		}
+		offset += class->size;
+	} while (offset < PAGE_SIZE);
+	kunmap_atomic(addr);
+
+	replace_sub_page(class, first_page, newpage, page);
+	first_page = newpage;
+	get_page(newpage);
+	VM_BUG_ON_PAGE(get_fullness_group(class, first_page) ==
+			ZS_EMPTY, first_page);
+	ClearZsPageIsolate(first_page);
+	putback_zspage(class, first_page);
+
+	/* Migration complete. Free old page */
+	ClearPageIsolated(page);
+	reset_page(page);
+	put_page(page);
+	ret = MIGRATEPAGE_SUCCESS;
+
+out_unfreeze:
+	unfreeze_zspage(class, first_page, freezed);
+out_class_unlock:
+	spin_unlock(&class->lock);
+
+	return ret;
+}
+
+void zs_page_putback(struct page *page)
+{
+	struct zs_pool *pool;
+	struct size_class *class;
+	int class_idx;
+	enum fullness_group fullness;
+	struct page *first_page;
+
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+	first_page = page;
+	get_zspage_mapping(first_page, &class_idx, &fullness);
+	pool = page->mapping->private_data;
+	class = pool->size_class[class_idx];
+
+	/*
+	 * If there is race betwwen zs_free and here, free_zspage
+	 * in zs_free will wait the page lock of @page without
+	 * destroying of zspage.
+	 */
+	INIT_LIST_HEAD(&first_page->lru);
+	spin_lock(&class->lock);
+	ClearPageIsolated(page);
+	ClearZsPageIsolate(first_page);
+	putback_zspage(class, first_page);
+	spin_unlock(&class->lock);
+}
+
+const struct address_space_operations zsmalloc_aops = {
+	.isolate_page = zs_page_isolate,
+	.migratepage = zs_page_migrate,
+	.putback_page = zs_page_putback,
+};
+
 /**
  * zs_create_pool - Creates an allocation pool to work from.
  * @flags: allocation flags used to allocate pool metadata
@@ -2145,6 +2409,15 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
 	if (zs_pool_stat_create(pool, name))
 		goto err;
 
+	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
+	if (IS_ERR(pool->inode)) {
+		pool->inode = NULL;
+		goto err;
+	}
+
+	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
+	pool->inode->i_mapping->private_data = pool;
+
 	/*
 	 * Not critical, we still can use the pool
 	 * and user can trigger compaction manually.
@@ -2164,6 +2437,8 @@ void zs_destroy_pool(struct zs_pool *pool)
 	int i;
 
 	zs_unregister_shrinker(pool);
+	if (pool->inode)
+		iput(pool->inode);
 	zs_pool_stat_destroy(pool);
 
 	for (i = 0; i < zs_size_classes; i++) {
@@ -2192,10 +2467,33 @@ void zs_destroy_pool(struct zs_pool *pool)
 }
 EXPORT_SYMBOL_GPL(zs_destroy_pool);
 
+static struct dentry *zs_mount(struct file_system_type *fs_type,
+				int flags, const char *dev_name, void *data)
+{
+	static const struct dentry_operations ops = {
+		.d_dname = simple_dname,
+	};
+
+	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
+}
+
+static struct file_system_type zsmalloc_fs = {
+	.name		= "zsmalloc",
+	.mount		= zs_mount,
+	.kill_sb	= kill_anon_super,
+};
+
 static int __init zs_init(void)
 {
-	int ret = zs_register_cpu_notifier();
+	int ret;
+
+	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
+	if (IS_ERR(zsmalloc_mnt)) {
+		ret = PTR_ERR(zsmalloc_mnt);
+		goto out;
+	}
 
+	ret = zs_register_cpu_notifier();
 	if (ret)
 		goto notifier_fail;
 
@@ -2218,6 +2516,7 @@ static int __init zs_init(void)
 		pr_err("zs stat initialization failed\n");
 		goto stat_fail;
 	}
+
 	return 0;
 
 stat_fail:
@@ -2226,7 +2525,8 @@ static int __init zs_init(void)
 #endif
 notifier_fail:
 	zs_unregister_cpu_notifier();
-
+	kern_unmount(zsmalloc_mnt);
+out:
 	return ret;
 }
 
@@ -2237,6 +2537,8 @@ static void __exit zs_exit(void)
 #endif
 	zs_unregister_cpu_notifier();
 
+	kern_unmount(zsmalloc_mnt);
+
 	zs_stat_exit();
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 14/16] zsmalloc: use single linked list for page chain
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (12 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 13/16] zsmalloc: migrate head page of zspage Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-03-30  7:12 ` [PATCH v3 15/16] zsmalloc: migrate tail pages in zspage Minchan Kim
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

For tail page migration, we shouldn't use page->lru which
was used for page chaining because VM will use it for own
purpose so that we need another field for chaining.
For chaining, singly linked list is enough and page->index
of tail page to point first object offset in the page could
be replaced in run-time calculation.

So, this patch change page->lru list for chaining with singly
linked list via page->freelist squeeze and introduces
get_first_obj_ofs to get first object offset in a page.

With that, it could maintain page chaining without using
page->lru.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 119 ++++++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 78 insertions(+), 41 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index f6c9138c3be0..a41cf3ef2077 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -17,10 +17,7 @@
  *
  * Usage of struct page fields:
  *	page->private: points to the first component (0-order) page
- *	page->index (union with page->freelist): offset of the first object
- *		starting in this page.
- *	page->lru: links together all component pages (except the first page)
- *		of a zspage
+ *	page->index (union with page->freelist): override by struct zs_meta
  *
  *	For _first_ page only:
  *
@@ -271,10 +268,19 @@ struct zs_pool {
 };
 
 struct zs_meta {
-	unsigned long freeobj:FREEOBJ_BITS;
-	unsigned long class:CLASS_BITS;
-	unsigned long fullness:FULLNESS_BITS;
-	unsigned long inuse:INUSE_BITS;
+	union {
+		/* first page */
+		struct {
+			unsigned long freeobj:FREEOBJ_BITS;
+			unsigned long class:CLASS_BITS;
+			unsigned long fullness:FULLNESS_BITS;
+			unsigned long inuse:INUSE_BITS;
+		};
+		/* tail pages */
+		struct {
+			struct page *next;
+		};
+	};
 };
 
 struct mapping_area {
@@ -491,6 +497,34 @@ static unsigned long get_freeobj(struct page *first_page)
 	return m->freeobj;
 }
 
+static void set_next_page(struct page *page, struct page *next)
+{
+	struct zs_meta *m;
+
+	VM_BUG_ON_PAGE(is_first_page(page), page);
+
+	m = (struct zs_meta *)&page->index;
+	m->next = next;
+}
+
+static struct page *get_next_page(struct page *page)
+{
+	struct page *next;
+
+	if (is_last_page(page))
+		next = NULL;
+	else if (is_first_page(page))
+		next = (struct page *)page_private(page);
+	else {
+		struct zs_meta *m = (struct zs_meta *)&page->index;
+
+		VM_BUG_ON(!m->next);
+		next = m->next;
+	}
+
+	return next;
+}
+
 static void get_zspage_mapping(struct page *first_page,
 				unsigned int *class_idx,
 				enum fullness_group *fullness)
@@ -871,18 +905,30 @@ static struct page *get_first_page(struct page *page)
 		return (struct page *)page_private(page);
 }
 
-static struct page *get_next_page(struct page *page)
+int get_first_obj_ofs(struct size_class *class, struct page *first_page,
+			struct page *page)
 {
-	struct page *next;
+	int pos, bound;
+	int page_idx = 0;
+	int ofs = 0;
+	struct page *cursor = first_page;
 
-	if (is_last_page(page))
-		next = NULL;
-	else if (is_first_page(page))
-		next = (struct page *)page_private(page);
-	else
-		next = list_entry(page->lru.next, struct page, lru);
+	if (first_page == page)
+		goto out;
 
-	return next;
+	while (page != cursor) {
+		page_idx++;
+		cursor = get_next_page(cursor);
+	}
+
+	bound = PAGE_SIZE * page_idx;
+	pos = (((class->objs_per_zspage * class->size) *
+		page_idx / class->pages_per_zspage) / class->size
+		) * class->size;
+
+	ofs = (pos + class->size) % PAGE_SIZE;
+out:
+	return ofs;
 }
 
 static void objidx_to_page_and_offset(struct size_class *class,
@@ -1008,27 +1054,25 @@ void lock_zspage(struct page *first_page)
 
 static void free_zspage(struct zs_pool *pool, struct page *first_page)
 {
-	struct page *nextp, *tmp, *head_extra;
+	struct page *nextp, *tmp;
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
 
 	lock_zspage(first_page);
-	head_extra = (struct page *)page_private(first_page);
+	nextp = (struct page *)page_private(first_page);
 
 	/* zspage with only 1 system page */
-	if (!head_extra)
+	if (!nextp)
 		goto out;
 
-	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
-		list_del(&nextp->lru);
-		reset_page(nextp);
-		unlock_page(nextp);
-		put_page(nextp);
-	}
-	reset_page(head_extra);
-	unlock_page(head_extra);
-	put_page(head_extra);
+	do {
+		tmp = nextp;
+		nextp = get_next_page(nextp);
+		reset_page(tmp);
+		unlock_page(tmp);
+		put_page(tmp);
+	} while (nextp);
 out:
 	reset_page(first_page);
 	unlock_page(first_page);
@@ -1055,13 +1099,6 @@ static void init_zspage(struct size_class *class, struct page *first_page,
 		struct link_free *link;
 		void *vaddr;
 
-		/*
-		 * page->index stores offset of first object starting
-		 * in the page.
-		 */
-		if (page != first_page)
-			page->index = off;
-
 		vaddr = kmap_atomic(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
@@ -1103,7 +1140,6 @@ static void create_page_chain(struct page *pages[], int nr_pages)
 	for (i = 0; i < nr_pages; i++) {
 		page = pages[i];
 
-		INIT_LIST_HEAD(&page->lru);
 		if (i == 0) {
 			SetPagePrivate(page);
 			set_page_private(page, 0);
@@ -1112,10 +1148,12 @@ static void create_page_chain(struct page *pages[], int nr_pages)
 
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
-		if (i >= 1)
+		if (i >= 1) {
+			set_next_page(page, NULL);
 			set_page_private(page, (unsigned long)first_page);
+		}
 		if (i >= 2)
-			list_add(&page->lru, &prev_page->lru);
+			set_next_page(prev_page, page);
 		if (i == nr_pages - 1)
 			SetPagePrivate2(page);
 
@@ -2239,8 +2277,7 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	kunmap_atomic(d_addr);
 	kunmap_atomic(s_addr);
 
-	if (!is_first_page(page))
-		offset = page->index;
+	offset = get_first_obj_ofs(class, first_page, page);
 
 	addr = kmap_atomic(page);
 	do {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 15/16] zsmalloc: migrate tail pages in zspage
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (13 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 14/16] zsmalloc: use single linked list for page chain Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-03-30  7:12 ` [PATCH v3 16/16] zram: use __GFP_MOVABLE for memory allocation Minchan Kim
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

This patch enables tail page migration of zspage.

In this point, I tested zsmalloc regression with micro-benchmark
which does zs_malloc/map/unmap/zs_free for all size class
in every CPU(my system is 12) during 20 sec.

It shows 1% regression which is really small when we consider
the benefit of this feature and realworkload overhead(i.e.,
most overhead comes from compression).

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 129 +++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 114 insertions(+), 15 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index a41cf3ef2077..e24f4a160892 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -551,6 +551,19 @@ static void set_zspage_mapping(struct page *first_page,
 	m->class = class_idx;
 }
 
+static bool check_isolated_page(struct page *first_page)
+{
+	struct page *cursor;
+
+	for (cursor = first_page; cursor != NULL; cursor =
+					get_next_page(cursor)) {
+		if (PageIsolated(cursor))
+			return true;
+	}
+
+	return false;
+}
+
 /*
  * zsmalloc divides the pool into various size classes where each
  * class maintains a list of zspages where each zspage is divided
@@ -1052,6 +1065,44 @@ void lock_zspage(struct page *first_page)
 	} while ((cursor = get_next_page(cursor)) != NULL);
 }
 
+int trylock_zspage(struct page *first_page, struct page *locked_page)
+{
+	struct page *cursor, *fail;
+
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+
+	for (cursor = first_page; cursor != NULL; cursor =
+			get_next_page(cursor)) {
+		if (cursor != locked_page) {
+			if (!trylock_page(cursor)) {
+				fail = cursor;
+				goto unlock;
+			}
+		}
+	}
+
+	return 1;
+unlock:
+	for (cursor = first_page; cursor != fail; cursor =
+			get_next_page(cursor)) {
+		if (cursor != locked_page)
+			unlock_page(cursor);
+	}
+
+	return 0;
+}
+
+void unlock_zspage(struct page *first_page, struct page *locked_page)
+{
+	struct page *cursor = first_page;
+
+	for (; cursor != NULL; cursor = get_next_page(cursor)) {
+		VM_BUG_ON_PAGE(!PageLocked(cursor), cursor);
+		if (cursor != locked_page)
+			unlock_page(cursor);
+	}
+}
+
 static void free_zspage(struct zs_pool *pool, struct page *first_page)
 {
 	struct page *nextp, *tmp;
@@ -1090,15 +1141,16 @@ static void init_zspage(struct size_class *class, struct page *first_page,
 	first_page->freelist = NULL;
 	INIT_LIST_HEAD(&first_page->lru);
 	set_zspage_inuse(first_page, 0);
-	BUG_ON(!trylock_page(first_page));
-	__SetPageMovable(first_page, mapping);
-	unlock_page(first_page);
 
 	while (page) {
 		struct page *next_page;
 		struct link_free *link;
 		void *vaddr;
 
+		BUG_ON(!trylock_page(page));
+		__SetPageMovable(page, mapping);
+		unlock_page(page);
+
 		vaddr = kmap_atomic(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
@@ -1848,6 +1900,7 @@ static enum fullness_group putback_zspage(struct size_class *class,
 
 	VM_BUG_ON_PAGE(!list_empty(&first_page->lru), first_page);
 	VM_BUG_ON_PAGE(ZsPageIsolate(first_page), first_page);
+	VM_BUG_ON_PAGE(check_isolated_page(first_page), first_page);
 
 	fullness = get_fullness_group(class, first_page);
 	insert_zspage(class, fullness, first_page);
@@ -1954,6 +2007,12 @@ static struct page *isolate_source_page(struct size_class *class)
 		if (!page)
 			continue;
 
+		/* To prevent race between object and page migration */
+		if (!trylock_zspage(page, NULL)) {
+			page = NULL;
+			continue;
+		}
+
 		remove_zspage(class, i, page);
 
 		inuse = get_zspage_inuse(page);
@@ -1962,6 +2021,7 @@ static struct page *isolate_source_page(struct size_class *class)
 		if (inuse != freezed) {
 			unfreeze_zspage(class, page, freezed);
 			putback_zspage(class, page);
+			unlock_zspage(page, NULL);
 			page = NULL;
 			continue;
 		}
@@ -1993,6 +2053,12 @@ static struct page *isolate_target_page(struct size_class *class)
 		if (!page)
 			continue;
 
+		/* To prevent race between object and page migration */
+		if (!trylock_zspage(page, NULL)) {
+			page = NULL;
+			continue;
+		}
+
 		remove_zspage(class, i, page);
 
 		inuse = get_zspage_inuse(page);
@@ -2001,6 +2067,7 @@ static struct page *isolate_target_page(struct size_class *class)
 		if (inuse != freezed) {
 			unfreeze_zspage(class, page, freezed);
 			putback_zspage(class, page);
+			unlock_zspage(page, NULL);
 			page = NULL;
 			continue;
 		}
@@ -2074,11 +2141,13 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 			putback_zspage(class, dst_page);
 			unfreeze_zspage(class, dst_page,
 				class->objs_per_zspage);
+			unlock_zspage(dst_page, NULL);
 			spin_unlock(&class->lock);
 			dst_page = NULL;
 		}
 
 		if (zspage_empty(class, src_page)) {
+			unlock_zspage(src_page, NULL);
 			free_zspage(pool, src_page);
 			spin_lock(&class->lock);
 			zs_stat_dec(class, OBJ_ALLOCATED,
@@ -2101,12 +2170,14 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 		putback_zspage(class, src_page);
 		unfreeze_zspage(class, src_page,
 				class->objs_per_zspage);
+		unlock_zspage(src_page, NULL);
 	}
 
 	if (dst_page) {
 		putback_zspage(class, dst_page);
 		unfreeze_zspage(class, dst_page,
 				class->objs_per_zspage);
+		unlock_zspage(dst_page, NULL);
 	}
 
 	spin_unlock(&class->lock);
@@ -2209,10 +2280,11 @@ bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(PageIsolated(page), page);
 	/*
-	 * In this implementation, it allows only first page migration.
+	 * first_page will not be destroyed by PG_lock of @page but it could
+	 * be migrated out. For prohibiting it, zs_page_migrate calls
+	 * trylock_zspage so it closes the race.
 	 */
-	VM_BUG_ON_PAGE(!is_first_page(page), page);
-	first_page = page;
+	first_page = get_first_page(page);
 
 	/*
 	 * Without class lock, fullness is meaningless while constant
@@ -2226,9 +2298,18 @@ bool zs_page_isolate(struct page *page, isolate_mode_t mode)
 	if (!spin_trylock(&class->lock))
 		return false;
 
+	if (check_isolated_page(first_page))
+		goto skip_isolate;
+
+	/*
+	 * If this is first time isolation for zspage, isolate zspage from
+	 * size_class to prevent further allocations from the zspage.
+	 */
 	get_zspage_mapping(first_page, &class_idx, &fullness);
 	remove_zspage(class, fullness, first_page);
 	SetZsPageIsolate(first_page);
+
+skip_isolate:
 	SetPageIsolated(page);
 	spin_unlock(&class->lock);
 
@@ -2251,7 +2332,7 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	first_page = page;
+	first_page = get_first_page(page);
 	get_zspage_mapping(first_page, &class_idx, &fullness);
 	pool = page->mapping->private_data;
 	class = pool->size_class[class_idx];
@@ -2266,6 +2347,13 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	if (get_zspage_inuse(first_page) == 0)
 		goto out_class_unlock;
 
+	/*
+	 * It prevents first_page migration during tail page opeartion for
+	 * get_first_page's stability.
+	 */
+	if (!trylock_zspage(first_page, page))
+		goto out_class_unlock;
+
 	freezed = freeze_zspage(class, first_page);
 	if (freezed != get_zspage_inuse(first_page))
 		goto out_unfreeze;
@@ -2304,21 +2392,26 @@ int zs_page_migrate(struct address_space *mapping, struct page *newpage,
 	kunmap_atomic(addr);
 
 	replace_sub_page(class, first_page, newpage, page);
-	first_page = newpage;
+	first_page = get_first_page(newpage);
 	get_page(newpage);
 	VM_BUG_ON_PAGE(get_fullness_group(class, first_page) ==
 			ZS_EMPTY, first_page);
-	ClearZsPageIsolate(first_page);
-	putback_zspage(class, first_page);
+	if (!check_isolated_page(first_page)) {
+		INIT_LIST_HEAD(&first_page->lru);
+		ClearZsPageIsolate(first_page);
+		putback_zspage(class, first_page);
+	}
+
 
 	/* Migration complete. Free old page */
 	ClearPageIsolated(page);
 	reset_page(page);
 	put_page(page);
 	ret = MIGRATEPAGE_SUCCESS;
-
+	page = newpage;
 out_unfreeze:
 	unfreeze_zspage(class, first_page, freezed);
+	unlock_zspage(first_page, page);
 out_class_unlock:
 	spin_unlock(&class->lock);
 
@@ -2336,7 +2429,7 @@ void zs_page_putback(struct page *page)
 	VM_BUG_ON_PAGE(!PageMovable(page), page);
 	VM_BUG_ON_PAGE(!PageIsolated(page), page);
 
-	first_page = page;
+	first_page = get_first_page(page);
 	get_zspage_mapping(first_page, &class_idx, &fullness);
 	pool = page->mapping->private_data;
 	class = pool->size_class[class_idx];
@@ -2346,11 +2439,17 @@ void zs_page_putback(struct page *page)
 	 * in zs_free will wait the page lock of @page without
 	 * destroying of zspage.
 	 */
-	INIT_LIST_HEAD(&first_page->lru);
 	spin_lock(&class->lock);
 	ClearPageIsolated(page);
-	ClearZsPageIsolate(first_page);
-	putback_zspage(class, first_page);
+	/*
+	 * putback zspage to right list if this is last isolated page
+	 * putback in the zspage.
+	 */
+	if (!check_isolated_page(first_page)) {
+		INIT_LIST_HEAD(&first_page->lru);
+		ClearZsPageIsolate(first_page);
+		putback_zspage(class, first_page);
+	}
 	spin_unlock(&class->lock);
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* [PATCH v3 16/16] zram: use __GFP_MOVABLE for memory allocation
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (14 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 15/16] zsmalloc: migrate tail pages in zspage Minchan Kim
@ 2016-03-30  7:12 ` Minchan Kim
  2016-03-30 23:11 ` [PATCH v3 00/16] Support non-lru page migration Andrew Morton
  2016-04-04 13:17 ` John Einar Reitan
  17 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-30  7:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Minchan Kim

Zsmalloc is ready for page migration so zram can use __GFP_MOVABLE
from now on.

I did test to see how it helps to make higher order pages.
Test scenario is as follows.

KVM guest, 1G memory, ext4 formated zram block device,

for i in `seq 1 8`;
do
        dd if=/dev/vda1 of=mnt/test$i.txt bs=128M count=1 &
done

wait `pidof dd`

for i in `seq 1 2 8`;
do
        rm -rf mnt/test$i.txt
done
fstrim -v mnt

echo "init"
cat /proc/buddyinfo

echo "compaction"
echo 1 > /proc/sys/vm/compact_memory
cat /proc/buddyinfo

old:

init
Node 0, zone      DMA    208    120     51     41     11      0      0      0      0      0      0
Node 0, zone    DMA32  16380  13777   9184   3805    789     54      3      0      0      0      0
compaction
Node 0, zone      DMA    132     82     40     39     16      2      1      0      0      0      0
Node 0, zone    DMA32   5219   5526   4969   3455   1831    677    139     15      0      0      0

new:

init
Node 0, zone      DMA    379    115     97     19      2      0      0      0      0      0      0
Node 0, zone    DMA32  18891  16774  10862   3947    637     21      0      0      0      0      0
compaction  1
Node 0, zone      DMA    214     66     87     29     10      3      0      0      0      0      0
Node 0, zone    DMA32   1612   3139   3154   2469   1745    990    384     94      7      0      0

As you can see, compaction made so many high-order pages. Yay!

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 3 ++-
 mm/zsmalloc.c                 | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 370c2f76016d..10f6ff1cf6a0 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -514,7 +514,8 @@ static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize)
 		goto out_error;
 	}
 
-	meta->mem_pool = zs_create_pool(pool_name, GFP_NOIO | __GFP_HIGHMEM);
+	meta->mem_pool = zs_create_pool(pool_name, GFP_NOIO|__GFP_HIGHMEM
+						|__GFP_MOVABLE);
 	if (!meta->mem_pool) {
 		pr_err("Error creating memory pool\n");
 		goto out_error;
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index e24f4a160892..4b1ccb68f2ee 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -308,7 +308,7 @@ static void destroy_handle_cache(struct zs_pool *pool)
 static unsigned long alloc_handle(struct zs_pool *pool)
 {
 	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-		pool->flags & ~__GFP_HIGHMEM);
+		pool->flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
 }
 
 static void free_handle(struct zs_pool *pool, unsigned long handle)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 00/16] Support non-lru page migration
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (15 preceding siblings ...)
  2016-03-30  7:12 ` [PATCH v3 16/16] zram: use __GFP_MOVABLE for memory allocation Minchan Kim
@ 2016-03-30 23:11 ` Andrew Morton
  2016-03-31  0:29   ` Sergey Senozhatsky
  2016-03-31  0:57   ` Minchan Kim
  2016-04-04 13:17 ` John Einar Reitan
  17 siblings, 2 replies; 65+ messages in thread
From: Andrew Morton @ 2016-03-30 23:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On Wed, 30 Mar 2016 16:11:59 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Recently, I got many reports about perfermance degradation
> in embedded system(Android mobile phone, webOS TV and so on)
> and failed to fork easily.
> 
> The problem was fragmentation caused by zram and GPU driver
> pages. Their pages cannot be migrated so compaction cannot
> work well, either so reclaimer ends up shrinking all of working
> set pages. It made system very slow and even to fail to fork
> easily.
> 
> Other pain point is that they cannot work with CMA.
> Most of CMA memory space could be idle(ie, it could be used
> for movable pages unless driver is using) but if driver(i.e.,
> zram) cannot migrate his page, that memory space could be
> wasted. In our product which has big CMA memory, it reclaims
> zones too exccessively although there are lots of free space
> in CMA so system was very slow easily.
> 
> To solve these problem, this patch try to add facility to
> migrate non-lru pages via introducing new friend functions
> of migratepage in address_space_operation and new page flags.
> 
> 	(isolate_page, putback_page)
> 	(PG_movable, PG_isolated)
> 
> For details, please read description in
> "mm/compaction: support non-lru movable page migration".

OK, I grabbed all these.

I wonder about testing coverage during the -next period.  How many
people are likely to exercise these code paths in a serious way before
it all hits mainline?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 00/16] Support non-lru page migration
  2016-03-30 23:11 ` [PATCH v3 00/16] Support non-lru page migration Andrew Morton
@ 2016-03-31  0:29   ` Sergey Senozhatsky
  2016-03-31  0:57     ` Minchan Kim
  2016-03-31  0:57   ` Minchan Kim
  1 sibling, 1 reply; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-03-31  0:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

On (03/30/16 16:11), Andrew Morton wrote:
[..]
> > For details, please read description in
> > "mm/compaction: support non-lru movable page migration".
> 
> OK, I grabbed all these.
> 
> I wonder about testing coverage during the -next period.  How many
> people are likely to exercise these code paths in a serious way before
> it all hits mainline?

I'm hammering the zsmalloc part.

	-ss

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 00/16] Support non-lru page migration
  2016-03-30 23:11 ` [PATCH v3 00/16] Support non-lru page migration Andrew Morton
  2016-03-31  0:29   ` Sergey Senozhatsky
@ 2016-03-31  0:57   ` Minchan Kim
  1 sibling, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-31  0:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On Wed, Mar 30, 2016 at 04:11:41PM -0700, Andrew Morton wrote:
> On Wed, 30 Mar 2016 16:11:59 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Recently, I got many reports about perfermance degradation
> > in embedded system(Android mobile phone, webOS TV and so on)
> > and failed to fork easily.
> > 
> > The problem was fragmentation caused by zram and GPU driver
> > pages. Their pages cannot be migrated so compaction cannot
> > work well, either so reclaimer ends up shrinking all of working
> > set pages. It made system very slow and even to fail to fork
> > easily.
> > 
> > Other pain point is that they cannot work with CMA.
> > Most of CMA memory space could be idle(ie, it could be used
> > for movable pages unless driver is using) but if driver(i.e.,
> > zram) cannot migrate his page, that memory space could be
> > wasted. In our product which has big CMA memory, it reclaims
> > zones too exccessively although there are lots of free space
> > in CMA so system was very slow easily.
> > 
> > To solve these problem, this patch try to add facility to
> > migrate non-lru pages via introducing new friend functions
> > of migratepage in address_space_operation and new page flags.
> > 
> > 	(isolate_page, putback_page)
> > 	(PG_movable, PG_isolated)
> > 
> > For details, please read description in
> > "mm/compaction: support non-lru movable page migration".
> 
> OK, I grabbed all these.
> 
> I wonder about testing coverage during the -next period.  How many
> people are likely to exercise these code paths in a serious way before
> it all hits mainline?

I asked this patchset to production team in my company for stress
testing. They alaways catch zram/zsmalloc bugs I have missed so
I hope they help me well, too.

About ballooning part, I hope Rafael Aquini get a time to review
and test it.

Other than that, IOW, linux-next will have a enough time to
test common migration part modification, I guess. :)

Thanks.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 00/16] Support non-lru page migration
  2016-03-31  0:29   ` Sergey Senozhatsky
@ 2016-03-31  0:57     ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-03-31  0:57 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

On Thu, Mar 31, 2016 at 09:29:32AM +0900, Sergey Senozhatsky wrote:
> On (03/30/16 16:11), Andrew Morton wrote:
> [..]
> > > For details, please read description in
> > > "mm/compaction: support non-lru movable page migration".
> > 
> > OK, I grabbed all these.
> > 
> > I wonder about testing coverage during the -next period.  How many
> > people are likely to exercise these code paths in a serious way before
> > it all hits mainline?
> 
> I'm hammering the zsmalloc part.

Thanks, Sergey!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-03-30  7:12 ` [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page Minchan Kim
@ 2016-04-01 12:58   ` Vlastimil Babka
  2016-04-04  1:39     ` Minchan Kim
  2016-04-04  5:53   ` Balbir Singh
  1 sibling, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-01 12:58 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Joonsoo Kim, koct9i,
	aquini, virtualization, Mel Gorman, Hugh Dickins,
	Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim, Sangseok Lee,
	Chan Gyun Jeong, Al Viro, YiPing Xu, Naoya Horiguchi

On 03/30/2016 09:12 AM, Minchan Kim wrote:
> Procedure of page migration is as follows:
>
> First of all, it should isolate a page from LRU and try to
> migrate the page. If it is successful, it releases the page
> for freeing. Otherwise, it should put the page back to LRU
> list.
>
> For LRU pages, we have used putback_lru_page for both freeing
> and putback to LRU list. It's okay because put_page is aware of
> LRU list so if it releases last refcount of the page, it removes
> the page from LRU list. However, It makes unnecessary operations
> (e.g., lru_cache_add, pagevec and flags operations. It would be
> not significant but no worth to do) and harder to support new
> non-lru page migration because put_page isn't aware of non-lru
> page's data structure.
>
> To solve the problem, we can add new hook in put_page with
> PageMovable flags check but it can increase overhead in
> hot path and needs new locking scheme to stabilize the flag check
> with put_page.
>
> So, this patch cleans it up to divide two semantic(ie, put and putback).
> If migration is successful, use put_page instead of putback_lru_page and
> use putback_lru_page only on failure. That makes code more readable
> and doesn't add overhead in put_page.
>
> Comment from Vlastimil
> "Yeah, and compaction (perhaps also other migration users) has to drain
> the lru pvec... Getting rid of this stuff is worth even by itself."
>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

[...]

> @@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>   		list_del(&page->lru);
>   		dec_zone_page_state(page, NR_ISOLATED_ANON +
>   				page_is_file_cache(page));
> -		/* Soft-offlined page shouldn't go through lru cache list */
> +	}
> +
> +	/*
> +	 * If migration is successful, drop the reference grabbed during
> +	 * isolation. Otherwise, restore the page to LRU list unless we
> +	 * want to retry.
> +	 */
> +	if (rc == MIGRATEPAGE_SUCCESS) {
> +		put_page(page);
>   		if (reason == MR_MEMORY_FAILURE) {
> -			put_page(page);
>   			if (!test_set_page_hwpoison(page))
>   				num_poisoned_pages_inc();
> -		} else
> +		}

Hmm, I didn't notice it previously, or it's due to rebasing, but it seems that 
you restricted the memory failure handling (i.e. setting hwpoison) to 
MIGRATE_SUCCESS, while previously it was done for all non-EAGAIN results. I 
think that goes against the intention of hwpoison, which is IIRC to catch and 
kill the poor process that still uses the page?

Also (but not your fault) the put_page() preceding test_set_page_hwpoison(page)) 
IMHO deserves a comment saying which pin we are releasing and which one we still 
have (hopefully? if I read description of da1b13ccfbebe right) otherwise it 
looks like doing something with a page that we just potentially freed.

> +	} else {
> +		if (rc != -EAGAIN)
>   			putback_lru_page(page);
> +		if (put_new_page)
> +			put_new_page(newpage, private);
> +		else
> +			put_page(newpage);
>   	}
>
> -	/*
> -	 * If migration was not successful and there's a freeing callback, use
> -	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
> -	 * during isolation.
> -	 */
> -	if (put_new_page)
> -		put_new_page(newpage, private);
> -	else if (unlikely(__is_movable_balloon_page(newpage))) {
> -		/* drop our reference, page already in the balloon */
> -		put_page(newpage);
> -	} else
> -		putback_lru_page(newpage);
> -
>   	if (result) {
>   		if (rc)
>   			*result = rc;
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 03/16] mm: add non-lru movable page support document
  2016-03-30  7:12 ` [PATCH v3 03/16] mm: add non-lru movable page support document Minchan Kim
@ 2016-04-01 14:38   ` Vlastimil Babka
  2016-04-04  2:25     ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-01 14:38 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Joonsoo Kim, koct9i,
	aquini, virtualization, Mel Gorman, Hugh Dickins,
	Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim, Sangseok Lee,
	Chan Gyun Jeong, Al Viro, YiPing Xu, Jonathan Corbet

On 03/30/2016 09:12 AM, Minchan Kim wrote:
> This patch describes what a subsystem should do for non-lru movable
> page supporting.

Intentionally reading this first without studying the code to better catch 
things that would seem obvious otherwise.

> Cc: Jonathan Corbet <corbet@lwn.net>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   Documentation/filesystems/vfs.txt | 11 ++++++-
>   Documentation/vm/page_migration   | 69 ++++++++++++++++++++++++++++++++++++++-
>   2 files changed, 78 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index 4c1b6c3b4bc8..d63142f8ed7b 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -752,12 +752,21 @@ struct address_space_operations {
>           and transfer data directly between the storage and the
>           application's address space.
>
> +  isolate_page: Called by the VM when isolating a movable non-lru page.
> +	If page is successfully isolated, we should mark the page as
> +	PG_isolated via __SetPageIsolated.

Patch 02 changelog suggests SetPageIsolated, so this is confusing. I guess the 
main point is that there might be parallel attempts and only one is allowed to 
succeed, right? Whether it's done by atomic ops or otherwise doesn't matter to 
e.g. compaction.

>     migrate_page:  This is used to compact the physical memory usage.
>           If the VM wants to relocate a page (maybe off a memory card
>           that is signalling imminent failure) it will pass a new page
>   	and an old page to this function.  migrate_page should
>   	transfer any private data across and update any references
> -        that it has to the page.
> +	that it has to the page. If migrated page is non-lru page,
> +	we should clear PG_isolated and PG_movable via __ClearPageIsolated
> +	and __ClearPageMovable.

Similar concern as __SetPageIsolated.

> +
> +  putback_page: Called by the VM when isolated page's migration fails.
> +	We should clear PG_isolated marked in isolated_page function.

Note this kind of wording is less confusing and could be used above wrt my concerns.

>
>     launder_page: Called before freeing a page - it writes back the dirty page. To
>     	prevent redirtying the page, it is kept locked during the whole
> diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
> index fea5c0864170..c4e7551a414e 100644
> --- a/Documentation/vm/page_migration
> +++ b/Documentation/vm/page_migration
> @@ -142,5 +142,72 @@ is increased so that the page cannot be freed while page migration occurs.
>   20. The new page is moved to the LRU and can be scanned by the swapper
>       etc again.
>
> -Christoph Lameter, May 8, 2006.
> +C. Non-LRU Page migration
> +-------------------------
> +
> +Although original migration aimed for reducing the latency of memory access
> +for NUMA, compaction who want to create high-order page is also main customer.
> +
> +Ppage migration's disadvantage is that it was designed to migrate only
> +*LRU* pages. However, there are potential non-lru movable pages which can be
> +migrated in system, for example, zsmalloc, virtio-balloon pages.
> +For virtio-balloon pages, some parts of migration code path was hooked up
> +and added virtio-balloon specific functions to intercept logi.

logi -> logic?

> +It's too specific to one subsystem so other subsystem who want to make
> +their pages movable should add own specific hooks in migration path.

s/should/would have to/ I guess?

> +To solve such problem, VM supports non-LRU page migration which provides
> +generic functions for non-LRU movable pages without needing subsystem
> +specific hook in mm/{migrate|compact}.c.
> +
> +If a subsystem want to make own pages movable, it should mark pages as
> +PG_movable via __SetPageMovable. __SetPageMovable needs address_space for
> +argument for register functions which will be called by VM.
> +
> +Three functions in address_space_operation related to non-lru movable page:
> +
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
> +	int (*migratepage) (struct address_space *,
> +		struct page *, struct page *, enum migrate_mode);
> +	void (*putback_page)(struct page *);
> +
> +1. Isolation
> +
> +What VM expected on isolate_page of subsystem is to set PG_isolated flags
> +of the page if it was successful. With that, concurrent isolation among
> +CPUs skips the isolated page by other CPU earlier. VM calls isolate_page
> +under PG_lock of page. If a subsystem cannot isolate the page, it should
> +return false.

Ah, I see, so it's designed with page lock to handle the concurrent isolations etc.

In http://marc.info/?l=linux-mm&m=143816716511904&w=2 Mel has warned about doing 
this in general under page_lock and suggested that each user handles concurrent 
calls to isolate_page() internally. Might be more generic that way, even if all 
current implementers will actually use the page lock.

Also it's worth reading that mail in full and incorporating here, as there are 
more concerns related to concurrency that should be documented, e.g. with pages 
that can be mapped to userspace. Not a case with zram and balloon pages I guess, 
but one of Gioh's original use cases was a driver which IIRC could map pages. So 
the design and documentation should keep that in mind.

> +2. Migration
> +
> +After successful isolation, VM calls migratepage. The migratepage's goal is
> +to move content of the old page to new page and set up struct page fields
> +of new page. If migration is successful, subsystem should release old page's
> +refcount to free. Keep in mind that subsystem should clear PG_movable and
> +PG_isolated before releasing the refcount.  If everything are done, user
> +should return MIGRATEPAGE_SUCCESS. If subsystem cannot migrate the page
> +at the moment, migratepage can return -EAGAIN. On -EAGAIN, VM will retry page
> +migration because VM interprets -EAGAIN as "temporal migration failure".
> +
> +3. Putback
> +
> +If migration was unsuccessful, VM calls putback_page. The subsystem should
> +insert isolated page to own data structure again if it has. And subsystem
> +should clear PG_isolated which was marked in isolation step.
> +
> +Note about releasing page:
> +
> +Subsystem can release pages whenever it want but if it releses the page
> +which is already isolated, it should clear PG_isolated but doesn't touch
> +PG_movable under PG_lock. Instead of it, VM will clear PG_movable after
> +his job done. Otherweise, subsystem should clear both page flags before
> +releasing the page.

I don't understand this right now. But maybe I will get it after reading the 
patches and suggest some improved wording here.

> +
> +Note about PG_isolated:
> +
> +PG_isolated check on a page is valid only if the page's flag is already
> +set to PG_movable.

But it's not possible to check both atomically, so I guess it implies checking 
under page lock? If that's true, should be explicit.

Thanks!

> +Christoph Lameter, May 8, 2006.
> +Minchan Kim, Mar 28, 2016.
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
  2016-03-30  7:12 ` [PATCH v3 02/16] mm/compaction: support non-lru movable page migration Minchan Kim
@ 2016-04-01 21:29   ` Vlastimil Babka
  2016-04-04  5:12     ` Minchan Kim
  2016-04-12  8:00   ` Chulmin Kim
  1 sibling, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-01 21:29 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Joonsoo Kim, koct9i,
	aquini, virtualization, Mel Gorman, Hugh Dickins,
	Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim, Sangseok Lee,
	Chan Gyun Jeong, Al Viro, YiPing Xu, dri-devel, Gioh Kim

Might have been better as a separate migration patch and then a compaction 
patch. It's prefixed mm/compaction, but most changed are in mm/migrate.c

On 03/30/2016 09:12 AM, Minchan Kim wrote:
> We have allowed migration for only LRU pages until now and it was
> enough to make high-order pages. But recently, embedded system(e.g.,
> webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> so we have seen several reports about troubles of small high-order
> allocation. For fixing the problem, there were several efforts
> (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> reserved memory, vmalloc and so on) but if there are lots of
> non-movable pages in system, their solutions are void in the long run.
>
> So, this patch is to support facility to change non-movable pages
> with movable. For the feature, this patch introduces functions related
> to migration to address_space_operations as well as some page flags.
>
> Basically, this patch supports two page-flags and two functions related
> to page migration. The flag and page->mapping stability are protected
> by PG_lock.
>
> 	PG_movable
> 	PG_isolated
>
> 	bool (*isolate_page) (struct page *, isolate_mode_t);
> 	void (*putback_page) (struct page *);
>
> Duty of subsystem want to make their pages as migratable are
> as follows:
>
> 1. It should register address_space to page->mapping then mark
> the page as PG_movable via __SetPageMovable.
>
> 2. It should mark the page as PG_isolated via SetPageIsolated
> if isolation is sucessful and return true.

Ah another thing to document (especially in the comments/Doc) is that the 
subsystem must not expect anything to survive in page.lru (or fields that union 
it) after having isolated successfully.

> 3. If migration is successful, it should clear PG_isolated and
> PG_movable of the page for free preparation then release the
> reference of the page to free.
>
> 4. If migration fails, putback function of subsystem should
> clear PG_isolated via ClearPageIsolated.
>
> 5. If a subsystem want to release isolated page, it should
> clear PG_isolated but not PG_movable. Instead, VM will do it.

Under lock? Or just with ClearPageIsolated?

> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   Documentation/filesystems/Locking      |   4 +
>   Documentation/filesystems/vfs.txt      |   5 +
>   fs/proc/page.c                         |   3 +
>   include/linux/fs.h                     |   2 +
>   include/linux/migrate.h                |   2 +
>   include/linux/page-flags.h             |  31 ++++++
>   include/uapi/linux/kernel-page-flags.h |   1 +
>   mm/compaction.c                        |  14 ++-
>   mm/migrate.c                           | 174 +++++++++++++++++++++++++++++----
>   9 files changed, 217 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 619af9bfdcb3..0bb79560abb3 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -195,7 +195,9 @@ unlocks and drops the reference.
>   	int (*releasepage) (struct page *, int);
>   	void (*freepage)(struct page *);
>   	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
>   	int (*migratepage)(struct address_space *, struct page *, struct page *);
> +	void (*putback_page) (struct page *);
>   	int (*launder_page)(struct page *);
>   	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
>   	int (*error_remove_page)(struct address_space *, struct page *);
> @@ -219,7 +221,9 @@ invalidatepage:		yes
>   releasepage:		yes
>   freepage:		yes
>   direct_IO:
> +isolate_page:		yes
>   migratepage:		yes (both)
> +putback_page:		yes
>   launder_page:		yes
>   is_partially_uptodate:	yes
>   error_remove_page:	yes
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index b02a7d598258..4c1b6c3b4bc8 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -592,9 +592,14 @@ struct address_space_operations {
>   	int (*releasepage) (struct page *, int);
>   	void (*freepage)(struct page *);
>   	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> +	/* isolate a page for migration */
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
>   	/* migrate the contents of a page to the specified target */
>   	int (*migratepage) (struct page *, struct page *);
> +	/* put the page back to right list */

... "after a failed migration" ?

> +	void (*putback_page) (struct page *);
>   	int (*launder_page) (struct page *);
> +
>   	int (*is_partially_uptodate) (struct page *, unsigned long,
>   					unsigned long);
>   	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 3ecd445e830d..ce3d08a4ad8d 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
>   	if (page_is_idle(page))
>   		u |= 1 << KPF_IDLE;
>
> +	if (PageMovable(page))
> +		u |= 1 << KPF_MOVABLE;
> +
>   	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
>
>   	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index da9e67d937e5..36f2d610e7a8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -401,6 +401,8 @@ struct address_space_operations {
>   	 */
>   	int (*migratepage) (struct address_space *,
>   			struct page *, struct page *, enum migrate_mode);
> +	bool (*isolate_page)(struct page *, isolate_mode_t);
> +	void (*putback_page)(struct page *);
>   	int (*launder_page) (struct page *);
>   	int (*is_partially_uptodate) (struct page *, unsigned long,
>   					unsigned long);
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 9b50325e4ddf..404fbfefeb33 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
>   			struct page *, struct page *, enum migrate_mode);
>   extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
>   		unsigned long private, enum migrate_mode mode, int reason);
> +extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> +extern void putback_movable_page(struct page *page);
>
>   extern int migrate_prep(void);
>   extern int migrate_prep_local(void);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f4ed4f1b0c77..77ebf8fdbc6e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -129,6 +129,10 @@ enum pageflags {
>
>   	/* Compound pages. Stored in first tail page's flags */
>   	PG_double_map = PG_private_2,
> +
> +	/* non-lru movable pages */
> +	PG_movable = PG_reclaim,
> +	PG_isolated = PG_owner_priv_1,

Documentation should probably state that these fields alias and subsystem 
supporting the movable pages shouldn't use them elsewhere.

Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that
page->mapping->a_ops->isolate_page exists for PageMovable() pages. What if it's 
a false positive on a PG_reclaim page? Can we rely on PG_reclaim always (and 
without races) implying PageLRU() so that we don't even attempt 
isolate_movable_page()?

>   };
>
>   #ifndef __GENERATING_BOUNDS_H
> @@ -614,6 +618,33 @@ static inline void __ClearPageBalloon(struct page *page)
>   	atomic_set(&page->_mapcount, -1);
>   }
>
> +#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)

IIRC this was what Gioh's previous attempts used instead of PG_movable? Is it 
still needed? Doesn't it prevent a driver providing movable *and* mapped pages?
If it's to distinguish the PG_reclaim alias that I mention above, it seems like 
an overkill to me. Why would be need both special mapcount value and a flag? 
Checking that page->mapping->a_ops->isolate_page exists before calling it should 
be enough to resolve the ambiguity?

> +
> +static inline int PageMovable(struct page *page)
> +{
> +	return ((test_bit(PG_movable, &(page)->flags) &&
> +		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> +		|| PageBalloon(page));
> +}
> +
> +/* Caller should hold a PG_lock */
> +static inline void __SetPageMovable(struct page *page,
> +				struct address_space *mapping)
> +{
> +	page->mapping = mapping;
> +	__set_bit(PG_movable, &page->flags);
> +	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> +}
> +
> +static inline void __ClearPageMovable(struct page *page)
> +{
> +	atomic_set(&page->_mapcount, -1);
> +	__clear_bit(PG_movable, &(page)->flags);
> +	page->mapping = NULL;
> +}
> +
> +PAGEFLAG(Isolated, isolated, PF_ANY);
> +
>   /*
>    * If network-based swap is enabled, sl*b must keep track of whether pages
>    * were allocated from pfmemalloc reserves.
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index 5da5f8751ce7..a184fd2434fa 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -34,6 +34,7 @@
>   #define KPF_BALLOON		23
>   #define KPF_ZERO_PAGE		24
>   #define KPF_IDLE		25
> +#define KPF_MOVABLE		26
>
>
>   #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index ccf97b02b85f..7557aedddaee 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>
>   		/*
>   		 * Check may be lockless but that's ok as we recheck later.
> -		 * It's possible to migrate LRU pages and balloon pages
> +		 * It's possible to migrate LRU and movable kernel pages.
>   		 * Skip any other type of page
>   		 */
>   		is_lru = PageLRU(page);
> @@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   					goto isolate_success;
>   				}
>   			}
> +
> +			if (unlikely(PageMovable(page)) &&
> +					!PageIsolated(page)) {
> +				if (locked) {
> +					spin_unlock_irqrestore(&zone->lru_lock,
> +									flags);
> +					locked = false;
> +				}
> +
> +				if (isolate_movable_page(page, isolate_mode))
> +					goto isolate_success;
> +			}
>   		}
>
>   		/*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 53529c805752..b56bf2b3fe8c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -73,6 +73,85 @@ int migrate_prep_local(void)
>   	return 0;
>   }
>
> +bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> +{
> +	bool ret = false;

Maintaining "ret" seems useless here. All the "goto out*" statements are 
executed only when ret is false, and ret == true is returned by a different return.

> +
> +	/*
> +	 * Avoid burning cycles with pages that are yet under __free_pages(),
> +	 * or just got freed under us.
> +	 *
> +	 * In case we 'win' a race for a movable page being freed under us and
> +	 * raise its refcount preventing __free_pages() from doing its job
> +	 * the put_page() at the end of this block will take care of
> +	 * release this page, thus avoiding a nasty leakage.
> +	 */
> +	if (unlikely(!get_page_unless_zero(page)))
> +		goto out;
> +
> +	/*
> +	 * Check PG_movable before holding a PG_lock because page's owner
> +	 * assumes anybody doesn't touch PG_lock of newly allocated page.
> +	 */
> +	if (unlikely(!PageMovable(page)))
> +		goto out_putpage;
> +	/*
> +	 * As movable pages are not isolated from LRU lists, concurrent
> +	 * compaction threads can race against page migration functions
> +	 * as well as race against the releasing a page.
> +	 *
> +	 * In order to avoid having an already isolated movable page
> +	 * being (wrongly) re-isolated while it is under migration,
> +	 * or to avoid attempting to isolate pages being released,
> +	 * lets be sure we have the page lock
> +	 * before proceeding with the movable page isolation steps.
> +	 */
> +	if (unlikely(!trylock_page(page)))
> +		goto out_putpage;
> +
> +	if (!PageMovable(page) || PageIsolated(page))
> +		goto out_no_isolated;
> +
> +	ret = page->mapping->a_ops->isolate_page(page, mode);
> +	if (!ret)
> +		goto out_no_isolated;
> +
> +	WARN_ON_ONCE(!PageIsolated(page));
> +	unlock_page(page);
> +	return ret;
> +
> +out_no_isolated:
> +	unlock_page(page);
> +out_putpage:
> +	put_page(page);
> +out:
> +	return ret;
> +}
> +
> +/* It should be called on page which is PG_movable */
> +void putback_movable_page(struct page *page)
> +{
> +	/*
> +	 * 'lock_page()' stabilizes the page and prevents races against
> +	 * concurrent isolation threads attempting to re-isolate it.
> +	 */
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +
> +	lock_page(page);
> +	if (PageIsolated(page)) {
> +		struct address_space *mapping;
> +
> +		mapping = page_mapping(page);
> +		mapping->a_ops->putback_page(page);
> +		WARN_ON_ONCE(PageIsolated(page));
> +	} else {
> +		__ClearPageMovable(page);
> +	}
> +	unlock_page(page);
> +	/* drop the extra ref count taken for movable page isolation */
> +	put_page(page);
> +}
> +
>   /*
>    * Put previously isolated pages back onto the appropriate lists
>    * from where they were once taken off for compaction/migration.
> @@ -94,10 +173,18 @@ void putback_movable_pages(struct list_head *l)
>   		list_del(&page->lru);
>   		dec_zone_page_state(page, NR_ISOLATED_ANON +
>   				page_is_file_cache(page));
> -		if (unlikely(isolated_balloon_page(page)))
> +		if (unlikely(isolated_balloon_page(page))) {
>   			balloon_page_putback(page);
> -		else
> +		} else if (unlikely(PageMovable(page))) {
> +			if (PageIsolated(page)) {
> +				putback_movable_page(page);
> +			} else {
> +				__ClearPageMovable(page);

We don't do lock_page() here, so what prevents parallel compaction isolating the 
same page?

> +				put_page(page);
> +			}
> +		} else {
>   			putback_lru_page(page);
> +		}
>   	}
>   }
>
> @@ -592,7 +679,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>    ***********************************************************/
>
>   /*
> - * Common logic to directly migrate a single page suitable for
> + * Common logic to directly migrate a single LRU page suitable for
>    * pages that do not use PagePrivate/PagePrivate2.
>    *
>    * Pages are locked upon entry and exit.
> @@ -755,24 +842,54 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>   				enum migrate_mode mode)
>   {
>   	struct address_space *mapping;
> -	int rc;
> +	int rc = -EAGAIN;
> +	bool lru_movable = true;
>
>   	VM_BUG_ON_PAGE(!PageLocked(page), page);
>   	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
>
>   	mapping = page_mapping(page);
> -	if (!mapping)
> -		rc = migrate_page(mapping, newpage, page, mode);
> -	else if (mapping->a_ops->migratepage)
> -		/*
> -		 * Most pages have a mapping and most filesystems provide a
> -		 * migratepage callback. Anonymous pages are part of swap
> -		 * space which also has its own migratepage callback. This
> -		 * is the most common path for page migration.
> -		 */
> -		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
> -	else
> -		rc = fallback_migrate_page(mapping, newpage, page, mode);
> +	/*
> +	 * In case of non-lru page, it could be released after
> +	 * isolation step. In that case, we shouldn't try
> +	 * fallback migration which was designed for LRU pages.
> +	 *
> +	 * The rule for such case is that subsystem should clear
> +	 * PG_isolated but remains PG_movable so VM should catch
> +	 * it and clear PG_movable for it.
> +	 */
> +	if (unlikely(PageMovable(page))) {

Can false positive from PG_reclaim occur here?

> +		lru_movable = false;
> +		VM_BUG_ON_PAGE(!mapping, page);
> +		if (!PageIsolated(page)) {
> +			rc = MIGRATEPAGE_SUCCESS;
> +			__ClearPageMovable(page);
> +			goto out;
> +		}
> +	}
> +
> +	if (likely(lru_movable)) {
> +		if (!mapping)
> +			rc = migrate_page(mapping, newpage, page, mode);
> +		else if (mapping->a_ops->migratepage)
> +			/*
> +			 * Most pages have a mapping and most filesystems
> +			 * provide a migratepage callback. Anonymous pages
> +			 * are part of swap space which also has its own
> +			 * migratepage callback. This is the most common path
> +			 * for page migration.
> +			 */
> +			rc = mapping->a_ops->migratepage(mapping, newpage,
> +							page, mode);
> +		else
> +			rc = fallback_migrate_page(mapping, newpage,
> +							page, mode);
> +	} else {
> +		rc = mapping->a_ops->migratepage(mapping, newpage,
> +						page, mode);
> +		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
> +			PageIsolated(page));
> +	}
>
>   	/*
>   	 * When successful, old pagecache page->mapping must be cleared before
> @@ -782,6 +899,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>   		if (!PageAnon(page))
>   			page->mapping = NULL;
>   	}
> +out:
>   	return rc;
>   }
>
> @@ -960,6 +1078,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>   			put_new_page(newpage, private);
>   		else
>   			put_page(newpage);
> +		if (PageMovable(page))
> +			__ClearPageMovable(page);
>   		goto out;
>   	}
>
> @@ -1000,8 +1120,26 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>   				num_poisoned_pages_inc();
>   		}
>   	} else {
> -		if (rc != -EAGAIN)
> -			putback_lru_page(page);
> +		if (rc != -EAGAIN) {
> +			/*
> +			 * subsystem couldn't remove PG_movable since page is
> +			 * isolated so PageMovable check is not racy in here.
> +			 * But PageIsolated check can be racy but it's okay
> +			 * because putback_movable_page checks it under PG_lock
> +			 * again.
> +			 */
> +			if (unlikely(PageMovable(page))) {
> +				if (PageIsolated(page))
> +					putback_movable_page(page);
> +				else {
> +					__ClearPageMovable(page);

Again, we don't do lock_page() here, so what prevents parallel compaction 
isolating the same page?

Sorry for so many questions, hope they all have good answers and this series is 
a success :) Thanks for picking it up.

> +					put_page(page);
> +				}
> +			} else {
> +				putback_lru_page(page);
> +			}
> +		}
> +
>   		if (put_new_page)
>   			put_new_page(newpage, private);
>   		else
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-01 12:58   ` Vlastimil Babka
@ 2016-04-04  1:39     ` Minchan Kim
  2016-04-04  4:45       ` Naoya Horiguchi
  0 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-04-04  1:39 UTC (permalink / raw)
  To: Vlastimil Babka, Naoya Horiguchi
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu,
	Naoya Horiguchi

On Fri, Apr 01, 2016 at 02:58:21PM +0200, Vlastimil Babka wrote:
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >Procedure of page migration is as follows:
> >
> >First of all, it should isolate a page from LRU and try to
> >migrate the page. If it is successful, it releases the page
> >for freeing. Otherwise, it should put the page back to LRU
> >list.
> >
> >For LRU pages, we have used putback_lru_page for both freeing
> >and putback to LRU list. It's okay because put_page is aware of
> >LRU list so if it releases last refcount of the page, it removes
> >the page from LRU list. However, It makes unnecessary operations
> >(e.g., lru_cache_add, pagevec and flags operations. It would be
> >not significant but no worth to do) and harder to support new
> >non-lru page migration because put_page isn't aware of non-lru
> >page's data structure.
> >
> >To solve the problem, we can add new hook in put_page with
> >PageMovable flags check but it can increase overhead in
> >hot path and needs new locking scheme to stabilize the flag check
> >with put_page.
> >
> >So, this patch cleans it up to divide two semantic(ie, put and putback).
> >If migration is successful, use put_page instead of putback_lru_page and
> >use putback_lru_page only on failure. That makes code more readable
> >and doesn't add overhead in put_page.
> >
> >Comment from Vlastimil
> >"Yeah, and compaction (perhaps also other migration users) has to drain
> >the lru pvec... Getting rid of this stuff is worth even by itself."
> >
> >Cc: Mel Gorman <mgorman@suse.de>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> [...]
> 
> >@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  		list_del(&page->lru);
> >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> >  				page_is_file_cache(page));
> >-		/* Soft-offlined page shouldn't go through lru cache list */
> >+	}
> >+
> >+	/*
> >+	 * If migration is successful, drop the reference grabbed during
> >+	 * isolation. Otherwise, restore the page to LRU list unless we
> >+	 * want to retry.
> >+	 */
> >+	if (rc == MIGRATEPAGE_SUCCESS) {
> >+		put_page(page);
> >  		if (reason == MR_MEMORY_FAILURE) {
> >-			put_page(page);
> >  			if (!test_set_page_hwpoison(page))
> >  				num_poisoned_pages_inc();
> >-		} else
> >+		}
> 
> Hmm, I didn't notice it previously, or it's due to rebasing, but it
> seems that you restricted the memory failure handling (i.e. setting
> hwpoison) to MIGRATE_SUCCESS, while previously it was done for all
> non-EAGAIN results. I think that goes against the intention of
> hwpoison, which is IIRC to catch and kill the poor process that
> still uses the page?

That's why I Cc'ed Naoya Horiguchi to catch things I might make
mistake.

Thanks for catching it, Vlastimil.
It was my mistake. But in this chance, I looked over hwpoison code and
I saw other places which increases num_poisoned_pages are successful
migration, already freed page and successful invalidated page.
IOW, they are already successful isolated page so I guess it should
increase the count when only successful migration is done?
And when I read memory_failure, it bails out without killing if it
encounters HWPoisoned page so I think it's not for catching and
kill the poor proces.

> 
> Also (but not your fault) the put_page() preceding
> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> pin we are releasing and which one we still have (hopefully? if I
> read description of da1b13ccfbebe right) otherwise it looks like
> doing something with a page that we just potentially freed.

Yes, while I read the code, I had same question. I think the releasing
refcount is for get_any_page.

Naoya, could you answer above two questions?

Thanks.

> 
> >+	} else {
> >+		if (rc != -EAGAIN)
> >  			putback_lru_page(page);
> >+		if (put_new_page)
> >+			put_new_page(newpage, private);
> >+		else
> >+			put_page(newpage);
> >  	}
> >
> >-	/*
> >-	 * If migration was not successful and there's a freeing callback, use
> >-	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
> >-	 * during isolation.
> >-	 */
> >-	if (put_new_page)
> >-		put_new_page(newpage, private);
> >-	else if (unlikely(__is_movable_balloon_page(newpage))) {
> >-		/* drop our reference, page already in the balloon */
> >-		put_page(newpage);
> >-	} else
> >-		putback_lru_page(newpage);
> >-
> >  	if (result) {
> >  		if (rc)
> >  			*result = rc;
> >
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 03/16] mm: add non-lru movable page support document
  2016-04-01 14:38   ` Vlastimil Babka
@ 2016-04-04  2:25     ` Minchan Kim
  2016-04-04 13:09       ` Vlastimil Babka
  0 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-04-04  2:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu,
	Jonathan Corbet

On Fri, Apr 01, 2016 at 04:38:34PM +0200, Vlastimil Babka wrote:
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >This patch describes what a subsystem should do for non-lru movable
> >page supporting.
> 
> Intentionally reading this first without studying the code to better
> catch things that would seem obvious otherwise.
> 
> >Cc: Jonathan Corbet <corbet@lwn.net>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  Documentation/filesystems/vfs.txt | 11 ++++++-
> >  Documentation/vm/page_migration   | 69 ++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 78 insertions(+), 2 deletions(-)
> >
> >diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> >index 4c1b6c3b4bc8..d63142f8ed7b 100644
> >--- a/Documentation/filesystems/vfs.txt
> >+++ b/Documentation/filesystems/vfs.txt
> >@@ -752,12 +752,21 @@ struct address_space_operations {
> >          and transfer data directly between the storage and the
> >          application's address space.
> >
> >+  isolate_page: Called by the VM when isolating a movable non-lru page.
> >+	If page is successfully isolated, we should mark the page as
> >+	PG_isolated via __SetPageIsolated.
> 
> Patch 02 changelog suggests SetPageIsolated, so this is confusing. I
> guess the main point is that there might be parallel attempts and
> only one is allowed to succeed, right? Whether it's done by atomic

Right.

> ops or otherwise doesn't matter to e.g. compaction.

It should be atomic under PG_lock so it would be better to change
__SetPageIsolated in patch 02 to show the intention "the operation
is not atomic so you need some lock(e.g., PG_lock) to make sure
atomicity".


> 
> >    migrate_page:  This is used to compact the physical memory usage.
> >          If the VM wants to relocate a page (maybe off a memory card
> >          that is signalling imminent failure) it will pass a new page
> >  	and an old page to this function.  migrate_page should
> >  	transfer any private data across and update any references
> >-        that it has to the page.
> >+	that it has to the page. If migrated page is non-lru page,
> >+	we should clear PG_isolated and PG_movable via __ClearPageIsolated
> >+	and __ClearPageMovable.
> 
> Similar concern as __SetPageIsolated.
> 
> >+
> >+  putback_page: Called by the VM when isolated page's migration fails.
> >+	We should clear PG_isolated marked in isolated_page function.
> 
> Note this kind of wording is less confusing and could be used above wrt my concerns.
> 
> >
> >    launder_page: Called before freeing a page - it writes back the dirty page. To
> >    	prevent redirtying the page, it is kept locked during the whole
> >diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
> >index fea5c0864170..c4e7551a414e 100644
> >--- a/Documentation/vm/page_migration
> >+++ b/Documentation/vm/page_migration
> >@@ -142,5 +142,72 @@ is increased so that the page cannot be freed while page migration occurs.
> >  20. The new page is moved to the LRU and can be scanned by the swapper
> >      etc again.
> >
> >-Christoph Lameter, May 8, 2006.
> >+C. Non-LRU Page migration
> >+-------------------------
> >+
> >+Although original migration aimed for reducing the latency of memory access
> >+for NUMA, compaction who want to create high-order page is also main customer.
> >+
> >+Ppage migration's disadvantage is that it was designed to migrate only
> >+*LRU* pages. However, there are potential non-lru movable pages which can be
> >+migrated in system, for example, zsmalloc, virtio-balloon pages.
> >+For virtio-balloon pages, some parts of migration code path was hooked up
> >+and added virtio-balloon specific functions to intercept logi.
> 
> logi -> logic?

-_-;;

> 
> >+It's too specific to one subsystem so other subsystem who want to make
> >+their pages movable should add own specific hooks in migration path.
> 
> s/should/would have to/ I guess?

Better.


> 
> >+To solve such problem, VM supports non-LRU page migration which provides
> >+generic functions for non-LRU movable pages without needing subsystem
> >+specific hook in mm/{migrate|compact}.c.
> >+
> >+If a subsystem want to make own pages movable, it should mark pages as
> >+PG_movable via __SetPageMovable. __SetPageMovable needs address_space for
> >+argument for register functions which will be called by VM.
> >+
> >+Three functions in address_space_operation related to non-lru movable page:
> >+
> >+	bool (*isolate_page) (struct page *, isolate_mode_t);
> >+	int (*migratepage) (struct address_space *,
> >+		struct page *, struct page *, enum migrate_mode);
> >+	void (*putback_page)(struct page *);
> >+
> >+1. Isolation
> >+
> >+What VM expected on isolate_page of subsystem is to set PG_isolated flags
> >+of the page if it was successful. With that, concurrent isolation among
> >+CPUs skips the isolated page by other CPU earlier. VM calls isolate_page
> >+under PG_lock of page. If a subsystem cannot isolate the page, it should
> >+return false.
> 
> Ah, I see, so it's designed with page lock to handle the concurrent isolations etc.
> 
> In http://marc.info/?l=linux-mm&m=143816716511904&w=2 Mel has warned
> about doing this in general under page_lock and suggested that each
> user handles concurrent calls to isolate_page() internally. Might be
> more generic that way, even if all current implementers will
> actually use the page lock.

We need PG_lock for two reasons.

Firstly, it guarantees page's flags operation(i.e., PG_movable, PG_isolated)
atomicity. Another thing is for stability for page->mapping->a_ops.

For example,

isolate_migratepages_block
        if (PageMovable(page))
                isolate_movable_page
                        get_page_unless_zero <--- 1
                        trylock_page
                        page->mapping->a_ops->isolate_page <--- 2

Between 1 and 2, driver can nullify page->mapping so we need PG_lock
to collaborate driver in the end. IOW, user should call
__ClearPageMovable where reset page->mapping to NULl under PG_lock.

> 
> Also it's worth reading that mail in full and incorporating here, as
> there are more concerns related to concurrency that should be
> documented, e.g. with pages that can be mapped to userspace. Not a
> case with zram and balloon pages I guess, but one of Gioh's original
> use cases was a driver which IIRC could map pages. So the design and
> documentation should keep that in mind.

Hmm, I didn't consider driver userspace mapped pages.
It's really worth to consider. I will think about it.

> 
> >+2. Migration
> >+
> >+After successful isolation, VM calls migratepage. The migratepage's goal is
> >+to move content of the old page to new page and set up struct page fields
> >+of new page. If migration is successful, subsystem should release old page's
> >+refcount to free. Keep in mind that subsystem should clear PG_movable and
> >+PG_isolated before releasing the refcount.  If everything are done, user
> >+should return MIGRATEPAGE_SUCCESS. If subsystem cannot migrate the page
> >+at the moment, migratepage can return -EAGAIN. On -EAGAIN, VM will retry page
> >+migration because VM interprets -EAGAIN as "temporal migration failure".
> >+
> >+3. Putback
> >+
> >+If migration was unsuccessful, VM calls putback_page. The subsystem should
> >+insert isolated page to own data structure again if it has. And subsystem
> >+should clear PG_isolated which was marked in isolation step.
> >+
> >+Note about releasing page:
> >+
> >+Subsystem can release pages whenever it want but if it releses the page
> >+which is already isolated, it should clear PG_isolated but doesn't touch
> >+PG_movable under PG_lock. Instead of it, VM will clear PG_movable after
> >+his job done. Otherweise, subsystem should clear both page flags before
> >+releasing the page.
> 
> I don't understand this right now. But maybe I will get it after
> reading the patches and suggest some improved wording here.

I will try to explain why such rule happens in there.

The problem is that put_page is aware of PageLRU. So, if someone releases
last refcount of LRU page, __put_page checks PageLRU and then, clear the
flags and detatch the page in LRU list(i.e., data structure).
But in case of driver page, data structure like LRU among drivers is not only one.
IOW, we should add following code in put_page to handle various requirements
of driver page.

void __put_page(struct page *page)
{
        if (PageMovable(page)) {
                /*
                 * It will tity up driver's data structure like LRU
                 * and reset page's flags. And it should be atomic
                 * and always successful
                 */
                page->put(page);
                __ClearPageMovable(page);
        } else if (PageCompound(page))
                __put_compound_page(page);
        else
                __put_single_page(page);
                          
}

I'd like to avoid add new branch for not popular job in put_page which is hot.
(Might change in future but not popular at the moment)
So, rule of driver is as follows.

When the driver releases the page and he found the page is PG_isolated,
he should unmark only PG_isolated, not PG_movable so migration side of
VM can catch it up "Hmm, the isolated non-lru page doesn't have PG_isolated
any more. It means drivers releases the page. So, let's put the page
instead of putback operation".

When the driver releases the page and he doesn't see PG_isolated mark
of the page, driver should reset both PG_isolated and PG_movable.

> 
> >+
> >+Note about PG_isolated:
> >+
> >+PG_isolated check on a page is valid only if the page's flag is already
> >+set to PG_movable.
> 
> But it's not possible to check both atomically, so I guess it
> implies checking under page lock? If that's true, should be
> explicit.

Sure.

Thanks for the review, Vlastimil. :)

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-04  1:39     ` Minchan Kim
@ 2016-04-04  4:45       ` Naoya Horiguchi
  2016-04-04 14:46         ` Vlastimil Babka
  0 siblings, 1 reply; 65+ messages in thread
From: Naoya Horiguchi @ 2016-04-04  4:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Vlastimil Babka, Andrew Morton, linux-kernel, linux-mm, jlayton,
	bfields, Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
> On Fri, Apr 01, 2016 at 02:58:21PM +0200, Vlastimil Babka wrote:
> > On 03/30/2016 09:12 AM, Minchan Kim wrote:
> > >Procedure of page migration is as follows:
> > >
> > >First of all, it should isolate a page from LRU and try to
> > >migrate the page. If it is successful, it releases the page
> > >for freeing. Otherwise, it should put the page back to LRU
> > >list.
> > >
> > >For LRU pages, we have used putback_lru_page for both freeing
> > >and putback to LRU list. It's okay because put_page is aware of
> > >LRU list so if it releases last refcount of the page, it removes
> > >the page from LRU list. However, It makes unnecessary operations
> > >(e.g., lru_cache_add, pagevec and flags operations. It would be
> > >not significant but no worth to do) and harder to support new
> > >non-lru page migration because put_page isn't aware of non-lru
> > >page's data structure.
> > >
> > >To solve the problem, we can add new hook in put_page with
> > >PageMovable flags check but it can increase overhead in
> > >hot path and needs new locking scheme to stabilize the flag check
> > >with put_page.
> > >
> > >So, this patch cleans it up to divide two semantic(ie, put and putback).
> > >If migration is successful, use put_page instead of putback_lru_page and
> > >use putback_lru_page only on failure. That makes code more readable
> > >and doesn't add overhead in put_page.
> > >
> > >Comment from Vlastimil
> > >"Yeah, and compaction (perhaps also other migration users) has to drain
> > >the lru pvec... Getting rid of this stuff is worth even by itself."
> > >
> > >Cc: Mel Gorman <mgorman@suse.de>
> > >Cc: Hugh Dickins <hughd@google.com>
> > >Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > >Signed-off-by: Minchan Kim <minchan@kernel.org>
> > 
> > [...]
> > 
> > >@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> > >  		list_del(&page->lru);
> > >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> > >  				page_is_file_cache(page));
> > >-		/* Soft-offlined page shouldn't go through lru cache list */
> > >+	}
> > >+
> > >+	/*
> > >+	 * If migration is successful, drop the reference grabbed during
> > >+	 * isolation. Otherwise, restore the page to LRU list unless we
> > >+	 * want to retry.
> > >+	 */
> > >+	if (rc == MIGRATEPAGE_SUCCESS) {
> > >+		put_page(page);
> > >  		if (reason == MR_MEMORY_FAILURE) {
> > >-			put_page(page);
> > >  			if (!test_set_page_hwpoison(page))
> > >  				num_poisoned_pages_inc();
> > >-		} else
> > >+		}
> > 
> > Hmm, I didn't notice it previously, or it's due to rebasing, but it
> > seems that you restricted the memory failure handling (i.e. setting
> > hwpoison) to MIGRATE_SUCCESS, while previously it was done for all
> > non-EAGAIN results. I think that goes against the intention of
> > hwpoison, which is IIRC to catch and kill the poor process that
> > still uses the page?
> 
> That's why I Cc'ed Naoya Horiguchi to catch things I might make
> mistake.
> 
> Thanks for catching it, Vlastimil.
> It was my mistake. But in this chance, I looked over hwpoison code and
> I saw other places which increases num_poisoned_pages are successful
> migration, already freed page and successful invalidated page.
> IOW, they are already successful isolated page so I guess it should
> increase the count when only successful migration is done?

Yes, that's right. When exiting with migration's failure, we shouldn't call
test_set_page_hwpoison or num_poisoned_pages_inc, so current code checking
(rc != -EAGAIN) is simply incorrect. Your change fixes the bug in memory
error handling. Great!

> And when I read memory_failure, it bails out without killing if it
> encounters HWPoisoned page so I think it's not for catching and
> kill the poor proces.
>
> > 
> > Also (but not your fault) the put_page() preceding
> > test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> > pin we are releasing and which one we still have (hopefully? if I
> > read description of da1b13ccfbebe right) otherwise it looks like
> > doing something with a page that we just potentially freed.
>
> Yes, while I read the code, I had same question. I think the releasing
> refcount is for get_any_page.

As the other callers of page migration do, soft_offline_page expects the
migration source page to be freed at this put_page() (no pin remains.)
The refcount released here is from isolate_lru_page() in __soft_offline_page().
(the pin by get_any_page is released by put_hwpoison_page just after it.)

.. yes, doing something just after freeing page looks weird, but that's
how PageHWPoison flag works. IOW, many other page flags are maintained
only during one "allocate-free" life span, but PageHWPoison still does
its job beyond it.

As for commenting, this put_page() is called in any MIGRATEPAGE_SUCCESS
case (regardless of callers), so what we can say here is "we free the
source page here, bypassing LRU list" or something?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
  2016-04-01 21:29   ` Vlastimil Babka
@ 2016-04-04  5:12     ` Minchan Kim
  2016-04-04 13:24       ` Vlastimil Babka
  0 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-04-04  5:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, dri-devel,
	Gioh Kim

On Fri, Apr 01, 2016 at 11:29:14PM +0200, Vlastimil Babka wrote:
> Might have been better as a separate migration patch and then a
> compaction patch. It's prefixed mm/compaction, but most changed are
> in mm/migrate.c

Indeed. The title is rather misleading but not sure it's a good idea
to separate compaction and migration part.
I will just resend to change the tile from "mm/compaction" to
"mm/migration".

> 
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >We have allowed migration for only LRU pages until now and it was
> >enough to make high-order pages. But recently, embedded system(e.g.,
> >webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> >so we have seen several reports about troubles of small high-order
> >allocation. For fixing the problem, there were several efforts
> >(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> >reserved memory, vmalloc and so on) but if there are lots of
> >non-movable pages in system, their solutions are void in the long run.
> >
> >So, this patch is to support facility to change non-movable pages
> >with movable. For the feature, this patch introduces functions related
> >to migration to address_space_operations as well as some page flags.
> >
> >Basically, this patch supports two page-flags and two functions related
> >to page migration. The flag and page->mapping stability are protected
> >by PG_lock.
> >
> >	PG_movable
> >	PG_isolated
> >
> >	bool (*isolate_page) (struct page *, isolate_mode_t);
> >	void (*putback_page) (struct page *);
> >
> >Duty of subsystem want to make their pages as migratable are
> >as follows:
> >
> >1. It should register address_space to page->mapping then mark
> >the page as PG_movable via __SetPageMovable.
> >
> >2. It should mark the page as PG_isolated via SetPageIsolated
> >if isolation is sucessful and return true.
> 
> Ah another thing to document (especially in the comments/Doc) is
> that the subsystem must not expect anything to survive in page.lru
> (or fields that union it) after having isolated successfully.

Indeed. I surprised I didn't miss because I wrote it down somewhere
but might miss it during rebase.
I will fix it.

> 
> >3. If migration is successful, it should clear PG_isolated and
> >PG_movable of the page for free preparation then release the
> >reference of the page to free.
> >
> >4. If migration fails, putback function of subsystem should
> >clear PG_isolated via ClearPageIsolated.
> >
> >5. If a subsystem want to release isolated page, it should
> >clear PG_isolated but not PG_movable. Instead, VM will do it.
> 
> Under lock? Or just with ClearPageIsolated?

Both:
ClearPageIsolated undert PG_lock.

Yes, it's better to change ClearPageIsolated to __ClearPageIsolated.

> 
> >Cc: Vlastimil Babka <vbabka@suse.cz>
> >Cc: Mel Gorman <mgorman@suse.de>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: dri-devel@lists.freedesktop.org
> >Cc: virtualization@lists.linux-foundation.org
> >Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  Documentation/filesystems/Locking      |   4 +
> >  Documentation/filesystems/vfs.txt      |   5 +
> >  fs/proc/page.c                         |   3 +
> >  include/linux/fs.h                     |   2 +
> >  include/linux/migrate.h                |   2 +
> >  include/linux/page-flags.h             |  31 ++++++
> >  include/uapi/linux/kernel-page-flags.h |   1 +
> >  mm/compaction.c                        |  14 ++-
> >  mm/migrate.c                           | 174 +++++++++++++++++++++++++++++----
> >  9 files changed, 217 insertions(+), 19 deletions(-)
> >
> >diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> >index 619af9bfdcb3..0bb79560abb3 100644
> >--- a/Documentation/filesystems/Locking
> >+++ b/Documentation/filesystems/Locking
> >@@ -195,7 +195,9 @@ unlocks and drops the reference.
> >  	int (*releasepage) (struct page *, int);
> >  	void (*freepage)(struct page *);
> >  	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> >+	bool (*isolate_page) (struct page *, isolate_mode_t);
> >  	int (*migratepage)(struct address_space *, struct page *, struct page *);
> >+	void (*putback_page) (struct page *);
> >  	int (*launder_page)(struct page *);
> >  	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
> >  	int (*error_remove_page)(struct address_space *, struct page *);
> >@@ -219,7 +221,9 @@ invalidatepage:		yes
> >  releasepage:		yes
> >  freepage:		yes
> >  direct_IO:
> >+isolate_page:		yes
> >  migratepage:		yes (both)
> >+putback_page:		yes
> >  launder_page:		yes
> >  is_partially_uptodate:	yes
> >  error_remove_page:	yes
> >diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> >index b02a7d598258..4c1b6c3b4bc8 100644
> >--- a/Documentation/filesystems/vfs.txt
> >+++ b/Documentation/filesystems/vfs.txt
> >@@ -592,9 +592,14 @@ struct address_space_operations {
> >  	int (*releasepage) (struct page *, int);
> >  	void (*freepage)(struct page *);
> >  	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> >+	/* isolate a page for migration */
> >+	bool (*isolate_page) (struct page *, isolate_mode_t);
> >  	/* migrate the contents of a page to the specified target */
> >  	int (*migratepage) (struct page *, struct page *);
> >+	/* put the page back to right list */
> 
> ... "after a failed migration" ?

Better.

> 
> >+	void (*putback_page) (struct page *);
> >  	int (*launder_page) (struct page *);
> >+
> >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> >  					unsigned long);
> >  	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> >diff --git a/fs/proc/page.c b/fs/proc/page.c
> >index 3ecd445e830d..ce3d08a4ad8d 100644
> >--- a/fs/proc/page.c
> >+++ b/fs/proc/page.c
> >@@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
> >  	if (page_is_idle(page))
> >  		u |= 1 << KPF_IDLE;
> >
> >+	if (PageMovable(page))
> >+		u |= 1 << KPF_MOVABLE;
> >+
> >  	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
> >
> >  	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> >diff --git a/include/linux/fs.h b/include/linux/fs.h
> >index da9e67d937e5..36f2d610e7a8 100644
> >--- a/include/linux/fs.h
> >+++ b/include/linux/fs.h
> >@@ -401,6 +401,8 @@ struct address_space_operations {
> >  	 */
> >  	int (*migratepage) (struct address_space *,
> >  			struct page *, struct page *, enum migrate_mode);
> >+	bool (*isolate_page)(struct page *, isolate_mode_t);
> >+	void (*putback_page)(struct page *);
> >  	int (*launder_page) (struct page *);
> >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> >  					unsigned long);
> >diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >index 9b50325e4ddf..404fbfefeb33 100644
> >--- a/include/linux/migrate.h
> >+++ b/include/linux/migrate.h
> >@@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
> >  			struct page *, struct page *, enum migrate_mode);
> >  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> >  		unsigned long private, enum migrate_mode mode, int reason);
> >+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> >+extern void putback_movable_page(struct page *page);
> >
> >  extern int migrate_prep(void);
> >  extern int migrate_prep_local(void);
> >diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >index f4ed4f1b0c77..77ebf8fdbc6e 100644
> >--- a/include/linux/page-flags.h
> >+++ b/include/linux/page-flags.h
> >@@ -129,6 +129,10 @@ enum pageflags {
> >
> >  	/* Compound pages. Stored in first tail page's flags */
> >  	PG_double_map = PG_private_2,
> >+
> >+	/* non-lru movable pages */
> >+	PG_movable = PG_reclaim,
> >+	PG_isolated = PG_owner_priv_1,
> 
> Documentation should probably state that these fields alias and
> subsystem supporting the movable pages shouldn't use them elsewhere.

Yeb.

> 
> Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that
> page->mapping->a_ops->isolate_page exists for PageMovable() pages.
> What if it's a false positive on a PG_reclaim page? Can we rely on
> PG_reclaim always (and without races) implying PageLRU() so that we
> don't even attempt isolate_movable_page()?

For now, we shouldn't have such a false positive because PageMovable
checks page->_mapcount == PAGE_MOVABLE_MAPCOUNT_VALUE as well as PG_movable
under PG_lock.

But I read your question about user-mapped drvier pages so we cannot
use _mapcount anymore so I will find another thing. A option is this.

static inline int PageMovable(struct page *page)
{
        int ret = 0;
        struct address_space *mapping;
        struct address_space_operations *a_op;

        if (!test_bit(PG_movable, &(page->flags))
                goto out;

        mapping = page->mapping;
        if (!mapping)
                goto out;

        a_op = mapping->a_op;
        if (!aop)
                goto out;
        if (a_op->isolate_page)
                ret = 1;
out:
        return ret;

}

It works under PG_lock but with this, we need trylock_page to peek
whether it's movable non-lru or not for scanning pfn.
For avoiding that, we need another function to peek which just checks
PG_movable bit instead of all things.


/*
 * If @page_locked is false, we cannot guarantee page->mapping's stability
 * so just the function checks with PG_movable which could be false positive
 * so caller should check it again under PG_lock to check a_ops->isolate_page.
 */
static inline int PageMovable(struct page *page, bool page_locked)
{
        int ret = 0;
        struct address_space *mapping;
        struct address_space_operations *a_op;

        if (!test_bit(PG_movable, &(page->flags))
                goto out;

        if (!page_locked) {
                ret = 1;
                goto out;
        }

        mapping = page->mapping;
        if (!mapping)
                goto out;

        a_op = mapping->a_op;
        if (!aop)
                goto out;
        if (a_op->isolate_page)
                ret = 1;
out:
        return ret;
}

> 
> >  };
> >
> >  #ifndef __GENERATING_BOUNDS_H
> >@@ -614,6 +618,33 @@ static inline void __ClearPageBalloon(struct page *page)
> >  	atomic_set(&page->_mapcount, -1);
> >  }
> >
> >+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> 
> IIRC this was what Gioh's previous attempts used instead of
> PG_movable? Is it still needed? Doesn't it prevent a driver

It needs to avoid false positive as I said.

> providing movable *and* mapped pages?

Absolutely true. I will rethink about it.

> If it's to distinguish the PG_reclaim alias that I mention above, it
> seems like an overkill to me. Why would be need both special
> mapcount value and a flag? Checking that
> page->mapping->a_ops->isolate_page exists before calling it should
> be enough to resolve the ambiguity?

As I mentioned, using a_ops->isolate_page needs to be done under PG_lock.
And the idea I suggested above will work, I guess.
I will try it.

> 
> >+
> >+static inline int PageMovable(struct page *page)
> >+{
> >+	return ((test_bit(PG_movable, &(page)->flags) &&
> >+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> >+		|| PageBalloon(page));
> >+}
> >+
> >+/* Caller should hold a PG_lock */
> >+static inline void __SetPageMovable(struct page *page,
> >+				struct address_space *mapping)
> >+{
> >+	page->mapping = mapping;
> >+	__set_bit(PG_movable, &page->flags);
> >+	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> >+}
> >+
> >+static inline void __ClearPageMovable(struct page *page)
> >+{
> >+	atomic_set(&page->_mapcount, -1);
> >+	__clear_bit(PG_movable, &(page)->flags);
> >+	page->mapping = NULL;
> >+}
> >+
> >+PAGEFLAG(Isolated, isolated, PF_ANY);
> >+
> >  /*
> >   * If network-based swap is enabled, sl*b must keep track of whether pages
> >   * were allocated from pfmemalloc reserves.
> >diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> >index 5da5f8751ce7..a184fd2434fa 100644
> >--- a/include/uapi/linux/kernel-page-flags.h
> >+++ b/include/uapi/linux/kernel-page-flags.h
> >@@ -34,6 +34,7 @@
> >  #define KPF_BALLOON		23
> >  #define KPF_ZERO_PAGE		24
> >  #define KPF_IDLE		25
> >+#define KPF_MOVABLE		26
> >
> >
> >  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index ccf97b02b85f..7557aedddaee 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >
> >  		/*
> >  		 * Check may be lockless but that's ok as we recheck later.
> >-		 * It's possible to migrate LRU pages and balloon pages
> >+		 * It's possible to migrate LRU and movable kernel pages.
> >  		 * Skip any other type of page
> >  		 */
> >  		is_lru = PageLRU(page);
> >@@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >  					goto isolate_success;
> >  				}
> >  			}
> >+
> >+			if (unlikely(PageMovable(page)) &&
> >+					!PageIsolated(page)) {
> >+				if (locked) {
> >+					spin_unlock_irqrestore(&zone->lru_lock,
> >+									flags);
> >+					locked = false;
> >+				}
> >+
> >+				if (isolate_movable_page(page, isolate_mode))
> >+					goto isolate_success;
> >+			}
> >  		}
> >
> >  		/*
> >diff --git a/mm/migrate.c b/mm/migrate.c
> >index 53529c805752..b56bf2b3fe8c 100644
> >--- a/mm/migrate.c
> >+++ b/mm/migrate.c
> >@@ -73,6 +73,85 @@ int migrate_prep_local(void)
> >  	return 0;
> >  }
> >
> >+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> >+{
> >+	bool ret = false;
> 
> Maintaining "ret" seems useless here. All the "goto out*" statements
> are executed only when ret is false, and ret == true is returned by
> a different return.

Yeb. Will change.

> 
> >+
> >+	/*
> >+	 * Avoid burning cycles with pages that are yet under __free_pages(),
> >+	 * or just got freed under us.
> >+	 *
> >+	 * In case we 'win' a race for a movable page being freed under us and
> >+	 * raise its refcount preventing __free_pages() from doing its job
> >+	 * the put_page() at the end of this block will take care of
> >+	 * release this page, thus avoiding a nasty leakage.
> >+	 */
> >+	if (unlikely(!get_page_unless_zero(page)))
> >+		goto out;
> >+
> >+	/*
> >+	 * Check PG_movable before holding a PG_lock because page's owner
> >+	 * assumes anybody doesn't touch PG_lock of newly allocated page.
> >+	 */
> >+	if (unlikely(!PageMovable(page)))
> >+		goto out_putpage;
> >+	/*
> >+	 * As movable pages are not isolated from LRU lists, concurrent
> >+	 * compaction threads can race against page migration functions
> >+	 * as well as race against the releasing a page.
> >+	 *
> >+	 * In order to avoid having an already isolated movable page
> >+	 * being (wrongly) re-isolated while it is under migration,
> >+	 * or to avoid attempting to isolate pages being released,
> >+	 * lets be sure we have the page lock
> >+	 * before proceeding with the movable page isolation steps.
> >+	 */
> >+	if (unlikely(!trylock_page(page)))
> >+		goto out_putpage;
> >+
> >+	if (!PageMovable(page) || PageIsolated(page))
> >+		goto out_no_isolated;
> >+
> >+	ret = page->mapping->a_ops->isolate_page(page, mode);
> >+	if (!ret)
> >+		goto out_no_isolated;
> >+
> >+	WARN_ON_ONCE(!PageIsolated(page));
> >+	unlock_page(page);
> >+	return ret;
> >+
> >+out_no_isolated:
> >+	unlock_page(page);
> >+out_putpage:
> >+	put_page(page);
> >+out:
> >+	return ret;
> >+}
> >+
> >+/* It should be called on page which is PG_movable */
> >+void putback_movable_page(struct page *page)
> >+{
> >+	/*
> >+	 * 'lock_page()' stabilizes the page and prevents races against
> >+	 * concurrent isolation threads attempting to re-isolate it.
> >+	 */
> >+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >+
> >+	lock_page(page);
> >+	if (PageIsolated(page)) {
> >+		struct address_space *mapping;
> >+
> >+		mapping = page_mapping(page);
> >+		mapping->a_ops->putback_page(page);
> >+		WARN_ON_ONCE(PageIsolated(page));
> >+	} else {
> >+		__ClearPageMovable(page);
> >+	}
> >+	unlock_page(page);
> >+	/* drop the extra ref count taken for movable page isolation */
> >+	put_page(page);
> >+}
> >+
> >  /*
> >   * Put previously isolated pages back onto the appropriate lists
> >   * from where they were once taken off for compaction/migration.
> >@@ -94,10 +173,18 @@ void putback_movable_pages(struct list_head *l)
> >  		list_del(&page->lru);
> >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> >  				page_is_file_cache(page));
> >-		if (unlikely(isolated_balloon_page(page)))
> >+		if (unlikely(isolated_balloon_page(page))) {
> >  			balloon_page_putback(page);
> >-		else
> >+		} else if (unlikely(PageMovable(page))) {
> >+			if (PageIsolated(page)) {
> >+				putback_movable_page(page);
> >+			} else {
> >+				__ClearPageMovable(page);
> 
> We don't do lock_page() here, so what prevents parallel compaction
> isolating the same page?

Need PG_lock.

> 
> >+				put_page(page);
> >+			}
> >+		} else {
> >  			putback_lru_page(page);
> >+		}
> >  	}
> >  }
> >
> >@@ -592,7 +679,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
> >   ***********************************************************/
> >
> >  /*
> >- * Common logic to directly migrate a single page suitable for
> >+ * Common logic to directly migrate a single LRU page suitable for
> >   * pages that do not use PagePrivate/PagePrivate2.
> >   *
> >   * Pages are locked upon entry and exit.
> >@@ -755,24 +842,54 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  				enum migrate_mode mode)
> >  {
> >  	struct address_space *mapping;
> >-	int rc;
> >+	int rc = -EAGAIN;
> >+	bool lru_movable = true;
> >
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> >
> >  	mapping = page_mapping(page);
> >-	if (!mapping)
> >-		rc = migrate_page(mapping, newpage, page, mode);
> >-	else if (mapping->a_ops->migratepage)
> >-		/*
> >-		 * Most pages have a mapping and most filesystems provide a
> >-		 * migratepage callback. Anonymous pages are part of swap
> >-		 * space which also has its own migratepage callback. This
> >-		 * is the most common path for page migration.
> >-		 */
> >-		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
> >-	else
> >-		rc = fallback_migrate_page(mapping, newpage, page, mode);
> >+	/*
> >+	 * In case of non-lru page, it could be released after
> >+	 * isolation step. In that case, we shouldn't try
> >+	 * fallback migration which was designed for LRU pages.
> >+	 *
> >+	 * The rule for such case is that subsystem should clear
> >+	 * PG_isolated but remains PG_movable so VM should catch
> >+	 * it and clear PG_movable for it.
> >+	 */
> >+	if (unlikely(PageMovable(page))) {
> 
> Can false positive from PG_reclaim occur here?

PageMovable has _mapcount == PAGE_MOVALBE_MAPCOUNT_VALUE check.

> 
> >+		lru_movable = false;
> >+		VM_BUG_ON_PAGE(!mapping, page);
> >+		if (!PageIsolated(page)) {
> >+			rc = MIGRATEPAGE_SUCCESS;
> >+			__ClearPageMovable(page);
> >+			goto out;
> >+		}
> >+	}
> >+
> >+	if (likely(lru_movable)) {
> >+		if (!mapping)
> >+			rc = migrate_page(mapping, newpage, page, mode);
> >+		else if (mapping->a_ops->migratepage)
> >+			/*
> >+			 * Most pages have a mapping and most filesystems
> >+			 * provide a migratepage callback. Anonymous pages
> >+			 * are part of swap space which also has its own
> >+			 * migratepage callback. This is the most common path
> >+			 * for page migration.
> >+			 */
> >+			rc = mapping->a_ops->migratepage(mapping, newpage,
> >+							page, mode);
> >+		else
> >+			rc = fallback_migrate_page(mapping, newpage,
> >+							page, mode);
> >+	} else {
> >+		rc = mapping->a_ops->migratepage(mapping, newpage,
> >+						page, mode);
> >+		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
> >+			PageIsolated(page));
> >+	}
> >
> >  	/*
> >  	 * When successful, old pagecache page->mapping must be cleared before
> >@@ -782,6 +899,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  		if (!PageAnon(page))
> >  			page->mapping = NULL;
> >  	}
> >+out:
> >  	return rc;
> >  }
> >
> >@@ -960,6 +1078,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  			put_new_page(newpage, private);
> >  		else
> >  			put_page(newpage);
> >+		if (PageMovable(page))
> >+			__ClearPageMovable(page);
> >  		goto out;
> >  	}
> >
> >@@ -1000,8 +1120,26 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  				num_poisoned_pages_inc();
> >  		}
> >  	} else {
> >-		if (rc != -EAGAIN)
> >-			putback_lru_page(page);
> >+		if (rc != -EAGAIN) {
> >+			/*
> >+			 * subsystem couldn't remove PG_movable since page is
> >+			 * isolated so PageMovable check is not racy in here.
> >+			 * But PageIsolated check can be racy but it's okay
> >+			 * because putback_movable_page checks it under PG_lock
> >+			 * again.
> >+			 */
> >+			if (unlikely(PageMovable(page))) {
> >+				if (PageIsolated(page))
> >+					putback_movable_page(page);
> >+				else {
> >+					__ClearPageMovable(page);
> 
> Again, we don't do lock_page() here, so what prevents parallel
> compaction isolating the same page?

It seems to need PG_lock in there, too.
Thanks for catching it up.

> 
> Sorry for so many questions, hope they all have good answers and
> this series is a success :) Thanks for picking it up.

No problem at all. Many question means the code or/and doc is not
clear and still need be improved.

Thanks for detail review, Vlastimil!
I will resend new versions after vacation in this week.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-03-30  7:12 ` [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page Minchan Kim
  2016-04-01 12:58   ` Vlastimil Babka
@ 2016-04-04  5:53   ` Balbir Singh
  2016-04-04  6:01     ` Minchan Kim
  1 sibling, 1 reply; 65+ messages in thread
From: Balbir Singh @ 2016-04-04  5:53 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Vlastimil Babka,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu,
	Naoya Horiguchi



On 30/03/16 18:12, Minchan Kim wrote:
> Procedure of page migration is as follows:
>
> First of all, it should isolate a page from LRU and try to
> migrate the page. If it is successful, it releases the page
> for freeing. Otherwise, it should put the page back to LRU
> list.
>
> For LRU pages, we have used putback_lru_page for both freeing
> and putback to LRU list. It's okay because put_page is aware of
> LRU list so if it releases last refcount of the page, it removes
> the page from LRU list. However, It makes unnecessary operations
> (e.g., lru_cache_add, pagevec and flags operations. It would be
> not significant but no worth to do) and harder to support new
> non-lru page migration because put_page isn't aware of non-lru
> page's data structure.
>
> To solve the problem, we can add new hook in put_page with
> PageMovable flags check but it can increase overhead in
> hot path and needs new locking scheme to stabilize the flag check
> with put_page.
>
> So, this patch cleans it up to divide two semantic(ie, put and putback).
> If migration is successful, use put_page instead of putback_lru_page and
> use putback_lru_page only on failure. That makes code more readable
> and doesn't add overhead in put_page.
So effectively when we return from unmap_and_move() the page is either
put_page or putback_lru_page() and the page is gone from under us.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-04  5:53   ` Balbir Singh
@ 2016-04-04  6:01     ` Minchan Kim
  2016-04-05  3:10       ` Balbir Singh
  0 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-04-04  6:01 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu, Naoya Horiguchi

On Mon, Apr 04, 2016 at 03:53:59PM +1000, Balbir Singh wrote:
> 
> 
> On 30/03/16 18:12, Minchan Kim wrote:
> > Procedure of page migration is as follows:
> >
> > First of all, it should isolate a page from LRU and try to
> > migrate the page. If it is successful, it releases the page
> > for freeing. Otherwise, it should put the page back to LRU
> > list.
> >
> > For LRU pages, we have used putback_lru_page for both freeing
> > and putback to LRU list. It's okay because put_page is aware of
> > LRU list so if it releases last refcount of the page, it removes
> > the page from LRU list. However, It makes unnecessary operations
> > (e.g., lru_cache_add, pagevec and flags operations. It would be
> > not significant but no worth to do) and harder to support new
> > non-lru page migration because put_page isn't aware of non-lru
> > page's data structure.
> >
> > To solve the problem, we can add new hook in put_page with
> > PageMovable flags check but it can increase overhead in
> > hot path and needs new locking scheme to stabilize the flag check
> > with put_page.
> >
> > So, this patch cleans it up to divide two semantic(ie, put and putback).
> > If migration is successful, use put_page instead of putback_lru_page and
> > use putback_lru_page only on failure. That makes code more readable
> > and doesn't add overhead in put_page.
> So effectively when we return from unmap_and_move() the page is either
> put_page or putback_lru_page() and the page is gone from under us.

I didn't get your point.
Could you elaborate it more what you want to say about this patch?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 12/16] zsmalloc: zs_compact refactoring
  2016-03-30  7:12 ` [PATCH v3 12/16] zsmalloc: zs_compact refactoring Minchan Kim
@ 2016-04-04  8:04   ` Chulmin Kim
  2016-04-04  9:01     ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Chulmin Kim @ 2016-04-04  8:04 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton; +Cc: linux-kernel, linux-mm

On 2016년 03월 30일 16:12, Minchan Kim wrote:
> Currently, we rely on class->lock to prevent zspage destruction.
> It was okay until now because the critical section is short but
> with run-time migration, it could be long so class->lock is not
> a good apporach any more.
>
> So, this patch introduces [un]freeze_zspage functions which
> freeze allocated objects in the zspage with pinning tag so
> user cannot free using object. With those functions, this patch
> redesign compaction.
>
> Those functions will be used for implementing zspage runtime
> migrations, too.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   mm/zsmalloc.c | 393 ++++++++++++++++++++++++++++++++++++++--------------------
>   1 file changed, 257 insertions(+), 136 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index b11dcd718502..ac8ca7b10720 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -922,6 +922,13 @@ static unsigned long obj_to_head(struct size_class *class, struct page *page,
>   		return *(unsigned long *)obj;
>   }
>
> +static inline int testpin_tag(unsigned long handle)
> +{
> +	unsigned long *ptr = (unsigned long *)handle;
> +
> +	return test_bit(HANDLE_PIN_BIT, ptr);
> +}
> +
>   static inline int trypin_tag(unsigned long handle)
>   {
>   	unsigned long *ptr = (unsigned long *)handle;
> @@ -949,8 +956,7 @@ static void reset_page(struct page *page)
>   	page->freelist = NULL;
>   }
>
> -static void free_zspage(struct zs_pool *pool, struct size_class *class,
> -			struct page *first_page)
> +static void free_zspage(struct zs_pool *pool, struct page *first_page)
>   {
>   	struct page *nextp, *tmp, *head_extra;
>
> @@ -973,11 +979,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
>   	}
>   	reset_page(head_extra);
>   	__free_page(head_extra);
> -
> -	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> -			class->size, class->pages_per_zspage));
> -	atomic_long_sub(class->pages_per_zspage,
> -				&pool->pages_allocated);
>   }
>
>   /* Initialize a newly allocated zspage */
> @@ -1325,6 +1326,11 @@ static bool zspage_full(struct size_class *class, struct page *first_page)
>   	return get_zspage_inuse(first_page) == class->objs_per_zspage;
>   }
>
> +static bool zspage_empty(struct size_class *class, struct page *first_page)
> +{
> +	return get_zspage_inuse(first_page) == 0;
> +}
> +
>   unsigned long zs_get_total_pages(struct zs_pool *pool)
>   {
>   	return atomic_long_read(&pool->pages_allocated);
> @@ -1455,7 +1461,6 @@ static unsigned long obj_malloc(struct size_class *class,
>   		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
>   	kunmap_atomic(vaddr);
>   	mod_zspage_inuse(first_page, 1);
> -	zs_stat_inc(class, OBJ_USED, 1);
>
>   	obj = location_to_obj(m_page, obj);
>
> @@ -1510,6 +1515,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
>   	}
>
>   	obj = obj_malloc(class, first_page, handle);
> +	zs_stat_inc(class, OBJ_USED, 1);
>   	/* Now move the zspage to another fullness group, if required */
>   	fix_fullness_group(class, first_page);
>   	record_obj(handle, obj);
> @@ -1540,7 +1546,6 @@ static void obj_free(struct size_class *class, unsigned long obj)
>   	kunmap_atomic(vaddr);
>   	set_freeobj(first_page, f_objidx);
>   	mod_zspage_inuse(first_page, -1);
> -	zs_stat_dec(class, OBJ_USED, 1);
>   }
>
>   void zs_free(struct zs_pool *pool, unsigned long handle)
> @@ -1564,10 +1569,19 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>
>   	spin_lock(&class->lock);
>   	obj_free(class, obj);
> +	zs_stat_dec(class, OBJ_USED, 1);
>   	fullness = fix_fullness_group(class, first_page);
> -	if (fullness == ZS_EMPTY)
> -		free_zspage(pool, class, first_page);
> +	if (fullness == ZS_EMPTY) {
> +		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> +				class->size, class->pages_per_zspage));
> +		spin_unlock(&class->lock);
> +		atomic_long_sub(class->pages_per_zspage,
> +					&pool->pages_allocated);
> +		free_zspage(pool, first_page);
> +		goto out;
> +	}
>   	spin_unlock(&class->lock);
> +out:
>   	unpin_tag(handle);
>
>   	free_handle(pool, handle);
> @@ -1637,127 +1651,66 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
>   	kunmap_atomic(s_addr);
>   }
>
> -/*
> - * Find alloced object in zspage from index object and
> - * return handle.
> - */
> -static unsigned long find_alloced_obj(struct size_class *class,
> -					struct page *page, int index)
> +static unsigned long handle_from_obj(struct size_class *class,
> +				struct page *first_page, int obj_idx)
>   {
> -	unsigned long head;
> -	int offset = 0;
> -	unsigned long handle = 0;
> -	void *addr = kmap_atomic(page);
> -
> -	if (!is_first_page(page))
> -		offset = page->index;
> -	offset += class->size * index;
> -
> -	while (offset < PAGE_SIZE) {
> -		head = obj_to_head(class, page, addr + offset);
> -		if (head & OBJ_ALLOCATED_TAG) {
> -			handle = head & ~OBJ_ALLOCATED_TAG;
> -			if (trypin_tag(handle))
> -				break;
> -			handle = 0;
> -		}
> +	struct page *page;
> +	unsigned long offset_in_page;
> +	void *addr;
> +	unsigned long head, handle = 0;
>
> -		offset += class->size;
> -		index++;
> -	}
> +	objidx_to_page_and_offset(class, first_page, obj_idx,
> +			&page, &offset_in_page);
>
> +	addr = kmap_atomic(page);
> +	head = obj_to_head(class, page, addr + offset_in_page);
> +	if (head & OBJ_ALLOCATED_TAG)
> +		handle = head & ~OBJ_ALLOCATED_TAG;
>   	kunmap_atomic(addr);
> +
>   	return handle;
>   }
>
> -struct zs_compact_control {
> -	/* Source page for migration which could be a subpage of zspage. */
> -	struct page *s_page;
> -	/* Destination page for migration which should be a first page
> -	 * of zspage. */
> -	struct page *d_page;
> -	 /* Starting object index within @s_page which used for live object
> -	  * in the subpage. */
> -	int index;
> -};
> -
> -static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> -				struct zs_compact_control *cc)
> +static int migrate_zspage(struct size_class *class, struct page *dst_page,
> +				struct page *src_page)
>   {
> -	unsigned long used_obj, free_obj;
>   	unsigned long handle;
> -	struct page *s_page = cc->s_page;
> -	struct page *d_page = cc->d_page;
> -	unsigned long index = cc->index;
> -	int ret = 0;
> +	unsigned long old_obj, new_obj;
> +	int i;
> +	int nr_migrated = 0;
>
> -	while (1) {
> -		handle = find_alloced_obj(class, s_page, index);
> -		if (!handle) {
> -			s_page = get_next_page(s_page);
> -			if (!s_page)
> -				break;
> -			index = 0;
> +	for (i = 0; i < class->objs_per_zspage; i++) {
> +		handle = handle_from_obj(class, src_page, i);
> +		if (!handle)
>   			continue;
> -		}
> -
> -		/* Stop if there is no more space */
> -		if (zspage_full(class, d_page)) {
> -			unpin_tag(handle);
> -			ret = -ENOMEM;
> +		if (zspage_full(class, dst_page))
>   			break;
> -		}
> -
> -		used_obj = handle_to_obj(handle);
> -		free_obj = obj_malloc(class, d_page, handle);
> -		zs_object_copy(class, free_obj, used_obj);
> -		index++;
> +		old_obj = handle_to_obj(handle);
> +		new_obj = obj_malloc(class, dst_page, handle);
> +		zs_object_copy(class, new_obj, old_obj);
> +		nr_migrated++;
>   		/*
>   		 * record_obj updates handle's value to free_obj and it will
>   		 * invalidate lock bit(ie, HANDLE_PIN_BIT) of handle, which
>   		 * breaks synchronization using pin_tag(e,g, zs_free) so
>   		 * let's keep the lock bit.
>   		 */
> -		free_obj |= BIT(HANDLE_PIN_BIT);
> -		record_obj(handle, free_obj);
> -		unpin_tag(handle);
> -		obj_free(class, used_obj);
> +		new_obj |= BIT(HANDLE_PIN_BIT);
> +		record_obj(handle, new_obj);
> +		obj_free(class, old_obj);
>   	}
> -
> -	/* Remember last position in this iteration */
> -	cc->s_page = s_page;
> -	cc->index = index;
> -
> -	return ret;
> -}
> -
> -static struct page *isolate_target_page(struct size_class *class)
> -{
> -	int i;
> -	struct page *page;
> -
> -	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
> -		page = class->fullness_list[i];
> -		if (page) {
> -			remove_zspage(class, i, page);
> -			break;
> -		}
> -	}
> -
> -	return page;
> +	return nr_migrated;
>   }
>
>   /*
>    * putback_zspage - add @first_page into right class's fullness list
> - * @pool: target pool
>    * @class: destination class
>    * @first_page: target page
>    *
>    * Return @first_page's updated fullness_group
>    */
> -static enum fullness_group putback_zspage(struct zs_pool *pool,
> -			struct size_class *class,
> -			struct page *first_page)
> +static enum fullness_group putback_zspage(struct size_class *class,
> +					struct page *first_page)
>   {
>   	enum fullness_group fullness;
>
> @@ -1768,17 +1721,155 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
>   	return fullness;
>   }
>
> +/*
> + * freeze_zspage - freeze all objects in a zspage
> + * @class: size class of the page
> + * @first_page: first page of zspage
> + *
> + * Freeze all allocated objects in a zspage so objects couldn't be
> + * freed until unfreeze objects. It should be called under class->lock.
> + *
> + * RETURNS:
> + * the number of pinned objects
> + */
> +static int freeze_zspage(struct size_class *class, struct page *first_page)
> +{
> +	unsigned long obj_idx;
> +	struct page *obj_page;
> +	unsigned long offset;
> +	void *addr;
> +	int nr_freeze = 0;
> +
> +	for (obj_idx = 0; obj_idx < class->objs_per_zspage; obj_idx++) {
> +		unsigned long head;
> +
> +		objidx_to_page_and_offset(class, first_page, obj_idx,
> +					&obj_page, &offset);
> +		addr = kmap_atomic(obj_page);
> +		head = obj_to_head(class, obj_page, addr + offset);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			unsigned long handle = head & ~OBJ_ALLOCATED_TAG;
> +
> +			if (!trypin_tag(handle)) {
> +				kunmap_atomic(addr);
> +				break;
> +			}
> +			nr_freeze++;
> +		}
> +		kunmap_atomic(addr);
> +	}
> +
> +	return nr_freeze;
> +}
> +
> +/*
> + * unfreeze_page - unfreeze objects freezed by freeze_zspage in a zspage
> + * @class: size class of the page
> + * @first_page: freezed zspage to unfreeze
> + * @nr_obj: the number of objects to unfreeze
> + *
> + * unfreeze objects in a zspage.
> + */
> +static void unfreeze_zspage(struct size_class *class, struct page *first_page,
> +			int nr_obj)
> +{
> +	unsigned long obj_idx;
> +	struct page *obj_page;
> +	unsigned long offset;
> +	void *addr;
> +	int nr_unfreeze = 0;
> +
> +	for (obj_idx = 0; obj_idx < class->objs_per_zspage &&
> +			nr_unfreeze < nr_obj; obj_idx++) {
> +		unsigned long head;
> +
> +		objidx_to_page_and_offset(class, first_page, obj_idx,
> +					&obj_page, &offset);
> +		addr = kmap_atomic(obj_page);
> +		head = obj_to_head(class, obj_page, addr + offset);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			unsigned long handle = head & ~OBJ_ALLOCATED_TAG;
> +
> +			VM_BUG_ON(!testpin_tag(handle));
> +			unpin_tag(handle);
> +			nr_unfreeze++;
> +		}
> +		kunmap_atomic(addr);
> +	}
> +}
> +
> +/*
> + * isolate_source_page - isolate a zspage for migration source
> + * @class: size class of zspage for isolation
> + *
> + * Returns a zspage which are isolated from list so anyone can
> + * allocate a object from that page. As well, freeze all objects
> + * allocated in the zspage so anyone cannot access that objects
> + * (e.g., zs_map_object, zs_free).
> + */
>   static struct page *isolate_source_page(struct size_class *class)
>   {
>   	int i;
>   	struct page *page = NULL;
>
>   	for (i = ZS_ALMOST_EMPTY; i >= ZS_ALMOST_FULL; i--) {
> +		int inuse, freezed;
> +
>   		page = class->fullness_list[i];
>   		if (!page)
>   			continue;
>
>   		remove_zspage(class, i, page);
> +
> +		inuse = get_zspage_inuse(page);
> +		freezed = freeze_zspage(class, page);
> +
> +		if (inuse != freezed) {
> +			unfreeze_zspage(class, page, freezed);
> +			putback_zspage(class, page);
> +			page = NULL;
> +			continue;
> +		}
> +
> +		break;
> +	}
> +
> +	return page;
> +}
> +
> +/*
> + * isolate_target_page - isolate a zspage for migration target
> + * @class: size class of zspage for isolation
> + *
> + * Returns a zspage which are isolated from list so anyone can
> + * allocate a object from that page. As well, freeze all objects
> + * allocated in the zspage so anyone cannot access that objects
> + * (e.g., zs_map_object, zs_free).
> + */
> +static struct page *isolate_target_page(struct size_class *class)
> +{
> +	int i;
> +	struct page *page;
> +
> +	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
> +		int inuse, freezed;
> +
> +		page = class->fullness_list[i];
> +		if (!page)
> +			continue;
> +
> +		remove_zspage(class, i, page);
> +
> +		inuse = get_zspage_inuse(page);
> +		freezed = freeze_zspage(class, page);
> +
> +		if (inuse != freezed) {
> +			unfreeze_zspage(class, page, freezed);
> +			putback_zspage(class, page);
> +			page = NULL;
> +			continue;
> +		}
> +
>   		break;
>   	}
>
> @@ -1793,9 +1884,11 @@ static struct page *isolate_source_page(struct size_class *class)
>   static unsigned long zs_can_compact(struct size_class *class)
>   {
>   	unsigned long obj_wasted;
> +	unsigned long obj_allocated, obj_used;
>
> -	obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) -
> -		zs_stat_get(class, OBJ_USED);
> +	obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
> +	obj_used = zs_stat_get(class, OBJ_USED);
> +	obj_wasted = obj_allocated - obj_used;
>
>   	obj_wasted /= get_maxobj_per_zspage(class->size,
>   			class->pages_per_zspage);
> @@ -1805,53 +1898,81 @@ static unsigned long zs_can_compact(struct size_class *class)
>
>   static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>   {
> -	struct zs_compact_control cc;
> -	struct page *src_page;
> +	struct page *src_page = NULL;
>   	struct page *dst_page = NULL;
>
> -	spin_lock(&class->lock);
> -	while ((src_page = isolate_source_page(class))) {
> +	while (1) {
> +		int nr_migrated;
>
> -		if (!zs_can_compact(class))
> +		spin_lock(&class->lock);
> +		if (!zs_can_compact(class)) {
> +			spin_unlock(&class->lock);
>   			break;
> +		}
>
> -		cc.index = 0;
> -		cc.s_page = src_page;
> +		/*
> +		 * Isolate source page and freeze all objects in a zspage
> +		 * to prevent zspage destroying.
> +		 */
> +		if (!src_page) {
> +			src_page = isolate_source_page(class);
> +			if (!src_page) {
> +				spin_unlock(&class->lock);
> +				break;
> +			}
> +		}
>
> -		while ((dst_page = isolate_target_page(class))) {
> -			cc.d_page = dst_page;
> -			/*
> -			 * If there is no more space in dst_page, resched
> -			 * and see if anyone had allocated another zspage.
> -			 */
> -			if (!migrate_zspage(pool, class, &cc))
> +		/* Isolate target page and freeze all objects in the zspage */
> +		if (!dst_page) {
> +			dst_page = isolate_target_page(class);
> +			if (!dst_page) {
> +				spin_unlock(&class->lock);
>   				break;
> +			}
> +		}
> +		spin_unlock(&class->lock);

(Sorry to delete individual recipients due to my compliance issues.)

Hello, Minchan.


Is it safe to unlock?


(I assume that the system has 2 cores
and a swap device is using zsmalloc pool.)
If a zs compact context scheduled out after this "spin_unlock" line,


    CPU A (Swap In)                CPU B (zs_free by process killed)
---------------------           -------------------------
                                 ...
                                 spin_lock(&si->lock)
                                 ...
                                # assume it is pinned by zs_compact context.
                                 pin_tag(handle) --> block

...
spin_lock(&si->lock) --> block


I think CPU A and CPU B may not be released forever.
Am I missing something?

Thanks.
Chulmin


> +
> +		nr_migrated = migrate_zspage(class, dst_page, src_page);
>
> -			VM_BUG_ON_PAGE(putback_zspage(pool, class,
> -				dst_page) == ZS_EMPTY, dst_page);
> +		if (zspage_full(class, dst_page)) {
> +			spin_lock(&class->lock);
> +			putback_zspage(class, dst_page);
> +			unfreeze_zspage(class, dst_page,
> +				class->objs_per_zspage);
> +			spin_unlock(&class->lock);
> +			dst_page = NULL;
>   		}
>
> -		/* Stop if we couldn't find slot */
> -		if (dst_page == NULL)
> -			break;
> +		if (zspage_empty(class, src_page)) {
> +			free_zspage(pool, src_page);
> +			spin_lock(&class->lock);
> +			zs_stat_dec(class, OBJ_ALLOCATED,
> +				get_maxobj_per_zspage(
> +				class->size, class->pages_per_zspage));
> +			atomic_long_sub(class->pages_per_zspage,
> +					&pool->pages_allocated);
>
> -		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> -				dst_page) == ZS_EMPTY, dst_page);
> -		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
>   			pool->stats.pages_compacted += class->pages_per_zspage;
>   			spin_unlock(&class->lock);
> -			free_zspage(pool, class, src_page);
> -		} else {
> -			spin_unlock(&class->lock);
> +			src_page = NULL;
>   		}
> +	}
>
> -		cond_resched();
> -		spin_lock(&class->lock);
> +	if (!src_page && !dst_page)
> +		return;
> +
> +	spin_lock(&class->lock);
> +	if (src_page) {
> +		putback_zspage(class, src_page);
> +		unfreeze_zspage(class, src_page,
> +				class->objs_per_zspage);
>   	}
>
> -	if (src_page)
> -		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> -				src_page) == ZS_EMPTY, src_page);
> +	if (dst_page) {
> +		putback_zspage(class, dst_page);
> +		unfreeze_zspage(class, dst_page,
> +				class->objs_per_zspage);
> +	}
>
>   	spin_unlock(&class->lock);
>   }
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 12/16] zsmalloc: zs_compact refactoring
  2016-04-04  8:04   ` Chulmin Kim
@ 2016-04-04  9:01     ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-04  9:01 UTC (permalink / raw)
  To: Chulmin Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

Hello Chulmin,

On Mon, Apr 04, 2016 at 05:04:16PM +0900, Chulmin Kim wrote:
> On 2016년 03월 30일 16:12, Minchan Kim wrote:
> >Currently, we rely on class->lock to prevent zspage destruction.
> >It was okay until now because the critical section is short but
> >with run-time migration, it could be long so class->lock is not
> >a good apporach any more.
> >
> >So, this patch introduces [un]freeze_zspage functions which
> >freeze allocated objects in the zspage with pinning tag so
> >user cannot free using object. With those functions, this patch
> >redesign compaction.
> >
> >Those functions will be used for implementing zspage runtime
> >migrations, too.
> >
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  mm/zsmalloc.c | 393 ++++++++++++++++++++++++++++++++++++++--------------------
> >  1 file changed, 257 insertions(+), 136 deletions(-)
> >
> >diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> >index b11dcd718502..ac8ca7b10720 100644
> >--- a/mm/zsmalloc.c
> >+++ b/mm/zsmalloc.c
> >@@ -922,6 +922,13 @@ static unsigned long obj_to_head(struct size_class *class, struct page *page,
> >  		return *(unsigned long *)obj;
> >  }
> >
> >+static inline int testpin_tag(unsigned long handle)
> >+{
> >+	unsigned long *ptr = (unsigned long *)handle;
> >+
> >+	return test_bit(HANDLE_PIN_BIT, ptr);
> >+}
> >+
> >  static inline int trypin_tag(unsigned long handle)
> >  {
> >  	unsigned long *ptr = (unsigned long *)handle;
> >@@ -949,8 +956,7 @@ static void reset_page(struct page *page)
> >  	page->freelist = NULL;
> >  }
> >
> >-static void free_zspage(struct zs_pool *pool, struct size_class *class,
> >-			struct page *first_page)
> >+static void free_zspage(struct zs_pool *pool, struct page *first_page)
> >  {
> >  	struct page *nextp, *tmp, *head_extra;
> >
> >@@ -973,11 +979,6 @@ static void free_zspage(struct zs_pool *pool, struct size_class *class,
> >  	}
> >  	reset_page(head_extra);
> >  	__free_page(head_extra);
> >-
> >-	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >-			class->size, class->pages_per_zspage));
> >-	atomic_long_sub(class->pages_per_zspage,
> >-				&pool->pages_allocated);
> >  }
> >
> >  /* Initialize a newly allocated zspage */
> >@@ -1325,6 +1326,11 @@ static bool zspage_full(struct size_class *class, struct page *first_page)
> >  	return get_zspage_inuse(first_page) == class->objs_per_zspage;
> >  }
> >
> >+static bool zspage_empty(struct size_class *class, struct page *first_page)
> >+{
> >+	return get_zspage_inuse(first_page) == 0;
> >+}
> >+
> >  unsigned long zs_get_total_pages(struct zs_pool *pool)
> >  {
> >  	return atomic_long_read(&pool->pages_allocated);
> >@@ -1455,7 +1461,6 @@ static unsigned long obj_malloc(struct size_class *class,
> >  		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
> >  	kunmap_atomic(vaddr);
> >  	mod_zspage_inuse(first_page, 1);
> >-	zs_stat_inc(class, OBJ_USED, 1);
> >
> >  	obj = location_to_obj(m_page, obj);
> >
> >@@ -1510,6 +1515,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
> >  	}
> >
> >  	obj = obj_malloc(class, first_page, handle);
> >+	zs_stat_inc(class, OBJ_USED, 1);
> >  	/* Now move the zspage to another fullness group, if required */
> >  	fix_fullness_group(class, first_page);
> >  	record_obj(handle, obj);
> >@@ -1540,7 +1546,6 @@ static void obj_free(struct size_class *class, unsigned long obj)
> >  	kunmap_atomic(vaddr);
> >  	set_freeobj(first_page, f_objidx);
> >  	mod_zspage_inuse(first_page, -1);
> >-	zs_stat_dec(class, OBJ_USED, 1);
> >  }
> >
> >  void zs_free(struct zs_pool *pool, unsigned long handle)
> >@@ -1564,10 +1569,19 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> >
> >  	spin_lock(&class->lock);
> >  	obj_free(class, obj);
> >+	zs_stat_dec(class, OBJ_USED, 1);
> >  	fullness = fix_fullness_group(class, first_page);
> >-	if (fullness == ZS_EMPTY)
> >-		free_zspage(pool, class, first_page);
> >+	if (fullness == ZS_EMPTY) {
> >+		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >+				class->size, class->pages_per_zspage));
> >+		spin_unlock(&class->lock);
> >+		atomic_long_sub(class->pages_per_zspage,
> >+					&pool->pages_allocated);
> >+		free_zspage(pool, first_page);
> >+		goto out;
> >+	}
> >  	spin_unlock(&class->lock);
> >+out:
> >  	unpin_tag(handle);
> >
> >  	free_handle(pool, handle);
> >@@ -1637,127 +1651,66 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
> >  	kunmap_atomic(s_addr);
> >  }
> >
> >-/*
> >- * Find alloced object in zspage from index object and
> >- * return handle.
> >- */
> >-static unsigned long find_alloced_obj(struct size_class *class,
> >-					struct page *page, int index)
> >+static unsigned long handle_from_obj(struct size_class *class,
> >+				struct page *first_page, int obj_idx)
> >  {
> >-	unsigned long head;
> >-	int offset = 0;
> >-	unsigned long handle = 0;
> >-	void *addr = kmap_atomic(page);
> >-
> >-	if (!is_first_page(page))
> >-		offset = page->index;
> >-	offset += class->size * index;
> >-
> >-	while (offset < PAGE_SIZE) {
> >-		head = obj_to_head(class, page, addr + offset);
> >-		if (head & OBJ_ALLOCATED_TAG) {
> >-			handle = head & ~OBJ_ALLOCATED_TAG;
> >-			if (trypin_tag(handle))
> >-				break;
> >-			handle = 0;
> >-		}
> >+	struct page *page;
> >+	unsigned long offset_in_page;
> >+	void *addr;
> >+	unsigned long head, handle = 0;
> >
> >-		offset += class->size;
> >-		index++;
> >-	}
> >+	objidx_to_page_and_offset(class, first_page, obj_idx,
> >+			&page, &offset_in_page);
> >
> >+	addr = kmap_atomic(page);
> >+	head = obj_to_head(class, page, addr + offset_in_page);
> >+	if (head & OBJ_ALLOCATED_TAG)
> >+		handle = head & ~OBJ_ALLOCATED_TAG;
> >  	kunmap_atomic(addr);
> >+
> >  	return handle;
> >  }
> >
> >-struct zs_compact_control {
> >-	/* Source page for migration which could be a subpage of zspage. */
> >-	struct page *s_page;
> >-	/* Destination page for migration which should be a first page
> >-	 * of zspage. */
> >-	struct page *d_page;
> >-	 /* Starting object index within @s_page which used for live object
> >-	  * in the subpage. */
> >-	int index;
> >-};
> >-
> >-static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
> >-				struct zs_compact_control *cc)
> >+static int migrate_zspage(struct size_class *class, struct page *dst_page,
> >+				struct page *src_page)
> >  {
> >-	unsigned long used_obj, free_obj;
> >  	unsigned long handle;
> >-	struct page *s_page = cc->s_page;
> >-	struct page *d_page = cc->d_page;
> >-	unsigned long index = cc->index;
> >-	int ret = 0;
> >+	unsigned long old_obj, new_obj;
> >+	int i;
> >+	int nr_migrated = 0;
> >
> >-	while (1) {
> >-		handle = find_alloced_obj(class, s_page, index);
> >-		if (!handle) {
> >-			s_page = get_next_page(s_page);
> >-			if (!s_page)
> >-				break;
> >-			index = 0;
> >+	for (i = 0; i < class->objs_per_zspage; i++) {
> >+		handle = handle_from_obj(class, src_page, i);
> >+		if (!handle)
> >  			continue;
> >-		}
> >-
> >-		/* Stop if there is no more space */
> >-		if (zspage_full(class, d_page)) {
> >-			unpin_tag(handle);
> >-			ret = -ENOMEM;
> >+		if (zspage_full(class, dst_page))
> >  			break;
> >-		}
> >-
> >-		used_obj = handle_to_obj(handle);
> >-		free_obj = obj_malloc(class, d_page, handle);
> >-		zs_object_copy(class, free_obj, used_obj);
> >-		index++;
> >+		old_obj = handle_to_obj(handle);
> >+		new_obj = obj_malloc(class, dst_page, handle);
> >+		zs_object_copy(class, new_obj, old_obj);
> >+		nr_migrated++;
> >  		/*
> >  		 * record_obj updates handle's value to free_obj and it will
> >  		 * invalidate lock bit(ie, HANDLE_PIN_BIT) of handle, which
> >  		 * breaks synchronization using pin_tag(e,g, zs_free) so
> >  		 * let's keep the lock bit.
> >  		 */
> >-		free_obj |= BIT(HANDLE_PIN_BIT);
> >-		record_obj(handle, free_obj);
> >-		unpin_tag(handle);
> >-		obj_free(class, used_obj);
> >+		new_obj |= BIT(HANDLE_PIN_BIT);
> >+		record_obj(handle, new_obj);
> >+		obj_free(class, old_obj);
> >  	}
> >-
> >-	/* Remember last position in this iteration */
> >-	cc->s_page = s_page;
> >-	cc->index = index;
> >-
> >-	return ret;
> >-}
> >-
> >-static struct page *isolate_target_page(struct size_class *class)
> >-{
> >-	int i;
> >-	struct page *page;
> >-
> >-	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
> >-		page = class->fullness_list[i];
> >-		if (page) {
> >-			remove_zspage(class, i, page);
> >-			break;
> >-		}
> >-	}
> >-
> >-	return page;
> >+	return nr_migrated;
> >  }
> >
> >  /*
> >   * putback_zspage - add @first_page into right class's fullness list
> >- * @pool: target pool
> >   * @class: destination class
> >   * @first_page: target page
> >   *
> >   * Return @first_page's updated fullness_group
> >   */
> >-static enum fullness_group putback_zspage(struct zs_pool *pool,
> >-			struct size_class *class,
> >-			struct page *first_page)
> >+static enum fullness_group putback_zspage(struct size_class *class,
> >+					struct page *first_page)
> >  {
> >  	enum fullness_group fullness;
> >
> >@@ -1768,17 +1721,155 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
> >  	return fullness;
> >  }
> >
> >+/*
> >+ * freeze_zspage - freeze all objects in a zspage
> >+ * @class: size class of the page
> >+ * @first_page: first page of zspage
> >+ *
> >+ * Freeze all allocated objects in a zspage so objects couldn't be
> >+ * freed until unfreeze objects. It should be called under class->lock.
> >+ *
> >+ * RETURNS:
> >+ * the number of pinned objects
> >+ */
> >+static int freeze_zspage(struct size_class *class, struct page *first_page)
> >+{
> >+	unsigned long obj_idx;
> >+	struct page *obj_page;
> >+	unsigned long offset;
> >+	void *addr;
> >+	int nr_freeze = 0;
> >+
> >+	for (obj_idx = 0; obj_idx < class->objs_per_zspage; obj_idx++) {
> >+		unsigned long head;
> >+
> >+		objidx_to_page_and_offset(class, first_page, obj_idx,
> >+					&obj_page, &offset);
> >+		addr = kmap_atomic(obj_page);
> >+		head = obj_to_head(class, obj_page, addr + offset);
> >+		if (head & OBJ_ALLOCATED_TAG) {
> >+			unsigned long handle = head & ~OBJ_ALLOCATED_TAG;
> >+
> >+			if (!trypin_tag(handle)) {
> >+				kunmap_atomic(addr);
> >+				break;
> >+			}
> >+			nr_freeze++;
> >+		}
> >+		kunmap_atomic(addr);
> >+	}
> >+
> >+	return nr_freeze;
> >+}
> >+
> >+/*
> >+ * unfreeze_page - unfreeze objects freezed by freeze_zspage in a zspage
> >+ * @class: size class of the page
> >+ * @first_page: freezed zspage to unfreeze
> >+ * @nr_obj: the number of objects to unfreeze
> >+ *
> >+ * unfreeze objects in a zspage.
> >+ */
> >+static void unfreeze_zspage(struct size_class *class, struct page *first_page,
> >+			int nr_obj)
> >+{
> >+	unsigned long obj_idx;
> >+	struct page *obj_page;
> >+	unsigned long offset;
> >+	void *addr;
> >+	int nr_unfreeze = 0;
> >+
> >+	for (obj_idx = 0; obj_idx < class->objs_per_zspage &&
> >+			nr_unfreeze < nr_obj; obj_idx++) {
> >+		unsigned long head;
> >+
> >+		objidx_to_page_and_offset(class, first_page, obj_idx,
> >+					&obj_page, &offset);
> >+		addr = kmap_atomic(obj_page);
> >+		head = obj_to_head(class, obj_page, addr + offset);
> >+		if (head & OBJ_ALLOCATED_TAG) {
> >+			unsigned long handle = head & ~OBJ_ALLOCATED_TAG;
> >+
> >+			VM_BUG_ON(!testpin_tag(handle));
> >+			unpin_tag(handle);
> >+			nr_unfreeze++;
> >+		}
> >+		kunmap_atomic(addr);
> >+	}
> >+}
> >+
> >+/*
> >+ * isolate_source_page - isolate a zspage for migration source
> >+ * @class: size class of zspage for isolation
> >+ *
> >+ * Returns a zspage which are isolated from list so anyone can
> >+ * allocate a object from that page. As well, freeze all objects
> >+ * allocated in the zspage so anyone cannot access that objects
> >+ * (e.g., zs_map_object, zs_free).
> >+ */
> >  static struct page *isolate_source_page(struct size_class *class)
> >  {
> >  	int i;
> >  	struct page *page = NULL;
> >
> >  	for (i = ZS_ALMOST_EMPTY; i >= ZS_ALMOST_FULL; i--) {
> >+		int inuse, freezed;
> >+
> >  		page = class->fullness_list[i];
> >  		if (!page)
> >  			continue;
> >
> >  		remove_zspage(class, i, page);
> >+
> >+		inuse = get_zspage_inuse(page);
> >+		freezed = freeze_zspage(class, page);
> >+
> >+		if (inuse != freezed) {
> >+			unfreeze_zspage(class, page, freezed);
> >+			putback_zspage(class, page);
> >+			page = NULL;
> >+			continue;
> >+		}
> >+
> >+		break;
> >+	}
> >+
> >+	return page;
> >+}
> >+
> >+/*
> >+ * isolate_target_page - isolate a zspage for migration target
> >+ * @class: size class of zspage for isolation
> >+ *
> >+ * Returns a zspage which are isolated from list so anyone can
> >+ * allocate a object from that page. As well, freeze all objects
> >+ * allocated in the zspage so anyone cannot access that objects
> >+ * (e.g., zs_map_object, zs_free).
> >+ */
> >+static struct page *isolate_target_page(struct size_class *class)
> >+{
> >+	int i;
> >+	struct page *page;
> >+
> >+	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
> >+		int inuse, freezed;
> >+
> >+		page = class->fullness_list[i];
> >+		if (!page)
> >+			continue;
> >+
> >+		remove_zspage(class, i, page);
> >+
> >+		inuse = get_zspage_inuse(page);
> >+		freezed = freeze_zspage(class, page);
> >+
> >+		if (inuse != freezed) {
> >+			unfreeze_zspage(class, page, freezed);
> >+			putback_zspage(class, page);
> >+			page = NULL;
> >+			continue;
> >+		}
> >+
> >  		break;
> >  	}
> >
> >@@ -1793,9 +1884,11 @@ static struct page *isolate_source_page(struct size_class *class)
> >  static unsigned long zs_can_compact(struct size_class *class)
> >  {
> >  	unsigned long obj_wasted;
> >+	unsigned long obj_allocated, obj_used;
> >
> >-	obj_wasted = zs_stat_get(class, OBJ_ALLOCATED) -
> >-		zs_stat_get(class, OBJ_USED);
> >+	obj_allocated = zs_stat_get(class, OBJ_ALLOCATED);
> >+	obj_used = zs_stat_get(class, OBJ_USED);
> >+	obj_wasted = obj_allocated - obj_used;
> >
> >  	obj_wasted /= get_maxobj_per_zspage(class->size,
> >  			class->pages_per_zspage);
> >@@ -1805,53 +1898,81 @@ static unsigned long zs_can_compact(struct size_class *class)
> >
> >  static void __zs_compact(struct zs_pool *pool, struct size_class *class)
> >  {
> >-	struct zs_compact_control cc;
> >-	struct page *src_page;
> >+	struct page *src_page = NULL;
> >  	struct page *dst_page = NULL;
> >
> >-	spin_lock(&class->lock);
> >-	while ((src_page = isolate_source_page(class))) {
> >+	while (1) {
> >+		int nr_migrated;
> >
> >-		if (!zs_can_compact(class))
> >+		spin_lock(&class->lock);
> >+		if (!zs_can_compact(class)) {
> >+			spin_unlock(&class->lock);
> >  			break;
> >+		}
> >
> >-		cc.index = 0;
> >-		cc.s_page = src_page;
> >+		/*
> >+		 * Isolate source page and freeze all objects in a zspage
> >+		 * to prevent zspage destroying.
> >+		 */
> >+		if (!src_page) {
> >+			src_page = isolate_source_page(class);
> >+			if (!src_page) {
> >+				spin_unlock(&class->lock);
> >+				break;
> >+			}
> >+		}
> >
> >-		while ((dst_page = isolate_target_page(class))) {
> >-			cc.d_page = dst_page;
> >-			/*
> >-			 * If there is no more space in dst_page, resched
> >-			 * and see if anyone had allocated another zspage.
> >-			 */
> >-			if (!migrate_zspage(pool, class, &cc))
> >+		/* Isolate target page and freeze all objects in the zspage */
> >+		if (!dst_page) {
> >+			dst_page = isolate_target_page(class);
> >+			if (!dst_page) {
> >+				spin_unlock(&class->lock);
> >  				break;
> >+			}
> >+		}
> >+		spin_unlock(&class->lock);
> 
> (Sorry to delete individual recipients due to my compliance issues.)
> 
> Hello, Minchan.
> 
> 
> Is it safe to unlock?
> 
> 
> (I assume that the system has 2 cores
> and a swap device is using zsmalloc pool.)
> If a zs compact context scheduled out after this "spin_unlock" line,
> 
> 
>    CPU A (Swap In)                CPU B (zs_free by process killed)
> ---------------------           -------------------------
>                                 ...
>                                 spin_lock(&si->lock)
>                                 ...
>                                # assume it is pinned by zs_compact context.
>                                 pin_tag(handle) --> block
> 
> ...
> spin_lock(&si->lock) --> block
> 
> 
> I think CPU A and CPU B may not be released forever.
> Am I missing something?

You didn't miss anything. It could be dead locked.
The swap_slot_free_notify is always really problem. :(
That's why I want to remove it.
I will think over how to handle it and send fix in next revision.

Thanks for the review!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 03/16] mm: add non-lru movable page support document
  2016-04-04  2:25     ` Minchan Kim
@ 2016-04-04 13:09       ` Vlastimil Babka
  2016-04-07  2:27         ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-04 13:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu,
	Jonathan Corbet

On 04/04/2016 04:25 AM, Minchan Kim wrote:
>>
>> Ah, I see, so it's designed with page lock to handle the concurrent isolations etc.
>>
>> In http://marc.info/?l=linux-mm&m=143816716511904&w=2 Mel has warned
>> about doing this in general under page_lock and suggested that each
>> user handles concurrent calls to isolate_page() internally. Might be
>> more generic that way, even if all current implementers will
>> actually use the page lock.
>
> We need PG_lock for two reasons.
>
> Firstly, it guarantees page's flags operation(i.e., PG_movable, PG_isolated)
> atomicity. Another thing is for stability for page->mapping->a_ops.
>
> For example,
>
> isolate_migratepages_block
>          if (PageMovable(page))
>                  isolate_movable_page
>                          get_page_unless_zero <--- 1
>                          trylock_page
>                          page->mapping->a_ops->isolate_page <--- 2
>
> Between 1 and 2, driver can nullify page->mapping so we need PG_lock

Hmm I see, that really doesn't seem easily solvable without page_lock.
My idea is that compaction code would just check PageMovable() and 
PageIsolated() to find a candidate. page->mapping->a_ops->isolate_page 
would do the driver-specific necessary locking, revalidate if the page 
state and succeed isolation, or fail. It would need to handle the 
possibility that the page already doesn't belong to the mapping, which 
is probably not a problem. But what if the driver is a module that was 
already unloaded, and even though we did NULL-check each part from page 
to isolate_page, it points to a function that's already gone? That would 
need some extra handling to prevent that, hm...

>>
>>> +2. Migration
>>> +
>>> +After successful isolation, VM calls migratepage. The migratepage's goal is
>>> +to move content of the old page to new page and set up struct page fields
>>> +of new page. If migration is successful, subsystem should release old page's
>>> +refcount to free. Keep in mind that subsystem should clear PG_movable and
>>> +PG_isolated before releasing the refcount.  If everything are done, user
>>> +should return MIGRATEPAGE_SUCCESS. If subsystem cannot migrate the page
>>> +at the moment, migratepage can return -EAGAIN. On -EAGAIN, VM will retry page
>>> +migration because VM interprets -EAGAIN as "temporal migration failure".
>>> +
>>> +3. Putback
>>> +
>>> +If migration was unsuccessful, VM calls putback_page. The subsystem should
>>> +insert isolated page to own data structure again if it has. And subsystem
>>> +should clear PG_isolated which was marked in isolation step.
>>> +
>>> +Note about releasing page:
>>> +
>>> +Subsystem can release pages whenever it want but if it releses the page
>>> +which is already isolated, it should clear PG_isolated but doesn't touch
>>> +PG_movable under PG_lock. Instead of it, VM will clear PG_movable after
>>> +his job done. Otherweise, subsystem should clear both page flags before
>>> +releasing the page.
>>
>> I don't understand this right now. But maybe I will get it after
>> reading the patches and suggest some improved wording here.
>
> I will try to explain why such rule happens in there.
>
> The problem is that put_page is aware of PageLRU. So, if someone releases
> last refcount of LRU page, __put_page checks PageLRU and then, clear the
> flags and detatch the page in LRU list(i.e., data structure).
> But in case of driver page, data structure like LRU among drivers is not only one.
> IOW, we should add following code in put_page to handle various requirements
> of driver page.
>
> void __put_page(struct page *page)
> {
>          if (PageMovable(page)) {
>                  /*
>                   * It will tity up driver's data structure like LRU
>                   * and reset page's flags. And it should be atomic
>                   * and always successful
>                   */
>                  page->put(page);
>                  __ClearPageMovable(page);
>          } else if (PageCompound(page))
>                  __put_compound_page(page);
>          else
>                  __put_single_page(page);
>
> }
>
> I'd like to avoid add new branch for not popular job in put_page which is hot.
> (Might change in future but not popular at the moment)
> So, rule of driver is as follows.
>
> When the driver releases the page and he found the page is PG_isolated,
> he should unmark only PG_isolated, not PG_movable so migration side of
> VM can catch it up "Hmm, the isolated non-lru page doesn't have PG_isolated
> any more. It means drivers releases the page. So, let's put the page
> instead of putback operation".
>
> When the driver releases the page and he doesn't see PG_isolated mark
> of the page, driver should reset both PG_isolated and PG_movable.

Yeah think I understand now, thanks for the explanation. But since I 
found the "freeing isolated page" part to be racy in the 02/16 
subthread, it might be premature now to improve the wording now :/

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 00/16] Support non-lru page migration
  2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
                   ` (16 preceding siblings ...)
  2016-03-30 23:11 ` [PATCH v3 00/16] Support non-lru page migration Andrew Morton
@ 2016-04-04 13:17 ` John Einar Reitan
  2016-04-11  4:35   ` Minchan Kim
  17 siblings, 1 reply; 65+ messages in thread
From: John Einar Reitan @ 2016-04-04 13:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

[-- Attachment #1: Type: text/plain, Size: 1679 bytes --]

On Wed, Mar 30, 2016 at 04:11:59PM +0900, Minchan Kim wrote:
> Recently, I got many reports about perfermance degradation
> in embedded system(Android mobile phone, webOS TV and so on)
> and failed to fork easily.
> 
> The problem was fragmentation caused by zram and GPU driver
> pages. Their pages cannot be migrated so compaction cannot
> work well, either so reclaimer ends up shrinking all of working
> set pages. It made system very slow and even to fail to fork
> easily.
> 
> Other pain point is that they cannot work with CMA.
> Most of CMA memory space could be idle(ie, it could be used
> for movable pages unless driver is using) but if driver(i.e.,
> zram) cannot migrate his page, that memory space could be
> wasted. In our product which has big CMA memory, it reclaims
> zones too exccessively although there are lots of free space
> in CMA so system was very slow easily.
> 
> To solve these problem, this patch try to add facility to
> migrate non-lru pages via introducing new friend functions
> of migratepage in address_space_operation and new page flags.
> 
> 	(isolate_page, putback_page)
> 	(PG_movable, PG_isolated)
> 
> For details, please read description in
> "mm/compaction: support non-lru movable page migration".

Thanks, this mirrors what we see with the ARM Mali GPU drivers too.

One thing with the current design which worries me is the potential
for multiple calls due to many separated pages being migrated.
On GPUs (or any other device) which has an IOMMU and L2 cache, which
isn't coherent with the CPU, we must do L2 cache flush & invalidation
per page. I guess batching pages isn't easily possible?


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 648 bytes --]

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
  2016-04-04  5:12     ` Minchan Kim
@ 2016-04-04 13:24       ` Vlastimil Babka
  2016-04-07  2:35         ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-04 13:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, dri-devel,
	Gioh Kim

On 04/04/2016 07:12 AM, Minchan Kim wrote:
> On Fri, Apr 01, 2016 at 11:29:14PM +0200, Vlastimil Babka wrote:
>> Might have been better as a separate migration patch and then a
>> compaction patch. It's prefixed mm/compaction, but most changed are
>> in mm/migrate.c
>
> Indeed. The title is rather misleading but not sure it's a good idea
> to separate compaction and migration part.

Guess it's better to see the new functions together with its user after 
all, OK.

> I will just resend to change the tile from "mm/compaction" to
> "mm/migration".

OK!

>> Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that
>> page->mapping->a_ops->isolate_page exists for PageMovable() pages.
>> What if it's a false positive on a PG_reclaim page? Can we rely on
>> PG_reclaim always (and without races) implying PageLRU() so that we
>> don't even attempt isolate_movable_page()?
>
> For now, we shouldn't have such a false positive because PageMovable
> checks page->_mapcount == PAGE_MOVABLE_MAPCOUNT_VALUE as well as PG_movable
> under PG_lock.
>
> But I read your question about user-mapped drvier pages so we cannot
> use _mapcount anymore so I will find another thing. A option is this.
>
> static inline int PageMovable(struct page *page)
> {
>          int ret = 0;
>          struct address_space *mapping;
>          struct address_space_operations *a_op;
>
>          if (!test_bit(PG_movable, &(page->flags))
>                  goto out;
>
>          mapping = page->mapping;
>          if (!mapping)
>                  goto out;
>
>          a_op = mapping->a_op;
>          if (!aop)
>                  goto out;
>          if (a_op->isolate_page)
>                  ret = 1;
> out:
>          return ret;
>
> }
>
> It works under PG_lock but with this, we need trylock_page to peek
> whether it's movable non-lru or not for scanning pfn.

Hm I hoped that with READ_ONCE() we could do the peek safely without 
trylock_page, if we use it only as a heuristic. But I guess it would 
require at least RCU-level protection of the 
page->mapping->a_op->isolate_page chain.

> For avoiding that, we need another function to peek which just checks
> PG_movable bit instead of all things.
>
>
> /*
>   * If @page_locked is false, we cannot guarantee page->mapping's stability
>   * so just the function checks with PG_movable which could be false positive
>   * so caller should check it again under PG_lock to check a_ops->isolate_page.
>   */
> static inline int PageMovable(struct page *page, bool page_locked)
> {
>          int ret = 0;
>          struct address_space *mapping;
>          struct address_space_operations *a_op;
>
>          if (!test_bit(PG_movable, &(page->flags))
>                  goto out;
>
>          if (!page_locked) {
>                  ret = 1;
>                  goto out;
>          }
>
>          mapping = page->mapping;
>          if (!mapping)
>                  goto out;
>
>          a_op = mapping->a_op;
>          if (!aop)
>                  goto out;
>          if (a_op->isolate_page)
>                  ret = 1;
> out:
>          return ret;
> }

I wouldn't put everything into single function, but create something 
like __PageMovable() just for the unlocked peek. Unlike the 
zone->lru_lock, we don't keep page_lock() across iterations in 
isolate_migratepages_block(), as obviously each page has different lock.
So the page_locked parameter would be always passed as constant, and at 
that point it's better to have separate functions.

So I guess the question is how many false positives from overlap with 
PG_reclaim the scanner will hit if we give up on 
PAGE_MOVABLE_MAPCOUNT_VALUE, as that will increase number of page locks 
just to realize that it's not actual PageMovable() page...

> Thanks for detail review, Vlastimil!
> I will resend new versions after vacation in this week.

You're welcome, great!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-04  4:45       ` Naoya Horiguchi
@ 2016-04-04 14:46         ` Vlastimil Babka
  2016-04-05  1:54           ` Naoya Horiguchi
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-04 14:46 UTC (permalink / raw)
  To: Naoya Horiguchi, Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On 04/04/2016 06:45 AM, Naoya Horiguchi wrote:
> On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
>> Thanks for catching it, Vlastimil.
>> It was my mistake. But in this chance, I looked over hwpoison code and
>> I saw other places which increases num_poisoned_pages are successful
>> migration, already freed page and successful invalidated page.
>> IOW, they are already successful isolated page so I guess it should
>> increase the count when only successful migration is done?
> 
> Yes, that's right. When exiting with migration's failure, we shouldn't call
> test_set_page_hwpoison or num_poisoned_pages_inc, so current code checking
> (rc != -EAGAIN) is simply incorrect. Your change fixes the bug in memory
> error handling. Great!

Ah, I see, soft onlining works differently than I thought.

>> And when I read memory_failure, it bails out without killing if it
>> encounters HWPoisoned page so I think it's not for catching and
>> kill the poor proces.
>>
>>>
>>> Also (but not your fault) the put_page() preceding
>>> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
>>> pin we are releasing and which one we still have (hopefully? if I
>>> read description of da1b13ccfbebe right) otherwise it looks like
>>> doing something with a page that we just potentially freed.
>>
>> Yes, while I read the code, I had same question. I think the releasing
>> refcount is for get_any_page.
> 
> As the other callers of page migration do, soft_offline_page expects the
> migration source page to be freed at this put_page() (no pin remains.)
> The refcount released here is from isolate_lru_page() in __soft_offline_page().
> (the pin by get_any_page is released by put_hwpoison_page just after it.)
> 
> .. yes, doing something just after freeing page looks weird, but that's
> how PageHWPoison flag works. IOW, many other page flags are maintained
> only during one "allocate-free" life span, but PageHWPoison still does
> its job beyond it.

But what prevents the page from being allocated again between put_page()
and test_set_page_hwpoison()? In that case we would be marking page
poisoned while still in use, which is the same as marking it while still
in use after a failed migration?

(Also, which part prevents pages with PageHWPoison to be allocated
again, anyway? I can't find it and test_set_page_hwpoison() doesn't
remove from buddy freelists).

Thanks.

> As for commenting, this put_page() is called in any MIGRATEPAGE_SUCCESS
> case (regardless of callers), so what we can say here is "we free the
> source page here, bypassing LRU list" or something?
> 
> Thanks,
> Naoya Horiguchi
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-04 14:46         ` Vlastimil Babka
@ 2016-04-05  1:54           ` Naoya Horiguchi
  2016-04-05  8:20             ` Vlastimil Babka
  0 siblings, 1 reply; 65+ messages in thread
From: Naoya Horiguchi @ 2016-04-05  1:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm, jlayton,
	bfields, Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On Mon, Apr 04, 2016 at 04:46:31PM +0200, Vlastimil Babka wrote:
> On 04/04/2016 06:45 AM, Naoya Horiguchi wrote:
> > On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
...
> >>>
> >>> Also (but not your fault) the put_page() preceding
> >>> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> >>> pin we are releasing and which one we still have (hopefully? if I
> >>> read description of da1b13ccfbebe right) otherwise it looks like
> >>> doing something with a page that we just potentially freed.
> >>
> >> Yes, while I read the code, I had same question. I think the releasing
> >> refcount is for get_any_page.
> > 
> > As the other callers of page migration do, soft_offline_page expects the
> > migration source page to be freed at this put_page() (no pin remains.)
> > The refcount released here is from isolate_lru_page() in __soft_offline_page().
> > (the pin by get_any_page is released by put_hwpoison_page just after it.)
> > 
> > .. yes, doing something just after freeing page looks weird, but that's
> > how PageHWPoison flag works. IOW, many other page flags are maintained
> > only during one "allocate-free" life span, but PageHWPoison still does
> > its job beyond it.
> 
> But what prevents the page from being allocated again between put_page()
> and test_set_page_hwpoison()? In that case we would be marking page
> poisoned while still in use, which is the same as marking it while still
> in use after a failed migration?

Actually nothing prevents that race. But I think that the result of the race
is that the error page can be reused for allocation, which results in killing
processes at page fault time. Soft offline is kind of mild/precautious thing
(for correctable errors that don't require immediate handling), so killing
processes looks to me an overkill. And marking hwpoison means that we can no
longer do retry from userspace.

And another practical thing is the race with unpoison_memory() as described
in commit da1b13ccfbebe. unpoison_memory() properly works only for properly
poisoned pages, so doing unpoison for in-use hwpoisoned pages is fragile.
That's why I'd like to avoid setting PageHWPoison for in-use pages if possible.

> (Also, which part prevents pages with PageHWPoison to be allocated
> again, anyway? I can't find it and test_set_page_hwpoison() doesn't
> remove from buddy freelists).

check_new_page() in mm/page_alloc.c should prevent reallocation of PageHWPoison.
As you pointed out, memory error handler doens't remove it from buddy freelists.


BTW, it might be a bit off-topic, but recently I felt that check_new_page()
might be improvable, because when check_new_page() returns 1, the whole buddy
block (not only the bad page) seems to be leaked from buddy freelist.
For example, if thp (order 9) is requested, and PageHWPoison (or any other
types of bad pages) is found in an order 9 block, all 512 page are discarded.
Unpoison can't bring it back to buddy.
So, some code to split buddy block including bad page (and recovering code from
unpoison) might be helpful, although that's another story ...

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-04  6:01     ` Minchan Kim
@ 2016-04-05  3:10       ` Balbir Singh
  0 siblings, 0 replies; 65+ messages in thread
From: Balbir Singh @ 2016-04-05  3:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu, Naoya Horiguchi



On 04/04/16 16:01, Minchan Kim wrote:
> On Mon, Apr 04, 2016 at 03:53:59PM +1000, Balbir Singh wrote:
>>
>> On 30/03/16 18:12, Minchan Kim wrote:
>>> Procedure of page migration is as follows:
>>>
>>> First of all, it should isolate a page from LRU and try to
>>> migrate the page. If it is successful, it releases the page
>>> for freeing. Otherwise, it should put the page back to LRU
>>> list.
>>>
>>> For LRU pages, we have used putback_lru_page for both freeing
>>> and putback to LRU list. It's okay because put_page is aware of
>>> LRU list so if it releases last refcount of the page, it removes
>>> the page from LRU list. However, It makes unnecessary operations
>>> (e.g., lru_cache_add, pagevec and flags operations. It would be
>>> not significant but no worth to do) and harder to support new
>>> non-lru page migration because put_page isn't aware of non-lru
>>> page's data structure.
>>>
>>> To solve the problem, we can add new hook in put_page with
>>> PageMovable flags check but it can increase overhead in
>>> hot path and needs new locking scheme to stabilize the flag check
>>> with put_page.
>>>
>>> So, this patch cleans it up to divide two semantic(ie, put and putback).
>>> If migration is successful, use put_page instead of putback_lru_page and
>>> use putback_lru_page only on failure. That makes code more readable
>>> and doesn't add overhead in put_page.
>> So effectively when we return from unmap_and_move() the page is either
>> put_page or putback_lru_page() and the page is gone from under us.
> I didn't get your point.
> Could you elaborate it more what you want to say about this patch?

I was just adding to my understanding of this change based on your changelog.
My understanding is that we take the extra reference in isolate_lru_page()
but by the time we return from unmap_and_move() we drop the extra reference

Balbir Singh

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-05  1:54           ` Naoya Horiguchi
@ 2016-04-05  8:20             ` Vlastimil Babka
  2016-04-06  0:54               ` Naoya Horiguchi
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-05  8:20 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm, jlayton,
	bfields, Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On 04/05/2016 03:54 AM, Naoya Horiguchi wrote:
> On Mon, Apr 04, 2016 at 04:46:31PM +0200, Vlastimil Babka wrote:
>> On 04/04/2016 06:45 AM, Naoya Horiguchi wrote:
>>> On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
> ...
>>>>>
>>>>> Also (but not your fault) the put_page() preceding
>>>>> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
>>>>> pin we are releasing and which one we still have (hopefully? if I
>>>>> read description of da1b13ccfbebe right) otherwise it looks like
>>>>> doing something with a page that we just potentially freed.
>>>>
>>>> Yes, while I read the code, I had same question. I think the releasing
>>>> refcount is for get_any_page.
>>>
>>> As the other callers of page migration do, soft_offline_page expects the
>>> migration source page to be freed at this put_page() (no pin remains.)
>>> The refcount released here is from isolate_lru_page() in __soft_offline_page().
>>> (the pin by get_any_page is released by put_hwpoison_page just after it.)
>>>
>>> .. yes, doing something just after freeing page looks weird, but that's
>>> how PageHWPoison flag works. IOW, many other page flags are maintained
>>> only during one "allocate-free" life span, but PageHWPoison still does
>>> its job beyond it.
>>
>> But what prevents the page from being allocated again between put_page()
>> and test_set_page_hwpoison()? In that case we would be marking page
>> poisoned while still in use, which is the same as marking it while still
>> in use after a failed migration?
> 
> Actually nothing prevents that race. But I think that the result of the race
> is that the error page can be reused for allocation, which results in killing
> processes at page fault time. Soft offline is kind of mild/precautious thing
> (for correctable errors that don't require immediate handling), so killing
> processes looks to me an overkill. And marking hwpoison means that we can no
> longer do retry from userspace.

So you agree that this race is a bug? It may turn a soft-offline attempt
into a killed process. In that case we should fix it the same as we are
fixing the failed migration case. Maybe it will be just enough to switch
the test_set_page_hwpoison() and put_page() calls?

> And another practical thing is the race with unpoison_memory() as described
> in commit da1b13ccfbebe. unpoison_memory() properly works only for properly
> poisoned pages, so doing unpoison for in-use hwpoisoned pages is fragile.
> That's why I'd like to avoid setting PageHWPoison for in-use pages if possible.
> 
>> (Also, which part prevents pages with PageHWPoison to be allocated
>> again, anyway? I can't find it and test_set_page_hwpoison() doesn't
>> remove from buddy freelists).
> 
> check_new_page() in mm/page_alloc.c should prevent reallocation of PageHWPoison.
> As you pointed out, memory error handler doens't remove it from buddy freelists.

Oh, I see. It's using __PG_HWPOISON wrapper, so I didn't notice it when
searching. In any case that results in a bad_page() warning, right? Is
it desirable for a soft-offlined page? If we didn't free poisoned pages
to buddy system, they wouldn't trigger this warning.

> BTW, it might be a bit off-topic, but recently I felt that check_new_page()
> might be improvable, because when check_new_page() returns 1, the whole buddy
> block (not only the bad page) seems to be leaked from buddy freelist.
> For example, if thp (order 9) is requested, and PageHWPoison (or any other
> types of bad pages) is found in an order 9 block, all 512 page are discarded.
> Unpoison can't bring it back to buddy.
> So, some code to split buddy block including bad page (and recovering code from
> unpoison) might be helpful, although that's another story ...

Hm sounds like another argument for not freeing the page to buddy lists
in the first place. Maybe a hook in free_pages_check()?

> Thanks,
> Naoya Horiguchi
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon
  2016-03-30  7:12 ` [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon Minchan Kim
@ 2016-04-05 12:03   ` Vlastimil Babka
  2016-04-11  4:29     ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-05 12:03 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: linux-kernel, linux-mm, jlayton, bfields, Joonsoo Kim, koct9i,
	aquini, virtualization, Mel Gorman, Hugh Dickins,
	Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim, Sangseok Lee,
	Chan Gyun Jeong, Al Viro, YiPing Xu, Gioh Kim

On 03/30/2016 09:12 AM, Minchan Kim wrote:
> Now, VM has a feature to migrate non-lru movable pages so
> balloon doesn't need custom migration hooks in migrate.c
> and compact.c. Instead, this patch implements page->mapping
> ->{isolate|migrate|putback} functions.
>
> With that, we could remove hooks for ballooning in general
> migration functions and make balloon compaction simple.
>
> Cc: virtualization@lists.linux-foundation.org
> Cc: Rafael Aquini <aquini@redhat.com>
> Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

I'm not familiar with the inode and pseudofs stuff, so just some things 
I noticed:

> -#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> +#define PAGE_MOVABLE_MAPCOUNT_VALUE (-256)
> +#define PAGE_BALLOON_MAPCOUNT_VALUE PAGE_MOVABLE_MAPCOUNT_VALUE
>
>   static inline int PageMovable(struct page *page)
>   {
> -	return ((test_bit(PG_movable, &(page)->flags) &&
> -		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> -		|| PageBalloon(page));
> +	return (test_bit(PG_movable, &(page)->flags) &&
> +		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE);
>   }
>
>   /* Caller should hold a PG_lock */
> @@ -645,6 +626,35 @@ static inline void __ClearPageMovable(struct page *page)
>
>   PAGEFLAG(Isolated, isolated, PF_ANY);
>
> +static inline int PageBalloon(struct page *page)
> +{
> +	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE
> +		&& PagePrivate2(page);
> +}

Hmm so you are now using PG_private_2 flag here, but it's not 
documented. Also the only caller of PageBalloon() seems to be 
stable_page_flags(). Which will now report all movable pages with 
PG_private_2 as KPF_BALOON. Seems like an overkill and also not 
reliable. Could it test e.g. page->mapping instead?

Or maybe if we manage to get rid of PAGE_MOVABLE_MAPCOUNT_VALUE, we can 
keep PAGE_BALLOON_MAPCOUNT_VALUE to simply distinguish balloon pages for 
stable_page_flags().

> @@ -1033,7 +1019,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
>   out:
>   	/* If migration is successful, move newpage to right list */
>   	if (rc == MIGRATEPAGE_SUCCESS) {
> -		if (unlikely(__is_movable_balloon_page(newpage)))
> +		if (unlikely(PageMovable(newpage)))
>   			put_page(newpage);
>   		else
>   			putback_lru_page(newpage);

Hmm shouldn't the condition have been changed to

if (unlikely(__is_movable_balloon_page(newpage)) || PageMovable(newpage)

by patch 02/16? And this patch should be just removing the 
balloon-specific check? Otherwise it seems like between patches 02 and 
04, other kinds of PageMovable pages were unnecessarily/wrongly routed 
through putback_lru_page()?

> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index d82196244340..c7696a2e11c7 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1254,7 +1254,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
>
>   	list_for_each_entry_safe(page, next, page_list, lru) {
>   		if (page_is_file_cache(page) && !PageDirty(page) &&
> -		    !isolated_balloon_page(page)) {
> +		    !PageIsolated(page)) {
>   			ClearPageActive(page);
>   			list_move(&page->lru, &clean_pages);
>   		}

This looks like the same comment as above at first glance. But looking 
closer, it's even weirder. isolated_balloon_page() was simply 
PageBalloon() after d6d86c0a7f8dd... weird already. You replace it with 
check for !PageIsolated() which looks like a more correct check, so ok. 
Except the potential false positive with PG_owner_priv_1.

Thanks.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-05  8:20             ` Vlastimil Babka
@ 2016-04-06  0:54               ` Naoya Horiguchi
  2016-04-06  7:57                 ` Vlastimil Babka
  0 siblings, 1 reply; 65+ messages in thread
From: Naoya Horiguchi @ 2016-04-06  0:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm, jlayton,
	bfields, Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On Tue, Apr 05, 2016 at 10:20:50AM +0200, Vlastimil Babka wrote:
> On 04/05/2016 03:54 AM, Naoya Horiguchi wrote:
> > On Mon, Apr 04, 2016 at 04:46:31PM +0200, Vlastimil Babka wrote:
> >> On 04/04/2016 06:45 AM, Naoya Horiguchi wrote:
> >>> On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
> > ...
> >>>>>
> >>>>> Also (but not your fault) the put_page() preceding
> >>>>> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> >>>>> pin we are releasing and which one we still have (hopefully? if I
> >>>>> read description of da1b13ccfbebe right) otherwise it looks like
> >>>>> doing something with a page that we just potentially freed.
> >>>>
> >>>> Yes, while I read the code, I had same question. I think the releasing
> >>>> refcount is for get_any_page.
> >>>
> >>> As the other callers of page migration do, soft_offline_page expects the
> >>> migration source page to be freed at this put_page() (no pin remains.)
> >>> The refcount released here is from isolate_lru_page() in __soft_offline_page().
> >>> (the pin by get_any_page is released by put_hwpoison_page just after it.)
> >>>
> >>> .. yes, doing something just after freeing page looks weird, but that's
> >>> how PageHWPoison flag works. IOW, many other page flags are maintained
> >>> only during one "allocate-free" life span, but PageHWPoison still does
> >>> its job beyond it.
> >>
> >> But what prevents the page from being allocated again between put_page()
> >> and test_set_page_hwpoison()? In that case we would be marking page
> >> poisoned while still in use, which is the same as marking it while still
> >> in use after a failed migration?
> > 
> > Actually nothing prevents that race. But I think that the result of the race
> > is that the error page can be reused for allocation, which results in killing
> > processes at page fault time. Soft offline is kind of mild/precautious thing
> > (for correctable errors that don't require immediate handling), so killing
> > processes looks to me an overkill. And marking hwpoison means that we can no
> > longer do retry from userspace.
> 
> So you agree that this race is a bug? It may turn a soft-offline attempt
> into a killed process. In that case we should fix it the same as we are
> fixing the failed migration case.

I agree, it's a bug, although rare and non-critical.

> Maybe it will be just enough to switch
> the test_set_page_hwpoison() and put_page() calls?

Unfortunately that restores the other race with unpoison (described below.)
Sorry for my bad/unclear statements, these races seems exclusive and a compatible
solution is not found, so I prioritized fixing the latter one by comparing
severity (the latter causes kernel crash,) which led to the current code.

> > And another practical thing is the race with unpoison_memory() as described
> > in commit da1b13ccfbebe. unpoison_memory() properly works only for properly
> > poisoned pages, so doing unpoison for in-use hwpoisoned pages is fragile.
> > That's why I'd like to avoid setting PageHWPoison for in-use pages if possible.
> > 
> >> (Also, which part prevents pages with PageHWPoison to be allocated
> >> again, anyway? I can't find it and test_set_page_hwpoison() doesn't
> >> remove from buddy freelists).
> > 
> > check_new_page() in mm/page_alloc.c should prevent reallocation of PageHWPoison.
> > As you pointed out, memory error handler doens't remove it from buddy freelists.
> 
> Oh, I see. It's using __PG_HWPOISON wrapper, so I didn't notice it when
> searching. In any case that results in a bad_page() warning, right? Is
> it desirable for a soft-offlined page?

That's right, and the bad_page warning might be too strong for soft offlining.
We can't tell which of memory_failure/soft_offline_page a PageHWPoison came
from, but users can find other lines in dmesg which should tell that.
And memory error events can hit buddy pages directly, in that case we still
need the check in check_new_page().

> If we didn't free poisoned pages
> to buddy system, they wouldn't trigger this warning.

Actually, we didn't free at commit add05cecef80 ("mm: soft-offline: don't free
target page in successful page migration"), but that's was reverted in
commit f4c18e6f7b5b ("mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*").
Now I start thinking the revert was a bad decision, so I'll dig this problem again.

> > BTW, it might be a bit off-topic, but recently I felt that check_new_page()
> > might be improvable, because when check_new_page() returns 1, the whole buddy
> > block (not only the bad page) seems to be leaked from buddy freelist.
> > For example, if thp (order 9) is requested, and PageHWPoison (or any other
> > types of bad pages) is found in an order 9 block, all 512 page are discarded.
> > Unpoison can't bring it back to buddy.
> > So, some code to split buddy block including bad page (and recovering code from
> > unpoison) might be helpful, although that's another story ...
> 
> Hm sounds like another argument for not freeing the page to buddy lists
> in the first place. Maybe a hook in free_pages_check()?

Sounds a good idea. I'll try it, too.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
  2016-04-06  0:54               ` Naoya Horiguchi
@ 2016-04-06  7:57                 ` Vlastimil Babka
  0 siblings, 0 replies; 65+ messages in thread
From: Vlastimil Babka @ 2016-04-06  7:57 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm, jlayton,
	bfields, Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On 04/06/2016 02:54 AM, Naoya Horiguchi wrote:
> On Tue, Apr 05, 2016 at 10:20:50AM +0200, Vlastimil Babka wrote:
>>
>> So you agree that this race is a bug? It may turn a soft-offline attempt
>> into a killed process. In that case we should fix it the same as we are
>> fixing the failed migration case.
> 
> I agree, it's a bug, although rare and non-critical.
> 
>> Maybe it will be just enough to switch
>> the test_set_page_hwpoison() and put_page() calls?
> 
> Unfortunately that restores the other race with unpoison (described below.)
> Sorry for my bad/unclear statements, these races seems exclusive and a compatible
> solution is not found, so I prioritized fixing the latter one by comparing
> severity (the latter causes kernel crash,) which led to the current code.

Ah, I see. However unpoison is a functionality just for stress-testing,
and not expected to be used in production, right? So it's somewhat
unfortunate trade-off with danger of soft-offlining killing an unrelated
process.

>>> And another practical thing is the race with unpoison_memory() as described
>>> in commit da1b13ccfbebe. unpoison_memory() properly works only for properly
>>> poisoned pages, so doing unpoison for in-use hwpoisoned pages is fragile.
>>> That's why I'd like to avoid setting PageHWPoison for in-use pages if possible.
>>>
>>>> (Also, which part prevents pages with PageHWPoison to be allocated
>>>> again, anyway? I can't find it and test_set_page_hwpoison() doesn't
>>>> remove from buddy freelists).
>>>
>>> check_new_page() in mm/page_alloc.c should prevent reallocation of PageHWPoison.
>>> As you pointed out, memory error handler doens't remove it from buddy freelists.
>>
>> Oh, I see. It's using __PG_HWPOISON wrapper, so I didn't notice it when
>> searching. In any case that results in a bad_page() warning, right? Is
>> it desirable for a soft-offlined page?
> 
> That's right, and the bad_page warning might be too strong for soft offlining.
> We can't tell which of memory_failure/soft_offline_page a PageHWPoison came
> from, but users can find other lines in dmesg which should tell that.
> And memory error events can hit buddy pages directly, in that case we still
> need the check in check_new_page().

Ah, ok.

>> If we didn't free poisoned pages
>> to buddy system, they wouldn't trigger this warning.
> 
> Actually, we didn't free at commit add05cecef80 ("mm: soft-offline: don't free
> target page in successful page migration"), but that's was reverted in
> commit f4c18e6f7b5b ("mm: check __PG_HWPOISON separately from PAGE_FLAGS_CHECK_AT_*").
> Now I start thinking the revert was a bad decision, so I'll dig this problem again.

Good.

>>> BTW, it might be a bit off-topic, but recently I felt that check_new_page()
>>> might be improvable, because when check_new_page() returns 1, the whole buddy
>>> block (not only the bad page) seems to be leaked from buddy freelist.
>>> For example, if thp (order 9) is requested, and PageHWPoison (or any other
>>> types of bad pages) is found in an order 9 block, all 512 page are discarded.
>>> Unpoison can't bring it back to buddy.
>>> So, some code to split buddy block including bad page (and recovering code from
>>> unpoison) might be helpful, although that's another story ...
>>
>> Hm sounds like another argument for not freeing the page to buddy lists
>> in the first place. Maybe a hook in free_pages_check()?
> 
> Sounds a good idea. I'll try it, too.

So what I think could hopefully work is to replace the put_page() after
migration with a hwpoison-specific construct that does something like:

if (put_page_testzero(page))
     if (test_set_page_hwpoison()) ...
     __put_page()

With some more thought about what other parts of put_page() apply - how
to handle compound pages and zone-device pages.

That should hopefully be the safest course. When put_page_testzero()
succeeds, there should be no other (current of near-future) users of the
page, and we can still do whatever we need before releasing to
__put_page(). I.e. set the HWPoison flag, and maybe combine this with
modification to free_pages_check() to divert it from becoming a buddy page.

It should be even safer than the current "put_page();
test_set_page_hwpoison();" approach in that we are currently not
guaranteed that the put_page() is indeed releasing the last pin, but we
set HWPoison in any case. Although we have just migrated the page away,
there might be a pfn scanner holding its pin and checking the page.
Hopefully no such scanner has a path that would break on HWPoison flag,
but I don't know. By not setting the HWpoison when we don't succeed
put_page_testzero(), we are safer. It's true the page might stay
unpoisoned due to a temporary pin, but the process data was migrated
away which is the important part, and userspace can retry anyway?

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 13/16] zsmalloc: migrate head page of zspage
  2016-03-30  7:12 ` [PATCH v3 13/16] zsmalloc: migrate head page of zspage Minchan Kim
@ 2016-04-06 13:01   ` Chulmin Kim
  2016-04-07  0:34     ` Chulmin Kim
  2016-04-07  0:43     ` Minchan Kim
  2016-04-19  6:08   ` Chulmin Kim
  1 sibling, 2 replies; 65+ messages in thread
From: Chulmin Kim @ 2016-04-06 13:01 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton; +Cc: linux-kernel, linux-mm

On 2016년 03월 30일 16:12, Minchan Kim wrote:
> This patch introduces run-time migration feature for zspage.
> To begin with, it supports only head page migration for
> easy review(later patches will support tail page migration).
>
> For migration, it supports three functions
>
> * zs_page_isolate
>
> It isolates a zspage which includes a subpage VM want to migrate
> from class so anyone cannot allocate new object from the zspage.
> IOW, allocation freeze
>
> * zs_page_migrate
>
> First of all, it freezes zspage to prevent zspage destrunction
> so anyone cannot free object. Then, It copies content from oldpage
> to newpage and create new page-chain with new page.
> If it was successful, drop the refcount of old page to free
> and putback new zspage to right data structure of zsmalloc.
> Lastly, unfreeze zspages so we allows object allocation/free
> from now on.
>
> * zs_page_putback
>
> It returns isolated zspage to right fullness_group list
> if it fails to migrate a page.
>
> NOTE: A hurdle to support migration is that destroying zspage
> while migration is going on. Once a zspage is isolated,
> anyone cannot allocate object from the zspage but can deallocate
> object freely so a zspage could be destroyed until all of objects
> in zspage are freezed to prevent deallocation. The problem is
> large window betwwen zs_page_isolate and freeze_zspage
> in zs_page_migrate so the zspage could be destroyed.
>
> A easy approach to solve the problem is that object freezing
> in zs_page_isolate but it has a drawback that any object cannot
> be deallocated until migration fails after isolation. However,
> There is large time gab between isolation and migration so
> any object freeing in other CPU should spin by pin_tag which
> would cause big latency. So, this patch introduces lock_zspage
> which holds PG_lock of all pages in a zspage right before
> freeing the zspage. VM migration locks the page, too right
> before calling ->migratepage so such race doesn't exist any more.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   include/uapi/linux/magic.h |   1 +
>   mm/zsmalloc.c              | 332 +++++++++++++++++++++++++++++++++++++++++++--
>   2 files changed, 318 insertions(+), 15 deletions(-)
>
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index e1fbe72c39c0..93b1affe4801 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -79,5 +79,6 @@
>   #define NSFS_MAGIC		0x6e736673
>   #define BPF_FS_MAGIC		0xcafe4a11
>   #define BALLOON_KVM_MAGIC	0x13661366
> +#define ZSMALLOC_MAGIC		0x58295829
>
>   #endif /* __LINUX_MAGIC_H__ */
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index ac8ca7b10720..f6c9138c3be0 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -56,6 +56,8 @@
>   #include <linux/debugfs.h>
>   #include <linux/zsmalloc.h>
>   #include <linux/zpool.h>
> +#include <linux/mount.h>
> +#include <linux/migrate.h>
>
>   /*
>    * This must be power of 2 and greater than of equal to sizeof(link_free).
> @@ -182,6 +184,8 @@ struct zs_size_stat {
>   static struct dentry *zs_stat_root;
>   #endif
>
> +static struct vfsmount *zsmalloc_mnt;
> +
>   /*
>    * number of size_classes
>    */
> @@ -263,6 +267,7 @@ struct zs_pool {
>   #ifdef CONFIG_ZSMALLOC_STAT
>   	struct dentry *stat_dentry;
>   #endif
> +	struct inode *inode;
>   };
>
>   struct zs_meta {
> @@ -412,6 +417,29 @@ static int is_last_page(struct page *page)
>   	return PagePrivate2(page);
>   }
>
> +/*
> + * Indicate that whether zspage is isolated for page migration.
> + * Protected by size_class lock
> + */
> +static void SetZsPageIsolate(struct page *first_page)
> +{
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +	SetPageUptodate(first_page);
> +}
> +
> +static int ZsPageIsolate(struct page *first_page)
> +{
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +
> +	return PageUptodate(first_page);
> +}
> +
> +static void ClearZsPageIsolate(struct page *first_page)
> +{
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +	ClearPageUptodate(first_page);
> +}
> +
>   static int get_zspage_inuse(struct page *first_page)
>   {
>   	struct zs_meta *m;
> @@ -783,8 +811,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
>   	if (newfg == currfg)
>   		goto out;
>
> -	remove_zspage(class, currfg, first_page);
> -	insert_zspage(class, newfg, first_page);
> +	/* Later, putback will insert page to right list */
> +	if (!ZsPageIsolate(first_page)) {
> +		remove_zspage(class, currfg, first_page);
> +		insert_zspage(class, newfg, first_page);
> +	}

Hello, Minchan.

I am running a serious stress test using this patchset.
(By the way, there can be a false alarm as I am working on Kernel v3.18.)

I got a bug as depicted in the below.

<0>[47821.493416]  [3:      dumpstate:16261] page:ffffffbdc44aaac0 
count:0 mapcount:0 mapping:          (null) index:0x2
<0>[47821.493524]  [3:      dumpstate:16261] flags: 0x4000000000000000()
<1>[47821.493592]  [3:      dumpstate:16261] page dumped because: 
VM_BUG_ON_PAGE(!is_first_page(first_page))
<4>[47821.493684]  [3:      dumpstate:16261] ------------[ cut here 
]------------
...
<4>[47821.507309]  [3:      dumpstate:16261] [<ffffffc0002515f4>] 
get_zspage_inuse+0x1c/0x30
<4>[47821.507381]  [3:      dumpstate:16261] [<ffffffc0002517a8>] 
insert_zspage+0x94/0xb0
<4>[47821.507452]  [3:      dumpstate:16261] [<ffffffc000252290>] 
putback_zspage+0xac/0xd4
<4>[47821.507522]  [3:      dumpstate:16261] [<ffffffc0002539a0>] 
zs_page_migrate+0x3d8/0x464
<4>[47821.507595]  [3:      dumpstate:16261] [<ffffffc000250294>] 
migrate_pages+0x5dc/0x88


When calling get_zspage_inuse(*head) in insert_zspage(),
VM_BUG_ON_PAGE occurred as *head was not the first page of a zspage.


During the debugging,
I thought that *head page could be a free page in pcp list.
- count, mapcount were reset.
- page->freelist = MIGRATE_MOVABLE (0x2)
- *head page had the multiple pages in the same state.


Here is my theory.

Circumstances
(1) A certain page in a zs page is isolated and about to be migrated. 
(not being migrated)
(2) zs_free() simultaneously occurred for the zs object in the above zs 
page.

What may happen.
(1) Assume that the above zs_free() made the zspage's FG to ZS_EMPTY.
(2) However, as the zspage is isolated, the zspage is not removed from 
the fullness list (e.g. reside in fullness_list[ZS_ALMOST_EMPTY]).
(according to this patch's code line just before my greeting.)
(3) The zspage is reset by free_zspage() in zs_free().
(4) and freed (maybe after zs_putback_page()).
(5) Freed zspage becomes a free page and is inserted into pcp freelist.


If my theory is correct,
we need some change in this patch.
(e.g. allow remove_zspage in fix_fullness_group())


Please check it out.

Thanks.


>   	set_zspage_mapping(first_page, class_idx, newfg);
>
>   out:
> @@ -950,12 +981,31 @@ static void unpin_tag(unsigned long handle)
>
>   static void reset_page(struct page *page)
>   {
> +	if (!PageIsolated(page))
> +		__ClearPageMovable(page);
> +	ClearPageIsolated(page);
>   	clear_bit(PG_private, &page->flags);
>   	clear_bit(PG_private_2, &page->flags);
>   	set_page_private(page, 0);
>   	page->freelist = NULL;
>   }
>
> +/**
> + * lock_zspage - lock all pages in the zspage
> + * @first_page: head page of the zspage
> + *
> + * To prevent destroy during migration, zspage freeing should
> + * hold locks of all pages in a zspage
> + */
> +void lock_zspage(struct page *first_page)
> +{
> +	struct page *cursor = first_page;
> +
> +	do {
> +		while (!trylock_page(cursor));
> +	} while ((cursor = get_next_page(cursor)) != NULL);
> +}
> +
>   static void free_zspage(struct zs_pool *pool, struct page *first_page)
>   {
>   	struct page *nextp, *tmp, *head_extra;
> @@ -963,26 +1013,31 @@ static void free_zspage(struct zs_pool *pool, struct page *first_page)
>   	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>   	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
>
> +	lock_zspage(first_page);
>   	head_extra = (struct page *)page_private(first_page);
>
> -	reset_page(first_page);
> -	__free_page(first_page);
> -
>   	/* zspage with only 1 system page */
>   	if (!head_extra)
> -		return;
> +		goto out;
>
>   	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
>   		list_del(&nextp->lru);
>   		reset_page(nextp);
> -		__free_page(nextp);
> +		unlock_page(nextp);
> +		put_page(nextp);
>   	}
>   	reset_page(head_extra);
> -	__free_page(head_extra);
> +	unlock_page(head_extra);
> +	put_page(head_extra);
> +out:
> +	reset_page(first_page);
> +	unlock_page(first_page);
> +	put_page(first_page);
>   }
>
>   /* Initialize a newly allocated zspage */
> -static void init_zspage(struct size_class *class, struct page *first_page)
> +static void init_zspage(struct size_class *class, struct page *first_page,
> +			struct address_space *mapping)
>   {
>   	int freeobj = 1;
>   	unsigned long off = 0;
> @@ -991,6 +1046,9 @@ static void init_zspage(struct size_class *class, struct page *first_page)
>   	first_page->freelist = NULL;
>   	INIT_LIST_HEAD(&first_page->lru);
>   	set_zspage_inuse(first_page, 0);
> +	BUG_ON(!trylock_page(first_page));
> +	__SetPageMovable(first_page, mapping);
> +	unlock_page(first_page);
>
>   	while (page) {
>   		struct page *next_page;
> @@ -1065,10 +1123,45 @@ static void create_page_chain(struct page *pages[], int nr_pages)
>   	}
>   }
>
> +static void replace_sub_page(struct size_class *class, struct page *first_page,
> +		struct page *newpage, struct page *oldpage)
> +{
> +	struct page *page;
> +	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL,};
> +	int idx = 0;
> +
> +	page = first_page;
> +	do {
> +		if (page == oldpage)
> +			pages[idx] = newpage;
> +		else
> +			pages[idx] = page;
> +		idx++;
> +	} while ((page = get_next_page(page)) != NULL);
> +
> +	create_page_chain(pages, class->pages_per_zspage);
> +
> +	if (is_first_page(oldpage)) {
> +		enum fullness_group fg;
> +		int class_idx;
> +
> +		SetZsPageIsolate(newpage);
> +		get_zspage_mapping(oldpage, &class_idx, &fg);
> +		set_zspage_mapping(newpage, class_idx, fg);
> +		set_freeobj(newpage, get_freeobj(oldpage));
> +		set_zspage_inuse(newpage, get_zspage_inuse(oldpage));
> +		if (class->huge)
> +			set_page_private(newpage,  page_private(oldpage));
> +	}
> +
> +	__SetPageMovable(newpage, oldpage->mapping);
> +}
> +
>   /*
>    * Allocate a zspage for the given size class
>    */
> -static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> +static struct page *alloc_zspage(struct zs_pool *pool,
> +				struct size_class *class)
>   {
>   	int i;
>   	struct page *first_page = NULL;
> @@ -1088,7 +1181,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>   	for (i = 0; i < class->pages_per_zspage; i++) {
>   		struct page *page;
>
> -		page = alloc_page(flags);
> +		page = alloc_page(pool->flags);
>   		if (!page) {
>   			while (--i >= 0)
>   				__free_page(pages[i]);
> @@ -1100,7 +1193,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>
>   	create_page_chain(pages, class->pages_per_zspage);
>   	first_page = pages[0];
> -	init_zspage(class, first_page);
> +	init_zspage(class, first_page, pool->inode->i_mapping);
>
>   	return first_page;
>   }
> @@ -1499,7 +1592,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
>
>   	if (!first_page) {
>   		spin_unlock(&class->lock);
> -		first_page = alloc_zspage(class, pool->flags);
> +		first_page = alloc_zspage(pool, class);
>   		if (unlikely(!first_page)) {
>   			free_handle(pool, handle);
>   			return 0;
> @@ -1559,6 +1652,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>   	if (unlikely(!handle))
>   		return;
>
> +	/* Once handle is pinned, page|object migration cannot work */
>   	pin_tag(handle);
>   	obj = handle_to_obj(handle);
>   	obj_to_location(obj, &f_page, &f_objidx);
> @@ -1714,6 +1808,9 @@ static enum fullness_group putback_zspage(struct size_class *class,
>   {
>   	enum fullness_group fullness;
>
> +	VM_BUG_ON_PAGE(!list_empty(&first_page->lru), first_page);
> +	VM_BUG_ON_PAGE(ZsPageIsolate(first_page), first_page);
> +
>   	fullness = get_fullness_group(class, first_page);
>   	insert_zspage(class, fullness, first_page);
>   	set_zspage_mapping(first_page, class->index, fullness);
> @@ -2059,6 +2156,173 @@ static int zs_register_shrinker(struct zs_pool *pool)
>   	return register_shrinker(&pool->shrinker);
>   }
>
> +bool zs_page_isolate(struct page *page, isolate_mode_t mode)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct page *first_page;
> +
> +	/*
> +	 * The page is locked so it couldn't be destroyed.
> +	 * For detail, look at lock_zspage in free_zspage.
> +	 */
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	VM_BUG_ON_PAGE(PageIsolated(page), page);
> +	/*
> +	 * In this implementation, it allows only first page migration.
> +	 */
> +	VM_BUG_ON_PAGE(!is_first_page(page), page);
> +	first_page = page;
> +
> +	/*
> +	 * Without class lock, fullness is meaningless while constant
> +	 * class_idx is okay. We will get it under class lock at below,
> +	 * again.
> +	 */
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	pool = page->mapping->private_data;
> +	class = pool->size_class[class_idx];
> +
> +	if (!spin_trylock(&class->lock))
> +		return false;
> +
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	remove_zspage(class, fullness, first_page);
> +	SetZsPageIsolate(first_page);
> +	SetPageIsolated(page);
> +	spin_unlock(&class->lock);
> +
> +	return true;
> +}
> +
> +int zs_page_migrate(struct address_space *mapping, struct page *newpage,
> +		struct page *page, enum migrate_mode mode)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct page *first_page;
> +	void *s_addr, *d_addr, *addr;
> +	int ret = -EBUSY;
> +	int offset = 0;
> +	int freezed = 0;
> +
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +
> +	first_page = page;
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	pool = page->mapping->private_data;
> +	class = pool->size_class[class_idx];
> +
> +	/*
> +	 * Get stable fullness under class->lock
> +	 */
> +	if (!spin_trylock(&class->lock))
> +		return ret;
> +
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	if (get_zspage_inuse(first_page) == 0)
> +		goto out_class_unlock;
> +
> +	freezed = freeze_zspage(class, first_page);
> +	if (freezed != get_zspage_inuse(first_page))
> +		goto out_unfreeze;
> +
> +	/* copy contents from page to newpage */
> +	s_addr = kmap_atomic(page);
> +	d_addr = kmap_atomic(newpage);
> +	memcpy(d_addr, s_addr, PAGE_SIZE);
> +	kunmap_atomic(d_addr);
> +	kunmap_atomic(s_addr);
> +
> +	if (!is_first_page(page))
> +		offset = page->index;
> +
> +	addr = kmap_atomic(page);
> +	do {
> +		unsigned long handle;
> +		unsigned long head;
> +		unsigned long new_obj, old_obj;
> +		unsigned long obj_idx;
> +		struct page *dummy;
> +
> +		head = obj_to_head(class, page, addr + offset);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			handle = head & ~OBJ_ALLOCATED_TAG;
> +			if (!testpin_tag(handle))
> +				BUG();
> +
> +			old_obj = handle_to_obj(handle);
> +			obj_to_location(old_obj, &dummy, &obj_idx);
> +			new_obj = location_to_obj(newpage, obj_idx);
> +			new_obj |= BIT(HANDLE_PIN_BIT);
> +			record_obj(handle, new_obj);
> +		}
> +		offset += class->size;
> +	} while (offset < PAGE_SIZE);
> +	kunmap_atomic(addr);
> +
> +	replace_sub_page(class, first_page, newpage, page);
> +	first_page = newpage;
> +	get_page(newpage);
> +	VM_BUG_ON_PAGE(get_fullness_group(class, first_page) ==
> +			ZS_EMPTY, first_page);
> +	ClearZsPageIsolate(first_page);
> +	putback_zspage(class, first_page);
> +
> +	/* Migration complete. Free old page */
> +	ClearPageIsolated(page);
> +	reset_page(page);
> +	put_page(page);
> +	ret = MIGRATEPAGE_SUCCESS;
> +
> +out_unfreeze:
> +	unfreeze_zspage(class, first_page, freezed);
> +out_class_unlock:
> +	spin_unlock(&class->lock);
> +
> +	return ret;
> +}
> +
> +void zs_page_putback(struct page *page)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct page *first_page;
> +
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +
> +	first_page = page;
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	pool = page->mapping->private_data;
> +	class = pool->size_class[class_idx];
> +
> +	/*
> +	 * If there is race betwwen zs_free and here, free_zspage
> +	 * in zs_free will wait the page lock of @page without
> +	 * destroying of zspage.
> +	 */
> +	INIT_LIST_HEAD(&first_page->lru);
> +	spin_lock(&class->lock);
> +	ClearPageIsolated(page);
> +	ClearZsPageIsolate(first_page);
> +	putback_zspage(class, first_page);
> +	spin_unlock(&class->lock);
> +}
> +
> +const struct address_space_operations zsmalloc_aops = {
> +	.isolate_page = zs_page_isolate,
> +	.migratepage = zs_page_migrate,
> +	.putback_page = zs_page_putback,
> +};
> +
>   /**
>    * zs_create_pool - Creates an allocation pool to work from.
>    * @flags: allocation flags used to allocate pool metadata
> @@ -2145,6 +2409,15 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>   	if (zs_pool_stat_create(pool, name))
>   		goto err;
>
> +	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
> +	if (IS_ERR(pool->inode)) {
> +		pool->inode = NULL;
> +		goto err;
> +	}
> +
> +	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
> +	pool->inode->i_mapping->private_data = pool;
> +
>   	/*
>   	 * Not critical, we still can use the pool
>   	 * and user can trigger compaction manually.
> @@ -2164,6 +2437,8 @@ void zs_destroy_pool(struct zs_pool *pool)
>   	int i;
>
>   	zs_unregister_shrinker(pool);
> +	if (pool->inode)
> +		iput(pool->inode);
>   	zs_pool_stat_destroy(pool);
>
>   	for (i = 0; i < zs_size_classes; i++) {
> @@ -2192,10 +2467,33 @@ void zs_destroy_pool(struct zs_pool *pool)
>   }
>   EXPORT_SYMBOL_GPL(zs_destroy_pool);
>
> +static struct dentry *zs_mount(struct file_system_type *fs_type,
> +				int flags, const char *dev_name, void *data)
> +{
> +	static const struct dentry_operations ops = {
> +		.d_dname = simple_dname,
> +	};
> +
> +	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
> +}
> +
> +static struct file_system_type zsmalloc_fs = {
> +	.name		= "zsmalloc",
> +	.mount		= zs_mount,
> +	.kill_sb	= kill_anon_super,
> +};
> +
>   static int __init zs_init(void)
>   {
> -	int ret = zs_register_cpu_notifier();
> +	int ret;
> +
> +	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
> +	if (IS_ERR(zsmalloc_mnt)) {
> +		ret = PTR_ERR(zsmalloc_mnt);
> +		goto out;
> +	}
>
> +	ret = zs_register_cpu_notifier();
>   	if (ret)
>   		goto notifier_fail;
>
> @@ -2218,6 +2516,7 @@ static int __init zs_init(void)
>   		pr_err("zs stat initialization failed\n");
>   		goto stat_fail;
>   	}
> +
>   	return 0;
>
>   stat_fail:
> @@ -2226,7 +2525,8 @@ static int __init zs_init(void)
>   #endif
>   notifier_fail:
>   	zs_unregister_cpu_notifier();
> -
> +	kern_unmount(zsmalloc_mnt);
> +out:
>   	return ret;
>   }
>
> @@ -2237,6 +2537,8 @@ static void __exit zs_exit(void)
>   #endif
>   	zs_unregister_cpu_notifier();
>
> +	kern_unmount(zsmalloc_mnt);
> +
>   	zs_stat_exit();
>   }
>
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 13/16] zsmalloc: migrate head page of zspage
  2016-04-06 13:01   ` Chulmin Kim
@ 2016-04-07  0:34     ` Chulmin Kim
  2016-04-07  0:43     ` Minchan Kim
  1 sibling, 0 replies; 65+ messages in thread
From: Chulmin Kim @ 2016-04-07  0:34 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton; +Cc: linux-kernel, linux-mm

On 2016년 04월 06일 22:01, Chulmin Kim wrote:
> On 2016년 03월 30일 16:12, Minchan Kim wrote:
>> This patch introduces run-time migration feature for zspage.
>> To begin with, it supports only head page migration for
>> easy review(later patches will support tail page migration).
>>
>> For migration, it supports three functions
>>
>> * zs_page_isolate
>>
>> It isolates a zspage which includes a subpage VM want to migrate
>> from class so anyone cannot allocate new object from the zspage.
>> IOW, allocation freeze
>>
>> * zs_page_migrate
>>
>> First of all, it freezes zspage to prevent zspage destrunction
>> so anyone cannot free object. Then, It copies content from oldpage
>> to newpage and create new page-chain with new page.
>> If it was successful, drop the refcount of old page to free
>> and putback new zspage to right data structure of zsmalloc.
>> Lastly, unfreeze zspages so we allows object allocation/free
>> from now on.
>>
>> * zs_page_putback
>>
>> It returns isolated zspage to right fullness_group list
>> if it fails to migrate a page.
>>
>> NOTE: A hurdle to support migration is that destroying zspage
>> while migration is going on. Once a zspage is isolated,
>> anyone cannot allocate object from the zspage but can deallocate
>> object freely so a zspage could be destroyed until all of objects
>> in zspage are freezed to prevent deallocation. The problem is
>> large window betwwen zs_page_isolate and freeze_zspage
>> in zs_page_migrate so the zspage could be destroyed.
>>
>> A easy approach to solve the problem is that object freezing
>> in zs_page_isolate but it has a drawback that any object cannot
>> be deallocated until migration fails after isolation. However,
>> There is large time gab between isolation and migration so
>> any object freeing in other CPU should spin by pin_tag which
>> would cause big latency. So, this patch introduces lock_zspage
>> which holds PG_lock of all pages in a zspage right before
>> freeing the zspage. VM migration locks the page, too right
>> before calling ->migratepage so such race doesn't exist any more.
>>
>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>> ---
>>   include/uapi/linux/magic.h |   1 +
>>   mm/zsmalloc.c              | 332
>> +++++++++++++++++++++++++++++++++++++++++++--
>>   2 files changed, 318 insertions(+), 15 deletions(-)
>>
>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>> index e1fbe72c39c0..93b1affe4801 100644
>> --- a/include/uapi/linux/magic.h
>> +++ b/include/uapi/linux/magic.h
>> @@ -79,5 +79,6 @@
>>   #define NSFS_MAGIC        0x6e736673
>>   #define BPF_FS_MAGIC        0xcafe4a11
>>   #define BALLOON_KVM_MAGIC    0x13661366
>> +#define ZSMALLOC_MAGIC        0x58295829
>>
>>   #endif /* __LINUX_MAGIC_H__ */
>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>> index ac8ca7b10720..f6c9138c3be0 100644
>> --- a/mm/zsmalloc.c
>> +++ b/mm/zsmalloc.c
>> @@ -56,6 +56,8 @@
>>   #include <linux/debugfs.h>
>>   #include <linux/zsmalloc.h>
>>   #include <linux/zpool.h>
>> +#include <linux/mount.h>
>> +#include <linux/migrate.h>
>>
>>   /*
>>    * This must be power of 2 and greater than of equal to
>> sizeof(link_free).
>> @@ -182,6 +184,8 @@ struct zs_size_stat {
>>   static struct dentry *zs_stat_root;
>>   #endif
>>
>> +static struct vfsmount *zsmalloc_mnt;
>> +
>>   /*
>>    * number of size_classes
>>    */
>> @@ -263,6 +267,7 @@ struct zs_pool {
>>   #ifdef CONFIG_ZSMALLOC_STAT
>>       struct dentry *stat_dentry;
>>   #endif
>> +    struct inode *inode;
>>   };
>>
>>   struct zs_meta {
>> @@ -412,6 +417,29 @@ static int is_last_page(struct page *page)
>>       return PagePrivate2(page);
>>   }
>>
>> +/*
>> + * Indicate that whether zspage is isolated for page migration.
>> + * Protected by size_class lock
>> + */
>> +static void SetZsPageIsolate(struct page *first_page)
>> +{
>> +    VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>> +    SetPageUptodate(first_page);
>> +}
>> +
>> +static int ZsPageIsolate(struct page *first_page)
>> +{
>> +    VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>> +
>> +    return PageUptodate(first_page);
>> +}
>> +
>> +static void ClearZsPageIsolate(struct page *first_page)
>> +{
>> +    VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>> +    ClearPageUptodate(first_page);
>> +}
>> +
>>   static int get_zspage_inuse(struct page *first_page)
>>   {
>>       struct zs_meta *m;
>> @@ -783,8 +811,11 @@ static enum fullness_group
>> fix_fullness_group(struct size_class *class,
>>       if (newfg == currfg)
>>           goto out;
>>
>> -    remove_zspage(class, currfg, first_page);
>> -    insert_zspage(class, newfg, first_page);
>> +    /* Later, putback will insert page to right list */
>> +    if (!ZsPageIsolate(first_page)) {
>> +        remove_zspage(class, currfg, first_page);
>> +        insert_zspage(class, newfg, first_page);
>> +    }
>
> Hello, Minchan.
>
> I am running a serious stress test using this patchset.
> (By the way, there can be a false alarm as I am working on Kernel v3.18.)
>
> I got a bug as depicted in the below.
>
> <0>[47821.493416]  [3:      dumpstate:16261] page:ffffffbdc44aaac0
> count:0 mapcount:0 mapping:          (null) index:0x2
> <0>[47821.493524]  [3:      dumpstate:16261] flags: 0x4000000000000000()
> <1>[47821.493592]  [3:      dumpstate:16261] page dumped because:
> VM_BUG_ON_PAGE(!is_first_page(first_page))
> <4>[47821.493684]  [3:      dumpstate:16261] ------------[ cut here
> ]------------
> ...
> <4>[47821.507309]  [3:      dumpstate:16261] [<ffffffc0002515f4>]
> get_zspage_inuse+0x1c/0x30
> <4>[47821.507381]  [3:      dumpstate:16261] [<ffffffc0002517a8>]
> insert_zspage+0x94/0xb0
> <4>[47821.507452]  [3:      dumpstate:16261] [<ffffffc000252290>]
> putback_zspage+0xac/0xd4
> <4>[47821.507522]  [3:      dumpstate:16261] [<ffffffc0002539a0>]
> zs_page_migrate+0x3d8/0x464
> <4>[47821.507595]  [3:      dumpstate:16261] [<ffffffc000250294>]
> migrate_pages+0x5dc/0x88
>
>
> When calling get_zspage_inuse(*head) in insert_zspage(),
> VM_BUG_ON_PAGE occurred as *head was not the first page of a zspage.
>
>
> During the debugging,
> I thought that *head page could be a free page in pcp list.
> - count, mapcount were reset.
> - page->freelist = MIGRATE_MOVABLE (0x2)
> - *head page had the multiple pages in the same state.
>
>

Please ignore the below part.
Seems weird even to me now. :)

> Here is my theory.
>
> Circumstances
> (1) A certain page in a zs page is isolated and about to be migrated.
> (not being migrated)
> (2) zs_free() simultaneously occurred for the zs object in the above zs
> page.
>
> What may happen.
> (1) Assume that the above zs_free() made the zspage's FG to ZS_EMPTY.
> (2) However, as the zspage is isolated, the zspage is not removed from
> the fullness list (e.g. reside in fullness_list[ZS_ALMOST_EMPTY]).
> (according to this patch's code line just before my greeting.)
> (3) The zspage is reset by free_zspage() in zs_free().
> (4) and freed (maybe after zs_putback_page()).
> (5) Freed zspage becomes a free page and is inserted into pcp freelist.
>
>
> If my theory is correct,
> we need some change in this patch.
> (e.g. allow remove_zspage in fix_fullness_group())
>
>
> Please check it out.
>
> Thanks.
>
>
>>       set_zspage_mapping(first_page, class_idx, newfg);
>>
>>   out:
>> @@ -950,12 +981,31 @@ static void unpin_tag(unsigned long handle)
>>
>>   static void reset_page(struct page *page)
>>   {
>> +    if (!PageIsolated(page))
>> +        __ClearPageMovable(page);
>> +    ClearPageIsolated(page);
>>       clear_bit(PG_private, &page->flags);
>>       clear_bit(PG_private_2, &page->flags);
>>       set_page_private(page, 0);
>>       page->freelist = NULL;
>>   }
>>
>> +/**
>> + * lock_zspage - lock all pages in the zspage
>> + * @first_page: head page of the zspage
>> + *
>> + * To prevent destroy during migration, zspage freeing should
>> + * hold locks of all pages in a zspage
>> + */
>> +void lock_zspage(struct page *first_page)
>> +{
>> +    struct page *cursor = first_page;
>> +
>> +    do {
>> +        while (!trylock_page(cursor));
>> +    } while ((cursor = get_next_page(cursor)) != NULL);
>> +}
>> +
>>   static void free_zspage(struct zs_pool *pool, struct page *first_page)
>>   {
>>       struct page *nextp, *tmp, *head_extra;
>> @@ -963,26 +1013,31 @@ static void free_zspage(struct zs_pool *pool,
>> struct page *first_page)
>>       VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>>       VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
>>
>> +    lock_zspage(first_page);
>>       head_extra = (struct page *)page_private(first_page);
>>
>> -    reset_page(first_page);
>> -    __free_page(first_page);
>> -
>>       /* zspage with only 1 system page */
>>       if (!head_extra)
>> -        return;
>> +        goto out;
>>
>>       list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
>>           list_del(&nextp->lru);
>>           reset_page(nextp);
>> -        __free_page(nextp);
>> +        unlock_page(nextp);
>> +        put_page(nextp);
>>       }
>>       reset_page(head_extra);
>> -    __free_page(head_extra);
>> +    unlock_page(head_extra);
>> +    put_page(head_extra);
>> +out:
>> +    reset_page(first_page);
>> +    unlock_page(first_page);
>> +    put_page(first_page);
>>   }
>>
>>   /* Initialize a newly allocated zspage */
>> -static void init_zspage(struct size_class *class, struct page
>> *first_page)
>> +static void init_zspage(struct size_class *class, struct page
>> *first_page,
>> +            struct address_space *mapping)
>>   {
>>       int freeobj = 1;
>>       unsigned long off = 0;
>> @@ -991,6 +1046,9 @@ static void init_zspage(struct size_class *class,
>> struct page *first_page)
>>       first_page->freelist = NULL;
>>       INIT_LIST_HEAD(&first_page->lru);
>>       set_zspage_inuse(first_page, 0);
>> +    BUG_ON(!trylock_page(first_page));
>> +    __SetPageMovable(first_page, mapping);
>> +    unlock_page(first_page);
>>
>>       while (page) {
>>           struct page *next_page;
>> @@ -1065,10 +1123,45 @@ static void create_page_chain(struct page
>> *pages[], int nr_pages)
>>       }
>>   }
>>
>> +static void replace_sub_page(struct size_class *class, struct page
>> *first_page,
>> +        struct page *newpage, struct page *oldpage)
>> +{
>> +    struct page *page;
>> +    struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL,};
>> +    int idx = 0;
>> +
>> +    page = first_page;
>> +    do {
>> +        if (page == oldpage)
>> +            pages[idx] = newpage;
>> +        else
>> +            pages[idx] = page;
>> +        idx++;
>> +    } while ((page = get_next_page(page)) != NULL);
>> +
>> +    create_page_chain(pages, class->pages_per_zspage);
>> +
>> +    if (is_first_page(oldpage)) {
>> +        enum fullness_group fg;
>> +        int class_idx;
>> +
>> +        SetZsPageIsolate(newpage);
>> +        get_zspage_mapping(oldpage, &class_idx, &fg);
>> +        set_zspage_mapping(newpage, class_idx, fg);
>> +        set_freeobj(newpage, get_freeobj(oldpage));
>> +        set_zspage_inuse(newpage, get_zspage_inuse(oldpage));
>> +        if (class->huge)
>> +            set_page_private(newpage,  page_private(oldpage));
>> +    }
>> +
>> +    __SetPageMovable(newpage, oldpage->mapping);
>> +}
>> +
>>   /*
>>    * Allocate a zspage for the given size class
>>    */
>> -static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>> +static struct page *alloc_zspage(struct zs_pool *pool,
>> +                struct size_class *class)
>>   {
>>       int i;
>>       struct page *first_page = NULL;
>> @@ -1088,7 +1181,7 @@ static struct page *alloc_zspage(struct
>> size_class *class, gfp_t flags)
>>       for (i = 0; i < class->pages_per_zspage; i++) {
>>           struct page *page;
>>
>> -        page = alloc_page(flags);
>> +        page = alloc_page(pool->flags);
>>           if (!page) {
>>               while (--i >= 0)
>>                   __free_page(pages[i]);
>> @@ -1100,7 +1193,7 @@ static struct page *alloc_zspage(struct
>> size_class *class, gfp_t flags)
>>
>>       create_page_chain(pages, class->pages_per_zspage);
>>       first_page = pages[0];
>> -    init_zspage(class, first_page);
>> +    init_zspage(class, first_page, pool->inode->i_mapping);
>>
>>       return first_page;
>>   }
>> @@ -1499,7 +1592,7 @@ unsigned long zs_malloc(struct zs_pool *pool,
>> size_t size)
>>
>>       if (!first_page) {
>>           spin_unlock(&class->lock);
>> -        first_page = alloc_zspage(class, pool->flags);
>> +        first_page = alloc_zspage(pool, class);
>>           if (unlikely(!first_page)) {
>>               free_handle(pool, handle);
>>               return 0;
>> @@ -1559,6 +1652,7 @@ void zs_free(struct zs_pool *pool, unsigned long
>> handle)
>>       if (unlikely(!handle))
>>           return;
>>
>> +    /* Once handle is pinned, page|object migration cannot work */
>>       pin_tag(handle);
>>       obj = handle_to_obj(handle);
>>       obj_to_location(obj, &f_page, &f_objidx);
>> @@ -1714,6 +1808,9 @@ static enum fullness_group putback_zspage(struct
>> size_class *class,
>>   {
>>       enum fullness_group fullness;
>>
>> +    VM_BUG_ON_PAGE(!list_empty(&first_page->lru), first_page);
>> +    VM_BUG_ON_PAGE(ZsPageIsolate(first_page), first_page);
>> +
>>       fullness = get_fullness_group(class, first_page);
>>       insert_zspage(class, fullness, first_page);
>>       set_zspage_mapping(first_page, class->index, fullness);
>> @@ -2059,6 +2156,173 @@ static int zs_register_shrinker(struct zs_pool
>> *pool)
>>       return register_shrinker(&pool->shrinker);
>>   }
>>
>> +bool zs_page_isolate(struct page *page, isolate_mode_t mode)
>> +{
>> +    struct zs_pool *pool;
>> +    struct size_class *class;
>> +    int class_idx;
>> +    enum fullness_group fullness;
>> +    struct page *first_page;
>> +
>> +    /*
>> +     * The page is locked so it couldn't be destroyed.
>> +     * For detail, look at lock_zspage in free_zspage.
>> +     */
>> +    VM_BUG_ON_PAGE(!PageLocked(page), page);
>> +    VM_BUG_ON_PAGE(PageIsolated(page), page);
>> +    /*
>> +     * In this implementation, it allows only first page migration.
>> +     */
>> +    VM_BUG_ON_PAGE(!is_first_page(page), page);
>> +    first_page = page;
>> +
>> +    /*
>> +     * Without class lock, fullness is meaningless while constant
>> +     * class_idx is okay. We will get it under class lock at below,
>> +     * again.
>> +     */
>> +    get_zspage_mapping(first_page, &class_idx, &fullness);
>> +    pool = page->mapping->private_data;
>> +    class = pool->size_class[class_idx];
>> +
>> +    if (!spin_trylock(&class->lock))
>> +        return false;
>> +
>> +    get_zspage_mapping(first_page, &class_idx, &fullness);
>> +    remove_zspage(class, fullness, first_page);
>> +    SetZsPageIsolate(first_page);
>> +    SetPageIsolated(page);
>> +    spin_unlock(&class->lock);
>> +
>> +    return true;
>> +}
>> +
>> +int zs_page_migrate(struct address_space *mapping, struct page *newpage,
>> +        struct page *page, enum migrate_mode mode)
>> +{
>> +    struct zs_pool *pool;
>> +    struct size_class *class;
>> +    int class_idx;
>> +    enum fullness_group fullness;
>> +    struct page *first_page;
>> +    void *s_addr, *d_addr, *addr;
>> +    int ret = -EBUSY;
>> +    int offset = 0;
>> +    int freezed = 0;
>> +
>> +    VM_BUG_ON_PAGE(!PageMovable(page), page);
>> +    VM_BUG_ON_PAGE(!PageIsolated(page), page);
>> +
>> +    first_page = page;
>> +    get_zspage_mapping(first_page, &class_idx, &fullness);
>> +    pool = page->mapping->private_data;
>> +    class = pool->size_class[class_idx];
>> +
>> +    /*
>> +     * Get stable fullness under class->lock
>> +     */
>> +    if (!spin_trylock(&class->lock))
>> +        return ret;
>> +
>> +    get_zspage_mapping(first_page, &class_idx, &fullness);
>> +    if (get_zspage_inuse(first_page) == 0)
>> +        goto out_class_unlock;
>> +
>> +    freezed = freeze_zspage(class, first_page);
>> +    if (freezed != get_zspage_inuse(first_page))
>> +        goto out_unfreeze;
>> +
>> +    /* copy contents from page to newpage */
>> +    s_addr = kmap_atomic(page);
>> +    d_addr = kmap_atomic(newpage);
>> +    memcpy(d_addr, s_addr, PAGE_SIZE);
>> +    kunmap_atomic(d_addr);
>> +    kunmap_atomic(s_addr);
>> +
>> +    if (!is_first_page(page))
>> +        offset = page->index;
>> +
>> +    addr = kmap_atomic(page);
>> +    do {
>> +        unsigned long handle;
>> +        unsigned long head;
>> +        unsigned long new_obj, old_obj;
>> +        unsigned long obj_idx;
>> +        struct page *dummy;
>> +
>> +        head = obj_to_head(class, page, addr + offset);
>> +        if (head & OBJ_ALLOCATED_TAG) {
>> +            handle = head & ~OBJ_ALLOCATED_TAG;
>> +            if (!testpin_tag(handle))
>> +                BUG();
>> +
>> +            old_obj = handle_to_obj(handle);
>> +            obj_to_location(old_obj, &dummy, &obj_idx);
>> +            new_obj = location_to_obj(newpage, obj_idx);
>> +            new_obj |= BIT(HANDLE_PIN_BIT);
>> +            record_obj(handle, new_obj);
>> +        }
>> +        offset += class->size;
>> +    } while (offset < PAGE_SIZE);
>> +    kunmap_atomic(addr);
>> +
>> +    replace_sub_page(class, first_page, newpage, page);
>> +    first_page = newpage;
>> +    get_page(newpage);
>> +    VM_BUG_ON_PAGE(get_fullness_group(class, first_page) ==
>> +            ZS_EMPTY, first_page);
>> +    ClearZsPageIsolate(first_page);
>> +    putback_zspage(class, first_page);
>> +
>> +    /* Migration complete. Free old page */
>> +    ClearPageIsolated(page);
>> +    reset_page(page);
>> +    put_page(page);
>> +    ret = MIGRATEPAGE_SUCCESS;
>> +
>> +out_unfreeze:
>> +    unfreeze_zspage(class, first_page, freezed);
>> +out_class_unlock:
>> +    spin_unlock(&class->lock);
>> +
>> +    return ret;
>> +}
>> +
>> +void zs_page_putback(struct page *page)
>> +{
>> +    struct zs_pool *pool;
>> +    struct size_class *class;
>> +    int class_idx;
>> +    enum fullness_group fullness;
>> +    struct page *first_page;
>> +
>> +    VM_BUG_ON_PAGE(!PageMovable(page), page);
>> +    VM_BUG_ON_PAGE(!PageIsolated(page), page);
>> +
>> +    first_page = page;
>> +    get_zspage_mapping(first_page, &class_idx, &fullness);
>> +    pool = page->mapping->private_data;
>> +    class = pool->size_class[class_idx];
>> +
>> +    /*
>> +     * If there is race betwwen zs_free and here, free_zspage
>> +     * in zs_free will wait the page lock of @page without
>> +     * destroying of zspage.
>> +     */
>> +    INIT_LIST_HEAD(&first_page->lru);
>> +    spin_lock(&class->lock);
>> +    ClearPageIsolated(page);
>> +    ClearZsPageIsolate(first_page);
>> +    putback_zspage(class, first_page);
>> +    spin_unlock(&class->lock);
>> +}
>> +
>> +const struct address_space_operations zsmalloc_aops = {
>> +    .isolate_page = zs_page_isolate,
>> +    .migratepage = zs_page_migrate,
>> +    .putback_page = zs_page_putback,
>> +};
>> +
>>   /**
>>    * zs_create_pool - Creates an allocation pool to work from.
>>    * @flags: allocation flags used to allocate pool metadata
>> @@ -2145,6 +2409,15 @@ struct zs_pool *zs_create_pool(const char
>> *name, gfp_t flags)
>>       if (zs_pool_stat_create(pool, name))
>>           goto err;
>>
>> +    pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
>> +    if (IS_ERR(pool->inode)) {
>> +        pool->inode = NULL;
>> +        goto err;
>> +    }
>> +
>> +    pool->inode->i_mapping->a_ops = &zsmalloc_aops;
>> +    pool->inode->i_mapping->private_data = pool;
>> +
>>       /*
>>        * Not critical, we still can use the pool
>>        * and user can trigger compaction manually.
>> @@ -2164,6 +2437,8 @@ void zs_destroy_pool(struct zs_pool *pool)
>>       int i;
>>
>>       zs_unregister_shrinker(pool);
>> +    if (pool->inode)
>> +        iput(pool->inode);
>>       zs_pool_stat_destroy(pool);
>>
>>       for (i = 0; i < zs_size_classes; i++) {
>> @@ -2192,10 +2467,33 @@ void zs_destroy_pool(struct zs_pool *pool)
>>   }
>>   EXPORT_SYMBOL_GPL(zs_destroy_pool);
>>
>> +static struct dentry *zs_mount(struct file_system_type *fs_type,
>> +                int flags, const char *dev_name, void *data)
>> +{
>> +    static const struct dentry_operations ops = {
>> +        .d_dname = simple_dname,
>> +    };
>> +
>> +    return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops,
>> ZSMALLOC_MAGIC);
>> +}
>> +
>> +static struct file_system_type zsmalloc_fs = {
>> +    .name        = "zsmalloc",
>> +    .mount        = zs_mount,
>> +    .kill_sb    = kill_anon_super,
>> +};
>> +
>>   static int __init zs_init(void)
>>   {
>> -    int ret = zs_register_cpu_notifier();
>> +    int ret;
>> +
>> +    zsmalloc_mnt = kern_mount(&zsmalloc_fs);
>> +    if (IS_ERR(zsmalloc_mnt)) {
>> +        ret = PTR_ERR(zsmalloc_mnt);
>> +        goto out;
>> +    }
>>
>> +    ret = zs_register_cpu_notifier();
>>       if (ret)
>>           goto notifier_fail;
>>
>> @@ -2218,6 +2516,7 @@ static int __init zs_init(void)
>>           pr_err("zs stat initialization failed\n");
>>           goto stat_fail;
>>       }
>> +
>>       return 0;
>>
>>   stat_fail:
>> @@ -2226,7 +2525,8 @@ static int __init zs_init(void)
>>   #endif
>>   notifier_fail:
>>       zs_unregister_cpu_notifier();
>> -
>> +    kern_unmount(zsmalloc_mnt);
>> +out:
>>       return ret;
>>   }
>>
>> @@ -2237,6 +2537,8 @@ static void __exit zs_exit(void)
>>   #endif
>>       zs_unregister_cpu_notifier();
>>
>> +    kern_unmount(zsmalloc_mnt);
>> +
>>       zs_stat_exit();
>>   }
>>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=ilto:"dont@kvack.org"> email@kvack.org </a>
>
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 13/16] zsmalloc: migrate head page of zspage
  2016-04-06 13:01   ` Chulmin Kim
  2016-04-07  0:34     ` Chulmin Kim
@ 2016-04-07  0:43     ` Minchan Kim
  1 sibling, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-07  0:43 UTC (permalink / raw)
  To: Chulmin Kim; +Cc: Andrew Morton, linux-kernel, linux-mm

Hello Chulmin,

On Wed, Apr 06, 2016 at 10:01:14PM +0900, Chulmin Kim wrote:
> On 2016년 03월 30일 16:12, Minchan Kim wrote:
> >This patch introduces run-time migration feature for zspage.
> >To begin with, it supports only head page migration for
> >easy review(later patches will support tail page migration).
> >
> >For migration, it supports three functions
> >
> >* zs_page_isolate
> >
> >It isolates a zspage which includes a subpage VM want to migrate
> >from class so anyone cannot allocate new object from the zspage.
> >IOW, allocation freeze
> >
> >* zs_page_migrate
> >
> >First of all, it freezes zspage to prevent zspage destrunction
> >so anyone cannot free object. Then, It copies content from oldpage
> >to newpage and create new page-chain with new page.
> >If it was successful, drop the refcount of old page to free
> >and putback new zspage to right data structure of zsmalloc.
> >Lastly, unfreeze zspages so we allows object allocation/free
> >from now on.
> >
> >* zs_page_putback
> >
> >It returns isolated zspage to right fullness_group list
> >if it fails to migrate a page.
> >
> >NOTE: A hurdle to support migration is that destroying zspage
> >while migration is going on. Once a zspage is isolated,
> >anyone cannot allocate object from the zspage but can deallocate
> >object freely so a zspage could be destroyed until all of objects
> >in zspage are freezed to prevent deallocation. The problem is
> >large window betwwen zs_page_isolate and freeze_zspage
> >in zs_page_migrate so the zspage could be destroyed.
> >
> >A easy approach to solve the problem is that object freezing
> >in zs_page_isolate but it has a drawback that any object cannot
> >be deallocated until migration fails after isolation. However,
> >There is large time gab between isolation and migration so
> >any object freeing in other CPU should spin by pin_tag which
> >would cause big latency. So, this patch introduces lock_zspage
> >which holds PG_lock of all pages in a zspage right before
> >freeing the zspage. VM migration locks the page, too right
> >before calling ->migratepage so such race doesn't exist any more.
> >
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  include/uapi/linux/magic.h |   1 +
> >  mm/zsmalloc.c              | 332 +++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 318 insertions(+), 15 deletions(-)
> >
> >diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> >index e1fbe72c39c0..93b1affe4801 100644
> >--- a/include/uapi/linux/magic.h
> >+++ b/include/uapi/linux/magic.h
> >@@ -79,5 +79,6 @@
> >  #define NSFS_MAGIC		0x6e736673
> >  #define BPF_FS_MAGIC		0xcafe4a11
> >  #define BALLOON_KVM_MAGIC	0x13661366
> >+#define ZSMALLOC_MAGIC		0x58295829
> >
> >  #endif /* __LINUX_MAGIC_H__ */
> >diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> >index ac8ca7b10720..f6c9138c3be0 100644
> >--- a/mm/zsmalloc.c
> >+++ b/mm/zsmalloc.c
> >@@ -56,6 +56,8 @@
> >  #include <linux/debugfs.h>
> >  #include <linux/zsmalloc.h>
> >  #include <linux/zpool.h>
> >+#include <linux/mount.h>
> >+#include <linux/migrate.h>
> >
> >  /*
> >   * This must be power of 2 and greater than of equal to sizeof(link_free).
> >@@ -182,6 +184,8 @@ struct zs_size_stat {
> >  static struct dentry *zs_stat_root;
> >  #endif
> >
> >+static struct vfsmount *zsmalloc_mnt;
> >+
> >  /*
> >   * number of size_classes
> >   */
> >@@ -263,6 +267,7 @@ struct zs_pool {
> >  #ifdef CONFIG_ZSMALLOC_STAT
> >  	struct dentry *stat_dentry;
> >  #endif
> >+	struct inode *inode;
> >  };
> >
> >  struct zs_meta {
> >@@ -412,6 +417,29 @@ static int is_last_page(struct page *page)
> >  	return PagePrivate2(page);
> >  }
> >
> >+/*
> >+ * Indicate that whether zspage is isolated for page migration.
> >+ * Protected by size_class lock
> >+ */
> >+static void SetZsPageIsolate(struct page *first_page)
> >+{
> >+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >+	SetPageUptodate(first_page);
> >+}
> >+
> >+static int ZsPageIsolate(struct page *first_page)
> >+{
> >+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >+
> >+	return PageUptodate(first_page);
> >+}
> >+
> >+static void ClearZsPageIsolate(struct page *first_page)
> >+{
> >+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >+	ClearPageUptodate(first_page);
> >+}
> >+
> >  static int get_zspage_inuse(struct page *first_page)
> >  {
> >  	struct zs_meta *m;
> >@@ -783,8 +811,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
> >  	if (newfg == currfg)
> >  		goto out;
> >
> >-	remove_zspage(class, currfg, first_page);
> >-	insert_zspage(class, newfg, first_page);
> >+	/* Later, putback will insert page to right list */
> >+	if (!ZsPageIsolate(first_page)) {
> >+		remove_zspage(class, currfg, first_page);
> >+		insert_zspage(class, newfg, first_page);
> >+	}
> 
> Hello, Minchan.
> 
> I am running a serious stress test using this patchset.
> (By the way, there can be a false alarm as I am working on Kernel v3.18.)
> 
> I got a bug as depicted in the below.
> 
> <0>[47821.493416]  [3:      dumpstate:16261] page:ffffffbdc44aaac0
> count:0 mapcount:0 mapping:          (null) index:0x2
> <0>[47821.493524]  [3:      dumpstate:16261] flags: 0x4000000000000000()
> <1>[47821.493592]  [3:      dumpstate:16261] page dumped because:
> VM_BUG_ON_PAGE(!is_first_page(first_page))
> <4>[47821.493684]  [3:      dumpstate:16261] ------------[ cut here
> ]------------
> ...
> <4>[47821.507309]  [3:      dumpstate:16261] [<ffffffc0002515f4>]
> get_zspage_inuse+0x1c/0x30
> <4>[47821.507381]  [3:      dumpstate:16261] [<ffffffc0002517a8>]
> insert_zspage+0x94/0xb0
> <4>[47821.507452]  [3:      dumpstate:16261] [<ffffffc000252290>]
> putback_zspage+0xac/0xd4
> <4>[47821.507522]  [3:      dumpstate:16261] [<ffffffc0002539a0>]
> zs_page_migrate+0x3d8/0x464
> <4>[47821.507595]  [3:      dumpstate:16261] [<ffffffc000250294>]
> migrate_pages+0x5dc/0x88
> 
> 
> When calling get_zspage_inuse(*head) in insert_zspage(),
> VM_BUG_ON_PAGE occurred as *head was not the first page of a zspage.
> 
> 
> During the debugging,
> I thought that *head page could be a free page in pcp list.
> - count, mapcount were reset.
> - page->freelist = MIGRATE_MOVABLE (0x2)
> - *head page had the multiple pages in the same state.
> 
> 
> Here is my theory.
> 
> Circumstances
> (1) A certain page in a zs page is isolated and about to be migrated.
> (not being migrated)
> (2) zs_free() simultaneously occurred for the zs object in the above zs
> page.
> 
> What may happen.
> (1) Assume that the above zs_free() made the zspage's FG to ZS_EMPTY.
> (2) However, as the zspage is isolated, the zspage is not removed from
> the fullness list (e.g. reside in fullness_list[ZS_ALMOST_EMPTY]).
> (according to this patch's code line just before my greeting.)
> (3) The zspage is reset by free_zspage() in zs_free().
> (4) and freed (maybe after zs_putback_page()).
> (5) Freed zspage becomes a free page and is inserted into pcp freelist.

free_zspage holds PG_lock all of sub_pages in a zspage to reset.
When the reset happens, if a page is isolated, zsmalloc clear
PG_Isolated but not PG_movable.

And move_to_new_page checks !PageIsolated under PG_lock if so
it doesn't call migratepage and just free the page.

        if (unlikely(PageMovable(page))) {
                lru_movable = false;
                VM_BUG_ON_PAGE(!mapping, page);
                if (!PageIsolated(page)) {
                        rc = MIGRATEPAGE_SUCCESS;
                        __ClearPageMovable(page);
                        goto out; 
                }    
        }    


However, as Vlastimil pointed out, there was other race in migration
path. I'm not sure it's related to the bug you are seeing.
I will handle it in next revision.

If my answer was not clear and you have more theory, please ping me
anytime.

Thanks for the review and hard testing, Chulmin!

> 
> 
> If my theory is correct,
> we need some change in this patch.
> (e.g. allow remove_zspage in fix_fullness_group())
> 
> 
> Please check it out.
> 
> Thanks.
> 
> 
> >  	set_zspage_mapping(first_page, class_idx, newfg);
> >
> >  out:
> >@@ -950,12 +981,31 @@ static void unpin_tag(unsigned long handle)
> >
> >  static void reset_page(struct page *page)
> >  {
> >+	if (!PageIsolated(page))
> >+		__ClearPageMovable(page);
> >+	ClearPageIsolated(page);
> >  	clear_bit(PG_private, &page->flags);
> >  	clear_bit(PG_private_2, &page->flags);
> >  	set_page_private(page, 0);
> >  	page->freelist = NULL;
> >  }
> >
> >+/**
> >+ * lock_zspage - lock all pages in the zspage
> >+ * @first_page: head page of the zspage
> >+ *
> >+ * To prevent destroy during migration, zspage freeing should
> >+ * hold locks of all pages in a zspage
> >+ */
> >+void lock_zspage(struct page *first_page)
> >+{
> >+	struct page *cursor = first_page;
> >+
> >+	do {
> >+		while (!trylock_page(cursor));
> >+	} while ((cursor = get_next_page(cursor)) != NULL);
> >+}
> >+
> >  static void free_zspage(struct zs_pool *pool, struct page *first_page)
> >  {
> >  	struct page *nextp, *tmp, *head_extra;
> >@@ -963,26 +1013,31 @@ static void free_zspage(struct zs_pool *pool, struct page *first_page)
> >  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >  	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
> >
> >+	lock_zspage(first_page);
> >  	head_extra = (struct page *)page_private(first_page);
> >
> >-	reset_page(first_page);
> >-	__free_page(first_page);
> >-
> >  	/* zspage with only 1 system page */
> >  	if (!head_extra)
> >-		return;
> >+		goto out;
> >
> >  	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
> >  		list_del(&nextp->lru);
> >  		reset_page(nextp);
> >-		__free_page(nextp);
> >+		unlock_page(nextp);
> >+		put_page(nextp);
> >  	}
> >  	reset_page(head_extra);
> >-	__free_page(head_extra);
> >+	unlock_page(head_extra);
> >+	put_page(head_extra);
> >+out:
> >+	reset_page(first_page);
> >+	unlock_page(first_page);
> >+	put_page(first_page);
> >  }
> >
> >  /* Initialize a newly allocated zspage */
> >-static void init_zspage(struct size_class *class, struct page *first_page)
> >+static void init_zspage(struct size_class *class, struct page *first_page,
> >+			struct address_space *mapping)
> >  {
> >  	int freeobj = 1;
> >  	unsigned long off = 0;
> >@@ -991,6 +1046,9 @@ static void init_zspage(struct size_class *class, struct page *first_page)
> >  	first_page->freelist = NULL;
> >  	INIT_LIST_HEAD(&first_page->lru);
> >  	set_zspage_inuse(first_page, 0);
> >+	BUG_ON(!trylock_page(first_page));
> >+	__SetPageMovable(first_page, mapping);
> >+	unlock_page(first_page);
> >
> >  	while (page) {
> >  		struct page *next_page;
> >@@ -1065,10 +1123,45 @@ static void create_page_chain(struct page *pages[], int nr_pages)
> >  	}
> >  }
> >
> >+static void replace_sub_page(struct size_class *class, struct page *first_page,
> >+		struct page *newpage, struct page *oldpage)
> >+{
> >+	struct page *page;
> >+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL,};
> >+	int idx = 0;
> >+
> >+	page = first_page;
> >+	do {
> >+		if (page == oldpage)
> >+			pages[idx] = newpage;
> >+		else
> >+			pages[idx] = page;
> >+		idx++;
> >+	} while ((page = get_next_page(page)) != NULL);
> >+
> >+	create_page_chain(pages, class->pages_per_zspage);
> >+
> >+	if (is_first_page(oldpage)) {
> >+		enum fullness_group fg;
> >+		int class_idx;
> >+
> >+		SetZsPageIsolate(newpage);
> >+		get_zspage_mapping(oldpage, &class_idx, &fg);
> >+		set_zspage_mapping(newpage, class_idx, fg);
> >+		set_freeobj(newpage, get_freeobj(oldpage));
> >+		set_zspage_inuse(newpage, get_zspage_inuse(oldpage));
> >+		if (class->huge)
> >+			set_page_private(newpage,  page_private(oldpage));
> >+	}
> >+
> >+	__SetPageMovable(newpage, oldpage->mapping);
> >+}
> >+
> >  /*
> >   * Allocate a zspage for the given size class
> >   */
> >-static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> >+static struct page *alloc_zspage(struct zs_pool *pool,
> >+				struct size_class *class)
> >  {
> >  	int i;
> >  	struct page *first_page = NULL;
> >@@ -1088,7 +1181,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> >  	for (i = 0; i < class->pages_per_zspage; i++) {
> >  		struct page *page;
> >
> >-		page = alloc_page(flags);
> >+		page = alloc_page(pool->flags);
> >  		if (!page) {
> >  			while (--i >= 0)
> >  				__free_page(pages[i]);
> >@@ -1100,7 +1193,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> >
> >  	create_page_chain(pages, class->pages_per_zspage);
> >  	first_page = pages[0];
> >-	init_zspage(class, first_page);
> >+	init_zspage(class, first_page, pool->inode->i_mapping);
> >
> >  	return first_page;
> >  }
> >@@ -1499,7 +1592,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
> >
> >  	if (!first_page) {
> >  		spin_unlock(&class->lock);
> >-		first_page = alloc_zspage(class, pool->flags);
> >+		first_page = alloc_zspage(pool, class);
> >  		if (unlikely(!first_page)) {
> >  			free_handle(pool, handle);
> >  			return 0;
> >@@ -1559,6 +1652,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> >  	if (unlikely(!handle))
> >  		return;
> >
> >+	/* Once handle is pinned, page|object migration cannot work */
> >  	pin_tag(handle);
> >  	obj = handle_to_obj(handle);
> >  	obj_to_location(obj, &f_page, &f_objidx);
> >@@ -1714,6 +1808,9 @@ static enum fullness_group putback_zspage(struct size_class *class,
> >  {
> >  	enum fullness_group fullness;
> >
> >+	VM_BUG_ON_PAGE(!list_empty(&first_page->lru), first_page);
> >+	VM_BUG_ON_PAGE(ZsPageIsolate(first_page), first_page);
> >+
> >  	fullness = get_fullness_group(class, first_page);
> >  	insert_zspage(class, fullness, first_page);
> >  	set_zspage_mapping(first_page, class->index, fullness);
> >@@ -2059,6 +2156,173 @@ static int zs_register_shrinker(struct zs_pool *pool)
> >  	return register_shrinker(&pool->shrinker);
> >  }
> >
> >+bool zs_page_isolate(struct page *page, isolate_mode_t mode)
> >+{
> >+	struct zs_pool *pool;
> >+	struct size_class *class;
> >+	int class_idx;
> >+	enum fullness_group fullness;
> >+	struct page *first_page;
> >+
> >+	/*
> >+	 * The page is locked so it couldn't be destroyed.
> >+	 * For detail, look at lock_zspage in free_zspage.
> >+	 */
> >+	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >+	VM_BUG_ON_PAGE(PageIsolated(page), page);
> >+	/*
> >+	 * In this implementation, it allows only first page migration.
> >+	 */
> >+	VM_BUG_ON_PAGE(!is_first_page(page), page);
> >+	first_page = page;
> >+
> >+	/*
> >+	 * Without class lock, fullness is meaningless while constant
> >+	 * class_idx is okay. We will get it under class lock at below,
> >+	 * again.
> >+	 */
> >+	get_zspage_mapping(first_page, &class_idx, &fullness);
> >+	pool = page->mapping->private_data;
> >+	class = pool->size_class[class_idx];
> >+
> >+	if (!spin_trylock(&class->lock))
> >+		return false;
> >+
> >+	get_zspage_mapping(first_page, &class_idx, &fullness);
> >+	remove_zspage(class, fullness, first_page);
> >+	SetZsPageIsolate(first_page);
> >+	SetPageIsolated(page);
> >+	spin_unlock(&class->lock);
> >+
> >+	return true;
> >+}
> >+
> >+int zs_page_migrate(struct address_space *mapping, struct page *newpage,
> >+		struct page *page, enum migrate_mode mode)
> >+{
> >+	struct zs_pool *pool;
> >+	struct size_class *class;
> >+	int class_idx;
> >+	enum fullness_group fullness;
> >+	struct page *first_page;
> >+	void *s_addr, *d_addr, *addr;
> >+	int ret = -EBUSY;
> >+	int offset = 0;
> >+	int freezed = 0;
> >+
> >+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> >+
> >+	first_page = page;
> >+	get_zspage_mapping(first_page, &class_idx, &fullness);
> >+	pool = page->mapping->private_data;
> >+	class = pool->size_class[class_idx];
> >+
> >+	/*
> >+	 * Get stable fullness under class->lock
> >+	 */
> >+	if (!spin_trylock(&class->lock))
> >+		return ret;
> >+
> >+	get_zspage_mapping(first_page, &class_idx, &fullness);
> >+	if (get_zspage_inuse(first_page) == 0)
> >+		goto out_class_unlock;
> >+
> >+	freezed = freeze_zspage(class, first_page);
> >+	if (freezed != get_zspage_inuse(first_page))
> >+		goto out_unfreeze;
> >+
> >+	/* copy contents from page to newpage */
> >+	s_addr = kmap_atomic(page);
> >+	d_addr = kmap_atomic(newpage);
> >+	memcpy(d_addr, s_addr, PAGE_SIZE);
> >+	kunmap_atomic(d_addr);
> >+	kunmap_atomic(s_addr);
> >+
> >+	if (!is_first_page(page))
> >+		offset = page->index;
> >+
> >+	addr = kmap_atomic(page);
> >+	do {
> >+		unsigned long handle;
> >+		unsigned long head;
> >+		unsigned long new_obj, old_obj;
> >+		unsigned long obj_idx;
> >+		struct page *dummy;
> >+
> >+		head = obj_to_head(class, page, addr + offset);
> >+		if (head & OBJ_ALLOCATED_TAG) {
> >+			handle = head & ~OBJ_ALLOCATED_TAG;
> >+			if (!testpin_tag(handle))
> >+				BUG();
> >+
> >+			old_obj = handle_to_obj(handle);
> >+			obj_to_location(old_obj, &dummy, &obj_idx);
> >+			new_obj = location_to_obj(newpage, obj_idx);
> >+			new_obj |= BIT(HANDLE_PIN_BIT);
> >+			record_obj(handle, new_obj);
> >+		}
> >+		offset += class->size;
> >+	} while (offset < PAGE_SIZE);
> >+	kunmap_atomic(addr);
> >+
> >+	replace_sub_page(class, first_page, newpage, page);
> >+	first_page = newpage;
> >+	get_page(newpage);
> >+	VM_BUG_ON_PAGE(get_fullness_group(class, first_page) ==
> >+			ZS_EMPTY, first_page);
> >+	ClearZsPageIsolate(first_page);
> >+	putback_zspage(class, first_page);
> >+
> >+	/* Migration complete. Free old page */
> >+	ClearPageIsolated(page);
> >+	reset_page(page);
> >+	put_page(page);
> >+	ret = MIGRATEPAGE_SUCCESS;
> >+
> >+out_unfreeze:
> >+	unfreeze_zspage(class, first_page, freezed);
> >+out_class_unlock:
> >+	spin_unlock(&class->lock);
> >+
> >+	return ret;
> >+}
> >+
> >+void zs_page_putback(struct page *page)
> >+{
> >+	struct zs_pool *pool;
> >+	struct size_class *class;
> >+	int class_idx;
> >+	enum fullness_group fullness;
> >+	struct page *first_page;
> >+
> >+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> >+
> >+	first_page = page;
> >+	get_zspage_mapping(first_page, &class_idx, &fullness);
> >+	pool = page->mapping->private_data;
> >+	class = pool->size_class[class_idx];
> >+
> >+	/*
> >+	 * If there is race betwwen zs_free and here, free_zspage
> >+	 * in zs_free will wait the page lock of @page without
> >+	 * destroying of zspage.
> >+	 */
> >+	INIT_LIST_HEAD(&first_page->lru);
> >+	spin_lock(&class->lock);
> >+	ClearPageIsolated(page);
> >+	ClearZsPageIsolate(first_page);
> >+	putback_zspage(class, first_page);
> >+	spin_unlock(&class->lock);
> >+}
> >+
> >+const struct address_space_operations zsmalloc_aops = {
> >+	.isolate_page = zs_page_isolate,
> >+	.migratepage = zs_page_migrate,
> >+	.putback_page = zs_page_putback,
> >+};
> >+
> >  /**
> >   * zs_create_pool - Creates an allocation pool to work from.
> >   * @flags: allocation flags used to allocate pool metadata
> >@@ -2145,6 +2409,15 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
> >  	if (zs_pool_stat_create(pool, name))
> >  		goto err;
> >
> >+	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
> >+	if (IS_ERR(pool->inode)) {
> >+		pool->inode = NULL;
> >+		goto err;
> >+	}
> >+
> >+	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
> >+	pool->inode->i_mapping->private_data = pool;
> >+
> >  	/*
> >  	 * Not critical, we still can use the pool
> >  	 * and user can trigger compaction manually.
> >@@ -2164,6 +2437,8 @@ void zs_destroy_pool(struct zs_pool *pool)
> >  	int i;
> >
> >  	zs_unregister_shrinker(pool);
> >+	if (pool->inode)
> >+		iput(pool->inode);
> >  	zs_pool_stat_destroy(pool);
> >
> >  	for (i = 0; i < zs_size_classes; i++) {
> >@@ -2192,10 +2467,33 @@ void zs_destroy_pool(struct zs_pool *pool)
> >  }
> >  EXPORT_SYMBOL_GPL(zs_destroy_pool);
> >
> >+static struct dentry *zs_mount(struct file_system_type *fs_type,
> >+				int flags, const char *dev_name, void *data)
> >+{
> >+	static const struct dentry_operations ops = {
> >+		.d_dname = simple_dname,
> >+	};
> >+
> >+	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
> >+}
> >+
> >+static struct file_system_type zsmalloc_fs = {
> >+	.name		= "zsmalloc",
> >+	.mount		= zs_mount,
> >+	.kill_sb	= kill_anon_super,
> >+};
> >+
> >  static int __init zs_init(void)
> >  {
> >-	int ret = zs_register_cpu_notifier();
> >+	int ret;
> >+
> >+	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
> >+	if (IS_ERR(zsmalloc_mnt)) {
> >+		ret = PTR_ERR(zsmalloc_mnt);
> >+		goto out;
> >+	}
> >
> >+	ret = zs_register_cpu_notifier();
> >  	if (ret)
> >  		goto notifier_fail;
> >
> >@@ -2218,6 +2516,7 @@ static int __init zs_init(void)
> >  		pr_err("zs stat initialization failed\n");
> >  		goto stat_fail;
> >  	}
> >+
> >  	return 0;
> >
> >  stat_fail:
> >@@ -2226,7 +2525,8 @@ static int __init zs_init(void)
> >  #endif
> >  notifier_fail:
> >  	zs_unregister_cpu_notifier();
> >-
> >+	kern_unmount(zsmalloc_mnt);
> >+out:
> >  	return ret;
> >  }
> >
> >@@ -2237,6 +2537,8 @@ static void __exit zs_exit(void)
> >  #endif
> >  	zs_unregister_cpu_notifier();
> >
> >+	kern_unmount(zsmalloc_mnt);
> >+
> >  	zs_stat_exit();
> >  }
> >
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 03/16] mm: add non-lru movable page support document
  2016-04-04 13:09       ` Vlastimil Babka
@ 2016-04-07  2:27         ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-07  2:27 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu,
	Jonathan Corbet

On Mon, Apr 04, 2016 at 03:09:22PM +0200, Vlastimil Babka wrote:
> On 04/04/2016 04:25 AM, Minchan Kim wrote:
> >>
> >>Ah, I see, so it's designed with page lock to handle the concurrent isolations etc.
> >>
> >>In http://marc.info/?l=linux-mm&m=143816716511904&w=2 Mel has warned
> >>about doing this in general under page_lock and suggested that each
> >>user handles concurrent calls to isolate_page() internally. Might be
> >>more generic that way, even if all current implementers will
> >>actually use the page lock.
> >
> >We need PG_lock for two reasons.
> >
> >Firstly, it guarantees page's flags operation(i.e., PG_movable, PG_isolated)
> >atomicity. Another thing is for stability for page->mapping->a_ops.
> >
> >For example,
> >
> >isolate_migratepages_block
> >         if (PageMovable(page))
> >                 isolate_movable_page
> >                         get_page_unless_zero <--- 1
> >                         trylock_page
> >                         page->mapping->a_ops->isolate_page <--- 2
> >
> >Between 1 and 2, driver can nullify page->mapping so we need PG_lock
> 
> Hmm I see, that really doesn't seem easily solvable without page_lock.
> My idea is that compaction code would just check PageMovable() and
> PageIsolated() to find a candidate.
> page->mapping->a_ops->isolate_page would do the driver-specific
> necessary locking, revalidate if the page state and succeed
> isolation, or fail. It would need to handle the possibility that the

So you mean that VM can try to isolate false-positive page of the driver?
I don't think it's a good idea. For handling that, every driver should
keep some logics to handle such false-positive which needs each own
data structure or something to remember the page passed from VM
is valid or not. It makes driver's logic more complicated and need
more codes to handle it. It's not a good deal.

> page already doesn't belong to the mapping, which is probably not a
> problem. But what if the driver is a module that was already
> unloaded, and even though we did NULL-check each part from page to
> isolate_page, it points to a function that's already gone? That
> would need some extra handling to prevent that, hm...

Yes, driver should clean up pages is is using. For it, we need some lock.
I think page_lock is good for it because we are migrating *page* and page_lock
have been used it for a long time in migration path.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
  2016-04-04 13:24       ` Vlastimil Babka
@ 2016-04-07  2:35         ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-07  2:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, dri-devel,
	Gioh Kim

On Mon, Apr 04, 2016 at 03:24:34PM +0200, Vlastimil Babka wrote:
> On 04/04/2016 07:12 AM, Minchan Kim wrote:
> >On Fri, Apr 01, 2016 at 11:29:14PM +0200, Vlastimil Babka wrote:
> >>Might have been better as a separate migration patch and then a
> >>compaction patch. It's prefixed mm/compaction, but most changed are
> >>in mm/migrate.c
> >
> >Indeed. The title is rather misleading but not sure it's a good idea
> >to separate compaction and migration part.
> 
> Guess it's better to see the new functions together with its user
> after all, OK.
> 
> >I will just resend to change the tile from "mm/compaction" to
> >"mm/migration".
> 
> OK!
> 
> >>Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that
> >>page->mapping->a_ops->isolate_page exists for PageMovable() pages.
> >>What if it's a false positive on a PG_reclaim page? Can we rely on
> >>PG_reclaim always (and without races) implying PageLRU() so that we
> >>don't even attempt isolate_movable_page()?
> >
> >For now, we shouldn't have such a false positive because PageMovable
> >checks page->_mapcount == PAGE_MOVABLE_MAPCOUNT_VALUE as well as PG_movable
> >under PG_lock.
> >
> >But I read your question about user-mapped drvier pages so we cannot
> >use _mapcount anymore so I will find another thing. A option is this.
> >
> >static inline int PageMovable(struct page *page)
> >{
> >         int ret = 0;
> >         struct address_space *mapping;
> >         struct address_space_operations *a_op;
> >
> >         if (!test_bit(PG_movable, &(page->flags))
> >                 goto out;
> >
> >         mapping = page->mapping;
> >         if (!mapping)
> >                 goto out;
> >
> >         a_op = mapping->a_op;
> >         if (!aop)
> >                 goto out;
> >         if (a_op->isolate_page)
> >                 ret = 1;
> >out:
> >         return ret;
> >
> >}
> >
> >It works under PG_lock but with this, we need trylock_page to peek
> >whether it's movable non-lru or not for scanning pfn.
> 
> Hm I hoped that with READ_ONCE() we could do the peek safely without
> trylock_page, if we use it only as a heuristic. But I guess it would
> require at least RCU-level protection of the
> page->mapping->a_op->isolate_page chain.
> 
> >For avoiding that, we need another function to peek which just checks
> >PG_movable bit instead of all things.
> >
> >
> >/*
> >  * If @page_locked is false, we cannot guarantee page->mapping's stability
> >  * so just the function checks with PG_movable which could be false positive
> >  * so caller should check it again under PG_lock to check a_ops->isolate_page.
> >  */
> >static inline int PageMovable(struct page *page, bool page_locked)
> >{
> >         int ret = 0;
> >         struct address_space *mapping;
> >         struct address_space_operations *a_op;
> >
> >         if (!test_bit(PG_movable, &(page->flags))
> >                 goto out;
> >
> >         if (!page_locked) {
> >                 ret = 1;
> >                 goto out;
> >         }
> >
> >         mapping = page->mapping;
> >         if (!mapping)
> >                 goto out;
> >
> >         a_op = mapping->a_op;
> >         if (!aop)
> >                 goto out;
> >         if (a_op->isolate_page)
> >                 ret = 1;
> >out:
> >         return ret;
> >}
> 
> I wouldn't put everything into single function, but create something
> like __PageMovable() just for the unlocked peek. Unlike the
> zone->lru_lock, we don't keep page_lock() across iterations in
> isolate_migratepages_block(), as obviously each page has different
> lock.
> So the page_locked parameter would be always passed as constant, and
> at that point it's better to have separate functions.

Agree.

> 
> So I guess the question is how many false positives from overlap
> with PG_reclaim the scanner will hit if we give up on
> PAGE_MOVABLE_MAPCOUNT_VALUE, as that will increase number of page
> locks just to realize that it's not actual PageMovable() page...

I don't think it's too many because PG_reclaim bit is set to only
LRU pages at the moment and we can check PageMovable after !PageLRU
check.

Thanks.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon
  2016-04-05 12:03   ` Vlastimil Babka
@ 2016-04-11  4:29     ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-11  4:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Joonsoo Kim, koct9i, aquini, virtualization, Mel Gorman,
	Hugh Dickins, Sergey Senozhatsky, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu, Gioh Kim

On Tue, Apr 05, 2016 at 02:03:05PM +0200, Vlastimil Babka wrote:
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >Now, VM has a feature to migrate non-lru movable pages so
> >balloon doesn't need custom migration hooks in migrate.c
> >and compact.c. Instead, this patch implements page->mapping
> >->{isolate|migrate|putback} functions.
> >
> >With that, we could remove hooks for ballooning in general
> >migration functions and make balloon compaction simple.
> >
> >Cc: virtualization@lists.linux-foundation.org
> >Cc: Rafael Aquini <aquini@redhat.com>
> >Cc: Konstantin Khlebnikov <koct9i@gmail.com>
> >Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> I'm not familiar with the inode and pseudofs stuff, so just some
> things I noticed:
> 
> >-#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> >+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-256)
> >+#define PAGE_BALLOON_MAPCOUNT_VALUE PAGE_MOVABLE_MAPCOUNT_VALUE
> >
> >  static inline int PageMovable(struct page *page)
> >  {
> >-	return ((test_bit(PG_movable, &(page)->flags) &&
> >-		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> >-		|| PageBalloon(page));
> >+	return (test_bit(PG_movable, &(page)->flags) &&
> >+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE);
> >  }
> >
> >  /* Caller should hold a PG_lock */
> >@@ -645,6 +626,35 @@ static inline void __ClearPageMovable(struct page *page)
> >
> >  PAGEFLAG(Isolated, isolated, PF_ANY);
> >
> >+static inline int PageBalloon(struct page *page)
> >+{
> >+	return atomic_read(&page->_mapcount) == PAGE_BALLOON_MAPCOUNT_VALUE
> >+		&& PagePrivate2(page);
> >+}
> 
> Hmm so you are now using PG_private_2 flag here, but it's not
> documented. Also the only caller of PageBalloon() seems to be
> stable_page_flags(). Which will now report all movable pages with
> PG_private_2 as KPF_BALOON. Seems like an overkill and also not
> reliable. Could it test e.g. page->mapping instead?

Thanks for pointing out.
I will not use page->_mapcount in next version so it should be okay.

> 
> Or maybe if we manage to get rid of PAGE_MOVABLE_MAPCOUNT_VALUE, we
> can keep PAGE_BALLOON_MAPCOUNT_VALUE to simply distinguish balloon
> pages for stable_page_flags().

Yeb.

> 
> >@@ -1033,7 +1019,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
> >  out:
> >  	/* If migration is successful, move newpage to right list */
> >  	if (rc == MIGRATEPAGE_SUCCESS) {
> >-		if (unlikely(__is_movable_balloon_page(newpage)))
> >+		if (unlikely(PageMovable(newpage)))
> >  			put_page(newpage);
> >  		else
> >  			putback_lru_page(newpage);
> 
> Hmm shouldn't the condition have been changed to
> 
> if (unlikely(__is_movable_balloon_page(newpage)) || PageMovable(newpage)
> 
> by patch 02/16? And this patch should be just removing the
> balloon-specific check? Otherwise it seems like between patches 02
> and 04, other kinds of PageMovable pages were unnecessarily/wrongly
> routed through putback_lru_page()?

Fixed.

> 
> >diff --git a/mm/vmscan.c b/mm/vmscan.c
> >index d82196244340..c7696a2e11c7 100644
> >--- a/mm/vmscan.c
> >+++ b/mm/vmscan.c
> >@@ -1254,7 +1254,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
> >
> >  	list_for_each_entry_safe(page, next, page_list, lru) {
> >  		if (page_is_file_cache(page) && !PageDirty(page) &&
> >-		    !isolated_balloon_page(page)) {
> >+		    !PageIsolated(page)) {
> >  			ClearPageActive(page);
> >  			list_move(&page->lru, &clean_pages);
> >  		}
> 
> This looks like the same comment as above at first glance. But
> looking closer, it's even weirder. isolated_balloon_page() was
> simply PageBalloon() after d6d86c0a7f8dd... weird already. You
> replace it with check for !PageIsolated() which looks like a more
> correct check, so ok. Except the potential false positive with
> PG_owner_priv_1.

I will change it next version so it shouldn't be a problem.
Thanks for the review!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 00/16] Support non-lru page migration
  2016-04-04 13:17 ` John Einar Reitan
@ 2016-04-11  4:35   ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-11  4:35 UTC (permalink / raw)
  To: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

On Mon, Apr 04, 2016 at 03:17:18PM +0200, John Einar Reitan wrote:
> On Wed, Mar 30, 2016 at 04:11:59PM +0900, Minchan Kim wrote:
> > Recently, I got many reports about perfermance degradation
> > in embedded system(Android mobile phone, webOS TV and so on)
> > and failed to fork easily.
> > 
> > The problem was fragmentation caused by zram and GPU driver
> > pages. Their pages cannot be migrated so compaction cannot
> > work well, either so reclaimer ends up shrinking all of working
> > set pages. It made system very slow and even to fail to fork
> > easily.
> > 
> > Other pain point is that they cannot work with CMA.
> > Most of CMA memory space could be idle(ie, it could be used
> > for movable pages unless driver is using) but if driver(i.e.,
> > zram) cannot migrate his page, that memory space could be
> > wasted. In our product which has big CMA memory, it reclaims
> > zones too exccessively although there are lots of free space
> > in CMA so system was very slow easily.
> > 
> > To solve these problem, this patch try to add facility to
> > migrate non-lru pages via introducing new friend functions
> > of migratepage in address_space_operation and new page flags.
> > 
> > 	(isolate_page, putback_page)
> > 	(PG_movable, PG_isolated)
> > 
> > For details, please read description in
> > "mm/compaction: support non-lru movable page migration".
> 
> Thanks, this mirrors what we see with the ARM Mali GPU drivers too.
> 
> One thing with the current design which worries me is the potential
> for multiple calls due to many separated pages being migrated.
> On GPUs (or any other device) which has an IOMMU and L2 cache, which
> isn't coherent with the CPU, we must do L2 cache flush & invalidation
> per page. I guess batching pages isn't easily possible?
> 

Hmm, I think it seems to cause many code stirring but surely worth
to work. So, IMMO, it would be better to add such feature after soft
landing of current work.

Anyway, I will Cc'ed you in next revision.

Thanks.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
  2016-03-30  7:12 ` [PATCH v3 02/16] mm/compaction: support non-lru movable page migration Minchan Kim
  2016-04-01 21:29   ` Vlastimil Babka
@ 2016-04-12  8:00   ` Chulmin Kim
  2016-04-12 14:25     ` Minchan Kim
  1 sibling, 1 reply; 65+ messages in thread
From: Chulmin Kim @ 2016-04-12  8:00 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton; +Cc: linux-kernel, linux-mm

On 2016년 03월 30일 16:12, Minchan Kim wrote:
> We have allowed migration for only LRU pages until now and it was
> enough to make high-order pages. But recently, embedded system(e.g.,
> webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> so we have seen several reports about troubles of small high-order
> allocation. For fixing the problem, there were several efforts
> (e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> reserved memory, vmalloc and so on) but if there are lots of
> non-movable pages in system, their solutions are void in the long run.
>
> So, this patch is to support facility to change non-movable pages
> with movable. For the feature, this patch introduces functions related
> to migration to address_space_operations as well as some page flags.
>
> Basically, this patch supports two page-flags and two functions related
> to page migration. The flag and page->mapping stability are protected
> by PG_lock.
>
> 	PG_movable
> 	PG_isolated
>
> 	bool (*isolate_page) (struct page *, isolate_mode_t);
> 	void (*putback_page) (struct page *);
>
> Duty of subsystem want to make their pages as migratable are
> as follows:
>
> 1. It should register address_space to page->mapping then mark
> the page as PG_movable via __SetPageMovable.
>
> 2. It should mark the page as PG_isolated via SetPageIsolated
> if isolation is sucessful and return true.
>
> 3. If migration is successful, it should clear PG_isolated and
> PG_movable of the page for free preparation then release the
> reference of the page to free.
>
> 4. If migration fails, putback function of subsystem should
> clear PG_isolated via ClearPageIsolated.
>
> 5. If a subsystem want to release isolated page, it should
> clear PG_isolated but not PG_movable. Instead, VM will do it.
>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: dri-devel@lists.freedesktop.org
> Cc: virtualization@lists.linux-foundation.org
> Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   Documentation/filesystems/Locking      |   4 +
>   Documentation/filesystems/vfs.txt      |   5 +
>   fs/proc/page.c                         |   3 +
>   include/linux/fs.h                     |   2 +
>   include/linux/migrate.h                |   2 +
>   include/linux/page-flags.h             |  31 ++++++
>   include/uapi/linux/kernel-page-flags.h |   1 +
>   mm/compaction.c                        |  14 ++-
>   mm/migrate.c                           | 174 +++++++++++++++++++++++++++++----
>   9 files changed, 217 insertions(+), 19 deletions(-)
>
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 619af9bfdcb3..0bb79560abb3 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -195,7 +195,9 @@ unlocks and drops the reference.
>   	int (*releasepage) (struct page *, int);
>   	void (*freepage)(struct page *);
>   	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
>   	int (*migratepage)(struct address_space *, struct page *, struct page *);
> +	void (*putback_page) (struct page *);
>   	int (*launder_page)(struct page *);
>   	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
>   	int (*error_remove_page)(struct address_space *, struct page *);
> @@ -219,7 +221,9 @@ invalidatepage:		yes
>   releasepage:		yes
>   freepage:		yes
>   direct_IO:
> +isolate_page:		yes
>   migratepage:		yes (both)
> +putback_page:		yes
>   launder_page:		yes
>   is_partially_uptodate:	yes
>   error_remove_page:	yes
> diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> index b02a7d598258..4c1b6c3b4bc8 100644
> --- a/Documentation/filesystems/vfs.txt
> +++ b/Documentation/filesystems/vfs.txt
> @@ -592,9 +592,14 @@ struct address_space_operations {
>   	int (*releasepage) (struct page *, int);
>   	void (*freepage)(struct page *);
>   	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> +	/* isolate a page for migration */
> +	bool (*isolate_page) (struct page *, isolate_mode_t);
>   	/* migrate the contents of a page to the specified target */
>   	int (*migratepage) (struct page *, struct page *);
> +	/* put the page back to right list */
> +	void (*putback_page) (struct page *);
>   	int (*launder_page) (struct page *);
> +
>   	int (*is_partially_uptodate) (struct page *, unsigned long,
>   					unsigned long);
>   	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> diff --git a/fs/proc/page.c b/fs/proc/page.c
> index 3ecd445e830d..ce3d08a4ad8d 100644
> --- a/fs/proc/page.c
> +++ b/fs/proc/page.c
> @@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
>   	if (page_is_idle(page))
>   		u |= 1 << KPF_IDLE;
>
> +	if (PageMovable(page))
> +		u |= 1 << KPF_MOVABLE;
> +
>   	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
>
>   	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index da9e67d937e5..36f2d610e7a8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -401,6 +401,8 @@ struct address_space_operations {
>   	 */
>   	int (*migratepage) (struct address_space *,
>   			struct page *, struct page *, enum migrate_mode);
> +	bool (*isolate_page)(struct page *, isolate_mode_t);
> +	void (*putback_page)(struct page *);
>   	int (*launder_page) (struct page *);
>   	int (*is_partially_uptodate) (struct page *, unsigned long,
>   					unsigned long);
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 9b50325e4ddf..404fbfefeb33 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
>   			struct page *, struct page *, enum migrate_mode);
>   extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
>   		unsigned long private, enum migrate_mode mode, int reason);
> +extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> +extern void putback_movable_page(struct page *page);
>
>   extern int migrate_prep(void);
>   extern int migrate_prep_local(void);
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index f4ed4f1b0c77..77ebf8fdbc6e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -129,6 +129,10 @@ enum pageflags {
>
>   	/* Compound pages. Stored in first tail page's flags */
>   	PG_double_map = PG_private_2,
> +
> +	/* non-lru movable pages */
> +	PG_movable = PG_reclaim,
> +	PG_isolated = PG_owner_priv_1,
>   };
>
>   #ifndef __GENERATING_BOUNDS_H
> @@ -614,6 +618,33 @@ static inline void __ClearPageBalloon(struct page *page)
>   	atomic_set(&page->_mapcount, -1);
>   }
>
> +#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> +
> +static inline int PageMovable(struct page *page)
> +{
> +	return ((test_bit(PG_movable, &(page)->flags) &&
> +		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> +		|| PageBalloon(page));
> +}
> +
> +/* Caller should hold a PG_lock */
> +static inline void __SetPageMovable(struct page *page,
> +				struct address_space *mapping)
> +{
> +	page->mapping = mapping;
> +	__set_bit(PG_movable, &page->flags);
> +	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> +}
> +
> +static inline void __ClearPageMovable(struct page *page)
> +{
> +	atomic_set(&page->_mapcount, -1);
> +	__clear_bit(PG_movable, &(page)->flags);
> +	page->mapping = NULL;
> +}
> +
> +PAGEFLAG(Isolated, isolated, PF_ANY);
> +
>   /*
>    * If network-based swap is enabled, sl*b must keep track of whether pages
>    * were allocated from pfmemalloc reserves.
> diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> index 5da5f8751ce7..a184fd2434fa 100644
> --- a/include/uapi/linux/kernel-page-flags.h
> +++ b/include/uapi/linux/kernel-page-flags.h
> @@ -34,6 +34,7 @@
>   #define KPF_BALLOON		23
>   #define KPF_ZERO_PAGE		24
>   #define KPF_IDLE		25
> +#define KPF_MOVABLE		26
>
>
>   #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> diff --git a/mm/compaction.c b/mm/compaction.c
> index ccf97b02b85f..7557aedddaee 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>
>   		/*
>   		 * Check may be lockless but that's ok as we recheck later.
> -		 * It's possible to migrate LRU pages and balloon pages
> +		 * It's possible to migrate LRU and movable kernel pages.
>   		 * Skip any other type of page
>   		 */
>   		is_lru = PageLRU(page);
> @@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
>   					goto isolate_success;
>   				}
>   			}
> +
> +			if (unlikely(PageMovable(page)) &&
> +					!PageIsolated(page)) {
> +				if (locked) {
> +					spin_unlock_irqrestore(&zone->lru_lock,
> +									flags);
> +					locked = false;
> +				}
> +
> +				if (isolate_movable_page(page, isolate_mode))
> +					goto isolate_success;
> +			}
>   		}
>
>   		/*
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 53529c805752..b56bf2b3fe8c 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -73,6 +73,85 @@ int migrate_prep_local(void)
>   	return 0;
>   }
>
> +bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> +{
> +	bool ret = false;
> +
> +	/*
> +	 * Avoid burning cycles with pages that are yet under __free_pages(),
> +	 * or just got freed under us.
> +	 *
> +	 * In case we 'win' a race for a movable page being freed under us and
> +	 * raise its refcount preventing __free_pages() from doing its job
> +	 * the put_page() at the end of this block will take care of
> +	 * release this page, thus avoiding a nasty leakage.
> +	 */
> +	if (unlikely(!get_page_unless_zero(page)))
> +		goto out;
> +
> +	/*
> +	 * Check PG_movable before holding a PG_lock because page's owner
> +	 * assumes anybody doesn't touch PG_lock of newly allocated page.
> +	 */
> +	if (unlikely(!PageMovable(page)))
> +		goto out_putpage;
> +	/*
> +	 * As movable pages are not isolated from LRU lists, concurrent
> +	 * compaction threads can race against page migration functions
> +	 * as well as race against the releasing a page.
> +	 *
> +	 * In order to avoid having an already isolated movable page
> +	 * being (wrongly) re-isolated while it is under migration,
> +	 * or to avoid attempting to isolate pages being released,
> +	 * lets be sure we have the page lock
> +	 * before proceeding with the movable page isolation steps.
> +	 */
> +	if (unlikely(!trylock_page(page)))
> +		goto out_putpage;
> +
> +	if (!PageMovable(page) || PageIsolated(page))
> +		goto out_no_isolated;


Hello Minchan.

We captured a problem case.
We suspect that
a zs subpage(T) in the below condition can be isolated twice,
which seems wrong.

  migrate_ctx_A         migrate_ctx_B            C (proc being killed)
-------------       ----------------        -----------------
  lock_page(T)
  isolate(T)
  unlock_page(T)

                                               zs_free() (making zspage 
ZS_EMPTY)
                                                   free_zspage()
                                                      lock_page(T)
                                                      reset_page(T)
                                                         (Keeps T 
"PageMovable" and Clears "PageIsolated")
                                                      unlock_page(T)


                        lock_page(T)
                        isolate(T)


In our case,
during the second isolation (migrateB),
there was null pointer dereference
((Without DEBUG_VM) T's first page was set to NULL)


Not sure this is the case you and Vlastimil discussed.
(I think it is a bit different)

Thanks.
Chulmin

> +
> +	ret = page->mapping->a_ops->isolate_page(page, mode);
> +	if (!ret)
> +		goto out_no_isolated;
> +
> +	WARN_ON_ONCE(!PageIsolated(page));
> +	unlock_page(page);
> +	return ret;
> +
> +out_no_isolated:
> +	unlock_page(page);
> +out_putpage:
> +	put_page(page);
> +out:
> +	return ret;
> +}
> +
> +/* It should be called on page which is PG_movable */
> +void putback_movable_page(struct page *page)
> +{
> +	/*
> +	 * 'lock_page()' stabilizes the page and prevents races against
> +	 * concurrent isolation threads attempting to re-isolate it.
> +	 */
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +
> +	lock_page(page);
> +	if (PageIsolated(page)) {
> +		struct address_space *mapping;
> +
> +		mapping = page_mapping(page);
> +		mapping->a_ops->putback_page(page);
> +		WARN_ON_ONCE(PageIsolated(page));
> +	} else {
> +		__ClearPageMovable(page);
> +	}
> +	unlock_page(page);
> +	/* drop the extra ref count taken for movable page isolation */
> +	put_page(page);
> +}
> +
>   /*
>    * Put previously isolated pages back onto the appropriate lists
>    * from where they were once taken off for compaction/migration.
> @@ -94,10 +173,18 @@ void putback_movable_pages(struct list_head *l)
>   		list_del(&page->lru);
>   		dec_zone_page_state(page, NR_ISOLATED_ANON +
>   				page_is_file_cache(page));
> -		if (unlikely(isolated_balloon_page(page)))
> +		if (unlikely(isolated_balloon_page(page))) {
>   			balloon_page_putback(page);
> -		else
> +		} else if (unlikely(PageMovable(page))) {
> +			if (PageIsolated(page)) {
> +				putback_movable_page(page);
> +			} else {
> +				__ClearPageMovable(page);
> +				put_page(page);
> +			}
> +		} else {
>   			putback_lru_page(page);
> +		}
>   	}
>   }
>
> @@ -592,7 +679,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
>    ***********************************************************/
>
>   /*
> - * Common logic to directly migrate a single page suitable for
> + * Common logic to directly migrate a single LRU page suitable for
>    * pages that do not use PagePrivate/PagePrivate2.
>    *
>    * Pages are locked upon entry and exit.
> @@ -755,24 +842,54 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>   				enum migrate_mode mode)
>   {
>   	struct address_space *mapping;
> -	int rc;
> +	int rc = -EAGAIN;
> +	bool lru_movable = true;
>
>   	VM_BUG_ON_PAGE(!PageLocked(page), page);
>   	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
>
>   	mapping = page_mapping(page);
> -	if (!mapping)
> -		rc = migrate_page(mapping, newpage, page, mode);
> -	else if (mapping->a_ops->migratepage)
> -		/*
> -		 * Most pages have a mapping and most filesystems provide a
> -		 * migratepage callback. Anonymous pages are part of swap
> -		 * space which also has its own migratepage callback. This
> -		 * is the most common path for page migration.
> -		 */
> -		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
> -	else
> -		rc = fallback_migrate_page(mapping, newpage, page, mode);
> +	/*
> +	 * In case of non-lru page, it could be released after
> +	 * isolation step. In that case, we shouldn't try
> +	 * fallback migration which was designed for LRU pages.
> +	 *
> +	 * The rule for such case is that subsystem should clear
> +	 * PG_isolated but remains PG_movable so VM should catch
> +	 * it and clear PG_movable for it.
> +	 */
> +	if (unlikely(PageMovable(page))) {
> +		lru_movable = false;
> +		VM_BUG_ON_PAGE(!mapping, page);
> +		if (!PageIsolated(page)) {
> +			rc = MIGRATEPAGE_SUCCESS;
> +			__ClearPageMovable(page);
> +			goto out;
> +		}
> +	}
> +
> +	if (likely(lru_movable)) {
> +		if (!mapping)
> +			rc = migrate_page(mapping, newpage, page, mode);
> +		else if (mapping->a_ops->migratepage)
> +			/*
> +			 * Most pages have a mapping and most filesystems
> +			 * provide a migratepage callback. Anonymous pages
> +			 * are part of swap space which also has its own
> +			 * migratepage callback. This is the most common path
> +			 * for page migration.
> +			 */
> +			rc = mapping->a_ops->migratepage(mapping, newpage,
> +							page, mode);
> +		else
> +			rc = fallback_migrate_page(mapping, newpage,
> +							page, mode);
> +	} else {
> +		rc = mapping->a_ops->migratepage(mapping, newpage,
> +						page, mode);
> +		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
> +			PageIsolated(page));
> +	}
>
>   	/*
>   	 * When successful, old pagecache page->mapping must be cleared before
> @@ -782,6 +899,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>   		if (!PageAnon(page))
>   			page->mapping = NULL;
>   	}
> +out:
>   	return rc;
>   }
>
> @@ -960,6 +1078,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>   			put_new_page(newpage, private);
>   		else
>   			put_page(newpage);
> +		if (PageMovable(page))
> +			__ClearPageMovable(page);
>   		goto out;
>   	}
>
> @@ -1000,8 +1120,26 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>   				num_poisoned_pages_inc();
>   		}
>   	} else {
> -		if (rc != -EAGAIN)
> -			putback_lru_page(page);
> +		if (rc != -EAGAIN) {
> +			/*
> +			 * subsystem couldn't remove PG_movable since page is
> +			 * isolated so PageMovable check is not racy in here.
> +			 * But PageIsolated check can be racy but it's okay
> +			 * because putback_movable_page checks it under PG_lock
> +			 * again.
> +			 */
> +			if (unlikely(PageMovable(page))) {
> +				if (PageIsolated(page))
> +					putback_movable_page(page);
> +				else {
> +					__ClearPageMovable(page);
> +					put_page(page);
> +				}
> +			} else {
> +				putback_lru_page(page);
> +			}
> +		}
> +
>   		if (put_new_page)
>   			put_new_page(newpage, private);
>   		else
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
  2016-04-12  8:00   ` Chulmin Kim
@ 2016-04-12 14:25     ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-12 14:25 UTC (permalink / raw)
  To: Chulmin Kim; +Cc: Minchan Kim, Andrew Morton, linux-kernel, linux-mm

Hello Chulmin,

On Tue, Apr 12, 2016 at 05:00:18PM +0900, Chulmin Kim wrote:
> On 2016년 03월 30일 16:12, Minchan Kim wrote:
> >We have allowed migration for only LRU pages until now and it was
> >enough to make high-order pages. But recently, embedded system(e.g.,
> >webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> >so we have seen several reports about troubles of small high-order
> >allocation. For fixing the problem, there were several efforts
> >(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> >reserved memory, vmalloc and so on) but if there are lots of
> >non-movable pages in system, their solutions are void in the long run.
> >
> >So, this patch is to support facility to change non-movable pages
> >with movable. For the feature, this patch introduces functions related
> >to migration to address_space_operations as well as some page flags.
> >
> >Basically, this patch supports two page-flags and two functions related
> >to page migration. The flag and page->mapping stability are protected
> >by PG_lock.
> >
> >	PG_movable
> >	PG_isolated
> >
> >	bool (*isolate_page) (struct page *, isolate_mode_t);
> >	void (*putback_page) (struct page *);
> >
> >Duty of subsystem want to make their pages as migratable are
> >as follows:
> >
> >1. It should register address_space to page->mapping then mark
> >the page as PG_movable via __SetPageMovable.
> >
> >2. It should mark the page as PG_isolated via SetPageIsolated
> >if isolation is sucessful and return true.
> >
> >3. If migration is successful, it should clear PG_isolated and
> >PG_movable of the page for free preparation then release the
> >reference of the page to free.
> >
> >4. If migration fails, putback function of subsystem should
> >clear PG_isolated via ClearPageIsolated.
> >
> >5. If a subsystem want to release isolated page, it should
> >clear PG_isolated but not PG_movable. Instead, VM will do it.
> >
> >Cc: Vlastimil Babka <vbabka@suse.cz>
> >Cc: Mel Gorman <mgorman@suse.de>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: dri-devel@lists.freedesktop.org
> >Cc: virtualization@lists.linux-foundation.org
> >Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  Documentation/filesystems/Locking      |   4 +
> >  Documentation/filesystems/vfs.txt      |   5 +
> >  fs/proc/page.c                         |   3 +
> >  include/linux/fs.h                     |   2 +
> >  include/linux/migrate.h                |   2 +
> >  include/linux/page-flags.h             |  31 ++++++
> >  include/uapi/linux/kernel-page-flags.h |   1 +
> >  mm/compaction.c                        |  14 ++-
> >  mm/migrate.c                           | 174 +++++++++++++++++++++++++++++----
> >  9 files changed, 217 insertions(+), 19 deletions(-)
> >
> >diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> >index 619af9bfdcb3..0bb79560abb3 100644
> >--- a/Documentation/filesystems/Locking
> >+++ b/Documentation/filesystems/Locking
> >@@ -195,7 +195,9 @@ unlocks and drops the reference.
> >  	int (*releasepage) (struct page *, int);
> >  	void (*freepage)(struct page *);
> >  	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> >+	bool (*isolate_page) (struct page *, isolate_mode_t);
> >  	int (*migratepage)(struct address_space *, struct page *, struct page *);
> >+	void (*putback_page) (struct page *);
> >  	int (*launder_page)(struct page *);
> >  	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
> >  	int (*error_remove_page)(struct address_space *, struct page *);
> >@@ -219,7 +221,9 @@ invalidatepage:		yes
> >  releasepage:		yes
> >  freepage:		yes
> >  direct_IO:
> >+isolate_page:		yes
> >  migratepage:		yes (both)
> >+putback_page:		yes
> >  launder_page:		yes
> >  is_partially_uptodate:	yes
> >  error_remove_page:	yes
> >diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> >index b02a7d598258..4c1b6c3b4bc8 100644
> >--- a/Documentation/filesystems/vfs.txt
> >+++ b/Documentation/filesystems/vfs.txt
> >@@ -592,9 +592,14 @@ struct address_space_operations {
> >  	int (*releasepage) (struct page *, int);
> >  	void (*freepage)(struct page *);
> >  	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> >+	/* isolate a page for migration */
> >+	bool (*isolate_page) (struct page *, isolate_mode_t);
> >  	/* migrate the contents of a page to the specified target */
> >  	int (*migratepage) (struct page *, struct page *);
> >+	/* put the page back to right list */
> >+	void (*putback_page) (struct page *);
> >  	int (*launder_page) (struct page *);
> >+
> >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> >  					unsigned long);
> >  	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> >diff --git a/fs/proc/page.c b/fs/proc/page.c
> >index 3ecd445e830d..ce3d08a4ad8d 100644
> >--- a/fs/proc/page.c
> >+++ b/fs/proc/page.c
> >@@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
> >  	if (page_is_idle(page))
> >  		u |= 1 << KPF_IDLE;
> >
> >+	if (PageMovable(page))
> >+		u |= 1 << KPF_MOVABLE;
> >+
> >  	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
> >
> >  	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> >diff --git a/include/linux/fs.h b/include/linux/fs.h
> >index da9e67d937e5..36f2d610e7a8 100644
> >--- a/include/linux/fs.h
> >+++ b/include/linux/fs.h
> >@@ -401,6 +401,8 @@ struct address_space_operations {
> >  	 */
> >  	int (*migratepage) (struct address_space *,
> >  			struct page *, struct page *, enum migrate_mode);
> >+	bool (*isolate_page)(struct page *, isolate_mode_t);
> >+	void (*putback_page)(struct page *);
> >  	int (*launder_page) (struct page *);
> >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> >  					unsigned long);
> >diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >index 9b50325e4ddf..404fbfefeb33 100644
> >--- a/include/linux/migrate.h
> >+++ b/include/linux/migrate.h
> >@@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
> >  			struct page *, struct page *, enum migrate_mode);
> >  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> >  		unsigned long private, enum migrate_mode mode, int reason);
> >+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> >+extern void putback_movable_page(struct page *page);
> >
> >  extern int migrate_prep(void);
> >  extern int migrate_prep_local(void);
> >diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >index f4ed4f1b0c77..77ebf8fdbc6e 100644
> >--- a/include/linux/page-flags.h
> >+++ b/include/linux/page-flags.h
> >@@ -129,6 +129,10 @@ enum pageflags {
> >
> >  	/* Compound pages. Stored in first tail page's flags */
> >  	PG_double_map = PG_private_2,
> >+
> >+	/* non-lru movable pages */
> >+	PG_movable = PG_reclaim,
> >+	PG_isolated = PG_owner_priv_1,
> >  };
> >
> >  #ifndef __GENERATING_BOUNDS_H
> >@@ -614,6 +618,33 @@ static inline void __ClearPageBalloon(struct page *page)
> >  	atomic_set(&page->_mapcount, -1);
> >  }
> >
> >+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> >+
> >+static inline int PageMovable(struct page *page)
> >+{
> >+	return ((test_bit(PG_movable, &(page)->flags) &&
> >+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> >+		|| PageBalloon(page));
> >+}
> >+
> >+/* Caller should hold a PG_lock */
> >+static inline void __SetPageMovable(struct page *page,
> >+				struct address_space *mapping)
> >+{
> >+	page->mapping = mapping;
> >+	__set_bit(PG_movable, &page->flags);
> >+	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> >+}
> >+
> >+static inline void __ClearPageMovable(struct page *page)
> >+{
> >+	atomic_set(&page->_mapcount, -1);
> >+	__clear_bit(PG_movable, &(page)->flags);
> >+	page->mapping = NULL;
> >+}
> >+
> >+PAGEFLAG(Isolated, isolated, PF_ANY);
> >+
> >  /*
> >   * If network-based swap is enabled, sl*b must keep track of whether pages
> >   * were allocated from pfmemalloc reserves.
> >diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> >index 5da5f8751ce7..a184fd2434fa 100644
> >--- a/include/uapi/linux/kernel-page-flags.h
> >+++ b/include/uapi/linux/kernel-page-flags.h
> >@@ -34,6 +34,7 @@
> >  #define KPF_BALLOON		23
> >  #define KPF_ZERO_PAGE		24
> >  #define KPF_IDLE		25
> >+#define KPF_MOVABLE		26
> >
> >
> >  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index ccf97b02b85f..7557aedddaee 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >
> >  		/*
> >  		 * Check may be lockless but that's ok as we recheck later.
> >-		 * It's possible to migrate LRU pages and balloon pages
> >+		 * It's possible to migrate LRU and movable kernel pages.
> >  		 * Skip any other type of page
> >  		 */
> >  		is_lru = PageLRU(page);
> >@@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >  					goto isolate_success;
> >  				}
> >  			}
> >+
> >+			if (unlikely(PageMovable(page)) &&
> >+					!PageIsolated(page)) {
> >+				if (locked) {
> >+					spin_unlock_irqrestore(&zone->lru_lock,
> >+									flags);
> >+					locked = false;
> >+				}
> >+
> >+				if (isolate_movable_page(page, isolate_mode))
> >+					goto isolate_success;
> >+			}
> >  		}
> >
> >  		/*
> >diff --git a/mm/migrate.c b/mm/migrate.c
> >index 53529c805752..b56bf2b3fe8c 100644
> >--- a/mm/migrate.c
> >+++ b/mm/migrate.c
> >@@ -73,6 +73,85 @@ int migrate_prep_local(void)
> >  	return 0;
> >  }
> >
> >+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> >+{
> >+	bool ret = false;
> >+
> >+	/*
> >+	 * Avoid burning cycles with pages that are yet under __free_pages(),
> >+	 * or just got freed under us.
> >+	 *
> >+	 * In case we 'win' a race for a movable page being freed under us and
> >+	 * raise its refcount preventing __free_pages() from doing its job
> >+	 * the put_page() at the end of this block will take care of
> >+	 * release this page, thus avoiding a nasty leakage.
> >+	 */
> >+	if (unlikely(!get_page_unless_zero(page)))
> >+		goto out;
> >+
> >+	/*
> >+	 * Check PG_movable before holding a PG_lock because page's owner
> >+	 * assumes anybody doesn't touch PG_lock of newly allocated page.
> >+	 */
> >+	if (unlikely(!PageMovable(page)))
> >+		goto out_putpage;
> >+	/*
> >+	 * As movable pages are not isolated from LRU lists, concurrent
> >+	 * compaction threads can race against page migration functions
> >+	 * as well as race against the releasing a page.
> >+	 *
> >+	 * In order to avoid having an already isolated movable page
> >+	 * being (wrongly) re-isolated while it is under migration,
> >+	 * or to avoid attempting to isolate pages being released,
> >+	 * lets be sure we have the page lock
> >+	 * before proceeding with the movable page isolation steps.
> >+	 */
> >+	if (unlikely(!trylock_page(page)))
> >+		goto out_putpage;
> >+
> >+	if (!PageMovable(page) || PageIsolated(page))
> >+		goto out_no_isolated;
> 
> 
> Hello Minchan.
> 
> We captured a problem case.
> We suspect that
> a zs subpage(T) in the below condition can be isolated twice,
> which seems wrong.
> 
>  migrate_ctx_A         migrate_ctx_B            C (proc being killed)
> -------------       ----------------        -----------------
>  lock_page(T)
>  isolate(T)
>  unlock_page(T)
> 
>                                               zs_free() (making zspage
> ZS_EMPTY)
>                                                   free_zspage()
>                                                      lock_page(T)
>                                                      reset_page(T)
>                                                         (Keeps T
> "PageMovable" and Clears "PageIsolated")
>                                                      unlock_page(T)
> 
> 
>                        lock_page(T)
>                        isolate(T)
> 
> 
> In our case,
> during the second isolation (migrateB),
> there was null pointer dereference
> ((Without DEBUG_VM) T's first page was set to NULL)
> 
> 
> Not sure this is the case you and Vlastimil discussed.
> (I think it is a bit different)

That's exactly a bug we discussed.
I will fix it in next revision.

Thanks.

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping
  2016-03-30  7:12 ` [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping Minchan Kim
@ 2016-04-17 15:08   ` Sergey Senozhatsky
  2016-04-19  7:40     ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-04-17 15:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
[..]
> +static int get_zspage_inuse(struct page *first_page)
> +{
> +	struct zs_meta *m;
> +
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +
> +	m = (struct zs_meta *)&first_page->mapping;
..
> +static void set_zspage_inuse(struct page *first_page, int val)
> +{
> +	struct zs_meta *m;
> +
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +
> +	m = (struct zs_meta *)&first_page->mapping;
..
> +static void mod_zspage_inuse(struct page *first_page, int val)
> +{
> +	struct zs_meta *m;
> +
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +
> +	m = (struct zs_meta *)&first_page->mapping;
..
>  static void get_zspage_mapping(struct page *first_page,
>  				unsigned int *class_idx,
>  				enum fullness_group *fullness)
>  {
> -	unsigned long m;
> +	struct zs_meta *m;
> +
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +	m = (struct zs_meta *)&first_page->mapping;
..
>  static void set_zspage_mapping(struct page *first_page,
>  				unsigned int class_idx,
>  				enum fullness_group fullness)
>  {
> +	struct zs_meta *m;
> +
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> +	m = (struct zs_meta *)&first_page->mapping;
> +	m->fullness = fullness;
> +	m->class = class_idx;
>  }


a nitpick: this

	struct zs_meta *m;
	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
	m = (struct zs_meta *)&first_page->mapping;


seems to be common in several places, may be it makes sense to
factor it out and turn into a macro or a static inline helper?

other than that, looks good to me

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 05/16] zsmalloc: keep max_object in size_class
  2016-03-30  7:12 ` [PATCH v3 05/16] zsmalloc: keep max_object in size_class Minchan Kim
@ 2016-04-17 15:08   ` Sergey Senozhatsky
  0 siblings, 0 replies; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-04-17 15:08 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
> 
> Every zspage in a size_class has same number of max objects so
> we could move it to a size_class.
> 

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 07/16] zsmalloc: remove page_mapcount_reset
  2016-03-30  7:12 ` [PATCH v3 07/16] zsmalloc: remove page_mapcount_reset Minchan Kim
@ 2016-04-17 15:11   ` Sergey Senozhatsky
  0 siblings, 0 replies; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-04-17 15:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
> We don't use page->_mapcount any more so no need to reset.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

> ---
>  mm/zsmalloc.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 4dd72a803568..0f6cce9b9119 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -922,7 +922,6 @@ static void reset_page(struct page *page)
>  	set_page_private(page, 0);
>  	page->mapping = NULL;
>  	page->freelist = NULL;
> -	page_mapcount_reset(page);
>  }
>  
>  static void free_zspage(struct page *first_page)
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 09/16] zsmalloc: move struct zs_meta from mapping to freelist
  2016-03-30  7:12 ` [PATCH v3 09/16] zsmalloc: move struct zs_meta from mapping to freelist Minchan Kim
@ 2016-04-17 15:22   ` Sergey Senozhatsky
  0 siblings, 0 replies; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-04-17 15:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
> For supporting migration from VM, we need to have address_space
> on every page so zsmalloc shouldn't use page->mapping. So,
> this patch moves zs_meta from mapping to freelist.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

a small get_zspage_meta() helper would make this patch shorter :)

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

> ---
>  mm/zsmalloc.c | 22 +++++++++++-----------
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 807998462539..d4d33a819832 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -29,7 +29,7 @@
>   *		Look at size_class->huge.
>   *	page->lru: links together first pages of various zspages.
>   *		Basically forming list of zspages in a fullness group.
> - *	page->mapping: override by struct zs_meta
> + *	page->freelist: override by struct zs_meta
>   *
>   * Usage of struct page flags:
>   *	PG_private: identifies the first component page
> @@ -418,7 +418,7 @@ static int get_zspage_inuse(struct page *first_page)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  
>  	return m->inuse;
>  }
> @@ -429,7 +429,7 @@ static void set_zspage_inuse(struct page *first_page, int val)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	m->inuse = val;
>  }
>  
> @@ -439,7 +439,7 @@ static void mod_zspage_inuse(struct page *first_page, int val)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	m->inuse += val;
>  }
>  
> @@ -449,7 +449,7 @@ static void set_freeobj(struct page *first_page, int idx)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	m->freeobj = idx;
>  }
>  
> @@ -459,7 +459,7 @@ static unsigned long get_freeobj(struct page *first_page)
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	return m->freeobj;
>  }
>  
> @@ -471,7 +471,7 @@ static void get_zspage_mapping(struct page *first_page,
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	*fullness = m->fullness;
>  	*class_idx = m->class;
>  }
> @@ -484,7 +484,7 @@ static void set_zspage_mapping(struct page *first_page,
>  
>  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>  
> -	m = (struct zs_meta *)&first_page->mapping;
> +	m = (struct zs_meta *)&first_page->freelist;
>  	m->fullness = fullness;
>  	m->class = class_idx;
>  }
> @@ -946,7 +946,6 @@ static void reset_page(struct page *page)
>  	clear_bit(PG_private, &page->flags);
>  	clear_bit(PG_private_2, &page->flags);
>  	set_page_private(page, 0);
> -	page->mapping = NULL;
>  	page->freelist = NULL;
>  }
>  
> @@ -1056,6 +1055,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>  
>  		INIT_LIST_HEAD(&page->lru);
>  		if (i == 0) {	/* first page */
> +			page->freelist = NULL;
>  			SetPagePrivate(page);
>  			set_page_private(page, 0);
>  			first_page = page;
> @@ -2068,9 +2068,9 @@ static int __init zs_init(void)
>  
>  	/*
>  	 * A zspage's a free object index, class index, fullness group,
> -	 * inuse object count are encoded in its (first)page->mapping
> +	 * inuse object count are encoded in its (first)page->freelist
>  	 * so sizeof(struct zs_meta) should be less than
> -	 * sizeof(page->mapping(i.e., unsigned long)).
> +	 * sizeof(page->freelist(i.e., void *)).
>  	 */
>  	BUILD_BUG_ON(sizeof(struct zs_meta) > sizeof(unsigned long));
>  
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping
  2016-03-30  7:12 ` [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping Minchan Kim
@ 2016-04-17 15:56   ` Sergey Senozhatsky
  2016-04-19  7:42     ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-04-17 15:56 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
[..]
> +static void objidx_to_page_and_offset(struct size_class *class,
> +				struct page *first_page,
> +				unsigned long obj_idx,
> +				struct page **obj_page,
> +				unsigned long *offset_in_page)
>  {
> -	unsigned long obj;
> +	int i;
> +	unsigned long offset;
> +	struct page *cursor;
> +	int nr_page;
>  
> -	if (!page) {
> -		VM_BUG_ON(obj_idx);
> -		return NULL;
> -	}
> +	offset = obj_idx * class->size;

so we already know the `offset' before we call objidx_to_page_and_offset(),
thus we can drop `struct size_class *class' and `obj_idx', and pass
`long obj_offset'  (which is `obj_idx * class->size') instead, right?

we also _may be_ can return `cursor' from the function.

static struct page *objidx_to_page_and_offset(struct page *first_page,
					unsigned long obj_offset,
					unsigned long *offset_in_page);

this can save ~20 instructions, which is not so terrible for a hot path
like obj_malloc(). what do you think?

well, seems that `unsigned long *offset_in_page' can be calculated
outside of this function too, it's basically

	*offset_in_page = (obj_idx * class->size) & ~PAGE_MASK;

so we don't need to supply it to this function, nor modify it there.
which can save ~40 instructions on my system. does this sound silly?

	-ss

> +	cursor = first_page;
> +	nr_page = offset >> PAGE_SHIFT;
>  
> -	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
> -	obj |= ((obj_idx) & OBJ_INDEX_MASK);
> -	obj <<= OBJ_TAG_BITS;
> +	*offset_in_page = offset & ~PAGE_MASK;
> +
> +	for (i = 0; i < nr_page; i++)
> +		cursor = get_next_page(cursor);
>  
> -	return (void *)obj;
> +	*obj_page = cursor;
>  }

	-ss

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 10/16] zsmalloc: factor page chain functionality out
  2016-03-30  7:12 ` [PATCH v3 10/16] zsmalloc: factor page chain functionality out Minchan Kim
@ 2016-04-18  0:33   ` Sergey Senozhatsky
  2016-04-19  7:46     ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-04-18  0:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

Hello,

On (03/30/16 16:12), Minchan Kim wrote:
> @@ -1421,7 +1434,6 @@ static unsigned long obj_malloc(struct size_class *class,
>  	unsigned long m_offset;
>  	void *vaddr;
>  
> -	handle |= OBJ_ALLOCATED_TAG;

a nitpick, why did you replace this ALLOCATED_TAG assignment
with 2 'handle | OBJ_ALLOCATED_TAG'?

	-ss

>  	obj = get_freeobj(first_page);
>  	objidx_to_page_and_offset(class, first_page, obj,
>  				&m_page, &m_offset);
> @@ -1431,10 +1443,10 @@ static unsigned long obj_malloc(struct size_class *class,
>  	set_freeobj(first_page, link->next >> OBJ_ALLOCATED_TAG);
>  	if (!class->huge)
>  		/* record handle in the header of allocated chunk */
> -		link->handle = handle;
> +		link->handle = handle | OBJ_ALLOCATED_TAG;
>  	else
>  		/* record handle in first_page->private */
> -		set_page_private(first_page, handle);
> +		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
>  	kunmap_atomic(vaddr);
>  	mod_zspage_inuse(first_page, 1);
>  	zs_stat_inc(class, OBJ_USED, 1);

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage
  2016-03-30  7:12 ` [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage Minchan Kim
@ 2016-04-18  1:04   ` Sergey Senozhatsky
  2016-04-19  7:51     ` Minchan Kim
  0 siblings, 1 reply; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-04-18  1:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

Hello Minchan,

On (03/30/16 16:12), Minchan Kim wrote:
[..]
> @@ -1835,23 +1827,31 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>  			if (!migrate_zspage(pool, class, &cc))
>  				break;
>  
> -			putback_zspage(pool, class, dst_page);
> +			VM_BUG_ON_PAGE(putback_zspage(pool, class,
> +				dst_page) == ZS_EMPTY, dst_page);

can this VM_BUG_ON_PAGE() condition ever be true?

>  		}
>  		/* Stop if we couldn't find slot */
>  		if (dst_page == NULL)
>  			break;
> -		putback_zspage(pool, class, dst_page);
> -		if (putback_zspage(pool, class, src_page) == ZS_EMPTY)
> +		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> +				dst_page) == ZS_EMPTY, dst_page);

hm... this VM_BUG_ON_PAGE(dst_page) is sort of confusing. under what
circumstances it can be true?

a minor nit, it took me some time (need some coffee I guess) to
correctly parse this macro wrapper

		VM_BUG_ON_PAGE(putback_zspage(pool, class,
			dst_page) == ZS_EMPTY, dst_page);

may be do it like:
		fullness = putback_zspage(pool, class, dst_page);
		VM_BUG_ON_PAGE(fullness == ZS_EMPTY, dst_page);


well, if we want to VM_BUG_ON_PAGE() at all. there haven't been any
problems with compaction, is there any specific reason these macros
were added?



> +		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
>  			pool->stats.pages_compacted += class->pages_per_zspage;
> -		spin_unlock(&class->lock);
> +			spin_unlock(&class->lock);
> +			free_zspage(pool, class, src_page);

do we really need to free_zspage() out of class->lock?
wouldn't something like this

		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
			pool->stats.pages_compacted += class->pages_per_zspage;
			free_zspage(pool, class, src_page);
		}
		spin_unlock(&class->lock);

be simpler?

besides, free_zspage() now updates class stats out of class lock,
not critical but still.

	-ss

> +		} else {
> +			spin_unlock(&class->lock);
> +		}
> +
>  		cond_resched();
>  		spin_lock(&class->lock);
>  	}
>  
>  	if (src_page)
> -		putback_zspage(pool, class, src_page);
> +		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> +				src_page) == ZS_EMPTY, src_page);
>  
>  	spin_unlock(&class->lock);
>  }

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 13/16] zsmalloc: migrate head page of zspage
  2016-03-30  7:12 ` [PATCH v3 13/16] zsmalloc: migrate head page of zspage Minchan Kim
  2016-04-06 13:01   ` Chulmin Kim
@ 2016-04-19  6:08   ` Chulmin Kim
  2016-04-19  6:15     ` Minchan Kim
  1 sibling, 1 reply; 65+ messages in thread
From: Chulmin Kim @ 2016-04-19  6:08 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton; +Cc: linux-kernel, linux-mm, s.suk, sunae.seo

On 2016년 03월 30일 16:12, Minchan Kim wrote:
> This patch introduces run-time migration feature for zspage.
> To begin with, it supports only head page migration for
> easy review(later patches will support tail page migration).
>
> For migration, it supports three functions
>
> * zs_page_isolate
>
> It isolates a zspage which includes a subpage VM want to migrate
> from class so anyone cannot allocate new object from the zspage.
> IOW, allocation freeze
>
> * zs_page_migrate
>
> First of all, it freezes zspage to prevent zspage destrunction
> so anyone cannot free object. Then, It copies content from oldpage
> to newpage and create new page-chain with new page.
> If it was successful, drop the refcount of old page to free
> and putback new zspage to right data structure of zsmalloc.
> Lastly, unfreeze zspages so we allows object allocation/free
> from now on.
>
> * zs_page_putback
>
> It returns isolated zspage to right fullness_group list
> if it fails to migrate a page.
>
> NOTE: A hurdle to support migration is that destroying zspage
> while migration is going on. Once a zspage is isolated,
> anyone cannot allocate object from the zspage but can deallocate
> object freely so a zspage could be destroyed until all of objects
> in zspage are freezed to prevent deallocation. The problem is
> large window betwwen zs_page_isolate and freeze_zspage
> in zs_page_migrate so the zspage could be destroyed.
>
> A easy approach to solve the problem is that object freezing
> in zs_page_isolate but it has a drawback that any object cannot
> be deallocated until migration fails after isolation. However,
> There is large time gab between isolation and migration so
> any object freeing in other CPU should spin by pin_tag which
> would cause big latency. So, this patch introduces lock_zspage
> which holds PG_lock of all pages in a zspage right before
> freeing the zspage. VM migration locks the page, too right
> before calling ->migratepage so such race doesn't exist any more.
>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>   include/uapi/linux/magic.h |   1 +
>   mm/zsmalloc.c              | 332 +++++++++++++++++++++++++++++++++++++++++++--
>   2 files changed, 318 insertions(+), 15 deletions(-)
>
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index e1fbe72c39c0..93b1affe4801 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -79,5 +79,6 @@
>   #define NSFS_MAGIC		0x6e736673
>   #define BPF_FS_MAGIC		0xcafe4a11
>   #define BALLOON_KVM_MAGIC	0x13661366
> +#define ZSMALLOC_MAGIC		0x58295829
>
>   #endif /* __LINUX_MAGIC_H__ */
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index ac8ca7b10720..f6c9138c3be0 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -56,6 +56,8 @@
>   #include <linux/debugfs.h>
>   #include <linux/zsmalloc.h>
>   #include <linux/zpool.h>
> +#include <linux/mount.h>
> +#include <linux/migrate.h>
>
>   /*
>    * This must be power of 2 and greater than of equal to sizeof(link_free).
> @@ -182,6 +184,8 @@ struct zs_size_stat {
>   static struct dentry *zs_stat_root;
>   #endif
>
> +static struct vfsmount *zsmalloc_mnt;
> +
>   /*
>    * number of size_classes
>    */
> @@ -263,6 +267,7 @@ struct zs_pool {
>   #ifdef CONFIG_ZSMALLOC_STAT
>   	struct dentry *stat_dentry;
>   #endif
> +	struct inode *inode;
>   };
>
>   struct zs_meta {
> @@ -412,6 +417,29 @@ static int is_last_page(struct page *page)
>   	return PagePrivate2(page);
>   }
>
> +/*
> + * Indicate that whether zspage is isolated for page migration.
> + * Protected by size_class lock
> + */
> +static void SetZsPageIsolate(struct page *first_page)
> +{
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +	SetPageUptodate(first_page);
> +}
> +
> +static int ZsPageIsolate(struct page *first_page)
> +{
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +
> +	return PageUptodate(first_page);
> +}
> +
> +static void ClearZsPageIsolate(struct page *first_page)
> +{
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +	ClearPageUptodate(first_page);
> +}
> +
>   static int get_zspage_inuse(struct page *first_page)
>   {
>   	struct zs_meta *m;
> @@ -783,8 +811,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
>   	if (newfg == currfg)
>   		goto out;
>
> -	remove_zspage(class, currfg, first_page);
> -	insert_zspage(class, newfg, first_page);
> +	/* Later, putback will insert page to right list */
> +	if (!ZsPageIsolate(first_page)) {
> +		remove_zspage(class, currfg, first_page);
> +		insert_zspage(class, newfg, first_page);
> +	}
>   	set_zspage_mapping(first_page, class_idx, newfg);
>
>   out:
> @@ -950,12 +981,31 @@ static void unpin_tag(unsigned long handle)
>
>   static void reset_page(struct page *page)
>   {
> +	if (!PageIsolated(page))
> +		__ClearPageMovable(page);
> +	ClearPageIsolated(page);
>   	clear_bit(PG_private, &page->flags);
>   	clear_bit(PG_private_2, &page->flags);
>   	set_page_private(page, 0);
>   	page->freelist = NULL;
>   }
>
> +/**
> + * lock_zspage - lock all pages in the zspage
> + * @first_page: head page of the zspage
> + *
> + * To prevent destroy during migration, zspage freeing should
> + * hold locks of all pages in a zspage
> + */
> +void lock_zspage(struct page *first_page)
> +{
> +	struct page *cursor = first_page;
> +
> +	do {
> +		while (!trylock_page(cursor));
> +	} while ((cursor = get_next_page(cursor)) != NULL);
> +}
> +
>   static void free_zspage(struct zs_pool *pool, struct page *first_page)
>   {
>   	struct page *nextp, *tmp, *head_extra;
> @@ -963,26 +1013,31 @@ static void free_zspage(struct zs_pool *pool, struct page *first_page)
>   	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>   	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
>
> +	lock_zspage(first_page);
>   	head_extra = (struct page *)page_private(first_page);
>
> -	reset_page(first_page);
> -	__free_page(first_page);
> -
>   	/* zspage with only 1 system page */
>   	if (!head_extra)
> -		return;
> +		goto out;
>
>   	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
>   		list_del(&nextp->lru);
>   		reset_page(nextp);
> -		__free_page(nextp);
> +		unlock_page(nextp);
> +		put_page(nextp);
>   	}
>   	reset_page(head_extra);
> -	__free_page(head_extra);
> +	unlock_page(head_extra);
> +	put_page(head_extra);
> +out:
> +	reset_page(first_page);
> +	unlock_page(first_page);
> +	put_page(first_page);
>   }
>
>   /* Initialize a newly allocated zspage */
> -static void init_zspage(struct size_class *class, struct page *first_page)
> +static void init_zspage(struct size_class *class, struct page *first_page,
> +			struct address_space *mapping)
>   {
>   	int freeobj = 1;
>   	unsigned long off = 0;
> @@ -991,6 +1046,9 @@ static void init_zspage(struct size_class *class, struct page *first_page)
>   	first_page->freelist = NULL;
>   	INIT_LIST_HEAD(&first_page->lru);
>   	set_zspage_inuse(first_page, 0);
> +	BUG_ON(!trylock_page(first_page));
> +	__SetPageMovable(first_page, mapping);
> +	unlock_page(first_page);
>
>   	while (page) {
>   		struct page *next_page;
> @@ -1065,10 +1123,45 @@ static void create_page_chain(struct page *pages[], int nr_pages)
>   	}
>   }
>
> +static void replace_sub_page(struct size_class *class, struct page *first_page,
> +		struct page *newpage, struct page *oldpage)
> +{
> +	struct page *page;
> +	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL,};
> +	int idx = 0;
> +
> +	page = first_page;
> +	do {
> +		if (page == oldpage)
> +			pages[idx] = newpage;
> +		else
> +			pages[idx] = page;
> +		idx++;
> +	} while ((page = get_next_page(page)) != NULL);
> +
> +	create_page_chain(pages, class->pages_per_zspage);
> +
> +	if (is_first_page(oldpage)) {
> +		enum fullness_group fg;
> +		int class_idx;
> +
> +		SetZsPageIsolate(newpage);
> +		get_zspage_mapping(oldpage, &class_idx, &fg);
> +		set_zspage_mapping(newpage, class_idx, fg);
> +		set_freeobj(newpage, get_freeobj(oldpage));
> +		set_zspage_inuse(newpage, get_zspage_inuse(oldpage));
> +		if (class->huge)
> +			set_page_private(newpage,  page_private(oldpage));
> +	}
> +
> +	__SetPageMovable(newpage, oldpage->mapping);
> +}
> +
>   /*
>    * Allocate a zspage for the given size class
>    */
> -static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> +static struct page *alloc_zspage(struct zs_pool *pool,
> +				struct size_class *class)
>   {
>   	int i;
>   	struct page *first_page = NULL;
> @@ -1088,7 +1181,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>   	for (i = 0; i < class->pages_per_zspage; i++) {
>   		struct page *page;
>
> -		page = alloc_page(flags);
> +		page = alloc_page(pool->flags);
>   		if (!page) {
>   			while (--i >= 0)
>   				__free_page(pages[i]);
> @@ -1100,7 +1193,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
>
>   	create_page_chain(pages, class->pages_per_zspage);
>   	first_page = pages[0];
> -	init_zspage(class, first_page);
> +	init_zspage(class, first_page, pool->inode->i_mapping);
>
>   	return first_page;
>   }
> @@ -1499,7 +1592,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
>
>   	if (!first_page) {
>   		spin_unlock(&class->lock);
> -		first_page = alloc_zspage(class, pool->flags);
> +		first_page = alloc_zspage(pool, class);
>   		if (unlikely(!first_page)) {
>   			free_handle(pool, handle);
>   			return 0;
> @@ -1559,6 +1652,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>   	if (unlikely(!handle))
>   		return;
>
> +	/* Once handle is pinned, page|object migration cannot work */
>   	pin_tag(handle);
>   	obj = handle_to_obj(handle);
>   	obj_to_location(obj, &f_page, &f_objidx);
> @@ -1714,6 +1808,9 @@ static enum fullness_group putback_zspage(struct size_class *class,
>   {
>   	enum fullness_group fullness;
>
> +	VM_BUG_ON_PAGE(!list_empty(&first_page->lru), first_page);
> +	VM_BUG_ON_PAGE(ZsPageIsolate(first_page), first_page);
> +
>   	fullness = get_fullness_group(class, first_page);
>   	insert_zspage(class, fullness, first_page);
>   	set_zspage_mapping(first_page, class->index, fullness);
> @@ -2059,6 +2156,173 @@ static int zs_register_shrinker(struct zs_pool *pool)
>   	return register_shrinker(&pool->shrinker);
>   }
>
> +bool zs_page_isolate(struct page *page, isolate_mode_t mode)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct page *first_page;
> +
> +	/*
> +	 * The page is locked so it couldn't be destroyed.
> +	 * For detail, look at lock_zspage in free_zspage.
> +	 */
> +	VM_BUG_ON_PAGE(!PageLocked(page), page);
> +	VM_BUG_ON_PAGE(PageIsolated(page), page);
> +	/*
> +	 * In this implementation, it allows only first page migration.
> +	 */
> +	VM_BUG_ON_PAGE(!is_first_page(page), page);
> +	first_page = page;
> +
> +	/*
> +	 * Without class lock, fullness is meaningless while constant
> +	 * class_idx is okay. We will get it under class lock at below,
> +	 * again.
> +	 */
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	pool = page->mapping->private_data;
> +	class = pool->size_class[class_idx];
> +
> +	if (!spin_trylock(&class->lock))
> +		return false;
> +
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	remove_zspage(class, fullness, first_page);
> +	SetZsPageIsolate(first_page);
> +	SetPageIsolated(page);
> +	spin_unlock(&class->lock);
> +
> +	return true;
> +}

Hello, Minchan.

We found another race condition.

When there is alloc_zspage(), which is not protected by any lock, in-flight,
a migrate context can isolate the zs subpage which is being initiated by 
alloc_zspage().

We detected VM_BUG_ON during remove_zspage() above in consequence of 
"page->index" being set to NULL wrongly. (seems uninitialized yet)

Though it is a real problem,
as this race issue is somewhat similar with the one we detected last time,
this seems to be fixed in the next version hopefully.

I report this just for note.

Thanks.
Chulmin

> +
> +int zs_page_migrate(struct address_space *mapping, struct page *newpage,
> +		struct page *page, enum migrate_mode mode)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct page *first_page;
> +	void *s_addr, *d_addr, *addr;
> +	int ret = -EBUSY;
> +	int offset = 0;
> +	int freezed = 0;
> +
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +
> +	first_page = page;
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	pool = page->mapping->private_data;
> +	class = pool->size_class[class_idx];
> +
> +	/*
> +	 * Get stable fullness under class->lock
> +	 */
> +	if (!spin_trylock(&class->lock))
> +		return ret;
> +
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	if (get_zspage_inuse(first_page) == 0)
> +		goto out_class_unlock;
> +
> +	freezed = freeze_zspage(class, first_page);
> +	if (freezed != get_zspage_inuse(first_page))
> +		goto out_unfreeze;
> +
> +	/* copy contents from page to newpage */
> +	s_addr = kmap_atomic(page);
> +	d_addr = kmap_atomic(newpage);
> +	memcpy(d_addr, s_addr, PAGE_SIZE);
> +	kunmap_atomic(d_addr);
> +	kunmap_atomic(s_addr);
> +
> +	if (!is_first_page(page))
> +		offset = page->index;
> +
> +	addr = kmap_atomic(page);
> +	do {
> +		unsigned long handle;
> +		unsigned long head;
> +		unsigned long new_obj, old_obj;
> +		unsigned long obj_idx;
> +		struct page *dummy;
> +
> +		head = obj_to_head(class, page, addr + offset);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			handle = head & ~OBJ_ALLOCATED_TAG;
> +			if (!testpin_tag(handle))
> +				BUG();
> +
> +			old_obj = handle_to_obj(handle);
> +			obj_to_location(old_obj, &dummy, &obj_idx);
> +			new_obj = location_to_obj(newpage, obj_idx);
> +			new_obj |= BIT(HANDLE_PIN_BIT);
> +			record_obj(handle, new_obj);
> +		}
> +		offset += class->size;
> +	} while (offset < PAGE_SIZE);
> +	kunmap_atomic(addr);
> +
> +	replace_sub_page(class, first_page, newpage, page);
> +	first_page = newpage;
> +	get_page(newpage);
> +	VM_BUG_ON_PAGE(get_fullness_group(class, first_page) ==
> +			ZS_EMPTY, first_page);
> +	ClearZsPageIsolate(first_page);
> +	putback_zspage(class, first_page);
> +
> +	/* Migration complete. Free old page */
> +	ClearPageIsolated(page);
> +	reset_page(page);
> +	put_page(page);
> +	ret = MIGRATEPAGE_SUCCESS;
> +
> +out_unfreeze:
> +	unfreeze_zspage(class, first_page, freezed);
> +out_class_unlock:
> +	spin_unlock(&class->lock);
> +
> +	return ret;
> +}
> +
> +void zs_page_putback(struct page *page)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct page *first_page;
> +
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +
> +	first_page = page;
> +	get_zspage_mapping(first_page, &class_idx, &fullness);
> +	pool = page->mapping->private_data;
> +	class = pool->size_class[class_idx];
> +
> +	/*
> +	 * If there is race betwwen zs_free and here, free_zspage
> +	 * in zs_free will wait the page lock of @page without
> +	 * destroying of zspage.
> +	 */
> +	INIT_LIST_HEAD(&first_page->lru);
> +	spin_lock(&class->lock);
> +	ClearPageIsolated(page);
> +	ClearZsPageIsolate(first_page);
> +	putback_zspage(class, first_page);
> +	spin_unlock(&class->lock);
> +}
> +
> +const struct address_space_operations zsmalloc_aops = {
> +	.isolate_page = zs_page_isolate,
> +	.migratepage = zs_page_migrate,
> +	.putback_page = zs_page_putback,
> +};
> +
>   /**
>    * zs_create_pool - Creates an allocation pool to work from.
>    * @flags: allocation flags used to allocate pool metadata
> @@ -2145,6 +2409,15 @@ struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
>   	if (zs_pool_stat_create(pool, name))
>   		goto err;
>
> +	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
> +	if (IS_ERR(pool->inode)) {
> +		pool->inode = NULL;
> +		goto err;
> +	}
> +
> +	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
> +	pool->inode->i_mapping->private_data = pool;
> +
>   	/*
>   	 * Not critical, we still can use the pool
>   	 * and user can trigger compaction manually.
> @@ -2164,6 +2437,8 @@ void zs_destroy_pool(struct zs_pool *pool)
>   	int i;
>
>   	zs_unregister_shrinker(pool);
> +	if (pool->inode)
> +		iput(pool->inode);
>   	zs_pool_stat_destroy(pool);
>
>   	for (i = 0; i < zs_size_classes; i++) {
> @@ -2192,10 +2467,33 @@ void zs_destroy_pool(struct zs_pool *pool)
>   }
>   EXPORT_SYMBOL_GPL(zs_destroy_pool);
>
> +static struct dentry *zs_mount(struct file_system_type *fs_type,
> +				int flags, const char *dev_name, void *data)
> +{
> +	static const struct dentry_operations ops = {
> +		.d_dname = simple_dname,
> +	};
> +
> +	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
> +}
> +
> +static struct file_system_type zsmalloc_fs = {
> +	.name		= "zsmalloc",
> +	.mount		= zs_mount,
> +	.kill_sb	= kill_anon_super,
> +};
> +
>   static int __init zs_init(void)
>   {
> -	int ret = zs_register_cpu_notifier();
> +	int ret;
> +
> +	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
> +	if (IS_ERR(zsmalloc_mnt)) {
> +		ret = PTR_ERR(zsmalloc_mnt);
> +		goto out;
> +	}
>
> +	ret = zs_register_cpu_notifier();
>   	if (ret)
>   		goto notifier_fail;
>
> @@ -2218,6 +2516,7 @@ static int __init zs_init(void)
>   		pr_err("zs stat initialization failed\n");
>   		goto stat_fail;
>   	}
> +
>   	return 0;
>
>   stat_fail:
> @@ -2226,7 +2525,8 @@ static int __init zs_init(void)
>   #endif
>   notifier_fail:
>   	zs_unregister_cpu_notifier();
> -
> +	kern_unmount(zsmalloc_mnt);
> +out:
>   	return ret;
>   }
>
> @@ -2237,6 +2537,8 @@ static void __exit zs_exit(void)
>   #endif
>   	zs_unregister_cpu_notifier();
>
> +	kern_unmount(zsmalloc_mnt);
> +
>   	zs_stat_exit();
>   }
>
>

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 13/16] zsmalloc: migrate head page of zspage
  2016-04-19  6:08   ` Chulmin Kim
@ 2016-04-19  6:15     ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-19  6:15 UTC (permalink / raw)
  To: Chulmin Kim; +Cc: Andrew Morton, linux-kernel, linux-mm, s.suk, sunae.seo

Hello Chulmin,

On Tue, Apr 19, 2016 at 03:08:48PM +0900, Chulmin Kim wrote:
> On 2016년 03월 30일 16:12, Minchan Kim wrote:
> >This patch introduces run-time migration feature for zspage.
> >To begin with, it supports only head page migration for
> >easy review(later patches will support tail page migration).
> >
> >For migration, it supports three functions
> >
> >* zs_page_isolate
> >
> >It isolates a zspage which includes a subpage VM want to migrate
> >from class so anyone cannot allocate new object from the zspage.
> >IOW, allocation freeze
> >
> >* zs_page_migrate
> >
> >First of all, it freezes zspage to prevent zspage destrunction
> >so anyone cannot free object. Then, It copies content from oldpage
> >to newpage and create new page-chain with new page.
> >If it was successful, drop the refcount of old page to free
> >and putback new zspage to right data structure of zsmalloc.
> >Lastly, unfreeze zspages so we allows object allocation/free
> >from now on.
> >
> >* zs_page_putback
> >
> >It returns isolated zspage to right fullness_group list
> >if it fails to migrate a page.
> >
> >NOTE: A hurdle to support migration is that destroying zspage
> >while migration is going on. Once a zspage is isolated,
> >anyone cannot allocate object from the zspage but can deallocate
> >object freely so a zspage could be destroyed until all of objects
> >in zspage are freezed to prevent deallocation. The problem is
> >large window betwwen zs_page_isolate and freeze_zspage
> >in zs_page_migrate so the zspage could be destroyed.
> >
> >A easy approach to solve the problem is that object freezing
> >in zs_page_isolate but it has a drawback that any object cannot
> >be deallocated until migration fails after isolation. However,
> >There is large time gab between isolation and migration so
> >any object freeing in other CPU should spin by pin_tag which
> >would cause big latency. So, this patch introduces lock_zspage
> >which holds PG_lock of all pages in a zspage right before
> >freeing the zspage. VM migration locks the page, too right
> >before calling ->migratepage so such race doesn't exist any more.
> >
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  include/uapi/linux/magic.h |   1 +
> >  mm/zsmalloc.c              | 332 +++++++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 318 insertions(+), 15 deletions(-)
> >
> >diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> >index e1fbe72c39c0..93b1affe4801 100644
> >--- a/include/uapi/linux/magic.h
> >+++ b/include/uapi/linux/magic.h
> >@@ -79,5 +79,6 @@
> >  #define NSFS_MAGIC		0x6e736673
> >  #define BPF_FS_MAGIC		0xcafe4a11
> >  #define BALLOON_KVM_MAGIC	0x13661366
> >+#define ZSMALLOC_MAGIC		0x58295829
> >
> >  #endif /* __LINUX_MAGIC_H__ */
> >diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> >index ac8ca7b10720..f6c9138c3be0 100644
> >--- a/mm/zsmalloc.c
> >+++ b/mm/zsmalloc.c
> >@@ -56,6 +56,8 @@
> >  #include <linux/debugfs.h>
> >  #include <linux/zsmalloc.h>
> >  #include <linux/zpool.h>
> >+#include <linux/mount.h>
> >+#include <linux/migrate.h>
> >
> >  /*
> >   * This must be power of 2 and greater than of equal to sizeof(link_free).
> >@@ -182,6 +184,8 @@ struct zs_size_stat {
> >  static struct dentry *zs_stat_root;
> >  #endif
> >
> >+static struct vfsmount *zsmalloc_mnt;
> >+
> >  /*
> >   * number of size_classes
> >   */
> >@@ -263,6 +267,7 @@ struct zs_pool {
> >  #ifdef CONFIG_ZSMALLOC_STAT
> >  	struct dentry *stat_dentry;
> >  #endif
> >+	struct inode *inode;
> >  };
> >
> >  struct zs_meta {
> >@@ -412,6 +417,29 @@ static int is_last_page(struct page *page)
> >  	return PagePrivate2(page);
> >  }
> >
> >+/*
> >+ * Indicate that whether zspage is isolated for page migration.
> >+ * Protected by size_class lock
> >+ */
> >+static void SetZsPageIsolate(struct page *first_page)
> >+{
> >+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >+	SetPageUptodate(first_page);
> >+}
> >+
> >+static int ZsPageIsolate(struct page *first_page)
> >+{
> >+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >+
> >+	return PageUptodate(first_page);
> >+}
> >+
> >+static void ClearZsPageIsolate(struct page *first_page)
> >+{
> >+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >+	ClearPageUptodate(first_page);
> >+}
> >+
> >  static int get_zspage_inuse(struct page *first_page)
> >  {
> >  	struct zs_meta *m;
> >@@ -783,8 +811,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
> >  	if (newfg == currfg)
> >  		goto out;
> >
> >-	remove_zspage(class, currfg, first_page);
> >-	insert_zspage(class, newfg, first_page);
> >+	/* Later, putback will insert page to right list */
> >+	if (!ZsPageIsolate(first_page)) {
> >+		remove_zspage(class, currfg, first_page);
> >+		insert_zspage(class, newfg, first_page);
> >+	}
> >  	set_zspage_mapping(first_page, class_idx, newfg);
> >
> >  out:
> >@@ -950,12 +981,31 @@ static void unpin_tag(unsigned long handle)
> >
> >  static void reset_page(struct page *page)
> >  {
> >+	if (!PageIsolated(page))
> >+		__ClearPageMovable(page);
> >+	ClearPageIsolated(page);
> >  	clear_bit(PG_private, &page->flags);
> >  	clear_bit(PG_private_2, &page->flags);
> >  	set_page_private(page, 0);
> >  	page->freelist = NULL;
> >  }
> >
> >+/**
> >+ * lock_zspage - lock all pages in the zspage
> >+ * @first_page: head page of the zspage
> >+ *
> >+ * To prevent destroy during migration, zspage freeing should
> >+ * hold locks of all pages in a zspage
> >+ */
> >+void lock_zspage(struct page *first_page)
> >+{
> >+	struct page *cursor = first_page;
> >+
> >+	do {
> >+		while (!trylock_page(cursor));
> >+	} while ((cursor = get_next_page(cursor)) != NULL);
> >+}
> >+
> >  static void free_zspage(struct zs_pool *pool, struct page *first_page)
> >  {
> >  	struct page *nextp, *tmp, *head_extra;
> >@@ -963,26 +1013,31 @@ static void free_zspage(struct zs_pool *pool, struct page *first_page)
> >  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >  	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
> >
> >+	lock_zspage(first_page);
> >  	head_extra = (struct page *)page_private(first_page);
> >
> >-	reset_page(first_page);
> >-	__free_page(first_page);
> >-
> >  	/* zspage with only 1 system page */
> >  	if (!head_extra)
> >-		return;
> >+		goto out;
> >
> >  	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
> >  		list_del(&nextp->lru);
> >  		reset_page(nextp);
> >-		__free_page(nextp);
> >+		unlock_page(nextp);
> >+		put_page(nextp);
> >  	}
> >  	reset_page(head_extra);
> >-	__free_page(head_extra);
> >+	unlock_page(head_extra);
> >+	put_page(head_extra);
> >+out:
> >+	reset_page(first_page);
> >+	unlock_page(first_page);
> >+	put_page(first_page);
> >  }
> >
> >  /* Initialize a newly allocated zspage */
> >-static void init_zspage(struct size_class *class, struct page *first_page)
> >+static void init_zspage(struct size_class *class, struct page *first_page,
> >+			struct address_space *mapping)
> >  {
> >  	int freeobj = 1;
> >  	unsigned long off = 0;
> >@@ -991,6 +1046,9 @@ static void init_zspage(struct size_class *class, struct page *first_page)
> >  	first_page->freelist = NULL;
> >  	INIT_LIST_HEAD(&first_page->lru);
> >  	set_zspage_inuse(first_page, 0);
> >+	BUG_ON(!trylock_page(first_page));
> >+	__SetPageMovable(first_page, mapping);
> >+	unlock_page(first_page);
> >
> >  	while (page) {
> >  		struct page *next_page;
> >@@ -1065,10 +1123,45 @@ static void create_page_chain(struct page *pages[], int nr_pages)
> >  	}
> >  }
> >
> >+static void replace_sub_page(struct size_class *class, struct page *first_page,
> >+		struct page *newpage, struct page *oldpage)
> >+{
> >+	struct page *page;
> >+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL,};
> >+	int idx = 0;
> >+
> >+	page = first_page;
> >+	do {
> >+		if (page == oldpage)
> >+			pages[idx] = newpage;
> >+		else
> >+			pages[idx] = page;
> >+		idx++;
> >+	} while ((page = get_next_page(page)) != NULL);
> >+
> >+	create_page_chain(pages, class->pages_per_zspage);
> >+
> >+	if (is_first_page(oldpage)) {
> >+		enum fullness_group fg;
> >+		int class_idx;
> >+
> >+		SetZsPageIsolate(newpage);
> >+		get_zspage_mapping(oldpage, &class_idx, &fg);
> >+		set_zspage_mapping(newpage, class_idx, fg);
> >+		set_freeobj(newpage, get_freeobj(oldpage));
> >+		set_zspage_inuse(newpage, get_zspage_inuse(oldpage));
> >+		if (class->huge)
> >+			set_page_private(newpage,  page_private(oldpage));
> >+	}
> >+
> >+	__SetPageMovable(newpage, oldpage->mapping);
> >+}
> >+
> >  /*
> >   * Allocate a zspage for the given size class
> >   */
> >-static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> >+static struct page *alloc_zspage(struct zs_pool *pool,
> >+				struct size_class *class)
> >  {
> >  	int i;
> >  	struct page *first_page = NULL;
> >@@ -1088,7 +1181,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> >  	for (i = 0; i < class->pages_per_zspage; i++) {
> >  		struct page *page;
> >
> >-		page = alloc_page(flags);
> >+		page = alloc_page(pool->flags);
> >  		if (!page) {
> >  			while (--i >= 0)
> >  				__free_page(pages[i]);
> >@@ -1100,7 +1193,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
> >
> >  	create_page_chain(pages, class->pages_per_zspage);
> >  	first_page = pages[0];
> >-	init_zspage(class, first_page);
> >+	init_zspage(class, first_page, pool->inode->i_mapping);
> >
> >  	return first_page;
> >  }
> >@@ -1499,7 +1592,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size)
> >
> >  	if (!first_page) {
> >  		spin_unlock(&class->lock);
> >-		first_page = alloc_zspage(class, pool->flags);
> >+		first_page = alloc_zspage(pool, class);
> >  		if (unlikely(!first_page)) {
> >  			free_handle(pool, handle);
> >  			return 0;
> >@@ -1559,6 +1652,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> >  	if (unlikely(!handle))
> >  		return;
> >
> >+	/* Once handle is pinned, page|object migration cannot work */
> >  	pin_tag(handle);
> >  	obj = handle_to_obj(handle);
> >  	obj_to_location(obj, &f_page, &f_objidx);
> >@@ -1714,6 +1808,9 @@ static enum fullness_group putback_zspage(struct size_class *class,
> >  {
> >  	enum fullness_group fullness;
> >
> >+	VM_BUG_ON_PAGE(!list_empty(&first_page->lru), first_page);
> >+	VM_BUG_ON_PAGE(ZsPageIsolate(first_page), first_page);
> >+
> >  	fullness = get_fullness_group(class, first_page);
> >  	insert_zspage(class, fullness, first_page);
> >  	set_zspage_mapping(first_page, class->index, fullness);
> >@@ -2059,6 +2156,173 @@ static int zs_register_shrinker(struct zs_pool *pool)
> >  	return register_shrinker(&pool->shrinker);
> >  }
> >
> >+bool zs_page_isolate(struct page *page, isolate_mode_t mode)
> >+{
> >+	struct zs_pool *pool;
> >+	struct size_class *class;
> >+	int class_idx;
> >+	enum fullness_group fullness;
> >+	struct page *first_page;
> >+
> >+	/*
> >+	 * The page is locked so it couldn't be destroyed.
> >+	 * For detail, look at lock_zspage in free_zspage.
> >+	 */
> >+	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >+	VM_BUG_ON_PAGE(PageIsolated(page), page);
> >+	/*
> >+	 * In this implementation, it allows only first page migration.
> >+	 */
> >+	VM_BUG_ON_PAGE(!is_first_page(page), page);
> >+	first_page = page;
> >+
> >+	/*
> >+	 * Without class lock, fullness is meaningless while constant
> >+	 * class_idx is okay. We will get it under class lock at below,
> >+	 * again.
> >+	 */
> >+	get_zspage_mapping(first_page, &class_idx, &fullness);
> >+	pool = page->mapping->private_data;
> >+	class = pool->size_class[class_idx];
> >+
> >+	if (!spin_trylock(&class->lock))
> >+		return false;
> >+
> >+	get_zspage_mapping(first_page, &class_idx, &fullness);
> >+	remove_zspage(class, fullness, first_page);
> >+	SetZsPageIsolate(first_page);
> >+	SetPageIsolated(page);
> >+	spin_unlock(&class->lock);
> >+
> >+	return true;
> >+}
> 
> Hello, Minchan.
> 
> We found another race condition.
> 
> When there is alloc_zspage(), which is not protected by any lock, in-flight,
> a migrate context can isolate the zs subpage which is being
> initiated by alloc_zspage().
> 
> We detected VM_BUG_ON during remove_zspage() above in consequence of
> "page->index" being set to NULL wrongly. (seems uninitialized yet)
> 
> Though it is a real problem,
> as this race issue is somewhat similar with the one we detected last time,
> this seems to be fixed in the next version hopefully.
> 
> I report this just for note.


I found problem you reported and already fixed it in my WIP version.
With your report, I am convinced my analysis was right, too. :)

Thanks for the analysis and reporting.
I really apprecaite your help, Chulmin!
 

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping
  2016-04-17 15:08   ` Sergey Senozhatsky
@ 2016-04-19  7:40     ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-19  7:40 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On Mon, Apr 18, 2016 at 12:08:04AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/30/16 16:12), Minchan Kim wrote:
> [..]
> > +static int get_zspage_inuse(struct page *first_page)
> > +{
> > +	struct zs_meta *m;
> > +
> > +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> > +
> > +	m = (struct zs_meta *)&first_page->mapping;
> ..
> > +static void set_zspage_inuse(struct page *first_page, int val)
> > +{
> > +	struct zs_meta *m;
> > +
> > +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> > +
> > +	m = (struct zs_meta *)&first_page->mapping;
> ..
> > +static void mod_zspage_inuse(struct page *first_page, int val)
> > +{
> > +	struct zs_meta *m;
> > +
> > +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> > +
> > +	m = (struct zs_meta *)&first_page->mapping;
> ..
> >  static void get_zspage_mapping(struct page *first_page,
> >  				unsigned int *class_idx,
> >  				enum fullness_group *fullness)
> >  {
> > -	unsigned long m;
> > +	struct zs_meta *m;
> > +
> >  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> > +	m = (struct zs_meta *)&first_page->mapping;
> ..
> >  static void set_zspage_mapping(struct page *first_page,
> >  				unsigned int class_idx,
> >  				enum fullness_group fullness)
> >  {
> > +	struct zs_meta *m;
> > +
> >  	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >  
> > +	m = (struct zs_meta *)&first_page->mapping;
> > +	m->fullness = fullness;
> > +	m->class = class_idx;
> >  }
> 
> 
> a nitpick: this
> 
> 	struct zs_meta *m;
> 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> 	m = (struct zs_meta *)&first_page->mapping;
> 
> 
> seems to be common in several places, may be it makes sense to
> factor it out and turn into a macro or a static inline helper?
> 
> other than that, looks good to me

Yeb.

> 
> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

Thanks for the review!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping
  2016-04-17 15:56   ` Sergey Senozhatsky
@ 2016-04-19  7:42     ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-19  7:42 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Rik van Riel, rknize, Gioh Kim,
	Sangseok Lee, Chan Gyun Jeong, Al Viro, YiPing Xu

On Mon, Apr 18, 2016 at 12:56:21AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/30/16 16:12), Minchan Kim wrote:
> [..]
> > +static void objidx_to_page_and_offset(struct size_class *class,
> > +				struct page *first_page,
> > +				unsigned long obj_idx,
> > +				struct page **obj_page,
> > +				unsigned long *offset_in_page)
> >  {
> > -	unsigned long obj;
> > +	int i;
> > +	unsigned long offset;
> > +	struct page *cursor;
> > +	int nr_page;
> >  
> > -	if (!page) {
> > -		VM_BUG_ON(obj_idx);
> > -		return NULL;
> > -	}
> > +	offset = obj_idx * class->size;
> 
> so we already know the `offset' before we call objidx_to_page_and_offset(),
> thus we can drop `struct size_class *class' and `obj_idx', and pass
> `long obj_offset'  (which is `obj_idx * class->size') instead, right?
> 
> we also _may be_ can return `cursor' from the function.
> 
> static struct page *objidx_to_page_and_offset(struct page *first_page,
> 					unsigned long obj_offset,
> 					unsigned long *offset_in_page);
> 
> this can save ~20 instructions, which is not so terrible for a hot path
> like obj_malloc(). what do you think?
> 
> well, seems that `unsigned long *offset_in_page' can be calculated
> outside of this function too, it's basically
> 
> 	*offset_in_page = (obj_idx * class->size) & ~PAGE_MASK;
> 
> so we don't need to supply it to this function, nor modify it there.
> which can save ~40 instructions on my system. does this sound silly?

Sound smart. :)
At least, we will use it in hot path.

Thanks!

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 10/16] zsmalloc: factor page chain functionality out
  2016-04-18  0:33   ` Sergey Senozhatsky
@ 2016-04-19  7:46     ` Minchan Kim
  0 siblings, 0 replies; 65+ messages in thread
From: Minchan Kim @ 2016-04-19  7:46 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

On Mon, Apr 18, 2016 at 09:33:05AM +0900, Sergey Senozhatsky wrote:
> Hello,
> 
> On (03/30/16 16:12), Minchan Kim wrote:
> > @@ -1421,7 +1434,6 @@ static unsigned long obj_malloc(struct size_class *class,
> >  	unsigned long m_offset;
> >  	void *vaddr;
> >  
> > -	handle |= OBJ_ALLOCATED_TAG;
> 
> a nitpick, why did you replace this ALLOCATED_TAG assignment
> with 2 'handle | OBJ_ALLOCATED_TAG'?

I thought this handle variable in here is pure handle but OBJ_ALLOCATED_TAG
should live in head(i.e., link->handle), not pure handle, itself.

> 
> 	-ss
> 
> >  	obj = get_freeobj(first_page);
> >  	objidx_to_page_and_offset(class, first_page, obj,
> >  				&m_page, &m_offset);
> > @@ -1431,10 +1443,10 @@ static unsigned long obj_malloc(struct size_class *class,
> >  	set_freeobj(first_page, link->next >> OBJ_ALLOCATED_TAG);
> >  	if (!class->huge)
> >  		/* record handle in the header of allocated chunk */
> > -		link->handle = handle;
> > +		link->handle = handle | OBJ_ALLOCATED_TAG;
> >  	else
> >  		/* record handle in first_page->private */
> > -		set_page_private(first_page, handle);
> > +		set_page_private(first_page, handle | OBJ_ALLOCATED_TAG);
> >  	kunmap_atomic(vaddr);
> >  	mod_zspage_inuse(first_page, 1);
> >  	zs_stat_inc(class, OBJ_USED, 1);

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage
  2016-04-18  1:04   ` Sergey Senozhatsky
@ 2016-04-19  7:51     ` Minchan Kim
  2016-04-19  7:53       ` Sergey Senozhatsky
  0 siblings, 1 reply; 65+ messages in thread
From: Minchan Kim @ 2016-04-19  7:51 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, linux-mm, jlayton, bfields,
	Vlastimil Babka, Joonsoo Kim, koct9i, aquini, virtualization,
	Mel Gorman, Hugh Dickins, Sergey Senozhatsky, Rik van Riel,
	rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong, Al Viro,
	YiPing Xu

Hi Sergey,

On Mon, Apr 18, 2016 at 10:04:08AM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (03/30/16 16:12), Minchan Kim wrote:
> [..]
> > @@ -1835,23 +1827,31 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
> >  			if (!migrate_zspage(pool, class, &cc))
> >  				break;
> >  
> > -			putback_zspage(pool, class, dst_page);
> > +			VM_BUG_ON_PAGE(putback_zspage(pool, class,
> > +				dst_page) == ZS_EMPTY, dst_page);
> 
> can this VM_BUG_ON_PAGE() condition ever be true?

I guess it is remained thing after I rebased to catch any mistake.
But I'm heavily chainging this part.
Please review next version instead of this after a few days. :)

> 
> >  		}
> >  		/* Stop if we couldn't find slot */
> >  		if (dst_page == NULL)
> >  			break;
> > -		putback_zspage(pool, class, dst_page);
> > -		if (putback_zspage(pool, class, src_page) == ZS_EMPTY)
> > +		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> > +				dst_page) == ZS_EMPTY, dst_page);
> 
> hm... this VM_BUG_ON_PAGE(dst_page) is sort of confusing. under what
> circumstances it can be true?
> 
> a minor nit, it took me some time (need some coffee I guess) to
> correctly parse this macro wrapper
> 
> 		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> 			dst_page) == ZS_EMPTY, dst_page);
> 
> may be do it like:
> 		fullness = putback_zspage(pool, class, dst_page);
> 		VM_BUG_ON_PAGE(fullness == ZS_EMPTY, dst_page);
> 
> 
> well, if we want to VM_BUG_ON_PAGE() at all. there haven't been any
> problems with compaction, is there any specific reason these macros
> were added?
> 
> 
> 
> > +		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
> >  			pool->stats.pages_compacted += class->pages_per_zspage;
> > -		spin_unlock(&class->lock);
> > +			spin_unlock(&class->lock);
> > +			free_zspage(pool, class, src_page);
> 
> do we really need to free_zspage() out of class->lock?
> wouldn't something like this
> 
> 		if (putback_zspage(pool, class, src_page) == ZS_EMPTY) {
> 			pool->stats.pages_compacted += class->pages_per_zspage;
> 			free_zspage(pool, class, src_page);
> 		}
> 		spin_unlock(&class->lock);
> 
> be simpler?

The reason I did out of class->lock is deadlock between page_lock
and class->lock with upcoming page migration.
However, as I said, I'm now heavily changing the part. :)

> 
> besides, free_zspage() now updates class stats out of class lock,
> not critical but still.
> 
> 	-ss
> 
> > +		} else {
> > +			spin_unlock(&class->lock);
> > +		}
> > +
> >  		cond_resched();
> >  		spin_lock(&class->lock);
> >  	}
> >  
> >  	if (src_page)
> > -		putback_zspage(pool, class, src_page);
> > +		VM_BUG_ON_PAGE(putback_zspage(pool, class,
> > +				src_page) == ZS_EMPTY, src_page);
> >  
> >  	spin_unlock(&class->lock);
> >  }

^ permalink raw reply	[flat|nested] 65+ messages in thread

* Re: [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage
  2016-04-19  7:51     ` Minchan Kim
@ 2016-04-19  7:53       ` Sergey Senozhatsky
  0 siblings, 0 replies; 65+ messages in thread
From: Sergey Senozhatsky @ 2016-04-19  7:53 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, linux-mm,
	jlayton, bfields, Vlastimil Babka, Joonsoo Kim, koct9i, aquini,
	virtualization, Mel Gorman, Hugh Dickins, Sergey Senozhatsky,
	Rik van Riel, rknize, Gioh Kim, Sangseok Lee, Chan Gyun Jeong,
	Al Viro, YiPing Xu

Hello Minchan,

On (04/19/16 16:51), Minchan Kim wrote:
[..]
> 
> I guess it is remained thing after I rebased to catch any mistake.
> But I'm heavily chainging this part.
> Please review next version instead of this after a few days. :)

ah, got it. thanks!

	-ss

^ permalink raw reply	[flat|nested] 65+ messages in thread

end of thread, other threads:[~2016-04-19  7:52 UTC | newest]

Thread overview: 65+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-03-30  7:11 [PATCH v3 00/16] Support non-lru page migration Minchan Kim
2016-03-30  7:12 ` [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page Minchan Kim
2016-04-01 12:58   ` Vlastimil Babka
2016-04-04  1:39     ` Minchan Kim
2016-04-04  4:45       ` Naoya Horiguchi
2016-04-04 14:46         ` Vlastimil Babka
2016-04-05  1:54           ` Naoya Horiguchi
2016-04-05  8:20             ` Vlastimil Babka
2016-04-06  0:54               ` Naoya Horiguchi
2016-04-06  7:57                 ` Vlastimil Babka
2016-04-04  5:53   ` Balbir Singh
2016-04-04  6:01     ` Minchan Kim
2016-04-05  3:10       ` Balbir Singh
2016-03-30  7:12 ` [PATCH v3 02/16] mm/compaction: support non-lru movable page migration Minchan Kim
2016-04-01 21:29   ` Vlastimil Babka
2016-04-04  5:12     ` Minchan Kim
2016-04-04 13:24       ` Vlastimil Babka
2016-04-07  2:35         ` Minchan Kim
2016-04-12  8:00   ` Chulmin Kim
2016-04-12 14:25     ` Minchan Kim
2016-03-30  7:12 ` [PATCH v3 03/16] mm: add non-lru movable page support document Minchan Kim
2016-04-01 14:38   ` Vlastimil Babka
2016-04-04  2:25     ` Minchan Kim
2016-04-04 13:09       ` Vlastimil Babka
2016-04-07  2:27         ` Minchan Kim
2016-03-30  7:12 ` [PATCH v3 04/16] mm/balloon: use general movable page feature into balloon Minchan Kim
2016-04-05 12:03   ` Vlastimil Babka
2016-04-11  4:29     ` Minchan Kim
2016-03-30  7:12 ` [PATCH v3 05/16] zsmalloc: keep max_object in size_class Minchan Kim
2016-04-17 15:08   ` Sergey Senozhatsky
2016-03-30  7:12 ` [PATCH v3 06/16] zsmalloc: squeeze inuse into page->mapping Minchan Kim
2016-04-17 15:08   ` Sergey Senozhatsky
2016-04-19  7:40     ` Minchan Kim
2016-03-30  7:12 ` [PATCH v3 07/16] zsmalloc: remove page_mapcount_reset Minchan Kim
2016-04-17 15:11   ` Sergey Senozhatsky
2016-03-30  7:12 ` [PATCH v3 08/16] zsmalloc: squeeze freelist into page->mapping Minchan Kim
2016-04-17 15:56   ` Sergey Senozhatsky
2016-04-19  7:42     ` Minchan Kim
2016-03-30  7:12 ` [PATCH v3 09/16] zsmalloc: move struct zs_meta from mapping to freelist Minchan Kim
2016-04-17 15:22   ` Sergey Senozhatsky
2016-03-30  7:12 ` [PATCH v3 10/16] zsmalloc: factor page chain functionality out Minchan Kim
2016-04-18  0:33   ` Sergey Senozhatsky
2016-04-19  7:46     ` Minchan Kim
2016-03-30  7:12 ` [PATCH v3 11/16] zsmalloc: separate free_zspage from putback_zspage Minchan Kim
2016-04-18  1:04   ` Sergey Senozhatsky
2016-04-19  7:51     ` Minchan Kim
2016-04-19  7:53       ` Sergey Senozhatsky
2016-03-30  7:12 ` [PATCH v3 12/16] zsmalloc: zs_compact refactoring Minchan Kim
2016-04-04  8:04   ` Chulmin Kim
2016-04-04  9:01     ` Minchan Kim
2016-03-30  7:12 ` [PATCH v3 13/16] zsmalloc: migrate head page of zspage Minchan Kim
2016-04-06 13:01   ` Chulmin Kim
2016-04-07  0:34     ` Chulmin Kim
2016-04-07  0:43     ` Minchan Kim
2016-04-19  6:08   ` Chulmin Kim
2016-04-19  6:15     ` Minchan Kim
2016-03-30  7:12 ` [PATCH v3 14/16] zsmalloc: use single linked list for page chain Minchan Kim
2016-03-30  7:12 ` [PATCH v3 15/16] zsmalloc: migrate tail pages in zspage Minchan Kim
2016-03-30  7:12 ` [PATCH v3 16/16] zram: use __GFP_MOVABLE for memory allocation Minchan Kim
2016-03-30 23:11 ` [PATCH v3 00/16] Support non-lru page migration Andrew Morton
2016-03-31  0:29   ` Sergey Senozhatsky
2016-03-31  0:57     ` Minchan Kim
2016-03-31  0:57   ` Minchan Kim
2016-04-04 13:17 ` John Einar Reitan
2016-04-11  4:35   ` Minchan Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).