linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v7 00/12] Support non-lru page migration
@ 2016-05-31 23:21 Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 01/12] mm: use put_page to free page instead of putback_lru_page Minchan Kim
                   ` (13 more replies)
  0 siblings, 14 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Vlastimil Babka, dri-devel,
	Hugh Dickins, John Einar Reitan, Jonathan Corbet, Joonsoo Kim,
	Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

Recently, I got many reports about perfermance degradation in embedded
system(Android mobile phone, webOS TV and so on) and easy fork fail.

The problem was fragmentation caused by zram and GPU driver mainly.
With memory pressure, their pages were spread out all of pageblock and
it cannot be migrated with current compaction algorithm which supports
only LRU pages. In the end, compaction cannot work well so reclaimer
shrinks all of working set pages. It made system very slow and even to
fail to fork easily which requires order-[2 or 3] allocations.

Other pain point is that they cannot use CMA memory space so when OOM
kill happens, I can see many free pages in CMA area, which is not
memory efficient. In our product which has big CMA memory, it reclaims
zones too exccessively to allocate GPU and zram page although there are
lots of free space in CMA so system becomes very slow easily.

To solve these problem, this patch tries to add facility to migrate
non-lru pages via introducing new functions and page flags to help
migration.

struct address_space_operations {
	..
	..
	bool (*isolate_page)(struct page *, isolate_mode_t);
	void (*putback_page)(struct page *);
	..
}

new page flags

	PG_movable
	PG_isolated

For details, please read description in "mm: migrate: support non-lru
movable page migration".

Originally, Gioh Kim had tried to support this feature but he moved so
I took over the work. I took many code from his work and changed a little
bit and Konstantin Khlebnikov helped Gioh a lot so he should deserve to have
many credit, too.

And I should mention Chulmin who have tested this patchset heavily
so I can find many bugs from him. :)

Thanks, Gioh, Konstantin and Chulmin!

This patchset consists of five parts.

1. clean up migration
  mm: use put_page to free page instead of putback_lru_page

2. add non-lru page migration feature
  mm: migrate: support non-lru movable page migration

3. rework KVM memory-ballooning
  mm: balloon: use general non-lru movable page feature

4. zsmalloc refactoring for preparing page migration
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index

5. zsmalloc page migration
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

* From v6
  * rebase on mmotm-2016-05-27-15-19
  * clean up zsmalloc - Sergey
  * clean up non-lru page migration - Vlastimil

* From v5
  * rebase on next-20160520
  * move utility functions to compaction.c and export - Sergey
  * zsmalloc dobule free fix - Sergey
  * add additional Reviewed-by for zsmalloc - Sergey

* From v4
  * rebase on mmotm-2016-05-05-17-19
  * fix huge object migration - Chulmin
  * !CONFIG_COMPACTION support for zsmalloc

* From v3
  * rebase on mmotm-2016-04-06-20-40
  * fix swap_info deadlock - Chulmin
  * race without page_lock - Vlastimil
  * no use page._mapcount for potential user-mapped page driver - Vlastimil
  * fix and enhance doc/description - Vlastimil
  * use page->mapping lower bits to represent PG_movable
  * make driver side's rule simple.

* From v2
  * rebase on mmotm-2016-03-29-15-54-16
  * check PageMovable before lock_page - Joonsoo
  * check PageMovable before PageIsolated checking - Joonsoo
  * add more description about rule

* From v1
  * rebase on v4.5-mmotm-2016-03-17-15-04
  * reordering patches to merge clean-up patches first
  * add Acked-by/Reviewed-by from Vlastimil and Sergey
  * use each own mount model instead of reusing anon_inode_fs - Al Viro
  * small changes - YiPing, Gioh

Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: dri-devel@lists.freedesktop.org
Cc: Hugh Dickins <hughd@google.com>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
Cc: Chan Gyun Jeong <chan.jeong@lge.com>
Cc: Sangseok Lee <sangseok.lee@lge.com>
Cc: Kyeongdon Kim <kyeongdon.kim@lge.com>
Cc: Chulmin Kim <cmlaika.kim@samsung.com>

Minchan Kim (12):
  mm: use put_page to free page instead of putback_lru_page
  mm: migrate: support non-lru movable page migration
  mm: balloon: use general non-lru movable page feature
  zsmalloc: keep max_object in size_class
  zsmalloc: use bit_spin_lock
  zsmalloc: use accessor
  zsmalloc: factor page chain functionality out
  zsmalloc: introduce zspage structure
  zsmalloc: separate free_zspage from putback_zspage
  zsmalloc: use freeobj for index
  zsmalloc: page migration support
  zram: use __GFP_MOVABLE for memory allocation

 Documentation/filesystems/Locking  |    4 +
 Documentation/filesystems/vfs.txt  |   11 +
 Documentation/vm/page_migration    |  107 ++-
 drivers/block/zram/zram_drv.c      |    6 +-
 drivers/virtio/virtio_balloon.c    |   54 +-
 include/linux/balloon_compaction.h |   53 +-
 include/linux/compaction.h         |   17 +
 include/linux/fs.h                 |    2 +
 include/linux/ksm.h                |    3 +-
 include/linux/migrate.h            |    2 +
 include/linux/mm.h                 |    1 +
 include/linux/page-flags.h         |   33 +-
 include/uapi/linux/magic.h         |    2 +
 mm/balloon_compaction.c            |   94 +--
 mm/compaction.c                    |   79 ++-
 mm/ksm.c                           |    4 +-
 mm/migrate.c                       |  257 +++++--
 mm/page_alloc.c                    |    2 +-
 mm/util.c                          |    6 +-
 mm/vmscan.c                        |    2 +-
 mm/zsmalloc.c                      | 1349 +++++++++++++++++++++++++-----------
 21 files changed, 1479 insertions(+), 609 deletions(-)

-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v7 01/12] mm: use put_page to free page instead of putback_lru_page
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 02/12] mm: migrate: support non-lru movable page migration Minchan Kim
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Rik van Riel, Mel Gorman,
	Hugh Dickins, Naoya Horiguchi, Vlastimil Babka

Procedure of page migration is as follows:

First of all, it should isolate a page from LRU and try to
migrate the page. If it is successful, it releases the page
for freeing. Otherwise, it should put the page back to LRU
list.

For LRU pages, we have used putback_lru_page for both freeing
and putback to LRU list. It's okay because put_page is aware of
LRU list so if it releases last refcount of the page, it removes
the page from LRU list. However, It makes unnecessary operations
(e.g., lru_cache_add, pagevec and flags operations. It would be
not significant but no worth to do) and harder to support new
non-lru page migration because put_page isn't aware of non-lru
page's data structure.

To solve the problem, we can add new hook in put_page with
PageMovable flags check but it can increase overhead in
hot path and needs new locking scheme to stabilize the flag check
with put_page.

So, this patch cleans it up to divide two semantic(ie, put and putback).
If migration is successful, use put_page instead of putback_lru_page and
use putback_lru_page only on failure. That makes code more readable
and doesn't add overhead in put_page.

Comment from Vlastimil
"Yeah, and compaction (perhaps also other migration users) has to drain
the lru pvec... Getting rid of this stuff is worth even by itself."

Cc: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/migrate.c | 64 +++++++++++++++++++++++++++++++++++++-----------------------
 1 file changed, 40 insertions(+), 24 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 9baf41c877ff..2666f28b5236 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -913,6 +913,19 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		put_anon_vma(anon_vma);
 	unlock_page(page);
 out:
+	/*
+	 * If migration is successful, decrease refcount of the newpage
+	 * which will not free the page because new page owner increased
+	 * refcounter. As well, if it is LRU page, add the page to LRU
+	 * list in here.
+	 */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		if (unlikely(__is_movable_balloon_page(newpage)))
+			put_page(newpage);
+		else
+			putback_lru_page(newpage);
+	}
+
 	return rc;
 }
 
@@ -946,6 +959,12 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 
 	if (page_count(page) == 1) {
 		/* page was freed from under us. So we are done. */
+		ClearPageActive(page);
+		ClearPageUnevictable(page);
+		if (put_new_page)
+			put_new_page(newpage, private);
+		else
+			put_page(newpage);
 		goto out;
 	}
 
@@ -958,10 +977,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 	}
 
 	rc = __unmap_and_move(page, newpage, force, mode);
-	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_new_page = NULL;
+	if (rc == MIGRATEPAGE_SUCCESS)
 		set_page_owner_migrate_reason(newpage, reason);
-	}
 
 out:
 	if (rc != -EAGAIN) {
@@ -974,34 +991,33 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		/* Soft-offlined page shouldn't go through lru cache list */
-		if (reason == MR_MEMORY_FAILURE && rc == MIGRATEPAGE_SUCCESS) {
+	}
+
+	/*
+	 * If migration is successful, releases reference grabbed during
+	 * isolation. Otherwise, restore the page to right list unless
+	 * we want to retry.
+	 */
+	if (rc == MIGRATEPAGE_SUCCESS) {
+		put_page(page);
+		if (reason == MR_MEMORY_FAILURE) {
 			/*
-			 * With this release, we free successfully migrated
-			 * page and set PG_HWPoison on just freed page
-			 * intentionally. Although it's rather weird, it's how
-			 * HWPoison flag works at the moment.
+			 * Set PG_HWPoison on just freed page
+			 * intentionally. Although it's rather weird,
+			 * it's how HWPoison flag works at the moment.
 			 */
-			put_page(page);
 			if (!test_set_page_hwpoison(page))
 				num_poisoned_pages_inc();
-		} else
+		}
+	} else {
+		if (rc != -EAGAIN)
 			putback_lru_page(page);
+		if (put_new_page)
+			put_new_page(newpage, private);
+		else
+			put_page(newpage);
 	}
 
-	/*
-	 * If migration was not successful and there's a freeing callback, use
-	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
-	 * during isolation.
-	 */
-	if (put_new_page)
-		put_new_page(newpage, private);
-	else if (unlikely(__is_movable_balloon_page(newpage))) {
-		/* drop our reference, page already in the balloon */
-		put_page(newpage);
-	} else
-		putback_lru_page(newpage);
-
 	if (result) {
 		if (rc)
 			*result = rc;
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 02/12] mm: migrate: support non-lru movable page migration
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 01/12] mm: use put_page to free page instead of putback_lru_page Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 03/12] mm: balloon: use general non-lru movable page feature Minchan Kim
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, Rik van Riel, Joonsoo Kim,
	Mel Gorman, Hugh Dickins, Rafael Aquini, virtualization,
	Jonathan Corbet, John Einar Reitan, dri-devel,
	Sergey Senozhatsky, Gioh Kim, Vlastimil Babka

We have allowed migration for only LRU pages until now and it was
enough to make high-order pages. But recently, embedded system(e.g.,
webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
so we have seen several reports about troubles of small high-order
allocation. For fixing the problem, there were several efforts
(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
reserved memory, vmalloc and so on) but if there are lots of
non-movable pages in system, their solutions are void in the long run.

So, this patch is to support facility to change non-movable pages
with movable. For the feature, this patch introduces functions related
to migration to address_space_operations as well as some page flags.

If a driver want to make own pages movable, it should define three functions
which are function pointers of struct address_space_operations.

1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);

What VM expects on isolate_page function of driver is to return *true*
if driver isolates page successfully. On returing true, VM marks the page
as PG_isolated so concurrent isolation in several CPUs skip the page
for isolation. If a driver cannot isolate the page, it should return *false*.

Once page is successfully isolated, VM uses page.lru fields so driver
shouldn't expect to preserve values in that fields.

2. int (*migratepage) (struct address_space *mapping,
		struct page *newpage, struct page *oldpage, enum migrate_mode);

After isolation, VM calls migratepage of driver with isolated page.
The function of migratepage is to move content of the old page to new page
and set up fields of struct page newpage. Keep in mind that you should
indicate to the VM the oldpage is no longer movable via __ClearPageMovable()
under page_lock if you migrated the oldpage successfully and returns 0.
If driver cannot migrate the page at the moment, driver can return -EAGAIN.
On -EAGAIN, VM will retry page migration in a short time because VM interprets
-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
VM will give up the page migration without retrying in this time.

Driver shouldn't touch page.lru field VM using in the functions.

3. void (*putback_page)(struct page *);

If migration fails on isolated page, VM should return the isolated page
to the driver so VM calls driver's putback_page with migration failed page.
In this function, driver should put the isolated page back to the own data
structure.

4. non-lru movable page flags

There are two page flags for supporting non-lru movable page.

* PG_movable

Driver should use the below function to make page movable under page_lock.

	void __SetPageMovable(struct page *page, struct address_space *mapping)

It needs argument of address_space for registering migration family functions
which will be called by VM. Exactly speaking, PG_movable is not a real flag of
struct page. Rather than, VM reuses page->mapping's lower bits to represent it.

	#define PAGE_MAPPING_MOVABLE 0x2
	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;

so driver shouldn't access page->mapping directly. Instead, driver should
use page_mapping which mask off the low two bits of page->mapping so it can get
right struct address_space.

For testing of non-lru movable page, VM supports __PageMovable function.
However, it doesn't guarantee to identify non-lru movable page because
page->mapping field is unified with other variables in struct page.
As well, if driver releases the page after isolation by VM, page->mapping
doesn't have stable value although it has PAGE_MAPPING_MOVABLE
(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
page is LRU or non-lru movable once the page has been isolated. Because
LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
good for just peeking to test non-lru movable pages before more expensive
checking with lock_page in pfn scanning to select victim.

For guaranteeing non-lru movable page, VM provides PageMovable function.
Unlike __PageMovable, PageMovable functions validates page->mapping and
mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
destroying of page->mapping.

Driver using __SetPageMovable should clear the flag via __ClearMovablePage
under page_lock before the releasing the page.

* PG_isolated

To prevent concurrent isolation among several CPUs, VM marks isolated page
as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
movable page, it can skip it. Driver doesn't need to manipulate the flag
because VM will set/clear it automatically. Keep in mind that if driver
sees PG_isolated page, it means the page have been isolated by VM so it
shouldn't touch page.lru field.
PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
for own purpose.

Cc: Rik van Riel <riel@redhat.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Rafael Aquini <aquini@redhat.com>
Cc: virtualization@lists.linux-foundation.org
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: John Einar Reitan <john.reitan@foss.arm.com>
Cc: dri-devel@lists.freedesktop.org
Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/filesystems/Locking |   4 +
 Documentation/filesystems/vfs.txt |  11 +++
 Documentation/vm/page_migration   | 107 ++++++++++++++++++++-
 include/linux/compaction.h        |  17 ++++
 include/linux/fs.h                |   2 +
 include/linux/ksm.h               |   3 +-
 include/linux/migrate.h           |   2 +
 include/linux/mm.h                |   1 +
 include/linux/page-flags.h        |  33 +++++--
 mm/compaction.c                   |  85 +++++++++++++----
 mm/ksm.c                          |   4 +-
 mm/migrate.c                      | 192 ++++++++++++++++++++++++++++++++++----
 mm/page_alloc.c                   |   2 +-
 mm/util.c                         |   6 +-
 14 files changed, 417 insertions(+), 52 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index af7c030a0368..3991a976cf43 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -195,7 +195,9 @@ unlocks and drops the reference.
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	int (*migratepage)(struct address_space *, struct page *, struct page *);
+	void (*putback_page) (struct page *);
 	int (*launder_page)(struct page *);
 	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
 	int (*error_remove_page)(struct address_space *, struct page *);
@@ -219,7 +221,9 @@ invalidatepage:		yes
 releasepage:		yes
 freepage:		yes
 direct_IO:
+isolate_page:		yes
 migratepage:		yes (both)
+putback_page:		yes
 launder_page:		yes
 is_partially_uptodate:	yes
 error_remove_page:	yes
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 19366fef2652..9d4ae317fdcb 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -591,9 +591,14 @@ struct address_space_operations {
 	int (*releasepage) (struct page *, int);
 	void (*freepage)(struct page *);
 	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
+	/* isolate a page for migration */
+	bool (*isolate_page) (struct page *, isolate_mode_t);
 	/* migrate the contents of a page to the specified target */
 	int (*migratepage) (struct page *, struct page *);
+	/* put migration-failed page back to right list */
+	void (*putback_page) (struct page *);
 	int (*launder_page) (struct page *);
+
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
 	void (*is_dirty_writeback) (struct page *, bool *, bool *);
@@ -739,6 +744,10 @@ struct address_space_operations {
         and transfer data directly between the storage and the
         application's address space.
 
+  isolate_page: Called by the VM when isolating a movable non-lru page.
+	If page is successfully isolated, VM marks the page as PG_isolated
+	via __SetPageIsolated.
+
   migrate_page:  This is used to compact the physical memory usage.
         If the VM wants to relocate a page (maybe off a memory card
         that is signalling imminent failure) it will pass a new page
@@ -746,6 +755,8 @@ struct address_space_operations {
 	transfer any private data across and update any references
         that it has to the page.
 
+  putback_page: Called by the VM when isolated page's migration fails.
+
   launder_page: Called before freeing a page - it writes back the dirty page. To
   	prevent redirtying the page, it is kept locked during the whole
 	operation.
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
index fea5c0864170..18d37c7ac50b 100644
--- a/Documentation/vm/page_migration
+++ b/Documentation/vm/page_migration
@@ -142,5 +142,110 @@ is increased so that the page cannot be freed while page migration occurs.
 20. The new page is moved to the LRU and can be scanned by the swapper
     etc again.
 
-Christoph Lameter, May 8, 2006.
+C. Non-LRU page migration
+-------------------------
+
+Although original migration aimed for reducing the latency of memory access
+for NUMA, compaction who want to create high-order page is also main customer.
+
+Current problem of the implementation is that it is designed to migrate only
+*LRU* pages. However, there are potential non-lru pages which can be migrated
+in drivers, for example, zsmalloc, virtio-balloon pages.
+
+For virtio-balloon pages, some parts of migration code path have been hooked
+up and added virtio-balloon specific functions to intercept migration logics.
+It's too specific to a driver so other drivers who want to make their pages
+movable would have to add own specific hooks in migration path.
+
+To overclome the problem, VM supports non-LRU page migration which provides
+generic functions for non-LRU movable pages without driver specific hooks
+migration path.
+
+If a driver want to make own pages movable, it should define three functions
+which are function pointers of struct address_space_operations.
+
+1. bool (*isolate_page) (struct page *page, isolate_mode_t mode);
+
+What VM expects on isolate_page function of driver is to return *true*
+if driver isolates page successfully. On returing true, VM marks the page
+as PG_isolated so concurrent isolation in several CPUs skip the page
+for isolation. If a driver cannot isolate the page, it should return *false*.
+
+Once page is successfully isolated, VM uses page.lru fields so driver
+shouldn't expect to preserve values in that fields.
+
+2. int (*migratepage) (struct address_space *mapping,
+		struct page *newpage, struct page *oldpage, enum migrate_mode);
+
+After isolation, VM calls migratepage of driver with isolated page.
+The function of migratepage is to move content of the old page to new page
+and set up fields of struct page newpage. Keep in mind that you should
+indicate to the VM the oldpage is no longer movable via __ClearPageMovable()
+under page_lock if you migrated the oldpage successfully and returns 0.
+If driver cannot migrate the page at the moment, driver can return -EAGAIN.
+On -EAGAIN, VM will retry page migration in a short time because VM interprets
+-EAGAIN as "temporal migration failure". On returning any error except -EAGAIN,
+VM will give up the page migration without retrying in this time.
+
+Driver shouldn't touch page.lru field VM using in the functions.
+
+3. void (*putback_page)(struct page *);
+
+If migration fails on isolated page, VM should return the isolated page
+to the driver so VM calls driver's putback_page with migration failed page.
+In this function, driver should put the isolated page back to the own data
+structure.
 
+4. non-lru movable page flags
+
+There are two page flags for supporting non-lru movable page.
+
+* PG_movable
+
+Driver should use the below function to make page movable under page_lock.
+
+	void __SetPageMovable(struct page *page, struct address_space *mapping)
+
+It needs argument of address_space for registering migration family functions
+which will be called by VM. Exactly speaking, PG_movable is not a real flag of
+struct page. Rather than, VM reuses page->mapping's lower bits to represent it.
+
+	#define PAGE_MAPPING_MOVABLE 0x2
+	page->mapping = page->mapping | PAGE_MAPPING_MOVABLE;
+
+so driver shouldn't access page->mapping directly. Instead, driver should
+use page_mapping which mask off the low two bits of page->mapping under
+page lock so it can get right struct address_space.
+
+For testing of non-lru movable page, VM supports __PageMovable function.
+However, it doesn't guarantee to identify non-lru movable page because
+page->mapping field is unified with other variables in struct page.
+As well, if driver releases the page after isolation by VM, page->mapping
+doesn't have stable value although it has PAGE_MAPPING_MOVABLE
+(Look at __ClearPageMovable). But __PageMovable is cheap to catch whether
+page is LRU or non-lru movable once the page has been isolated. Because
+LRU pages never can have PAGE_MAPPING_MOVABLE in page->mapping. It is also
+good for just peeking to test non-lru movable pages before more expensive
+checking with lock_page in pfn scanning to select victim.
+
+For guaranteeing non-lru movable page, VM provides PageMovable function.
+Unlike __PageMovable, PageMovable functions validates page->mapping and
+mapping->a_ops->isolate_page under lock_page. The lock_page prevents sudden
+destroying of page->mapping.
+
+Driver using __SetPageMovable should clear the flag via __ClearMovablePage
+under page_lock before the releasing the page.
+
+* PG_isolated
+
+To prevent concurrent isolation among several CPUs, VM marks isolated page
+as PG_isolated under lock_page. So if a CPU encounters PG_isolated non-lru
+movable page, it can skip it. Driver doesn't need to manipulate the flag
+because VM will set/clear it automatically. Keep in mind that if driver
+sees PG_isolated page, it means the page have been isolated by VM so it
+shouldn't touch page.lru field.
+PG_isolated is alias with PG_reclaim flag so driver shouldn't use the flag
+for own purpose.
+
+Christoph Lameter, May 8, 2006.
+Minchan Kim, Mar 28, 2016.
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a58c852a268f..c6b47c861cea 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -54,6 +54,9 @@ enum compact_result {
 struct alloc_context; /* in mm/internal.h */
 
 #ifdef CONFIG_COMPACTION
+extern int PageMovable(struct page *page);
+extern void __SetPageMovable(struct page *page, struct address_space *mapping);
+extern void __ClearPageMovable(struct page *page);
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
 			void __user *buffer, size_t *length, loff_t *ppos);
@@ -151,6 +154,19 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
+static inline int PageMovable(struct page *page)
+{
+	return 0;
+}
+static inline void __SetPageMovable(struct page *page,
+			struct address_space *mapping)
+{
+}
+
+static inline void __ClearPageMovable(struct page *page)
+{
+}
+
 static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order, int alloc_flags,
 			const struct alloc_context *ac,
@@ -212,6 +228,7 @@ static inline void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_i
 #endif /* CONFIG_COMPACTION */
 
 #if defined(CONFIG_COMPACTION) && defined(CONFIG_SYSFS) && defined(CONFIG_NUMA)
+struct node;
 extern int compaction_register_node(struct node *node);
 extern void compaction_unregister_node(struct node *node);
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0cfdf2aec8f7..39ef97414033 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -402,6 +402,8 @@ struct address_space_operations {
 	 */
 	int (*migratepage) (struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
+	bool (*isolate_page)(struct page *, isolate_mode_t);
+	void (*putback_page)(struct page *);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, unsigned long,
 					unsigned long);
diff --git a/include/linux/ksm.h b/include/linux/ksm.h
index 7ae216a39c9e..481c8c4627ca 100644
--- a/include/linux/ksm.h
+++ b/include/linux/ksm.h
@@ -43,8 +43,7 @@ static inline struct stable_node *page_stable_node(struct page *page)
 static inline void set_page_stable_node(struct page *page,
 					struct stable_node *stable_node)
 {
-	page->mapping = (void *)stable_node +
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+	page->mapping = (void *)((unsigned long)stable_node | PAGE_MAPPING_KSM);
 }
 
 /*
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 9b50325e4ddf..404fbfefeb33 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *, enum migrate_mode);
 extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
 		unsigned long private, enum migrate_mode mode, int reason);
+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
+extern void putback_movable_page(struct page *page);
 
 extern int migrate_prep(void);
 extern int migrate_prep_local(void);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a00ec816233a..33eaec57e997 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1035,6 +1035,7 @@ static inline pgoff_t page_file_index(struct page *page)
 }
 
 bool page_mapped(struct page *page);
+struct address_space *page_mapping(struct page *page);
 
 /*
  * Return true only if the page has been allocated with
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index e5a32445f930..f36dbb3a3060 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -129,6 +129,9 @@ enum pageflags {
 
 	/* Compound pages. Stored in first tail page's flags */
 	PG_double_map = PG_private_2,
+
+	/* non-lru isolated movable page */
+	PG_isolated = PG_reclaim,
 };
 
 #ifndef __GENERATING_BOUNDS_H
@@ -357,29 +360,37 @@ PAGEFLAG(Idle, idle, PF_ANY)
  * with the PAGE_MAPPING_ANON bit set to distinguish it.  See rmap.h.
  *
  * On an anonymous page in a VM_MERGEABLE area, if CONFIG_KSM is enabled,
- * the PAGE_MAPPING_KSM bit may be set along with the PAGE_MAPPING_ANON bit;
- * and then page->mapping points, not to an anon_vma, but to a private
+ * the PAGE_MAPPING_MOVABLE bit may be set along with the PAGE_MAPPING_ANON
+ * bit; and then page->mapping points, not to an anon_vma, but to a private
  * structure which KSM associates with that merged page.  See ksm.h.
  *
- * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is currently never used.
+ * PAGE_MAPPING_KSM without PAGE_MAPPING_ANON is used for non-lru movable
+ * page and then page->mapping points a struct address_space.
  *
  * Please note that, confusingly, "page_mapping" refers to the inode
  * address_space which maps the page from disk; whereas "page_mapped"
  * refers to user virtual address space into which the page is mapped.
  */
-#define PAGE_MAPPING_ANON	1
-#define PAGE_MAPPING_KSM	2
-#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
+#define PAGE_MAPPING_ANON	0x1
+#define PAGE_MAPPING_MOVABLE	0x2
+#define PAGE_MAPPING_KSM	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
+#define PAGE_MAPPING_FLAGS	(PAGE_MAPPING_ANON | PAGE_MAPPING_MOVABLE)
 
-static __always_inline int PageAnonHead(struct page *page)
+static __always_inline int PageMappingFlags(struct page *page)
 {
-	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) != 0;
 }
 
 static __always_inline int PageAnon(struct page *page)
 {
 	page = compound_head(page);
-	return PageAnonHead(page);
+	return ((unsigned long)page->mapping & PAGE_MAPPING_ANON) != 0;
+}
+
+static __always_inline int __PageMovable(struct page *page)
+{
+	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
+				PAGE_MAPPING_MOVABLE;
 }
 
 #ifdef CONFIG_KSM
@@ -393,7 +404,7 @@ static __always_inline int PageKsm(struct page *page)
 {
 	page = compound_head(page);
 	return ((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) ==
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+				PAGE_MAPPING_KSM;
 }
 #else
 TESTPAGEFLAG_FALSE(Ksm)
@@ -641,6 +652,8 @@ static inline void __ClearPageBalloon(struct page *page)
 	atomic_set(&page->_mapcount, -1);
 }
 
+__PAGEFLAG(Isolated, isolated, PF_ANY);
+
 /*
  * If network-based swap is enabled, sl*b must keep track of whether pages
  * were allocated from pfmemalloc reserves.
diff --git a/mm/compaction.c b/mm/compaction.c
index 1427366ad673..a680b52e190b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -81,6 +81,44 @@ static inline bool migrate_async_suitable(int migratetype)
 
 #ifdef CONFIG_COMPACTION
 
+int PageMovable(struct page *page)
+{
+	struct address_space *mapping;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	if (!__PageMovable(page))
+		return 0;
+
+	mapping = page_mapping(page);
+	if (mapping && mapping->a_ops && mapping->a_ops->isolate_page)
+		return 1;
+
+	return 0;
+}
+EXPORT_SYMBOL(PageMovable);
+
+void __SetPageMovable(struct page *page, struct address_space *mapping)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE((unsigned long)mapping & PAGE_MAPPING_MOVABLE, page);
+	page->mapping = (void *)((unsigned long)mapping | PAGE_MAPPING_MOVABLE);
+}
+EXPORT_SYMBOL(__SetPageMovable);
+
+void __ClearPageMovable(struct page *page)
+{
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	/*
+	 * Clear registered address_space val with keeping PAGE_MAPPING_MOVABLE
+	 * flag so that VM can catch up released page by driver after isolation.
+	 * With it, VM migration doesn't try to put it back.
+	 */
+	page->mapping = (void *)((unsigned long)page->mapping &
+				PAGE_MAPPING_MOVABLE);
+}
+EXPORT_SYMBOL(__ClearPageMovable);
+
 /* Do not skip compaction more than 64 times */
 #define COMPACT_MAX_DEFER_SHIFT 6
 
@@ -735,21 +773,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		}
 
 		/*
-		 * Check may be lockless but that's ok as we recheck later.
-		 * It's possible to migrate LRU pages and balloon pages
-		 * Skip any other type of page
-		 */
-		is_lru = PageLRU(page);
-		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-		}
-
-		/*
 		 * Regardless of being on LRU, compound pages such as THP and
 		 * hugetlbfs are not to be compacted. We can potentially save
 		 * a lot of iterations if we skip them at once. The check is
@@ -765,8 +788,38 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 			goto isolate_fail;
 		}
 
-		if (!is_lru)
+		/*
+		 * Check may be lockless but that's ok as we recheck later.
+		 * It's possible to migrate LRU and non-lru movable pages.
+		 * Skip any other type of page
+		 */
+		is_lru = PageLRU(page);
+		if (!is_lru) {
+			if (unlikely(balloon_page_movable(page))) {
+				if (balloon_page_isolate(page)) {
+					/* Successfully isolated */
+					goto isolate_success;
+				}
+			}
+
+			/*
+			 * __PageMovable can return false positive so we need
+			 * to verify it under page_lock.
+			 */
+			if (unlikely(__PageMovable(page)) &&
+					!PageIsolated(page)) {
+				if (locked) {
+					spin_unlock_irqrestore(&zone->lru_lock,
+									flags);
+					locked = false;
+				}
+
+				if (isolate_movable_page(page, isolate_mode))
+					goto isolate_success;
+			}
+
 			goto isolate_fail;
+		}
 
 		/*
 		 * Migration will fail if an anonymous page is pinned in memory,
diff --git a/mm/ksm.c b/mm/ksm.c
index 4786b4150f62..35b8aef867a9 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -532,8 +532,8 @@ static struct page *get_ksm_page(struct stable_node *stable_node, bool lock_it)
 	void *expected_mapping;
 	unsigned long kpfn;
 
-	expected_mapping = (void *)stable_node +
-				(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
+	expected_mapping = (void *)((unsigned long)stable_node |
+					PAGE_MAPPING_KSM);
 again:
 	kpfn = READ_ONCE(stable_node->kpfn);
 	page = pfn_to_page(kpfn);
diff --git a/mm/migrate.c b/mm/migrate.c
index 2666f28b5236..60abcf379b51 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -31,6 +31,7 @@
 #include <linux/vmalloc.h>
 #include <linux/security.h>
 #include <linux/backing-dev.h>
+#include <linux/compaction.h>
 #include <linux/syscalls.h>
 #include <linux/hugetlb.h>
 #include <linux/hugetlb_cgroup.h>
@@ -73,6 +74,81 @@ int migrate_prep_local(void)
 	return 0;
 }
 
+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
+{
+	struct address_space *mapping;
+
+	/*
+	 * Avoid burning cycles with pages that are yet under __free_pages(),
+	 * or just got freed under us.
+	 *
+	 * In case we 'win' a race for a movable page being freed under us and
+	 * raise its refcount preventing __free_pages() from doing its job
+	 * the put_page() at the end of this block will take care of
+	 * release this page, thus avoiding a nasty leakage.
+	 */
+	if (unlikely(!get_page_unless_zero(page)))
+		goto out;
+
+	/*
+	 * Check PageMovable before holding a PG_lock because page's owner
+	 * assumes anybody doesn't touch PG_lock of newly allocated page
+	 * so unconditionally grapping the lock ruins page's owner side.
+	 */
+	if (unlikely(!__PageMovable(page)))
+		goto out_putpage;
+	/*
+	 * As movable pages are not isolated from LRU lists, concurrent
+	 * compaction threads can race against page migration functions
+	 * as well as race against the releasing a page.
+	 *
+	 * In order to avoid having an already isolated movable page
+	 * being (wrongly) re-isolated while it is under migration,
+	 * or to avoid attempting to isolate pages being released,
+	 * lets be sure we have the page lock
+	 * before proceeding with the movable page isolation steps.
+	 */
+	if (unlikely(!trylock_page(page)))
+		goto out_putpage;
+
+	if (!PageMovable(page) || PageIsolated(page))
+		goto out_no_isolated;
+
+	mapping = page_mapping(page);
+	VM_BUG_ON_PAGE(!mapping, page);
+
+	if (!mapping->a_ops->isolate_page(page, mode))
+		goto out_no_isolated;
+
+	/* Driver shouldn't use PG_isolated bit of page->flags */
+	WARN_ON_ONCE(PageIsolated(page));
+	__SetPageIsolated(page);
+	unlock_page(page);
+
+	return true;
+
+out_no_isolated:
+	unlock_page(page);
+out_putpage:
+	put_page(page);
+out:
+	return false;
+}
+
+/* It should be called on page which is PG_movable */
+void putback_movable_page(struct page *page)
+{
+	struct address_space *mapping;
+
+	VM_BUG_ON_PAGE(!PageLocked(page), page);
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+	mapping = page_mapping(page);
+	mapping->a_ops->putback_page(page);
+	__ClearPageIsolated(page);
+}
+
 /*
  * Put previously isolated pages back onto the appropriate lists
  * from where they were once taken off for compaction/migration.
@@ -94,10 +170,25 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page)))
+		if (unlikely(isolated_balloon_page(page))) {
 			balloon_page_putback(page);
-		else
+		/*
+		 * We isolated non-lru movable page so here we can use
+		 * __PageMovable because LRU page's mapping cannot have
+		 * PAGE_MAPPING_MOVABLE.
+		 */
+		} else if (unlikely(__PageMovable(page))) {
+			VM_BUG_ON_PAGE(!PageIsolated(page), page);
+			lock_page(page);
+			if (PageMovable(page))
+				putback_movable_page(page);
+			else
+				__ClearPageIsolated(page);
+			unlock_page(page);
+			put_page(page);
+		} else {
 			putback_lru_page(page);
+		}
 	}
 }
 
@@ -592,7 +683,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
  ***********************************************************/
 
 /*
- * Common logic to directly migrate a single page suitable for
+ * Common logic to directly migrate a single LRU page suitable for
  * pages that do not use PagePrivate/PagePrivate2.
  *
  * Pages are locked upon entry and exit.
@@ -755,33 +846,72 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 				enum migrate_mode mode)
 {
 	struct address_space *mapping;
-	int rc;
+	int rc = -EAGAIN;
+	bool is_lru = !__PageMovable(page);
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
 	mapping = page_mapping(page);
-	if (!mapping)
-		rc = migrate_page(mapping, newpage, page, mode);
-	else if (mapping->a_ops->migratepage)
+
+	if (likely(is_lru)) {
+		if (!mapping)
+			rc = migrate_page(mapping, newpage, page, mode);
+		else if (mapping->a_ops->migratepage)
+			/*
+			 * Most pages have a mapping and most filesystems
+			 * provide a migratepage callback. Anonymous pages
+			 * are part of swap space which also has its own
+			 * migratepage callback. This is the most common path
+			 * for page migration.
+			 */
+			rc = mapping->a_ops->migratepage(mapping, newpage,
+							page, mode);
+		else
+			rc = fallback_migrate_page(mapping, newpage,
+							page, mode);
+	} else {
 		/*
-		 * Most pages have a mapping and most filesystems provide a
-		 * migratepage callback. Anonymous pages are part of swap
-		 * space which also has its own migratepage callback. This
-		 * is the most common path for page migration.
+		 * In case of non-lru page, it could be released after
+		 * isolation step. In that case, we shouldn't try migration.
 		 */
-		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
-	else
-		rc = fallback_migrate_page(mapping, newpage, page, mode);
+		VM_BUG_ON_PAGE(!PageIsolated(page), page);
+		if (!PageMovable(page)) {
+			rc = MIGRATEPAGE_SUCCESS;
+			__ClearPageIsolated(page);
+			goto out;
+		}
+
+		rc = mapping->a_ops->migratepage(mapping, newpage,
+						page, mode);
+		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
+			!PageIsolated(page));
+	}
 
 	/*
 	 * When successful, old pagecache page->mapping must be cleared before
 	 * page is freed; but stats require that PageAnon be left as PageAnon.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (!PageAnon(page))
+		if (__PageMovable(page)) {
+			VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+			/*
+			 * We clear PG_movable under page_lock so any compactor
+			 * cannot try to migrate this page.
+			 */
+			__ClearPageIsolated(page);
+		}
+
+		/*
+		 * Anonymous and movable page->mapping will be cleard by
+		 * free_pages_prepare so don't reset it here for keeping
+		 * the type to work PageAnon, for example.
+		 */
+		if (!PageMappingFlags(page))
 			page->mapping = NULL;
 	}
+out:
 	return rc;
 }
 
@@ -791,6 +921,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	int rc = -EAGAIN;
 	int page_was_mapped = 0;
 	struct anon_vma *anon_vma = NULL;
+	bool is_lru = !__PageMovable(page);
 
 	if (!trylock_page(page)) {
 		if (!force || mode == MIGRATE_ASYNC)
@@ -871,6 +1002,11 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 		goto out_unlock_both;
 	}
 
+	if (unlikely(!is_lru)) {
+		rc = move_to_new_page(newpage, page, mode);
+		goto out_unlock_both;
+	}
+
 	/*
 	 * Corner case handling:
 	 * 1. When a new swap-cache page is read into, it is added to the LRU
@@ -920,7 +1056,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 * list in here.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage)))
+		if (unlikely(__is_movable_balloon_page(newpage) ||
+				__PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
@@ -961,6 +1098,12 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		/* page was freed from under us. So we are done. */
 		ClearPageActive(page);
 		ClearPageUnevictable(page);
+		if (unlikely(__PageMovable(page))) {
+			lock_page(page);
+			if (!PageMovable(page))
+				__ClearPageIsolated(page);
+			unlock_page(page);
+		}
 		if (put_new_page)
 			put_new_page(newpage, private);
 		else
@@ -1010,8 +1153,21 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 				num_poisoned_pages_inc();
 		}
 	} else {
-		if (rc != -EAGAIN)
-			putback_lru_page(page);
+		if (rc != -EAGAIN) {
+			if (likely(!__PageMovable(page))) {
+				putback_lru_page(page);
+				goto put_new;
+			}
+
+			lock_page(page);
+			if (PageMovable(page))
+				putback_movable_page(page);
+			else
+				__ClearPageIsolated(page);
+			unlock_page(page);
+			put_page(page);
+		}
+put_new:
 		if (put_new_page)
 			put_new_page(newpage, private);
 		else
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7da8310b86e9..4b3a07ce824d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1014,7 +1014,7 @@ static __always_inline bool free_pages_prepare(struct page *page,
 			(page + i)->flags &= ~PAGE_FLAGS_CHECK_AT_PREP;
 		}
 	}
-	if (PageAnonHead(page))
+	if (PageMappingFlags(page))
 		page->mapping = NULL;
 	if (check_free)
 		bad += free_pages_check(page);
diff --git a/mm/util.c b/mm/util.c
index 917e0e3d0f8e..b756ee36f7f0 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -399,10 +399,12 @@ struct address_space *page_mapping(struct page *page)
 	}
 
 	mapping = page->mapping;
-	if ((unsigned long)mapping & PAGE_MAPPING_FLAGS)
+	if ((unsigned long)mapping & PAGE_MAPPING_ANON)
 		return NULL;
-	return mapping;
+
+	return (void *)((unsigned long)mapping & ~PAGE_MAPPING_FLAGS);
 }
+EXPORT_SYMBOL(page_mapping);
 
 /* Slow path of page_mapcount() for compound pages */
 int __page_mapcount(struct page *page)
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 03/12] mm: balloon: use general non-lru movable page feature
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 01/12] mm: use put_page to free page instead of putback_lru_page Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 02/12] mm: migrate: support non-lru movable page migration Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 04/12] zsmalloc: keep max_object in size_class Minchan Kim
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Minchan Kim, virtualization,
	Rafael Aquini, Konstantin Khlebnikov, Gioh Kim, Vlastimil Babka

Now, VM has a feature to migrate non-lru movable pages so
balloon doesn't need custom migration hooks in migrate.c
and compaction.c. Instead, this patch implements
page->mapping->a_ops->{isolate|migrate|putback} functions.

With that, we could remove hooks for ballooning in general
migration functions and make balloon compaction simple.

Cc: virtualization@lists.linux-foundation.org
Cc: Rafael Aquini <aquini@redhat.com>
Cc: Konstantin Khlebnikov <koct9i@gmail.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Gioh Kim <gi-oh.kim@profitbricks.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/virtio/virtio_balloon.c    | 54 +++++++++++++++++++---
 include/linux/balloon_compaction.h | 53 +++++++--------------
 include/uapi/linux/magic.h         |  1 +
 mm/balloon_compaction.c            | 94 +++++++-------------------------------
 mm/compaction.c                    |  7 ---
 mm/migrate.c                       | 19 +-------
 mm/vmscan.c                        |  2 +-
 7 files changed, 85 insertions(+), 145 deletions(-)

diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
index 476c0e3a7150..88d5609375de 100644
--- a/drivers/virtio/virtio_balloon.c
+++ b/drivers/virtio/virtio_balloon.c
@@ -30,6 +30,7 @@
 #include <linux/oom.h>
 #include <linux/wait.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 
 /*
  * Balloon device works in 4K page units.  So each page is pointed to by
@@ -45,6 +46,10 @@ static int oom_pages = OOM_VBALLOON_DEFAULT_PAGES;
 module_param(oom_pages, int, S_IRUSR | S_IWUSR);
 MODULE_PARM_DESC(oom_pages, "pages to free on OOM");
 
+#ifdef CONFIG_BALLOON_COMPACTION
+static struct vfsmount *balloon_mnt;
+#endif
+
 struct virtio_balloon {
 	struct virtio_device *vdev;
 	struct virtqueue *inflate_vq, *deflate_vq, *stats_vq;
@@ -488,8 +493,26 @@ static int virtballoon_migratepage(struct balloon_dev_info *vb_dev_info,
 
 	put_page(page); /* balloon reference */
 
-	return MIGRATEPAGE_SUCCESS;
+	return 0;
 }
+
+static struct dentry *balloon_mount(struct file_system_type *fs_type,
+		int flags, const char *dev_name, void *data)
+{
+	static const struct dentry_operations ops = {
+		.d_dname = simple_dname,
+	};
+
+	return mount_pseudo(fs_type, "balloon-kvm:", NULL, &ops,
+				BALLOON_KVM_MAGIC);
+}
+
+static struct file_system_type balloon_fs = {
+	.name           = "balloon-kvm",
+	.mount          = balloon_mount,
+	.kill_sb        = kill_anon_super,
+};
+
 #endif /* CONFIG_BALLOON_COMPACTION */
 
 static int virtballoon_probe(struct virtio_device *vdev)
@@ -519,9 +542,6 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->vdev = vdev;
 
 	balloon_devinfo_init(&vb->vb_dev_info);
-#ifdef CONFIG_BALLOON_COMPACTION
-	vb->vb_dev_info.migratepage = virtballoon_migratepage;
-#endif
 
 	err = init_vqs(vb);
 	if (err)
@@ -531,13 +551,33 @@ static int virtballoon_probe(struct virtio_device *vdev)
 	vb->nb.priority = VIRTBALLOON_OOM_NOTIFY_PRIORITY;
 	err = register_oom_notifier(&vb->nb);
 	if (err < 0)
-		goto out_oom_notify;
+		goto out_del_vqs;
+
+#ifdef CONFIG_BALLOON_COMPACTION
+	balloon_mnt = kern_mount(&balloon_fs);
+	if (IS_ERR(balloon_mnt)) {
+		err = PTR_ERR(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		goto out_del_vqs;
+	}
+
+	vb->vb_dev_info.migratepage = virtballoon_migratepage;
+	vb->vb_dev_info.inode = alloc_anon_inode(balloon_mnt->mnt_sb);
+	if (IS_ERR(vb->vb_dev_info.inode)) {
+		err = PTR_ERR(vb->vb_dev_info.inode);
+		kern_unmount(balloon_mnt);
+		unregister_oom_notifier(&vb->nb);
+		vb->vb_dev_info.inode = NULL;
+		goto out_del_vqs;
+	}
+	vb->vb_dev_info.inode->i_mapping->a_ops = &balloon_aops;
+#endif
 
 	virtio_device_ready(vdev);
 
 	return 0;
 
-out_oom_notify:
+out_del_vqs:
 	vdev->config->del_vqs(vdev);
 out_free_vb:
 	kfree(vb);
@@ -571,6 +611,8 @@ static void virtballoon_remove(struct virtio_device *vdev)
 	cancel_work_sync(&vb->update_balloon_stats_work);
 
 	remove_common(vb);
+	if (vb->vb_dev_info.inode)
+		iput(vb->vb_dev_info.inode);
 	kfree(vb);
 }
 
diff --git a/include/linux/balloon_compaction.h b/include/linux/balloon_compaction.h
index 9b0a15d06a4f..c0c430d06a9b 100644
--- a/include/linux/balloon_compaction.h
+++ b/include/linux/balloon_compaction.h
@@ -45,9 +45,10 @@
 #define _LINUX_BALLOON_COMPACTION_H
 #include <linux/pagemap.h>
 #include <linux/page-flags.h>
-#include <linux/migrate.h>
+#include <linux/compaction.h>
 #include <linux/gfp.h>
 #include <linux/err.h>
+#include <linux/fs.h>
 
 /*
  * Balloon device information descriptor.
@@ -62,6 +63,7 @@ struct balloon_dev_info {
 	struct list_head pages;		/* Pages enqueued & handled to Host */
 	int (*migratepage)(struct balloon_dev_info *, struct page *newpage,
 			struct page *page, enum migrate_mode mode);
+	struct inode *inode;
 };
 
 extern struct page *balloon_page_enqueue(struct balloon_dev_info *b_dev_info);
@@ -73,45 +75,19 @@ static inline void balloon_devinfo_init(struct balloon_dev_info *balloon)
 	spin_lock_init(&balloon->pages_lock);
 	INIT_LIST_HEAD(&balloon->pages);
 	balloon->migratepage = NULL;
+	balloon->inode = NULL;
 }
 
 #ifdef CONFIG_BALLOON_COMPACTION
-extern bool balloon_page_isolate(struct page *page);
+extern const struct address_space_operations balloon_aops;
+extern bool balloon_page_isolate(struct page *page,
+				isolate_mode_t mode);
 extern void balloon_page_putback(struct page *page);
-extern int balloon_page_migrate(struct page *newpage,
+extern int balloon_page_migrate(struct address_space *mapping,
+				struct page *newpage,
 				struct page *page, enum migrate_mode mode);
 
 /*
- * __is_movable_balloon_page - helper to perform @page PageBalloon tests
- */
-static inline bool __is_movable_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
- * balloon_page_movable - test PageBalloon to identify balloon pages
- *			  and PagePrivate to check that the page is not
- *			  isolated and can be moved by compaction/migration.
- *
- * As we might return false positives in the case of a balloon page being just
- * released under us, this need to be re-tested later, under the page lock.
- */
-static inline bool balloon_page_movable(struct page *page)
-{
-	return PageBalloon(page) && PagePrivate(page);
-}
-
-/*
- * isolated_balloon_page - identify an isolated balloon page on private
- *			   compaction/migration page lists.
- */
-static inline bool isolated_balloon_page(struct page *page)
-{
-	return PageBalloon(page);
-}
-
-/*
  * balloon_page_insert - insert a page into the balloon's page list and make
  *			 the page->private assignment accordingly.
  * @balloon : pointer to balloon device
@@ -124,7 +100,7 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 				       struct page *page)
 {
 	__SetPageBalloon(page);
-	SetPagePrivate(page);
+	__SetPageMovable(page, balloon->inode->i_mapping);
 	set_page_private(page, (unsigned long)balloon);
 	list_add(&page->lru, &balloon->pages);
 }
@@ -140,11 +116,14 @@ static inline void balloon_page_insert(struct balloon_dev_info *balloon,
 static inline void balloon_page_delete(struct page *page)
 {
 	__ClearPageBalloon(page);
+	__ClearPageMovable(page);
 	set_page_private(page, 0);
-	if (PagePrivate(page)) {
-		ClearPagePrivate(page);
+	/*
+	 * No touch page.lru field once @page has been isolated
+	 * because VM is using the field.
+	 */
+	if (!PageIsolated(page))
 		list_del(&page->lru);
-	}
 }
 
 /*
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index 546b38886e11..d829ce63529d 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -80,5 +80,6 @@
 #define BPF_FS_MAGIC		0xcafe4a11
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC		0x15013346
+#define BALLOON_KVM_MAGIC	0x13661366
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/balloon_compaction.c b/mm/balloon_compaction.c
index 57b3e9bd6bc5..da91df50ba31 100644
--- a/mm/balloon_compaction.c
+++ b/mm/balloon_compaction.c
@@ -70,7 +70,7 @@ struct page *balloon_page_dequeue(struct balloon_dev_info *b_dev_info)
 		 */
 		if (trylock_page(page)) {
 #ifdef CONFIG_BALLOON_COMPACTION
-			if (!PagePrivate(page)) {
+			if (PageIsolated(page)) {
 				/* raced with isolation */
 				unlock_page(page);
 				continue;
@@ -106,110 +106,50 @@ EXPORT_SYMBOL_GPL(balloon_page_dequeue);
 
 #ifdef CONFIG_BALLOON_COMPACTION
 
-static inline void __isolate_balloon_page(struct page *page)
+bool balloon_page_isolate(struct page *page, isolate_mode_t mode)
+
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	ClearPagePrivate(page);
 	list_del(&page->lru);
 	b_dev_info->isolated_pages++;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
+
+	return true;
 }
 
-static inline void __putback_balloon_page(struct page *page)
+void balloon_page_putback(struct page *page)
 {
 	struct balloon_dev_info *b_dev_info = balloon_page_device(page);
 	unsigned long flags;
 
 	spin_lock_irqsave(&b_dev_info->pages_lock, flags);
-	SetPagePrivate(page);
 	list_add(&page->lru, &b_dev_info->pages);
 	b_dev_info->isolated_pages--;
 	spin_unlock_irqrestore(&b_dev_info->pages_lock, flags);
 }
 
-/* __isolate_lru_page() counterpart for a ballooned page */
-bool balloon_page_isolate(struct page *page)
-{
-	/*
-	 * Avoid burning cycles with pages that are yet under __free_pages(),
-	 * or just got freed under us.
-	 *
-	 * In case we 'win' a race for a balloon page being freed under us and
-	 * raise its refcount preventing __free_pages() from doing its job
-	 * the put_page() at the end of this block will take care of
-	 * release this page, thus avoiding a nasty leakage.
-	 */
-	if (likely(get_page_unless_zero(page))) {
-		/*
-		 * As balloon pages are not isolated from LRU lists, concurrent
-		 * compaction threads can race against page migration functions
-		 * as well as race against the balloon driver releasing a page.
-		 *
-		 * In order to avoid having an already isolated balloon page
-		 * being (wrongly) re-isolated while it is under migration,
-		 * or to avoid attempting to isolate pages being released by
-		 * the balloon driver, lets be sure we have the page lock
-		 * before proceeding with the balloon page isolation steps.
-		 */
-		if (likely(trylock_page(page))) {
-			/*
-			 * A ballooned page, by default, has PagePrivate set.
-			 * Prevent concurrent compaction threads from isolating
-			 * an already isolated balloon page by clearing it.
-			 */
-			if (balloon_page_movable(page)) {
-				__isolate_balloon_page(page);
-				unlock_page(page);
-				return true;
-			}
-			unlock_page(page);
-		}
-		put_page(page);
-	}
-	return false;
-}
-
-/* putback_lru_page() counterpart for a ballooned page */
-void balloon_page_putback(struct page *page)
-{
-	/*
-	 * 'lock_page()' stabilizes the page and prevents races against
-	 * concurrent isolation threads attempting to re-isolate it.
-	 */
-	lock_page(page);
-
-	if (__is_movable_balloon_page(page)) {
-		__putback_balloon_page(page);
-		/* drop the extra ref count taken for page isolation */
-		put_page(page);
-	} else {
-		WARN_ON(1);
-		dump_page(page, "not movable balloon page");
-	}
-	unlock_page(page);
-}
 
 /* move_to_new_page() counterpart for a ballooned page */
-int balloon_page_migrate(struct page *newpage,
-			 struct page *page, enum migrate_mode mode)
+int balloon_page_migrate(struct address_space *mapping,
+		struct page *newpage, struct page *page,
+		enum migrate_mode mode)
 {
 	struct balloon_dev_info *balloon = balloon_page_device(page);
-	int rc = -EAGAIN;
 
 	VM_BUG_ON_PAGE(!PageLocked(page), page);
 	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
 
-	if (WARN_ON(!__is_movable_balloon_page(page))) {
-		dump_page(page, "not movable balloon page");
-		return rc;
-	}
+	return balloon->migratepage(balloon, newpage, page, mode);
+}
 
-	if (balloon && balloon->migratepage)
-		rc = balloon->migratepage(balloon, newpage, page, mode);
+const struct address_space_operations balloon_aops = {
+	.migratepage = balloon_page_migrate,
+	.isolate_page = balloon_page_isolate,
+	.putback_page = balloon_page_putback,
+};
+EXPORT_SYMBOL_GPL(balloon_aops);
 
-	return rc;
-}
 #endif /* CONFIG_BALLOON_COMPACTION */
diff --git a/mm/compaction.c b/mm/compaction.c
index a680b52e190b..b7bfdf94b545 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -795,13 +795,6 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
 		 */
 		is_lru = PageLRU(page);
 		if (!is_lru) {
-			if (unlikely(balloon_page_movable(page))) {
-				if (balloon_page_isolate(page)) {
-					/* Successfully isolated */
-					goto isolate_success;
-				}
-			}
-
 			/*
 			 * __PageMovable can return false positive so we need
 			 * to verify it under page_lock.
diff --git a/mm/migrate.c b/mm/migrate.c
index 60abcf379b51..e6daf49e224f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -170,14 +170,12 @@ void putback_movable_pages(struct list_head *l)
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		if (unlikely(isolated_balloon_page(page))) {
-			balloon_page_putback(page);
 		/*
 		 * We isolated non-lru movable page so here we can use
 		 * __PageMovable because LRU page's mapping cannot have
 		 * PAGE_MAPPING_MOVABLE.
 		 */
-		} else if (unlikely(__PageMovable(page))) {
+		if (unlikely(__PageMovable(page))) {
 			VM_BUG_ON_PAGE(!PageIsolated(page), page);
 			lock_page(page);
 			if (PageMovable(page))
@@ -990,18 +988,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	if (unlikely(!trylock_page(newpage)))
 		goto out_unlock;
 
-	if (unlikely(isolated_balloon_page(page))) {
-		/*
-		 * A ballooned page does not need any special attention from
-		 * physical to virtual reverse mapping procedures.
-		 * Skip any attempt to unmap PTEs or to remap swap cache,
-		 * in order to avoid burning cycles at rmap level, and perform
-		 * the page migration right away (proteced by page lock).
-		 */
-		rc = balloon_page_migrate(newpage, page, mode);
-		goto out_unlock_both;
-	}
-
 	if (unlikely(!is_lru)) {
 		rc = move_to_new_page(newpage, page, mode);
 		goto out_unlock_both;
@@ -1056,8 +1042,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	 * list in here.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		if (unlikely(__is_movable_balloon_page(newpage) ||
-				__PageMovable(newpage)))
+		if (unlikely(__PageMovable(newpage)))
 			put_page(newpage);
 		else
 			putback_lru_page(newpage);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..93ba33789ac6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1254,7 +1254,7 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 
 	list_for_each_entry_safe(page, next, page_list, lru) {
 		if (page_is_file_cache(page) && !PageDirty(page) &&
-		    !isolated_balloon_page(page)) {
+		    !__PageMovable(page)) {
 			ClearPageActive(page);
 			list_move(&page->lru, &clean_pages);
 		}
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 04/12] zsmalloc: keep max_object in size_class
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (2 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 03/12] mm: balloon: use general non-lru movable page feature Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 05/12] zsmalloc: use bit_spin_lock Minchan Kim
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

Every zspage in a size_class has same number of max objects so
we could move it to a size_class.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 32 +++++++++++++++-----------------
 1 file changed, 15 insertions(+), 17 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b6d4f258cb53..79295c73dc9f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -32,8 +32,6 @@
  *	page->freelist: points to the first free object in zspage.
  *		Free objects are linked together using in-place
  *		metadata.
- *	page->objects: maximum number of objects we can store in this
- *		zspage (class->zspage_order * PAGE_SIZE / class->size)
  *	page->lru: links together first pages of various zspages.
  *		Basically forming list of zspages in a fullness group.
  *	page->mapping: class index and fullness group of the zspage
@@ -213,6 +211,7 @@ struct size_class {
 	 * of ZS_ALIGN.
 	 */
 	int size;
+	int objs_per_zspage;
 	unsigned int index;
 
 	struct zs_size_stat stats;
@@ -631,21 +630,22 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
  * the pool (not yet implemented). This function returns fullness
  * status of the given page.
  */
-static enum fullness_group get_fullness_group(struct page *first_page)
+static enum fullness_group get_fullness_group(struct size_class *class,
+						struct page *first_page)
 {
-	int inuse, max_objects;
+	int inuse, objs_per_zspage;
 	enum fullness_group fg;
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
 	inuse = first_page->inuse;
-	max_objects = first_page->objects;
+	objs_per_zspage = class->objs_per_zspage;
 
 	if (inuse == 0)
 		fg = ZS_EMPTY;
-	else if (inuse == max_objects)
+	else if (inuse == objs_per_zspage)
 		fg = ZS_FULL;
-	else if (inuse <= 3 * max_objects / fullness_threshold_frac)
+	else if (inuse <= 3 * objs_per_zspage / fullness_threshold_frac)
 		fg = ZS_ALMOST_EMPTY;
 	else
 		fg = ZS_ALMOST_FULL;
@@ -732,7 +732,7 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
 	enum fullness_group currfg, newfg;
 
 	get_zspage_mapping(first_page, &class_idx, &currfg);
-	newfg = get_fullness_group(first_page);
+	newfg = get_fullness_group(class, first_page);
 	if (newfg == currfg)
 		goto out;
 
@@ -1012,9 +1012,6 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	init_zspage(class, first_page);
 
 	first_page->freelist = location_to_obj(first_page, 0);
-	/* Maximum number of objects we can store in this zspage */
-	first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
-
 	error = 0; /* Success */
 
 cleanup:
@@ -1242,11 +1239,11 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage)
 	return true;
 }
 
-static bool zspage_full(struct page *first_page)
+static bool zspage_full(struct size_class *class, struct page *first_page)
 {
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	return first_page->inuse == first_page->objects;
+	return first_page->inuse == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1632,7 +1629,7 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
 		}
 
 		/* Stop if there is no more space */
-		if (zspage_full(d_page)) {
+		if (zspage_full(class, d_page)) {
 			unpin_tag(handle);
 			ret = -ENOMEM;
 			break;
@@ -1691,7 +1688,7 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
 {
 	enum fullness_group fullness;
 
-	fullness = get_fullness_group(first_page);
+	fullness = get_fullness_group(class, first_page);
 	insert_zspage(class, fullness, first_page);
 	set_zspage_mapping(first_page, class->index, fullness);
 
@@ -1943,8 +1940,9 @@ struct zs_pool *zs_create_pool(const char *name)
 		class->size = size;
 		class->index = i;
 		class->pages_per_zspage = pages_per_zspage;
-		if (pages_per_zspage == 1 &&
-			get_maxobj_per_zspage(size, pages_per_zspage) == 1)
+		class->objs_per_zspage = class->pages_per_zspage *
+						PAGE_SIZE / class->size;
+		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
 			class->huge = true;
 		spin_lock_init(&class->lock);
 		pool->size_class[i] = class;
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 05/12] zsmalloc: use bit_spin_lock
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (3 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 04/12] zsmalloc: keep max_object in size_class Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 06/12] zsmalloc: use accessor Minchan Kim
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

Use kernel standard bit spin-lock instead of custom mess. Even, it has
a bug which doesn't disable preemption. The reason we don't have any
problem is that we have used it during preemption disable section
by class->lock spinlock. So no need to go to stable.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 10 +++-------
 1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 79295c73dc9f..39f29aedd5d6 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -868,21 +868,17 @@ static unsigned long obj_idx_to_offset(struct page *page,
 
 static inline int trypin_tag(unsigned long handle)
 {
-	unsigned long *ptr = (unsigned long *)handle;
-
-	return !test_and_set_bit_lock(HANDLE_PIN_BIT, ptr);
+	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void pin_tag(unsigned long handle)
 {
-	while (!trypin_tag(handle));
+	bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void unpin_tag(unsigned long handle)
 {
-	unsigned long *ptr = (unsigned long *)handle;
-
-	clear_bit_unlock(HANDLE_PIN_BIT, ptr);
+	bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
 }
 
 static void reset_page(struct page *page)
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 06/12] zsmalloc: use accessor
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (4 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 05/12] zsmalloc: use bit_spin_lock Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 07/12] zsmalloc: factor page chain functionality out Minchan Kim
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

Upcoming patch will change how to encode zspage meta so for easy review,
this patch wraps code to access metadata as accessor.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 82 +++++++++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 60 insertions(+), 22 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 39f29aedd5d6..5da80961ff3e 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -268,10 +268,14 @@ struct zs_pool {
  * A zspage's class index and fullness group
  * are encoded in its (first)page->mapping
  */
-#define CLASS_IDX_BITS	28
 #define FULLNESS_BITS	4
-#define CLASS_IDX_MASK	((1 << CLASS_IDX_BITS) - 1)
-#define FULLNESS_MASK	((1 << FULLNESS_BITS) - 1)
+#define CLASS_BITS	28
+
+#define FULLNESS_SHIFT	0
+#define CLASS_SHIFT	(FULLNESS_SHIFT + FULLNESS_BITS)
+
+#define FULLNESS_MASK	((1UL << FULLNESS_BITS) - 1)
+#define CLASS_MASK	((1UL << CLASS_BITS) - 1)
 
 struct mapping_area {
 #ifdef CONFIG_PGTABLE_MAPPING
@@ -418,6 +422,41 @@ static int is_last_page(struct page *page)
 	return PagePrivate2(page);
 }
 
+static inline int get_zspage_inuse(struct page *first_page)
+{
+	return first_page->inuse;
+}
+
+static inline void set_zspage_inuse(struct page *first_page, int val)
+{
+	first_page->inuse = val;
+}
+
+static inline void mod_zspage_inuse(struct page *first_page, int val)
+{
+	first_page->inuse += val;
+}
+
+static inline int get_first_obj_offset(struct page *page)
+{
+	return page->index;
+}
+
+static inline void set_first_obj_offset(struct page *page, int offset)
+{
+	page->index = offset;
+}
+
+static inline unsigned long get_freeobj(struct page *first_page)
+{
+	return (unsigned long)first_page->freelist;
+}
+
+static inline void set_freeobj(struct page *first_page, unsigned long obj)
+{
+	first_page->freelist = (void *)obj;
+}
+
 static void get_zspage_mapping(struct page *first_page,
 				unsigned int *class_idx,
 				enum fullness_group *fullness)
@@ -426,8 +465,8 @@ static void get_zspage_mapping(struct page *first_page,
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
 	m = (unsigned long)first_page->mapping;
-	*fullness = m & FULLNESS_MASK;
-	*class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK;
+	*fullness = (m >> FULLNESS_SHIFT) & FULLNESS_MASK;
+	*class_idx = (m >> CLASS_SHIFT) & CLASS_MASK;
 }
 
 static void set_zspage_mapping(struct page *first_page,
@@ -437,8 +476,7 @@ static void set_zspage_mapping(struct page *first_page,
 	unsigned long m;
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) |
-			(fullness & FULLNESS_MASK);
+	m = (class_idx << CLASS_SHIFT) | (fullness << FULLNESS_SHIFT);
 	first_page->mapping = (struct address_space *)m;
 }
 
@@ -638,7 +676,7 @@ static enum fullness_group get_fullness_group(struct size_class *class,
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	inuse = first_page->inuse;
+	inuse = get_zspage_inuse(first_page);
 	objs_per_zspage = class->objs_per_zspage;
 
 	if (inuse == 0)
@@ -684,7 +722,7 @@ static void insert_zspage(struct size_class *class,
 	 * empty/full. Put pages with higher ->inuse first.
 	 */
 	list_add_tail(&first_page->lru, &(*head)->lru);
-	if (first_page->inuse >= (*head)->inuse)
+	if (get_zspage_inuse(first_page) >= get_zspage_inuse(*head))
 		*head = first_page;
 }
 
@@ -861,7 +899,7 @@ static unsigned long obj_idx_to_offset(struct page *page,
 	unsigned long off = 0;
 
 	if (!is_first_page(page))
-		off = page->index;
+		off = get_first_obj_offset(page);
 
 	return off + obj_idx * class_size;
 }
@@ -896,7 +934,7 @@ static void free_zspage(struct page *first_page)
 	struct page *nextp, *tmp, *head_extra;
 
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-	VM_BUG_ON_PAGE(first_page->inuse, first_page);
+	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
 
 	head_extra = (struct page *)page_private(first_page);
 
@@ -937,7 +975,7 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 		 * head of corresponding zspage's freelist.
 		 */
 		if (page != first_page)
-			page->index = off;
+			set_first_obj_offset(page, off);
 
 		vaddr = kmap_atomic(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
@@ -992,7 +1030,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 			SetPagePrivate(page);
 			set_page_private(page, 0);
 			first_page = page;
-			first_page->inuse = 0;
+			set_zspage_inuse(first_page, 0);
 		}
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
@@ -1007,7 +1045,7 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 
 	init_zspage(class, first_page);
 
-	first_page->freelist = location_to_obj(first_page, 0);
+	set_freeobj(first_page,	(unsigned long)location_to_obj(first_page, 0));
 	error = 0; /* Success */
 
 cleanup:
@@ -1239,7 +1277,7 @@ static bool zspage_full(struct size_class *class, struct page *first_page)
 {
 	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
 
-	return first_page->inuse == class->objs_per_zspage;
+	return get_zspage_inuse(first_page) == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1358,13 +1396,13 @@ static unsigned long obj_malloc(struct size_class *class,
 	void *vaddr;
 
 	handle |= OBJ_ALLOCATED_TAG;
-	obj = (unsigned long)first_page->freelist;
+	obj = get_freeobj(first_page);
 	obj_to_location(obj, &m_page, &m_objidx);
 	m_offset = obj_idx_to_offset(m_page, m_objidx, class->size);
 
 	vaddr = kmap_atomic(m_page);
 	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
-	first_page->freelist = link->next;
+	set_freeobj(first_page, (unsigned long)link->next);
 	if (!class->huge)
 		/* record handle in the header of allocated chunk */
 		link->handle = handle;
@@ -1372,7 +1410,7 @@ static unsigned long obj_malloc(struct size_class *class,
 		/* record handle in first_page->private */
 		set_page_private(first_page, handle);
 	kunmap_atomic(vaddr);
-	first_page->inuse++;
+	mod_zspage_inuse(first_page, 1);
 	zs_stat_inc(class, OBJ_USED, 1);
 
 	return obj;
@@ -1452,12 +1490,12 @@ static void obj_free(struct size_class *class, unsigned long obj)
 
 	/* Insert this object in containing zspage's freelist */
 	link = (struct link_free *)(vaddr + f_offset);
-	link->next = first_page->freelist;
+	link->next = (void *)get_freeobj(first_page);
 	if (class->huge)
 		set_page_private(first_page, 0);
 	kunmap_atomic(vaddr);
-	first_page->freelist = (void *)obj;
-	first_page->inuse--;
+	set_freeobj(first_page, obj);
+	mod_zspage_inuse(first_page, -1);
 	zs_stat_dec(class, OBJ_USED, 1);
 }
 
@@ -1573,7 +1611,7 @@ static unsigned long find_alloced_obj(struct size_class *class,
 	void *addr = kmap_atomic(page);
 
 	if (!is_first_page(page))
-		offset = page->index;
+		offset = get_first_obj_offset(page);
 	offset += class->size * index;
 
 	while (offset < PAGE_SIZE) {
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 07/12] zsmalloc: factor page chain functionality out
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (5 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 06/12] zsmalloc: use accessor Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 08/12] zsmalloc: introduce zspage structure Minchan Kim
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

For page migration, we need to create page chain of zspage dynamically
so this patch factors it out from alloc_zspage.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 59 +++++++++++++++++++++++++++++++++++------------------------
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 5da80961ff3e..07485a2e5b96 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -960,7 +960,8 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 	unsigned long off = 0;
 	struct page *page = first_page;
 
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+	first_page->freelist = NULL;
+	set_zspage_inuse(first_page, 0);
 
 	while (page) {
 		struct page *next_page;
@@ -996,15 +997,16 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 		page = next_page;
 		off %= PAGE_SIZE;
 	}
+
+	set_freeobj(first_page, (unsigned long)location_to_obj(first_page, 0));
 }
 
-/*
- * Allocate a zspage for the given size class
- */
-static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
+static void create_page_chain(struct page *pages[], int nr_pages)
 {
-	int i, error;
-	struct page *first_page = NULL, *uninitialized_var(prev_page);
+	int i;
+	struct page *page;
+	struct page *prev_page = NULL;
+	struct page *first_page = NULL;
 
 	/*
 	 * Allocate individual pages and link them together as:
@@ -1017,20 +1019,14 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 	 * (i.e. no other sub-page has this flag set) and PG_private_2 to
 	 * identify the last page.
 	 */
-	error = -ENOMEM;
-	for (i = 0; i < class->pages_per_zspage; i++) {
-		struct page *page;
-
-		page = alloc_page(flags);
-		if (!page)
-			goto cleanup;
+	for (i = 0; i < nr_pages; i++) {
+		page = pages[i];
 
 		INIT_LIST_HEAD(&page->lru);
-		if (i == 0) {	/* first page */
+		if (i == 0) {
 			SetPagePrivate(page);
 			set_page_private(page, 0);
 			first_page = page;
-			set_zspage_inuse(first_page, 0);
 		}
 		if (i == 1)
 			set_page_private(first_page, (unsigned long)page);
@@ -1038,22 +1034,37 @@ static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
 			set_page_private(page, (unsigned long)first_page);
 		if (i >= 2)
 			list_add(&page->lru, &prev_page->lru);
-		if (i == class->pages_per_zspage - 1)	/* last page */
+		if (i == nr_pages - 1)
 			SetPagePrivate2(page);
 		prev_page = page;
 	}
+}
 
-	init_zspage(class, first_page);
+/*
+ * Allocate a zspage for the given size class
+ */
+static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
+{
+	int i;
+	struct page *first_page = NULL;
+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
 
-	set_freeobj(first_page,	(unsigned long)location_to_obj(first_page, 0));
-	error = 0; /* Success */
+	for (i = 0; i < class->pages_per_zspage; i++) {
+		struct page *page;
 
-cleanup:
-	if (unlikely(error) && first_page) {
-		free_zspage(first_page);
-		first_page = NULL;
+		page = alloc_page(flags);
+		if (!page) {
+			while (--i >= 0)
+				__free_page(pages[i]);
+			return NULL;
+		}
+		pages[i] = page;
 	}
 
+	create_page_chain(pages, class->pages_per_zspage);
+	first_page = pages[0];
+	init_zspage(class, first_page);
+
 	return first_page;
 }
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 08/12] zsmalloc: introduce zspage structure
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (6 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 07/12] zsmalloc: factor page chain functionality out Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 09/12] zsmalloc: separate free_zspage from putback_zspage Minchan Kim
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

We have squeezed meta data of zspage into first page's descriptor.
So, to get meta data from subpage, we should get first page first
of all. But it makes trouble to implment page migration feature
of zsmalloc because any place where to get first page from subpage
can be raced with first page migration. IOW, first page it got
could be stale. For preventing it, I have tried several approahces
but it made code complicated so finally, I concluded to separate
metadata from first page. Of course, it consumes more memory. IOW,
16bytes per zspage on 32bit at the moment. It means we lost 1%
at *worst case*(40B/4096B) which is not bad I think at the cost of
maintenance.

Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/compaction.c |   1 -
 mm/zsmalloc.c   | 531 ++++++++++++++++++++++++++------------------------------
 2 files changed, 242 insertions(+), 290 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index b7bfdf94b545..d1d2063b4fd9 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -15,7 +15,6 @@
 #include <linux/backing-dev.h>
 #include <linux/sysctl.h>
 #include <linux/sysfs.h>
-#include <linux/balloon_compaction.h>
 #include <linux/page-isolation.h>
 #include <linux/kasan.h>
 #include <linux/kthread.h>
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 07485a2e5b96..c6d2cbe0f19f 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -16,26 +16,11 @@
  * struct page(s) to form a zspage.
  *
  * Usage of struct page fields:
- *	page->private: points to the first component (0-order) page
- *	page->index (union with page->freelist): offset of the first object
- *		starting in this page. For the first page, this is
- *		always 0, so we use this field (aka freelist) to point
- *		to the first free object in zspage.
- *	page->lru: links together all component pages (except the first page)
- *		of a zspage
- *
- *	For _first_ page only:
- *
- *	page->private: refers to the component page after the first page
- *		If the page is first_page for huge object, it stores handle.
- *		Look at size_class->huge.
- *	page->freelist: points to the first free object in zspage.
- *		Free objects are linked together using in-place
- *		metadata.
- *	page->lru: links together first pages of various zspages.
- *		Basically forming list of zspages in a fullness group.
- *	page->mapping: class index and fullness group of the zspage
- *	page->inuse: the number of objects that are used in this zspage
+ *	page->private: points to zspage
+ *	page->index: offset of the first object starting in this page.
+ *		For the first page, this is always 0, so we use this field
+ *		to store handle for huge object.
+ *	page->next: links together all component pages of a zspage
  *
  * Usage of struct page flags:
  *	PG_private: identifies the first component page
@@ -147,7 +132,7 @@
  *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
  *  (reason above)
  */
-#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> 8)
+#define ZS_SIZE_CLASS_DELTA	(PAGE_SIZE >> CLASS_BITS)
 
 /*
  * We do not maintain any list for completely empty or full pages
@@ -155,8 +140,6 @@
 enum fullness_group {
 	ZS_ALMOST_FULL,
 	ZS_ALMOST_EMPTY,
-	_ZS_NR_FULLNESS_GROUPS,
-
 	ZS_EMPTY,
 	ZS_FULL
 };
@@ -205,7 +188,7 @@ static const int fullness_threshold_frac = 4;
 
 struct size_class {
 	spinlock_t lock;
-	struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS];
+	struct list_head fullness_list[2];
 	/*
 	 * Size of objects stored in this class. Must be multiple
 	 * of ZS_ALIGN.
@@ -224,7 +207,7 @@ struct size_class {
 
 /*
  * Placed within free objects to form a singly linked list.
- * For every zspage, first_page->freelist gives head of this list.
+ * For every zspage, zspage->freeobj gives head of this list.
  *
  * This must be power of 2 and less than or equal to ZS_ALIGN
  */
@@ -247,6 +230,7 @@ struct zs_pool {
 
 	struct size_class **size_class;
 	struct kmem_cache *handle_cachep;
+	struct kmem_cache *zspage_cachep;
 
 	atomic_long_t pages_allocated;
 
@@ -268,14 +252,19 @@ struct zs_pool {
  * A zspage's class index and fullness group
  * are encoded in its (first)page->mapping
  */
-#define FULLNESS_BITS	4
-#define CLASS_BITS	28
+#define FULLNESS_BITS	2
+#define CLASS_BITS	8
 
-#define FULLNESS_SHIFT	0
-#define CLASS_SHIFT	(FULLNESS_SHIFT + FULLNESS_BITS)
-
-#define FULLNESS_MASK	((1UL << FULLNESS_BITS) - 1)
-#define CLASS_MASK	((1UL << CLASS_BITS) - 1)
+struct zspage {
+	struct {
+		unsigned int fullness:FULLNESS_BITS;
+		unsigned int class:CLASS_BITS;
+	};
+	unsigned int inuse;
+	void *freeobj;
+	struct page *first_page;
+	struct list_head list; /* fullness list */
+};
 
 struct mapping_area {
 #ifdef CONFIG_PGTABLE_MAPPING
@@ -287,29 +276,51 @@ struct mapping_area {
 	enum zs_mapmode vm_mm; /* mapping mode */
 };
 
-static int create_handle_cache(struct zs_pool *pool)
+static int create_cache(struct zs_pool *pool)
 {
 	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
 					0, 0, NULL);
-	return pool->handle_cachep ? 0 : 1;
+	if (!pool->handle_cachep)
+		return 1;
+
+	pool->zspage_cachep = kmem_cache_create("zspage", sizeof(struct zspage),
+					0, 0, NULL);
+	if (!pool->zspage_cachep) {
+		kmem_cache_destroy(pool->handle_cachep);
+		pool->handle_cachep = NULL;
+		return 1;
+	}
+
+	return 0;
 }
 
-static void destroy_handle_cache(struct zs_pool *pool)
+static void destroy_cache(struct zs_pool *pool)
 {
 	kmem_cache_destroy(pool->handle_cachep);
+	kmem_cache_destroy(pool->zspage_cachep);
 }
 
-static unsigned long alloc_handle(struct zs_pool *pool, gfp_t gfp)
+static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
 	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
 			gfp & ~__GFP_HIGHMEM);
 }
 
-static void free_handle(struct zs_pool *pool, unsigned long handle)
+static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
 {
 	kmem_cache_free(pool->handle_cachep, (void *)handle);
 }
 
+static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
+{
+	return kmem_cache_alloc(pool->zspage_cachep, flags & ~__GFP_HIGHMEM);
+};
+
+static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
+{
+	kmem_cache_free(pool->zspage_cachep, zspage);
+}
+
 static void record_obj(unsigned long handle, unsigned long obj)
 {
 	/*
@@ -417,67 +428,61 @@ static int is_first_page(struct page *page)
 	return PagePrivate(page);
 }
 
-static int is_last_page(struct page *page)
-{
-	return PagePrivate2(page);
-}
-
-static inline int get_zspage_inuse(struct page *first_page)
+static inline int get_zspage_inuse(struct zspage *zspage)
 {
-	return first_page->inuse;
+	return zspage->inuse;
 }
 
-static inline void set_zspage_inuse(struct page *first_page, int val)
+static inline void set_zspage_inuse(struct zspage *zspage, int val)
 {
-	first_page->inuse = val;
+	zspage->inuse = val;
 }
 
-static inline void mod_zspage_inuse(struct page *first_page, int val)
+static inline void mod_zspage_inuse(struct zspage *zspage, int val)
 {
-	first_page->inuse += val;
+	zspage->inuse += val;
 }
 
 static inline int get_first_obj_offset(struct page *page)
 {
+	if (is_first_page(page))
+		return 0;
+
 	return page->index;
 }
 
 static inline void set_first_obj_offset(struct page *page, int offset)
 {
+	if (is_first_page(page))
+		return;
+
 	page->index = offset;
 }
 
-static inline unsigned long get_freeobj(struct page *first_page)
+static inline unsigned long get_freeobj(struct zspage *zspage)
 {
-	return (unsigned long)first_page->freelist;
+	return (unsigned long)zspage->freeobj;
 }
 
-static inline void set_freeobj(struct page *first_page, unsigned long obj)
+static inline void set_freeobj(struct zspage *zspage, unsigned long obj)
 {
-	first_page->freelist = (void *)obj;
+	zspage->freeobj = (void *)obj;
 }
 
-static void get_zspage_mapping(struct page *first_page,
+static void get_zspage_mapping(struct zspage *zspage,
 				unsigned int *class_idx,
 				enum fullness_group *fullness)
 {
-	unsigned long m;
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	m = (unsigned long)first_page->mapping;
-	*fullness = (m >> FULLNESS_SHIFT) & FULLNESS_MASK;
-	*class_idx = (m >> CLASS_SHIFT) & CLASS_MASK;
+	*fullness = zspage->fullness;
+	*class_idx = zspage->class;
 }
 
-static void set_zspage_mapping(struct page *first_page,
+static void set_zspage_mapping(struct zspage *zspage,
 				unsigned int class_idx,
 				enum fullness_group fullness)
 {
-	unsigned long m;
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	m = (class_idx << CLASS_SHIFT) | (fullness << FULLNESS_SHIFT);
-	first_page->mapping = (struct address_space *)m;
+	zspage->class = class_idx;
+	zspage->fullness = fullness;
 }
 
 /*
@@ -669,14 +674,12 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
  * status of the given page.
  */
 static enum fullness_group get_fullness_group(struct size_class *class,
-						struct page *first_page)
+						struct zspage *zspage)
 {
 	int inuse, objs_per_zspage;
 	enum fullness_group fg;
 
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	inuse = get_zspage_inuse(first_page);
+	inuse = get_zspage_inuse(zspage);
 	objs_per_zspage = class->objs_per_zspage;
 
 	if (inuse == 0)
@@ -698,32 +701,31 @@ static enum fullness_group get_fullness_group(struct size_class *class,
  * identified by <class, fullness_group>.
  */
 static void insert_zspage(struct size_class *class,
-				enum fullness_group fullness,
-				struct page *first_page)
+				struct zspage *zspage,
+				enum fullness_group fullness)
 {
-	struct page **head;
-
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+	struct zspage *head;
 
-	if (fullness >= _ZS_NR_FULLNESS_GROUPS)
+	if (fullness >= ZS_EMPTY)
 		return;
 
+	head = list_first_entry_or_null(&class->fullness_list[fullness],
+					struct zspage, list);
+
 	zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ?
 			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
 
-	head = &class->fullness_list[fullness];
-	if (!*head) {
-		*head = first_page;
-		return;
-	}
-
 	/*
-	 * We want to see more ZS_FULL pages and less almost
-	 * empty/full. Put pages with higher ->inuse first.
+	 * We want to see more ZS_FULL pages and less almost empty/full.
+	 * Put pages with higher ->inuse first.
 	 */
-	list_add_tail(&first_page->lru, &(*head)->lru);
-	if (get_zspage_inuse(first_page) >= get_zspage_inuse(*head))
-		*head = first_page;
+	if (head) {
+		if (get_zspage_inuse(zspage) < get_zspage_inuse(head)) {
+			list_add(&zspage->list, &head->list);
+			return;
+		}
+	}
+	list_add(&zspage->list, &class->fullness_list[fullness]);
 }
 
 /*
@@ -731,25 +733,15 @@ static void insert_zspage(struct size_class *class,
  * by <class, fullness_group>.
  */
 static void remove_zspage(struct size_class *class,
-				enum fullness_group fullness,
-				struct page *first_page)
+				struct zspage *zspage,
+				enum fullness_group fullness)
 {
-	struct page **head;
-
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	if (fullness >= _ZS_NR_FULLNESS_GROUPS)
+	if (fullness >= ZS_EMPTY)
 		return;
 
-	head = &class->fullness_list[fullness];
-	VM_BUG_ON_PAGE(!*head, first_page);
-	if (list_empty(&(*head)->lru))
-		*head = NULL;
-	else if (*head == first_page)
-		*head = (struct page *)list_entry((*head)->lru.next,
-					struct page, lru);
+	VM_BUG_ON(list_empty(&class->fullness_list[fullness]));
 
-	list_del_init(&first_page->lru);
+	list_del_init(&zspage->list);
 	zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ?
 			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
 }
@@ -764,19 +756,19 @@ static void remove_zspage(struct size_class *class,
  * fullness group.
  */
 static enum fullness_group fix_fullness_group(struct size_class *class,
-						struct page *first_page)
+						struct zspage *zspage)
 {
 	int class_idx;
 	enum fullness_group currfg, newfg;
 
-	get_zspage_mapping(first_page, &class_idx, &currfg);
-	newfg = get_fullness_group(class, first_page);
+	get_zspage_mapping(zspage, &class_idx, &currfg);
+	newfg = get_fullness_group(class, zspage);
 	if (newfg == currfg)
 		goto out;
 
-	remove_zspage(class, currfg, first_page);
-	insert_zspage(class, newfg, first_page);
-	set_zspage_mapping(first_page, class_idx, newfg);
+	remove_zspage(class, zspage, currfg);
+	insert_zspage(class, zspage, newfg);
+	set_zspage_mapping(zspage, class_idx, newfg);
 
 out:
 	return newfg;
@@ -818,31 +810,15 @@ static int get_pages_per_zspage(int class_size)
 	return max_usedpc_order;
 }
 
-/*
- * A single 'zspage' is composed of many system pages which are
- * linked together using fields in struct page. This function finds
- * the first/head page, given any component page of a zspage.
- */
-static struct page *get_first_page(struct page *page)
+
+static struct zspage *get_zspage(struct page *page)
 {
-	if (is_first_page(page))
-		return page;
-	else
-		return (struct page *)page_private(page);
+	return (struct zspage *)page->private;
 }
 
 static struct page *get_next_page(struct page *page)
 {
-	struct page *next;
-
-	if (is_last_page(page))
-		next = NULL;
-	else if (is_first_page(page))
-		next = (struct page *)page_private(page);
-	else
-		next = list_entry(page->lru.next, struct page, lru);
-
-	return next;
+	return page->next;
 }
 
 /*
@@ -888,7 +864,7 @@ static unsigned long obj_to_head(struct size_class *class, struct page *page,
 {
 	if (class->huge) {
 		VM_BUG_ON_PAGE(!is_first_page(page), page);
-		return page_private(page);
+		return page->index;
 	} else
 		return *(unsigned long *)obj;
 }
@@ -896,10 +872,9 @@ static unsigned long obj_to_head(struct size_class *class, struct page *page,
 static unsigned long obj_idx_to_offset(struct page *page,
 				unsigned long obj_idx, int class_size)
 {
-	unsigned long off = 0;
+	unsigned long off;
 
-	if (!is_first_page(page))
-		off = get_first_obj_offset(page);
+	off = get_first_obj_offset(page);
 
 	return off + obj_idx * class_size;
 }
@@ -924,44 +899,31 @@ static void reset_page(struct page *page)
 	clear_bit(PG_private, &page->flags);
 	clear_bit(PG_private_2, &page->flags);
 	set_page_private(page, 0);
-	page->mapping = NULL;
-	page->freelist = NULL;
-	page_mapcount_reset(page);
+	page->index = 0;
 }
 
-static void free_zspage(struct page *first_page)
+static void free_zspage(struct zs_pool *pool, struct zspage *zspage)
 {
-	struct page *nextp, *tmp, *head_extra;
+	struct page *page, *next;
 
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-	VM_BUG_ON_PAGE(get_zspage_inuse(first_page), first_page);
+	VM_BUG_ON(get_zspage_inuse(zspage));
 
-	head_extra = (struct page *)page_private(first_page);
+	next = page = zspage->first_page;
+	do {
+		next = page->next;
+		reset_page(page);
+		put_page(page);
+		page = next;
+	} while (page != NULL);
 
-	reset_page(first_page);
-	__free_page(first_page);
-
-	/* zspage with only 1 system page */
-	if (!head_extra)
-		return;
-
-	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
-		list_del(&nextp->lru);
-		reset_page(nextp);
-		__free_page(nextp);
-	}
-	reset_page(head_extra);
-	__free_page(head_extra);
+	cache_free_zspage(pool, zspage);
 }
 
 /* Initialize a newly allocated zspage */
-static void init_zspage(struct size_class *class, struct page *first_page)
+static void init_zspage(struct size_class *class, struct zspage *zspage)
 {
 	unsigned long off = 0;
-	struct page *page = first_page;
-
-	first_page->freelist = NULL;
-	set_zspage_inuse(first_page, 0);
+	struct page *page = zspage->first_page;
 
 	while (page) {
 		struct page *next_page;
@@ -969,14 +931,7 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 		unsigned int i = 1;
 		void *vaddr;
 
-		/*
-		 * page->index stores offset of first object starting
-		 * in the page. For the first page, this is always 0,
-		 * so we use first_page->index (aka ->freelist) to store
-		 * head of corresponding zspage's freelist.
-		 */
-		if (page != first_page)
-			set_first_obj_offset(page, off);
+		set_first_obj_offset(page, off);
 
 		vaddr = kmap_atomic(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
@@ -998,44 +953,38 @@ static void init_zspage(struct size_class *class, struct page *first_page)
 		off %= PAGE_SIZE;
 	}
 
-	set_freeobj(first_page, (unsigned long)location_to_obj(first_page, 0));
+	set_freeobj(zspage,
+		(unsigned long)location_to_obj(zspage->first_page, 0));
 }
 
-static void create_page_chain(struct page *pages[], int nr_pages)
+static void create_page_chain(struct zspage *zspage, struct page *pages[],
+				int nr_pages)
 {
 	int i;
 	struct page *page;
 	struct page *prev_page = NULL;
-	struct page *first_page = NULL;
 
 	/*
 	 * Allocate individual pages and link them together as:
-	 * 1. first page->private = first sub-page
-	 * 2. all sub-pages are linked together using page->lru
-	 * 3. each sub-page is linked to the first page using page->private
+	 * 1. all pages are linked together using page->next
+	 * 2. each sub-page point to zspage using page->private
 	 *
-	 * For each size class, First/Head pages are linked together using
-	 * page->lru. Also, we set PG_private to identify the first page
-	 * (i.e. no other sub-page has this flag set) and PG_private_2 to
-	 * identify the last page.
+	 * we set PG_private to identify the first page (i.e. no other sub-page
+	 * has this flag set) and PG_private_2 to identify the last page.
 	 */
 	for (i = 0; i < nr_pages; i++) {
 		page = pages[i];
-
-		INIT_LIST_HEAD(&page->lru);
+		set_page_private(page, (unsigned long)zspage);
 		if (i == 0) {
+			zspage->first_page = page;
 			SetPagePrivate(page);
-			set_page_private(page, 0);
-			first_page = page;
+		} else {
+			prev_page->next = page;
 		}
-		if (i == 1)
-			set_page_private(first_page, (unsigned long)page);
-		if (i >= 1)
-			set_page_private(page, (unsigned long)first_page);
-		if (i >= 2)
-			list_add(&page->lru, &prev_page->lru);
-		if (i == nr_pages - 1)
+		if (i == nr_pages - 1) {
 			SetPagePrivate2(page);
+			page->next = NULL;
+		}
 		prev_page = page;
 	}
 }
@@ -1043,43 +992,51 @@ static void create_page_chain(struct page *pages[], int nr_pages)
 /*
  * Allocate a zspage for the given size class
  */
-static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
+static struct zspage *alloc_zspage(struct zs_pool *pool,
+					struct size_class *class,
+					gfp_t gfp)
 {
 	int i;
-	struct page *first_page = NULL;
 	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE];
+	struct zspage *zspage = cache_alloc_zspage(pool, gfp);
+
+	if (!zspage)
+		return NULL;
+
+	memset(zspage, 0, sizeof(struct zspage));
 
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct page *page;
 
-		page = alloc_page(flags);
+		page = alloc_page(gfp);
 		if (!page) {
 			while (--i >= 0)
 				__free_page(pages[i]);
+			cache_free_zspage(pool, zspage);
 			return NULL;
 		}
 		pages[i] = page;
 	}
 
-	create_page_chain(pages, class->pages_per_zspage);
-	first_page = pages[0];
-	init_zspage(class, first_page);
+	create_page_chain(zspage, pages, class->pages_per_zspage);
+	init_zspage(class, zspage);
 
-	return first_page;
+	return zspage;
 }
 
-static struct page *find_get_zspage(struct size_class *class)
+static struct zspage *find_get_zspage(struct size_class *class)
 {
 	int i;
-	struct page *page;
+	struct zspage *zspage;
 
-	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
-		page = class->fullness_list[i];
-		if (page)
+	for (i = ZS_ALMOST_FULL; i <= ZS_ALMOST_EMPTY; i++) {
+		zspage = list_first_entry_or_null(&class->fullness_list[i],
+				struct zspage, list);
+		if (zspage)
 			break;
 	}
 
-	return page;
+	return zspage;
 }
 
 #ifdef CONFIG_PGTABLE_MAPPING
@@ -1284,11 +1241,9 @@ static bool can_merge(struct size_class *prev, int size, int pages_per_zspage)
 	return true;
 }
 
-static bool zspage_full(struct size_class *class, struct page *first_page)
+static bool zspage_full(struct size_class *class, struct zspage *zspage)
 {
-	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
-
-	return get_zspage_inuse(first_page) == class->objs_per_zspage;
+	return get_zspage_inuse(zspage) == class->objs_per_zspage;
 }
 
 unsigned long zs_get_total_pages(struct zs_pool *pool)
@@ -1314,6 +1269,7 @@ EXPORT_SYMBOL_GPL(zs_get_total_pages);
 void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 			enum zs_mapmode mm)
 {
+	struct zspage *zspage;
 	struct page *page;
 	unsigned long obj, obj_idx, off;
 
@@ -1336,7 +1292,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &page, &obj_idx);
-	get_zspage_mapping(get_first_page(page), &class_idx, &fg);
+	zspage = get_zspage(page);
+	get_zspage_mapping(zspage, &class_idx, &fg);
 	class = pool->size_class[class_idx];
 	off = obj_idx_to_offset(page, obj_idx, class->size);
 
@@ -1365,6 +1322,7 @@ EXPORT_SYMBOL_GPL(zs_map_object);
 
 void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 {
+	struct zspage *zspage;
 	struct page *page;
 	unsigned long obj, obj_idx, off;
 
@@ -1375,7 +1333,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &page, &obj_idx);
-	get_zspage_mapping(get_first_page(page), &class_idx, &fg);
+	zspage = get_zspage(page);
+	get_zspage_mapping(zspage, &class_idx, &fg);
 	class = pool->size_class[class_idx];
 	off = obj_idx_to_offset(page, obj_idx, class->size);
 
@@ -1397,7 +1356,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 EXPORT_SYMBOL_GPL(zs_unmap_object);
 
 static unsigned long obj_malloc(struct size_class *class,
-				struct page *first_page, unsigned long handle)
+				struct zspage *zspage, unsigned long handle)
 {
 	unsigned long obj;
 	struct link_free *link;
@@ -1407,21 +1366,22 @@ static unsigned long obj_malloc(struct size_class *class,
 	void *vaddr;
 
 	handle |= OBJ_ALLOCATED_TAG;
-	obj = get_freeobj(first_page);
+	obj = get_freeobj(zspage);
 	obj_to_location(obj, &m_page, &m_objidx);
 	m_offset = obj_idx_to_offset(m_page, m_objidx, class->size);
 
 	vaddr = kmap_atomic(m_page);
 	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
-	set_freeobj(first_page, (unsigned long)link->next);
+	set_freeobj(zspage, (unsigned long)link->next);
 	if (!class->huge)
 		/* record handle in the header of allocated chunk */
 		link->handle = handle;
 	else
-		/* record handle in first_page->private */
-		set_page_private(first_page, handle);
+		/* record handle to page->index */
+		zspage->first_page->index = handle;
+
 	kunmap_atomic(vaddr);
-	mod_zspage_inuse(first_page, 1);
+	mod_zspage_inuse(zspage, 1);
 	zs_stat_inc(class, OBJ_USED, 1);
 
 	return obj;
@@ -1441,12 +1401,12 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 {
 	unsigned long handle, obj;
 	struct size_class *class;
-	struct page *first_page;
+	struct zspage *zspage;
 
 	if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
 		return 0;
 
-	handle = alloc_handle(pool, gfp);
+	handle = cache_alloc_handle(pool, gfp);
 	if (!handle)
 		return 0;
 
@@ -1455,17 +1415,17 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 	class = pool->size_class[get_size_class_index(size)];
 
 	spin_lock(&class->lock);
-	first_page = find_get_zspage(class);
+	zspage = find_get_zspage(class);
 
-	if (!first_page) {
+	if (!zspage) {
 		spin_unlock(&class->lock);
-		first_page = alloc_zspage(class, gfp);
-		if (unlikely(!first_page)) {
-			free_handle(pool, handle);
+		zspage = alloc_zspage(pool, class, gfp);
+		if (unlikely(!zspage)) {
+			cache_free_handle(pool, handle);
 			return 0;
 		}
 
-		set_zspage_mapping(first_page, class->index, ZS_EMPTY);
+		set_zspage_mapping(zspage, class->index, ZS_EMPTY);
 		atomic_long_add(class->pages_per_zspage,
 					&pool->pages_allocated);
 
@@ -1474,9 +1434,9 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 				class->size, class->pages_per_zspage));
 	}
 
-	obj = obj_malloc(class, first_page, handle);
+	obj = obj_malloc(class, zspage, handle);
 	/* Now move the zspage to another fullness group, if required */
-	fix_fullness_group(class, first_page);
+	fix_fullness_group(class, zspage);
 	record_obj(handle, obj);
 	spin_unlock(&class->lock);
 
@@ -1487,13 +1447,14 @@ EXPORT_SYMBOL_GPL(zs_malloc);
 static void obj_free(struct size_class *class, unsigned long obj)
 {
 	struct link_free *link;
-	struct page *first_page, *f_page;
+	struct zspage *zspage;
+	struct page *f_page;
 	unsigned long f_objidx, f_offset;
 	void *vaddr;
 
 	obj &= ~OBJ_ALLOCATED_TAG;
 	obj_to_location(obj, &f_page, &f_objidx);
-	first_page = get_first_page(f_page);
+	zspage = get_zspage(f_page);
 
 	f_offset = obj_idx_to_offset(f_page, f_objidx, class->size);
 
@@ -1501,18 +1462,17 @@ static void obj_free(struct size_class *class, unsigned long obj)
 
 	/* Insert this object in containing zspage's freelist */
 	link = (struct link_free *)(vaddr + f_offset);
-	link->next = (void *)get_freeobj(first_page);
-	if (class->huge)
-		set_page_private(first_page, 0);
+	link->next = (void *)get_freeobj(zspage);
 	kunmap_atomic(vaddr);
-	set_freeobj(first_page, obj);
-	mod_zspage_inuse(first_page, -1);
+	set_freeobj(zspage, obj);
+	mod_zspage_inuse(zspage, -1);
 	zs_stat_dec(class, OBJ_USED, 1);
 }
 
 void zs_free(struct zs_pool *pool, unsigned long handle)
 {
-	struct page *first_page, *f_page;
+	struct zspage *zspage;
+	struct page *f_page;
 	unsigned long obj, f_objidx;
 	int class_idx;
 	struct size_class *class;
@@ -1524,25 +1484,25 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	pin_tag(handle);
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &f_page, &f_objidx);
-	first_page = get_first_page(f_page);
+	zspage = get_zspage(f_page);
 
-	get_zspage_mapping(first_page, &class_idx, &fullness);
+	get_zspage_mapping(zspage, &class_idx, &fullness);
 	class = pool->size_class[class_idx];
 
 	spin_lock(&class->lock);
 	obj_free(class, obj);
-	fullness = fix_fullness_group(class, first_page);
+	fullness = fix_fullness_group(class, zspage);
 	if (fullness == ZS_EMPTY) {
 		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
 				class->size, class->pages_per_zspage));
 		atomic_long_sub(class->pages_per_zspage,
 				&pool->pages_allocated);
-		free_zspage(first_page);
+		free_zspage(pool, zspage);
 	}
 	spin_unlock(&class->lock);
 	unpin_tag(handle);
 
-	free_handle(pool, handle);
+	cache_free_handle(pool, handle);
 }
 EXPORT_SYMBOL_GPL(zs_free);
 
@@ -1621,8 +1581,7 @@ static unsigned long find_alloced_obj(struct size_class *class,
 	unsigned long handle = 0;
 	void *addr = kmap_atomic(page);
 
-	if (!is_first_page(page))
-		offset = get_first_obj_offset(page);
+	offset = get_first_obj_offset(page);
 	offset += class->size * index;
 
 	while (offset < PAGE_SIZE) {
@@ -1643,7 +1602,7 @@ static unsigned long find_alloced_obj(struct size_class *class,
 }
 
 struct zs_compact_control {
-	/* Source page for migration which could be a subpage of zspage. */
+	/* Source spage for migration which could be a subpage of zspage */
 	struct page *s_page;
 	/* Destination page for migration which should be a first page
 	 * of zspage. */
@@ -1674,14 +1633,14 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
 		}
 
 		/* Stop if there is no more space */
-		if (zspage_full(class, d_page)) {
+		if (zspage_full(class, get_zspage(d_page))) {
 			unpin_tag(handle);
 			ret = -ENOMEM;
 			break;
 		}
 
 		used_obj = handle_to_obj(handle);
-		free_obj = obj_malloc(class, d_page, handle);
+		free_obj = obj_malloc(class, get_zspage(d_page), handle);
 		zs_object_copy(class, free_obj, used_obj);
 		index++;
 		/*
@@ -1703,39 +1662,46 @@ static int migrate_zspage(struct zs_pool *pool, struct size_class *class,
 	return ret;
 }
 
-static struct page *isolate_target_page(struct size_class *class)
+static struct zspage *isolate_zspage(struct size_class *class, bool source)
 {
 	int i;
-	struct page *page;
+	struct zspage *zspage;
+	enum fullness_group fg[2] = {ZS_ALMOST_EMPTY, ZS_ALMOST_FULL};
 
-	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
-		page = class->fullness_list[i];
-		if (page) {
-			remove_zspage(class, i, page);
-			break;
+	if (!source) {
+		fg[0] = ZS_ALMOST_FULL;
+		fg[1] = ZS_ALMOST_EMPTY;
+	}
+
+	for (i = 0; i < 2; i++) {
+		zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
+							struct zspage, list);
+		if (zspage) {
+			remove_zspage(class, zspage, fg[i]);
+			return zspage;
 		}
 	}
 
-	return page;
+	return zspage;
 }
 
 /*
- * putback_zspage - add @first_page into right class's fullness list
+ * putback_zspage - add @zspage into right class's fullness list
  * @pool: target pool
  * @class: destination class
- * @first_page: target page
+ * @zspage: target page
  *
- * Return @fist_page's fullness_group
+ * Return @zspage's fullness_group
  */
 static enum fullness_group putback_zspage(struct zs_pool *pool,
 			struct size_class *class,
-			struct page *first_page)
+			struct zspage *zspage)
 {
 	enum fullness_group fullness;
 
-	fullness = get_fullness_group(class, first_page);
-	insert_zspage(class, fullness, first_page);
-	set_zspage_mapping(first_page, class->index, fullness);
+	fullness = get_fullness_group(class, zspage);
+	insert_zspage(class, zspage, fullness);
+	set_zspage_mapping(zspage, class->index, fullness);
 
 	if (fullness == ZS_EMPTY) {
 		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
@@ -1743,29 +1709,12 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
 		atomic_long_sub(class->pages_per_zspage,
 				&pool->pages_allocated);
 
-		free_zspage(first_page);
+		free_zspage(pool, zspage);
 	}
 
 	return fullness;
 }
 
-static struct page *isolate_source_page(struct size_class *class)
-{
-	int i;
-	struct page *page = NULL;
-
-	for (i = ZS_ALMOST_EMPTY; i >= ZS_ALMOST_FULL; i--) {
-		page = class->fullness_list[i];
-		if (!page)
-			continue;
-
-		remove_zspage(class, i, page);
-		break;
-	}
-
-	return page;
-}
-
 /*
  *
  * Based on the number of unused allocated objects calculate
@@ -1790,20 +1739,20 @@ static unsigned long zs_can_compact(struct size_class *class)
 static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 {
 	struct zs_compact_control cc;
-	struct page *src_page;
-	struct page *dst_page = NULL;
+	struct zspage *src_zspage;
+	struct zspage *dst_zspage = NULL;
 
 	spin_lock(&class->lock);
-	while ((src_page = isolate_source_page(class))) {
+	while ((src_zspage = isolate_zspage(class, true))) {
 
 		if (!zs_can_compact(class))
 			break;
 
 		cc.index = 0;
-		cc.s_page = src_page;
+		cc.s_page = src_zspage->first_page;
 
-		while ((dst_page = isolate_target_page(class))) {
-			cc.d_page = dst_page;
+		while ((dst_zspage = isolate_zspage(class, false))) {
+			cc.d_page = dst_zspage->first_page;
 			/*
 			 * If there is no more space in dst_page, resched
 			 * and see if anyone had allocated another zspage.
@@ -1811,23 +1760,23 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 			if (!migrate_zspage(pool, class, &cc))
 				break;
 
-			putback_zspage(pool, class, dst_page);
+			putback_zspage(pool, class, dst_zspage);
 		}
 
 		/* Stop if we couldn't find slot */
-		if (dst_page == NULL)
+		if (dst_zspage == NULL)
 			break;
 
-		putback_zspage(pool, class, dst_page);
-		if (putback_zspage(pool, class, src_page) == ZS_EMPTY)
+		putback_zspage(pool, class, dst_zspage);
+		if (putback_zspage(pool, class, src_zspage) == ZS_EMPTY)
 			pool->stats.pages_compacted += class->pages_per_zspage;
 		spin_unlock(&class->lock);
 		cond_resched();
 		spin_lock(&class->lock);
 	}
 
-	if (src_page)
-		putback_zspage(pool, class, src_page);
+	if (src_zspage)
+		putback_zspage(pool, class, src_zspage);
 
 	spin_unlock(&class->lock);
 }
@@ -1945,7 +1894,7 @@ struct zs_pool *zs_create_pool(const char *name)
 	if (!pool->name)
 		goto err;
 
-	if (create_handle_cache(pool))
+	if (create_cache(pool))
 		goto err;
 
 	/*
@@ -1956,6 +1905,7 @@ struct zs_pool *zs_create_pool(const char *name)
 		int size;
 		int pages_per_zspage;
 		struct size_class *class;
+		int fullness = 0;
 
 		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
 		if (size > ZS_MAX_ALLOC_SIZE)
@@ -1991,6 +1941,9 @@ struct zs_pool *zs_create_pool(const char *name)
 			class->huge = true;
 		spin_lock_init(&class->lock);
 		pool->size_class[i] = class;
+		for (fullness = ZS_ALMOST_FULL; fullness <= ZS_ALMOST_EMPTY;
+								fullness++)
+			INIT_LIST_HEAD(&class->fullness_list[fullness]);
 
 		prev_class = class;
 	}
@@ -2029,8 +1982,8 @@ void zs_destroy_pool(struct zs_pool *pool)
 		if (class->index != i)
 			continue;
 
-		for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) {
-			if (class->fullness_list[fg]) {
+		for (fg = ZS_ALMOST_FULL; fg <= ZS_ALMOST_EMPTY; fg++) {
+			if (!list_empty(&class->fullness_list[fg])) {
 				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
 					class->size, fg);
 			}
@@ -2038,7 +1991,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 		kfree(class);
 	}
 
-	destroy_handle_cache(pool);
+	destroy_cache(pool);
 	kfree(pool->size_class);
 	kfree(pool->name);
 	kfree(pool);
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 09/12] zsmalloc: separate free_zspage from putback_zspage
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (7 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 08/12] zsmalloc: introduce zspage structure Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 10/12] zsmalloc: use freeobj for index Minchan Kim
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

Currently, putback_zspage does free zspage under class->lock
if fullness become ZS_EMPTY but it makes trouble to implement
locking scheme for new zspage migration.
So, this patch is to separate free_zspage from putback_zspage
and free zspage out of class->lock which is preparation for
zspage migration.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index c6d2cbe0f19f..dd3708611f65 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -1687,14 +1687,12 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
 
 /*
  * putback_zspage - add @zspage into right class's fullness list
- * @pool: target pool
  * @class: destination class
  * @zspage: target page
  *
  * Return @zspage's fullness_group
  */
-static enum fullness_group putback_zspage(struct zs_pool *pool,
-			struct size_class *class,
+static enum fullness_group putback_zspage(struct size_class *class,
 			struct zspage *zspage)
 {
 	enum fullness_group fullness;
@@ -1703,15 +1701,6 @@ static enum fullness_group putback_zspage(struct zs_pool *pool,
 	insert_zspage(class, zspage, fullness);
 	set_zspage_mapping(zspage, class->index, fullness);
 
-	if (fullness == ZS_EMPTY) {
-		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
-			class->size, class->pages_per_zspage));
-		atomic_long_sub(class->pages_per_zspage,
-				&pool->pages_allocated);
-
-		free_zspage(pool, zspage);
-	}
-
 	return fullness;
 }
 
@@ -1760,23 +1749,29 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 			if (!migrate_zspage(pool, class, &cc))
 				break;
 
-			putback_zspage(pool, class, dst_zspage);
+			putback_zspage(class, dst_zspage);
 		}
 
 		/* Stop if we couldn't find slot */
 		if (dst_zspage == NULL)
 			break;
 
-		putback_zspage(pool, class, dst_zspage);
-		if (putback_zspage(pool, class, src_zspage) == ZS_EMPTY)
+		putback_zspage(class, dst_zspage);
+		if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
+			zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
+					class->size, class->pages_per_zspage));
+			atomic_long_sub(class->pages_per_zspage,
+					&pool->pages_allocated);
+			free_zspage(pool, src_zspage);
 			pool->stats.pages_compacted += class->pages_per_zspage;
+		}
 		spin_unlock(&class->lock);
 		cond_resched();
 		spin_lock(&class->lock);
 	}
 
 	if (src_zspage)
-		putback_zspage(pool, class, src_zspage);
+		putback_zspage(class, src_zspage);
 
 	spin_unlock(&class->lock);
 }
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 10/12] zsmalloc: use freeobj for index
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (8 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 09/12] zsmalloc: separate free_zspage from putback_zspage Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-05-31 23:21 ` [PATCH v7 11/12] zsmalloc: page migration support Minchan Kim
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

Zsmalloc stores first free object's <PFN, obj_idx> position into
freeobj in each zspage. If we change it with index from first_page
instead of position, it makes page migration simple because we
don't need to correct other entries for linked list if a page is
migrated out.

Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/zsmalloc.c | 139 ++++++++++++++++++++++++++++++----------------------------
 1 file changed, 73 insertions(+), 66 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index dd3708611f65..c6fb543cfb98 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -71,9 +71,7 @@
  * Object location (<PFN>, <obj_idx>) is encoded as
  * as single (unsigned long) handle value.
  *
- * Note that object index <obj_idx> is relative to system
- * page <PFN> it is stored in, so for each sub-page belonging
- * to a zspage, obj_idx starts with 0.
+ * Note that object index <obj_idx> starts from 0.
  *
  * This is made more complicated by various memory models and PAE.
  */
@@ -214,10 +212,10 @@ struct size_class {
 struct link_free {
 	union {
 		/*
-		 * Position of next free chunk (encodes <PFN, obj_idx>)
+		 * Free object index;
 		 * It's valid for non-allocated object
 		 */
-		void *next;
+		unsigned long next;
 		/*
 		 * Handle of allocated object.
 		 */
@@ -261,7 +259,7 @@ struct zspage {
 		unsigned int class:CLASS_BITS;
 	};
 	unsigned int inuse;
-	void *freeobj;
+	unsigned int freeobj;
 	struct page *first_page;
 	struct list_head list; /* fullness list */
 };
@@ -459,14 +457,14 @@ static inline void set_first_obj_offset(struct page *page, int offset)
 	page->index = offset;
 }
 
-static inline unsigned long get_freeobj(struct zspage *zspage)
+static inline unsigned int get_freeobj(struct zspage *zspage)
 {
-	return (unsigned long)zspage->freeobj;
+	return zspage->freeobj;
 }
 
-static inline void set_freeobj(struct zspage *zspage, unsigned long obj)
+static inline void set_freeobj(struct zspage *zspage, unsigned int obj)
 {
-	zspage->freeobj = (void *)obj;
+	zspage->freeobj = obj;
 }
 
 static void get_zspage_mapping(struct zspage *zspage,
@@ -810,6 +808,10 @@ static int get_pages_per_zspage(int class_size)
 	return max_usedpc_order;
 }
 
+static struct page *get_first_page(struct zspage *zspage)
+{
+	return zspage->first_page;
+}
 
 static struct zspage *get_zspage(struct page *page)
 {
@@ -821,37 +823,33 @@ static struct page *get_next_page(struct page *page)
 	return page->next;
 }
 
-/*
- * Encode <page, obj_idx> as a single handle value.
- * We use the least bit of handle for tagging.
+/**
+ * obj_to_location - get (<page>, <obj_idx>) from encoded object value
+ * @page: page object resides in zspage
+ * @obj_idx: object index
  */
-static void *location_to_obj(struct page *page, unsigned long obj_idx)
+static void obj_to_location(unsigned long obj, struct page **page,
+				unsigned int *obj_idx)
 {
-	unsigned long obj;
+	obj >>= OBJ_TAG_BITS;
+	*page = pfn_to_page(obj >> OBJ_INDEX_BITS);
+	*obj_idx = (obj & OBJ_INDEX_MASK);
+}
 
-	if (!page) {
-		VM_BUG_ON(obj_idx);
-		return NULL;
-	}
+/**
+ * location_to_obj - get obj value encoded from (<page>, <obj_idx>)
+ * @page: page object resides in zspage
+ * @obj_idx: object index
+ */
+static unsigned long location_to_obj(struct page *page, unsigned int obj_idx)
+{
+	unsigned long obj;
 
 	obj = page_to_pfn(page) << OBJ_INDEX_BITS;
-	obj |= ((obj_idx) & OBJ_INDEX_MASK);
+	obj |= obj_idx & OBJ_INDEX_MASK;
 	obj <<= OBJ_TAG_BITS;
 
-	return (void *)obj;
-}
-
-/*
- * Decode <page, obj_idx> pair from the given object handle. We adjust the
- * decoded obj_idx back to its original value since it was adjusted in
- * location_to_obj().
- */
-static void obj_to_location(unsigned long obj, struct page **page,
-				unsigned long *obj_idx)
-{
-	obj >>= OBJ_TAG_BITS;
-	*page = pfn_to_page(obj >> OBJ_INDEX_BITS);
-	*obj_idx = (obj & OBJ_INDEX_MASK);
+	return obj;
 }
 
 static unsigned long handle_to_obj(unsigned long handle)
@@ -869,16 +867,6 @@ static unsigned long obj_to_head(struct size_class *class, struct page *page,
 		return *(unsigned long *)obj;
 }
 
-static unsigned long obj_idx_to_offset(struct page *page,
-				unsigned long obj_idx, int class_size)
-{
-	unsigned long off;
-
-	off = get_first_obj_offset(page);
-
-	return off + obj_idx * class_size;
-}
-
 static inline int trypin_tag(unsigned long handle)
 {
 	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
@@ -922,13 +910,13 @@ static void free_zspage(struct zs_pool *pool, struct zspage *zspage)
 /* Initialize a newly allocated zspage */
 static void init_zspage(struct size_class *class, struct zspage *zspage)
 {
+	unsigned int freeobj = 1;
 	unsigned long off = 0;
 	struct page *page = zspage->first_page;
 
 	while (page) {
 		struct page *next_page;
 		struct link_free *link;
-		unsigned int i = 1;
 		void *vaddr;
 
 		set_first_obj_offset(page, off);
@@ -937,7 +925,7 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
 		while ((off += class->size) < PAGE_SIZE) {
-			link->next = location_to_obj(page, i++);
+			link->next = freeobj++ << OBJ_ALLOCATED_TAG;
 			link += class->size / sizeof(*link);
 		}
 
@@ -947,14 +935,21 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 		 * page (if present)
 		 */
 		next_page = get_next_page(page);
-		link->next = location_to_obj(next_page, 0);
+		if (next_page) {
+			link->next = freeobj++ << OBJ_ALLOCATED_TAG;
+		} else {
+			/*
+			 * Reset OBJ_ALLOCATED_TAG bit to last link to tell
+			 * whether it's allocated object or not.
+			 */
+			link->next = -1 << OBJ_ALLOCATED_TAG;
+		}
 		kunmap_atomic(vaddr);
 		page = next_page;
 		off %= PAGE_SIZE;
 	}
 
-	set_freeobj(zspage,
-		(unsigned long)location_to_obj(zspage->first_page, 0));
+	set_freeobj(zspage, 0);
 }
 
 static void create_page_chain(struct zspage *zspage, struct page *pages[],
@@ -1271,7 +1266,8 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 {
 	struct zspage *zspage;
 	struct page *page;
-	unsigned long obj, obj_idx, off;
+	unsigned long obj, off;
+	unsigned int obj_idx;
 
 	unsigned int class_idx;
 	enum fullness_group fg;
@@ -1295,7 +1291,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	zspage = get_zspage(page);
 	get_zspage_mapping(zspage, &class_idx, &fg);
 	class = pool->size_class[class_idx];
-	off = obj_idx_to_offset(page, obj_idx, class->size);
+	off = (class->size * obj_idx) & ~PAGE_MASK;
 
 	area = &get_cpu_var(zs_map_area);
 	area->vm_mm = mm;
@@ -1324,7 +1320,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 {
 	struct zspage *zspage;
 	struct page *page;
-	unsigned long obj, obj_idx, off;
+	unsigned long obj, off;
+	unsigned int obj_idx;
 
 	unsigned int class_idx;
 	enum fullness_group fg;
@@ -1336,7 +1333,7 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 	zspage = get_zspage(page);
 	get_zspage_mapping(zspage, &class_idx, &fg);
 	class = pool->size_class[class_idx];
-	off = obj_idx_to_offset(page, obj_idx, class->size);
+	off = (class->size * obj_idx) & ~PAGE_MASK;
 
 	area = this_cpu_ptr(&zs_map_area);
 	if (off + class->size <= PAGE_SIZE)
@@ -1358,21 +1355,28 @@ EXPORT_SYMBOL_GPL(zs_unmap_object);
 static unsigned long obj_malloc(struct size_class *class,
 				struct zspage *zspage, unsigned long handle)
 {
+	int i, nr_page, offset;
 	unsigned long obj;
 	struct link_free *link;
 
 	struct page *m_page;
-	unsigned long m_objidx, m_offset;
+	unsigned long m_offset;
 	void *vaddr;
 
 	handle |= OBJ_ALLOCATED_TAG;
 	obj = get_freeobj(zspage);
-	obj_to_location(obj, &m_page, &m_objidx);
-	m_offset = obj_idx_to_offset(m_page, m_objidx, class->size);
+
+	offset = obj * class->size;
+	nr_page = offset >> PAGE_SHIFT;
+	m_offset = offset & ~PAGE_MASK;
+	m_page = get_first_page(zspage);
+
+	for (i = 0; i < nr_page; i++)
+		m_page = get_next_page(m_page);
 
 	vaddr = kmap_atomic(m_page);
 	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
-	set_freeobj(zspage, (unsigned long)link->next);
+	set_freeobj(zspage, link->next >> OBJ_ALLOCATED_TAG);
 	if (!class->huge)
 		/* record handle in the header of allocated chunk */
 		link->handle = handle;
@@ -1384,6 +1388,8 @@ static unsigned long obj_malloc(struct size_class *class,
 	mod_zspage_inuse(zspage, 1);
 	zs_stat_inc(class, OBJ_USED, 1);
 
+	obj = location_to_obj(m_page, obj);
+
 	return obj;
 }
 
@@ -1449,22 +1455,22 @@ static void obj_free(struct size_class *class, unsigned long obj)
 	struct link_free *link;
 	struct zspage *zspage;
 	struct page *f_page;
-	unsigned long f_objidx, f_offset;
+	unsigned long f_offset;
+	unsigned int f_objidx;
 	void *vaddr;
 
 	obj &= ~OBJ_ALLOCATED_TAG;
 	obj_to_location(obj, &f_page, &f_objidx);
+	f_offset = (class->size * f_objidx) & ~PAGE_MASK;
 	zspage = get_zspage(f_page);
 
-	f_offset = obj_idx_to_offset(f_page, f_objidx, class->size);
-
 	vaddr = kmap_atomic(f_page);
 
 	/* Insert this object in containing zspage's freelist */
 	link = (struct link_free *)(vaddr + f_offset);
-	link->next = (void *)get_freeobj(zspage);
+	link->next = get_freeobj(zspage) << OBJ_ALLOCATED_TAG;
 	kunmap_atomic(vaddr);
-	set_freeobj(zspage, obj);
+	set_freeobj(zspage, f_objidx);
 	mod_zspage_inuse(zspage, -1);
 	zs_stat_dec(class, OBJ_USED, 1);
 }
@@ -1473,7 +1479,8 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 {
 	struct zspage *zspage;
 	struct page *f_page;
-	unsigned long obj, f_objidx;
+	unsigned long obj;
+	unsigned int f_objidx;
 	int class_idx;
 	struct size_class *class;
 	enum fullness_group fullness;
@@ -1510,7 +1517,7 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 				unsigned long src)
 {
 	struct page *s_page, *d_page;
-	unsigned long s_objidx, d_objidx;
+	unsigned int s_objidx, d_objidx;
 	unsigned long s_off, d_off;
 	void *s_addr, *d_addr;
 	int s_size, d_size, size;
@@ -1521,8 +1528,8 @@ static void zs_object_copy(struct size_class *class, unsigned long dst,
 	obj_to_location(src, &s_page, &s_objidx);
 	obj_to_location(dst, &d_page, &d_objidx);
 
-	s_off = obj_idx_to_offset(s_page, s_objidx, class->size);
-	d_off = obj_idx_to_offset(d_page, d_objidx, class->size);
+	s_off = (class->size * s_objidx) & ~PAGE_MASK;
+	d_off = (class->size * d_objidx) & ~PAGE_MASK;
 
 	if (s_off + class->size > PAGE_SIZE)
 		s_size = PAGE_SIZE - s_off;
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 11/12] zsmalloc: page migration support
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (9 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 10/12] zsmalloc: use freeobj for index Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-06-01 14:09   ` Vlastimil Babka
                     ` (2 more replies)
  2016-05-31 23:21 ` [PATCH v7 12/12] zram: use __GFP_MOVABLE for memory allocation Minchan Kim
                   ` (2 subsequent siblings)
  13 siblings, 3 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

This patch introduces run-time migration feature for zspage.

For migration, VM uses page.lru field so it would be better to not use
page.next field which is unified with page.lru for own purpose.
For that, firstly, we can get first object offset of the page via
runtime calculation instead of using page.index so we can use
page.index as link for page chaining instead of page.next.

In case of huge object, it stores handle to page.index instead of
next link of page chaining because huge object doesn't need to next
link for page chaining. So get_next_page need to identify huge
object to return NULL. For it, this patch uses PG_owner_priv_1 flag
of the page flag.

For migration, it supports three functions

* zs_page_isolate

It isolates a zspage which includes a subpage VM want to migrate
from class so anyone cannot allocate new object from the zspage.

We could try to isolate a zspage by the number of subpage so
subsequent isolation trial of other subpage of the zpsage shouldn't
fail. For that, we introduce zspage.isolated count. With that,
zs_page_isolate can know whether zspage is already isolated or not
for migration so if it is isolated for migration, subsequent
isolation trial can be successful without trying further isolation.

* zs_page_migrate

First of all, it holds write-side zspage->lock to prevent migrate other
subpage in zspage. Then, lock all objects in the page VM want to migrate.
The reason we should lock all objects in the page is due to race between
zs_map_object and zs_page_migrate.

zs_map_object				zs_page_migrate

pin_tag(handle)
obj = handle_to_obj(handle)
obj_to_location(obj, &page, &obj_idx);

					write_lock(&zspage->lock)
					if (!trypin_tag(handle))
						goto unpin_object

zspage = get_zspage(page);
read_lock(&zspage->lock);

If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can
be stale by migration so it goes crash.

If it locks all of objects successfully, it copies content from
old page to new one, finally, create new zspage chain with new page.
And if it's last isolated subpage in the zspage, put the zspage back
to class.

* zs_page_putback

It returns isolated zspage to right fullness_group list if it fails to
migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage
freeing to workqueue. See below about async zspage freeing.

This patch introduces asynchronous zspage free. The reason to need it
is we need page_lock to clear PG_movable but unfortunately,
zs_free path should be atomic so the apporach is try to grab page_lock.
If it got page_lock of all of pages successfully, it can free zspage
immediately. Otherwise, it queues free request and free zspage via
workqueue in process context.

If zs_free finds the zspage is isolated when it try to free zspage,
it delays the freeing until zs_page_putback finds it so it will free
free the zspage finally.

In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.
First of all, it will use ZS_EMPTY list for delay freeing.
And with adding ZS_FULL list, it makes to identify whether zspage is
isolated or not via list_empty(&zspage->list) test.

Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 include/uapi/linux/magic.h |   1 +
 mm/zsmalloc.c              | 793 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 672 insertions(+), 122 deletions(-)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index d829ce63529d..e398beac67b8 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -81,5 +81,6 @@
 /* Since UDF 2.01 is ISO 13346 based... */
 #define UDF_SUPER_MAGIC		0x15013346
 #define BALLOON_KVM_MAGIC	0x13661366
+#define ZSMALLOC_MAGIC		0x58295829
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index c6fb543cfb98..a80100db16d6 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -17,14 +17,14 @@
  *
  * Usage of struct page fields:
  *	page->private: points to zspage
- *	page->index: offset of the first object starting in this page.
- *		For the first page, this is always 0, so we use this field
- *		to store handle for huge object.
- *	page->next: links together all component pages of a zspage
+ *	page->freelist(index): links together all component pages of a zspage
+ *		For the huge page, this is always 0, so we use this field
+ *		to store handle.
  *
  * Usage of struct page flags:
  *	PG_private: identifies the first component page
  *	PG_private2: identifies the last component page
+ *	PG_owner_priv_1: indentifies the huge component page
  *
  */
 
@@ -49,6 +49,11 @@
 #include <linux/debugfs.h>
 #include <linux/zsmalloc.h>
 #include <linux/zpool.h>
+#include <linux/mount.h>
+#include <linux/compaction.h>
+#include <linux/pagemap.h>
+
+#define ZSPAGE_MAGIC	0x58
 
 /*
  * This must be power of 2 and greater than of equal to sizeof(link_free).
@@ -136,25 +141,23 @@
  * We do not maintain any list for completely empty or full pages
  */
 enum fullness_group {
-	ZS_ALMOST_FULL,
-	ZS_ALMOST_EMPTY,
 	ZS_EMPTY,
-	ZS_FULL
+	ZS_ALMOST_EMPTY,
+	ZS_ALMOST_FULL,
+	ZS_FULL,
+	NR_ZS_FULLNESS,
 };
 
 enum zs_stat_type {
+	CLASS_EMPTY,
+	CLASS_ALMOST_EMPTY,
+	CLASS_ALMOST_FULL,
+	CLASS_FULL,
 	OBJ_ALLOCATED,
 	OBJ_USED,
-	CLASS_ALMOST_FULL,
-	CLASS_ALMOST_EMPTY,
+	NR_ZS_STAT_TYPE,
 };
 
-#ifdef CONFIG_ZSMALLOC_STAT
-#define NR_ZS_STAT_TYPE	(CLASS_ALMOST_EMPTY + 1)
-#else
-#define NR_ZS_STAT_TYPE	(OBJ_USED + 1)
-#endif
-
 struct zs_size_stat {
 	unsigned long objs[NR_ZS_STAT_TYPE];
 };
@@ -163,6 +166,10 @@ struct zs_size_stat {
 static struct dentry *zs_stat_root;
 #endif
 
+#ifdef CONFIG_COMPACTION
+static struct vfsmount *zsmalloc_mnt;
+#endif
+
 /*
  * number of size_classes
  */
@@ -186,23 +193,36 @@ static const int fullness_threshold_frac = 4;
 
 struct size_class {
 	spinlock_t lock;
-	struct list_head fullness_list[2];
+	struct list_head fullness_list[NR_ZS_FULLNESS];
 	/*
 	 * Size of objects stored in this class. Must be multiple
 	 * of ZS_ALIGN.
 	 */
 	int size;
 	int objs_per_zspage;
-	unsigned int index;
-
-	struct zs_size_stat stats;
-
 	/* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
 	int pages_per_zspage;
-	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
-	bool huge;
+
+	unsigned int index;
+	struct zs_size_stat stats;
 };
 
+/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
+static void SetPageHugeObject(struct page *page)
+{
+	SetPageOwnerPriv1(page);
+}
+
+static void ClearPageHugeObject(struct page *page)
+{
+	ClearPageOwnerPriv1(page);
+}
+
+static int PageHugeObject(struct page *page)
+{
+	return PageOwnerPriv1(page);
+}
+
 /*
  * Placed within free objects to form a singly linked list.
  * For every zspage, zspage->freeobj gives head of this list.
@@ -244,6 +264,10 @@ struct zs_pool {
 #ifdef CONFIG_ZSMALLOC_STAT
 	struct dentry *stat_dentry;
 #endif
+#ifdef CONFIG_COMPACTION
+	struct inode *inode;
+	struct work_struct free_work;
+#endif
 };
 
 /*
@@ -252,16 +276,23 @@ struct zs_pool {
  */
 #define FULLNESS_BITS	2
 #define CLASS_BITS	8
+#define ISOLATED_BITS	3
+#define MAGIC_VAL_BITS	8
 
 struct zspage {
 	struct {
 		unsigned int fullness:FULLNESS_BITS;
 		unsigned int class:CLASS_BITS;
+		unsigned int isolated:ISOLATED_BITS;
+		unsigned int magic:MAGIC_VAL_BITS;
 	};
 	unsigned int inuse;
 	unsigned int freeobj;
 	struct page *first_page;
 	struct list_head list; /* fullness list */
+#ifdef CONFIG_COMPACTION
+	rwlock_t lock;
+#endif
 };
 
 struct mapping_area {
@@ -274,6 +305,28 @@ struct mapping_area {
 	enum zs_mapmode vm_mm; /* mapping mode */
 };
 
+#ifdef CONFIG_COMPACTION
+static int zs_register_migration(struct zs_pool *pool);
+static void zs_unregister_migration(struct zs_pool *pool);
+static void migrate_lock_init(struct zspage *zspage);
+static void migrate_read_lock(struct zspage *zspage);
+static void migrate_read_unlock(struct zspage *zspage);
+static void kick_deferred_free(struct zs_pool *pool);
+static void init_deferred_free(struct zs_pool *pool);
+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
+#else
+static int zsmalloc_mount(void) { return 0; }
+static void zsmalloc_unmount(void) {}
+static int zs_register_migration(struct zs_pool *pool) { return 0; }
+static void zs_unregister_migration(struct zs_pool *pool) {}
+static void migrate_lock_init(struct zspage *zspage) {}
+static void migrate_read_lock(struct zspage *zspage) {}
+static void migrate_read_unlock(struct zspage *zspage) {}
+static void kick_deferred_free(struct zs_pool *pool) {}
+static void init_deferred_free(struct zs_pool *pool) {}
+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
+#endif
+
 static int create_cache(struct zs_pool *pool)
 {
 	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
@@ -301,7 +354,7 @@ static void destroy_cache(struct zs_pool *pool)
 static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
 	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-			gfp & ~__GFP_HIGHMEM);
+			gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
 }
 
 static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
@@ -311,7 +364,8 @@ static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
 
 static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
 {
-	return kmem_cache_alloc(pool->zspage_cachep, flags & ~__GFP_HIGHMEM);
+	return kmem_cache_alloc(pool->zspage_cachep,
+			flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
 };
 
 static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
@@ -421,11 +475,17 @@ static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage)
 /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
 static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
 
+static bool is_zspage_isolated(struct zspage *zspage)
+{
+	return zspage->isolated;
+}
+
 static int is_first_page(struct page *page)
 {
 	return PagePrivate(page);
 }
 
+/* Protected by class->lock */
 static inline int get_zspage_inuse(struct zspage *zspage)
 {
 	return zspage->inuse;
@@ -441,20 +501,12 @@ static inline void mod_zspage_inuse(struct zspage *zspage, int val)
 	zspage->inuse += val;
 }
 
-static inline int get_first_obj_offset(struct page *page)
+static inline struct page *get_first_page(struct zspage *zspage)
 {
-	if (is_first_page(page))
-		return 0;
+	struct page *first_page = zspage->first_page;
 
-	return page->index;
-}
-
-static inline void set_first_obj_offset(struct page *page, int offset)
-{
-	if (is_first_page(page))
-		return;
-
-	page->index = offset;
+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
+	return first_page;
 }
 
 static inline unsigned int get_freeobj(struct zspage *zspage)
@@ -471,6 +523,8 @@ static void get_zspage_mapping(struct zspage *zspage,
 				unsigned int *class_idx,
 				enum fullness_group *fullness)
 {
+	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
+
 	*fullness = zspage->fullness;
 	*class_idx = zspage->class;
 }
@@ -504,23 +558,19 @@ static int get_size_class_index(int size)
 static inline void zs_stat_inc(struct size_class *class,
 				enum zs_stat_type type, unsigned long cnt)
 {
-	if (type < NR_ZS_STAT_TYPE)
-		class->stats.objs[type] += cnt;
+	class->stats.objs[type] += cnt;
 }
 
 static inline void zs_stat_dec(struct size_class *class,
 				enum zs_stat_type type, unsigned long cnt)
 {
-	if (type < NR_ZS_STAT_TYPE)
-		class->stats.objs[type] -= cnt;
+	class->stats.objs[type] -= cnt;
 }
 
 static inline unsigned long zs_stat_get(struct size_class *class,
 				enum zs_stat_type type)
 {
-	if (type < NR_ZS_STAT_TYPE)
-		return class->stats.objs[type];
-	return 0;
+	return class->stats.objs[type];
 }
 
 #ifdef CONFIG_ZSMALLOC_STAT
@@ -664,6 +714,7 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
 }
 #endif
 
+
 /*
  * For each size class, zspages are divided into different groups
  * depending on how "full" they are. This was done so that we could
@@ -704,15 +755,9 @@ static void insert_zspage(struct size_class *class,
 {
 	struct zspage *head;
 
-	if (fullness >= ZS_EMPTY)
-		return;
-
+	zs_stat_inc(class, fullness, 1);
 	head = list_first_entry_or_null(&class->fullness_list[fullness],
 					struct zspage, list);
-
-	zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ?
-			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
-
 	/*
 	 * We want to see more ZS_FULL pages and less almost empty/full.
 	 * Put pages with higher ->inuse first.
@@ -734,14 +779,11 @@ static void remove_zspage(struct size_class *class,
 				struct zspage *zspage,
 				enum fullness_group fullness)
 {
-	if (fullness >= ZS_EMPTY)
-		return;
-
 	VM_BUG_ON(list_empty(&class->fullness_list[fullness]));
+	VM_BUG_ON(is_zspage_isolated(zspage));
 
 	list_del_init(&zspage->list);
-	zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ?
-			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
+	zs_stat_dec(class, fullness, 1);
 }
 
 /*
@@ -764,8 +806,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
 	if (newfg == currfg)
 		goto out;
 
-	remove_zspage(class, zspage, currfg);
-	insert_zspage(class, zspage, newfg);
+	if (!is_zspage_isolated(zspage)) {
+		remove_zspage(class, zspage, currfg);
+		insert_zspage(class, zspage, newfg);
+	}
+
 	set_zspage_mapping(zspage, class_idx, newfg);
 
 out:
@@ -808,19 +853,45 @@ static int get_pages_per_zspage(int class_size)
 	return max_usedpc_order;
 }
 
-static struct page *get_first_page(struct zspage *zspage)
+static struct zspage *get_zspage(struct page *page)
 {
-	return zspage->first_page;
+	struct zspage *zspage = (struct zspage *)page->private;
+
+	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
+	return zspage;
 }
 
-static struct zspage *get_zspage(struct page *page)
+static struct page *get_next_page(struct page *page)
 {
-	return (struct zspage *)page->private;
+	if (unlikely(PageHugeObject(page)))
+		return NULL;
+
+	return page->freelist;
 }
 
-static struct page *get_next_page(struct page *page)
+/* Get byte offset of first object in the @page */
+static int get_first_obj_offset(struct size_class *class,
+				struct page *first_page, struct page *page)
 {
-	return page->next;
+	int pos;
+	int page_idx = 0;
+	int ofs = 0;
+	struct page *cursor = first_page;
+
+	if (first_page == page)
+		goto out;
+
+	while (page != cursor) {
+		page_idx++;
+		cursor = get_next_page(cursor);
+	}
+
+	pos = class->objs_per_zspage * class->size *
+		page_idx / class->pages_per_zspage;
+
+	ofs = (pos + class->size) % PAGE_SIZE;
+out:
+	return ofs;
 }
 
 /**
@@ -857,16 +928,20 @@ static unsigned long handle_to_obj(unsigned long handle)
 	return *(unsigned long *)handle;
 }
 
-static unsigned long obj_to_head(struct size_class *class, struct page *page,
-			void *obj)
+static unsigned long obj_to_head(struct page *page, void *obj)
 {
-	if (class->huge) {
+	if (unlikely(PageHugeObject(page))) {
 		VM_BUG_ON_PAGE(!is_first_page(page), page);
 		return page->index;
 	} else
 		return *(unsigned long *)obj;
 }
 
+static inline int testpin_tag(unsigned long handle)
+{
+	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
+}
+
 static inline int trypin_tag(unsigned long handle)
 {
 	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
@@ -884,27 +959,93 @@ static void unpin_tag(unsigned long handle)
 
 static void reset_page(struct page *page)
 {
+	__ClearPageMovable(page);
 	clear_bit(PG_private, &page->flags);
 	clear_bit(PG_private_2, &page->flags);
 	set_page_private(page, 0);
-	page->index = 0;
+	ClearPageHugeObject(page);
+	page->freelist = NULL;
 }
 
-static void free_zspage(struct zs_pool *pool, struct zspage *zspage)
+/*
+ * To prevent zspage destroy during migration, zspage freeing should
+ * hold locks of all pages in the zspage.
+ */
+void lock_zspage(struct zspage *zspage)
+{
+	struct page *page = get_first_page(zspage);
+
+	do {
+		lock_page(page);
+	} while ((page = get_next_page(page)) != NULL);
+}
+
+int trylock_zspage(struct zspage *zspage)
+{
+	struct page *cursor, *fail;
+
+	for (cursor = get_first_page(zspage); cursor != NULL; cursor =
+					get_next_page(cursor)) {
+		if (!trylock_page(cursor)) {
+			fail = cursor;
+			goto unlock;
+		}
+	}
+
+	return 1;
+unlock:
+	for (cursor = get_first_page(zspage); cursor != fail; cursor =
+					get_next_page(cursor))
+		unlock_page(cursor);
+
+	return 0;
+}
+
+static void __free_zspage(struct zs_pool *pool, struct size_class *class,
+				struct zspage *zspage)
 {
 	struct page *page, *next;
+	enum fullness_group fg;
+	unsigned int class_idx;
+
+	get_zspage_mapping(zspage, &class_idx, &fg);
+
+	assert_spin_locked(&class->lock);
 
 	VM_BUG_ON(get_zspage_inuse(zspage));
+	VM_BUG_ON(fg != ZS_EMPTY);
 
-	next = page = zspage->first_page;
+	next = page = get_first_page(zspage);
 	do {
-		next = page->next;
+		VM_BUG_ON_PAGE(!PageLocked(page), page);
+		next = get_next_page(page);
 		reset_page(page);
+		unlock_page(page);
 		put_page(page);
 		page = next;
 	} while (page != NULL);
 
 	cache_free_zspage(pool, zspage);
+
+	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
+			class->size, class->pages_per_zspage));
+	atomic_long_sub(class->pages_per_zspage,
+					&pool->pages_allocated);
+}
+
+static void free_zspage(struct zs_pool *pool, struct size_class *class,
+				struct zspage *zspage)
+{
+	VM_BUG_ON(get_zspage_inuse(zspage));
+	VM_BUG_ON(list_empty(&zspage->list));
+
+	if (!trylock_zspage(zspage)) {
+		kick_deferred_free(pool);
+		return;
+	}
+
+	remove_zspage(class, zspage, ZS_EMPTY);
+	__free_zspage(pool, class, zspage);
 }
 
 /* Initialize a newly allocated zspage */
@@ -912,15 +1053,13 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 {
 	unsigned int freeobj = 1;
 	unsigned long off = 0;
-	struct page *page = zspage->first_page;
+	struct page *page = get_first_page(zspage);
 
 	while (page) {
 		struct page *next_page;
 		struct link_free *link;
 		void *vaddr;
 
-		set_first_obj_offset(page, off);
-
 		vaddr = kmap_atomic(page);
 		link = (struct link_free *)vaddr + off / sizeof(*link);
 
@@ -952,16 +1091,17 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
 	set_freeobj(zspage, 0);
 }
 
-static void create_page_chain(struct zspage *zspage, struct page *pages[],
-				int nr_pages)
+static void create_page_chain(struct size_class *class, struct zspage *zspage,
+				struct page *pages[])
 {
 	int i;
 	struct page *page;
 	struct page *prev_page = NULL;
+	int nr_pages = class->pages_per_zspage;
 
 	/*
 	 * Allocate individual pages and link them together as:
-	 * 1. all pages are linked together using page->next
+	 * 1. all pages are linked together using page->freelist
 	 * 2. each sub-page point to zspage using page->private
 	 *
 	 * we set PG_private to identify the first page (i.e. no other sub-page
@@ -970,16 +1110,18 @@ static void create_page_chain(struct zspage *zspage, struct page *pages[],
 	for (i = 0; i < nr_pages; i++) {
 		page = pages[i];
 		set_page_private(page, (unsigned long)zspage);
+		page->freelist = NULL;
 		if (i == 0) {
 			zspage->first_page = page;
 			SetPagePrivate(page);
+			if (unlikely(class->objs_per_zspage == 1 &&
+					class->pages_per_zspage == 1))
+				SetPageHugeObject(page);
 		} else {
-			prev_page->next = page;
+			prev_page->freelist = page;
 		}
-		if (i == nr_pages - 1) {
+		if (i == nr_pages - 1)
 			SetPagePrivate2(page);
-			page->next = NULL;
-		}
 		prev_page = page;
 	}
 }
@@ -999,6 +1141,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 		return NULL;
 
 	memset(zspage, 0, sizeof(struct zspage));
+	zspage->magic = ZSPAGE_MAGIC;
+	migrate_lock_init(zspage);
 
 	for (i = 0; i < class->pages_per_zspage; i++) {
 		struct page *page;
@@ -1013,7 +1157,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
 		pages[i] = page;
 	}
 
-	create_page_chain(zspage, pages, class->pages_per_zspage);
+	create_page_chain(class, zspage, pages);
 	init_zspage(class, zspage);
 
 	return zspage;
@@ -1024,7 +1168,7 @@ static struct zspage *find_get_zspage(struct size_class *class)
 	int i;
 	struct zspage *zspage;
 
-	for (i = ZS_ALMOST_FULL; i <= ZS_ALMOST_EMPTY; i++) {
+	for (i = ZS_ALMOST_FULL; i >= ZS_EMPTY; i--) {
 		zspage = list_first_entry_or_null(&class->fullness_list[i],
 				struct zspage, list);
 		if (zspage)
@@ -1289,6 +1433,10 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 	obj = handle_to_obj(handle);
 	obj_to_location(obj, &page, &obj_idx);
 	zspage = get_zspage(page);
+
+	/* migration cannot move any subpage in this zspage */
+	migrate_read_lock(zspage);
+
 	get_zspage_mapping(zspage, &class_idx, &fg);
 	class = pool->size_class[class_idx];
 	off = (class->size * obj_idx) & ~PAGE_MASK;
@@ -1309,7 +1457,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
 
 	ret = __zs_map_object(area, pages, off, class->size);
 out:
-	if (!class->huge)
+	if (likely(!PageHugeObject(page)))
 		ret += ZS_HANDLE_SIZE;
 
 	return ret;
@@ -1348,6 +1496,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
 		__zs_unmap_object(area, pages, off, class->size);
 	}
 	put_cpu_var(zs_map_area);
+
+	migrate_read_unlock(zspage);
 	unpin_tag(handle);
 }
 EXPORT_SYMBOL_GPL(zs_unmap_object);
@@ -1377,7 +1527,7 @@ static unsigned long obj_malloc(struct size_class *class,
 	vaddr = kmap_atomic(m_page);
 	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
 	set_freeobj(zspage, link->next >> OBJ_ALLOCATED_TAG);
-	if (!class->huge)
+	if (likely(!PageHugeObject(m_page)))
 		/* record handle in the header of allocated chunk */
 		link->handle = handle;
 	else
@@ -1407,6 +1557,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 {
 	unsigned long handle, obj;
 	struct size_class *class;
+	enum fullness_group newfg;
 	struct zspage *zspage;
 
 	if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
@@ -1422,28 +1573,37 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
 
 	spin_lock(&class->lock);
 	zspage = find_get_zspage(class);
-
-	if (!zspage) {
+	if (likely(zspage)) {
+		obj = obj_malloc(class, zspage, handle);
+		/* Now move the zspage to another fullness group, if required */
+		fix_fullness_group(class, zspage);
+		record_obj(handle, obj);
 		spin_unlock(&class->lock);
-		zspage = alloc_zspage(pool, class, gfp);
-		if (unlikely(!zspage)) {
-			cache_free_handle(pool, handle);
-			return 0;
-		}
 
-		set_zspage_mapping(zspage, class->index, ZS_EMPTY);
-		atomic_long_add(class->pages_per_zspage,
-					&pool->pages_allocated);
+		return handle;
+	}
 
-		spin_lock(&class->lock);
-		zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
-				class->size, class->pages_per_zspage));
+	spin_unlock(&class->lock);
+
+	zspage = alloc_zspage(pool, class, gfp);
+	if (!zspage) {
+		cache_free_handle(pool, handle);
+		return 0;
 	}
 
+	spin_lock(&class->lock);
 	obj = obj_malloc(class, zspage, handle);
-	/* Now move the zspage to another fullness group, if required */
-	fix_fullness_group(class, zspage);
+	newfg = get_fullness_group(class, zspage);
+	insert_zspage(class, zspage, newfg);
+	set_zspage_mapping(zspage, class->index, newfg);
 	record_obj(handle, obj);
+	atomic_long_add(class->pages_per_zspage,
+				&pool->pages_allocated);
+	zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
+			class->size, class->pages_per_zspage));
+
+	/* We completely set up zspage so mark them as movable */
+	SetZsPageMovable(pool, zspage);
 	spin_unlock(&class->lock);
 
 	return handle;
@@ -1484,6 +1644,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	int class_idx;
 	struct size_class *class;
 	enum fullness_group fullness;
+	bool isolated;
 
 	if (unlikely(!handle))
 		return;
@@ -1493,22 +1654,28 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
 	obj_to_location(obj, &f_page, &f_objidx);
 	zspage = get_zspage(f_page);
 
+	migrate_read_lock(zspage);
+
 	get_zspage_mapping(zspage, &class_idx, &fullness);
 	class = pool->size_class[class_idx];
 
 	spin_lock(&class->lock);
 	obj_free(class, obj);
 	fullness = fix_fullness_group(class, zspage);
-	if (fullness == ZS_EMPTY) {
-		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
-				class->size, class->pages_per_zspage));
-		atomic_long_sub(class->pages_per_zspage,
-				&pool->pages_allocated);
-		free_zspage(pool, zspage);
+	if (fullness != ZS_EMPTY) {
+		migrate_read_unlock(zspage);
+		goto out;
 	}
+
+	isolated = is_zspage_isolated(zspage);
+	migrate_read_unlock(zspage);
+	/* If zspage is isolated, zs_page_putback will free the zspage */
+	if (likely(!isolated))
+		free_zspage(pool, class, zspage);
+out:
+
 	spin_unlock(&class->lock);
 	unpin_tag(handle);
-
 	cache_free_handle(pool, handle);
 }
 EXPORT_SYMBOL_GPL(zs_free);
@@ -1587,12 +1754,13 @@ static unsigned long find_alloced_obj(struct size_class *class,
 	int offset = 0;
 	unsigned long handle = 0;
 	void *addr = kmap_atomic(page);
+	struct zspage *zspage = get_zspage(page);
 
-	offset = get_first_obj_offset(page);
+	offset = get_first_obj_offset(class, get_first_page(zspage), page);
 	offset += class->size * index;
 
 	while (offset < PAGE_SIZE) {
-		head = obj_to_head(class, page, addr + offset);
+		head = obj_to_head(page, addr + offset);
 		if (head & OBJ_ALLOCATED_TAG) {
 			handle = head & ~OBJ_ALLOCATED_TAG;
 			if (trypin_tag(handle))
@@ -1684,6 +1852,7 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
 		zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
 							struct zspage, list);
 		if (zspage) {
+			VM_BUG_ON(is_zspage_isolated(zspage));
 			remove_zspage(class, zspage, fg[i]);
 			return zspage;
 		}
@@ -1704,6 +1873,8 @@ static enum fullness_group putback_zspage(struct size_class *class,
 {
 	enum fullness_group fullness;
 
+	VM_BUG_ON(is_zspage_isolated(zspage));
+
 	fullness = get_fullness_group(class, zspage);
 	insert_zspage(class, zspage, fullness);
 	set_zspage_mapping(zspage, class->index, fullness);
@@ -1711,6 +1882,377 @@ static enum fullness_group putback_zspage(struct size_class *class,
 	return fullness;
 }
 
+#ifdef CONFIG_COMPACTION
+static struct dentry *zs_mount(struct file_system_type *fs_type,
+				int flags, const char *dev_name, void *data)
+{
+	static const struct dentry_operations ops = {
+		.d_dname = simple_dname,
+	};
+
+	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
+}
+
+static struct file_system_type zsmalloc_fs = {
+	.name		= "zsmalloc",
+	.mount		= zs_mount,
+	.kill_sb	= kill_anon_super,
+};
+
+static int zsmalloc_mount(void)
+{
+	int ret = 0;
+
+	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
+	if (IS_ERR(zsmalloc_mnt))
+		ret = PTR_ERR(zsmalloc_mnt);
+
+	return ret;
+}
+
+static void zsmalloc_unmount(void)
+{
+	kern_unmount(zsmalloc_mnt);
+}
+
+static void migrate_lock_init(struct zspage *zspage)
+{
+	rwlock_init(&zspage->lock);
+}
+
+static void migrate_read_lock(struct zspage *zspage)
+{
+	read_lock(&zspage->lock);
+}
+
+static void migrate_read_unlock(struct zspage *zspage)
+{
+	read_unlock(&zspage->lock);
+}
+
+static void migrate_write_lock(struct zspage *zspage)
+{
+	write_lock(&zspage->lock);
+}
+
+static void migrate_write_unlock(struct zspage *zspage)
+{
+	write_unlock(&zspage->lock);
+}
+
+/* Number of isolated subpage for *page migration* in this zspage */
+static void inc_zspage_isolation(struct zspage *zspage)
+{
+	zspage->isolated++;
+}
+
+static void dec_zspage_isolation(struct zspage *zspage)
+{
+	zspage->isolated--;
+}
+
+static void replace_sub_page(struct size_class *class, struct zspage *zspage,
+				struct page *newpage, struct page *oldpage)
+{
+	struct page *page;
+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL, };
+	int idx = 0;
+
+	page = get_first_page(zspage);
+	do {
+		if (page == oldpage)
+			pages[idx] = newpage;
+		else
+			pages[idx] = page;
+		idx++;
+	} while ((page = get_next_page(page)) != NULL);
+
+	create_page_chain(class, zspage, pages);
+	if (unlikely(PageHugeObject(oldpage)))
+		newpage->index = oldpage->index;
+	__SetPageMovable(newpage, page_mapping(oldpage));
+}
+
+bool zs_page_isolate(struct page *page, isolate_mode_t mode)
+{
+	struct zs_pool *pool;
+	struct size_class *class;
+	int class_idx;
+	enum fullness_group fullness;
+	struct zspage *zspage;
+	struct address_space *mapping;
+
+	/*
+	 * Page is locked so zspage couldn't be destroyed. For detail, look at
+	 * lock_zspage in free_zspage.
+	 */
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(PageIsolated(page), page);
+
+	zspage = get_zspage(page);
+
+	/*
+	 * Without class lock, fullness could be stale while class_idx is okay
+	 * because class_idx is constant unless page is freed so we should get
+	 * fullness again under class lock.
+	 */
+	get_zspage_mapping(zspage, &class_idx, &fullness);
+	mapping = page_mapping(page);
+	pool = mapping->private_data;
+	class = pool->size_class[class_idx];
+
+	spin_lock(&class->lock);
+	if (get_zspage_inuse(zspage) == 0) {
+		spin_unlock(&class->lock);
+		return false;
+	}
+
+	/* zspage is isolated for object migration */
+	if (list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
+		spin_unlock(&class->lock);
+		return false;
+	}
+
+	/*
+	 * If this is first time isolation for the zspage, isolate zspage from
+	 * size_class to prevent further object allocation from the zspage.
+	 */
+	if (!list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
+		get_zspage_mapping(zspage, &class_idx, &fullness);
+		remove_zspage(class, zspage, fullness);
+	}
+
+	inc_zspage_isolation(zspage);
+	spin_unlock(&class->lock);
+
+	return true;
+}
+
+int zs_page_migrate(struct address_space *mapping, struct page *newpage,
+		struct page *page, enum migrate_mode mode)
+{
+	struct zs_pool *pool;
+	struct size_class *class;
+	int class_idx;
+	enum fullness_group fullness;
+	struct zspage *zspage;
+	struct page *dummy;
+	void *s_addr, *d_addr, *addr;
+	int offset, pos;
+	unsigned long handle, head;
+	unsigned long old_obj, new_obj;
+	unsigned int obj_idx;
+	int ret = -EAGAIN;
+
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+	zspage = get_zspage(page);
+
+	/* Concurrent compactor cannot migrate any subpage in zspage */
+	migrate_write_lock(zspage);
+	get_zspage_mapping(zspage, &class_idx, &fullness);
+	pool = mapping->private_data;
+	class = pool->size_class[class_idx];
+	offset = get_first_obj_offset(class, get_first_page(zspage), page);
+
+	spin_lock(&class->lock);
+	if (!get_zspage_inuse(zspage)) {
+		ret = -EBUSY;
+		goto unlock_class;
+	}
+
+	pos = offset;
+	s_addr = kmap_atomic(page);
+	while (pos < PAGE_SIZE) {
+		head = obj_to_head(page, s_addr + pos);
+		if (head & OBJ_ALLOCATED_TAG) {
+			handle = head & ~OBJ_ALLOCATED_TAG;
+			if (!trypin_tag(handle))
+				goto unpin_objects;
+		}
+		pos += class->size;
+	}
+
+	/*
+	 * Here, any user cannot access all objects in the zspage so let's move.
+	 */
+	d_addr = kmap_atomic(newpage);
+	memcpy(d_addr, s_addr, PAGE_SIZE);
+	kunmap_atomic(d_addr);
+
+	for (addr = s_addr + offset; addr < s_addr + pos;
+					addr += class->size) {
+		head = obj_to_head(page, addr);
+		if (head & OBJ_ALLOCATED_TAG) {
+			handle = head & ~OBJ_ALLOCATED_TAG;
+			if (!testpin_tag(handle))
+				BUG();
+
+			old_obj = handle_to_obj(handle);
+			obj_to_location(old_obj, &dummy, &obj_idx);
+			new_obj = (unsigned long)location_to_obj(newpage,
+								obj_idx);
+			new_obj |= BIT(HANDLE_PIN_BIT);
+			record_obj(handle, new_obj);
+		}
+	}
+
+	replace_sub_page(class, zspage, newpage, page);
+	get_page(newpage);
+
+	dec_zspage_isolation(zspage);
+
+	/*
+	 * Page migration is done so let's putback isolated zspage to
+	 * the list if @page is final isolated subpage in the zspage.
+	 */
+	if (!is_zspage_isolated(zspage))
+		putback_zspage(class, zspage);
+
+	reset_page(page);
+	put_page(page);
+	page = newpage;
+
+	ret = 0;
+unpin_objects:
+	for (addr = s_addr + offset; addr < s_addr + pos;
+						addr += class->size) {
+		head = obj_to_head(page, addr);
+		if (head & OBJ_ALLOCATED_TAG) {
+			handle = head & ~OBJ_ALLOCATED_TAG;
+			if (!testpin_tag(handle))
+				BUG();
+			unpin_tag(handle);
+		}
+	}
+	kunmap_atomic(s_addr);
+unlock_class:
+	spin_unlock(&class->lock);
+	migrate_write_unlock(zspage);
+
+	return ret;
+}
+
+void zs_page_putback(struct page *page)
+{
+	struct zs_pool *pool;
+	struct size_class *class;
+	int class_idx;
+	enum fullness_group fg;
+	struct address_space *mapping;
+	struct zspage *zspage;
+
+	VM_BUG_ON_PAGE(!PageMovable(page), page);
+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
+
+	zspage = get_zspage(page);
+	get_zspage_mapping(zspage, &class_idx, &fg);
+	mapping = page_mapping(page);
+	pool = mapping->private_data;
+	class = pool->size_class[class_idx];
+
+	spin_lock(&class->lock);
+	dec_zspage_isolation(zspage);
+	if (!is_zspage_isolated(zspage)) {
+		fg = putback_zspage(class, zspage);
+		/*
+		 * Due to page_lock, we cannot free zspage immediately
+		 * so let's defer.
+		 */
+		if (fg == ZS_EMPTY)
+			schedule_work(&pool->free_work);
+	}
+	spin_unlock(&class->lock);
+}
+
+const struct address_space_operations zsmalloc_aops = {
+	.isolate_page = zs_page_isolate,
+	.migratepage = zs_page_migrate,
+	.putback_page = zs_page_putback,
+};
+
+static int zs_register_migration(struct zs_pool *pool)
+{
+	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
+	if (IS_ERR(pool->inode)) {
+		pool->inode = NULL;
+		return 1;
+	}
+
+	pool->inode->i_mapping->private_data = pool;
+	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
+	return 0;
+}
+
+static void zs_unregister_migration(struct zs_pool *pool)
+{
+	flush_work(&pool->free_work);
+	if (pool->inode)
+		iput(pool->inode);
+}
+
+/*
+ * Caller should hold page_lock of all pages in the zspage
+ * In here, we cannot use zspage meta data.
+ */
+static void async_free_zspage(struct work_struct *work)
+{
+	int i;
+	struct size_class *class;
+	unsigned int class_idx;
+	enum fullness_group fullness;
+	struct zspage *zspage, *tmp;
+	LIST_HEAD(free_pages);
+	struct zs_pool *pool = container_of(work, struct zs_pool,
+					free_work);
+
+	for (i = 0; i < zs_size_classes; i++) {
+		class = pool->size_class[i];
+		if (class->index != i)
+			continue;
+
+		spin_lock(&class->lock);
+		list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
+		spin_unlock(&class->lock);
+	}
+
+
+	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
+		list_del(&zspage->list);
+		lock_zspage(zspage);
+
+		get_zspage_mapping(zspage, &class_idx, &fullness);
+		VM_BUG_ON(fullness != ZS_EMPTY);
+		class = pool->size_class[class_idx];
+		spin_lock(&class->lock);
+		__free_zspage(pool, pool->size_class[class_idx], zspage);
+		spin_unlock(&class->lock);
+	}
+};
+
+static void kick_deferred_free(struct zs_pool *pool)
+{
+	schedule_work(&pool->free_work);
+}
+
+static void init_deferred_free(struct zs_pool *pool)
+{
+	INIT_WORK(&pool->free_work, async_free_zspage);
+}
+
+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage)
+{
+	struct page *page = get_first_page(zspage);
+
+	do {
+		WARN_ON(!trylock_page(page));
+		__SetPageMovable(page, pool->inode->i_mapping);
+		unlock_page(page);
+	} while ((page = get_next_page(page)) != NULL);
+}
+#endif
+
 /*
  *
  * Based on the number of unused allocated objects calculate
@@ -1745,10 +2287,10 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 			break;
 
 		cc.index = 0;
-		cc.s_page = src_zspage->first_page;
+		cc.s_page = get_first_page(src_zspage);
 
 		while ((dst_zspage = isolate_zspage(class, false))) {
-			cc.d_page = dst_zspage->first_page;
+			cc.d_page = get_first_page(dst_zspage);
 			/*
 			 * If there is no more space in dst_page, resched
 			 * and see if anyone had allocated another zspage.
@@ -1765,11 +2307,7 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
 
 		putback_zspage(class, dst_zspage);
 		if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
-			zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
-					class->size, class->pages_per_zspage));
-			atomic_long_sub(class->pages_per_zspage,
-					&pool->pages_allocated);
-			free_zspage(pool, src_zspage);
+			free_zspage(pool, class, src_zspage);
 			pool->stats.pages_compacted += class->pages_per_zspage;
 		}
 		spin_unlock(&class->lock);
@@ -1885,6 +2423,7 @@ struct zs_pool *zs_create_pool(const char *name)
 	if (!pool)
 		return NULL;
 
+	init_deferred_free(pool);
 	pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
 			GFP_KERNEL);
 	if (!pool->size_class) {
@@ -1939,12 +2478,10 @@ struct zs_pool *zs_create_pool(const char *name)
 		class->pages_per_zspage = pages_per_zspage;
 		class->objs_per_zspage = class->pages_per_zspage *
 						PAGE_SIZE / class->size;
-		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
-			class->huge = true;
 		spin_lock_init(&class->lock);
 		pool->size_class[i] = class;
-		for (fullness = ZS_ALMOST_FULL; fullness <= ZS_ALMOST_EMPTY;
-								fullness++)
+		for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
+							fullness++)
 			INIT_LIST_HEAD(&class->fullness_list[fullness]);
 
 		prev_class = class;
@@ -1953,6 +2490,9 @@ struct zs_pool *zs_create_pool(const char *name)
 	/* debug only, don't abort if it fails */
 	zs_pool_stat_create(pool, name);
 
+	if (zs_register_migration(pool))
+		goto err;
+
 	/*
 	 * Not critical, we still can use the pool
 	 * and user can trigger compaction manually.
@@ -1972,6 +2512,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 	int i;
 
 	zs_unregister_shrinker(pool);
+	zs_unregister_migration(pool);
 	zs_pool_stat_destroy(pool);
 
 	for (i = 0; i < zs_size_classes; i++) {
@@ -1984,7 +2525,7 @@ void zs_destroy_pool(struct zs_pool *pool)
 		if (class->index != i)
 			continue;
 
-		for (fg = ZS_ALMOST_FULL; fg <= ZS_ALMOST_EMPTY; fg++) {
+		for (fg = ZS_EMPTY; fg < NR_ZS_FULLNESS; fg++) {
 			if (!list_empty(&class->fullness_list[fg])) {
 				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
 					class->size, fg);
@@ -2002,7 +2543,13 @@ EXPORT_SYMBOL_GPL(zs_destroy_pool);
 
 static int __init zs_init(void)
 {
-	int ret = zs_register_cpu_notifier();
+	int ret;
+
+	ret = zsmalloc_mount();
+	if (ret)
+		goto out;
+
+	ret = zs_register_cpu_notifier();
 
 	if (ret)
 		goto notifier_fail;
@@ -2019,7 +2566,8 @@ static int __init zs_init(void)
 
 notifier_fail:
 	zs_unregister_cpu_notifier();
-
+	zsmalloc_unmount();
+out:
 	return ret;
 }
 
@@ -2028,6 +2576,7 @@ static void __exit zs_exit(void)
 #ifdef CONFIG_ZPOOL
 	zpool_unregister_driver(&zs_zpool_driver);
 #endif
+	zsmalloc_unmount();
 	zs_unregister_cpu_notifier();
 
 	zs_stat_exit();
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v7 12/12] zram: use __GFP_MOVABLE for memory allocation
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (10 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 11/12] zsmalloc: page migration support Minchan Kim
@ 2016-05-31 23:21 ` Minchan Kim
  2016-06-01 21:41 ` [PATCH v7 00/12] Support non-lru page migration Andrew Morton
  2016-06-15  7:59 ` Sergey Senozhatsky
  13 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-05-31 23:21 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Minchan Kim, Sergey Senozhatsky

Zsmalloc is ready for page migration so zram can use __GFP_MOVABLE
from now on.

I did test to see how it helps to make higher order pages.
Test scenario is as follows.

KVM guest, 1G memory, ext4 formated zram block device,

for i in `seq 1 8`;
do
        dd if=/dev/vda1 of=mnt/test$i.txt bs=128M count=1 &
done

wait `pidof dd`

for i in `seq 1 2 8`;
do
        rm -rf mnt/test$i.txt
done
fstrim -v mnt

echo "init"
cat /proc/buddyinfo

echo "compaction"
echo 1 > /proc/sys/vm/compact_memory
cat /proc/buddyinfo

old:

init
Node 0, zone      DMA    208    120     51     41     11      0      0      0      0      0      0
Node 0, zone    DMA32  16380  13777   9184   3805    789     54      3      0      0      0      0
compaction
Node 0, zone      DMA    132     82     40     39     16      2      1      0      0      0      0
Node 0, zone    DMA32   5219   5526   4969   3455   1831    677    139     15      0      0      0

new:

init
Node 0, zone      DMA    379    115     97     19      2      0      0      0      0      0      0
Node 0, zone    DMA32  18891  16774  10862   3947    637     21      0      0      0      0      0
compaction  1
Node 0, zone      DMA    214     66     87     29     10      3      0      0      0      0      0
Node 0, zone    DMA32   1612   3139   3154   2469   1745    990    384     94      7      0      0

As you can see, compaction made so many high-order pages. Yay!

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 8fcad8b761f1..ccf1bddd09ca 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -732,7 +732,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 		handle = zs_malloc(meta->mem_pool, clen,
 				__GFP_KSWAPD_RECLAIM |
 				__GFP_NOWARN |
-				__GFP_HIGHMEM);
+				__GFP_HIGHMEM |
+				__GFP_MOVABLE);
 	if (!handle) {
 		zcomp_strm_release(zram->comp, zstrm);
 		zstrm = NULL;
@@ -740,7 +741,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 		atomic64_inc(&zram->stats.writestall);
 
 		handle = zs_malloc(meta->mem_pool, clen,
-				GFP_NOIO | __GFP_HIGHMEM);
+				GFP_NOIO | __GFP_HIGHMEM |
+				__GFP_MOVABLE);
 		if (handle)
 			goto compress_again;
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2016-05-31 23:21 ` [PATCH v7 11/12] zsmalloc: page migration support Minchan Kim
@ 2016-06-01 14:09   ` Vlastimil Babka
  2016-06-02  0:25     ` Minchan Kim
  2016-06-01 21:39   ` Andrew Morton
       [not found]   ` <CGME20170119001317epcas1p188357c77e1f4ff08b6d3dcb76dedca06@epcas1p1.samsung.com>
  2 siblings, 1 reply; 49+ messages in thread
From: Vlastimil Babka @ 2016-06-01 14:09 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

On 06/01/2016 01:21 AM, Minchan Kim wrote:

[...]

> 
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>

I'm not that familiar with zsmalloc, so this is not a full review. I was
just curious how it's handling the movable migration API, and stumbled
upon some things pointed out below.

> @@ -252,16 +276,23 @@ struct zs_pool {
>   */
>  #define FULLNESS_BITS	2
>  #define CLASS_BITS	8
> +#define ISOLATED_BITS	3
> +#define MAGIC_VAL_BITS	8
>  
>  struct zspage {
>  	struct {
>  		unsigned int fullness:FULLNESS_BITS;
>  		unsigned int class:CLASS_BITS;
> +		unsigned int isolated:ISOLATED_BITS;
> +		unsigned int magic:MAGIC_VAL_BITS;

This magic seems to be only tested via VM_BUG_ON, so it's presence
should be also guarded by #ifdef DEBUG_VM, no?

> @@ -999,6 +1141,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  		return NULL;
>  
>  	memset(zspage, 0, sizeof(struct zspage));
> +	zspage->magic = ZSPAGE_MAGIC;

Same here.

> +int zs_page_migrate(struct address_space *mapping, struct page *newpage,
> +		struct page *page, enum migrate_mode mode)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct zspage *zspage;
> +	struct page *dummy;
> +	void *s_addr, *d_addr, *addr;
> +	int offset, pos;
> +	unsigned long handle, head;
> +	unsigned long old_obj, new_obj;
> +	unsigned int obj_idx;
> +	int ret = -EAGAIN;
> +
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +
> +	zspage = get_zspage(page);
> +
> +	/* Concurrent compactor cannot migrate any subpage in zspage */
> +	migrate_write_lock(zspage);
> +	get_zspage_mapping(zspage, &class_idx, &fullness);
> +	pool = mapping->private_data;
> +	class = pool->size_class[class_idx];
> +	offset = get_first_obj_offset(class, get_first_page(zspage), page);
> +
> +	spin_lock(&class->lock);
> +	if (!get_zspage_inuse(zspage)) {
> +		ret = -EBUSY;
> +		goto unlock_class;
> +	}
> +
> +	pos = offset;
> +	s_addr = kmap_atomic(page);
> +	while (pos < PAGE_SIZE) {
> +		head = obj_to_head(page, s_addr + pos);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			handle = head & ~OBJ_ALLOCATED_TAG;
> +			if (!trypin_tag(handle))
> +				goto unpin_objects;
> +		}
> +		pos += class->size;
> +	}
> +
> +	/*
> +	 * Here, any user cannot access all objects in the zspage so let's move.
> +	 */
> +	d_addr = kmap_atomic(newpage);
> +	memcpy(d_addr, s_addr, PAGE_SIZE);
> +	kunmap_atomic(d_addr);
> +
> +	for (addr = s_addr + offset; addr < s_addr + pos;
> +					addr += class->size) {
> +		head = obj_to_head(page, addr);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			handle = head & ~OBJ_ALLOCATED_TAG;
> +			if (!testpin_tag(handle))
> +				BUG();
> +
> +			old_obj = handle_to_obj(handle);
> +			obj_to_location(old_obj, &dummy, &obj_idx);
> +			new_obj = (unsigned long)location_to_obj(newpage,
> +								obj_idx);
> +			new_obj |= BIT(HANDLE_PIN_BIT);
> +			record_obj(handle, new_obj);
> +		}
> +	}
> +
> +	replace_sub_page(class, zspage, newpage, page);
> +	get_page(newpage);
> +
> +	dec_zspage_isolation(zspage);
> +
> +	/*
> +	 * Page migration is done so let's putback isolated zspage to
> +	 * the list if @page is final isolated subpage in the zspage.
> +	 */
> +	if (!is_zspage_isolated(zspage))
> +		putback_zspage(class, zspage);
> +
> +	reset_page(page);
> +	put_page(page);
> +	page = newpage;
> +
> +	ret = 0;
> +unpin_objects:
> +	for (addr = s_addr + offset; addr < s_addr + pos;
> +						addr += class->size) {
> +		head = obj_to_head(page, addr);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			handle = head & ~OBJ_ALLOCATED_TAG;
> +			if (!testpin_tag(handle))
> +				BUG();
> +			unpin_tag(handle);
> +		}
> +	}
> +	kunmap_atomic(s_addr);

The above seems suspicious to me. In the success case, page points to
newpage, but s_addr is still the original one?

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2016-05-31 23:21 ` [PATCH v7 11/12] zsmalloc: page migration support Minchan Kim
  2016-06-01 14:09   ` Vlastimil Babka
@ 2016-06-01 21:39   ` Andrew Morton
  2016-06-02  0:15     ` Minchan Kim
       [not found]   ` <CGME20170119001317epcas1p188357c77e1f4ff08b6d3dcb76dedca06@epcas1p1.samsung.com>
  2 siblings, 1 reply; 49+ messages in thread
From: Andrew Morton @ 2016-06-01 21:39 UTC (permalink / raw)
  To: Minchan Kim; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

On Wed,  1 Jun 2016 08:21:20 +0900 Minchan Kim <minchan@kernel.org> wrote:

> This patch introduces run-time migration feature for zspage.
> 
> ...
>
> +static void kick_deferred_free(struct zs_pool *pool)
> +{
> +	schedule_work(&pool->free_work);
> +}

When CONFIG_ZSMALLOC=m, what keeps all the data structures in place
during a concurrent rmmod?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (11 preceding siblings ...)
  2016-05-31 23:21 ` [PATCH v7 12/12] zram: use __GFP_MOVABLE for memory allocation Minchan Kim
@ 2016-06-01 21:41 ` Andrew Morton
  2016-06-01 22:40   ` Daniel Vetter
  2016-06-02  0:36   ` Minchan Kim
  2016-06-15  7:59 ` Sergey Senozhatsky
  13 siblings, 2 replies; 49+ messages in thread
From: Andrew Morton @ 2016-06-01 21:41 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-mm, linux-kernel, Vlastimil Babka, dri-devel, Hugh Dickins,
	John Einar Reitan, Jonathan Corbet, Joonsoo Kim,
	Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

On Wed,  1 Jun 2016 08:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Recently, I got many reports about perfermance degradation in embedded
> system(Android mobile phone, webOS TV and so on) and easy fork fail.
> 
> The problem was fragmentation caused by zram and GPU driver mainly.
> With memory pressure, their pages were spread out all of pageblock and
> it cannot be migrated with current compaction algorithm which supports
> only LRU pages. In the end, compaction cannot work well so reclaimer
> shrinks all of working set pages. It made system very slow and even to
> fail to fork easily which requires order-[2 or 3] allocations.
> 
> Other pain point is that they cannot use CMA memory space so when OOM
> kill happens, I can see many free pages in CMA area, which is not
> memory efficient. In our product which has big CMA memory, it reclaims
> zones too exccessively to allocate GPU and zram page although there are
> lots of free space in CMA so system becomes very slow easily.

But this isn't presently implemented for GPU drivers or for CMA, yes?

What's the story there?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-01 21:41 ` [PATCH v7 00/12] Support non-lru page migration Andrew Morton
@ 2016-06-01 22:40   ` Daniel Vetter
  2016-06-02  0:36   ` Minchan Kim
  1 sibling, 0 replies; 49+ messages in thread
From: Daniel Vetter @ 2016-06-01 22:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Minchan Kim, linux-mm, linux-kernel, Vlastimil Babka, dri-devel,
	Hugh Dickins, John Einar Reitan, Jonathan Corbet, Joonsoo Kim,
	Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

On Wed, Jun 01, 2016 at 02:41:51PM -0700, Andrew Morton wrote:
> On Wed,  1 Jun 2016 08:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Recently, I got many reports about perfermance degradation in embedded
> > system(Android mobile phone, webOS TV and so on) and easy fork fail.
> > 
> > The problem was fragmentation caused by zram and GPU driver mainly.
> > With memory pressure, their pages were spread out all of pageblock and
> > it cannot be migrated with current compaction algorithm which supports
> > only LRU pages. In the end, compaction cannot work well so reclaimer
> > shrinks all of working set pages. It made system very slow and even to
> > fail to fork easily which requires order-[2 or 3] allocations.
> > 
> > Other pain point is that they cannot use CMA memory space so when OOM
> > kill happens, I can see many free pages in CMA area, which is not
> > memory efficient. In our product which has big CMA memory, it reclaims
> > zones too exccessively to allocate GPU and zram page although there are
> > lots of free space in CMA so system becomes very slow easily.
> 
> But this isn't presently implemented for GPU drivers or for CMA, yes?
> 
> What's the story there?

Broken (out-of-tree) drivers that don't allocate their gpu stuff
correctly.  There's piles of drivers that get_user_page all over the place
but then fail to timely get off these pages again. The fix is to get off
those pages again (either by unpinning timely, or registering an
mmu_notifier if the driver wants to keep the pages pinned indefinitely, as
a caching optimization).

At least that's my guess, and iirc it was confirmed first time around this
series showed up.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2016-06-01 21:39   ` Andrew Morton
@ 2016-06-02  0:15     ` Minchan Kim
  0 siblings, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-06-02  0:15 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Sergey Senozhatsky

On Wed, Jun 01, 2016 at 02:39:36PM -0700, Andrew Morton wrote:
> On Wed,  1 Jun 2016 08:21:20 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > This patch introduces run-time migration feature for zspage.
> > 
> > ...
> >
> > +static void kick_deferred_free(struct zs_pool *pool)
> > +{
> > +	schedule_work(&pool->free_work);
> > +}
> 
> When CONFIG_ZSMALLOC=m, what keeps all the data structures in place
> during a concurrent rmmod?
> 

The most of data structure in zram start to work by user calling
zs_create_pool and user of zsmalloc should call zs_destroy_pool
before trying doing rmmod where zs_unregister_migration does
flush_work(&pool->free_work).

If I miss something, please let me know it.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2016-06-01 14:09   ` Vlastimil Babka
@ 2016-06-02  0:25     ` Minchan Kim
  2016-06-02 11:44       ` Vlastimil Babka
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2016-06-02  0:25 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Andrew Morton, linux-mm, linux-kernel, Sergey Senozhatsky

On Wed, Jun 01, 2016 at 04:09:26PM +0200, Vlastimil Babka wrote:
> On 06/01/2016 01:21 AM, Minchan Kim wrote:
> 
> [...]
> 
> > 
> > Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> > Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> I'm not that familiar with zsmalloc, so this is not a full review. I was
> just curious how it's handling the movable migration API, and stumbled
> upon some things pointed out below.
> 
> > @@ -252,16 +276,23 @@ struct zs_pool {
> >   */
> >  #define FULLNESS_BITS	2
> >  #define CLASS_BITS	8
> > +#define ISOLATED_BITS	3
> > +#define MAGIC_VAL_BITS	8
> >  
> >  struct zspage {
> >  	struct {
> >  		unsigned int fullness:FULLNESS_BITS;
> >  		unsigned int class:CLASS_BITS;
> > +		unsigned int isolated:ISOLATED_BITS;
> > +		unsigned int magic:MAGIC_VAL_BITS;
> 
> This magic seems to be only tested via VM_BUG_ON, so it's presence
> should be also guarded by #ifdef DEBUG_VM, no?

Thanks for the point.

Then, I want to change it to BUG_ON because struct zspage corruption
is really risky to work rightly and want to catch on it in real product
which disable CONFIG_DEBUG_VM for a while until make the feature stable.

> 
> > @@ -999,6 +1141,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
> >  		return NULL;
> >  
> >  	memset(zspage, 0, sizeof(struct zspage));
> > +	zspage->magic = ZSPAGE_MAGIC;
> 
> Same here.
> 
> > +int zs_page_migrate(struct address_space *mapping, struct page *newpage,
> > +		struct page *page, enum migrate_mode mode)
> > +{
> > +	struct zs_pool *pool;
> > +	struct size_class *class;
> > +	int class_idx;
> > +	enum fullness_group fullness;
> > +	struct zspage *zspage;
> > +	struct page *dummy;
> > +	void *s_addr, *d_addr, *addr;
> > +	int offset, pos;
> > +	unsigned long handle, head;
> > +	unsigned long old_obj, new_obj;
> > +	unsigned int obj_idx;
> > +	int ret = -EAGAIN;
> > +
> > +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> > +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> > +
> > +	zspage = get_zspage(page);
> > +
> > +	/* Concurrent compactor cannot migrate any subpage in zspage */
> > +	migrate_write_lock(zspage);
> > +	get_zspage_mapping(zspage, &class_idx, &fullness);
> > +	pool = mapping->private_data;
> > +	class = pool->size_class[class_idx];
> > +	offset = get_first_obj_offset(class, get_first_page(zspage), page);
> > +
> > +	spin_lock(&class->lock);
> > +	if (!get_zspage_inuse(zspage)) {
> > +		ret = -EBUSY;
> > +		goto unlock_class;
> > +	}
> > +
> > +	pos = offset;
> > +	s_addr = kmap_atomic(page);
> > +	while (pos < PAGE_SIZE) {
> > +		head = obj_to_head(page, s_addr + pos);
> > +		if (head & OBJ_ALLOCATED_TAG) {
> > +			handle = head & ~OBJ_ALLOCATED_TAG;
> > +			if (!trypin_tag(handle))
> > +				goto unpin_objects;
> > +		}
> > +		pos += class->size;
> > +	}
> > +
> > +	/*
> > +	 * Here, any user cannot access all objects in the zspage so let's move.
> > +	 */
> > +	d_addr = kmap_atomic(newpage);
> > +	memcpy(d_addr, s_addr, PAGE_SIZE);
> > +	kunmap_atomic(d_addr);
> > +
> > +	for (addr = s_addr + offset; addr < s_addr + pos;
> > +					addr += class->size) {
> > +		head = obj_to_head(page, addr);
> > +		if (head & OBJ_ALLOCATED_TAG) {
> > +			handle = head & ~OBJ_ALLOCATED_TAG;
> > +			if (!testpin_tag(handle))
> > +				BUG();
> > +
> > +			old_obj = handle_to_obj(handle);
> > +			obj_to_location(old_obj, &dummy, &obj_idx);
> > +			new_obj = (unsigned long)location_to_obj(newpage,
> > +								obj_idx);
> > +			new_obj |= BIT(HANDLE_PIN_BIT);
> > +			record_obj(handle, new_obj);
> > +		}
> > +	}
> > +
> > +	replace_sub_page(class, zspage, newpage, page);
> > +	get_page(newpage);
> > +
> > +	dec_zspage_isolation(zspage);
> > +
> > +	/*
> > +	 * Page migration is done so let's putback isolated zspage to
> > +	 * the list if @page is final isolated subpage in the zspage.
> > +	 */
> > +	if (!is_zspage_isolated(zspage))
> > +		putback_zspage(class, zspage);
> > +
> > +	reset_page(page);
> > +	put_page(page);
> > +	page = newpage;
> > +
> > +	ret = 0;
> > +unpin_objects:
> > +	for (addr = s_addr + offset; addr < s_addr + pos;
> > +						addr += class->size) {
> > +		head = obj_to_head(page, addr);
> > +		if (head & OBJ_ALLOCATED_TAG) {
> > +			handle = head & ~OBJ_ALLOCATED_TAG;
> > +			if (!testpin_tag(handle))
> > +				BUG();
> > +			unpin_tag(handle);
> > +		}
> > +	}
> > +	kunmap_atomic(s_addr);
> 
> The above seems suspicious to me. In the success case, page points to
> newpage, but s_addr is still the original one?

s_addr is virtual adress of old page by kmap_atomic so page pointer of
new page doesn't matter.

> 
> Vlastimil
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-01 21:41 ` [PATCH v7 00/12] Support non-lru page migration Andrew Morton
  2016-06-01 22:40   ` Daniel Vetter
@ 2016-06-02  0:36   ` Minchan Kim
  1 sibling, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2016-06-02  0:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, Vlastimil Babka, dri-devel, Hugh Dickins,
	John Einar Reitan, Jonathan Corbet, Joonsoo Kim,
	Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

On Wed, Jun 01, 2016 at 02:41:51PM -0700, Andrew Morton wrote:
> On Wed,  1 Jun 2016 08:21:09 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Recently, I got many reports about perfermance degradation in embedded
> > system(Android mobile phone, webOS TV and so on) and easy fork fail.
> > 
> > The problem was fragmentation caused by zram and GPU driver mainly.
> > With memory pressure, their pages were spread out all of pageblock and
> > it cannot be migrated with current compaction algorithm which supports
> > only LRU pages. In the end, compaction cannot work well so reclaimer
> > shrinks all of working set pages. It made system very slow and even to
> > fail to fork easily which requires order-[2 or 3] allocations.
> > 
> > Other pain point is that they cannot use CMA memory space so when OOM
> > kill happens, I can see many free pages in CMA area, which is not
> > memory efficient. In our product which has big CMA memory, it reclaims
> > zones too exccessively to allocate GPU and zram page although there are
> > lots of free space in CMA so system becomes very slow easily.
> 
> But this isn't presently implemented for GPU drivers or for CMA, yes?

For GPU driver, Gioh implemented but it was proprietary so couldn't
contribute.

For CMA, [zram: use __GFP_MOVABLE for memory allocation] added __GFP_MOVABLE
for zsmalloc page allocation so it can use CMA area automatically now.

> 
> What's the story there?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2016-06-02  0:25     ` Minchan Kim
@ 2016-06-02 11:44       ` Vlastimil Babka
  0 siblings, 0 replies; 49+ messages in thread
From: Vlastimil Babka @ 2016-06-02 11:44 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-mm, linux-kernel, Sergey Senozhatsky

On 06/02/2016 02:25 AM, Minchan Kim wrote:
> On Wed, Jun 01, 2016 at 04:09:26PM +0200, Vlastimil Babka wrote:
>> On 06/01/2016 01:21 AM, Minchan Kim wrote:
>>> +	reset_page(page);
>>> +	put_page(page);
>>> +	page = newpage;
>>> +
>>> +	ret = 0;
>>> +unpin_objects:
>>> +	for (addr = s_addr + offset; addr < s_addr + pos;
>>> +						addr += class->size) {
>>> +		head = obj_to_head(page, addr);
>>> +		if (head & OBJ_ALLOCATED_TAG) {
>>> +			handle = head & ~OBJ_ALLOCATED_TAG;
>>> +			if (!testpin_tag(handle))
>>> +				BUG();
>>> +			unpin_tag(handle);
>>> +		}
>>> +	}
>>> +	kunmap_atomic(s_addr);
>>
>> The above seems suspicious to me. In the success case, page points to
>> newpage, but s_addr is still the original one?
>
> s_addr is virtual adress of old page by kmap_atomic so page pointer of
> new page doesn't matter.

Hmm, I see. The value (head address/handle) it reads from the old page 
should be the same as the one in the newpage. And this value doesn't get 
changed in the process. So it works, it's just subtle :)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
                   ` (12 preceding siblings ...)
  2016-06-01 21:41 ` [PATCH v7 00/12] Support non-lru page migration Andrew Morton
@ 2016-06-15  7:59 ` Sergey Senozhatsky
  2016-06-15 23:12   ` Minchan Kim
  13 siblings, 1 reply; 49+ messages in thread
From: Sergey Senozhatsky @ 2016-06-15  7:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-mm, linux-kernel, Vlastimil Babka,
	dri-devel, Hugh Dickins, John Einar Reitan, Jonathan Corbet,
	Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

Hello Minchan,

-next 4.7.0-rc3-next-20160614


[  315.146533] kasan: CONFIG_KASAN_INLINE enabled
[  315.146538] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  315.146546] general protection fault: 0000 [#1] PREEMPT SMP KASAN
[  315.146576] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
[  315.146785] CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
[  315.146841] task: ffff8800bfaf2900 ti: ffff880112468000 task.ti: ffff880112468000
[  315.146859] RIP: 0010:[<ffffffffa02c413d>]  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
[  315.146892] RSP: 0000:ffff88011246f138  EFLAGS: 00010293
[  315.146906] RAX: 736761742d6f6e2c RBX: ffff880017ad9a80 RCX: 0000000000000000
[  315.146924] RDX: 1ffffffff064d704 RSI: ffff88000511469a RDI: ffffffff8326ba20
[  315.146942] RBP: ffff88011246f328 R08: 0000000000000001 R09: 0000000000000000
[  315.146959] R10: ffff88011246f0a8 R11: ffff8800bfc07fff R12: ffff88011246f300
[  315.146977] R13: ffffed0015523e6f R14: ffff8800aa91f378 R15: ffffea0000144500
[  315.146995] FS:  0000000000000000(0000) GS:ffff880113780000(0000) knlGS:0000000000000000
[  315.147015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  315.147030] CR2: 00007f3f97911000 CR3: 0000000002209000 CR4: 00000000000006e0
[  315.147046] Stack:
[  315.147052]  1ffff10015523e0f ffff88011246f240 ffff880005116800 00017f80e0000000
[  315.147083]  ffff880017ad9aa8 736761742d6f6e2c 1ffff1002248de34 ffff880017ad9a90
[  315.147113]  0000069a1246f660 000000000000069a ffff880005114000 ffffea0002ff0180
[  315.147143] Call Trace:
[  315.147154]  [<ffffffffa02c3de8>] ? obj_to_head+0x9d/0x9d [zsmalloc]
[  315.147175]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
[  315.147195]  [<ffffffff812275b1>] ? isolate_freepages_block+0x2f9/0x5a6
[  315.147213]  [<ffffffff8127f15c>] ? kasan_poison_shadow+0x2f/0x31
[  315.147230]  [<ffffffff8127f66a>] ? kasan_alloc_pages+0x39/0x3b
[  315.147246]  [<ffffffff812267e6>] ? map_pages+0x1f3/0x3ad
[  315.147262]  [<ffffffff812265f3>] ? update_pageblock_skip+0x18d/0x18d
[  315.147280]  [<ffffffff81115972>] ? up_read+0x1a/0x30
[  315.147296]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
[  315.147315]  [<ffffffff812842d1>] move_to_new_page+0x4dd/0x615
[  315.147332]  [<ffffffff81283df4>] ? migrate_page+0x75/0x75
[  315.147347]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
[  315.147366]  [<ffffffff812851c1>] migrate_pages+0xadd/0x131a
[  315.147382]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
[  315.147399]  [<ffffffff81226375>] ? kzfree+0x2b/0x2b
[  315.147414]  [<ffffffff812846e4>] ? buffer_migrate_page+0x2db/0x2db
[  315.147431]  [<ffffffff8122a6cf>] compact_zone+0xcdb/0x1155
[  315.147448]  [<ffffffff812299f4>] ? compaction_suitable+0x76/0x76
[  315.147465]  [<ffffffff8122ac29>] compact_zone_order+0xe0/0x167
[  315.147481]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  315.147499]  [<ffffffff8122ab49>] ? compact_zone+0x1155/0x1155
[  315.147515]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  315.147533]  [<ffffffff8122bcff>] try_to_compact_pages+0x2f1/0x648
[  315.147550]  [<ffffffff8122bcff>] ? try_to_compact_pages+0x2f1/0x648
[  315.147568]  [<ffffffff8122ba0e>] ? compaction_zonelist_suitable+0x3a6/0x3a6
[  315.147589]  [<ffffffff811ee129>] ? get_page_from_freelist+0x2c0/0x129a
[  315.147608]  [<ffffffff811ef1ed>] __alloc_pages_direct_compact+0xea/0x30d
[  315.147626]  [<ffffffff811ef103>] ? get_page_from_freelist+0x129a/0x129a
[  315.147645]  [<ffffffff811f0422>] __alloc_pages_nodemask+0x840/0x16b6
[  315.147663]  [<ffffffff810dba27>] ? try_to_wake_up+0x696/0x6c8
[  315.149147]  [<ffffffff811efbe2>] ? warn_alloc_failed+0x226/0x226
[  315.150615]  [<ffffffff810dba69>] ? wake_up_process+0x10/0x12
[  315.152078]  [<ffffffff810dbaf4>] ? wake_up_q+0x89/0xa7
[  315.153539]  [<ffffffff81128b6f>] ? rwsem_wake+0x131/0x15c
[  315.155007]  [<ffffffff812922e7>] ? khugepaged+0x4072/0x484f
[  315.156471]  [<ffffffff8128e449>] khugepaged+0x1d4/0x484f
[  315.157940]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
[  315.159402]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  315.160870]  [<ffffffff81d31df8>] ? _raw_spin_unlock_irq+0x27/0x45
[  315.162341]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  315.163814]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
[  315.165295]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
[  315.166763]  [<ffffffff810ccde3>] kthread+0x252/0x261
[  315.168214]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
[  315.169646]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  315.171056]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
[  315.172462]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  315.173869] Code: 03 b5 60 fe ff ff e8 2e fc ff ff a8 01 74 4c 48 83 e0 fe bf 01 00 00 00 48 89 85 38 fe ff ff e8 41 18 e1 e0 48 8b 85 38 fe ff ff <f0> 0f ba 28 00 73 29 bf 01 00 00 00 41 bc f5 ff ff ff e8 ea 27 
[  315.175573] RIP  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
[  315.177084]  RSP <ffff88011246f138>
[  315.186572] ---[ end trace 0962b8ee48c98bbc ]---




[  315.186577] BUG: sleeping function called from invalid context at include/linux/sched.h:2960
[  315.186580] in_atomic(): 1, irqs_disabled(): 0, pid: 38, name: khugepaged
[  315.186581] INFO: lockdep is turned off.
[  315.186583] Preemption disabled at:[<ffffffffa02c3f1d>] zs_page_migrate+0x135/0xaa0 [zsmalloc]

[  315.186594] CPU: 3 PID: 38 Comm: khugepaged Tainted: G      D         4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
[  315.186599]  0000000000000000 ffff88011246ed58 ffffffff814d56bf ffff8800bfaf2900
[  315.186604]  0000000000000004 ffff88011246ed98 ffffffff810d5e6a 0000000000000000
[  315.186609]  ffff8800bfaf2900 ffffffff81e39820 0000000000000b90 0000000000000000
[  315.186614] Call Trace:
[  315.186618]  [<ffffffff814d56bf>] dump_stack+0x68/0x92
[  315.186622]  [<ffffffff810d5e6a>] ___might_sleep+0x3bd/0x3c9
[  315.186625]  [<ffffffff810d5fd1>] __might_sleep+0x15b/0x167
[  315.186630]  [<ffffffff810ac4c1>] exit_signals+0x7a/0x34f
[  315.186633]  [<ffffffff810ac447>] ? get_signal+0xd9b/0xd9b
[  315.186636]  [<ffffffff811aee21>] ? irq_work_queue+0x101/0x11c
[  315.186640]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  315.186645]  [<ffffffff81096357>] do_exit+0x34d/0x1b4e
[  315.186648]  [<ffffffff81130e16>] ? vprintk_emit+0x4b1/0x4d3
[  315.186652]  [<ffffffff8109600a>] ? is_current_pgrp_orphaned+0x8c/0x8c
[  315.186655]  [<ffffffff81122c56>] ? lock_acquire+0xec/0x147
[  315.186658]  [<ffffffff811321ef>] ? kmsg_dump+0x12/0x27a
[  315.186662]  [<ffffffff81132448>] ? kmsg_dump+0x26b/0x27a
[  315.186666]  [<ffffffff81036507>] oops_end+0x9d/0xa4
[  315.186669]  [<ffffffff8103662c>] die+0x55/0x5e
[  315.186672]  [<ffffffff81032aa0>] do_general_protection+0x16c/0x337
[  315.186676]  [<ffffffff81d33abf>] general_protection+0x1f/0x30
[  315.186681]  [<ffffffffa02c413d>] ? zs_page_migrate+0x355/0xaa0 [zsmalloc]
[  315.186686]  [<ffffffffa02c4136>] ? zs_page_migrate+0x34e/0xaa0 [zsmalloc]
[  315.186691]  [<ffffffffa02c3de8>] ? obj_to_head+0x9d/0x9d [zsmalloc]
[  315.186695]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
[  315.186699]  [<ffffffff812275b1>] ? isolate_freepages_block+0x2f9/0x5a6
[  315.186702]  [<ffffffff8127f15c>] ? kasan_poison_shadow+0x2f/0x31
[  315.186706]  [<ffffffff8127f66a>] ? kasan_alloc_pages+0x39/0x3b
[  315.186709]  [<ffffffff812267e6>] ? map_pages+0x1f3/0x3ad
[  315.186712]  [<ffffffff812265f3>] ? update_pageblock_skip+0x18d/0x18d
[  315.186716]  [<ffffffff81115972>] ? up_read+0x1a/0x30
[  315.186719]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
[  315.186723]  [<ffffffff812842d1>] move_to_new_page+0x4dd/0x615
[  315.186726]  [<ffffffff81283df4>] ? migrate_page+0x75/0x75
[  315.186730]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
[  315.186733]  [<ffffffff812851c1>] migrate_pages+0xadd/0x131a
[  315.186737]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
[  315.186740]  [<ffffffff81226375>] ? kzfree+0x2b/0x2b
[  315.186743]  [<ffffffff812846e4>] ? buffer_migrate_page+0x2db/0x2db
[  315.186747]  [<ffffffff8122a6cf>] compact_zone+0xcdb/0x1155
[  315.186751]  [<ffffffff812299f4>] ? compaction_suitable+0x76/0x76
[  315.186754]  [<ffffffff8122ac29>] compact_zone_order+0xe0/0x167
[  315.186757]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  315.186761]  [<ffffffff8122ab49>] ? compact_zone+0x1155/0x1155
[  315.186764]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  315.186768]  [<ffffffff8122bcff>] try_to_compact_pages+0x2f1/0x648
[  315.186771]  [<ffffffff8122bcff>] ? try_to_compact_pages+0x2f1/0x648
[  315.186775]  [<ffffffff8122ba0e>] ? compaction_zonelist_suitable+0x3a6/0x3a6
[  315.186780]  [<ffffffff811ee129>] ? get_page_from_freelist+0x2c0/0x129a
[  315.186783]  [<ffffffff811ef1ed>] __alloc_pages_direct_compact+0xea/0x30d
[  315.186787]  [<ffffffff811ef103>] ? get_page_from_freelist+0x129a/0x129a
[  315.186791]  [<ffffffff811f0422>] __alloc_pages_nodemask+0x840/0x16b6
[  315.186794]  [<ffffffff810dba27>] ? try_to_wake_up+0x696/0x6c8
[  315.186798]  [<ffffffff811efbe2>] ? warn_alloc_failed+0x226/0x226
[  315.186801]  [<ffffffff810dba69>] ? wake_up_process+0x10/0x12
[  315.186804]  [<ffffffff810dbaf4>] ? wake_up_q+0x89/0xa7
[  315.186807]  [<ffffffff81128b6f>] ? rwsem_wake+0x131/0x15c
[  315.186811]  [<ffffffff812922e7>] ? khugepaged+0x4072/0x484f
[  315.186815]  [<ffffffff8128e449>] khugepaged+0x1d4/0x484f
[  315.186819]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
[  315.186822]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  315.186826]  [<ffffffff81d31df8>] ? _raw_spin_unlock_irq+0x27/0x45
[  315.186829]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  315.186832]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
[  315.186836]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
[  315.186840]  [<ffffffff810ccde3>] kthread+0x252/0x261
[  315.186843]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
[  315.186846]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  315.186851]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
[  315.186854]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  315.186869] note: khugepaged[38] exited with preempt_count 4



[  340.319852] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [jbd2/zram0-8:405]
[  340.319856] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
[  340.319900] irq event stamp: 834296
[  340.319902] hardirqs last  enabled at (834295): [<ffffffff81280b07>] quarantine_put+0xa1/0xe6
[  340.319911] hardirqs last disabled at (834296): [<ffffffff81d31e68>] _raw_write_lock_irqsave+0x13/0x4c
[  340.319917] softirqs last  enabled at (833836): [<ffffffff81d3455e>] __do_softirq+0x406/0x48f
[  340.319922] softirqs last disabled at (833831): [<ffffffff8109914a>] irq_exit+0x6a/0x113
[  340.319929] CPU: 2 PID: 405 Comm: jbd2/zram0-8 Tainted: G      D         4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
[  340.319935] task: ffff8800bb512900 ti: ffff8800a69c0000 task.ti: ffff8800a69c0000
[  340.319937] RIP: 0010:[<ffffffff814ed772>]  [<ffffffff814ed772>] delay_tsc+0x0/0xa4
[  340.319943] RSP: 0018:ffff8800a69c70f8  EFLAGS: 00000206
[  340.319945] RAX: 0000000000000001 RBX: ffff8800aa91f300 RCX: 0000000000000000
[  340.319947] RDX: 0000000000000003 RSI: ffffffff81ed2840 RDI: 0000000000000001
[  340.319949] RBP: ffff8800a69c7100 R08: 0000000000000001 R09: 0000000000000000
[  340.319951] R10: ffff8800a69c70e8 R11: 000000007e7516b9 R12: ffff8800aa91f310
[  340.319954] R13: ffff8800aa91f308 R14: 000000001f3306fa R15: 0000000000000000
[  340.319956] FS:  0000000000000000(0000) GS:ffff880113700000(0000) knlGS:0000000000000000
[  340.319959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  340.319961] CR2: 00007fc99caba080 CR3: 00000000b9796000 CR4: 00000000000006e0
[  340.319963] Stack:
[  340.319964]  ffffffff814ed89c ffff8800a69c7148 ffffffff8112795d ffffed0015523e60
[  340.319970]  000000009e857390 ffff8800aa91f300 ffff8800bbe21cc0 ffff8800047d6f80
[  340.319975]  ffff8800a69c72b0 ffff8800aa91f300 ffff8800a69c7168 ffffffff81d31bed
[  340.319980] Call Trace:
[  340.319983]  [<ffffffff814ed89c>] ? __delay+0xa/0xc
[  340.319988]  [<ffffffff8112795d>] do_raw_spin_lock+0x197/0x257
[  340.319991]  [<ffffffff81d31bed>] _raw_spin_lock+0x35/0x3c
[  340.319998]  [<ffffffffa02c6062>] ? zs_free+0x191/0x27a [zsmalloc]
[  340.320003]  [<ffffffffa02c6062>] zs_free+0x191/0x27a [zsmalloc]
[  340.320008]  [<ffffffffa02c5ed1>] ? free_zspage+0xe8/0xe8 [zsmalloc]
[  340.320012]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
[  340.320015]  [<ffffffff810d58a6>] ? finish_task_switch+0x3b3/0x484
[  340.320021]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
[  340.320024]  [<ffffffff81d28086>] ? preempt_schedule+0x1f/0x21
[  340.320028]  [<ffffffff81d27ff9>] ? preempt_schedule_common+0xb7/0xe8
[  340.320034]  [<ffffffffa02d3f0e>] zram_free_page+0x112/0x1f6 [zram]
[  340.320039]  [<ffffffffa02d5e6c>] zram_make_request+0x45d/0x89f [zram]
[  340.320045]  [<ffffffffa02d5a0f>] ? zram_rw_page+0x21d/0x21d [zram]
[  340.320048]  [<ffffffff81493657>] ? blk_exit_rl+0x39/0x39
[  340.320053]  [<ffffffff8148fe3f>] ? handle_bad_sector+0x192/0x192
[  340.320056]  [<ffffffff8127f83e>] ? kasan_slab_alloc+0x12/0x14
[  340.320059]  [<ffffffff8127ca68>] ? kmem_cache_alloc+0xf3/0x101
[  340.320062]  [<ffffffff81494e37>] generic_make_request+0x2bc/0x496
[  340.320066]  [<ffffffff81494b7b>] ? blk_plug_queued_count+0x103/0x103
[  340.320069]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
[  340.320072]  [<ffffffff81495309>] submit_bio+0x2f8/0x324
[  340.320075]  [<ffffffff81495011>] ? generic_make_request+0x496/0x496
[  340.320078]  [<ffffffff811190fc>] ? lockdep_init_map+0x1ef/0x4b0
[  340.320082]  [<ffffffff814880a4>] submit_bio_wait+0xff/0x138
[  340.320085]  [<ffffffff81487fa5>] ? bio_add_page+0x292/0x292
[  340.320090]  [<ffffffff814ab82c>] blkdev_issue_discard+0xee/0x148
[  340.320093]  [<ffffffff814ab73e>] ? __blkdev_issue_discard+0x399/0x399
[  340.320097]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  340.320101]  [<ffffffff81404de8>] ext4_free_data_callback+0x2cc/0x8bc
[  340.320104]  [<ffffffff81404de8>] ? ext4_free_data_callback+0x2cc/0x8bc
[  340.320107]  [<ffffffff81404b1c>] ? ext4_mb_release_context+0x10aa/0x10aa
[  340.320111]  [<ffffffff81122c56>] ? lock_acquire+0xec/0x147
[  340.320115]  [<ffffffff813c8a6a>] ? ext4_journal_commit_callback+0x203/0x220
[  340.320119]  [<ffffffff813c8a61>] ext4_journal_commit_callback+0x1fa/0x220
[  340.320124]  [<ffffffff81438bf5>] jbd2_journal_commit_transaction+0x3753/0x3c20
[  340.320128]  [<ffffffff814354a2>] ? journal_submit_commit_record+0x777/0x777
[  340.320132]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
[  340.320135]  [<ffffffff811205a5>] ? __lock_acquire+0x14f9/0x33b8
[  340.320139]  [<ffffffff81d31db0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
[  340.320143]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  340.320146]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
[  340.320151]  [<ffffffff81156945>] ? try_to_del_timer_sync+0xa5/0xce
[  340.320154]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  340.320157]  [<ffffffff8143febd>] kjournald2+0x246/0x6e1
[  340.320160]  [<ffffffff8143febd>] ? kjournald2+0x246/0x6e1
[  340.320163]  [<ffffffff8143fc77>] ? commit_timeout+0xb/0xb
[  340.320167]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
[  340.320171]  [<ffffffff810ccde3>] kthread+0x252/0x261
[  340.320174]  [<ffffffff8143fc77>] ? commit_timeout+0xb/0xb
[  340.320177]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  340.320181]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
[  340.320185]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
[  340.320186] Code: 5c 5d c3 55 48 8d 04 bd 00 00 00 00 65 48 8b 15 8d 59 b2 7e 48 69 d2 fa 00 00 00 48 89 e5 f7 e2 48 8d 7a 01 e8 22 01 00 00 5d c3 <55> 48 89 e5 41 56 41 55 41 54 53 49 89 fd bf 01 00 00 00 e8 ed 

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-15  7:59 ` Sergey Senozhatsky
@ 2016-06-15 23:12   ` Minchan Kim
  2016-06-16  2:48     ` Sergey Senozhatsky
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2016-06-15 23:12 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-mm, linux-kernel, Vlastimil Babka,
	dri-devel, Hugh Dickins, John Einar Reitan, Jonathan Corbet,
	Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

Hi Sergey,

On Wed, Jun 15, 2016 at 04:59:09PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> -next 4.7.0-rc3-next-20160614
> 
> 
> [  315.146533] kasan: CONFIG_KASAN_INLINE enabled
> [  315.146538] kasan: GPF could be caused by NULL-ptr deref or user memory access
> [  315.146546] general protection fault: 0000 [#1] PREEMPT SMP KASAN
> [  315.146576] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
> [  315.146785] CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> [  315.146841] task: ffff8800bfaf2900 ti: ffff880112468000 task.ti: ffff880112468000
> [  315.146859] RIP: 0010:[<ffffffffa02c413d>]  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]

Thanks for the report!

zs_page_migrate+0x355? Could you tell me what line is it?

It seems to be related to obj_to_head.

Could you test with [zsmalloc: keep first object offset in struct page]
in mmotm?


> [  315.146892] RSP: 0000:ffff88011246f138  EFLAGS: 00010293
> [  315.146906] RAX: 736761742d6f6e2c RBX: ffff880017ad9a80 RCX: 0000000000000000
> [  315.146924] RDX: 1ffffffff064d704 RSI: ffff88000511469a RDI: ffffffff8326ba20
> [  315.146942] RBP: ffff88011246f328 R08: 0000000000000001 R09: 0000000000000000
> [  315.146959] R10: ffff88011246f0a8 R11: ffff8800bfc07fff R12: ffff88011246f300
> [  315.146977] R13: ffffed0015523e6f R14: ffff8800aa91f378 R15: ffffea0000144500
> [  315.146995] FS:  0000000000000000(0000) GS:ffff880113780000(0000) knlGS:0000000000000000
> [  315.147015] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  315.147030] CR2: 00007f3f97911000 CR3: 0000000002209000 CR4: 00000000000006e0
> [  315.147046] Stack:
> [  315.147052]  1ffff10015523e0f ffff88011246f240 ffff880005116800 00017f80e0000000
> [  315.147083]  ffff880017ad9aa8 736761742d6f6e2c 1ffff1002248de34 ffff880017ad9a90
> [  315.147113]  0000069a1246f660 000000000000069a ffff880005114000 ffffea0002ff0180
> [  315.147143] Call Trace:
> [  315.147154]  [<ffffffffa02c3de8>] ? obj_to_head+0x9d/0x9d [zsmalloc]
> [  315.147175]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> [  315.147195]  [<ffffffff812275b1>] ? isolate_freepages_block+0x2f9/0x5a6
> [  315.147213]  [<ffffffff8127f15c>] ? kasan_poison_shadow+0x2f/0x31
> [  315.147230]  [<ffffffff8127f66a>] ? kasan_alloc_pages+0x39/0x3b
> [  315.147246]  [<ffffffff812267e6>] ? map_pages+0x1f3/0x3ad
> [  315.147262]  [<ffffffff812265f3>] ? update_pageblock_skip+0x18d/0x18d
> [  315.147280]  [<ffffffff81115972>] ? up_read+0x1a/0x30
> [  315.147296]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
> [  315.147315]  [<ffffffff812842d1>] move_to_new_page+0x4dd/0x615
> [  315.147332]  [<ffffffff81283df4>] ? migrate_page+0x75/0x75
> [  315.147347]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
> [  315.147366]  [<ffffffff812851c1>] migrate_pages+0xadd/0x131a
> [  315.147382]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
> [  315.147399]  [<ffffffff81226375>] ? kzfree+0x2b/0x2b
> [  315.147414]  [<ffffffff812846e4>] ? buffer_migrate_page+0x2db/0x2db
> [  315.147431]  [<ffffffff8122a6cf>] compact_zone+0xcdb/0x1155
> [  315.147448]  [<ffffffff812299f4>] ? compaction_suitable+0x76/0x76
> [  315.147465]  [<ffffffff8122ac29>] compact_zone_order+0xe0/0x167
> [  315.147481]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  315.147499]  [<ffffffff8122ab49>] ? compact_zone+0x1155/0x1155
> [  315.147515]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  315.147533]  [<ffffffff8122bcff>] try_to_compact_pages+0x2f1/0x648
> [  315.147550]  [<ffffffff8122bcff>] ? try_to_compact_pages+0x2f1/0x648
> [  315.147568]  [<ffffffff8122ba0e>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> [  315.147589]  [<ffffffff811ee129>] ? get_page_from_freelist+0x2c0/0x129a
> [  315.147608]  [<ffffffff811ef1ed>] __alloc_pages_direct_compact+0xea/0x30d
> [  315.147626]  [<ffffffff811ef103>] ? get_page_from_freelist+0x129a/0x129a
> [  315.147645]  [<ffffffff811f0422>] __alloc_pages_nodemask+0x840/0x16b6
> [  315.147663]  [<ffffffff810dba27>] ? try_to_wake_up+0x696/0x6c8
> [  315.149147]  [<ffffffff811efbe2>] ? warn_alloc_failed+0x226/0x226
> [  315.150615]  [<ffffffff810dba69>] ? wake_up_process+0x10/0x12
> [  315.152078]  [<ffffffff810dbaf4>] ? wake_up_q+0x89/0xa7
> [  315.153539]  [<ffffffff81128b6f>] ? rwsem_wake+0x131/0x15c
> [  315.155007]  [<ffffffff812922e7>] ? khugepaged+0x4072/0x484f
> [  315.156471]  [<ffffffff8128e449>] khugepaged+0x1d4/0x484f
> [  315.157940]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
> [  315.159402]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  315.160870]  [<ffffffff81d31df8>] ? _raw_spin_unlock_irq+0x27/0x45
> [  315.162341]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
> [  315.163814]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
> [  315.165295]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
> [  315.166763]  [<ffffffff810ccde3>] kthread+0x252/0x261
> [  315.168214]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
> [  315.169646]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  315.171056]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
> [  315.172462]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  315.173869] Code: 03 b5 60 fe ff ff e8 2e fc ff ff a8 01 74 4c 48 83 e0 fe bf 01 00 00 00 48 89 85 38 fe ff ff e8 41 18 e1 e0 48 8b 85 38 fe ff ff <f0> 0f ba 28 00 73 29 bf 01 00 00 00 41 bc f5 ff ff ff e8 ea 27 
> [  315.175573] RIP  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
> [  315.177084]  RSP <ffff88011246f138>
> [  315.186572] ---[ end trace 0962b8ee48c98bbc ]---
> 
> 
> 
> 
> [  315.186577] BUG: sleeping function called from invalid context at include/linux/sched.h:2960
> [  315.186580] in_atomic(): 1, irqs_disabled(): 0, pid: 38, name: khugepaged
> [  315.186581] INFO: lockdep is turned off.
> [  315.186583] Preemption disabled at:[<ffffffffa02c3f1d>] zs_page_migrate+0x135/0xaa0 [zsmalloc]
> 
> [  315.186594] CPU: 3 PID: 38 Comm: khugepaged Tainted: G      D         4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> [  315.186599]  0000000000000000 ffff88011246ed58 ffffffff814d56bf ffff8800bfaf2900
> [  315.186604]  0000000000000004 ffff88011246ed98 ffffffff810d5e6a 0000000000000000
> [  315.186609]  ffff8800bfaf2900 ffffffff81e39820 0000000000000b90 0000000000000000
> [  315.186614] Call Trace:
> [  315.186618]  [<ffffffff814d56bf>] dump_stack+0x68/0x92
> [  315.186622]  [<ffffffff810d5e6a>] ___might_sleep+0x3bd/0x3c9
> [  315.186625]  [<ffffffff810d5fd1>] __might_sleep+0x15b/0x167
> [  315.186630]  [<ffffffff810ac4c1>] exit_signals+0x7a/0x34f
> [  315.186633]  [<ffffffff810ac447>] ? get_signal+0xd9b/0xd9b
> [  315.186636]  [<ffffffff811aee21>] ? irq_work_queue+0x101/0x11c
> [  315.186640]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  315.186645]  [<ffffffff81096357>] do_exit+0x34d/0x1b4e
> [  315.186648]  [<ffffffff81130e16>] ? vprintk_emit+0x4b1/0x4d3
> [  315.186652]  [<ffffffff8109600a>] ? is_current_pgrp_orphaned+0x8c/0x8c
> [  315.186655]  [<ffffffff81122c56>] ? lock_acquire+0xec/0x147
> [  315.186658]  [<ffffffff811321ef>] ? kmsg_dump+0x12/0x27a
> [  315.186662]  [<ffffffff81132448>] ? kmsg_dump+0x26b/0x27a
> [  315.186666]  [<ffffffff81036507>] oops_end+0x9d/0xa4
> [  315.186669]  [<ffffffff8103662c>] die+0x55/0x5e
> [  315.186672]  [<ffffffff81032aa0>] do_general_protection+0x16c/0x337
> [  315.186676]  [<ffffffff81d33abf>] general_protection+0x1f/0x30
> [  315.186681]  [<ffffffffa02c413d>] ? zs_page_migrate+0x355/0xaa0 [zsmalloc]
> [  315.186686]  [<ffffffffa02c4136>] ? zs_page_migrate+0x34e/0xaa0 [zsmalloc]
> [  315.186691]  [<ffffffffa02c3de8>] ? obj_to_head+0x9d/0x9d [zsmalloc]
> [  315.186695]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> [  315.186699]  [<ffffffff812275b1>] ? isolate_freepages_block+0x2f9/0x5a6
> [  315.186702]  [<ffffffff8127f15c>] ? kasan_poison_shadow+0x2f/0x31
> [  315.186706]  [<ffffffff8127f66a>] ? kasan_alloc_pages+0x39/0x3b
> [  315.186709]  [<ffffffff812267e6>] ? map_pages+0x1f3/0x3ad
> [  315.186712]  [<ffffffff812265f3>] ? update_pageblock_skip+0x18d/0x18d
> [  315.186716]  [<ffffffff81115972>] ? up_read+0x1a/0x30
> [  315.186719]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
> [  315.186723]  [<ffffffff812842d1>] move_to_new_page+0x4dd/0x615
> [  315.186726]  [<ffffffff81283df4>] ? migrate_page+0x75/0x75
> [  315.186730]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
> [  315.186733]  [<ffffffff812851c1>] migrate_pages+0xadd/0x131a
> [  315.186737]  [<ffffffff8122785e>] ? isolate_freepages_block+0x5a6/0x5a6
> [  315.186740]  [<ffffffff81226375>] ? kzfree+0x2b/0x2b
> [  315.186743]  [<ffffffff812846e4>] ? buffer_migrate_page+0x2db/0x2db
> [  315.186747]  [<ffffffff8122a6cf>] compact_zone+0xcdb/0x1155
> [  315.186751]  [<ffffffff812299f4>] ? compaction_suitable+0x76/0x76
> [  315.186754]  [<ffffffff8122ac29>] compact_zone_order+0xe0/0x167
> [  315.186757]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  315.186761]  [<ffffffff8122ab49>] ? compact_zone+0x1155/0x1155
> [  315.186764]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  315.186768]  [<ffffffff8122bcff>] try_to_compact_pages+0x2f1/0x648
> [  315.186771]  [<ffffffff8122bcff>] ? try_to_compact_pages+0x2f1/0x648
> [  315.186775]  [<ffffffff8122ba0e>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> [  315.186780]  [<ffffffff811ee129>] ? get_page_from_freelist+0x2c0/0x129a
> [  315.186783]  [<ffffffff811ef1ed>] __alloc_pages_direct_compact+0xea/0x30d
> [  315.186787]  [<ffffffff811ef103>] ? get_page_from_freelist+0x129a/0x129a
> [  315.186791]  [<ffffffff811f0422>] __alloc_pages_nodemask+0x840/0x16b6
> [  315.186794]  [<ffffffff810dba27>] ? try_to_wake_up+0x696/0x6c8
> [  315.186798]  [<ffffffff811efbe2>] ? warn_alloc_failed+0x226/0x226
> [  315.186801]  [<ffffffff810dba69>] ? wake_up_process+0x10/0x12
> [  315.186804]  [<ffffffff810dbaf4>] ? wake_up_q+0x89/0xa7
> [  315.186807]  [<ffffffff81128b6f>] ? rwsem_wake+0x131/0x15c
> [  315.186811]  [<ffffffff812922e7>] ? khugepaged+0x4072/0x484f
> [  315.186815]  [<ffffffff8128e449>] khugepaged+0x1d4/0x484f
> [  315.186819]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
> [  315.186822]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  315.186826]  [<ffffffff81d31df8>] ? _raw_spin_unlock_irq+0x27/0x45
> [  315.186829]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
> [  315.186832]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
> [  315.186836]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
> [  315.186840]  [<ffffffff810ccde3>] kthread+0x252/0x261
> [  315.186843]  [<ffffffff8128e275>] ? hugepage_vma_revalidate+0xef/0xef
> [  315.186846]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  315.186851]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
> [  315.186854]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  315.186869] note: khugepaged[38] exited with preempt_count 4
> 
> 
> 
> [  340.319852] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [jbd2/zram0-8:405]
> [  340.319856] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
> [  340.319900] irq event stamp: 834296
> [  340.319902] hardirqs last  enabled at (834295): [<ffffffff81280b07>] quarantine_put+0xa1/0xe6
> [  340.319911] hardirqs last disabled at (834296): [<ffffffff81d31e68>] _raw_write_lock_irqsave+0x13/0x4c
> [  340.319917] softirqs last  enabled at (833836): [<ffffffff81d3455e>] __do_softirq+0x406/0x48f
> [  340.319922] softirqs last disabled at (833831): [<ffffffff8109914a>] irq_exit+0x6a/0x113
> [  340.319929] CPU: 2 PID: 405 Comm: jbd2/zram0-8 Tainted: G      D         4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> [  340.319935] task: ffff8800bb512900 ti: ffff8800a69c0000 task.ti: ffff8800a69c0000
> [  340.319937] RIP: 0010:[<ffffffff814ed772>]  [<ffffffff814ed772>] delay_tsc+0x0/0xa4
> [  340.319943] RSP: 0018:ffff8800a69c70f8  EFLAGS: 00000206
> [  340.319945] RAX: 0000000000000001 RBX: ffff8800aa91f300 RCX: 0000000000000000
> [  340.319947] RDX: 0000000000000003 RSI: ffffffff81ed2840 RDI: 0000000000000001
> [  340.319949] RBP: ffff8800a69c7100 R08: 0000000000000001 R09: 0000000000000000
> [  340.319951] R10: ffff8800a69c70e8 R11: 000000007e7516b9 R12: ffff8800aa91f310
> [  340.319954] R13: ffff8800aa91f308 R14: 000000001f3306fa R15: 0000000000000000
> [  340.319956] FS:  0000000000000000(0000) GS:ffff880113700000(0000) knlGS:0000000000000000
> [  340.319959] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  340.319961] CR2: 00007fc99caba080 CR3: 00000000b9796000 CR4: 00000000000006e0
> [  340.319963] Stack:
> [  340.319964]  ffffffff814ed89c ffff8800a69c7148 ffffffff8112795d ffffed0015523e60
> [  340.319970]  000000009e857390 ffff8800aa91f300 ffff8800bbe21cc0 ffff8800047d6f80
> [  340.319975]  ffff8800a69c72b0 ffff8800aa91f300 ffff8800a69c7168 ffffffff81d31bed
> [  340.319980] Call Trace:
> [  340.319983]  [<ffffffff814ed89c>] ? __delay+0xa/0xc
> [  340.319988]  [<ffffffff8112795d>] do_raw_spin_lock+0x197/0x257
> [  340.319991]  [<ffffffff81d31bed>] _raw_spin_lock+0x35/0x3c
> [  340.319998]  [<ffffffffa02c6062>] ? zs_free+0x191/0x27a [zsmalloc]
> [  340.320003]  [<ffffffffa02c6062>] zs_free+0x191/0x27a [zsmalloc]
> [  340.320008]  [<ffffffffa02c5ed1>] ? free_zspage+0xe8/0xe8 [zsmalloc]
> [  340.320012]  [<ffffffff810d58d1>] ? finish_task_switch+0x3de/0x484
> [  340.320015]  [<ffffffff810d58a6>] ? finish_task_switch+0x3b3/0x484
> [  340.320021]  [<ffffffff81d27ad5>] ? __schedule+0xa4d/0xd16
> [  340.320024]  [<ffffffff81d28086>] ? preempt_schedule+0x1f/0x21
> [  340.320028]  [<ffffffff81d27ff9>] ? preempt_schedule_common+0xb7/0xe8
> [  340.320034]  [<ffffffffa02d3f0e>] zram_free_page+0x112/0x1f6 [zram]
> [  340.320039]  [<ffffffffa02d5e6c>] zram_make_request+0x45d/0x89f [zram]
> [  340.320045]  [<ffffffffa02d5a0f>] ? zram_rw_page+0x21d/0x21d [zram]
> [  340.320048]  [<ffffffff81493657>] ? blk_exit_rl+0x39/0x39
> [  340.320053]  [<ffffffff8148fe3f>] ? handle_bad_sector+0x192/0x192
> [  340.320056]  [<ffffffff8127f83e>] ? kasan_slab_alloc+0x12/0x14
> [  340.320059]  [<ffffffff8127ca68>] ? kmem_cache_alloc+0xf3/0x101
> [  340.320062]  [<ffffffff81494e37>] generic_make_request+0x2bc/0x496
> [  340.320066]  [<ffffffff81494b7b>] ? blk_plug_queued_count+0x103/0x103
> [  340.320069]  [<ffffffff8111ec7e>] ? debug_check_no_locks_freed+0x150/0x22b
> [  340.320072]  [<ffffffff81495309>] submit_bio+0x2f8/0x324
> [  340.320075]  [<ffffffff81495011>] ? generic_make_request+0x496/0x496
> [  340.320078]  [<ffffffff811190fc>] ? lockdep_init_map+0x1ef/0x4b0
> [  340.320082]  [<ffffffff814880a4>] submit_bio_wait+0xff/0x138
> [  340.320085]  [<ffffffff81487fa5>] ? bio_add_page+0x292/0x292
> [  340.320090]  [<ffffffff814ab82c>] blkdev_issue_discard+0xee/0x148
> [  340.320093]  [<ffffffff814ab73e>] ? __blkdev_issue_discard+0x399/0x399
> [  340.320097]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  340.320101]  [<ffffffff81404de8>] ext4_free_data_callback+0x2cc/0x8bc
> [  340.320104]  [<ffffffff81404de8>] ? ext4_free_data_callback+0x2cc/0x8bc
> [  340.320107]  [<ffffffff81404b1c>] ? ext4_mb_release_context+0x10aa/0x10aa
> [  340.320111]  [<ffffffff81122c56>] ? lock_acquire+0xec/0x147
> [  340.320115]  [<ffffffff813c8a6a>] ? ext4_journal_commit_callback+0x203/0x220
> [  340.320119]  [<ffffffff813c8a61>] ext4_journal_commit_callback+0x1fa/0x220
> [  340.320124]  [<ffffffff81438bf5>] jbd2_journal_commit_transaction+0x3753/0x3c20
> [  340.320128]  [<ffffffff814354a2>] ? journal_submit_commit_record+0x777/0x777
> [  340.320132]  [<ffffffff8111f0ac>] ? debug_show_all_locks+0x226/0x226
> [  340.320135]  [<ffffffff811205a5>] ? __lock_acquire+0x14f9/0x33b8
> [  340.320139]  [<ffffffff81d31db0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
> [  340.320143]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
> [  340.320146]  [<ffffffff81d31dbc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> [  340.320151]  [<ffffffff81156945>] ? try_to_del_timer_sync+0xa5/0xce
> [  340.320154]  [<ffffffff8111cde6>] ? trace_hardirqs_on_caller+0x3d2/0x492
> [  340.320157]  [<ffffffff8143febd>] kjournald2+0x246/0x6e1
> [  340.320160]  [<ffffffff8143febd>] ? kjournald2+0x246/0x6e1
> [  340.320163]  [<ffffffff8143fc77>] ? commit_timeout+0xb/0xb
> [  340.320167]  [<ffffffff8111112e>] ? prepare_to_wait_event+0x3f7/0x3f7
> [  340.320171]  [<ffffffff810ccde3>] kthread+0x252/0x261
> [  340.320174]  [<ffffffff8143fc77>] ? commit_timeout+0xb/0xb
> [  340.320177]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  340.320181]  [<ffffffff81d3277f>] ret_from_fork+0x1f/0x40
> [  340.320185]  [<ffffffff810ccb91>] ? kthread_create_on_node+0x377/0x377
> [  340.320186] Code: 5c 5d c3 55 48 8d 04 bd 00 00 00 00 65 48 8b 15 8d 59 b2 7e 48 69 d2 fa 00 00 00 48 89 e5 f7 e2 48 8d 7a 01 e8 22 01 00 00 5d c3 <55> 48 89 e5 41 56 41 55 41 54 53 49 89 fd bf 01 00 00 00 e8 ed 
> 
> 	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-15 23:12   ` Minchan Kim
@ 2016-06-16  2:48     ` Sergey Senozhatsky
  2016-06-16  2:58       ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Sergey Senozhatsky @ 2016-06-16  2:48 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, dri-devel, Hugh Dickins, John Einar Reitan,
	Jonathan Corbet, Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman,
	Naoya Horiguchi, Rafael Aquini, Rik van Riel, Sergey Senozhatsky,
	virtualization, Gioh Kim, Chan Gyun Jeong, Sangseok Lee,
	Kyeongdon Kim, Chulmin Kim

Hi,

On (06/16/16 08:12), Minchan Kim wrote:
> > [  315.146533] kasan: CONFIG_KASAN_INLINE enabled
> > [  315.146538] kasan: GPF could be caused by NULL-ptr deref or user memory access
> > [  315.146546] general protection fault: 0000 [#1] PREEMPT SMP KASAN
> > [  315.146576] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
> > [  315.146785] CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> > [  315.146841] task: ffff8800bfaf2900 ti: ffff880112468000 task.ti: ffff880112468000
> > [  315.146859] RIP: 0010:[<ffffffffa02c413d>]  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
> 
> Thanks for the report!
> 
> zs_page_migrate+0x355? Could you tell me what line is it?
> 
> It seems to be related to obj_to_head.

reproduced. a bit different call stack this time. but the problem is
still the same.

zs_compact()
...
    6371:       e8 00 00 00 00          callq  6376 <zs_compact+0x22b>
    6376:       0f 0b                   ud2    
    6378:       48 8b 95 a8 fe ff ff    mov    -0x158(%rbp),%rdx
    637f:       4d 8d 74 24 78          lea    0x78(%r12),%r14
    6384:       4c 89 ee                mov    %r13,%rsi
    6387:       4c 89 e7                mov    %r12,%rdi
    638a:       e8 86 c7 ff ff          callq  2b15 <get_first_obj_offset>
    638f:       41 89 c5                mov    %eax,%r13d
    6392:       4c 89 f0                mov    %r14,%rax
    6395:       48 c1 e8 03             shr    $0x3,%rax
    6399:       8a 04 18                mov    (%rax,%rbx,1),%al
    639c:       84 c0                   test   %al,%al
    639e:       0f 85 f2 02 00 00       jne    6696 <zs_compact+0x54b>
    63a4:       41 8b 44 24 78          mov    0x78(%r12),%eax
    63a9:       41 0f af c7             imul   %r15d,%eax
    63ad:       41 01 c5                add    %eax,%r13d
    63b0:       4c 89 f0                mov    %r14,%rax
    63b3:       48 c1 e8 03             shr    $0x3,%rax
    63b7:       48 01 d8                add    %rbx,%rax
    63ba:       48 89 85 88 fe ff ff    mov    %rax,-0x178(%rbp)
    63c1:       41 81 fd ff 0f 00 00    cmp    $0xfff,%r13d
    63c8:       0f 87 1a 03 00 00       ja     66e8 <zs_compact+0x59d>
    63ce:       49 63 f5                movslq %r13d,%rsi
    63d1:       48 03 b5 98 fe ff ff    add    -0x168(%rbp),%rsi
    63d8:       48 8b bd a8 fe ff ff    mov    -0x158(%rbp),%rdi
    63df:       e8 67 d9 ff ff          callq  3d4b <obj_to_head>
    63e4:       a8 01                   test   $0x1,%al
    63e6:       0f 84 d9 02 00 00       je     66c5 <zs_compact+0x57a>
    63ec:       48 83 e0 fe             and    $0xfffffffffffffffe,%rax
    63f0:       bf 01 00 00 00          mov    $0x1,%edi
    63f5:       48 89 85 b0 fe ff ff    mov    %rax,-0x150(%rbp)
    63fc:       e8 00 00 00 00          callq  6401 <zs_compact+0x2b6>
    6401:       48 8b 85 b0 fe ff ff    mov    -0x150(%rbp),%rax
					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    6408:       f0 0f ba 28 00          lock btsl $0x0,(%rax)
    640d:       0f 82 98 02 00 00       jb     66ab <zs_compact+0x560>
    6413:       48 8b 85 10 fe ff ff    mov    -0x1f0(%rbp),%rax
    641a:       48 8d b8 48 10 00 00    lea    0x1048(%rax),%rdi
    6421:       48 89 f8                mov    %rdi,%rax
    6424:       48 c1 e8 03             shr    $0x3,%rax
    6428:       8a 04 18                mov    (%rax,%rbx,1),%al
    642b:       84 c0                   test   %al,%al
    642d:       0f 85 c5 02 00 00       jne    66f8 <zs_compact+0x5ad>
    6433:       48 8b 85 10 fe ff ff    mov    -0x1f0(%rbp),%rax
    643a:       65 4c 8b 2c 25 00 00    mov    %gs:0x0,%r13
    6441:       00 00 
    6443:       49 8d bd 48 10 00 00    lea    0x1048(%r13),%rdi
    644a:       ff 88 48 10 00 00       decl   0x1048(%rax)
    6450:       48 89 f8                mov    %rdi,%rax
    6453:       48 c1 e8 03             shr    $0x3,%rax
    6457:       8a 04 18                mov    (%rax,%rbx,1),%al
    645a:       84 c0                   test   %al,%al
    645c:       0f 85 a8 02 00 00       jne    670a <zs_compact+0x5bf>
    6462:       41 83 bd 48 10 00 00    cmpl   $0x0,0x1048(%r13)


which is

_next/./arch/x86/include/asm/bitops.h:206
_next/./arch/x86/include/asm/bitops.h:219
_next/include/linux/bit_spinlock.h:44
_next/mm/zsmalloc.c:950
_next/mm/zsmalloc.c:1774
_next/mm/zsmalloc.c:1809
_next/mm/zsmalloc.c:2306
_next/mm/zsmalloc.c:2346


smells like race conditon.



backtraces:

[  319.363646] kasan: CONFIG_KASAN_INLINE enabled
[  319.363650] kasan: GPF could be caused by NULL-ptr deref or user memory access
[  319.363658] general protection fault: 0000 [#1] PREEMPT SMP KASAN
[  319.363688] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek snd_hda_codec_generic r8169 mii i2c_i801 snd_hda_intel snd_hda_codec snd_hda_core snd_pcm snd_timer acpi_cpufreq snd lpc_ich soundcore mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci ehci_pci libata ehci_hcd usbcore scsi_mod usb_common
[  319.363895] CPU: 0 PID: 45 Comm: kswapd0 Not tainted 4.7.0-rc3-next-20160615-dbg-00004-g550dc8a-dirty #490
[  319.363950] task: ffff8800bfb93d80 ti: ffff880112200000 task.ti: ffff880112200000
[  319.363968] RIP: 0010:[<ffffffffa03ce408>]  [<ffffffffa03ce408>] zs_compact+0x2bd/0xf22 [zsmalloc]
[  319.364000] RSP: 0018:ffff8801122077f8  EFLAGS: 00010293
[  319.364014] RAX: 2065676162726166 RBX: dffffc0000000000 RCX: 0000000000000000
[  319.364032] RDX: 1ffffffff064c504 RSI: ffff88003217c770 RDI: ffffffff83262ae0
[  319.364049] RBP: ffff880112207a18 R08: 0000000000000001 R09: 0000000000000000
[  319.364067] R10: ffff880112207768 R11: 00000000a19f2c26 R12: ffff8800a7caab00
[  319.364085] R13: 0000000000000770 R14: ffff8800a7caab78 R15: 0000000000000000
[  319.364103] FS:  0000000000000000(0000) GS:ffff880113600000(0000) knlGS:0000000000000000
[  319.364123] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  319.364138] CR2: 00007fa154633d70 CR3: 00000000b183d000 CR4: 00000000000006f0
[  319.364154] Stack:
[  319.364160]  ffffed00163d6a81 1ffff10017f729b9 ffff8800bfb944a0 ffffed0017f729b9
[  319.364191]  ffff8800bfb93d80 ffff8800b1eb5408 ffff8800bfb93d80 ffff8800bfb94dc8
[  319.364222]  ffff8800bfb944f8 ffff880000000001 1ffff10022440f1a 0000000041b58ab3
[  319.364252] Call Trace:
[  319.364264]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  319.364284]  [<ffffffffa03ce14b>] ? zs_free+0x27a/0x27a [zsmalloc]
[  319.364303]  [<ffffffff812303e3>] ? list_lru_count_one+0x65/0x6d
[  319.364320]  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
[  319.364336]  [<ffffffff812303b7>] ? list_lru_count_one+0x39/0x6d
[  319.364353]  [<ffffffff81d32e4f>] ? _raw_spin_unlock+0x2c/0x3f
[  319.364371]  [<ffffffffa03cf0a8>] zs_shrinker_scan+0x3b/0x4e [zsmalloc]
[  319.364391]  [<ffffffff81204eef>] shrink_slab.part.5.constprop.17+0x2e4/0x432
[  319.364411]  [<ffffffff81204c0b>] ? cpu_callback+0xb0/0xb0
[  319.364426]  [<ffffffff8120bfbc>] shrink_zone+0x19b/0x416
[  319.364442]  [<ffffffff8120be21>] ? shrink_zone_memcg.isra.14+0xd08/0xd08
[  319.364461]  [<ffffffff811f0b10>] ? zone_watermark_ok_safe+0x1e9/0x1f8
[  319.364478]  [<ffffffff81205fd7>] ? zone_reclaimable+0x14b/0x170
[  319.364495]  [<ffffffff8120d2fb>] kswapd+0xaad/0xcee
[  319.364510]  [<ffffffff8120c84e>] ? try_to_free_pages+0x617/0x617
[  319.364527]  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  319.364545]  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
[  319.364564]  [<ffffffff810cd0de>] kthread+0x252/0x261
[  319.364578]  [<ffffffff8120c84e>] ? try_to_free_pages+0x617/0x617
[  319.364595]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  319.364614]  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
[  319.364629]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  319.364645] Code: ff ff e8 67 d9 ff ff a8 01 0f 84 d9 02 00 00 48 83 e0 fe bf 01 00 00 00 48 89 85 b0 fe ff ff e8 71 78 d0 e0 48 8b 85 b0 fe ff ff <f0> 0f ba 28 00 0f 82 98 02 00 00 48 8b 85 10 fe ff ff 48 8d b8 
[  319.364913] RIP  [<ffffffffa03ce408>] zs_compact+0x2bd/0xf22 [zsmalloc]
[  319.364937]  RSP <ffff8801122077f8>
[  319.372870] ---[ end trace bcefd5a456f6b462 ]---



[  319.372875] BUG: sleeping function called from invalid context at include/linux/sched.h:2960
[  319.372877] in_atomic(): 1, irqs_disabled(): 0, pid: 45, name: kswapd0
[  319.372879] INFO: lockdep is turned off.
[  319.372880] Preemption disabled at:[<ffffffffa03ce2c3>] zs_compact+0x178/0xf22 [zsmalloc]

[  319.372891] CPU: 0 PID: 45 Comm: kswapd0 Tainted: G      D         4.7.0-rc3-next-20160615-dbg-00004-g550dc8a-dirty #490
[  319.372895]  0000000000000000 ffff880112207418 ffffffff814d69b0 ffff8800bfb93d80
[  319.372901]  0000000000000003 ffff880112207458 ffffffff810d6165 0000000000000000
[  319.372906]  ffff8800bfb93d80 ffffffff81e39860 0000000000000b90 0000000000000000
[  319.372911] Call Trace:
[  319.372915]  [<ffffffff814d69b0>] dump_stack+0x68/0x92
[  319.372919]  [<ffffffff810d6165>] ___might_sleep+0x3bd/0x3c9
[  319.372922]  [<ffffffff810d62cc>] __might_sleep+0x15b/0x167
[  319.372927]  [<ffffffff810ac7bf>] exit_signals+0x7a/0x34f
[  319.372931]  [<ffffffff810ac745>] ? get_signal+0xd9b/0xd9b
[  319.372934]  [<ffffffff811af758>] ? irq_work_queue+0x101/0x11c
[  319.372938]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  319.372943]  [<ffffffff81096655>] do_exit+0x34d/0x1b4e
[  319.372947]  [<ffffffff8113119f>] ? vprintk_emit+0x4b1/0x4d3
[  319.372951]  [<ffffffff81096308>] ? is_current_pgrp_orphaned+0x8c/0x8c
[  319.372954]  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
[  319.372957]  [<ffffffff81132578>] ? kmsg_dump+0x12/0x27a
[  319.372961]  [<ffffffff811327d1>] ? kmsg_dump+0x26b/0x27a
[  319.372965]  [<ffffffff81036507>] oops_end+0x9d/0xa4
[  319.372968]  [<ffffffff81036641>] die+0x55/0x5e
[  319.372971]  [<ffffffff81032aa0>] do_general_protection+0x16c/0x337
[  319.372975]  [<ffffffff81d34bbf>] general_protection+0x1f/0x30
[  319.372981]  [<ffffffffa03ce408>] ? zs_compact+0x2bd/0xf22 [zsmalloc]
[  319.372986]  [<ffffffffa03ce401>] ? zs_compact+0x2b6/0xf22 [zsmalloc]
[  319.372989]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  319.372995]  [<ffffffffa03ce14b>] ? zs_free+0x27a/0x27a [zsmalloc]
[  319.372999]  [<ffffffff812303e3>] ? list_lru_count_one+0x65/0x6d
[  319.373002]  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
[  319.373005]  [<ffffffff812303b7>] ? list_lru_count_one+0x39/0x6d
[  319.373009]  [<ffffffff81d32e4f>] ? _raw_spin_unlock+0x2c/0x3f
[  319.373014]  [<ffffffffa03cf0a8>] zs_shrinker_scan+0x3b/0x4e [zsmalloc]
[  319.373018]  [<ffffffff81204eef>] shrink_slab.part.5.constprop.17+0x2e4/0x432
[  319.373022]  [<ffffffff81204c0b>] ? cpu_callback+0xb0/0xb0
[  319.373025]  [<ffffffff8120bfbc>] shrink_zone+0x19b/0x416
[  319.373029]  [<ffffffff8120be21>] ? shrink_zone_memcg.isra.14+0xd08/0xd08
[  319.373032]  [<ffffffff811f0b10>] ? zone_watermark_ok_safe+0x1e9/0x1f8
[  319.373036]  [<ffffffff81205fd7>] ? zone_reclaimable+0x14b/0x170
[  319.373039]  [<ffffffff8120d2fb>] kswapd+0xaad/0xcee
[  319.373043]  [<ffffffff8120c84e>] ? try_to_free_pages+0x617/0x617
[  319.373046]  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
[  319.373050]  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
[  319.373054]  [<ffffffff810cd0de>] kthread+0x252/0x261
[  319.373057]  [<ffffffff8120c84e>] ? try_to_free_pages+0x617/0x617
[  319.373060]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  319.373064]  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
[  319.373068]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377


[  319.373071] note: kswapd0[45] exited with preempt_count 3
[  322.891083] kmemleak: Cannot allocate a kmemleak_object structure


[  322.891091] kmemleak: Kernel memory leak detector disabled
[  322.891194] kmemleak: Automatic memory scanning thread ended


[  344.264076] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/u8:3:108]
[  344.264080] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek snd_hda_codec_generic r8169 mii i2c_i801 snd_hda_intel snd_hda_codec snd_hda_core snd_pcm snd_timer acpi_cpufreq snd lpc_ich soundcore mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci ehci_pci libata ehci_hcd usbcore scsi_mod usb_common
[  344.264118] irq event stamp: 13848655
[  344.264119] hardirqs last  enabled at (13848655): [<ffffffff8127dbd8>] __slab_alloc.isra.18.constprop.23+0x53/0x61
[  344.264127] hardirqs last disabled at (13848654): [<ffffffff8127db9e>] __slab_alloc.isra.18.constprop.23+0x19/0x61
[  344.264131] softirqs last  enabled at (13848614): [<ffffffff81d3565e>] __do_softirq+0x406/0x48f
[  344.264136] softirqs last disabled at (13848593): [<ffffffff81099448>] irq_exit+0x6a/0x113
[  344.264143] CPU: 1 PID: 108 Comm: kworker/u8:3 Tainted: G      D         4.7.0-rc3-next-20160615-dbg-00004-g550dc8a-dirty #490
[  344.264151] Workqueue: writeback wb_workfn (flush-254:0)
[  344.264155] task: ffff8800ba1c2900 ti: ffff8801122a0000 task.ti: ffff8801122a0000
[  344.264157] RIP: 0010:[<ffffffff814eeae3>]  [<ffffffff814eeae3>] delay_tsc+0x81/0xa4
[  344.264162] RSP: 0018:ffff8801122a70d0  EFLAGS: 00000206
[  344.264164] RAX: 000000000000001c RBX: 000000dc3a548e47 RCX: 0000000000000000
[  344.264166] RDX: 000000dc3a548e63 RSI: ffffffff81ed2e80 RDI: ffffffff81ed2ec0
[  344.264168] RBP: ffff8801122a70f0 R08: 0000000000000001 R09: 0000000000000000
[  344.264170] R10: ffff8801122a70e8 R11: 0000000045cb5d4f R12: 000000dc3a548e63
[  344.264172] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
[  344.264175] FS:  0000000000000000(0000) GS:ffff880113680000(0000) knlGS:0000000000000000
[  344.264177] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  344.264179] CR2: 00007fa26a978978 CR3: 0000000002209000 CR4: 00000000000006e0
[  344.264180] Stack:
[  344.264181]  ffff8800a7caab00 ffff8800a7caab10 ffff8800a7caab08 0000000022af534e
[  344.264186]  ffff8801122a7100 ffffffff814eeb8c ffff8801122a7148 ffffffff81127ce6
[  344.264191]  ffffed0014f95560 000000009e85cd68 ffff8800a7caab00 ffff8800a7caab58
[  344.264196] Call Trace:
[  344.264199]  [<ffffffff814eeb8c>] __delay+0xa/0xc
[  344.264203]  [<ffffffff81127ce6>] do_raw_spin_lock+0x197/0x257
[  344.264206]  [<ffffffff81d32d0d>] _raw_spin_lock+0x35/0x3c
[  344.264212]  [<ffffffffa03ccd78>] ? zs_malloc+0x17e/0xb71 [zsmalloc]
[  344.264217]  [<ffffffffa03ccd78>] zs_malloc+0x17e/0xb71 [zsmalloc]
[  344.264220]  [<ffffffffa0190204>] ? lzo_decompress+0x11d/0x11d [lzo]
[  344.264223]  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
[  344.264228]  [<ffffffffa03ccbfa>] ? obj_malloc+0x372/0x372 [zsmalloc]
[  344.264233]  [<ffffffff81472ff9>] ? crypto_compress+0x87/0x93
[  344.264238]  [<ffffffffa041522d>] zram_bvec_rw+0x1073/0x1638 [zram]
[  344.264243]  [<ffffffffa04141ba>] ? zram_slot_free_notify+0x1c8/0x1c8 [zram]
[  344.264247]  [<ffffffff812fc37b>] ? wb_writeback+0x316/0x44c
[  344.264251]  [<ffffffffa0416104>] zram_make_request+0x6f5/0x89f [zram]
[  344.264255]  [<ffffffff81111ef0>] ? woken_wake_function+0x51/0x51
[  344.264260]  [<ffffffffa0415a0f>] ? zram_rw_page+0x21d/0x21d [zram]
[  344.264263]  [<ffffffff81494948>] ? blk_exit_rl+0x39/0x39
[  344.264267]  [<ffffffff81491130>] ? handle_bad_sector+0x192/0x192
[  344.264271]  [<ffffffff811506a1>] ? call_rcu+0x12/0x14
[  344.264274]  [<ffffffff8129a684>] ? put_object+0x58/0x5b
[  344.264277]  [<ffffffff81496128>] generic_make_request+0x2bc/0x496
[  344.264280]  [<ffffffff81495e6c>] ? blk_plug_queued_count+0x103/0x103
[  344.264283]  [<ffffffff814965fa>] submit_bio+0x2f8/0x324
[  344.264286]  [<ffffffff81496302>] ? generic_make_request+0x496/0x496
[  344.264289]  [<ffffffff813aa993>] ? ext4_reserve_inode_write+0x101/0x101
[  344.264292]  [<ffffffff813b44e8>] ext4_io_submit+0x12d/0x15d
[  344.264295]  [<ffffffff813ac54d>] ext4_writepages+0x15f9/0x1660
[  344.264298]  [<ffffffff813aaf54>] ? ext4_mark_inode_dirty+0x5c1/0x5c1
[  344.264301]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  344.264304]  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
[  344.264307]  [<ffffffff8111f9a4>] ? __lock_acquire+0x59f/0x33b8
[  344.264311]  [<ffffffff811fa6ea>] do_writepages+0x93/0xa1
[  344.264315]  [<ffffffff812fb7a0>] ? writeback_sb_inodes+0x270/0x85e
[  344.264317]  [<ffffffff811fa6ea>] ? do_writepages+0x93/0xa1
[  344.264321]  [<ffffffff812fb287>] __writeback_single_inode+0x8b/0x334
[  344.264324]  [<ffffffff812fb9c9>] writeback_sb_inodes+0x499/0x85e
[  344.264327]  [<ffffffff812fb530>] ? __writeback_single_inode+0x334/0x334
[  344.264331]  [<ffffffff81115e1c>] ? down_read_trylock+0x53/0xaf
[  344.264335]  [<ffffffff812a7398>] ? trylock_super+0x16/0xaf
[  344.264338]  [<ffffffff812fbe95>] __writeback_inodes_wb+0x107/0x17d
[  344.264341]  [<ffffffff812fc37b>] wb_writeback+0x316/0x44c
[  344.264345]  [<ffffffff812fc065>] ? writeback_inodes_wb.constprop.10+0x15a/0x15a
[  344.264348]  [<ffffffff811f837f>] ? wb_over_bg_thresh+0x110/0x194
[  344.264351]  [<ffffffff811f826f>] ? balance_dirty_pages_ratelimited+0x14f5/0x14f5
[  344.264354]  [<ffffffff812fce5d>] ? wb_workfn+0x296/0x6d6
[  344.264357]  [<ffffffff812fced4>] wb_workfn+0x30d/0x6d6
[  344.264360]  [<ffffffff812fced4>] ? wb_workfn+0x30d/0x6d6
[  344.264364]  [<ffffffff812fcbc7>] ? inode_wait_for_writeback+0x2e/0x2e
[  344.264368]  [<ffffffff810be6d0>] process_one_work+0x6f4/0xb2c
[  344.264371]  [<ffffffff810bdfdc>] ? pwq_dec_nr_in_flight+0x22b/0x22b
[  344.264375]  [<ffffffff810c0de0>] worker_thread+0x5bb/0x88e
[  344.264378]  [<ffffffff810cd0de>] kthread+0x252/0x261
[  344.264381]  [<ffffffff810c0825>] ? rescuer_thread+0x879/0x879
[  344.264383]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  344.264387]  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
[  344.264390]  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
[  344.264392] Code: 14 6a b2 7e 85 c0 75 05 e8 8b 35 b1 ff f3 90 bf 01 00 00 00 e8 a1 71 be ff e8 e6 f3 01 00 44 39 f0 74 b6 4c 29 e3 49 01 dd eb 97 <bf> 01 00 00 00 e8 4c 81 be ff 65 8b 05 dc 69 b2 7e 85 c0 75 05 


> Could you test with [zsmalloc: keep first object offset in struct page]
> in mmotm?

sure, I can.  will it help, tho? we have a race condition here I think.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-16  2:48     ` Sergey Senozhatsky
@ 2016-06-16  2:58       ` Minchan Kim
  2016-06-16  4:23         ` Sergey Senozhatsky
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2016-06-16  2:58 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-mm, linux-kernel, Vlastimil Babka,
	dri-devel, Hugh Dickins, John Einar Reitan, Jonathan Corbet,
	Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

On Thu, Jun 16, 2016 at 11:48:27AM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> On (06/16/16 08:12), Minchan Kim wrote:
> > > [  315.146533] kasan: CONFIG_KASAN_INLINE enabled
> > > [  315.146538] kasan: GPF could be caused by NULL-ptr deref or user memory access
> > > [  315.146546] general protection fault: 0000 [#1] PREEMPT SMP KASAN
> > > [  315.146576] Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel r8169 i2c_i801 mii snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich mfd_core processor sch_fq_codel sd_mod hid_generic usbhid hid ahci libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common
> > > [  315.146785] CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160614-dbg-00004-ga1c2cbc-dirty #488
> > > [  315.146841] task: ffff8800bfaf2900 ti: ffff880112468000 task.ti: ffff880112468000
> > > [  315.146859] RIP: 0010:[<ffffffffa02c413d>]  [<ffffffffa02c413d>] zs_page_migrate+0x355/0xaa0 [zsmalloc]
> > 
> > Thanks for the report!
> > 
> > zs_page_migrate+0x355? Could you tell me what line is it?
> > 
> > It seems to be related to obj_to_head.
> 
> reproduced. a bit different call stack this time. but the problem is
> still the same.
> 
> zs_compact()
> ...
>     6371:       e8 00 00 00 00          callq  6376 <zs_compact+0x22b>
>     6376:       0f 0b                   ud2    
>     6378:       48 8b 95 a8 fe ff ff    mov    -0x158(%rbp),%rdx
>     637f:       4d 8d 74 24 78          lea    0x78(%r12),%r14
>     6384:       4c 89 ee                mov    %r13,%rsi
>     6387:       4c 89 e7                mov    %r12,%rdi
>     638a:       e8 86 c7 ff ff          callq  2b15 <get_first_obj_offset>
>     638f:       41 89 c5                mov    %eax,%r13d
>     6392:       4c 89 f0                mov    %r14,%rax
>     6395:       48 c1 e8 03             shr    $0x3,%rax
>     6399:       8a 04 18                mov    (%rax,%rbx,1),%al
>     639c:       84 c0                   test   %al,%al
>     639e:       0f 85 f2 02 00 00       jne    6696 <zs_compact+0x54b>
>     63a4:       41 8b 44 24 78          mov    0x78(%r12),%eax
>     63a9:       41 0f af c7             imul   %r15d,%eax
>     63ad:       41 01 c5                add    %eax,%r13d
>     63b0:       4c 89 f0                mov    %r14,%rax
>     63b3:       48 c1 e8 03             shr    $0x3,%rax
>     63b7:       48 01 d8                add    %rbx,%rax
>     63ba:       48 89 85 88 fe ff ff    mov    %rax,-0x178(%rbp)
>     63c1:       41 81 fd ff 0f 00 00    cmp    $0xfff,%r13d
>     63c8:       0f 87 1a 03 00 00       ja     66e8 <zs_compact+0x59d>
>     63ce:       49 63 f5                movslq %r13d,%rsi
>     63d1:       48 03 b5 98 fe ff ff    add    -0x168(%rbp),%rsi
>     63d8:       48 8b bd a8 fe ff ff    mov    -0x158(%rbp),%rdi
>     63df:       e8 67 d9 ff ff          callq  3d4b <obj_to_head>
>     63e4:       a8 01                   test   $0x1,%al
>     63e6:       0f 84 d9 02 00 00       je     66c5 <zs_compact+0x57a>
>     63ec:       48 83 e0 fe             and    $0xfffffffffffffffe,%rax
>     63f0:       bf 01 00 00 00          mov    $0x1,%edi
>     63f5:       48 89 85 b0 fe ff ff    mov    %rax,-0x150(%rbp)
>     63fc:       e8 00 00 00 00          callq  6401 <zs_compact+0x2b6>
>     6401:       48 8b 85 b0 fe ff ff    mov    -0x150(%rbp),%rax

RAX: 2065676162726166 so rax is totally garbage, I think.
It means obj_to_head returns garbage because get_first_obj_offset is
utter crab because (page_idx / class->pages_per_zspage) was totally
wrong.

> 					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>     6408:       f0 0f ba 28 00          lock btsl $0x0,(%rax)
 
<snip>

> > Could you test with [zsmalloc: keep first object offset in struct page]
> > in mmotm?
> 
> sure, I can.  will it help, tho? we have a race condition here I think.

I guess root cause is caused by get_first_obj_offset.
Please test with it.

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-16  2:58       ` Minchan Kim
@ 2016-06-16  4:23         ` Sergey Senozhatsky
  2016-06-16  4:47           ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Sergey Senozhatsky @ 2016-06-16  4:23 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, dri-devel, Hugh Dickins, John Einar Reitan,
	Jonathan Corbet, Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman,
	Naoya Horiguchi, Rafael Aquini, Rik van Riel, Sergey Senozhatsky,
	virtualization, Gioh Kim, Chan Gyun Jeong, Sangseok Lee,
	Kyeongdon Kim, Chulmin Kim

On (06/16/16 11:58), Minchan Kim wrote:
[..]
> RAX: 2065676162726166 so rax is totally garbage, I think.
> It means obj_to_head returns garbage because get_first_obj_offset is
> utter crab because (page_idx / class->pages_per_zspage) was totally
> wrong.
> 
> > 					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >     6408:       f0 0f ba 28 00          lock btsl $0x0,(%rax)
>  
> <snip>
> 
> > > Could you test with [zsmalloc: keep first object offset in struct page]
> > > in mmotm?
> > 
> > sure, I can.  will it help, tho? we have a race condition here I think.
> 
> I guess root cause is caused by get_first_obj_offset.

sounds reasonable.

> Please test with it.


this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.

kernel: BUG: Bad page state in process khugepaged  pfn:101db8
kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
kernel: flags: 0x8000000000000000()
kernel: page dumped because: nonzero mapcount
kernel: Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek i2c_i801 snd_hda_codec_generic r8169 mii snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich processor mfd_core sch_fq_codel sd_mod hid_generic usb
kernel: CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160615-dbg-00005-gfd11984-dirty #491
kernel:  0000000000000000 ffff8801124c73f8 ffffffff814d69b0 ffffea0004076e00
kernel:  ffffffff81e658a0 ffff8801124c7420 ffffffff811e9b63 0000000000000000
kernel:  ffffea0004076e00 ffffffff81e658a0 ffff8801124c7440 ffffffff811e9ca9
kernel: Call Trace:
kernel:  [<ffffffff814d69b0>] dump_stack+0x68/0x92
kernel:  [<ffffffff811e9b63>] bad_page+0x158/0x1a2
kernel:  [<ffffffff811e9ca9>] free_pages_check_bad+0xfc/0x101
kernel:  [<ffffffff811ee516>] free_hot_cold_page+0x135/0x5de
kernel:  [<ffffffff811eea26>] __free_pages+0x67/0x72
kernel:  [<ffffffff81227c63>] release_freepages+0x13a/0x191
kernel:  [<ffffffff8122b3c2>] compact_zone+0x845/0x1155
kernel:  [<ffffffff8122ab7d>] ? compaction_suitable+0x76/0x76
kernel:  [<ffffffff8122bdb2>] compact_zone_order+0xe0/0x167
kernel:  [<ffffffff8122bcd2>] ? compact_zone+0x1155/0x1155
kernel:  [<ffffffff8122ce88>] try_to_compact_pages+0x2f1/0x648
kernel:  [<ffffffff8122ce88>] ? try_to_compact_pages+0x2f1/0x648
kernel:  [<ffffffff8122cb97>] ? compaction_zonelist_suitable+0x3a6/0x3a6
kernel:  [<ffffffff811ef1ea>] ? get_page_from_freelist+0x2c0/0x133c
kernel:  [<ffffffff811f0350>] __alloc_pages_direct_compact+0xea/0x30d
kernel:  [<ffffffff811f0266>] ? get_page_from_freelist+0x133c/0x133c
kernel:  [<ffffffff811ee3b2>] ? drain_all_pages+0x1d6/0x205
kernel:  [<ffffffff811f21a8>] __alloc_pages_nodemask+0x143d/0x16b6
kernel:  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
kernel:  [<ffffffff811f0d6b>] ? warn_alloc_failed+0x24c/0x24c
kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
kernel:  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
kernel:  [<ffffffff81d32ed0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
kernel:  [<ffffffff81d32edc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
kernel:  [<ffffffff8128f73a>] khugepaged+0x1d4/0x484f
kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
kernel:  [<ffffffff810d5bcc>] ? finish_task_switch+0x3de/0x484
kernel:  [<ffffffff81d32f18>] ? _raw_spin_unlock_irq+0x27/0x45
kernel:  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
kernel:  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
kernel:  [<ffffffff81d28bf5>] ? __schedule+0xa4d/0xd16
kernel:  [<ffffffff810cd0de>] kthread+0x252/0x261
kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
kernel:  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
-- Reboot --

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-16  4:23         ` Sergey Senozhatsky
@ 2016-06-16  4:47           ` Minchan Kim
  2016-06-16  5:22             ` Sergey Senozhatsky
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2016-06-16  4:47 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-mm, linux-kernel, Vlastimil Babka,
	dri-devel, Hugh Dickins, John Einar Reitan, Jonathan Corbet,
	Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

On Thu, Jun 16, 2016 at 01:23:43PM +0900, Sergey Senozhatsky wrote:
> On (06/16/16 11:58), Minchan Kim wrote:
> [..]
> > RAX: 2065676162726166 so rax is totally garbage, I think.
> > It means obj_to_head returns garbage because get_first_obj_offset is
> > utter crab because (page_idx / class->pages_per_zspage) was totally
> > wrong.
> > 
> > > 					^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >     6408:       f0 0f ba 28 00          lock btsl $0x0,(%rax)
> >  
> > <snip>
> > 
> > > > Could you test with [zsmalloc: keep first object offset in struct page]
> > > > in mmotm?
> > > 
> > > sure, I can.  will it help, tho? we have a race condition here I think.
> > 
> > I guess root cause is caused by get_first_obj_offset.
> 
> sounds reasonable.
> 
> > Please test with it.
> 
> 
> this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> 
> kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1

Hm, it seems double free.

It doen't happen if you disable zram? IOW, it seems to be related
zsmalloc migration?

How easy can you reprodcue it? Could you bisect it?

> kernel: flags: 0x8000000000000000()
> kernel: page dumped because: nonzero mapcount
> kernel: Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek i2c_i801 snd_hda_codec_generic r8169 mii snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich processor mfd_core sch_fq_codel sd_mod hid_generic usb
> kernel: CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160615-dbg-00005-gfd11984-dirty #491
> kernel:  0000000000000000 ffff8801124c73f8 ffffffff814d69b0 ffffea0004076e00
> kernel:  ffffffff81e658a0 ffff8801124c7420 ffffffff811e9b63 0000000000000000
> kernel:  ffffea0004076e00 ffffffff81e658a0 ffff8801124c7440 ffffffff811e9ca9
> kernel: Call Trace:
> kernel:  [<ffffffff814d69b0>] dump_stack+0x68/0x92
> kernel:  [<ffffffff811e9b63>] bad_page+0x158/0x1a2
> kernel:  [<ffffffff811e9ca9>] free_pages_check_bad+0xfc/0x101
> kernel:  [<ffffffff811ee516>] free_hot_cold_page+0x135/0x5de
> kernel:  [<ffffffff811eea26>] __free_pages+0x67/0x72
> kernel:  [<ffffffff81227c63>] release_freepages+0x13a/0x191
> kernel:  [<ffffffff8122b3c2>] compact_zone+0x845/0x1155
> kernel:  [<ffffffff8122ab7d>] ? compaction_suitable+0x76/0x76
> kernel:  [<ffffffff8122bdb2>] compact_zone_order+0xe0/0x167
> kernel:  [<ffffffff8122bcd2>] ? compact_zone+0x1155/0x1155
> kernel:  [<ffffffff8122ce88>] try_to_compact_pages+0x2f1/0x648
> kernel:  [<ffffffff8122ce88>] ? try_to_compact_pages+0x2f1/0x648
> kernel:  [<ffffffff8122cb97>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> kernel:  [<ffffffff811ef1ea>] ? get_page_from_freelist+0x2c0/0x133c
> kernel:  [<ffffffff811f0350>] __alloc_pages_direct_compact+0xea/0x30d
> kernel:  [<ffffffff811f0266>] ? get_page_from_freelist+0x133c/0x133c
> kernel:  [<ffffffff811ee3b2>] ? drain_all_pages+0x1d6/0x205
> kernel:  [<ffffffff811f21a8>] __alloc_pages_nodemask+0x143d/0x16b6
> kernel:  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
> kernel:  [<ffffffff811f0d6b>] ? warn_alloc_failed+0x24c/0x24c
> kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> kernel:  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
> kernel:  [<ffffffff81d32ed0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
> kernel:  [<ffffffff81d32edc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> kernel:  [<ffffffff8128f73a>] khugepaged+0x1d4/0x484f
> kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> kernel:  [<ffffffff810d5bcc>] ? finish_task_switch+0x3de/0x484
> kernel:  [<ffffffff81d32f18>] ? _raw_spin_unlock_irq+0x27/0x45
> kernel:  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
> kernel:  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
> kernel:  [<ffffffff81d28bf5>] ? __schedule+0xa4d/0xd16
> kernel:  [<ffffffff810cd0de>] kthread+0x252/0x261
> kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
> kernel:  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
> kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
> -- Reboot --
> 
> 	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-16  4:47           ` Minchan Kim
@ 2016-06-16  5:22             ` Sergey Senozhatsky
  2016-06-16  6:47               ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Sergey Senozhatsky @ 2016-06-16  5:22 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, dri-devel, Hugh Dickins, John Einar Reitan,
	Jonathan Corbet, Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman,
	Naoya Horiguchi, Rafael Aquini, Rik van Riel, Sergey Senozhatsky,
	virtualization, Gioh Kim, Chan Gyun Jeong, Sangseok Lee,
	Kyeongdon Kim, Chulmin Kim

On (06/16/16 13:47), Minchan Kim wrote:
[..]
> > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > 
> > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> 
> Hm, it seems double free.
> 
> It doen't happen if you disable zram? IOW, it seems to be related
> zsmalloc migration?

need to test more, can't confidently answer now.

> How easy can you reprodcue it? Could you bisect it?

it takes some (um.. random) time to trigger the bug.
I'll try to come up with more details.

	-ss

> > kernel: flags: 0x8000000000000000()
> > kernel: page dumped because: nonzero mapcount
> > kernel: Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek i2c_i801 snd_hda_codec_generic r8169 mii snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich processor mfd_core sch_fq_codel sd_mod hid_generic usb
> > kernel: CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160615-dbg-00005-gfd11984-dirty #491
> > kernel:  0000000000000000 ffff8801124c73f8 ffffffff814d69b0 ffffea0004076e00
> > kernel:  ffffffff81e658a0 ffff8801124c7420 ffffffff811e9b63 0000000000000000
> > kernel:  ffffea0004076e00 ffffffff81e658a0 ffff8801124c7440 ffffffff811e9ca9
> > kernel: Call Trace:
> > kernel:  [<ffffffff814d69b0>] dump_stack+0x68/0x92
> > kernel:  [<ffffffff811e9b63>] bad_page+0x158/0x1a2
> > kernel:  [<ffffffff811e9ca9>] free_pages_check_bad+0xfc/0x101
> > kernel:  [<ffffffff811ee516>] free_hot_cold_page+0x135/0x5de
> > kernel:  [<ffffffff811eea26>] __free_pages+0x67/0x72
> > kernel:  [<ffffffff81227c63>] release_freepages+0x13a/0x191
> > kernel:  [<ffffffff8122b3c2>] compact_zone+0x845/0x1155
> > kernel:  [<ffffffff8122ab7d>] ? compaction_suitable+0x76/0x76
> > kernel:  [<ffffffff8122bdb2>] compact_zone_order+0xe0/0x167
> > kernel:  [<ffffffff8122bcd2>] ? compact_zone+0x1155/0x1155
> > kernel:  [<ffffffff8122ce88>] try_to_compact_pages+0x2f1/0x648
> > kernel:  [<ffffffff8122ce88>] ? try_to_compact_pages+0x2f1/0x648
> > kernel:  [<ffffffff8122cb97>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> > kernel:  [<ffffffff811ef1ea>] ? get_page_from_freelist+0x2c0/0x133c
> > kernel:  [<ffffffff811f0350>] __alloc_pages_direct_compact+0xea/0x30d
> > kernel:  [<ffffffff811f0266>] ? get_page_from_freelist+0x133c/0x133c
> > kernel:  [<ffffffff811ee3b2>] ? drain_all_pages+0x1d6/0x205
> > kernel:  [<ffffffff811f21a8>] __alloc_pages_nodemask+0x143d/0x16b6
> > kernel:  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
> > kernel:  [<ffffffff811f0d6b>] ? warn_alloc_failed+0x24c/0x24c
> > kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> > kernel:  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
> > kernel:  [<ffffffff81d32ed0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
> > kernel:  [<ffffffff81d32edc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> > kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> > kernel:  [<ffffffff8128f73a>] khugepaged+0x1d4/0x484f
> > kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> > kernel:  [<ffffffff810d5bcc>] ? finish_task_switch+0x3de/0x484
> > kernel:  [<ffffffff81d32f18>] ? _raw_spin_unlock_irq+0x27/0x45
> > kernel:  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
> > kernel:  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
> > kernel:  [<ffffffff81d28bf5>] ? __schedule+0xa4d/0xd16
> > kernel:  [<ffffffff810cd0de>] kthread+0x252/0x261
> > kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> > kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
> > kernel:  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40
> > kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377
> > -- Reboot --

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-16  5:22             ` Sergey Senozhatsky
@ 2016-06-16  6:47               ` Minchan Kim
  2016-06-16  8:42                 ` Sergey Senozhatsky
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2016-06-16  6:47 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-mm, linux-kernel, Vlastimil Babka,
	dri-devel, Hugh Dickins, John Einar Reitan, Jonathan Corbet,
	Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

On Thu, Jun 16, 2016 at 02:22:09PM +0900, Sergey Senozhatsky wrote:
> On (06/16/16 13:47), Minchan Kim wrote:
> [..]
> > > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > > 
> > > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> > 
> > Hm, it seems double free.
> > 
> > It doen't happen if you disable zram? IOW, it seems to be related
> > zsmalloc migration?
> 
> need to test more, can't confidently answer now.
> 
> > How easy can you reprodcue it? Could you bisect it?
> 
> it takes some (um.. random) time to trigger the bug.
> I'll try to come up with more details.

Could you revert [1] and retest?

[1] mm/compaction: split freepages without holding the zone lock

> 
> 	-ss
> 
> > > kernel: flags: 0x8000000000000000()
> > > kernel: page dumped because: nonzero mapcount
> > > kernel: Modules linked in: lzo zram zsmalloc mousedev coretemp hwmon crc32c_intel snd_hda_codec_realtek i2c_i801 snd_hda_codec_generic r8169 mii snd_hda_intel snd_hda_codec snd_hda_core acpi_cpufreq snd_pcm snd_timer snd soundcore lpc_ich processor mfd_core sch_fq_codel sd_mod hid_generic usb
> > > kernel: CPU: 3 PID: 38 Comm: khugepaged Not tainted 4.7.0-rc3-next-20160615-dbg-00005-gfd11984-dirty #491
> > > kernel:  0000000000000000 ffff8801124c73f8 ffffffff814d69b0 ffffea0004076e00
> > > kernel:  ffffffff81e658a0 ffff8801124c7420 ffffffff811e9b63 0000000000000000
> > > kernel:  ffffea0004076e00 ffffffff81e658a0 ffff8801124c7440 ffffffff811e9ca9
> > > kernel: Call Trace:
> > > kernel:  [<ffffffff814d69b0>] dump_stack+0x68/0x92
> > > kernel:  [<ffffffff811e9b63>] bad_page+0x158/0x1a2
> > > kernel:  [<ffffffff811e9ca9>] free_pages_check_bad+0xfc/0x101
> > > kernel:  [<ffffffff811ee516>] free_hot_cold_page+0x135/0x5de
> > > kernel:  [<ffffffff811eea26>] __free_pages+0x67/0x72
> > > kernel:  [<ffffffff81227c63>] release_freepages+0x13a/0x191
> > > kernel:  [<ffffffff8122b3c2>] compact_zone+0x845/0x1155
> > > kernel:  [<ffffffff8122ab7d>] ? compaction_suitable+0x76/0x76
> > > kernel:  [<ffffffff8122bdb2>] compact_zone_order+0xe0/0x167
> > > kernel:  [<ffffffff8122bcd2>] ? compact_zone+0x1155/0x1155
> > > kernel:  [<ffffffff8122ce88>] try_to_compact_pages+0x2f1/0x648
> > > kernel:  [<ffffffff8122ce88>] ? try_to_compact_pages+0x2f1/0x648
> > > kernel:  [<ffffffff8122cb97>] ? compaction_zonelist_suitable+0x3a6/0x3a6
> > > kernel:  [<ffffffff811ef1ea>] ? get_page_from_freelist+0x2c0/0x133c
> > > kernel:  [<ffffffff811f0350>] __alloc_pages_direct_compact+0xea/0x30d
> > > kernel:  [<ffffffff811f0266>] ? get_page_from_freelist+0x133c/0x133c
> > > kernel:  [<ffffffff811ee3b2>] ? drain_all_pages+0x1d6/0x205
> > > kernel:  [<ffffffff811f21a8>] __alloc_pages_nodemask+0x143d/0x16b6
> > > kernel:  [<ffffffff8111f405>] ? debug_show_all_locks+0x226/0x226
> > > kernel:  [<ffffffff811f0d6b>] ? warn_alloc_failed+0x24c/0x24c
> > > kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> > > kernel:  [<ffffffff81122faf>] ? lock_acquire+0xec/0x147
> > > kernel:  [<ffffffff81d32ed0>] ? _raw_spin_unlock_irqrestore+0x3b/0x5c
> > > kernel:  [<ffffffff81d32edc>] ? _raw_spin_unlock_irqrestore+0x47/0x5c
> > > kernel:  [<ffffffff81110ffc>] ? finish_wait+0x1a4/0x1b0
> > > kernel:  [<ffffffff8128f73a>] khugepaged+0x1d4/0x484f
> > > kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef
> > > kernel:  [<ffffffff810d5bcc>] ? finish_task_switch+0x3de/0x484
> > > kernel:  [<ffffffff81d32f18>] ? _raw_spin_unlock_irq+0x27/0x45
> > > kernel:  [<ffffffff8111d13f>] ? trace_hardirqs_on_caller+0x3d2/0x492
> > > kernel:  [<ffffffff81111487>] ? prepare_to_wait_event+0x3f7/0x3f7
> > > kernel:  [<ffffffff81d28bf5>] ? __schedule+0xa4d/0xd16
> > > kernel:  [<ffffffff810cd0de>] kthread+0x252/0x261 > > > kernel:  [<ffffffff8128f566>] ? hugepage_vma_revalidate+0xef/0xef > > > kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377 > > > kernel:  [<ffffffff81d3387f>] ret_from_fork+0x1f/0x40 > > > kernel:  [<ffffffff810cce8c>] ? kthread_create_on_node+0x377/0x377 > > > -- Reboot --

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-16  6:47               ` Minchan Kim
@ 2016-06-16  8:42                 ` Sergey Senozhatsky
  2016-06-16 10:09                   ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Sergey Senozhatsky @ 2016-06-16  8:42 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, dri-devel, Hugh Dickins, John Einar Reitan,
	Jonathan Corbet, Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman,
	Naoya Horiguchi, Rafael Aquini, Rik van Riel, Sergey Senozhatsky,
	virtualization, Gioh Kim, Chan Gyun Jeong, Sangseok Lee,
	Kyeongdon Kim, Chulmin Kim

On (06/16/16 15:47), Minchan Kim wrote:
> > [..]
> > > > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > > > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > > > 
> > > > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > > > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> > > 
> > > Hm, it seems double free.
> > > 
> > > It doen't happen if you disable zram? IOW, it seems to be related
> > > zsmalloc migration?
> > 
> > need to test more, can't confidently answer now.
> > 
> > > How easy can you reprodcue it? Could you bisect it?
> > 
> > it takes some (um.. random) time to trigger the bug.
> > I'll try to come up with more details.
> 
> Could you revert [1] and retest?
> 
> [1] mm/compaction: split freepages without holding the zone lock

ok, so this is not related to zsmalloc. finally manged to reproduce
it. will fork a separate thread.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-16  8:42                 ` Sergey Senozhatsky
@ 2016-06-16 10:09                   ` Minchan Kim
  2016-06-17  7:28                     ` Joonsoo Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2016-06-16 10:09 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-mm, linux-kernel, Vlastimil Babka,
	dri-devel, Hugh Dickins, John Einar Reitan, Jonathan Corbet,
	Joonsoo Kim, Konstantin Khlebnikov, Mel Gorman, Naoya Horiguchi,
	Rafael Aquini, Rik van Riel, Sergey Senozhatsky, virtualization,
	Gioh Kim, Chan Gyun Jeong, Sangseok Lee, Kyeongdon Kim,
	Chulmin Kim

On Thu, Jun 16, 2016 at 05:42:11PM +0900, Sergey Senozhatsky wrote:
> On (06/16/16 15:47), Minchan Kim wrote:
> > > [..]
> > > > > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > > > > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > > > > 
> > > > > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > > > > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> > > > 
> > > > Hm, it seems double free.
> > > > 
> > > > It doen't happen if you disable zram? IOW, it seems to be related
> > > > zsmalloc migration?
> > > 
> > > need to test more, can't confidently answer now.
> > > 
> > > > How easy can you reprodcue it? Could you bisect it?
> > > 
> > > it takes some (um.. random) time to trigger the bug.
> > > I'll try to come up with more details.
> > 
> > Could you revert [1] and retest?
> > 
> > [1] mm/compaction: split freepages without holding the zone lock
> 
> ok, so this is not related to zsmalloc. finally manged to reproduce
> it. will fork a separate thread.

The reason I mentioned [1] is that it seems to have a bug.

isolate_freepages_block
  __isolate_free_page
    if(!zone_watermark_ok())
      return 0;
  list_add_tail(&page->lru, freelist);

However, the page is not isolated.
Joonsoo?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 00/12] Support non-lru page migration
  2016-06-16 10:09                   ` Minchan Kim
@ 2016-06-17  7:28                     ` Joonsoo Kim
  0 siblings, 0 replies; 49+ messages in thread
From: Joonsoo Kim @ 2016-06-17  7:28 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-mm, linux-kernel,
	Vlastimil Babka, dri-devel, Hugh Dickins, John Einar Reitan,
	Jonathan Corbet, Konstantin Khlebnikov, Mel Gorman,
	Naoya Horiguchi, Rafael Aquini, Rik van Riel, Sergey Senozhatsky,
	virtualization, Gioh Kim, Chan Gyun Jeong, Sangseok Lee,
	Kyeongdon Kim, Chulmin Kim

On Thu, Jun 16, 2016 at 07:09:32PM +0900, Minchan Kim wrote:
> On Thu, Jun 16, 2016 at 05:42:11PM +0900, Sergey Senozhatsky wrote:
> > On (06/16/16 15:47), Minchan Kim wrote:
> > > > [..]
> > > > > > this is what I'm getting with the [zsmalloc: keep first object offset in struct page]
> > > > > > applied:  "count:0 mapcount:-127". which may be not related to zsmalloc at this point.
> > > > > > 
> > > > > > kernel: BUG: Bad page state in process khugepaged  pfn:101db8
> > > > > > kernel: page:ffffea0004076e00 count:0 mapcount:-127 mapping:          (null) index:0x1
> > > > > 
> > > > > Hm, it seems double free.
> > > > > 
> > > > > It doen't happen if you disable zram? IOW, it seems to be related
> > > > > zsmalloc migration?
> > > > 
> > > > need to test more, can't confidently answer now.
> > > > 
> > > > > How easy can you reprodcue it? Could you bisect it?
> > > > 
> > > > it takes some (um.. random) time to trigger the bug.
> > > > I'll try to come up with more details.
> > > 
> > > Could you revert [1] and retest?
> > > 
> > > [1] mm/compaction: split freepages without holding the zone lock
> > 
> > ok, so this is not related to zsmalloc. finally manged to reproduce
> > it. will fork a separate thread.
> 
> The reason I mentioned [1] is that it seems to have a bug.
> 
> isolate_freepages_block
>   __isolate_free_page
>     if(!zone_watermark_ok())
>       return 0;
>   list_add_tail(&page->lru, freelist);
> 
> However, the page is not isolated.
> Joonsoo?

Good job!
I will fix it soon.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
       [not found]   ` <CGME20170119001317epcas1p188357c77e1f4ff08b6d3dcb76dedca06@epcas1p1.samsung.com>
@ 2017-01-19  0:13     ` Chulmin Kim
  2017-01-19  2:44       ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Chulmin Kim @ 2017-01-19  0:13 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton; +Cc: linux-mm, Sergey Senozhatsky

Hello. Minchan, and all zsmalloc guys.

I have a quick question.
Is zsmalloc considering memory barrier things correctly?

AFAIK, in ARM64,
zsmalloc relies on dmb operation in bit_spin_unlock only.
(It seems that dmb operations in spinlock functions are being prepared,
but let is be aside as it is not merged yet.)

If I am correct,
migrating a page in a zspage filled with free objs
may cause the corruption cause bit_spin_unlock will not be executed at all.

I am not sure this is enough memory barrier for zsmalloc operations.

Can you enlighten me?


THanks!
CHulmin KIm



On 05/31/2016 07:21 PM, Minchan Kim wrote:
> This patch introduces run-time migration feature for zspage.
>
> For migration, VM uses page.lru field so it would be better to not use
> page.next field which is unified with page.lru for own purpose.
> For that, firstly, we can get first object offset of the page via
> runtime calculation instead of using page.index so we can use
> page.index as link for page chaining instead of page.next.
> 	
> In case of huge object, it stores handle to page.index instead of
> next link of page chaining because huge object doesn't need to next
> link for page chaining. So get_next_page need to identify huge
> object to return NULL. For it, this patch uses PG_owner_priv_1 flag
> of the page flag.
>
> For migration, it supports three functions
>
> * zs_page_isolate
>
> It isolates a zspage which includes a subpage VM want to migrate
> from class so anyone cannot allocate new object from the zspage.
>
> We could try to isolate a zspage by the number of subpage so
> subsequent isolation trial of other subpage of the zpsage shouldn't
> fail. For that, we introduce zspage.isolated count. With that,
> zs_page_isolate can know whether zspage is already isolated or not
> for migration so if it is isolated for migration, subsequent
> isolation trial can be successful without trying further isolation.
>
> * zs_page_migrate
>
> First of all, it holds write-side zspage->lock to prevent migrate other
> subpage in zspage. Then, lock all objects in the page VM want to migrate.
> The reason we should lock all objects in the page is due to race between
> zs_map_object and zs_page_migrate.
>
> zs_map_object				zs_page_migrate
>
> pin_tag(handle)
> obj = handle_to_obj(handle)
> obj_to_location(obj, &page, &obj_idx);
>
> 					write_lock(&zspage->lock)
> 					if (!trypin_tag(handle))
> 						goto unpin_object
>
> zspage = get_zspage(page);
> read_lock(&zspage->lock);
>
> If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can
> be stale by migration so it goes crash.
>
> If it locks all of objects successfully, it copies content from
> old page to new one, finally, create new zspage chain with new page.
> And if it's last isolated subpage in the zspage, put the zspage back
> to class.
>
> * zs_page_putback
>
> It returns isolated zspage to right fullness_group list if it fails to
> migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage
> freeing to workqueue. See below about async zspage freeing.
>
> This patch introduces asynchronous zspage free. The reason to need it
> is we need page_lock to clear PG_movable but unfortunately,
> zs_free path should be atomic so the apporach is try to grab page_lock.
> If it got page_lock of all of pages successfully, it can free zspage
> immediately. Otherwise, it queues free request and free zspage via
> workqueue in process context.
>
> If zs_free finds the zspage is isolated when it try to free zspage,
> it delays the freeing until zs_page_putback finds it so it will free
> free the zspage finally.
>
> In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.
> First of all, it will use ZS_EMPTY list for delay freeing.
> And with adding ZS_FULL list, it makes to identify whether zspage is
> isolated or not via list_empty(&zspage->list) test.
>
> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> Signed-off-by: Minchan Kim <minchan@kernel.org>
> ---
>  include/uapi/linux/magic.h |   1 +
>  mm/zsmalloc.c              | 793 ++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 672 insertions(+), 122 deletions(-)
>
> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> index d829ce63529d..e398beac67b8 100644
> --- a/include/uapi/linux/magic.h
> +++ b/include/uapi/linux/magic.h
> @@ -81,5 +81,6 @@
>  /* Since UDF 2.01 is ISO 13346 based... */
>  #define UDF_SUPER_MAGIC		0x15013346
>  #define BALLOON_KVM_MAGIC	0x13661366
> +#define ZSMALLOC_MAGIC		0x58295829
>
>  #endif /* __LINUX_MAGIC_H__ */
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index c6fb543cfb98..a80100db16d6 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -17,14 +17,14 @@
>   *
>   * Usage of struct page fields:
>   *	page->private: points to zspage
> - *	page->index: offset of the first object starting in this page.
> - *		For the first page, this is always 0, so we use this field
> - *		to store handle for huge object.
> - *	page->next: links together all component pages of a zspage
> + *	page->freelist(index): links together all component pages of a zspage
> + *		For the huge page, this is always 0, so we use this field
> + *		to store handle.
>   *
>   * Usage of struct page flags:
>   *	PG_private: identifies the first component page
>   *	PG_private2: identifies the last component page
> + *	PG_owner_priv_1: indentifies the huge component page
>   *
>   */
>
> @@ -49,6 +49,11 @@
>  #include <linux/debugfs.h>
>  #include <linux/zsmalloc.h>
>  #include <linux/zpool.h>
> +#include <linux/mount.h>
> +#include <linux/compaction.h>
> +#include <linux/pagemap.h>
> +
> +#define ZSPAGE_MAGIC	0x58
>
>  /*
>   * This must be power of 2 and greater than of equal to sizeof(link_free).
> @@ -136,25 +141,23 @@
>   * We do not maintain any list for completely empty or full pages
>   */
>  enum fullness_group {
> -	ZS_ALMOST_FULL,
> -	ZS_ALMOST_EMPTY,
>  	ZS_EMPTY,
> -	ZS_FULL
> +	ZS_ALMOST_EMPTY,
> +	ZS_ALMOST_FULL,
> +	ZS_FULL,
> +	NR_ZS_FULLNESS,
>  };
>
>  enum zs_stat_type {
> +	CLASS_EMPTY,
> +	CLASS_ALMOST_EMPTY,
> +	CLASS_ALMOST_FULL,
> +	CLASS_FULL,
>  	OBJ_ALLOCATED,
>  	OBJ_USED,
> -	CLASS_ALMOST_FULL,
> -	CLASS_ALMOST_EMPTY,
> +	NR_ZS_STAT_TYPE,
>  };
>
> -#ifdef CONFIG_ZSMALLOC_STAT
> -#define NR_ZS_STAT_TYPE	(CLASS_ALMOST_EMPTY + 1)
> -#else
> -#define NR_ZS_STAT_TYPE	(OBJ_USED + 1)
> -#endif
> -
>  struct zs_size_stat {
>  	unsigned long objs[NR_ZS_STAT_TYPE];
>  };
> @@ -163,6 +166,10 @@ struct zs_size_stat {
>  static struct dentry *zs_stat_root;
>  #endif
>
> +#ifdef CONFIG_COMPACTION
> +static struct vfsmount *zsmalloc_mnt;
> +#endif
> +
>  /*
>   * number of size_classes
>   */
> @@ -186,23 +193,36 @@ static const int fullness_threshold_frac = 4;
>
>  struct size_class {
>  	spinlock_t lock;
> -	struct list_head fullness_list[2];
> +	struct list_head fullness_list[NR_ZS_FULLNESS];
>  	/*
>  	 * Size of objects stored in this class. Must be multiple
>  	 * of ZS_ALIGN.
>  	 */
>  	int size;
>  	int objs_per_zspage;
> -	unsigned int index;
> -
> -	struct zs_size_stat stats;
> -
>  	/* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
>  	int pages_per_zspage;
> -	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
> -	bool huge;
> +
> +	unsigned int index;
> +	struct zs_size_stat stats;
>  };
>
> +/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
> +static void SetPageHugeObject(struct page *page)
> +{
> +	SetPageOwnerPriv1(page);
> +}
> +
> +static void ClearPageHugeObject(struct page *page)
> +{
> +	ClearPageOwnerPriv1(page);
> +}
> +
> +static int PageHugeObject(struct page *page)
> +{
> +	return PageOwnerPriv1(page);
> +}
> +
>  /*
>   * Placed within free objects to form a singly linked list.
>   * For every zspage, zspage->freeobj gives head of this list.
> @@ -244,6 +264,10 @@ struct zs_pool {
>  #ifdef CONFIG_ZSMALLOC_STAT
>  	struct dentry *stat_dentry;
>  #endif
> +#ifdef CONFIG_COMPACTION
> +	struct inode *inode;
> +	struct work_struct free_work;
> +#endif
>  };
>
>  /*
> @@ -252,16 +276,23 @@ struct zs_pool {
>   */
>  #define FULLNESS_BITS	2
>  #define CLASS_BITS	8
> +#define ISOLATED_BITS	3
> +#define MAGIC_VAL_BITS	8
>
>  struct zspage {
>  	struct {
>  		unsigned int fullness:FULLNESS_BITS;
>  		unsigned int class:CLASS_BITS;
> +		unsigned int isolated:ISOLATED_BITS;
> +		unsigned int magic:MAGIC_VAL_BITS;
>  	};
>  	unsigned int inuse;
>  	unsigned int freeobj;
>  	struct page *first_page;
>  	struct list_head list; /* fullness list */
> +#ifdef CONFIG_COMPACTION
> +	rwlock_t lock;
> +#endif
>  };
>
>  struct mapping_area {
> @@ -274,6 +305,28 @@ struct mapping_area {
>  	enum zs_mapmode vm_mm; /* mapping mode */
>  };
>
> +#ifdef CONFIG_COMPACTION
> +static int zs_register_migration(struct zs_pool *pool);
> +static void zs_unregister_migration(struct zs_pool *pool);
> +static void migrate_lock_init(struct zspage *zspage);
> +static void migrate_read_lock(struct zspage *zspage);
> +static void migrate_read_unlock(struct zspage *zspage);
> +static void kick_deferred_free(struct zs_pool *pool);
> +static void init_deferred_free(struct zs_pool *pool);
> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
> +#else
> +static int zsmalloc_mount(void) { return 0; }
> +static void zsmalloc_unmount(void) {}
> +static int zs_register_migration(struct zs_pool *pool) { return 0; }
> +static void zs_unregister_migration(struct zs_pool *pool) {}
> +static void migrate_lock_init(struct zspage *zspage) {}
> +static void migrate_read_lock(struct zspage *zspage) {}
> +static void migrate_read_unlock(struct zspage *zspage) {}
> +static void kick_deferred_free(struct zs_pool *pool) {}
> +static void init_deferred_free(struct zs_pool *pool) {}
> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
> +#endif
> +
>  static int create_cache(struct zs_pool *pool)
>  {
>  	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
> @@ -301,7 +354,7 @@ static void destroy_cache(struct zs_pool *pool)
>  static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
>  {
>  	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
> -			gfp & ~__GFP_HIGHMEM);
> +			gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
>  }
>
>  static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
> @@ -311,7 +364,8 @@ static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
>
>  static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
>  {
> -	return kmem_cache_alloc(pool->zspage_cachep, flags & ~__GFP_HIGHMEM);
> +	return kmem_cache_alloc(pool->zspage_cachep,
> +			flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
>  };
>
>  static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
> @@ -421,11 +475,17 @@ static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage)
>  /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
>  static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
>
> +static bool is_zspage_isolated(struct zspage *zspage)
> +{
> +	return zspage->isolated;
> +}
> +
>  static int is_first_page(struct page *page)
>  {
>  	return PagePrivate(page);
>  }
>
> +/* Protected by class->lock */
>  static inline int get_zspage_inuse(struct zspage *zspage)
>  {
>  	return zspage->inuse;
> @@ -441,20 +501,12 @@ static inline void mod_zspage_inuse(struct zspage *zspage, int val)
>  	zspage->inuse += val;
>  }
>
> -static inline int get_first_obj_offset(struct page *page)
> +static inline struct page *get_first_page(struct zspage *zspage)
>  {
> -	if (is_first_page(page))
> -		return 0;
> +	struct page *first_page = zspage->first_page;
>
> -	return page->index;
> -}
> -
> -static inline void set_first_obj_offset(struct page *page, int offset)
> -{
> -	if (is_first_page(page))
> -		return;
> -
> -	page->index = offset;
> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> +	return first_page;
>  }
>
>  static inline unsigned int get_freeobj(struct zspage *zspage)
> @@ -471,6 +523,8 @@ static void get_zspage_mapping(struct zspage *zspage,
>  				unsigned int *class_idx,
>  				enum fullness_group *fullness)
>  {
> +	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
> +
>  	*fullness = zspage->fullness;
>  	*class_idx = zspage->class;
>  }
> @@ -504,23 +558,19 @@ static int get_size_class_index(int size)
>  static inline void zs_stat_inc(struct size_class *class,
>  				enum zs_stat_type type, unsigned long cnt)
>  {
> -	if (type < NR_ZS_STAT_TYPE)
> -		class->stats.objs[type] += cnt;
> +	class->stats.objs[type] += cnt;
>  }
>
>  static inline void zs_stat_dec(struct size_class *class,
>  				enum zs_stat_type type, unsigned long cnt)
>  {
> -	if (type < NR_ZS_STAT_TYPE)
> -		class->stats.objs[type] -= cnt;
> +	class->stats.objs[type] -= cnt;
>  }
>
>  static inline unsigned long zs_stat_get(struct size_class *class,
>  				enum zs_stat_type type)
>  {
> -	if (type < NR_ZS_STAT_TYPE)
> -		return class->stats.objs[type];
> -	return 0;
> +	return class->stats.objs[type];
>  }
>
>  #ifdef CONFIG_ZSMALLOC_STAT
> @@ -664,6 +714,7 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
>  }
>  #endif
>
> +
>  /*
>   * For each size class, zspages are divided into different groups
>   * depending on how "full" they are. This was done so that we could
> @@ -704,15 +755,9 @@ static void insert_zspage(struct size_class *class,
>  {
>  	struct zspage *head;
>
> -	if (fullness >= ZS_EMPTY)
> -		return;
> -
> +	zs_stat_inc(class, fullness, 1);
>  	head = list_first_entry_or_null(&class->fullness_list[fullness],
>  					struct zspage, list);
> -
> -	zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ?
> -			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
> -
>  	/*
>  	 * We want to see more ZS_FULL pages and less almost empty/full.
>  	 * Put pages with higher ->inuse first.
> @@ -734,14 +779,11 @@ static void remove_zspage(struct size_class *class,
>  				struct zspage *zspage,
>  				enum fullness_group fullness)
>  {
> -	if (fullness >= ZS_EMPTY)
> -		return;
> -
>  	VM_BUG_ON(list_empty(&class->fullness_list[fullness]));
> +	VM_BUG_ON(is_zspage_isolated(zspage));
>
>  	list_del_init(&zspage->list);
> -	zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ?
> -			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
> +	zs_stat_dec(class, fullness, 1);
>  }
>
>  /*
> @@ -764,8 +806,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
>  	if (newfg == currfg)
>  		goto out;
>
> -	remove_zspage(class, zspage, currfg);
> -	insert_zspage(class, zspage, newfg);
> +	if (!is_zspage_isolated(zspage)) {
> +		remove_zspage(class, zspage, currfg);
> +		insert_zspage(class, zspage, newfg);
> +	}
> +
>  	set_zspage_mapping(zspage, class_idx, newfg);
>
>  out:
> @@ -808,19 +853,45 @@ static int get_pages_per_zspage(int class_size)
>  	return max_usedpc_order;
>  }
>
> -static struct page *get_first_page(struct zspage *zspage)
> +static struct zspage *get_zspage(struct page *page)
>  {
> -	return zspage->first_page;
> +	struct zspage *zspage = (struct zspage *)page->private;
> +
> +	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
> +	return zspage;
>  }
>
> -static struct zspage *get_zspage(struct page *page)
> +static struct page *get_next_page(struct page *page)
>  {
> -	return (struct zspage *)page->private;
> +	if (unlikely(PageHugeObject(page)))
> +		return NULL;
> +
> +	return page->freelist;
>  }
>
> -static struct page *get_next_page(struct page *page)
> +/* Get byte offset of first object in the @page */
> +static int get_first_obj_offset(struct size_class *class,
> +				struct page *first_page, struct page *page)
>  {
> -	return page->next;
> +	int pos;
> +	int page_idx = 0;
> +	int ofs = 0;
> +	struct page *cursor = first_page;
> +
> +	if (first_page == page)
> +		goto out;
> +
> +	while (page != cursor) {
> +		page_idx++;
> +		cursor = get_next_page(cursor);
> +	}
> +
> +	pos = class->objs_per_zspage * class->size *
> +		page_idx / class->pages_per_zspage;
> +
> +	ofs = (pos + class->size) % PAGE_SIZE;
> +out:
> +	return ofs;
>  }
>
>  /**
> @@ -857,16 +928,20 @@ static unsigned long handle_to_obj(unsigned long handle)
>  	return *(unsigned long *)handle;
>  }
>
> -static unsigned long obj_to_head(struct size_class *class, struct page *page,
> -			void *obj)
> +static unsigned long obj_to_head(struct page *page, void *obj)
>  {
> -	if (class->huge) {
> +	if (unlikely(PageHugeObject(page))) {
>  		VM_BUG_ON_PAGE(!is_first_page(page), page);
>  		return page->index;
>  	} else
>  		return *(unsigned long *)obj;
>  }
>
> +static inline int testpin_tag(unsigned long handle)
> +{
> +	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> +}
> +
>  static inline int trypin_tag(unsigned long handle)
>  {
>  	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> @@ -884,27 +959,93 @@ static void unpin_tag(unsigned long handle)
>
>  static void reset_page(struct page *page)
>  {
> +	__ClearPageMovable(page);
>  	clear_bit(PG_private, &page->flags);
>  	clear_bit(PG_private_2, &page->flags);
>  	set_page_private(page, 0);
> -	page->index = 0;
> +	ClearPageHugeObject(page);
> +	page->freelist = NULL;
>  }
>
> -static void free_zspage(struct zs_pool *pool, struct zspage *zspage)
> +/*
> + * To prevent zspage destroy during migration, zspage freeing should
> + * hold locks of all pages in the zspage.
> + */
> +void lock_zspage(struct zspage *zspage)
> +{
> +	struct page *page = get_first_page(zspage);
> +
> +	do {
> +		lock_page(page);
> +	} while ((page = get_next_page(page)) != NULL);
> +}
> +
> +int trylock_zspage(struct zspage *zspage)
> +{
> +	struct page *cursor, *fail;
> +
> +	for (cursor = get_first_page(zspage); cursor != NULL; cursor =
> +					get_next_page(cursor)) {
> +		if (!trylock_page(cursor)) {
> +			fail = cursor;
> +			goto unlock;
> +		}
> +	}
> +
> +	return 1;
> +unlock:
> +	for (cursor = get_first_page(zspage); cursor != fail; cursor =
> +					get_next_page(cursor))
> +		unlock_page(cursor);
> +
> +	return 0;
> +}
> +
> +static void __free_zspage(struct zs_pool *pool, struct size_class *class,
> +				struct zspage *zspage)
>  {
>  	struct page *page, *next;
> +	enum fullness_group fg;
> +	unsigned int class_idx;
> +
> +	get_zspage_mapping(zspage, &class_idx, &fg);
> +
> +	assert_spin_locked(&class->lock);
>
>  	VM_BUG_ON(get_zspage_inuse(zspage));
> +	VM_BUG_ON(fg != ZS_EMPTY);
>
> -	next = page = zspage->first_page;
> +	next = page = get_first_page(zspage);
>  	do {
> -		next = page->next;
> +		VM_BUG_ON_PAGE(!PageLocked(page), page);
> +		next = get_next_page(page);
>  		reset_page(page);
> +		unlock_page(page);
>  		put_page(page);
>  		page = next;
>  	} while (page != NULL);
>
>  	cache_free_zspage(pool, zspage);
> +
> +	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> +			class->size, class->pages_per_zspage));
> +	atomic_long_sub(class->pages_per_zspage,
> +					&pool->pages_allocated);
> +}
> +
> +static void free_zspage(struct zs_pool *pool, struct size_class *class,
> +				struct zspage *zspage)
> +{
> +	VM_BUG_ON(get_zspage_inuse(zspage));
> +	VM_BUG_ON(list_empty(&zspage->list));
> +
> +	if (!trylock_zspage(zspage)) {
> +		kick_deferred_free(pool);
> +		return;
> +	}
> +
> +	remove_zspage(class, zspage, ZS_EMPTY);
> +	__free_zspage(pool, class, zspage);
>  }
>
>  /* Initialize a newly allocated zspage */
> @@ -912,15 +1053,13 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>  {
>  	unsigned int freeobj = 1;
>  	unsigned long off = 0;
> -	struct page *page = zspage->first_page;
> +	struct page *page = get_first_page(zspage);
>
>  	while (page) {
>  		struct page *next_page;
>  		struct link_free *link;
>  		void *vaddr;
>
> -		set_first_obj_offset(page, off);
> -
>  		vaddr = kmap_atomic(page);
>  		link = (struct link_free *)vaddr + off / sizeof(*link);
>
> @@ -952,16 +1091,17 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>  	set_freeobj(zspage, 0);
>  }
>
> -static void create_page_chain(struct zspage *zspage, struct page *pages[],
> -				int nr_pages)
> +static void create_page_chain(struct size_class *class, struct zspage *zspage,
> +				struct page *pages[])
>  {
>  	int i;
>  	struct page *page;
>  	struct page *prev_page = NULL;
> +	int nr_pages = class->pages_per_zspage;
>
>  	/*
>  	 * Allocate individual pages and link them together as:
> -	 * 1. all pages are linked together using page->next
> +	 * 1. all pages are linked together using page->freelist
>  	 * 2. each sub-page point to zspage using page->private
>  	 *
>  	 * we set PG_private to identify the first page (i.e. no other sub-page
> @@ -970,16 +1110,18 @@ static void create_page_chain(struct zspage *zspage, struct page *pages[],
>  	for (i = 0; i < nr_pages; i++) {
>  		page = pages[i];
>  		set_page_private(page, (unsigned long)zspage);
> +		page->freelist = NULL;
>  		if (i == 0) {
>  			zspage->first_page = page;
>  			SetPagePrivate(page);
> +			if (unlikely(class->objs_per_zspage == 1 &&
> +					class->pages_per_zspage == 1))
> +				SetPageHugeObject(page);
>  		} else {
> -			prev_page->next = page;
> +			prev_page->freelist = page;
>  		}
> -		if (i == nr_pages - 1) {
> +		if (i == nr_pages - 1)
>  			SetPagePrivate2(page);
> -			page->next = NULL;
> -		}
>  		prev_page = page;
>  	}
>  }
> @@ -999,6 +1141,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  		return NULL;
>
>  	memset(zspage, 0, sizeof(struct zspage));
> +	zspage->magic = ZSPAGE_MAGIC;
> +	migrate_lock_init(zspage);
>
>  	for (i = 0; i < class->pages_per_zspage; i++) {
>  		struct page *page;
> @@ -1013,7 +1157,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>  		pages[i] = page;
>  	}
>
> -	create_page_chain(zspage, pages, class->pages_per_zspage);
> +	create_page_chain(class, zspage, pages);
>  	init_zspage(class, zspage);
>
>  	return zspage;
> @@ -1024,7 +1168,7 @@ static struct zspage *find_get_zspage(struct size_class *class)
>  	int i;
>  	struct zspage *zspage;
>
> -	for (i = ZS_ALMOST_FULL; i <= ZS_ALMOST_EMPTY; i++) {
> +	for (i = ZS_ALMOST_FULL; i >= ZS_EMPTY; i--) {
>  		zspage = list_first_entry_or_null(&class->fullness_list[i],
>  				struct zspage, list);
>  		if (zspage)
> @@ -1289,6 +1433,10 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>  	obj = handle_to_obj(handle);
>  	obj_to_location(obj, &page, &obj_idx);
>  	zspage = get_zspage(page);
> +
> +	/* migration cannot move any subpage in this zspage */
> +	migrate_read_lock(zspage);
> +
>  	get_zspage_mapping(zspage, &class_idx, &fg);
>  	class = pool->size_class[class_idx];
>  	off = (class->size * obj_idx) & ~PAGE_MASK;
> @@ -1309,7 +1457,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>
>  	ret = __zs_map_object(area, pages, off, class->size);
>  out:
> -	if (!class->huge)
> +	if (likely(!PageHugeObject(page)))
>  		ret += ZS_HANDLE_SIZE;
>
>  	return ret;
> @@ -1348,6 +1496,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>  		__zs_unmap_object(area, pages, off, class->size);
>  	}
>  	put_cpu_var(zs_map_area);
> +
> +	migrate_read_unlock(zspage);
>  	unpin_tag(handle);
>  }
>  EXPORT_SYMBOL_GPL(zs_unmap_object);
> @@ -1377,7 +1527,7 @@ static unsigned long obj_malloc(struct size_class *class,
>  	vaddr = kmap_atomic(m_page);
>  	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
>  	set_freeobj(zspage, link->next >> OBJ_ALLOCATED_TAG);
> -	if (!class->huge)
> +	if (likely(!PageHugeObject(m_page)))
>  		/* record handle in the header of allocated chunk */
>  		link->handle = handle;
>  	else
> @@ -1407,6 +1557,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>  {
>  	unsigned long handle, obj;
>  	struct size_class *class;
> +	enum fullness_group newfg;
>  	struct zspage *zspage;
>
>  	if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
> @@ -1422,28 +1573,37 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>
>  	spin_lock(&class->lock);
>  	zspage = find_get_zspage(class);
> -
> -	if (!zspage) {
> +	if (likely(zspage)) {
> +		obj = obj_malloc(class, zspage, handle);
> +		/* Now move the zspage to another fullness group, if required */
> +		fix_fullness_group(class, zspage);
> +		record_obj(handle, obj);
>  		spin_unlock(&class->lock);
> -		zspage = alloc_zspage(pool, class, gfp);
> -		if (unlikely(!zspage)) {
> -			cache_free_handle(pool, handle);
> -			return 0;
> -		}
>
> -		set_zspage_mapping(zspage, class->index, ZS_EMPTY);
> -		atomic_long_add(class->pages_per_zspage,
> -					&pool->pages_allocated);
> +		return handle;
> +	}
>
> -		spin_lock(&class->lock);
> -		zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> -				class->size, class->pages_per_zspage));
> +	spin_unlock(&class->lock);
> +
> +	zspage = alloc_zspage(pool, class, gfp);
> +	if (!zspage) {
> +		cache_free_handle(pool, handle);
> +		return 0;
>  	}
>
> +	spin_lock(&class->lock);
>  	obj = obj_malloc(class, zspage, handle);
> -	/* Now move the zspage to another fullness group, if required */
> -	fix_fullness_group(class, zspage);
> +	newfg = get_fullness_group(class, zspage);
> +	insert_zspage(class, zspage, newfg);
> +	set_zspage_mapping(zspage, class->index, newfg);
>  	record_obj(handle, obj);
> +	atomic_long_add(class->pages_per_zspage,
> +				&pool->pages_allocated);
> +	zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> +			class->size, class->pages_per_zspage));
> +
> +	/* We completely set up zspage so mark them as movable */
> +	SetZsPageMovable(pool, zspage);
>  	spin_unlock(&class->lock);
>
>  	return handle;
> @@ -1484,6 +1644,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  	int class_idx;
>  	struct size_class *class;
>  	enum fullness_group fullness;
> +	bool isolated;
>
>  	if (unlikely(!handle))
>  		return;
> @@ -1493,22 +1654,28 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>  	obj_to_location(obj, &f_page, &f_objidx);
>  	zspage = get_zspage(f_page);
>
> +	migrate_read_lock(zspage);
> +
>  	get_zspage_mapping(zspage, &class_idx, &fullness);
>  	class = pool->size_class[class_idx];
>
>  	spin_lock(&class->lock);
>  	obj_free(class, obj);
>  	fullness = fix_fullness_group(class, zspage);
> -	if (fullness == ZS_EMPTY) {
> -		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> -				class->size, class->pages_per_zspage));
> -		atomic_long_sub(class->pages_per_zspage,
> -				&pool->pages_allocated);
> -		free_zspage(pool, zspage);
> +	if (fullness != ZS_EMPTY) {
> +		migrate_read_unlock(zspage);
> +		goto out;
>  	}
> +
> +	isolated = is_zspage_isolated(zspage);
> +	migrate_read_unlock(zspage);
> +	/* If zspage is isolated, zs_page_putback will free the zspage */
> +	if (likely(!isolated))
> +		free_zspage(pool, class, zspage);
> +out:
> +
>  	spin_unlock(&class->lock);
>  	unpin_tag(handle);
> -
>  	cache_free_handle(pool, handle);
>  }
>  EXPORT_SYMBOL_GPL(zs_free);
> @@ -1587,12 +1754,13 @@ static unsigned long find_alloced_obj(struct size_class *class,
>  	int offset = 0;
>  	unsigned long handle = 0;
>  	void *addr = kmap_atomic(page);
> +	struct zspage *zspage = get_zspage(page);
>
> -	offset = get_first_obj_offset(page);
> +	offset = get_first_obj_offset(class, get_first_page(zspage), page);
>  	offset += class->size * index;
>
>  	while (offset < PAGE_SIZE) {
> -		head = obj_to_head(class, page, addr + offset);
> +		head = obj_to_head(page, addr + offset);
>  		if (head & OBJ_ALLOCATED_TAG) {
>  			handle = head & ~OBJ_ALLOCATED_TAG;
>  			if (trypin_tag(handle))
> @@ -1684,6 +1852,7 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
>  		zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
>  							struct zspage, list);
>  		if (zspage) {
> +			VM_BUG_ON(is_zspage_isolated(zspage));
>  			remove_zspage(class, zspage, fg[i]);
>  			return zspage;
>  		}
> @@ -1704,6 +1873,8 @@ static enum fullness_group putback_zspage(struct size_class *class,
>  {
>  	enum fullness_group fullness;
>
> +	VM_BUG_ON(is_zspage_isolated(zspage));
> +
>  	fullness = get_fullness_group(class, zspage);
>  	insert_zspage(class, zspage, fullness);
>  	set_zspage_mapping(zspage, class->index, fullness);
> @@ -1711,6 +1882,377 @@ static enum fullness_group putback_zspage(struct size_class *class,
>  	return fullness;
>  }
>
> +#ifdef CONFIG_COMPACTION
> +static struct dentry *zs_mount(struct file_system_type *fs_type,
> +				int flags, const char *dev_name, void *data)
> +{
> +	static const struct dentry_operations ops = {
> +		.d_dname = simple_dname,
> +	};
> +
> +	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
> +}
> +
> +static struct file_system_type zsmalloc_fs = {
> +	.name		= "zsmalloc",
> +	.mount		= zs_mount,
> +	.kill_sb	= kill_anon_super,
> +};
> +
> +static int zsmalloc_mount(void)
> +{
> +	int ret = 0;
> +
> +	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
> +	if (IS_ERR(zsmalloc_mnt))
> +		ret = PTR_ERR(zsmalloc_mnt);
> +
> +	return ret;
> +}
> +
> +static void zsmalloc_unmount(void)
> +{
> +	kern_unmount(zsmalloc_mnt);
> +}
> +
> +static void migrate_lock_init(struct zspage *zspage)
> +{
> +	rwlock_init(&zspage->lock);
> +}
> +
> +static void migrate_read_lock(struct zspage *zspage)
> +{
> +	read_lock(&zspage->lock);
> +}
> +
> +static void migrate_read_unlock(struct zspage *zspage)
> +{
> +	read_unlock(&zspage->lock);
> +}
> +
> +static void migrate_write_lock(struct zspage *zspage)
> +{
> +	write_lock(&zspage->lock);
> +}
> +
> +static void migrate_write_unlock(struct zspage *zspage)
> +{
> +	write_unlock(&zspage->lock);
> +}
> +
> +/* Number of isolated subpage for *page migration* in this zspage */
> +static void inc_zspage_isolation(struct zspage *zspage)
> +{
> +	zspage->isolated++;
> +}
> +
> +static void dec_zspage_isolation(struct zspage *zspage)
> +{
> +	zspage->isolated--;
> +}
> +
> +static void replace_sub_page(struct size_class *class, struct zspage *zspage,
> +				struct page *newpage, struct page *oldpage)
> +{
> +	struct page *page;
> +	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL, };
> +	int idx = 0;
> +
> +	page = get_first_page(zspage);
> +	do {
> +		if (page == oldpage)
> +			pages[idx] = newpage;
> +		else
> +			pages[idx] = page;
> +		idx++;
> +	} while ((page = get_next_page(page)) != NULL);
> +
> +	create_page_chain(class, zspage, pages);
> +	if (unlikely(PageHugeObject(oldpage)))
> +		newpage->index = oldpage->index;
> +	__SetPageMovable(newpage, page_mapping(oldpage));
> +}
> +
> +bool zs_page_isolate(struct page *page, isolate_mode_t mode)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct zspage *zspage;
> +	struct address_space *mapping;
> +
> +	/*
> +	 * Page is locked so zspage couldn't be destroyed. For detail, look at
> +	 * lock_zspage in free_zspage.
> +	 */
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +	VM_BUG_ON_PAGE(PageIsolated(page), page);
> +
> +	zspage = get_zspage(page);
> +
> +	/*
> +	 * Without class lock, fullness could be stale while class_idx is okay
> +	 * because class_idx is constant unless page is freed so we should get
> +	 * fullness again under class lock.
> +	 */
> +	get_zspage_mapping(zspage, &class_idx, &fullness);
> +	mapping = page_mapping(page);
> +	pool = mapping->private_data;
> +	class = pool->size_class[class_idx];
> +
> +	spin_lock(&class->lock);
> +	if (get_zspage_inuse(zspage) == 0) {
> +		spin_unlock(&class->lock);
> +		return false;
> +	}
> +
> +	/* zspage is isolated for object migration */
> +	if (list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
> +		spin_unlock(&class->lock);
> +		return false;
> +	}
> +
> +	/*
> +	 * If this is first time isolation for the zspage, isolate zspage from
> +	 * size_class to prevent further object allocation from the zspage.
> +	 */
> +	if (!list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
> +		get_zspage_mapping(zspage, &class_idx, &fullness);
> +		remove_zspage(class, zspage, fullness);
> +	}
> +
> +	inc_zspage_isolation(zspage);
> +	spin_unlock(&class->lock);
> +
> +	return true;
> +}
> +
> +int zs_page_migrate(struct address_space *mapping, struct page *newpage,
> +		struct page *page, enum migrate_mode mode)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fullness;
> +	struct zspage *zspage;
> +	struct page *dummy;
> +	void *s_addr, *d_addr, *addr;
> +	int offset, pos;
> +	unsigned long handle, head;
> +	unsigned long old_obj, new_obj;
> +	unsigned int obj_idx;
> +	int ret = -EAGAIN;
> +
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +
> +	zspage = get_zspage(page);
> +
> +	/* Concurrent compactor cannot migrate any subpage in zspage */
> +	migrate_write_lock(zspage);
> +	get_zspage_mapping(zspage, &class_idx, &fullness);
> +	pool = mapping->private_data;
> +	class = pool->size_class[class_idx];
> +	offset = get_first_obj_offset(class, get_first_page(zspage), page);
> +
> +	spin_lock(&class->lock);
> +	if (!get_zspage_inuse(zspage)) {
> +		ret = -EBUSY;
> +		goto unlock_class;
> +	}
> +
> +	pos = offset;
> +	s_addr = kmap_atomic(page);
> +	while (pos < PAGE_SIZE) {
> +		head = obj_to_head(page, s_addr + pos);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			handle = head & ~OBJ_ALLOCATED_TAG;
> +			if (!trypin_tag(handle))
> +				goto unpin_objects;
> +		}
> +		pos += class->size;
> +	}
> +
> +	/*
> +	 * Here, any user cannot access all objects in the zspage so let's move.
> +	 */
> +	d_addr = kmap_atomic(newpage);
> +	memcpy(d_addr, s_addr, PAGE_SIZE);
> +	kunmap_atomic(d_addr);
> +
> +	for (addr = s_addr + offset; addr < s_addr + pos;
> +					addr += class->size) {
> +		head = obj_to_head(page, addr);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			handle = head & ~OBJ_ALLOCATED_TAG;
> +			if (!testpin_tag(handle))
> +				BUG();
> +
> +			old_obj = handle_to_obj(handle);
> +			obj_to_location(old_obj, &dummy, &obj_idx);
> +			new_obj = (unsigned long)location_to_obj(newpage,
> +								obj_idx);
> +			new_obj |= BIT(HANDLE_PIN_BIT);
> +			record_obj(handle, new_obj);
> +		}
> +	}
> +
> +	replace_sub_page(class, zspage, newpage, page);
> +	get_page(newpage);
> +
> +	dec_zspage_isolation(zspage);
> +
> +	/*
> +	 * Page migration is done so let's putback isolated zspage to
> +	 * the list if @page is final isolated subpage in the zspage.
> +	 */
> +	if (!is_zspage_isolated(zspage))
> +		putback_zspage(class, zspage);
> +
> +	reset_page(page);
> +	put_page(page);
> +	page = newpage;
> +
> +	ret = 0;
> +unpin_objects:
> +	for (addr = s_addr + offset; addr < s_addr + pos;
> +						addr += class->size) {
> +		head = obj_to_head(page, addr);
> +		if (head & OBJ_ALLOCATED_TAG) {
> +			handle = head & ~OBJ_ALLOCATED_TAG;
> +			if (!testpin_tag(handle))
> +				BUG();
> +			unpin_tag(handle);
> +		}
> +	}
> +	kunmap_atomic(s_addr);
> +unlock_class:
> +	spin_unlock(&class->lock);
> +	migrate_write_unlock(zspage);
> +
> +	return ret;
> +}
> +
> +void zs_page_putback(struct page *page)
> +{
> +	struct zs_pool *pool;
> +	struct size_class *class;
> +	int class_idx;
> +	enum fullness_group fg;
> +	struct address_space *mapping;
> +	struct zspage *zspage;
> +
> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> +
> +	zspage = get_zspage(page);
> +	get_zspage_mapping(zspage, &class_idx, &fg);
> +	mapping = page_mapping(page);
> +	pool = mapping->private_data;
> +	class = pool->size_class[class_idx];
> +
> +	spin_lock(&class->lock);
> +	dec_zspage_isolation(zspage);
> +	if (!is_zspage_isolated(zspage)) {
> +		fg = putback_zspage(class, zspage);
> +		/*
> +		 * Due to page_lock, we cannot free zspage immediately
> +		 * so let's defer.
> +		 */
> +		if (fg == ZS_EMPTY)
> +			schedule_work(&pool->free_work);
> +	}
> +	spin_unlock(&class->lock);
> +}
> +
> +const struct address_space_operations zsmalloc_aops = {
> +	.isolate_page = zs_page_isolate,
> +	.migratepage = zs_page_migrate,
> +	.putback_page = zs_page_putback,
> +};
> +
> +static int zs_register_migration(struct zs_pool *pool)
> +{
> +	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
> +	if (IS_ERR(pool->inode)) {
> +		pool->inode = NULL;
> +		return 1;
> +	}
> +
> +	pool->inode->i_mapping->private_data = pool;
> +	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
> +	return 0;
> +}
> +
> +static void zs_unregister_migration(struct zs_pool *pool)
> +{
> +	flush_work(&pool->free_work);
> +	if (pool->inode)
> +		iput(pool->inode);
> +}
> +
> +/*
> + * Caller should hold page_lock of all pages in the zspage
> + * In here, we cannot use zspage meta data.
> + */
> +static void async_free_zspage(struct work_struct *work)
> +{
> +	int i;
> +	struct size_class *class;
> +	unsigned int class_idx;
> +	enum fullness_group fullness;
> +	struct zspage *zspage, *tmp;
> +	LIST_HEAD(free_pages);
> +	struct zs_pool *pool = container_of(work, struct zs_pool,
> +					free_work);
> +
> +	for (i = 0; i < zs_size_classes; i++) {
> +		class = pool->size_class[i];
> +		if (class->index != i)
> +			continue;
> +
> +		spin_lock(&class->lock);
> +		list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
> +		spin_unlock(&class->lock);
> +	}
> +
> +
> +	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
> +		list_del(&zspage->list);
> +		lock_zspage(zspage);
> +
> +		get_zspage_mapping(zspage, &class_idx, &fullness);
> +		VM_BUG_ON(fullness != ZS_EMPTY);
> +		class = pool->size_class[class_idx];
> +		spin_lock(&class->lock);
> +		__free_zspage(pool, pool->size_class[class_idx], zspage);
> +		spin_unlock(&class->lock);
> +	}
> +};
> +
> +static void kick_deferred_free(struct zs_pool *pool)
> +{
> +	schedule_work(&pool->free_work);
> +}
> +
> +static void init_deferred_free(struct zs_pool *pool)
> +{
> +	INIT_WORK(&pool->free_work, async_free_zspage);
> +}
> +
> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage)
> +{
> +	struct page *page = get_first_page(zspage);
> +
> +	do {
> +		WARN_ON(!trylock_page(page));
> +		__SetPageMovable(page, pool->inode->i_mapping);
> +		unlock_page(page);
> +	} while ((page = get_next_page(page)) != NULL);
> +}
> +#endif
> +
>  /*
>   *
>   * Based on the number of unused allocated objects calculate
> @@ -1745,10 +2287,10 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>  			break;
>
>  		cc.index = 0;
> -		cc.s_page = src_zspage->first_page;
> +		cc.s_page = get_first_page(src_zspage);
>
>  		while ((dst_zspage = isolate_zspage(class, false))) {
> -			cc.d_page = dst_zspage->first_page;
> +			cc.d_page = get_first_page(dst_zspage);
>  			/*
>  			 * If there is no more space in dst_page, resched
>  			 * and see if anyone had allocated another zspage.
> @@ -1765,11 +2307,7 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>
>  		putback_zspage(class, dst_zspage);
>  		if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
> -			zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> -					class->size, class->pages_per_zspage));
> -			atomic_long_sub(class->pages_per_zspage,
> -					&pool->pages_allocated);
> -			free_zspage(pool, src_zspage);
> +			free_zspage(pool, class, src_zspage);
>  			pool->stats.pages_compacted += class->pages_per_zspage;
>  		}
>  		spin_unlock(&class->lock);
> @@ -1885,6 +2423,7 @@ struct zs_pool *zs_create_pool(const char *name)
>  	if (!pool)
>  		return NULL;
>
> +	init_deferred_free(pool);
>  	pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
>  			GFP_KERNEL);
>  	if (!pool->size_class) {
> @@ -1939,12 +2478,10 @@ struct zs_pool *zs_create_pool(const char *name)
>  		class->pages_per_zspage = pages_per_zspage;
>  		class->objs_per_zspage = class->pages_per_zspage *
>  						PAGE_SIZE / class->size;
> -		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
> -			class->huge = true;
>  		spin_lock_init(&class->lock);
>  		pool->size_class[i] = class;
> -		for (fullness = ZS_ALMOST_FULL; fullness <= ZS_ALMOST_EMPTY;
> -								fullness++)
> +		for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
> +							fullness++)
>  			INIT_LIST_HEAD(&class->fullness_list[fullness]);
>
>  		prev_class = class;
> @@ -1953,6 +2490,9 @@ struct zs_pool *zs_create_pool(const char *name)
>  	/* debug only, don't abort if it fails */
>  	zs_pool_stat_create(pool, name);
>
> +	if (zs_register_migration(pool))
> +		goto err;
> +
>  	/*
>  	 * Not critical, we still can use the pool
>  	 * and user can trigger compaction manually.
> @@ -1972,6 +2512,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>  	int i;
>
>  	zs_unregister_shrinker(pool);
> +	zs_unregister_migration(pool);
>  	zs_pool_stat_destroy(pool);
>
>  	for (i = 0; i < zs_size_classes; i++) {
> @@ -1984,7 +2525,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>  		if (class->index != i)
>  			continue;
>
> -		for (fg = ZS_ALMOST_FULL; fg <= ZS_ALMOST_EMPTY; fg++) {
> +		for (fg = ZS_EMPTY; fg < NR_ZS_FULLNESS; fg++) {
>  			if (!list_empty(&class->fullness_list[fg])) {
>  				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
>  					class->size, fg);
> @@ -2002,7 +2543,13 @@ EXPORT_SYMBOL_GPL(zs_destroy_pool);
>
>  static int __init zs_init(void)
>  {
> -	int ret = zs_register_cpu_notifier();
> +	int ret;
> +
> +	ret = zsmalloc_mount();
> +	if (ret)
> +		goto out;
> +
> +	ret = zs_register_cpu_notifier();
>
>  	if (ret)
>  		goto notifier_fail;
> @@ -2019,7 +2566,8 @@ static int __init zs_init(void)
>
>  notifier_fail:
>  	zs_unregister_cpu_notifier();
> -
> +	zsmalloc_unmount();
> +out:
>  	return ret;
>  }
>
> @@ -2028,6 +2576,7 @@ static void __exit zs_exit(void)
>  #ifdef CONFIG_ZPOOL
>  	zpool_unregister_driver(&zs_zpool_driver);
>  #endif
> +	zsmalloc_unmount();
>  	zs_unregister_cpu_notifier();
>
>  	zs_stat_exit();
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-19  0:13     ` Chulmin Kim
@ 2017-01-19  2:44       ` Minchan Kim
  2017-01-19  3:39         ` Chulmin Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2017-01-19  2:44 UTC (permalink / raw)
  To: Chulmin Kim; +Cc: Andrew Morton, linux-mm, Sergey Senozhatsky

Hello Chulmin,

On Wed, Jan 18, 2017 at 07:13:21PM -0500, Chulmin Kim wrote:
> Hello. Minchan, and all zsmalloc guys.
> 
> I have a quick question.
> Is zsmalloc considering memory barrier things correctly?
> 
> AFAIK, in ARM64,
> zsmalloc relies on dmb operation in bit_spin_unlock only.
> (It seems that dmb operations in spinlock functions are being prepared,
> but let is be aside as it is not merged yet.)
> 
> If I am correct,
> migrating a page in a zspage filled with free objs
> may cause the corruption cause bit_spin_unlock will not be executed at all.
> 
> I am not sure this is enough memory barrier for zsmalloc operations.
> 
> Can you enlighten me?

Do you mean bit_spin_unlock is broken or zsmalloc locking scheme broken?
Could you please describe what you are concerning in detail?
It would be very helpful if you say it with a example!

Thanks.

> 
> 
> THanks!
> CHulmin KIm
> 
> 
> 
> On 05/31/2016 07:21 PM, Minchan Kim wrote:
> >This patch introduces run-time migration feature for zspage.
> >
> >For migration, VM uses page.lru field so it would be better to not use
> >page.next field which is unified with page.lru for own purpose.
> >For that, firstly, we can get first object offset of the page via
> >runtime calculation instead of using page.index so we can use
> >page.index as link for page chaining instead of page.next.
> >	
> >In case of huge object, it stores handle to page.index instead of
> >next link of page chaining because huge object doesn't need to next
> >link for page chaining. So get_next_page need to identify huge
> >object to return NULL. For it, this patch uses PG_owner_priv_1 flag
> >of the page flag.
> >
> >For migration, it supports three functions
> >
> >* zs_page_isolate
> >
> >It isolates a zspage which includes a subpage VM want to migrate
> >from class so anyone cannot allocate new object from the zspage.
> >
> >We could try to isolate a zspage by the number of subpage so
> >subsequent isolation trial of other subpage of the zpsage shouldn't
> >fail. For that, we introduce zspage.isolated count. With that,
> >zs_page_isolate can know whether zspage is already isolated or not
> >for migration so if it is isolated for migration, subsequent
> >isolation trial can be successful without trying further isolation.
> >
> >* zs_page_migrate
> >
> >First of all, it holds write-side zspage->lock to prevent migrate other
> >subpage in zspage. Then, lock all objects in the page VM want to migrate.
> >The reason we should lock all objects in the page is due to race between
> >zs_map_object and zs_page_migrate.
> >
> >zs_map_object				zs_page_migrate
> >
> >pin_tag(handle)
> >obj = handle_to_obj(handle)
> >obj_to_location(obj, &page, &obj_idx);
> >
> >					write_lock(&zspage->lock)
> >					if (!trypin_tag(handle))
> >						goto unpin_object
> >
> >zspage = get_zspage(page);
> >read_lock(&zspage->lock);
> >
> >If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can
> >be stale by migration so it goes crash.
> >
> >If it locks all of objects successfully, it copies content from
> >old page to new one, finally, create new zspage chain with new page.
> >And if it's last isolated subpage in the zspage, put the zspage back
> >to class.
> >
> >* zs_page_putback
> >
> >It returns isolated zspage to right fullness_group list if it fails to
> >migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage
> >freeing to workqueue. See below about async zspage freeing.
> >
> >This patch introduces asynchronous zspage free. The reason to need it
> >is we need page_lock to clear PG_movable but unfortunately,
> >zs_free path should be atomic so the apporach is try to grab page_lock.
> >If it got page_lock of all of pages successfully, it can free zspage
> >immediately. Otherwise, it queues free request and free zspage via
> >workqueue in process context.
> >
> >If zs_free finds the zspage is isolated when it try to free zspage,
> >it delays the freeing until zs_page_putback finds it so it will free
> >free the zspage finally.
> >
> >In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.
> >First of all, it will use ZS_EMPTY list for delay freeing.
> >And with adding ZS_FULL list, it makes to identify whether zspage is
> >isolated or not via list_empty(&zspage->list) test.
> >
> >Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> > include/uapi/linux/magic.h |   1 +
> > mm/zsmalloc.c              | 793 ++++++++++++++++++++++++++++++++++++++-------
> > 2 files changed, 672 insertions(+), 122 deletions(-)
> >
> >diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> >index d829ce63529d..e398beac67b8 100644
> >--- a/include/uapi/linux/magic.h
> >+++ b/include/uapi/linux/magic.h
> >@@ -81,5 +81,6 @@
> > /* Since UDF 2.01 is ISO 13346 based... */
> > #define UDF_SUPER_MAGIC		0x15013346
> > #define BALLOON_KVM_MAGIC	0x13661366
> >+#define ZSMALLOC_MAGIC		0x58295829
> >
> > #endif /* __LINUX_MAGIC_H__ */
> >diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> >index c6fb543cfb98..a80100db16d6 100644
> >--- a/mm/zsmalloc.c
> >+++ b/mm/zsmalloc.c
> >@@ -17,14 +17,14 @@
> >  *
> >  * Usage of struct page fields:
> >  *	page->private: points to zspage
> >- *	page->index: offset of the first object starting in this page.
> >- *		For the first page, this is always 0, so we use this field
> >- *		to store handle for huge object.
> >- *	page->next: links together all component pages of a zspage
> >+ *	page->freelist(index): links together all component pages of a zspage
> >+ *		For the huge page, this is always 0, so we use this field
> >+ *		to store handle.
> >  *
> >  * Usage of struct page flags:
> >  *	PG_private: identifies the first component page
> >  *	PG_private2: identifies the last component page
> >+ *	PG_owner_priv_1: indentifies the huge component page
> >  *
> >  */
> >
> >@@ -49,6 +49,11 @@
> > #include <linux/debugfs.h>
> > #include <linux/zsmalloc.h>
> > #include <linux/zpool.h>
> >+#include <linux/mount.h>
> >+#include <linux/compaction.h>
> >+#include <linux/pagemap.h>
> >+
> >+#define ZSPAGE_MAGIC	0x58
> >
> > /*
> >  * This must be power of 2 and greater than of equal to sizeof(link_free).
> >@@ -136,25 +141,23 @@
> >  * We do not maintain any list for completely empty or full pages
> >  */
> > enum fullness_group {
> >-	ZS_ALMOST_FULL,
> >-	ZS_ALMOST_EMPTY,
> > 	ZS_EMPTY,
> >-	ZS_FULL
> >+	ZS_ALMOST_EMPTY,
> >+	ZS_ALMOST_FULL,
> >+	ZS_FULL,
> >+	NR_ZS_FULLNESS,
> > };
> >
> > enum zs_stat_type {
> >+	CLASS_EMPTY,
> >+	CLASS_ALMOST_EMPTY,
> >+	CLASS_ALMOST_FULL,
> >+	CLASS_FULL,
> > 	OBJ_ALLOCATED,
> > 	OBJ_USED,
> >-	CLASS_ALMOST_FULL,
> >-	CLASS_ALMOST_EMPTY,
> >+	NR_ZS_STAT_TYPE,
> > };
> >
> >-#ifdef CONFIG_ZSMALLOC_STAT
> >-#define NR_ZS_STAT_TYPE	(CLASS_ALMOST_EMPTY + 1)
> >-#else
> >-#define NR_ZS_STAT_TYPE	(OBJ_USED + 1)
> >-#endif
> >-
> > struct zs_size_stat {
> > 	unsigned long objs[NR_ZS_STAT_TYPE];
> > };
> >@@ -163,6 +166,10 @@ struct zs_size_stat {
> > static struct dentry *zs_stat_root;
> > #endif
> >
> >+#ifdef CONFIG_COMPACTION
> >+static struct vfsmount *zsmalloc_mnt;
> >+#endif
> >+
> > /*
> >  * number of size_classes
> >  */
> >@@ -186,23 +193,36 @@ static const int fullness_threshold_frac = 4;
> >
> > struct size_class {
> > 	spinlock_t lock;
> >-	struct list_head fullness_list[2];
> >+	struct list_head fullness_list[NR_ZS_FULLNESS];
> > 	/*
> > 	 * Size of objects stored in this class. Must be multiple
> > 	 * of ZS_ALIGN.
> > 	 */
> > 	int size;
> > 	int objs_per_zspage;
> >-	unsigned int index;
> >-
> >-	struct zs_size_stat stats;
> >-
> > 	/* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
> > 	int pages_per_zspage;
> >-	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
> >-	bool huge;
> >+
> >+	unsigned int index;
> >+	struct zs_size_stat stats;
> > };
> >
> >+/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
> >+static void SetPageHugeObject(struct page *page)
> >+{
> >+	SetPageOwnerPriv1(page);
> >+}
> >+
> >+static void ClearPageHugeObject(struct page *page)
> >+{
> >+	ClearPageOwnerPriv1(page);
> >+}
> >+
> >+static int PageHugeObject(struct page *page)
> >+{
> >+	return PageOwnerPriv1(page);
> >+}
> >+
> > /*
> >  * Placed within free objects to form a singly linked list.
> >  * For every zspage, zspage->freeobj gives head of this list.
> >@@ -244,6 +264,10 @@ struct zs_pool {
> > #ifdef CONFIG_ZSMALLOC_STAT
> > 	struct dentry *stat_dentry;
> > #endif
> >+#ifdef CONFIG_COMPACTION
> >+	struct inode *inode;
> >+	struct work_struct free_work;
> >+#endif
> > };
> >
> > /*
> >@@ -252,16 +276,23 @@ struct zs_pool {
> >  */
> > #define FULLNESS_BITS	2
> > #define CLASS_BITS	8
> >+#define ISOLATED_BITS	3
> >+#define MAGIC_VAL_BITS	8
> >
> > struct zspage {
> > 	struct {
> > 		unsigned int fullness:FULLNESS_BITS;
> > 		unsigned int class:CLASS_BITS;
> >+		unsigned int isolated:ISOLATED_BITS;
> >+		unsigned int magic:MAGIC_VAL_BITS;
> > 	};
> > 	unsigned int inuse;
> > 	unsigned int freeobj;
> > 	struct page *first_page;
> > 	struct list_head list; /* fullness list */
> >+#ifdef CONFIG_COMPACTION
> >+	rwlock_t lock;
> >+#endif
> > };
> >
> > struct mapping_area {
> >@@ -274,6 +305,28 @@ struct mapping_area {
> > 	enum zs_mapmode vm_mm; /* mapping mode */
> > };
> >
> >+#ifdef CONFIG_COMPACTION
> >+static int zs_register_migration(struct zs_pool *pool);
> >+static void zs_unregister_migration(struct zs_pool *pool);
> >+static void migrate_lock_init(struct zspage *zspage);
> >+static void migrate_read_lock(struct zspage *zspage);
> >+static void migrate_read_unlock(struct zspage *zspage);
> >+static void kick_deferred_free(struct zs_pool *pool);
> >+static void init_deferred_free(struct zs_pool *pool);
> >+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
> >+#else
> >+static int zsmalloc_mount(void) { return 0; }
> >+static void zsmalloc_unmount(void) {}
> >+static int zs_register_migration(struct zs_pool *pool) { return 0; }
> >+static void zs_unregister_migration(struct zs_pool *pool) {}
> >+static void migrate_lock_init(struct zspage *zspage) {}
> >+static void migrate_read_lock(struct zspage *zspage) {}
> >+static void migrate_read_unlock(struct zspage *zspage) {}
> >+static void kick_deferred_free(struct zs_pool *pool) {}
> >+static void init_deferred_free(struct zs_pool *pool) {}
> >+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
> >+#endif
> >+
> > static int create_cache(struct zs_pool *pool)
> > {
> > 	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
> >@@ -301,7 +354,7 @@ static void destroy_cache(struct zs_pool *pool)
> > static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
> > {
> > 	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
> >-			gfp & ~__GFP_HIGHMEM);
> >+			gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
> > }
> >
> > static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
> >@@ -311,7 +364,8 @@ static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
> >
> > static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
> > {
> >-	return kmem_cache_alloc(pool->zspage_cachep, flags & ~__GFP_HIGHMEM);
> >+	return kmem_cache_alloc(pool->zspage_cachep,
> >+			flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
> > };
> >
> > static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
> >@@ -421,11 +475,17 @@ static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage)
> > /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
> > static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
> >
> >+static bool is_zspage_isolated(struct zspage *zspage)
> >+{
> >+	return zspage->isolated;
> >+}
> >+
> > static int is_first_page(struct page *page)
> > {
> > 	return PagePrivate(page);
> > }
> >
> >+/* Protected by class->lock */
> > static inline int get_zspage_inuse(struct zspage *zspage)
> > {
> > 	return zspage->inuse;
> >@@ -441,20 +501,12 @@ static inline void mod_zspage_inuse(struct zspage *zspage, int val)
> > 	zspage->inuse += val;
> > }
> >
> >-static inline int get_first_obj_offset(struct page *page)
> >+static inline struct page *get_first_page(struct zspage *zspage)
> > {
> >-	if (is_first_page(page))
> >-		return 0;
> >+	struct page *first_page = zspage->first_page;
> >
> >-	return page->index;
> >-}
> >-
> >-static inline void set_first_obj_offset(struct page *page, int offset)
> >-{
> >-	if (is_first_page(page))
> >-		return;
> >-
> >-	page->index = offset;
> >+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >+	return first_page;
> > }
> >
> > static inline unsigned int get_freeobj(struct zspage *zspage)
> >@@ -471,6 +523,8 @@ static void get_zspage_mapping(struct zspage *zspage,
> > 				unsigned int *class_idx,
> > 				enum fullness_group *fullness)
> > {
> >+	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
> >+
> > 	*fullness = zspage->fullness;
> > 	*class_idx = zspage->class;
> > }
> >@@ -504,23 +558,19 @@ static int get_size_class_index(int size)
> > static inline void zs_stat_inc(struct size_class *class,
> > 				enum zs_stat_type type, unsigned long cnt)
> > {
> >-	if (type < NR_ZS_STAT_TYPE)
> >-		class->stats.objs[type] += cnt;
> >+	class->stats.objs[type] += cnt;
> > }
> >
> > static inline void zs_stat_dec(struct size_class *class,
> > 				enum zs_stat_type type, unsigned long cnt)
> > {
> >-	if (type < NR_ZS_STAT_TYPE)
> >-		class->stats.objs[type] -= cnt;
> >+	class->stats.objs[type] -= cnt;
> > }
> >
> > static inline unsigned long zs_stat_get(struct size_class *class,
> > 				enum zs_stat_type type)
> > {
> >-	if (type < NR_ZS_STAT_TYPE)
> >-		return class->stats.objs[type];
> >-	return 0;
> >+	return class->stats.objs[type];
> > }
> >
> > #ifdef CONFIG_ZSMALLOC_STAT
> >@@ -664,6 +714,7 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
> > }
> > #endif
> >
> >+
> > /*
> >  * For each size class, zspages are divided into different groups
> >  * depending on how "full" they are. This was done so that we could
> >@@ -704,15 +755,9 @@ static void insert_zspage(struct size_class *class,
> > {
> > 	struct zspage *head;
> >
> >-	if (fullness >= ZS_EMPTY)
> >-		return;
> >-
> >+	zs_stat_inc(class, fullness, 1);
> > 	head = list_first_entry_or_null(&class->fullness_list[fullness],
> > 					struct zspage, list);
> >-
> >-	zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ?
> >-			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
> >-
> > 	/*
> > 	 * We want to see more ZS_FULL pages and less almost empty/full.
> > 	 * Put pages with higher ->inuse first.
> >@@ -734,14 +779,11 @@ static void remove_zspage(struct size_class *class,
> > 				struct zspage *zspage,
> > 				enum fullness_group fullness)
> > {
> >-	if (fullness >= ZS_EMPTY)
> >-		return;
> >-
> > 	VM_BUG_ON(list_empty(&class->fullness_list[fullness]));
> >+	VM_BUG_ON(is_zspage_isolated(zspage));
> >
> > 	list_del_init(&zspage->list);
> >-	zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ?
> >-			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
> >+	zs_stat_dec(class, fullness, 1);
> > }
> >
> > /*
> >@@ -764,8 +806,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
> > 	if (newfg == currfg)
> > 		goto out;
> >
> >-	remove_zspage(class, zspage, currfg);
> >-	insert_zspage(class, zspage, newfg);
> >+	if (!is_zspage_isolated(zspage)) {
> >+		remove_zspage(class, zspage, currfg);
> >+		insert_zspage(class, zspage, newfg);
> >+	}
> >+
> > 	set_zspage_mapping(zspage, class_idx, newfg);
> >
> > out:
> >@@ -808,19 +853,45 @@ static int get_pages_per_zspage(int class_size)
> > 	return max_usedpc_order;
> > }
> >
> >-static struct page *get_first_page(struct zspage *zspage)
> >+static struct zspage *get_zspage(struct page *page)
> > {
> >-	return zspage->first_page;
> >+	struct zspage *zspage = (struct zspage *)page->private;
> >+
> >+	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
> >+	return zspage;
> > }
> >
> >-static struct zspage *get_zspage(struct page *page)
> >+static struct page *get_next_page(struct page *page)
> > {
> >-	return (struct zspage *)page->private;
> >+	if (unlikely(PageHugeObject(page)))
> >+		return NULL;
> >+
> >+	return page->freelist;
> > }
> >
> >-static struct page *get_next_page(struct page *page)
> >+/* Get byte offset of first object in the @page */
> >+static int get_first_obj_offset(struct size_class *class,
> >+				struct page *first_page, struct page *page)
> > {
> >-	return page->next;
> >+	int pos;
> >+	int page_idx = 0;
> >+	int ofs = 0;
> >+	struct page *cursor = first_page;
> >+
> >+	if (first_page == page)
> >+		goto out;
> >+
> >+	while (page != cursor) {
> >+		page_idx++;
> >+		cursor = get_next_page(cursor);
> >+	}
> >+
> >+	pos = class->objs_per_zspage * class->size *
> >+		page_idx / class->pages_per_zspage;
> >+
> >+	ofs = (pos + class->size) % PAGE_SIZE;
> >+out:
> >+	return ofs;
> > }
> >
> > /**
> >@@ -857,16 +928,20 @@ static unsigned long handle_to_obj(unsigned long handle)
> > 	return *(unsigned long *)handle;
> > }
> >
> >-static unsigned long obj_to_head(struct size_class *class, struct page *page,
> >-			void *obj)
> >+static unsigned long obj_to_head(struct page *page, void *obj)
> > {
> >-	if (class->huge) {
> >+	if (unlikely(PageHugeObject(page))) {
> > 		VM_BUG_ON_PAGE(!is_first_page(page), page);
> > 		return page->index;
> > 	} else
> > 		return *(unsigned long *)obj;
> > }
> >
> >+static inline int testpin_tag(unsigned long handle)
> >+{
> >+	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> >+}
> >+
> > static inline int trypin_tag(unsigned long handle)
> > {
> > 	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> >@@ -884,27 +959,93 @@ static void unpin_tag(unsigned long handle)
> >
> > static void reset_page(struct page *page)
> > {
> >+	__ClearPageMovable(page);
> > 	clear_bit(PG_private, &page->flags);
> > 	clear_bit(PG_private_2, &page->flags);
> > 	set_page_private(page, 0);
> >-	page->index = 0;
> >+	ClearPageHugeObject(page);
> >+	page->freelist = NULL;
> > }
> >
> >-static void free_zspage(struct zs_pool *pool, struct zspage *zspage)
> >+/*
> >+ * To prevent zspage destroy during migration, zspage freeing should
> >+ * hold locks of all pages in the zspage.
> >+ */
> >+void lock_zspage(struct zspage *zspage)
> >+{
> >+	struct page *page = get_first_page(zspage);
> >+
> >+	do {
> >+		lock_page(page);
> >+	} while ((page = get_next_page(page)) != NULL);
> >+}
> >+
> >+int trylock_zspage(struct zspage *zspage)
> >+{
> >+	struct page *cursor, *fail;
> >+
> >+	for (cursor = get_first_page(zspage); cursor != NULL; cursor =
> >+					get_next_page(cursor)) {
> >+		if (!trylock_page(cursor)) {
> >+			fail = cursor;
> >+			goto unlock;
> >+		}
> >+	}
> >+
> >+	return 1;
> >+unlock:
> >+	for (cursor = get_first_page(zspage); cursor != fail; cursor =
> >+					get_next_page(cursor))
> >+		unlock_page(cursor);
> >+
> >+	return 0;
> >+}
> >+
> >+static void __free_zspage(struct zs_pool *pool, struct size_class *class,
> >+				struct zspage *zspage)
> > {
> > 	struct page *page, *next;
> >+	enum fullness_group fg;
> >+	unsigned int class_idx;
> >+
> >+	get_zspage_mapping(zspage, &class_idx, &fg);
> >+
> >+	assert_spin_locked(&class->lock);
> >
> > 	VM_BUG_ON(get_zspage_inuse(zspage));
> >+	VM_BUG_ON(fg != ZS_EMPTY);
> >
> >-	next = page = zspage->first_page;
> >+	next = page = get_first_page(zspage);
> > 	do {
> >-		next = page->next;
> >+		VM_BUG_ON_PAGE(!PageLocked(page), page);
> >+		next = get_next_page(page);
> > 		reset_page(page);
> >+		unlock_page(page);
> > 		put_page(page);
> > 		page = next;
> > 	} while (page != NULL);
> >
> > 	cache_free_zspage(pool, zspage);
> >+
> >+	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >+			class->size, class->pages_per_zspage));
> >+	atomic_long_sub(class->pages_per_zspage,
> >+					&pool->pages_allocated);
> >+}
> >+
> >+static void free_zspage(struct zs_pool *pool, struct size_class *class,
> >+				struct zspage *zspage)
> >+{
> >+	VM_BUG_ON(get_zspage_inuse(zspage));
> >+	VM_BUG_ON(list_empty(&zspage->list));
> >+
> >+	if (!trylock_zspage(zspage)) {
> >+		kick_deferred_free(pool);
> >+		return;
> >+	}
> >+
> >+	remove_zspage(class, zspage, ZS_EMPTY);
> >+	__free_zspage(pool, class, zspage);
> > }
> >
> > /* Initialize a newly allocated zspage */
> >@@ -912,15 +1053,13 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
> > {
> > 	unsigned int freeobj = 1;
> > 	unsigned long off = 0;
> >-	struct page *page = zspage->first_page;
> >+	struct page *page = get_first_page(zspage);
> >
> > 	while (page) {
> > 		struct page *next_page;
> > 		struct link_free *link;
> > 		void *vaddr;
> >
> >-		set_first_obj_offset(page, off);
> >-
> > 		vaddr = kmap_atomic(page);
> > 		link = (struct link_free *)vaddr + off / sizeof(*link);
> >
> >@@ -952,16 +1091,17 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
> > 	set_freeobj(zspage, 0);
> > }
> >
> >-static void create_page_chain(struct zspage *zspage, struct page *pages[],
> >-				int nr_pages)
> >+static void create_page_chain(struct size_class *class, struct zspage *zspage,
> >+				struct page *pages[])
> > {
> > 	int i;
> > 	struct page *page;
> > 	struct page *prev_page = NULL;
> >+	int nr_pages = class->pages_per_zspage;
> >
> > 	/*
> > 	 * Allocate individual pages and link them together as:
> >-	 * 1. all pages are linked together using page->next
> >+	 * 1. all pages are linked together using page->freelist
> > 	 * 2. each sub-page point to zspage using page->private
> > 	 *
> > 	 * we set PG_private to identify the first page (i.e. no other sub-page
> >@@ -970,16 +1110,18 @@ static void create_page_chain(struct zspage *zspage, struct page *pages[],
> > 	for (i = 0; i < nr_pages; i++) {
> > 		page = pages[i];
> > 		set_page_private(page, (unsigned long)zspage);
> >+		page->freelist = NULL;
> > 		if (i == 0) {
> > 			zspage->first_page = page;
> > 			SetPagePrivate(page);
> >+			if (unlikely(class->objs_per_zspage == 1 &&
> >+					class->pages_per_zspage == 1))
> >+				SetPageHugeObject(page);
> > 		} else {
> >-			prev_page->next = page;
> >+			prev_page->freelist = page;
> > 		}
> >-		if (i == nr_pages - 1) {
> >+		if (i == nr_pages - 1)
> > 			SetPagePrivate2(page);
> >-			page->next = NULL;
> >-		}
> > 		prev_page = page;
> > 	}
> > }
> >@@ -999,6 +1141,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
> > 		return NULL;
> >
> > 	memset(zspage, 0, sizeof(struct zspage));
> >+	zspage->magic = ZSPAGE_MAGIC;
> >+	migrate_lock_init(zspage);
> >
> > 	for (i = 0; i < class->pages_per_zspage; i++) {
> > 		struct page *page;
> >@@ -1013,7 +1157,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
> > 		pages[i] = page;
> > 	}
> >
> >-	create_page_chain(zspage, pages, class->pages_per_zspage);
> >+	create_page_chain(class, zspage, pages);
> > 	init_zspage(class, zspage);
> >
> > 	return zspage;
> >@@ -1024,7 +1168,7 @@ static struct zspage *find_get_zspage(struct size_class *class)
> > 	int i;
> > 	struct zspage *zspage;
> >
> >-	for (i = ZS_ALMOST_FULL; i <= ZS_ALMOST_EMPTY; i++) {
> >+	for (i = ZS_ALMOST_FULL; i >= ZS_EMPTY; i--) {
> > 		zspage = list_first_entry_or_null(&class->fullness_list[i],
> > 				struct zspage, list);
> > 		if (zspage)
> >@@ -1289,6 +1433,10 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> > 	obj = handle_to_obj(handle);
> > 	obj_to_location(obj, &page, &obj_idx);
> > 	zspage = get_zspage(page);
> >+
> >+	/* migration cannot move any subpage in this zspage */
> >+	migrate_read_lock(zspage);
> >+
> > 	get_zspage_mapping(zspage, &class_idx, &fg);
> > 	class = pool->size_class[class_idx];
> > 	off = (class->size * obj_idx) & ~PAGE_MASK;
> >@@ -1309,7 +1457,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> >
> > 	ret = __zs_map_object(area, pages, off, class->size);
> > out:
> >-	if (!class->huge)
> >+	if (likely(!PageHugeObject(page)))
> > 		ret += ZS_HANDLE_SIZE;
> >
> > 	return ret;
> >@@ -1348,6 +1496,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
> > 		__zs_unmap_object(area, pages, off, class->size);
> > 	}
> > 	put_cpu_var(zs_map_area);
> >+
> >+	migrate_read_unlock(zspage);
> > 	unpin_tag(handle);
> > }
> > EXPORT_SYMBOL_GPL(zs_unmap_object);
> >@@ -1377,7 +1527,7 @@ static unsigned long obj_malloc(struct size_class *class,
> > 	vaddr = kmap_atomic(m_page);
> > 	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
> > 	set_freeobj(zspage, link->next >> OBJ_ALLOCATED_TAG);
> >-	if (!class->huge)
> >+	if (likely(!PageHugeObject(m_page)))
> > 		/* record handle in the header of allocated chunk */
> > 		link->handle = handle;
> > 	else
> >@@ -1407,6 +1557,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> > {
> > 	unsigned long handle, obj;
> > 	struct size_class *class;
> >+	enum fullness_group newfg;
> > 	struct zspage *zspage;
> >
> > 	if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
> >@@ -1422,28 +1573,37 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> >
> > 	spin_lock(&class->lock);
> > 	zspage = find_get_zspage(class);
> >-
> >-	if (!zspage) {
> >+	if (likely(zspage)) {
> >+		obj = obj_malloc(class, zspage, handle);
> >+		/* Now move the zspage to another fullness group, if required */
> >+		fix_fullness_group(class, zspage);
> >+		record_obj(handle, obj);
> > 		spin_unlock(&class->lock);
> >-		zspage = alloc_zspage(pool, class, gfp);
> >-		if (unlikely(!zspage)) {
> >-			cache_free_handle(pool, handle);
> >-			return 0;
> >-		}
> >
> >-		set_zspage_mapping(zspage, class->index, ZS_EMPTY);
> >-		atomic_long_add(class->pages_per_zspage,
> >-					&pool->pages_allocated);
> >+		return handle;
> >+	}
> >
> >-		spin_lock(&class->lock);
> >-		zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >-				class->size, class->pages_per_zspage));
> >+	spin_unlock(&class->lock);
> >+
> >+	zspage = alloc_zspage(pool, class, gfp);
> >+	if (!zspage) {
> >+		cache_free_handle(pool, handle);
> >+		return 0;
> > 	}
> >
> >+	spin_lock(&class->lock);
> > 	obj = obj_malloc(class, zspage, handle);
> >-	/* Now move the zspage to another fullness group, if required */
> >-	fix_fullness_group(class, zspage);
> >+	newfg = get_fullness_group(class, zspage);
> >+	insert_zspage(class, zspage, newfg);
> >+	set_zspage_mapping(zspage, class->index, newfg);
> > 	record_obj(handle, obj);
> >+	atomic_long_add(class->pages_per_zspage,
> >+				&pool->pages_allocated);
> >+	zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >+			class->size, class->pages_per_zspage));
> >+
> >+	/* We completely set up zspage so mark them as movable */
> >+	SetZsPageMovable(pool, zspage);
> > 	spin_unlock(&class->lock);
> >
> > 	return handle;
> >@@ -1484,6 +1644,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> > 	int class_idx;
> > 	struct size_class *class;
> > 	enum fullness_group fullness;
> >+	bool isolated;
> >
> > 	if (unlikely(!handle))
> > 		return;
> >@@ -1493,22 +1654,28 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> > 	obj_to_location(obj, &f_page, &f_objidx);
> > 	zspage = get_zspage(f_page);
> >
> >+	migrate_read_lock(zspage);
> >+
> > 	get_zspage_mapping(zspage, &class_idx, &fullness);
> > 	class = pool->size_class[class_idx];
> >
> > 	spin_lock(&class->lock);
> > 	obj_free(class, obj);
> > 	fullness = fix_fullness_group(class, zspage);
> >-	if (fullness == ZS_EMPTY) {
> >-		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >-				class->size, class->pages_per_zspage));
> >-		atomic_long_sub(class->pages_per_zspage,
> >-				&pool->pages_allocated);
> >-		free_zspage(pool, zspage);
> >+	if (fullness != ZS_EMPTY) {
> >+		migrate_read_unlock(zspage);
> >+		goto out;
> > 	}
> >+
> >+	isolated = is_zspage_isolated(zspage);
> >+	migrate_read_unlock(zspage);
> >+	/* If zspage is isolated, zs_page_putback will free the zspage */
> >+	if (likely(!isolated))
> >+		free_zspage(pool, class, zspage);
> >+out:
> >+
> > 	spin_unlock(&class->lock);
> > 	unpin_tag(handle);
> >-
> > 	cache_free_handle(pool, handle);
> > }
> > EXPORT_SYMBOL_GPL(zs_free);
> >@@ -1587,12 +1754,13 @@ static unsigned long find_alloced_obj(struct size_class *class,
> > 	int offset = 0;
> > 	unsigned long handle = 0;
> > 	void *addr = kmap_atomic(page);
> >+	struct zspage *zspage = get_zspage(page);
> >
> >-	offset = get_first_obj_offset(page);
> >+	offset = get_first_obj_offset(class, get_first_page(zspage), page);
> > 	offset += class->size * index;
> >
> > 	while (offset < PAGE_SIZE) {
> >-		head = obj_to_head(class, page, addr + offset);
> >+		head = obj_to_head(page, addr + offset);
> > 		if (head & OBJ_ALLOCATED_TAG) {
> > 			handle = head & ~OBJ_ALLOCATED_TAG;
> > 			if (trypin_tag(handle))
> >@@ -1684,6 +1852,7 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
> > 		zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
> > 							struct zspage, list);
> > 		if (zspage) {
> >+			VM_BUG_ON(is_zspage_isolated(zspage));
> > 			remove_zspage(class, zspage, fg[i]);
> > 			return zspage;
> > 		}
> >@@ -1704,6 +1873,8 @@ static enum fullness_group putback_zspage(struct size_class *class,
> > {
> > 	enum fullness_group fullness;
> >
> >+	VM_BUG_ON(is_zspage_isolated(zspage));
> >+
> > 	fullness = get_fullness_group(class, zspage);
> > 	insert_zspage(class, zspage, fullness);
> > 	set_zspage_mapping(zspage, class->index, fullness);
> >@@ -1711,6 +1882,377 @@ static enum fullness_group putback_zspage(struct size_class *class,
> > 	return fullness;
> > }
> >
> >+#ifdef CONFIG_COMPACTION
> >+static struct dentry *zs_mount(struct file_system_type *fs_type,
> >+				int flags, const char *dev_name, void *data)
> >+{
> >+	static const struct dentry_operations ops = {
> >+		.d_dname = simple_dname,
> >+	};
> >+
> >+	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
> >+}
> >+
> >+static struct file_system_type zsmalloc_fs = {
> >+	.name		= "zsmalloc",
> >+	.mount		= zs_mount,
> >+	.kill_sb	= kill_anon_super,
> >+};
> >+
> >+static int zsmalloc_mount(void)
> >+{
> >+	int ret = 0;
> >+
> >+	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
> >+	if (IS_ERR(zsmalloc_mnt))
> >+		ret = PTR_ERR(zsmalloc_mnt);
> >+
> >+	return ret;
> >+}
> >+
> >+static void zsmalloc_unmount(void)
> >+{
> >+	kern_unmount(zsmalloc_mnt);
> >+}
> >+
> >+static void migrate_lock_init(struct zspage *zspage)
> >+{
> >+	rwlock_init(&zspage->lock);
> >+}
> >+
> >+static void migrate_read_lock(struct zspage *zspage)
> >+{
> >+	read_lock(&zspage->lock);
> >+}
> >+
> >+static void migrate_read_unlock(struct zspage *zspage)
> >+{
> >+	read_unlock(&zspage->lock);
> >+}
> >+
> >+static void migrate_write_lock(struct zspage *zspage)
> >+{
> >+	write_lock(&zspage->lock);
> >+}
> >+
> >+static void migrate_write_unlock(struct zspage *zspage)
> >+{
> >+	write_unlock(&zspage->lock);
> >+}
> >+
> >+/* Number of isolated subpage for *page migration* in this zspage */
> >+static void inc_zspage_isolation(struct zspage *zspage)
> >+{
> >+	zspage->isolated++;
> >+}
> >+
> >+static void dec_zspage_isolation(struct zspage *zspage)
> >+{
> >+	zspage->isolated--;
> >+}
> >+
> >+static void replace_sub_page(struct size_class *class, struct zspage *zspage,
> >+				struct page *newpage, struct page *oldpage)
> >+{
> >+	struct page *page;
> >+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL, };
> >+	int idx = 0;
> >+
> >+	page = get_first_page(zspage);
> >+	do {
> >+		if (page == oldpage)
> >+			pages[idx] = newpage;
> >+		else
> >+			pages[idx] = page;
> >+		idx++;
> >+	} while ((page = get_next_page(page)) != NULL);
> >+
> >+	create_page_chain(class, zspage, pages);
> >+	if (unlikely(PageHugeObject(oldpage)))
> >+		newpage->index = oldpage->index;
> >+	__SetPageMovable(newpage, page_mapping(oldpage));
> >+}
> >+
> >+bool zs_page_isolate(struct page *page, isolate_mode_t mode)
> >+{
> >+	struct zs_pool *pool;
> >+	struct size_class *class;
> >+	int class_idx;
> >+	enum fullness_group fullness;
> >+	struct zspage *zspage;
> >+	struct address_space *mapping;
> >+
> >+	/*
> >+	 * Page is locked so zspage couldn't be destroyed. For detail, look at
> >+	 * lock_zspage in free_zspage.
> >+	 */
> >+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >+	VM_BUG_ON_PAGE(PageIsolated(page), page);
> >+
> >+	zspage = get_zspage(page);
> >+
> >+	/*
> >+	 * Without class lock, fullness could be stale while class_idx is okay
> >+	 * because class_idx is constant unless page is freed so we should get
> >+	 * fullness again under class lock.
> >+	 */
> >+	get_zspage_mapping(zspage, &class_idx, &fullness);
> >+	mapping = page_mapping(page);
> >+	pool = mapping->private_data;
> >+	class = pool->size_class[class_idx];
> >+
> >+	spin_lock(&class->lock);
> >+	if (get_zspage_inuse(zspage) == 0) {
> >+		spin_unlock(&class->lock);
> >+		return false;
> >+	}
> >+
> >+	/* zspage is isolated for object migration */
> >+	if (list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
> >+		spin_unlock(&class->lock);
> >+		return false;
> >+	}
> >+
> >+	/*
> >+	 * If this is first time isolation for the zspage, isolate zspage from
> >+	 * size_class to prevent further object allocation from the zspage.
> >+	 */
> >+	if (!list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
> >+		get_zspage_mapping(zspage, &class_idx, &fullness);
> >+		remove_zspage(class, zspage, fullness);
> >+	}
> >+
> >+	inc_zspage_isolation(zspage);
> >+	spin_unlock(&class->lock);
> >+
> >+	return true;
> >+}
> >+
> >+int zs_page_migrate(struct address_space *mapping, struct page *newpage,
> >+		struct page *page, enum migrate_mode mode)
> >+{
> >+	struct zs_pool *pool;
> >+	struct size_class *class;
> >+	int class_idx;
> >+	enum fullness_group fullness;
> >+	struct zspage *zspage;
> >+	struct page *dummy;
> >+	void *s_addr, *d_addr, *addr;
> >+	int offset, pos;
> >+	unsigned long handle, head;
> >+	unsigned long old_obj, new_obj;
> >+	unsigned int obj_idx;
> >+	int ret = -EAGAIN;
> >+
> >+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> >+
> >+	zspage = get_zspage(page);
> >+
> >+	/* Concurrent compactor cannot migrate any subpage in zspage */
> >+	migrate_write_lock(zspage);
> >+	get_zspage_mapping(zspage, &class_idx, &fullness);
> >+	pool = mapping->private_data;
> >+	class = pool->size_class[class_idx];
> >+	offset = get_first_obj_offset(class, get_first_page(zspage), page);
> >+
> >+	spin_lock(&class->lock);
> >+	if (!get_zspage_inuse(zspage)) {
> >+		ret = -EBUSY;
> >+		goto unlock_class;
> >+	}
> >+
> >+	pos = offset;
> >+	s_addr = kmap_atomic(page);
> >+	while (pos < PAGE_SIZE) {
> >+		head = obj_to_head(page, s_addr + pos);
> >+		if (head & OBJ_ALLOCATED_TAG) {
> >+			handle = head & ~OBJ_ALLOCATED_TAG;
> >+			if (!trypin_tag(handle))
> >+				goto unpin_objects;
> >+		}
> >+		pos += class->size;
> >+	}
> >+
> >+	/*
> >+	 * Here, any user cannot access all objects in the zspage so let's move.
> >+	 */
> >+	d_addr = kmap_atomic(newpage);
> >+	memcpy(d_addr, s_addr, PAGE_SIZE);
> >+	kunmap_atomic(d_addr);
> >+
> >+	for (addr = s_addr + offset; addr < s_addr + pos;
> >+					addr += class->size) {
> >+		head = obj_to_head(page, addr);
> >+		if (head & OBJ_ALLOCATED_TAG) {
> >+			handle = head & ~OBJ_ALLOCATED_TAG;
> >+			if (!testpin_tag(handle))
> >+				BUG();
> >+
> >+			old_obj = handle_to_obj(handle);
> >+			obj_to_location(old_obj, &dummy, &obj_idx);
> >+			new_obj = (unsigned long)location_to_obj(newpage,
> >+								obj_idx);
> >+			new_obj |= BIT(HANDLE_PIN_BIT);
> >+			record_obj(handle, new_obj);
> >+		}
> >+	}
> >+
> >+	replace_sub_page(class, zspage, newpage, page);
> >+	get_page(newpage);
> >+
> >+	dec_zspage_isolation(zspage);
> >+
> >+	/*
> >+	 * Page migration is done so let's putback isolated zspage to
> >+	 * the list if @page is final isolated subpage in the zspage.
> >+	 */
> >+	if (!is_zspage_isolated(zspage))
> >+		putback_zspage(class, zspage);
> >+
> >+	reset_page(page);
> >+	put_page(page);
> >+	page = newpage;
> >+
> >+	ret = 0;
> >+unpin_objects:
> >+	for (addr = s_addr + offset; addr < s_addr + pos;
> >+						addr += class->size) {
> >+		head = obj_to_head(page, addr);
> >+		if (head & OBJ_ALLOCATED_TAG) {
> >+			handle = head & ~OBJ_ALLOCATED_TAG;
> >+			if (!testpin_tag(handle))
> >+				BUG();
> >+			unpin_tag(handle);
> >+		}
> >+	}
> >+	kunmap_atomic(s_addr);
> >+unlock_class:
> >+	spin_unlock(&class->lock);
> >+	migrate_write_unlock(zspage);
> >+
> >+	return ret;
> >+}
> >+
> >+void zs_page_putback(struct page *page)
> >+{
> >+	struct zs_pool *pool;
> >+	struct size_class *class;
> >+	int class_idx;
> >+	enum fullness_group fg;
> >+	struct address_space *mapping;
> >+	struct zspage *zspage;
> >+
> >+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> >+
> >+	zspage = get_zspage(page);
> >+	get_zspage_mapping(zspage, &class_idx, &fg);
> >+	mapping = page_mapping(page);
> >+	pool = mapping->private_data;
> >+	class = pool->size_class[class_idx];
> >+
> >+	spin_lock(&class->lock);
> >+	dec_zspage_isolation(zspage);
> >+	if (!is_zspage_isolated(zspage)) {
> >+		fg = putback_zspage(class, zspage);
> >+		/*
> >+		 * Due to page_lock, we cannot free zspage immediately
> >+		 * so let's defer.
> >+		 */
> >+		if (fg == ZS_EMPTY)
> >+			schedule_work(&pool->free_work);
> >+	}
> >+	spin_unlock(&class->lock);
> >+}
> >+
> >+const struct address_space_operations zsmalloc_aops = {
> >+	.isolate_page = zs_page_isolate,
> >+	.migratepage = zs_page_migrate,
> >+	.putback_page = zs_page_putback,
> >+};
> >+
> >+static int zs_register_migration(struct zs_pool *pool)
> >+{
> >+	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
> >+	if (IS_ERR(pool->inode)) {
> >+		pool->inode = NULL;
> >+		return 1;
> >+	}
> >+
> >+	pool->inode->i_mapping->private_data = pool;
> >+	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
> >+	return 0;
> >+}
> >+
> >+static void zs_unregister_migration(struct zs_pool *pool)
> >+{
> >+	flush_work(&pool->free_work);
> >+	if (pool->inode)
> >+		iput(pool->inode);
> >+}
> >+
> >+/*
> >+ * Caller should hold page_lock of all pages in the zspage
> >+ * In here, we cannot use zspage meta data.
> >+ */
> >+static void async_free_zspage(struct work_struct *work)
> >+{
> >+	int i;
> >+	struct size_class *class;
> >+	unsigned int class_idx;
> >+	enum fullness_group fullness;
> >+	struct zspage *zspage, *tmp;
> >+	LIST_HEAD(free_pages);
> >+	struct zs_pool *pool = container_of(work, struct zs_pool,
> >+					free_work);
> >+
> >+	for (i = 0; i < zs_size_classes; i++) {
> >+		class = pool->size_class[i];
> >+		if (class->index != i)
> >+			continue;
> >+
> >+		spin_lock(&class->lock);
> >+		list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
> >+		spin_unlock(&class->lock);
> >+	}
> >+
> >+
> >+	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
> >+		list_del(&zspage->list);
> >+		lock_zspage(zspage);
> >+
> >+		get_zspage_mapping(zspage, &class_idx, &fullness);
> >+		VM_BUG_ON(fullness != ZS_EMPTY);
> >+		class = pool->size_class[class_idx];
> >+		spin_lock(&class->lock);
> >+		__free_zspage(pool, pool->size_class[class_idx], zspage);
> >+		spin_unlock(&class->lock);
> >+	}
> >+};
> >+
> >+static void kick_deferred_free(struct zs_pool *pool)
> >+{
> >+	schedule_work(&pool->free_work);
> >+}
> >+
> >+static void init_deferred_free(struct zs_pool *pool)
> >+{
> >+	INIT_WORK(&pool->free_work, async_free_zspage);
> >+}
> >+
> >+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage)
> >+{
> >+	struct page *page = get_first_page(zspage);
> >+
> >+	do {
> >+		WARN_ON(!trylock_page(page));
> >+		__SetPageMovable(page, pool->inode->i_mapping);
> >+		unlock_page(page);
> >+	} while ((page = get_next_page(page)) != NULL);
> >+}
> >+#endif
> >+
> > /*
> >  *
> >  * Based on the number of unused allocated objects calculate
> >@@ -1745,10 +2287,10 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
> > 			break;
> >
> > 		cc.index = 0;
> >-		cc.s_page = src_zspage->first_page;
> >+		cc.s_page = get_first_page(src_zspage);
> >
> > 		while ((dst_zspage = isolate_zspage(class, false))) {
> >-			cc.d_page = dst_zspage->first_page;
> >+			cc.d_page = get_first_page(dst_zspage);
> > 			/*
> > 			 * If there is no more space in dst_page, resched
> > 			 * and see if anyone had allocated another zspage.
> >@@ -1765,11 +2307,7 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
> >
> > 		putback_zspage(class, dst_zspage);
> > 		if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
> >-			zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >-					class->size, class->pages_per_zspage));
> >-			atomic_long_sub(class->pages_per_zspage,
> >-					&pool->pages_allocated);
> >-			free_zspage(pool, src_zspage);
> >+			free_zspage(pool, class, src_zspage);
> > 			pool->stats.pages_compacted += class->pages_per_zspage;
> > 		}
> > 		spin_unlock(&class->lock);
> >@@ -1885,6 +2423,7 @@ struct zs_pool *zs_create_pool(const char *name)
> > 	if (!pool)
> > 		return NULL;
> >
> >+	init_deferred_free(pool);
> > 	pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
> > 			GFP_KERNEL);
> > 	if (!pool->size_class) {
> >@@ -1939,12 +2478,10 @@ struct zs_pool *zs_create_pool(const char *name)
> > 		class->pages_per_zspage = pages_per_zspage;
> > 		class->objs_per_zspage = class->pages_per_zspage *
> > 						PAGE_SIZE / class->size;
> >-		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
> >-			class->huge = true;
> > 		spin_lock_init(&class->lock);
> > 		pool->size_class[i] = class;
> >-		for (fullness = ZS_ALMOST_FULL; fullness <= ZS_ALMOST_EMPTY;
> >-								fullness++)
> >+		for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
> >+							fullness++)
> > 			INIT_LIST_HEAD(&class->fullness_list[fullness]);
> >
> > 		prev_class = class;
> >@@ -1953,6 +2490,9 @@ struct zs_pool *zs_create_pool(const char *name)
> > 	/* debug only, don't abort if it fails */
> > 	zs_pool_stat_create(pool, name);
> >
> >+	if (zs_register_migration(pool))
> >+		goto err;
> >+
> > 	/*
> > 	 * Not critical, we still can use the pool
> > 	 * and user can trigger compaction manually.
> >@@ -1972,6 +2512,7 @@ void zs_destroy_pool(struct zs_pool *pool)
> > 	int i;
> >
> > 	zs_unregister_shrinker(pool);
> >+	zs_unregister_migration(pool);
> > 	zs_pool_stat_destroy(pool);
> >
> > 	for (i = 0; i < zs_size_classes; i++) {
> >@@ -1984,7 +2525,7 @@ void zs_destroy_pool(struct zs_pool *pool)
> > 		if (class->index != i)
> > 			continue;
> >
> >-		for (fg = ZS_ALMOST_FULL; fg <= ZS_ALMOST_EMPTY; fg++) {
> >+		for (fg = ZS_EMPTY; fg < NR_ZS_FULLNESS; fg++) {
> > 			if (!list_empty(&class->fullness_list[fg])) {
> > 				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
> > 					class->size, fg);
> >@@ -2002,7 +2543,13 @@ EXPORT_SYMBOL_GPL(zs_destroy_pool);
> >
> > static int __init zs_init(void)
> > {
> >-	int ret = zs_register_cpu_notifier();
> >+	int ret;
> >+
> >+	ret = zsmalloc_mount();
> >+	if (ret)
> >+		goto out;
> >+
> >+	ret = zs_register_cpu_notifier();
> >
> > 	if (ret)
> > 		goto notifier_fail;
> >@@ -2019,7 +2566,8 @@ static int __init zs_init(void)
> >
> > notifier_fail:
> > 	zs_unregister_cpu_notifier();
> >-
> >+	zsmalloc_unmount();
> >+out:
> > 	return ret;
> > }
> >
> >@@ -2028,6 +2576,7 @@ static void __exit zs_exit(void)
> > #ifdef CONFIG_ZPOOL
> > 	zpool_unregister_driver(&zs_zpool_driver);
> > #endif
> >+	zsmalloc_unmount();
> > 	zs_unregister_cpu_notifier();
> >
> > 	zs_stat_exit();
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-19  2:44       ` Minchan Kim
@ 2017-01-19  3:39         ` Chulmin Kim
  2017-01-19  6:21           ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Chulmin Kim @ 2017-01-19  3:39 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-mm, Sergey Senozhatsky

On 01/18/2017 09:44 PM, Minchan Kim wrote:
> Hello Chulmin,
>
> On Wed, Jan 18, 2017 at 07:13:21PM -0500, Chulmin Kim wrote:
>> Hello. Minchan, and all zsmalloc guys.
>>
>> I have a quick question.
>> Is zsmalloc considering memory barrier things correctly?
>>
>> AFAIK, in ARM64,
>> zsmalloc relies on dmb operation in bit_spin_unlock only.
>> (It seems that dmb operations in spinlock functions are being prepared,
>> but let is be aside as it is not merged yet.)
>>
>> If I am correct,
>> migrating a page in a zspage filled with free objs
>> may cause the corruption cause bit_spin_unlock will not be executed at all.
>>
>> I am not sure this is enough memory barrier for zsmalloc operations.
>>
>> Can you enlighten me?
>
> Do you mean bit_spin_unlock is broken or zsmalloc locking scheme broken?
> Could you please describe what you are concerning in detail?
> It would be very helpful if you say it with a example!

Sorry for ambiguous expressions. :)

Recently,
I found multiple zsmalloc corruption cases which have garbage idx values 
in in zspage->freeobj. (not ffffffff (-1) value.)

Honestly, I have no clue yet.

I suspect the case when zspage migrate a zs sub page filled with free 
objects (so that never calls unpin_tag() which has memory barrier).


Assume the page (zs subpage) being migrated has no allocated zs object.

S : zs subpage
D : free page


CPU A : zs_page_migrate()		CPU B : zs_malloc()
---------------------			-----------------------------


migrate_write_lock()
spin_lock()

memcpy(D, S, PAGE_SIZE)   -> (1)
replace_sub_page()

putback_zspage()
spin_unlock()
migrate_write_unlock()
					
					spin_lock()
					obj_malloc()
					--> (2-a) allocate obj in D
					--> (2-b) set freeobj using
      						the first 8 bytes of
  						the allocated obj
					record_obj()
					spin_unlock



I think the locking has no problem, but memory ordering.
I doubt whether (2-b) in CPU B really loads the data stored by (1).

If it doesn't, set_freeobj in (2-b) will corrupt zspage->freeobj.
After then, we will see corrupted object sooner or later.


According to the below link,
(https://patchwork.kernel.org/patch/9313493/)
spin lock in a specific arch (arm64 maybe) seems not guaranteeing memory 
ordering.

===
+/*
+ * Accesses appearing in program order before a spin_lock() operation
+ * can be reordered with accesses inside the critical section, by virtue
+ * of arch_spin_lock being constructed using acquire semantics.
+ *
+ * In cases where this is problematic (e.g. try_to_wake_up), an
+ * smp_mb__before_spinlock() can restore the required ordering.
+ */
+#define smp_mb__before_spinlock()	smp_mb()
===



THanks.
CHulmin Kim





>
> Thanks.
>
>>
>>
>> THanks!
>> CHulmin KIm
>>
>>
>>
>> On 05/31/2016 07:21 PM, Minchan Kim wrote:
>>> This patch introduces run-time migration feature for zspage.
>>>
>>> For migration, VM uses page.lru field so it would be better to not use
>>> page.next field which is unified with page.lru for own purpose.
>>> For that, firstly, we can get first object offset of the page via
>>> runtime calculation instead of using page.index so we can use
>>> page.index as link for page chaining instead of page.next.
>>> 	
>>> In case of huge object, it stores handle to page.index instead of
>>> next link of page chaining because huge object doesn't need to next
>>> link for page chaining. So get_next_page need to identify huge
>>> object to return NULL. For it, this patch uses PG_owner_priv_1 flag
>>> of the page flag.
>>>
>>> For migration, it supports three functions
>>>
>>> * zs_page_isolate
>>>
>>> It isolates a zspage which includes a subpage VM want to migrate
>> >from class so anyone cannot allocate new object from the zspage.
>>>
>>> We could try to isolate a zspage by the number of subpage so
>>> subsequent isolation trial of other subpage of the zpsage shouldn't
>>> fail. For that, we introduce zspage.isolated count. With that,
>>> zs_page_isolate can know whether zspage is already isolated or not
>>> for migration so if it is isolated for migration, subsequent
>>> isolation trial can be successful without trying further isolation.
>>>
>>> * zs_page_migrate
>>>
>>> First of all, it holds write-side zspage->lock to prevent migrate other
>>> subpage in zspage. Then, lock all objects in the page VM want to migrate.
>>> The reason we should lock all objects in the page is due to race between
>>> zs_map_object and zs_page_migrate.
>>>
>>> zs_map_object				zs_page_migrate
>>>
>>> pin_tag(handle)
>>> obj = handle_to_obj(handle)
>>> obj_to_location(obj, &page, &obj_idx);
>>>
>>> 					write_lock(&zspage->lock)
>>> 					if (!trypin_tag(handle))
>>> 						goto unpin_object
>>>
>>> zspage = get_zspage(page);
>>> read_lock(&zspage->lock);
>>>
>>> If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can
>>> be stale by migration so it goes crash.
>>>
>>> If it locks all of objects successfully, it copies content from
>>> old page to new one, finally, create new zspage chain with new page.
>>> And if it's last isolated subpage in the zspage, put the zspage back
>>> to class.
>>>
>>> * zs_page_putback
>>>
>>> It returns isolated zspage to right fullness_group list if it fails to
>>> migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage
>>> freeing to workqueue. See below about async zspage freeing.
>>>
>>> This patch introduces asynchronous zspage free. The reason to need it
>>> is we need page_lock to clear PG_movable but unfortunately,
>>> zs_free path should be atomic so the apporach is try to grab page_lock.
>>> If it got page_lock of all of pages successfully, it can free zspage
>>> immediately. Otherwise, it queues free request and free zspage via
>>> workqueue in process context.
>>>
>>> If zs_free finds the zspage is isolated when it try to free zspage,
>>> it delays the freeing until zs_page_putback finds it so it will free
>>> free the zspage finally.
>>>
>>> In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.
>>> First of all, it will use ZS_EMPTY list for delay freeing.
>>> And with adding ZS_FULL list, it makes to identify whether zspage is
>>> isolated or not via list_empty(&zspage->list) test.
>>>
>>> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
>>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>>> ---
>>> include/uapi/linux/magic.h |   1 +
>>> mm/zsmalloc.c              | 793 ++++++++++++++++++++++++++++++++++++++-------
>>> 2 files changed, 672 insertions(+), 122 deletions(-)
>>>
>>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>>> index d829ce63529d..e398beac67b8 100644
>>> --- a/include/uapi/linux/magic.h
>>> +++ b/include/uapi/linux/magic.h
>>> @@ -81,5 +81,6 @@
>>> /* Since UDF 2.01 is ISO 13346 based... */
>>> #define UDF_SUPER_MAGIC		0x15013346
>>> #define BALLOON_KVM_MAGIC	0x13661366
>>> +#define ZSMALLOC_MAGIC		0x58295829
>>>
>>> #endif /* __LINUX_MAGIC_H__ */
>>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>>> index c6fb543cfb98..a80100db16d6 100644
>>> --- a/mm/zsmalloc.c
>>> +++ b/mm/zsmalloc.c
>>> @@ -17,14 +17,14 @@
>>>  *
>>>  * Usage of struct page fields:
>>>  *	page->private: points to zspage
>>> - *	page->index: offset of the first object starting in this page.
>>> - *		For the first page, this is always 0, so we use this field
>>> - *		to store handle for huge object.
>>> - *	page->next: links together all component pages of a zspage
>>> + *	page->freelist(index): links together all component pages of a zspage
>>> + *		For the huge page, this is always 0, so we use this field
>>> + *		to store handle.
>>>  *
>>>  * Usage of struct page flags:
>>>  *	PG_private: identifies the first component page
>>>  *	PG_private2: identifies the last component page
>>> + *	PG_owner_priv_1: indentifies the huge component page
>>>  *
>>>  */
>>>
>>> @@ -49,6 +49,11 @@
>>> #include <linux/debugfs.h>
>>> #include <linux/zsmalloc.h>
>>> #include <linux/zpool.h>
>>> +#include <linux/mount.h>
>>> +#include <linux/compaction.h>
>>> +#include <linux/pagemap.h>
>>> +
>>> +#define ZSPAGE_MAGIC	0x58
>>>
>>> /*
>>>  * This must be power of 2 and greater than of equal to sizeof(link_free).
>>> @@ -136,25 +141,23 @@
>>>  * We do not maintain any list for completely empty or full pages
>>>  */
>>> enum fullness_group {
>>> -	ZS_ALMOST_FULL,
>>> -	ZS_ALMOST_EMPTY,
>>> 	ZS_EMPTY,
>>> -	ZS_FULL
>>> +	ZS_ALMOST_EMPTY,
>>> +	ZS_ALMOST_FULL,
>>> +	ZS_FULL,
>>> +	NR_ZS_FULLNESS,
>>> };
>>>
>>> enum zs_stat_type {
>>> +	CLASS_EMPTY,
>>> +	CLASS_ALMOST_EMPTY,
>>> +	CLASS_ALMOST_FULL,
>>> +	CLASS_FULL,
>>> 	OBJ_ALLOCATED,
>>> 	OBJ_USED,
>>> -	CLASS_ALMOST_FULL,
>>> -	CLASS_ALMOST_EMPTY,
>>> +	NR_ZS_STAT_TYPE,
>>> };
>>>
>>> -#ifdef CONFIG_ZSMALLOC_STAT
>>> -#define NR_ZS_STAT_TYPE	(CLASS_ALMOST_EMPTY + 1)
>>> -#else
>>> -#define NR_ZS_STAT_TYPE	(OBJ_USED + 1)
>>> -#endif
>>> -
>>> struct zs_size_stat {
>>> 	unsigned long objs[NR_ZS_STAT_TYPE];
>>> };
>>> @@ -163,6 +166,10 @@ struct zs_size_stat {
>>> static struct dentry *zs_stat_root;
>>> #endif
>>>
>>> +#ifdef CONFIG_COMPACTION
>>> +static struct vfsmount *zsmalloc_mnt;
>>> +#endif
>>> +
>>> /*
>>>  * number of size_classes
>>>  */
>>> @@ -186,23 +193,36 @@ static const int fullness_threshold_frac = 4;
>>>
>>> struct size_class {
>>> 	spinlock_t lock;
>>> -	struct list_head fullness_list[2];
>>> +	struct list_head fullness_list[NR_ZS_FULLNESS];
>>> 	/*
>>> 	 * Size of objects stored in this class. Must be multiple
>>> 	 * of ZS_ALIGN.
>>> 	 */
>>> 	int size;
>>> 	int objs_per_zspage;
>>> -	unsigned int index;
>>> -
>>> -	struct zs_size_stat stats;
>>> -
>>> 	/* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
>>> 	int pages_per_zspage;
>>> -	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
>>> -	bool huge;
>>> +
>>> +	unsigned int index;
>>> +	struct zs_size_stat stats;
>>> };
>>>
>>> +/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
>>> +static void SetPageHugeObject(struct page *page)
>>> +{
>>> +	SetPageOwnerPriv1(page);
>>> +}
>>> +
>>> +static void ClearPageHugeObject(struct page *page)
>>> +{
>>> +	ClearPageOwnerPriv1(page);
>>> +}
>>> +
>>> +static int PageHugeObject(struct page *page)
>>> +{
>>> +	return PageOwnerPriv1(page);
>>> +}
>>> +
>>> /*
>>>  * Placed within free objects to form a singly linked list.
>>>  * For every zspage, zspage->freeobj gives head of this list.
>>> @@ -244,6 +264,10 @@ struct zs_pool {
>>> #ifdef CONFIG_ZSMALLOC_STAT
>>> 	struct dentry *stat_dentry;
>>> #endif
>>> +#ifdef CONFIG_COMPACTION
>>> +	struct inode *inode;
>>> +	struct work_struct free_work;
>>> +#endif
>>> };
>>>
>>> /*
>>> @@ -252,16 +276,23 @@ struct zs_pool {
>>>  */
>>> #define FULLNESS_BITS	2
>>> #define CLASS_BITS	8
>>> +#define ISOLATED_BITS	3
>>> +#define MAGIC_VAL_BITS	8
>>>
>>> struct zspage {
>>> 	struct {
>>> 		unsigned int fullness:FULLNESS_BITS;
>>> 		unsigned int class:CLASS_BITS;
>>> +		unsigned int isolated:ISOLATED_BITS;
>>> +		unsigned int magic:MAGIC_VAL_BITS;
>>> 	};
>>> 	unsigned int inuse;
>>> 	unsigned int freeobj;
>>> 	struct page *first_page;
>>> 	struct list_head list; /* fullness list */
>>> +#ifdef CONFIG_COMPACTION
>>> +	rwlock_t lock;
>>> +#endif
>>> };
>>>
>>> struct mapping_area {
>>> @@ -274,6 +305,28 @@ struct mapping_area {
>>> 	enum zs_mapmode vm_mm; /* mapping mode */
>>> };
>>>
>>> +#ifdef CONFIG_COMPACTION
>>> +static int zs_register_migration(struct zs_pool *pool);
>>> +static void zs_unregister_migration(struct zs_pool *pool);
>>> +static void migrate_lock_init(struct zspage *zspage);
>>> +static void migrate_read_lock(struct zspage *zspage);
>>> +static void migrate_read_unlock(struct zspage *zspage);
>>> +static void kick_deferred_free(struct zs_pool *pool);
>>> +static void init_deferred_free(struct zs_pool *pool);
>>> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
>>> +#else
>>> +static int zsmalloc_mount(void) { return 0; }
>>> +static void zsmalloc_unmount(void) {}
>>> +static int zs_register_migration(struct zs_pool *pool) { return 0; }
>>> +static void zs_unregister_migration(struct zs_pool *pool) {}
>>> +static void migrate_lock_init(struct zspage *zspage) {}
>>> +static void migrate_read_lock(struct zspage *zspage) {}
>>> +static void migrate_read_unlock(struct zspage *zspage) {}
>>> +static void kick_deferred_free(struct zs_pool *pool) {}
>>> +static void init_deferred_free(struct zs_pool *pool) {}
>>> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
>>> +#endif
>>> +
>>> static int create_cache(struct zs_pool *pool)
>>> {
>>> 	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
>>> @@ -301,7 +354,7 @@ static void destroy_cache(struct zs_pool *pool)
>>> static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
>>> {
>>> 	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
>>> -			gfp & ~__GFP_HIGHMEM);
>>> +			gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
>>> }
>>>
>>> static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
>>> @@ -311,7 +364,8 @@ static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
>>>
>>> static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
>>> {
>>> -	return kmem_cache_alloc(pool->zspage_cachep, flags & ~__GFP_HIGHMEM);
>>> +	return kmem_cache_alloc(pool->zspage_cachep,
>>> +			flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
>>> };
>>>
>>> static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
>>> @@ -421,11 +475,17 @@ static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage)
>>> /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
>>> static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
>>>
>>> +static bool is_zspage_isolated(struct zspage *zspage)
>>> +{
>>> +	return zspage->isolated;
>>> +}
>>> +
>>> static int is_first_page(struct page *page)
>>> {
>>> 	return PagePrivate(page);
>>> }
>>>
>>> +/* Protected by class->lock */
>>> static inline int get_zspage_inuse(struct zspage *zspage)
>>> {
>>> 	return zspage->inuse;
>>> @@ -441,20 +501,12 @@ static inline void mod_zspage_inuse(struct zspage *zspage, int val)
>>> 	zspage->inuse += val;
>>> }
>>>
>>> -static inline int get_first_obj_offset(struct page *page)
>>> +static inline struct page *get_first_page(struct zspage *zspage)
>>> {
>>> -	if (is_first_page(page))
>>> -		return 0;
>>> +	struct page *first_page = zspage->first_page;
>>>
>>> -	return page->index;
>>> -}
>>> -
>>> -static inline void set_first_obj_offset(struct page *page, int offset)
>>> -{
>>> -	if (is_first_page(page))
>>> -		return;
>>> -
>>> -	page->index = offset;
>>> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>>> +	return first_page;
>>> }
>>>
>>> static inline unsigned int get_freeobj(struct zspage *zspage)
>>> @@ -471,6 +523,8 @@ static void get_zspage_mapping(struct zspage *zspage,
>>> 				unsigned int *class_idx,
>>> 				enum fullness_group *fullness)
>>> {
>>> +	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
>>> +
>>> 	*fullness = zspage->fullness;
>>> 	*class_idx = zspage->class;
>>> }
>>> @@ -504,23 +558,19 @@ static int get_size_class_index(int size)
>>> static inline void zs_stat_inc(struct size_class *class,
>>> 				enum zs_stat_type type, unsigned long cnt)
>>> {
>>> -	if (type < NR_ZS_STAT_TYPE)
>>> -		class->stats.objs[type] += cnt;
>>> +	class->stats.objs[type] += cnt;
>>> }
>>>
>>> static inline void zs_stat_dec(struct size_class *class,
>>> 				enum zs_stat_type type, unsigned long cnt)
>>> {
>>> -	if (type < NR_ZS_STAT_TYPE)
>>> -		class->stats.objs[type] -= cnt;
>>> +	class->stats.objs[type] -= cnt;
>>> }
>>>
>>> static inline unsigned long zs_stat_get(struct size_class *class,
>>> 				enum zs_stat_type type)
>>> {
>>> -	if (type < NR_ZS_STAT_TYPE)
>>> -		return class->stats.objs[type];
>>> -	return 0;
>>> +	return class->stats.objs[type];
>>> }
>>>
>>> #ifdef CONFIG_ZSMALLOC_STAT
>>> @@ -664,6 +714,7 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
>>> }
>>> #endif
>>>
>>> +
>>> /*
>>>  * For each size class, zspages are divided into different groups
>>>  * depending on how "full" they are. This was done so that we could
>>> @@ -704,15 +755,9 @@ static void insert_zspage(struct size_class *class,
>>> {
>>> 	struct zspage *head;
>>>
>>> -	if (fullness >= ZS_EMPTY)
>>> -		return;
>>> -
>>> +	zs_stat_inc(class, fullness, 1);
>>> 	head = list_first_entry_or_null(&class->fullness_list[fullness],
>>> 					struct zspage, list);
>>> -
>>> -	zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ?
>>> -			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
>>> -
>>> 	/*
>>> 	 * We want to see more ZS_FULL pages and less almost empty/full.
>>> 	 * Put pages with higher ->inuse first.
>>> @@ -734,14 +779,11 @@ static void remove_zspage(struct size_class *class,
>>> 				struct zspage *zspage,
>>> 				enum fullness_group fullness)
>>> {
>>> -	if (fullness >= ZS_EMPTY)
>>> -		return;
>>> -
>>> 	VM_BUG_ON(list_empty(&class->fullness_list[fullness]));
>>> +	VM_BUG_ON(is_zspage_isolated(zspage));
>>>
>>> 	list_del_init(&zspage->list);
>>> -	zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ?
>>> -			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
>>> +	zs_stat_dec(class, fullness, 1);
>>> }
>>>
>>> /*
>>> @@ -764,8 +806,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
>>> 	if (newfg == currfg)
>>> 		goto out;
>>>
>>> -	remove_zspage(class, zspage, currfg);
>>> -	insert_zspage(class, zspage, newfg);
>>> +	if (!is_zspage_isolated(zspage)) {
>>> +		remove_zspage(class, zspage, currfg);
>>> +		insert_zspage(class, zspage, newfg);
>>> +	}
>>> +
>>> 	set_zspage_mapping(zspage, class_idx, newfg);
>>>
>>> out:
>>> @@ -808,19 +853,45 @@ static int get_pages_per_zspage(int class_size)
>>> 	return max_usedpc_order;
>>> }
>>>
>>> -static struct page *get_first_page(struct zspage *zspage)
>>> +static struct zspage *get_zspage(struct page *page)
>>> {
>>> -	return zspage->first_page;
>>> +	struct zspage *zspage = (struct zspage *)page->private;
>>> +
>>> +	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
>>> +	return zspage;
>>> }
>>>
>>> -static struct zspage *get_zspage(struct page *page)
>>> +static struct page *get_next_page(struct page *page)
>>> {
>>> -	return (struct zspage *)page->private;
>>> +	if (unlikely(PageHugeObject(page)))
>>> +		return NULL;
>>> +
>>> +	return page->freelist;
>>> }
>>>
>>> -static struct page *get_next_page(struct page *page)
>>> +/* Get byte offset of first object in the @page */
>>> +static int get_first_obj_offset(struct size_class *class,
>>> +				struct page *first_page, struct page *page)
>>> {
>>> -	return page->next;
>>> +	int pos;
>>> +	int page_idx = 0;
>>> +	int ofs = 0;
>>> +	struct page *cursor = first_page;
>>> +
>>> +	if (first_page == page)
>>> +		goto out;
>>> +
>>> +	while (page != cursor) {
>>> +		page_idx++;
>>> +		cursor = get_next_page(cursor);
>>> +	}
>>> +
>>> +	pos = class->objs_per_zspage * class->size *
>>> +		page_idx / class->pages_per_zspage;
>>> +
>>> +	ofs = (pos + class->size) % PAGE_SIZE;
>>> +out:
>>> +	return ofs;
>>> }
>>>
>>> /**
>>> @@ -857,16 +928,20 @@ static unsigned long handle_to_obj(unsigned long handle)
>>> 	return *(unsigned long *)handle;
>>> }
>>>
>>> -static unsigned long obj_to_head(struct size_class *class, struct page *page,
>>> -			void *obj)
>>> +static unsigned long obj_to_head(struct page *page, void *obj)
>>> {
>>> -	if (class->huge) {
>>> +	if (unlikely(PageHugeObject(page))) {
>>> 		VM_BUG_ON_PAGE(!is_first_page(page), page);
>>> 		return page->index;
>>> 	} else
>>> 		return *(unsigned long *)obj;
>>> }
>>>
>>> +static inline int testpin_tag(unsigned long handle)
>>> +{
>>> +	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
>>> +}
>>> +
>>> static inline int trypin_tag(unsigned long handle)
>>> {
>>> 	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
>>> @@ -884,27 +959,93 @@ static void unpin_tag(unsigned long handle)
>>>
>>> static void reset_page(struct page *page)
>>> {
>>> +	__ClearPageMovable(page);
>>> 	clear_bit(PG_private, &page->flags);
>>> 	clear_bit(PG_private_2, &page->flags);
>>> 	set_page_private(page, 0);
>>> -	page->index = 0;
>>> +	ClearPageHugeObject(page);
>>> +	page->freelist = NULL;
>>> }
>>>
>>> -static void free_zspage(struct zs_pool *pool, struct zspage *zspage)
>>> +/*
>>> + * To prevent zspage destroy during migration, zspage freeing should
>>> + * hold locks of all pages in the zspage.
>>> + */
>>> +void lock_zspage(struct zspage *zspage)
>>> +{
>>> +	struct page *page = get_first_page(zspage);
>>> +
>>> +	do {
>>> +		lock_page(page);
>>> +	} while ((page = get_next_page(page)) != NULL);
>>> +}
>>> +
>>> +int trylock_zspage(struct zspage *zspage)
>>> +{
>>> +	struct page *cursor, *fail;
>>> +
>>> +	for (cursor = get_first_page(zspage); cursor != NULL; cursor =
>>> +					get_next_page(cursor)) {
>>> +		if (!trylock_page(cursor)) {
>>> +			fail = cursor;
>>> +			goto unlock;
>>> +		}
>>> +	}
>>> +
>>> +	return 1;
>>> +unlock:
>>> +	for (cursor = get_first_page(zspage); cursor != fail; cursor =
>>> +					get_next_page(cursor))
>>> +		unlock_page(cursor);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>>> +				struct zspage *zspage)
>>> {
>>> 	struct page *page, *next;
>>> +	enum fullness_group fg;
>>> +	unsigned int class_idx;
>>> +
>>> +	get_zspage_mapping(zspage, &class_idx, &fg);
>>> +
>>> +	assert_spin_locked(&class->lock);
>>>
>>> 	VM_BUG_ON(get_zspage_inuse(zspage));
>>> +	VM_BUG_ON(fg != ZS_EMPTY);
>>>
>>> -	next = page = zspage->first_page;
>>> +	next = page = get_first_page(zspage);
>>> 	do {
>>> -		next = page->next;
>>> +		VM_BUG_ON_PAGE(!PageLocked(page), page);
>>> +		next = get_next_page(page);
>>> 		reset_page(page);
>>> +		unlock_page(page);
>>> 		put_page(page);
>>> 		page = next;
>>> 	} while (page != NULL);
>>>
>>> 	cache_free_zspage(pool, zspage);
>>> +
>>> +	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>> +			class->size, class->pages_per_zspage));
>>> +	atomic_long_sub(class->pages_per_zspage,
>>> +					&pool->pages_allocated);
>>> +}
>>> +
>>> +static void free_zspage(struct zs_pool *pool, struct size_class *class,
>>> +				struct zspage *zspage)
>>> +{
>>> +	VM_BUG_ON(get_zspage_inuse(zspage));
>>> +	VM_BUG_ON(list_empty(&zspage->list));
>>> +
>>> +	if (!trylock_zspage(zspage)) {
>>> +		kick_deferred_free(pool);
>>> +		return;
>>> +	}
>>> +
>>> +	remove_zspage(class, zspage, ZS_EMPTY);
>>> +	__free_zspage(pool, class, zspage);
>>> }
>>>
>>> /* Initialize a newly allocated zspage */
>>> @@ -912,15 +1053,13 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>>> {
>>> 	unsigned int freeobj = 1;
>>> 	unsigned long off = 0;
>>> -	struct page *page = zspage->first_page;
>>> +	struct page *page = get_first_page(zspage);
>>>
>>> 	while (page) {
>>> 		struct page *next_page;
>>> 		struct link_free *link;
>>> 		void *vaddr;
>>>
>>> -		set_first_obj_offset(page, off);
>>> -
>>> 		vaddr = kmap_atomic(page);
>>> 		link = (struct link_free *)vaddr + off / sizeof(*link);
>>>
>>> @@ -952,16 +1091,17 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>>> 	set_freeobj(zspage, 0);
>>> }
>>>
>>> -static void create_page_chain(struct zspage *zspage, struct page *pages[],
>>> -				int nr_pages)
>>> +static void create_page_chain(struct size_class *class, struct zspage *zspage,
>>> +				struct page *pages[])
>>> {
>>> 	int i;
>>> 	struct page *page;
>>> 	struct page *prev_page = NULL;
>>> +	int nr_pages = class->pages_per_zspage;
>>>
>>> 	/*
>>> 	 * Allocate individual pages and link them together as:
>>> -	 * 1. all pages are linked together using page->next
>>> +	 * 1. all pages are linked together using page->freelist
>>> 	 * 2. each sub-page point to zspage using page->private
>>> 	 *
>>> 	 * we set PG_private to identify the first page (i.e. no other sub-page
>>> @@ -970,16 +1110,18 @@ static void create_page_chain(struct zspage *zspage, struct page *pages[],
>>> 	for (i = 0; i < nr_pages; i++) {
>>> 		page = pages[i];
>>> 		set_page_private(page, (unsigned long)zspage);
>>> +		page->freelist = NULL;
>>> 		if (i == 0) {
>>> 			zspage->first_page = page;
>>> 			SetPagePrivate(page);
>>> +			if (unlikely(class->objs_per_zspage == 1 &&
>>> +					class->pages_per_zspage == 1))
>>> +				SetPageHugeObject(page);
>>> 		} else {
>>> -			prev_page->next = page;
>>> +			prev_page->freelist = page;
>>> 		}
>>> -		if (i == nr_pages - 1) {
>>> +		if (i == nr_pages - 1)
>>> 			SetPagePrivate2(page);
>>> -			page->next = NULL;
>>> -		}
>>> 		prev_page = page;
>>> 	}
>>> }
>>> @@ -999,6 +1141,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>>> 		return NULL;
>>>
>>> 	memset(zspage, 0, sizeof(struct zspage));
>>> +	zspage->magic = ZSPAGE_MAGIC;
>>> +	migrate_lock_init(zspage);
>>>
>>> 	for (i = 0; i < class->pages_per_zspage; i++) {
>>> 		struct page *page;
>>> @@ -1013,7 +1157,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>>> 		pages[i] = page;
>>> 	}
>>>
>>> -	create_page_chain(zspage, pages, class->pages_per_zspage);
>>> +	create_page_chain(class, zspage, pages);
>>> 	init_zspage(class, zspage);
>>>
>>> 	return zspage;
>>> @@ -1024,7 +1168,7 @@ static struct zspage *find_get_zspage(struct size_class *class)
>>> 	int i;
>>> 	struct zspage *zspage;
>>>
>>> -	for (i = ZS_ALMOST_FULL; i <= ZS_ALMOST_EMPTY; i++) {
>>> +	for (i = ZS_ALMOST_FULL; i >= ZS_EMPTY; i--) {
>>> 		zspage = list_first_entry_or_null(&class->fullness_list[i],
>>> 				struct zspage, list);
>>> 		if (zspage)
>>> @@ -1289,6 +1433,10 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>>> 	obj = handle_to_obj(handle);
>>> 	obj_to_location(obj, &page, &obj_idx);
>>> 	zspage = get_zspage(page);
>>> +
>>> +	/* migration cannot move any subpage in this zspage */
>>> +	migrate_read_lock(zspage);
>>> +
>>> 	get_zspage_mapping(zspage, &class_idx, &fg);
>>> 	class = pool->size_class[class_idx];
>>> 	off = (class->size * obj_idx) & ~PAGE_MASK;
>>> @@ -1309,7 +1457,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>>>
>>> 	ret = __zs_map_object(area, pages, off, class->size);
>>> out:
>>> -	if (!class->huge)
>>> +	if (likely(!PageHugeObject(page)))
>>> 		ret += ZS_HANDLE_SIZE;
>>>
>>> 	return ret;
>>> @@ -1348,6 +1496,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>>> 		__zs_unmap_object(area, pages, off, class->size);
>>> 	}
>>> 	put_cpu_var(zs_map_area);
>>> +
>>> +	migrate_read_unlock(zspage);
>>> 	unpin_tag(handle);
>>> }
>>> EXPORT_SYMBOL_GPL(zs_unmap_object);
>>> @@ -1377,7 +1527,7 @@ static unsigned long obj_malloc(struct size_class *class,
>>> 	vaddr = kmap_atomic(m_page);
>>> 	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
>>> 	set_freeobj(zspage, link->next >> OBJ_ALLOCATED_TAG);
>>> -	if (!class->huge)
>>> +	if (likely(!PageHugeObject(m_page)))
>>> 		/* record handle in the header of allocated chunk */
>>> 		link->handle = handle;
>>> 	else
>>> @@ -1407,6 +1557,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>>> {
>>> 	unsigned long handle, obj;
>>> 	struct size_class *class;
>>> +	enum fullness_group newfg;
>>> 	struct zspage *zspage;
>>>
>>> 	if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
>>> @@ -1422,28 +1573,37 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>>>
>>> 	spin_lock(&class->lock);
>>> 	zspage = find_get_zspage(class);
>>> -
>>> -	if (!zspage) {
>>> +	if (likely(zspage)) {
>>> +		obj = obj_malloc(class, zspage, handle);
>>> +		/* Now move the zspage to another fullness group, if required */
>>> +		fix_fullness_group(class, zspage);
>>> +		record_obj(handle, obj);
>>> 		spin_unlock(&class->lock);
>>> -		zspage = alloc_zspage(pool, class, gfp);
>>> -		if (unlikely(!zspage)) {
>>> -			cache_free_handle(pool, handle);
>>> -			return 0;
>>> -		}
>>>
>>> -		set_zspage_mapping(zspage, class->index, ZS_EMPTY);
>>> -		atomic_long_add(class->pages_per_zspage,
>>> -					&pool->pages_allocated);
>>> +		return handle;
>>> +	}
>>>
>>> -		spin_lock(&class->lock);
>>> -		zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>> -				class->size, class->pages_per_zspage));
>>> +	spin_unlock(&class->lock);
>>> +
>>> +	zspage = alloc_zspage(pool, class, gfp);
>>> +	if (!zspage) {
>>> +		cache_free_handle(pool, handle);
>>> +		return 0;
>>> 	}
>>>
>>> +	spin_lock(&class->lock);
>>> 	obj = obj_malloc(class, zspage, handle);
>>> -	/* Now move the zspage to another fullness group, if required */
>>> -	fix_fullness_group(class, zspage);
>>> +	newfg = get_fullness_group(class, zspage);
>>> +	insert_zspage(class, zspage, newfg);
>>> +	set_zspage_mapping(zspage, class->index, newfg);
>>> 	record_obj(handle, obj);
>>> +	atomic_long_add(class->pages_per_zspage,
>>> +				&pool->pages_allocated);
>>> +	zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>> +			class->size, class->pages_per_zspage));
>>> +
>>> +	/* We completely set up zspage so mark them as movable */
>>> +	SetZsPageMovable(pool, zspage);
>>> 	spin_unlock(&class->lock);
>>>
>>> 	return handle;
>>> @@ -1484,6 +1644,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>>> 	int class_idx;
>>> 	struct size_class *class;
>>> 	enum fullness_group fullness;
>>> +	bool isolated;
>>>
>>> 	if (unlikely(!handle))
>>> 		return;
>>> @@ -1493,22 +1654,28 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>>> 	obj_to_location(obj, &f_page, &f_objidx);
>>> 	zspage = get_zspage(f_page);
>>>
>>> +	migrate_read_lock(zspage);
>>> +
>>> 	get_zspage_mapping(zspage, &class_idx, &fullness);
>>> 	class = pool->size_class[class_idx];
>>>
>>> 	spin_lock(&class->lock);
>>> 	obj_free(class, obj);
>>> 	fullness = fix_fullness_group(class, zspage);
>>> -	if (fullness == ZS_EMPTY) {
>>> -		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>> -				class->size, class->pages_per_zspage));
>>> -		atomic_long_sub(class->pages_per_zspage,
>>> -				&pool->pages_allocated);
>>> -		free_zspage(pool, zspage);
>>> +	if (fullness != ZS_EMPTY) {
>>> +		migrate_read_unlock(zspage);
>>> +		goto out;
>>> 	}
>>> +
>>> +	isolated = is_zspage_isolated(zspage);
>>> +	migrate_read_unlock(zspage);
>>> +	/* If zspage is isolated, zs_page_putback will free the zspage */
>>> +	if (likely(!isolated))
>>> +		free_zspage(pool, class, zspage);
>>> +out:
>>> +
>>> 	spin_unlock(&class->lock);
>>> 	unpin_tag(handle);
>>> -
>>> 	cache_free_handle(pool, handle);
>>> }
>>> EXPORT_SYMBOL_GPL(zs_free);
>>> @@ -1587,12 +1754,13 @@ static unsigned long find_alloced_obj(struct size_class *class,
>>> 	int offset = 0;
>>> 	unsigned long handle = 0;
>>> 	void *addr = kmap_atomic(page);
>>> +	struct zspage *zspage = get_zspage(page);
>>>
>>> -	offset = get_first_obj_offset(page);
>>> +	offset = get_first_obj_offset(class, get_first_page(zspage), page);
>>> 	offset += class->size * index;
>>>
>>> 	while (offset < PAGE_SIZE) {
>>> -		head = obj_to_head(class, page, addr + offset);
>>> +		head = obj_to_head(page, addr + offset);
>>> 		if (head & OBJ_ALLOCATED_TAG) {
>>> 			handle = head & ~OBJ_ALLOCATED_TAG;
>>> 			if (trypin_tag(handle))
>>> @@ -1684,6 +1852,7 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
>>> 		zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
>>> 							struct zspage, list);
>>> 		if (zspage) {
>>> +			VM_BUG_ON(is_zspage_isolated(zspage));
>>> 			remove_zspage(class, zspage, fg[i]);
>>> 			return zspage;
>>> 		}
>>> @@ -1704,6 +1873,8 @@ static enum fullness_group putback_zspage(struct size_class *class,
>>> {
>>> 	enum fullness_group fullness;
>>>
>>> +	VM_BUG_ON(is_zspage_isolated(zspage));
>>> +
>>> 	fullness = get_fullness_group(class, zspage);
>>> 	insert_zspage(class, zspage, fullness);
>>> 	set_zspage_mapping(zspage, class->index, fullness);
>>> @@ -1711,6 +1882,377 @@ static enum fullness_group putback_zspage(struct size_class *class,
>>> 	return fullness;
>>> }
>>>
>>> +#ifdef CONFIG_COMPACTION
>>> +static struct dentry *zs_mount(struct file_system_type *fs_type,
>>> +				int flags, const char *dev_name, void *data)
>>> +{
>>> +	static const struct dentry_operations ops = {
>>> +		.d_dname = simple_dname,
>>> +	};
>>> +
>>> +	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
>>> +}
>>> +
>>> +static struct file_system_type zsmalloc_fs = {
>>> +	.name		= "zsmalloc",
>>> +	.mount		= zs_mount,
>>> +	.kill_sb	= kill_anon_super,
>>> +};
>>> +
>>> +static int zsmalloc_mount(void)
>>> +{
>>> +	int ret = 0;
>>> +
>>> +	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
>>> +	if (IS_ERR(zsmalloc_mnt))
>>> +		ret = PTR_ERR(zsmalloc_mnt);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +static void zsmalloc_unmount(void)
>>> +{
>>> +	kern_unmount(zsmalloc_mnt);
>>> +}
>>> +
>>> +static void migrate_lock_init(struct zspage *zspage)
>>> +{
>>> +	rwlock_init(&zspage->lock);
>>> +}
>>> +
>>> +static void migrate_read_lock(struct zspage *zspage)
>>> +{
>>> +	read_lock(&zspage->lock);
>>> +}
>>> +
>>> +static void migrate_read_unlock(struct zspage *zspage)
>>> +{
>>> +	read_unlock(&zspage->lock);
>>> +}
>>> +
>>> +static void migrate_write_lock(struct zspage *zspage)
>>> +{
>>> +	write_lock(&zspage->lock);
>>> +}
>>> +
>>> +static void migrate_write_unlock(struct zspage *zspage)
>>> +{
>>> +	write_unlock(&zspage->lock);
>>> +}
>>> +
>>> +/* Number of isolated subpage for *page migration* in this zspage */
>>> +static void inc_zspage_isolation(struct zspage *zspage)
>>> +{
>>> +	zspage->isolated++;
>>> +}
>>> +
>>> +static void dec_zspage_isolation(struct zspage *zspage)
>>> +{
>>> +	zspage->isolated--;
>>> +}
>>> +
>>> +static void replace_sub_page(struct size_class *class, struct zspage *zspage,
>>> +				struct page *newpage, struct page *oldpage)
>>> +{
>>> +	struct page *page;
>>> +	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL, };
>>> +	int idx = 0;
>>> +
>>> +	page = get_first_page(zspage);
>>> +	do {
>>> +		if (page == oldpage)
>>> +			pages[idx] = newpage;
>>> +		else
>>> +			pages[idx] = page;
>>> +		idx++;
>>> +	} while ((page = get_next_page(page)) != NULL);
>>> +
>>> +	create_page_chain(class, zspage, pages);
>>> +	if (unlikely(PageHugeObject(oldpage)))
>>> +		newpage->index = oldpage->index;
>>> +	__SetPageMovable(newpage, page_mapping(oldpage));
>>> +}
>>> +
>>> +bool zs_page_isolate(struct page *page, isolate_mode_t mode)
>>> +{
>>> +	struct zs_pool *pool;
>>> +	struct size_class *class;
>>> +	int class_idx;
>>> +	enum fullness_group fullness;
>>> +	struct zspage *zspage;
>>> +	struct address_space *mapping;
>>> +
>>> +	/*
>>> +	 * Page is locked so zspage couldn't be destroyed. For detail, look at
>>> +	 * lock_zspage in free_zspage.
>>> +	 */
>>> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
>>> +	VM_BUG_ON_PAGE(PageIsolated(page), page);
>>> +
>>> +	zspage = get_zspage(page);
>>> +
>>> +	/*
>>> +	 * Without class lock, fullness could be stale while class_idx is okay
>>> +	 * because class_idx is constant unless page is freed so we should get
>>> +	 * fullness again under class lock.
>>> +	 */
>>> +	get_zspage_mapping(zspage, &class_idx, &fullness);
>>> +	mapping = page_mapping(page);
>>> +	pool = mapping->private_data;
>>> +	class = pool->size_class[class_idx];
>>> +
>>> +	spin_lock(&class->lock);
>>> +	if (get_zspage_inuse(zspage) == 0) {
>>> +		spin_unlock(&class->lock);
>>> +		return false;
>>> +	}
>>> +
>>> +	/* zspage is isolated for object migration */
>>> +	if (list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
>>> +		spin_unlock(&class->lock);
>>> +		return false;
>>> +	}
>>> +
>>> +	/*
>>> +	 * If this is first time isolation for the zspage, isolate zspage from
>>> +	 * size_class to prevent further object allocation from the zspage.
>>> +	 */
>>> +	if (!list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
>>> +		get_zspage_mapping(zspage, &class_idx, &fullness);
>>> +		remove_zspage(class, zspage, fullness);
>>> +	}
>>> +
>>> +	inc_zspage_isolation(zspage);
>>> +	spin_unlock(&class->lock);
>>> +
>>> +	return true;
>>> +}
>>> +
>>> +int zs_page_migrate(struct address_space *mapping, struct page *newpage,
>>> +		struct page *page, enum migrate_mode mode)
>>> +{
>>> +	struct zs_pool *pool;
>>> +	struct size_class *class;
>>> +	int class_idx;
>>> +	enum fullness_group fullness;
>>> +	struct zspage *zspage;
>>> +	struct page *dummy;
>>> +	void *s_addr, *d_addr, *addr;
>>> +	int offset, pos;
>>> +	unsigned long handle, head;
>>> +	unsigned long old_obj, new_obj;
>>> +	unsigned int obj_idx;
>>> +	int ret = -EAGAIN;
>>> +
>>> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
>>> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
>>> +
>>> +	zspage = get_zspage(page);
>>> +
>>> +	/* Concurrent compactor cannot migrate any subpage in zspage */
>>> +	migrate_write_lock(zspage);
>>> +	get_zspage_mapping(zspage, &class_idx, &fullness);
>>> +	pool = mapping->private_data;
>>> +	class = pool->size_class[class_idx];
>>> +	offset = get_first_obj_offset(class, get_first_page(zspage), page);
>>> +
>>> +	spin_lock(&class->lock);
>>> +	if (!get_zspage_inuse(zspage)) {
>>> +		ret = -EBUSY;
>>> +		goto unlock_class;
>>> +	}
>>> +
>>> +	pos = offset;
>>> +	s_addr = kmap_atomic(page);
>>> +	while (pos < PAGE_SIZE) {
>>> +		head = obj_to_head(page, s_addr + pos);
>>> +		if (head & OBJ_ALLOCATED_TAG) {
>>> +			handle = head & ~OBJ_ALLOCATED_TAG;
>>> +			if (!trypin_tag(handle))
>>> +				goto unpin_objects;
>>> +		}
>>> +		pos += class->size;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Here, any user cannot access all objects in the zspage so let's move.
>>> +	 */
>>> +	d_addr = kmap_atomic(newpage);
>>> +	memcpy(d_addr, s_addr, PAGE_SIZE);
>>> +	kunmap_atomic(d_addr);
>>> +
>>> +	for (addr = s_addr + offset; addr < s_addr + pos;
>>> +					addr += class->size) {
>>> +		head = obj_to_head(page, addr);
>>> +		if (head & OBJ_ALLOCATED_TAG) {
>>> +			handle = head & ~OBJ_ALLOCATED_TAG;
>>> +			if (!testpin_tag(handle))
>>> +				BUG();
>>> +
>>> +			old_obj = handle_to_obj(handle);
>>> +			obj_to_location(old_obj, &dummy, &obj_idx);
>>> +			new_obj = (unsigned long)location_to_obj(newpage,
>>> +								obj_idx);
>>> +			new_obj |= BIT(HANDLE_PIN_BIT);
>>> +			record_obj(handle, new_obj);
>>> +		}
>>> +	}
>>> +
>>> +	replace_sub_page(class, zspage, newpage, page);
>>> +	get_page(newpage);
>>> +
>>> +	dec_zspage_isolation(zspage);
>>> +
>>> +	/*
>>> +	 * Page migration is done so let's putback isolated zspage to
>>> +	 * the list if @page is final isolated subpage in the zspage.
>>> +	 */
>>> +	if (!is_zspage_isolated(zspage))
>>> +		putback_zspage(class, zspage);
>>> +
>>> +	reset_page(page);
>>> +	put_page(page);
>>> +	page = newpage;
>>> +
>>> +	ret = 0;
>>> +unpin_objects:
>>> +	for (addr = s_addr + offset; addr < s_addr + pos;
>>> +						addr += class->size) {
>>> +		head = obj_to_head(page, addr);
>>> +		if (head & OBJ_ALLOCATED_TAG) {
>>> +			handle = head & ~OBJ_ALLOCATED_TAG;
>>> +			if (!testpin_tag(handle))
>>> +				BUG();
>>> +			unpin_tag(handle);
>>> +		}
>>> +	}
>>> +	kunmap_atomic(s_addr);
>>> +unlock_class:
>>> +	spin_unlock(&class->lock);
>>> +	migrate_write_unlock(zspage);
>>> +
>>> +	return ret;
>>> +}
>>> +
>>> +void zs_page_putback(struct page *page)
>>> +{
>>> +	struct zs_pool *pool;
>>> +	struct size_class *class;
>>> +	int class_idx;
>>> +	enum fullness_group fg;
>>> +	struct address_space *mapping;
>>> +	struct zspage *zspage;
>>> +
>>> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
>>> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
>>> +
>>> +	zspage = get_zspage(page);
>>> +	get_zspage_mapping(zspage, &class_idx, &fg);
>>> +	mapping = page_mapping(page);
>>> +	pool = mapping->private_data;
>>> +	class = pool->size_class[class_idx];
>>> +
>>> +	spin_lock(&class->lock);
>>> +	dec_zspage_isolation(zspage);
>>> +	if (!is_zspage_isolated(zspage)) {
>>> +		fg = putback_zspage(class, zspage);
>>> +		/*
>>> +		 * Due to page_lock, we cannot free zspage immediately
>>> +		 * so let's defer.
>>> +		 */
>>> +		if (fg == ZS_EMPTY)
>>> +			schedule_work(&pool->free_work);
>>> +	}
>>> +	spin_unlock(&class->lock);
>>> +}
>>> +
>>> +const struct address_space_operations zsmalloc_aops = {
>>> +	.isolate_page = zs_page_isolate,
>>> +	.migratepage = zs_page_migrate,
>>> +	.putback_page = zs_page_putback,
>>> +};
>>> +
>>> +static int zs_register_migration(struct zs_pool *pool)
>>> +{
>>> +	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
>>> +	if (IS_ERR(pool->inode)) {
>>> +		pool->inode = NULL;
>>> +		return 1;
>>> +	}
>>> +
>>> +	pool->inode->i_mapping->private_data = pool;
>>> +	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
>>> +	return 0;
>>> +}
>>> +
>>> +static void zs_unregister_migration(struct zs_pool *pool)
>>> +{
>>> +	flush_work(&pool->free_work);
>>> +	if (pool->inode)
>>> +		iput(pool->inode);
>>> +}
>>> +
>>> +/*
>>> + * Caller should hold page_lock of all pages in the zspage
>>> + * In here, we cannot use zspage meta data.
>>> + */
>>> +static void async_free_zspage(struct work_struct *work)
>>> +{
>>> +	int i;
>>> +	struct size_class *class;
>>> +	unsigned int class_idx;
>>> +	enum fullness_group fullness;
>>> +	struct zspage *zspage, *tmp;
>>> +	LIST_HEAD(free_pages);
>>> +	struct zs_pool *pool = container_of(work, struct zs_pool,
>>> +					free_work);
>>> +
>>> +	for (i = 0; i < zs_size_classes; i++) {
>>> +		class = pool->size_class[i];
>>> +		if (class->index != i)
>>> +			continue;
>>> +
>>> +		spin_lock(&class->lock);
>>> +		list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
>>> +		spin_unlock(&class->lock);
>>> +	}
>>> +
>>> +
>>> +	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
>>> +		list_del(&zspage->list);
>>> +		lock_zspage(zspage);
>>> +
>>> +		get_zspage_mapping(zspage, &class_idx, &fullness);
>>> +		VM_BUG_ON(fullness != ZS_EMPTY);
>>> +		class = pool->size_class[class_idx];
>>> +		spin_lock(&class->lock);
>>> +		__free_zspage(pool, pool->size_class[class_idx], zspage);
>>> +		spin_unlock(&class->lock);
>>> +	}
>>> +};
>>> +
>>> +static void kick_deferred_free(struct zs_pool *pool)
>>> +{
>>> +	schedule_work(&pool->free_work);
>>> +}
>>> +
>>> +static void init_deferred_free(struct zs_pool *pool)
>>> +{
>>> +	INIT_WORK(&pool->free_work, async_free_zspage);
>>> +}
>>> +
>>> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage)
>>> +{
>>> +	struct page *page = get_first_page(zspage);
>>> +
>>> +	do {
>>> +		WARN_ON(!trylock_page(page));
>>> +		__SetPageMovable(page, pool->inode->i_mapping);
>>> +		unlock_page(page);
>>> +	} while ((page = get_next_page(page)) != NULL);
>>> +}
>>> +#endif
>>> +
>>> /*
>>>  *
>>>  * Based on the number of unused allocated objects calculate
>>> @@ -1745,10 +2287,10 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>>> 			break;
>>>
>>> 		cc.index = 0;
>>> -		cc.s_page = src_zspage->first_page;
>>> +		cc.s_page = get_first_page(src_zspage);
>>>
>>> 		while ((dst_zspage = isolate_zspage(class, false))) {
>>> -			cc.d_page = dst_zspage->first_page;
>>> +			cc.d_page = get_first_page(dst_zspage);
>>> 			/*
>>> 			 * If there is no more space in dst_page, resched
>>> 			 * and see if anyone had allocated another zspage.
>>> @@ -1765,11 +2307,7 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>>>
>>> 		putback_zspage(class, dst_zspage);
>>> 		if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
>>> -			zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>> -					class->size, class->pages_per_zspage));
>>> -			atomic_long_sub(class->pages_per_zspage,
>>> -					&pool->pages_allocated);
>>> -			free_zspage(pool, src_zspage);
>>> +			free_zspage(pool, class, src_zspage);
>>> 			pool->stats.pages_compacted += class->pages_per_zspage;
>>> 		}
>>> 		spin_unlock(&class->lock);
>>> @@ -1885,6 +2423,7 @@ struct zs_pool *zs_create_pool(const char *name)
>>> 	if (!pool)
>>> 		return NULL;
>>>
>>> +	init_deferred_free(pool);
>>> 	pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
>>> 			GFP_KERNEL);
>>> 	if (!pool->size_class) {
>>> @@ -1939,12 +2478,10 @@ struct zs_pool *zs_create_pool(const char *name)
>>> 		class->pages_per_zspage = pages_per_zspage;
>>> 		class->objs_per_zspage = class->pages_per_zspage *
>>> 						PAGE_SIZE / class->size;
>>> -		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
>>> -			class->huge = true;
>>> 		spin_lock_init(&class->lock);
>>> 		pool->size_class[i] = class;
>>> -		for (fullness = ZS_ALMOST_FULL; fullness <= ZS_ALMOST_EMPTY;
>>> -								fullness++)
>>> +		for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
>>> +							fullness++)
>>> 			INIT_LIST_HEAD(&class->fullness_list[fullness]);
>>>
>>> 		prev_class = class;
>>> @@ -1953,6 +2490,9 @@ struct zs_pool *zs_create_pool(const char *name)
>>> 	/* debug only, don't abort if it fails */
>>> 	zs_pool_stat_create(pool, name);
>>>
>>> +	if (zs_register_migration(pool))
>>> +		goto err;
>>> +
>>> 	/*
>>> 	 * Not critical, we still can use the pool
>>> 	 * and user can trigger compaction manually.
>>> @@ -1972,6 +2512,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>>> 	int i;
>>>
>>> 	zs_unregister_shrinker(pool);
>>> +	zs_unregister_migration(pool);
>>> 	zs_pool_stat_destroy(pool);
>>>
>>> 	for (i = 0; i < zs_size_classes; i++) {
>>> @@ -1984,7 +2525,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>>> 		if (class->index != i)
>>> 			continue;
>>>
>>> -		for (fg = ZS_ALMOST_FULL; fg <= ZS_ALMOST_EMPTY; fg++) {
>>> +		for (fg = ZS_EMPTY; fg < NR_ZS_FULLNESS; fg++) {
>>> 			if (!list_empty(&class->fullness_list[fg])) {
>>> 				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
>>> 					class->size, fg);
>>> @@ -2002,7 +2543,13 @@ EXPORT_SYMBOL_GPL(zs_destroy_pool);
>>>
>>> static int __init zs_init(void)
>>> {
>>> -	int ret = zs_register_cpu_notifier();
>>> +	int ret;
>>> +
>>> +	ret = zsmalloc_mount();
>>> +	if (ret)
>>> +		goto out;
>>> +
>>> +	ret = zs_register_cpu_notifier();
>>>
>>> 	if (ret)
>>> 		goto notifier_fail;
>>> @@ -2019,7 +2566,8 @@ static int __init zs_init(void)
>>>
>>> notifier_fail:
>>> 	zs_unregister_cpu_notifier();
>>> -
>>> +	zsmalloc_unmount();
>>> +out:
>>> 	return ret;
>>> }
>>>
>>> @@ -2028,6 +2576,7 @@ static void __exit zs_exit(void)
>>> #ifdef CONFIG_ZPOOL
>>> 	zpool_unregister_driver(&zs_zpool_driver);
>>> #endif
>>> +	zsmalloc_unmount();
>>> 	zs_unregister_cpu_notifier();
>>>
>>> 	zs_stat_exit();
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-19  3:39         ` Chulmin Kim
@ 2017-01-19  6:21           ` Minchan Kim
  2017-01-19  8:16             ` Chulmin Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2017-01-19  6:21 UTC (permalink / raw)
  To: Chulmin Kim; +Cc: Andrew Morton, linux-mm, Sergey Senozhatsky

On Wed, Jan 18, 2017 at 10:39:15PM -0500, Chulmin Kim wrote:
> On 01/18/2017 09:44 PM, Minchan Kim wrote:
> >Hello Chulmin,
> >
> >On Wed, Jan 18, 2017 at 07:13:21PM -0500, Chulmin Kim wrote:
> >>Hello. Minchan, and all zsmalloc guys.
> >>
> >>I have a quick question.
> >>Is zsmalloc considering memory barrier things correctly?
> >>
> >>AFAIK, in ARM64,
> >>zsmalloc relies on dmb operation in bit_spin_unlock only.
> >>(It seems that dmb operations in spinlock functions are being prepared,
> >>but let is be aside as it is not merged yet.)
> >>
> >>If I am correct,
> >>migrating a page in a zspage filled with free objs
> >>may cause the corruption cause bit_spin_unlock will not be executed at all.
> >>
> >>I am not sure this is enough memory barrier for zsmalloc operations.
> >>
> >>Can you enlighten me?
> >
> >Do you mean bit_spin_unlock is broken or zsmalloc locking scheme broken?
> >Could you please describe what you are concerning in detail?
> >It would be very helpful if you say it with a example!
> 
> Sorry for ambiguous expressions. :)
> 
> Recently,
> I found multiple zsmalloc corruption cases which have garbage idx values in
> in zspage->freeobj. (not ffffffff (-1) value.)
> 
> Honestly, I have no clue yet.
> 
> I suspect the case when zspage migrate a zs sub page filled with free
> objects (so that never calls unpin_tag() which has memory barrier).
> 
> 
> Assume the page (zs subpage) being migrated has no allocated zs object.
> 
> S : zs subpage
> D : free page
> 
> 
> CPU A : zs_page_migrate()		CPU B : zs_malloc()
> ---------------------			-----------------------------
> 
> 
> migrate_write_lock()
> spin_lock()
> 
> memcpy(D, S, PAGE_SIZE)   -> (1)
> replace_sub_page()
> 
> putback_zspage()
> spin_unlock()
> migrate_write_unlock()
> 					
> 					spin_lock()
> 					obj_malloc()
> 					--> (2-a) allocate obj in D
> 					--> (2-b) set freeobj using
>      						the first 8 bytes of
>  						the allocated obj
> 					record_obj()
> 					spin_unlock
> 
> 
> 
> I think the locking has no problem, but memory ordering.
> I doubt whether (2-b) in CPU B really loads the data stored by (1).
> 
> If it doesn't, set_freeobj in (2-b) will corrupt zspage->freeobj.
> After then, we will see corrupted object sooner or later.

Thanks for the example.
When I cannot understand what you are pointing out.

In above example, two CPU use same spin_lock of a class so store op
by memcpy in the critical section should be visible by CPU B.

Am I missing your point?

> 
> 
> According to the below link,
> (https://patchwork.kernel.org/patch/9313493/)
> spin lock in a specific arch (arm64 maybe) seems not guaranteeing memory
> ordering.

IMHO, it's not related the this issue.
It should be matter in when data is updated without formal locking scheme.

Thanks.

> 
> ===
> +/*
> + * Accesses appearing in program order before a spin_lock() operation
> + * can be reordered with accesses inside the critical section, by virtue
> + * of arch_spin_lock being constructed using acquire semantics.
> + *
> + * In cases where this is problematic (e.g. try_to_wake_up), an
> + * smp_mb__before_spinlock() can restore the required ordering.
> + */
> +#define smp_mb__before_spinlock()	smp_mb()
> ===
> 
> 
> 
> THanks.
> CHulmin Kim
> 
> 
> 
> 
> 
> >
> >Thanks.
> >
> >>
> >>
> >>THanks!
> >>CHulmin KIm
> >>
> >>
> >>
> >>On 05/31/2016 07:21 PM, Minchan Kim wrote:
> >>>This patch introduces run-time migration feature for zspage.
> >>>
> >>>For migration, VM uses page.lru field so it would be better to not use
> >>>page.next field which is unified with page.lru for own purpose.
> >>>For that, firstly, we can get first object offset of the page via
> >>>runtime calculation instead of using page.index so we can use
> >>>page.index as link for page chaining instead of page.next.
> >>>	
> >>>In case of huge object, it stores handle to page.index instead of
> >>>next link of page chaining because huge object doesn't need to next
> >>>link for page chaining. So get_next_page need to identify huge
> >>>object to return NULL. For it, this patch uses PG_owner_priv_1 flag
> >>>of the page flag.
> >>>
> >>>For migration, it supports three functions
> >>>
> >>>* zs_page_isolate
> >>>
> >>>It isolates a zspage which includes a subpage VM want to migrate
> >>>from class so anyone cannot allocate new object from the zspage.
> >>>
> >>>We could try to isolate a zspage by the number of subpage so
> >>>subsequent isolation trial of other subpage of the zpsage shouldn't
> >>>fail. For that, we introduce zspage.isolated count. With that,
> >>>zs_page_isolate can know whether zspage is already isolated or not
> >>>for migration so if it is isolated for migration, subsequent
> >>>isolation trial can be successful without trying further isolation.
> >>>
> >>>* zs_page_migrate
> >>>
> >>>First of all, it holds write-side zspage->lock to prevent migrate other
> >>>subpage in zspage. Then, lock all objects in the page VM want to migrate.
> >>>The reason we should lock all objects in the page is due to race between
> >>>zs_map_object and zs_page_migrate.
> >>>
> >>>zs_map_object				zs_page_migrate
> >>>
> >>>pin_tag(handle)
> >>>obj = handle_to_obj(handle)
> >>>obj_to_location(obj, &page, &obj_idx);
> >>>
> >>>					write_lock(&zspage->lock)
> >>>					if (!trypin_tag(handle))
> >>>						goto unpin_object
> >>>
> >>>zspage = get_zspage(page);
> >>>read_lock(&zspage->lock);
> >>>
> >>>If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can
> >>>be stale by migration so it goes crash.
> >>>
> >>>If it locks all of objects successfully, it copies content from
> >>>old page to new one, finally, create new zspage chain with new page.
> >>>And if it's last isolated subpage in the zspage, put the zspage back
> >>>to class.
> >>>
> >>>* zs_page_putback
> >>>
> >>>It returns isolated zspage to right fullness_group list if it fails to
> >>>migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage
> >>>freeing to workqueue. See below about async zspage freeing.
> >>>
> >>>This patch introduces asynchronous zspage free. The reason to need it
> >>>is we need page_lock to clear PG_movable but unfortunately,
> >>>zs_free path should be atomic so the apporach is try to grab page_lock.
> >>>If it got page_lock of all of pages successfully, it can free zspage
> >>>immediately. Otherwise, it queues free request and free zspage via
> >>>workqueue in process context.
> >>>
> >>>If zs_free finds the zspage is isolated when it try to free zspage,
> >>>it delays the freeing until zs_page_putback finds it so it will free
> >>>free the zspage finally.
> >>>
> >>>In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.
> >>>First of all, it will use ZS_EMPTY list for delay freeing.
> >>>And with adding ZS_FULL list, it makes to identify whether zspage is
> >>>isolated or not via list_empty(&zspage->list) test.
> >>>
> >>>Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
> >>>Signed-off-by: Minchan Kim <minchan@kernel.org>
> >>>---
> >>>include/uapi/linux/magic.h |   1 +
> >>>mm/zsmalloc.c              | 793 ++++++++++++++++++++++++++++++++++++++-------
> >>>2 files changed, 672 insertions(+), 122 deletions(-)
> >>>
> >>>diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
> >>>index d829ce63529d..e398beac67b8 100644
> >>>--- a/include/uapi/linux/magic.h
> >>>+++ b/include/uapi/linux/magic.h
> >>>@@ -81,5 +81,6 @@
> >>>/* Since UDF 2.01 is ISO 13346 based... */
> >>>#define UDF_SUPER_MAGIC		0x15013346
> >>>#define BALLOON_KVM_MAGIC	0x13661366
> >>>+#define ZSMALLOC_MAGIC		0x58295829
> >>>
> >>>#endif /* __LINUX_MAGIC_H__ */
> >>>diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> >>>index c6fb543cfb98..a80100db16d6 100644
> >>>--- a/mm/zsmalloc.c
> >>>+++ b/mm/zsmalloc.c
> >>>@@ -17,14 +17,14 @@
> >>> *
> >>> * Usage of struct page fields:
> >>> *	page->private: points to zspage
> >>>- *	page->index: offset of the first object starting in this page.
> >>>- *		For the first page, this is always 0, so we use this field
> >>>- *		to store handle for huge object.
> >>>- *	page->next: links together all component pages of a zspage
> >>>+ *	page->freelist(index): links together all component pages of a zspage
> >>>+ *		For the huge page, this is always 0, so we use this field
> >>>+ *		to store handle.
> >>> *
> >>> * Usage of struct page flags:
> >>> *	PG_private: identifies the first component page
> >>> *	PG_private2: identifies the last component page
> >>>+ *	PG_owner_priv_1: indentifies the huge component page
> >>> *
> >>> */
> >>>
> >>>@@ -49,6 +49,11 @@
> >>>#include <linux/debugfs.h>
> >>>#include <linux/zsmalloc.h>
> >>>#include <linux/zpool.h>
> >>>+#include <linux/mount.h>
> >>>+#include <linux/compaction.h>
> >>>+#include <linux/pagemap.h>
> >>>+
> >>>+#define ZSPAGE_MAGIC	0x58
> >>>
> >>>/*
> >>> * This must be power of 2 and greater than of equal to sizeof(link_free).
> >>>@@ -136,25 +141,23 @@
> >>> * We do not maintain any list for completely empty or full pages
> >>> */
> >>>enum fullness_group {
> >>>-	ZS_ALMOST_FULL,
> >>>-	ZS_ALMOST_EMPTY,
> >>>	ZS_EMPTY,
> >>>-	ZS_FULL
> >>>+	ZS_ALMOST_EMPTY,
> >>>+	ZS_ALMOST_FULL,
> >>>+	ZS_FULL,
> >>>+	NR_ZS_FULLNESS,
> >>>};
> >>>
> >>>enum zs_stat_type {
> >>>+	CLASS_EMPTY,
> >>>+	CLASS_ALMOST_EMPTY,
> >>>+	CLASS_ALMOST_FULL,
> >>>+	CLASS_FULL,
> >>>	OBJ_ALLOCATED,
> >>>	OBJ_USED,
> >>>-	CLASS_ALMOST_FULL,
> >>>-	CLASS_ALMOST_EMPTY,
> >>>+	NR_ZS_STAT_TYPE,
> >>>};
> >>>
> >>>-#ifdef CONFIG_ZSMALLOC_STAT
> >>>-#define NR_ZS_STAT_TYPE	(CLASS_ALMOST_EMPTY + 1)
> >>>-#else
> >>>-#define NR_ZS_STAT_TYPE	(OBJ_USED + 1)
> >>>-#endif
> >>>-
> >>>struct zs_size_stat {
> >>>	unsigned long objs[NR_ZS_STAT_TYPE];
> >>>};
> >>>@@ -163,6 +166,10 @@ struct zs_size_stat {
> >>>static struct dentry *zs_stat_root;
> >>>#endif
> >>>
> >>>+#ifdef CONFIG_COMPACTION
> >>>+static struct vfsmount *zsmalloc_mnt;
> >>>+#endif
> >>>+
> >>>/*
> >>> * number of size_classes
> >>> */
> >>>@@ -186,23 +193,36 @@ static const int fullness_threshold_frac = 4;
> >>>
> >>>struct size_class {
> >>>	spinlock_t lock;
> >>>-	struct list_head fullness_list[2];
> >>>+	struct list_head fullness_list[NR_ZS_FULLNESS];
> >>>	/*
> >>>	 * Size of objects stored in this class. Must be multiple
> >>>	 * of ZS_ALIGN.
> >>>	 */
> >>>	int size;
> >>>	int objs_per_zspage;
> >>>-	unsigned int index;
> >>>-
> >>>-	struct zs_size_stat stats;
> >>>-
> >>>	/* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
> >>>	int pages_per_zspage;
> >>>-	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
> >>>-	bool huge;
> >>>+
> >>>+	unsigned int index;
> >>>+	struct zs_size_stat stats;
> >>>};
> >>>
> >>>+/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
> >>>+static void SetPageHugeObject(struct page *page)
> >>>+{
> >>>+	SetPageOwnerPriv1(page);
> >>>+}
> >>>+
> >>>+static void ClearPageHugeObject(struct page *page)
> >>>+{
> >>>+	ClearPageOwnerPriv1(page);
> >>>+}
> >>>+
> >>>+static int PageHugeObject(struct page *page)
> >>>+{
> >>>+	return PageOwnerPriv1(page);
> >>>+}
> >>>+
> >>>/*
> >>> * Placed within free objects to form a singly linked list.
> >>> * For every zspage, zspage->freeobj gives head of this list.
> >>>@@ -244,6 +264,10 @@ struct zs_pool {
> >>>#ifdef CONFIG_ZSMALLOC_STAT
> >>>	struct dentry *stat_dentry;
> >>>#endif
> >>>+#ifdef CONFIG_COMPACTION
> >>>+	struct inode *inode;
> >>>+	struct work_struct free_work;
> >>>+#endif
> >>>};
> >>>
> >>>/*
> >>>@@ -252,16 +276,23 @@ struct zs_pool {
> >>> */
> >>>#define FULLNESS_BITS	2
> >>>#define CLASS_BITS	8
> >>>+#define ISOLATED_BITS	3
> >>>+#define MAGIC_VAL_BITS	8
> >>>
> >>>struct zspage {
> >>>	struct {
> >>>		unsigned int fullness:FULLNESS_BITS;
> >>>		unsigned int class:CLASS_BITS;
> >>>+		unsigned int isolated:ISOLATED_BITS;
> >>>+		unsigned int magic:MAGIC_VAL_BITS;
> >>>	};
> >>>	unsigned int inuse;
> >>>	unsigned int freeobj;
> >>>	struct page *first_page;
> >>>	struct list_head list; /* fullness list */
> >>>+#ifdef CONFIG_COMPACTION
> >>>+	rwlock_t lock;
> >>>+#endif
> >>>};
> >>>
> >>>struct mapping_area {
> >>>@@ -274,6 +305,28 @@ struct mapping_area {
> >>>	enum zs_mapmode vm_mm; /* mapping mode */
> >>>};
> >>>
> >>>+#ifdef CONFIG_COMPACTION
> >>>+static int zs_register_migration(struct zs_pool *pool);
> >>>+static void zs_unregister_migration(struct zs_pool *pool);
> >>>+static void migrate_lock_init(struct zspage *zspage);
> >>>+static void migrate_read_lock(struct zspage *zspage);
> >>>+static void migrate_read_unlock(struct zspage *zspage);
> >>>+static void kick_deferred_free(struct zs_pool *pool);
> >>>+static void init_deferred_free(struct zs_pool *pool);
> >>>+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
> >>>+#else
> >>>+static int zsmalloc_mount(void) { return 0; }
> >>>+static void zsmalloc_unmount(void) {}
> >>>+static int zs_register_migration(struct zs_pool *pool) { return 0; }
> >>>+static void zs_unregister_migration(struct zs_pool *pool) {}
> >>>+static void migrate_lock_init(struct zspage *zspage) {}
> >>>+static void migrate_read_lock(struct zspage *zspage) {}
> >>>+static void migrate_read_unlock(struct zspage *zspage) {}
> >>>+static void kick_deferred_free(struct zs_pool *pool) {}
> >>>+static void init_deferred_free(struct zs_pool *pool) {}
> >>>+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
> >>>+#endif
> >>>+
> >>>static int create_cache(struct zs_pool *pool)
> >>>{
> >>>	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
> >>>@@ -301,7 +354,7 @@ static void destroy_cache(struct zs_pool *pool)
> >>>static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
> >>>{
> >>>	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
> >>>-			gfp & ~__GFP_HIGHMEM);
> >>>+			gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
> >>>}
> >>>
> >>>static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
> >>>@@ -311,7 +364,8 @@ static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
> >>>
> >>>static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
> >>>{
> >>>-	return kmem_cache_alloc(pool->zspage_cachep, flags & ~__GFP_HIGHMEM);
> >>>+	return kmem_cache_alloc(pool->zspage_cachep,
> >>>+			flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
> >>>};
> >>>
> >>>static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
> >>>@@ -421,11 +475,17 @@ static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage)
> >>>/* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
> >>>static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
> >>>
> >>>+static bool is_zspage_isolated(struct zspage *zspage)
> >>>+{
> >>>+	return zspage->isolated;
> >>>+}
> >>>+
> >>>static int is_first_page(struct page *page)
> >>>{
> >>>	return PagePrivate(page);
> >>>}
> >>>
> >>>+/* Protected by class->lock */
> >>>static inline int get_zspage_inuse(struct zspage *zspage)
> >>>{
> >>>	return zspage->inuse;
> >>>@@ -441,20 +501,12 @@ static inline void mod_zspage_inuse(struct zspage *zspage, int val)
> >>>	zspage->inuse += val;
> >>>}
> >>>
> >>>-static inline int get_first_obj_offset(struct page *page)
> >>>+static inline struct page *get_first_page(struct zspage *zspage)
> >>>{
> >>>-	if (is_first_page(page))
> >>>-		return 0;
> >>>+	struct page *first_page = zspage->first_page;
> >>>
> >>>-	return page->index;
> >>>-}
> >>>-
> >>>-static inline void set_first_obj_offset(struct page *page, int offset)
> >>>-{
> >>>-	if (is_first_page(page))
> >>>-		return;
> >>>-
> >>>-	page->index = offset;
> >>>+	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
> >>>+	return first_page;
> >>>}
> >>>
> >>>static inline unsigned int get_freeobj(struct zspage *zspage)
> >>>@@ -471,6 +523,8 @@ static void get_zspage_mapping(struct zspage *zspage,
> >>>				unsigned int *class_idx,
> >>>				enum fullness_group *fullness)
> >>>{
> >>>+	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
> >>>+
> >>>	*fullness = zspage->fullness;
> >>>	*class_idx = zspage->class;
> >>>}
> >>>@@ -504,23 +558,19 @@ static int get_size_class_index(int size)
> >>>static inline void zs_stat_inc(struct size_class *class,
> >>>				enum zs_stat_type type, unsigned long cnt)
> >>>{
> >>>-	if (type < NR_ZS_STAT_TYPE)
> >>>-		class->stats.objs[type] += cnt;
> >>>+	class->stats.objs[type] += cnt;
> >>>}
> >>>
> >>>static inline void zs_stat_dec(struct size_class *class,
> >>>				enum zs_stat_type type, unsigned long cnt)
> >>>{
> >>>-	if (type < NR_ZS_STAT_TYPE)
> >>>-		class->stats.objs[type] -= cnt;
> >>>+	class->stats.objs[type] -= cnt;
> >>>}
> >>>
> >>>static inline unsigned long zs_stat_get(struct size_class *class,
> >>>				enum zs_stat_type type)
> >>>{
> >>>-	if (type < NR_ZS_STAT_TYPE)
> >>>-		return class->stats.objs[type];
> >>>-	return 0;
> >>>+	return class->stats.objs[type];
> >>>}
> >>>
> >>>#ifdef CONFIG_ZSMALLOC_STAT
> >>>@@ -664,6 +714,7 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
> >>>}
> >>>#endif
> >>>
> >>>+
> >>>/*
> >>> * For each size class, zspages are divided into different groups
> >>> * depending on how "full" they are. This was done so that we could
> >>>@@ -704,15 +755,9 @@ static void insert_zspage(struct size_class *class,
> >>>{
> >>>	struct zspage *head;
> >>>
> >>>-	if (fullness >= ZS_EMPTY)
> >>>-		return;
> >>>-
> >>>+	zs_stat_inc(class, fullness, 1);
> >>>	head = list_first_entry_or_null(&class->fullness_list[fullness],
> >>>					struct zspage, list);
> >>>-
> >>>-	zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ?
> >>>-			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
> >>>-
> >>>	/*
> >>>	 * We want to see more ZS_FULL pages and less almost empty/full.
> >>>	 * Put pages with higher ->inuse first.
> >>>@@ -734,14 +779,11 @@ static void remove_zspage(struct size_class *class,
> >>>				struct zspage *zspage,
> >>>				enum fullness_group fullness)
> >>>{
> >>>-	if (fullness >= ZS_EMPTY)
> >>>-		return;
> >>>-
> >>>	VM_BUG_ON(list_empty(&class->fullness_list[fullness]));
> >>>+	VM_BUG_ON(is_zspage_isolated(zspage));
> >>>
> >>>	list_del_init(&zspage->list);
> >>>-	zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ?
> >>>-			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
> >>>+	zs_stat_dec(class, fullness, 1);
> >>>}
> >>>
> >>>/*
> >>>@@ -764,8 +806,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
> >>>	if (newfg == currfg)
> >>>		goto out;
> >>>
> >>>-	remove_zspage(class, zspage, currfg);
> >>>-	insert_zspage(class, zspage, newfg);
> >>>+	if (!is_zspage_isolated(zspage)) {
> >>>+		remove_zspage(class, zspage, currfg);
> >>>+		insert_zspage(class, zspage, newfg);
> >>>+	}
> >>>+
> >>>	set_zspage_mapping(zspage, class_idx, newfg);
> >>>
> >>>out:
> >>>@@ -808,19 +853,45 @@ static int get_pages_per_zspage(int class_size)
> >>>	return max_usedpc_order;
> >>>}
> >>>
> >>>-static struct page *get_first_page(struct zspage *zspage)
> >>>+static struct zspage *get_zspage(struct page *page)
> >>>{
> >>>-	return zspage->first_page;
> >>>+	struct zspage *zspage = (struct zspage *)page->private;
> >>>+
> >>>+	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
> >>>+	return zspage;
> >>>}
> >>>
> >>>-static struct zspage *get_zspage(struct page *page)
> >>>+static struct page *get_next_page(struct page *page)
> >>>{
> >>>-	return (struct zspage *)page->private;
> >>>+	if (unlikely(PageHugeObject(page)))
> >>>+		return NULL;
> >>>+
> >>>+	return page->freelist;
> >>>}
> >>>
> >>>-static struct page *get_next_page(struct page *page)
> >>>+/* Get byte offset of first object in the @page */
> >>>+static int get_first_obj_offset(struct size_class *class,
> >>>+				struct page *first_page, struct page *page)
> >>>{
> >>>-	return page->next;
> >>>+	int pos;
> >>>+	int page_idx = 0;
> >>>+	int ofs = 0;
> >>>+	struct page *cursor = first_page;
> >>>+
> >>>+	if (first_page == page)
> >>>+		goto out;
> >>>+
> >>>+	while (page != cursor) {
> >>>+		page_idx++;
> >>>+		cursor = get_next_page(cursor);
> >>>+	}
> >>>+
> >>>+	pos = class->objs_per_zspage * class->size *
> >>>+		page_idx / class->pages_per_zspage;
> >>>+
> >>>+	ofs = (pos + class->size) % PAGE_SIZE;
> >>>+out:
> >>>+	return ofs;
> >>>}
> >>>
> >>>/**
> >>>@@ -857,16 +928,20 @@ static unsigned long handle_to_obj(unsigned long handle)
> >>>	return *(unsigned long *)handle;
> >>>}
> >>>
> >>>-static unsigned long obj_to_head(struct size_class *class, struct page *page,
> >>>-			void *obj)
> >>>+static unsigned long obj_to_head(struct page *page, void *obj)
> >>>{
> >>>-	if (class->huge) {
> >>>+	if (unlikely(PageHugeObject(page))) {
> >>>		VM_BUG_ON_PAGE(!is_first_page(page), page);
> >>>		return page->index;
> >>>	} else
> >>>		return *(unsigned long *)obj;
> >>>}
> >>>
> >>>+static inline int testpin_tag(unsigned long handle)
> >>>+{
> >>>+	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> >>>+}
> >>>+
> >>>static inline int trypin_tag(unsigned long handle)
> >>>{
> >>>	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> >>>@@ -884,27 +959,93 @@ static void unpin_tag(unsigned long handle)
> >>>
> >>>static void reset_page(struct page *page)
> >>>{
> >>>+	__ClearPageMovable(page);
> >>>	clear_bit(PG_private, &page->flags);
> >>>	clear_bit(PG_private_2, &page->flags);
> >>>	set_page_private(page, 0);
> >>>-	page->index = 0;
> >>>+	ClearPageHugeObject(page);
> >>>+	page->freelist = NULL;
> >>>}
> >>>
> >>>-static void free_zspage(struct zs_pool *pool, struct zspage *zspage)
> >>>+/*
> >>>+ * To prevent zspage destroy during migration, zspage freeing should
> >>>+ * hold locks of all pages in the zspage.
> >>>+ */
> >>>+void lock_zspage(struct zspage *zspage)
> >>>+{
> >>>+	struct page *page = get_first_page(zspage);
> >>>+
> >>>+	do {
> >>>+		lock_page(page);
> >>>+	} while ((page = get_next_page(page)) != NULL);
> >>>+}
> >>>+
> >>>+int trylock_zspage(struct zspage *zspage)
> >>>+{
> >>>+	struct page *cursor, *fail;
> >>>+
> >>>+	for (cursor = get_first_page(zspage); cursor != NULL; cursor =
> >>>+					get_next_page(cursor)) {
> >>>+		if (!trylock_page(cursor)) {
> >>>+			fail = cursor;
> >>>+			goto unlock;
> >>>+		}
> >>>+	}
> >>>+
> >>>+	return 1;
> >>>+unlock:
> >>>+	for (cursor = get_first_page(zspage); cursor != fail; cursor =
> >>>+					get_next_page(cursor))
> >>>+		unlock_page(cursor);
> >>>+
> >>>+	return 0;
> >>>+}
> >>>+
> >>>+static void __free_zspage(struct zs_pool *pool, struct size_class *class,
> >>>+				struct zspage *zspage)
> >>>{
> >>>	struct page *page, *next;
> >>>+	enum fullness_group fg;
> >>>+	unsigned int class_idx;
> >>>+
> >>>+	get_zspage_mapping(zspage, &class_idx, &fg);
> >>>+
> >>>+	assert_spin_locked(&class->lock);
> >>>
> >>>	VM_BUG_ON(get_zspage_inuse(zspage));
> >>>+	VM_BUG_ON(fg != ZS_EMPTY);
> >>>
> >>>-	next = page = zspage->first_page;
> >>>+	next = page = get_first_page(zspage);
> >>>	do {
> >>>-		next = page->next;
> >>>+		VM_BUG_ON_PAGE(!PageLocked(page), page);
> >>>+		next = get_next_page(page);
> >>>		reset_page(page);
> >>>+		unlock_page(page);
> >>>		put_page(page);
> >>>		page = next;
> >>>	} while (page != NULL);
> >>>
> >>>	cache_free_zspage(pool, zspage);
> >>>+
> >>>+	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >>>+			class->size, class->pages_per_zspage));
> >>>+	atomic_long_sub(class->pages_per_zspage,
> >>>+					&pool->pages_allocated);
> >>>+}
> >>>+
> >>>+static void free_zspage(struct zs_pool *pool, struct size_class *class,
> >>>+				struct zspage *zspage)
> >>>+{
> >>>+	VM_BUG_ON(get_zspage_inuse(zspage));
> >>>+	VM_BUG_ON(list_empty(&zspage->list));
> >>>+
> >>>+	if (!trylock_zspage(zspage)) {
> >>>+		kick_deferred_free(pool);
> >>>+		return;
> >>>+	}
> >>>+
> >>>+	remove_zspage(class, zspage, ZS_EMPTY);
> >>>+	__free_zspage(pool, class, zspage);
> >>>}
> >>>
> >>>/* Initialize a newly allocated zspage */
> >>>@@ -912,15 +1053,13 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
> >>>{
> >>>	unsigned int freeobj = 1;
> >>>	unsigned long off = 0;
> >>>-	struct page *page = zspage->first_page;
> >>>+	struct page *page = get_first_page(zspage);
> >>>
> >>>	while (page) {
> >>>		struct page *next_page;
> >>>		struct link_free *link;
> >>>		void *vaddr;
> >>>
> >>>-		set_first_obj_offset(page, off);
> >>>-
> >>>		vaddr = kmap_atomic(page);
> >>>		link = (struct link_free *)vaddr + off / sizeof(*link);
> >>>
> >>>@@ -952,16 +1091,17 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
> >>>	set_freeobj(zspage, 0);
> >>>}
> >>>
> >>>-static void create_page_chain(struct zspage *zspage, struct page *pages[],
> >>>-				int nr_pages)
> >>>+static void create_page_chain(struct size_class *class, struct zspage *zspage,
> >>>+				struct page *pages[])
> >>>{
> >>>	int i;
> >>>	struct page *page;
> >>>	struct page *prev_page = NULL;
> >>>+	int nr_pages = class->pages_per_zspage;
> >>>
> >>>	/*
> >>>	 * Allocate individual pages and link them together as:
> >>>-	 * 1. all pages are linked together using page->next
> >>>+	 * 1. all pages are linked together using page->freelist
> >>>	 * 2. each sub-page point to zspage using page->private
> >>>	 *
> >>>	 * we set PG_private to identify the first page (i.e. no other sub-page
> >>>@@ -970,16 +1110,18 @@ static void create_page_chain(struct zspage *zspage, struct page *pages[],
> >>>	for (i = 0; i < nr_pages; i++) {
> >>>		page = pages[i];
> >>>		set_page_private(page, (unsigned long)zspage);
> >>>+		page->freelist = NULL;
> >>>		if (i == 0) {
> >>>			zspage->first_page = page;
> >>>			SetPagePrivate(page);
> >>>+			if (unlikely(class->objs_per_zspage == 1 &&
> >>>+					class->pages_per_zspage == 1))
> >>>+				SetPageHugeObject(page);
> >>>		} else {
> >>>-			prev_page->next = page;
> >>>+			prev_page->freelist = page;
> >>>		}
> >>>-		if (i == nr_pages - 1) {
> >>>+		if (i == nr_pages - 1)
> >>>			SetPagePrivate2(page);
> >>>-			page->next = NULL;
> >>>-		}
> >>>		prev_page = page;
> >>>	}
> >>>}
> >>>@@ -999,6 +1141,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
> >>>		return NULL;
> >>>
> >>>	memset(zspage, 0, sizeof(struct zspage));
> >>>+	zspage->magic = ZSPAGE_MAGIC;
> >>>+	migrate_lock_init(zspage);
> >>>
> >>>	for (i = 0; i < class->pages_per_zspage; i++) {
> >>>		struct page *page;
> >>>@@ -1013,7 +1157,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
> >>>		pages[i] = page;
> >>>	}
> >>>
> >>>-	create_page_chain(zspage, pages, class->pages_per_zspage);
> >>>+	create_page_chain(class, zspage, pages);
> >>>	init_zspage(class, zspage);
> >>>
> >>>	return zspage;
> >>>@@ -1024,7 +1168,7 @@ static struct zspage *find_get_zspage(struct size_class *class)
> >>>	int i;
> >>>	struct zspage *zspage;
> >>>
> >>>-	for (i = ZS_ALMOST_FULL; i <= ZS_ALMOST_EMPTY; i++) {
> >>>+	for (i = ZS_ALMOST_FULL; i >= ZS_EMPTY; i--) {
> >>>		zspage = list_first_entry_or_null(&class->fullness_list[i],
> >>>				struct zspage, list);
> >>>		if (zspage)
> >>>@@ -1289,6 +1433,10 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> >>>	obj = handle_to_obj(handle);
> >>>	obj_to_location(obj, &page, &obj_idx);
> >>>	zspage = get_zspage(page);
> >>>+
> >>>+	/* migration cannot move any subpage in this zspage */
> >>>+	migrate_read_lock(zspage);
> >>>+
> >>>	get_zspage_mapping(zspage, &class_idx, &fg);
> >>>	class = pool->size_class[class_idx];
> >>>	off = (class->size * obj_idx) & ~PAGE_MASK;
> >>>@@ -1309,7 +1457,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
> >>>
> >>>	ret = __zs_map_object(area, pages, off, class->size);
> >>>out:
> >>>-	if (!class->huge)
> >>>+	if (likely(!PageHugeObject(page)))
> >>>		ret += ZS_HANDLE_SIZE;
> >>>
> >>>	return ret;
> >>>@@ -1348,6 +1496,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
> >>>		__zs_unmap_object(area, pages, off, class->size);
> >>>	}
> >>>	put_cpu_var(zs_map_area);
> >>>+
> >>>+	migrate_read_unlock(zspage);
> >>>	unpin_tag(handle);
> >>>}
> >>>EXPORT_SYMBOL_GPL(zs_unmap_object);
> >>>@@ -1377,7 +1527,7 @@ static unsigned long obj_malloc(struct size_class *class,
> >>>	vaddr = kmap_atomic(m_page);
> >>>	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
> >>>	set_freeobj(zspage, link->next >> OBJ_ALLOCATED_TAG);
> >>>-	if (!class->huge)
> >>>+	if (likely(!PageHugeObject(m_page)))
> >>>		/* record handle in the header of allocated chunk */
> >>>		link->handle = handle;
> >>>	else
> >>>@@ -1407,6 +1557,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> >>>{
> >>>	unsigned long handle, obj;
> >>>	struct size_class *class;
> >>>+	enum fullness_group newfg;
> >>>	struct zspage *zspage;
> >>>
> >>>	if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
> >>>@@ -1422,28 +1573,37 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
> >>>
> >>>	spin_lock(&class->lock);
> >>>	zspage = find_get_zspage(class);
> >>>-
> >>>-	if (!zspage) {
> >>>+	if (likely(zspage)) {
> >>>+		obj = obj_malloc(class, zspage, handle);
> >>>+		/* Now move the zspage to another fullness group, if required */
> >>>+		fix_fullness_group(class, zspage);
> >>>+		record_obj(handle, obj);
> >>>		spin_unlock(&class->lock);
> >>>-		zspage = alloc_zspage(pool, class, gfp);
> >>>-		if (unlikely(!zspage)) {
> >>>-			cache_free_handle(pool, handle);
> >>>-			return 0;
> >>>-		}
> >>>
> >>>-		set_zspage_mapping(zspage, class->index, ZS_EMPTY);
> >>>-		atomic_long_add(class->pages_per_zspage,
> >>>-					&pool->pages_allocated);
> >>>+		return handle;
> >>>+	}
> >>>
> >>>-		spin_lock(&class->lock);
> >>>-		zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >>>-				class->size, class->pages_per_zspage));
> >>>+	spin_unlock(&class->lock);
> >>>+
> >>>+	zspage = alloc_zspage(pool, class, gfp);
> >>>+	if (!zspage) {
> >>>+		cache_free_handle(pool, handle);
> >>>+		return 0;
> >>>	}
> >>>
> >>>+	spin_lock(&class->lock);
> >>>	obj = obj_malloc(class, zspage, handle);
> >>>-	/* Now move the zspage to another fullness group, if required */
> >>>-	fix_fullness_group(class, zspage);
> >>>+	newfg = get_fullness_group(class, zspage);
> >>>+	insert_zspage(class, zspage, newfg);
> >>>+	set_zspage_mapping(zspage, class->index, newfg);
> >>>	record_obj(handle, obj);
> >>>+	atomic_long_add(class->pages_per_zspage,
> >>>+				&pool->pages_allocated);
> >>>+	zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >>>+			class->size, class->pages_per_zspage));
> >>>+
> >>>+	/* We completely set up zspage so mark them as movable */
> >>>+	SetZsPageMovable(pool, zspage);
> >>>	spin_unlock(&class->lock);
> >>>
> >>>	return handle;
> >>>@@ -1484,6 +1644,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> >>>	int class_idx;
> >>>	struct size_class *class;
> >>>	enum fullness_group fullness;
> >>>+	bool isolated;
> >>>
> >>>	if (unlikely(!handle))
> >>>		return;
> >>>@@ -1493,22 +1654,28 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
> >>>	obj_to_location(obj, &f_page, &f_objidx);
> >>>	zspage = get_zspage(f_page);
> >>>
> >>>+	migrate_read_lock(zspage);
> >>>+
> >>>	get_zspage_mapping(zspage, &class_idx, &fullness);
> >>>	class = pool->size_class[class_idx];
> >>>
> >>>	spin_lock(&class->lock);
> >>>	obj_free(class, obj);
> >>>	fullness = fix_fullness_group(class, zspage);
> >>>-	if (fullness == ZS_EMPTY) {
> >>>-		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >>>-				class->size, class->pages_per_zspage));
> >>>-		atomic_long_sub(class->pages_per_zspage,
> >>>-				&pool->pages_allocated);
> >>>-		free_zspage(pool, zspage);
> >>>+	if (fullness != ZS_EMPTY) {
> >>>+		migrate_read_unlock(zspage);
> >>>+		goto out;
> >>>	}
> >>>+
> >>>+	isolated = is_zspage_isolated(zspage);
> >>>+	migrate_read_unlock(zspage);
> >>>+	/* If zspage is isolated, zs_page_putback will free the zspage */
> >>>+	if (likely(!isolated))
> >>>+		free_zspage(pool, class, zspage);
> >>>+out:
> >>>+
> >>>	spin_unlock(&class->lock);
> >>>	unpin_tag(handle);
> >>>-
> >>>	cache_free_handle(pool, handle);
> >>>}
> >>>EXPORT_SYMBOL_GPL(zs_free);
> >>>@@ -1587,12 +1754,13 @@ static unsigned long find_alloced_obj(struct size_class *class,
> >>>	int offset = 0;
> >>>	unsigned long handle = 0;
> >>>	void *addr = kmap_atomic(page);
> >>>+	struct zspage *zspage = get_zspage(page);
> >>>
> >>>-	offset = get_first_obj_offset(page);
> >>>+	offset = get_first_obj_offset(class, get_first_page(zspage), page);
> >>>	offset += class->size * index;
> >>>
> >>>	while (offset < PAGE_SIZE) {
> >>>-		head = obj_to_head(class, page, addr + offset);
> >>>+		head = obj_to_head(page, addr + offset);
> >>>		if (head & OBJ_ALLOCATED_TAG) {
> >>>			handle = head & ~OBJ_ALLOCATED_TAG;
> >>>			if (trypin_tag(handle))
> >>>@@ -1684,6 +1852,7 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
> >>>		zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
> >>>							struct zspage, list);
> >>>		if (zspage) {
> >>>+			VM_BUG_ON(is_zspage_isolated(zspage));
> >>>			remove_zspage(class, zspage, fg[i]);
> >>>			return zspage;
> >>>		}
> >>>@@ -1704,6 +1873,8 @@ static enum fullness_group putback_zspage(struct size_class *class,
> >>>{
> >>>	enum fullness_group fullness;
> >>>
> >>>+	VM_BUG_ON(is_zspage_isolated(zspage));
> >>>+
> >>>	fullness = get_fullness_group(class, zspage);
> >>>	insert_zspage(class, zspage, fullness);
> >>>	set_zspage_mapping(zspage, class->index, fullness);
> >>>@@ -1711,6 +1882,377 @@ static enum fullness_group putback_zspage(struct size_class *class,
> >>>	return fullness;
> >>>}
> >>>
> >>>+#ifdef CONFIG_COMPACTION
> >>>+static struct dentry *zs_mount(struct file_system_type *fs_type,
> >>>+				int flags, const char *dev_name, void *data)
> >>>+{
> >>>+	static const struct dentry_operations ops = {
> >>>+		.d_dname = simple_dname,
> >>>+	};
> >>>+
> >>>+	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
> >>>+}
> >>>+
> >>>+static struct file_system_type zsmalloc_fs = {
> >>>+	.name		= "zsmalloc",
> >>>+	.mount		= zs_mount,
> >>>+	.kill_sb	= kill_anon_super,
> >>>+};
> >>>+
> >>>+static int zsmalloc_mount(void)
> >>>+{
> >>>+	int ret = 0;
> >>>+
> >>>+	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
> >>>+	if (IS_ERR(zsmalloc_mnt))
> >>>+		ret = PTR_ERR(zsmalloc_mnt);
> >>>+
> >>>+	return ret;
> >>>+}
> >>>+
> >>>+static void zsmalloc_unmount(void)
> >>>+{
> >>>+	kern_unmount(zsmalloc_mnt);
> >>>+}
> >>>+
> >>>+static void migrate_lock_init(struct zspage *zspage)
> >>>+{
> >>>+	rwlock_init(&zspage->lock);
> >>>+}
> >>>+
> >>>+static void migrate_read_lock(struct zspage *zspage)
> >>>+{
> >>>+	read_lock(&zspage->lock);
> >>>+}
> >>>+
> >>>+static void migrate_read_unlock(struct zspage *zspage)
> >>>+{
> >>>+	read_unlock(&zspage->lock);
> >>>+}
> >>>+
> >>>+static void migrate_write_lock(struct zspage *zspage)
> >>>+{
> >>>+	write_lock(&zspage->lock);
> >>>+}
> >>>+
> >>>+static void migrate_write_unlock(struct zspage *zspage)
> >>>+{
> >>>+	write_unlock(&zspage->lock);
> >>>+}
> >>>+
> >>>+/* Number of isolated subpage for *page migration* in this zspage */
> >>>+static void inc_zspage_isolation(struct zspage *zspage)
> >>>+{
> >>>+	zspage->isolated++;
> >>>+}
> >>>+
> >>>+static void dec_zspage_isolation(struct zspage *zspage)
> >>>+{
> >>>+	zspage->isolated--;
> >>>+}
> >>>+
> >>>+static void replace_sub_page(struct size_class *class, struct zspage *zspage,
> >>>+				struct page *newpage, struct page *oldpage)
> >>>+{
> >>>+	struct page *page;
> >>>+	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL, };
> >>>+	int idx = 0;
> >>>+
> >>>+	page = get_first_page(zspage);
> >>>+	do {
> >>>+		if (page == oldpage)
> >>>+			pages[idx] = newpage;
> >>>+		else
> >>>+			pages[idx] = page;
> >>>+		idx++;
> >>>+	} while ((page = get_next_page(page)) != NULL);
> >>>+
> >>>+	create_page_chain(class, zspage, pages);
> >>>+	if (unlikely(PageHugeObject(oldpage)))
> >>>+		newpage->index = oldpage->index;
> >>>+	__SetPageMovable(newpage, page_mapping(oldpage));
> >>>+}
> >>>+
> >>>+bool zs_page_isolate(struct page *page, isolate_mode_t mode)
> >>>+{
> >>>+	struct zs_pool *pool;
> >>>+	struct size_class *class;
> >>>+	int class_idx;
> >>>+	enum fullness_group fullness;
> >>>+	struct zspage *zspage;
> >>>+	struct address_space *mapping;
> >>>+
> >>>+	/*
> >>>+	 * Page is locked so zspage couldn't be destroyed. For detail, look at
> >>>+	 * lock_zspage in free_zspage.
> >>>+	 */
> >>>+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >>>+	VM_BUG_ON_PAGE(PageIsolated(page), page);
> >>>+
> >>>+	zspage = get_zspage(page);
> >>>+
> >>>+	/*
> >>>+	 * Without class lock, fullness could be stale while class_idx is okay
> >>>+	 * because class_idx is constant unless page is freed so we should get
> >>>+	 * fullness again under class lock.
> >>>+	 */
> >>>+	get_zspage_mapping(zspage, &class_idx, &fullness);
> >>>+	mapping = page_mapping(page);
> >>>+	pool = mapping->private_data;
> >>>+	class = pool->size_class[class_idx];
> >>>+
> >>>+	spin_lock(&class->lock);
> >>>+	if (get_zspage_inuse(zspage) == 0) {
> >>>+		spin_unlock(&class->lock);
> >>>+		return false;
> >>>+	}
> >>>+
> >>>+	/* zspage is isolated for object migration */
> >>>+	if (list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
> >>>+		spin_unlock(&class->lock);
> >>>+		return false;
> >>>+	}
> >>>+
> >>>+	/*
> >>>+	 * If this is first time isolation for the zspage, isolate zspage from
> >>>+	 * size_class to prevent further object allocation from the zspage.
> >>>+	 */
> >>>+	if (!list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
> >>>+		get_zspage_mapping(zspage, &class_idx, &fullness);
> >>>+		remove_zspage(class, zspage, fullness);
> >>>+	}
> >>>+
> >>>+	inc_zspage_isolation(zspage);
> >>>+	spin_unlock(&class->lock);
> >>>+
> >>>+	return true;
> >>>+}
> >>>+
> >>>+int zs_page_migrate(struct address_space *mapping, struct page *newpage,
> >>>+		struct page *page, enum migrate_mode mode)
> >>>+{
> >>>+	struct zs_pool *pool;
> >>>+	struct size_class *class;
> >>>+	int class_idx;
> >>>+	enum fullness_group fullness;
> >>>+	struct zspage *zspage;
> >>>+	struct page *dummy;
> >>>+	void *s_addr, *d_addr, *addr;
> >>>+	int offset, pos;
> >>>+	unsigned long handle, head;
> >>>+	unsigned long old_obj, new_obj;
> >>>+	unsigned int obj_idx;
> >>>+	int ret = -EAGAIN;
> >>>+
> >>>+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >>>+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> >>>+
> >>>+	zspage = get_zspage(page);
> >>>+
> >>>+	/* Concurrent compactor cannot migrate any subpage in zspage */
> >>>+	migrate_write_lock(zspage);
> >>>+	get_zspage_mapping(zspage, &class_idx, &fullness);
> >>>+	pool = mapping->private_data;
> >>>+	class = pool->size_class[class_idx];
> >>>+	offset = get_first_obj_offset(class, get_first_page(zspage), page);
> >>>+
> >>>+	spin_lock(&class->lock);
> >>>+	if (!get_zspage_inuse(zspage)) {
> >>>+		ret = -EBUSY;
> >>>+		goto unlock_class;
> >>>+	}
> >>>+
> >>>+	pos = offset;
> >>>+	s_addr = kmap_atomic(page);
> >>>+	while (pos < PAGE_SIZE) {
> >>>+		head = obj_to_head(page, s_addr + pos);
> >>>+		if (head & OBJ_ALLOCATED_TAG) {
> >>>+			handle = head & ~OBJ_ALLOCATED_TAG;
> >>>+			if (!trypin_tag(handle))
> >>>+				goto unpin_objects;
> >>>+		}
> >>>+		pos += class->size;
> >>>+	}
> >>>+
> >>>+	/*
> >>>+	 * Here, any user cannot access all objects in the zspage so let's move.
> >>>+	 */
> >>>+	d_addr = kmap_atomic(newpage);
> >>>+	memcpy(d_addr, s_addr, PAGE_SIZE);
> >>>+	kunmap_atomic(d_addr);
> >>>+
> >>>+	for (addr = s_addr + offset; addr < s_addr + pos;
> >>>+					addr += class->size) {
> >>>+		head = obj_to_head(page, addr);
> >>>+		if (head & OBJ_ALLOCATED_TAG) {
> >>>+			handle = head & ~OBJ_ALLOCATED_TAG;
> >>>+			if (!testpin_tag(handle))
> >>>+				BUG();
> >>>+
> >>>+			old_obj = handle_to_obj(handle);
> >>>+			obj_to_location(old_obj, &dummy, &obj_idx);
> >>>+			new_obj = (unsigned long)location_to_obj(newpage,
> >>>+								obj_idx);
> >>>+			new_obj |= BIT(HANDLE_PIN_BIT);
> >>>+			record_obj(handle, new_obj);
> >>>+		}
> >>>+	}
> >>>+
> >>>+	replace_sub_page(class, zspage, newpage, page);
> >>>+	get_page(newpage);
> >>>+
> >>>+	dec_zspage_isolation(zspage);
> >>>+
> >>>+	/*
> >>>+	 * Page migration is done so let's putback isolated zspage to
> >>>+	 * the list if @page is final isolated subpage in the zspage.
> >>>+	 */
> >>>+	if (!is_zspage_isolated(zspage))
> >>>+		putback_zspage(class, zspage);
> >>>+
> >>>+	reset_page(page);
> >>>+	put_page(page);
> >>>+	page = newpage;
> >>>+
> >>>+	ret = 0;
> >>>+unpin_objects:
> >>>+	for (addr = s_addr + offset; addr < s_addr + pos;
> >>>+						addr += class->size) {
> >>>+		head = obj_to_head(page, addr);
> >>>+		if (head & OBJ_ALLOCATED_TAG) {
> >>>+			handle = head & ~OBJ_ALLOCATED_TAG;
> >>>+			if (!testpin_tag(handle))
> >>>+				BUG();
> >>>+			unpin_tag(handle);
> >>>+		}
> >>>+	}
> >>>+	kunmap_atomic(s_addr);
> >>>+unlock_class:
> >>>+	spin_unlock(&class->lock);
> >>>+	migrate_write_unlock(zspage);
> >>>+
> >>>+	return ret;
> >>>+}
> >>>+
> >>>+void zs_page_putback(struct page *page)
> >>>+{
> >>>+	struct zs_pool *pool;
> >>>+	struct size_class *class;
> >>>+	int class_idx;
> >>>+	enum fullness_group fg;
> >>>+	struct address_space *mapping;
> >>>+	struct zspage *zspage;
> >>>+
> >>>+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >>>+	VM_BUG_ON_PAGE(!PageIsolated(page), page);
> >>>+
> >>>+	zspage = get_zspage(page);
> >>>+	get_zspage_mapping(zspage, &class_idx, &fg);
> >>>+	mapping = page_mapping(page);
> >>>+	pool = mapping->private_data;
> >>>+	class = pool->size_class[class_idx];
> >>>+
> >>>+	spin_lock(&class->lock);
> >>>+	dec_zspage_isolation(zspage);
> >>>+	if (!is_zspage_isolated(zspage)) {
> >>>+		fg = putback_zspage(class, zspage);
> >>>+		/*
> >>>+		 * Due to page_lock, we cannot free zspage immediately
> >>>+		 * so let's defer.
> >>>+		 */
> >>>+		if (fg == ZS_EMPTY)
> >>>+			schedule_work(&pool->free_work);
> >>>+	}
> >>>+	spin_unlock(&class->lock);
> >>>+}
> >>>+
> >>>+const struct address_space_operations zsmalloc_aops = {
> >>>+	.isolate_page = zs_page_isolate,
> >>>+	.migratepage = zs_page_migrate,
> >>>+	.putback_page = zs_page_putback,
> >>>+};
> >>>+
> >>>+static int zs_register_migration(struct zs_pool *pool)
> >>>+{
> >>>+	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
> >>>+	if (IS_ERR(pool->inode)) {
> >>>+		pool->inode = NULL;
> >>>+		return 1;
> >>>+	}
> >>>+
> >>>+	pool->inode->i_mapping->private_data = pool;
> >>>+	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
> >>>+	return 0;
> >>>+}
> >>>+
> >>>+static void zs_unregister_migration(struct zs_pool *pool)
> >>>+{
> >>>+	flush_work(&pool->free_work);
> >>>+	if (pool->inode)
> >>>+		iput(pool->inode);
> >>>+}
> >>>+
> >>>+/*
> >>>+ * Caller should hold page_lock of all pages in the zspage
> >>>+ * In here, we cannot use zspage meta data.
> >>>+ */
> >>>+static void async_free_zspage(struct work_struct *work)
> >>>+{
> >>>+	int i;
> >>>+	struct size_class *class;
> >>>+	unsigned int class_idx;
> >>>+	enum fullness_group fullness;
> >>>+	struct zspage *zspage, *tmp;
> >>>+	LIST_HEAD(free_pages);
> >>>+	struct zs_pool *pool = container_of(work, struct zs_pool,
> >>>+					free_work);
> >>>+
> >>>+	for (i = 0; i < zs_size_classes; i++) {
> >>>+		class = pool->size_class[i];
> >>>+		if (class->index != i)
> >>>+			continue;
> >>>+
> >>>+		spin_lock(&class->lock);
> >>>+		list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
> >>>+		spin_unlock(&class->lock);
> >>>+	}
> >>>+
> >>>+
> >>>+	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
> >>>+		list_del(&zspage->list);
> >>>+		lock_zspage(zspage);
> >>>+
> >>>+		get_zspage_mapping(zspage, &class_idx, &fullness);
> >>>+		VM_BUG_ON(fullness != ZS_EMPTY);
> >>>+		class = pool->size_class[class_idx];
> >>>+		spin_lock(&class->lock);
> >>>+		__free_zspage(pool, pool->size_class[class_idx], zspage);
> >>>+		spin_unlock(&class->lock);
> >>>+	}
> >>>+};
> >>>+
> >>>+static void kick_deferred_free(struct zs_pool *pool)
> >>>+{
> >>>+	schedule_work(&pool->free_work);
> >>>+}
> >>>+
> >>>+static void init_deferred_free(struct zs_pool *pool)
> >>>+{
> >>>+	INIT_WORK(&pool->free_work, async_free_zspage);
> >>>+}
> >>>+
> >>>+static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage)
> >>>+{
> >>>+	struct page *page = get_first_page(zspage);
> >>>+
> >>>+	do {
> >>>+		WARN_ON(!trylock_page(page));
> >>>+		__SetPageMovable(page, pool->inode->i_mapping);
> >>>+		unlock_page(page);
> >>>+	} while ((page = get_next_page(page)) != NULL);
> >>>+}
> >>>+#endif
> >>>+
> >>>/*
> >>> *
> >>> * Based on the number of unused allocated objects calculate
> >>>@@ -1745,10 +2287,10 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
> >>>			break;
> >>>
> >>>		cc.index = 0;
> >>>-		cc.s_page = src_zspage->first_page;
> >>>+		cc.s_page = get_first_page(src_zspage);
> >>>
> >>>		while ((dst_zspage = isolate_zspage(class, false))) {
> >>>-			cc.d_page = dst_zspage->first_page;
> >>>+			cc.d_page = get_first_page(dst_zspage);
> >>>			/*
> >>>			 * If there is no more space in dst_page, resched
> >>>			 * and see if anyone had allocated another zspage.
> >>>@@ -1765,11 +2307,7 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
> >>>
> >>>		putback_zspage(class, dst_zspage);
> >>>		if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
> >>>-			zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
> >>>-					class->size, class->pages_per_zspage));
> >>>-			atomic_long_sub(class->pages_per_zspage,
> >>>-					&pool->pages_allocated);
> >>>-			free_zspage(pool, src_zspage);
> >>>+			free_zspage(pool, class, src_zspage);
> >>>			pool->stats.pages_compacted += class->pages_per_zspage;
> >>>		}
> >>>		spin_unlock(&class->lock);
> >>>@@ -1885,6 +2423,7 @@ struct zs_pool *zs_create_pool(const char *name)
> >>>	if (!pool)
> >>>		return NULL;
> >>>
> >>>+	init_deferred_free(pool);
> >>>	pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
> >>>			GFP_KERNEL);
> >>>	if (!pool->size_class) {
> >>>@@ -1939,12 +2478,10 @@ struct zs_pool *zs_create_pool(const char *name)
> >>>		class->pages_per_zspage = pages_per_zspage;
> >>>		class->objs_per_zspage = class->pages_per_zspage *
> >>>						PAGE_SIZE / class->size;
> >>>-		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
> >>>-			class->huge = true;
> >>>		spin_lock_init(&class->lock);
> >>>		pool->size_class[i] = class;
> >>>-		for (fullness = ZS_ALMOST_FULL; fullness <= ZS_ALMOST_EMPTY;
> >>>-								fullness++)
> >>>+		for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
> >>>+							fullness++)
> >>>			INIT_LIST_HEAD(&class->fullness_list[fullness]);
> >>>
> >>>		prev_class = class;
> >>>@@ -1953,6 +2490,9 @@ struct zs_pool *zs_create_pool(const char *name)
> >>>	/* debug only, don't abort if it fails */
> >>>	zs_pool_stat_create(pool, name);
> >>>
> >>>+	if (zs_register_migration(pool))
> >>>+		goto err;
> >>>+
> >>>	/*
> >>>	 * Not critical, we still can use the pool
> >>>	 * and user can trigger compaction manually.
> >>>@@ -1972,6 +2512,7 @@ void zs_destroy_pool(struct zs_pool *pool)
> >>>	int i;
> >>>
> >>>	zs_unregister_shrinker(pool);
> >>>+	zs_unregister_migration(pool);
> >>>	zs_pool_stat_destroy(pool);
> >>>
> >>>	for (i = 0; i < zs_size_classes; i++) {
> >>>@@ -1984,7 +2525,7 @@ void zs_destroy_pool(struct zs_pool *pool)
> >>>		if (class->index != i)
> >>>			continue;
> >>>
> >>>-		for (fg = ZS_ALMOST_FULL; fg <= ZS_ALMOST_EMPTY; fg++) {
> >>>+		for (fg = ZS_EMPTY; fg < NR_ZS_FULLNESS; fg++) {
> >>>			if (!list_empty(&class->fullness_list[fg])) {
> >>>				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
> >>>					class->size, fg);
> >>>@@ -2002,7 +2543,13 @@ EXPORT_SYMBOL_GPL(zs_destroy_pool);
> >>>
> >>>static int __init zs_init(void)
> >>>{
> >>>-	int ret = zs_register_cpu_notifier();
> >>>+	int ret;
> >>>+
> >>>+	ret = zsmalloc_mount();
> >>>+	if (ret)
> >>>+		goto out;
> >>>+
> >>>+	ret = zs_register_cpu_notifier();
> >>>
> >>>	if (ret)
> >>>		goto notifier_fail;
> >>>@@ -2019,7 +2566,8 @@ static int __init zs_init(void)
> >>>
> >>>notifier_fail:
> >>>	zs_unregister_cpu_notifier();
> >>>-
> >>>+	zsmalloc_unmount();
> >>>+out:
> >>>	return ret;
> >>>}
> >>>
> >>>@@ -2028,6 +2576,7 @@ static void __exit zs_exit(void)
> >>>#ifdef CONFIG_ZPOOL
> >>>	zpool_unregister_driver(&zs_zpool_driver);
> >>>#endif
> >>>+	zsmalloc_unmount();
> >>>	zs_unregister_cpu_notifier();
> >>>
> >>>	zs_stat_exit();
> >>>
> >>
> >>--
> >>To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >>the body to majordomo@kvack.org.  For more info on Linux MM,
> >>see: http://www.linux-mm.org/ .
> >>Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-19  6:21           ` Minchan Kim
@ 2017-01-19  8:16             ` Chulmin Kim
  2017-01-23  5:22               ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Chulmin Kim @ 2017-01-19  8:16 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Andrew Morton, linux-mm, Sergey Senozhatsky

On 01/19/2017 01:21 AM, Minchan Kim wrote:
> On Wed, Jan 18, 2017 at 10:39:15PM -0500, Chulmin Kim wrote:
>> On 01/18/2017 09:44 PM, Minchan Kim wrote:
>>> Hello Chulmin,
>>>
>>> On Wed, Jan 18, 2017 at 07:13:21PM -0500, Chulmin Kim wrote:
>>>> Hello. Minchan, and all zsmalloc guys.
>>>>
>>>> I have a quick question.
>>>> Is zsmalloc considering memory barrier things correctly?
>>>>
>>>> AFAIK, in ARM64,
>>>> zsmalloc relies on dmb operation in bit_spin_unlock only.
>>>> (It seems that dmb operations in spinlock functions are being prepared,
>>>> but let is be aside as it is not merged yet.)
>>>>
>>>> If I am correct,
>>>> migrating a page in a zspage filled with free objs
>>>> may cause the corruption cause bit_spin_unlock will not be executed at all.
>>>>
>>>> I am not sure this is enough memory barrier for zsmalloc operations.
>>>>
>>>> Can you enlighten me?
>>>
>>> Do you mean bit_spin_unlock is broken or zsmalloc locking scheme broken?
>>> Could you please describe what you are concerning in detail?
>>> It would be very helpful if you say it with a example!
>>
>> Sorry for ambiguous expressions. :)
>>
>> Recently,
>> I found multiple zsmalloc corruption cases which have garbage idx values in
>> in zspage->freeobj. (not ffffffff (-1) value.)
>>
>> Honestly, I have no clue yet.
>>
>> I suspect the case when zspage migrate a zs sub page filled with free
>> objects (so that never calls unpin_tag() which has memory barrier).
>>
>>
>> Assume the page (zs subpage) being migrated has no allocated zs object.
>>
>> S : zs subpage
>> D : free page
>>
>>
>> CPU A : zs_page_migrate()		CPU B : zs_malloc()
>> ---------------------			-----------------------------
>>
>>
>> migrate_write_lock()
>> spin_lock()
>>
>> memcpy(D, S, PAGE_SIZE)   -> (1)
>> replace_sub_page()
>>
>> putback_zspage()
>> spin_unlock()
>> migrate_write_unlock()
>> 					
>> 					spin_lock()
>> 					obj_malloc()
>> 					--> (2-a) allocate obj in D
>> 					--> (2-b) set freeobj using
>>      						the first 8 bytes of
>>  						the allocated obj
>> 					record_obj()
>> 					spin_unlock
>>
>>
>>
>> I think the locking has no problem, but memory ordering.
>> I doubt whether (2-b) in CPU B really loads the data stored by (1).
>>
>> If it doesn't, set_freeobj in (2-b) will corrupt zspage->freeobj.
>> After then, we will see corrupted object sooner or later.
>
> Thanks for the example.
> When I cannot understand what you are pointing out.
>
> In above example, two CPU use same spin_lock of a class so store op
> by memcpy in the critical section should be visible by CPU B.
>
> Am I missing your point?


No, you are right.
I just pointed it prematurely after only checking that arm64's spinlock 
seems not issue "dmb" operation explicitly.
I am the one missed the basics.

Anyway, I will let you know the situation when it gets more clear.

THanks!






>
>>
>>
>> According to the below link,
>> (https://patchwork.kernel.org/patch/9313493/)
>> spin lock in a specific arch (arm64 maybe) seems not guaranteeing memory
>> ordering.
>
> IMHO, it's not related the this issue.
> It should be matter in when data is updated without formal locking scheme.
>
> Thanks.
>
>>
>> ===
>> +/*
>> + * Accesses appearing in program order before a spin_lock() operation
>> + * can be reordered with accesses inside the critical section, by virtue
>> + * of arch_spin_lock being constructed using acquire semantics.
>> + *
>> + * In cases where this is problematic (e.g. try_to_wake_up), an
>> + * smp_mb__before_spinlock() can restore the required ordering.
>> + */
>> +#define smp_mb__before_spinlock()	smp_mb()
>> ===
>>
>>
>>
>> THanks.
>> CHulmin Kim
>>
>>
>>
>>
>>
>>>
>>> Thanks.
>>>
>>>>
>>>>
>>>> THanks!
>>>> CHulmin KIm
>>>>
>>>>
>>>>
>>>> On 05/31/2016 07:21 PM, Minchan Kim wrote:
>>>>> This patch introduces run-time migration feature for zspage.
>>>>>
>>>>> For migration, VM uses page.lru field so it would be better to not use
>>>>> page.next field which is unified with page.lru for own purpose.
>>>>> For that, firstly, we can get first object offset of the page via
>>>>> runtime calculation instead of using page.index so we can use
>>>>> page.index as link for page chaining instead of page.next.
>>>>> 	
>>>>> In case of huge object, it stores handle to page.index instead of
>>>>> next link of page chaining because huge object doesn't need to next
>>>>> link for page chaining. So get_next_page need to identify huge
>>>>> object to return NULL. For it, this patch uses PG_owner_priv_1 flag
>>>>> of the page flag.
>>>>>
>>>>> For migration, it supports three functions
>>>>>
>>>>> * zs_page_isolate
>>>>>
>>>>> It isolates a zspage which includes a subpage VM want to migrate
>>>> >from class so anyone cannot allocate new object from the zspage.
>>>>>
>>>>> We could try to isolate a zspage by the number of subpage so
>>>>> subsequent isolation trial of other subpage of the zpsage shouldn't
>>>>> fail. For that, we introduce zspage.isolated count. With that,
>>>>> zs_page_isolate can know whether zspage is already isolated or not
>>>>> for migration so if it is isolated for migration, subsequent
>>>>> isolation trial can be successful without trying further isolation.
>>>>>
>>>>> * zs_page_migrate
>>>>>
>>>>> First of all, it holds write-side zspage->lock to prevent migrate other
>>>>> subpage in zspage. Then, lock all objects in the page VM want to migrate.
>>>>> The reason we should lock all objects in the page is due to race between
>>>>> zs_map_object and zs_page_migrate.
>>>>>
>>>>> zs_map_object				zs_page_migrate
>>>>>
>>>>> pin_tag(handle)
>>>>> obj = handle_to_obj(handle)
>>>>> obj_to_location(obj, &page, &obj_idx);
>>>>>
>>>>> 					write_lock(&zspage->lock)
>>>>> 					if (!trypin_tag(handle))
>>>>> 						goto unpin_object
>>>>>
>>>>> zspage = get_zspage(page);
>>>>> read_lock(&zspage->lock);
>>>>>
>>>>> If zs_page_migrate doesn't do trypin_tag, zs_map_object's page can
>>>>> be stale by migration so it goes crash.
>>>>>
>>>>> If it locks all of objects successfully, it copies content from
>>>>> old page to new one, finally, create new zspage chain with new page.
>>>>> And if it's last isolated subpage in the zspage, put the zspage back
>>>>> to class.
>>>>>
>>>>> * zs_page_putback
>>>>>
>>>>> It returns isolated zspage to right fullness_group list if it fails to
>>>>> migrate a page. If it find a zspage is ZS_EMPTY, it queues zspage
>>>>> freeing to workqueue. See below about async zspage freeing.
>>>>>
>>>>> This patch introduces asynchronous zspage free. The reason to need it
>>>>> is we need page_lock to clear PG_movable but unfortunately,
>>>>> zs_free path should be atomic so the apporach is try to grab page_lock.
>>>>> If it got page_lock of all of pages successfully, it can free zspage
>>>>> immediately. Otherwise, it queues free request and free zspage via
>>>>> workqueue in process context.
>>>>>
>>>>> If zs_free finds the zspage is isolated when it try to free zspage,
>>>>> it delays the freeing until zs_page_putback finds it so it will free
>>>>> free the zspage finally.
>>>>>
>>>>> In this patch, we expand fullness_list from ZS_EMPTY to ZS_FULL.
>>>>> First of all, it will use ZS_EMPTY list for delay freeing.
>>>>> And with adding ZS_FULL list, it makes to identify whether zspage is
>>>>> isolated or not via list_empty(&zspage->list) test.
>>>>>
>>>>> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
>>>>> Signed-off-by: Minchan Kim <minchan@kernel.org>
>>>>> ---
>>>>> include/uapi/linux/magic.h |   1 +
>>>>> mm/zsmalloc.c              | 793 ++++++++++++++++++++++++++++++++++++++-------
>>>>> 2 files changed, 672 insertions(+), 122 deletions(-)
>>>>>
>>>>> diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
>>>>> index d829ce63529d..e398beac67b8 100644
>>>>> --- a/include/uapi/linux/magic.h
>>>>> +++ b/include/uapi/linux/magic.h
>>>>> @@ -81,5 +81,6 @@
>>>>> /* Since UDF 2.01 is ISO 13346 based... */
>>>>> #define UDF_SUPER_MAGIC		0x15013346
>>>>> #define BALLOON_KVM_MAGIC	0x13661366
>>>>> +#define ZSMALLOC_MAGIC		0x58295829
>>>>>
>>>>> #endif /* __LINUX_MAGIC_H__ */
>>>>> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
>>>>> index c6fb543cfb98..a80100db16d6 100644
>>>>> --- a/mm/zsmalloc.c
>>>>> +++ b/mm/zsmalloc.c
>>>>> @@ -17,14 +17,14 @@
>>>>> *
>>>>> * Usage of struct page fields:
>>>>> *	page->private: points to zspage
>>>>> - *	page->index: offset of the first object starting in this page.
>>>>> - *		For the first page, this is always 0, so we use this field
>>>>> - *		to store handle for huge object.
>>>>> - *	page->next: links together all component pages of a zspage
>>>>> + *	page->freelist(index): links together all component pages of a zspage
>>>>> + *		For the huge page, this is always 0, so we use this field
>>>>> + *		to store handle.
>>>>> *
>>>>> * Usage of struct page flags:
>>>>> *	PG_private: identifies the first component page
>>>>> *	PG_private2: identifies the last component page
>>>>> + *	PG_owner_priv_1: indentifies the huge component page
>>>>> *
>>>>> */
>>>>>
>>>>> @@ -49,6 +49,11 @@
>>>>> #include <linux/debugfs.h>
>>>>> #include <linux/zsmalloc.h>
>>>>> #include <linux/zpool.h>
>>>>> +#include <linux/mount.h>
>>>>> +#include <linux/compaction.h>
>>>>> +#include <linux/pagemap.h>
>>>>> +
>>>>> +#define ZSPAGE_MAGIC	0x58
>>>>>
>>>>> /*
>>>>> * This must be power of 2 and greater than of equal to sizeof(link_free).
>>>>> @@ -136,25 +141,23 @@
>>>>> * We do not maintain any list for completely empty or full pages
>>>>> */
>>>>> enum fullness_group {
>>>>> -	ZS_ALMOST_FULL,
>>>>> -	ZS_ALMOST_EMPTY,
>>>>> 	ZS_EMPTY,
>>>>> -	ZS_FULL
>>>>> +	ZS_ALMOST_EMPTY,
>>>>> +	ZS_ALMOST_FULL,
>>>>> +	ZS_FULL,
>>>>> +	NR_ZS_FULLNESS,
>>>>> };
>>>>>
>>>>> enum zs_stat_type {
>>>>> +	CLASS_EMPTY,
>>>>> +	CLASS_ALMOST_EMPTY,
>>>>> +	CLASS_ALMOST_FULL,
>>>>> +	CLASS_FULL,
>>>>> 	OBJ_ALLOCATED,
>>>>> 	OBJ_USED,
>>>>> -	CLASS_ALMOST_FULL,
>>>>> -	CLASS_ALMOST_EMPTY,
>>>>> +	NR_ZS_STAT_TYPE,
>>>>> };
>>>>>
>>>>> -#ifdef CONFIG_ZSMALLOC_STAT
>>>>> -#define NR_ZS_STAT_TYPE	(CLASS_ALMOST_EMPTY + 1)
>>>>> -#else
>>>>> -#define NR_ZS_STAT_TYPE	(OBJ_USED + 1)
>>>>> -#endif
>>>>> -
>>>>> struct zs_size_stat {
>>>>> 	unsigned long objs[NR_ZS_STAT_TYPE];
>>>>> };
>>>>> @@ -163,6 +166,10 @@ struct zs_size_stat {
>>>>> static struct dentry *zs_stat_root;
>>>>> #endif
>>>>>
>>>>> +#ifdef CONFIG_COMPACTION
>>>>> +static struct vfsmount *zsmalloc_mnt;
>>>>> +#endif
>>>>> +
>>>>> /*
>>>>> * number of size_classes
>>>>> */
>>>>> @@ -186,23 +193,36 @@ static const int fullness_threshold_frac = 4;
>>>>>
>>>>> struct size_class {
>>>>> 	spinlock_t lock;
>>>>> -	struct list_head fullness_list[2];
>>>>> +	struct list_head fullness_list[NR_ZS_FULLNESS];
>>>>> 	/*
>>>>> 	 * Size of objects stored in this class. Must be multiple
>>>>> 	 * of ZS_ALIGN.
>>>>> 	 */
>>>>> 	int size;
>>>>> 	int objs_per_zspage;
>>>>> -	unsigned int index;
>>>>> -
>>>>> -	struct zs_size_stat stats;
>>>>> -
>>>>> 	/* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
>>>>> 	int pages_per_zspage;
>>>>> -	/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
>>>>> -	bool huge;
>>>>> +
>>>>> +	unsigned int index;
>>>>> +	struct zs_size_stat stats;
>>>>> };
>>>>>
>>>>> +/* huge object: pages_per_zspage == 1 && maxobj_per_zspage == 1 */
>>>>> +static void SetPageHugeObject(struct page *page)
>>>>> +{
>>>>> +	SetPageOwnerPriv1(page);
>>>>> +}
>>>>> +
>>>>> +static void ClearPageHugeObject(struct page *page)
>>>>> +{
>>>>> +	ClearPageOwnerPriv1(page);
>>>>> +}
>>>>> +
>>>>> +static int PageHugeObject(struct page *page)
>>>>> +{
>>>>> +	return PageOwnerPriv1(page);
>>>>> +}
>>>>> +
>>>>> /*
>>>>> * Placed within free objects to form a singly linked list.
>>>>> * For every zspage, zspage->freeobj gives head of this list.
>>>>> @@ -244,6 +264,10 @@ struct zs_pool {
>>>>> #ifdef CONFIG_ZSMALLOC_STAT
>>>>> 	struct dentry *stat_dentry;
>>>>> #endif
>>>>> +#ifdef CONFIG_COMPACTION
>>>>> +	struct inode *inode;
>>>>> +	struct work_struct free_work;
>>>>> +#endif
>>>>> };
>>>>>
>>>>> /*
>>>>> @@ -252,16 +276,23 @@ struct zs_pool {
>>>>> */
>>>>> #define FULLNESS_BITS	2
>>>>> #define CLASS_BITS	8
>>>>> +#define ISOLATED_BITS	3
>>>>> +#define MAGIC_VAL_BITS	8
>>>>>
>>>>> struct zspage {
>>>>> 	struct {
>>>>> 		unsigned int fullness:FULLNESS_BITS;
>>>>> 		unsigned int class:CLASS_BITS;
>>>>> +		unsigned int isolated:ISOLATED_BITS;
>>>>> +		unsigned int magic:MAGIC_VAL_BITS;
>>>>> 	};
>>>>> 	unsigned int inuse;
>>>>> 	unsigned int freeobj;
>>>>> 	struct page *first_page;
>>>>> 	struct list_head list; /* fullness list */
>>>>> +#ifdef CONFIG_COMPACTION
>>>>> +	rwlock_t lock;
>>>>> +#endif
>>>>> };
>>>>>
>>>>> struct mapping_area {
>>>>> @@ -274,6 +305,28 @@ struct mapping_area {
>>>>> 	enum zs_mapmode vm_mm; /* mapping mode */
>>>>> };
>>>>>
>>>>> +#ifdef CONFIG_COMPACTION
>>>>> +static int zs_register_migration(struct zs_pool *pool);
>>>>> +static void zs_unregister_migration(struct zs_pool *pool);
>>>>> +static void migrate_lock_init(struct zspage *zspage);
>>>>> +static void migrate_read_lock(struct zspage *zspage);
>>>>> +static void migrate_read_unlock(struct zspage *zspage);
>>>>> +static void kick_deferred_free(struct zs_pool *pool);
>>>>> +static void init_deferred_free(struct zs_pool *pool);
>>>>> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage);
>>>>> +#else
>>>>> +static int zsmalloc_mount(void) { return 0; }
>>>>> +static void zsmalloc_unmount(void) {}
>>>>> +static int zs_register_migration(struct zs_pool *pool) { return 0; }
>>>>> +static void zs_unregister_migration(struct zs_pool *pool) {}
>>>>> +static void migrate_lock_init(struct zspage *zspage) {}
>>>>> +static void migrate_read_lock(struct zspage *zspage) {}
>>>>> +static void migrate_read_unlock(struct zspage *zspage) {}
>>>>> +static void kick_deferred_free(struct zs_pool *pool) {}
>>>>> +static void init_deferred_free(struct zs_pool *pool) {}
>>>>> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage) {}
>>>>> +#endif
>>>>> +
>>>>> static int create_cache(struct zs_pool *pool)
>>>>> {
>>>>> 	pool->handle_cachep = kmem_cache_create("zs_handle", ZS_HANDLE_SIZE,
>>>>> @@ -301,7 +354,7 @@ static void destroy_cache(struct zs_pool *pool)
>>>>> static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
>>>>> {
>>>>> 	return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
>>>>> -			gfp & ~__GFP_HIGHMEM);
>>>>> +			gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
>>>>> }
>>>>>
>>>>> static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
>>>>> @@ -311,7 +364,8 @@ static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
>>>>>
>>>>> static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
>>>>> {
>>>>> -	return kmem_cache_alloc(pool->zspage_cachep, flags & ~__GFP_HIGHMEM);
>>>>> +	return kmem_cache_alloc(pool->zspage_cachep,
>>>>> +			flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
>>>>> };
>>>>>
>>>>> static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
>>>>> @@ -421,11 +475,17 @@ static unsigned int get_maxobj_per_zspage(int size, int pages_per_zspage)
>>>>> /* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
>>>>> static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
>>>>>
>>>>> +static bool is_zspage_isolated(struct zspage *zspage)
>>>>> +{
>>>>> +	return zspage->isolated;
>>>>> +}
>>>>> +
>>>>> static int is_first_page(struct page *page)
>>>>> {
>>>>> 	return PagePrivate(page);
>>>>> }
>>>>>
>>>>> +/* Protected by class->lock */
>>>>> static inline int get_zspage_inuse(struct zspage *zspage)
>>>>> {
>>>>> 	return zspage->inuse;
>>>>> @@ -441,20 +501,12 @@ static inline void mod_zspage_inuse(struct zspage *zspage, int val)
>>>>> 	zspage->inuse += val;
>>>>> }
>>>>>
>>>>> -static inline int get_first_obj_offset(struct page *page)
>>>>> +static inline struct page *get_first_page(struct zspage *zspage)
>>>>> {
>>>>> -	if (is_first_page(page))
>>>>> -		return 0;
>>>>> +	struct page *first_page = zspage->first_page;
>>>>>
>>>>> -	return page->index;
>>>>> -}
>>>>> -
>>>>> -static inline void set_first_obj_offset(struct page *page, int offset)
>>>>> -{
>>>>> -	if (is_first_page(page))
>>>>> -		return;
>>>>> -
>>>>> -	page->index = offset;
>>>>> +	VM_BUG_ON_PAGE(!is_first_page(first_page), first_page);
>>>>> +	return first_page;
>>>>> }
>>>>>
>>>>> static inline unsigned int get_freeobj(struct zspage *zspage)
>>>>> @@ -471,6 +523,8 @@ static void get_zspage_mapping(struct zspage *zspage,
>>>>> 				unsigned int *class_idx,
>>>>> 				enum fullness_group *fullness)
>>>>> {
>>>>> +	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
>>>>> +
>>>>> 	*fullness = zspage->fullness;
>>>>> 	*class_idx = zspage->class;
>>>>> }
>>>>> @@ -504,23 +558,19 @@ static int get_size_class_index(int size)
>>>>> static inline void zs_stat_inc(struct size_class *class,
>>>>> 				enum zs_stat_type type, unsigned long cnt)
>>>>> {
>>>>> -	if (type < NR_ZS_STAT_TYPE)
>>>>> -		class->stats.objs[type] += cnt;
>>>>> +	class->stats.objs[type] += cnt;
>>>>> }
>>>>>
>>>>> static inline void zs_stat_dec(struct size_class *class,
>>>>> 				enum zs_stat_type type, unsigned long cnt)
>>>>> {
>>>>> -	if (type < NR_ZS_STAT_TYPE)
>>>>> -		class->stats.objs[type] -= cnt;
>>>>> +	class->stats.objs[type] -= cnt;
>>>>> }
>>>>>
>>>>> static inline unsigned long zs_stat_get(struct size_class *class,
>>>>> 				enum zs_stat_type type)
>>>>> {
>>>>> -	if (type < NR_ZS_STAT_TYPE)
>>>>> -		return class->stats.objs[type];
>>>>> -	return 0;
>>>>> +	return class->stats.objs[type];
>>>>> }
>>>>>
>>>>> #ifdef CONFIG_ZSMALLOC_STAT
>>>>> @@ -664,6 +714,7 @@ static inline void zs_pool_stat_destroy(struct zs_pool *pool)
>>>>> }
>>>>> #endif
>>>>>
>>>>> +
>>>>> /*
>>>>> * For each size class, zspages are divided into different groups
>>>>> * depending on how "full" they are. This was done so that we could
>>>>> @@ -704,15 +755,9 @@ static void insert_zspage(struct size_class *class,
>>>>> {
>>>>> 	struct zspage *head;
>>>>>
>>>>> -	if (fullness >= ZS_EMPTY)
>>>>> -		return;
>>>>> -
>>>>> +	zs_stat_inc(class, fullness, 1);
>>>>> 	head = list_first_entry_or_null(&class->fullness_list[fullness],
>>>>> 					struct zspage, list);
>>>>> -
>>>>> -	zs_stat_inc(class, fullness == ZS_ALMOST_EMPTY ?
>>>>> -			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
>>>>> -
>>>>> 	/*
>>>>> 	 * We want to see more ZS_FULL pages and less almost empty/full.
>>>>> 	 * Put pages with higher ->inuse first.
>>>>> @@ -734,14 +779,11 @@ static void remove_zspage(struct size_class *class,
>>>>> 				struct zspage *zspage,
>>>>> 				enum fullness_group fullness)
>>>>> {
>>>>> -	if (fullness >= ZS_EMPTY)
>>>>> -		return;
>>>>> -
>>>>> 	VM_BUG_ON(list_empty(&class->fullness_list[fullness]));
>>>>> +	VM_BUG_ON(is_zspage_isolated(zspage));
>>>>>
>>>>> 	list_del_init(&zspage->list);
>>>>> -	zs_stat_dec(class, fullness == ZS_ALMOST_EMPTY ?
>>>>> -			CLASS_ALMOST_EMPTY : CLASS_ALMOST_FULL, 1);
>>>>> +	zs_stat_dec(class, fullness, 1);
>>>>> }
>>>>>
>>>>> /*
>>>>> @@ -764,8 +806,11 @@ static enum fullness_group fix_fullness_group(struct size_class *class,
>>>>> 	if (newfg == currfg)
>>>>> 		goto out;
>>>>>
>>>>> -	remove_zspage(class, zspage, currfg);
>>>>> -	insert_zspage(class, zspage, newfg);
>>>>> +	if (!is_zspage_isolated(zspage)) {
>>>>> +		remove_zspage(class, zspage, currfg);
>>>>> +		insert_zspage(class, zspage, newfg);
>>>>> +	}
>>>>> +
>>>>> 	set_zspage_mapping(zspage, class_idx, newfg);
>>>>>
>>>>> out:
>>>>> @@ -808,19 +853,45 @@ static int get_pages_per_zspage(int class_size)
>>>>> 	return max_usedpc_order;
>>>>> }
>>>>>
>>>>> -static struct page *get_first_page(struct zspage *zspage)
>>>>> +static struct zspage *get_zspage(struct page *page)
>>>>> {
>>>>> -	return zspage->first_page;
>>>>> +	struct zspage *zspage = (struct zspage *)page->private;
>>>>> +
>>>>> +	VM_BUG_ON(zspage->magic != ZSPAGE_MAGIC);
>>>>> +	return zspage;
>>>>> }
>>>>>
>>>>> -static struct zspage *get_zspage(struct page *page)
>>>>> +static struct page *get_next_page(struct page *page)
>>>>> {
>>>>> -	return (struct zspage *)page->private;
>>>>> +	if (unlikely(PageHugeObject(page)))
>>>>> +		return NULL;
>>>>> +
>>>>> +	return page->freelist;
>>>>> }
>>>>>
>>>>> -static struct page *get_next_page(struct page *page)
>>>>> +/* Get byte offset of first object in the @page */
>>>>> +static int get_first_obj_offset(struct size_class *class,
>>>>> +				struct page *first_page, struct page *page)
>>>>> {
>>>>> -	return page->next;
>>>>> +	int pos;
>>>>> +	int page_idx = 0;
>>>>> +	int ofs = 0;
>>>>> +	struct page *cursor = first_page;
>>>>> +
>>>>> +	if (first_page == page)
>>>>> +		goto out;
>>>>> +
>>>>> +	while (page != cursor) {
>>>>> +		page_idx++;
>>>>> +		cursor = get_next_page(cursor);
>>>>> +	}
>>>>> +
>>>>> +	pos = class->objs_per_zspage * class->size *
>>>>> +		page_idx / class->pages_per_zspage;
>>>>> +
>>>>> +	ofs = (pos + class->size) % PAGE_SIZE;
>>>>> +out:
>>>>> +	return ofs;
>>>>> }
>>>>>
>>>>> /**
>>>>> @@ -857,16 +928,20 @@ static unsigned long handle_to_obj(unsigned long handle)
>>>>> 	return *(unsigned long *)handle;
>>>>> }
>>>>>
>>>>> -static unsigned long obj_to_head(struct size_class *class, struct page *page,
>>>>> -			void *obj)
>>>>> +static unsigned long obj_to_head(struct page *page, void *obj)
>>>>> {
>>>>> -	if (class->huge) {
>>>>> +	if (unlikely(PageHugeObject(page))) {
>>>>> 		VM_BUG_ON_PAGE(!is_first_page(page), page);
>>>>> 		return page->index;
>>>>> 	} else
>>>>> 		return *(unsigned long *)obj;
>>>>> }
>>>>>
>>>>> +static inline int testpin_tag(unsigned long handle)
>>>>> +{
>>>>> +	return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
>>>>> +}
>>>>> +
>>>>> static inline int trypin_tag(unsigned long handle)
>>>>> {
>>>>> 	return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
>>>>> @@ -884,27 +959,93 @@ static void unpin_tag(unsigned long handle)
>>>>>
>>>>> static void reset_page(struct page *page)
>>>>> {
>>>>> +	__ClearPageMovable(page);
>>>>> 	clear_bit(PG_private, &page->flags);
>>>>> 	clear_bit(PG_private_2, &page->flags);
>>>>> 	set_page_private(page, 0);
>>>>> -	page->index = 0;
>>>>> +	ClearPageHugeObject(page);
>>>>> +	page->freelist = NULL;
>>>>> }
>>>>>
>>>>> -static void free_zspage(struct zs_pool *pool, struct zspage *zspage)
>>>>> +/*
>>>>> + * To prevent zspage destroy during migration, zspage freeing should
>>>>> + * hold locks of all pages in the zspage.
>>>>> + */
>>>>> +void lock_zspage(struct zspage *zspage)
>>>>> +{
>>>>> +	struct page *page = get_first_page(zspage);
>>>>> +
>>>>> +	do {
>>>>> +		lock_page(page);
>>>>> +	} while ((page = get_next_page(page)) != NULL);
>>>>> +}
>>>>> +
>>>>> +int trylock_zspage(struct zspage *zspage)
>>>>> +{
>>>>> +	struct page *cursor, *fail;
>>>>> +
>>>>> +	for (cursor = get_first_page(zspage); cursor != NULL; cursor =
>>>>> +					get_next_page(cursor)) {
>>>>> +		if (!trylock_page(cursor)) {
>>>>> +			fail = cursor;
>>>>> +			goto unlock;
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	return 1;
>>>>> +unlock:
>>>>> +	for (cursor = get_first_page(zspage); cursor != fail; cursor =
>>>>> +					get_next_page(cursor))
>>>>> +		unlock_page(cursor);
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +static void __free_zspage(struct zs_pool *pool, struct size_class *class,
>>>>> +				struct zspage *zspage)
>>>>> {
>>>>> 	struct page *page, *next;
>>>>> +	enum fullness_group fg;
>>>>> +	unsigned int class_idx;
>>>>> +
>>>>> +	get_zspage_mapping(zspage, &class_idx, &fg);
>>>>> +
>>>>> +	assert_spin_locked(&class->lock);
>>>>>
>>>>> 	VM_BUG_ON(get_zspage_inuse(zspage));
>>>>> +	VM_BUG_ON(fg != ZS_EMPTY);
>>>>>
>>>>> -	next = page = zspage->first_page;
>>>>> +	next = page = get_first_page(zspage);
>>>>> 	do {
>>>>> -		next = page->next;
>>>>> +		VM_BUG_ON_PAGE(!PageLocked(page), page);
>>>>> +		next = get_next_page(page);
>>>>> 		reset_page(page);
>>>>> +		unlock_page(page);
>>>>> 		put_page(page);
>>>>> 		page = next;
>>>>> 	} while (page != NULL);
>>>>>
>>>>> 	cache_free_zspage(pool, zspage);
>>>>> +
>>>>> +	zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>>>> +			class->size, class->pages_per_zspage));
>>>>> +	atomic_long_sub(class->pages_per_zspage,
>>>>> +					&pool->pages_allocated);
>>>>> +}
>>>>> +
>>>>> +static void free_zspage(struct zs_pool *pool, struct size_class *class,
>>>>> +				struct zspage *zspage)
>>>>> +{
>>>>> +	VM_BUG_ON(get_zspage_inuse(zspage));
>>>>> +	VM_BUG_ON(list_empty(&zspage->list));
>>>>> +
>>>>> +	if (!trylock_zspage(zspage)) {
>>>>> +		kick_deferred_free(pool);
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	remove_zspage(class, zspage, ZS_EMPTY);
>>>>> +	__free_zspage(pool, class, zspage);
>>>>> }
>>>>>
>>>>> /* Initialize a newly allocated zspage */
>>>>> @@ -912,15 +1053,13 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>>>>> {
>>>>> 	unsigned int freeobj = 1;
>>>>> 	unsigned long off = 0;
>>>>> -	struct page *page = zspage->first_page;
>>>>> +	struct page *page = get_first_page(zspage);
>>>>>
>>>>> 	while (page) {
>>>>> 		struct page *next_page;
>>>>> 		struct link_free *link;
>>>>> 		void *vaddr;
>>>>>
>>>>> -		set_first_obj_offset(page, off);
>>>>> -
>>>>> 		vaddr = kmap_atomic(page);
>>>>> 		link = (struct link_free *)vaddr + off / sizeof(*link);
>>>>>
>>>>> @@ -952,16 +1091,17 @@ static void init_zspage(struct size_class *class, struct zspage *zspage)
>>>>> 	set_freeobj(zspage, 0);
>>>>> }
>>>>>
>>>>> -static void create_page_chain(struct zspage *zspage, struct page *pages[],
>>>>> -				int nr_pages)
>>>>> +static void create_page_chain(struct size_class *class, struct zspage *zspage,
>>>>> +				struct page *pages[])
>>>>> {
>>>>> 	int i;
>>>>> 	struct page *page;
>>>>> 	struct page *prev_page = NULL;
>>>>> +	int nr_pages = class->pages_per_zspage;
>>>>>
>>>>> 	/*
>>>>> 	 * Allocate individual pages and link them together as:
>>>>> -	 * 1. all pages are linked together using page->next
>>>>> +	 * 1. all pages are linked together using page->freelist
>>>>> 	 * 2. each sub-page point to zspage using page->private
>>>>> 	 *
>>>>> 	 * we set PG_private to identify the first page (i.e. no other sub-page
>>>>> @@ -970,16 +1110,18 @@ static void create_page_chain(struct zspage *zspage, struct page *pages[],
>>>>> 	for (i = 0; i < nr_pages; i++) {
>>>>> 		page = pages[i];
>>>>> 		set_page_private(page, (unsigned long)zspage);
>>>>> +		page->freelist = NULL;
>>>>> 		if (i == 0) {
>>>>> 			zspage->first_page = page;
>>>>> 			SetPagePrivate(page);
>>>>> +			if (unlikely(class->objs_per_zspage == 1 &&
>>>>> +					class->pages_per_zspage == 1))
>>>>> +				SetPageHugeObject(page);
>>>>> 		} else {
>>>>> -			prev_page->next = page;
>>>>> +			prev_page->freelist = page;
>>>>> 		}
>>>>> -		if (i == nr_pages - 1) {
>>>>> +		if (i == nr_pages - 1)
>>>>> 			SetPagePrivate2(page);
>>>>> -			page->next = NULL;
>>>>> -		}
>>>>> 		prev_page = page;
>>>>> 	}
>>>>> }
>>>>> @@ -999,6 +1141,8 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>>>>> 		return NULL;
>>>>>
>>>>> 	memset(zspage, 0, sizeof(struct zspage));
>>>>> +	zspage->magic = ZSPAGE_MAGIC;
>>>>> +	migrate_lock_init(zspage);
>>>>>
>>>>> 	for (i = 0; i < class->pages_per_zspage; i++) {
>>>>> 		struct page *page;
>>>>> @@ -1013,7 +1157,7 @@ static struct zspage *alloc_zspage(struct zs_pool *pool,
>>>>> 		pages[i] = page;
>>>>> 	}
>>>>>
>>>>> -	create_page_chain(zspage, pages, class->pages_per_zspage);
>>>>> +	create_page_chain(class, zspage, pages);
>>>>> 	init_zspage(class, zspage);
>>>>>
>>>>> 	return zspage;
>>>>> @@ -1024,7 +1168,7 @@ static struct zspage *find_get_zspage(struct size_class *class)
>>>>> 	int i;
>>>>> 	struct zspage *zspage;
>>>>>
>>>>> -	for (i = ZS_ALMOST_FULL; i <= ZS_ALMOST_EMPTY; i++) {
>>>>> +	for (i = ZS_ALMOST_FULL; i >= ZS_EMPTY; i--) {
>>>>> 		zspage = list_first_entry_or_null(&class->fullness_list[i],
>>>>> 				struct zspage, list);
>>>>> 		if (zspage)
>>>>> @@ -1289,6 +1433,10 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>>>>> 	obj = handle_to_obj(handle);
>>>>> 	obj_to_location(obj, &page, &obj_idx);
>>>>> 	zspage = get_zspage(page);
>>>>> +
>>>>> +	/* migration cannot move any subpage in this zspage */
>>>>> +	migrate_read_lock(zspage);
>>>>> +
>>>>> 	get_zspage_mapping(zspage, &class_idx, &fg);
>>>>> 	class = pool->size_class[class_idx];
>>>>> 	off = (class->size * obj_idx) & ~PAGE_MASK;
>>>>> @@ -1309,7 +1457,7 @@ void *zs_map_object(struct zs_pool *pool, unsigned long handle,
>>>>>
>>>>> 	ret = __zs_map_object(area, pages, off, class->size);
>>>>> out:
>>>>> -	if (!class->huge)
>>>>> +	if (likely(!PageHugeObject(page)))
>>>>> 		ret += ZS_HANDLE_SIZE;
>>>>>
>>>>> 	return ret;
>>>>> @@ -1348,6 +1496,8 @@ void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
>>>>> 		__zs_unmap_object(area, pages, off, class->size);
>>>>> 	}
>>>>> 	put_cpu_var(zs_map_area);
>>>>> +
>>>>> +	migrate_read_unlock(zspage);
>>>>> 	unpin_tag(handle);
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(zs_unmap_object);
>>>>> @@ -1377,7 +1527,7 @@ static unsigned long obj_malloc(struct size_class *class,
>>>>> 	vaddr = kmap_atomic(m_page);
>>>>> 	link = (struct link_free *)vaddr + m_offset / sizeof(*link);
>>>>> 	set_freeobj(zspage, link->next >> OBJ_ALLOCATED_TAG);
>>>>> -	if (!class->huge)
>>>>> +	if (likely(!PageHugeObject(m_page)))
>>>>> 		/* record handle in the header of allocated chunk */
>>>>> 		link->handle = handle;
>>>>> 	else
>>>>> @@ -1407,6 +1557,7 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>>>>> {
>>>>> 	unsigned long handle, obj;
>>>>> 	struct size_class *class;
>>>>> +	enum fullness_group newfg;
>>>>> 	struct zspage *zspage;
>>>>>
>>>>> 	if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
>>>>> @@ -1422,28 +1573,37 @@ unsigned long zs_malloc(struct zs_pool *pool, size_t size, gfp_t gfp)
>>>>>
>>>>> 	spin_lock(&class->lock);
>>>>> 	zspage = find_get_zspage(class);
>>>>> -
>>>>> -	if (!zspage) {
>>>>> +	if (likely(zspage)) {
>>>>> +		obj = obj_malloc(class, zspage, handle);
>>>>> +		/* Now move the zspage to another fullness group, if required */
>>>>> +		fix_fullness_group(class, zspage);
>>>>> +		record_obj(handle, obj);
>>>>> 		spin_unlock(&class->lock);
>>>>> -		zspage = alloc_zspage(pool, class, gfp);
>>>>> -		if (unlikely(!zspage)) {
>>>>> -			cache_free_handle(pool, handle);
>>>>> -			return 0;
>>>>> -		}
>>>>>
>>>>> -		set_zspage_mapping(zspage, class->index, ZS_EMPTY);
>>>>> -		atomic_long_add(class->pages_per_zspage,
>>>>> -					&pool->pages_allocated);
>>>>> +		return handle;
>>>>> +	}
>>>>>
>>>>> -		spin_lock(&class->lock);
>>>>> -		zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>>>> -				class->size, class->pages_per_zspage));
>>>>> +	spin_unlock(&class->lock);
>>>>> +
>>>>> +	zspage = alloc_zspage(pool, class, gfp);
>>>>> +	if (!zspage) {
>>>>> +		cache_free_handle(pool, handle);
>>>>> +		return 0;
>>>>> 	}
>>>>>
>>>>> +	spin_lock(&class->lock);
>>>>> 	obj = obj_malloc(class, zspage, handle);
>>>>> -	/* Now move the zspage to another fullness group, if required */
>>>>> -	fix_fullness_group(class, zspage);
>>>>> +	newfg = get_fullness_group(class, zspage);
>>>>> +	insert_zspage(class, zspage, newfg);
>>>>> +	set_zspage_mapping(zspage, class->index, newfg);
>>>>> 	record_obj(handle, obj);
>>>>> +	atomic_long_add(class->pages_per_zspage,
>>>>> +				&pool->pages_allocated);
>>>>> +	zs_stat_inc(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>>>> +			class->size, class->pages_per_zspage));
>>>>> +
>>>>> +	/* We completely set up zspage so mark them as movable */
>>>>> +	SetZsPageMovable(pool, zspage);
>>>>> 	spin_unlock(&class->lock);
>>>>>
>>>>> 	return handle;
>>>>> @@ -1484,6 +1644,7 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>>>>> 	int class_idx;
>>>>> 	struct size_class *class;
>>>>> 	enum fullness_group fullness;
>>>>> +	bool isolated;
>>>>>
>>>>> 	if (unlikely(!handle))
>>>>> 		return;
>>>>> @@ -1493,22 +1654,28 @@ void zs_free(struct zs_pool *pool, unsigned long handle)
>>>>> 	obj_to_location(obj, &f_page, &f_objidx);
>>>>> 	zspage = get_zspage(f_page);
>>>>>
>>>>> +	migrate_read_lock(zspage);
>>>>> +
>>>>> 	get_zspage_mapping(zspage, &class_idx, &fullness);
>>>>> 	class = pool->size_class[class_idx];
>>>>>
>>>>> 	spin_lock(&class->lock);
>>>>> 	obj_free(class, obj);
>>>>> 	fullness = fix_fullness_group(class, zspage);
>>>>> -	if (fullness == ZS_EMPTY) {
>>>>> -		zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>>>> -				class->size, class->pages_per_zspage));
>>>>> -		atomic_long_sub(class->pages_per_zspage,
>>>>> -				&pool->pages_allocated);
>>>>> -		free_zspage(pool, zspage);
>>>>> +	if (fullness != ZS_EMPTY) {
>>>>> +		migrate_read_unlock(zspage);
>>>>> +		goto out;
>>>>> 	}
>>>>> +
>>>>> +	isolated = is_zspage_isolated(zspage);
>>>>> +	migrate_read_unlock(zspage);
>>>>> +	/* If zspage is isolated, zs_page_putback will free the zspage */
>>>>> +	if (likely(!isolated))
>>>>> +		free_zspage(pool, class, zspage);
>>>>> +out:
>>>>> +
>>>>> 	spin_unlock(&class->lock);
>>>>> 	unpin_tag(handle);
>>>>> -
>>>>> 	cache_free_handle(pool, handle);
>>>>> }
>>>>> EXPORT_SYMBOL_GPL(zs_free);
>>>>> @@ -1587,12 +1754,13 @@ static unsigned long find_alloced_obj(struct size_class *class,
>>>>> 	int offset = 0;
>>>>> 	unsigned long handle = 0;
>>>>> 	void *addr = kmap_atomic(page);
>>>>> +	struct zspage *zspage = get_zspage(page);
>>>>>
>>>>> -	offset = get_first_obj_offset(page);
>>>>> +	offset = get_first_obj_offset(class, get_first_page(zspage), page);
>>>>> 	offset += class->size * index;
>>>>>
>>>>> 	while (offset < PAGE_SIZE) {
>>>>> -		head = obj_to_head(class, page, addr + offset);
>>>>> +		head = obj_to_head(page, addr + offset);
>>>>> 		if (head & OBJ_ALLOCATED_TAG) {
>>>>> 			handle = head & ~OBJ_ALLOCATED_TAG;
>>>>> 			if (trypin_tag(handle))
>>>>> @@ -1684,6 +1852,7 @@ static struct zspage *isolate_zspage(struct size_class *class, bool source)
>>>>> 		zspage = list_first_entry_or_null(&class->fullness_list[fg[i]],
>>>>> 							struct zspage, list);
>>>>> 		if (zspage) {
>>>>> +			VM_BUG_ON(is_zspage_isolated(zspage));
>>>>> 			remove_zspage(class, zspage, fg[i]);
>>>>> 			return zspage;
>>>>> 		}
>>>>> @@ -1704,6 +1873,8 @@ static enum fullness_group putback_zspage(struct size_class *class,
>>>>> {
>>>>> 	enum fullness_group fullness;
>>>>>
>>>>> +	VM_BUG_ON(is_zspage_isolated(zspage));
>>>>> +
>>>>> 	fullness = get_fullness_group(class, zspage);
>>>>> 	insert_zspage(class, zspage, fullness);
>>>>> 	set_zspage_mapping(zspage, class->index, fullness);
>>>>> @@ -1711,6 +1882,377 @@ static enum fullness_group putback_zspage(struct size_class *class,
>>>>> 	return fullness;
>>>>> }
>>>>>
>>>>> +#ifdef CONFIG_COMPACTION
>>>>> +static struct dentry *zs_mount(struct file_system_type *fs_type,
>>>>> +				int flags, const char *dev_name, void *data)
>>>>> +{
>>>>> +	static const struct dentry_operations ops = {
>>>>> +		.d_dname = simple_dname,
>>>>> +	};
>>>>> +
>>>>> +	return mount_pseudo(fs_type, "zsmalloc:", NULL, &ops, ZSMALLOC_MAGIC);
>>>>> +}
>>>>> +
>>>>> +static struct file_system_type zsmalloc_fs = {
>>>>> +	.name		= "zsmalloc",
>>>>> +	.mount		= zs_mount,
>>>>> +	.kill_sb	= kill_anon_super,
>>>>> +};
>>>>> +
>>>>> +static int zsmalloc_mount(void)
>>>>> +{
>>>>> +	int ret = 0;
>>>>> +
>>>>> +	zsmalloc_mnt = kern_mount(&zsmalloc_fs);
>>>>> +	if (IS_ERR(zsmalloc_mnt))
>>>>> +		ret = PTR_ERR(zsmalloc_mnt);
>>>>> +
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>> +static void zsmalloc_unmount(void)
>>>>> +{
>>>>> +	kern_unmount(zsmalloc_mnt);
>>>>> +}
>>>>> +
>>>>> +static void migrate_lock_init(struct zspage *zspage)
>>>>> +{
>>>>> +	rwlock_init(&zspage->lock);
>>>>> +}
>>>>> +
>>>>> +static void migrate_read_lock(struct zspage *zspage)
>>>>> +{
>>>>> +	read_lock(&zspage->lock);
>>>>> +}
>>>>> +
>>>>> +static void migrate_read_unlock(struct zspage *zspage)
>>>>> +{
>>>>> +	read_unlock(&zspage->lock);
>>>>> +}
>>>>> +
>>>>> +static void migrate_write_lock(struct zspage *zspage)
>>>>> +{
>>>>> +	write_lock(&zspage->lock);
>>>>> +}
>>>>> +
>>>>> +static void migrate_write_unlock(struct zspage *zspage)
>>>>> +{
>>>>> +	write_unlock(&zspage->lock);
>>>>> +}
>>>>> +
>>>>> +/* Number of isolated subpage for *page migration* in this zspage */
>>>>> +static void inc_zspage_isolation(struct zspage *zspage)
>>>>> +{
>>>>> +	zspage->isolated++;
>>>>> +}
>>>>> +
>>>>> +static void dec_zspage_isolation(struct zspage *zspage)
>>>>> +{
>>>>> +	zspage->isolated--;
>>>>> +}
>>>>> +
>>>>> +static void replace_sub_page(struct size_class *class, struct zspage *zspage,
>>>>> +				struct page *newpage, struct page *oldpage)
>>>>> +{
>>>>> +	struct page *page;
>>>>> +	struct page *pages[ZS_MAX_PAGES_PER_ZSPAGE] = {NULL, };
>>>>> +	int idx = 0;
>>>>> +
>>>>> +	page = get_first_page(zspage);
>>>>> +	do {
>>>>> +		if (page == oldpage)
>>>>> +			pages[idx] = newpage;
>>>>> +		else
>>>>> +			pages[idx] = page;
>>>>> +		idx++;
>>>>> +	} while ((page = get_next_page(page)) != NULL);
>>>>> +
>>>>> +	create_page_chain(class, zspage, pages);
>>>>> +	if (unlikely(PageHugeObject(oldpage)))
>>>>> +		newpage->index = oldpage->index;
>>>>> +	__SetPageMovable(newpage, page_mapping(oldpage));
>>>>> +}
>>>>> +
>>>>> +bool zs_page_isolate(struct page *page, isolate_mode_t mode)
>>>>> +{
>>>>> +	struct zs_pool *pool;
>>>>> +	struct size_class *class;
>>>>> +	int class_idx;
>>>>> +	enum fullness_group fullness;
>>>>> +	struct zspage *zspage;
>>>>> +	struct address_space *mapping;
>>>>> +
>>>>> +	/*
>>>>> +	 * Page is locked so zspage couldn't be destroyed. For detail, look at
>>>>> +	 * lock_zspage in free_zspage.
>>>>> +	 */
>>>>> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
>>>>> +	VM_BUG_ON_PAGE(PageIsolated(page), page);
>>>>> +
>>>>> +	zspage = get_zspage(page);
>>>>> +
>>>>> +	/*
>>>>> +	 * Without class lock, fullness could be stale while class_idx is okay
>>>>> +	 * because class_idx is constant unless page is freed so we should get
>>>>> +	 * fullness again under class lock.
>>>>> +	 */
>>>>> +	get_zspage_mapping(zspage, &class_idx, &fullness);
>>>>> +	mapping = page_mapping(page);
>>>>> +	pool = mapping->private_data;
>>>>> +	class = pool->size_class[class_idx];
>>>>> +
>>>>> +	spin_lock(&class->lock);
>>>>> +	if (get_zspage_inuse(zspage) == 0) {
>>>>> +		spin_unlock(&class->lock);
>>>>> +		return false;
>>>>> +	}
>>>>> +
>>>>> +	/* zspage is isolated for object migration */
>>>>> +	if (list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
>>>>> +		spin_unlock(&class->lock);
>>>>> +		return false;
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * If this is first time isolation for the zspage, isolate zspage from
>>>>> +	 * size_class to prevent further object allocation from the zspage.
>>>>> +	 */
>>>>> +	if (!list_empty(&zspage->list) && !is_zspage_isolated(zspage)) {
>>>>> +		get_zspage_mapping(zspage, &class_idx, &fullness);
>>>>> +		remove_zspage(class, zspage, fullness);
>>>>> +	}
>>>>> +
>>>>> +	inc_zspage_isolation(zspage);
>>>>> +	spin_unlock(&class->lock);
>>>>> +
>>>>> +	return true;
>>>>> +}
>>>>> +
>>>>> +int zs_page_migrate(struct address_space *mapping, struct page *newpage,
>>>>> +		struct page *page, enum migrate_mode mode)
>>>>> +{
>>>>> +	struct zs_pool *pool;
>>>>> +	struct size_class *class;
>>>>> +	int class_idx;
>>>>> +	enum fullness_group fullness;
>>>>> +	struct zspage *zspage;
>>>>> +	struct page *dummy;
>>>>> +	void *s_addr, *d_addr, *addr;
>>>>> +	int offset, pos;
>>>>> +	unsigned long handle, head;
>>>>> +	unsigned long old_obj, new_obj;
>>>>> +	unsigned int obj_idx;
>>>>> +	int ret = -EAGAIN;
>>>>> +
>>>>> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
>>>>> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
>>>>> +
>>>>> +	zspage = get_zspage(page);
>>>>> +
>>>>> +	/* Concurrent compactor cannot migrate any subpage in zspage */
>>>>> +	migrate_write_lock(zspage);
>>>>> +	get_zspage_mapping(zspage, &class_idx, &fullness);
>>>>> +	pool = mapping->private_data;
>>>>> +	class = pool->size_class[class_idx];
>>>>> +	offset = get_first_obj_offset(class, get_first_page(zspage), page);
>>>>> +
>>>>> +	spin_lock(&class->lock);
>>>>> +	if (!get_zspage_inuse(zspage)) {
>>>>> +		ret = -EBUSY;
>>>>> +		goto unlock_class;
>>>>> +	}
>>>>> +
>>>>> +	pos = offset;
>>>>> +	s_addr = kmap_atomic(page);
>>>>> +	while (pos < PAGE_SIZE) {
>>>>> +		head = obj_to_head(page, s_addr + pos);
>>>>> +		if (head & OBJ_ALLOCATED_TAG) {
>>>>> +			handle = head & ~OBJ_ALLOCATED_TAG;
>>>>> +			if (!trypin_tag(handle))
>>>>> +				goto unpin_objects;
>>>>> +		}
>>>>> +		pos += class->size;
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * Here, any user cannot access all objects in the zspage so let's move.
>>>>> +	 */
>>>>> +	d_addr = kmap_atomic(newpage);
>>>>> +	memcpy(d_addr, s_addr, PAGE_SIZE);
>>>>> +	kunmap_atomic(d_addr);
>>>>> +
>>>>> +	for (addr = s_addr + offset; addr < s_addr + pos;
>>>>> +					addr += class->size) {
>>>>> +		head = obj_to_head(page, addr);
>>>>> +		if (head & OBJ_ALLOCATED_TAG) {
>>>>> +			handle = head & ~OBJ_ALLOCATED_TAG;
>>>>> +			if (!testpin_tag(handle))
>>>>> +				BUG();
>>>>> +
>>>>> +			old_obj = handle_to_obj(handle);
>>>>> +			obj_to_location(old_obj, &dummy, &obj_idx);
>>>>> +			new_obj = (unsigned long)location_to_obj(newpage,
>>>>> +								obj_idx);
>>>>> +			new_obj |= BIT(HANDLE_PIN_BIT);
>>>>> +			record_obj(handle, new_obj);
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	replace_sub_page(class, zspage, newpage, page);
>>>>> +	get_page(newpage);
>>>>> +
>>>>> +	dec_zspage_isolation(zspage);
>>>>> +
>>>>> +	/*
>>>>> +	 * Page migration is done so let's putback isolated zspage to
>>>>> +	 * the list if @page is final isolated subpage in the zspage.
>>>>> +	 */
>>>>> +	if (!is_zspage_isolated(zspage))
>>>>> +		putback_zspage(class, zspage);
>>>>> +
>>>>> +	reset_page(page);
>>>>> +	put_page(page);
>>>>> +	page = newpage;
>>>>> +
>>>>> +	ret = 0;
>>>>> +unpin_objects:
>>>>> +	for (addr = s_addr + offset; addr < s_addr + pos;
>>>>> +						addr += class->size) {
>>>>> +		head = obj_to_head(page, addr);
>>>>> +		if (head & OBJ_ALLOCATED_TAG) {
>>>>> +			handle = head & ~OBJ_ALLOCATED_TAG;
>>>>> +			if (!testpin_tag(handle))
>>>>> +				BUG();
>>>>> +			unpin_tag(handle);
>>>>> +		}
>>>>> +	}
>>>>> +	kunmap_atomic(s_addr);
>>>>> +unlock_class:
>>>>> +	spin_unlock(&class->lock);
>>>>> +	migrate_write_unlock(zspage);
>>>>> +
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>> +void zs_page_putback(struct page *page)
>>>>> +{
>>>>> +	struct zs_pool *pool;
>>>>> +	struct size_class *class;
>>>>> +	int class_idx;
>>>>> +	enum fullness_group fg;
>>>>> +	struct address_space *mapping;
>>>>> +	struct zspage *zspage;
>>>>> +
>>>>> +	VM_BUG_ON_PAGE(!PageMovable(page), page);
>>>>> +	VM_BUG_ON_PAGE(!PageIsolated(page), page);
>>>>> +
>>>>> +	zspage = get_zspage(page);
>>>>> +	get_zspage_mapping(zspage, &class_idx, &fg);
>>>>> +	mapping = page_mapping(page);
>>>>> +	pool = mapping->private_data;
>>>>> +	class = pool->size_class[class_idx];
>>>>> +
>>>>> +	spin_lock(&class->lock);
>>>>> +	dec_zspage_isolation(zspage);
>>>>> +	if (!is_zspage_isolated(zspage)) {
>>>>> +		fg = putback_zspage(class, zspage);
>>>>> +		/*
>>>>> +		 * Due to page_lock, we cannot free zspage immediately
>>>>> +		 * so let's defer.
>>>>> +		 */
>>>>> +		if (fg == ZS_EMPTY)
>>>>> +			schedule_work(&pool->free_work);
>>>>> +	}
>>>>> +	spin_unlock(&class->lock);
>>>>> +}
>>>>> +
>>>>> +const struct address_space_operations zsmalloc_aops = {
>>>>> +	.isolate_page = zs_page_isolate,
>>>>> +	.migratepage = zs_page_migrate,
>>>>> +	.putback_page = zs_page_putback,
>>>>> +};
>>>>> +
>>>>> +static int zs_register_migration(struct zs_pool *pool)
>>>>> +{
>>>>> +	pool->inode = alloc_anon_inode(zsmalloc_mnt->mnt_sb);
>>>>> +	if (IS_ERR(pool->inode)) {
>>>>> +		pool->inode = NULL;
>>>>> +		return 1;
>>>>> +	}
>>>>> +
>>>>> +	pool->inode->i_mapping->private_data = pool;
>>>>> +	pool->inode->i_mapping->a_ops = &zsmalloc_aops;
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +static void zs_unregister_migration(struct zs_pool *pool)
>>>>> +{
>>>>> +	flush_work(&pool->free_work);
>>>>> +	if (pool->inode)
>>>>> +		iput(pool->inode);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Caller should hold page_lock of all pages in the zspage
>>>>> + * In here, we cannot use zspage meta data.
>>>>> + */
>>>>> +static void async_free_zspage(struct work_struct *work)
>>>>> +{
>>>>> +	int i;
>>>>> +	struct size_class *class;
>>>>> +	unsigned int class_idx;
>>>>> +	enum fullness_group fullness;
>>>>> +	struct zspage *zspage, *tmp;
>>>>> +	LIST_HEAD(free_pages);
>>>>> +	struct zs_pool *pool = container_of(work, struct zs_pool,
>>>>> +					free_work);
>>>>> +
>>>>> +	for (i = 0; i < zs_size_classes; i++) {
>>>>> +		class = pool->size_class[i];
>>>>> +		if (class->index != i)
>>>>> +			continue;
>>>>> +
>>>>> +		spin_lock(&class->lock);
>>>>> +		list_splice_init(&class->fullness_list[ZS_EMPTY], &free_pages);
>>>>> +		spin_unlock(&class->lock);
>>>>> +	}
>>>>> +
>>>>> +
>>>>> +	list_for_each_entry_safe(zspage, tmp, &free_pages, list) {
>>>>> +		list_del(&zspage->list);
>>>>> +		lock_zspage(zspage);
>>>>> +
>>>>> +		get_zspage_mapping(zspage, &class_idx, &fullness);
>>>>> +		VM_BUG_ON(fullness != ZS_EMPTY);
>>>>> +		class = pool->size_class[class_idx];
>>>>> +		spin_lock(&class->lock);
>>>>> +		__free_zspage(pool, pool->size_class[class_idx], zspage);
>>>>> +		spin_unlock(&class->lock);
>>>>> +	}
>>>>> +};
>>>>> +
>>>>> +static void kick_deferred_free(struct zs_pool *pool)
>>>>> +{
>>>>> +	schedule_work(&pool->free_work);
>>>>> +}
>>>>> +
>>>>> +static void init_deferred_free(struct zs_pool *pool)
>>>>> +{
>>>>> +	INIT_WORK(&pool->free_work, async_free_zspage);
>>>>> +}
>>>>> +
>>>>> +static void SetZsPageMovable(struct zs_pool *pool, struct zspage *zspage)
>>>>> +{
>>>>> +	struct page *page = get_first_page(zspage);
>>>>> +
>>>>> +	do {
>>>>> +		WARN_ON(!trylock_page(page));
>>>>> +		__SetPageMovable(page, pool->inode->i_mapping);
>>>>> +		unlock_page(page);
>>>>> +	} while ((page = get_next_page(page)) != NULL);
>>>>> +}
>>>>> +#endif
>>>>> +
>>>>> /*
>>>>> *
>>>>> * Based on the number of unused allocated objects calculate
>>>>> @@ -1745,10 +2287,10 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>>>>> 			break;
>>>>>
>>>>> 		cc.index = 0;
>>>>> -		cc.s_page = src_zspage->first_page;
>>>>> +		cc.s_page = get_first_page(src_zspage);
>>>>>
>>>>> 		while ((dst_zspage = isolate_zspage(class, false))) {
>>>>> -			cc.d_page = dst_zspage->first_page;
>>>>> +			cc.d_page = get_first_page(dst_zspage);
>>>>> 			/*
>>>>> 			 * If there is no more space in dst_page, resched
>>>>> 			 * and see if anyone had allocated another zspage.
>>>>> @@ -1765,11 +2307,7 @@ static void __zs_compact(struct zs_pool *pool, struct size_class *class)
>>>>>
>>>>> 		putback_zspage(class, dst_zspage);
>>>>> 		if (putback_zspage(class, src_zspage) == ZS_EMPTY) {
>>>>> -			zs_stat_dec(class, OBJ_ALLOCATED, get_maxobj_per_zspage(
>>>>> -					class->size, class->pages_per_zspage));
>>>>> -			atomic_long_sub(class->pages_per_zspage,
>>>>> -					&pool->pages_allocated);
>>>>> -			free_zspage(pool, src_zspage);
>>>>> +			free_zspage(pool, class, src_zspage);
>>>>> 			pool->stats.pages_compacted += class->pages_per_zspage;
>>>>> 		}
>>>>> 		spin_unlock(&class->lock);
>>>>> @@ -1885,6 +2423,7 @@ struct zs_pool *zs_create_pool(const char *name)
>>>>> 	if (!pool)
>>>>> 		return NULL;
>>>>>
>>>>> +	init_deferred_free(pool);
>>>>> 	pool->size_class = kcalloc(zs_size_classes, sizeof(struct size_class *),
>>>>> 			GFP_KERNEL);
>>>>> 	if (!pool->size_class) {
>>>>> @@ -1939,12 +2478,10 @@ struct zs_pool *zs_create_pool(const char *name)
>>>>> 		class->pages_per_zspage = pages_per_zspage;
>>>>> 		class->objs_per_zspage = class->pages_per_zspage *
>>>>> 						PAGE_SIZE / class->size;
>>>>> -		if (pages_per_zspage == 1 && class->objs_per_zspage == 1)
>>>>> -			class->huge = true;
>>>>> 		spin_lock_init(&class->lock);
>>>>> 		pool->size_class[i] = class;
>>>>> -		for (fullness = ZS_ALMOST_FULL; fullness <= ZS_ALMOST_EMPTY;
>>>>> -								fullness++)
>>>>> +		for (fullness = ZS_EMPTY; fullness < NR_ZS_FULLNESS;
>>>>> +							fullness++)
>>>>> 			INIT_LIST_HEAD(&class->fullness_list[fullness]);
>>>>>
>>>>> 		prev_class = class;
>>>>> @@ -1953,6 +2490,9 @@ struct zs_pool *zs_create_pool(const char *name)
>>>>> 	/* debug only, don't abort if it fails */
>>>>> 	zs_pool_stat_create(pool, name);
>>>>>
>>>>> +	if (zs_register_migration(pool))
>>>>> +		goto err;
>>>>> +
>>>>> 	/*
>>>>> 	 * Not critical, we still can use the pool
>>>>> 	 * and user can trigger compaction manually.
>>>>> @@ -1972,6 +2512,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>>>>> 	int i;
>>>>>
>>>>> 	zs_unregister_shrinker(pool);
>>>>> +	zs_unregister_migration(pool);
>>>>> 	zs_pool_stat_destroy(pool);
>>>>>
>>>>> 	for (i = 0; i < zs_size_classes; i++) {
>>>>> @@ -1984,7 +2525,7 @@ void zs_destroy_pool(struct zs_pool *pool)
>>>>> 		if (class->index != i)
>>>>> 			continue;
>>>>>
>>>>> -		for (fg = ZS_ALMOST_FULL; fg <= ZS_ALMOST_EMPTY; fg++) {
>>>>> +		for (fg = ZS_EMPTY; fg < NR_ZS_FULLNESS; fg++) {
>>>>> 			if (!list_empty(&class->fullness_list[fg])) {
>>>>> 				pr_info("Freeing non-empty class with size %db, fullness group %d\n",
>>>>> 					class->size, fg);
>>>>> @@ -2002,7 +2543,13 @@ EXPORT_SYMBOL_GPL(zs_destroy_pool);
>>>>>
>>>>> static int __init zs_init(void)
>>>>> {
>>>>> -	int ret = zs_register_cpu_notifier();
>>>>> +	int ret;
>>>>> +
>>>>> +	ret = zsmalloc_mount();
>>>>> +	if (ret)
>>>>> +		goto out;
>>>>> +
>>>>> +	ret = zs_register_cpu_notifier();
>>>>>
>>>>> 	if (ret)
>>>>> 		goto notifier_fail;
>>>>> @@ -2019,7 +2566,8 @@ static int __init zs_init(void)
>>>>>
>>>>> notifier_fail:
>>>>> 	zs_unregister_cpu_notifier();
>>>>> -
>>>>> +	zsmalloc_unmount();
>>>>> +out:
>>>>> 	return ret;
>>>>> }
>>>>>
>>>>> @@ -2028,6 +2576,7 @@ static void __exit zs_exit(void)
>>>>> #ifdef CONFIG_ZPOOL
>>>>> 	zpool_unregister_driver(&zs_zpool_driver);
>>>>> #endif
>>>>> +	zsmalloc_unmount();
>>>>> 	zs_unregister_cpu_notifier();
>>>>>
>>>>> 	zs_stat_exit();
>>>>>
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>>
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-19  8:16             ` Chulmin Kim
@ 2017-01-23  5:22               ` Minchan Kim
  2017-01-23  5:30                 ` Sergey Senozhatsky
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2017-01-23  5:22 UTC (permalink / raw)
  To: Chulmin Kim; +Cc: Andrew Morton, linux-mm, Sergey Senozhatsky

Hi Chulmin,

On Thu, Jan 19, 2017 at 03:16:11AM -0500, Chulmin Kim wrote:
> On 01/19/2017 01:21 AM, Minchan Kim wrote:
> >On Wed, Jan 18, 2017 at 10:39:15PM -0500, Chulmin Kim wrote:
> >>On 01/18/2017 09:44 PM, Minchan Kim wrote:
> >>>Hello Chulmin,
> >>>
> >>>On Wed, Jan 18, 2017 at 07:13:21PM -0500, Chulmin Kim wrote:
> >>>>Hello. Minchan, and all zsmalloc guys.
> >>>>
> >>>>I have a quick question.
> >>>>Is zsmalloc considering memory barrier things correctly?
> >>>>
> >>>>AFAIK, in ARM64,
> >>>>zsmalloc relies on dmb operation in bit_spin_unlock only.
> >>>>(It seems that dmb operations in spinlock functions are being prepared,
> >>>>but let is be aside as it is not merged yet.)
> >>>>
> >>>>If I am correct,
> >>>>migrating a page in a zspage filled with free objs
> >>>>may cause the corruption cause bit_spin_unlock will not be executed at all.
> >>>>
> >>>>I am not sure this is enough memory barrier for zsmalloc operations.
> >>>>
> >>>>Can you enlighten me?
> >>>
> >>>Do you mean bit_spin_unlock is broken or zsmalloc locking scheme broken?
> >>>Could you please describe what you are concerning in detail?
> >>>It would be very helpful if you say it with a example!
> >>
> >>Sorry for ambiguous expressions. :)
> >>
> >>Recently,
> >>I found multiple zsmalloc corruption cases which have garbage idx values in
> >>in zspage->freeobj. (not ffffffff (-1) value.)
> >>
> >>Honestly, I have no clue yet.
> >>
> >>I suspect the case when zspage migrate a zs sub page filled with free
> >>objects (so that never calls unpin_tag() which has memory barrier).
> >>
> >>
> >>Assume the page (zs subpage) being migrated has no allocated zs object.
> >>
> >>S : zs subpage
> >>D : free page
> >>
> >>
> >>CPU A : zs_page_migrate()		CPU B : zs_malloc()
> >>---------------------			-----------------------------
> >>
> >>
> >>migrate_write_lock()
> >>spin_lock()
> >>
> >>memcpy(D, S, PAGE_SIZE)   -> (1)
> >>replace_sub_page()
> >>
> >>putback_zspage()
> >>spin_unlock()
> >>migrate_write_unlock()
> >>					
> >>					spin_lock()
> >>					obj_malloc()
> >>					--> (2-a) allocate obj in D
> >>					--> (2-b) set freeobj using
> >>     						the first 8 bytes of
> >> 						the allocated obj
> >>					record_obj()
> >>					spin_unlock
> >>
> >>
> >>
> >>I think the locking has no problem, but memory ordering.
> >>I doubt whether (2-b) in CPU B really loads the data stored by (1).
> >>
> >>If it doesn't, set_freeobj in (2-b) will corrupt zspage->freeobj.
> >>After then, we will see corrupted object sooner or later.
> >
> >Thanks for the example.
> >When I cannot understand what you are pointing out.
> >
> >In above example, two CPU use same spin_lock of a class so store op
> >by memcpy in the critical section should be visible by CPU B.
> >
> >Am I missing your point?
> 
> 
> No, you are right.
> I just pointed it prematurely after only checking that arm64's spinlock
> seems not issue "dmb" operation explicitly.
> I am the one missed the basics.
> 
> Anyway, I will let you know the situation when it gets more clear.

Yeb, Thanks.

Perhaps, did you tried flush page before the writing?
I think arm64 have no d-cache alising problem but worth to try it.
Who knows :)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 46da1c4..a3a5520 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -612,6 +612,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 	unsigned long element;
 
 	page = bvec->bv_page;
+	flush_dcache_page(page);
+
 	if (is_partial_io(bvec)) {
 		/*
 		 * This is a partial IO. We need to read the full page

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-23  5:22               ` Minchan Kim
@ 2017-01-23  5:30                 ` Sergey Senozhatsky
  2017-01-23  5:40                   ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Sergey Senozhatsky @ 2017-01-23  5:30 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Chulmin Kim, Andrew Morton, linux-mm, Sergey Senozhatsky

On (01/23/17 14:22), Minchan Kim wrote:
[..]
> > Anyway, I will let you know the situation when it gets more clear.
> 
> Yeb, Thanks.
> 
> Perhaps, did you tried flush page before the writing?
> I think arm64 have no d-cache alising problem but worth to try it.
> Who knows :)

I thought that flush_dcache_page() is only for cases when we write
to page (store that makes pages dirty), isn't it?

	-ss

> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 46da1c4..a3a5520 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -612,6 +612,8 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
>  	unsigned long element;
>  
>  	page = bvec->bv_page;
> +	flush_dcache_page(page);
> +
>  	if (is_partial_io(bvec)) {
>  		/*
>  		 * This is a partial IO. We need to read the full page

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-23  5:30                 ` Sergey Senozhatsky
@ 2017-01-23  5:40                   ` Minchan Kim
  2017-01-25  4:06                     ` Chulmin Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2017-01-23  5:40 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Chulmin Kim, Andrew Morton, linux-mm, Sergey Senozhatsky

On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
> On (01/23/17 14:22), Minchan Kim wrote:
> [..]
> > > Anyway, I will let you know the situation when it gets more clear.
> > 
> > Yeb, Thanks.
> > 
> > Perhaps, did you tried flush page before the writing?
> > I think arm64 have no d-cache alising problem but worth to try it.
> > Who knows :)
> 
> I thought that flush_dcache_page() is only for cases when we write
> to page (store that makes pages dirty), isn't it?

I think we need both because to see recent stores done by the user.
I'm not sure it should be done by block device driver rather than
page cache. Anyway, brd added it so worth to try it, I thought. :)

Thanks.

http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=c2572f2b4ffc27ba79211aceee3bef53a59bb5cd


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-23  5:40                   ` Minchan Kim
@ 2017-01-25  4:06                     ` Chulmin Kim
  2017-01-25  4:25                       ` Sergey Senozhatsky
  2017-01-25  5:26                       ` Minchan Kim
  0 siblings, 2 replies; 49+ messages in thread
From: Chulmin Kim @ 2017-01-25  4:06 UTC (permalink / raw)
  To: Minchan Kim, Sergey Senozhatsky
  Cc: Andrew Morton, linux-mm, Sergey Senozhatsky

On 01/23/2017 12:40 AM, Minchan Kim wrote:
> On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
>> On (01/23/17 14:22), Minchan Kim wrote:
>> [..]
>>>> Anyway, I will let you know the situation when it gets more clear.
>>>
>>> Yeb, Thanks.
>>>
>>> Perhaps, did you tried flush page before the writing?
>>> I think arm64 have no d-cache alising problem but worth to try it.
>>> Who knows :)
>>
>> I thought that flush_dcache_page() is only for cases when we write
>> to page (store that makes pages dirty), isn't it?
>
> I think we need both because to see recent stores done by the user.
> I'm not sure it should be done by block device driver rather than
> page cache. Anyway, brd added it so worth to try it, I thought. :)
>

Thanks for the suggestion!
It might be helpful
though proving it is not easy as the problem appears rarely.

Have you thought about
zram swap or zswap dealing with self modifying code pages (ex. JIT)?
(arm64 may have i-cache aliasing problem)

If it is problematic,
especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) 
may provide the corrupted data
and even swap out (compressing) may see the corrupted data sooner or 
later, i guess.

THanks!





> Thanks.
>
> http://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?id=c2572f2b4ffc27ba79211aceee3bef53a59bb5cd
>
>
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-25  4:06                     ` Chulmin Kim
@ 2017-01-25  4:25                       ` Sergey Senozhatsky
  2017-01-25  5:26                       ` Minchan Kim
  1 sibling, 0 replies; 49+ messages in thread
From: Sergey Senozhatsky @ 2017-01-25  4:25 UTC (permalink / raw)
  To: Chulmin Kim
  Cc: Minchan Kim, Andrew Morton, linux-mm, Dan Streetman,
	Seth Jennings, Sergey Senozhatsky, Sergey Senozhatsky

On (01/24/17 23:06), Chulmin Kim wrote:
[..]
> > > > Yeb, Thanks.
> > > > 
> > > > Perhaps, did you tried flush page before the writing?
> > > > I think arm64 have no d-cache alising problem but worth to try it.
> > > > Who knows :)
> > > 
> > > I thought that flush_dcache_page() is only for cases when we write
> > > to page (store that makes pages dirty), isn't it?
> > 
> > I think we need both because to see recent stores done by the user.
> > I'm not sure it should be done by block device driver rather than
> > page cache. Anyway, brd added it so worth to try it, I thought. :)

Cc Dan, Seth

(https://marc.info/?l=linux-mm&m=148514896820940)


> Thanks for the suggestion!
> It might be helpful
> though proving it is not easy as the problem appears rarely.
> 
> Have you thought about
> zram swap or zswap dealing with self modifying code pages (ex. JIT)?
> (arm64 may have i-cache aliasing problem)
> 
> If it is problematic,
> especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) may
> provide the corrupted data
> and even swap out (compressing) may see the corrupted data sooner or later,
> i guess.

hm, interesting. there is a report of zswap_frontswap_load() failing to
decompress the page: https://marc.info/?l=linux-mm&m=148468457306971

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-25  4:06                     ` Chulmin Kim
  2017-01-25  4:25                       ` Sergey Senozhatsky
@ 2017-01-25  5:26                       ` Minchan Kim
  2017-01-26 17:04                         ` Dan Streetman
  1 sibling, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2017-01-25  5:26 UTC (permalink / raw)
  To: Chulmin Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-mm, Sergey Senozhatsky

On Tue, Jan 24, 2017 at 11:06:51PM -0500, Chulmin Kim wrote:
> On 01/23/2017 12:40 AM, Minchan Kim wrote:
> >On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
> >>On (01/23/17 14:22), Minchan Kim wrote:
> >>[..]
> >>>>Anyway, I will let you know the situation when it gets more clear.
> >>>
> >>>Yeb, Thanks.
> >>>
> >>>Perhaps, did you tried flush page before the writing?
> >>>I think arm64 have no d-cache alising problem but worth to try it.
> >>>Who knows :)
> >>
> >>I thought that flush_dcache_page() is only for cases when we write
> >>to page (store that makes pages dirty), isn't it?
> >
> >I think we need both because to see recent stores done by the user.
> >I'm not sure it should be done by block device driver rather than
> >page cache. Anyway, brd added it so worth to try it, I thought. :)
> >
> 
> Thanks for the suggestion!
> It might be helpful
> though proving it is not easy as the problem appears rarely.
> 
> Have you thought about
> zram swap or zswap dealing with self modifying code pages (ex. JIT)?
> (arm64 may have i-cache aliasing problem)

It can happen, I think, although I don't know how arm64 handles it.

> 
> If it is problematic,
> especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) may
> provide the corrupted data
> and even swap out (compressing) may see the corrupted data sooner or later,
> i guess.

try_to_unmap_one calls flush_cache_page which I hope to handle swap-out side
but for swap-in, I think zswap need flushing logic because it's first
touch of the user buffer so it's his resposibility.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-25  5:26                       ` Minchan Kim
@ 2017-01-26 17:04                         ` Dan Streetman
  2017-01-31  0:10                           ` Minchan Kim
  0 siblings, 1 reply; 49+ messages in thread
From: Dan Streetman @ 2017-01-26 17:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Chulmin Kim, Sergey Senozhatsky, Andrew Morton, Linux-MM,
	Sergey Senozhatsky

On Wed, Jan 25, 2017 at 12:26 AM, Minchan Kim <minchan@kernel.org> wrote:
> On Tue, Jan 24, 2017 at 11:06:51PM -0500, Chulmin Kim wrote:
>> On 01/23/2017 12:40 AM, Minchan Kim wrote:
>> >On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
>> >>On (01/23/17 14:22), Minchan Kim wrote:
>> >>[..]
>> >>>>Anyway, I will let you know the situation when it gets more clear.
>> >>>
>> >>>Yeb, Thanks.
>> >>>
>> >>>Perhaps, did you tried flush page before the writing?
>> >>>I think arm64 have no d-cache alising problem but worth to try it.
>> >>>Who knows :)
>> >>
>> >>I thought that flush_dcache_page() is only for cases when we write
>> >>to page (store that makes pages dirty), isn't it?
>> >
>> >I think we need both because to see recent stores done by the user.
>> >I'm not sure it should be done by block device driver rather than
>> >page cache. Anyway, brd added it so worth to try it, I thought. :)
>> >
>>
>> Thanks for the suggestion!
>> It might be helpful
>> though proving it is not easy as the problem appears rarely.
>>
>> Have you thought about
>> zram swap or zswap dealing with self modifying code pages (ex. JIT)?
>> (arm64 may have i-cache aliasing problem)
>
> It can happen, I think, although I don't know how arm64 handles it.
>
>>
>> If it is problematic,
>> especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) may
>> provide the corrupted data
>> and even swap out (compressing) may see the corrupted data sooner or later,
>> i guess.
>
> try_to_unmap_one calls flush_cache_page which I hope to handle swap-out side
> but for swap-in, I think zswap need flushing logic because it's first
> touch of the user buffer so it's his resposibility.

Hmm, I don't think zswap needs to, because all the cache aliases were
flushed when the page was written out.  After that, any access to the
page will cause a fault, and the fault will cause the page to be read
back in (via zswap).  I don't see how the page could be cached at any
time between the swap write-out and swap read-in, so there should be
no need to flush any caches when it's read back in; am I missing
something?


>
> Thanks.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-26 17:04                         ` Dan Streetman
@ 2017-01-31  0:10                           ` Minchan Kim
  2017-01-31 13:09                             ` Dan Streetman
  0 siblings, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2017-01-31  0:10 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Chulmin Kim, Sergey Senozhatsky, Andrew Morton, Linux-MM,
	Sergey Senozhatsky

Hi Dan,

On Thu, Jan 26, 2017 at 12:04:03PM -0500, Dan Streetman wrote:
> On Wed, Jan 25, 2017 at 12:26 AM, Minchan Kim <minchan@kernel.org> wrote:
> > On Tue, Jan 24, 2017 at 11:06:51PM -0500, Chulmin Kim wrote:
> >> On 01/23/2017 12:40 AM, Minchan Kim wrote:
> >> >On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
> >> >>On (01/23/17 14:22), Minchan Kim wrote:
> >> >>[..]
> >> >>>>Anyway, I will let you know the situation when it gets more clear.
> >> >>>
> >> >>>Yeb, Thanks.
> >> >>>
> >> >>>Perhaps, did you tried flush page before the writing?
> >> >>>I think arm64 have no d-cache alising problem but worth to try it.
> >> >>>Who knows :)
> >> >>
> >> >>I thought that flush_dcache_page() is only for cases when we write
> >> >>to page (store that makes pages dirty), isn't it?
> >> >
> >> >I think we need both because to see recent stores done by the user.
> >> >I'm not sure it should be done by block device driver rather than
> >> >page cache. Anyway, brd added it so worth to try it, I thought. :)
> >> >
> >>
> >> Thanks for the suggestion!
> >> It might be helpful
> >> though proving it is not easy as the problem appears rarely.
> >>
> >> Have you thought about
> >> zram swap or zswap dealing with self modifying code pages (ex. JIT)?
> >> (arm64 may have i-cache aliasing problem)
> >
> > It can happen, I think, although I don't know how arm64 handles it.
> >
> >>
> >> If it is problematic,
> >> especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) may
> >> provide the corrupted data
> >> and even swap out (compressing) may see the corrupted data sooner or later,
> >> i guess.
> >
> > try_to_unmap_one calls flush_cache_page which I hope to handle swap-out side
> > but for swap-in, I think zswap need flushing logic because it's first
> > touch of the user buffer so it's his resposibility.
> 
> Hmm, I don't think zswap needs to, because all the cache aliases were
> flushed when the page was written out.  After that, any access to the
> page will cause a fault, and the fault will cause the page to be read
> back in (via zswap).  I don't see how the page could be cached at any
> time between the swap write-out and swap read-in, so there should be
> no need to flush any caches when it's read back in; am I missing
> something?

Documentation/cachetlb.txt says

  void flush_dcache_page(struct page *page)

        Any time the kernel writes to a page cache page, _OR_
        the kernel is about to read from a page cache page and
        user space shared/writable mappings of this page potentially
        exist, this routine is called.

For swap-in side, I don't see any logic to prevent the aliasing
problem. Let's consider other examples like cow_user_page->
copy_user_highpage. For architectures which can make aliasing,
it has arch specific functions which has flushing function.

IOW, if a kernel makes store operation to the page which will
be mapped to user space address, kernel should call flush function.
Otherwise, user space will miss recent update from kernel side.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-31  0:10                           ` Minchan Kim
@ 2017-01-31 13:09                             ` Dan Streetman
  2017-02-01  6:51                               ` Minchan Kim
  2017-02-02  8:48                               ` Minchan Kim
  0 siblings, 2 replies; 49+ messages in thread
From: Dan Streetman @ 2017-01-31 13:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Chulmin Kim, Sergey Senozhatsky, Andrew Morton, Linux-MM,
	Sergey Senozhatsky

On Mon, Jan 30, 2017 at 7:10 PM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Dan,
>
> On Thu, Jan 26, 2017 at 12:04:03PM -0500, Dan Streetman wrote:
>> On Wed, Jan 25, 2017 at 12:26 AM, Minchan Kim <minchan@kernel.org> wrote:
>> > On Tue, Jan 24, 2017 at 11:06:51PM -0500, Chulmin Kim wrote:
>> >> On 01/23/2017 12:40 AM, Minchan Kim wrote:
>> >> >On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
>> >> >>On (01/23/17 14:22), Minchan Kim wrote:
>> >> >>[..]
>> >> >>>>Anyway, I will let you know the situation when it gets more clear.
>> >> >>>
>> >> >>>Yeb, Thanks.
>> >> >>>
>> >> >>>Perhaps, did you tried flush page before the writing?
>> >> >>>I think arm64 have no d-cache alising problem but worth to try it.
>> >> >>>Who knows :)
>> >> >>
>> >> >>I thought that flush_dcache_page() is only for cases when we write
>> >> >>to page (store that makes pages dirty), isn't it?
>> >> >
>> >> >I think we need both because to see recent stores done by the user.
>> >> >I'm not sure it should be done by block device driver rather than
>> >> >page cache. Anyway, brd added it so worth to try it, I thought. :)
>> >> >
>> >>
>> >> Thanks for the suggestion!
>> >> It might be helpful
>> >> though proving it is not easy as the problem appears rarely.
>> >>
>> >> Have you thought about
>> >> zram swap or zswap dealing with self modifying code pages (ex. JIT)?
>> >> (arm64 may have i-cache aliasing problem)
>> >
>> > It can happen, I think, although I don't know how arm64 handles it.
>> >
>> >>
>> >> If it is problematic,
>> >> especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) may
>> >> provide the corrupted data
>> >> and even swap out (compressing) may see the corrupted data sooner or later,
>> >> i guess.
>> >
>> > try_to_unmap_one calls flush_cache_page which I hope to handle swap-out side
>> > but for swap-in, I think zswap need flushing logic because it's first
>> > touch of the user buffer so it's his resposibility.
>>
>> Hmm, I don't think zswap needs to, because all the cache aliases were
>> flushed when the page was written out.  After that, any access to the
>> page will cause a fault, and the fault will cause the page to be read
>> back in (via zswap).  I don't see how the page could be cached at any
>> time between the swap write-out and swap read-in, so there should be
>> no need to flush any caches when it's read back in; am I missing
>> something?
>
> Documentation/cachetlb.txt says
>
>   void flush_dcache_page(struct page *page)
>
>         Any time the kernel writes to a page cache page, _OR_
>         the kernel is about to read from a page cache page and
>         user space shared/writable mappings of this page potentially
>         exist, this routine is called.
>
> For swap-in side, I don't see any logic to prevent the aliasing
> problem. Let's consider other examples like cow_user_page->
> copy_user_highpage. For architectures which can make aliasing,
> it has arch specific functions which has flushing function.

COW works with a page that has a physical backing.  swap-in does not.
COW pages can be accessed normally; swapped out pages cannot.

>
> IOW, if a kernel makes store operation to the page which will
> be mapped to user space address, kernel should call flush function.
> Otherwise, user space will miss recent update from kernel side.

as I said before, when it's swapped out caches are flushed, and the
page mapping invalidated, so it will cause a fault on any access, and
thus cause swap to re-load the page from disk (or zswap).  So how
would a cache of the page be created after swap-out, but before
swap-in?  It's not possible for user space to have any caches to the
page, unless (as I said) I'm missing something?


>
> Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-31 13:09                             ` Dan Streetman
@ 2017-02-01  6:51                               ` Minchan Kim
  2017-02-01 19:38                                 ` Dan Streetman
  2017-02-02  8:48                               ` Minchan Kim
  1 sibling, 1 reply; 49+ messages in thread
From: Minchan Kim @ 2017-02-01  6:51 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Chulmin Kim, Sergey Senozhatsky, Andrew Morton, Linux-MM,
	Sergey Senozhatsky

Hi Dan,

On Tue, Jan 31, 2017 at 08:09:53AM -0500, Dan Streetman wrote:
> On Mon, Jan 30, 2017 at 7:10 PM, Minchan Kim <minchan@kernel.org> wrote:
> > Hi Dan,
> >
> > On Thu, Jan 26, 2017 at 12:04:03PM -0500, Dan Streetman wrote:
> >> On Wed, Jan 25, 2017 at 12:26 AM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Tue, Jan 24, 2017 at 11:06:51PM -0500, Chulmin Kim wrote:
> >> >> On 01/23/2017 12:40 AM, Minchan Kim wrote:
> >> >> >On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
> >> >> >>On (01/23/17 14:22), Minchan Kim wrote:
> >> >> >>[..]
> >> >> >>>>Anyway, I will let you know the situation when it gets more clear.
> >> >> >>>
> >> >> >>>Yeb, Thanks.
> >> >> >>>
> >> >> >>>Perhaps, did you tried flush page before the writing?
> >> >> >>>I think arm64 have no d-cache alising problem but worth to try it.
> >> >> >>>Who knows :)
> >> >> >>
> >> >> >>I thought that flush_dcache_page() is only for cases when we write
> >> >> >>to page (store that makes pages dirty), isn't it?
> >> >> >
> >> >> >I think we need both because to see recent stores done by the user.
> >> >> >I'm not sure it should be done by block device driver rather than
> >> >> >page cache. Anyway, brd added it so worth to try it, I thought. :)
> >> >> >
> >> >>
> >> >> Thanks for the suggestion!
> >> >> It might be helpful
> >> >> though proving it is not easy as the problem appears rarely.
> >> >>
> >> >> Have you thought about
> >> >> zram swap or zswap dealing with self modifying code pages (ex. JIT)?
> >> >> (arm64 may have i-cache aliasing problem)
> >> >
> >> > It can happen, I think, although I don't know how arm64 handles it.
> >> >
> >> >>
> >> >> If it is problematic,
> >> >> especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) may
> >> >> provide the corrupted data
> >> >> and even swap out (compressing) may see the corrupted data sooner or later,
> >> >> i guess.
> >> >
> >> > try_to_unmap_one calls flush_cache_page which I hope to handle swap-out side
> >> > but for swap-in, I think zswap need flushing logic because it's first
> >> > touch of the user buffer so it's his resposibility.
> >>
> >> Hmm, I don't think zswap needs to, because all the cache aliases were
> >> flushed when the page was written out.  After that, any access to the
> >> page will cause a fault, and the fault will cause the page to be read
> >> back in (via zswap).  I don't see how the page could be cached at any
> >> time between the swap write-out and swap read-in, so there should be
> >> no need to flush any caches when it's read back in; am I missing
> >> something?
> >
> > Documentation/cachetlb.txt says
> >
> >   void flush_dcache_page(struct page *page)
> >
> >         Any time the kernel writes to a page cache page, _OR_
> >         the kernel is about to read from a page cache page and
> >         user space shared/writable mappings of this page potentially
> >         exist, this routine is called.
> >
> > For swap-in side, I don't see any logic to prevent the aliasing
> > problem. Let's consider other examples like cow_user_page->
> > copy_user_highpage. For architectures which can make aliasing,
> > it has arch specific functions which has flushing function.
> 
> COW works with a page that has a physical backing.  swap-in does not.
> COW pages can be accessed normally; swapped out pages cannot.
> 
> >
> > IOW, if a kernel makes store operation to the page which will
> > be mapped to user space address, kernel should call flush function.
> > Otherwise, user space will miss recent update from kernel side.
> 
> as I said before, when it's swapped out caches are flushed, and the
> page mapping invalidated, so it will cause a fault on any access, and
> thus cause swap to re-load the page from disk (or zswap).  So how
> would a cache of the page be created after swap-out, but before
> swap-in?  It's not possible for user space to have any caches to the
> page, unless (as I said) I'm missing something?

I'm saying H/W cache, not S/W cache.
Please think over VIVT architecture. The virtual address kernel is using
for store is different with the one user will use so it's cache-aliasing
candidate.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-02-01  6:51                               ` Minchan Kim
@ 2017-02-01 19:38                                 ` Dan Streetman
  0 siblings, 0 replies; 49+ messages in thread
From: Dan Streetman @ 2017-02-01 19:38 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Chulmin Kim, Sergey Senozhatsky, Andrew Morton, Linux-MM,
	Sergey Senozhatsky

On Wed, Feb 1, 2017 at 1:51 AM, Minchan Kim <minchan@kernel.org> wrote:
> Hi Dan,
>
> On Tue, Jan 31, 2017 at 08:09:53AM -0500, Dan Streetman wrote:
>> On Mon, Jan 30, 2017 at 7:10 PM, Minchan Kim <minchan@kernel.org> wrote:
>> > Hi Dan,
>> >
>> > On Thu, Jan 26, 2017 at 12:04:03PM -0500, Dan Streetman wrote:
>> >> On Wed, Jan 25, 2017 at 12:26 AM, Minchan Kim <minchan@kernel.org> wrote:
>> >> > On Tue, Jan 24, 2017 at 11:06:51PM -0500, Chulmin Kim wrote:
>> >> >> On 01/23/2017 12:40 AM, Minchan Kim wrote:
>> >> >> >On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
>> >> >> >>On (01/23/17 14:22), Minchan Kim wrote:
>> >> >> >>[..]
>> >> >> >>>>Anyway, I will let you know the situation when it gets more clear.
>> >> >> >>>
>> >> >> >>>Yeb, Thanks.
>> >> >> >>>
>> >> >> >>>Perhaps, did you tried flush page before the writing?
>> >> >> >>>I think arm64 have no d-cache alising problem but worth to try it.
>> >> >> >>>Who knows :)
>> >> >> >>
>> >> >> >>I thought that flush_dcache_page() is only for cases when we write
>> >> >> >>to page (store that makes pages dirty), isn't it?
>> >> >> >
>> >> >> >I think we need both because to see recent stores done by the user.
>> >> >> >I'm not sure it should be done by block device driver rather than
>> >> >> >page cache. Anyway, brd added it so worth to try it, I thought. :)
>> >> >> >
>> >> >>
>> >> >> Thanks for the suggestion!
>> >> >> It might be helpful
>> >> >> though proving it is not easy as the problem appears rarely.
>> >> >>
>> >> >> Have you thought about
>> >> >> zram swap or zswap dealing with self modifying code pages (ex. JIT)?
>> >> >> (arm64 may have i-cache aliasing problem)
>> >> >
>> >> > It can happen, I think, although I don't know how arm64 handles it.
>> >> >
>> >> >>
>> >> >> If it is problematic,
>> >> >> especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) may
>> >> >> provide the corrupted data
>> >> >> and even swap out (compressing) may see the corrupted data sooner or later,
>> >> >> i guess.
>> >> >
>> >> > try_to_unmap_one calls flush_cache_page which I hope to handle swap-out side
>> >> > but for swap-in, I think zswap need flushing logic because it's first
>> >> > touch of the user buffer so it's his resposibility.
>> >>
>> >> Hmm, I don't think zswap needs to, because all the cache aliases were
>> >> flushed when the page was written out.  After that, any access to the
>> >> page will cause a fault, and the fault will cause the page to be read
>> >> back in (via zswap).  I don't see how the page could be cached at any
>> >> time between the swap write-out and swap read-in, so there should be
>> >> no need to flush any caches when it's read back in; am I missing
>> >> something?
>> >
>> > Documentation/cachetlb.txt says
>> >
>> >   void flush_dcache_page(struct page *page)
>> >
>> >         Any time the kernel writes to a page cache page, _OR_
>> >         the kernel is about to read from a page cache page and
>> >         user space shared/writable mappings of this page potentially
>> >         exist, this routine is called.
>> >
>> > For swap-in side, I don't see any logic to prevent the aliasing
>> > problem. Let's consider other examples like cow_user_page->
>> > copy_user_highpage. For architectures which can make aliasing,
>> > it has arch specific functions which has flushing function.
>>
>> COW works with a page that has a physical backing.  swap-in does not.
>> COW pages can be accessed normally; swapped out pages cannot.
>>
>> >
>> > IOW, if a kernel makes store operation to the page which will
>> > be mapped to user space address, kernel should call flush function.
>> > Otherwise, user space will miss recent update from kernel side.
>>
>> as I said before, when it's swapped out caches are flushed, and the
>> page mapping invalidated, so it will cause a fault on any access, and
>> thus cause swap to re-load the page from disk (or zswap).  So how
>> would a cache of the page be created after swap-out, but before
>> swap-in?  It's not possible for user space to have any caches to the
>> page, unless (as I said) I'm missing something?
>
> I'm saying H/W cache, not S/W cache.
> Please think over VIVT architecture. The virtual address kernel is using
> for store is different with the one user will use so it's cache-aliasing
> candidate.

sorry, i'm still not seeing how it's possible, maybe I'm missing
details of what you are thinking.  can you give a specific example of
how a user space H/W cache of the page can be created, while the page
is swapped out?  Chulmin, do you have an example already of how it
could happen?


>
> Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v7 11/12] zsmalloc: page migration support
  2017-01-31 13:09                             ` Dan Streetman
  2017-02-01  6:51                               ` Minchan Kim
@ 2017-02-02  8:48                               ` Minchan Kim
  1 sibling, 0 replies; 49+ messages in thread
From: Minchan Kim @ 2017-02-02  8:48 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Chulmin Kim, Sergey Senozhatsky, Andrew Morton, Linux-MM,
	Sergey Senozhatsky

Hi Dan,

On Tue, Jan 31, 2017 at 08:09:53AM -0500, Dan Streetman wrote:
> On Mon, Jan 30, 2017 at 7:10 PM, Minchan Kim <minchan@kernel.org> wrote:
> > Hi Dan,
> >
> > On Thu, Jan 26, 2017 at 12:04:03PM -0500, Dan Streetman wrote:
> >> On Wed, Jan 25, 2017 at 12:26 AM, Minchan Kim <minchan@kernel.org> wrote:
> >> > On Tue, Jan 24, 2017 at 11:06:51PM -0500, Chulmin Kim wrote:
> >> >> On 01/23/2017 12:40 AM, Minchan Kim wrote:
> >> >> >On Mon, Jan 23, 2017 at 02:30:56PM +0900, Sergey Senozhatsky wrote:
> >> >> >>On (01/23/17 14:22), Minchan Kim wrote:
> >> >> >>[..]
> >> >> >>>>Anyway, I will let you know the situation when it gets more clear.
> >> >> >>>
> >> >> >>>Yeb, Thanks.
> >> >> >>>
> >> >> >>>Perhaps, did you tried flush page before the writing?
> >> >> >>>I think arm64 have no d-cache alising problem but worth to try it.
> >> >> >>>Who knows :)
> >> >> >>
> >> >> >>I thought that flush_dcache_page() is only for cases when we write
> >> >> >>to page (store that makes pages dirty), isn't it?
> >> >> >
> >> >> >I think we need both because to see recent stores done by the user.
> >> >> >I'm not sure it should be done by block device driver rather than
> >> >> >page cache. Anyway, brd added it so worth to try it, I thought. :)
> >> >> >
> >> >>
> >> >> Thanks for the suggestion!
> >> >> It might be helpful
> >> >> though proving it is not easy as the problem appears rarely.
> >> >>
> >> >> Have you thought about
> >> >> zram swap or zswap dealing with self modifying code pages (ex. JIT)?
> >> >> (arm64 may have i-cache aliasing problem)
> >> >
> >> > It can happen, I think, although I don't know how arm64 handles it.
> >> >
> >> >>
> >> >> If it is problematic,
> >> >> especiallly zswap (without flush_dcache_page in zswap_frontswap_load()) may
> >> >> provide the corrupted data
> >> >> and even swap out (compressing) may see the corrupted data sooner or later,
> >> >> i guess.
> >> >
> >> > try_to_unmap_one calls flush_cache_page which I hope to handle swap-out side
> >> > but for swap-in, I think zswap need flushing logic because it's first
> >> > touch of the user buffer so it's his resposibility.
> >>
> >> Hmm, I don't think zswap needs to, because all the cache aliases were
> >> flushed when the page was written out.  After that, any access to the
> >> page will cause a fault, and the fault will cause the page to be read
> >> back in (via zswap).  I don't see how the page could be cached at any
> >> time between the swap write-out and swap read-in, so there should be
> >> no need to flush any caches when it's read back in; am I missing
> >> something?
> >
> > Documentation/cachetlb.txt says
> >
> >   void flush_dcache_page(struct page *page)
> >
> >         Any time the kernel writes to a page cache page, _OR_
> >         the kernel is about to read from a page cache page and
> >         user space shared/writable mappings of this page potentially
> >         exist, this routine is called.
> >
> > For swap-in side, I don't see any logic to prevent the aliasing
> > problem. Let's consider other examples like cow_user_page->
> > copy_user_highpage. For architectures which can make aliasing,
> > it has arch specific functions which has flushing function.
> 
> COW works with a page that has a physical backing.  swap-in does not.
> COW pages can be accessed normally; swapped out pages cannot.
> 
> >
> > IOW, if a kernel makes store operation to the page which will
> > be mapped to user space address, kernel should call flush function.
> > Otherwise, user space will miss recent update from kernel side.
> 
> as I said before, when it's swapped out caches are flushed, and the
> page mapping invalidated, so it will cause a fault on any access, and
> thus cause swap to re-load the page from disk (or zswap).  So how
> would a cache of the page be created after swap-out, but before
> swap-in?  It's not possible for user space to have any caches to the
> page, unless (as I said) I'm missing something?
> 

Let's assume VIVT architecture which get index and tag from virtual
address.

In zswap_frontswap_load, let's assume dst is kernel virtual address,
0xc0002000 so cacheline for 0xc0002000 has recent uptodate data.
Now, VM mapped the page into user address space, 0x80003000, for
example. In that case, userland application try to read data from
0x80003000 but it is associated with another cacheline so it cannot
see recent uptodate data written by zswap.

flush_dcache_page will handle both kernel side and user side.
I'm not sure how I explain well. :-(

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2017-02-02  8:48 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-31 23:21 [PATCH v7 00/12] Support non-lru page migration Minchan Kim
2016-05-31 23:21 ` [PATCH v7 01/12] mm: use put_page to free page instead of putback_lru_page Minchan Kim
2016-05-31 23:21 ` [PATCH v7 02/12] mm: migrate: support non-lru movable page migration Minchan Kim
2016-05-31 23:21 ` [PATCH v7 03/12] mm: balloon: use general non-lru movable page feature Minchan Kim
2016-05-31 23:21 ` [PATCH v7 04/12] zsmalloc: keep max_object in size_class Minchan Kim
2016-05-31 23:21 ` [PATCH v7 05/12] zsmalloc: use bit_spin_lock Minchan Kim
2016-05-31 23:21 ` [PATCH v7 06/12] zsmalloc: use accessor Minchan Kim
2016-05-31 23:21 ` [PATCH v7 07/12] zsmalloc: factor page chain functionality out Minchan Kim
2016-05-31 23:21 ` [PATCH v7 08/12] zsmalloc: introduce zspage structure Minchan Kim
2016-05-31 23:21 ` [PATCH v7 09/12] zsmalloc: separate free_zspage from putback_zspage Minchan Kim
2016-05-31 23:21 ` [PATCH v7 10/12] zsmalloc: use freeobj for index Minchan Kim
2016-05-31 23:21 ` [PATCH v7 11/12] zsmalloc: page migration support Minchan Kim
2016-06-01 14:09   ` Vlastimil Babka
2016-06-02  0:25     ` Minchan Kim
2016-06-02 11:44       ` Vlastimil Babka
2016-06-01 21:39   ` Andrew Morton
2016-06-02  0:15     ` Minchan Kim
     [not found]   ` <CGME20170119001317epcas1p188357c77e1f4ff08b6d3dcb76dedca06@epcas1p1.samsung.com>
2017-01-19  0:13     ` Chulmin Kim
2017-01-19  2:44       ` Minchan Kim
2017-01-19  3:39         ` Chulmin Kim
2017-01-19  6:21           ` Minchan Kim
2017-01-19  8:16             ` Chulmin Kim
2017-01-23  5:22               ` Minchan Kim
2017-01-23  5:30                 ` Sergey Senozhatsky
2017-01-23  5:40                   ` Minchan Kim
2017-01-25  4:06                     ` Chulmin Kim
2017-01-25  4:25                       ` Sergey Senozhatsky
2017-01-25  5:26                       ` Minchan Kim
2017-01-26 17:04                         ` Dan Streetman
2017-01-31  0:10                           ` Minchan Kim
2017-01-31 13:09                             ` Dan Streetman
2017-02-01  6:51                               ` Minchan Kim
2017-02-01 19:38                                 ` Dan Streetman
2017-02-02  8:48                               ` Minchan Kim
2016-05-31 23:21 ` [PATCH v7 12/12] zram: use __GFP_MOVABLE for memory allocation Minchan Kim
2016-06-01 21:41 ` [PATCH v7 00/12] Support non-lru page migration Andrew Morton
2016-06-01 22:40   ` Daniel Vetter
2016-06-02  0:36   ` Minchan Kim
2016-06-15  7:59 ` Sergey Senozhatsky
2016-06-15 23:12   ` Minchan Kim
2016-06-16  2:48     ` Sergey Senozhatsky
2016-06-16  2:58       ` Minchan Kim
2016-06-16  4:23         ` Sergey Senozhatsky
2016-06-16  4:47           ` Minchan Kim
2016-06-16  5:22             ` Sergey Senozhatsky
2016-06-16  6:47               ` Minchan Kim
2016-06-16  8:42                 ` Sergey Senozhatsky
2016-06-16 10:09                   ` Minchan Kim
2016-06-17  7:28                     ` Joonsoo Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).