[RFC] mm: add support for zsmalloc and zcache
diff mbox series

Message ID 1346794486-12107-1-git-send-email-sjenning@linux.vnet.ibm.com
State New, archived
Headers show
Series
  • [RFC] mm: add support for zsmalloc and zcache
Related show

Commit Message

Seth Jennings Sept. 4, 2012, 9:34 p.m. UTC
zcache is the remaining piece of code required to support in-kernel
memory compression.  The other two features, cleancache and frontswap,
have been promoted to mainline in 3.0 and 3.5 respectively.  This
patchset promotes zcache from the staging tree to mainline.

Based on the level of activity and contributions we're seeing from a
diverse set of people and interests, I think zcache has matured to the
point where it makes sense to promote this out of staging.

Overview

Comments

Konrad Rzeszutek Wilk Sept. 7, 2012, 2:37 p.m. UTC | #1
> significant design challenges exist, many of which are already resolved in
> the new codebase ("zcache2").  These design issues include:
.. snip..
> Before other key mm maintainers read and comment on zcache, I think
> it would be most wise to move to a codebase which resolves the known design
> problems or, at least to thoroughly discuss and debunk the design issues
> described above.  OR... it may be possible to identify and pursue some
> compromise plan.  In any case, I believe the promotion proposal is premature.

Thank you for the feedback!

I took your comments and pasted them in this patch.

Seth, Robert, Minchan, Nitin, can you guys provide some comments pls,
so we can put them as a TODO pls or modify the patch below.

Oh, I think I forgot Andrew's comment which was:

 - Explain which workloads this benefits and provide some benchmark data.
   This should help in narrowing down in which case we know zcache works
   well and in which it does not.

My TODO's were:

 - Figure out (this could be - and perhaps should be in frontswap) a
   determination whether this swap is quite fast and the CPU is slow
   (or taxed quite heavily now), so as to not slow the currently executing
   workloads.
 - Work out automatic benchmarks in three categories: database (I am going to use
   swing for that), compile (that one is easy), and firefox tab browsers
   overloading.


>From bd85d5fa0cc231f2779f3209ee62b755caf3aa9b Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 7 Sep 2012 10:21:01 -0400
Subject: [PATCH] zsmalloc/zcache: TODO list.

Adding in comments by Dan.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/staging/zcache/TODO   |   21 +++++++++++++++++++++
 drivers/staging/zsmalloc/TODO |   17 +++++++++++++++++
 2 files changed, 38 insertions(+), 0 deletions(-)
 create mode 100644 drivers/staging/zcache/TODO
 create mode 100644 drivers/staging/zsmalloc/TODO

diff --git a/drivers/staging/zcache/TODO b/drivers/staging/zcache/TODO
new file mode 100644
index 0000000..bf19a01
--- /dev/null
+++ b/drivers/staging/zcache/TODO
@@ -0,0 +1,21 @@
+
+A) Andrea Arcangeli pointed out and, after some deep thinking, I came
+   to agree that zcache _must_ have some "backdoor exit" for frontswap
+   pages [2], else bad things will eventually happen in many workloads.
+   This requires some kind of reaper of frontswap'ed zpages[1] which "evicts"
+   the data to the actual swap disk.  This reaper must ensure it can reclaim
+   _full_ pageframes (not just zpages) or it has little value.  Further the
+   reaper should determine which pageframes to reap based on an LRU-ish
+   (not random) approach.
+
+B) Zcache uses zbud(v1) for cleancache pages and includes a shrinker which
+   reclaims pairs of zpages to release whole pageframes, but there is
+   no attempt to shrink/reclaim cleanache pageframes in LRU order.
+   It would also be nice if single-cleancache-pageframe reclaim could
+   be implemented.
+
+C) Offer a mechanism to select whether zbud or zsmalloc should be used.
+   This should be for either cleancache or frontswap pages. Meaning there
+   are four choices: cleancache and frontswap using zbud; cleancache and
+   frontswap using zsmalloc; cleancache using zsmalloc, frontswap using zbud;
+   cleacache using zbud, and frontswap using zsmalloc.
diff --git a/drivers/staging/zsmalloc/TODO b/drivers/staging/zsmalloc/TODO
new file mode 100644
index 0000000..b1debad
--- /dev/null
+++ b/drivers/staging/zsmalloc/TODO
@@ -0,0 +1,17 @@
+
+A) Zsmalloc has potentially far superior density vs zbud because zsmalloc can
+   pack more zpages into each pageframe and allows for zpages that cross pageframe
+   boundaries.  But, (i) this is very data dependent... the average compression
+   for LZO is about 2x.  The frontswap'ed pages in the kernel compile benchmark
+   compress to about 4x, which is impressive but probably not representative of
+   a wide range of zpages and workloads.  And (ii) there are many historical
+   discussions going back to Knuth and mainframes about tight packing of data...
+   high density has some advantages but also brings many disadvantages related to
+   fragmentation and compaction.  Zbud is much less aggressive (max two zpages
+   per pageframe) but has a similar density on average data, without the
+   disadvantages of high density.
+
+   So zsmalloc may blow zbud away on a kernel compile benchmark but, if both were
+   runners, zsmalloc is a sprinter and zbud is a marathoner.  Perhaps the best
+   solution is to offer both?
+
Nitin Gupta Sept. 9, 2012, 3:46 a.m. UTC | #2
On 09/07/2012 07:37 AM, Konrad Rzeszutek Wilk wrote:
>> significant design challenges exist, many of which are already resolved in
>> the new codebase ("zcache2").  These design issues include:
> .. snip..
>> Before other key mm maintainers read and comment on zcache, I think
>> it would be most wise to move to a codebase which resolves the known design
>> problems or, at least to thoroughly discuss and debunk the design issues
>> described above.  OR... it may be possible to identify and pursue some
>> compromise plan.  In any case, I believe the promotion proposal is premature.
>
> Thank you for the feedback!
>
> I took your comments and pasted them in this patch.
>
> Seth, Robert, Minchan, Nitin, can you guys provide some comments pls,
> so we can put them as a TODO pls or modify the patch below.
>
> Oh, I think I forgot Andrew's comment which was:
>
>   - Explain which workloads this benefits and provide some benchmark data.
>     This should help in narrowing down in which case we know zcache works
>     well and in which it does not.
>
> My TODO's were:
>
>   - Figure out (this could be - and perhaps should be in frontswap) a
>     determination whether this swap is quite fast and the CPU is slow
>     (or taxed quite heavily now), so as to not slow the currently executing
>     workloads.
>   - Work out automatic benchmarks in three categories: database (I am going to use
>     swing for that), compile (that one is easy), and firefox tab browsers
>     overloading.
>
>
>  From bd85d5fa0cc231f2779f3209ee62b755caf3aa9b Mon Sep 17 00:00:00 2001
> From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> Date: Fri, 7 Sep 2012 10:21:01 -0400
> Subject: [PATCH] zsmalloc/zcache: TODO list.
>
> Adding in comments by Dan.
>
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>   drivers/staging/zcache/TODO   |   21 +++++++++++++++++++++
>   drivers/staging/zsmalloc/TODO |   17 +++++++++++++++++
>   2 files changed, 38 insertions(+), 0 deletions(-)
>   create mode 100644 drivers/staging/zcache/TODO
>   create mode 100644 drivers/staging/zsmalloc/TODO
>
> diff --git a/drivers/staging/zcache/TODO b/drivers/staging/zcache/TODO
> new file mode 100644
> index 0000000..bf19a01
> --- /dev/null
> +++ b/drivers/staging/zcache/TODO
> @@ -0,0 +1,21 @@
> +
> +A) Andrea Arcangeli pointed out and, after some deep thinking, I came
> +   to agree that zcache _must_ have some "backdoor exit" for frontswap
> +   pages [2], else bad things will eventually happen in many workloads.
> +   This requires some kind of reaper of frontswap'ed zpages[1] which "evicts"
> +   the data to the actual swap disk.  This reaper must ensure it can reclaim
> +   _full_ pageframes (not just zpages) or it has little value.  Further the
> +   reaper should determine which pageframes to reap based on an LRU-ish
> +   (not random) approach.
> +
> +B) Zcache uses zbud(v1) for cleancache pages and includes a shrinker which
> +   reclaims pairs of zpages to release whole pageframes, but there is
> +   no attempt to shrink/reclaim cleanache pageframes in LRU order.
> +   It would also be nice if single-cleancache-pageframe reclaim could
> +   be implemented.
> +
> +C) Offer a mechanism to select whether zbud or zsmalloc should be used.
> +   This should be for either cleancache or frontswap pages. Meaning there
> +   are four choices: cleancache and frontswap using zbud; cleancache and
> +   frontswap using zsmalloc; cleancache using zsmalloc, frontswap using zbud;
> +   cleacache using zbud, and frontswap using zsmalloc.
> diff --git a/drivers/staging/zsmalloc/TODO b/drivers/staging/zsmalloc/TODO
> new file mode 100644
> index 0000000..b1debad
> --- /dev/null
> +++ b/drivers/staging/zsmalloc/TODO
> @@ -0,0 +1,17 @@
> +
> +A) Zsmalloc has potentially far superior density vs zbud because zsmalloc can
> +   pack more zpages into each pageframe and allows for zpages that cross pageframe
> +   boundaries.  But, (i) this is very data dependent... the average compression
> +   for LZO is about 2x.  The frontswap'ed pages in the kernel compile benchmark
> +   compress to about 4x, which is impressive but probably not representative of
> +   a wide range of zpages and workloads.  And (ii) there are many historical
> +   discussions going back to Knuth and mainframes about tight packing of data...
> +   high density has some advantages but also brings many disadvantages related to
> +   fragmentation and compaction.  Zbud is much less aggressive (max two zpages
> +   per pageframe) but has a similar density on average data, without the
> +   disadvantages of high density.
> +
> +   So zsmalloc may blow zbud away on a kernel compile benchmark but, if both were
> +   runners, zsmalloc is a sprinter and zbud is a marathoner.  Perhaps the best
> +   solution is to offer both?
> +
>

The problem is that zbud performs well only when a (compressed) page is 
either PAGE_SIZE/2 - e or PAGE_SIZE - e, where e is small. So, even if 
the average compression ratio is 2x (which is hard to believe), a 
majority of sizes can actually end up in PAGE_SIZE/2 + e bucket and zbud 
will still give bad performance.  For instance, consider these histograms:

# Created tar of /usr/lib (2GB) on a fairly loaded Linux system and 
compressed page-by-page using LZO:

# first two fields: bin start, end.  Third field: compressed size
32 286 7644
286 540 4226
540 794 11868
794 1048 20356
1048 1302 43443
1302 1556 39374
1556 1810 32903
1810 2064 37631
2064 2318 42400
2318 2572 51921
2572 2826 56255
2826 3080 59346
3080 3334 36545
3334 3588 12872
3588 3842 6513
3842 4096 3482

The only (approx) sweetspots for zbud are 1810-2064 and 3842-4096 which 
covers only a small fraction of pages.

# same page-by-page compression for 220MB ISO from project Gutenberg:
32 286 70
286 540 68
540 794 43
794 1048 36
1048 1302 46
1302 1556 78
1556 1810 142
1810 2064 244
2064 2318 561
2318 2572 1272
2572 2826 3889
2826 3080 17826
3080 3334 3896
3334 3588 358
3588 3842 465
3842 4096 804

Again very few pages in zbud favoring bins.

So, we really need zsmalloc style allocator which handles sizes all over 
the spectrum. But yes, compaction remains far easier to implement on zbud.

Thanks,
Nitin

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 17, 2012, 8:42 p.m. UTC | #3
> From: Nitin Gupta [mailto:ngupta@vflare.org]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> 
> The problem is that zbud performs well only when a (compressed) page is
> either PAGE_SIZE/2 - e or PAGE_SIZE - e, where e is small. So, even if
> the average compression ratio is 2x (which is hard to believe), a
> majority of sizes can actually end up in PAGE_SIZE/2 + e bucket and zbud
> will still give bad performance.  For instance, consider these histograms:

Whoa whoa whoa.  This is very wrong.  Zbud handles compressed pages
of any range that fits in a pageframe (same, almost, as zsmalloc).
Unless there is some horrible bug you found...

Zbud _does_ require the _distribution_ of zsize to be roughly
centered around PAGE_SIZE/2 (or less).  Is that what you meant?
If so, the following numbers you posted don't make sense to me.
Could you be more explicit on what the numbers mean?

Also, as you know, unlike zram, the architecture of tmem/frontswap
allows zcache to reject any page, so if the distribution of zsize
exceeds PAGE_SIZE/2, some pages can be rejected (and thus passed
through to swap).  This safety valve already exists in zcache (and zcache2)
to avoid situations where zpages would otherwise significantly
exceed half of total pageframes allocated.  IMHO this is a
better policy than accepting a large number of poorly-compressed pages,
i.e. if every data page compresses down from 4096 bytes to 4032
bytes, zsmalloc stores them all (thus using very nearly one pageframe
per zpage), whereas zbud avoids the anomalous page sequence altogether.
 
> # Created tar of /usr/lib (2GB) on a fairly loaded Linux system and
> compressed page-by-page using LZO:
> 
> # first two fields: bin start, end.  Third field: compressed size
> 32 286 7644
> :
> 3842 4096 3482
> 
> The only (approx) sweetspots for zbud are 1810-2064 and 3842-4096 which
> covers only a small fraction of pages.
> 
> # same page-by-page compression for 220MB ISO from project Gutenberg:
> 32 286 70
> :
> 3842 4096 804
> 
> Again very few pages in zbud favoring bins.
> 
> So, we really need zsmalloc style allocator which handles sizes all over
> the spectrum. But yes, compaction remains far easier to implement on zbud.

So it remains to be seen if a third choice exists (which might be either
an enhanced zbud or an enhanced zsmalloc), right?

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Nitin Gupta Sept. 17, 2012, 11:46 p.m. UTC | #4
On Mon, Sep 17, 2012 at 1:42 PM, Dan Magenheimer
<dan.magenheimer@oracle.com> wrote:
>> From: Nitin Gupta [mailto:ngupta@vflare.org]
>> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
>>
>> The problem is that zbud performs well only when a (compressed) page is
>> either PAGE_SIZE/2 - e or PAGE_SIZE - e, where e is small. So, even if
>> the average compression ratio is 2x (which is hard to believe), a
>> majority of sizes can actually end up in PAGE_SIZE/2 + e bucket and zbud
>> will still give bad performance.  For instance, consider these histograms:
>
> Whoa whoa whoa.  This is very wrong.  Zbud handles compressed pages
> of any range that fits in a pageframe (same, almost, as zsmalloc).
> Unless there is some horrible bug you found...
>
> Zbud _does_ require the _distribution_ of zsize to be roughly
> centered around PAGE_SIZE/2 (or less).  Is that what you meant?

Yes, I meant this only: though zbud can handle any size, it isn't
efficient for any size not centered around PAGESIZE/2.

> If so, the following numbers you posted don't make sense to me.
> Could you be more explicit on what the numbers mean?
>

This is a histogram of the compressed sizes when files were
compressed in 4K chunks. The first number is the lower limit of
bin size, second number of upper limit and third number is the
number of pages that fall in that bin.

> Also, as you know, unlike zram, the architecture of tmem/frontswap
> allows zcache to reject any page, so if the distribution of zsize
> exceeds PAGE_SIZE/2, some pages can be rejected (and thus passed
> through to swap).  This safety valve already exists in zcache (and zcache2)
> to avoid situations where zpages would otherwise significantly
> exceed half of total pageframes allocated.  IMHO this is a
> better policy than accepting a large number of poorly-compressed pages,

Long time back zram had the ability of forwarding poorly compressed
pages to a backing swap device but that was removed to cleanup the
code and help with upstream promotion.  Once zram goes out of staging,
I will try getting that functionality back if there is enough demand.


> i.e. if every data page compresses down from 4096 bytes to 4032
> bytes, zsmalloc stores them all (thus using very nearly one pageframe
> per zpage), whereas zbud avoids the anomalous page sequence altogether.
>

This ability to letting pages go to physical device is not really
highlighting anything
of zbud vs zsmalloc.  That ability is really zram vs frontswap stuff
which is a different
thing.


>> # Created tar of /usr/lib (2GB) on a fairly loaded Linux system and
>> compressed page-by-page using LZO:
>>
>> # first two fields: bin start, end.  Third field: compressed size
>> 32 286 7644
>> :
>> 3842 4096 3482
>>
>> The only (approx) sweetspots for zbud are 1810-2064 and 3842-4096 which
>> covers only a small fraction of pages.
>>
>> # same page-by-page compression for 220MB ISO from project Gutenberg:
>> 32 286 70
>> :
>> 3842 4096 804
>>
>> Again very few pages in zbud favoring bins.
>>
>> So, we really need zsmalloc style allocator which handles sizes all over
>> the spectrum. But yes, compaction remains far easier to implement on zbud.
>
> So it remains to be seen if a third choice exists (which might be either
> an enhanced zbud or an enhanced zsmalloc), right?
>

Yes, definitely. At least for non-ephemeral pages (zram), zsmalloc seems to be
a better choice even without compaction. As for zcache, I don't understand its
codebase anyways so not sure how exactly compaction would interact with it,
so I think zcache should stay with zbud.

Thanks,
Nitin
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mel Gorman Sept. 21, 2012, 4:12 p.m. UTC | #5
On Tue, Sep 04, 2012 at 04:34:46PM -0500, Seth Jennings wrote:
> zcache is the remaining piece of code required to support in-kernel
> memory compression.  The other two features, cleancache and frontswap,
> have been promoted to mainline in 3.0 and 3.5 respectively.  This
> patchset promotes zcache from the staging tree to mainline.
> 

This is a very rough review of the code simply because I was asked to
look at it. I'm barely aware of the history and I'm not a user of this
code myself so take all of this with a grain of salt.

Very broadly speaking my initial reaction before I reviewed anything was
that *some* sort of usable backend for cleancache or frontswap should exist
at this point. My understanding is that Xen is the primary user of both
those frontends and ramster, while interesting, is not something that a
typical user will benefit from.

That said, I worry that this has bounced around a lot and as Dan (the
original author) has a rewrite. I'm wary of spending too much time on this
at all. Is Dan's new code going to replace this or what? It'd be nice to
find a definitive answer on that.

Anyway, here goes

> Based on the level of activity and contributions we're seeing from a
> diverse set of people and interests, I think zcache has matured to the
> point where it makes sense to promote this out of staging.
> 
> Overview
> ========
> zcache is a backend to frontswap and cleancache that accepts pages from
> those mechanisms and compresses them, leading to reduced I/O caused by
> swap and file re-reads.  This is very valuable in shared storage situations
> to reduce load on things like SANs.  Also, in the case of slow backing/swap
> devices, zcache can also yield a performance gain.
> 
> In-Kernel Memory Compression Overview:
> 
>  swap subsystem            page cache
>         +                      +
>     frontswap              cleancache
>         +                      +
> zcache frontswap glue  zcache cleancache glue
>         +                      +
>         +---------+------------+
>                   +
>             zcache/tmem core
>                   +
>         +---------+------------+
>         +                      +
>      zsmalloc                 zbud
> 
> Everything below the frontswap/cleancache layer is current inside the
> zcache driver expect for zsmalloc which is a shared between zcache and
> another memory compression driver, zram.
> 
> Since zcache is dependent on zsmalloc, it is also being promoted by this
> patchset.
> 
> For information on zsmalloc and the rationale behind it's design and use
> cases verses already existing allocators in the kernel:
> 
> https://lkml.org/lkml/2012/1/9/386
> 
> zsmalloc is the allocator used by zcache to store persistent pages that
> comes from frontswap, as opposed to zbud which is the (internal) allocator
> used for ephemeral pages from cleancache.
> 
> zsmalloc uses many fields of the page struct to create it's conceptual
> high-order page called a zspage.  Exactly which fields are used and for
> what purpose is documented at the top of the zsmalloc .c file.  Because
> zsmalloc uses struct page extensively, Andrew advised that the
> promotion location be mm/:
> 
> https://lkml.org/lkml/2012/1/20/308
> 
> Zcache is added in a new driver class under drivers/ named mm for
> memory management related drivers.  This driver class would be for
> drivers that don't actually enabled a hardware device, but rather
> augment the memory manager in some way.  Other in-tree candidates
> for this driver class are zram and lowmemorykiller, both in staging.
> 
> Some benchmarking numbers demonstrating the I/O saving that can be had
> with zcache:
> 
> https://lkml.org/lkml/2012/3/22/383
> 
> Dan's presentation at LSF/MM this year on zcache:
> 
> http://oss.oracle.com/projects/tmem/dist/documentation/presentations/LSFMM12-zcache-final.pdf
> 
> There was a recent thread about cleancache memory corruption that is
> resolved by this patch that should be making it into linux-next via
> Greg very soon and is included in this patch:
> 
> https://lkml.org/lkml/2012/8/29/253
> 
> Based on next-20120904
> 
> Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
> ---
>  drivers/Kconfig                 |    2 +
>  drivers/Makefile                |    1 +
>  drivers/mm/Kconfig              |   13 +
>  drivers/mm/Makefile             |    1 +
>  drivers/mm/zcache/Makefile      |    3 +
>  drivers/mm/zcache/tmem.c        |  773 +++++++++++++++
>  drivers/mm/zcache/tmem.h        |  206 ++++
>  drivers/mm/zcache/zcache-main.c | 2077 +++++++++++++++++++++++++++++++++++++++
>  include/linux/zsmalloc.h        |   43 +
>  mm/Kconfig                      |   18 +
>  mm/Makefile                     |    1 +
>  mm/zsmalloc.c                   | 1063 ++++++++++++++++++++
>  12 files changed, 4201 insertions(+)
>  create mode 100644 drivers/mm/Kconfig
>  create mode 100644 drivers/mm/Makefile
>  create mode 100644 drivers/mm/zcache/Makefile
>  create mode 100644 drivers/mm/zcache/tmem.c
>  create mode 100644 drivers/mm/zcache/tmem.h
>  create mode 100644 drivers/mm/zcache/zcache-main.c
>  create mode 100644 include/linux/zsmalloc.h
>  create mode 100644 mm/zsmalloc.c
> 
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index 324e958..d126132 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -154,4 +154,6 @@ source "drivers/vme/Kconfig"
>  
>  source "drivers/pwm/Kconfig"
>  
> +source "drivers/mm/Kconfig"
> +
>  endmenu
> diff --git a/drivers/Makefile b/drivers/Makefile
> index d64a0f7..aa69e1c 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -140,3 +140,4 @@ obj-$(CONFIG_EXTCON)		+= extcon/
>  obj-$(CONFIG_MEMORY)		+= memory/
>  obj-$(CONFIG_IIO)		+= iio/
>  obj-$(CONFIG_VME_BUS)		+= vme/
> +obj-$(CONFIG_MM_DRIVERS)	+= mm/
> diff --git a/drivers/mm/Kconfig b/drivers/mm/Kconfig
> new file mode 100644
> index 0000000..22289c6
> --- /dev/null
> +++ b/drivers/mm/Kconfig
> @@ -0,0 +1,13 @@
> +menu "Memory management drivers"
> +
> +config ZCACHE
> +	bool "Dynamic compression of swap pages and clean pagecache pages"
> +	depends on (CLEANCACHE || FRONTSWAP) && CRYPTO=y && ZSMALLOC=y
> +	select CRYPTO_LZO
> +	default n
> +	help
> +	  Zcache uses compression and an in-kernel implementation of
> +	  transcendent memory to store clean page cache pages and swap
> +	  in RAM, providing a noticeable reduction in disk I/O.
> +
> +endmenu
> diff --git a/drivers/mm/Makefile b/drivers/mm/Makefile
> new file mode 100644
> index 0000000..f36f509
> --- /dev/null
> +++ b/drivers/mm/Makefile
> @@ -0,0 +1 @@
> +obj-$(CONFIG_ZCACHE)	+= zcache/
> diff --git a/drivers/mm/zcache/Makefile b/drivers/mm/zcache/Makefile
> new file mode 100644
> index 0000000..60daa27
> --- /dev/null
> +++ b/drivers/mm/zcache/Makefile
> @@ -0,0 +1,3 @@
> +zcache-y	:=	zcache-main.o tmem.o
> +
> +obj-$(CONFIG_ZCACHE)	+=	zcache.o
> diff --git a/drivers/mm/zcache/tmem.c b/drivers/mm/zcache/tmem.c
> new file mode 100644
> index 0000000..eaa9021
> --- /dev/null
> +++ b/drivers/mm/zcache/tmem.c
> @@ -0,0 +1,773 @@
> +/*
> + * In-kernel transcendent memory (generic implementation)
> + *
> + * Copyright (c) 2009-2011, Dan Magenheimer, Oracle Corp.
> + *
> + * The primary purpose of Transcedent Memory ("tmem") is to map object-oriented
> + * "handles" (triples containing a pool id, and object id, and an index), to
> + * pages in a page-accessible memory (PAM).  Tmem references the PAM pages via
> + * an abstract "pampd" (PAM page-descriptor), which can be operated on by a
> + * set of functions (pamops).  Each pampd contains some representation of
> + * PAGE_SIZE bytes worth of data. Tmem must support potentially millions of
> + * pages and must be able to insert, find, and delete these pages at a
> + * potential frequency of thousands per second concurrently across many CPUs,
> + * (and, if used with KVM, across many vcpus across many guests).
> + * Tmem is tracked with a hierarchy of data structures, organized by
> + * the elements in a handle-tuple: pool_id, object_id, and page index.
> + * One or more "clients" (e.g. guests) each provide one or more tmem_pools.
> + * Each pool, contains a hash table of rb_trees of tmem_objs.  Each
> + * tmem_obj contains a radix-tree-like tree of pointers, with intermediate
> + * nodes called tmem_objnodes.  Each leaf pointer in this tree points to
> + * a pampd, which is accessible only through a small set of callbacks
> + * registered by the PAM implementation (see tmem_register_pamops). Tmem
> + * does all memory allocation via a set of callbacks registered by the tmem
> + * host implementation (e.g. see tmem_register_hostops).
> + */
> +
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/atomic.h>
> +
> +#include "tmem.h"
> +
> +/* data structure sentinels used for debugging... see tmem.h */
> +#define POOL_SENTINEL 0x87658765
> +#define OBJ_SENTINEL 0x12345678
> +#define OBJNODE_SENTINEL 0xfedcba09
> +

Nit, the typical phrase for such debugging is POISON.

> +/*
> + * A tmem host implementation must use this function to register callbacks
> + * for memory allocation.
> + */
> +static struct tmem_hostops tmem_hostops;
> +
> +static void tmem_objnode_tree_init(void);
> +
> +void tmem_register_hostops(struct tmem_hostops *m)
> +{
> +	tmem_objnode_tree_init();
> +	tmem_hostops = *m;
> +}
> +
> +/*
> + * A tmem host implementation must use this function to register
> + * callbacks for a page-accessible memory (PAM) implementation
> + */
> +static struct tmem_pamops tmem_pamops;
> +
> +void tmem_register_pamops(struct tmem_pamops *m)
> +{
> +	tmem_pamops = *m;
> +}
> +

This implies that this can only host one client  at a time. I suppose
that's ok to start with but is there ever an expectation that zcache +
something else would be enabled at the same time?

> +/*
> + * Oid's are potentially very sparse and tmem_objs may have an indeterminately
> + * short life, being added and deleted at a relatively high frequency.
> + * So an rb_tree is an ideal data structure to manage tmem_objs.  But because
> + * of the potentially huge number of tmem_objs, each pool manages a hashtable
> + * of rb_trees to reduce search, insert, delete, and rebalancing time.
> + * Each hashbucket also has a lock to manage concurrent access.
> + *
> + * The following routines manage tmem_objs.  When any tmem_obj is accessed,
> + * the hashbucket lock must be held.
> + */
> +
> +static struct tmem_obj
> +*__tmem_obj_find(struct tmem_hashbucket *hb, struct tmem_oid *oidp,
> +		 struct rb_node **parent, struct rb_node ***link)
> +{
> +	struct rb_node *_parent = NULL, **rbnode;
> +	struct tmem_obj *obj = NULL;
> +
> +	rbnode = &hb->obj_rb_root.rb_node;
> +	while (*rbnode) {
> +		BUG_ON(RB_EMPTY_NODE(*rbnode));
> +		_parent = *rbnode;
> +		obj = rb_entry(*rbnode, struct tmem_obj,
> +			       rb_tree_node);
> +		switch (tmem_oid_compare(oidp, &obj->oid)) {
> +		case 0: /* equal */
> +			goto out;
> +		case -1:
> +			rbnode = &(*rbnode)->rb_left;
> +			break;
> +		case 1:
> +			rbnode = &(*rbnode)->rb_right;
> +			break;
> +		}
> +	}
> +
> +	if (parent)
> +		*parent = _parent;
> +	if (link)
> +		*link = rbnode;
> +
> +	obj = NULL;
> +out:
> +	return obj;
> +}
> +
> +
> +/* searches for object==oid in pool, returns locked object if found */
> +static struct tmem_obj *tmem_obj_find(struct tmem_hashbucket *hb,
> +					struct tmem_oid *oidp)
> +{
> +	return __tmem_obj_find(hb, oidp, NULL, NULL);
> +}
> +

Ok. It's a pity that the caller is responsible for looking up the hashbucket
and the locking. The pool can be found from the tmem_obj structure and the hash is
not that expensive to calculate.

> +static void tmem_pampd_destroy_all_in_obj(struct tmem_obj *);
> +
> +/* free an object that has no more pampds in it */
> +static void tmem_obj_free(struct tmem_obj *obj, struct tmem_hashbucket *hb)
> +{
> +	struct tmem_pool *pool;
> +
> +	BUG_ON(obj == NULL);
> +	ASSERT_SENTINEL(obj, OBJ);
> +	BUG_ON(obj->pampd_count > 0);
> +	pool = obj->pool;
> +	BUG_ON(pool == NULL);
> +	if (obj->objnode_tree_root != NULL) /* may be "stump" with no leaves */
> +		tmem_pampd_destroy_all_in_obj(obj);
> +	BUG_ON(obj->objnode_tree_root != NULL);
> +	BUG_ON((long)obj->objnode_count != 0);
> +	atomic_dec(&pool->obj_count);
> +	BUG_ON(atomic_read(&pool->obj_count) < 0);
> +	INVERT_SENTINEL(obj, OBJ);
> +	obj->pool = NULL;
> +	tmem_oid_set_invalid(&obj->oid);
> +	rb_erase(&obj->rb_tree_node, &hb->obj_rb_root);
> +}
> +

By and large this looks ok but one thing jumped out at me and it was the
use of atomics. Why is obj_count an atomic? Within this file it is only
accessed under hb->lock. zcache on top of it appears to be only reading
this count (actually without a lock which looks suspicious in itself).

> +/*
> + * initialize, and insert an tmem_object_root (called only if find failed)
> + */
> +static void tmem_obj_init(struct tmem_obj *obj, struct tmem_hashbucket *hb,
> +					struct tmem_pool *pool,
> +					struct tmem_oid *oidp)
> +{
> +	struct rb_root *root = &hb->obj_rb_root;
> +	struct rb_node **new = NULL, *parent = NULL;
> +
> +	BUG_ON(pool == NULL);
> +	atomic_inc(&pool->obj_count);
> +	obj->objnode_tree_height = 0;
> +	obj->objnode_tree_root = NULL;
> +	obj->pool = pool;
> +	obj->oid = *oidp;
> +	obj->objnode_count = 0;
> +	obj->pampd_count = 0;
> +	(*tmem_pamops.new_obj)(obj);
> +	SET_SENTINEL(obj, OBJ);
> +
> +	if (__tmem_obj_find(hb, oidp, &parent, &new))
> +		BUG();
> +
> +	rb_link_node(&obj->rb_tree_node, parent, new);
> +	rb_insert_color(&obj->rb_tree_node, root);
> +}
> +
> +/*
> + * Tmem is managed as a set of tmem_pools with certain attributes, such as
> + * "ephemeral" vs "persistent".  These attributes apply to all tmem_objs
> + * and all pampds that belong to a tmem_pool.  A tmem_pool is created
> + * or deleted relatively rarely (for example, when a filesystem is
> + * mounted or unmounted.
> + */
> +
> +/* flush all data from a pool and, optionally, free it */
> +static void tmem_pool_flush(struct tmem_pool *pool, bool destroy)
> +{
> +	struct rb_node *rbnode;
> +	struct tmem_obj *obj;
> +	struct tmem_hashbucket *hb = &pool->hashbucket[0];
> +	int i;
> +
> +	BUG_ON(pool == NULL);
> +	for (i = 0; i < TMEM_HASH_BUCKETS; i++, hb++) {
> +		spin_lock(&hb->lock);
> +		rbnode = rb_first(&hb->obj_rb_root);
> +		while (rbnode != NULL) {
> +			obj = rb_entry(rbnode, struct tmem_obj, rb_tree_node);
> +			rbnode = rb_next(rbnode);
> +			tmem_pampd_destroy_all_in_obj(obj);
> +			tmem_obj_free(obj, hb);
> +			(*tmem_hostops.obj_free)(obj, pool);
> +		}
> +		spin_unlock(&hb->lock);
> +	}
> +	if (destroy)
> +		list_del(&pool->pool_list);
> +}
> +
> +/*
> + * A tmem_obj contains a radix-tree-like tree in which the intermediate
> + * nodes are called tmem_objnodes.  (The kernel lib/radix-tree.c implementation
> + * is very specialized and tuned for specific uses and is not particularly
> + * suited for use from this code, though some code from the core algorithms has
> + * been reused, thus the copyright notices below).  Each tmem_objnode contains
> + * a set of pointers which point to either a set of intermediate tmem_objnodes
> + * or a set of of pampds.
> + *
> + * Portions Copyright (C) 2001 Momchil Velikov
> + * Portions Copyright (C) 2001 Christoph Hellwig
> + * Portions Copyright (C) 2005 SGI, Christoph Lameter <clameter@sgi.com>
> + */
> +

This is a bit vague. It asserts that lib/radix-tree is unsuitable but
not why. I skipped over most of the implementation to be honest.

> +struct tmem_objnode_tree_path {
> +	struct tmem_objnode *objnode;
> +	int offset;
> +};
> +
> +/* objnode height_to_maxindex translation */
> +static unsigned long tmem_objnode_tree_h2max[OBJNODE_TREE_MAX_PATH + 1];
> +
> +static void tmem_objnode_tree_init(void)
> +{
> +	unsigned int ht, tmp;
> +
> +	for (ht = 0; ht < ARRAY_SIZE(tmem_objnode_tree_h2max); ht++) {
> +		tmp = ht * OBJNODE_TREE_MAP_SHIFT;
> +		if (tmp >= OBJNODE_TREE_INDEX_BITS)
> +			tmem_objnode_tree_h2max[ht] = ~0UL;
> +		else
> +			tmem_objnode_tree_h2max[ht] =
> +			    (~0UL >> (OBJNODE_TREE_INDEX_BITS - tmp - 1)) >> 1;
> +	}
> +}
> +
> +static struct tmem_objnode *tmem_objnode_alloc(struct tmem_obj *obj)
> +{
> +	struct tmem_objnode *objnode;
> +
> +	ASSERT_SENTINEL(obj, OBJ);
> +	BUG_ON(obj->pool == NULL);
> +	ASSERT_SENTINEL(obj->pool, POOL);
> +	objnode = (*tmem_hostops.objnode_alloc)(obj->pool);
> +	if (unlikely(objnode == NULL))
> +		goto out;
> +	objnode->obj = obj;
> +	SET_SENTINEL(objnode, OBJNODE);
> +	memset(&objnode->slots, 0, sizeof(objnode->slots));
> +	objnode->slots_in_use = 0;
> +	obj->objnode_count++;
> +out:
> +	return objnode;
> +}
> +
> +static void tmem_objnode_free(struct tmem_objnode *objnode)
> +{
> +	struct tmem_pool *pool;
> +	int i;
> +
> +	BUG_ON(objnode == NULL);
> +	for (i = 0; i < OBJNODE_TREE_MAP_SIZE; i++)
> +		BUG_ON(objnode->slots[i] != NULL);
> +	ASSERT_SENTINEL(objnode, OBJNODE);
> +	INVERT_SENTINEL(objnode, OBJNODE);
> +	BUG_ON(objnode->obj == NULL);
> +	ASSERT_SENTINEL(objnode->obj, OBJ);
> +	pool = objnode->obj->pool;
> +	BUG_ON(pool == NULL);
> +	ASSERT_SENTINEL(pool, POOL);
> +	objnode->obj->objnode_count--;
> +	objnode->obj = NULL;
> +	(*tmem_hostops.objnode_free)(objnode, pool);
> +}
> +
> +/*
> + * lookup index in object and return associated pampd (or NULL if not found)
> + */
> +static void **__tmem_pampd_lookup_in_obj(struct tmem_obj *obj, uint32_t index)
> +{
> +	unsigned int height, shift;
> +	struct tmem_objnode **slot = NULL;
> +
> +	BUG_ON(obj == NULL);
> +	ASSERT_SENTINEL(obj, OBJ);
> +	BUG_ON(obj->pool == NULL);
> +	ASSERT_SENTINEL(obj->pool, POOL);
> +
> +	height = obj->objnode_tree_height;
> +	if (index > tmem_objnode_tree_h2max[obj->objnode_tree_height])
> +		goto out;
> +	if (height == 0 && obj->objnode_tree_root) {
> +		slot = &obj->objnode_tree_root;
> +		goto out;
> +	}
> +	shift = (height-1) * OBJNODE_TREE_MAP_SHIFT;
> +	slot = &obj->objnode_tree_root;
> +	while (height > 0) {
> +		if (*slot == NULL)
> +			goto out;
> +		slot = (struct tmem_objnode **)
> +			((*slot)->slots +
> +			 ((index >> shift) & OBJNODE_TREE_MAP_MASK));
> +		shift -= OBJNODE_TREE_MAP_SHIFT;
> +		height--;
> +	}
> +out:
> +	return slot != NULL ? (void **)slot : NULL;
> +}
> +
> +static void *tmem_pampd_lookup_in_obj(struct tmem_obj *obj, uint32_t index)
> +{
> +	struct tmem_objnode **slot;
> +
> +	slot = (struct tmem_objnode **)__tmem_pampd_lookup_in_obj(obj, index);
> +	return slot != NULL ? *slot : NULL;
> +}
> +
> +static void *tmem_pampd_replace_in_obj(struct tmem_obj *obj, uint32_t index,
> +					void *new_pampd)
> +{
> +	struct tmem_objnode **slot;
> +	void *ret = NULL;
> +
> +	slot = (struct tmem_objnode **)__tmem_pampd_lookup_in_obj(obj, index);
> +	if ((slot != NULL) && (*slot != NULL)) {
> +		void *old_pampd = *(void **)slot;
> +		*(void **)slot = new_pampd;
> +		(*tmem_pamops.free)(old_pampd, obj->pool, NULL, 0);
> +		ret = new_pampd;
> +	}
> +	return ret;
> +}
> +
> +static int tmem_pampd_add_to_obj(struct tmem_obj *obj, uint32_t index,
> +					void *pampd)
> +{
> +	int ret = 0;
> +	struct tmem_objnode *objnode = NULL, *newnode, *slot;
> +	unsigned int height, shift;
> +	int offset = 0;
> +
> +	/* if necessary, extend the tree to be higher  */
> +	if (index > tmem_objnode_tree_h2max[obj->objnode_tree_height]) {
> +		height = obj->objnode_tree_height + 1;
> +		if (index > tmem_objnode_tree_h2max[height])
> +			while (index > tmem_objnode_tree_h2max[height])
> +				height++;
> +		if (obj->objnode_tree_root == NULL) {
> +			obj->objnode_tree_height = height;
> +			goto insert;
> +		}
> +		do {
> +			newnode = tmem_objnode_alloc(obj);
> +			if (!newnode) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			newnode->slots[0] = obj->objnode_tree_root;
> +			newnode->slots_in_use = 1;
> +			obj->objnode_tree_root = newnode;
> +			obj->objnode_tree_height++;
> +		} while (height > obj->objnode_tree_height);
> +	}
> +insert:
> +	slot = obj->objnode_tree_root;
> +	height = obj->objnode_tree_height;
> +	shift = (height-1) * OBJNODE_TREE_MAP_SHIFT;
> +	while (height > 0) {
> +		if (slot == NULL) {
> +			/* add a child objnode.  */
> +			slot = tmem_objnode_alloc(obj);
> +			if (!slot) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
> +			if (objnode) {
> +
> +				objnode->slots[offset] = slot;
> +				objnode->slots_in_use++;
> +			} else
> +				obj->objnode_tree_root = slot;
> +		}
> +		/* go down a level */
> +		offset = (index >> shift) & OBJNODE_TREE_MAP_MASK;
> +		objnode = slot;
> +		slot = objnode->slots[offset];
> +		shift -= OBJNODE_TREE_MAP_SHIFT;
> +		height--;
> +	}
> +	BUG_ON(slot != NULL);
> +	if (objnode) {
> +		objnode->slots_in_use++;
> +		objnode->slots[offset] = pampd;
> +	} else
> +		obj->objnode_tree_root = pampd;
> +	obj->pampd_count++;
> +out:
> +	return ret;
> +}
> +
> +static void *tmem_pampd_delete_from_obj(struct tmem_obj *obj, uint32_t index)
> +{
> +	struct tmem_objnode_tree_path path[OBJNODE_TREE_MAX_PATH + 1];
> +	struct tmem_objnode_tree_path *pathp = path;
> +	struct tmem_objnode *slot = NULL;
> +	unsigned int height, shift;
> +	int offset;
> +
> +	BUG_ON(obj == NULL);
> +	ASSERT_SENTINEL(obj, OBJ);
> +	BUG_ON(obj->pool == NULL);
> +	ASSERT_SENTINEL(obj->pool, POOL);
> +	height = obj->objnode_tree_height;
> +	if (index > tmem_objnode_tree_h2max[height])
> +		goto out;
> +	slot = obj->objnode_tree_root;
> +	if (height == 0 && obj->objnode_tree_root) {
> +		obj->objnode_tree_root = NULL;
> +		goto out;
> +	}
> +	shift = (height - 1) * OBJNODE_TREE_MAP_SHIFT;
> +	pathp->objnode = NULL;
> +	do {
> +		if (slot == NULL)
> +			goto out;
> +		pathp++;
> +		offset = (index >> shift) & OBJNODE_TREE_MAP_MASK;
> +		pathp->offset = offset;
> +		pathp->objnode = slot;
> +		slot = slot->slots[offset];
> +		shift -= OBJNODE_TREE_MAP_SHIFT;
> +		height--;
> +	} while (height > 0);
> +	if (slot == NULL)
> +		goto out;
> +	while (pathp->objnode) {
> +		pathp->objnode->slots[pathp->offset] = NULL;
> +		pathp->objnode->slots_in_use--;
> +		if (pathp->objnode->slots_in_use) {
> +			if (pathp->objnode == obj->objnode_tree_root) {
> +				while (obj->objnode_tree_height > 0 &&
> +				  obj->objnode_tree_root->slots_in_use == 1 &&
> +				  obj->objnode_tree_root->slots[0]) {
> +					struct tmem_objnode *to_free =
> +						obj->objnode_tree_root;
> +
> +					obj->objnode_tree_root =
> +							to_free->slots[0];
> +					obj->objnode_tree_height--;
> +					to_free->slots[0] = NULL;
> +					to_free->slots_in_use = 0;
> +					tmem_objnode_free(to_free);
> +				}
> +			}
> +			goto out;
> +		}
> +		tmem_objnode_free(pathp->objnode); /* 0 slots used, free it */
> +		pathp--;
> +	}
> +	obj->objnode_tree_height = 0;
> +	obj->objnode_tree_root = NULL;
> +
> +out:
> +	if (slot != NULL)
> +		obj->pampd_count--;
> +	BUG_ON(obj->pampd_count < 0);
> +	return slot;
> +}
> +
> +/* recursively walk the objnode_tree destroying pampds and objnodes */
> +static void tmem_objnode_node_destroy(struct tmem_obj *obj,
> +					struct tmem_objnode *objnode,
> +					unsigned int ht)
> +{
> +	int i;
> +
> +	if (ht == 0)
> +		return;
> +	for (i = 0; i < OBJNODE_TREE_MAP_SIZE; i++) {
> +		if (objnode->slots[i]) {
> +			if (ht == 1) {
> +				obj->pampd_count--;
> +				(*tmem_pamops.free)(objnode->slots[i],
> +						obj->pool, NULL, 0);
> +				objnode->slots[i] = NULL;
> +				continue;
> +			}
> +			tmem_objnode_node_destroy(obj, objnode->slots[i], ht-1);
> +			tmem_objnode_free(objnode->slots[i]);
> +			objnode->slots[i] = NULL;
> +		}
> +	}
> +}
> +
> +static void tmem_pampd_destroy_all_in_obj(struct tmem_obj *obj)
> +{
> +	if (obj->objnode_tree_root == NULL)
> +		return;
> +	if (obj->objnode_tree_height == 0) {
> +		obj->pampd_count--;
> +		(*tmem_pamops.free)(obj->objnode_tree_root, obj->pool, NULL, 0);
> +	} else {
> +		tmem_objnode_node_destroy(obj, obj->objnode_tree_root,
> +					obj->objnode_tree_height);
> +		tmem_objnode_free(obj->objnode_tree_root);
> +		obj->objnode_tree_height = 0;
> +	}
> +	obj->objnode_tree_root = NULL;
> +	(*tmem_pamops.free_obj)(obj->pool, obj);
> +}
> +
> +/*
> + * Tmem is operated on by a set of well-defined actions:
> + * "put", "get", "flush", "flush_object", "new pool" and "destroy pool".
> + * (The tmem ABI allows for subpages and exchanges but these operations
> + * are not included in this implementation.)
> + *
> + * These "tmem core" operations are implemented in the following functions.
> + */
> +

More nits. As this defines a boundary between two major components it
probably should have its own Documentation/ entry and the APIs should have
kernel doc comments.

> +/*
> + * "Put" a page, e.g. copy a page from the kernel into newly allocated
> + * PAM space (if such space is available). Tmem_put is complicated by

That's an awful name! put in every other context means drop a reference
count. I suppose it must be taken from a spec somewhere that set the name
in stone but it's a pity because it's misleading. I'm going to keep seeing
put and get as reference counts.

> + * a corner case: What if a page with matching handle already exists in
> + * tmem?  To guarantee coherency, one of two actions is necessary: Either
> + * the data for the page must be overwritten, or the page must be
> + * "flushed" so that the data is not accessible to a subsequent "get".
> + * Since these "duplicate puts" are relatively rare, this implementation
> + * always flushes for simplicity.
> + */

At first glance that sounds really dangerous. If two different users can have
the same oid for different data, what prevents the wrong data being fetched?
>From this level I expect that it's something the layers above it have to
manage and in practice they must be preventing duplicates ever happening
but I'm guessing. At some point it would be nice if there was an example
included here explaining why duplicates are not a bug.

> +int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index,
> +		char *data, size_t size, bool raw, bool ephemeral)
> +{
> +	struct tmem_obj *obj = NULL, *objfound = NULL, *objnew = NULL;
> +	void *pampd = NULL, *pampd_del = NULL;
> +	int ret = -ENOMEM;
> +	struct tmem_hashbucket *hb;
> +
> +	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
> +	spin_lock(&hb->lock);
> +	obj = objfound = tmem_obj_find(hb, oidp);
> +	if (obj != NULL) {
> +		pampd = tmem_pampd_lookup_in_obj(objfound, index);
> +		if (pampd != NULL) {
> +			/* if found, is a dup put, flush the old one */
> +			pampd_del = tmem_pampd_delete_from_obj(obj, index);
> +			BUG_ON(pampd_del != pampd);
> +			(*tmem_pamops.free)(pampd, pool, oidp, index);
> +			if (obj->pampd_count == 0) {
> +				objnew = obj;
> +				objfound = NULL;
> +			}
> +			pampd = NULL;
> +		}
> +	} else {
> +		obj = objnew = (*tmem_hostops.obj_alloc)(pool);
> +		if (unlikely(obj == NULL)) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		tmem_obj_init(obj, hb, pool, oidp);
> +	}
> +	BUG_ON(obj == NULL);
> +	BUG_ON(((objnew != obj) && (objfound != obj)) || (objnew == objfound));
> +	pampd = (*tmem_pamops.create)(data, size, raw, ephemeral,
> +					obj->pool, &obj->oid, index);
> +	if (unlikely(pampd == NULL))
> +		goto free;
> +	ret = tmem_pampd_add_to_obj(obj, index, pampd);
> +	if (unlikely(ret == -ENOMEM))
> +		/* may have partially built objnode tree ("stump") */
> +		goto delete_and_free;
> +	goto out;
> +
> +delete_and_free:
> +	(void)tmem_pampd_delete_from_obj(obj, index);
> +free:
> +	if (pampd)
> +		(*tmem_pamops.free)(pampd, pool, NULL, 0);
> +	if (objnew) {
> +		tmem_obj_free(objnew, hb);
> +		(*tmem_hostops.obj_free)(objnew, pool);
> +	}
> +out:
> +	spin_unlock(&hb->lock);
> +	return ret;
> +}
> +
> +/*
> + * "Get" a page, e.g. if one can be found, copy the tmem page with the
> + * matching handle from PAM space to the kernel.  By tmem definition,
> + * when a "get" is successful on an ephemeral page, the page is "flushed",
> + * and when a "get" is successful on a persistent page, the page is retained
> + * in tmem.  Note that to preserve
> + * coherency, "get" can never be skipped if tmem contains the data.
> + * That is, if a get is done with a certain handle and fails, any
> + * subsequent "get" must also fail (unless of course there is a
> + * "put" done with the same handle).
> +
> + */
> +int tmem_get(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index,
> +		char *data, size_t *size, bool raw, int get_and_free)
> +{
> +	struct tmem_obj *obj;
> +	void *pampd;
> +	bool ephemeral = is_ephemeral(pool);
> +	int ret = -1;
> +	struct tmem_hashbucket *hb;
> +	bool free = (get_and_free == 1) || ((get_and_free == 0) && ephemeral);
> +	bool lock_held = false;
> +
> +	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
> +	spin_lock(&hb->lock);
> +	lock_held = true;

Nit: It might have been a bit more straight-forward to have an out and
out_locked

> +	obj = tmem_obj_find(hb, oidp);
> +	if (obj == NULL)
> +		goto out;
> +	if (free)
> +		pampd = tmem_pampd_delete_from_obj(obj, index);
> +	else
> +		pampd = tmem_pampd_lookup_in_obj(obj, index);
> +	if (pampd == NULL)
> +		goto out;
> +	if (free) {
> +		if (obj->pampd_count == 0) {
> +			tmem_obj_free(obj, hb);
> +			(*tmem_hostops.obj_free)(obj, pool);
> +			obj = NULL;
> +		}
> +	}
> +	if (tmem_pamops.is_remote(pampd)) {
> +		lock_held = false;
> +		spin_unlock(&hb->lock);
> +	}
> +	if (free)
> +		ret = (*tmem_pamops.get_data_and_free)(
> +				data, size, raw, pampd, pool, oidp, index);
> +	else
> +		ret = (*tmem_pamops.get_data)(
> +				data, size, raw, pampd, pool, oidp, index);
> +	if (ret < 0)
> +		goto out;
> +	ret = 0;
> +out:
> +	if (lock_held)
> +		spin_unlock(&hb->lock);
> +	return ret;
> +}
> +
> +/*
> + * If a page in tmem matches the handle, "flush" this page from tmem such
> + * that any subsequent "get" does not succeed (unless, of course, there
> + * was another "put" with the same handle).
> + */

As with the other names, the term "flush" is ambiguous. evict would
have been clearer. flush, particularly in filesystem contexts might be
interpreted as cleaning the page.

> +int tmem_flush_page(struct tmem_pool *pool,
> +				struct tmem_oid *oidp, uint32_t index)
> +{
> +	struct tmem_obj *obj;
> +	void *pampd;
> +	int ret = -1;
> +	struct tmem_hashbucket *hb;
> +
> +	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
> +	spin_lock(&hb->lock);
> +	obj = tmem_obj_find(hb, oidp);
> +	if (obj == NULL)
> +		goto out;
> +	pampd = tmem_pampd_delete_from_obj(obj, index);
> +	if (pampd == NULL)
> +		goto out;
> +	(*tmem_pamops.free)(pampd, pool, oidp, index);
> +	if (obj->pampd_count == 0) {
> +		tmem_obj_free(obj, hb);
> +		(*tmem_hostops.obj_free)(obj, pool);
> +	}
> +	ret = 0;
> +
> +out:
> +	spin_unlock(&hb->lock);
> +	return ret;
> +}
> +
> +/*
> + * If a page in tmem matches the handle, replace the page so that any
> + * subsequent "get" gets the new page.  Returns 0 if
> + * there was a page to replace, else returns -1.
> + */
> +int tmem_replace(struct tmem_pool *pool, struct tmem_oid *oidp,
> +			uint32_t index, void *new_pampd)
> +{
> +	struct tmem_obj *obj;
> +	int ret = -1;
> +	struct tmem_hashbucket *hb;
> +
> +	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
> +	spin_lock(&hb->lock);
> +	obj = tmem_obj_find(hb, oidp);
> +	if (obj == NULL)
> +		goto out;
> +	new_pampd = tmem_pampd_replace_in_obj(obj, index, new_pampd);
> +	ret = (*tmem_pamops.replace_in_obj)(new_pampd, obj);
> +out:
> +	spin_unlock(&hb->lock);
> +	return ret;
> +}
> +

Nothin in this patch uses this. It looks like ramster would depend on it
but at a glance, ramster seems to have its own copy of the code. I guess
this is what Dan was referring to as the fork and at some point that needs
to be resolved. Here, it looks like dead code.

> +/*
> + * "Flush" all pages in tmem matching this oid.
> + */
> +int tmem_flush_object(struct tmem_pool *pool, struct tmem_oid *oidp)
> +{
> +	struct tmem_obj *obj;
> +	struct tmem_hashbucket *hb;
> +	int ret = -1;
> +
> +	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
> +	spin_lock(&hb->lock);
> +	obj = tmem_obj_find(hb, oidp);
> +	if (obj == NULL)
> +		goto out;
> +	tmem_pampd_destroy_all_in_obj(obj);
> +	tmem_obj_free(obj, hb);
> +	(*tmem_hostops.obj_free)(obj, pool);
> +	ret = 0;
> +
> +out:
> +	spin_unlock(&hb->lock);
> +	return ret;
> +}
> +
> +/*
> + * "Flush" all pages (and tmem_objs) from this tmem_pool and disable
> + * all subsequent access to this tmem_pool.
> + */
> +int tmem_destroy_pool(struct tmem_pool *pool)
> +{
> +	int ret = -1;
> +
> +	if (pool == NULL)
> +		goto out;
> +	tmem_pool_flush(pool, 1);
> +	ret = 0;
> +out:
> +	return ret;
> +}

I'm worried about the locking. Glancing through it looks like most users
of the tmem API have interrupts disabled. So while it looks like just
the hb->lock is necessary, there is actually an implicit assumption that
interrupts are also disabled. However, when tmem_destroy_pool is called
only the bh is disabled. Now because of when the pool is destroyed, I doubt
you're going to have a problem with interrupts but I wonder ... has this
been heavily tested with lockdep?

It's not clear at this point why interrupts even had to be disabled.

> +
> +static LIST_HEAD(tmem_global_pool_list);
> +
> +/*
> + * Create a new tmem_pool with the provided flag and return
> + * a pool id provided by the tmem host implementation.
> + */
> +void tmem_new_pool(struct tmem_pool *pool, uint32_t flags)
> +{
> +	int persistent = flags & TMEM_POOL_PERSIST;
> +	int shared = flags & TMEM_POOL_SHARED;
> +	struct tmem_hashbucket *hb = &pool->hashbucket[0];
> +	int i;
> +
> +	for (i = 0; i < TMEM_HASH_BUCKETS; i++, hb++) {
> +		hb->obj_rb_root = RB_ROOT;
> +		spin_lock_init(&hb->lock);
> +	}
> +	INIT_LIST_HEAD(&pool->pool_list);
> +	atomic_set(&pool->obj_count, 0);
> +	SET_SENTINEL(pool, POOL);
> +	list_add_tail(&pool->pool_list, &tmem_global_pool_list);
> +	pool->persistent = persistent;
> +	pool->shared = shared;
> +}
> diff --git a/drivers/mm/zcache/tmem.h b/drivers/mm/zcache/tmem.h
> new file mode 100644
> index 0000000..0d4aa82
> --- /dev/null
> +++ b/drivers/mm/zcache/tmem.h
> @@ -0,0 +1,206 @@
> +/*
> + * tmem.h
> + *
> + * Transcendent memory
> + *
> + * Copyright (c) 2009-2011, Dan Magenheimer, Oracle Corp.
> + */
> +
> +#ifndef _TMEM_H_
> +#define _TMEM_H_
> +
> +#include <linux/types.h>
> +#include <linux/highmem.h>
> +#include <linux/hash.h>
> +#include <linux/atomic.h>
> +
> +/*
> + * These are pre-defined by the Xen<->Linux ABI
> + */

So it does look like the names are fixed already. Pity.

> +#define TMEM_PUT_PAGE			4
> +#define TMEM_GET_PAGE			5
> +#define TMEM_FLUSH_PAGE			6
> +#define TMEM_FLUSH_OBJECT		7
> +#define TMEM_POOL_PERSIST		1
> +#define TMEM_POOL_SHARED		2
> +#define TMEM_POOL_PRECOMPRESSED		4
> +#define TMEM_POOL_PAGESIZE_SHIFT	4
> +#define TMEM_POOL_PAGESIZE_MASK		0xf
> +#define TMEM_POOL_RESERVED_BITS		0x00ffff00
> +
> +/*
> + * sentinels have proven very useful for debugging but can be removed
> + * or disabled before final merge.
> + */
> +#define SENTINELS
> +#ifdef SENTINELS
> +#define DECL_SENTINEL uint32_t sentinel;
> +#define SET_SENTINEL(_x, _y) (_x->sentinel = _y##_SENTINEL)
> +#define INVERT_SENTINEL(_x, _y) (_x->sentinel = ~_y##_SENTINEL)
> +#define ASSERT_SENTINEL(_x, _y) WARN_ON(_x->sentinel != _y##_SENTINEL)
> +#define ASSERT_INVERTED_SENTINEL(_x, _y) WARN_ON(_x->sentinel != ~_y##_SENTINEL)
> +#else
> +#define DECL_SENTINEL
> +#define SET_SENTINEL(_x, _y) do { } while (0)
> +#define INVERT_SENTINEL(_x, _y) do { } while (0)
> +#define ASSERT_SENTINEL(_x, _y) do { } while (0)
> +#define ASSERT_INVERTED_SENTINEL(_x, _y) do { } while (0)
> +#endif
> +

This should have been enabled/disabled via Kconfig.

> +#define ASSERT_SPINLOCK(_l)	lockdep_assert_held(_l)
> +
> +/*
> + * A pool is the highest-level data structure managed by tmem and
> + * usually corresponds to a large independent set of pages such as
> + * a filesystem.  Each pool has an id, and certain attributes and counters.
> + * It also contains a set of hash buckets, each of which contains an rbtree
> + * of objects and a lock to manage concurrency within the pool.
> + */
> +
> +#define TMEM_HASH_BUCKET_BITS	8
> +#define TMEM_HASH_BUCKETS	(1<<TMEM_HASH_BUCKET_BITS)
> +
> +struct tmem_hashbucket {
> +	struct rb_root obj_rb_root;
> +	spinlock_t lock;
> +};
> +
> +struct tmem_pool {
> +	void *client; /* "up" for some clients, avoids table lookup */
> +	struct list_head pool_list;
> +	uint32_t pool_id;
> +	bool persistent;
> +	bool shared;
> +	atomic_t obj_count;
> +	atomic_t refcount;
> +	struct tmem_hashbucket hashbucket[TMEM_HASH_BUCKETS];
> +	DECL_SENTINEL
> +};
> +
> +#define is_persistent(_p)  (_p->persistent)
> +#define is_ephemeral(_p)   (!(_p->persistent))
> +
> +/*
> + * An object id ("oid") is large: 192-bits (to ensure, for example, files
> + * in a modern filesystem can be uniquely identified).
> + */
> +
> +struct tmem_oid {
> +	uint64_t oid[3];
> +};
> +
> +static inline void tmem_oid_set_invalid(struct tmem_oid *oidp)
> +{
> +	oidp->oid[0] = oidp->oid[1] = oidp->oid[2] = -1UL;
> +}
> +
> +static inline bool tmem_oid_valid(struct tmem_oid *oidp)
> +{
> +	return oidp->oid[0] != -1UL || oidp->oid[1] != -1UL ||
> +		oidp->oid[2] != -1UL;
> +}
> +
> +static inline int tmem_oid_compare(struct tmem_oid *left,
> +					struct tmem_oid *right)
> +{
> +	int ret;
> +
> +	if (left->oid[2] == right->oid[2]) {
> +		if (left->oid[1] == right->oid[1]) {
> +			if (left->oid[0] == right->oid[0])
> +				ret = 0;
> +			else if (left->oid[0] < right->oid[0])
> +				ret = -1;
> +			else
> +				return 1;
> +		} else if (left->oid[1] < right->oid[1])
> +			ret = -1;
> +		else
> +			ret = 1;
> +	} else if (left->oid[2] < right->oid[2])
> +		ret = -1;
> +	else
> +		ret = 1;
> +	return ret;
> +}

Holy Branches Batman!

Bit of a jumble but works at least. Nits: mixes ret = and returns
mid-way. Could have been implemented with a while loop. Only has one
caller and should have been in the C file that uses it. There was no need
to explicitely mark it inline either with just one caller.

> +
> +static inline unsigned tmem_oid_hash(struct tmem_oid *oidp)
> +{
> +	return hash_long(oidp->oid[0] ^ oidp->oid[1] ^ oidp->oid[2],
> +				TMEM_HASH_BUCKET_BITS);
> +}
> +
> +/*
> + * A tmem_obj contains an identifier (oid), pointers to the parent
> + * pool and the rb_tree to which it belongs, counters, and an ordered
> + * set of pampds, structured in a radix-tree-like tree.  The intermediate
> + * nodes of the tree are called tmem_objnodes.
> + */
> +
> +struct tmem_objnode;
> +
> +struct tmem_obj {
> +	struct tmem_oid oid;
> +	struct tmem_pool *pool;
> +	struct rb_node rb_tree_node;
> +	struct tmem_objnode *objnode_tree_root;
> +	unsigned int objnode_tree_height;
> +	unsigned long objnode_count;
> +	long pampd_count;
> +	void *extra; /* for private use by pampd implementation */
> +	DECL_SENTINEL
> +};
> +
> +#define OBJNODE_TREE_MAP_SHIFT 6
> +#define OBJNODE_TREE_MAP_SIZE (1UL << OBJNODE_TREE_MAP_SHIFT)
> +#define OBJNODE_TREE_MAP_MASK (OBJNODE_TREE_MAP_SIZE-1)
> +#define OBJNODE_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
> +#define OBJNODE_TREE_MAX_PATH \
> +		(OBJNODE_TREE_INDEX_BITS/OBJNODE_TREE_MAP_SHIFT + 2)
> +
> +struct tmem_objnode {
> +	struct tmem_obj *obj;
> +	DECL_SENTINEL
> +	void *slots[OBJNODE_TREE_MAP_SIZE];
> +	unsigned int slots_in_use;
> +};

Strikes me as odd that the debugging field is near the start of the
structure.

> +
> +/* pampd abstract datatype methods provided by the PAM implementation */
> +struct tmem_pamops {
> +	void *(*create)(char *, size_t, bool, int,
> +			struct tmem_pool *, struct tmem_oid *, uint32_t);
> +	int (*get_data)(char *, size_t *, bool, void *, struct tmem_pool *,
> +				struct tmem_oid *, uint32_t);
> +	int (*get_data_and_free)(char *, size_t *, bool, void *,
> +				struct tmem_pool *, struct tmem_oid *,
> +				uint32_t);
> +	void (*free)(void *, struct tmem_pool *, struct tmem_oid *, uint32_t);
> +	void (*free_obj)(struct tmem_pool *, struct tmem_obj *);
> +	bool (*is_remote)(void *);
> +	void (*new_obj)(struct tmem_obj *);
> +	int (*replace_in_obj)(void *, struct tmem_obj *);
> +};
> +extern void tmem_register_pamops(struct tmem_pamops *m);
> +
> +/* memory allocation methods provided by the host implementation */
> +struct tmem_hostops {
> +	struct tmem_obj *(*obj_alloc)(struct tmem_pool *);
> +	void (*obj_free)(struct tmem_obj *, struct tmem_pool *);
> +	struct tmem_objnode *(*objnode_alloc)(struct tmem_pool *);
> +	void (*objnode_free)(struct tmem_objnode *, struct tmem_pool *);
> +};
> +extern void tmem_register_hostops(struct tmem_hostops *m);
> +
> +/* core tmem accessor functions */
> +extern int tmem_put(struct tmem_pool *, struct tmem_oid *, uint32_t index,
> +			char *, size_t, bool, bool);
> +extern int tmem_get(struct tmem_pool *, struct tmem_oid *, uint32_t index,
> +			char *, size_t *, bool, int);
> +extern int tmem_replace(struct tmem_pool *, struct tmem_oid *, uint32_t index,
> +			void *);
> +extern int tmem_flush_page(struct tmem_pool *, struct tmem_oid *,
> +			uint32_t index);
> +extern int tmem_flush_object(struct tmem_pool *, struct tmem_oid *);
> +extern int tmem_destroy_pool(struct tmem_pool *);
> +extern void tmem_new_pool(struct tmem_pool *, uint32_t);
> +#endif /* _TMEM_H */
> diff --git a/drivers/mm/zcache/zcache-main.c b/drivers/mm/zcache/zcache-main.c
> new file mode 100644
> index 0000000..34b2c5c
> --- /dev/null
> +++ b/drivers/mm/zcache/zcache-main.c
> @@ -0,0 +1,2077 @@
> +/*
> + * zcache.c
> + *
> + * Copyright (c) 2010,2011, Dan Magenheimer, Oracle Corp.
> + * Copyright (c) 2010,2011, Nitin Gupta
> + *
> + * Zcache provides an in-kernel "host implementation" for transcendent memory
> + * and, thus indirectly, for cleancache and frontswap.  Zcache includes two
> + * page-accessible memory [1] interfaces, both utilizing the crypto compression
> + * API:
> + * 1) "compression buddies" ("zbud") is used for ephemeral pages
> + * 2) zsmalloc is used for persistent pages.
> + * Xvmalloc (based on the TLSF allocator) has very low fragmentation
> + * so maximizes space efficiency, while zbud allows pairs (and potentially,
> + * in the future, more than a pair of) compressed pages to be closely linked
> + * so that reclaiming can be done via the kernel's physical-page-oriented
> + * "shrinker" interface.
> + *

Doesn't actually explain why zbud is good for one and zsmalloc good for the other.

> + * [1] For a definition of page-accessible memory (aka PAM), see:
> + *   http://marc.info/?l=linux-mm&m=127811271605009
> + */

Stick this in Documentation/

> +
> +#include <linux/module.h>
> +#include <linux/cpu.h>
> +#include <linux/highmem.h>
> +#include <linux/list.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +#include <linux/types.h>
> +#include <linux/atomic.h>
> +#include <linux/math64.h>
> +#include <linux/crypto.h>
> +#include <linux/string.h>
> +#include <linux/idr.h>
> +#include <linux/zsmalloc.h>
> +
> +#include "tmem.h"
> +
> +#ifdef CONFIG_CLEANCACHE
> +#include <linux/cleancache.h>
> +#endif
> +#ifdef CONFIG_FRONTSWAP
> +#include <linux/frontswap.h>
> +#endif
> +
> +#if 0
> +/* this is more aggressive but may cause other problems? */
> +#define ZCACHE_GFP_MASK	(GFP_ATOMIC | __GFP_NORETRY | __GFP_NOWARN)

Why is this "more agressive"? If anything it's less aggressive because it'll
bail if there is no memory available. Get rid of this.

> +#else
> +#define ZCACHE_GFP_MASK \
> +	(__GFP_FS | __GFP_NORETRY | __GFP_NOWARN | __GFP_NOMEMALLOC)
> +#endif
> +
> +#define MAX_CLIENTS 16

Seems a bit arbitrary. Why 16?

> +#define LOCAL_CLIENT ((uint16_t)-1)
> +
> +MODULE_LICENSE("GPL");
> +
> +struct zcache_client {
> +	struct idr tmem_pools;
> +	struct zs_pool *zspool;
> +	bool allocated;
> +	atomic_t refcount;
> +};

why is "allocated" needed. Is the refcount not enough to determine if this
client is in use or not?

> +
> +static struct zcache_client zcache_host;
> +static struct zcache_client zcache_clients[MAX_CLIENTS];
> +
> +static inline uint16_t get_client_id_from_client(struct zcache_client *cli)
> +{
> +	BUG_ON(cli == NULL);
> +	if (cli == &zcache_host)
> +		return LOCAL_CLIENT;
> +	return cli - &zcache_clients[0];
> +}
> +
> +static struct zcache_client *get_zcache_client(uint16_t cli_id)
> +{
> +	if (cli_id == LOCAL_CLIENT)
> +		return &zcache_host;
> +
> +	if ((unsigned int)cli_id < MAX_CLIENTS)
> +		return &zcache_clients[cli_id];
> +
> +	return NULL;
> +}
> +
> +static inline bool is_local_client(struct zcache_client *cli)
> +{
> +	return cli == &zcache_host;
> +}
> +
> +/* crypto API for zcache  */
> +#define ZCACHE_COMP_NAME_SZ CRYPTO_MAX_ALG_NAME
> +static char zcache_comp_name[ZCACHE_COMP_NAME_SZ];
> +static struct crypto_comp * __percpu *zcache_comp_pcpu_tfms;
> +
> +enum comp_op {
> +	ZCACHE_COMPOP_COMPRESS,
> +	ZCACHE_COMPOP_DECOMPRESS
> +};
> +
> +static inline int zcache_comp_op(enum comp_op op,
> +				const u8 *src, unsigned int slen,
> +				u8 *dst, unsigned int *dlen)
> +{
> +	struct crypto_comp *tfm;
> +	int ret;
> +
> +	BUG_ON(!zcache_comp_pcpu_tfms);

Unnecessary check, it'll blow up on the next line if NULL anyway.

> +	tfm = *per_cpu_ptr(zcache_comp_pcpu_tfms, get_cpu());
> +	BUG_ON(!tfm);

If this BUG_ON triggers, it'll exit with preempt disabled and cause more
problems. Warn-on and recover.

> +	switch (op) {
> +	case ZCACHE_COMPOP_COMPRESS:
> +		ret = crypto_comp_compress(tfm, src, slen, dst, dlen);
> +		break;
> +	case ZCACHE_COMPOP_DECOMPRESS:
> +		ret = crypto_comp_decompress(tfm, src, slen, dst, dlen);
> +		break;
> +	default:
> +		ret = -EINVAL;
> +	}
> +	put_cpu();
> +	return ret;
> +}
> +
> +/**********
> + * Compression buddies ("zbud") provides for packing two (or, possibly
> + * in the future, more) compressed ephemeral pages into a single "raw"
> + * (physical) page and tracking them with data structures so that
> + * the raw pages can be easily reclaimed.
> + *

Ok, if I'm reading this right it implies that a page must at least compress
by 50% before zcache even accepts the page.  It would be interesting if
there were statistics available at runtime that recorded how often a page
was rejected because it did not compress well enough.

Oh... you do, but there is no obvious way to figure out whether compression
is failing more often than succeeding. You'd need a success counter too.

> + * A zbud page ("zbpg") is an aligned page containing a list_head,
> + * a lock, and two "zbud headers".  The remainder of the physical
> + * page is divided up into aligned 64-byte "chunks" which contain
> + * the compressed data for zero, one, or two zbuds.  Each zbpg
> + * resides on: (1) an "unused list" if it has no zbuds; (2) a
> + * "buddied" list if it is fully populated  with two zbuds; or
> + * (3) one of PAGE_SIZE/64 "unbuddied" lists indexed by how many chunks
> + * the one unbuddied zbud uses.  The data inside a zbpg cannot be
> + * read or written unless the zbpg's lock is held.
> + */
> +
> +#define ZBH_SENTINEL  0x43214321
> +#define ZBPG_SENTINEL  0xdeadbeef
> +
> +#define ZBUD_MAX_BUDS 2
> +
> +struct zbud_hdr {
> +	uint16_t client_id;
> +	uint16_t pool_id;
> +	struct tmem_oid oid;
> +	uint32_t index;
> +	uint16_t size; /* compressed size in bytes, zero means unused */
> +	DECL_SENTINEL
> +};
> +
> +struct zbud_page {
> +	struct list_head bud_list;
> +	spinlock_t lock;
> +	struct zbud_hdr buddy[ZBUD_MAX_BUDS];
> +	DECL_SENTINEL
> +	/* followed by NUM_CHUNK aligned CHUNK_SIZE-byte chunks */
> +};

how much chunk could a chunker chunk if a chunk could chunk chunks?

s/NUM_CHUNK/NCHUNKS/

The earlier comment mentions that the chunks are aligned but it's not
obvious that they are aligned here.

> +
> +#define CHUNK_SHIFT	6
> +#define CHUNK_SIZE	(1 << CHUNK_SHIFT)
> +#define CHUNK_MASK	(~(CHUNK_SIZE-1))
> +#define NCHUNKS		(((PAGE_SIZE - sizeof(struct zbud_page)) & \
> +				CHUNK_MASK) >> CHUNK_SHIFT)
> +#define MAX_CHUNK	(NCHUNKS-1)
> +
> +static struct {
> +	struct list_head list;
> +	unsigned count;
> +} zbud_unbuddied[NCHUNKS];
> +/* list N contains pages with N chunks USED and NCHUNKS-N unused */

As zbud_pages can only contain two buddies, it's not very clear why this
is even necessary. I'm missing something obvious.

> +/* element 0 is never used but optimizing that isn't worth it */
> +static unsigned long zbud_cumul_chunk_counts[NCHUNKS];
> +
> +struct list_head zbud_buddied_list;
> +static unsigned long zcache_zbud_buddied_count;
> +

nr_free_zbuds?

> +/* protects the buddied list and all unbuddied lists */
> +static DEFINE_SPINLOCK(zbud_budlists_spinlock);
> +
> +static LIST_HEAD(zbpg_unused_list);
> +static unsigned long zcache_zbpg_unused_list_count;
> +

nr_free_zpages ?

In general I find the naming a bit confusing to be honest

> +/* protects the unused page list */
> +static DEFINE_SPINLOCK(zbpg_unused_list_spinlock);
> +
> +static atomic_t zcache_zbud_curr_raw_pages;
> +static atomic_t zcache_zbud_curr_zpages;

Should not have been necessary to make these atomics. Probably protected
by zbpg_unused_list_spinlock or something similar.

> +static unsigned long zcache_zbud_curr_zbytes;

Overkill, this is just

zcache_zbud_curr_raw_pages << PAGE_SHIFT

> +static unsigned long zcache_zbud_cumul_zpages;
> +static unsigned long zcache_zbud_cumul_zbytes;
> +static unsigned long zcache_compress_poor;
> +static unsigned long zcache_mean_compress_poor;
> +

In general the stats keeping is going to suck on larger machines as these
are all shared writable cache lines. You might be able to mitigate the
impact in the future by moving these to vmstat. Maybe it doesn't matter
as such - it all depends on what velocity pages enter and leave zcache.
If that velocity is high, maybe the performance is shot anyway.

> +/* forward references */
> +static void *zcache_get_free_page(void);
> +static void zcache_free_page(void *p);
> +
> +/*
> + * zbud helper functions
> + */
> +
> +static inline unsigned zbud_max_buddy_size(void)
> +{
> +	return MAX_CHUNK << CHUNK_SHIFT;
> +}
> +

Is the max size not half of MAX_CHUNK as the page is split into two buddies?

> +static inline unsigned zbud_size_to_chunks(unsigned size)
> +{
> +	BUG_ON(size == 0 || size > zbud_max_buddy_size());
> +	return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT;
> +}
> +
> +static inline int zbud_budnum(struct zbud_hdr *zh)
> +{
> +	unsigned offset = (unsigned long)zh & (PAGE_SIZE - 1);
> +	struct zbud_page *zbpg = NULL;
> +	unsigned budnum = -1U;
> +	int i;
> +
> +	for (i = 0; i < ZBUD_MAX_BUDS; i++)
> +		if (offset == offsetof(typeof(*zbpg), buddy[i])) {
> +			budnum = i;
> +			break;
> +		}
> +	BUG_ON(budnum == -1U);
> +	return budnum;
> +}
> +
> +static char *zbud_data(struct zbud_hdr *zh, unsigned size)
> +{
> +	struct zbud_page *zbpg;
> +	char *p;
> +	unsigned budnum;
> +
> +	ASSERT_SENTINEL(zh, ZBH);
> +	budnum = zbud_budnum(zh);
> +	BUG_ON(size == 0 || size > zbud_max_buddy_size());
> +	zbpg = container_of(zh, struct zbud_page, buddy[budnum]);
> +	ASSERT_SPINLOCK(&zbpg->lock);
> +	p = (char *)zbpg;
> +	if (budnum == 0)
> +		p += ((sizeof(struct zbud_page) + CHUNK_SIZE - 1) &
> +							CHUNK_MASK);
> +	else if (budnum == 1)
> +		p += PAGE_SIZE - ((size + CHUNK_SIZE - 1) & CHUNK_MASK);
> +	return p;
> +}
> +
> +/*
> + * zbud raw page management
> + */
> +
> +static struct zbud_page *zbud_alloc_raw_page(void)
> +{
> +	struct zbud_page *zbpg = NULL;
> +	struct zbud_hdr *zh0, *zh1;
> +	bool recycled = 0;
> +

type mismatching

bool recycled = false

This mismatch in a few places.

recycled would have been completely unnecessary if zcache_get_free_page()
managed the initialisation

> +	/* if any pages on the zbpg list, use one */
> +	spin_lock(&zbpg_unused_list_spinlock);
> +	if (!list_empty(&zbpg_unused_list)) {
> +		zbpg = list_first_entry(&zbpg_unused_list,
> +				struct zbud_page, bud_list);
> +		list_del_init(&zbpg->bud_list);
> +		zcache_zbpg_unused_list_count--;
> +		recycled = 1;
> +	}
> +	spin_unlock(&zbpg_unused_list_spinlock);
> +	if (zbpg == NULL)
> +		/* none on zbpg list, try to get a kernel page */
> +		zbpg = zcache_get_free_page();

So zcache_get_free_page() is getting a preloaded page from a per-cpu magazine
and that thing blows up if there is no page available. This implies that
preemption must be disabled for the entire putting of a page into zcache!

> +	if (likely(zbpg != NULL)) {

It's not just likely, it's impossible because if it's NULL,
zcache_get_free_page() will already have BUG().

If it's the case that preemption is *not* disabled and the process gets
scheduled to a CPU that has its magazine consumed then this will blow up
in some cases.

Scary.

> +		INIT_LIST_HEAD(&zbpg->bud_list);
> +		zh0 = &zbpg->buddy[0]; zh1 = &zbpg->buddy[1];
> +		spin_lock_init(&zbpg->lock);
> +		if (recycled) {
> +			ASSERT_INVERTED_SENTINEL(zbpg, ZBPG);
> +			SET_SENTINEL(zbpg, ZBPG);
> +			BUG_ON(zh0->size != 0 || tmem_oid_valid(&zh0->oid));
> +			BUG_ON(zh1->size != 0 || tmem_oid_valid(&zh1->oid));
> +		} else {
> +			atomic_inc(&zcache_zbud_curr_raw_pages);
> +			INIT_LIST_HEAD(&zbpg->bud_list);
> +			SET_SENTINEL(zbpg, ZBPG);
> +			zh0->size = 0; zh1->size = 0;
> +			tmem_oid_set_invalid(&zh0->oid);
> +			tmem_oid_set_invalid(&zh1->oid);
> +		}
> +	}
> +	return zbpg;
> +}
> +
> +static void zbud_free_raw_page(struct zbud_page *zbpg)
> +{
> +	struct zbud_hdr *zh0 = &zbpg->buddy[0], *zh1 = &zbpg->buddy[1];
> +
> +	ASSERT_SENTINEL(zbpg, ZBPG);
> +	BUG_ON(!list_empty(&zbpg->bud_list));
> +	ASSERT_SPINLOCK(&zbpg->lock);
> +	BUG_ON(zh0->size != 0 || tmem_oid_valid(&zh0->oid));
> +	BUG_ON(zh1->size != 0 || tmem_oid_valid(&zh1->oid));
> +	INVERT_SENTINEL(zbpg, ZBPG);
> +	spin_unlock(&zbpg->lock);
> +	spin_lock(&zbpg_unused_list_spinlock);
> +	list_add(&zbpg->bud_list, &zbpg_unused_list);
> +	zcache_zbpg_unused_list_count++;
> +	spin_unlock(&zbpg_unused_list_spinlock);
> +}
> +
> +/*
> + * core zbud handling routines
> + */
> +
> +static unsigned zbud_free(struct zbud_hdr *zh)
> +{
> +	unsigned size;
> +
> +	ASSERT_SENTINEL(zh, ZBH);
> +	BUG_ON(!tmem_oid_valid(&zh->oid));
> +	size = zh->size;
> +	BUG_ON(zh->size == 0 || zh->size > zbud_max_buddy_size());
> +	zh->size = 0;
> +	tmem_oid_set_invalid(&zh->oid);
> +	INVERT_SENTINEL(zh, ZBH);
> +	zcache_zbud_curr_zbytes -= size;
> +	atomic_dec(&zcache_zbud_curr_zpages);
> +	return size;
> +}
> +
> +static void zbud_free_and_delist(struct zbud_hdr *zh)
> +{
> +	unsigned chunks;
> +	struct zbud_hdr *zh_other;
> +	unsigned budnum = zbud_budnum(zh), size;
> +	struct zbud_page *zbpg =
> +		container_of(zh, struct zbud_page, buddy[budnum]);
> +
> +	spin_lock(&zbud_budlists_spinlock);
> +	spin_lock(&zbpg->lock);
> +	if (list_empty(&zbpg->bud_list)) {
> +		/* ignore zombie page... see zbud_evict_pages() */
> +		spin_unlock(&zbpg->lock);
> +		spin_unlock(&zbud_budlists_spinlock);
> +		return;
> +	}
> +	size = zbud_free(zh);
> +	ASSERT_SPINLOCK(&zbpg->lock);
> +	zh_other = &zbpg->buddy[(budnum == 0) ? 1 : 0];
> +	if (zh_other->size == 0) { /* was unbuddied: unlist and free */
> +		chunks = zbud_size_to_chunks(size) ;
> +		BUG_ON(list_empty(&zbud_unbuddied[chunks].list));
> +		list_del_init(&zbpg->bud_list);
> +		zbud_unbuddied[chunks].count--;
> +		spin_unlock(&zbud_budlists_spinlock);
> +		zbud_free_raw_page(zbpg);
> +	} else { /* was buddied: move remaining buddy to unbuddied list */
> +		chunks = zbud_size_to_chunks(zh_other->size) ;
> +		list_del_init(&zbpg->bud_list);
> +		zcache_zbud_buddied_count--;
> +		list_add_tail(&zbpg->bud_list, &zbud_unbuddied[chunks].list);
> +		zbud_unbuddied[chunks].count++;
> +		spin_unlock(&zbud_budlists_spinlock);
> +		spin_unlock(&zbpg->lock);
> +	}
> +}
> +
> +static struct zbud_hdr *zbud_create(uint16_t client_id, uint16_t pool_id,
> +					struct tmem_oid *oid,
> +					uint32_t index, struct page *page,
> +					void *cdata, unsigned size)
> +{
> +	struct zbud_hdr *zh0, *zh1, *zh = NULL;
> +	struct zbud_page *zbpg = NULL, *ztmp;
> +	unsigned nchunks;
> +	char *to;
> +	int i, found_good_buddy = 0;
> +
> +	nchunks = zbud_size_to_chunks(size) ;
> +	for (i = MAX_CHUNK - nchunks + 1; i > 0; i--) {
> +		spin_lock(&zbud_budlists_spinlock);
> +		if (!list_empty(&zbud_unbuddied[i].list)) {
> +			list_for_each_entry_safe(zbpg, ztmp,
> +				    &zbud_unbuddied[i].list, bud_list) {
> +				if (spin_trylock(&zbpg->lock)) {
> +					found_good_buddy = i;
> +					goto found_unbuddied;
> +				}
> +			}
> +		}
> +		spin_unlock(&zbud_budlists_spinlock);
> +	}
> +	/* didn't find a good buddy, try allocating a new page */

It's not just try, it will have blown up if it failed the allocation.

> +	zbpg = zbud_alloc_raw_page();
> +	if (unlikely(zbpg == NULL))
> +		goto out;
> +	/* ok, have a page, now compress the data before taking locks */

This comment talks about compressing the data but I see no sign of the
compression taking place here. It happened earlier and got passed in 
as cdata.

> +	spin_lock(&zbud_budlists_spinlock);
> +	spin_lock(&zbpg->lock);
> +	list_add_tail(&zbpg->bud_list, &zbud_unbuddied[nchunks].list);
> +	zbud_unbuddied[nchunks].count++;
> +	zh = &zbpg->buddy[0];
> +	goto init_zh;
> +
> +found_unbuddied:
> +	ASSERT_SPINLOCK(&zbpg->lock);
> +	zh0 = &zbpg->buddy[0]; zh1 = &zbpg->buddy[1];

Multiple lines on single line :/

> +	BUG_ON(!((zh0->size == 0) ^ (zh1->size == 0)));
> +	if (zh0->size != 0) { /* buddy0 in use, buddy1 is vacant */
> +		ASSERT_SENTINEL(zh0, ZBH);
> +		zh = zh1;
> +	} else if (zh1->size != 0) { /* buddy1 in use, buddy0 is vacant */
> +		ASSERT_SENTINEL(zh1, ZBH);
> +		zh = zh0;
> +	} else
> +		BUG();
> +	list_del_init(&zbpg->bud_list);
> +	zbud_unbuddied[found_good_buddy].count--;
> +	list_add_tail(&zbpg->bud_list, &zbud_buddied_list);
> +	zcache_zbud_buddied_count++;
> +
> +init_zh:
> +	SET_SENTINEL(zh, ZBH);
> +	zh->size = size;
> +	zh->index = index;
> +	zh->oid = *oid;
> +	zh->pool_id = pool_id;
> +	zh->client_id = client_id;
> +	to = zbud_data(zh, size);
> +	memcpy(to, cdata, size);
> +	spin_unlock(&zbpg->lock);
> +	spin_unlock(&zbud_budlists_spinlock);
> +
> +	zbud_cumul_chunk_counts[nchunks]++;
> +	atomic_inc(&zcache_zbud_curr_zpages);
> +	zcache_zbud_cumul_zpages++;
> +	zcache_zbud_curr_zbytes += size;
> +	zcache_zbud_cumul_zbytes += size;
> +out:
> +	return zh;
> +}
> +
> +static int zbud_decompress(struct page *page, struct zbud_hdr *zh)
> +{
> +	struct zbud_page *zbpg;
> +	unsigned budnum = zbud_budnum(zh);
> +	unsigned int out_len = PAGE_SIZE;
> +	char *to_va, *from_va;
> +	unsigned size;
> +	int ret = 0;
> +
> +	zbpg = container_of(zh, struct zbud_page, buddy[budnum]);
> +	spin_lock(&zbpg->lock);
> +	if (list_empty(&zbpg->bud_list)) {
> +		/* ignore zombie page... see zbud_evict_pages() */
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +	ASSERT_SENTINEL(zh, ZBH);
> +	BUG_ON(zh->size == 0 || zh->size > zbud_max_buddy_size());
> +	to_va = kmap_atomic(page);
> +	size = zh->size;
> +	from_va = zbud_data(zh, size);
> +	ret = zcache_comp_op(ZCACHE_COMPOP_DECOMPRESS, from_va, size,
> +				to_va, &out_len);
> +	BUG_ON(ret);
> +	BUG_ON(out_len != PAGE_SIZE);
> +	kunmap_atomic(to_va);
> +out:
> +	spin_unlock(&zbpg->lock);
> +	return ret;
> +}
> +
> +/*
> + * The following routines handle shrinking of ephemeral pages by evicting
> + * pages "least valuable" first.
> + */
> +
> +static unsigned long zcache_evicted_raw_pages;
> +static unsigned long zcache_evicted_buddied_pages;
> +static unsigned long zcache_evicted_unbuddied_pages;
> +
> +static struct tmem_pool *zcache_get_pool_by_id(uint16_t cli_id,
> +						uint16_t poolid);
> +static void zcache_put_pool(struct tmem_pool *pool);
> +
> +/*
> + * Flush and free all zbuds in a zbpg, then free the pageframe
> + */
> +static void zbud_evict_zbpg(struct zbud_page *zbpg)
> +{
> +	struct zbud_hdr *zh;
> +	int i, j;
> +	uint32_t pool_id[ZBUD_MAX_BUDS], client_id[ZBUD_MAX_BUDS];
> +	uint32_t index[ZBUD_MAX_BUDS];
> +	struct tmem_oid oid[ZBUD_MAX_BUDS];
> +	struct tmem_pool *pool;
> +
> +	ASSERT_SPINLOCK(&zbpg->lock);
> +	BUG_ON(!list_empty(&zbpg->bud_list));
> +	for (i = 0, j = 0; i < ZBUD_MAX_BUDS; i++) {
> +		zh = &zbpg->buddy[i];
> +		if (zh->size) {
> +			client_id[j] = zh->client_id;
> +			pool_id[j] = zh->pool_id;
> +			oid[j] = zh->oid;
> +			index[j] = zh->index;
> +			j++;
> +			zbud_free(zh);
> +		}
> +	}
> +	spin_unlock(&zbpg->lock);
> +	for (i = 0; i < j; i++) {
> +		pool = zcache_get_pool_by_id(client_id[i], pool_id[i]);
> +		if (pool != NULL) {
> +			tmem_flush_page(pool, &oid[i], index[i]);
> +			zcache_put_pool(pool);
> +		}
> +	}
> +	ASSERT_SENTINEL(zbpg, ZBPG);
> +	spin_lock(&zbpg->lock);
> +	zbud_free_raw_page(zbpg);
> +}
> +
> +/*
> + * Free nr pages.  This code is funky because we want to hold the locks
> + * protecting various lists for as short a time as possible, and in some
> + * circumstances the list may change asynchronously when the list lock is
> + * not held.  In some cases we also trylock not only to avoid waiting on a
> + * page in use by another cpu, but also to avoid potential deadlock due to
> + * lock inversion.
> + */
> +static void zbud_evict_pages(int nr)
> +{
> +	struct zbud_page *zbpg;
> +	int i;
> +
> +	/* first try freeing any pages on unused list */
> +retry_unused_list:
> +	spin_lock_bh(&zbpg_unused_list_spinlock);
> +	if (!list_empty(&zbpg_unused_list)) {
> +		/* can't walk list here, since it may change when unlocked */
> +		zbpg = list_first_entry(&zbpg_unused_list,
> +				struct zbud_page, bud_list);
> +		list_del_init(&zbpg->bud_list);
> +		zcache_zbpg_unused_list_count--;
> +		atomic_dec(&zcache_zbud_curr_raw_pages);
> +		spin_unlock_bh(&zbpg_unused_list_spinlock);
> +		zcache_free_page(zbpg);
> +		zcache_evicted_raw_pages++;
> +		if (--nr <= 0)
> +			goto out;
> +		goto retry_unused_list;
> +	}
> +	spin_unlock_bh(&zbpg_unused_list_spinlock);
> +
> +	/* now try freeing unbuddied pages, starting with least space avail */
> +	for (i = 0; i < MAX_CHUNK; i++) {
> +retry_unbud_list_i:
> +		spin_lock_bh(&zbud_budlists_spinlock);
> +		if (list_empty(&zbud_unbuddied[i].list)) {
> +			spin_unlock_bh(&zbud_budlists_spinlock);
> +			continue;
> +		}
> +		list_for_each_entry(zbpg, &zbud_unbuddied[i].list, bud_list) {
> +			if (unlikely(!spin_trylock(&zbpg->lock)))
> +				continue;
> +			list_del_init(&zbpg->bud_list);
> +			zbud_unbuddied[i].count--;
> +			spin_unlock(&zbud_budlists_spinlock);
> +			zcache_evicted_unbuddied_pages++;
> +			/* want budlists unlocked when doing zbpg eviction */
> +			zbud_evict_zbpg(zbpg);
> +			local_bh_enable();
> +			if (--nr <= 0)
> +				goto out;
> +			goto retry_unbud_list_i;
> +		}
> +		spin_unlock_bh(&zbud_budlists_spinlock);
> +	}
> +
> +	/* as a last resort, free buddied pages */
> +retry_bud_list:
> +	spin_lock_bh(&zbud_budlists_spinlock);
> +	if (list_empty(&zbud_buddied_list)) {
> +		spin_unlock_bh(&zbud_budlists_spinlock);
> +		goto out;
> +	}
> +	list_for_each_entry(zbpg, &zbud_buddied_list, bud_list) {
> +		if (unlikely(!spin_trylock(&zbpg->lock)))
> +			continue;
> +		list_del_init(&zbpg->bud_list);
> +		zcache_zbud_buddied_count--;
> +		spin_unlock(&zbud_budlists_spinlock);
> +		zcache_evicted_buddied_pages++;
> +		/* want budlists unlocked when doing zbpg eviction */
> +		zbud_evict_zbpg(zbpg);
> +		local_bh_enable();
> +		if (--nr <= 0)
> +			goto out;
> +		goto retry_bud_list;
> +	}
> +	spin_unlock_bh(&zbud_budlists_spinlock);
> +out:
> +	return;
> +}
> +
> +static void __init zbud_init(void)
> +{
> +	int i;
> +
> +	INIT_LIST_HEAD(&zbud_buddied_list);
> +
> +	for (i = 0; i < NCHUNKS; i++)
> +		INIT_LIST_HEAD(&zbud_unbuddied[i].list);
> +}
> +
> +#ifdef CONFIG_SYSFS
> +/*
> + * These sysfs routines show a nice distribution of how many zbpg's are
> + * currently (and have ever been placed) in each unbuddied list.  It's fun
> + * to watch but can probably go away before final merge.
> + */
> +static int zbud_show_unbuddied_list_counts(char *buf)
> +{
> +	int i;
> +	char *p = buf;
> +
> +	for (i = 0; i < NCHUNKS; i++)
> +		p += sprintf(p, "%u ", zbud_unbuddied[i].count);
> +	return p - buf;
> +}
> +
> +static int zbud_show_cumul_chunk_counts(char *buf)
> +{
> +	unsigned long i, chunks = 0, total_chunks = 0, sum_total_chunks = 0;
> +	unsigned long total_chunks_lte_21 = 0, total_chunks_lte_32 = 0;
> +	unsigned long total_chunks_lte_42 = 0;
> +	char *p = buf;
> +
> +	for (i = 0; i < NCHUNKS; i++) {
> +		p += sprintf(p, "%lu ", zbud_cumul_chunk_counts[i]);
> +		chunks += zbud_cumul_chunk_counts[i];
> +		total_chunks += zbud_cumul_chunk_counts[i];
> +		sum_total_chunks += i * zbud_cumul_chunk_counts[i];
> +		if (i == 21)
> +			total_chunks_lte_21 = total_chunks;
> +		if (i == 32)
> +			total_chunks_lte_32 = total_chunks;
> +		if (i == 42)
> +			total_chunks_lte_42 = total_chunks;
> +	}
> +	p += sprintf(p, "<=21:%lu <=32:%lu <=42:%lu, mean:%lu\n",
> +		total_chunks_lte_21, total_chunks_lte_32, total_chunks_lte_42,
> +		chunks == 0 ? 0 : sum_total_chunks / chunks);
> +	return p - buf;
> +}
> +#endif
> +
> +/**********
> + * This "zv" PAM implementation combines the slab-based zsmalloc
> + * with the crypto compression API to maximize the amount of data that can
> + * be packed into a physical page.
> + *
> + * Zv represents a PAM page with the index and object (plus a "size" value
> + * necessary for decompression) immediately preceding the compressed data.
> + */
> +
> +#define ZVH_SENTINEL  0x43214321
> +
> +struct zv_hdr {
> +	uint32_t pool_id;
> +	struct tmem_oid oid;
> +	uint32_t index;
> +	size_t size;
> +	DECL_SENTINEL
> +};
> +
> +/* rudimentary policy limits */
> +/* total number of persistent pages may not exceed this percentage */
> +static unsigned int zv_page_count_policy_percent = 75;
> +/*
> + * byte count defining poor compression; pages with greater zsize will be
> + * rejected
> + */
> +static unsigned int zv_max_zsize = (PAGE_SIZE / 8) * 7;
> +/*
> + * byte count defining poor *mean* compression; pages with greater zsize
> + * will be rejected until sufficient better-compressed pages are accepted
> + * driving the mean below this threshold
> + */
> +static unsigned int zv_max_mean_zsize = (PAGE_SIZE / 8) * 5;
> +
> +static atomic_t zv_curr_dist_counts[NCHUNKS];
> +static atomic_t zv_cumul_dist_counts[NCHUNKS];
> +
> +static unsigned long zv_create(struct zs_pool *pool, uint32_t pool_id,
> +				struct tmem_oid *oid, uint32_t index,
> +				void *cdata, unsigned clen)
> +{
> +	struct zv_hdr *zv;
> +	u32 size = clen + sizeof(struct zv_hdr);
> +	int chunks = (size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
> +	unsigned long handle = 0;
> +
> +	BUG_ON(!irqs_disabled());
> +	BUG_ON(chunks >= NCHUNKS);
> +	handle = zs_malloc(pool, size);
> +	if (!handle)
> +		goto out;
> +	atomic_inc(&zv_curr_dist_counts[chunks]);
> +	atomic_inc(&zv_cumul_dist_counts[chunks]);
> +	zv = zs_map_object(pool, handle, ZS_MM_WO);
> +	zv->index = index;
> +	zv->oid = *oid;
> +	zv->pool_id = pool_id;
> +	zv->size = clen;
> +	SET_SENTINEL(zv, ZVH);
> +	memcpy((char *)zv + sizeof(struct zv_hdr), cdata, clen);
> +	zs_unmap_object(pool, handle);
> +out:
> +	return handle;
> +}
> +
> +static void zv_free(struct zs_pool *pool, unsigned long handle)
> +{
> +	unsigned long flags;
> +	struct zv_hdr *zv;
> +	uint16_t size;
> +	int chunks;
> +
> +	zv = zs_map_object(pool, handle, ZS_MM_RW);
> +	ASSERT_SENTINEL(zv, ZVH);
> +	size = zv->size + sizeof(struct zv_hdr);
> +	INVERT_SENTINEL(zv, ZVH);
> +	zs_unmap_object(pool, handle);
> +
> +	chunks = (size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
> +	BUG_ON(chunks >= NCHUNKS);
> +	atomic_dec(&zv_curr_dist_counts[chunks]);
> +
> +	local_irq_save(flags);
> +	zs_free(pool, handle);
> +	local_irq_restore(flags);
> +}
> +
> +static void zv_decompress(struct page *page, unsigned long handle)
> +{
> +	unsigned int clen = PAGE_SIZE;
> +	char *to_va;
> +	int ret;
> +	struct zv_hdr *zv;
> +
> +	zv = zs_map_object(zcache_host.zspool, handle, ZS_MM_RO);
> +	BUG_ON(zv->size == 0);
> +	ASSERT_SENTINEL(zv, ZVH);
> +	to_va = kmap_atomic(page);
> +	ret = zcache_comp_op(ZCACHE_COMPOP_DECOMPRESS, (char *)zv + sizeof(*zv),
> +				zv->size, to_va, &clen);
> +	kunmap_atomic(to_va);
> +	zs_unmap_object(zcache_host.zspool, handle);
> +	BUG_ON(ret);
> +	BUG_ON(clen != PAGE_SIZE);
> +}
> +
> +#ifdef CONFIG_SYSFS
> +/*
> + * show a distribution of compression stats for zv pages.
> + */
> +
> +static int zv_curr_dist_counts_show(char *buf)
> +{
> +	unsigned long i, n, chunks = 0, sum_total_chunks = 0;
> +	char *p = buf;
> +
> +	for (i = 0; i < NCHUNKS; i++) {
> +		n = atomic_read(&zv_curr_dist_counts[i]);
> +		p += sprintf(p, "%lu ", n);
> +		chunks += n;
> +		sum_total_chunks += i * n;
> +	}
> +	p += sprintf(p, "mean:%lu\n",
> +		chunks == 0 ? 0 : sum_total_chunks / chunks);
> +	return p - buf;
> +}
> +
> +static int zv_cumul_dist_counts_show(char *buf)
> +{
> +	unsigned long i, n, chunks = 0, sum_total_chunks = 0;
> +	char *p = buf;
> +
> +	for (i = 0; i < NCHUNKS; i++) {
> +		n = atomic_read(&zv_cumul_dist_counts[i]);
> +		p += sprintf(p, "%lu ", n);
> +		chunks += n;
> +		sum_total_chunks += i * n;
> +	}
> +	p += sprintf(p, "mean:%lu\n",
> +		chunks == 0 ? 0 : sum_total_chunks / chunks);
> +	return p - buf;
> +}
> +
> +/*
> + * setting zv_max_zsize via sysfs causes all persistent (e.g. swap)
> + * pages that don't compress to less than this value (including metadata
> + * overhead) to be rejected.  We don't allow the value to get too close
> + * to PAGE_SIZE.
> + */
> +static ssize_t zv_max_zsize_show(struct kobject *kobj,
> +				    struct kobj_attribute *attr,
> +				    char *buf)
> +{
> +	return sprintf(buf, "%u\n", zv_max_zsize);
> +}
> +
> +static ssize_t zv_max_zsize_store(struct kobject *kobj,
> +				    struct kobj_attribute *attr,
> +				    const char *buf, size_t count)
> +{
> +	unsigned long val;
> +	int err;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	err = kstrtoul(buf, 10, &val);
> +	if (err || (val == 0) || (val > (PAGE_SIZE / 8) * 7))
> +		return -EINVAL;
> +	zv_max_zsize = val;
> +	return count;
> +}
> +
> +/*
> + * setting zv_max_mean_zsize via sysfs causes all persistent (e.g. swap)
> + * pages that don't compress to less than this value (including metadata
> + * overhead) to be rejected UNLESS the mean compression is also smaller
> + * than this value.  In other words, we are load-balancing-by-zsize the
> + * accepted pages.  Again, we don't allow the value to get too close
> + * to PAGE_SIZE.
> + */
> +static ssize_t zv_max_mean_zsize_show(struct kobject *kobj,
> +				    struct kobj_attribute *attr,
> +				    char *buf)
> +{
> +	return sprintf(buf, "%u\n", zv_max_mean_zsize);
> +}
> +
> +static ssize_t zv_max_mean_zsize_store(struct kobject *kobj,
> +				    struct kobj_attribute *attr,
> +				    const char *buf, size_t count)
> +{
> +	unsigned long val;
> +	int err;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	err = kstrtoul(buf, 10, &val);
> +	if (err || (val == 0) || (val > (PAGE_SIZE / 8) * 7))
> +		return -EINVAL;
> +	zv_max_mean_zsize = val;
> +	return count;
> +}
> +
> +/*
> + * setting zv_page_count_policy_percent via sysfs sets an upper bound of
> + * persistent (e.g. swap) pages that will be retained according to:
> + *     (zv_page_count_policy_percent * totalram_pages) / 100)
> + * when that limit is reached, further puts will be rejected (until
> + * some pages have been flushed).  Note that, due to compression,
> + * this number may exceed 100; it defaults to 75 and we set an
> + * arbitary limit of 150.  A poor choice will almost certainly result
> + * in OOM's, so this value should only be changed prudently.
> + */
> +static ssize_t zv_page_count_policy_percent_show(struct kobject *kobj,
> +						 struct kobj_attribute *attr,
> +						 char *buf)
> +{
> +	return sprintf(buf, "%u\n", zv_page_count_policy_percent);
> +}
> +
> +static ssize_t zv_page_count_policy_percent_store(struct kobject *kobj,
> +						  struct kobj_attribute *attr,
> +						  const char *buf, size_t count)
> +{
> +	unsigned long val;
> +	int err;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	err = kstrtoul(buf, 10, &val);
> +	if (err || (val == 0) || (val > 150))
> +		return -EINVAL;
> +	zv_page_count_policy_percent = val;
> +	return count;
> +}
> +
> +static struct kobj_attribute zcache_zv_max_zsize_attr = {
> +		.attr = { .name = "zv_max_zsize", .mode = 0644 },
> +		.show = zv_max_zsize_show,
> +		.store = zv_max_zsize_store,
> +};
> +
> +static struct kobj_attribute zcache_zv_max_mean_zsize_attr = {
> +		.attr = { .name = "zv_max_mean_zsize", .mode = 0644 },
> +		.show = zv_max_mean_zsize_show,
> +		.store = zv_max_mean_zsize_store,
> +};
> +
> +static struct kobj_attribute zcache_zv_page_count_policy_percent_attr = {
> +		.attr = { .name = "zv_page_count_policy_percent",
> +			  .mode = 0644 },
> +		.show = zv_page_count_policy_percent_show,
> +		.store = zv_page_count_policy_percent_store,
> +};
> +#endif
> +
> +/*
> + * zcache core code starts here
> + */
> +
> +/* useful stats not collected by cleancache or frontswap */
> +static unsigned long zcache_flush_total;
> +static unsigned long zcache_flush_found;
> +static unsigned long zcache_flobj_total;
> +static unsigned long zcache_flobj_found;
> +static unsigned long zcache_failed_eph_puts;
> +static unsigned long zcache_failed_pers_puts;
> +
> +/*
> + * Tmem operations assume the poolid implies the invoking client.
> + * Zcache only has one client (the kernel itself): LOCAL_CLIENT.
> + * RAMster has each client numbered by cluster node, and a KVM version
> + * of zcache would have one client per guest and each client might
> + * have a poolid==N.
> + */
> +static struct tmem_pool *zcache_get_pool_by_id(uint16_t cli_id, uint16_t poolid)
> +{
> +	struct tmem_pool *pool = NULL;
> +	struct zcache_client *cli = NULL;
> +
> +	cli = get_zcache_client(cli_id);
> +	if (!cli)
> +		goto out;
> +
> +	atomic_inc(&cli->refcount);
> +	pool = idr_find(&cli->tmem_pools, poolid);
> +	if (pool != NULL)
> +		atomic_inc(&pool->refcount);
> +out:
> +	return pool;
> +}
> +
> +static void zcache_put_pool(struct tmem_pool *pool)
> +{
> +	struct zcache_client *cli = NULL;
> +
> +	if (pool == NULL)
> +		BUG();
> +	cli = pool->client;
> +	atomic_dec(&pool->refcount);
> +	atomic_dec(&cli->refcount);
> +}
> +
> +int zcache_new_client(uint16_t cli_id)
> +{
> +	struct zcache_client *cli;
> +	int ret = -1;
> +
> +	cli = get_zcache_client(cli_id);
> +
> +	if (cli == NULL)
> +		goto out;
> +	if (cli->allocated)
> +		goto out;
> +	cli->allocated = 1;
> +#ifdef CONFIG_FRONTSWAP
> +	cli->zspool = zs_create_pool("zcache", ZCACHE_GFP_MASK);
> +	if (cli->zspool == NULL)
> +		goto out;
> +	idr_init(&cli->tmem_pools);
> +#endif
> +	ret = 0;
> +out:
> +	return ret;
> +}
> +
> +/* counters for debugging */
> +static unsigned long zcache_failed_get_free_pages;
> +static unsigned long zcache_failed_alloc;
> +static unsigned long zcache_put_to_flush;
> +
> +/*
> + * for now, used named slabs so can easily track usage; later can
> + * either just use kmalloc, or perhaps add a slab-like allocator
> + * to more carefully manage total memory utilization
> + */
> +static struct kmem_cache *zcache_objnode_cache;
> +static struct kmem_cache *zcache_obj_cache;
> +static atomic_t zcache_curr_obj_count = ATOMIC_INIT(0);
> +static unsigned long zcache_curr_obj_count_max;
> +static atomic_t zcache_curr_objnode_count = ATOMIC_INIT(0);
> +static unsigned long zcache_curr_objnode_count_max;
> +
> +/*
> + * to avoid memory allocation recursion (e.g. due to direct reclaim), we
> + * preload all necessary data structures so the hostops callbacks never
> + * actually do a malloc
> + */
> +struct zcache_preload {
> +	void *page;
> +	struct tmem_obj *obj;
> +	int nr;
> +	struct tmem_objnode *objnodes[OBJNODE_TREE_MAX_PATH];
> +};
> +static DEFINE_PER_CPU(struct zcache_preload, zcache_preloads) = { 0, };
> +
> +static int zcache_do_preload(struct tmem_pool *pool)
> +{
> +	struct zcache_preload *kp;
> +	struct tmem_objnode *objnode;
> +	struct tmem_obj *obj;
> +	void *page;
> +	int ret = -ENOMEM;
> +
> +	if (unlikely(zcache_objnode_cache == NULL))
> +		goto out;
> +	if (unlikely(zcache_obj_cache == NULL))
> +		goto out;
> +
> +	/* IRQ has already been disabled. */
> +	kp = &__get_cpu_var(zcache_preloads);
> +	while (kp->nr < ARRAY_SIZE(kp->objnodes)) {
> +		objnode = kmem_cache_alloc(zcache_objnode_cache,
> +				ZCACHE_GFP_MASK);
> +		if (unlikely(objnode == NULL)) {
> +			zcache_failed_alloc++;
> +			goto out;
> +		}
> +
> +		kp->objnodes[kp->nr++] = objnode;
> +	}
> +
> +	if (!kp->obj) {
> +		obj = kmem_cache_alloc(zcache_obj_cache, ZCACHE_GFP_MASK);
> +		if (unlikely(obj == NULL)) {
> +			zcache_failed_alloc++;
> +			goto out;
> +		}
> +		kp->obj = obj;
> +	}
> +
> +	if (!kp->page) {
> +		page = (void *)__get_free_page(ZCACHE_GFP_MASK);
> +		if (unlikely(page == NULL)) {
> +			zcache_failed_get_free_pages++;
> +			goto out;
> +		}
> +		kp->page =  page;
> +	}
> +
> +	ret = 0;
> +out:
> +	return ret;
> +}

Ok, so if this thing fails to allocate a page then what prevents us getting into
a situation where the zcache grows to a large size and we cannot take decompress
anything in it because we cannot allocate a page here?

It looks like this could potentially deadlock the system unless it was possible
to either discard zcache data and reconstruct it from information on disk.
It feels like something like a mempool needs to exist that is used to forcibly
shrink the zcache somehow but I can't seem to find where something like that happens.

Where is it or is there a risk of deadlock here?

> +
> +static void *zcache_get_free_page(void)
> +{
> +	struct zcache_preload *kp;
> +	void *page;
> +
> +	kp = &__get_cpu_var(zcache_preloads);
> +	page = kp->page;
> +	BUG_ON(page == NULL);
> +	kp->page = NULL;
> +	return page;
> +}
> +
> +static void zcache_free_page(void *p)
> +{
> +	free_page((unsigned long)p);
> +}
> +
> +/*
> + * zcache implementation for tmem host ops
> + */
> +
> +static struct tmem_objnode *zcache_objnode_alloc(struct tmem_pool *pool)
> +{
> +	struct tmem_objnode *objnode = NULL;
> +	unsigned long count;
> +	struct zcache_preload *kp;
> +
> +	kp = &__get_cpu_var(zcache_preloads);
> +	if (kp->nr <= 0)
> +		goto out;
> +	objnode = kp->objnodes[kp->nr - 1];
> +	BUG_ON(objnode == NULL);
> +	kp->objnodes[kp->nr - 1] = NULL;
> +	kp->nr--;
> +	count = atomic_inc_return(&zcache_curr_objnode_count);
> +	if (count > zcache_curr_objnode_count_max)
> +		zcache_curr_objnode_count_max = count;
> +out:
> +	return objnode;
> +}
> +
> +static void zcache_objnode_free(struct tmem_objnode *objnode,
> +					struct tmem_pool *pool)
> +{
> +	atomic_dec(&zcache_curr_objnode_count);
> +	BUG_ON(atomic_read(&zcache_curr_objnode_count) < 0);
> +	kmem_cache_free(zcache_objnode_cache, objnode);
> +}
> +
> +static struct tmem_obj *zcache_obj_alloc(struct tmem_pool *pool)
> +{
> +	struct tmem_obj *obj = NULL;
> +	unsigned long count;
> +	struct zcache_preload *kp;
> +
> +	kp = &__get_cpu_var(zcache_preloads);
> +	obj = kp->obj;
> +	BUG_ON(obj == NULL);
> +	kp->obj = NULL;
> +	count = atomic_inc_return(&zcache_curr_obj_count);
> +	if (count > zcache_curr_obj_count_max)
> +		zcache_curr_obj_count_max = count;
> +	return obj;
> +}
> +
> +static void zcache_obj_free(struct tmem_obj *obj, struct tmem_pool *pool)
> +{
> +	atomic_dec(&zcache_curr_obj_count);
> +	BUG_ON(atomic_read(&zcache_curr_obj_count) < 0);
> +	kmem_cache_free(zcache_obj_cache, obj);
> +}
> +
> +static struct tmem_hostops zcache_hostops = {
> +	.obj_alloc = zcache_obj_alloc,
> +	.obj_free = zcache_obj_free,
> +	.objnode_alloc = zcache_objnode_alloc,
> +	.objnode_free = zcache_objnode_free,
> +};
> +
> +/*
> + * zcache implementations for PAM page descriptor ops
> + */
> +
> +static atomic_t zcache_curr_eph_pampd_count = ATOMIC_INIT(0);
> +static unsigned long zcache_curr_eph_pampd_count_max;
> +static atomic_t zcache_curr_pers_pampd_count = ATOMIC_INIT(0);
> +static unsigned long zcache_curr_pers_pampd_count_max;
> +
> +/* forward reference */
> +static int zcache_compress(struct page *from, void **out_va, unsigned *out_len);
> +
> +static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph,
> +				struct tmem_pool *pool, struct tmem_oid *oid,
> +				 uint32_t index)
> +{
> +	void *pampd = NULL, *cdata;
> +	unsigned clen;
> +	int ret;
> +	unsigned long count;
> +	struct page *page = (struct page *)(data);
> +	struct zcache_client *cli = pool->client;
> +	uint16_t client_id = get_client_id_from_client(cli);
> +	unsigned long zv_mean_zsize;
> +	unsigned long curr_pers_pampd_count;
> +	u64 total_zsize;
> +
> +	if (eph) {
> +		ret = zcache_compress(page, &cdata, &clen);
> +		if (ret == 0)
> +			goto out;
> +		if (clen == 0 || clen > zbud_max_buddy_size()) {
> +			zcache_compress_poor++;
> +			goto out;
> +		}
> +		pampd = (void *)zbud_create(client_id, pool->pool_id, oid,
> +						index, page, cdata, clen);
> +		if (pampd != NULL) {
> +			count = atomic_inc_return(&zcache_curr_eph_pampd_count);
> +			if (count > zcache_curr_eph_pampd_count_max)
> +				zcache_curr_eph_pampd_count_max = count;
> +		}
> +	} else {
> +		curr_pers_pampd_count =
> +			atomic_read(&zcache_curr_pers_pampd_count);
> +		if (curr_pers_pampd_count >
> +		    (zv_page_count_policy_percent * totalram_pages) / 100)
> +			goto out;
> +		ret = zcache_compress(page, &cdata, &clen);
> +		if (ret == 0)
> +			goto out;
> +		/* reject if compression is too poor */
> +		if (clen > zv_max_zsize) {
> +			zcache_compress_poor++;
> +			goto out;
> +		}

Here is where some sort of success count is needed too so we can figure
out what percentage of pages are failing to compress.

> +		/* reject if mean compression is too poor */
> +		if ((clen > zv_max_mean_zsize) && (curr_pers_pampd_count > 0)) {
> +			total_zsize = zs_get_total_size_bytes(cli->zspool);
> +			zv_mean_zsize = div_u64(total_zsize,
> +						curr_pers_pampd_count);
> +			if (zv_mean_zsize > zv_max_mean_zsize) {
> +				zcache_mean_compress_poor++;
> +				goto out;
> +			}
> +		}

hmmmm, feels like this would be difficult to tune properly but cannot
exactly put my finger on it.

> +		pampd = (void *)zv_create(cli->zspool, pool->pool_id,
> +						oid, index, cdata, clen);
> +		if (pampd == NULL)
> +			goto out;
> +		count = atomic_inc_return(&zcache_curr_pers_pampd_count);
> +		if (count > zcache_curr_pers_pampd_count_max)
> +			zcache_curr_pers_pampd_count_max = count;
> +	}
> +out:
> +	return pampd;
> +}
> +
> +/*
> + * fill the pageframe corresponding to the struct page with the data
> + * from the passed pampd
> + */
> +static int zcache_pampd_get_data(char *data, size_t *bufsize, bool raw,
> +					void *pampd, struct tmem_pool *pool,
> +					struct tmem_oid *oid, uint32_t index)
> +{
> +	int ret = 0;
> +
> +	BUG_ON(is_ephemeral(pool));
> +	zv_decompress((struct page *)(data), (unsigned long)pampd);
> +	return ret;
> +}
> +
> +/*
> + * fill the pageframe corresponding to the struct page with the data
> + * from the passed pampd
> + */
> +static int zcache_pampd_get_data_and_free(char *data, size_t *bufsize, bool raw,
> +					void *pampd, struct tmem_pool *pool,
> +					struct tmem_oid *oid, uint32_t index)
> +{
> +	BUG_ON(!is_ephemeral(pool));
> +	if (zbud_decompress((struct page *)(data), pampd) < 0)
> +		return -EINVAL;
> +	zbud_free_and_delist((struct zbud_hdr *)pampd);
> +	atomic_dec(&zcache_curr_eph_pampd_count);
> +	return 0;
> +}
> +
> +/*
> + * free the pampd and remove it from any zcache lists
> + * pampd must no longer be pointed to from any tmem data structures!
> + */
> +static void zcache_pampd_free(void *pampd, struct tmem_pool *pool,
> +				struct tmem_oid *oid, uint32_t index)
> +{
> +	struct zcache_client *cli = pool->client;
> +
> +	if (is_ephemeral(pool)) {
> +		zbud_free_and_delist((struct zbud_hdr *)pampd);
> +		atomic_dec(&zcache_curr_eph_pampd_count);
> +		BUG_ON(atomic_read(&zcache_curr_eph_pampd_count) < 0);
> +	} else {
> +		zv_free(cli->zspool, (unsigned long)pampd);
> +		atomic_dec(&zcache_curr_pers_pampd_count);
> +		BUG_ON(atomic_read(&zcache_curr_pers_pampd_count) < 0);
> +	}
> +}
> +
> +static void zcache_pampd_free_obj(struct tmem_pool *pool, struct tmem_obj *obj)
> +{
> +}
> +
> +static void zcache_pampd_new_obj(struct tmem_obj *obj)
> +{
> +}
> +
> +static int zcache_pampd_replace_in_obj(void *pampd, struct tmem_obj *obj)
> +{
> +	return -1;
> +}
> +
> +static bool zcache_pampd_is_remote(void *pampd)
> +{
> +	return 0;
> +}
> +
> +static struct tmem_pamops zcache_pamops = {
> +	.create = zcache_pampd_create,
> +	.get_data = zcache_pampd_get_data,
> +	.get_data_and_free = zcache_pampd_get_data_and_free,
> +	.free = zcache_pampd_free,
> +	.free_obj = zcache_pampd_free_obj,
> +	.new_obj = zcache_pampd_new_obj,
> +	.replace_in_obj = zcache_pampd_replace_in_obj,
> +	.is_remote = zcache_pampd_is_remote,
> +};
> +
> +/*
> + * zcache compression/decompression and related per-cpu stuff
> + */
> +
> +static DEFINE_PER_CPU(unsigned char *, zcache_dstmem);
> +#define ZCACHE_DSTMEM_ORDER 1
> +
> +static int zcache_compress(struct page *from, void **out_va, unsigned *out_len)
> +{
> +	int ret = 0;
> +	unsigned char *dmem = __get_cpu_var(zcache_dstmem);
> +	char *from_va;
> +
> +	BUG_ON(!irqs_disabled());
> +	if (unlikely(dmem == NULL))
> +		goto out;  /* no buffer or no compressor so can't compress */
> +	*out_len = PAGE_SIZE << ZCACHE_DSTMEM_ORDER;
> +	from_va = kmap_atomic(from);

Ok, so I am running out of beans here but this triggered alarm bells. Is
zcache stored in lowmem? If so, then it might be a total no-go on 32-bit
systems if pages from highmem cause increased low memory pressure to put
the page into zcache.

> +	mb();

.... Why?

> +	ret = zcache_comp_op(ZCACHE_COMPOP_COMPRESS, from_va, PAGE_SIZE, dmem,
> +				out_len);
> +	BUG_ON(ret);
> +	*out_va = dmem;
> +	kunmap_atomic(from_va);
> +	ret = 1;
> +out:
> +	return ret;
> +}
> +
> +static int zcache_comp_cpu_up(int cpu)
> +{
> +	struct crypto_comp *tfm;
> +
> +	tfm = crypto_alloc_comp(zcache_comp_name, 0, 0);
> +	if (IS_ERR(tfm))
> +		return NOTIFY_BAD;
> +	*per_cpu_ptr(zcache_comp_pcpu_tfms, cpu) = tfm;
> +	return NOTIFY_OK;
> +}
> +
> +static void zcache_comp_cpu_down(int cpu)
> +{
> +	struct crypto_comp *tfm;
> +
> +	tfm = *per_cpu_ptr(zcache_comp_pcpu_tfms, cpu);
> +	crypto_free_comp(tfm);
> +	*per_cpu_ptr(zcache_comp_pcpu_tfms, cpu) = NULL;
> +}
> +
> +static int zcache_cpu_notifier(struct notifier_block *nb,
> +				unsigned long action, void *pcpu)
> +{
> +	int ret, cpu = (long)pcpu;
> +	struct zcache_preload *kp;
> +
> +	switch (action) {
> +	case CPU_UP_PREPARE:
> +		ret = zcache_comp_cpu_up(cpu);
> +		if (ret != NOTIFY_OK) {
> +			pr_err("zcache: can't allocate compressor transform\n");
> +			return ret;
> +		}
> +		per_cpu(zcache_dstmem, cpu) = (void *)__get_free_pages(
> +			GFP_KERNEL | __GFP_REPEAT, ZCACHE_DSTMEM_ORDER);
> +		break;
> +	case CPU_DEAD:
> +	case CPU_UP_CANCELED:
> +		zcache_comp_cpu_down(cpu);
> +		free_pages((unsigned long)per_cpu(zcache_dstmem, cpu),
> +			ZCACHE_DSTMEM_ORDER);
> +		per_cpu(zcache_dstmem, cpu) = NULL;
> +		kp = &per_cpu(zcache_preloads, cpu);
> +		while (kp->nr) {
> +			kmem_cache_free(zcache_objnode_cache,
> +					kp->objnodes[kp->nr - 1]);
> +			kp->objnodes[kp->nr - 1] = NULL;
> +			kp->nr--;
> +		}
> +		if (kp->obj) {
> +			kmem_cache_free(zcache_obj_cache, kp->obj);
> +			kp->obj = NULL;
> +		}
> +		if (kp->page) {
> +			free_page((unsigned long)kp->page);
> +			kp->page = NULL;
> +		}
> +		break;
> +	default:
> +		break;
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block zcache_cpu_notifier_block = {
> +	.notifier_call = zcache_cpu_notifier
> +};
> +
> +#ifdef CONFIG_SYSFS
> +#define ZCACHE_SYSFS_RO(_name) \
> +	static ssize_t zcache_##_name##_show(struct kobject *kobj, \
> +				struct kobj_attribute *attr, char *buf) \
> +	{ \
> +		return sprintf(buf, "%lu\n", zcache_##_name); \
> +	} \
> +	static struct kobj_attribute zcache_##_name##_attr = { \
> +		.attr = { .name = __stringify(_name), .mode = 0444 }, \
> +		.show = zcache_##_name##_show, \
> +	}
> +
> +#define ZCACHE_SYSFS_RO_ATOMIC(_name) \
> +	static ssize_t zcache_##_name##_show(struct kobject *kobj, \
> +				struct kobj_attribute *attr, char *buf) \
> +	{ \
> +	    return sprintf(buf, "%d\n", atomic_read(&zcache_##_name)); \
> +	} \
> +	static struct kobj_attribute zcache_##_name##_attr = { \
> +		.attr = { .name = __stringify(_name), .mode = 0444 }, \
> +		.show = zcache_##_name##_show, \
> +	}
> +
> +#define ZCACHE_SYSFS_RO_CUSTOM(_name, _func) \
> +	static ssize_t zcache_##_name##_show(struct kobject *kobj, \
> +				struct kobj_attribute *attr, char *buf) \
> +	{ \
> +	    return _func(buf); \
> +	} \
> +	static struct kobj_attribute zcache_##_name##_attr = { \
> +		.attr = { .name = __stringify(_name), .mode = 0444 }, \
> +		.show = zcache_##_name##_show, \
> +	}
> +
> +ZCACHE_SYSFS_RO(curr_obj_count_max);
> +ZCACHE_SYSFS_RO(curr_objnode_count_max);
> +ZCACHE_SYSFS_RO(flush_total);
> +ZCACHE_SYSFS_RO(flush_found);
> +ZCACHE_SYSFS_RO(flobj_total);
> +ZCACHE_SYSFS_RO(flobj_found);
> +ZCACHE_SYSFS_RO(failed_eph_puts);
> +ZCACHE_SYSFS_RO(failed_pers_puts);
> +ZCACHE_SYSFS_RO(zbud_curr_zbytes);
> +ZCACHE_SYSFS_RO(zbud_cumul_zpages);
> +ZCACHE_SYSFS_RO(zbud_cumul_zbytes);
> +ZCACHE_SYSFS_RO(zbud_buddied_count);
> +ZCACHE_SYSFS_RO(zbpg_unused_list_count);
> +ZCACHE_SYSFS_RO(evicted_raw_pages);
> +ZCACHE_SYSFS_RO(evicted_unbuddied_pages);
> +ZCACHE_SYSFS_RO(evicted_buddied_pages);
> +ZCACHE_SYSFS_RO(failed_get_free_pages);
> +ZCACHE_SYSFS_RO(failed_alloc);
> +ZCACHE_SYSFS_RO(put_to_flush);
> +ZCACHE_SYSFS_RO(compress_poor);
> +ZCACHE_SYSFS_RO(mean_compress_poor);
> +ZCACHE_SYSFS_RO_ATOMIC(zbud_curr_raw_pages);
> +ZCACHE_SYSFS_RO_ATOMIC(zbud_curr_zpages);
> +ZCACHE_SYSFS_RO_ATOMIC(curr_obj_count);
> +ZCACHE_SYSFS_RO_ATOMIC(curr_objnode_count);
> +ZCACHE_SYSFS_RO_CUSTOM(zbud_unbuddied_list_counts,
> +			zbud_show_unbuddied_list_counts);
> +ZCACHE_SYSFS_RO_CUSTOM(zbud_cumul_chunk_counts,
> +			zbud_show_cumul_chunk_counts);
> +ZCACHE_SYSFS_RO_CUSTOM(zv_curr_dist_counts,
> +			zv_curr_dist_counts_show);
> +ZCACHE_SYSFS_RO_CUSTOM(zv_cumul_dist_counts,
> +			zv_cumul_dist_counts_show);
> +
> +static struct attribute *zcache_attrs[] = {
> +	&zcache_curr_obj_count_attr.attr,
> +	&zcache_curr_obj_count_max_attr.attr,
> +	&zcache_curr_objnode_count_attr.attr,
> +	&zcache_curr_objnode_count_max_attr.attr,
> +	&zcache_flush_total_attr.attr,
> +	&zcache_flobj_total_attr.attr,
> +	&zcache_flush_found_attr.attr,
> +	&zcache_flobj_found_attr.attr,
> +	&zcache_failed_eph_puts_attr.attr,
> +	&zcache_failed_pers_puts_attr.attr,
> +	&zcache_compress_poor_attr.attr,
> +	&zcache_mean_compress_poor_attr.attr,
> +	&zcache_zbud_curr_raw_pages_attr.attr,
> +	&zcache_zbud_curr_zpages_attr.attr,
> +	&zcache_zbud_curr_zbytes_attr.attr,
> +	&zcache_zbud_cumul_zpages_attr.attr,
> +	&zcache_zbud_cumul_zbytes_attr.attr,
> +	&zcache_zbud_buddied_count_attr.attr,
> +	&zcache_zbpg_unused_list_count_attr.attr,
> +	&zcache_evicted_raw_pages_attr.attr,
> +	&zcache_evicted_unbuddied_pages_attr.attr,
> +	&zcache_evicted_buddied_pages_attr.attr,
> +	&zcache_failed_get_free_pages_attr.attr,
> +	&zcache_failed_alloc_attr.attr,
> +	&zcache_put_to_flush_attr.attr,
> +	&zcache_zbud_unbuddied_list_counts_attr.attr,
> +	&zcache_zbud_cumul_chunk_counts_attr.attr,
> +	&zcache_zv_curr_dist_counts_attr.attr,
> +	&zcache_zv_cumul_dist_counts_attr.attr,
> +	&zcache_zv_max_zsize_attr.attr,
> +	&zcache_zv_max_mean_zsize_attr.attr,
> +	&zcache_zv_page_count_policy_percent_attr.attr,
> +	NULL,
> +};
> +
> +static struct attribute_group zcache_attr_group = {
> +	.attrs = zcache_attrs,
> +	.name = "zcache",
> +};
> +
> +#endif /* CONFIG_SYSFS */
> +/*
> + * When zcache is disabled ("frozen"), pools can be created and destroyed,
> + * but all puts (and thus all other operations that require memory allocation)
> + * must fail.  If zcache is unfrozen, accepts puts, then frozen again,
> + * data consistency requires all puts while frozen to be converted into
> + * flushes.
> + */
> +static bool zcache_freeze;
> +
> +/*
> + * zcache shrinker interface (only useful for ephemeral pages, so zbud only)
> + */
> +static int shrink_zcache_memory(struct shrinker *shrink,
> +				struct shrink_control *sc)
> +{
> +	int ret = -1;
> +	int nr = sc->nr_to_scan;
> +	gfp_t gfp_mask = sc->gfp_mask;
> +
> +	if (nr >= 0) {
> +		if (!(gfp_mask & __GFP_FS))
> +			/* does this case really need to be skipped? */
> +			goto out;

Answer that question. It's not obvious at all why zcache cannot handle
!__GFP_FS. You're not obviously recursing into a filesystem.

> +		zbud_evict_pages(nr);
> +	}
> +	ret = (int)atomic_read(&zcache_zbud_curr_raw_pages);
> +out:
> +	return ret;
> +}
> +
> +static struct shrinker zcache_shrinker = {
> +	.shrink = shrink_zcache_memory,
> +	.seeks = DEFAULT_SEEKS,
> +};
> +
> +/*
> + * zcache shims between cleancache/frontswap ops and tmem
> + */
> +
> +static int zcache_put_page(int cli_id, int pool_id, struct tmem_oid *oidp,
> +				uint32_t index, struct page *page)
> +{
> +	struct tmem_pool *pool;
> +	int ret = -1;
> +
> +	BUG_ON(!irqs_disabled());
> +	pool = zcache_get_pool_by_id(cli_id, pool_id);
> +	if (unlikely(pool == NULL))
> +		goto out;
> +	if (!zcache_freeze && zcache_do_preload(pool) == 0) {
> +		/* preload does preempt_disable on success */
> +		ret = tmem_put(pool, oidp, index, (char *)(page),
> +				PAGE_SIZE, 0, is_ephemeral(pool));
> +		if (ret < 0) {
> +			if (is_ephemeral(pool))
> +				zcache_failed_eph_puts++;
> +			else
> +				zcache_failed_pers_puts++;
> +		}
> +	} else {
> +		zcache_put_to_flush++;
> +		if (atomic_read(&pool->obj_count) > 0)
> +			/* the put fails whether the flush succeeds or not */
> +			(void)tmem_flush_page(pool, oidp, index);
> +	}
> +
> +	zcache_put_pool(pool);
> +out:
> +	return ret;
> +}
> +
> +static int zcache_get_page(int cli_id, int pool_id, struct tmem_oid *oidp,
> +				uint32_t index, struct page *page)
> +{
> +	struct tmem_pool *pool;
> +	int ret = -1;
> +	unsigned long flags;
> +	size_t size = PAGE_SIZE;
> +
> +	local_irq_save(flags);

Why do interrupts have to be disabled?

This makes the locking between tmem and zcache very confusing unfortunately
because I cannot decide if tmem indirectly depends on disabled interrupts
or not. It's also not clear why an interrupt handler would be trying to
get/put pages in tmem.

> +	pool = zcache_get_pool_by_id(cli_id, pool_id);
> +	if (likely(pool != NULL)) {
> +		if (atomic_read(&pool->obj_count) > 0)
> +			ret = tmem_get(pool, oidp, index, (char *)(page),
> +					&size, 0, is_ephemeral(pool));

It looks like you are disabling interrupts to avoid racing on that atomic
update. 

This feels very shaky and the layering is being violated. You should
unconditionally call into tmem_get and not worry about the pool count at
all. tmem_get should then check the count under the pool lock and make
obj_count a normal counter instead of an atomic.

The same comment applies to all the other obj_count locations.

> +		zcache_put_pool(pool);
> +	}
> +	local_irq_restore(flags);
> +	return ret;
> +}
> +
> +static int zcache_flush_page(int cli_id, int pool_id,
> +				struct tmem_oid *oidp, uint32_t index)
> +{
> +	struct tmem_pool *pool;
> +	int ret = -1;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	zcache_flush_total++;
> +	pool = zcache_get_pool_by_id(cli_id, pool_id);
> +	if (likely(pool != NULL)) {
> +		if (atomic_read(&pool->obj_count) > 0)
> +			ret = tmem_flush_page(pool, oidp, index);
> +		zcache_put_pool(pool);
> +	}
> +	if (ret >= 0)
> +		zcache_flush_found++;
> +	local_irq_restore(flags);
> +	return ret;
> +}
> +
> +static int zcache_flush_object(int cli_id, int pool_id,
> +				struct tmem_oid *oidp)
> +{
> +	struct tmem_pool *pool;
> +	int ret = -1;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	zcache_flobj_total++;
> +	pool = zcache_get_pool_by_id(cli_id, pool_id);
> +	if (likely(pool != NULL)) {
> +		if (atomic_read(&pool->obj_count) > 0)
> +			ret = tmem_flush_object(pool, oidp);
> +		zcache_put_pool(pool);
> +	}
> +	if (ret >= 0)
> +		zcache_flobj_found++;
> +	local_irq_restore(flags);
> +	return ret;
> +}
> +
> +static int zcache_destroy_pool(int cli_id, int pool_id)
> +{
> +	struct tmem_pool *pool = NULL;
> +	struct zcache_client *cli;
> +	int ret = -1;
> +
> +	if (pool_id < 0)
> +		goto out;
> +
> +	cli = get_zcache_client(cli_id);
> +	if (cli == NULL)
> +		goto out;
> +
> +	atomic_inc(&cli->refcount);
> +	pool = idr_find(&cli->tmem_pools, pool_id);
> +	if (pool == NULL)
> +		goto out;
> +	idr_remove(&cli->tmem_pools, pool_id);
> +	/* wait for pool activity on other cpus to quiesce */
> +	while (atomic_read(&pool->refcount) != 0)
> +		;

There *HAS* to be a better way of waiting before destroying the pool
than than a busy wait.

> +	atomic_dec(&cli->refcount);
> +	local_bh_disable();
> +	ret = tmem_destroy_pool(pool);
> +	local_bh_enable();

Again I'm missing something about how interrupt handlers even end up in
any of the paths.

> +	kfree(pool);
> +	pr_info("zcache: destroyed pool id=%d, cli_id=%d\n",
> +			pool_id, cli_id);
> +out:
> +	return ret;
> +}
> +
> +static int zcache_new_pool(uint16_t cli_id, uint32_t flags)
> +{
> +	int poolid = -1;
> +	struct tmem_pool *pool;
> +	struct zcache_client *cli = NULL;
> +	int r;
> +
> +	cli = get_zcache_client(cli_id);
> +	if (cli == NULL)
> +		goto out;
> +
> +	atomic_inc(&cli->refcount);
> +	pool = kmalloc(sizeof(struct tmem_pool), GFP_ATOMIC);
> +	if (pool == NULL) {
> +		pr_info("zcache: pool creation failed: out of memory\n");
> +		goto out;
> +	}
> +
> +	do {
> +		r = idr_pre_get(&cli->tmem_pools, GFP_ATOMIC);
> +		if (r != 1) {
> +			kfree(pool);
> +			pr_info("zcache: pool creation failed: out of memory\n");
> +			goto out;
> +		}
> +		r = idr_get_new(&cli->tmem_pools, pool, &poolid);
> +	} while (r == -EAGAIN);
> +	if (r) {
> +		pr_info("zcache: pool creation failed: error %d\n", r);
> +		kfree(pool);
> +		goto out;
> +	}
> +
> +	atomic_set(&pool->refcount, 0);
> +	pool->client = cli;
> +	pool->pool_id = poolid;
> +	tmem_new_pool(pool, flags);
> +	pr_info("zcache: created %s tmem pool, id=%d, client=%d\n",
> +		flags & TMEM_POOL_PERSIST ? "persistent" : "ephemeral",
> +		poolid, cli_id);
> +out:
> +	if (cli != NULL)
> +		atomic_dec(&cli->refcount);
> +	return poolid;
> +}
> +
> +/**********
> + * Two kernel functionalities currently can be layered on top of tmem.
> + * These are "cleancache" which is used as a second-chance cache for clean
> + * page cache pages; and "frontswap" which is used for swap pages
> + * to avoid writes to disk.  A generic "shim" is provided here for each
> + * to translate in-kernel semantics to zcache semantics.
> + */
> +
> +#ifdef CONFIG_CLEANCACHE

Feels like this should be in its own file with a clear interface to
zcache-main.c . Minor point, at this point I'm fatigued reading the code
and cranky.

> +static void zcache_cleancache_put_page(int pool_id,
> +					struct cleancache_filekey key,
> +					pgoff_t index, struct page *page)
> +{
> +	u32 ind = (u32) index;

This looks like an interesting limitation. How sure are you that index
will never be larger than u32 and this start behaving badly? I guess it's
because the index is going to be related to PFN and there are not that
many 16TB machines lying around but this looks like something that could
bite us on the ass one day.


> +	struct tmem_oid oid = *(struct tmem_oid *)&key;
> +
> +	if (likely(ind == index))
> +		(void)zcache_put_page(LOCAL_CLIENT, pool_id, &oid, index, page);
> +}
> +
> +static int zcache_cleancache_get_page(int pool_id,
> +					struct cleancache_filekey key,
> +					pgoff_t index, struct page *page)
> +{
> +	u32 ind = (u32) index;
> +	struct tmem_oid oid = *(struct tmem_oid *)&key;
> +	int ret = -1;
> +
> +	if (likely(ind == index))
> +		ret = zcache_get_page(LOCAL_CLIENT, pool_id, &oid, index, page);
> +	return ret;
> +}
> +
> +static void zcache_cleancache_flush_page(int pool_id,
> +					struct cleancache_filekey key,
> +					pgoff_t index)
> +{
> +	u32 ind = (u32) index;
> +	struct tmem_oid oid = *(struct tmem_oid *)&key;
> +
> +	if (likely(ind == index))
> +		(void)zcache_flush_page(LOCAL_CLIENT, pool_id, &oid, ind);
> +}
> +
> +static void zcache_cleancache_flush_inode(int pool_id,
> +					struct cleancache_filekey key)
> +{
> +	struct tmem_oid oid = *(struct tmem_oid *)&key;
> +
> +	(void)zcache_flush_object(LOCAL_CLIENT, pool_id, &oid);
> +}
> +
> +static void zcache_cleancache_flush_fs(int pool_id)
> +{
> +	if (pool_id >= 0)
> +		(void)zcache_destroy_pool(LOCAL_CLIENT, pool_id);
> +}
> +
> +static int zcache_cleancache_init_fs(size_t pagesize)
> +{
> +	BUG_ON(sizeof(struct cleancache_filekey) !=
> +				sizeof(struct tmem_oid));
> +	BUG_ON(pagesize != PAGE_SIZE);
> +	return zcache_new_pool(LOCAL_CLIENT, 0);
> +}
> +
> +static int zcache_cleancache_init_shared_fs(char *uuid, size_t pagesize)
> +{
> +	/* shared pools are unsupported and map to private */
> +	BUG_ON(sizeof(struct cleancache_filekey) !=
> +				sizeof(struct tmem_oid));
> +	BUG_ON(pagesize != PAGE_SIZE);
> +	return zcache_new_pool(LOCAL_CLIENT, 0);
> +}
> +
> +static struct cleancache_ops zcache_cleancache_ops = {
> +	.put_page = zcache_cleancache_put_page,
> +	.get_page = zcache_cleancache_get_page,
> +	.invalidate_page = zcache_cleancache_flush_page,
> +	.invalidate_inode = zcache_cleancache_flush_inode,
> +	.invalidate_fs = zcache_cleancache_flush_fs,
> +	.init_shared_fs = zcache_cleancache_init_shared_fs,
> +	.init_fs = zcache_cleancache_init_fs
> +};
> +
> +struct cleancache_ops zcache_cleancache_register_ops(void)
> +{
> +	struct cleancache_ops old_ops =
> +		cleancache_register_ops(&zcache_cleancache_ops);
> +
> +	return old_ops;
> +}
> +#endif
> +
> +#ifdef CONFIG_FRONTSWAP
> +/* a single tmem poolid is used for all frontswap "types" (swapfiles) */
> +static int zcache_frontswap_poolid = -1;
> +
> +/*
> + * Swizzling increases objects per swaptype, increasing tmem concurrency
> + * for heavy swaploads.  Later, larger nr_cpus -> larger SWIZ_BITS
> + * Setting SWIZ_BITS to 27 basically reconstructs the swap entry from
> + * frontswap_load(), but has side-effects. Hence using 8.
> + */

Ok, I don't get this but honestly, I didn't try either. I'll take your word for it.

> +#define SWIZ_BITS		8
> +#define SWIZ_MASK		((1 << SWIZ_BITS) - 1)
> +#define _oswiz(_type, _ind)	((_type << SWIZ_BITS) | (_ind & SWIZ_MASK))
> +#define iswiz(_ind)		(_ind >> SWIZ_BITS)
> +
> +static inline struct tmem_oid oswiz(unsigned type, u32 ind)
> +{
> +	struct tmem_oid oid = { .oid = { 0 } };
> +	oid.oid[0] = _oswiz(type, ind);
> +	return oid;
> +}
> +
> +static int zcache_frontswap_store(unsigned type, pgoff_t offset,
> +				   struct page *page)
> +{
> +	u64 ind64 = (u64)offset;
> +	u32 ind = (u32)offset;
> +	struct tmem_oid oid = oswiz(type, ind);
> +	int ret = -1;
> +	unsigned long flags;
> +
> +	BUG_ON(!PageLocked(page));
> +	if (likely(ind64 == ind)) {
> +		local_irq_save(flags);
> +		ret = zcache_put_page(LOCAL_CLIENT, zcache_frontswap_poolid,
> +					&oid, iswiz(ind), page);
> +		local_irq_restore(flags);
> +	}

Again those interrupt disabling reaches right out and pokes me in the
eye. It seems completely unnecessary to depend on interrupts being disabled.

> +	return ret;
> +}
> +
> +/* returns 0 if the page was successfully gotten from frontswap, -1 if
> + * was not present (should never happen!) */
> +static int zcache_frontswap_load(unsigned type, pgoff_t offset,
> +				   struct page *page)
> +{
> +	u64 ind64 = (u64)offset;
> +	u32 ind = (u32)offset;
> +	struct tmem_oid oid = oswiz(type, ind);
> +	int ret = -1;
> +
> +	BUG_ON(!PageLocked(page));
> +	if (likely(ind64 == ind))
> +		ret = zcache_get_page(LOCAL_CLIENT, zcache_frontswap_poolid,
> +					&oid, iswiz(ind), page);
> +	return ret;
> +}
> +
> +/* flush a single page from frontswap */
> +static void zcache_frontswap_flush_page(unsigned type, pgoff_t offset)
> +{
> +	u64 ind64 = (u64)offset;
> +	u32 ind = (u32)offset;
> +	struct tmem_oid oid = oswiz(type, ind);
> +
> +	if (likely(ind64 == ind))
> +		(void)zcache_flush_page(LOCAL_CLIENT, zcache_frontswap_poolid,
> +					&oid, iswiz(ind));
> +}
> +
> +/* flush all pages from the passed swaptype */
> +static void zcache_frontswap_flush_area(unsigned type)
> +{
> +	struct tmem_oid oid;
> +	int ind;
> +
> +	for (ind = SWIZ_MASK; ind >= 0; ind--) {
> +		oid = oswiz(type, ind);
> +		(void)zcache_flush_object(LOCAL_CLIENT,
> +						zcache_frontswap_poolid, &oid);
> +	}
> +}
> +
> +static void zcache_frontswap_init(unsigned ignored)
> +{
> +	/* a single tmem poolid is used for all frontswap "types" (swapfiles) */
> +	if (zcache_frontswap_poolid < 0)
> +		zcache_frontswap_poolid =
> +			zcache_new_pool(LOCAL_CLIENT, TMEM_POOL_PERSIST);
> +}
> +
> +static struct frontswap_ops zcache_frontswap_ops = {
> +	.store = zcache_frontswap_store,
> +	.load = zcache_frontswap_load,
> +	.invalidate_page = zcache_frontswap_flush_page,
> +	.invalidate_area = zcache_frontswap_flush_area,
> +	.init = zcache_frontswap_init
> +};
> +
> +struct frontswap_ops zcache_frontswap_register_ops(void)
> +{
> +	struct frontswap_ops old_ops =
> +		frontswap_register_ops(&zcache_frontswap_ops);
> +
> +	return old_ops;
> +}
> +#endif
> +
> +/*
> + * zcache initialization
> + * NOTE FOR NOW zcache MUST BE PROVIDED AS A KERNEL BOOT PARAMETER OR
> + * NOTHING HAPPENS!
> + */
> +

ok..... why?

superficially there does not appear to be anything obvious that stops it
being turned on at runtime. Hardly a blocked, just odd.

> +static int zcache_enabled;
> +
> +static int __init enable_zcache(char *s)
> +{
> +	zcache_enabled = 1;
> +	return 1;
> +}
> +__setup("zcache", enable_zcache);
> +
> +/* allow independent dynamic disabling of cleancache and frontswap */
> +
> +static int use_cleancache = 1;
> +
> +static int __init no_cleancache(char *s)
> +{
> +	use_cleancache = 0;
> +	return 1;
> +}
> +
> +__setup("nocleancache", no_cleancache);
> +
> +static int use_frontswap = 1;
> +
> +static int __init no_frontswap(char *s)
> +{
> +	use_frontswap = 0;
> +	return 1;
> +}
> +
> +__setup("nofrontswap", no_frontswap);
> +
> +static int __init enable_zcache_compressor(char *s)
> +{
> +	strncpy(zcache_comp_name, s, ZCACHE_COMP_NAME_SZ);
> +	zcache_enabled = 1;
> +	return 1;
> +}
> +__setup("zcache=", enable_zcache_compressor);
> +
> +
> +static int __init zcache_comp_init(void)
> +{
> +	int ret = 0;
> +
> +	/* check crypto algorithm */
> +	if (*zcache_comp_name != '\0') {
> +		ret = crypto_has_comp(zcache_comp_name, 0, 0);
> +		if (!ret)
> +			pr_info("zcache: %s not supported\n",
> +					zcache_comp_name);
> +	}
> +	if (!ret)
> +		strcpy(zcache_comp_name, "lzo");
> +	ret = crypto_has_comp(zcache_comp_name, 0, 0);
> +	if (!ret) {
> +		ret = 1;
> +		goto out;
> +	}
> +	pr_info("zcache: using %s compressor\n", zcache_comp_name);
> +
> +	/* alloc percpu transforms */
> +	ret = 0;
> +	zcache_comp_pcpu_tfms = alloc_percpu(struct crypto_comp *);
> +	if (!zcache_comp_pcpu_tfms)
> +		ret = 1;
> +out:
> +	return ret;
> +}
> +
> +static int __init zcache_init(void)
> +{
> +	int ret = 0;
> +
> +#ifdef CONFIG_SYSFS
> +	ret = sysfs_create_group(mm_kobj, &zcache_attr_group);
> +	if (ret) {
> +		pr_err("zcache: can't create sysfs\n");
> +		goto out;
> +	}
> +#endif /* CONFIG_SYSFS */
> +
> +	if (zcache_enabled) {
> +		unsigned int cpu;
> +
> +		tmem_register_hostops(&zcache_hostops);
> +		tmem_register_pamops(&zcache_pamops);
> +		ret = register_cpu_notifier(&zcache_cpu_notifier_block);
> +		if (ret) {
> +			pr_err("zcache: can't register cpu notifier\n");
> +			goto out;
> +		}
> +		ret = zcache_comp_init();
> +		if (ret) {
> +			pr_err("zcache: compressor initialization failed\n");
> +			goto out;
> +		}
> +		for_each_online_cpu(cpu) {
> +			void *pcpu = (void *)(long)cpu;
> +			zcache_cpu_notifier(&zcache_cpu_notifier_block,
> +				CPU_UP_PREPARE, pcpu);
> +		}
> +	}
> +	zcache_objnode_cache = kmem_cache_create("zcache_objnode",
> +				sizeof(struct tmem_objnode), 0, 0, NULL);
> +	zcache_obj_cache = kmem_cache_create("zcache_obj",
> +				sizeof(struct tmem_obj), 0, 0, NULL);
> +	ret = zcache_new_client(LOCAL_CLIENT);
> +	if (ret) {
> +		pr_err("zcache: can't create client\n");
> +		goto out;
> +	}
> +
> +#ifdef CONFIG_CLEANCACHE
> +	if (zcache_enabled && use_cleancache) {
> +		struct cleancache_ops old_ops;
> +
> +		zbud_init();
> +		register_shrinker(&zcache_shrinker);
> +		old_ops = zcache_cleancache_register_ops();
> +		pr_info("zcache: cleancache enabled using kernel "
> +			"transcendent memory and compression buddies\n");
> +		if (old_ops.init_fs != NULL)
> +			pr_warning("zcache: cleancache_ops overridden");
> +	}
> +#endif
> +#ifdef CONFIG_FRONTSWAP
> +	if (zcache_enabled && use_frontswap) {
> +		struct frontswap_ops old_ops;
> +
> +		old_ops = zcache_frontswap_register_ops();
> +		pr_info("zcache: frontswap enabled using kernel "
> +			"transcendent memory and zsmalloc\n");
> +		if (old_ops.init != NULL)
> +			pr_warning("zcache: frontswap_ops overridden");
> +	}
> +#endif
> +out:
> +	return ret;
> +}
> +
> +module_init(zcache_init)
> diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
> new file mode 100644
> index 0000000..de2e8bf
> --- /dev/null
> +++ b/include/linux/zsmalloc.h
> @@ -0,0 +1,43 @@
> +/*
> + * zsmalloc memory allocator
> + *
> + * Copyright (C) 2011  Nitin Gupta
> + *
> + * This code is released using a dual license strategy: BSD/GPL
> + * You can choose the license that better fits your requirements.
> + *
> + * Released under the terms of 3-clause BSD License
> + * Released under the terms of GNU General Public License Version 2.0
> + */
> +

Ok, I didn't read anything after this point.  It's another allocator that
may or may not pack compressed pages better. The usual concerns about
internal fragmentation and the like apply but I'm not going to mull over them
now. The really interesting part was deciding if zcache was ready or not.

So, on zcache, zbud and the underlying tmem thing;

The locking is convulated, the interrupt disabling suspicious and there is at
least one place where it looks like we are depending on not being scheduled
on another CPU during a long operation. It may actually be that you are
disabling interrupts to prevent that happening but it's not documented. Even
if it's the case, disabling interrupts to avoid CPU migration is overkill.

I'm also worried that there appears to be no control over how large
the zcache can get and am suspicious it can increase lowmem pressure on
32-bit machines.  If the lowmem pressure is real then zcache should not
be available on machines with highmem at all. I'm *really* worried that
it can deadlock if a page allocation fails before decompressing a page.

That said, my initial feeling still stands. I think that this needs to move
out of staging because it's in limbo where it is but Andrew may disagree
because of the reservations. If my reservations are accurate then they
should at least be *clearly* documented with a note saying that using
this in production is ill-advised for now. If zcache is activated via the
kernel parameter, it should print a big dirty warning that the feature is
still experiemental and leave that warning there until all the issues are
addressed. Right now I'm not convinced this is production ready but that
the  issues could be fixed incrementally.
Konrad Rzeszutek Wilk Sept. 21, 2012, 6:02 p.m. UTC | #6
On Fri, Sep 21, 2012 at 05:12:52PM +0100, Mel Gorman wrote:
> On Tue, Sep 04, 2012 at 04:34:46PM -0500, Seth Jennings wrote:
> > zcache is the remaining piece of code required to support in-kernel
> > memory compression.  The other two features, cleancache and frontswap,
> > have been promoted to mainline in 3.0 and 3.5 respectively.  This
> > patchset promotes zcache from the staging tree to mainline.
> > 
> 
> This is a very rough review of the code simply because I was asked to
> look at it. I'm barely aware of the history and I'm not a user of this
> code myself so take all of this with a grain of salt.

Ah fresh set of eyes! Yeey!
> 
> Very broadly speaking my initial reaction before I reviewed anything was
> that *some* sort of usable backend for cleancache or frontswap should exist
> at this point. My understanding is that Xen is the primary user of both
> those frontends and ramster, while interesting, is not something that a
> typical user will benefit from.

Right, the majority of users do not use virtualization. Thought embedded
wise .. well, there are a lot of Android users - thought I am not 100%
sure they are using it right now (I recall seeing changelogs for the clones
of Android mentioning zcache).
> 
> That said, I worry that this has bounced around a lot and as Dan (the
> original author) has a rewrite. I'm wary of spending too much time on this
> at all. Is Dan's new code going to replace this or what? It'd be nice to
> find a definitive answer on that.

The idea is to take parts of zcache2 as seperate patches and stick it
in the code you just reviewed (those that make sense as part of unstaging).
The end result will be that zcache1 == zcache2 in functionality. Right
now we are assembling a list of TODOs for zcache that should be done as part
of 'unstaging'.

> 
> Anyway, here goes

.. and your responses will fill the TODO with many extra line-items.

Its going to take a bit of time to mull over your questions, so it will
take me some time. Also Dan will probably beat me in providing the answers.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Seth Jennings Sept. 21, 2012, 7:02 p.m. UTC | #7
On 09/21/2012 01:02 PM, Konrad Rzeszutek Wilk wrote:
> On Fri, Sep 21, 2012 at 05:12:52PM +0100, Mel Gorman wrote:
>> On Tue, Sep 04, 2012 at 04:34:46PM -0500, Seth Jennings wrote:
>>> zcache is the remaining piece of code required to support in-kernel
>>> memory compression.  The other two features, cleancache and frontswap,
>>> have been promoted to mainline in 3.0 and 3.5 respectively.  This
>>> patchset promotes zcache from the staging tree to mainline.
>>>
>>
>> This is a very rough review of the code simply because I was asked to
>> look at it. I'm barely aware of the history and I'm not a user of this
>> code myself so take all of this with a grain of salt.
> 
> Ah fresh set of eyes! Yeey!

Agreed! Thanks so much!

>>
>> Very broadly speaking my initial reaction before I reviewed anything was
>> that *some* sort of usable backend for cleancache or frontswap should exist
>> at this point. My understanding is that Xen is the primary user of both
>> those frontends and ramster, while interesting, is not something that a
>> typical user will benefit from.
> 
> Right, the majority of users do not use virtualization. Thought embedded
> wise .. well, there are a lot of Android users - thought I am not 100%
> sure they are using it right now (I recall seeing changelogs for the clones
> of Android mentioning zcache).
>>
>> That said, I worry that this has bounced around a lot and as Dan (the
>> original author) has a rewrite. I'm wary of spending too much time on this
>> at all. Is Dan's new code going to replace this or what? It'd be nice to
>> find a definitive answer on that.
> 
> The idea is to take parts of zcache2 as seperate patches and stick it
> in the code you just reviewed (those that make sense as part of unstaging).

I agree with this.  Only the changes from zcache2 (Dan's
rewrite) that are necessary for promotion should be
considered right now.  Afaict, none of the concerns raised
in these comments are addressed by the changes in zcache2.

> The end result will be that zcache1 == zcache2 in functionality. Right
> now we are assembling a list of TODOs for zcache that should be done as part
> of 'unstaging'.
> 
>>
>> Anyway, here goes
> 
> .. and your responses will fill the TODO with many extra line-items.

Great, thanks Konrad.

> 
> Its going to take a bit of time to mull over your questions, so it will
> take me some time.

Same here. I'll respond asap. Thanks again, Mel!

--
Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 21, 2012, 7:14 p.m. UTC | #8
Hi Mel --

Wow!  An incredibly wonderfully detailed response!  Thank you very
much for taking the time to read through all of zcache!

Your comments run the gamut from nit and code style, to design,
architecture and broad naming.  Until the choice-of-codebase issue
is resolved, I'll avoid the nits and codestyle comments and respond
to the higher level strategic and design questions.  Since a couple
of your questions are repeated and the specific code which provoked
your question is not isolated, I hope it is OK if I answer those
first out-of-context from your original comments in the code.
(This should also make this easier to read and to extract optimal
meaning, for you and for posterity.)

> That said, I worry that this has bounced around a lot and as Dan (the
> original author) has a rewrite. I'm wary of spending too much time on this
> at all. Is Dan's new code going to replace this or what? It'd be nice to
> find a definitive answer on that.

Replacing this code was my intent, but that was blocked.  IMHO zcache2
is _much_ better than the "demo version" of zcache (aka zcache1).
Hopefully a middle ground can be reached.  I've proposed one privately
offlist.

Seth, please feel free to augment or correct anything below, or
respond to anything I haven't commented on.

> Anyway, here goes

Repeated comments answered first out-of-context:

1) The interrupt context for zcache (and any tmem backend) is imposed
   by the frontend callers.  Cleancache_put [see naming comment below]
   is always called with interrupts disabled.  Cleancache_flush is
   sometimes called with interrupts disabled and sometimes not.
   Cleancache_get is never called in an atomic context.  (I think)
   frontswap_get/put/flush are never called in an atomic context but
   sometimes with the swap_lock held. Because it is dangerous (true?)
   for code to sometimes/not be called in atomic context, much of the
   code in zcache and tmem is forced into atomic context.  BUT Andrea
   observed that there are situations where asynchronicity would be
   preferable and, it turns out that cleancache_get and frontswap_get
   are never called in atomic context.  Zcache2/ramster takes advantage of
   that, and a future KVM backend may want to do so as well.  However,
   the interrupt/atomicity model and assumptions certainly does deserve
   better documentation.

2) The naming of the core tmem functions (put, get, flush) has been
   discussed endlessly, everyone has a different opinion, and the
   current state is a mess: cleancache, frontswap, and the various
   backends are horribly inconsistent.   IMHO, the use of "put"
   and "get" for reference counting is a historical accident, and
   the tmem ABI names were chosen well before I understood the historical
   precedence and the potential for confusion by kernel developers.
   So I don't have a good answer... I'd prefer the ABI-documented
   names, but if they are unacceptable, at least we need to agree
   on a consistent set of names and fix all references in all
   the various tmem parts (and possibly Xen and the kernel<->Xen
   ABI as well).

The rest of my comments/replies are in context.

> > +/*
> > + * A tmem host implementation must use this function to register
> > + * callbacks for a page-accessible memory (PAM) implementation
> > + */
> > +static struct tmem_pamops tmem_pamops;
> > +
> > +void tmem_register_pamops(struct tmem_pamops *m)
> > +{
> > +	tmem_pamops = *m;
> > +}
> > +
> 
> This implies that this can only host one client  at a time. I suppose
> that's ok to start with but is there ever an expectation that zcache +
> something else would be enabled at the same time?

There was some thought that zcache and Xen (or KVM) might somehow "chain"
the implementations.
 
> > +/*
> > + * A tmem_obj contains a radix-tree-like tree in which the intermediate
> > + * nodes are called tmem_objnodes.  (The kernel lib/radix-tree.c implementation
> > + * is very specialized and tuned for specific uses and is not particularly
> > + * suited for use from this code, though some code from the core algorithms has
> 
> This is a bit vague. It asserts that lib/radix-tree is unsuitable but
> not why. I skipped over most of the implementation to be honest.

IIRC, lib/radix-tree is highly tuned for mm's needs.  Things like
tagging and rcu weren't a good fit for tmem, and new things like calling
a different allocator needed to be added.  In the long run it might
be possible for the lib version to serve both needs, but the impediment
and aggravation of merging all necessary changes into lib seemed a high price
to pay for a hundred lines of code implementing a variation of a widely
documented tree algorithm.

> > + * These "tmem core" operations are implemented in the following functions.
> 
> More nits. As this defines a boundary between two major components it
> probably should have its own Documentation/ entry and the APIs should have
> kernel doc comments.

Agreed.

> > + * a corner case: What if a page with matching handle already exists in
> > + * tmem?  To guarantee coherency, one of two actions is necessary: Either
> > + * the data for the page must be overwritten, or the page must be
> > + * "flushed" so that the data is not accessible to a subsequent "get".
> > + * Since these "duplicate puts" are relatively rare, this implementation
> > + * always flushes for simplicity.
> > + */
> 
> At first glance that sounds really dangerous. If two different users can have
> the same oid for different data, what prevents the wrong data being fetched?
> From this level I expect that it's something the layers above it have to
> manage and in practice they must be preventing duplicates ever happening
> but I'm guessing. At some point it would be nice if there was an example
> included here explaining why duplicates are not a bug.

VFS decides when to call cleancache and dups do happen.  Honestly, I don't
know why they happen (though Chris Mason, who wrote the cleancache hooks,
may know) they happen, but the above coherency rules for backend implementation
always work.  The same is true of frontswap.

> > +int tmem_replace(struct tmem_pool *pool, struct tmem_oid *oidp,
> > +			uint32_t index, void *new_pampd)
> > +{
> > +	struct tmem_obj *obj;
> > +	int ret = -1;
> > +	struct tmem_hashbucket *hb;
> > +
> > +	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
> > +	spin_lock(&hb->lock);
> > +	obj = tmem_obj_find(hb, oidp);
> > +	if (obj == NULL)
> > +		goto out;
> > +	new_pampd = tmem_pampd_replace_in_obj(obj, index, new_pampd);
> > +	ret = (*tmem_pamops.replace_in_obj)(new_pampd, obj);
> > +out:
> > +	spin_unlock(&hb->lock);
> > +	return ret;
> > +}
> > +
> 
> Nothin in this patch uses this. It looks like ramster would depend on it
> but at a glance, ramster seems to have its own copy of the code. I guess
> this is what Dan was referring to as the fork and at some point that needs
> to be resolved. Here, it looks like dead code.

Yep, this was a first step toward supporting ramster (and any other
future asynchronous-get tmem backends).

> > +static inline void tmem_oid_set_invalid(struct tmem_oid *oidp)
> > +
> > +static inline bool tmem_oid_valid(struct tmem_oid *oidp)
> > +
> > +static inline int tmem_oid_compare(struct tmem_oid *left,
> > +					struct tmem_oid *right)
> > +{
> > +}
> 
> Holy Branches Batman!
> 
> Bit of a jumble but works at least. Nits: mixes ret = and returns
> mid-way. Could have been implemented with a while loop. Only has one
> caller and should have been in the C file that uses it. There was no need
> to explicitely mark it inline either with just one caller.

It was put here to group object operations together sort
of as if it is an abstract datatype.  No objections
to moving it.

> > +++ b/drivers/mm/zcache/zcache-main.c
> > + *
> > + * Zcache provides an in-kernel "host implementation" for transcendent memory
> > + * and, thus indirectly, for cleancache and frontswap.  Zcache includes two
> > + * page-accessible memory [1] interfaces, both utilizing the crypto compression
> > + * API:
> > + * 1) "compression buddies" ("zbud") is used for ephemeral pages
> > + * 2) zsmalloc is used for persistent pages.
> > + * Xvmalloc (based on the TLSF allocator) has very low fragmentation
> > + * so maximizes space efficiency, while zbud allows pairs (and potentially,
> > + * in the future, more than a pair of) compressed pages to be closely linked
> > + * so that reclaiming can be done via the kernel's physical-page-oriented
> > + * "shrinker" interface.
> > + *
> 
> Doesn't actually explain why zbud is good for one and zsmalloc good for the other.

There's been extensive discussion of that elsewhere and the
equivalent description in zcache2 is better, but I agree this
needs to be in Documentation/, once the zcache1/zcache2 discussion settles.

> > +#if 0
> > +/* this is more aggressive but may cause other problems? */
> > +#define ZCACHE_GFP_MASK	(GFP_ATOMIC | __GFP_NORETRY | __GFP_NOWARN)
> 
> Why is this "more agressive"? If anything it's less aggressive because it'll
> bail if there is no memory available. Get rid of this.

My understanding (from Jeremy Fitzhardinge I think) was that GFP_ATOMIC
would use a special reserve of pages which might lead to OOMs.
More experimentation may be warranted.

> > +#else
> > +#define ZCACHE_GFP_MASK \
> > +	(__GFP_FS | __GFP_NORETRY | __GFP_NOWARN | __GFP_NOMEMALLOC)
> > +#endif
> > +
> > +#define MAX_CLIENTS 16
> 
> Seems a bit arbitrary. Why 16?

Sasha Levin posted a patch to fix this but it was tied in to
the proposed KVM implementation, so was never merged.

> > +#define LOCAL_CLIENT ((uint16_t)-1)
> > +
> > +MODULE_LICENSE("GPL");
> > +
> > +struct zcache_client {
> > +	struct idr tmem_pools;
> > +	struct zs_pool *zspool;
> > +	bool allocated;
> > +	atomic_t refcount;
> > +};
> 
> why is "allocated" needed. Is the refcount not enough to determine if this
> client is in use or not?

May be a historical accident.  Deserves a second look.

> > + * Compression buddies ("zbud") provides for packing two (or, possibly
> > + * in the future, more) compressed ephemeral pages into a single "raw"
> > + * (physical) page and tracking them with data structures so that
> > + * the raw pages can be easily reclaimed.
> > + *
> 
> Ok, if I'm reading this right it implies that a page must at least compress
> by 50% before zcache even accepts the page.

NO! Zbud matches up pages that compress well with those that don't.
There's a lot more detailed description of this in zcache2.

> > +static atomic_t zcache_zbud_curr_raw_pages;
> > +static atomic_t zcache_zbud_curr_zpages;
> 
> Should not have been necessary to make these atomics. Probably protected
> by zbpg_unused_list_spinlock or something similar.

Agreed, but it gets confusing when monitoring zcache
if certain key counters go negative.  Ideally this
should all be eventually tied to some runtime debug flag
but it's not clear yet what counters might be used
by future userland software.
 
> > +static unsigned long zcache_zbud_curr_zbytes;
> 
> Overkill, this is just
> 
> zcache_zbud_curr_raw_pages << PAGE_SHIFT

No, it allows a measure of the average compression,
irrelevant of the number of pageframes required.
 
> > +static unsigned long zcache_zbud_cumul_zpages;
> > +static unsigned long zcache_zbud_cumul_zbytes;
> > +static unsigned long zcache_compress_poor;
> > +static unsigned long zcache_mean_compress_poor;
> 
> In general the stats keeping is going to suck on larger machines as these
> are all shared writable cache lines. You might be able to mitigate the
> impact in the future by moving these to vmstat. Maybe it doesn't matter
> as such - it all depends on what velocity pages enter and leave zcache.
> If that velocity is high, maybe the performance is shot anyway.

Agreed.  Velocity is on the order of the number of disk
pages read per second plus pswpin+pswpout per second.
It's not clear yet if that is high enough for the
stat counters to affect performance but it seems unlikely
except possibly on huge NUMA machines.

> > +static inline unsigned zbud_max_buddy_size(void)
> > +{
> > +	return MAX_CHUNK << CHUNK_SHIFT;
> > +}
> > +
> 
> Is the max size not half of MAX_CHUNK as the page is split into two buddies?

No, see above.

> > +	if (zbpg == NULL)
> > +		/* none on zbpg list, try to get a kernel page */
> > +		zbpg = zcache_get_free_page();
> 
> So zcache_get_free_page() is getting a preloaded page from a per-cpu magazine
> and that thing blows up if there is no page available. This implies that
> preemption must be disabled for the entire putting of a page into zcache!
>
> > +	if (likely(zbpg != NULL)) {
> 
> It's not just likely, it's impossible because if it's NULL,
> zcache_get_free_page() will already have BUG().
> 
> If it's the case that preemption is *not* disabled and the process gets
> scheduled to a CPU that has its magazine consumed then this will blow up
> in some cases.
> 
> Scary.

This code is all redesigned/rewritten in zcache2.

> Ok, so if this thing fails to allocate a page then what prevents us getting into
> a situation where the zcache grows to a large size and we cannot take decompress
> anything in it because we cannot allocate a page here?
> 
> It looks like this could potentially deadlock the system unless it was possible
> to either discard zcache data and reconstruct it from information on disk.
> It feels like something like a mempool needs to exist that is used to forcibly
> shrink the zcache somehow but I can't seem to find where something like that happens.
> 
> Where is it or is there a risk of deadlock here?

I am fairly sure there is no risk of deadlock here.  The callers
to cleancache_get and frontswap_get always provide a struct page
for the decompression.  Cleancache pages in zcache can always
be discarded whenever required.

The risk for OOMs does exist when we start trying to force
frontswap-zcache zpages out to the swap disk.  This work
is currently in progress and I hope to have a patch for
review soon.

> > +	BUG_ON(!irqs_disabled());
> > +	if (unlikely(dmem == NULL))
> > +		goto out;  /* no buffer or no compressor so can't compress */
> > +	*out_len = PAGE_SIZE << ZCACHE_DSTMEM_ORDER;
> > +	from_va = kmap_atomic(from);
> 
> Ok, so I am running out of beans here but this triggered alarm bells. Is
> zcache stored in lowmem? If so, then it might be a total no-go on 32-bit
> systems if pages from highmem cause increased low memory pressure to put
> the page into zcache.

Personally, I'm neither an expert nor an advocate of lowmem systems
but Seth said he has tested zcache ("demo version") there.

> > +	mb();
> 
> .... Why?

Historical accident...  I think this was required in the Xen version.
 
> > +	if (nr >= 0) {
> > +		if (!(gfp_mask & __GFP_FS))
> > +			/* does this case really need to be skipped? */
> > +			goto out;
> 
> Answer that question. It's not obvious at all why zcache cannot handle
> !__GFP_FS. You're not obviously recursing into a filesystem.

Yep, this is a remaining loose end.  The documentation
of this (in the shrinker code) was pretty vague so this
is "safety" code that probably should be removed after
a decent test proves it can be.

> > +static int zcache_get_page(int cli_id, int pool_id, struct tmem_oid *oidp,
> > +				uint32_t index, struct page *page)
> > +{
> > +	struct tmem_pool *pool;
> > +	int ret = -1;
> > +	unsigned long flags;
> > +	size_t size = PAGE_SIZE;
> > +
> > +	local_irq_save(flags);
> 
> Why do interrupts have to be disabled?
> 
> This makes the locking between tmem and zcache very confusing unfortunately
> because I cannot decide if tmem indirectly depends on disabled interrupts
> or not. It's also not clear why an interrupt handler would be trying to
> get/put pages in tmem.

Yes, irq disablement goes away for gets in zcache2.

> > +	pool = zcache_get_pool_by_id(cli_id, pool_id);
> > +	if (likely(pool != NULL)) {
> > +		if (atomic_read(&pool->obj_count) > 0)
> > +			ret = tmem_get(pool, oidp, index, (char *)(page),
> > +					&size, 0, is_ephemeral(pool));
> 
> It looks like you are disabling interrupts to avoid racing on that atomic
> update.
> 
> This feels very shaky and the layering is being violated. You should
> unconditionally call into tmem_get and not worry about the pool count at
> all. tmem_get should then check the count under the pool lock and make
> obj_count a normal counter instead of an atomic.
> 
> The same comment applies to all the other obj_count locations.

This isn't the reason for irq disabling, see previous.
It's possible atomic obj_count can go away as it may
have only been necessary in a previous tmem locking design.

> > +	/* wait for pool activity on other cpus to quiesce */
> > +	while (atomic_read(&pool->refcount) != 0)
> > +		;
> 
> There *HAS* to be a better way of waiting before destroying the pool
> than than a busy wait.

Most probably.  Pool destruction is relatively very rare (umount and
swapoff), so fixing/testing this has never bubbled up to the top
of the list.

> Feels like this should be in its own file with a clear interface to
> zcache-main.c . Minor point, at this point I'm fatigued reading the code
> and cranky.

Perhaps.  In zcache2, all the zbud code is moved to a separate
code module, so zcache-main.c is much shorter.

> > +static void zcache_cleancache_put_page(int pool_id,
> > +					struct cleancache_filekey key,
> > +					pgoff_t index, struct page *page)
> > +{
> > +	u32 ind = (u32) index;
> 
> This looks like an interesting limitation. How sure are you that index
> will never be larger than u32 and this start behaving badly? I guess it's
> because the index is going to be related to PFN and there are not that
> many 16TB machines lying around but this looks like something that could
> bite us on the ass one day.

The limitation is for a >16TB _file_ on a cleancache-aware filesystem.
And it's not a hard limitation:  Since the definition of tmem/cleancache
allows for it to ignore any put, pages above 16TB in a single file
can be rejected.  So, yes, it will still eventually bite us on
the ass, but not before huge parts of the kernel need to be rewritten too.

> > +/*
> > + * zcache initialization
> > + * NOTE FOR NOW zcache MUST BE PROVIDED AS A KERNEL BOOT PARAMETER OR
> > + * NOTHING HAPPENS!
> > + */
> > +
> 
> ok..... why?
> 
> superficially there does not appear to be anything obvious that stops it
> being turned on at runtime. Hardly a blocked, just odd.

The issue is that zcache must be active when a filesystem is mounted
(and at swapon time) or the filesystem will be ignored.

A patch has been posted by a University team to fix this but
it hasn't been merged yet.  I agree it should before zcache
should be widely used.

> > + * zsmalloc memory allocator
> 
> Ok, I didn't read anything after this point.  It's another allocator that
> may or may not pack compressed pages better. The usual concerns about
> internal fragmentation and the like apply but I'm not going to mull over them
> now.
> The really interesting part was deciding if zcache was ready or not.
> 
> So, on zcache, zbud and the underlying tmem thing;
> 
> The locking is convulated, the interrupt disabling suspicious and there is at
> least one place where it looks like we are depending on not being scheduled
> on another CPU during a long operation. It may actually be that you are
> disabling interrupts to prevent that happening but it's not documented. Even
> if it's the case, disabling interrupts to avoid CPU migration is overkill.

Explained above, but more work may be possible here.

> I'm also worried that there appears to be no control over how large
> the zcache can get

There is limited control in zcache1.  The policy is handled much better
in zcache2.  More work definitely remains.

> and am suspicious it can increase lowmem pressure on
> 32-bit machines.  If the lowmem pressure is real then zcache should not
> be available on machines with highmem at all. I'm *really* worried that
> it can deadlock if a page allocation fails before decompressing a page.

I've explicitly tested cases where page allocation fails in both versions
of zcache so I know it works, though I obviously can't guarantee it _always_
works.  In zcache2, when an alloc_page fails, a cleancache_put will
"eat its own tail" (i.e. reclaim and immediately reuse the LRU zpageframe)
and a frontswap_put will eat the LRU cleancache pageframe.  Zcache1
doesn't fail or deadlock, but just rejects all new frontswap puts when
zsmalloc becomes full.

> That said, my initial feeling still stands. I think that this needs to move
> out of staging because it's in limbo where it is but Andrew may disagree
> because of the reservations. If my reservations are accurate then they
> should at least be *clearly* documented with a note saying that using
> this in production is ill-advised for now. If zcache is activated via the
> kernel parameter, it should print a big dirty warning that the feature is
> still experiemental and leave that warning there until all the issues are
> addressed. Right now I'm not convinced this is production ready but that
> the  issues could be fixed incrementally.

Sounds good... but begs the question whether to promote zcache1
or zcache2.  Or some compromise.

Thanks again, Mel, for taking the (obviously tons of) time to go
through the code and ask intelligent questions and point out the
many nits and minor issues due to my (and others) kernel newbieness!

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Seth Jennings Sept. 21, 2012, 7:16 p.m. UTC | #9
On 09/21/2012 11:12 AM, Mel Gorman wrote:
> That said, my initial feeling still stands. I think that this needs to move
> out of staging because it's in limbo where it is but Andrew may disagree
> because of the reservations. If my reservations are accurate then they
> should at least be *clearly* documented with a note saying that using
> this in production is ill-advised for now. If zcache is activated via the
> kernel parameter, it should print a big dirty warning that the feature is
> still experiemental and leave that warning there until all the issues are
> addressed. Right now I'm not convinced this is production ready but that
> the  issues could be fixed incrementally.

Thank you _so_ much for the review!  Your comments have
provided one of the few glimpses I've had into any other
thoughts on the code save Dan and my own.

I'm in the process of going through the comments you provided.

I am _very_ glad to hear you believe that zcache should be
promoted out of the staging limbo where it currently
resides.  I am fine with providing a warning against use in
production environments until we can address everyone's
concerns.

Once zcache is promoted, I think it will give the code more
opportunity to be used/improved/extended in an incremental
and stable way.

--
Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 21, 2012, 8:35 p.m. UTC | #10
> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> 
> On 09/21/2012 01:02 PM, Konrad Rzeszutek Wilk wrote:
> > On Fri, Sep 21, 2012 at 05:12:52PM +0100, Mel Gorman wrote:
> >> On Tue, Sep 04, 2012 at 04:34:46PM -0500, Seth Jennings wrote:
> >>> zcache is the remaining piece of code required to support in-kernel
> >>> memory compression.  The other two features, cleancache and frontswap,
> >>> have been promoted to mainline in 3.0 and 3.5 respectively.  This
> >>> patchset promotes zcache from the staging tree to mainline.
> 
> >>
> >> Very broadly speaking my initial reaction before I reviewed anything was
> >> that *some* sort of usable backend for cleancache or frontswap should exist
> >> at this point. My understanding is that Xen is the primary user of both
> >> those frontends and ramster, while interesting, is not something that a
> >> typical user will benefit from.
> >
> > Right, the majority of users do not use virtualization. Thought embedded
> > wise .. well, there are a lot of Android users - thought I am not 100%
> > sure they are using it right now (I recall seeing changelogs for the clones
> > of Android mentioning zcache).
> >>
> >> That said, I worry that this has bounced around a lot and as Dan (the
> >> original author) has a rewrite. I'm wary of spending too much time on this
> >> at all. Is Dan's new code going to replace this or what? It'd be nice to
> >> find a definitive answer on that.
> >
> > The idea is to take parts of zcache2 as seperate patches and stick it
> > in the code you just reviewed (those that make sense as part of unstaging).
> 
> I agree with this.  Only the changes from zcache2 (Dan's
> rewrite) that are necessary for promotion should be
> considered right now.  Afaict, none of the concerns raised
> in these comments are addressed by the changes in zcache2.

While I may agree with the proposed end result, this proposal
is a _very_ long way away from a solution.  To me, it sounds like
a "split the baby in half" proposal (cf. wisdom of Solomon)
which may sound reasonable to some but, in the end, everyone loses.

I have proposed a reasonable compromise offlist to Seth, but
it appears that it has been silently rejected; I guess it is
now time to take the proposal public.  I apologize in advance
for my characteristic bluntness...

So let's consider two proposals and the pros and cons of them,
before we waste any further mm developer time.  (Fortunately,
most of Mel's insightful comments apply to both versions, though
he did identify some of the design issues that led to zcache2!)

The two proposals:
A) Recreate all the work done for zcache2 as a proper sequence of
   independent patches and apply them to zcache1. (Seth/Konrad)
B) Add zsmalloc back in to zcache2 as an alternative allocator
   for frontswap pages. (Dan)

Pros for (A):
1. It better preserves the history of the handful of (non-zsmalloc)
   commits in the original zcache code.
2. Seth[1] can incrementally learn the new designs by reading
   normal kernel patches.
3. For kernel purists, it is the _right_ way dammit (and Dan
   should be shot for redesigning code non-incrementally, even
   if it was in staging, etc.)
4. Seth believes that zcache will be promoted out of staging sooner
   because, except for a few nits, it is ready today.

Cons for (A):
1. Nobody has signed up to do the work, including testing.  It
   took the author (and sole expert on all the components
   except zsmalloc) between two and three months essentially
   fulltime to move zcache1->zcache2.  So forward progress on
   zcache will likely be essentially frozen until at least the
   end of 2012, possibly a lot longer.
2. The end result (if we reach one) is almost certainly a
   _third_ implementation of zcache: "zcache 1.5".  So
   we may not be leveraging much of the history/testing
   from zcache1 anyway!
3. Many of the zcache2 changes are closely interwoven so
   a sequence of patches may not be much more incrementally
   readable than zcache2.
4. The merge with ramster will likely be very low priority
   so the fork between the two will continue.
5. Dan believes that, if zcache1 does indeed get promoted with
   few or none of the zcache2 redesigns, zcache will never
   get properly finished.

Pros for (B):
1. Many of the design issues/constraints of zcache are resolved
   in code that has already been tested approximately as well
   as the original. All of the redesign (zcache1->zcache2) has
   been extensively discussed on-list; only the code itself is
   "non-incremental".
2. Both allocators (which AFAIK is the only technical area
   of controversy) will be supported in the same codebase.
3. Dan (especially with help from Seth) can do the work in a
   week or two, and then we can immediately move forward
   doing useful work and adding features on a solid codebase.
4. Zcache2 already has the foundation in place for "reclaim
   frontswap zpages", which mm experts have noted is a critical
   requirement for broader zcache acceptance (e.g. KVM).
5. Ramster is already a small incremental addition to core zcache2 code
   rather than a fork.  While many may ignore ramster as "not valuable",
   it is the foundation for future related work so there's a reasonable
   chance that some form of ramster will need to be merged in the future.

Cons for (B):
1. Seth [1] has to relearn some of the zcache2 code via diffs and
   code reading instead of incremental patches.
2. Dan doesn't get properly punished for not doing incremental patches.

[1] With all due respect, at this time, there are really only
two people in the world that have a reasonably deep understanding
of zcache and the technologies it's built on: Dan and Seth.
Seth admits less than thorough understanding of some of the
components (e.g. cleancache, zbud, tmem).  Dan admits poor
understanding of zsmalloc internals.

P.S.
For history on how the "fork" between zcache1 and zcache2 happened, see:
https://lkml.org/lkml/2012/8/16/617 
For a high-level list of the redesign in zcache2, see:
https://lkml.org/lkml/2012/7/31/573 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mel Gorman Sept. 22, 2012, 12:25 a.m. UTC | #11
On Fri, Sep 21, 2012 at 12:14:39PM -0700, Dan Magenheimer wrote:
> Hi Mel --
> 
> Wow!  An incredibly wonderfully detailed response!  Thank you very
> much for taking the time to read through all of zcache!
> 

My pleasure.

> Your comments run the gamut from nit and code style, to design,
> architecture and broad naming.  Until the choice-of-codebase issue
> is resolved, I'll avoid the nits and codestyle comments and respond
> to the higher level strategic and design questions. 

That's fair enough. FWIW, I would never consider the nits to be
blockers. If all the complaints I had were nits then there would be no
real issue to merging it to the core.

> Since a couple
> of your questions are repeated and the specific code which provoked
> your question is not isolated, I hope it is OK if I answer those
> first out-of-context from your original comments in the code.
> (This should also make this easier to read and to extract optimal
> meaning, for you and for posterity.)

Sure. I recognise that I was repeating myself at parts.

> > That said, I worry that this has bounced around a lot and as Dan (the
> > original author) has a rewrite. I'm wary of spending too much time on this
> > at all. Is Dan's new code going to replace this or what? It'd be nice to
> > find a definitive answer on that.
> 
> Replacing this code was my intent, but that was blocked.  IMHO zcache2
> is _much_ better than the "demo version" of zcache (aka zcache1).
> Hopefully a middle ground can be reached.  I've proposed one privately
> offlist.
> 

Ok. Unfortunately I cannot help resolve that issue but I'll mention it
again later.

> Seth, please feel free to augment or correct anything below, or
> respond to anything I haven't commented on.
> 
> > Anyway, here goes
> 
> Repeated comments answered first out-of-context:
> 
> 1) The interrupt context for zcache (and any tmem backend) is imposed
>    by the frontend callers.  Cleancache_put [see naming comment below]
>    is always called with interrupts disabled. 

Ok, I sortof see. It's always called within the irq-safe mapping tree_lock
and that infects the lower layers in a sense. It still feels like a layering
violation and minimally I would expect this is propagated down by making
locks like the hb->lock IRQ-safe and document the locking accordingly.

> Cleancache_flush is
>    sometimes called with interrupts disabled and sometimes not.
>    Cleancache_get is never called in an atomic context.  (I think)
>    frontswap_get/put/flush are never called in an atomic context but
>    sometimes with the swap_lock held. Because it is dangerous (true?)
>    for code to sometimes/not be called in atomic context, much of the
>    code in zcache and tmem is forced into atomic context. 

FWIW, if it can be called from a context with IRQs disabled then it must
be consistent throughout or it's unsafe. At the very least lockdep will
throw a fit if it is inconsistent.

> BUT Andrea
>    observed that there are situations where asynchronicity would be
>    preferable and, it turns out that cleancache_get and frontswap_get
>    are never called in atomic context.  Zcache2/ramster takes advantage of
>    that, and a future KVM backend may want to do so as well.  However,
>    the interrupt/atomicity model and assumptions certainly does deserve
>    better documentation.
> 

Minimally, move the locking to use the irq-safe with spin_lock_irqsave
rather than the current arrangement of calling local_irq_save() in
places. That alone would make it a bit easier to follow.

> 2) The naming of the core tmem functions (put, get, flush) has been
>    discussed endlessly, everyone has a different opinion, and the
>    current state is a mess: cleancache, frontswap, and the various
>    backends are horribly inconsistent.   IMHO, the use of "put"
>    and "get" for reference counting is a historical accident, and
>    the tmem ABI names were chosen well before I understood the historical
>    precedence and the potential for confusion by kernel developers.
>    So I don't have a good answer... I'd prefer the ABI-documented
>    names, but if they are unacceptable, at least we need to agree
>    on a consistent set of names and fix all references in all
>    the various tmem parts (and possibly Xen and the kernel<->Xen
>    ABI as well).
> 

Ok, I see. Well, it's unfortunate but I'm not going to throw the toys out
of the pram over it either. Changing the names at this stage might just
confuse the people who are already familiar with the code. I'm the newbie
here so the confusion about terminology is my problem.

> The rest of my comments/replies are in context.
> 
> > > +/*
> > > + * A tmem host implementation must use this function to register
> > > + * callbacks for a page-accessible memory (PAM) implementation
> > > + */
> > > +static struct tmem_pamops tmem_pamops;
> > > +
> > > +void tmem_register_pamops(struct tmem_pamops *m)
> > > +{
> > > +	tmem_pamops = *m;
> > > +}
> > > +
> > 
> > This implies that this can only host one client  at a time. I suppose
> > that's ok to start with but is there ever an expectation that zcache +
> > something else would be enabled at the same time?
> 
> There was some thought that zcache and Xen (or KVM) might somehow "chain"
> the implementations.
>  

Ok, in that case it should at least detect if an attempt is ever made to
chain and bail out.

> > > +/*
> > > + * A tmem_obj contains a radix-tree-like tree in which the intermediate
> > > + * nodes are called tmem_objnodes.  (The kernel lib/radix-tree.c implementation
> > > + * is very specialized and tuned for specific uses and is not particularly
> > > + * suited for use from this code, though some code from the core algorithms has
> > 
> > This is a bit vague. It asserts that lib/radix-tree is unsuitable but
> > not why. I skipped over most of the implementation to be honest.
> 
> IIRC, lib/radix-tree is highly tuned for mm's needs.  Things like
> tagging and rcu weren't a good fit for tmem, and new things like calling
> a different allocator needed to be added.  In the long run it might
> be possible for the lib version to serve both needs, but the impediment
> and aggravation of merging all necessary changes into lib seemed a high price
> to pay for a hundred lines of code implementing a variation of a widely
> documented tree algorithm.
> 

Ok, thanks for the explanation. I think in that case it just needs to be
in a file of it's own and maybe clearly name in case there ever is the
case that another subsystem can reuse the same data structure. I suspect
in the future there might be people who want to create RAM-like devices
backed by SSD and they may benefit from similar data structures.  I do
not have a suggestion on good names unfortunately.

> > > + * These "tmem core" operations are implemented in the following functions.
> > 
> > More nits. As this defines a boundary between two major components it
> > probably should have its own Documentation/ entry and the APIs should have
> > kernel doc comments.
> 
> Agreed.
> 
> > > + * a corner case: What if a page with matching handle already exists in
> > > + * tmem?  To guarantee coherency, one of two actions is necessary: Either
> > > + * the data for the page must be overwritten, or the page must be
> > > + * "flushed" so that the data is not accessible to a subsequent "get".
> > > + * Since these "duplicate puts" are relatively rare, this implementation
> > > + * always flushes for simplicity.
> > > + */
> > 
> > At first glance that sounds really dangerous. If two different users can have
> > the same oid for different data, what prevents the wrong data being fetched?
> > From this level I expect that it's something the layers above it have to
> > manage and in practice they must be preventing duplicates ever happening
> > but I'm guessing. At some point it would be nice if there was an example
> > included here explaining why duplicates are not a bug.
> 
> VFS decides when to call cleancache and dups do happen.  Honestly, I don't
> know why they happen (though Chris Mason, who wrote the cleancache hooks,
> may know) they happen,

Because you mentioned Chris Mason it might be specific to btrfs and snapshots
i.e. a page at a given offset in an inode but in two snapshots might alias
in zcache. This would be legal but rare. If this is accurate it should be
commented on.

> but the above coherency rules for backend implementation
> always work.  The same is true of frontswap.
> 

I'm less sure the situation can even happen with frontswap but that is a
complete guess as I simply am not familiar enough with this code.

> > > +int tmem_replace(struct tmem_pool *pool, struct tmem_oid *oidp,
> > > +			uint32_t index, void *new_pampd)
> > > +{
> > > +	struct tmem_obj *obj;
> > > +	int ret = -1;
> > > +	struct tmem_hashbucket *hb;
> > > +
> > > +	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
> > > +	spin_lock(&hb->lock);
> > > +	obj = tmem_obj_find(hb, oidp);
> > > +	if (obj == NULL)
> > > +		goto out;
> > > +	new_pampd = tmem_pampd_replace_in_obj(obj, index, new_pampd);
> > > +	ret = (*tmem_pamops.replace_in_obj)(new_pampd, obj);
> > > +out:
> > > +	spin_unlock(&hb->lock);
> > > +	return ret;
> > > +}
> > > +
> > 
> > Nothin in this patch uses this. It looks like ramster would depend on it
> > but at a glance, ramster seems to have its own copy of the code. I guess
> > this is what Dan was referring to as the fork and at some point that needs
> > to be resolved. Here, it looks like dead code.
> 
> Yep, this was a first step toward supporting ramster (and any other
> future asynchronous-get tmem backends).
> 

Ok. I don't really see why it's connected to asynchronous get. I was
reading it as a convenient helper.

> > > +static inline void tmem_oid_set_invalid(struct tmem_oid *oidp)
> > > +
> > > +static inline bool tmem_oid_valid(struct tmem_oid *oidp)
> > > +
> > > +static inline int tmem_oid_compare(struct tmem_oid *left,
> > > +					struct tmem_oid *right)
> > > +{
> > > +}
> > 
> > Holy Branches Batman!
> > 
> > Bit of a jumble but works at least. Nits: mixes ret = and returns
> > mid-way. Could have been implemented with a while loop. Only has one
> > caller and should have been in the C file that uses it. There was no need
> > to explicitely mark it inline either with just one caller.
> 
> It was put here to group object operations together sort
> of as if it is an abstract datatype.  No objections
> to moving it.
> 

Ok. I am not pushed either way to be honest.

> > > +++ b/drivers/mm/zcache/zcache-main.c
> > > + *
> > > + * Zcache provides an in-kernel "host implementation" for transcendent memory
> > > + * and, thus indirectly, for cleancache and frontswap.  Zcache includes two
> > > + * page-accessible memory [1] interfaces, both utilizing the crypto compression
> > > + * API:
> > > + * 1) "compression buddies" ("zbud") is used for ephemeral pages
> > > + * 2) zsmalloc is used for persistent pages.
> > > + * Xvmalloc (based on the TLSF allocator) has very low fragmentation
> > > + * so maximizes space efficiency, while zbud allows pairs (and potentially,
> > > + * in the future, more than a pair of) compressed pages to be closely linked
> > > + * so that reclaiming can be done via the kernel's physical-page-oriented
> > > + * "shrinker" interface.
> > > + *
> > 
> > Doesn't actually explain why zbud is good for one and zsmalloc good for the other.
> 
> There's been extensive discussion of that elsewhere and the
> equivalent description in zcache2 is better, but I agree this
> needs to be in Documentation/, once the zcache1/zcache2 discussion settles.
> 

Ok, that really does need to be settled in some fashion but I have no
recommendations on how to do it. Ordinarily there is a hatred of having
two implementations of the same functionality in-tree. I know the virtio
people have been fighting about something recently but it's not unheard
of either. jbd and jbd2 exist for example.

> > > +#if 0
> > > +/* this is more aggressive but may cause other problems? */
> > > +#define ZCACHE_GFP_MASK	(GFP_ATOMIC | __GFP_NORETRY | __GFP_NOWARN)
> > 
> > Why is this "more agressive"? If anything it's less aggressive because it'll
> > bail if there is no memory available. Get rid of this.
> 
> My understanding (from Jeremy Fitzhardinge I think) was that GFP_ATOMIC
> would use a special reserve of pages which might lead to OOMs.

It might, but it's a stretch. The greater concern to me is that using
GFP_ATOMIC means that zcache expansions will not enter direct page
reclaim and instead depend on kswapd to do the necessary work. It would
make adding pages to zcache under memory pressure a hit and miss affair.
Considering that frontswap is a possible frontend and swapping happens in the
presense of memory pressure it would imply to me that using GFP_ATOMIC is
the worst possible choice for zcache and the aging simply feels "wrong". I
much prefer the gfp mask it is currently using for this reason.

Again, this is based on a lot of guesswork so take with a grain of salt.

> More experimentation may be warranted.
> 

Personally I wouldn't bother and instead stick with the current
ZCACHE_GFP_MASK.

> > > +#else
> > > +#define ZCACHE_GFP_MASK \
> > > +	(__GFP_FS | __GFP_NORETRY | __GFP_NOWARN | __GFP_NOMEMALLOC)
> > > +#endif
> > > +
> > > +#define MAX_CLIENTS 16
> > 
> > Seems a bit arbitrary. Why 16?
> 
> Sasha Levin posted a patch to fix this but it was tied in to
> the proposed KVM implementation, so was never merged.
> 

Ok, so it really is just an arbitrary choice. It's probably not an
issue, just looked odd.

> > > +#define LOCAL_CLIENT ((uint16_t)-1)
> > > +
> > > +MODULE_LICENSE("GPL");
> > > +
> > > +struct zcache_client {
> > > +	struct idr tmem_pools;
> > > +	struct zs_pool *zspool;
> > > +	bool allocated;
> > > +	atomic_t refcount;
> > > +};
> > 
> > why is "allocated" needed. Is the refcount not enough to determine if this
> > client is in use or not?
> 
> May be a historical accident.  Deserves a second look.
> 

Ok. Again, it's not a major deal, it just looks weird.

> > > + * Compression buddies ("zbud") provides for packing two (or, possibly
> > > + * in the future, more) compressed ephemeral pages into a single "raw"
> > > + * (physical) page and tracking them with data structures so that
> > > + * the raw pages can be easily reclaimed.
> > > + *
> > 
> > Ok, if I'm reading this right it implies that a page must at least compress
> > by 50% before zcache even accepts the page.
> 
> NO! Zbud matches up pages that compress well with those that don't.
> There's a lot more detailed description of this in zcache2.
> 

Oh.... ok. I thought the buddy arrangement would require at least 50%
compression. To be honest, I'm happier with that limitation than trying
to figure out the buckets sizes to deal with varying compressions but
that's me being lazy :)

> > > +static atomic_t zcache_zbud_curr_raw_pages;
> > > +static atomic_t zcache_zbud_curr_zpages;
> > 
> > Should not have been necessary to make these atomics. Probably protected
> > by zbpg_unused_list_spinlock or something similar.
> 
> Agreed, but it gets confusing when monitoring zcache
> if certain key counters go negative.

Do they really go negative? It's not obvious why they should but even if
they can it could be bodged to print 0 if the value is negative. I didn't
double check it but I think we already do something like that for vmstat
when per-cpu counter drift can make a counter appear negative.

Bodging it would be preferable to incurring the cost of atomic updates.
Atomics also make new reviewers start worrying that the locking is
flawed somehow! An atomic_read > 0 followed by data deletion just looks
like a problem waiting to happen. The expected pattern for atomics in a
situation like this involves atomic_dec_and_test() to atomically catch
when a reference count reaches 0.

>  Ideally this
> should all be eventually tied to some runtime debug flag
> but it's not clear yet what counters might be used
> by future userland software.
>  

Move to debugfs maybe? With or without that move, it seems to me that the
counters are for monitoring and debugging similar to what /proc/vmstat
is fot. I would very much hope that monitor tools would be tolerant to
the available statistics changing and it wouldn't be part of the ABI. For
example, some of the vmstat names changed recently and no one threw a fit.

Ok... I threw a fit because they broke MMTests but it took all of 10
minutes to handle it and MMTests only broke because I was lazy in the
first place.

> > > +static unsigned long zcache_zbud_curr_zbytes;
> > 
> > Overkill, this is just
> > 
> > zcache_zbud_curr_raw_pages << PAGE_SHIFT
> 
> No, it allows a measure of the average compression,
> irrelevant of the number of pageframes required.
>  

Ah, ok, that makes more sense actually.

> > > +static unsigned long zcache_zbud_cumul_zpages;
> > > +static unsigned long zcache_zbud_cumul_zbytes;
> > > +static unsigned long zcache_compress_poor;
> > > +static unsigned long zcache_mean_compress_poor;
> > 
> > In general the stats keeping is going to suck on larger machines as these
> > are all shared writable cache lines. You might be able to mitigate the
> > impact in the future by moving these to vmstat. Maybe it doesn't matter
> > as such - it all depends on what velocity pages enter and leave zcache.
> > If that velocity is high, maybe the performance is shot anyway.
> 
> Agreed.  Velocity is on the order of the number of disk
> pages read per second plus pswpin+pswpout per second.

I see.

> It's not clear yet if that is high enough for the
> stat counters to affect performance but it seems unlikely
> except possibly on huge NUMA machines.
> 

Meaning the KVM people would want this fixed eventually particularly if
they back swap with very fast storage. I know they are not a current user
but it seems like they *should* be eventually. It's not a blocker but some
of the statistics gathering should eventually move to something like vmstat.

Obviously, it would be a lot easier to do that if zcache[1|2] was part of
the core vm :)

> > > +static inline unsigned zbud_max_buddy_size(void)
> > > +{
> > > +	return MAX_CHUNK << CHUNK_SHIFT;
> > > +}
> > > +
> > 
> > Is the max size not half of MAX_CHUNK as the page is split into two buddies?
> 
> No, see above.
> 

My bad, it's actually a bit tricky at first reading to see how all this
hangs together. That's fine, I'm ok with having things explained to me.

> > > +	if (zbpg == NULL)
> > > +		/* none on zbpg list, try to get a kernel page */
> > > +		zbpg = zcache_get_free_page();
> > 
> > So zcache_get_free_page() is getting a preloaded page from a per-cpu magazine
> > and that thing blows up if there is no page available. This implies that
> > preemption must be disabled for the entire putting of a page into zcache!
> >
> > > +	if (likely(zbpg != NULL)) {
> > 
> > It's not just likely, it's impossible because if it's NULL,
> > zcache_get_free_page() will already have BUG().
> > 
> > If it's the case that preemption is *not* disabled and the process gets
> > scheduled to a CPU that has its magazine consumed then this will blow up
> > in some cases.
> > 
> > Scary.
> 
> This code is all redesigned/rewritten in zcache2.
> 

So the problem is sortof real and if it is avoided it is because this
interrupts disabled limitation that is being enforced. It appears that the
interrupts disabling is just a co-incidence and it would be best to not
depend on it for zcache to be "correct". Does zcache2 deal with this problem?

> > Ok, so if this thing fails to allocate a page then what prevents us getting into
> > a situation where the zcache grows to a large size and we cannot take decompress
> > anything in it because we cannot allocate a page here?
> > 
> > It looks like this could potentially deadlock the system unless it was possible
> > to either discard zcache data and reconstruct it from information on disk.
> > It feels like something like a mempool needs to exist that is used to forcibly
> > shrink the zcache somehow but I can't seem to find where something like that happens.
> > 
> > Where is it or is there a risk of deadlock here?
> 
> I am fairly sure there is no risk of deadlock here.  The callers
> to cleancache_get and frontswap_get always provide a struct page
> for the decompression.

What happens if they cannot allocate a page?

>  Cleancache pages in zcache can always
> be discarded whenever required.
> 

What about frontswap?

> The risk for OOMs does exist when we start trying to force
> frontswap-zcache zpages out to the swap disk.  This work
> is currently in progress and I hope to have a patch for
> review soon.
> 

Good news. It would be a big job but my initial reaction is that you need
a mempool to emergency evict pages. Not exactly sure how it would all hang
together unfortunately but it should be coped with if zcache is to be used
in production.

> > > +	BUG_ON(!irqs_disabled());
> > > +	if (unlikely(dmem == NULL))
> > > +		goto out;  /* no buffer or no compressor so can't compress */
> > > +	*out_len = PAGE_SIZE << ZCACHE_DSTMEM_ORDER;
> > > +	from_va = kmap_atomic(from);
> > 
> > Ok, so I am running out of beans here but this triggered alarm bells. Is
> > zcache stored in lowmem? If so, then it might be a total no-go on 32-bit
> > systems if pages from highmem cause increased low memory pressure to put
> > the page into zcache.
> 
> Personally, I'm neither an expert nor an advocate of lowmem systems
> but Seth said he has tested zcache ("demo version") there.
> 

Ok, that's not exactly a ringing endorsement. highmem/lowmem issues
completely suck. It looks like the ideal would be that zcache supports
storing of compressed pages in highmem but that probably means that the
pages have to be kmap()ed before passing them to the compression
algorithm. Due to the interrupt-disabled issue it would have to be
kmap_atomic and then it all goes completely to crap.

I for one would be ok with making zcache 64-bit only.

> > > +	mb();
> > 
> > .... Why?
> 
> Historical accident...  I think this was required in the Xen version.
>  

Ok, it would be really nice to have a comment explaining why the barrier
is there or get rid of it completely.

> > > +	if (nr >= 0) {
> > > +		if (!(gfp_mask & __GFP_FS))
> > > +			/* does this case really need to be skipped? */
> > > +			goto out;
> > 
> > Answer that question. It's not obvious at all why zcache cannot handle
> > !__GFP_FS. You're not obviously recursing into a filesystem.
> 
> Yep, this is a remaining loose end.  The documentation
> of this (in the shrinker code) was pretty vague so this
> is "safety" code that probably should be removed after
> a decent test proves it can be.
> 

Not sure what documentation that is but I bet you a shiny penny it's
worried about icache/dcache shrinking and that's why there are worries
about filesystem recursion.

> > > +static int zcache_get_page(int cli_id, int pool_id, struct tmem_oid *oidp,
> > > +				uint32_t index, struct page *page)
> > > +{
> > > +	struct tmem_pool *pool;
> > > +	int ret = -1;
> > > +	unsigned long flags;
> > > +	size_t size = PAGE_SIZE;
> > > +
> > > +	local_irq_save(flags);
> > 
> > Why do interrupts have to be disabled?
> > 
> > This makes the locking between tmem and zcache very confusing unfortunately
> > because I cannot decide if tmem indirectly depends on disabled interrupts
> > or not. It's also not clear why an interrupt handler would be trying to
> > get/put pages in tmem.
> 
> Yes, irq disablement goes away for gets in zcache2.
> 

Great.

> > > +	pool = zcache_get_pool_by_id(cli_id, pool_id);
> > > +	if (likely(pool != NULL)) {
> > > +		if (atomic_read(&pool->obj_count) > 0)
> > > +			ret = tmem_get(pool, oidp, index, (char *)(page),
> > > +					&size, 0, is_ephemeral(pool));
> > 
> > It looks like you are disabling interrupts to avoid racing on that atomic
> > update.
> > 
> > This feels very shaky and the layering is being violated. You should
> > unconditionally call into tmem_get and not worry about the pool count at
> > all. tmem_get should then check the count under the pool lock and make
> > obj_count a normal counter instead of an atomic.
> > 
> > The same comment applies to all the other obj_count locations.
> 
> This isn't the reason for irq disabling, see previous.
> It's possible atomic obj_count can go away as it may
> have only been necessary in a previous tmem locking design.
> 

Ok, then in principal I would like to see the obj_count check go away
and pass responsibility down to the lower layer.

> > > +	/* wait for pool activity on other cpus to quiesce */
> > > +	while (atomic_read(&pool->refcount) != 0)
> > > +		;
> > 
> > There *HAS* to be a better way of waiting before destroying the pool
> > than than a busy wait.
> 
> Most probably.  Pool destruction is relatively very rare (umount and
> swapoff), so fixing/testing this has never bubbled up to the top
> of the list.
> 

Yeah, I guessed that might be the case but I could not let a busy wait
slide by without comment. If Peter saw this and thought I missed it he
would be laughing at me for months. It's wrong and needs to go away at
some point.

> > Feels like this should be in its own file with a clear interface to
> > zcache-main.c . Minor point, at this point I'm fatigued reading the code
> > and cranky.
> 
> Perhaps.  In zcache2, all the zbud code is moved to a separate
> code module, so zcache-main.c is much shorter.
> 

Ok, so at the very least the zcache1 and zcache2 implementations can
move closer together by doing the same split.

> > > +static void zcache_cleancache_put_page(int pool_id,
> > > +					struct cleancache_filekey key,
> > > +					pgoff_t index, struct page *page)
> > > +{
> > > +	u32 ind = (u32) index;
> > 
> > This looks like an interesting limitation. How sure are you that index
> > will never be larger than u32 and this start behaving badly? I guess it's
> > because the index is going to be related to PFN and there are not that
> > many 16TB machines lying around but this looks like something that could
> > bite us on the ass one day.
> 
> The limitation is for a >16TB _file_ on a cleancache-aware filesystem.

I see. That makes sense now that you say it. The UUID is not going be based
on the block device, it's going to be based on an inode + some offset with
some swizzling to handle snapshots. I was thinking of frontswap backing
the physical address space at the time and think it might still be a
problem, but not a blocking one.

This doesn't need to be documented because a sufficiently motivated
person can figure it out. I was not sufficiently motivated :)

> And it's not a hard limitation:  Since the definition of tmem/cleancache
> allows for it to ignore any put, pages above 16TB in a single file
> can be rejected.  So, yes, it will still eventually bite us on
> the ass, but not before huge parts of the kernel need to be rewritten too.
> 

That's fair enough. The situation should be at least detected though.

> > > +/*
> > > + * zcache initialization
> > > + * NOTE FOR NOW zcache MUST BE PROVIDED AS A KERNEL BOOT PARAMETER OR
> > > + * NOTHING HAPPENS!
> > > + */
> > > +
> > 
> > ok..... why?
> > 
> > superficially there does not appear to be anything obvious that stops it
> > being turned on at runtime. Hardly a blocked, just odd.
> 
> The issue is that zcache must be active when a filesystem is mounted
> (and at swapon time) or the filesystem will be ignored.
> 

Ok so ultimately it should be possible to remount a filesystem with
zcache enabled. Not a blocking issue, just would be nice.

> A patch has been posted by a University team to fix this but
> it hasn't been merged yet.  I agree it should before zcache
> should be widely used.
> 

Meh, actually I did not view this as a blocker. It's clumsy but the
other concerns were more important.

> > > + * zsmalloc memory allocator
> > 
> > Ok, I didn't read anything after this point.  It's another allocator that
> > may or may not pack compressed pages better. The usual concerns about
> > internal fragmentation and the like apply but I'm not going to mull over them
> > now.
> > The really interesting part was deciding if zcache was ready or not.
> > 
> > So, on zcache, zbud and the underlying tmem thing;
> > 
> > The locking is convulated, the interrupt disabling suspicious and there is at
> > least one place where it looks like we are depending on not being scheduled
> > on another CPU during a long operation. It may actually be that you are
> > disabling interrupts to prevent that happening but it's not documented. Even
> > if it's the case, disabling interrupts to avoid CPU migration is overkill.
> 
> Explained above, but more work may be possible here.
> 

Agreed.

> > I'm also worried that there appears to be no control over how large
> > the zcache can get
> 
> There is limited control in zcache1.  The policy is handled much better
> in zcache2.  More work definitely remains.
> 

Ok.

> > and am suspicious it can increase lowmem pressure on
> > 32-bit machines.  If the lowmem pressure is real then zcache should not
> > be available on machines with highmem at all. I'm *really* worried that
> > it can deadlock if a page allocation fails before decompressing a page.
> 
> I've explicitly tested cases where page allocation fails in both versions
> of zcache so I know it works, though I obviously can't guarantee it _always_
> works. 

Yeah, I get your point. An allocation failure might be handled but I'm
worried about the case where the allocation fails *and* the system
cannot do anything about it. Similar situations happen if the page
allocator gets broken by a patch and does not enforce watermarks which
is why the alarm bell triggered for me.

> In zcache2, when an alloc_page fails, a cleancache_put will
> "eat its own tail" (i.e. reclaim and immediately reuse the LRU zpageframe)

Conceptually, I *really* like that idea. It seems that it would be much
more robust in general.

> and a frontswap_put will eat the LRU cleancache pageframe.  Zcache1
> doesn't fail or deadlock, but just rejects all new frontswap puts when
> zsmalloc becomes full.
> 

Which could get awkward. The contents of zcache in that case could be
completely inappropriate and lead to a type of priority inversion
problem. It would be hard to debug.

> > That said, my initial feeling still stands. I think that this needs to move
> > out of staging because it's in limbo where it is but Andrew may disagree
> > because of the reservations. If my reservations are accurate then they
> > should at least be *clearly* documented with a note saying that using
> > this in production is ill-advised for now. If zcache is activated via the
> > kernel parameter, it should print a big dirty warning that the feature is
> > still experiemental and leave that warning there until all the issues are
> > addressed. Right now I'm not convinced this is production ready but that
> > the  issues could be fixed incrementally.
> 
> Sounds good... but begs the question whether to promote zcache1
> or zcache2.  Or some compromise.
> 

I don't have a good suggestion on how to resolve that. The people who
handled the jbd vs jbd2 issue might. Andrew, Ted or Jan might know the
history there. I was blissfully ignorant.

> Thanks again, Mel, for taking the (obviously tons of) time to go
> through the code and ask intelligent questions and point out the
> many nits and minor issues due to my (and others) kernel newbieness!
> 

I'm the newbie here, thanks for taking the time to answer my questions
:)

Bottom line for me remains the same. I think something like this should be
promoted if frontcache/cleancache are going to be used in generally available
systems properly but  the limitations need to be clearly documented and
dirty warnings printed on activation until it's production ready. The
zcache1 vs zcache2 problem must be resolved but I would suggest that it
be done by either mering all the features of zcache2 into zcache1 until
it disappears *or* the exact opposite -- both get merged, maintain API
compatibility but zcache1 get critical bug fixes only and all development
take place on zcache2. Having never looked at zcache2 I cannot be sure
which is the better idea. If zcache2 is shown to handle some fundamental
problems though then merging both but forcing a freeze on new features in
zcache1 seems like a reasonable compromise to me.
Mel Gorman Sept. 22, 2012, 1:07 a.m. UTC | #12
On Fri, Sep 21, 2012 at 01:35:15PM -0700, Dan Magenheimer wrote:
> > From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > 
> > On 09/21/2012 01:02 PM, Konrad Rzeszutek Wilk wrote:
> > > On Fri, Sep 21, 2012 at 05:12:52PM +0100, Mel Gorman wrote:
> > >> On Tue, Sep 04, 2012 at 04:34:46PM -0500, Seth Jennings wrote:
> > >>> zcache is the remaining piece of code required to support in-kernel
> > >>> memory compression.  The other two features, cleancache and frontswap,
> > >>> have been promoted to mainline in 3.0 and 3.5 respectively.  This
> > >>> patchset promotes zcache from the staging tree to mainline.
> > 
> > >>
> > >> Very broadly speaking my initial reaction before I reviewed anything was
> > >> that *some* sort of usable backend for cleancache or frontswap should exist
> > >> at this point. My understanding is that Xen is the primary user of both
> > >> those frontends and ramster, while interesting, is not something that a
> > >> typical user will benefit from.
> > >
> > > Right, the majority of users do not use virtualization. Thought embedded
> > > wise .. well, there are a lot of Android users - thought I am not 100%
> > > sure they are using it right now (I recall seeing changelogs for the clones
> > > of Android mentioning zcache).
> > >>
> > >> That said, I worry that this has bounced around a lot and as Dan (the
> > >> original author) has a rewrite. I'm wary of spending too much time on this
> > >> at all. Is Dan's new code going to replace this or what? It'd be nice to
> > >> find a definitive answer on that.
> > >
> > > The idea is to take parts of zcache2 as seperate patches and stick it
> > > in the code you just reviewed (those that make sense as part of unstaging).
> > 
> > I agree with this.  Only the changes from zcache2 (Dan's
> > rewrite) that are necessary for promotion should be
> > considered right now.  Afaict, none of the concerns raised
> > in these comments are addressed by the changes in zcache2.
> 
> While I may agree with the proposed end result, this proposal
> is a _very_ long way away from a solution.  To me, it sounds like
> a "split the baby in half" proposal (cf. wisdom of Solomon)
> which may sound reasonable to some but, in the end, everyone loses.
> 

I tend to agree but this really is an unhappy situation that should be
resolved in the coming weeks instead of months if it's going to move
forward.

> I have proposed a reasonable compromise offlist to Seth, but
> it appears that it has been silently rejected; I guess it is
> now time to take the proposal public.  I apologize in advance
> for my characteristic bluntness...
> 

Meh, I'm ok with blunt.

> So let's consider two proposals and the pros and cons of them,
> before we waste any further mm developer time.  (Fortunately,
> most of Mel's insightful comments apply to both versions, though
> he did identify some of the design issues that led to zcache2!)
> 
> The two proposals:
> A) Recreate all the work done for zcache2 as a proper sequence of
>    independent patches and apply them to zcache1. (Seth/Konrad)
> B) Add zsmalloc back in to zcache2 as an alternative allocator
>    for frontswap pages. (Dan)

Throwing it out there but ....

C) Merge both, but freeze zcache1 except for critical fixes. Only allow
   future work on zcache2. Document limitations of zcache1 and
   workarounds until zcache2 is fully production ready.

> 
> Pros for (A):
> 1. It better preserves the history of the handful of (non-zsmalloc)
>    commits in the original zcache code.

Marginal benefit.

> 2. Seth[1] can incrementally learn the new designs by reading
>    normal kernel patches.

Which would be nice but that is not exactly compelling.

> 3. For kernel purists, it is the _right_ way dammit (and Dan
>    should be shot for redesigning code non-incrementally, even
>    if it was in staging, etc.)

Yes, but there are historical examples of ditching something completely
too. USB has been ditched a few times. Andrea shot a large chunk of the
VM out the window in 2.6.10. jbd vs jbd2 is still there.

> 4. Seth believes that zcache will be promoted out of staging sooner
>    because, except for a few nits, it is ready today.
> 

I wouldn't call them minor but it's probably better understood by more
people. It's why I'd be sortof ok with promoting zcache1 as long as
the limitations were clearly understood and there was a migration path
to zcache2.

> Cons for (A):
> 1. Nobody has signed up to do the work, including testing.  It
>    took the author (and sole expert on all the components
>    except zsmalloc) between two and three months essentially
>    fulltime to move zcache1->zcache2.  So forward progress on
>    zcache will likely be essentially frozen until at least the
>    end of 2012, possibly a lot longer.

This to me is a big issue. It's one reason why I think it would be ok for
zcache1 + zcache2 to exist in parallel but zcache1 would have to freeze for
this to be sensible. If zcache1 gained capabilities that zcache2 did *not*
have, it would be very problematic.

> 2. The end result (if we reach one) is almost certainly a
>    _third_ implementation of zcache: "zcache 1.5".  So
>    we may not be leveraging much of the history/testing
>    from zcache1 anyway!

Sod that.

> 3. Many of the zcache2 changes are closely interwoven so
>    a sequence of patches may not be much more incrementally
>    readable than zcache2.

Impossible for me to tell unfortunately. I'm too much of a newbie.

> 4. The merge with ramster will likely be very low priority
>    so the fork between the two will continue.

If zcache1 froze and ramaster supported only zcache2, it would be a path
to promotion for ramster, right?

> 5. Dan believes that, if zcache1 does indeed get promoted with
>    few or none of the zcache2 redesigns, zcache will never
>    get properly finished.
> 

This is the tricky part. If zcache1 gets promoted then zcache2 still needs
to go somewhere. My feeling is that we should promote both once testing
indicates that zcache2 does not regress in comparison to zcache1. It would
be nice to agree on what that testing would look like. I would like to
suggest MMTests with some configuration files because it should only take
a few hours to implement some zcache support. Other than the kernel
parameter this should not be a major problem. If it is, it actually
indicates that the feature is basically unusable for mere mortals :)

> Pros for (B):
> 1. Many of the design issues/constraints of zcache are resolved
>    in code that has already been tested approximately as well
>    as the original. All of the redesign (zcache1->zcache2) has
>    been extensively discussed on-list; only the code itself is
>    "non-incremental".

If zcache2 resolves some of the fundamental problems of zcache1 then it
cannot be ignored.

> 2. Both allocators (which AFAIK is the only technical area
>    of controversy) will be supported in the same codebase.
> 3. Dan (especially with help from Seth) can do the work in a
>    week or two, and then we can immediately move forward
>    doing useful work and adding features on a solid codebase.
> 4. Zcache2 already has the foundation in place for "reclaim
>    frontswap zpages", which mm experts have noted is a critical
>    requirement for broader zcache acceptance (e.g. KVM).

I, for one, am really concerned about the reclaim frontswap zpages
problem. I think it potentially leads to deadlock and if zcache2 deals
with it, that's great.

> 5. Ramster is already a small incremental addition to core zcache2 code
>    rather than a fork.  While many may ignore ramster as "not valuable",
>    it is the foundation for future related work so there's a reasonable
>    chance that some form of ramster will need to be merged in the future.
> 
> Cons for (B):
> 1. Seth [1] has to relearn some of the zcache2 code via diffs and
>    code reading instead of incremental patches.
> 2. Dan doesn't get properly punished for not doing incremental patches.
> 

Neither of those cons are compelling to me. zcache2 may require a full
review from scratch which is annoying but hardly insurmountable. Minimally
it should be possible to batter both with blackbox testing and at least
confirm that zcache2 does not regress in comparison to zcache1. If both
pass the same testing, promote both but freeze zcache1 and document the
limitations and do all future development on zcache2. People that are
currently supporting zcache1 can continue to do so and merge critical
fixes while migrating to zcache2 over time.

> [1] With all due respect, at this time, there are really only
> two people in the world that have a reasonably deep understanding
> of zcache and the technologies it's built on: Dan and Seth.

Which may be correct but I would expect that this would change once
something gets promoted out of staging.
Sasha Levin Sept. 22, 2012, 1:31 p.m. UTC | #13
On 09/21/2012 09:14 PM, Dan Magenheimer wrote:
>>> +#define MAX_CLIENTS 16
>> > 
>> > Seems a bit arbitrary. Why 16?
> Sasha Levin posted a patch to fix this but it was tied in to
> the proposed KVM implementation, so was never merged.
> 

My patch changed the max pools per client, not the maximum amount of clients.
That patch has already found it's way in.

(MAX_CLIENTS does look like an arbitrary number though).


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Sasha Levin Sept. 22, 2012, 1:38 p.m. UTC | #14
On 09/22/2012 03:31 PM, Sasha Levin wrote:
> On 09/21/2012 09:14 PM, Dan Magenheimer wrote:
>>>> +#define MAX_CLIENTS 16
>>>>
>>>> Seems a bit arbitrary. Why 16?
>> Sasha Levin posted a patch to fix this but it was tied in to
>> the proposed KVM implementation, so was never merged.
>>
> 
> My patch changed the max pools per client, not the maximum amount of clients.
> That patch has already found it's way in.
> 
> (MAX_CLIENTS does look like an arbitrary number though).

btw, while we're on the subject of KVM, the implementation of tmem/kvm was
blocked due to insufficient performance caused by the lack of multi-page
ops/batching.

Are there any plans to make it better in the future?


Thanks,
Sasha

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 22, 2012, 9:18 p.m. UTC | #15
> From: Mel Gorman [mailto:mgorman@suse.de]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> 
> On Fri, Sep 21, 2012 at 01:35:15PM -0700, Dan Magenheimer wrote:
> > > From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> > > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > The two proposals:
> > A) Recreate all the work done for zcache2 as a proper sequence of
> >    independent patches and apply them to zcache1. (Seth/Konrad)
> > B) Add zsmalloc back in to zcache2 as an alternative allocator
> >    for frontswap pages. (Dan)
> 
> Throwing it out there but ....
> 
> C) Merge both, but freeze zcache1 except for critical fixes. Only allow
>    future work on zcache2. Document limitations of zcache1 and
>    workarounds until zcache2 is fully production ready.

Hi Mel (with request for Seth below) --

(C) may be the politically-expedient solution but, personally,
I think it is a bit insane and I suspect that any mm developer
who were to deeply review both codebases side-by-side would come to
the same conclusion.  The cost in developer/maintainer time,
and the confusion presented to the user/distro base if both
are promoted/merged would be way too high, and IMHO completely
unwarranted.  Let me try to explain...

I use the terms "zcache1" and "zcache2" only to clarify which
codebase, not because they are dramatically different. I estimate
that 85%-90% of the code in zcache1 and zcache2 is identical, not
counting the allocator or comments/whitespace/janitorial!

Zcache2 _is_ zcache1 with some good stuff added and with zsmalloc
dropped.  I think after careful study, there would be wide agreement
among mm developers that the stuff added is all moving in the direction
of making zcache "production-ready".  IMHO, zcache1 has _never_
been production-ready, and zcache2 is merely a big step in the right
direction.

(Quick logistical aside: zcache2 is in staging-next and linux-next,
currently housed under the drivers/staging/ramster directory...
with !CONFIG_RAMSTER, ramster _is_ zcache2.)

Seth (and IBM) seems to have a bee in his bonnet that the existing
zcache1 code _must_ be promoted _soon_ with as little change as possible.
Other than the fact that he didn't like my patching approach [1],
the only technical objection Seth has raised to zcache2 is that he
thinks zsmalloc is the best choice of allocator [2] for his limited
benchmarking [3].

I've offered to put zsmalloc back in to zcache2 as an optional
(even default) allocator, but that doesn't seem to be good enough
for Seth.  Any other technical objections to zcache2, or explanation
for his urgent desire to promote zcache1, Seth (and IBM) is keeping
close to his vest, which I find to be a bit disingenuous.

So, I'd like to challenge Seth with a simple question:

If zcache2 offers zsmalloc as an alternative (even default) allocator,
what remaining _technical_ objections do you (Seth) have to merging
zcache2 _instead_ of zcache1?

If Mel agrees that your objections are worth the costs of bifurcating
zcache and will still endorse merging both into core mm, I agree to move
forward with Mel's alternative (C) (and will then repost
https://lkml.org/lkml/2012/7/31/573).

Personally, I would _really_ like to get back to writing code to make
zcacheN more suitable for production so would really like to see this
resolved!

Dan

[1] Monolithic, because GregKH seemed to be unwilling to take further
patches to zcache before it was promoted, and because I thought
a number of things had to be fixed before I would feel comfortable
presenting zcache to be reviewed by mm developers
[2] Note, zsmalloc is used in zcache1 only for frontswap pages...
zbud is used in both zcache1 and zcache2 for cleancache pages.
[3] I've never seen any benchmark results posted for zcache other
than some variation of kernbench.  IMHO that's an issue all in itself.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
James Bottomley Sept. 23, 2012, 7:34 a.m. UTC | #16
On Sat, 2012-09-22 at 02:07 +0100, Mel Gorman wrote:
> > The two proposals:
> > A) Recreate all the work done for zcache2 as a proper sequence of
> >    independent patches and apply them to zcache1. (Seth/Konrad)
> > B) Add zsmalloc back in to zcache2 as an alternative allocator
> >    for frontswap pages. (Dan)
> 
> Throwing it out there but ....
> 
> C) Merge both, but freeze zcache1 except for critical fixes. Only
> allow
>    future work on zcache2. Document limitations of zcache1 and
>    workarounds until zcache2 is fully production ready.
> 
Actually, there is a fourth option, which is the one we'd have usually
used when staging wasn't around:  Throw the old code out as a successful
prototype which showed the author how to do it better (i.e. flush it
from staging) and start again from the new code which has all the
benefits learned from the old code.

Staging isn't supposed to be some magical set of history that we have to
adhere to no matter what (unlike the rest of the tree). It's supposed to
be an accelerator to get stuff into the kernel and not become a
hindrance to it.

There also seem to be a couple of process issues here that could do with
sorting:  Firstly that rewrites on better reflection, while not common,
are also not unusual so we need a mechanism for coping with them.  This
is actually a serious process problem: everyone becomes so attached to
the code they helped clean up that they're hugely unwilling to
countenance a rewrite which would in their (probably correct) opinion
have the cleanups start from ground zero again. Secondly, we've got a
set of use cases and add ons which grew up around code in staging that
act as a bit of a barrier to ABI/API evolution, even as they help to
demonstrate the problems.

I think the first process issue really crystallises the problem we're
having in staging:  we need to get the design approximately right before
we start on the code cleanups.  What I think this means is that we start
on the list where the people who understand the design issues reside
then, when they're happy with the design, we can begin cleaning it up
afterwards if necessary.  I don't think this is hard and fast: there is,
of course, code so bad that even the experts can't penetrate it to see
the design without having their eyes bleed but we should at least always
try to begin with design.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mel Gorman Sept. 24, 2012, 10:31 a.m. UTC | #17
On Sat, Sep 22, 2012 at 02:18:44PM -0700, Dan Magenheimer wrote:
> > From: Mel Gorman [mailto:mgorman@suse.de]
> > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > 
> > On Fri, Sep 21, 2012 at 01:35:15PM -0700, Dan Magenheimer wrote:
> > > > From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> > > > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > > The two proposals:
> > > A) Recreate all the work done for zcache2 as a proper sequence of
> > >    independent patches and apply them to zcache1. (Seth/Konrad)
> > > B) Add zsmalloc back in to zcache2 as an alternative allocator
> > >    for frontswap pages. (Dan)
> > 
> > Throwing it out there but ....
> > 
> > C) Merge both, but freeze zcache1 except for critical fixes. Only allow
> >    future work on zcache2. Document limitations of zcache1 and
> >    workarounds until zcache2 is fully production ready.
> 
> Hi Mel (with request for Seth below) --
> 
> (C) may be the politically-expedient solution but, personally,
> I think it is a bit insane and I suspect that any mm developer
> who were to deeply review both codebases side-by-side would come to
> the same conclusion. 

I have not read zcache2 and maybe it is the case that no one in their
right mind would use zcache1 if zcache2 was available but the discussion
keeps going in circles.

> The cost in developer/maintainer time,
> and the confusion presented to the user/distro base if both
> are promoted/merged would be way too high, and IMHO completely
> unwarranted.  Let me try to explain...
> 

What would the impact be if zcache2 and zcache1 were mutually exclusive
in Kconfig and the naming was as follows?

CONFIG_ZCACHE_DEPRECATED	(zcache1)
CONFIG_ZCACHE			(zcache2)

That would make it absolutely clear to distributions which one they should
be enabling and also make it clear that all future development happen
on zcache2.

I know it looks insane to promote something that is instantly deprecated
but none of the other alternatives seem to be gaining traction either.
This would at least allow the people who are currently heavily behind
zcache1 to continue supporting it and applying critical fixes until they
move to zcache2.

> I use the terms "zcache1" and "zcache2" only to clarify which
> codebase, not because they are dramatically different. I estimate
> that 85%-90% of the code in zcache1 and zcache2 is identical, not
> counting the allocator or comments/whitespace/janitorial!
> 

If 85-90% of the code is identicial then they really should be sharing
the code rather than making copies. That will result in some monolithic
patches but it's unavoidable. I expect it would end up looking like

Patch 1		promote zcache1
Patch 2		promote zcache2
Patch 3		move shared code for zcache1,zcache2 to common files

If the shared code is really shared and not copied it may reduce some of
the friction between the camps.

> Zcache2 _is_ zcache1 with some good stuff added and with zsmalloc
> dropped.  I think after careful study, there would be wide agreement
> among mm developers that the stuff added is all moving in the direction
> of making zcache "production-ready".  IMHO, zcache1 has _never_
> been production-ready, and zcache2 is merely a big step in the right
> direction.
> 

zcache1 does appear to have a few snarls that would make me wary of having
to support it. I don't know if zcache2 suffers the same problems or not
as I have not read it.

> (Quick logistical aside: zcache2 is in staging-next and linux-next,
> currently housed under the drivers/staging/ramster directory...
> with !CONFIG_RAMSTER, ramster _is_ zcache2.)
> 

Unfortunately, I'm not going to get the chance to review it in the
short-term. However, if zcache1 and zcache2 shared code in common files
it would at least reduce the amount of new code I have to read :)

> Seth (and IBM) seems to have a bee in his bonnet that the existing
> zcache1 code _must_ be promoted _soon_ with as little change as possible.
> Other than the fact that he didn't like my patching approach [1],
> the only technical objection Seth has raised to zcache2 is that he
> thinks zsmalloc is the best choice of allocator [2] for his limited
> benchmarking [3].
> 

FWIW, I would fear that kernbench is not that interesting a benchmark for
something like zcache. From an MM perspective, I would be wary that the
data compresses too well and fits too neatly in the different buckets and
make zsmalloc appear to behave much better than it would for a more general
workload.  Of greater concern is that the allocations for zcache would be
too short lived to measure if external fragmentation was a real problem
or not. This is pure guesswork as I didn't read zsmalloc but this is the
sort of problem I'd be looking out for if I did review it. In practice,
I would probably prefer to depend on zbud because it avoids the external
fragmentation problem even if it wasted memory but that's just me being
cautious.

> I've offered to put zsmalloc back in to zcache2 as an optional
> (even default) allocator, but that doesn't seem to be good enough
> for Seth.  Any other technical objections to zcache2, or explanation
> for his urgent desire to promote zcache1, Seth (and IBM) is keeping
> close to his vest, which I find to be a bit disingenuous.
> 

I can only guess what the reasons might be for this and none of the
guesses will help resolve this problem.

> So, I'd like to challenge Seth with a simple question:
> 
> If zcache2 offers zsmalloc as an alternative (even default) allocator,
> what remaining _technical_ objections do you (Seth) have to merging
> zcache2 _instead_ of zcache1?
> 
> If Mel agrees that your objections are worth the costs of bifurcating
> zcache and will still endorse merging both into core mm, I agree to move
> forward with Mel's alternative (C) (and will then repost
> https://lkml.org/lkml/2012/7/31/573).
> 

If you go with C), please also add another patch on top *if possible*
that actually shares any common code between zcache1 and zcache2.

> Personally, I would _really_ like to get back to writing code to make
> zcacheN more suitable for production so would really like to see this
> resolved!
> 
> Dan
> 
> [1] Monolithic, because GregKH seemed to be unwilling to take further
> patches to zcache before it was promoted, and because I thought
> a number of things had to be fixed before I would feel comfortable
> presenting zcache to be reviewed by mm developers
> [2] Note, zsmalloc is used in zcache1 only for frontswap pages...
> zbud is used in both zcache1 and zcache2 for cleancache pages.
> [3] I've never seen any benchmark results posted for zcache other
> than some variation of kernbench.  IMHO that's an issue all in itself.
Seth Jennings Sept. 24, 2012, 5:25 p.m. UTC | #18
On 09/21/2012 03:35 PM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
>> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
>>
>> On 09/21/2012 01:02 PM, Konrad Rzeszutek Wilk wrote:
>>> On Fri, Sep 21, 2012 at 05:12:52PM +0100, Mel Gorman wrote:
>>>> On Tue, Sep 04, 2012 at 04:34:46PM -0500, Seth Jennings wrote:
>>>>> zcache is the remaining piece of code required to support in-kernel
>>>>> memory compression.  The other two features, cleancache and frontswap,
>>>>> have been promoted to mainline in 3.0 and 3.5 respectively.  This
>>>>> patchset promotes zcache from the staging tree to mainline.
>>
>>>>
>>>> Very broadly speaking my initial reaction before I reviewed anything was
>>>> that *some* sort of usable backend for cleancache or frontswap should exist
>>>> at this point. My understanding is that Xen is the primary user of both
>>>> those frontends and ramster, while interesting, is not something that a
>>>> typical user will benefit from.
>>>
>>> Right, the majority of users do not use virtualization. Thought embedded
>>> wise .. well, there are a lot of Android users - thought I am not 100%
>>> sure they are using it right now (I recall seeing changelogs for the clones
>>> of Android mentioning zcache).
>>>>
>>>> That said, I worry that this has bounced around a lot and as Dan (the
>>>> original author) has a rewrite. I'm wary of spending too much time on this
>>>> at all. Is Dan's new code going to replace this or what? It'd be nice to
>>>> find a definitive answer on that.
>>>
>>> The idea is to take parts of zcache2 as seperate patches and stick it
>>> in the code you just reviewed (those that make sense as part of unstaging).
>>
>> I agree with this.  Only the changes from zcache2 (Dan's
>> rewrite) that are necessary for promotion should be
>> considered right now.  Afaict, none of the concerns raised
>> in these comments are addressed by the changes in zcache2.
> 
> While I may agree with the proposed end result, this proposal
> is a _very_ long way away from a solution.  To me, it sounds like
> a "split the baby in half" proposal (cf. wisdom of Solomon)
> which may sound reasonable to some but, in the end, everyone loses.
> 
> I have proposed a reasonable compromise offlist to Seth, but
> it appears that it has been silently rejected; I guess it is
> now time to take the proposal public. I apologize in advance
> for my characteristic bluntness...
> 
> So let's consider two proposals and the pros and cons of them,
> before we waste any further mm developer time.  (Fortunately,
> most of Mel's insightful comments apply to both versions, though
> he did identify some of the design issues that led to zcache2!)
> 
> The two proposals:
> A) Recreate all the work done for zcache2 as a proper sequence of
>    independent patches and apply them to zcache1. (Seth/Konrad)
> B) Add zsmalloc back in to zcache2 as an alternative allocator
>    for frontswap pages. (Dan)
> 
> Pros for (A):
> 1. It better preserves the history of the handful of (non-zsmalloc)
>    commits in the original zcache code.
> 2. Seth[1] can incrementally learn the new designs by reading
>    normal kernel patches.

It's not a matter of breaking the patches up so that I can
understand them.  I understand them just fine as indicated
by my responses to the attempt to overwrite zcache/remove
zsmalloc:

https://lkml.org/lkml/2012/8/14/347
https://lkml.org/lkml/2012/8/17/498

zcache2 also crashes on PPC64, which uses 64k pages, because
a 4k maximum page size is hard coded into the new zbudpage
struct.

The point is to discuss and adopt each change on it's own
merits instead of this "take a 10k line patch or leave it"
approach.

> 3. For kernel purists, it is the _right_ way dammit (and Dan
>    should be shot for redesigning code non-incrementally, even
>    if it was in staging, etc.)

Dan says "dammit" to add a comic element to this point,
however, it is a valid point (minus the firing squad).

Lets be clear about what zcache2 is.  It is not a rewrite in
the way most people think: a refactored codebase the caries
out the same functional set as an original codebase.  It is
an _overwrite_ to accommodate an entirely new set of
functionally whose code doubles the size of the origin
codebase and regresses performance on the original
functionality.

> 4. Seth believes that zcache will be promoted out of staging sooner
>    because, except for a few nits, it is ready today.
> 
> Cons for (A):
> 1. Nobody has signed up to do the work, including testing.  It
>    took the author (and sole expert on all the components
>    except zsmalloc) between two and three months essentially
>    fulltime to move zcache1->zcache2.  So forward progress on
>    zcache will likely be essentially frozen until at least the
>    end of 2012, possibly a lot longer.

This is not true.  I have agreed to do the work necessary to
make zcache1 acceptable for mainline, which can include
merging changes from zcache2 if people agree it is a blocker.

> 2. The end result (if we reach one) is almost certainly a
>    _third_ implementation of zcache: "zcache 1.5".  So
>    we may not be leveraging much of the history/testing
>    from zcache1 anyway!
> 3. Many of the zcache2 changes are closely interwoven so
>    a sequence of patches may not be much more incrementally
>    readable than zcache2.
> 4. The merge with ramster will likely be very low priority
>    so the fork between the two will continue.
> 5. Dan believes that, if zcache1 does indeed get promoted with
>    few or none of the zcache2 redesigns, zcache will never
>    get properly finished.

What is "properly finished"?

> Pros for (B):
> 1. Many of the design issues/constraints of zcache are resolved
>    in code that has already been tested approximately as well
>    as the original. All of the redesign (zcache1->zcache2) has
>    been extensively discussed on-list; only the code itself is
>    "non-incremental".
> 2. Both allocators (which AFAIK is the only technical area
>    of controversy) will be supported in the same codebase.
> 3. Dan (especially with help from Seth) can do the work in a
>    week or two, and then we can immediately move forward
>    doing useful work and adding features on a solid codebase.

The continuous degradation of zcache as "demo" and the
assertion that zcache2 is the "solid codebase" is tedious.
zcache is actually being worked on by others and has been in
staging for years.  By definition, _it_ is the more
hardended codebase.

If there are results showing that zcache2 has superior
performance and stability on the existing use cases please
share them.  Otherwise this characterization is just propaganda.

> 4. Zcache2 already has the foundation in place for "reclaim
>    frontswap zpages", which mm experts have noted is a critical
>    requirement for broader zcache acceptance (e.g. KVM).

This is dead code in zcache2 right now and relies on
yet-to-be-posted changes to the core mm to work.

My impression is that folks are ok with adding this
functionality to zcache if/when a good way to do it is
presented, and it's absence is not a blocker for acceptance.

> 5. Ramster is already a small incremental addition to core zcache2 code
>    rather than a fork.

According to Greg's staging-next, ramster adds 6000 lines of
new code to zcache.

In summary, I really don't understand the objection to
promoting zcache and integrating zcache2 improvements and
features incrementally.  It seems very natural and
straightforward to me.  Rewrites can even happen in
mainline, as James pointed out.  Adoption in mainline just
provides a more stable environment for more people to use
and contribute to zcache.

--
Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 24, 2012, 7:17 p.m. UTC | #19
> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache

Once again, you have completely ignored a reasonable
compromise proposal.  Why?

> According to Greg's staging-next, ramster adds 6000 lines of
> new code to zcache.
>   :
> functionally whose code doubles the size of the origin

Indeed, and the 6K lines is all in the ramster-specific directory.
I am not asking that ramster be promoted, only that the small
handful of hooks that enable ramster should exist in zcache
(and tmem) if/when zcache is promoted.  And zcache1+zsmalloc
does not have that.
 
> Lets be clear about what zcache2 is.  It is not a rewrite in
> the way most people think: a refactored codebase the caries
> out the same functional set as an original codebase.  It is
> an _overwrite_ to accommodate an entirely new set of
> functionally whose code doubles the size of the origin
> codebase and regresses performance on the original
> functionality.

There were some design deficiencies necessary to support a
range of workloads (other than just kernbench) and that
required some redesign.  Those have been clearly documented
in the post of zcache2 and discussed in other threads.  Other
than janitorial work (much of which was proposed by other people).
zcache2 is actually _less_ of  rewrite than most people think.

By "performance regression", you mean it doesn't use zsmalloc
because zbud has to make more conservative assumptions than
"works really well on kernbench".  Mel identified his preference
for conservative assumptions.  The compromise I have
proposed will give you back zsmalloc for your use kernbench
use case.  Why is that not good enough?

Overwrite was simply a mechanism to avoid a patch post that
nobody (other than you) would be able to read.  Anyone
can do a diff. Focusing on the patch mechanism is a red herring.

> > 4. Seth believes that zcache will be promoted out of staging sooner
> >    because, except for a few nits, it is ready today.
> >
> > Cons for (A):
> > 1. Nobody has signed up to do the work, including testing.  It
> >    took the author (and sole expert on all the components
> >    except zsmalloc) between two and three months essentially
> >    fulltime to move zcache1->zcache2.  So forward progress on
> >    zcache will likely be essentially frozen until at least the
> >    end of 2012, possibly a lot longer.
> 
> This is not true.  I have agreed to do the work necessary to
> make zcache1 acceptable for mainline, which can include
> merging changes from zcache2 if people agree it is a blocker.
>  :
> What is "properly finished"?

In the compromise I have proposed, the work is already done.

You have claimed that that work is not necessary, because it
doesn't help zsmalloc or kernbench.  You have refused to
adapt zsmalloc to meet the needs I have described.  Further
(and sorry to be so horribly blunt in public but, by claiming
you are going to do the work, you are asking for it), you have
NOT designed or written any significant code in the kernel,
just patched and bugfixed and tested and run kernbench on
zcache.  (Zsmalloc, which you have championed, was written
by Nitin and adapted by you.)

And you've continued with (IMHO) disingenuous behavior.
While I understand all too well why that may be necessary
when working for a big company, it makes it very hard to
identify an acceptable compromise.

So, no I don't really trust that you have either the intent
or ability to do the redesigns that I feel (and echoed by
Andrea and Mel) are necessary for zcache to be more than
toy "demo" code.

> The continuous degradation of zcache as "demo" and the

I call it demo code because I wrote it as a demo to
show that in-kernel compression could be a user of
cleancache and frontswap.

I'm not criticizing your code or anyone else's,
I am criticizing MY OWN code.  I had no illusion
that zcache (aka zcache1) was ready for promotion.
It sucked in a number of ways.  MM developers with
real experience in the complexity of managing memory,
Mel and Andrea, without digging very hard, identified
those same ways it sucks.  I'm trying to fix those.
Are you?

> assertion that zcache2 is the "solid codebase" is tedious.
> zcache is actually being worked on by others and has been in
> staging for years.  By definition, _it_ is the more
> hardended codebase.

Please be more specific (and I don't mean a meaningless count
of patches).  Other than your replacement of xvmalloc with
zsmalloc and a bug fix or three, can you point to anything
that was more than cleanup?  Can you point to any broad
workload testing?  And for those two Android distros that have
included zcache (despite the fact that anything in staging
taints the kernel), can you demonstrate that those distros 
have enabled it or even documented to their users _how_ to
enable it?

> If there are results showing that zcache2 has superior
> performance and stability on the existing use cases please
> share them.  Otherwise this characterization is just propaganda.

Neither of us can demonstrate superior performance on
anything other than kernbench, nor stability on use
cases other than kernbench.  You have repeatedly stated
that performance and stability on kernbench is sufficient
for promotion.

But I agree that it is propaganda regardless of who states
it, so if you stop claiming zcache1 has had enough exposure
to warrant promotion, I won't say that zcache2 is
more stable.

> > 4. Zcache2 already has the foundation in place for "reclaim
> >    frontswap zpages", which mm experts have noted is a critical
> >    requirement for broader zcache acceptance (e.g. KVM).
> 
> This is dead code in zcache2 right now and relies on
> yet-to-be-posted changes to the core mm to work.
> 
> My impression is that folks are ok with adding this
> functionality to zcache if/when a good way to do it is
> presented, and it's absence is not a blocker for acceptance.

Andrea and Mel have both stated they think it is necessary.
Much of the redesign in zcache2 is required to provide
it.  And it is yet-to-be-posted because I'm wasting so
much time quibbling with you so that the foundation design
changes and code necessary don't get thrown away.

> > 5. Ramster is already a small incremental addition to core zcache2 code
> >    rather than a fork.
> 
> In summary, I really don't understand the objection to
> promoting zcache and integrating zcache2 improvements and
> features incrementally.  It seems very natural and
> straightforward to me.  Rewrites can even happen in
> mainline, as James pointed out.  Adoption in mainline just
> provides a more stable environment for more people to use
> and contribute to zcache.

And I, as I have stated repeatedly, don't understand why
anyone would argue to throw away (or even re-do) months of
useful work when a reasonable compromise has been proposed.

James pointed out that the design should best be evolved
until it is right _while_ in staging and, _if_ _necessary_
redesigns can be done after promotion.  You have repeatedly
failed to identify why you think it is necessary to do
it bass-ackwards.

> zcache2 also crashes on PPC64, which uses 64k pages, because
> a 4k maximum page size is hard coded into the new zbudpage
> struct.

OK, that sounds like a bug on a machine few developers have
access to.  So let's fix it (on zcache2).  It doesn't sound
to me like a reason to throw away all the forward progress
and work put into zcache2.  But with the compromise
I proposed, zcache2+zsmalloc wouldn't use zbud on
PPC64 anyway, right?

I simply do NOT understand why you are fighting so hard to
promote old code that works on toy benchmarks.  I'm fighting
for the integrity of a signficiant memory management feature
that _I_ wrote, and _I_ understand thoroughly enough to know
its design flaws, and have demonstrated the desire
and ability to continue to develop/evolve/finish.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 24, 2012, 8:05 p.m. UTC | #20
> From: James Bottomley [mailto:James.Bottomley@HansenPartnership.com]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache

> On Sat, 2012-09-22 at 02:07 +0100, Mel Gorman wrote:
> > > The two proposals:
> > > A) Recreate all the work done for zcache2 as a proper sequence of
> > >    independent patches and apply them to zcache1. (Seth/Konrad)
> > > B) Add zsmalloc back in to zcache2 as an alternative allocator
> > >    for frontswap pages. (Dan)
> >
> > Throwing it out there but ....
> >
> > C) Merge both, but freeze zcache1 except for critical fixes. Only
> > allow
> >    future work on zcache2. Document limitations of zcache1 and
> >    workarounds until zcache2 is fully production ready.
> >
> Actually, there is a fourth option, which is the one we'd have usually
> used when staging wasn't around:  Throw the old code out as a successful
> prototype which showed the author how to do it better (i.e. flush it
> from staging) and start again from the new code which has all the
> benefits learned from the old code.
> 
> Staging isn't supposed to be some magical set of history that we have to
> adhere to no matter what (unlike the rest of the tree). It's supposed to
> be an accelerator to get stuff into the kernel and not become a
> hindrance to it.
> 
> There also seem to be a couple of process issues here that could do with
> sorting:  Firstly that rewrites on better reflection, while not common,
> are also not unusual so we need a mechanism for coping with them.  This
> is actually a serious process problem: everyone becomes so attached to
> the code they helped clean up that they're hugely unwilling to
> countenance a rewrite which would in their (probably correct) opinion
> have the cleanups start from ground zero again. Secondly, we've got a
> set of use cases and add ons which grew up around code in staging that
> act as a bit of a barrier to ABI/API evolution, even as they help to
> demonstrate the problems.
> 
> I think the first process issue really crystallises the problem we're
> having in staging:  we need to get the design approximately right before
> we start on the code cleanups.  What I think this means is that we start
> on the list where the people who understand the design issues reside
> then, when they're happy with the design, we can begin cleaning it up
> afterwards if necessary.  I don't think this is hard and fast: there is,
> of course, code so bad that even the experts can't penetrate it to see
> the design without having their eyes bleed but we should at least always
> try to begin with design.


Hi James --

I think you've hit the nail on the head, generalizing this interminable
debate into a process problem that needs to be solved more generally.
Thanks for your insight!

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 24, 2012, 8:36 p.m. UTC | #21
> From: Mel Gorman [mailto:mgorman@suse.de]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> 
> On Sat, Sep 22, 2012 at 02:18:44PM -0700, Dan Magenheimer wrote:
> > > From: Mel Gorman [mailto:mgorman@suse.de]
> > > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > >
> > > On Fri, Sep 21, 2012 at 01:35:15PM -0700, Dan Magenheimer wrote:
> > > > > From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> > > > > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > > > The two proposals:
> > > > A) Recreate all the work done for zcache2 as a proper sequence of
> > > >    independent patches and apply them to zcache1. (Seth/Konrad)
> > > > B) Add zsmalloc back in to zcache2 as an alternative allocator
> > > >    for frontswap pages. (Dan)
> > >
> > > Throwing it out there but ....
> > >
> > > C) Merge both, but freeze zcache1 except for critical fixes. Only allow
> > >    future work on zcache2. Document limitations of zcache1 and
> > >    workarounds until zcache2 is fully production ready.
> >
> What would the impact be if zcache2 and zcache1 were mutually exclusive
> in Kconfig and the naming was as follows?
> 
> CONFIG_ZCACHE_DEPRECATED	(zcache1)
> CONFIG_ZCACHE			(zcache2)
> 
> That would make it absolutely clear to distributions which one they should
> be enabling and also make it clear that all future development happen
> on zcache2.
> 
> I know it looks insane to promote something that is instantly deprecated
> but none of the other alternatives seem to be gaining traction either.
> This would at least allow the people who are currently heavily behind
> zcache1 to continue supporting it and applying critical fixes until they
> move to zcache2.

Just wondering... how, in your opinion, is this different from
leaving zcache1 (or even both) in staging?  "Tainting" occurs
either way, it's just a matter of whether or not there is a message
logged by the kernel that it is officially tainted, right?

However, it _is_ another attempt at compromise and, if this
is the only solution that allows the debate to end, and it
is agreed on by whatever maintainer is committed to pull
both (be it you, or Andrew, or Konrad, or Linux), I would
agree to your "C-prime" proposal.
 
> > I use the terms "zcache1" and "zcache2" only to clarify which
> > codebase, not because they are dramatically different. I estimate
> > that 85%-90% of the code in zcache1 and zcache2 is identical, not
> > counting the allocator or comments/whitespace/janitorial!
> 
> If 85-90% of the code is identicial then they really should be sharing
> the code rather than making copies. That will result in some monolithic
> patches but it's unavoidable. I expect it would end up looking like
> 
> Patch 1		promote zcache1
> Patch 2		promote zcache2
> Patch 3		move shared code for zcache1,zcache2 to common files
> 
> If the shared code is really shared and not copied it may reduce some of
> the friction between the camps.

This part I would object to... at least I would object to signing
up to do Patch 3 myself.  Seems like a lot of busywork if zcache1
is truly deprecated.

> zcache1 does appear to have a few snarls that would make me wary of having
> to support it. I don't know if zcache2 suffers the same problems or not
> as I have not read it.
> 
> Unfortunately, I'm not going to get the chance to review [zcache2] in the
> short-term. However, if zcache1 and zcache2 shared code in common files
> it would at least reduce the amount of new code I have to read :)

Understood, which re-emphasizes my point about how the presence
of both reduces the (to date, very limited) MM developer time available
for either.

> > Seth (and IBM) seems to have a bee in his bonnet that the existing
> > zcache1 code _must_ be promoted _soon_ with as little change as possible.
> > Other than the fact that he didn't like my patching approach [1],
> > the only technical objection Seth has raised to zcache2 is that he
> > thinks zsmalloc is the best choice of allocator [2] for his limited
> > benchmarking [3].
> 
> FWIW, I would fear that kernbench is not that interesting a benchmark for
> something like zcache. From an MM perspective, I would be wary that the
> data compresses too well and fits too neatly in the different buckets and
> make zsmalloc appear to behave much better than it would for a more general
> workload.  Of greater concern is that the allocations for zcache would be
> too short lived to measure if external fragmentation was a real problem
> or not. This is pure guesswork as I didn't read zsmalloc but this is the
> sort of problem I'd be looking out for if I did review it. In practice,
> I would probably prefer to depend on zbud because it avoids the external
> fragmentation problem even if it wasted memory but that's just me being
> cautious.

Your well-honed intuition is IMHO exactly right.

But my compromise proposal would allow the allocator decision to be delayed
until a broader set of workloads are brought to bear.

> > I've offered to put zsmalloc back in to zcache2 as an optional
> > (even default) allocator, but that doesn't seem to be good enough
> > for Seth.  Any other technical objections to zcache2, or explanation
> > for his urgent desire to promote zcache1, Seth (and IBM) is keeping
> > close to his vest, which I find to be a bit disingenuous.
> 
> I can only guess what the reasons might be for this and none of the
> guesses will help resolve this problem.

Me too.  Given the amount of time already spent on this discussion
(and your time reviewing, IMHO, old code), I sure hope the reasons
are compelling.

It's awfully hard to determine a compromise when one side
refuses to budge for unspecified reasons.   And the difference
between deprecated and in-staging seems minor enough that
it's hard to believe your modified proposal will make that
side happy... but we are both shooting in the dark.

> > So, I'd like to challenge Seth with a simple question:
> >
> > If zcache2 offers zsmalloc as an alternative (even default) allocator,
> > what remaining _technical_ objections do you (Seth) have to merging
> > zcache2 _instead_ of zcache1?
> >
> > If Mel agrees that your objections are worth the costs of bifurcating
> > zcache and will still endorse merging both into core mm, I agree to move
> > forward with Mel's alternative (C) (and will then repost
> > https://lkml.org/lkml/2012/7/31/573).
> 
> If you go with C), please also add another patch on top *if possible*
> that actually shares any common code between zcache1 and zcache2.

Let's hear Seth's technical objections first, and discuss post-merge
followon steps later?

Thanks again, Mel, for wading into this.  Hopefully the disagreement
can be resolved and I will value your input on some of the zcache next
steps currently blocked by this unfortunate logjam.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Mel Gorman Sept. 25, 2012, 10:20 a.m. UTC | #22
On Mon, Sep 24, 2012 at 01:36:48PM -0700, Dan Magenheimer wrote:
> > From: Mel Gorman [mailto:mgorman@suse.de]
> > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > 
> > On Sat, Sep 22, 2012 at 02:18:44PM -0700, Dan Magenheimer wrote:
> > > > From: Mel Gorman [mailto:mgorman@suse.de]
> > > > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > > >
> > > > On Fri, Sep 21, 2012 at 01:35:15PM -0700, Dan Magenheimer wrote:
> > > > > > From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> > > > > > Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> > > > > The two proposals:
> > > > > A) Recreate all the work done for zcache2 as a proper sequence of
> > > > >    independent patches and apply them to zcache1. (Seth/Konrad)
> > > > > B) Add zsmalloc back in to zcache2 as an alternative allocator
> > > > >    for frontswap pages. (Dan)
> > > >
> > > > Throwing it out there but ....
> > > >
> > > > C) Merge both, but freeze zcache1 except for critical fixes. Only allow
> > > >    future work on zcache2. Document limitations of zcache1 and
> > > >    workarounds until zcache2 is fully production ready.
> > >
> > What would the impact be if zcache2 and zcache1 were mutually exclusive
> > in Kconfig and the naming was as follows?
> > 
> > CONFIG_ZCACHE_DEPRECATED	(zcache1)
> > CONFIG_ZCACHE			(zcache2)
> > 
> > That would make it absolutely clear to distributions which one they should
> > be enabling and also make it clear that all future development happen
> > on zcache2.
> > 
> > I know it looks insane to promote something that is instantly deprecated
> > but none of the other alternatives seem to be gaining traction either.
> > This would at least allow the people who are currently heavily behind
> > zcache1 to continue supporting it and applying critical fixes until they
> > move to zcache2.
> 
> Just wondering... how, in your opinion, is this different from
> leaving zcache1 (or even both) in staging? 

Because leaving it in staging implies it is not supported. What I'm
suggesting is that zcache1 be promoted but marked deprecated. Seth and the
embedded people that use it should continue to support it as it currently
stands and fix any critical bugs that are reported but avoid writing new
features for it. The limitations of it should be documented.

> "Tainting" occurs
> either way, it's just a matter of whether or not there is a message
> logged by the kernel that it is officially tainted, right?
> 

Using a deprecated interface does not necessarily taint the kernel.

> However, it _is_ another attempt at compromise and, if this
> is the only solution that allows the debate to end, and it
> is agreed on by whatever maintainer is committed to pull
> both (be it you, or Andrew, or Konrad, or Linux), I would
> agree to your "C-prime" proposal.
>  

And bear in mind that I do not any sort of say in what happens
ultimately. I'm just suggesting alternatives here that may potentially
keep everyone happy (or at least stop it going in circles).

> > > I use the terms "zcache1" and "zcache2" only to clarify which
> > > codebase, not because they are dramatically different. I estimate
> > > that 85%-90% of the code in zcache1 and zcache2 is identical, not
> > > counting the allocator or comments/whitespace/janitorial!
> > 
> > If 85-90% of the code is identicial then they really should be sharing
> > the code rather than making copies. That will result in some monolithic
> > patches but it's unavoidable. I expect it would end up looking like
> > 
> > Patch 1		promote zcache1
> > Patch 2		promote zcache2
> > Patch 3		move shared code for zcache1,zcache2 to common files
> > 
> > If the shared code is really shared and not copied it may reduce some of
> > the friction between the camps.
> 
> This part I would object to... at least I would object to signing
> up to do Patch 3 myself.  Seems like a lot of busywork if zcache1
> is truly deprecated.
> 

It'd help the path to truly deprecating it.

1. Fixes in common code only have to be applied once. This avoids a
   situation where zcache1 gets a fix and zcache2 misses it and vice-versa.
   In a related note it makes it a bit more obvious is a new feature is
   attempted to be merged to zcache1

2. It forces the zcache2 and zcache1 people to keep more or less in sync
   with each other and limit API breakage between components.

3. It makes it absolutely clear what the differences between zcache1 and
   zcache2 are at any given time.

My expectation is that the zcache1-specific components would shrink over
time with zcache2 taking over responsibility. Ideally the end result
would be that zcache1 is just an alias for the zcache2 code.

I recognise that this is a lot of busy work and time-consuming but it's
at least *a* path that allows zcache1 to migrate to zcache2. Of course
if the zcache1 people do not support the idea in principal then it goes
back to square one.

> > zcache1 does appear to have a few snarls that would make me wary of having
> > to support it. I don't know if zcache2 suffers the same problems or not
> > as I have not read it.
> > 
> > Unfortunately, I'm not going to get the chance to review [zcache2] in the
> > short-term. However, if zcache1 and zcache2 shared code in common files
> > it would at least reduce the amount of new code I have to read :)
> 
> Understood, which re-emphasizes my point about how the presence
> of both reduces the (to date, very limited) MM developer time available
> for either.
> 

While that may be true, it's not looking like that one side will accept the
complete deletion of zcache1 on day 1. On the flip-side, they have a point
that zcache1 has been tested by more people even if there are some serious
limitations in the code.

> > > Seth (and IBM) seems to have a bee in his bonnet that the existing
> > > zcache1 code _must_ be promoted _soon_ with as little change as possible.
> > > Other than the fact that he didn't like my patching approach [1],
> > > the only technical objection Seth has raised to zcache2 is that he
> > > thinks zsmalloc is the best choice of allocator [2] for his limited
> > > benchmarking [3].
> > 
> > FWIW, I would fear that kernbench is not that interesting a benchmark for
> > something like zcache. From an MM perspective, I would be wary that the
> > data compresses too well and fits too neatly in the different buckets and
> > make zsmalloc appear to behave much better than it would for a more general
> > workload.  Of greater concern is that the allocations for zcache would be
> > too short lived to measure if external fragmentation was a real problem
> > or not. This is pure guesswork as I didn't read zsmalloc but this is the
> > sort of problem I'd be looking out for if I did review it. In practice,
> > I would probably prefer to depend on zbud because it avoids the external
> > fragmentation problem even if it wasted memory but that's just me being
> > cautious.
> 
> Your well-honed intuition is IMHO exactly right.
> 
> But my compromise proposal would allow the allocator decision to be delayed
> until a broader set of workloads are brought to bear.
> 

If the API to the underlying allocator is fixed it should be at least
possible to load either. It does not feel like an issue that should
completely hold up everything.

It may be the case that on day 1 that zcache2 cannot use zsmalloc but then
I'd expect that at least the zsmalloc allocator would be the first block
of code shared by both zcache1 and zcache2.

> > > I've offered to put zsmalloc back in to zcache2 as an optional
> > > (even default) allocator, but that doesn't seem to be good enough
> > > for Seth.  Any other technical objections to zcache2, or explanation
> > > for his urgent desire to promote zcache1, Seth (and IBM) is keeping
> > > close to his vest, which I find to be a bit disingenuous.
> > 
> > I can only guess what the reasons might be for this and none of the
> > guesses will help resolve this problem.
> 
> Me too.  Given the amount of time already spent on this discussion
> (and your time reviewing, IMHO, old code), I sure hope the reasons
> are compelling.
> 
> It's awfully hard to determine a compromise when one side
> refuses to budge for unspecified reasons.   And the difference
> between deprecated and in-staging seems minor enough that
> it's hard to believe your modified proposal will make that
> side happy... but we are both shooting in the dark.
> 

This is why I think the compromise is going to be promoting both,
marking deprecated and then share as much code as possible. Without the
sharing the split may remain permanent and just cause more problems in
the future.

> > > So, I'd like to challenge Seth with a simple question:
> > >
> > > If zcache2 offers zsmalloc as an alternative (even default) allocator,
> > > what remaining _technical_ objections do you (Seth) have to merging
> > > zcache2 _instead_ of zcache1?
> > >
> > > If Mel agrees that your objections are worth the costs of bifurcating
> > > zcache and will still endorse merging both into core mm, I agree to move
> > > forward with Mel's alternative (C) (and will then repost
> > > https://lkml.org/lkml/2012/7/31/573).
> > 
> > If you go with C), please also add another patch on top *if possible*
> > that actually shares any common code between zcache1 and zcache2.
> 
> Let's hear Seth's technical objections first, and discuss post-merge
> followon steps later?
> 

Sure, but bear in mind I do not have the final say in this, I'm just making
suggestions on how this logjam could potentially be cleared.
James Bottomley Sept. 25, 2012, 10:33 a.m. UTC | #23
On Mon, 2012-09-24 at 12:25 -0500, Seth Jennings wrote:
> In summary, I really don't understand the objection to
> promoting zcache and integrating zcache2 improvements and
> features incrementally.  It seems very natural and
> straightforward to me.  Rewrites can even happen in
> mainline, as James pointed out.  Adoption in mainline just
> provides a more stable environment for more people to use
> and contribute to zcache.

This is slightly disingenuous.  Acceptance into mainline commits us to
the interface.  Promotion from staging with simultaneous deprecation
seems like a reasonable (if inelegant) compromise, but the problem is
it's not necessarily a workable solution: as long as we have users of
the interface in mainline, we can't really deprecate stuff however many
feature deprecation files we fill in (I've had a deprecated SCSI ioctl
set that's been deprecated for ten years and counting).  What worries me
looking at this fight is that since there's a use case for the old
interface it will never really get removed.

Conversely, rewrites do tend to vastly increase the acceptance cycle
mainly because of reviewer fatigue (and reviews are our most precious
commodity in the kernel).  I'm saying rewrites should be possible in
staging because it was always possible on plain patch submissions; I'm
not saying they're desirable.  Every time I've seen a rewrite done, it
has added ~6mo-1yr to the acceptance cycle.  I sense that the fatigue
factor with transcendent memory is particularly high, so we're probably
looking at the outside edge of the estimate, so the author needs
seriously to consider if the rewrite is worth this.

Oh, and while this spat goes on, the stalemate is basically assured and
external goodwill eroding.  So, for god's sake find a mutually
acceptable compromise, because we're not going to find one for you.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 25, 2012, 7:22 p.m. UTC | #24
> From: Sasha Levin [mailto:levinsasha928@gmail.com]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache

Sorry for delayed response!
 
> On 09/22/2012 03:31 PM, Sasha Levin wrote:
> > On 09/21/2012 09:14 PM, Dan Magenheimer wrote:
> >>>> +#define MAX_CLIENTS 16
> >>>>
> >>>> Seems a bit arbitrary. Why 16?
> >> Sasha Levin posted a patch to fix this but it was tied in to
> >> the proposed KVM implementation, so was never merged.
> >>
> >
> > My patch changed the max pools per client, not the maximum amount of clients.
> > That patch has already found it's way in.
> >
> > (MAX_CLIENTS does look like an arbitrary number though).
> 
> btw, while we're on the subject of KVM, the implementation of tmem/kvm was
> blocked due to insufficient performance caused by the lack of multi-page
> ops/batching.

Hmmm... I recall that was an unproven assertion.  The tmem/kvm
implementation was not exposed to any wide range of workloads
IIRC?  Also, the WasActive patch is intended to reduce the problem
that multi-guest high volume reads would provoke, so any testing
without that patch may be moot.
 
> Are there any plans to make it better in the future?

If it indeed proves to be a problem, the ramster-merged zcache
(aka zcache2) should be capable of managing a "split" zcache
implementation, i.e. zcache executing in the guest and "overflowing"
page cache pages to the zcache in the host, which should at least
ameliorate most of Avi's concern.  I personally have no plans
to implement that, but would be willing to assist if others
attempt to implement it.

The other main concern expressed by the KVM community, by
Andrea, was zcache's lack of ability to "overflow" frontswap
pages in the host to a real swap device.  The foundation
for that was one of the objectives of the zcache2 redesign;
I am working on a "yet-to-be-posted" patch built on top of zcache2
that will require some insight and review from MM experts.

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Seth Jennings Sept. 27, 2012, 8:25 p.m. UTC | #25
On 09/24/2012 02:17 PM, Dan Magenheimer wrote:
>> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
>> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> 
> Once again, you have completely ignored a reasonable
> compromise proposal.  Why?

We have users who are interested in zcache and we had hoped for a path
that didn't introduce an additional 6-12 month delay.  I am talking
with our team to determine a compromise that resolves this, but also
gets this feature into the hands of users that they can work with.
I'll be away from email until next week, but I wanted to get something
out to the mailing list before I left.  I need a couple days to give a
more definite answer.

Seth


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Sept. 27, 2012, 10:07 p.m. UTC | #26
> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> 
> On 09/24/2012 02:17 PM, Dan Magenheimer wrote:
> >> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> >> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> >
> > Once again, you have completely ignored a reasonable
> > compromise proposal.  Why?
> 
> We have users who are interested in zcache and we had hoped for a path
> that didn't introduce an additional 6-12 month delay.  I am talking
> with our team to determine a compromise that resolves this, but also
> gets this feature into the hands of users that they can work with.
> I'll be away from email until next week, but I wanted to get something
> out to the mailing list before I left.  I need a couple days to give a
> more definite answer.

Hi Seth --

James Bottomley's estimate of the additional 6-12 month
addition to the acceptance cycle was (quote) "every time I've
seen a rewrite done".  Especially with zsmalloc available
as an option in zcache2 (see separately-posted patch),
zcache2 is _really_ _not_ a rewrite, certainly not for
frontswap-centric workloads, which is I think where your
efforts have always been focused (and, I assume, your
future users).  I suspect if you walk through the code
paths in zcache2+zsmalloc, you'll find they are nearly
identical to zcache1, other than some very minor cleanups,
and some changes where Mel gave some feedback which would
need to be cleaned up in zcache1 before promotion anyway
(and happen to already have been cleaned up in zcache2).
The more invasive design changes are all on the zbud paths.

Of course, I'm of the opinion that neither zcache1 nor
zcache2 would be likely to be promoted for at least another
cycle or two, so if you go with zcache2+zsmalloc as the compromise
and it still takes six months for promotion, I hope you don't
blame that on the "rewrite". ;-)

Anyway, looking forward (hopefully) to working with you on
a good compromise.  It would be nice to get back to coding
and working together on a single path forward for zcache
as there is a lot of work to do!

Have a great weekend!

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Seth Jennings Oct. 2, 2012, 6:02 p.m. UTC | #27
On 09/27/2012 05:07 PM, Dan Magenheimer wrote:
> Of course, I'm of the opinion that neither zcache1 nor
> zcache2 would be likely to be promoted for at least another
> cycle or two, so if you go with zcache2+zsmalloc as the compromise
> and it still takes six months for promotion, I hope you don't
> blame that on the "rewrite". ;-)
> 
> Anyway, looking forward (hopefully) to working with you on
> a good compromise.  It would be nice to get back to coding
> and working together on a single path forward for zcache
> as there is a lot of work to do!

We want to see zcache moving forward so that it can get out of staging
and into the hands of end users.  From the direction the discussion
has taken, replacing zcache with the new code appears to be the right
compromise for the situation.  Moving to the new zcache code resets
the clock so I would like to know that we're all on the same track...

1- Promotion must be the top priority, focus needs to be on making the
code production ready rather than adding more features.

2- The code is in the community and development must be done in
public, no further large private rewrites.

3- Benchmarks need to be agreed on, Mel has suggested some of the
MMTests. We need a way to talk about performance so we can make
comparisions, avoid regressions, and talk about promotion criteria.
They should be something any developer can run.

4- Let's investigate breaking ramster out of zcache so that zcache
remains a separately testable building block; Konrad was looking at
this I believe.  RAMSTer adds another functional mode for zcache and
adds to the difficulty of validating patches.  Not every developer
has a cluster of machines to validate RAMSter.

Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Dan Magenheimer Oct. 2, 2012, 6:17 p.m. UTC | #28
> From: Seth Jennings [mailto:sjenning@linux.vnet.ibm.com]
> Subject: Re: [RFC] mm: add support for zsmalloc and zcache
> 
> On 09/27/2012 05:07 PM, Dan Magenheimer wrote:
> > Of course, I'm of the opinion that neither zcache1 nor
> > zcache2 would be likely to be promoted for at least another
> > cycle or two, so if you go with zcache2+zsmalloc as the compromise
> > and it still takes six months for promotion, I hope you don't
> > blame that on the "rewrite". ;-)
> >
> > Anyway, looking forward (hopefully) to working with you on
> > a good compromise.  It would be nice to get back to coding
> > and working together on a single path forward for zcache
> > as there is a lot of work to do!
> 
> We want to see zcache moving forward so that it can get out of staging
> and into the hands of end users.  From the direction the discussion
> has taken, replacing zcache with the new code appears to be the right
> compromise for the situation.  Moving to the new zcache code resets
> the clock so I would like to know that we're all on the same track...
> 
> 1- Promotion must be the top priority, focus needs to be on making the
> code production ready rather than adding more features.

Agreed.

> 2- The code is in the community and development must be done in
> public, no further large private rewrites.

Agreed.

> 3- Benchmarks need to be agreed on, Mel has suggested some of the
> MMTests. We need a way to talk about performance so we can make
> comparisions, avoid regressions, and talk about promotion criteria.
> They should be something any developer can run.

Agreed.

> 4- Let's investigate breaking ramster out of zcache so that zcache
> remains a separately testable building block; Konrad was looking at
> this I believe.  RAMSTer adds another functional mode for zcache and
> adds to the difficulty of validating patches.  Not every developer
> has a cluster of machines to validate RAMSter.

In zcache2 (which is now in Linus' 3.7-rc0 tree in the ramster directory),
ramster is already broken out.  It can be disabled either at compile-time
(simply by not specifying CONFIG_RAMSTER) or at run-time (by using
"zcache" as the kernel boot parameter instead of "ramster").

So... also agreed.  RAMster will not be allowed to get in the
way of promotion or performance as long as any reasonable attempt
is made to avoid breaking the existing hooks to RAMster.
(This only because I expect future functionality to also
use these hooks so would like to avoid breaking them, if possible.)

Does this last clarification work for you, Seth?

If so, <shake hands> and move forward?  What do you see as next steps?

Dan
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Seth Jennings Oct. 4, 2012, 2:36 p.m. UTC | #29
On 10/02/2012 01:17 PM, Dan Magenheimer wrote:
> If so, <shake hands> and move forward?  What do you see as next steps?

I'll need to get up to speed on the new codebase before I can answer
this.  I should be able to answer by early next week.

Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Seth Jennings Oct. 26, 2012, 9:45 p.m. UTC | #30
On 10/02/2012 01:17 PM, Dan Magenheimer wrote:
> If so, <shake hands> and move forward?  What do you see as next steps?

I've been reviewing the changes between zcache and zcache2 and getting
a feel for the scope and direction of those changes.

- Getting the community engaged to review zcache1 at ~2300SLOC was
  difficult.
- Adding RAMSter has meant adding RAMSter-specific code broadly across
  zcache and increases the size of code to review to ~7600SLOC.
- The changes have blurred zcache's internal layering and increased
  complexity beyond what a simple SLOC metric can reflect.
- Getting the community engaged in reviewing zcache2 will be difficult
  and will require an exceptional amount of effort for maintainer and
  reviewer.

It is difficult for me to know when it could be ready for mainline and
production use.  While zcache2 isn't getting broad code reviews yet,
how do suggest managing that complexity to make the code maintainable
and get it reviewed?

Seth

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Konrad Rzeszutek Wilk Nov. 2, 2012, 4:14 p.m. UTC | #31
On Fri, Oct 26, 2012 at 04:45:14PM -0500, Seth Jennings wrote:
> On 10/02/2012 01:17 PM, Dan Magenheimer wrote:
> > If so, <shake hands> and move forward?  What do you see as next steps?
> 
> I've been reviewing the changes between zcache and zcache2 and getting
> a feel for the scope and direction of those changes.
> 
> - Getting the community engaged to review zcache1 at ~2300SLOC was
>   difficult.
> - Adding RAMSter has meant adding RAMSter-specific code broadly across
>   zcache and increases the size of code to review to ~7600SLOC.

One can ignore the drivers/staging/ramster/ramster* directory.

> - The changes have blurred zcache's internal layering and increased
>   complexity beyond what a simple SLOC metric can reflect.

Not sure I see a problem.
> - Getting the community engaged in reviewing zcache2 will be difficult
>   and will require an exceptional amount of effort for maintainer and
>   reviewer.

Exceptional? I think if we start trimming the code down and moving it
around - and moving the 'ramster' specific calls to header files to
not be compiled - that should make it easier to read.

I mean the goal of any review is to address all of the concern you saw
when you were looking over the code. You probably have a page of
questions you asked yourself - and in all likehood the other reviewers
would ask the same questions. So if you address them - either by
giving comments or making the code easier to read - that would do it.

> 
> It is difficult for me to know when it could be ready for mainline and
> production use.  While zcache2 isn't getting broad code reviews yet,
> how do suggest managing that complexity to make the code maintainable
> and get it reviewed?

There are Mel's feedback that is also applicable to zcache2.

Thanks for looking at the code!
> 
> Seth
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch
diff mbox series

========
zcache is a backend to frontswap and cleancache that accepts pages from
those mechanisms and compresses them, leading to reduced I/O caused by
swap and file re-reads.  This is very valuable in shared storage situations
to reduce load on things like SANs.  Also, in the case of slow backing/swap
devices, zcache can also yield a performance gain.

In-Kernel Memory Compression Overview:

 swap subsystem            page cache
        +                      +
    frontswap              cleancache
        +                      +
zcache frontswap glue  zcache cleancache glue
        +                      +
        +---------+------------+
                  +
            zcache/tmem core
                  +
        +---------+------------+
        +                      +
     zsmalloc                 zbud

Everything below the frontswap/cleancache layer is current inside the
zcache driver expect for zsmalloc which is a shared between zcache and
another memory compression driver, zram.

Since zcache is dependent on zsmalloc, it is also being promoted by this
patchset.

For information on zsmalloc and the rationale behind it's design and use
cases verses already existing allocators in the kernel:

https://lkml.org/lkml/2012/1/9/386

zsmalloc is the allocator used by zcache to store persistent pages that
comes from frontswap, as opposed to zbud which is the (internal) allocator
used for ephemeral pages from cleancache.

zsmalloc uses many fields of the page struct to create it's conceptual
high-order page called a zspage.  Exactly which fields are used and for
what purpose is documented at the top of the zsmalloc .c file.  Because
zsmalloc uses struct page extensively, Andrew advised that the
promotion location be mm/:

https://lkml.org/lkml/2012/1/20/308

Zcache is added in a new driver class under drivers/ named mm for
memory management related drivers.  This driver class would be for
drivers that don't actually enabled a hardware device, but rather
augment the memory manager in some way.  Other in-tree candidates
for this driver class are zram and lowmemorykiller, both in staging.

Some benchmarking numbers demonstrating the I/O saving that can be had
with zcache:

https://lkml.org/lkml/2012/3/22/383

Dan's presentation at LSF/MM this year on zcache:

http://oss.oracle.com/projects/tmem/dist/documentation/presentations/LSFMM12-zcache-final.pdf

There was a recent thread about cleancache memory corruption that is
resolved by this patch that should be making it into linux-next via
Greg very soon and is included in this patch:

https://lkml.org/lkml/2012/8/29/253

Based on next-20120904

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
 drivers/Kconfig                 |    2 +
 drivers/Makefile                |    1 +
 drivers/mm/Kconfig              |   13 +
 drivers/mm/Makefile             |    1 +
 drivers/mm/zcache/Makefile      |    3 +
 drivers/mm/zcache/tmem.c        |  773 +++++++++++++++
 drivers/mm/zcache/tmem.h        |  206 ++++
 drivers/mm/zcache/zcache-main.c | 2077 +++++++++++++++++++++++++++++++++++++++
 include/linux/zsmalloc.h        |   43 +
 mm/Kconfig                      |   18 +
 mm/Makefile                     |    1 +
 mm/zsmalloc.c                   | 1063 ++++++++++++++++++++
 12 files changed, 4201 insertions(+)
 create mode 100644 drivers/mm/Kconfig
 create mode 100644 drivers/mm/Makefile
 create mode 100644 drivers/mm/zcache/Makefile
 create mode 100644 drivers/mm/zcache/tmem.c
 create mode 100644 drivers/mm/zcache/tmem.h
 create mode 100644 drivers/mm/zcache/zcache-main.c
 create mode 100644 include/linux/zsmalloc.h
 create mode 100644 mm/zsmalloc.c

diff --git a/drivers/Kconfig b/drivers/Kconfig
index 324e958..d126132 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -154,4 +154,6 @@  source "drivers/vme/Kconfig"
 
 source "drivers/pwm/Kconfig"
 
+source "drivers/mm/Kconfig"
+
 endmenu
diff --git a/drivers/Makefile b/drivers/Makefile
index d64a0f7..aa69e1c 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -140,3 +140,4 @@  obj-$(CONFIG_EXTCON)		+= extcon/
 obj-$(CONFIG_MEMORY)		+= memory/
 obj-$(CONFIG_IIO)		+= iio/
 obj-$(CONFIG_VME_BUS)		+= vme/
+obj-$(CONFIG_MM_DRIVERS)	+= mm/
diff --git a/drivers/mm/Kconfig b/drivers/mm/Kconfig
new file mode 100644
index 0000000..22289c6
--- /dev/null
+++ b/drivers/mm/Kconfig
@@ -0,0 +1,13 @@ 
+menu "Memory management drivers"
+
+config ZCACHE
+	bool "Dynamic compression of swap pages and clean pagecache pages"
+	depends on (CLEANCACHE || FRONTSWAP) && CRYPTO=y && ZSMALLOC=y
+	select CRYPTO_LZO
+	default n
+	help
+	  Zcache uses compression and an in-kernel implementation of
+	  transcendent memory to store clean page cache pages and swap
+	  in RAM, providing a noticeable reduction in disk I/O.
+
+endmenu
diff --git a/drivers/mm/Makefile b/drivers/mm/Makefile
new file mode 100644
index 0000000..f36f509
--- /dev/null
+++ b/drivers/mm/Makefile
@@ -0,0 +1 @@ 
+obj-$(CONFIG_ZCACHE)	+= zcache/
diff --git a/drivers/mm/zcache/Makefile b/drivers/mm/zcache/Makefile
new file mode 100644
index 0000000..60daa27
--- /dev/null
+++ b/drivers/mm/zcache/Makefile
@@ -0,0 +1,3 @@ 
+zcache-y	:=	zcache-main.o tmem.o
+
+obj-$(CONFIG_ZCACHE)	+=	zcache.o
diff --git a/drivers/mm/zcache/tmem.c b/drivers/mm/zcache/tmem.c
new file mode 100644
index 0000000..eaa9021
--- /dev/null
+++ b/drivers/mm/zcache/tmem.c
@@ -0,0 +1,773 @@ 
+/*
+ * In-kernel transcendent memory (generic implementation)
+ *
+ * Copyright (c) 2009-2011, Dan Magenheimer, Oracle Corp.
+ *
+ * The primary purpose of Transcedent Memory ("tmem") is to map object-oriented
+ * "handles" (triples containing a pool id, and object id, and an index), to
+ * pages in a page-accessible memory (PAM).  Tmem references the PAM pages via
+ * an abstract "pampd" (PAM page-descriptor), which can be operated on by a
+ * set of functions (pamops).  Each pampd contains some representation of
+ * PAGE_SIZE bytes worth of data. Tmem must support potentially millions of
+ * pages and must be able to insert, find, and delete these pages at a
+ * potential frequency of thousands per second concurrently across many CPUs,
+ * (and, if used with KVM, across many vcpus across many guests).
+ * Tmem is tracked with a hierarchy of data structures, organized by
+ * the elements in a handle-tuple: pool_id, object_id, and page index.
+ * One or more "clients" (e.g. guests) each provide one or more tmem_pools.
+ * Each pool, contains a hash table of rb_trees of tmem_objs.  Each
+ * tmem_obj contains a radix-tree-like tree of pointers, with intermediate
+ * nodes called tmem_objnodes.  Each leaf pointer in this tree points to
+ * a pampd, which is accessible only through a small set of callbacks
+ * registered by the PAM implementation (see tmem_register_pamops). Tmem
+ * does all memory allocation via a set of callbacks registered by the tmem
+ * host implementation (e.g. see tmem_register_hostops).
+ */
+
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/atomic.h>
+
+#include "tmem.h"
+
+/* data structure sentinels used for debugging... see tmem.h */
+#define POOL_SENTINEL 0x87658765
+#define OBJ_SENTINEL 0x12345678
+#define OBJNODE_SENTINEL 0xfedcba09
+
+/*
+ * A tmem host implementation must use this function to register callbacks
+ * for memory allocation.
+ */
+static struct tmem_hostops tmem_hostops;
+
+static void tmem_objnode_tree_init(void);
+
+void tmem_register_hostops(struct tmem_hostops *m)
+{
+	tmem_objnode_tree_init();
+	tmem_hostops = *m;
+}
+
+/*
+ * A tmem host implementation must use this function to register
+ * callbacks for a page-accessible memory (PAM) implementation
+ */
+static struct tmem_pamops tmem_pamops;
+
+void tmem_register_pamops(struct tmem_pamops *m)
+{
+	tmem_pamops = *m;
+}
+
+/*
+ * Oid's are potentially very sparse and tmem_objs may have an indeterminately
+ * short life, being added and deleted at a relatively high frequency.
+ * So an rb_tree is an ideal data structure to manage tmem_objs.  But because
+ * of the potentially huge number of tmem_objs, each pool manages a hashtable
+ * of rb_trees to reduce search, insert, delete, and rebalancing time.
+ * Each hashbucket also has a lock to manage concurrent access.
+ *
+ * The following routines manage tmem_objs.  When any tmem_obj is accessed,
+ * the hashbucket lock must be held.
+ */
+
+static struct tmem_obj
+*__tmem_obj_find(struct tmem_hashbucket *hb, struct tmem_oid *oidp,
+		 struct rb_node **parent, struct rb_node ***link)
+{
+	struct rb_node *_parent = NULL, **rbnode;
+	struct tmem_obj *obj = NULL;
+
+	rbnode = &hb->obj_rb_root.rb_node;
+	while (*rbnode) {
+		BUG_ON(RB_EMPTY_NODE(*rbnode));
+		_parent = *rbnode;
+		obj = rb_entry(*rbnode, struct tmem_obj,
+			       rb_tree_node);
+		switch (tmem_oid_compare(oidp, &obj->oid)) {
+		case 0: /* equal */
+			goto out;
+		case -1:
+			rbnode = &(*rbnode)->rb_left;
+			break;
+		case 1:
+			rbnode = &(*rbnode)->rb_right;
+			break;
+		}
+	}
+
+	if (parent)
+		*parent = _parent;
+	if (link)
+		*link = rbnode;
+
+	obj = NULL;
+out:
+	return obj;
+}
+
+
+/* searches for object==oid in pool, returns locked object if found */
+static struct tmem_obj *tmem_obj_find(struct tmem_hashbucket *hb,
+					struct tmem_oid *oidp)
+{
+	return __tmem_obj_find(hb, oidp, NULL, NULL);
+}
+
+static void tmem_pampd_destroy_all_in_obj(struct tmem_obj *);
+
+/* free an object that has no more pampds in it */
+static void tmem_obj_free(struct tmem_obj *obj, struct tmem_hashbucket *hb)
+{
+	struct tmem_pool *pool;
+
+	BUG_ON(obj == NULL);
+	ASSERT_SENTINEL(obj, OBJ);
+	BUG_ON(obj->pampd_count > 0);
+	pool = obj->pool;
+	BUG_ON(pool == NULL);
+	if (obj->objnode_tree_root != NULL) /* may be "stump" with no leaves */
+		tmem_pampd_destroy_all_in_obj(obj);
+	BUG_ON(obj->objnode_tree_root != NULL);
+	BUG_ON((long)obj->objnode_count != 0);
+	atomic_dec(&pool->obj_count);
+	BUG_ON(atomic_read(&pool->obj_count) < 0);
+	INVERT_SENTINEL(obj, OBJ);
+	obj->pool = NULL;
+	tmem_oid_set_invalid(&obj->oid);
+	rb_erase(&obj->rb_tree_node, &hb->obj_rb_root);
+}
+
+/*
+ * initialize, and insert an tmem_object_root (called only if find failed)
+ */
+static void tmem_obj_init(struct tmem_obj *obj, struct tmem_hashbucket *hb,
+					struct tmem_pool *pool,
+					struct tmem_oid *oidp)
+{
+	struct rb_root *root = &hb->obj_rb_root;
+	struct rb_node **new = NULL, *parent = NULL;
+
+	BUG_ON(pool == NULL);
+	atomic_inc(&pool->obj_count);
+	obj->objnode_tree_height = 0;
+	obj->objnode_tree_root = NULL;
+	obj->pool = pool;
+	obj->oid = *oidp;
+	obj->objnode_count = 0;
+	obj->pampd_count = 0;
+	(*tmem_pamops.new_obj)(obj);
+	SET_SENTINEL(obj, OBJ);
+
+	if (__tmem_obj_find(hb, oidp, &parent, &new))
+		BUG();
+
+	rb_link_node(&obj->rb_tree_node, parent, new);
+	rb_insert_color(&obj->rb_tree_node, root);
+}
+
+/*
+ * Tmem is managed as a set of tmem_pools with certain attributes, such as
+ * "ephemeral" vs "persistent".  These attributes apply to all tmem_objs
+ * and all pampds that belong to a tmem_pool.  A tmem_pool is created
+ * or deleted relatively rarely (for example, when a filesystem is
+ * mounted or unmounted.
+ */
+
+/* flush all data from a pool and, optionally, free it */
+static void tmem_pool_flush(struct tmem_pool *pool, bool destroy)
+{
+	struct rb_node *rbnode;
+	struct tmem_obj *obj;
+	struct tmem_hashbucket *hb = &pool->hashbucket[0];
+	int i;
+
+	BUG_ON(pool == NULL);
+	for (i = 0; i < TMEM_HASH_BUCKETS; i++, hb++) {
+		spin_lock(&hb->lock);
+		rbnode = rb_first(&hb->obj_rb_root);
+		while (rbnode != NULL) {
+			obj = rb_entry(rbnode, struct tmem_obj, rb_tree_node);
+			rbnode = rb_next(rbnode);
+			tmem_pampd_destroy_all_in_obj(obj);
+			tmem_obj_free(obj, hb);
+			(*tmem_hostops.obj_free)(obj, pool);
+		}
+		spin_unlock(&hb->lock);
+	}
+	if (destroy)
+		list_del(&pool->pool_list);
+}
+
+/*
+ * A tmem_obj contains a radix-tree-like tree in which the intermediate
+ * nodes are called tmem_objnodes.  (The kernel lib/radix-tree.c implementation
+ * is very specialized and tuned for specific uses and is not particularly
+ * suited for use from this code, though some code from the core algorithms has
+ * been reused, thus the copyright notices below).  Each tmem_objnode contains
+ * a set of pointers which point to either a set of intermediate tmem_objnodes
+ * or a set of of pampds.
+ *
+ * Portions Copyright (C) 2001 Momchil Velikov
+ * Portions Copyright (C) 2001 Christoph Hellwig
+ * Portions Copyright (C) 2005 SGI, Christoph Lameter <clameter@sgi.com>
+ */
+
+struct tmem_objnode_tree_path {
+	struct tmem_objnode *objnode;
+	int offset;
+};
+
+/* objnode height_to_maxindex translation */
+static unsigned long tmem_objnode_tree_h2max[OBJNODE_TREE_MAX_PATH + 1];
+
+static void tmem_objnode_tree_init(void)
+{
+	unsigned int ht, tmp;
+
+	for (ht = 0; ht < ARRAY_SIZE(tmem_objnode_tree_h2max); ht++) {
+		tmp = ht * OBJNODE_TREE_MAP_SHIFT;
+		if (tmp >= OBJNODE_TREE_INDEX_BITS)
+			tmem_objnode_tree_h2max[ht] = ~0UL;
+		else
+			tmem_objnode_tree_h2max[ht] =
+			    (~0UL >> (OBJNODE_TREE_INDEX_BITS - tmp - 1)) >> 1;
+	}
+}
+
+static struct tmem_objnode *tmem_objnode_alloc(struct tmem_obj *obj)
+{
+	struct tmem_objnode *objnode;
+
+	ASSERT_SENTINEL(obj, OBJ);
+	BUG_ON(obj->pool == NULL);
+	ASSERT_SENTINEL(obj->pool, POOL);
+	objnode = (*tmem_hostops.objnode_alloc)(obj->pool);
+	if (unlikely(objnode == NULL))
+		goto out;
+	objnode->obj = obj;
+	SET_SENTINEL(objnode, OBJNODE);
+	memset(&objnode->slots, 0, sizeof(objnode->slots));
+	objnode->slots_in_use = 0;
+	obj->objnode_count++;
+out:
+	return objnode;
+}
+
+static void tmem_objnode_free(struct tmem_objnode *objnode)
+{
+	struct tmem_pool *pool;
+	int i;
+
+	BUG_ON(objnode == NULL);
+	for (i = 0; i < OBJNODE_TREE_MAP_SIZE; i++)
+		BUG_ON(objnode->slots[i] != NULL);
+	ASSERT_SENTINEL(objnode, OBJNODE);
+	INVERT_SENTINEL(objnode, OBJNODE);
+	BUG_ON(objnode->obj == NULL);
+	ASSERT_SENTINEL(objnode->obj, OBJ);
+	pool = objnode->obj->pool;
+	BUG_ON(pool == NULL);
+	ASSERT_SENTINEL(pool, POOL);
+	objnode->obj->objnode_count--;
+	objnode->obj = NULL;
+	(*tmem_hostops.objnode_free)(objnode, pool);
+}
+
+/*
+ * lookup index in object and return associated pampd (or NULL if not found)
+ */
+static void **__tmem_pampd_lookup_in_obj(struct tmem_obj *obj, uint32_t index)
+{
+	unsigned int height, shift;
+	struct tmem_objnode **slot = NULL;
+
+	BUG_ON(obj == NULL);
+	ASSERT_SENTINEL(obj, OBJ);
+	BUG_ON(obj->pool == NULL);
+	ASSERT_SENTINEL(obj->pool, POOL);
+
+	height = obj->objnode_tree_height;
+	if (index > tmem_objnode_tree_h2max[obj->objnode_tree_height])
+		goto out;
+	if (height == 0 && obj->objnode_tree_root) {
+		slot = &obj->objnode_tree_root;
+		goto out;
+	}
+	shift = (height-1) * OBJNODE_TREE_MAP_SHIFT;
+	slot = &obj->objnode_tree_root;
+	while (height > 0) {
+		if (*slot == NULL)
+			goto out;
+		slot = (struct tmem_objnode **)
+			((*slot)->slots +
+			 ((index >> shift) & OBJNODE_TREE_MAP_MASK));
+		shift -= OBJNODE_TREE_MAP_SHIFT;
+		height--;
+	}
+out:
+	return slot != NULL ? (void **)slot : NULL;
+}
+
+static void *tmem_pampd_lookup_in_obj(struct tmem_obj *obj, uint32_t index)
+{
+	struct tmem_objnode **slot;
+
+	slot = (struct tmem_objnode **)__tmem_pampd_lookup_in_obj(obj, index);
+	return slot != NULL ? *slot : NULL;
+}
+
+static void *tmem_pampd_replace_in_obj(struct tmem_obj *obj, uint32_t index,
+					void *new_pampd)
+{
+	struct tmem_objnode **slot;
+	void *ret = NULL;
+
+	slot = (struct tmem_objnode **)__tmem_pampd_lookup_in_obj(obj, index);
+	if ((slot != NULL) && (*slot != NULL)) {
+		void *old_pampd = *(void **)slot;
+		*(void **)slot = new_pampd;
+		(*tmem_pamops.free)(old_pampd, obj->pool, NULL, 0);
+		ret = new_pampd;
+	}
+	return ret;
+}
+
+static int tmem_pampd_add_to_obj(struct tmem_obj *obj, uint32_t index,
+					void *pampd)
+{
+	int ret = 0;
+	struct tmem_objnode *objnode = NULL, *newnode, *slot;
+	unsigned int height, shift;
+	int offset = 0;
+
+	/* if necessary, extend the tree to be higher  */
+	if (index > tmem_objnode_tree_h2max[obj->objnode_tree_height]) {
+		height = obj->objnode_tree_height + 1;
+		if (index > tmem_objnode_tree_h2max[height])
+			while (index > tmem_objnode_tree_h2max[height])
+				height++;
+		if (obj->objnode_tree_root == NULL) {
+			obj->objnode_tree_height = height;
+			goto insert;
+		}
+		do {
+			newnode = tmem_objnode_alloc(obj);
+			if (!newnode) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			newnode->slots[0] = obj->objnode_tree_root;
+			newnode->slots_in_use = 1;
+			obj->objnode_tree_root = newnode;
+			obj->objnode_tree_height++;
+		} while (height > obj->objnode_tree_height);
+	}
+insert:
+	slot = obj->objnode_tree_root;
+	height = obj->objnode_tree_height;
+	shift = (height-1) * OBJNODE_TREE_MAP_SHIFT;
+	while (height > 0) {
+		if (slot == NULL) {
+			/* add a child objnode.  */
+			slot = tmem_objnode_alloc(obj);
+			if (!slot) {
+				ret = -ENOMEM;
+				goto out;
+			}
+			if (objnode) {
+
+				objnode->slots[offset] = slot;
+				objnode->slots_in_use++;
+			} else
+				obj->objnode_tree_root = slot;
+		}
+		/* go down a level */
+		offset = (index >> shift) & OBJNODE_TREE_MAP_MASK;
+		objnode = slot;
+		slot = objnode->slots[offset];
+		shift -= OBJNODE_TREE_MAP_SHIFT;
+		height--;
+	}
+	BUG_ON(slot != NULL);
+	if (objnode) {
+		objnode->slots_in_use++;
+		objnode->slots[offset] = pampd;
+	} else
+		obj->objnode_tree_root = pampd;
+	obj->pampd_count++;
+out:
+	return ret;
+}
+
+static void *tmem_pampd_delete_from_obj(struct tmem_obj *obj, uint32_t index)
+{
+	struct tmem_objnode_tree_path path[OBJNODE_TREE_MAX_PATH + 1];
+	struct tmem_objnode_tree_path *pathp = path;
+	struct tmem_objnode *slot = NULL;
+	unsigned int height, shift;
+	int offset;
+
+	BUG_ON(obj == NULL);
+	ASSERT_SENTINEL(obj, OBJ);
+	BUG_ON(obj->pool == NULL);
+	ASSERT_SENTINEL(obj->pool, POOL);
+	height = obj->objnode_tree_height;
+	if (index > tmem_objnode_tree_h2max[height])
+		goto out;
+	slot = obj->objnode_tree_root;
+	if (height == 0 && obj->objnode_tree_root) {
+		obj->objnode_tree_root = NULL;
+		goto out;
+	}
+	shift = (height - 1) * OBJNODE_TREE_MAP_SHIFT;
+	pathp->objnode = NULL;
+	do {
+		if (slot == NULL)
+			goto out;
+		pathp++;
+		offset = (index >> shift) & OBJNODE_TREE_MAP_MASK;
+		pathp->offset = offset;
+		pathp->objnode = slot;
+		slot = slot->slots[offset];
+		shift -= OBJNODE_TREE_MAP_SHIFT;
+		height--;
+	} while (height > 0);
+	if (slot == NULL)
+		goto out;
+	while (pathp->objnode) {
+		pathp->objnode->slots[pathp->offset] = NULL;
+		pathp->objnode->slots_in_use--;
+		if (pathp->objnode->slots_in_use) {
+			if (pathp->objnode == obj->objnode_tree_root) {
+				while (obj->objnode_tree_height > 0 &&
+				  obj->objnode_tree_root->slots_in_use == 1 &&
+				  obj->objnode_tree_root->slots[0]) {
+					struct tmem_objnode *to_free =
+						obj->objnode_tree_root;
+
+					obj->objnode_tree_root =
+							to_free->slots[0];
+					obj->objnode_tree_height--;
+					to_free->slots[0] = NULL;
+					to_free->slots_in_use = 0;
+					tmem_objnode_free(to_free);
+				}
+			}
+			goto out;
+		}
+		tmem_objnode_free(pathp->objnode); /* 0 slots used, free it */
+		pathp--;
+	}
+	obj->objnode_tree_height = 0;
+	obj->objnode_tree_root = NULL;
+
+out:
+	if (slot != NULL)
+		obj->pampd_count--;
+	BUG_ON(obj->pampd_count < 0);
+	return slot;
+}
+
+/* recursively walk the objnode_tree destroying pampds and objnodes */
+static void tmem_objnode_node_destroy(struct tmem_obj *obj,
+					struct tmem_objnode *objnode,
+					unsigned int ht)
+{
+	int i;
+
+	if (ht == 0)
+		return;
+	for (i = 0; i < OBJNODE_TREE_MAP_SIZE; i++) {
+		if (objnode->slots[i]) {
+			if (ht == 1) {
+				obj->pampd_count--;
+				(*tmem_pamops.free)(objnode->slots[i],
+						obj->pool, NULL, 0);
+				objnode->slots[i] = NULL;
+				continue;
+			}
+			tmem_objnode_node_destroy(obj, objnode->slots[i], ht-1);
+			tmem_objnode_free(objnode->slots[i]);
+			objnode->slots[i] = NULL;
+		}
+	}
+}
+
+static void tmem_pampd_destroy_all_in_obj(struct tmem_obj *obj)
+{
+	if (obj->objnode_tree_root == NULL)
+		return;
+	if (obj->objnode_tree_height == 0) {
+		obj->pampd_count--;
+		(*tmem_pamops.free)(obj->objnode_tree_root, obj->pool, NULL, 0);
+	} else {
+		tmem_objnode_node_destroy(obj, obj->objnode_tree_root,
+					obj->objnode_tree_height);
+		tmem_objnode_free(obj->objnode_tree_root);
+		obj->objnode_tree_height = 0;
+	}
+	obj->objnode_tree_root = NULL;
+	(*tmem_pamops.free_obj)(obj->pool, obj);
+}
+
+/*
+ * Tmem is operated on by a set of well-defined actions:
+ * "put", "get", "flush", "flush_object", "new pool" and "destroy pool".
+ * (The tmem ABI allows for subpages and exchanges but these operations
+ * are not included in this implementation.)
+ *
+ * These "tmem core" operations are implemented in the following functions.
+ */
+
+/*
+ * "Put" a page, e.g. copy a page from the kernel into newly allocated
+ * PAM space (if such space is available).  Tmem_put is complicated by
+ * a corner case: What if a page with matching handle already exists in
+ * tmem?  To guarantee coherency, one of two actions is necessary: Either
+ * the data for the page must be overwritten, or the page must be
+ * "flushed" so that the data is not accessible to a subsequent "get".
+ * Since these "duplicate puts" are relatively rare, this implementation
+ * always flushes for simplicity.
+ */
+int tmem_put(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index,
+		char *data, size_t size, bool raw, bool ephemeral)
+{
+	struct tmem_obj *obj = NULL, *objfound = NULL, *objnew = NULL;
+	void *pampd = NULL, *pampd_del = NULL;
+	int ret = -ENOMEM;
+	struct tmem_hashbucket *hb;
+
+	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
+	spin_lock(&hb->lock);
+	obj = objfound = tmem_obj_find(hb, oidp);
+	if (obj != NULL) {
+		pampd = tmem_pampd_lookup_in_obj(objfound, index);
+		if (pampd != NULL) {
+			/* if found, is a dup put, flush the old one */
+			pampd_del = tmem_pampd_delete_from_obj(obj, index);
+			BUG_ON(pampd_del != pampd);
+			(*tmem_pamops.free)(pampd, pool, oidp, index);
+			if (obj->pampd_count == 0) {
+				objnew = obj;
+				objfound = NULL;
+			}
+			pampd = NULL;
+		}
+	} else {
+		obj = objnew = (*tmem_hostops.obj_alloc)(pool);
+		if (unlikely(obj == NULL)) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		tmem_obj_init(obj, hb, pool, oidp);
+	}
+	BUG_ON(obj == NULL);
+	BUG_ON(((objnew != obj) && (objfound != obj)) || (objnew == objfound));
+	pampd = (*tmem_pamops.create)(data, size, raw, ephemeral,
+					obj->pool, &obj->oid, index);
+	if (unlikely(pampd == NULL))
+		goto free;
+	ret = tmem_pampd_add_to_obj(obj, index, pampd);
+	if (unlikely(ret == -ENOMEM))
+		/* may have partially built objnode tree ("stump") */
+		goto delete_and_free;
+	goto out;
+
+delete_and_free:
+	(void)tmem_pampd_delete_from_obj(obj, index);
+free:
+	if (pampd)
+		(*tmem_pamops.free)(pampd, pool, NULL, 0);
+	if (objnew) {
+		tmem_obj_free(objnew, hb);
+		(*tmem_hostops.obj_free)(objnew, pool);
+	}
+out:
+	spin_unlock(&hb->lock);
+	return ret;
+}
+
+/*
+ * "Get" a page, e.g. if one can be found, copy the tmem page with the
+ * matching handle from PAM space to the kernel.  By tmem definition,
+ * when a "get" is successful on an ephemeral page, the page is "flushed",
+ * and when a "get" is successful on a persistent page, the page is retained
+ * in tmem.  Note that to preserve
+ * coherency, "get" can never be skipped if tmem contains the data.
+ * That is, if a get is done with a certain handle and fails, any
+ * subsequent "get" must also fail (unless of course there is a
+ * "put" done with the same handle).
+
+ */
+int tmem_get(struct tmem_pool *pool, struct tmem_oid *oidp, uint32_t index,
+		char *data, size_t *size, bool raw, int get_and_free)
+{
+	struct tmem_obj *obj;
+	void *pampd;
+	bool ephemeral = is_ephemeral(pool);
+	int ret = -1;
+	struct tmem_hashbucket *hb;
+	bool free = (get_and_free == 1) || ((get_and_free == 0) && ephemeral);
+	bool lock_held = false;
+
+	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
+	spin_lock(&hb->lock);
+	lock_held = true;
+	obj = tmem_obj_find(hb, oidp);
+	if (obj == NULL)
+		goto out;
+	if (free)
+		pampd = tmem_pampd_delete_from_obj(obj, index);
+	else
+		pampd = tmem_pampd_lookup_in_obj(obj, index);
+	if (pampd == NULL)
+		goto out;
+	if (free) {
+		if (obj->pampd_count == 0) {
+			tmem_obj_free(obj, hb);
+			(*tmem_hostops.obj_free)(obj, pool);
+			obj = NULL;
+		}
+	}
+	if (tmem_pamops.is_remote(pampd)) {
+		lock_held = false;
+		spin_unlock(&hb->lock);
+	}
+	if (free)
+		ret = (*tmem_pamops.get_data_and_free)(
+				data, size, raw, pampd, pool, oidp, index);
+	else
+		ret = (*tmem_pamops.get_data)(
+				data, size, raw, pampd, pool, oidp, index);
+	if (ret < 0)
+		goto out;
+	ret = 0;
+out:
+	if (lock_held)
+		spin_unlock(&hb->lock);
+	return ret;
+}
+
+/*
+ * If a page in tmem matches the handle, "flush" this page from tmem such
+ * that any subsequent "get" does not succeed (unless, of course, there
+ * was another "put" with the same handle).
+ */
+int tmem_flush_page(struct tmem_pool *pool,
+				struct tmem_oid *oidp, uint32_t index)
+{
+	struct tmem_obj *obj;
+	void *pampd;
+	int ret = -1;
+	struct tmem_hashbucket *hb;
+
+	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
+	spin_lock(&hb->lock);
+	obj = tmem_obj_find(hb, oidp);
+	if (obj == NULL)
+		goto out;
+	pampd = tmem_pampd_delete_from_obj(obj, index);
+	if (pampd == NULL)
+		goto out;
+	(*tmem_pamops.free)(pampd, pool, oidp, index);
+	if (obj->pampd_count == 0) {
+		tmem_obj_free(obj, hb);
+		(*tmem_hostops.obj_free)(obj, pool);
+	}
+	ret = 0;
+
+out:
+	spin_unlock(&hb->lock);
+	return ret;
+}
+
+/*
+ * If a page in tmem matches the handle, replace the page so that any
+ * subsequent "get" gets the new page.  Returns 0 if
+ * there was a page to replace, else returns -1.
+ */
+int tmem_replace(struct tmem_pool *pool, struct tmem_oid *oidp,
+			uint32_t index, void *new_pampd)
+{
+	struct tmem_obj *obj;
+	int ret = -1;
+	struct tmem_hashbucket *hb;
+
+	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
+	spin_lock(&hb->lock);
+	obj = tmem_obj_find(hb, oidp);
+	if (obj == NULL)
+		goto out;
+	new_pampd = tmem_pampd_replace_in_obj(obj, index, new_pampd);
+	ret = (*tmem_pamops.replace_in_obj)(new_pampd, obj);
+out:
+	spin_unlock(&hb->lock);
+	return ret;
+}
+
+/*
+ * "Flush" all pages in tmem matching this oid.
+ */
+int tmem_flush_object(struct tmem_pool *pool, struct tmem_oid *oidp)
+{
+	struct tmem_obj *obj;
+	struct tmem_hashbucket *hb;
+	int ret = -1;
+
+	hb = &pool->hashbucket[tmem_oid_hash(oidp)];
+	spin_lock(&hb->lock);
+	obj = tmem_obj_find(hb, oidp);
+	if (obj == NULL)
+		goto out;
+	tmem_pampd_destroy_all_in_obj(obj);
+	tmem_obj_free(obj, hb);
+	(*tmem_hostops.obj_free)(obj, pool);
+	ret = 0;
+
+out:
+	spin_unlock(&hb->lock);
+	return ret;
+}
+
+/*
+ * "Flush" all pages (and tmem_objs) from this tmem_pool and disable
+ * all subsequent access to this tmem_pool.
+ */
+int tmem_destroy_pool(struct tmem_pool *pool)
+{
+	int ret = -1;
+
+	if (pool == NULL)
+		goto out;
+	tmem_pool_flush(pool, 1);
+	ret = 0;
+out:
+	return ret;
+}
+
+static LIST_HEAD(tmem_global_pool_list);
+
+/*
+ * Create a new tmem_pool with the provided flag and return
+ * a pool id provided by the tmem host implementation.
+ */
+void tmem_new_pool(struct tmem_pool *pool, uint32_t flags)
+{
+	int persistent = flags & TMEM_POOL_PERSIST;
+	int shared = flags & TMEM_POOL_SHARED;
+	struct tmem_hashbucket *hb = &pool->hashbucket[0];
+	int i;
+
+	for (i = 0; i < TMEM_HASH_BUCKETS; i++, hb++) {
+		hb->obj_rb_root = RB_ROOT;
+		spin_lock_init(&hb->lock);
+	}
+	INIT_LIST_HEAD(&pool->pool_list);
+	atomic_set(&pool->obj_count, 0);
+	SET_SENTINEL(pool, POOL);
+	list_add_tail(&pool->pool_list, &tmem_global_pool_list);
+	pool->persistent = persistent;
+	pool->shared = shared;
+}
diff --git a/drivers/mm/zcache/tmem.h b/drivers/mm/zcache/tmem.h
new file mode 100644
index 0000000..0d4aa82
--- /dev/null
+++ b/drivers/mm/zcache/tmem.h
@@ -0,0 +1,206 @@ 
+/*
+ * tmem.h
+ *
+ * Transcendent memory
+ *
+ * Copyright (c) 2009-2011, Dan Magenheimer, Oracle Corp.
+ */
+
+#ifndef _TMEM_H_
+#define _TMEM_H_
+
+#include <linux/types.h>
+#include <linux/highmem.h>
+#include <linux/hash.h>
+#include <linux/atomic.h>
+
+/*
+ * These are pre-defined by the Xen<->Linux ABI
+ */
+#define TMEM_PUT_PAGE			4
+#define TMEM_GET_PAGE			5
+#define TMEM_FLUSH_PAGE			6
+#define TMEM_FLUSH_OBJECT		7
+#define TMEM_POOL_PERSIST		1
+#define TMEM_POOL_SHARED		2
+#define TMEM_POOL_PRECOMPRESSED		4
+#define TMEM_POOL_PAGESIZE_SHIFT	4
+#define TMEM_POOL_PAGESIZE_MASK		0xf
+#define TMEM_POOL_RESERVED_BITS		0x00ffff00
+
+/*
+ * sentinels have proven very useful for debugging but can be removed
+ * or disabled before final merge.
+ */
+#define SENTINELS
+#ifdef SENTINELS
+#define DECL_SENTINEL uint32_t sentinel;
+#define SET_SENTINEL(_x, _y) (_x->sentinel = _y##_SENTINEL)
+#define INVERT_SENTINEL(_x, _y) (_x->sentinel = ~_y##_SENTINEL)
+#define ASSERT_SENTINEL(_x, _y) WARN_ON(_x->sentinel != _y##_SENTINEL)
+#define ASSERT_INVERTED_SENTINEL(_x, _y) WARN_ON(_x->sentinel != ~_y##_SENTINEL)
+#else
+#define DECL_SENTINEL
+#define SET_SENTINEL(_x, _y) do { } while (0)
+#define INVERT_SENTINEL(_x, _y) do { } while (0)
+#define ASSERT_SENTINEL(_x, _y) do { } while (0)
+#define ASSERT_INVERTED_SENTINEL(_x, _y) do { } while (0)
+#endif
+
+#define ASSERT_SPINLOCK(_l)	lockdep_assert_held(_l)
+
+/*
+ * A pool is the highest-level data structure managed by tmem and
+ * usually corresponds to a large independent set of pages such as
+ * a filesystem.  Each pool has an id, and certain attributes and counters.
+ * It also contains a set of hash buckets, each of which contains an rbtree
+ * of objects and a lock to manage concurrency within the pool.
+ */
+
+#define TMEM_HASH_BUCKET_BITS	8
+#define TMEM_HASH_BUCKETS	(1<<TMEM_HASH_BUCKET_BITS)
+
+struct tmem_hashbucket {
+	struct rb_root obj_rb_root;
+	spinlock_t lock;
+};
+
+struct tmem_pool {
+	void *client; /* "up" for some clients, avoids table lookup */
+	struct list_head pool_list;
+	uint32_t pool_id;
+	bool persistent;
+	bool shared;
+	atomic_t obj_count;
+	atomic_t refcount;
+	struct tmem_hashbucket hashbucket[TMEM_HASH_BUCKETS];
+	DECL_SENTINEL
+};
+
+#define is_persistent(_p)  (_p->persistent)
+#define is_ephemeral(_p)   (!(_p->persistent))
+
+/*
+ * An object id ("oid") is large: 192-bits (to ensure, for example, files
+ * in a modern filesystem can be uniquely identified).
+ */
+
+struct tmem_oid {
+	uint64_t oid[3];
+};
+
+static inline void tmem_oid_set_invalid(struct tmem_oid *oidp)
+{
+	oidp->oid[0] = oidp->oid[1] = oidp->oid[2] = -1UL;
+}
+
+static inline bool tmem_oid_valid(struct tmem_oid *oidp)
+{
+	return oidp->oid[0] != -1UL || oidp->oid[1] != -1UL ||
+		oidp->oid[2] != -1UL;
+}
+
+static inline int tmem_oid_compare(struct tmem_oid *left,
+					struct tmem_oid *right)
+{
+	int ret;
+
+	if (left->oid[2] == right->oid[2]) {
+		if (left->oid[1] == right->oid[1]) {
+			if (left->oid[0] == right->oid[0])
+				ret = 0;
+			else if (left->oid[0] < right->oid[0])
+				ret = -1;
+			else
+				return 1;
+		} else if (left->oid[1] < right->oid[1])
+			ret = -1;
+		else
+			ret = 1;
+	} else if (left->oid[2] < right->oid[2])
+		ret = -1;
+	else
+		ret = 1;
+	return ret;
+}
+
+static inline unsigned tmem_oid_hash(struct tmem_oid *oidp)
+{
+	return hash_long(oidp->oid[0] ^ oidp->oid[1] ^ oidp->oid[2],
+				TMEM_HASH_BUCKET_BITS);
+}
+
+/*
+ * A tmem_obj contains an identifier (oid), pointers to the parent
+ * pool and the rb_tree to which it belongs, counters, and an ordered
+ * set of pampds, structured in a radix-tree-like tree.  The intermediate
+ * nodes of the tree are called tmem_objnodes.
+ */
+
+struct tmem_objnode;
+
+struct tmem_obj {
+	struct tmem_oid oid;
+	struct tmem_pool *pool;
+	struct rb_node rb_tree_node;
+	struct tmem_objnode *objnode_tree_root;
+	unsigned int objnode_tree_height;
+	unsigned long objnode_count;
+	long pampd_count;
+	void *extra; /* for private use by pampd implementation */
+	DECL_SENTINEL
+};
+
+#define OBJNODE_TREE_MAP_SHIFT 6
+#define OBJNODE_TREE_MAP_SIZE (1UL << OBJNODE_TREE_MAP_SHIFT)
+#define OBJNODE_TREE_MAP_MASK (OBJNODE_TREE_MAP_SIZE-1)
+#define OBJNODE_TREE_INDEX_BITS (8 /* CHAR_BIT */ * sizeof(unsigned long))
+#define OBJNODE_TREE_MAX_PATH \
+		(OBJNODE_TREE_INDEX_BITS/OBJNODE_TREE_MAP_SHIFT + 2)
+
+struct tmem_objnode {
+	struct tmem_obj *obj;
+	DECL_SENTINEL
+	void *slots[OBJNODE_TREE_MAP_SIZE];
+	unsigned int slots_in_use;
+};
+
+/* pampd abstract datatype methods provided by the PAM implementation */
+struct tmem_pamops {
+	void *(*create)(char *, size_t, bool, int,
+			struct tmem_pool *, struct tmem_oid *, uint32_t);
+	int (*get_data)(char *, size_t *, bool, void *, struct tmem_pool *,
+				struct tmem_oid *, uint32_t);
+	int (*get_data_and_free)(char *, size_t *, bool, void *,
+				struct tmem_pool *, struct tmem_oid *,
+				uint32_t);
+	void (*free)(void *, struct tmem_pool *, struct tmem_oid *, uint32_t);
+	void (*free_obj)(struct tmem_pool *, struct tmem_obj *);
+	bool (*is_remote)(void *);
+	void (*new_obj)(struct tmem_obj *);
+	int (*replace_in_obj)(void *, struct tmem_obj *);
+};
+extern void tmem_register_pamops(struct tmem_pamops *m);
+
+/* memory allocation methods provided by the host implementation */
+struct tmem_hostops {
+	struct tmem_obj *(*obj_alloc)(struct tmem_pool *);
+	void (*obj_free)(struct tmem_obj *, struct tmem_pool *);
+	struct tmem_objnode *(*objnode_alloc)(struct tmem_pool *);
+	void (*objnode_free)(struct tmem_objnode *, struct tmem_pool *);
+};
+extern void tmem_register_hostops(struct tmem_hostops *m);
+
+/* core tmem accessor functions */
+extern int tmem_put(struct tmem_pool *, struct tmem_oid *, uint32_t index,
+			char *, size_t, bool, bool);
+extern int tmem_get(struct tmem_pool *, struct tmem_oid *, uint32_t index,
+			char *, size_t *, bool, int);
+extern int tmem_replace(struct tmem_pool *, struct tmem_oid *, uint32_t index,
+			void *);
+extern int tmem_flush_page(struct tmem_pool *, struct tmem_oid *,
+			uint32_t index);
+extern int tmem_flush_object(struct tmem_pool *, struct tmem_oid *);
+extern int tmem_destroy_pool(struct tmem_pool *);
+extern void tmem_new_pool(struct tmem_pool *, uint32_t);
+#endif /* _TMEM_H */
diff --git a/drivers/mm/zcache/zcache-main.c b/drivers/mm/zcache/zcache-main.c
new file mode 100644
index 0000000..34b2c5c
--- /dev/null
+++ b/drivers/mm/zcache/zcache-main.c
@@ -0,0 +1,2077 @@ 
+/*
+ * zcache.c
+ *
+ * Copyright (c) 2010,2011, Dan Magenheimer, Oracle Corp.
+ * Copyright (c) 2010,2011, Nitin Gupta
+ *
+ * Zcache provides an in-kernel "host implementation" for transcendent memory
+ * and, thus indirectly, for cleancache and frontswap.  Zcache includes two
+ * page-accessible memory [1] interfaces, both utilizing the crypto compression
+ * API:
+ * 1) "compression buddies" ("zbud") is used for ephemeral pages
+ * 2) zsmalloc is used for persistent pages.
+ * Xvmalloc (based on the TLSF allocator) has very low fragmentation
+ * so maximizes space efficiency, while zbud allows pairs (and potentially,
+ * in the future, more than a pair of) compressed pages to be closely linked
+ * so that reclaiming can be done via the kernel's physical-page-oriented
+ * "shrinker" interface.
+ *
+ * [1] For a definition of page-accessible memory (aka PAM), see:
+ *   http://marc.info/?l=linux-mm&m=127811271605009
+ */
+
+#include <linux/module.h>
+#include <linux/cpu.h>
+#include <linux/highmem.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <linux/atomic.h>
+#include <linux/math64.h>
+#include <linux/crypto.h>
+#include <linux/string.h>
+#include <linux/idr.h>
+#include <linux/zsmalloc.h>
+
+#include "tmem.h"
+
+#ifdef CONFIG_CLEANCACHE
+#include <linux/cleancache.h>
+#endif
+#ifdef CONFIG_FRONTSWAP
+#include <linux/frontswap.h>
+#endif
+
+#if 0
+/* this is more aggressive but may cause other problems? */
+#define ZCACHE_GFP_MASK	(GFP_ATOMIC | __GFP_NORETRY | __GFP_NOWARN)
+#else
+#define ZCACHE_GFP_MASK \
+	(__GFP_FS | __GFP_NORETRY | __GFP_NOWARN | __GFP_NOMEMALLOC)
+#endif
+
+#define MAX_CLIENTS 16
+#define LOCAL_CLIENT ((uint16_t)-1)
+
+MODULE_LICENSE("GPL");
+
+struct zcache_client {
+	struct idr tmem_pools;
+	struct zs_pool *zspool;
+	bool allocated;
+	atomic_t refcount;
+};
+
+static struct zcache_client zcache_host;
+static struct zcache_client zcache_clients[MAX_CLIENTS];
+
+static inline uint16_t get_client_id_from_client(struct zcache_client *cli)
+{
+	BUG_ON(cli == NULL);
+	if (cli == &zcache_host)
+		return LOCAL_CLIENT;
+	return cli - &zcache_clients[0];
+}
+
+static struct zcache_client *get_zcache_client(uint16_t cli_id)
+{
+	if (cli_id == LOCAL_CLIENT)
+		return &zcache_host;
+
+	if ((unsigned int)cli_id < MAX_CLIENTS)
+		return &zcache_clients[cli_id];
+
+	return NULL;
+}
+
+static inline bool is_local_client(struct zcache_client *cli)
+{
+	return cli == &zcache_host;
+}
+
+/* crypto API for zcache  */
+#define ZCACHE_COMP_NAME_SZ CRYPTO_MAX_ALG_NAME
+static char zcache_comp_name[ZCACHE_COMP_NAME_SZ];
+static struct crypto_comp * __percpu *zcache_comp_pcpu_tfms;
+
+enum comp_op {
+	ZCACHE_COMPOP_COMPRESS,
+	ZCACHE_COMPOP_DECOMPRESS
+};
+
+static inline int zcache_comp_op(enum comp_op op,
+				const u8 *src, unsigned int slen,
+				u8 *dst, unsigned int *dlen)
+{
+	struct crypto_comp *tfm;
+	int ret;
+
+	BUG_ON(!zcache_comp_pcpu_tfms);
+	tfm = *per_cpu_ptr(zcache_comp_pcpu_tfms, get_cpu());
+	BUG_ON(!tfm);
+	switch (op) {
+	case ZCACHE_COMPOP_COMPRESS:
+		ret = crypto_comp_compress(tfm, src, slen, dst, dlen);
+		break;
+	case ZCACHE_COMPOP_DECOMPRESS:
+		ret = crypto_comp_decompress(tfm, src, slen, dst, dlen);
+		break;
+	default:
+		ret = -EINVAL;
+	}
+	put_cpu();
+	return ret;
+}
+
+/**********
+ * Compression buddies ("zbud") provides for packing two (or, possibly
+ * in the future, more) compressed ephemeral pages into a single "raw"
+ * (physical) page and tracking them with data structures so that
+ * the raw pages can be easily reclaimed.
+ *
+ * A zbud page ("zbpg") is an aligned page containing a list_head,
+ * a lock, and two "zbud headers".  The remainder of the physical
+ * page is divided up into aligned 64-byte "chunks" which contain
+ * the compressed data for zero, one, or two zbuds.  Each zbpg
+ * resides on: (1) an "unused list" if it has no zbuds; (2) a
+ * "buddied" list if it is fully populated  with two zbuds; or
+ * (3) one of PAGE_SIZE/64 "unbuddied" lists indexed by how many chunks
+ * the one unbuddied zbud uses.  The data inside a zbpg cannot be
+ * read or written unless the zbpg's lock is held.
+ */
+
+#define ZBH_SENTINEL  0x43214321
+#define ZBPG_SENTINEL  0xdeadbeef
+
+#define ZBUD_MAX_BUDS 2
+
+struct zbud_hdr {
+	uint16_t client_id;
+	uint16_t pool_id;
+	struct tmem_oid oid;
+	uint32_t index;
+	uint16_t size; /* compressed size in bytes, zero means unused */
+	DECL_SENTINEL
+};
+
+struct zbud_page {
+	struct list_head bud_list;
+	spinlock_t lock;
+	struct zbud_hdr buddy[ZBUD_MAX_BUDS];
+	DECL_SENTINEL
+	/* followed by NUM_CHUNK aligned CHUNK_SIZE-byte chunks */
+};
+
+#define CHUNK_SHIFT	6
+#define CHUNK_SIZE	(1 << CHUNK_SHIFT)
+#define CHUNK_MASK	(~(CHUNK_SIZE-1))
+#define NCHUNKS		(((PAGE_SIZE - sizeof(struct zbud_page)) & \
+				CHUNK_MASK) >> CHUNK_SHIFT)
+#define MAX_CHUNK	(NCHUNKS-1)
+
+static struct {
+	struct list_head list;
+	unsigned count;
+} zbud_unbuddied[NCHUNKS];
+/* list N contains pages with N chunks USED and NCHUNKS-N unused */
+/* element 0 is never used but optimizing that isn't worth it */
+static unsigned long zbud_cumul_chunk_counts[NCHUNKS];
+
+struct list_head zbud_buddied_list;
+static unsigned long zcache_zbud_buddied_count;
+
+/* protects the buddied list and all unbuddied lists */
+static DEFINE_SPINLOCK(zbud_budlists_spinlock);
+
+static LIST_HEAD(zbpg_unused_list);
+static unsigned long zcache_zbpg_unused_list_count;
+
+/* protects the unused page list */
+static DEFINE_SPINLOCK(zbpg_unused_list_spinlock);
+
+static atomic_t zcache_zbud_curr_raw_pages;
+static atomic_t zcache_zbud_curr_zpages;
+static unsigned long zcache_zbud_curr_zbytes;
+static unsigned long zcache_zbud_cumul_zpages;
+static unsigned long zcache_zbud_cumul_zbytes;
+static unsigned long zcache_compress_poor;
+static unsigned long zcache_mean_compress_poor;
+
+/* forward references */
+static void *zcache_get_free_page(void);
+static void zcache_free_page(void *p);
+
+/*
+ * zbud helper functions
+ */
+
+static inline unsigned zbud_max_buddy_size(void)
+{
+	return MAX_CHUNK << CHUNK_SHIFT;
+}
+
+static inline unsigned zbud_size_to_chunks(unsigned size)
+{
+	BUG_ON(size == 0 || size > zbud_max_buddy_size());
+	return (size + CHUNK_SIZE - 1) >> CHUNK_SHIFT;
+}
+
+static inline int zbud_budnum(struct zbud_hdr *zh)
+{
+	unsigned offset = (unsigned long)zh & (PAGE_SIZE - 1);
+	struct zbud_page *zbpg = NULL;
+	unsigned budnum = -1U;
+	int i;
+
+	for (i = 0; i < ZBUD_MAX_BUDS; i++)
+		if (offset == offsetof(typeof(*zbpg), buddy[i])) {
+			budnum = i;
+			break;
+		}
+	BUG_ON(budnum == -1U);
+	return budnum;
+}
+
+static char *zbud_data(struct zbud_hdr *zh, unsigned size)
+{
+	struct zbud_page *zbpg;
+	char *p;
+	unsigned budnum;
+
+	ASSERT_SENTINEL(zh, ZBH);
+	budnum = zbud_budnum(zh);
+	BUG_ON(size == 0 || size > zbud_max_buddy_size());
+	zbpg = container_of(zh, struct zbud_page, buddy[budnum]);
+	ASSERT_SPINLOCK(&zbpg->lock);
+	p = (char *)zbpg;
+	if (budnum == 0)
+		p += ((sizeof(struct zbud_page) + CHUNK_SIZE - 1) &
+							CHUNK_MASK);
+	else if (budnum == 1)
+		p += PAGE_SIZE - ((size + CHUNK_SIZE - 1) & CHUNK_MASK);
+	return p;
+}
+
+/*
+ * zbud raw page management
+ */
+
+static struct zbud_page *zbud_alloc_raw_page(void)
+{
+	struct zbud_page *zbpg = NULL;
+	struct zbud_hdr *zh0, *zh1;
+	bool recycled = 0;
+
+	/* if any pages on the zbpg list, use one */
+	spin_lock(&zbpg_unused_list_spinlock);
+	if (!list_empty(&zbpg_unused_list)) {
+		zbpg = list_first_entry(&zbpg_unused_list,
+				struct zbud_page, bud_list);
+		list_del_init(&zbpg->bud_list);
+		zcache_zbpg_unused_list_count--;
+		recycled = 1;
+	}
+	spin_unlock(&zbpg_unused_list_spinlock);
+	if (zbpg == NULL)
+		/* none on zbpg list, try to get a kernel page */
+		zbpg = zcache_get_free_page();
+	if (likely(zbpg != NULL)) {
+		INIT_LIST_HEAD(&zbpg->bud_list);
+		zh0 = &zbpg->buddy[0]; zh1 = &zbpg->buddy[1];
+		spin_lock_init(&zbpg->lock);
+		if (recycled) {
+			ASSERT_INVERTED_SENTINEL(zbpg, ZBPG);
+			SET_SENTINEL(zbpg, ZBPG);
+			BUG_ON(zh0->size != 0 || tmem_oid_valid(&zh0->oid));
+			BUG_ON(zh1->size != 0 || tmem_oid_valid(&zh1->oid));
+		} else {
+			atomic_inc(&zcache_zbud_curr_raw_pages);
+			INIT_LIST_HEAD(&zbpg->bud_list);
+			SET_SENTINEL(zbpg, ZBPG);
+			zh0->size = 0; zh1->size = 0;
+			tmem_oid_set_invalid(&zh0->oid);
+			tmem_oid_set_invalid(&zh1->oid);
+		}
+	}
+	return zbpg;
+}
+
+static void zbud_free_raw_page(struct zbud_page *zbpg)
+{
+	struct zbud_hdr *zh0 = &zbpg->buddy[0], *zh1 = &zbpg->buddy[1];
+
+	ASSERT_SENTINEL(zbpg, ZBPG);
+	BUG_ON(!list_empty(&zbpg->bud_list));
+	ASSERT_SPINLOCK(&zbpg->lock);
+	BUG_ON(zh0->size != 0 || tmem_oid_valid(&zh0->oid));
+	BUG_ON(zh1->size != 0 || tmem_oid_valid(&zh1->oid));
+	INVERT_SENTINEL(zbpg, ZBPG);
+	spin_unlock(&zbpg->lock);
+	spin_lock(&zbpg_unused_list_spinlock);
+	list_add(&zbpg->bud_list, &zbpg_unused_list);
+	zcache_zbpg_unused_list_count++;
+	spin_unlock(&zbpg_unused_list_spinlock);
+}
+
+/*
+ * core zbud handling routines
+ */
+
+static unsigned zbud_free(struct zbud_hdr *zh)
+{
+	unsigned size;
+
+	ASSERT_SENTINEL(zh, ZBH);
+	BUG_ON(!tmem_oid_valid(&zh->oid));
+	size = zh->size;
+	BUG_ON(zh->size == 0 || zh->size > zbud_max_buddy_size());
+	zh->size = 0;
+	tmem_oid_set_invalid(&zh->oid);
+	INVERT_SENTINEL(zh, ZBH);
+	zcache_zbud_curr_zbytes -= size;
+	atomic_dec(&zcache_zbud_curr_zpages);
+	return size;
+}
+
+static void zbud_free_and_delist(struct zbud_hdr *zh)
+{
+	unsigned chunks;
+	struct zbud_hdr *zh_other;
+	unsigned budnum = zbud_budnum(zh), size;
+	struct zbud_page *zbpg =
+		container_of(zh, struct zbud_page, buddy[budnum]);
+
+	spin_lock(&zbud_budlists_spinlock);
+	spin_lock(&zbpg->lock);
+	if (list_empty(&zbpg->bud_list)) {
+		/* ignore zombie page... see zbud_evict_pages() */
+		spin_unlock(&zbpg->lock);
+		spin_unlock(&zbud_budlists_spinlock);
+		return;
+	}
+	size = zbud_free(zh);
+	ASSERT_SPINLOCK(&zbpg->lock);
+	zh_other = &zbpg->buddy[(budnum == 0) ? 1 : 0];
+	if (zh_other->size == 0) { /* was unbuddied: unlist and free */
+		chunks = zbud_size_to_chunks(size) ;
+		BUG_ON(list_empty(&zbud_unbuddied[chunks].list));
+		list_del_init(&zbpg->bud_list);
+		zbud_unbuddied[chunks].count--;
+		spin_unlock(&zbud_budlists_spinlock);
+		zbud_free_raw_page(zbpg);
+	} else { /* was buddied: move remaining buddy to unbuddied list */
+		chunks = zbud_size_to_chunks(zh_other->size) ;
+		list_del_init(&zbpg->bud_list);
+		zcache_zbud_buddied_count--;
+		list_add_tail(&zbpg->bud_list, &zbud_unbuddied[chunks].list);
+		zbud_unbuddied[chunks].count++;
+		spin_unlock(&zbud_budlists_spinlock);
+		spin_unlock(&zbpg->lock);
+	}
+}
+
+static struct zbud_hdr *zbud_create(uint16_t client_id, uint16_t pool_id,
+					struct tmem_oid *oid,
+					uint32_t index, struct page *page,
+					void *cdata, unsigned size)
+{
+	struct zbud_hdr *zh0, *zh1, *zh = NULL;
+	struct zbud_page *zbpg = NULL, *ztmp;
+	unsigned nchunks;
+	char *to;
+	int i, found_good_buddy = 0;
+
+	nchunks = zbud_size_to_chunks(size) ;
+	for (i = MAX_CHUNK - nchunks + 1; i > 0; i--) {
+		spin_lock(&zbud_budlists_spinlock);
+		if (!list_empty(&zbud_unbuddied[i].list)) {
+			list_for_each_entry_safe(zbpg, ztmp,
+				    &zbud_unbuddied[i].list, bud_list) {
+				if (spin_trylock(&zbpg->lock)) {
+					found_good_buddy = i;
+					goto found_unbuddied;
+				}
+			}
+		}
+		spin_unlock(&zbud_budlists_spinlock);
+	}
+	/* didn't find a good buddy, try allocating a new page */
+	zbpg = zbud_alloc_raw_page();
+	if (unlikely(zbpg == NULL))
+		goto out;
+	/* ok, have a page, now compress the data before taking locks */
+	spin_lock(&zbud_budlists_spinlock);
+	spin_lock(&zbpg->lock);
+	list_add_tail(&zbpg->bud_list, &zbud_unbuddied[nchunks].list);
+	zbud_unbuddied[nchunks].count++;
+	zh = &zbpg->buddy[0];
+	goto init_zh;
+
+found_unbuddied:
+	ASSERT_SPINLOCK(&zbpg->lock);
+	zh0 = &zbpg->buddy[0]; zh1 = &zbpg->buddy[1];
+	BUG_ON(!((zh0->size == 0) ^ (zh1->size == 0)));
+	if (zh0->size != 0) { /* buddy0 in use, buddy1 is vacant */
+		ASSERT_SENTINEL(zh0, ZBH);
+		zh = zh1;
+	} else if (zh1->size != 0) { /* buddy1 in use, buddy0 is vacant */
+		ASSERT_SENTINEL(zh1, ZBH);
+		zh = zh0;
+	} else
+		BUG();
+	list_del_init(&zbpg->bud_list);
+	zbud_unbuddied[found_good_buddy].count--;
+	list_add_tail(&zbpg->bud_list, &zbud_buddied_list);
+	zcache_zbud_buddied_count++;
+
+init_zh:
+	SET_SENTINEL(zh, ZBH);
+	zh->size = size;
+	zh->index = index;
+	zh->oid = *oid;
+	zh->pool_id = pool_id;
+	zh->client_id = client_id;
+	to = zbud_data(zh, size);
+	memcpy(to, cdata, size);
+	spin_unlock(&zbpg->lock);
+	spin_unlock(&zbud_budlists_spinlock);
+
+	zbud_cumul_chunk_counts[nchunks]++;
+	atomic_inc(&zcache_zbud_curr_zpages);
+	zcache_zbud_cumul_zpages++;
+	zcache_zbud_curr_zbytes += size;
+	zcache_zbud_cumul_zbytes += size;
+out:
+	return zh;
+}
+
+static int zbud_decompress(struct page *page, struct zbud_hdr *zh)
+{
+	struct zbud_page *zbpg;
+	unsigned budnum = zbud_budnum(zh);
+	unsigned int out_len = PAGE_SIZE;
+	char *to_va, *from_va;
+	unsigned size;
+	int ret = 0;
+
+	zbpg = container_of(zh, struct zbud_page, buddy[budnum]);
+	spin_lock(&zbpg->lock);
+	if (list_empty(&zbpg->bud_list)) {
+		/* ignore zombie page... see zbud_evict_pages() */
+		ret = -EINVAL;
+		goto out;
+	}
+	ASSERT_SENTINEL(zh, ZBH);
+	BUG_ON(zh->size == 0 || zh->size > zbud_max_buddy_size());
+	to_va = kmap_atomic(page);
+	size = zh->size;
+	from_va = zbud_data(zh, size);
+	ret = zcache_comp_op(ZCACHE_COMPOP_DECOMPRESS, from_va, size,
+				to_va, &out_len);
+	BUG_ON(ret);
+	BUG_ON(out_len != PAGE_SIZE);
+	kunmap_atomic(to_va);
+out:
+	spin_unlock(&zbpg->lock);
+	return ret;
+}
+
+/*
+ * The following routines handle shrinking of ephemeral pages by evicting
+ * pages "least valuable" first.
+ */
+
+static unsigned long zcache_evicted_raw_pages;
+static unsigned long zcache_evicted_buddied_pages;
+static unsigned long zcache_evicted_unbuddied_pages;
+
+static struct tmem_pool *zcache_get_pool_by_id(uint16_t cli_id,
+						uint16_t poolid);
+static void zcache_put_pool(struct tmem_pool *pool);
+
+/*
+ * Flush and free all zbuds in a zbpg, then free the pageframe
+ */
+static void zbud_evict_zbpg(struct zbud_page *zbpg)
+{
+	struct zbud_hdr *zh;
+	int i, j;
+	uint32_t pool_id[ZBUD_MAX_BUDS], client_id[ZBUD_MAX_BUDS];
+	uint32_t index[ZBUD_MAX_BUDS];
+	struct tmem_oid oid[ZBUD_MAX_BUDS];
+	struct tmem_pool *pool;
+
+	ASSERT_SPINLOCK(&zbpg->lock);
+	BUG_ON(!list_empty(&zbpg->bud_list));
+	for (i = 0, j = 0; i < ZBUD_MAX_BUDS; i++) {
+		zh = &zbpg->buddy[i];
+		if (zh->size) {
+			client_id[j] = zh->client_id;
+			pool_id[j] = zh->pool_id;
+			oid[j] = zh->oid;
+			index[j] = zh->index;
+			j++;
+			zbud_free(zh);
+		}
+	}
+	spin_unlock(&zbpg->lock);
+	for (i = 0; i < j; i++) {
+		pool = zcache_get_pool_by_id(client_id[i], pool_id[i]);
+		if (pool != NULL) {
+			tmem_flush_page(pool, &oid[i], index[i]);
+			zcache_put_pool(pool);
+		}
+	}
+	ASSERT_SENTINEL(zbpg, ZBPG);
+	spin_lock(&zbpg->lock);
+	zbud_free_raw_page(zbpg);
+}
+
+/*
+ * Free nr pages.  This code is funky because we want to hold the locks
+ * protecting various lists for as short a time as possible, and in some
+ * circumstances the list may change asynchronously when the list lock is
+ * not held.  In some cases we also trylock not only to avoid waiting on a
+ * page in use by another cpu, but also to avoid potential deadlock due to
+ * lock inversion.
+ */
+static void zbud_evict_pages(int nr)
+{
+	struct zbud_page *zbpg;
+	int i;
+
+	/* first try freeing any pages on unused list */
+retry_unused_list:
+	spin_lock_bh(&zbpg_unused_list_spinlock);
+	if (!list_empty(&zbpg_unused_list)) {
+		/* can't walk list here, since it may change when unlocked */
+		zbpg = list_first_entry(&zbpg_unused_list,
+				struct zbud_page, bud_list);
+		list_del_init(&zbpg->bud_list);
+		zcache_zbpg_unused_list_count--;
+		atomic_dec(&zcache_zbud_curr_raw_pages);
+		spin_unlock_bh(&zbpg_unused_list_spinlock);
+		zcache_free_page(zbpg);
+		zcache_evicted_raw_pages++;
+		if (--nr <= 0)
+			goto out;
+		goto retry_unused_list;
+	}
+	spin_unlock_bh(&zbpg_unused_list_spinlock);
+
+	/* now try freeing unbuddied pages, starting with least space avail */
+	for (i = 0; i < MAX_CHUNK; i++) {
+retry_unbud_list_i:
+		spin_lock_bh(&zbud_budlists_spinlock);
+		if (list_empty(&zbud_unbuddied[i].list)) {
+			spin_unlock_bh(&zbud_budlists_spinlock);
+			continue;
+		}
+		list_for_each_entry(zbpg, &zbud_unbuddied[i].list, bud_list) {
+			if (unlikely(!spin_trylock(&zbpg->lock)))
+				continue;
+			list_del_init(&zbpg->bud_list);
+			zbud_unbuddied[i].count--;
+			spin_unlock(&zbud_budlists_spinlock);
+			zcache_evicted_unbuddied_pages++;
+			/* want budlists unlocked when doing zbpg eviction */
+			zbud_evict_zbpg(zbpg);
+			local_bh_enable();
+			if (--nr <= 0)
+				goto out;
+			goto retry_unbud_list_i;
+		}
+		spin_unlock_bh(&zbud_budlists_spinlock);
+	}
+
+	/* as a last resort, free buddied pages */
+retry_bud_list:
+	spin_lock_bh(&zbud_budlists_spinlock);
+	if (list_empty(&zbud_buddied_list)) {
+		spin_unlock_bh(&zbud_budlists_spinlock);
+		goto out;
+	}
+	list_for_each_entry(zbpg, &zbud_buddied_list, bud_list) {
+		if (unlikely(!spin_trylock(&zbpg->lock)))
+			continue;
+		list_del_init(&zbpg->bud_list);
+		zcache_zbud_buddied_count--;
+		spin_unlock(&zbud_budlists_spinlock);
+		zcache_evicted_buddied_pages++;
+		/* want budlists unlocked when doing zbpg eviction */
+		zbud_evict_zbpg(zbpg);
+		local_bh_enable();
+		if (--nr <= 0)
+			goto out;
+		goto retry_bud_list;
+	}
+	spin_unlock_bh(&zbud_budlists_spinlock);
+out:
+	return;
+}
+
+static void __init zbud_init(void)
+{
+	int i;
+
+	INIT_LIST_HEAD(&zbud_buddied_list);
+
+	for (i = 0; i < NCHUNKS; i++)
+		INIT_LIST_HEAD(&zbud_unbuddied[i].list);
+}
+
+#ifdef CONFIG_SYSFS
+/*
+ * These sysfs routines show a nice distribution of how many zbpg's are
+ * currently (and have ever been placed) in each unbuddied list.  It's fun
+ * to watch but can probably go away before final merge.
+ */
+static int zbud_show_unbuddied_list_counts(char *buf)
+{
+	int i;
+	char *p = buf;
+
+	for (i = 0; i < NCHUNKS; i++)
+		p += sprintf(p, "%u ", zbud_unbuddied[i].count);
+	return p - buf;
+}
+
+static int zbud_show_cumul_chunk_counts(char *buf)
+{
+	unsigned long i, chunks = 0, total_chunks = 0, sum_total_chunks = 0;
+	unsigned long total_chunks_lte_21 = 0, total_chunks_lte_32 = 0;
+	unsigned long total_chunks_lte_42 = 0;
+	char *p = buf;
+
+	for (i = 0; i < NCHUNKS; i++) {
+		p += sprintf(p, "%lu ", zbud_cumul_chunk_counts[i]);
+		chunks += zbud_cumul_chunk_counts[i];
+		total_chunks += zbud_cumul_chunk_counts[i];
+		sum_total_chunks += i * zbud_cumul_chunk_counts[i];
+		if (i == 21)
+			total_chunks_lte_21 = total_chunks;
+		if (i == 32)
+			total_chunks_lte_32 = total_chunks;
+		if (i == 42)
+			total_chunks_lte_42 = total_chunks;
+	}
+	p += sprintf(p, "<=21:%lu <=32:%lu <=42:%lu, mean:%lu\n",
+		total_chunks_lte_21, total_chunks_lte_32, total_chunks_lte_42,
+		chunks == 0 ? 0 : sum_total_chunks / chunks);
+	return p - buf;
+}
+#endif
+
+/**********
+ * This "zv" PAM implementation combines the slab-based zsmalloc
+ * with the crypto compression API to maximize the amount of data that can
+ * be packed into a physical page.
+ *
+ * Zv represents a PAM page with the index and object (plus a "size" value
+ * necessary for decompression) immediately preceding the compressed data.
+ */
+
+#define ZVH_SENTINEL  0x43214321
+
+struct zv_hdr {
+	uint32_t pool_id;
+	struct tmem_oid oid;
+	uint32_t index;
+	size_t size;
+	DECL_SENTINEL
+};
+
+/* rudimentary policy limits */
+/* total number of persistent pages may not exceed this percentage */
+static unsigned int zv_page_count_policy_percent = 75;
+/*
+ * byte count defining poor compression; pages with greater zsize will be
+ * rejected
+ */
+static unsigned int zv_max_zsize = (PAGE_SIZE / 8) * 7;
+/*
+ * byte count defining poor *mean* compression; pages with greater zsize
+ * will be rejected until sufficient better-compressed pages are accepted
+ * driving the mean below this threshold
+ */
+static unsigned int zv_max_mean_zsize = (PAGE_SIZE / 8) * 5;
+
+static atomic_t zv_curr_dist_counts[NCHUNKS];
+static atomic_t zv_cumul_dist_counts[NCHUNKS];
+
+static unsigned long zv_create(struct zs_pool *pool, uint32_t pool_id,
+				struct tmem_oid *oid, uint32_t index,
+				void *cdata, unsigned clen)
+{
+	struct zv_hdr *zv;
+	u32 size = clen + sizeof(struct zv_hdr);
+	int chunks = (size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
+	unsigned long handle = 0;
+
+	BUG_ON(!irqs_disabled());
+	BUG_ON(chunks >= NCHUNKS);
+	handle = zs_malloc(pool, size);
+	if (!handle)
+		goto out;
+	atomic_inc(&zv_curr_dist_counts[chunks]);
+	atomic_inc(&zv_cumul_dist_counts[chunks]);
+	zv = zs_map_object(pool, handle, ZS_MM_WO);
+	zv->index = index;
+	zv->oid = *oid;
+	zv->pool_id = pool_id;
+	zv->size = clen;
+	SET_SENTINEL(zv, ZVH);
+	memcpy((char *)zv + sizeof(struct zv_hdr), cdata, clen);
+	zs_unmap_object(pool, handle);
+out:
+	return handle;
+}
+
+static void zv_free(struct zs_pool *pool, unsigned long handle)
+{
+	unsigned long flags;
+	struct zv_hdr *zv;
+	uint16_t size;
+	int chunks;
+
+	zv = zs_map_object(pool, handle, ZS_MM_RW);
+	ASSERT_SENTINEL(zv, ZVH);
+	size = zv->size + sizeof(struct zv_hdr);
+	INVERT_SENTINEL(zv, ZVH);
+	zs_unmap_object(pool, handle);
+
+	chunks = (size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
+	BUG_ON(chunks >= NCHUNKS);
+	atomic_dec(&zv_curr_dist_counts[chunks]);
+
+	local_irq_save(flags);
+	zs_free(pool, handle);
+	local_irq_restore(flags);
+}
+
+static void zv_decompress(struct page *page, unsigned long handle)
+{
+	unsigned int clen = PAGE_SIZE;
+	char *to_va;
+	int ret;
+	struct zv_hdr *zv;
+
+	zv = zs_map_object(zcache_host.zspool, handle, ZS_MM_RO);
+	BUG_ON(zv->size == 0);
+	ASSERT_SENTINEL(zv, ZVH);
+	to_va = kmap_atomic(page);
+	ret = zcache_comp_op(ZCACHE_COMPOP_DECOMPRESS, (char *)zv + sizeof(*zv),
+				zv->size, to_va, &clen);
+	kunmap_atomic(to_va);
+	zs_unmap_object(zcache_host.zspool, handle);
+	BUG_ON(ret);
+	BUG_ON(clen != PAGE_SIZE);
+}
+
+#ifdef CONFIG_SYSFS
+/*
+ * show a distribution of compression stats for zv pages.
+ */
+
+static int zv_curr_dist_counts_show(char *buf)
+{
+	unsigned long i, n, chunks = 0, sum_total_chunks = 0;
+	char *p = buf;
+
+	for (i = 0; i < NCHUNKS; i++) {
+		n = atomic_read(&zv_curr_dist_counts[i]);
+		p += sprintf(p, "%lu ", n);
+		chunks += n;
+		sum_total_chunks += i * n;
+	}
+	p += sprintf(p, "mean:%lu\n",
+		chunks == 0 ? 0 : sum_total_chunks / chunks);
+	return p - buf;
+}
+
+static int zv_cumul_dist_counts_show(char *buf)
+{
+	unsigned long i, n, chunks = 0, sum_total_chunks = 0;
+	char *p = buf;
+
+	for (i = 0; i < NCHUNKS; i++) {
+		n = atomic_read(&zv_cumul_dist_counts[i]);
+		p += sprintf(p, "%lu ", n);
+		chunks += n;
+		sum_total_chunks += i * n;
+	}
+	p += sprintf(p, "mean:%lu\n",
+		chunks == 0 ? 0 : sum_total_chunks / chunks);
+	return p - buf;
+}
+
+/*
+ * setting zv_max_zsize via sysfs causes all persistent (e.g. swap)
+ * pages that don't compress to less than this value (including metadata
+ * overhead) to be rejected.  We don't allow the value to get too close
+ * to PAGE_SIZE.
+ */
+static ssize_t zv_max_zsize_show(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    char *buf)
+{
+	return sprintf(buf, "%u\n", zv_max_zsize);
+}
+
+static ssize_t zv_max_zsize_store(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	err = kstrtoul(buf, 10, &val);
+	if (err || (val == 0) || (val > (PAGE_SIZE / 8) * 7))
+		return -EINVAL;
+	zv_max_zsize = val;
+	return count;
+}
+
+/*
+ * setting zv_max_mean_zsize via sysfs causes all persistent (e.g. swap)
+ * pages that don't compress to less than this value (including metadata
+ * overhead) to be rejected UNLESS the mean compression is also smaller
+ * than this value.  In other words, we are load-balancing-by-zsize the
+ * accepted pages.  Again, we don't allow the value to get too close
+ * to PAGE_SIZE.
+ */
+static ssize_t zv_max_mean_zsize_show(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    char *buf)
+{
+	return sprintf(buf, "%u\n", zv_max_mean_zsize);
+}
+
+static ssize_t zv_max_mean_zsize_store(struct kobject *kobj,
+				    struct kobj_attribute *attr,
+				    const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	err = kstrtoul(buf, 10, &val);
+	if (err || (val == 0) || (val > (PAGE_SIZE / 8) * 7))
+		return -EINVAL;
+	zv_max_mean_zsize = val;
+	return count;
+}
+
+/*
+ * setting zv_page_count_policy_percent via sysfs sets an upper bound of
+ * persistent (e.g. swap) pages that will be retained according to:
+ *     (zv_page_count_policy_percent * totalram_pages) / 100)
+ * when that limit is reached, further puts will be rejected (until
+ * some pages have been flushed).  Note that, due to compression,
+ * this number may exceed 100; it defaults to 75 and we set an
+ * arbitary limit of 150.  A poor choice will almost certainly result
+ * in OOM's, so this value should only be changed prudently.
+ */
+static ssize_t zv_page_count_policy_percent_show(struct kobject *kobj,
+						 struct kobj_attribute *attr,
+						 char *buf)
+{
+	return sprintf(buf, "%u\n", zv_page_count_policy_percent);
+}
+
+static ssize_t zv_page_count_policy_percent_store(struct kobject *kobj,
+						  struct kobj_attribute *attr,
+						  const char *buf, size_t count)
+{
+	unsigned long val;
+	int err;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	err = kstrtoul(buf, 10, &val);
+	if (err || (val == 0) || (val > 150))
+		return -EINVAL;
+	zv_page_count_policy_percent = val;
+	return count;
+}
+
+static struct kobj_attribute zcache_zv_max_zsize_attr = {
+		.attr = { .name = "zv_max_zsize", .mode = 0644 },
+		.show = zv_max_zsize_show,
+		.store = zv_max_zsize_store,
+};
+
+static struct kobj_attribute zcache_zv_max_mean_zsize_attr = {
+		.attr = { .name = "zv_max_mean_zsize", .mode = 0644 },
+		.show = zv_max_mean_zsize_show,
+		.store = zv_max_mean_zsize_store,
+};
+
+static struct kobj_attribute zcache_zv_page_count_policy_percent_attr = {
+		.attr = { .name = "zv_page_count_policy_percent",
+			  .mode = 0644 },
+		.show = zv_page_count_policy_percent_show,
+		.store = zv_page_count_policy_percent_store,
+};
+#endif
+
+/*
+ * zcache core code starts here
+ */
+
+/* useful stats not collected by cleancache or frontswap */
+static unsigned long zcache_flush_total;
+static unsigned long zcache_flush_found;
+static unsigned long zcache_flobj_total;
+static unsigned long zcache_flobj_found;
+static unsigned long zcache_failed_eph_puts;
+static unsigned long zcache_failed_pers_puts;
+
+/*
+ * Tmem operations assume the poolid implies the invoking client.
+ * Zcache only has one client (the kernel itself): LOCAL_CLIENT.
+ * RAMster has each client numbered by cluster node, and a KVM version
+ * of zcache would have one client per guest and each client might
+ * have a poolid==N.
+ */
+static struct tmem_pool *zcache_get_pool_by_id(uint16_t cli_id, uint16_t poolid)
+{
+	struct tmem_pool *pool = NULL;
+	struct zcache_client *cli = NULL;
+
+	cli = get_zcache_client(cli_id);
+	if (!cli)
+		goto out;
+
+	atomic_inc(&cli->refcount);
+	pool = idr_find(&cli->tmem_pools, poolid);
+	if (pool != NULL)
+		atomic_inc(&pool->refcount);
+out:
+	return pool;
+}
+
+static void zcache_put_pool(struct tmem_pool *pool)
+{
+	struct zcache_client *cli = NULL;
+
+	if (pool == NULL)
+		BUG();
+	cli = pool->client;
+	atomic_dec(&pool->refcount);
+	atomic_dec(&cli->refcount);
+}
+
+int zcache_new_client(uint16_t cli_id)
+{
+	struct zcache_client *cli;
+	int ret = -1;
+
+	cli = get_zcache_client(cli_id);
+
+	if (cli == NULL)
+		goto out;
+	if (cli->allocated)
+		goto out;
+	cli->allocated = 1;
+#ifdef CONFIG_FRONTSWAP
+	cli->zspool = zs_create_pool("zcache", ZCACHE_GFP_MASK);
+	if (cli->zspool == NULL)
+		goto out;
+	idr_init(&cli->tmem_pools);
+#endif
+	ret = 0;
+out:
+	return ret;
+}
+
+/* counters for debugging */
+static unsigned long zcache_failed_get_free_pages;
+static unsigned long zcache_failed_alloc;
+static unsigned long zcache_put_to_flush;
+
+/*
+ * for now, used named slabs so can easily track usage; later can
+ * either just use kmalloc, or perhaps add a slab-like allocator
+ * to more carefully manage total memory utilization
+ */
+static struct kmem_cache *zcache_objnode_cache;
+static struct kmem_cache *zcache_obj_cache;
+static atomic_t zcache_curr_obj_count = ATOMIC_INIT(0);
+static unsigned long zcache_curr_obj_count_max;
+static atomic_t zcache_curr_objnode_count = ATOMIC_INIT(0);
+static unsigned long zcache_curr_objnode_count_max;
+
+/*
+ * to avoid memory allocation recursion (e.g. due to direct reclaim), we
+ * preload all necessary data structures so the hostops callbacks never
+ * actually do a malloc
+ */
+struct zcache_preload {
+	void *page;
+	struct tmem_obj *obj;
+	int nr;
+	struct tmem_objnode *objnodes[OBJNODE_TREE_MAX_PATH];
+};
+static DEFINE_PER_CPU(struct zcache_preload, zcache_preloads) = { 0, };
+
+static int zcache_do_preload(struct tmem_pool *pool)
+{
+	struct zcache_preload *kp;
+	struct tmem_objnode *objnode;
+	struct tmem_obj *obj;
+	void *page;
+	int ret = -ENOMEM;
+
+	if (unlikely(zcache_objnode_cache == NULL))
+		goto out;
+	if (unlikely(zcache_obj_cache == NULL))
+		goto out;
+
+	/* IRQ has already been disabled. */
+	kp = &__get_cpu_var(zcache_preloads);
+	while (kp->nr < ARRAY_SIZE(kp->objnodes)) {
+		objnode = kmem_cache_alloc(zcache_objnode_cache,
+				ZCACHE_GFP_MASK);
+		if (unlikely(objnode == NULL)) {
+			zcache_failed_alloc++;
+			goto out;
+		}
+
+		kp->objnodes[kp->nr++] = objnode;
+	}
+
+	if (!kp->obj) {
+		obj = kmem_cache_alloc(zcache_obj_cache, ZCACHE_GFP_MASK);
+		if (unlikely(obj == NULL)) {
+			zcache_failed_alloc++;
+			goto out;
+		}
+		kp->obj = obj;
+	}
+
+	if (!kp->page) {
+		page = (void *)__get_free_page(ZCACHE_GFP_MASK);
+		if (unlikely(page == NULL)) {
+			zcache_failed_get_free_pages++;
+			goto out;
+		}
+		kp->page =  page;
+	}
+
+	ret = 0;
+out:
+	return ret;
+}
+
+static void *zcache_get_free_page(void)
+{
+	struct zcache_preload *kp;
+	void *page;
+
+	kp = &__get_cpu_var(zcache_preloads);
+	page = kp->page;
+	BUG_ON(page == NULL);
+	kp->page = NULL;
+	return page;
+}
+
+static void zcache_free_page(void *p)
+{
+	free_page((unsigned long)p);
+}
+
+/*
+ * zcache implementation for tmem host ops
+ */
+
+static struct tmem_objnode *zcache_objnode_alloc(struct tmem_pool *pool)
+{
+	struct tmem_objnode *objnode = NULL;
+	unsigned long count;
+	struct zcache_preload *kp;
+
+	kp = &__get_cpu_var(zcache_preloads);
+	if (kp->nr <= 0)
+		goto out;
+	objnode = kp->objnodes[kp->nr - 1];
+	BUG_ON(objnode == NULL);
+	kp->objnodes[kp->nr - 1] = NULL;
+	kp->nr--;
+	count = atomic_inc_return(&zcache_curr_objnode_count);
+	if (count > zcache_curr_objnode_count_max)
+		zcache_curr_objnode_count_max = count;
+out:
+	return objnode;
+}
+
+static void zcache_objnode_free(struct tmem_objnode *objnode,
+					struct tmem_pool *pool)
+{
+	atomic_dec(&zcache_curr_objnode_count);
+	BUG_ON(atomic_read(&zcache_curr_objnode_count) < 0);
+	kmem_cache_free(zcache_objnode_cache, objnode);
+}
+
+static struct tmem_obj *zcache_obj_alloc(struct tmem_pool *pool)
+{
+	struct tmem_obj *obj = NULL;
+	unsigned long count;
+	struct zcache_preload *kp;
+
+	kp = &__get_cpu_var(zcache_preloads);
+	obj = kp->obj;
+	BUG_ON(obj == NULL);
+	kp->obj = NULL;
+	count = atomic_inc_return(&zcache_curr_obj_count);
+	if (count > zcache_curr_obj_count_max)
+		zcache_curr_obj_count_max = count;
+	return obj;
+}
+
+static void zcache_obj_free(struct tmem_obj *obj, struct tmem_pool *pool)
+{
+	atomic_dec(&zcache_curr_obj_count);
+	BUG_ON(atomic_read(&zcache_curr_obj_count) < 0);
+	kmem_cache_free(zcache_obj_cache, obj);
+}
+
+static struct tmem_hostops zcache_hostops = {
+	.obj_alloc = zcache_obj_alloc,
+	.obj_free = zcache_obj_free,
+	.objnode_alloc = zcache_objnode_alloc,
+	.objnode_free = zcache_objnode_free,
+};
+
+/*
+ * zcache implementations for PAM page descriptor ops
+ */
+
+static atomic_t zcache_curr_eph_pampd_count = ATOMIC_INIT(0);
+static unsigned long zcache_curr_eph_pampd_count_max;
+static atomic_t zcache_curr_pers_pampd_count = ATOMIC_INIT(0);
+static unsigned long zcache_curr_pers_pampd_count_max;
+
+/* forward reference */
+static int zcache_compress(struct page *from, void **out_va, unsigned *out_len);
+
+static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph,
+				struct tmem_pool *pool, struct tmem_oid *oid,
+				 uint32_t index)
+{
+	void *pampd = NULL, *cdata;
+	unsigned clen;
+	int ret;
+	unsigned long count;
+	struct page *page = (struct page *)(data);
+	struct zcache_client *cli = pool->client;
+	uint16_t client_id = get_client_id_from_client(cli);
+	unsigned long zv_mean_zsize;
+	unsigned long curr_pers_pampd_count;
+	u64 total_zsize;
+
+	if (eph) {
+		ret = zcache_compress(page, &cdata, &clen);
+		if (ret == 0)
+			goto out;
+		if (clen == 0 || clen > zbud_max_buddy_size()) {
+			zcache_compress_poor++;
+			goto out;
+		}
+		pampd = (void *)zbud_create(client_id, pool->pool_id, oid,
+						index, page, cdata, clen);
+		if (pampd != NULL) {
+			count = atomic_inc_return(&zcache_curr_eph_pampd_count);
+			if (count > zcache_curr_eph_pampd_count_max)
+				zcache_curr_eph_pampd_count_max = count;
+		}
+	} else {
+		curr_pers_pampd_count =
+			atomic_read(&zcache_curr_pers_pampd_count);
+		if (curr_pers_pampd_count >
+		    (zv_page_count_policy_percent * totalram_pages) / 100)
+			goto out;
+		ret = zcache_compress(page, &cdata, &clen);
+		if (ret == 0)
+			goto out;
+		/* reject if compression is too poor */
+		if (clen > zv_max_zsize) {
+			zcache_compress_poor++;
+			goto out;
+		}
+		/* reject if mean compression is too poor */
+		if ((clen > zv_max_mean_zsize) && (curr_pers_pampd_count > 0)) {
+			total_zsize = zs_get_total_size_bytes(cli->zspool);
+			zv_mean_zsize = div_u64(total_zsize,
+						curr_pers_pampd_count);
+			if (zv_mean_zsize > zv_max_mean_zsize) {
+				zcache_mean_compress_poor++;
+				goto out;
+			}
+		}
+		pampd = (void *)zv_create(cli->zspool, pool->pool_id,
+						oid, index, cdata, clen);
+		if (pampd == NULL)
+			goto out;
+		count = atomic_inc_return(&zcache_curr_pers_pampd_count);
+		if (count > zcache_curr_pers_pampd_count_max)
+			zcache_curr_pers_pampd_count_max = count;
+	}
+out:
+	return pampd;
+}
+
+/*
+ * fill the pageframe corresponding to the struct page with the data
+ * from the passed pampd
+ */
+static int zcache_pampd_get_data(char *data, size_t *bufsize, bool raw,
+					void *pampd, struct tmem_pool *pool,
+					struct tmem_oid *oid, uint32_t index)
+{
+	int ret = 0;
+
+	BUG_ON(is_ephemeral(pool));
+	zv_decompress((struct page *)(data), (unsigned long)pampd);
+	return ret;
+}
+
+/*
+ * fill the pageframe corresponding to the struct page with the data
+ * from the passed pampd
+ */
+static int zcache_pampd_get_data_and_free(char *data, size_t *bufsize, bool raw,
+					void *pampd, struct tmem_pool *pool,
+					struct tmem_oid *oid, uint32_t index)
+{
+	BUG_ON(!is_ephemeral(pool));
+	if (zbud_decompress((struct page *)(data), pampd) < 0)
+		return -EINVAL;
+	zbud_free_and_delist((struct zbud_hdr *)pampd);
+	atomic_dec(&zcache_curr_eph_pampd_count);
+	return 0;
+}
+
+/*
+ * free the pampd and remove it from any zcache lists
+ * pampd must no longer be pointed to from any tmem data structures!
+ */
+static void zcache_pampd_free(void *pampd, struct tmem_pool *pool,
+				struct tmem_oid *oid, uint32_t index)
+{
+	struct zcache_client *cli = pool->client;
+
+	if (is_ephemeral(pool)) {
+		zbud_free_and_delist((struct zbud_hdr *)pampd);
+		atomic_dec(&zcache_curr_eph_pampd_count);
+		BUG_ON(atomic_read(&zcache_curr_eph_pampd_count) < 0);
+	} else {
+		zv_free(cli->zspool, (unsigned long)pampd);
+		atomic_dec(&zcache_curr_pers_pampd_count);
+		BUG_ON(atomic_read(&zcache_curr_pers_pampd_count) < 0);
+	}
+}
+
+static void zcache_pampd_free_obj(struct tmem_pool *pool, struct tmem_obj *obj)
+{
+}
+
+static void zcache_pampd_new_obj(struct tmem_obj *obj)
+{
+}
+
+static int zcache_pampd_replace_in_obj(void *pampd, struct tmem_obj *obj)
+{
+	return -1;
+}
+
+static bool zcache_pampd_is_remote(void *pampd)
+{
+	return 0;
+}
+
+static struct tmem_pamops zcache_pamops = {
+	.create = zcache_pampd_create,
+	.get_data = zcache_pampd_get_data,
+	.get_data_and_free = zcache_pampd_get_data_and_free,
+	.free = zcache_pampd_free,
+	.free_obj = zcache_pampd_free_obj,
+	.new_obj = zcache_pampd_new_obj,
+	.replace_in_obj = zcache_pampd_replace_in_obj,
+	.is_remote = zcache_pampd_is_remote,
+};
+
+/*
+ * zcache compression/decompression and related per-cpu stuff
+ */
+
+static DEFINE_PER_CPU(unsigned char *, zcache_dstmem);
+#define ZCACHE_DSTMEM_ORDER 1
+
+static int zcache_compress(struct page *from, void **out_va, unsigned *out_len)
+{
+	int ret = 0;
+	unsigned char *dmem = __get_cpu_var(zcache_dstmem);
+	char *from_va;
+
+	BUG_ON(!irqs_disabled());
+	if (unlikely(dmem == NULL))
+		goto out;  /* no buffer or no compressor so can't compress */
+	*out_len = PAGE_SIZE << ZCACHE_DSTMEM_ORDER;
+	from_va = kmap_atomic(from);
+	mb();
+	ret = zcache_comp_op(ZCACHE_COMPOP_COMPRESS, from_va, PAGE_SIZE, dmem,
+				out_len);
+	BUG_ON(ret);
+	*out_va = dmem;
+	kunmap_atomic(from_va);
+	ret = 1;
+out:
+	return ret;
+}
+
+static int zcache_comp_cpu_up(int cpu)
+{
+	struct crypto_comp *tfm;
+
+	tfm = crypto_alloc_comp(zcache_comp_name, 0, 0);
+	if (IS_ERR(tfm))
+		return NOTIFY_BAD;
+	*per_cpu_ptr(zcache_comp_pcpu_tfms, cpu) = tfm;
+	return NOTIFY_OK;
+}
+
+static void zcache_comp_cpu_down(int cpu)
+{
+	struct crypto_comp *tfm;
+
+	tfm = *per_cpu_ptr(zcache_comp_pcpu_tfms, cpu);
+	crypto_free_comp(tfm);
+	*per_cpu_ptr(zcache_comp_pcpu_tfms, cpu) = NULL;
+}
+
+static int zcache_cpu_notifier(struct notifier_block *nb,
+				unsigned long action, void *pcpu)
+{
+	int ret, cpu = (long)pcpu;
+	struct zcache_preload *kp;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		ret = zcache_comp_cpu_up(cpu);
+		if (ret != NOTIFY_OK) {
+			pr_err("zcache: can't allocate compressor transform\n");
+			return ret;
+		}
+		per_cpu(zcache_dstmem, cpu) = (void *)__get_free_pages(
+			GFP_KERNEL | __GFP_REPEAT, ZCACHE_DSTMEM_ORDER);
+		break;
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		zcache_comp_cpu_down(cpu);
+		free_pages((unsigned long)per_cpu(zcache_dstmem, cpu),
+			ZCACHE_DSTMEM_ORDER);
+		per_cpu(zcache_dstmem, cpu) = NULL;
+		kp = &per_cpu(zcache_preloads, cpu);
+		while (kp->nr) {
+			kmem_cache_free(zcache_objnode_cache,
+					kp->objnodes[kp->nr - 1]);
+			kp->objnodes[kp->nr - 1] = NULL;
+			kp->nr--;
+		}
+		if (kp->obj) {
+			kmem_cache_free(zcache_obj_cache, kp->obj);
+			kp->obj = NULL;
+		}
+		if (kp->page) {
+			free_page((unsigned long)kp->page);
+			kp->page = NULL;
+		}
+		break;
+	default:
+		break;
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block zcache_cpu_notifier_block = {
+	.notifier_call = zcache_cpu_notifier
+};
+
+#ifdef CONFIG_SYSFS
+#define ZCACHE_SYSFS_RO(_name) \
+	static ssize_t zcache_##_name##_show(struct kobject *kobj, \
+				struct kobj_attribute *attr, char *buf) \
+	{ \
+		return sprintf(buf, "%lu\n", zcache_##_name); \
+	} \
+	static struct kobj_attribute zcache_##_name##_attr = { \
+		.attr = { .name = __stringify(_name), .mode = 0444 }, \
+		.show = zcache_##_name##_show, \
+	}
+
+#define ZCACHE_SYSFS_RO_ATOMIC(_name) \
+	static ssize_t zcache_##_name##_show(struct kobject *kobj, \
+				struct kobj_attribute *attr, char *buf) \
+	{ \
+	    return sprintf(buf, "%d\n", atomic_read(&zcache_##_name)); \
+	} \
+	static struct kobj_attribute zcache_##_name##_attr = { \
+		.attr = { .name = __stringify(_name), .mode = 0444 }, \
+		.show = zcache_##_name##_show, \
+	}
+
+#define ZCACHE_SYSFS_RO_CUSTOM(_name, _func) \
+	static ssize_t zcache_##_name##_show(struct kobject *kobj, \
+				struct kobj_attribute *attr, char *buf) \
+	{ \
+	    return _func(buf); \
+	} \
+	static struct kobj_attribute zcache_##_name##_attr = { \
+		.attr = { .name = __stringify(_name), .mode = 0444 }, \
+		.show = zcache_##_name##_show, \
+	}
+
+ZCACHE_SYSFS_RO(curr_obj_count_max);
+ZCACHE_SYSFS_RO(curr_objnode_count_max);
+ZCACHE_SYSFS_RO(flush_total);
+ZCACHE_SYSFS_RO(flush_found);
+ZCACHE_SYSFS_RO(flobj_total);
+ZCACHE_SYSFS_RO(flobj_found);
+ZCACHE_SYSFS_RO(failed_eph_puts);
+ZCACHE_SYSFS_RO(failed_pers_puts);
+ZCACHE_SYSFS_RO(zbud_curr_zbytes);
+ZCACHE_SYSFS_RO(zbud_cumul_zpages);
+ZCACHE_SYSFS_RO(zbud_cumul_zbytes);
+ZCACHE_SYSFS_RO(zbud_buddied_count);
+ZCACHE_SYSFS_RO(zbpg_unused_list_count);
+ZCACHE_SYSFS_RO(evicted_raw_pages);
+ZCACHE_SYSFS_RO(evicted_unbuddied_pages);
+ZCACHE_SYSFS_RO(evicted_buddied_pages);
+ZCACHE_SYSFS_RO(failed_get_free_pages);
+ZCACHE_SYSFS_RO(failed_alloc);
+ZCACHE_SYSFS_RO(put_to_flush);
+ZCACHE_SYSFS_RO(compress_poor);
+ZCACHE_SYSFS_RO(mean_compress_poor);
+ZCACHE_SYSFS_RO_ATOMIC(zbud_curr_raw_pages);
+ZCACHE_SYSFS_RO_ATOMIC(zbud_curr_zpages);
+ZCACHE_SYSFS_RO_ATOMIC(curr_obj_count);
+ZCACHE_SYSFS_RO_ATOMIC(curr_objnode_count);
+ZCACHE_SYSFS_RO_CUSTOM(zbud_unbuddied_list_counts,
+			zbud_show_unbuddied_list_counts);
+ZCACHE_SYSFS_RO_CUSTOM(zbud_cumul_chunk_counts,
+			zbud_show_cumul_chunk_counts);
+ZCACHE_SYSFS_RO_CUSTOM(zv_curr_dist_counts,
+			zv_curr_dist_counts_show);
+ZCACHE_SYSFS_RO_CUSTOM(zv_cumul_dist_counts,
+			zv_cumul_dist_counts_show);
+
+static struct attribute *zcache_attrs[] = {
+	&zcache_curr_obj_count_attr.attr,
+	&zcache_curr_obj_count_max_attr.attr,
+	&zcache_curr_objnode_count_attr.attr,
+	&zcache_curr_objnode_count_max_attr.attr,
+	&zcache_flush_total_attr.attr,
+	&zcache_flobj_total_attr.attr,
+	&zcache_flush_found_attr.attr,
+	&zcache_flobj_found_attr.attr,
+	&zcache_failed_eph_puts_attr.attr,
+	&zcache_failed_pers_puts_attr.attr,
+	&zcache_compress_poor_attr.attr,
+	&zcache_mean_compress_poor_attr.attr,
+	&zcache_zbud_curr_raw_pages_attr.attr,
+	&zcache_zbud_curr_zpages_attr.attr,
+	&zcache_zbud_curr_zbytes_attr.attr,
+	&zcache_zbud_cumul_zpages_attr.attr,
+	&zcache_zbud_cumul_zbytes_attr.attr,
+	&zcache_zbud_buddied_count_attr.attr,
+	&zcache_zbpg_unused_list_count_attr.attr,
+	&zcache_evicted_raw_pages_attr.attr,
+	&zcache_evicted_unbuddied_pages_attr.attr,
+	&zcache_evicted_buddied_pages_attr.attr,
+	&zcache_failed_get_free_pages_attr.attr,
+	&zcache_failed_alloc_attr.attr,
+	&zcache_put_to_flush_attr.attr,
+	&zcache_zbud_unbuddied_list_counts_attr.attr,
+	&zcache_zbud_cumul_chunk_counts_attr.attr,
+	&zcache_zv_curr_dist_counts_attr.attr,
+	&zcache_zv_cumul_dist_counts_attr.attr,
+	&zcache_zv_max_zsize_attr.attr,
+	&zcache_zv_max_mean_zsize_attr.attr,
+	&zcache_zv_page_count_policy_percent_attr.attr,
+	NULL,
+};
+
+static struct attribute_group zcache_attr_group = {
+	.attrs = zcache_attrs,
+	.name = "zcache",
+};
+
+#endif /* CONFIG_SYSFS */
+/*
+ * When zcache is disabled ("frozen"), pools can be created and destroyed,
+ * but all puts (and thus all other operations that require memory allocation)
+ * must fail.  If zcache is unfrozen, accepts puts, then frozen again,
+ * data consistency requires all puts while frozen to be converted into
+ * flushes.
+ */
+static bool zcache_freeze;
+
+/*
+ * zcache shrinker interface (only useful for ephemeral pages, so zbud only)
+ */
+static int shrink_zcache_memory(struct shrinker *shrink,
+				struct shrink_control *sc)
+{
+	int ret = -1;
+	int nr = sc->nr_to_scan;
+	gfp_t gfp_mask = sc->gfp_mask;
+
+	if (nr >= 0) {
+		if (!(gfp_mask & __GFP_FS))
+			/* does this case really need to be skipped? */
+			goto out;
+		zbud_evict_pages(nr);
+	}
+	ret = (int)atomic_read(&zcache_zbud_curr_raw_pages);
+out:
+	return ret;
+}
+
+static struct shrinker zcache_shrinker = {
+	.shrink = shrink_zcache_memory,
+	.seeks = DEFAULT_SEEKS,
+};
+
+/*
+ * zcache shims between cleancache/frontswap ops and tmem
+ */
+
+static int zcache_put_page(int cli_id, int pool_id, struct tmem_oid *oidp,
+				uint32_t index, struct page *page)
+{
+	struct tmem_pool *pool;
+	int ret = -1;
+
+	BUG_ON(!irqs_disabled());
+	pool = zcache_get_pool_by_id(cli_id, pool_id);
+	if (unlikely(pool == NULL))
+		goto out;
+	if (!zcache_freeze && zcache_do_preload(pool) == 0) {
+		/* preload does preempt_disable on success */
+		ret = tmem_put(pool, oidp, index, (char *)(page),
+				PAGE_SIZE, 0, is_ephemeral(pool));
+		if (ret < 0) {
+			if (is_ephemeral(pool))
+				zcache_failed_eph_puts++;
+			else
+				zcache_failed_pers_puts++;
+		}
+	} else {
+		zcache_put_to_flush++;
+		if (atomic_read(&pool->obj_count) > 0)
+			/* the put fails whether the flush succeeds or not */
+			(void)tmem_flush_page(pool, oidp, index);
+	}
+
+	zcache_put_pool(pool);
+out:
+	return ret;
+}
+
+static int zcache_get_page(int cli_id, int pool_id, struct tmem_oid *oidp,
+				uint32_t index, struct page *page)
+{
+	struct tmem_pool *pool;
+	int ret = -1;
+	unsigned long flags;
+	size_t size = PAGE_SIZE;
+
+	local_irq_save(flags);
+	pool = zcache_get_pool_by_id(cli_id, pool_id);
+	if (likely(pool != NULL)) {
+		if (atomic_read(&pool->obj_count) > 0)
+			ret = tmem_get(pool, oidp, index, (char *)(page),
+					&size, 0, is_ephemeral(pool));
+		zcache_put_pool(pool);
+	}
+	local_irq_restore(flags);
+	return ret;
+}
+
+static int zcache_flush_page(int cli_id, int pool_id,
+				struct tmem_oid *oidp, uint32_t index)
+{
+	struct tmem_pool *pool;
+	int ret = -1;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	zcache_flush_total++;
+	pool = zcache_get_pool_by_id(cli_id, pool_id);
+	if (likely(pool != NULL)) {
+		if (atomic_read(&pool->obj_count) > 0)
+			ret = tmem_flush_page(pool, oidp, index);
+		zcache_put_pool(pool);
+	}
+	if (ret >= 0)
+		zcache_flush_found++;
+	local_irq_restore(flags);
+	return ret;
+}
+
+static int zcache_flush_object(int cli_id, int pool_id,
+				struct tmem_oid *oidp)
+{
+	struct tmem_pool *pool;
+	int ret = -1;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	zcache_flobj_total++;
+	pool = zcache_get_pool_by_id(cli_id, pool_id);
+	if (likely(pool != NULL)) {
+		if (atomic_read(&pool->obj_count) > 0)
+			ret = tmem_flush_object(pool, oidp);
+		zcache_put_pool(pool);
+	}
+	if (ret >= 0)
+		zcache_flobj_found++;
+	local_irq_restore(flags);
+	return ret;
+}
+
+static int zcache_destroy_pool(int cli_id, int pool_id)
+{
+	struct tmem_pool *pool = NULL;
+	struct zcache_client *cli;
+	int ret = -1;
+
+	if (pool_id < 0)
+		goto out;
+
+	cli = get_zcache_client(cli_id);
+	if (cli == NULL)
+		goto out;
+
+	atomic_inc(&cli->refcount);
+	pool = idr_find(&cli->tmem_pools, pool_id);
+	if (pool == NULL)
+		goto out;
+	idr_remove(&cli->tmem_pools, pool_id);
+	/* wait for pool activity on other cpus to quiesce */
+	while (atomic_read(&pool->refcount) != 0)
+		;
+	atomic_dec(&cli->refcount);
+	local_bh_disable();
+	ret = tmem_destroy_pool(pool);
+	local_bh_enable();
+	kfree(pool);
+	pr_info("zcache: destroyed pool id=%d, cli_id=%d\n",
+			pool_id, cli_id);
+out:
+	return ret;
+}
+
+static int zcache_new_pool(uint16_t cli_id, uint32_t flags)
+{
+	int poolid = -1;
+	struct tmem_pool *pool;
+	struct zcache_client *cli = NULL;
+	int r;
+
+	cli = get_zcache_client(cli_id);
+	if (cli == NULL)
+		goto out;
+
+	atomic_inc(&cli->refcount);
+	pool = kmalloc(sizeof(struct tmem_pool), GFP_ATOMIC);
+	if (pool == NULL) {
+		pr_info("zcache: pool creation failed: out of memory\n");
+		goto out;
+	}
+
+	do {
+		r = idr_pre_get(&cli->tmem_pools, GFP_ATOMIC);
+		if (r != 1) {
+			kfree(pool);
+			pr_info("zcache: pool creation failed: out of memory\n");
+			goto out;
+		}
+		r = idr_get_new(&cli->tmem_pools, pool, &poolid);
+	} while (r == -EAGAIN);
+	if (r) {
+		pr_info("zcache: pool creation failed: error %d\n", r);
+		kfree(pool);
+		goto out;
+	}
+
+	atomic_set(&pool->refcount, 0);
+	pool->client = cli;
+	pool->pool_id = poolid;
+	tmem_new_pool(pool, flags);
+	pr_info("zcache: created %s tmem pool, id=%d, client=%d\n",
+		flags & TMEM_POOL_PERSIST ? "persistent" : "ephemeral",
+		poolid, cli_id);
+out:
+	if (cli != NULL)
+		atomic_dec(&cli->refcount);
+	return poolid;
+}
+
+/**********
+ * Two kernel functionalities currently can be layered on top of tmem.
+ * These are "cleancache" which is used as a second-chance cache for clean
+ * page cache pages; and "frontswap" which is used for swap pages
+ * to avoid writes to disk.  A generic "shim" is provided here for each
+ * to translate in-kernel semantics to zcache semantics.
+ */
+
+#ifdef CONFIG_CLEANCACHE
+static void zcache_cleancache_put_page(int pool_id,
+					struct cleancache_filekey key,
+					pgoff_t index, struct page *page)
+{
+	u32 ind = (u32) index;
+	struct tmem_oid oid = *(struct tmem_oid *)&key;
+
+	if (likely(ind == index))
+		(void)zcache_put_page(LOCAL_CLIENT, pool_id, &oid, index, page);
+}
+
+static int zcache_cleancache_get_page(int pool_id,
+					struct cleancache_filekey key,
+					pgoff_t index, struct page *page)
+{
+	u32 ind = (u32) index;
+	struct tmem_oid oid = *(struct tmem_oid *)&key;
+	int ret = -1;
+
+	if (likely(ind == index))
+		ret = zcache_get_page(LOCAL_CLIENT, pool_id, &oid, index, page);
+	return ret;
+}
+
+static void zcache_cleancache_flush_page(int pool_id,
+					struct cleancache_filekey key,
+					pgoff_t index)
+{
+	u32 ind = (u32) index;
+	struct tmem_oid oid = *(struct tmem_oid *)&key;
+
+	if (likely(ind == index))
+		(void)zcache_flush_page(LOCAL_CLIENT, pool_id, &oid, ind);
+}
+
+static void zcache_cleancache_flush_inode(int pool_id,
+					struct cleancache_filekey key)
+{
+	struct tmem_oid oid = *(struct tmem_oid *)&key;
+
+	(void)zcache_flush_object(LOCAL_CLIENT, pool_id, &oid);
+}
+
+static void zcache_cleancache_flush_fs(int pool_id)
+{
+	if (pool_id >= 0)
+		(void)zcache_destroy_pool(LOCAL_CLIENT, pool_id);
+}
+
+static int zcache_cleancache_init_fs(size_t pagesize)
+{
+	BUG_ON(sizeof(struct cleancache_filekey) !=
+				sizeof(struct tmem_oid));
+	BUG_ON(pagesize != PAGE_SIZE);
+	return zcache_new_pool(LOCAL_CLIENT, 0);
+}
+
+static int zcache_cleancache_init_shared_fs(char *uuid, size_t pagesize)
+{
+	/* shared pools are unsupported and map to private */
+	BUG_ON(sizeof(struct cleancache_filekey) !=
+				sizeof(struct tmem_oid));
+	BUG_ON(pagesize != PAGE_SIZE);
+	return zcache_new_pool(LOCAL_CLIENT, 0);
+}
+
+static struct cleancache_ops zcache_cleancache_ops = {
+	.put_page = zcache_cleancache_put_page,
+	.get_page = zcache_cleancache_get_page,
+	.invalidate_page = zcache_cleancache_flush_page,
+	.invalidate_inode = zcache_cleancache_flush_inode,
+	.invalidate_fs = zcache_cleancache_flush_fs,
+	.init_shared_fs = zcache_cleancache_init_shared_fs,
+	.init_fs = zcache_cleancache_init_fs
+};
+
+struct cleancache_ops zcache_cleancache_register_ops(void)
+{
+	struct cleancache_ops old_ops =
+		cleancache_register_ops(&zcache_cleancache_ops);
+
+	return old_ops;
+}
+#endif
+
+#ifdef CONFIG_FRONTSWAP
+/* a single tmem poolid is used for all frontswap "types" (swapfiles) */
+static int zcache_frontswap_poolid = -1;
+
+/*
+ * Swizzling increases objects per swaptype, increasing tmem concurrency
+ * for heavy swaploads.  Later, larger nr_cpus -> larger SWIZ_BITS
+ * Setting SWIZ_BITS to 27 basically reconstructs the swap entry from
+ * frontswap_load(), but has side-effects. Hence using 8.
+ */
+#define SWIZ_BITS		8
+#define SWIZ_MASK		((1 << SWIZ_BITS) - 1)
+#define _oswiz(_type, _ind)	((_type << SWIZ_BITS) | (_ind & SWIZ_MASK))
+#define iswiz(_ind)		(_ind >> SWIZ_BITS)
+
+static inline struct tmem_oid oswiz(unsigned type, u32 ind)
+{
+	struct tmem_oid oid = { .oid = { 0 } };
+	oid.oid[0] = _oswiz(type, ind);
+	return oid;
+}
+
+static int zcache_frontswap_store(unsigned type, pgoff_t offset,
+				   struct page *page)
+{
+	u64 ind64 = (u64)offset;
+	u32 ind = (u32)offset;
+	struct tmem_oid oid = oswiz(type, ind);
+	int ret = -1;
+	unsigned long flags;
+
+	BUG_ON(!PageLocked(page));
+	if (likely(ind64 == ind)) {
+		local_irq_save(flags);
+		ret = zcache_put_page(LOCAL_CLIENT, zcache_frontswap_poolid,
+					&oid, iswiz(ind), page);
+		local_irq_restore(flags);
+	}
+	return ret;
+}
+
+/* returns 0 if the page was successfully gotten from frontswap, -1 if
+ * was not present (should never happen!) */
+static int zcache_frontswap_load(unsigned type, pgoff_t offset,
+				   struct page *page)
+{
+	u64 ind64 = (u64)offset;
+	u32 ind = (u32)offset;
+	struct tmem_oid oid = oswiz(type, ind);
+	int ret = -1;
+
+	BUG_ON(!PageLocked(page));
+	if (likely(ind64 == ind))
+		ret = zcache_get_page(LOCAL_CLIENT, zcache_frontswap_poolid,
+					&oid, iswiz(ind), page);
+	return ret;
+}
+
+/* flush a single page from frontswap */
+static void zcache_frontswap_flush_page(unsigned type, pgoff_t offset)
+{
+	u64 ind64 = (u64)offset;
+	u32 ind = (u32)offset;
+	struct tmem_oid oid = oswiz(type, ind);
+
+	if (likely(ind64 == ind))
+		(void)zcache_flush_page(LOCAL_CLIENT, zcache_frontswap_poolid,
+					&oid, iswiz(ind));
+}
+
+/* flush all pages from the passed swaptype */
+static void zcache_frontswap_flush_area(unsigned type)
+{
+	struct tmem_oid oid;
+	int ind;
+
+	for (ind = SWIZ_MASK; ind >= 0; ind--) {
+		oid = oswiz(type, ind);
+		(void)zcache_flush_object(LOCAL_CLIENT,
+						zcache_frontswap_poolid, &oid);
+	}
+}
+
+static void zcache_frontswap_init(unsigned ignored)
+{
+	/* a single tmem poolid is used for all frontswap "types" (swapfiles) */
+	if (zcache_frontswap_poolid < 0)
+		zcache_frontswap_poolid =
+			zcache_new_pool(LOCAL_CLIENT, TMEM_POOL_PERSIST);
+}
+
+static struct frontswap_ops zcache_frontswap_ops = {
+	.store = zcache_frontswap_store,
+	.load = zcache_frontswap_load,
+	.invalidate_page = zcache_frontswap_flush_page,
+	.invalidate_area = zcache_frontswap_flush_area,
+	.init = zcache_frontswap_init
+};
+
+struct frontswap_ops zcache_frontswap_register_ops(void)
+{
+	struct frontswap_ops old_ops =
+		frontswap_register_ops(&zcache_frontswap_ops);
+
+	return old_ops;
+}
+#endif
+
+/*
+ * zcache initialization
+ * NOTE FOR NOW zcache MUST BE PROVIDED AS A KERNEL BOOT PARAMETER OR
+ * NOTHING HAPPENS!
+ */
+
+static int zcache_enabled;
+
+static int __init enable_zcache(char *s)
+{
+	zcache_enabled = 1;
+	return 1;
+}
+__setup("zcache", enable_zcache);
+
+/* allow independent dynamic disabling of cleancache and frontswap */
+
+static int use_cleancache = 1;
+
+static int __init no_cleancache(char *s)
+{
+	use_cleancache = 0;
+	return 1;
+}
+
+__setup("nocleancache", no_cleancache);
+
+static int use_frontswap = 1;
+
+static int __init no_frontswap(char *s)
+{
+	use_frontswap = 0;
+	return 1;
+}
+
+__setup("nofrontswap", no_frontswap);
+
+static int __init enable_zcache_compressor(char *s)
+{
+	strncpy(zcache_comp_name, s, ZCACHE_COMP_NAME_SZ);
+	zcache_enabled = 1;
+	return 1;
+}
+__setup("zcache=", enable_zcache_compressor);
+
+
+static int __init zcache_comp_init(void)
+{
+	int ret = 0;
+
+	/* check crypto algorithm */
+	if (*zcache_comp_name != '\0') {
+		ret = crypto_has_comp(zcache_comp_name, 0, 0);
+		if (!ret)
+			pr_info("zcache: %s not supported\n",
+					zcache_comp_name);
+	}
+	if (!ret)
+		strcpy(zcache_comp_name, "lzo");
+	ret = crypto_has_comp(zcache_comp_name, 0, 0);
+	if (!ret) {
+		ret = 1;
+		goto out;
+	}
+	pr_info("zcache: using %s compressor\n", zcache_comp_name);
+
+	/* alloc percpu transforms */
+	ret = 0;
+	zcache_comp_pcpu_tfms = alloc_percpu(struct crypto_comp *);
+	if (!zcache_comp_pcpu_tfms)
+		ret = 1;
+out:
+	return ret;
+}
+
+static int __init zcache_init(void)
+{
+	int ret = 0;
+
+#ifdef CONFIG_SYSFS
+	ret = sysfs_create_group(mm_kobj, &zcache_attr_group);
+	if (ret) {
+		pr_err("zcache: can't create sysfs\n");
+		goto out;
+	}
+#endif /* CONFIG_SYSFS */
+
+	if (zcache_enabled) {
+		unsigned int cpu;
+
+		tmem_register_hostops(&zcache_hostops);
+		tmem_register_pamops(&zcache_pamops);
+		ret = register_cpu_notifier(&zcache_cpu_notifier_block);
+		if (ret) {
+			pr_err("zcache: can't register cpu notifier\n");
+			goto out;
+		}
+		ret = zcache_comp_init();
+		if (ret) {
+			pr_err("zcache: compressor initialization failed\n");
+			goto out;
+		}
+		for_each_online_cpu(cpu) {
+			void *pcpu = (void *)(long)cpu;
+			zcache_cpu_notifier(&zcache_cpu_notifier_block,
+				CPU_UP_PREPARE, pcpu);
+		}
+	}
+	zcache_objnode_cache = kmem_cache_create("zcache_objnode",
+				sizeof(struct tmem_objnode), 0, 0, NULL);
+	zcache_obj_cache = kmem_cache_create("zcache_obj",
+				sizeof(struct tmem_obj), 0, 0, NULL);
+	ret = zcache_new_client(LOCAL_CLIENT);
+	if (ret) {
+		pr_err("zcache: can't create client\n");
+		goto out;
+	}
+
+#ifdef CONFIG_CLEANCACHE
+	if (zcache_enabled && use_cleancache) {
+		struct cleancache_ops old_ops;
+
+		zbud_init();
+		register_shrinker(&zcache_shrinker);
+		old_ops = zcache_cleancache_register_ops();
+		pr_info("zcache: cleancache enabled using kernel "
+			"transcendent memory and compression buddies\n");
+		if (old_ops.init_fs != NULL)
+			pr_warning("zcache: cleancache_ops overridden");
+	}
+#endif
+#ifdef CONFIG_FRONTSWAP
+	if (zcache_enabled && use_frontswap) {
+		struct frontswap_ops old_ops;
+
+		old_ops = zcache_frontswap_register_ops();
+		pr_info("zcache: frontswap enabled using kernel "
+			"transcendent memory and zsmalloc\n");
+		if (old_ops.init != NULL)
+			pr_warning("zcache: frontswap_ops overridden");
+	}
+#endif
+out:
+	return ret;
+}
+
+module_init(zcache_init)
diff --git a/include/linux/zsmalloc.h b/include/linux/zsmalloc.h
new file mode 100644
index 0000000..de2e8bf
--- /dev/null
+++ b/include/linux/zsmalloc.h
@@ -0,0 +1,43 @@ 
+/*
+ * zsmalloc memory allocator
+ *
+ * Copyright (C) 2011  Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the license that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+#ifndef _ZS_MALLOC_H_
+#define _ZS_MALLOC_H_
+
+#include <linux/types.h>
+
+/*
+ * zsmalloc mapping modes
+ *
+ * NOTE: These only make a difference when a mapped object spans pages
+*/
+enum zs_mapmode {
+	ZS_MM_RW, /* normal read-write mapping */
+	ZS_MM_RO, /* read-only (no copy-out at unmap time) */
+	ZS_MM_WO /* write-only (no copy-in at map time) */
+};
+
+struct zs_pool;
+
+struct zs_pool *zs_create_pool(const char *name, gfp_t flags);
+void zs_destroy_pool(struct zs_pool *pool);
+
+unsigned long zs_malloc(struct zs_pool *pool, size_t size);
+void zs_free(struct zs_pool *pool, unsigned long obj);
+
+void *zs_map_object(struct zs_pool *pool, unsigned long handle,
+			enum zs_mapmode mm);
+void zs_unmap_object(struct zs_pool *pool, unsigned long handle);
+
+u64 zs_get_total_size_bytes(struct zs_pool *pool);
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index d5c8019..2586b66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -411,3 +411,21 @@  config FRONTSWAP
 	  and swap data is stored as normal on the matching swap device.
 
 	  If unsure, say Y to enable frontswap.
+
+config ZSMALLOC
+	tristate "Memory allocator for compressed pages"
+	default n
+	help
+	  zsmalloc is a slab-based memory allocator designed to store
+	  compressed RAM pages.  zsmalloc uses a memory pool that combines
+	  single pages into higher order pages by linking them together
+	  using the fields of the struct page. Allocations are then
+	  mapped through copy buffers or VM mapping, in order to reduce
+	  memory pool fragmentation and increase allocation success rate under
+	  memory pressure.
+
+	  This results in a non-standard allocator interface where
+	  a handle, not a pointer, is returned by the allocation function.
+	  This handle must be mapped in order to access the allocated space.
+
+	  If unsure, say N.
diff --git a/mm/Makefile b/mm/Makefile
index 92753e2..8a3d7bea 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -57,3 +57,4 @@  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
 obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
 obj-$(CONFIG_CLEANCACHE) += cleancache.o
 obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
+obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
new file mode 100644
index 0000000..6b20429
--- /dev/null
+++ b/mm/zsmalloc.c
@@ -0,0 +1,1063 @@ 
+/*
+ * zsmalloc memory allocator
+ *
+ * Copyright (C) 2011  Nitin Gupta
+ *
+ * This code is released using a dual license strategy: BSD/GPL
+ * You can choose the license that better fits your requirements.
+ *
+ * Released under the terms of 3-clause BSD License
+ * Released under the terms of GNU General Public License Version 2.0
+ */
+
+
+/*
+ * This allocator is designed for use with zcache and zram. Thus, the
+ * allocator is supposed to work well under low memory conditions. In
+ * particular, it never attempts higher order page allocation which is
+ * very likely to fail under memory pressure. On the other hand, if we
+ * just use single (0-order) pages, it would suffer from very high
+ * fragmentation -- any object of size PAGE_SIZE/2 or larger would occupy
+ * an entire page. This was one of the major issues with its predecessor
+ * (xvmalloc).
+ *
+ * To overcome these issues, zsmalloc allocates a bunch of 0-order pages
+ * and links them together using various 'struct page' fields. These linked
+ * pages act as a single higher-order page i.e. an object can span 0-order
+ * page boundaries. The code refers to these linked pages as a single entity
+ * called zspage.
+ *
+ * Following is how we use various fields and flags of underlying
+ * struct page(s) to form a zspage.
+ *
+ * Usage of struct page fields:
+ *	page->first_page: points to the first component (0-order) page
+ *	page->index (union with page->freelist): offset of the first object
+ *		starting in this page. For the first page, this is
+ *		always 0, so we use this field (aka freelist) to point
+ *		to the first free object in zspage.
+ *	page->lru: links together all component pages (except the first page)
+ *		of a zspage
+ *
+ *	For _first_ page only:
+ *
+ *	page->private (union with page->first_page): refers to the
+ *		component page after the first page
+ *	page->freelist: points to the first free object in zspage.
+ *		Free objects are linked together using in-place
+ *		metadata.
+ *	page->objects: maximum number of objects we can store in this
+ *		zspage (class->zspage_order * PAGE_SIZE / class->size)
+ *	page->lru: links together first pages of various zspages.
+ *		Basically forming list of zspages in a fullness group.
+ *	page->mapping: class index and fullness group of the zspage
+ *
+ * Usage of struct page flags:
+ *	PG_private: identifies the first component page
+ *	PG_private2: identifies the last component page
+ *
+ */
+
+#ifdef CONFIG_ZSMALLOC_DEBUG
+#define DEBUG
+#endif
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+#include <asm/tlbflush.h>
+#include <asm/pgtable.h>
+#include <linux/cpumask.h>
+#include <linux/cpu.h>
+#include <linux/vmalloc.h>
+#include <linux/hardirq.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <linux/zsmalloc.h>
+
+/*
+ * This must be power of 2 and greater than of equal to sizeof(link_free).
+ * These two conditions ensure that any 'struct link_free' itself doesn't
+ * span more than 1 page which avoids complex case of mapping 2 pages simply
+ * to restore link_free pointer values.
+ */
+#define ZS_ALIGN		8
+
+/*
+ * A single 'zspage' is composed of up to 2^N discontiguous 0-order (single)
+ * pages. ZS_MAX_ZSPAGE_ORDER defines upper limit on N.
+ */
+#define ZS_MAX_ZSPAGE_ORDER 2
+#define ZS_MAX_PAGES_PER_ZSPAGE (_AC(1, UL) << ZS_MAX_ZSPAGE_ORDER)
+
+/*
+ * Object location (<PFN>, <obj_idx>) is encoded as
+ * as single (void *) handle value.
+ *
+ * Note that object index <obj_idx> is relative to system
+ * page <PFN> it is stored in, so for each sub-page belonging
+ * to a zspage, obj_idx starts with 0.
+ *
+ * This is made more complicated by various memory models and PAE.
+ */
+
+#ifndef MAX_PHYSMEM_BITS
+#ifdef CONFIG_HIGHMEM64G
+#define MAX_PHYSMEM_BITS 36
+#else /* !CONFIG_HIGHMEM64G */
+/*
+ * If this definition of MAX_PHYSMEM_BITS is used, OBJ_INDEX_BITS will just
+ * be PAGE_SHIFT
+ */
+#define MAX_PHYSMEM_BITS BITS_PER_LONG
+#endif
+#endif
+#define _PFN_BITS		(MAX_PHYSMEM_BITS - PAGE_SHIFT)
+#define OBJ_INDEX_BITS	(BITS_PER_LONG - _PFN_BITS)
+#define OBJ_INDEX_MASK	((_AC(1, UL) << OBJ_INDEX_BITS) - 1)
+
+#define MAX(a, b) ((a) >= (b) ? (a) : (b))
+/* ZS_MIN_ALLOC_SIZE must be multiple of ZS_ALIGN */
+#define ZS_MIN_ALLOC_SIZE \
+	MAX(32, (ZS_MAX_PAGES_PER_ZSPAGE << PAGE_SHIFT >> OBJ_INDEX_BITS))
+#define ZS_MAX_ALLOC_SIZE	PAGE_SIZE
+
+/*
+ * On systems with 4K page size, this gives 254 size classes! There is a
+ * trader-off here:
+ *  - Large number of size classes is potentially wasteful as free page are
+ *    spread across these classes
+ *  - Small number of size classes causes large internal fragmentation
+ *  - Probably its better to use specific size classes (empirically
+ *    determined). NOTE: all those class sizes must be set as multiple of
+ *    ZS_ALIGN to make sure link_free itself never has to span 2 pages.
+ *
+ *  ZS_MIN_ALLOC_SIZE and ZS_SIZE_CLASS_DELTA must be multiple of ZS_ALIGN
+ *  (reason above)
+ */
+#define ZS_SIZE_CLASS_DELTA	16
+#define ZS_SIZE_CLASSES		((ZS_MAX_ALLOC_SIZE - ZS_MIN_ALLOC_SIZE) / \
+					ZS_SIZE_CLASS_DELTA + 1)
+
+/*
+ * We do not maintain any list for completely empty or full pages
+ */
+enum fullness_group {
+	ZS_ALMOST_FULL,
+	ZS_ALMOST_EMPTY,
+	_ZS_NR_FULLNESS_GROUPS,
+
+	ZS_EMPTY,
+	ZS_FULL
+};
+
+/*
+ * We assign a page to ZS_ALMOST_EMPTY fullness group when:
+ *	n <= N / f, where
+ * n = number of allocated objects
+ * N = total number of objects zspage can store
+ * f = 1/fullness_threshold_frac
+ *
+ * Similarly, we assign zspage to:
+ *	ZS_ALMOST_FULL	when n > N / f
+ *	ZS_EMPTY	when n == 0
+ *	ZS_FULL		when n == N
+ *
+ * (see: fix_fullness_group())
+ */
+static const int fullness_threshold_frac = 4;
+
+struct size_class {
+	/*
+	 * Size of objects stored in this class. Must be multiple
+	 * of ZS_ALIGN.
+	 */
+	int size;
+	unsigned int index;
+
+	/* Number of PAGE_SIZE sized pages to combine to form a 'zspage' */
+	int pages_per_zspage;
+
+	spinlock_t lock;
+
+	/* stats */
+	u64 pages_allocated;
+
+	struct page *fullness_list[_ZS_NR_FULLNESS_GROUPS];
+};
+
+/*
+ * Placed within free objects to form a singly linked list.
+ * For every zspage, first_page->freelist gives head of this list.
+ *
+ * This must be power of 2 and less than or equal to ZS_ALIGN
+ */
+struct link_free {
+	/* Handle of next free chunk (encodes <PFN, obj_idx>) */
+	void *next;
+};
+
+struct zs_pool {
+	struct size_class size_class[ZS_SIZE_CLASSES];
+
+	gfp_t flags;	/* allocation flags used when growing pool */
+	const char *name;
+};
+
+/*
+ * A zspage's class index and fullness group
+ * are encoded in its (first)page->mapping
+ */
+#define CLASS_IDX_BITS	28
+#define FULLNESS_BITS	4
+#define CLASS_IDX_MASK	((1 << CLASS_IDX_BITS) - 1)
+#define FULLNESS_MASK	((1 << FULLNESS_BITS) - 1)
+
+/*
+ * By default, zsmalloc uses a copy-based object mapping method to access
+ * allocations that span two pages. However, if a particular architecture
+ * 1) Implements local_flush_tlb_kernel_range() and 2) Performs VM mapping
+ * faster than copying, then it should be added here so that
+ * USE_PGTABLE_MAPPING is defined. This causes zsmalloc to use page table
+ * mapping rather than copying
+ * for object mapping.
+*/
+#if defined(CONFIG_ARM)
+#define USE_PGTABLE_MAPPING
+#endif
+
+struct mapping_area {
+#ifdef USE_PGTABLE_MAPPING
+	struct vm_struct *vm; /* vm area for mapping object that span pages */
+#else
+	char *vm_buf; /* copy buffer for objects that span pages */
+#endif
+	char *vm_addr; /* address of kmap_atomic()'ed pages */
+	enum zs_mapmode vm_mm; /* mapping mode */
+};
+
+
+/* per-cpu VM mapping areas for zspage accesses that cross page boundaries */
+static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
+
+static int is_first_page(struct page *page)
+{
+	return PagePrivate(page);
+}
+
+static int is_last_page(struct page *page)
+{
+	return PagePrivate2(page);
+}
+
+static void get_zspage_mapping(struct page *page, unsigned int *class_idx,
+				enum fullness_group *fullness)
+{
+	unsigned long m;
+	BUG_ON(!is_first_page(page));
+
+	m = (unsigned long)page->mapping;
+	*fullness = m & FULLNESS_MASK;
+	*class_idx = (m >> FULLNESS_BITS) & CLASS_IDX_MASK;
+}
+
+static void set_zspage_mapping(struct page *page, unsigned int class_idx,
+				enum fullness_group fullness)
+{
+	unsigned long m;
+	BUG_ON(!is_first_page(page));
+
+	m = ((class_idx & CLASS_IDX_MASK) << FULLNESS_BITS) |
+			(fullness & FULLNESS_MASK);
+	page->mapping = (struct address_space *)m;
+}
+
+static int get_size_class_index(int size)
+{
+	int idx = 0;
+
+	if (likely(size > ZS_MIN_ALLOC_SIZE))
+		idx = DIV_ROUND_UP(size - ZS_MIN_ALLOC_SIZE,
+				ZS_SIZE_CLASS_DELTA);
+
+	return idx;
+}
+
+static enum fullness_group get_fullness_group(struct page *page)
+{
+	int inuse, max_objects;
+	enum fullness_group fg;
+	BUG_ON(!is_first_page(page));
+
+	inuse = page->inuse;
+	max_objects = page->objects;
+
+	if (inuse == 0)
+		fg = ZS_EMPTY;
+	else if (inuse == max_objects)
+		fg = ZS_FULL;
+	else if (inuse <= max_objects / fullness_threshold_frac)
+		fg = ZS_ALMOST_EMPTY;
+	else
+		fg = ZS_ALMOST_FULL;
+
+	return fg;
+}
+
+static void insert_zspage(struct page *page, struct size_class *class,
+				enum fullness_group fullness)
+{
+	struct page **head;
+
+	BUG_ON(!is_first_page(page));
+
+	if (fullness >= _ZS_NR_FULLNESS_GROUPS)
+		return;
+
+	head = &class->fullness_list[fullness];
+	if (*head)
+		list_add_tail(&page->lru, &(*head)->lru);
+
+	*head = page;
+}
+
+static void remove_zspage(struct page *page, struct size_class *class,
+				enum fullness_group fullness)
+{
+	struct page **head;
+
+	BUG_ON(!is_first_page(page));
+
+	if (fullness >= _ZS_NR_FULLNESS_GROUPS)
+		return;
+
+	head = &class->fullness_list[fullness];
+	BUG_ON(!*head);
+	if (list_empty(&(*head)->lru))
+		*head = NULL;
+	else if (*head == page)
+		*head = (struct page *)list_entry((*head)->lru.next,
+					struct page, lru);
+
+	list_del_init(&page->lru);
+}
+
+static enum fullness_group fix_fullness_group(struct zs_pool *pool,
+						struct page *page)
+{
+	int class_idx;
+	struct size_class *class;
+	enum fullness_group currfg, newfg;
+
+	BUG_ON(!is_first_page(page));
+
+	get_zspage_mapping(page, &class_idx, &currfg);
+	newfg = get_fullness_group(page);
+	if (newfg == currfg)
+		goto out;
+
+	class = &pool->size_class[class_idx];
+	remove_zspage(page, class, currfg);
+	insert_zspage(page, class, newfg);
+	set_zspage_mapping(page, class_idx, newfg);
+
+out:
+	return newfg;
+}
+
+/*
+ * We have to decide on how many pages to link together
+ * to form a zspage for each size class. This is important
+ * to reduce wastage due to unusable space left at end of
+ * each zspage which is given as:
+ *	wastage = Zp - Zp % size_class
+ * where Zp = zspage size = k * PAGE_SIZE where k = 1, 2, ...
+ *
+ * For example, for size class of 3/8 * PAGE_SIZE, we should
+ * link together 3 PAGE_SIZE sized pages to form a zspage
+ * since then we can perfectly fit in 8 such objects.
+ */
+static int get_pages_per_zspage(int class_size)
+{
+	int i, max_usedpc = 0;
+	/* zspage order which gives maximum used size per KB */
+	int max_usedpc_order = 1;
+
+	for (i = 1; i <= ZS_MAX_PAGES_PER_ZSPAGE; i++) {
+		int zspage_size;
+		int waste, usedpc;
+
+		zspage_size = i * PAGE_SIZE;
+		waste = zspage_size % class_size;
+		usedpc = (zspage_size - waste) * 100 / zspage_size;
+
+		if (usedpc > max_usedpc) {
+			max_usedpc = usedpc;
+			max_usedpc_order = i;
+		}
+	}
+
+	return max_usedpc_order;
+}
+
+/*
+ * A single 'zspage' is composed of many system pages which are
+ * linked together using fields in struct page. This function finds
+ * the first/head page, given any component page of a zspage.
+ */
+static struct page *get_first_page(struct page *page)
+{
+	if (is_first_page(page))
+		return page;
+	else
+		return page->first_page;
+}
+
+static struct page *get_next_page(struct page *page)
+{
+	struct page *next;
+
+	if (is_last_page(page))
+		next = NULL;
+	else if (is_first_page(page))
+		next = (struct page *)page->private;
+	else
+		next = list_entry(page->lru.next, struct page, lru);
+
+	return next;
+}
+
+/* Encode <page, obj_idx> as a single handle value */
+static void *obj_location_to_handle(struct page *page, unsigned long obj_idx)
+{
+	unsigned long handle;
+
+	if (!page) {
+		BUG_ON(obj_idx);
+		return NULL;
+	}
+
+	handle = page_to_pfn(page) << OBJ_INDEX_BITS;
+	handle |= (obj_idx & OBJ_INDEX_MASK);
+
+	return (void *)handle;
+}
+
+/* Decode <page, obj_idx> pair from the given object handle */
+static void obj_handle_to_location(unsigned long handle, struct page **page,
+				unsigned long *obj_idx)
+{
+	*page = pfn_to_page(handle >> OBJ_INDEX_BITS);
+	*obj_idx = handle & OBJ_INDEX_MASK;
+}
+
+static unsigned long obj_idx_to_offset(struct page *page,
+				unsigned long obj_idx, int class_size)
+{
+	unsigned long off = 0;
+
+	if (!is_first_page(page))
+		off = page->index;
+
+	return off + obj_idx * class_size;
+}
+
+static void reset_page(struct page *page)
+{
+	clear_bit(PG_private, &page->flags);
+	clear_bit(PG_private_2, &page->flags);
+	set_page_private(page, 0);
+	page->mapping = NULL;
+	page->freelist = NULL;
+	reset_page_mapcount(page);
+}
+
+static void free_zspage(struct page *first_page)
+{
+	struct page *nextp, *tmp, *head_extra;
+
+	BUG_ON(!is_first_page(first_page));
+	BUG_ON(first_page->inuse);
+
+	head_extra = (struct page *)page_private(first_page);
+
+	reset_page(first_page);
+	__free_page(first_page);
+
+	/* zspage with only 1 system page */
+	if (!head_extra)
+		return;
+
+	list_for_each_entry_safe(nextp, tmp, &head_extra->lru, lru) {
+		list_del(&nextp->lru);
+		reset_page(nextp);
+		__free_page(nextp);
+	}
+	reset_page(head_extra);
+	__free_page(head_extra);
+}
+
+/* Initialize a newly allocated zspage */
+static void init_zspage(struct page *first_page, struct size_class *class)
+{
+	unsigned long off = 0;
+	struct page *page = first_page;
+
+	BUG_ON(!is_first_page(first_page));
+	while (page) {
+		struct page *next_page;
+		struct link_free *link;
+		unsigned int i, objs_on_page;
+
+		/*
+		 * page->index stores offset of first object starting
+		 * in the page. For the first page, this is always 0,
+		 * so we use first_page->index (aka ->freelist) to store
+		 * head of corresponding zspage's freelist.
+		 */
+		if (page != first_page)
+			page->index = off;
+
+		link = (struct link_free *)kmap_atomic(page) +
+						off / sizeof(*link);
+		objs_on_page = (PAGE_SIZE - off) / class->size;
+
+		for (i = 1; i <= objs_on_page; i++) {
+			off += class->size;
+			if (off < PAGE_SIZE) {
+				link->next = obj_location_to_handle(page, i);
+				link += class->size / sizeof(*link);
+			}
+		}
+
+		/*
+		 * We now come to the last (full or partial) object on this
+		 * page, which must point to the first object on the next
+		 * page (if present)
+		 */
+		next_page = get_next_page(page);
+		link->next = obj_location_to_handle(next_page, 0);
+		kunmap_atomic(link);
+		page = next_page;
+		off = (off + class->size) % PAGE_SIZE;
+	}
+}
+
+/*
+ * Allocate a zspage for the given size class
+ */
+static struct page *alloc_zspage(struct size_class *class, gfp_t flags)
+{
+	int i, error;
+	struct page *first_page = NULL, *uninitialized_var(prev_page);
+
+	/*
+	 * Allocate individual pages and link them together as:
+	 * 1. first page->private = first sub-page
+	 * 2. all sub-pages are linked together using page->lru
+	 * 3. each sub-page is linked to the first page using page->first_page
+	 *
+	 * For each size class, First/Head pages are linked together using
+	 * page->lru. Also, we set PG_private to identify the first page
+	 * (i.e. no other sub-page has this flag set) and PG_private_2 to
+	 * identify the last page.
+	 */
+	error = -ENOMEM;
+	for (i = 0; i < class->pages_per_zspage; i++) {
+		struct page *page;
+
+		page = alloc_page(flags);
+		if (!page)
+			goto cleanup;
+
+		INIT_LIST_HEAD(&page->lru);
+		if (i == 0) {	/* first page */
+			SetPagePrivate(page);
+			set_page_private(page, 0);
+			first_page = page;
+			first_page->inuse = 0;
+		}
+		if (i == 1)
+			first_page->private = (unsigned long)page;
+		if (i >= 1)
+			page->first_page = first_page;
+		if (i >= 2)
+			list_add(&page->lru, &prev_page->lru);
+		if (i == class->pages_per_zspage - 1)	/* last page */
+			SetPagePrivate2(page);
+		prev_page = page;
+	}
+
+	init_zspage(first_page, class);
+
+	first_page->freelist = obj_location_to_handle(first_page, 0);
+	/* Maximum number of objects we can store in this zspage */
+	first_page->objects = class->pages_per_zspage * PAGE_SIZE / class->size;
+
+	error = 0; /* Success */
+
+cleanup:
+	if (unlikely(error) && first_page) {
+		free_zspage(first_page);
+		first_page = NULL;
+	}
+
+	return first_page;
+}
+
+static struct page *find_get_zspage(struct size_class *class)
+{
+	int i;
+	struct page *page;
+
+	for (i = 0; i < _ZS_NR_FULLNESS_GROUPS; i++) {
+		page = class->fullness_list[i];
+		if (page)
+			break;
+	}
+
+	return page;
+}
+
+#ifdef USE_PGTABLE_MAPPING
+static inline int __zs_cpu_up(struct mapping_area *area)
+{
+	/*
+	 * Make sure we don't leak memory if a cpu UP notification
+	 * and zs_init() race and both call zs_cpu_up() on the same cpu
+	 */
+	if (area->vm)
+		return 0;
+	area->vm = alloc_vm_area(PAGE_SIZE * 2, NULL);
+	if (!area->vm)
+		return -ENOMEM;
+	return 0;
+}
+
+static inline void __zs_cpu_down(struct mapping_area *area)
+{
+	if (area->vm)
+		free_vm_area(area->vm);
+	area->vm = NULL;
+}
+
+static inline void *__zs_map_object(struct mapping_area *area,
+				struct page *pages[2], int off, int size)
+{
+	BUG_ON(map_vm_area(area->vm, PAGE_KERNEL, &pages));
+	area->vm_addr = area->vm->addr;
+	return area->vm_addr + off;
+}
+
+static inline void __zs_unmap_object(struct mapping_area *area,
+				struct page *pages[2], int off, int size)
+{
+	unsigned long addr = (unsigned long)area->vm_addr;
+	unsigned long end = addr + (PAGE_SIZE * 2);
+
+	flush_cache_vunmap(addr, end);
+	unmap_kernel_range_noflush(addr, PAGE_SIZE * 2);
+	local_flush_tlb_kernel_range(addr, end);
+}
+
+#else /* USE_PGTABLE_MAPPING */
+
+static inline int __zs_cpu_up(struct mapping_area *area)
+{
+	/*
+	 * Make sure we don't leak memory if a cpu UP notification
+	 * and zs_init() race and both call zs_cpu_up() on the same cpu
+	 */
+	if (area->vm_buf)
+		return 0;
+	area->vm_buf = (char *)__get_free_page(GFP_KERNEL);
+	if (!area->vm_buf)
+		return -ENOMEM;
+	return 0;
+}
+
+static inline void __zs_cpu_down(struct mapping_area *area)
+{
+	if (area->vm_buf)
+		free_page((unsigned long)area->vm_buf);
+	area->vm_buf = NULL;
+}
+
+static void *__zs_map_object(struct mapping_area *area,
+			struct page *pages[2], int off, int size)
+{
+	int sizes[2];
+	void *addr;
+	char *buf = area->vm_buf;
+
+	/* disable page faults to match kmap_atomic() return conditions */
+	pagefault_disable();
+
+	/* no read fastpath */
+	if (area->vm_mm == ZS_MM_WO)
+		goto out;
+
+	sizes[0] = PAGE_SIZE - off;
+	sizes[1] = size - sizes[0];
+
+	/* copy object to per-cpu buffer */
+	addr = kmap_atomic(pages[0]);
+	memcpy(buf, addr + off, sizes[0]);
+	kunmap_atomic(addr);
+	addr = kmap_atomic(pages[1]);
+	memcpy(buf + sizes[0], addr, sizes[1]);
+	kunmap_atomic(addr);
+out:
+	return area->vm_buf;
+}
+
+static void __zs_unmap_object(struct mapping_area *area,
+			struct page *pages[2], int off, int size)
+{
+	int sizes[2];
+	void *addr;
+	char *buf = area->vm_buf;
+
+	/* no write fastpath */
+	if (area->vm_mm == ZS_MM_RO)
+		goto out;
+
+	sizes[0] = PAGE_SIZE - off;
+	sizes[1] = size - sizes[0];
+
+	/* copy per-cpu buffer to object */
+	addr = kmap_atomic(pages[0]);
+	memcpy(addr + off, buf, sizes[0]);
+	kunmap_atomic(addr);
+	addr = kmap_atomic(pages[1]);
+	memcpy(addr, buf + sizes[0], sizes[1]);
+	kunmap_atomic(addr);
+
+out:
+	/* enable page faults to match kunmap_atomic() return conditions */
+	pagefault_enable();
+}
+
+#endif /* USE_PGTABLE_MAPPING */
+
+static int zs_cpu_notifier(struct notifier_block *nb, unsigned long action,
+				void *pcpu)
+{
+	int ret, cpu = (long)pcpu;
+	struct mapping_area *area;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		area = &per_cpu(zs_map_area, cpu);
+		ret = __zs_cpu_up(area);
+		if (ret)
+			return notifier_from_errno(ret);
+		break;
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		area = &per_cpu(zs_map_area, cpu);
+		__zs_cpu_down(area);
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+static struct notifier_block zs_cpu_nb = {
+	.notifier_call = zs_cpu_notifier
+};
+
+static void zs_exit(void)
+{
+	int cpu;
+
+	for_each_online_cpu(cpu)
+		zs_cpu_notifier(NULL, CPU_DEAD, (void *)(long)cpu);
+	unregister_cpu_notifier(&zs_cpu_nb);
+}
+
+static int zs_init(void)
+{
+	int cpu, ret;
+
+	register_cpu_notifier(&zs_cpu_nb);
+	for_each_online_cpu(cpu) {
+		ret = zs_cpu_notifier(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
+		if (notifier_to_errno(ret))
+			goto fail;
+	}
+	return 0;
+fail:
+	zs_exit();
+	return notifier_to_errno(ret);
+}
+
+struct zs_pool *zs_create_pool(const char *name, gfp_t flags)
+{
+	int i, ovhd_size;
+	struct zs_pool *pool;
+
+	if (!name)
+		return NULL;
+
+	ovhd_size = roundup(sizeof(*pool), PAGE_SIZE);
+	pool = kzalloc(ovhd_size, GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
+		int size;
+		struct size_class *class;
+
+		size = ZS_MIN_ALLOC_SIZE + i * ZS_SIZE_CLASS_DELTA;
+		if (size > ZS_MAX_ALLOC_SIZE)
+			size = ZS_MAX_ALLOC_SIZE;
+
+		class = &pool->size_class[i];
+		class->size = size;
+		class->index = i;
+		spin_lock_init(&class->lock);
+		class->pages_per_zspage = get_pages_per_zspage(size);
+
+	}
+
+	pool->flags = flags;
+	pool->name = name;
+
+	return pool;
+}
+EXPORT_SYMBOL_GPL(zs_create_pool);
+
+void zs_destroy_pool(struct zs_pool *pool)
+{
+	int i;
+
+	for (i = 0; i < ZS_SIZE_CLASSES; i++) {
+		int fg;
+		struct size_class *class = &pool->size_class[i];
+
+		for (fg = 0; fg < _ZS_NR_FULLNESS_GROUPS; fg++) {
+			if (class->fullness_list[fg]) {
+				pr_info("Freeing non-empty class with size "
+					"%db, fullness group %d\n",
+					class->size, fg);
+			}
+		}
+	}
+	kfree(pool);
+}
+EXPORT_SYMBOL_GPL(zs_destroy_pool);
+
+/**
+ * zs_malloc - Allocate block of given size from pool.
+ * @pool: pool to allocate from
+ * @size: size of block to allocate
+ *
+ * On success, handle to the allocated object is returned,
+ * otherwise 0.
+ * Allocation requests with size > ZS_MAX_ALLOC_SIZE will fail.
+ */
+unsigned long zs_malloc(struct zs_pool *pool, size_t size)
+{
+	unsigned long obj;
+	struct link_free *link;
+	int class_idx;
+	struct size_class *class;
+
+	struct page *first_page, *m_page;
+	unsigned long m_objidx, m_offset;
+
+	if (unlikely(!size || size > ZS_MAX_ALLOC_SIZE))
+		return 0;
+
+	class_idx = get_size_class_index(size);
+	class = &pool->size_class[class_idx];
+	BUG_ON(class_idx != class->index);
+
+	spin_lock(&class->lock);
+	first_page = find_get_zspage(class);
+
+	if (!first_page) {
+		spin_unlock(&class->lock);
+		first_page = alloc_zspage(class, pool->flags);
+		if (unlikely(!first_page))
+			return 0;
+
+		set_zspage_mapping(first_page, class->index, ZS_EMPTY);
+		spin_lock(&class->lock);
+		class->pages_allocated += class->pages_per_zspage;
+	}
+
+	obj = (unsigned long)first_page->freelist;
+	obj_handle_to_location(obj, &m_page, &m_objidx);
+	m_offset = obj_idx_to_offset(m_page, m_objidx, class->size);
+
+	link = (struct link_free *)kmap_atomic(m_page) +
+					m_offset / sizeof(*link);
+	first_page->freelist = link->next;
+	memset(link, POISON_INUSE, sizeof(*link));
+	kunmap_atomic(link);
+
+	first_page->inuse++;
+	/* Now move the zspage to another fullness group, if required */
+	fix_fullness_group(pool, first_page);
+	spin_unlock(&class->lock);
+
+	return obj;
+}
+EXPORT_SYMBOL_GPL(zs_malloc);
+
+void zs_free(struct zs_pool *pool, unsigned long obj)
+{
+	struct link_free *link;
+	struct page *first_page, *f_page;
+	unsigned long f_objidx, f_offset;
+
+	int class_idx;
+	struct size_class *class;
+	enum fullness_group fullness;
+
+	if (unlikely(!obj))
+		return;
+
+	obj_handle_to_location(obj, &f_page, &f_objidx);
+	first_page = get_first_page(f_page);
+
+	get_zspage_mapping(first_page, &class_idx, &fullness);
+	class = &pool->size_class[class_idx];
+	f_offset = obj_idx_to_offset(f_page, f_objidx, class->size);
+
+	spin_lock(&class->lock);
+
+	/* Insert this object in containing zspage's freelist */
+	link = (struct link_free *)((unsigned char *)kmap_atomic(f_page)
+							+ f_offset);
+	link->next = first_page->freelist;
+	kunmap_atomic(link);
+	first_page->freelist = (void *)obj;
+
+	first_page->inuse--;
+	fullness = fix_fullness_group(pool, first_page);
+
+	if (fullness == ZS_EMPTY)
+		class->pages_allocated -= class->pages_per_zspage;
+
+	spin_unlock(&class->lock);
+
+	if (fullness == ZS_EMPTY)
+		free_zspage(first_page);
+}
+EXPORT_SYMBOL_GPL(zs_free);
+
+/**
+ * zs_map_object - get address of allocated object from handle.
+ * @pool: pool from which the object was allocated
+ * @handle: handle returned from zs_malloc
+ *
+ * Before using an object allocated from zs_malloc, it must be mapped using
+ * this function. When done with the object, it must be unmapped using
+ * zs_unmap_object.
+ *
+ * Only one object can be mapped per cpu at a time. There is no protection
+ * against nested mappings.
+ *
+ * This function returns with preemption and page faults disabled.
+*/
+void *zs_map_object(struct zs_pool *pool, unsigned long handle,
+			enum zs_mapmode mm)
+{
+	struct page *page;
+	unsigned long obj_idx, off;
+
+	unsigned int class_idx;
+	enum fullness_group fg;
+	struct size_class *class;
+	struct mapping_area *area;
+	struct page *pages[2];
+
+	BUG_ON(!handle);
+
+	/*
+	 * Because we use per-cpu mapping areas shared among the
+	 * pools/users, we can't allow mapping in interrupt context
+	 * because it can corrupt another users mappings.
+	 */
+	BUG_ON(in_interrupt());
+
+	obj_handle_to_location(handle, &page, &obj_idx);
+	get_zspage_mapping(get_first_page(page), &class_idx, &fg);
+	class = &pool->size_class[class_idx];
+	off = obj_idx_to_offset(page, obj_idx, class->size);
+
+	area = &get_cpu_var(zs_map_area);
+	area->vm_mm = mm;
+	if (off + class->size <= PAGE_SIZE) {
+		/* this object is contained entirely within a page */
+		area->vm_addr = kmap_atomic(page);
+		return area->vm_addr + off;
+	}
+
+	/* this object spans two pages */
+	pages[0] = page;
+	pages[1] = get_next_page(page);
+	BUG_ON(!pages[1]);
+
+	return __zs_map_object(area, pages, off, class->size);
+}
+EXPORT_SYMBOL_GPL(zs_map_object);
+
+void zs_unmap_object(struct zs_pool *pool, unsigned long handle)
+{
+	struct page *page;
+	unsigned long obj_idx, off;
+
+	unsigned int class_idx;
+	enum fullness_group fg;
+	struct size_class *class;
+	struct mapping_area *area;
+
+	BUG_ON(!handle);
+
+	obj_handle_to_location(handle, &page, &obj_idx);
+	get_zspage_mapping(get_first_page(page), &class_idx, &fg);
+	class = &pool->size_class[class_idx];
+	off = obj_idx_to_offset(page, obj_idx, class->size);
+
+	area = &__get_cpu_var(zs_map_area);
+	if (off + class->size <= PAGE_SIZE)
+		kunmap_atomic(area->vm_addr);
+	else {
+		struct page *pages[2];
+
+		pages[0] = page;
+		pages[1] = get_next_page(page);
+		BUG_ON(!pages[1]);
+
+		__zs_unmap_object(area, pages, off, class->size);
+	}
+	put_cpu_var(zs_map_area);
+}
+EXPORT_SYMBOL_GPL(zs_unmap_object);
+
+u64 zs_get_total_size_bytes(struct zs_pool *pool)
+{
+	int i;
+	u64 npages = 0;
+
+	for (i = 0; i < ZS_SIZE_CLASSES; i++)
+		npages += pool->size_class[i].pages_allocated;
+
+	return npages << PAGE_SHIFT;
+}
+EXPORT_SYMBOL_GPL(zs_get_total_size_bytes);
+
+module_init(zs_init);
+module_exit(zs_exit);
+
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_AUTHOR("Nitin Gupta <ngupta@vflare.org>");