All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] staging: zcache: xcfmalloc support
@ 2011-08-31 14:40 Seth Jennings
  2011-08-31 14:40 ` [PATCH 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: Seth Jennings @ 2011-08-31 14:40 UTC (permalink / raw)
  To: gregkh
  Cc: devel, dan.magenheimer, ngupta, cascardo, rdunlap, linux-kernel,
	Seth Jennings

This patchset introduces a new memory allocator for persistent
pages for zcache.  The current allocator is xvmalloc.  xvmalloc
has two notable limitations:
* High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
* No compaction support which reduces page reclaimation

xcfmalloc seeks to fix these issues by using scatter-gather model that
allows for cross-page allocations and relocatable data blocks.

In tests, with pages that only compress to 75% of their original
size, xvmalloc had an effective compression (pages stored / pages used by the
compressed memory pool) of ~95% (~20% lost to fragmentation). Almost nothing
was gained by the compression in this case. xcfmalloc had an effective
compression of ~77% (about ~2% lost to fragmentation and metadata overhead).

xcfmalloc uses the same locking scheme as xvmalloc; a single pool-level
spinlock.  This can lead to some contention.  However, in my tests on a 4
way SMP system, the contention was minimal (200 contentions out of 600k
aquisitions).  The locking scheme may be able to be improved in the future.
In tests, xcfmalloc and xvmalloc had identical throughputs.

While the xcfmalloc design lends itself to compaction, this is not yet
implemented.  Support will be added in a follow-on patch.

Based on 3.1-rc4.

Seth Jennings (3):
  staging: zcache: xcfmalloc memory allocator for zcache
  staging: zcache: replace xvmalloc with xcfmalloc
  staging: zcache: add zv_page_count and zv_desc_count

 drivers/staging/zcache/Makefile      |    2 +-
 drivers/staging/zcache/xcfmalloc.c   |  653 ++++++++++++++++++++++++++++++++++
 drivers/staging/zcache/xcfmalloc.h   |   29 ++
 drivers/staging/zcache/zcache-main.c |  154 ++++++---
 4 files changed, 791 insertions(+), 47 deletions(-)
 create mode 100644 drivers/staging/zcache/xcfmalloc.c
 create mode 100644 drivers/staging/zcache/xcfmalloc.h

-- 
1.7.4.1


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] staging: zcache: xcfmalloc memory allocator for zcache
  2011-08-31 14:40 [PATCH 0/3] staging: zcache: xcfmalloc support Seth Jennings
@ 2011-08-31 14:40 ` Seth Jennings
  2011-09-01 15:43   ` Seth Jennings
  2011-08-31 14:40 ` [PATCH 2/3] staging: zcache: replace xvmalloc with xcfmalloc Seth Jennings
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 14+ messages in thread
From: Seth Jennings @ 2011-08-31 14:40 UTC (permalink / raw)
  To: gregkh
  Cc: devel, dan.magenheimer, ngupta, cascardo, rdunlap, linux-kernel,
	Seth Jennings

xcfmalloc is a memory allocator based on a scatter-gather model that
allows for cross-page storage and relocatable data blocks.  This is
achieved through a simple, though non-standard, memory allocation API.

xcfmalloc is a replacement for the xvmalloc allocator used by zcache
for persistent compressed page storage.

xcfmalloc uses a scatter-gather block model in order to reduce external
fragmentation (at the expense of an extra memcpy).  Low fragmentation
is especially important for a zcache allocator since memory will already
be under pressure.

It also uses relocatable data blocks in order to allow compaction to take
place.  Compacting seeks to move blocks to other, less sparsely used,
pages such that the maximum number of pages can be returned to the kernel
buddy allocator. Compaction is not yet implemented in this patch.

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
 drivers/staging/zcache/Makefile    |    2 +-
 drivers/staging/zcache/xcfmalloc.c |  653 ++++++++++++++++++++++++++++++++++++
 drivers/staging/zcache/xcfmalloc.h |   29 ++
 3 files changed, 683 insertions(+), 1 deletions(-)
 create mode 100644 drivers/staging/zcache/xcfmalloc.c
 create mode 100644 drivers/staging/zcache/xcfmalloc.h

diff --git a/drivers/staging/zcache/Makefile b/drivers/staging/zcache/Makefile
index 60daa27..c7aa772 100644
--- a/drivers/staging/zcache/Makefile
+++ b/drivers/staging/zcache/Makefile
@@ -1,3 +1,3 @@
-zcache-y	:=	zcache-main.o tmem.o
+zcache-y	:=	zcache-main.o tmem.o xcfmalloc.o
 
 obj-$(CONFIG_ZCACHE)	+=	zcache.o
diff --git a/drivers/staging/zcache/xcfmalloc.c b/drivers/staging/zcache/xcfmalloc.c
new file mode 100644
index 0000000..0c6bf49
--- /dev/null
+++ b/drivers/staging/zcache/xcfmalloc.c
@@ -0,0 +1,653 @@
+/* xcfmalloc.c */
+
+/*
+ * xcfmalloc is a memory allocator based on a scatter-gather model that
+ * allows for cross-page storage and relocatable data blocks.  This is
+ * achieved through a simple, though non-standard, memory allocation API.
+ *
+ * xcfmalloc is a replacement for the xvmalloc allocator used by zcache
+ * for persistent compressed page storage.
+ *
+ * xcfmalloc uses a scatter-gather block model in order to reduce external
+ * fragmentation (at the expense of an extra memcpy).  Low fragmentation
+ * is especially important for a zcache allocator since memory will already
+ * be under pressure.
+ *
+ * It also uses relocatable data blocks in order to allow compaction to take
+ * place.  Compacting seeks to move blocks to other, less sparsely used,
+ * pages such that the maximum number of pages can be returned to the kernel
+ * buddy allocator.  NOTE: Compaction is not yet implemented.
+*/
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/bitops.h>
+#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/init.h>
+#include <linux/string.h>
+#include <linux/slab.h>
+
+#include "xcfmalloc.h"
+
+/* turn debugging on */
+#define XCFM_DEBUG
+
+/* tunables */
+#define XCF_BLOCK_SHIFT 6
+#define XCF_MAX_RESERVE_PAGES 32
+#define XCF_MAX_BLOCKS_PER_ALLOC 2
+
+#define XCF_BLOCK_ALIGN (1 << XCF_BLOCK_SHIFT)
+#define XCF_NUM_FREELISTS (PAGE_SIZE >> XCF_BLOCK_SHIFT)
+#define XCF_RESERVED_INDEX (XCF_NUM_FREELISTS - 1)
+
+static inline int xcf_align_to_block(int size)
+{
+	return ALIGN(size, XCF_BLOCK_ALIGN);
+}
+
+static inline int xcf_size_to_flindex(int size)
+{
+
+	return (xcf_align_to_block(size) >> XCF_BLOCK_SHIFT) - 1;
+}
+
+/* pool structure */
+struct xcf_pool {
+	struct xcf_blkdesc *freelists[XCF_NUM_FREELISTS];
+	spinlock_t lock;
+	gfp_t flags; /* flags used for page allocations */
+	int reserve_count; /* number of reserved (completely unused) pages */
+	int page_count; /* number of pages used and reserved */
+	int desc_count; /* number of block descriptors */
+};
+
+/* block descriptor structure and helper functions */
+struct xcf_blkdesc {
+	struct page *page;
+	struct xcf_blkdesc *next;
+	struct xcf_blkdesc *prev;
+	int offxcf_set_flags;
+#define XCF_FLAG_MASK (XCF_BLOCK_ALIGN - 1)
+#define XCF_OFFSET_MASK (~XCF_FLAG_MASK)
+#define XCF_BLOCK_FREE (1<<0)
+#define XCF_BLOCK_USED (1<<1)
+	/* size means different things depending on whether the block is
+	 * in use or not. If the block is free, the size is the aligned size
+	 * of the block and the block can contain an allocation of up to
+	 * (size - sizeof(struct xcf_blkhdr)). If the block is in use,
+	 * size is the actual number of bytes used by the allocation plus
+	 * sizeof(struct xcf_blkhdr) */
+	int size;
+};
+
+static inline void xcf_set_offset(struct xcf_blkdesc *desc, int offset)
+{
+	BUG_ON(offset != (offset & XCF_OFFSET_MASK));
+	desc->offxcf_set_flags = offset |
+		(desc->offxcf_set_flags & XCF_FLAG_MASK);
+}
+
+static inline int xcf_get_offset(struct xcf_blkdesc *desc)
+{
+	return desc->offxcf_set_flags & XCF_OFFSET_MASK;
+}
+
+static inline void xcf_set_flags(struct xcf_blkdesc *desc, int flags)
+{
+	BUG_ON(flags != (flags & XCF_FLAG_MASK));
+	desc->offxcf_set_flags = flags |
+		(desc->offxcf_set_flags & XCF_OFFSET_MASK);
+}
+
+static inline bool xcf_is_free(struct xcf_blkdesc *desc)
+{
+	return desc->offxcf_set_flags & XCF_BLOCK_FREE;
+}
+
+static inline bool xcf_is_used(struct xcf_blkdesc *desc)
+{
+	return desc->offxcf_set_flags & XCF_BLOCK_USED;
+}
+
+/* block header structure and helper functions */
+#ifdef XCFM_DEBUG
+#define DECL_SENTINEL int sentinel
+#define SENTINEL 0xdeadbeef
+#define SET_SENTINEL(_obj) ((_obj)->sentinel = SENTINEL)
+#define ASSERT_SENTINEL(_obj) BUG_ON((_obj)->sentinel != SENTINEL)
+#define CLEAR_SENTINEL(_obj) ((_obj)->sentinel = 0)
+#else
+#define DECL_SENTINEL
+#define SET_SENTINEL(_obj) do { } while (0)
+#define ASSERT_SENTINEL(_obj) do { } while (0)
+#define CLEAR_SENTINEL(_obj) do { } while (0)
+#endif
+
+struct xcf_blkhdr {
+	struct xcf_blkdesc *desc;
+	int prevoffset;
+	DECL_SENTINEL;
+};
+
+#define MAX_ALLOC_SIZE (PAGE_SIZE - sizeof(struct xcf_blkhdr))
+
+static inline void *xcf_page_start(struct xcf_blkhdr *block)
+{
+	return (void *)((unsigned long)block & PAGE_MASK);
+}
+
+static inline struct xcf_blkhdr *xcf_block_offset(struct xcf_blkhdr *block,
+						int offset)
+{
+	return (struct xcf_blkhdr *)((char *)block + offset);
+}
+
+static inline bool xcf_same_page(struct xcf_blkhdr *block1,
+				struct xcf_blkhdr *block2)
+{
+	return (xcf_page_start(block1) == xcf_page_start(block2));
+}
+
+static inline void *__xcf_map_block(struct xcf_blkdesc *desc,
+					enum km_type type)
+{
+	return kmap_atomic(desc->page, type) + xcf_get_offset(desc);
+}
+
+static inline void *xcf_map_block(struct xcf_blkdesc *desc, enum km_type type)
+{
+	struct xcf_blkhdr *block;
+	block = __xcf_map_block(desc, type);
+	ASSERT_SENTINEL(block);
+	return block;
+}
+
+static inline void xcf_unmap_block(struct xcf_blkhdr *block, enum km_type type)
+{
+	kunmap_atomic(xcf_page_start(block), type);
+}
+
+/* descriptor memory cache */
+static struct kmem_cache *xcf_desc_cache;
+
+/* descriptor management */
+
+struct xcf_blkdesc *xcf_alloc_desc(struct xcf_pool *pool, gfp_t flags)
+{
+	struct xcf_blkdesc *desc;
+	desc = kmem_cache_zalloc(xcf_desc_cache, flags);
+	if (desc)
+		pool->desc_count++;
+	return desc;
+}
+
+void xcf_free_desc(struct xcf_pool *pool, struct xcf_blkdesc *desc)
+{
+	kmem_cache_free(xcf_desc_cache, desc);
+	pool->desc_count--;
+}
+
+/* block management */
+
+static void xcf_remove_block(struct xcf_pool *pool, struct xcf_blkdesc *desc)
+{
+	u32 flindex;
+
+	BUG_ON(!xcf_is_free(desc));
+	flindex = xcf_size_to_flindex(desc->size);
+	if (flindex == XCF_RESERVED_INDEX)
+		pool->reserve_count--;
+	if (pool->freelists[flindex] == desc)
+		pool->freelists[flindex] = desc->next;
+	if (desc->prev)
+		desc->prev->next = desc->next;
+	if (desc->next)
+		desc->next->prev = desc->prev;
+	desc->next = NULL;
+	desc->prev = NULL;
+}
+
+static void xcf_insert_block(struct xcf_pool *pool, struct xcf_blkdesc *desc)
+{
+	u32 flindex;
+
+	flindex = xcf_size_to_flindex(desc->size);
+	if (flindex == XCF_RESERVED_INDEX)
+		pool->reserve_count++;
+	/* the block size needs to be realigned since it was previously
+	 * set to the actual allocation size */
+	desc->size = xcf_align_to_block(desc->size);
+	desc->next = pool->freelists[flindex];
+	desc->prev = NULL;
+	xcf_set_flags(desc, XCF_BLOCK_FREE);
+	pool->freelists[flindex] = desc;
+	if (desc->next)
+		desc->next->prev = desc;
+}
+
+/*
+ * The xcf_find_remove_block function contains the block selection logic for
+ * this allocator.
+ *
+ * This selection pattern has the advantage of utilizing blocks in partially
+ * used pages before considering a completely unused page.  This results
+ * in high utilization of the pool pages and the expense of higher allocation
+ * fragmentation (i.e. more blocks per allocation).  This is acceptable
+ * though since, at this point, memory is the resource under pressure.
+*/
+static struct xcf_blkdesc *xcf_find_remove_block(struct xcf_pool *pool,
+	int size, int blocknum)
+{
+	int flindex, i;
+	struct xcf_blkdesc *desc = NULL;
+
+	flindex = xcf_size_to_flindex(size + sizeof(struct xcf_blkhdr));
+
+	/* look for best fit */
+	if (pool->freelists[flindex])
+		goto remove;
+
+	/* if this is the last block allowed in the allocation, we shouldn't
+	 * consider smaller blocks.  it's all or nothing now */
+	if (blocknum == XCF_MAX_BLOCKS_PER_ALLOC) {
+		/* look for largest smaller block */
+		for (i = flindex; i > 0; i--) {
+			if (pool->freelists[i]) {
+				flindex = i;
+				goto remove;
+			}
+		}
+	}
+
+	/* look for smallest larger block */
+	for (i = flindex + 1; i < XCF_NUM_FREELISTS; i++) {
+		if (pool->freelists[i]) {
+			flindex = i;
+			goto remove;
+		}
+	}
+
+	/* if we get here, there are no blocks that satisfy the request */
+	return NULL;
+
+remove:
+	desc = pool->freelists[flindex];
+	xcf_remove_block(pool, desc);
+	return desc;
+}
+
+struct xcf_blkdesc *xcf_merge_free_block(struct xcf_pool *pool,
+					struct xcf_blkdesc *desc)
+{
+	struct xcf_blkdesc *next, *prev;
+	struct xcf_blkhdr *block, *nextblock, *prevblock;
+
+	block = xcf_map_block(desc, KM_USER0);
+
+	prevblock = xcf_block_offset(xcf_page_start(block), block->prevoffset);
+	if (xcf_get_offset(desc) == 0) {
+		prev = NULL;
+	} else {
+		ASSERT_SENTINEL(prevblock);
+		prev = prevblock->desc;
+	}
+
+	nextblock = xcf_block_offset(block, desc->size);
+	if (!xcf_same_page(block, nextblock)) {
+		next = NULL;
+	} else {
+		ASSERT_SENTINEL(nextblock);
+		next = nextblock->desc;
+	}
+
+	/* merge adjacent free blocks in page */
+	if (prev && xcf_is_free(prev)) {
+		/* merge with previous block */
+		xcf_remove_block(pool, prev);
+		prev->size += desc->size;
+		xcf_free_desc(pool, desc);
+		CLEAR_SENTINEL(block);
+		desc = prev;
+		block = prevblock;
+	}
+
+	if (next && xcf_is_free(next)) {
+		/* merge with next block */
+		xcf_remove_block(pool, next);
+		desc->size += next->size;
+		CLEAR_SENTINEL(nextblock);
+		nextblock = xcf_block_offset(nextblock, next->size);
+		xcf_free_desc(pool, next);
+		if (!xcf_same_page(nextblock, block)) {
+			next = NULL;
+		} else {
+			ASSERT_SENTINEL(nextblock);
+			next = nextblock->desc;
+		}
+	}
+
+	if (next)
+		nextblock->prevoffset = xcf_get_offset(desc);
+
+	xcf_unmap_block(block, KM_USER0);
+
+	return desc;
+}
+
+/* pool management */
+
+/* try to get pool->reserve_count up to XCF_MAX_RESERVE_PAGES */
+static void xcf_increase_pool(struct xcf_pool *pool, gfp_t flags)
+{
+	struct xcf_blkdesc *desc;
+	int deficit_pages;
+	struct xcf_blkhdr *block;
+	struct page *page;
+
+	/* if we are at or above our desired number of
+	 * reserved pages, nothing to do */
+	if (pool->reserve_count >= XCF_MAX_RESERVE_PAGES)
+		return;
+
+	/* alloc some pages (if we can) */
+	deficit_pages = XCF_MAX_RESERVE_PAGES - pool->reserve_count;
+	while (deficit_pages--) {
+		desc = xcf_alloc_desc(pool, GFP_NOWAIT);
+		if (!desc)
+			break;
+		/* we use our own scatter-gather mechanism that maps high
+		 * memory pages under the covers, so we add HIGHMEM
+		 * even if the caller did not explicitly request it */
+		page = alloc_page(flags | __GFP_HIGHMEM);
+		if (!page) {
+			xcf_free_desc(pool, desc);
+			break;
+		}
+		pool->page_count++;
+		desc->page = page;
+		xcf_set_offset(desc, 0);
+		desc->size = PAGE_SIZE;
+		xcf_insert_block(pool, desc);
+		block = __xcf_map_block(desc, KM_USER0);
+		block->desc = desc;
+		block->prevoffset = 0;
+		SET_SENTINEL(block);
+		xcf_unmap_block(block, KM_USER0);
+	}
+}
+
+/* tries to get pool->reserve_count down to XCF_MAX_RESERVE_PAGES */
+static void xcf_decrease_pool(struct xcf_pool *pool)
+{
+	struct xcf_blkdesc *desc;
+	int surplus_pages;
+
+	/* if we are at are below our desired number of
+	 * reserved pages, nothing to do */
+	if (pool->reserve_count <= XCF_MAX_RESERVE_PAGES)
+		return;
+
+	/* free some pages */
+	surplus_pages = pool->reserve_count - XCF_MAX_RESERVE_PAGES;
+	while (surplus_pages--) {
+		desc = pool->freelists[XCF_RESERVED_INDEX];
+		BUG_ON(!desc);
+		xcf_remove_block(pool, desc);
+		__free_page(desc->page);
+		pool->page_count--;
+		xcf_free_desc(pool, desc);
+	}
+}
+
+/* public functions */
+
+/* return handle to new descect with specified size */
+void *xcf_malloc(struct xcf_pool *pool, int size, gfp_t flags)
+{
+	struct xcf_blkdesc *first, *prev, *next, *desc, *splitdesc;
+	int i, sizeleft, alignedleft;
+	struct xcf_blkhdr *block, *nextblock, *splitblock;
+
+	if (size > MAX_ALLOC_SIZE)
+		return NULL;
+
+	spin_lock(&pool->lock);
+
+	/* check and possibly increase (alloc) pool's page reserves */
+	xcf_increase_pool(pool, flags);
+
+	sizeleft = size;
+	first = NULL;
+	prev = NULL;
+
+	/* find block(s) to contain the allocation */
+	for (i = 1; i <= XCF_MAX_BLOCKS_PER_ALLOC; i++) {
+		desc = xcf_find_remove_block(pool, sizeleft, i);
+		if (!desc)
+			goto return_blocks;
+		if (!first)
+			first = desc;
+		desc->prev = prev;
+		if (prev)
+			prev->next = desc;
+		xcf_set_flags(desc, XCF_BLOCK_USED);
+		if (desc->size >= sizeleft + sizeof(struct xcf_blkhdr))
+			break;
+		sizeleft -= desc->size - sizeof(struct xcf_blkhdr);
+		prev = desc;
+	}
+
+	/* split is only possible on the last block */
+	alignedleft = xcf_align_to_block(sizeleft + sizeof(struct xcf_blkhdr));
+	if (desc->size > alignedleft) {
+		splitdesc = xcf_alloc_desc(pool, GFP_NOWAIT);
+		if (!splitdesc)
+			goto return_blocks;
+		splitdesc->size = desc->size - alignedleft;
+		splitdesc->page = desc->page;
+		xcf_set_offset(splitdesc, xcf_get_offset(desc) + alignedleft);
+
+		block = xcf_map_block(desc, KM_USER0);
+		ASSERT_SENTINEL(block);
+		nextblock = xcf_block_offset(block, desc->size);
+		if (xcf_same_page(nextblock, block))
+			nextblock->prevoffset = xcf_get_offset(splitdesc);
+		splitblock = xcf_block_offset(block, alignedleft);
+		splitblock->desc = splitdesc;
+		splitblock->prevoffset = xcf_get_offset(desc);
+		SET_SENTINEL(splitblock);
+		xcf_unmap_block(block, KM_USER0);
+
+		xcf_insert_block(pool, splitdesc);
+	}
+
+	desc->size = sizeleft + sizeof(struct xcf_blkhdr);
+
+	/* ensure the changes to desc are written before a potential
+	 read in xcf_get_alloc_size() since it does not aquire a
+	 lock */
+	wmb();
+
+	spin_unlock(&pool->lock);
+	return first;
+
+return_blocks:
+	desc = first;
+	while (desc) {
+		next = desc->next;
+		xcf_insert_block(pool, desc);
+		desc = next;
+	}
+	spin_unlock(&pool->lock);
+	return NULL;
+}
+
+enum xcf_op {
+	XCFMOP_READ,
+	XCFMOP_WRITE
+};
+
+static int access_allocation(void *handle, char *buf, enum xcf_op op)
+{
+	struct xcf_blkdesc *desc;
+	int count, offset;
+	char *block;
+
+	desc = handle;
+	count = XCF_MAX_BLOCKS_PER_ALLOC;
+	offset = 0;
+	while (desc && count--) {
+		BUG_ON(!xcf_is_used(desc));
+		BUG_ON(offset > MAX_ALLOC_SIZE);
+
+		block = xcf_map_block(desc, KM_USER0);
+		if (op == XCFMOP_WRITE)
+			memcpy(block + sizeof(struct xcf_blkhdr),
+				buf + offset,
+				desc->size - sizeof(struct xcf_blkhdr));
+		else /* XCFMOP_READ */
+			memcpy(buf + offset,
+				block + sizeof(struct xcf_blkhdr),
+				desc->size - sizeof(struct xcf_blkhdr));
+		xcf_unmap_block((struct xcf_blkhdr *)block, KM_USER0);
+
+		offset += desc->size - sizeof(struct xcf_blkhdr);
+		desc = desc->next;
+	}
+
+	BUG_ON(desc);
+
+	return 0;
+}
+
+/* write data from buf into allocation */
+int xcf_write(struct xcf_handle *handle, void *buf)
+{
+	return access_allocation(handle, buf, XCFMOP_WRITE);
+}
+
+/* read data from allocation into buf */
+int xcf_read(struct xcf_handle *handle, void *buf)
+{
+	return access_allocation(handle, buf, XCFMOP_READ);
+}
+
+/* free an allocation */
+void xcf_free(struct xcf_pool *pool, struct xcf_handle *handle)
+{
+	struct xcf_blkdesc *desc, *nextdesc;
+	int count;
+
+	spin_lock(&pool->lock);
+
+	desc = (struct xcf_blkdesc *)handle;
+	count = XCF_MAX_BLOCKS_PER_ALLOC;
+	while (desc && count--) {
+		nextdesc = desc->next;
+		BUG_ON(xcf_is_free(desc));
+		xcf_set_flags(desc, XCF_BLOCK_FREE);
+		desc->size = xcf_align_to_block(desc->size);
+		/* xcf_merge_free_block may merge with a previous block which
+		 * will cause desc to change */
+		desc = xcf_merge_free_block(pool, desc);
+		xcf_insert_block(pool, desc);
+		desc = nextdesc;
+	}
+
+	BUG_ON(desc);
+
+	/* check and possibly decrease (free) pool's page reserves */
+	xcf_decrease_pool(pool);
+
+	spin_unlock(&pool->lock);
+}
+
+/* create an xcfmalloc memory pool */
+struct xcf_pool *xcf_create_pool(gfp_t flags)
+{
+	struct xcf_pool *pool = NULL;
+
+	if (!xcf_desc_cache)
+		xcf_desc_cache = kmem_cache_create("xcf_desc_cache",
+			sizeof(struct xcf_blkdesc), 0, 0, NULL);
+
+	if (!xcf_desc_cache)
+		goto out;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		goto out;
+
+	spin_lock_init(&pool->lock);
+
+	xcf_increase_pool(pool, flags);
+
+out:
+	return pool;
+}
+
+/* destroy an xcfmalloc memory pool */
+void xcf_destroy_pool(struct xcf_pool *pool)
+{
+	struct xcf_blkdesc *desc, *nextdesc;
+
+	/* free reserved pages */
+	desc = pool->freelists[XCF_RESERVED_INDEX];
+	while (desc) {
+		nextdesc = desc->next;
+		__free_page(desc->page);
+		xcf_free_desc(pool, desc);
+		desc = nextdesc;
+	}
+
+	kfree(pool);
+}
+
+/* get size of allocation associated with handle */
+int xcf_get_alloc_size(struct xcf_handle *handle)
+{
+	struct xcf_blkdesc *desc;
+	int count, size = 0;
+
+	if (!handle)
+		goto out;
+
+	/* ensure the changes to desc by xcf_malloc() are written before
+	 we access here since we don't acquire a lock */
+	rmb();
+
+	desc = (struct xcf_blkdesc *)handle;
+	count = XCF_MAX_BLOCKS_PER_ALLOC;
+	while (desc && count--) {
+		BUG_ON(!xcf_is_used(desc));
+		size += desc->size - sizeof(struct xcf_blkhdr);
+		desc = desc->next;
+	}
+
+out:
+	return size;
+}
+
+/* get total number of pages in use by pool not including the reserved pages */
+u64 xcf_get_total_size_bytes(struct xcf_pool *pool)
+{
+	u64 ret;
+	spin_lock(&pool->lock);
+	ret = pool->page_count - pool->reserve_count;
+	spin_unlock(&pool->lock);
+	return ret << PAGE_SHIFT;
+}
+
+/* get number of descriptors in use by pool not including the descriptors for
+ * reserved pages */
+int xcf_get_desc_count(struct xcf_pool *pool)
+{
+	int ret;
+	spin_lock(&pool->lock);
+	ret = pool->desc_count - pool->reserve_count;
+	spin_unlock(&pool->lock);
+	return ret;
+}
+
diff --git a/drivers/staging/zcache/xcfmalloc.h b/drivers/staging/zcache/xcfmalloc.h
new file mode 100644
index 0000000..38e2df9
--- /dev/null
+++ b/drivers/staging/zcache/xcfmalloc.h
@@ -0,0 +1,29 @@
+/* xcfmalloc.h */
+
+#ifndef _XCFMALLOC_H_
+#define _XCFMALLOC_H_
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+struct xcf_pool;
+struct xcf_handle {
+	void *desc;
+};
+
+void *xcf_malloc(struct xcf_pool *pool, int size, gfp_t flags);
+void xcf_free(struct xcf_pool *pool, struct xcf_handle *handle);
+
+int xcf_write(struct xcf_handle *handle, void *buf);
+int xcf_read(struct xcf_handle *handle, void *buf);
+
+struct xcf_pool *xcf_create_pool(gfp_t flags);
+void xcf_destroy_pool(struct xcf_pool *pool);
+
+int xcf_get_alloc_size(struct xcf_handle *handle);
+
+u64 xcf_get_total_size_bytes(struct xcf_pool *pool);
+int xcf_get_desc_count(struct xcf_pool *pool);
+
+#endif /* _XCFMALLOC_H_ */
+
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] staging: zcache: replace xvmalloc with xcfmalloc
  2011-08-31 14:40 [PATCH 0/3] staging: zcache: xcfmalloc support Seth Jennings
  2011-08-31 14:40 ` [PATCH 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
@ 2011-08-31 14:40 ` Seth Jennings
  2011-08-31 14:40 ` [PATCH 3/3] staging: zcache: add zv_page_count and zv_desc_count Seth Jennings
  2011-08-31 19:46 ` [PATCH 0/3] staging: zcache: xcfmalloc support Dan Magenheimer
  3 siblings, 0 replies; 14+ messages in thread
From: Seth Jennings @ 2011-08-31 14:40 UTC (permalink / raw)
  To: gregkh
  Cc: devel, dan.magenheimer, ngupta, cascardo, rdunlap, linux-kernel,
	Seth Jennings

This patch replaces xvmalloc with xcfmalloc as the persistent page
allocator for zcache.

Because the API is not the same between xvmalloc and xcfmalloc, the
changes are not a simple find/replace on the function names.

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
 drivers/staging/zcache/zcache-main.c |  130 ++++++++++++++++++++++------------
 1 files changed, 84 insertions(+), 46 deletions(-)

diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index a3f5162..b07377b 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -8,7 +8,7 @@
  * and, thus indirectly, for cleancache and frontswap.  Zcache includes two
  * page-accessible memory [1] interfaces, both utilizing lzo1x compression:
  * 1) "compression buddies" ("zbud") is used for ephemeral pages
- * 2) xvmalloc is used for persistent pages.
+ * 2) xcfmalloc is used for persistent pages.
  * Xvmalloc (based on the TLSF allocator) has very low fragmentation
  * so maximizes space efficiency, while zbud allows pairs (and potentially,
  * in the future, more than a pair of) compressed pages to be closely linked
@@ -31,7 +31,7 @@
 #include <linux/math64.h>
 #include "tmem.h"
 
-#include "../zram/xvmalloc.h" /* if built in drivers/staging */
+#include "xcfmalloc.h"
 
 #if (!defined(CONFIG_CLEANCACHE) && !defined(CONFIG_FRONTSWAP))
 #error "zcache is useless without CONFIG_CLEANCACHE or CONFIG_FRONTSWAP"
@@ -60,7 +60,7 @@ MODULE_LICENSE("GPL");
 
 struct zcache_client {
 	struct tmem_pool *tmem_pools[MAX_POOLS_PER_CLIENT];
-	struct xv_pool *xvpool;
+	struct xcf_pool *xcfmpool;
 	bool allocated;
 	atomic_t refcount;
 };
@@ -623,9 +623,8 @@ static int zbud_show_cumul_chunk_counts(char *buf)
 #endif
 
 /**********
- * This "zv" PAM implementation combines the TLSF-based xvMalloc
- * with lzo1x compression to maximize the amount of data that can
- * be packed into a physical page.
+ * This "zv" PAM implementation combines xcfmalloc with lzo1x compression
+ * to maximize the amount of data that can be packed into a physical page.
  *
  * Zv represents a PAM page with the index and object (plus a "size" value
  * necessary for decompression) immediately preceding the compressed data.
@@ -658,71 +657,97 @@ static unsigned int zv_max_mean_zsize = (PAGE_SIZE / 8) * 5;
 static unsigned long zv_curr_dist_counts[NCHUNKS];
 static unsigned long zv_cumul_dist_counts[NCHUNKS];
 
-static struct zv_hdr *zv_create(struct xv_pool *xvpool, uint32_t pool_id,
+static DEFINE_PER_CPU(unsigned char *, zv_cbuf); /* zv create buffer */
+static DEFINE_PER_CPU(unsigned char *, zv_dbuf); /* zv decompress buffer */
+
+static int zv_cpu_notifier(struct notifier_block *nb,
+				unsigned long action, void *pcpu)
+{
+	int cpu = (long)pcpu;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		per_cpu(zv_cbuf, cpu) = (void *)__get_free_page(
+			GFP_KERNEL | __GFP_REPEAT);
+		per_cpu(zv_dbuf, cpu) = (void *)__get_free_page(
+			GFP_KERNEL | __GFP_REPEAT);
+		break;
+	case CPU_DEAD:
+	case CPU_UP_CANCELED:
+		free_page((unsigned long)per_cpu(zv_cbuf, cpu));
+		per_cpu(zv_cbuf, cpu) = NULL;
+		free_page((unsigned long)per_cpu(zv_dbuf, cpu));
+		per_cpu(zv_dbuf, cpu) = NULL;
+		break;
+	default:
+		break;
+	}
+
+	return NOTIFY_OK;
+}
+
+
+static void **zv_create(struct xcf_pool *xcfmpool, uint32_t pool_id,
 				struct tmem_oid *oid, uint32_t index,
 				void *cdata, unsigned clen)
 {
-	struct page *page;
-	struct zv_hdr *zv = NULL;
-	uint32_t offset;
-	int alloc_size = clen + sizeof(struct zv_hdr);
-	int chunks = (alloc_size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
-	int ret;
+	struct zv_hdr *zv;
+	u32 size = clen + sizeof(struct zv_hdr);
+	int chunks = (size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
+	void *handle;
 
 	BUG_ON(!irqs_disabled());
 	BUG_ON(chunks >= NCHUNKS);
-	ret = xv_malloc(xvpool, alloc_size,
-			&page, &offset, ZCACHE_GFP_MASK);
-	if (unlikely(ret))
-		goto out;
+
+	handle = xcf_malloc(xcfmpool, size, ZCACHE_GFP_MASK);
+	if (!handle)
+		return NULL;
+
 	zv_curr_dist_counts[chunks]++;
 	zv_cumul_dist_counts[chunks]++;
-	zv = kmap_atomic(page, KM_USER0) + offset;
+
+	zv = (struct zv_hdr *)((char *)cdata - sizeof(*zv));
 	zv->index = index;
 	zv->oid = *oid;
 	zv->pool_id = pool_id;
 	SET_SENTINEL(zv, ZVH);
-	memcpy((char *)zv + sizeof(struct zv_hdr), cdata, clen);
-	kunmap_atomic(zv, KM_USER0);
-out:
-	return zv;
+	xcf_write(handle, zv);
+
+	return handle;
 }
 
-static void zv_free(struct xv_pool *xvpool, struct zv_hdr *zv)
+static void zv_free(struct xcf_pool *xcfmpool, void *handle)
 {
 	unsigned long flags;
-	struct page *page;
-	uint32_t offset;
-	uint16_t size = xv_get_object_size(zv);
+	u32 size = xcf_get_alloc_size(handle);
 	int chunks = (size + (CHUNK_SIZE - 1)) >> CHUNK_SHIFT;
 
-	ASSERT_SENTINEL(zv, ZVH);
 	BUG_ON(chunks >= NCHUNKS);
 	zv_curr_dist_counts[chunks]--;
-	size -= sizeof(*zv);
-	BUG_ON(size == 0);
-	INVERT_SENTINEL(zv, ZVH);
-	page = virt_to_page(zv);
-	offset = (unsigned long)zv & ~PAGE_MASK;
+
 	local_irq_save(flags);
-	xv_free(xvpool, page, offset);
+	xcf_free(xcfmpool, handle);
 	local_irq_restore(flags);
 }
 
-static void zv_decompress(struct page *page, struct zv_hdr *zv)
+static void zv_decompress(struct page *page, void *handle)
 {
 	size_t clen = PAGE_SIZE;
 	char *to_va;
 	unsigned size;
 	int ret;
+	struct zv_hdr *zv;
 
-	ASSERT_SENTINEL(zv, ZVH);
-	size = xv_get_object_size(zv) - sizeof(*zv);
+	size = xcf_get_alloc_size(handle) - sizeof(*zv);
 	BUG_ON(size == 0);
+	zv = (struct zv_hdr *)(get_cpu_var(zv_dbuf));
+	xcf_read(handle, zv);
+	ASSERT_SENTINEL(zv, ZVH);
 	to_va = kmap_atomic(page, KM_USER0);
 	ret = lzo1x_decompress_safe((char *)zv + sizeof(*zv),
 					size, to_va, &clen);
 	kunmap_atomic(to_va, KM_USER0);
+	put_cpu_var(zv_dbuf);
 	BUG_ON(ret != LZO_E_OK);
 	BUG_ON(clen != PAGE_SIZE);
 }
@@ -949,8 +974,9 @@ int zcache_new_client(uint16_t cli_id)
 		goto out;
 	cli->allocated = 1;
 #ifdef CONFIG_FRONTSWAP
-	cli->xvpool = xv_create_pool();
-	if (cli->xvpool == NULL)
+	cli->xcfmpool =
+		xcf_create_pool(ZCACHE_GFP_MASK);
+	if (cli->xcfmpool == NULL)
 		goto out;
 #endif
 	ret = 0;
@@ -1154,7 +1180,7 @@ static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph,
 				struct tmem_pool *pool, struct tmem_oid *oid,
 				 uint32_t index)
 {
-	void *pampd = NULL, *cdata;
+	void *pampd = NULL, *cdata = NULL;
 	size_t clen;
 	int ret;
 	unsigned long count;
@@ -1186,26 +1212,30 @@ static void *zcache_pampd_create(char *data, size_t size, bool raw, int eph,
 		if (curr_pers_pampd_count >
 		    (zv_page_count_policy_percent * totalram_pages) / 100)
 			goto out;
+		cdata = get_cpu_var(zv_cbuf) + sizeof(struct zv_hdr);
 		ret = zcache_compress(page, &cdata, &clen);
 		if (ret == 0)
 			goto out;
 		/* reject if compression is too poor */
 		if (clen > zv_max_zsize) {
 			zcache_compress_poor++;
+			put_cpu_var(zv_cbuf);
 			goto out;
 		}
 		/* reject if mean compression is too poor */
 		if ((clen > zv_max_mean_zsize) && (curr_pers_pampd_count > 0)) {
-			total_zsize = xv_get_total_size_bytes(cli->xvpool);
+			total_zsize = xcf_get_total_size_bytes(cli->xcfmpool);
 			zv_mean_zsize = div_u64(total_zsize,
 						curr_pers_pampd_count);
 			if (zv_mean_zsize > zv_max_mean_zsize) {
 				zcache_mean_compress_poor++;
+				put_cpu_var(zv_cbuf);
 				goto out;
 			}
 		}
-		pampd = (void *)zv_create(cli->xvpool, pool->pool_id,
+		pampd = (void *)zv_create(cli->xcfmpool, pool->pool_id,
 						oid, index, cdata, clen);
+		put_cpu_var(zv_cbuf);
 		if (pampd == NULL)
 			goto out;
 		count = atomic_inc_return(&zcache_curr_pers_pampd_count);
@@ -1262,7 +1292,7 @@ static void zcache_pampd_free(void *pampd, struct tmem_pool *pool,
 		atomic_dec(&zcache_curr_eph_pampd_count);
 		BUG_ON(atomic_read(&zcache_curr_eph_pampd_count) < 0);
 	} else {
-		zv_free(cli->xvpool, (struct zv_hdr *)pampd);
+		zv_free(cli->xcfmpool, pampd);
 		atomic_dec(&zcache_curr_pers_pampd_count);
 		BUG_ON(atomic_read(&zcache_curr_pers_pampd_count) < 0);
 	}
@@ -1309,11 +1339,16 @@ static DEFINE_PER_CPU(unsigned char *, zcache_dstmem);
 static int zcache_compress(struct page *from, void **out_va, size_t *out_len)
 {
 	int ret = 0;
-	unsigned char *dmem = __get_cpu_var(zcache_dstmem);
+	unsigned char *dmem;
 	unsigned char *wmem = __get_cpu_var(zcache_workmem);
 	char *from_va;
 
 	BUG_ON(!irqs_disabled());
+	if (out_va && *out_va)
+		dmem = *out_va;
+	else
+		dmem = __get_cpu_var(zcache_dstmem);
+
 	if (unlikely(dmem == NULL || wmem == NULL))
 		goto out;  /* no buffer, so can't compress */
 	from_va = kmap_atomic(from, KM_USER0);
@@ -1331,7 +1366,7 @@ out:
 static int zcache_cpu_notifier(struct notifier_block *nb,
 				unsigned long action, void *pcpu)
 {
-	int cpu = (long)pcpu;
+	int ret, cpu = (long)pcpu;
 	struct zcache_preload *kp;
 
 	switch (action) {
@@ -1363,7 +1398,10 @@ static int zcache_cpu_notifier(struct notifier_block *nb,
 	default:
 		break;
 	}
-	return NOTIFY_OK;
+
+	ret = zv_cpu_notifier(nb, action, pcpu);
+
+	return ret;
 }
 
 static struct notifier_block zcache_cpu_notifier_block = {
@@ -1991,7 +2029,7 @@ static int __init zcache_init(void)
 
 		old_ops = zcache_frontswap_register_ops();
 		pr_info("zcache: frontswap enabled using kernel "
-			"transcendent memory and xvmalloc\n");
+			"transcendent memory and xcfmalloc\n");
 		if (old_ops.init != NULL)
 			pr_warning("ktmem: frontswap_ops overridden");
 	}
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] staging: zcache: add zv_page_count and zv_desc_count
  2011-08-31 14:40 [PATCH 0/3] staging: zcache: xcfmalloc support Seth Jennings
  2011-08-31 14:40 ` [PATCH 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
  2011-08-31 14:40 ` [PATCH 2/3] staging: zcache: replace xvmalloc with xcfmalloc Seth Jennings
@ 2011-08-31 14:40 ` Seth Jennings
  2011-08-31 19:46 ` [PATCH 0/3] staging: zcache: xcfmalloc support Dan Magenheimer
  3 siblings, 0 replies; 14+ messages in thread
From: Seth Jennings @ 2011-08-31 14:40 UTC (permalink / raw)
  To: gregkh
  Cc: devel, dan.magenheimer, ngupta, cascardo, rdunlap, linux-kernel,
	Seth Jennings

This patch adds the zv_page_count and zv_desc_count attributes
to the zcache sysfs.  They are read-only attributes and return
the number of pages and the number of block descriptors in use
by the pool respectively.

These statistics can be used to calculate effective compression
and block descriptor overhead for the xcfmalloc allocator.

Using the frontswap curr_pages attribute, effective compression
is: zv_page_count / curr_pages

Using /proc/slabinfo to get the objsize for a xcf_desc_cache
object, descriptor overhead is: zv_desc_count * objsize

Signed-off-by: Seth Jennings <sjenning@linux.vnet.ibm.com>
---
 drivers/staging/zcache/zcache-main.c |   24 ++++++++++++++++++++++++
 1 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/drivers/staging/zcache/zcache-main.c b/drivers/staging/zcache/zcache-main.c
index b07377b..6adbbbe 100644
--- a/drivers/staging/zcache/zcache-main.c
+++ b/drivers/staging/zcache/zcache-main.c
@@ -789,6 +789,24 @@ static int zv_cumul_dist_counts_show(char *buf)
 	return p - buf;
 }
 
+static int zv_page_count_show(char *buf)
+{
+	char *p = buf;
+	unsigned long count;
+	count = xcf_get_total_size_bytes(zcache_host.xcfmpool) >> PAGE_SHIFT;
+	p += sprintf(p, "%lu\n", count);
+	return p - buf;
+}
+
+static int zv_desc_count_show(char *buf)
+{
+	char *p = buf;
+	unsigned long count;
+	count = xcf_get_desc_count(zcache_host.xcfmpool);
+	p += sprintf(p, "%lu\n", count);
+	return p - buf;
+}
+
 /*
  * setting zv_max_zsize via sysfs causes all persistent (e.g. swap)
  * pages that don't compress to less than this value (including metadata
@@ -1477,6 +1495,10 @@ ZCACHE_SYSFS_RO_CUSTOM(zv_curr_dist_counts,
 			zv_curr_dist_counts_show);
 ZCACHE_SYSFS_RO_CUSTOM(zv_cumul_dist_counts,
 			zv_cumul_dist_counts_show);
+ZCACHE_SYSFS_RO_CUSTOM(zv_page_count,
+			zv_page_count_show);
+ZCACHE_SYSFS_RO_CUSTOM(zv_desc_count,
+			zv_desc_count_show);
 
 static struct attribute *zcache_attrs[] = {
 	&zcache_curr_obj_count_attr.attr,
@@ -1513,6 +1535,8 @@ static struct attribute *zcache_attrs[] = {
 	&zcache_zv_max_zsize_attr.attr,
 	&zcache_zv_max_mean_zsize_attr.attr,
 	&zcache_zv_page_count_policy_percent_attr.attr,
+	&zcache_zv_page_count_attr.attr,
+	&zcache_zv_desc_count_attr.attr,
 	NULL,
 };
 
-- 
1.7.4.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* RE: [PATCH 0/3] staging: zcache: xcfmalloc support
  2011-08-31 14:40 [PATCH 0/3] staging: zcache: xcfmalloc support Seth Jennings
                   ` (2 preceding siblings ...)
  2011-08-31 14:40 ` [PATCH 3/3] staging: zcache: add zv_page_count and zv_desc_count Seth Jennings
@ 2011-08-31 19:46 ` Dan Magenheimer
  2011-08-31 22:06   ` Seth Jennings
  2011-09-01 22:42   ` Seth Jennings
  3 siblings, 2 replies; 14+ messages in thread
From: Dan Magenheimer @ 2011-08-31 19:46 UTC (permalink / raw)
  To: Seth Jennings, gregkh; +Cc: devel, ngupta, cascardo, rdunlap, linux-kernel

> This patchset introduces a new memory allocator for persistent
> pages for zcache.  The current allocator is xvmalloc.  xvmalloc
> has two notable limitations:
> * High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
> * No compaction support which reduces page reclaimation
> 
> xcfmalloc seeks to fix these issues by using scatter-gather model that
> allows for cross-page allocations and relocatable data blocks.
> 
> In tests, with pages that only compress to 75% of their original
> size, xvmalloc had an effective compression (pages stored / pages used by the
> compressed memory pool) of ~95% (~20% lost to fragmentation). Almost nothing
> was gained by the compression in this case. xcfmalloc had an effective
> compression of ~77% (about ~2% lost to fragmentation and metadata overhead).

Hi Seth --

Do you have any data comparing xcfmalloc vs xvmalloc for
compression ratio and/or performance (cycles to compress
or decompress different pages) on a wide(r) range of data?
Assuming xcfmalloc isn't "always better", maybe it would
be best to allow the algorithm to be selectable?  (And
then we would also need to decide the default.)

(Hopefully Nitin will have a chance to comment, since he
has much more expertise in compression than I do.)

Thanks,
Dan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] staging: zcache: xcfmalloc support
  2011-08-31 19:46 ` [PATCH 0/3] staging: zcache: xcfmalloc support Dan Magenheimer
@ 2011-08-31 22:06   ` Seth Jennings
  2011-09-01 15:17     ` Dan Magenheimer
  2011-09-01 22:42   ` Seth Jennings
  1 sibling, 1 reply; 14+ messages in thread
From: Seth Jennings @ 2011-08-31 22:06 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: gregkh, devel, ngupta, cascardo, rdunlap, linux-kernel

On 08/31/2011 02:46 PM, Dan Magenheimer wrote:
>> This patchset introduces a new memory allocator for persistent
>> pages for zcache.  The current allocator is xvmalloc.  xvmalloc
>> has two notable limitations:
>> * High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
>> * No compaction support which reduces page reclaimation
>>
>> xcfmalloc seeks to fix these issues by using scatter-gather model that
>> allows for cross-page allocations and relocatable data blocks.
>>
>> In tests, with pages that only compress to 75% of their original
>> size, xvmalloc had an effective compression (pages stored / pages used by the
>> compressed memory pool) of ~95% (~20% lost to fragmentation). Almost nothing
>> was gained by the compression in this case. xcfmalloc had an effective
>> compression of ~77% (about ~2% lost to fragmentation and metadata overhead).
> 
> Hi Seth --
> 
> Do you have any data comparing xcfmalloc vs xvmalloc for
> compression ratio and/or performance (cycles to compress
> or decompress different pages) on a wide(r) range of data?
> Assuming xcfmalloc isn't "always better", maybe it would
> be best to allow the algorithm to be selectable?  (And
> then we would also need to decide the default.)
> 

I can get you some results comparing the two tomorrow.

You have to make the distinction between the
"compression ratio" and the "effective compression".
The compression ratio is the same since the compression
algorithm, LZO, was changed.  The effective compression,
the ratio of stored compressed pages to allocator pool
pages, is different between the two, especially for
allocation sets > PAGE_SIZE/2.

What the numbers will tell you is that for allocations sets
< PAGE_SIZE/2 xcfmalloc is a little worse (~2% greater
overhead).  But for allocation sets > PAGE_SIZE/2,
xcfmalloc has up to a 50% advantage over xvmalloc.

As far as performance numbers, all I can see is that
the throughput is the same between the two.  I'm not
sure how to get, for example, and cycles delta
between the two.

I would be difficult to make it selectable because the
function signatures (and some concepts) aren't the same.
You can see the changes that were required in the patch
2/3.

> (Hopefully Nitin will have a chance to comment, since he
> has much more expertise in compression than I do.)
> 
> Thanks,
> Dan

Thanks,
Seth


^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH 0/3] staging: zcache: xcfmalloc support
  2011-08-31 22:06   ` Seth Jennings
@ 2011-09-01 15:17     ` Dan Magenheimer
  2011-09-01 16:33       ` Seth Jennings
  0 siblings, 1 reply; 14+ messages in thread
From: Dan Magenheimer @ 2011-09-01 15:17 UTC (permalink / raw)
  To: Seth Jennings; +Cc: gregkh, devel, ngupta, cascardo, rdunlap, linux-kernel

> > Do you have any data comparing xcfmalloc vs xvmalloc for
> > compression ratio and/or performance (cycles to compress
> > or decompress different pages) on a wide(r) range of data?
> > Assuming xcfmalloc isn't "always better", maybe it would
> > be best to allow the algorithm to be selectable?  (And
> > then we would also need to decide the default.)
> >
> 
> I can get you some results comparing the two tomorrow.
> 
> You have to make the distinction between the
> "compression ratio" and the "effective compression".
> The compression ratio is the same since the compression
> algorithm, LZO, was changed.  The effective compression,

changed -> NOT changed, correct? LZO is used in both?
I had forgotten that, so the only issue might be the
overhead.

> the ratio of stored compressed pages to allocator pool
> pages, is different between the two, especially for
> allocation sets > PAGE_SIZE/2.
 
> What the numbers will tell you is that for allocations sets
> < PAGE_SIZE/2 xcfmalloc is a little worse (~2% greater
> overhead).  But for allocation sets > PAGE_SIZE/2,
> xcfmalloc has up to a 50% advantage over xvmalloc.
> 
> As far as performance numbers, all I can see is that
> the throughput is the same between the two.  I'm not
> sure how to get, for example, and cycles delta
> between the two.

IIRC, xvmalloc has O(1) overhead regardless of the number
of chunks of data stored.  Some algorithms are O(N) or
even O(N**2), i.e. might walk a potentially increasingly
very long list of allocations/descriptors
to find a slot, which would not be acceptable for
zcache as, for a large data set, the overhead might
be much worse than the cycles-to-compress.  Which is
xcfmalloc, O(1) or O(N) or O(N**2)?

(Also, since I think interrupts are still disabled,
reading the tsc before/after should be safe to
get the cycles delta.)

> I would be difficult to make it selectable because the
> function signatures (and some concepts) aren't the same.
> You can see the changes that were required in the patch
> 2/3.

Then I definitely would like to see some review
and discussion from Nitin.  Clearly xcfmalloc is better
for poorly-compressible data; I would like to be confident
that it is not "worse" for another large common set of
data.

An even better implementation could be if both are
active and the selection is made at runtime depending
on the compressibility of the data, i.e. poorly-compressible
data gets stored in xcfmalloc and other data in xvmalloc?
Probably not worth the effort, but food for thought.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] staging: zcache: xcfmalloc memory allocator for zcache
  2011-08-31 14:40 ` [PATCH 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
@ 2011-09-01 15:43   ` Seth Jennings
  2011-09-06 23:51     ` Greg KH
  0 siblings, 1 reply; 14+ messages in thread
From: Seth Jennings @ 2011-09-01 15:43 UTC (permalink / raw)
  To: Seth Jennings
  Cc: gregkh, devel, dan.magenheimer, ngupta, cascardo, rdunlap, linux-kernel

On 08/31/2011 09:40 AM, Seth Jennings wrote:

> +static struct xcf_blkdesc *xcf_find_remove_block(struct xcf_pool *pool,
> +	int size, int blocknum)
> +{
> +	int flindex, i;
> +	struct xcf_blkdesc *desc = NULL;
> +
> +	flindex = xcf_size_to_flindex(size + sizeof(struct xcf_blkhdr));
> +
> +	/* look for best fit */
> +	if (pool->freelists[flindex])
> +		goto remove;
> +
> +	/* if this is the last block allowed in the allocation, we shouldn't
> +	 * consider smaller blocks.  it's all or nothing now */
> +	if (blocknum == XCF_MAX_BLOCKS_PER_ALLOC) {

In gathering my performance numbers for Dan, I discovered I introduced
a regression by making a late change in my development.

This line should be:
	if (blocknum != XCF_MAX_BLOCKS_PER_ALLOC) {

This regression actually causes xcfmalloc to have the same fragmentation
issue as xvmalloc.

> +		/* look for largest smaller block */
> +		for (i = flindex; i > 0; i--) {
> +			if (pool->freelists[i]) {
> +				flindex = i;
> +				goto remove;
> +			}
> +		}
> +	}

--
Seth

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] staging: zcache: xcfmalloc support
  2011-09-01 15:17     ` Dan Magenheimer
@ 2011-09-01 16:33       ` Seth Jennings
  2011-09-01 16:54         ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Seth Jennings @ 2011-09-01 16:33 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: gregkh, devel, ngupta, cascardo, rdunlap, linux-kernel

On 09/01/2011 10:17 AM, Dan Magenheimer wrote:
>>> Do you have any data comparing xcfmalloc vs xvmalloc for
>>> compression ratio and/or performance (cycles to compress
>>> or decompress different pages) on a wide(r) range of data?
>>> Assuming xcfmalloc isn't "always better", maybe it would
>>> be best to allow the algorithm to be selectable?  (And
>>> then we would also need to decide the default.)
>>>
>>
>> I can get you some results comparing the two tomorrow.
>>
>> You have to make the distinction between the
>> "compression ratio" and the "effective compression".
>> The compression ratio is the same since the compression
>> algorithm, LZO, was changed.  The effective compression,
> 
> changed -> NOT changed, correct? LZO is used in both?
> I had forgotten that, so the only issue might be the
> overhead.
> 

Yes, LZO is NOT changed. Typo on my part.

>> the ratio of stored compressed pages to allocator pool
>> pages, is different between the two, especially for
>> allocation sets > PAGE_SIZE/2.
>  
>> What the numbers will tell you is that for allocations sets
>> < PAGE_SIZE/2 xcfmalloc is a little worse (~2% greater
>> overhead).  But for allocation sets > PAGE_SIZE/2,
>> xcfmalloc has up to a 50% advantage over xvmalloc.
>>
>> As far as performance numbers, all I can see is that
>> the throughput is the same between the two.  I'm not
>> sure how to get, for example, and cycles delta
>> between the two.
> 
> IIRC, xvmalloc has O(1) overhead regardless of the number
> of chunks of data stored.  Some algorithms are O(N) or
> even O(N**2), i.e. might walk a potentially increasingly
> very long list of allocations/descriptors
> to find a slot, which would not be acceptable for
> zcache as, for a large data set, the overhead might
> be much worse than the cycles-to-compress.  Which is
> xcfmalloc, O(1) or O(N) or O(N**2)?

xcfmalloc is also 0(1) in that the number of freelists
at that have to be checked is constant and not increasing
with the number of allocations.  The constant hidden
in the O(1) for finding a suitable block is NUM_FREELISTS.

The performance on the block lookup could be improved by
using a freelist bitmap similar to xvmalloc.  However,
I don't think there would be a measurable change in 
performance, because this in not where the 
zcache_put_page() codepath is spending all its time.
The time eater in this codepath is the LZO compression,
as dictated by the large difference in throughput
depending on the compressibility of the page.
This is supported by the metrics I am about to send you.

> 
> (Also, since I think interrupts are still disabled,
> reading the tsc before/after should be safe to
> get the cycles delta.)
> 
>> I would be difficult to make it selectable because the
>> function signatures (and some concepts) aren't the same.
>> You can see the changes that were required in the patch
>> 2/3.
> 
> Then I definitely would like to see some review
> and discussion from Nitin.  Clearly xcfmalloc is better
> for poorly-compressible data; I would like to be confident
> that it is not "worse" for another large common set of
> data.
> 
> An even better implementation could be if both are
> active and the selection is made at runtime depending
> on the compressibility of the data, i.e. poorly-compressible
> data gets stored in xcfmalloc and other data in xvmalloc?
> Probably not worth the effort, but food for thought.
> 

In the metrics I am about to send, you'll see that xcfmalloc
only does <2% worse on zeroed pages (the worst case for xcfmalloc),
but does almost 22% better on pages that compress to 75% of their
original size.

> Thanks,
> Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] staging: zcache: xcfmalloc support
  2011-09-01 16:33       ` Seth Jennings
@ 2011-09-01 16:54         ` Dave Hansen
  2011-09-01 22:01           ` Seth Jennings
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Hansen @ 2011-09-01 16:54 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Dan Magenheimer, gregkh, devel, ngupta, cascardo, rdunlap, linux-kernel

On Thu, 2011-09-01 at 11:33 -0500, Seth Jennings wrote:
> xcfmalloc is also 0(1) in that the number of freelists
> at that have to be checked is constant and not increasing
> with the number of allocations.  The constant hidden
> in the O(1) for finding a suitable block is NUM_FREELISTS.

The algorithm is technically O(n^2) since there are
XCF_MAX_BLOCKS_PER_ALLOC searches through XCF_NUM_FREELISTS.  There's
also the reserved pages refill loop, which is linear too.

xcfmalloc's big compromise is that it doesn't do any searching or
fitting.  It might needlessly split larger blocks when two small ones
would have worked, for instance.

-- Dave


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] staging: zcache: xcfmalloc support
  2011-09-01 16:54         ` Dave Hansen
@ 2011-09-01 22:01           ` Seth Jennings
  2011-09-01 23:44             ` Dave Hansen
  0 siblings, 1 reply; 14+ messages in thread
From: Seth Jennings @ 2011-09-01 22:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Dan Magenheimer, gregkh, devel, ngupta, cascardo, rdunlap, linux-kernel

On 09/01/2011 11:54 AM, Dave Hansen wrote:
> On Thu, 2011-09-01 at 11:33 -0500, Seth Jennings wrote:
>> xcfmalloc is also 0(1) in that the number of freelists
>> at that have to be checked is constant and not increasing
>> with the number of allocations.  The constant hidden
>> in the O(1) for finding a suitable block is NUM_FREELISTS.
> 
> The algorithm is technically O(n^2) since there are
> XCF_MAX_BLOCKS_PER_ALLOC searches through XCF_NUM_FREELISTS.  There's
> also the reserved pages refill loop, which is linear too.
> 

I was seeing n as the number of allocations.  Since 
XCF_MAX_BLOCKS_PER_ALLOC and XCF_NUM_FREELISTS are constant (i.e.
not increasing with the number of allocations) wouldn't it be
O(1)?

I see it like this:

for (i=0; i<2; i++) {
	do_something();
}

vs.

do_something();
do_something();

Is one O(n) and the other O(1)?  They do the same thing because the
loop iterates a constant number of times.

For it to be O(n) it would have to be:

for (i=0; i<n; i++) {
	do_something();
}

Right?

> xcfmalloc's big compromise is that it doesn't do any searching or
> fitting.  It might needlessly split larger blocks when two small ones
> would have worked, for instance.

Splitting a larger block is the last option.  I might not
be understanding you correctly, but find_remove_block() does try to
find the optimal block to use, which is "searching and fitting" in my
mind.

> 
> -- Dave
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] staging: zcache: xcfmalloc support
  2011-08-31 19:46 ` [PATCH 0/3] staging: zcache: xcfmalloc support Dan Magenheimer
  2011-08-31 22:06   ` Seth Jennings
@ 2011-09-01 22:42   ` Seth Jennings
  1 sibling, 0 replies; 14+ messages in thread
From: Seth Jennings @ 2011-09-01 22:42 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: gregkh, devel, ngupta, cascardo, rdunlap, linux-kernel

On 08/31/2011 02:46 PM, Dan Magenheimer wrote:
>> This patchset introduces a new memory allocator for persistent
>> pages for zcache.  The current allocator is xvmalloc.  xvmalloc
>> has two notable limitations:
>> * High (up to 50%) external fragmentation on allocation sets > PAGE_SIZE/2
>> * No compaction support which reduces page reclaimation
>>
>> xcfmalloc seeks to fix these issues by using scatter-gather model that
>> allows for cross-page allocations and relocatable data blocks.
>>
>> In tests, with pages that only compress to 75% of their original
>> size, xvmalloc had an effective compression (pages stored / pages used by the
>> compressed memory pool) of ~95% (~20% lost to fragmentation). Almost nothing
>> was gained by the compression in this case. xcfmalloc had an effective
>> compression of ~77% (about ~2% lost to fragmentation and metadata overhead).
> 
> Hi Seth --
> 
> Do you have any data comparing xcfmalloc vs xvmalloc for
> compression ratio and/or performance (cycles to compress
> or decompress different pages) on a wide(r) range of data?
> Assuming xcfmalloc isn't "always better", maybe it would
> be best to allow the algorithm to be selectable?  (And
> then we would also need to decide the default.)
> 

Ok, so I was able to get some numbers and I said I'd send them out
today, so... here they are. 32-bit system.  I create a cgroup with a 
memory.limit_in_bytes of 256MB.  Then I ran a program that allocates
512MB, one 4k page a time.  The pages can be filled with zeros
or text depending on the command arguments.  The text is english
text that has an average compression ratio, using lzo1x, of 75%
with little deviation.  zv_max_mean_zsize and zv_max_zsize are
both set to 3584 (7 * PAGE_SIZE / 8).

xvmalloc
		curr_pages	zv_page_count	effective compression
zero filled	65859		1269		1.93%
text (75%)	65925		65892		99.95%

xcfmalloc (descriptors are 24 bytes, 170 per 4k page)
		curr_pages	zv_page_count	zv_desc_count	effective compression
zero filled	65845		2068		65858		3.72% (+1.79)
text (75%)	65965		50848		114980		78.11% (-21.84)

This shows that xvmalloc is 1.79 points better on zero filled pages.
This is because xcfmalloc has higher internal fragmentation because
the block sizes aren't as granular as xvmalloc.  This contributes
to 1.21 points of the delta. xcfmalloc also has block descriptors,
which contributes to the remaining 0.58 points.

It also shows that xcfmalloc is 21.84 points better on text filled 
pages. This is because of xcfmalloc allocations can span different
pages which greatly reduces external fragmentation compared to xvmalloc.

I did some quick tests with "time" using the same program and the
timings are very close (3 run average, little deviation):

xvmalloc:
zero filled	0m0.852s
text (75%)	0m14.415s

xcfmalloc:
zero filled	0m0.870s
text (75%)	0m15.089s

I suspect that the small decrease in throughput is due to the
extra memcpy in xcfmalloc.  However, these timing, more than 
anything, demonstrate that the throughput is GREATLY effected
by the compressibility of the data.

In all cases, all swapped pages where captured by frontswap with
no put failures.

> (Hopefully Nitin will have a chance to comment, since he
> has much more expertise in compression than I do.)
> 
> Thanks,
> Dan


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 0/3] staging: zcache: xcfmalloc support
  2011-09-01 22:01           ` Seth Jennings
@ 2011-09-01 23:44             ` Dave Hansen
  0 siblings, 0 replies; 14+ messages in thread
From: Dave Hansen @ 2011-09-01 23:44 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Dan Magenheimer, gregkh, devel, ngupta, cascardo, rdunlap, linux-kernel

On Thu, 2011-09-01 at 17:01 -0500, Seth Jennings wrote:
> I was seeing n as the number of allocations.  Since 
> XCF_MAX_BLOCKS_PER_ALLOC and XCF_NUM_FREELISTS are constant (i.e.
> not increasing with the number of allocations) wouldn't it be
> O(1)?

It's the difference between your implementation and the _algorithm_
you've chosen.  If someone doubled XCF_MAX_BLOCKS_PER_ALLOC and
XCF_NUM_FREELISTS, you'd see the time quadruple, not stay constant.
That's a property of the _algorithm_.

> > xcfmalloc's big compromise is that it doesn't do any searching or
> > fitting.  It might needlessly split larger blocks when two small ones
> > would have worked, for instance.
> 
> Splitting a larger block is the last option.  I might not
> be understanding you correctly, but find_remove_block() does try to
> find the optimal block to use, which is "searching and fitting" in my
> mind.

I don't want to split hairs on the wording.  It's obvious, though, that
xcfmalloc does not find _optimal_ fits.  It also doesn't use the
smallest-possible blocks to fit the alloction.  Consider if you wanted a
1000 byte allocation (with 10 100-byte buckets and no metadata for
simplicity), and had 4 blocks:

        900
        500,500,500

I think it would split a 500 into 100,400, and leave the 400:
        
        500,500
        400

It took the largest (most valuable) block, and split a 500 block when it
didn't have to.  The reason it doesn't do this is that it doesn't
_search_.  It just indexes and guesses.  That's *fast*, but it errs on
the side of speed rather than being optimal.  That's OK, we do it all
the time, but it *is* a compromise.  We should at least be thinking of
the cases when this doesn't perform well.

-- Dave


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/3] staging: zcache: xcfmalloc memory allocator for zcache
  2011-09-01 15:43   ` Seth Jennings
@ 2011-09-06 23:51     ` Greg KH
  0 siblings, 0 replies; 14+ messages in thread
From: Greg KH @ 2011-09-06 23:51 UTC (permalink / raw)
  To: Seth Jennings
  Cc: gregkh, devel, dan.magenheimer, ngupta, cascardo, rdunlap, linux-kernel

On Thu, Sep 01, 2011 at 10:43:27AM -0500, Seth Jennings wrote:
> On 08/31/2011 09:40 AM, Seth Jennings wrote:
> 
> > +static struct xcf_blkdesc *xcf_find_remove_block(struct xcf_pool *pool,
> > +	int size, int blocknum)
> > +{
> > +	int flindex, i;
> > +	struct xcf_blkdesc *desc = NULL;
> > +
> > +	flindex = xcf_size_to_flindex(size + sizeof(struct xcf_blkhdr));
> > +
> > +	/* look for best fit */
> > +	if (pool->freelists[flindex])
> > +		goto remove;
> > +
> > +	/* if this is the last block allowed in the allocation, we shouldn't
> > +	 * consider smaller blocks.  it's all or nothing now */
> > +	if (blocknum == XCF_MAX_BLOCKS_PER_ALLOC) {
> 
> In gathering my performance numbers for Dan, I discovered I introduced
> a regression by making a late change in my development.
> 
> This line should be:
> 	if (blocknum != XCF_MAX_BLOCKS_PER_ALLOC) {
> 
> This regression actually causes xcfmalloc to have the same fragmentation
> issue as xvmalloc.

Ok, care to resend a tested and correct patch series then?

greg k-h

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-09-06 23:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-31 14:40 [PATCH 0/3] staging: zcache: xcfmalloc support Seth Jennings
2011-08-31 14:40 ` [PATCH 1/3] staging: zcache: xcfmalloc memory allocator for zcache Seth Jennings
2011-09-01 15:43   ` Seth Jennings
2011-09-06 23:51     ` Greg KH
2011-08-31 14:40 ` [PATCH 2/3] staging: zcache: replace xvmalloc with xcfmalloc Seth Jennings
2011-08-31 14:40 ` [PATCH 3/3] staging: zcache: add zv_page_count and zv_desc_count Seth Jennings
2011-08-31 19:46 ` [PATCH 0/3] staging: zcache: xcfmalloc support Dan Magenheimer
2011-08-31 22:06   ` Seth Jennings
2011-09-01 15:17     ` Dan Magenheimer
2011-09-01 16:33       ` Seth Jennings
2011-09-01 16:54         ` Dave Hansen
2011-09-01 22:01           ` Seth Jennings
2011-09-01 23:44             ` Dave Hansen
2011-09-01 22:42   ` Seth Jennings

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.