All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
@ 2014-09-26  6:53 ` Joonsoo Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-09-26  6:53 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-mm, linux-kernel, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins, Joonsoo Kim

WARNING: This is just RFC patchset. patch 2/2 is only for testing.
If you know useful place to use this allocator, please let me know.

This is brand-new allocator, called anti-fragmentation memory allocator
(aka afmalloc), in order to deal with arbitrary sized object allocation
efficiently. zram and zswap uses arbitrary sized object to store
compressed data so they can use this allocator. If there are any other
use cases, they can use it, too.

This work is motivated by observation of fragmentation on zsmalloc which
intended for storing arbitrary sized object with low fragmentation.
Although it works well on allocation-intensive workload, memory could be
highly fragmented after many free occurs. In some cases, unused memory due
to fragmentation occupy 20% ~ 50% amount of real used memory. The other
problem is that other subsystem cannot use these unused memory. These
fragmented memory are zsmalloc specific, so most of other subsystem cannot
use it until zspage is freed to page allocator.

I guess that there are similar fragmentation problem in zbud, but, I
didn't deeply investigate it.

This new allocator uses SLAB allocator to solve above problems. When
request comes, it returns handle that is pointer of metatdata to point
many small chunks. These small chunks are in power of 2 size and
build up whole requested memory. We can easily acquire these chunks
using SLAB allocator. Following is conceptual represetation of metadata
used in this allocator to help understanding of this allocator.

Handle A for 400 bytes
{
	Pointer for 256 bytes chunk
	Pointer for 128 bytes chunk
	Pointer for 16 bytes chunk

	(256 + 128 + 16 = 400)
}

As you can see, 400 bytes memory are not contiguous in afmalloc so that
allocator specific store/load functions are needed. These require some
computation overhead and I guess that this is the only drawback this
allocator has.

For optimization, it uses another approach for power of 2 sized request.
Instead of returning handle for metadata, it adds tag on pointer from
SLAB allocator and directly returns this value as handle. With this tag,
afmalloc can recognize whether handle is for metadata or not and do proper
processing on it. This optimization can save some memory.

Although afmalloc use some memory for metadata, overall utilization of
memory is really good due to zero internal fragmentation by using power
of 2 sized object. Although zsmalloc has many size class, there is
considerable internal fragmentation in zsmalloc.

In workload that needs many free, memory could be fragmented like
zsmalloc, but, there is big difference. These unused portion of memory
are SLAB specific memory so that other subsystem can use it. Therefore,
fragmented memory could not be a big problem in this allocator.

Extra benefit of this allocator design is NUMA awareness. This allocator
allocates real memory from SLAB allocator. SLAB considers client's NUMA
affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
and zbud which are backend of zram and zswap, respectively, are not NUMA
awareness so that remote node's memory could be returned to requestor.
I think that it could be solved easily if NUMA awareness turns out to be
real problem. But, it may enlarge fragmentation depending on number of
nodes. Anyway, there is no NUMA awareness issue in this allocator.

Although I'd like to replace zsmalloc with this allocator, it cannot be
possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
would be very limited so supporting HIGHMEM would be really good advantage
of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
64-bits world, this allocator may be good option for this system. I
didn't deeply consider whether this allocator can replace zbud or not.

Below is the result of my simple test.
(zsmalloc used in experiments is patched with my previous patch:
zsmalloc: merge size_class to reduce fragmentation)

TEST ENV: EXT4 on zram, mount with discard option
WORKLOAD: untar kernel source, remove dir in descending order in size.
(drivers arch fs sound include)

Each line represents orig_data_size, compr_data_size, mem_used_total,
fragmentation overhead (mem_used - compr_data_size) and overhead ratio
(overhead to compr_data_size), respectively, after untar and remove
operation is executed. In afmalloc case, overhead is calculated by
before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
in afmalloc, one is real_overhead which represents metadata usage and
overhead of internal fragmentation, and the other is a ratio,
real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
internal fragmented memory cannot be used by other subsystem. So,
comparing real_overhead in afmalloc with overhead on zsmalloc seems to
be proper comparison.

* untar-merge.out

orig_size compr_size used_size overhead overhead_ratio
526.23MB 199.18MB 209.81MB  10.64MB 5.34%
288.68MB  97.45MB 104.08MB   6.63MB 6.80%
177.68MB  61.14MB  66.93MB   5.79MB 9.47%
146.83MB  47.34MB  52.79MB   5.45MB 11.51%
124.52MB  38.87MB  44.30MB   5.43MB 13.96%
104.29MB  31.70MB  36.83MB   5.13MB 16.19%

* untar-afmalloc.out

orig_size compr_size used_size overhead overhead_ratio real real-ratio
526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%

As you can see above result, real_overhead_ratio in afmalloc is
just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.

And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
than overhead_ratio in zsmalloc which is 5% ~ 17%.

Below is another simple test to check fragmentation effect in alloc/free
repetition workload.

TEST ENV: EXT4 on zram, mount with discard option
WORKLOAD: untar kernel source, remove dir in descending order in size
(drivers arch fs sound include). Repeat this untar and remove 10 times.

* untar-merge.out

orig_size compr_size used_size overhead overhead_ratio
526.24MB 199.18MB 209.79MB  10.61MB 5.33%
288.69MB  97.45MB 104.09MB   6.64MB 6.81%
177.69MB  61.14MB  66.89MB   5.75MB 9.40%
146.84MB  47.34MB  52.77MB   5.43MB 11.46%
124.53MB  38.88MB  44.28MB   5.40MB 13.90%
104.29MB  31.71MB  36.87MB   5.17MB 16.29%
535.59MB 200.30MB 211.77MB  11.47MB 5.73%
294.84MB  98.28MB 106.24MB   7.97MB 8.11%
179.99MB  61.58MB  69.34MB   7.76MB 12.60%
148.67MB  47.75MB  55.19MB   7.43MB 15.57%
125.98MB  39.26MB  46.62MB   7.36MB 18.75%
105.05MB  32.03MB  39.18MB   7.15MB 22.32%
(snip...)
535.59MB 200.31MB 211.88MB  11.57MB 5.77%
294.84MB  98.28MB 106.62MB   8.34MB 8.49%
179.99MB  61.59MB  73.83MB  12.24MB 19.88%
148.67MB  47.76MB  59.58MB  11.82MB 24.76%
125.98MB  39.27MB  51.10MB  11.84MB 30.14%
105.05MB  32.04MB  43.68MB  11.64MB 36.31%
535.59MB 200.31MB 211.89MB  11.58MB 5.78%
294.84MB  98.28MB 106.68MB   8.40MB 8.55%
179.99MB  61.59MB  74.14MB  12.55MB 20.37%
148.67MB  47.76MB  59.94MB  12.18MB 25.50%
125.98MB  39.27MB  51.46MB  12.19MB 31.04%
105.05MB  32.04MB  44.01MB  11.97MB 37.35%

* untar-afmalloc.out

orig_size compr_size used_size overhead overhead_ratio real real-ratio
526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
(snip...)
535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%

As you can see above result, fragmentation grows continuously at each run.
But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
while overhead_ratio on zsmalloc varies 5% ~ 38%.
Fragmented slab memory can be used for other system, so we don't
have to much worry about overhead metric in afmalloc. Anyway, overhead
metric is also better in afmalloc, 4% ~ 26%.

As a result, I think that afmalloc is better than zsmalloc in terms of
memory efficiency. But, I could be wrong so any comments are welcome. :)

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/afmalloc.h |   21 ++
 mm/Kconfig               |    7 +
 mm/Makefile              |    1 +
 mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 619 insertions(+)
 create mode 100644 include/linux/afmalloc.h
 create mode 100644 mm/afmalloc.c

diff --git a/include/linux/afmalloc.h b/include/linux/afmalloc.h
new file mode 100644
index 0000000..751ae56
--- /dev/null
+++ b/include/linux/afmalloc.h
@@ -0,0 +1,21 @@
+#define AFMALLOC_MIN_LEVEL (1)
+#ifdef CONFIG_64BIT
+#define AFMALLOC_MAX_LEVEL (7)	/* 4 + 4 + 8 * 7 = 64 */
+#else
+#define AFMALLOC_MAX_LEVEL (6)	/* 4 + 4 + 4 * 6 = 32 */
+#endif
+
+extern struct afmalloc_pool *afmalloc_create_pool(int max_level,
+			size_t max_size, gfp_t flags);
+extern void afmalloc_destroy_pool(struct afmalloc_pool *pool);
+extern size_t afmalloc_get_used_pages(struct afmalloc_pool *pool);
+extern unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len);
+extern void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle);
+extern size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
+			void *src, size_t len);
+extern size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
+			void *dst, size_t len);
+extern void *afmalloc_map_handle(struct afmalloc_pool *pool,
+			unsigned long handle, size_t len, bool read_only);
+extern void afmalloc_unmap_handle(struct afmalloc_pool *pool,
+			unsigned long handle);
diff --git a/mm/Kconfig b/mm/Kconfig
index e09cf0a..7869768 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -585,6 +585,13 @@ config ZSMALLOC
 	  returned by an alloc().  This handle must be mapped in order to
 	  access the allocated space.
 
+config ANTI_FRAGMENTATION_MALLOC
+	boolean "Anti-fragmentation memory allocator"
+	help
+	  Select this to store data into anti-fragmentation memory
+	  allocator. This helps to reduce internal/external
+	  fragmentation caused by storing arbitrary sized data.
+
 config PGTABLE_MAPPING
 	bool "Use page table mapping to access object in zsmalloc"
 	depends on ZSMALLOC
diff --git a/mm/Makefile b/mm/Makefile
index b2f18dc..d47b147 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_ZPOOL)	+= zpool.o
 obj-$(CONFIG_ZBUD)	+= zbud.o
 obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
+obj-$(CONFIG_ANTI_FRAGMENTATION_MALLOC) += afmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
diff --git a/mm/afmalloc.c b/mm/afmalloc.c
new file mode 100644
index 0000000..83a5c61
--- /dev/null
+++ b/mm/afmalloc.c
@@ -0,0 +1,590 @@
+/*
+ * Anti Fragmentation Memory allocator
+ *
+ * Copyright (C) 2014 Joonsoo Kim
+ *
+ * Anti Fragmentation Memory allocator(aka afmalloc) is special purpose
+ * allocator in order to deal with arbitrary sized object allocation
+ * efficiently in terms of memory utilization.
+ *
+ * Overall design is too simple.
+ *
+ * If request is for power of 2 sized object, afmalloc allocate object
+ * from the SLAB, add tag on it and return it to requestor. This tag will be
+ * used for determining whether it is a handle for metadata or not.
+ *
+ * If request isn't for power of 2 sized object, afmalloc divides size
+ * into elements in power of 2 size. For example, 400 byte request, 256,
+ * 128, 16 bytes build up 400 bytes. afmalloc allocates these size memory
+ * from the SLAB and allocates memory for metadata to keep the pointer of
+ * these chunks. Conceptual representation of metadata structure is below.
+ *
+ * Metadata for 400 bytes
+ * - Pointer for 256 bytes chunk
+ * - Pointer for 128 bytes chunk
+ * - Pointer for 16 bytes chunk
+ *
+ * After allocation all of them, afmalloc returns handle for this metadata to
+ * requestor. Requestor can load/store from/into this memory via this handle.
+ *
+ * Returned memory from afmalloc isn't contiguous so using this memory needs
+ * special APIs. afmalloc_(load/store) handles load/store requests according
+ * to afmalloc's internal structure, so you can use it without any anxiety.
+ *
+ * If you may want to use this memory like as normal memory, you need to call
+ * afmalloc_map_object before using it. This returns contiguous memory for
+ * this handle so that you could use it with normal memory operation.
+ * Unfortunately, only one object can be mapped per cpu at a time and to
+ * contruct this mapping has some overhead.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/afmalloc.h>
+#include <linux/highmem.h>
+#include <linux/sizes.h>
+#include <linux/module.h>
+
+#define afmalloc_OBJ_MIN_SIZE (32)
+
+#define DIRECT_ENTRY (0x1)
+
+struct afmalloc_pool {
+	spinlock_t lock;
+	gfp_t flags;
+	int max_level;
+	size_t max_size;
+	size_t size;
+};
+
+struct afmalloc_entry {
+	int level;
+	int alloced;
+	void *mem[];
+};
+
+struct afmalloc_mapped_info {
+	struct page *page;
+	size_t len;
+	bool read_only;
+};
+
+static struct afmalloc_mapped_info __percpu *mapped_info;
+
+static struct afmalloc_entry *mem_to_direct_entry(void *mem)
+{
+	return (struct afmalloc_entry *)((unsigned long)mem | DIRECT_ENTRY);
+}
+
+static void *direct_entry_to_mem(struct afmalloc_entry *entry)
+{
+	return (void *)((unsigned long)entry & ~DIRECT_ENTRY);
+}
+
+static bool is_direct_entry(struct afmalloc_entry *entry)
+{
+	return (unsigned long)entry & DIRECT_ENTRY;
+}
+
+static unsigned long entry_to_handle(struct afmalloc_entry *entry)
+{
+	return (unsigned long)entry;
+}
+
+static struct afmalloc_entry *handle_to_entry(unsigned long handle)
+{
+	return (struct afmalloc_entry *)handle;
+}
+
+static bool valid_level(int max_level)
+{
+	if (max_level < AFMALLOC_MIN_LEVEL)
+		return false;
+
+	if (max_level > AFMALLOC_MAX_LEVEL)
+		return false;
+
+	return true;
+}
+
+static bool valid_flags(gfp_t flags)
+{
+	if (flags & __GFP_HIGHMEM)
+		return false;
+
+	return true;
+}
+
+/**
+ * afmalloc_create_pool - Creates an allocation pool to work from.
+ * @max_level: limit on number of chunks that is part of requested memory
+ * @max_size: limit on total allocation size from this pool
+ * @flags: allocation flags used to allocate memory
+ *
+ * This function must be called before anything when using
+ * the afmalloc allocator.
+ *
+ * On success, a pointer to the newly created pool is returned,
+ * otherwise NULL.
+ */
+struct afmalloc_pool *afmalloc_create_pool(int max_level, size_t max_size,
+					gfp_t flags)
+{
+	struct afmalloc_pool *pool;
+
+	if (!valid_level(max_level))
+		return NULL;
+
+	if (!valid_flags(flags))
+		return NULL;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	spin_lock_init(&pool->lock);
+	pool->flags = flags;
+	pool->max_level = max_level;
+	pool->max_size = max_size;
+	pool->size = 0;
+
+	return pool;
+}
+EXPORT_SYMBOL(afmalloc_create_pool);
+
+void afmalloc_destroy_pool(struct afmalloc_pool *pool)
+{
+	kfree(pool);
+}
+EXPORT_SYMBOL(afmalloc_destroy_pool);
+
+size_t afmalloc_get_used_pages(struct afmalloc_pool *pool)
+{
+	size_t size;
+
+	spin_lock(&pool->lock);
+	size = pool->size >> PAGE_SHIFT;
+	spin_unlock(&pool->lock);
+
+	return size;
+}
+EXPORT_SYMBOL(afmalloc_get_used_pages);
+
+static void free_entry(struct afmalloc_pool *pool, struct afmalloc_entry *entry,
+			bool calc_size)
+{
+	int i;
+	int level;
+	int alloced;
+
+	if (is_direct_entry(entry)) {
+		void *mem = direct_entry_to_mem(entry);
+
+		alloced = ksize(mem);
+		kfree(mem);
+		goto out;
+	}
+
+	level = entry->level;
+	alloced = entry->alloced;
+	for (i = 0; i < level; i++)
+		kfree(entry->mem[i]);
+
+	kfree(entry);
+
+out:
+	if (calc_size && alloced) {
+		spin_lock(&pool->lock);
+		pool->size -= alloced;
+		spin_unlock(&pool->lock);
+	}
+}
+
+static int calculate_level(struct afmalloc_pool *pool, size_t len)
+{
+	int level = 0;
+	size_t down_size, up_size;
+
+	if (len <= afmalloc_OBJ_MIN_SIZE)
+		goto out;
+
+	while (1) {
+		down_size = rounddown_pow_of_two(len);
+		if (down_size >= len)
+			break;
+
+		up_size = roundup_pow_of_two(len);
+		if (up_size - len <= afmalloc_OBJ_MIN_SIZE)
+			break;
+
+		len -= down_size;
+		level++;
+	}
+
+out:
+	level++;
+	return min(level, pool->max_level);
+}
+
+static int estimate_alloced(struct afmalloc_pool *pool, int level, size_t len)
+{
+	int i, alloced = 0;
+	size_t size;
+
+	for (i = 0; i < level - 1; i++) {
+		size = rounddown_pow_of_two(len);
+		alloced += size;
+		len -= size;
+	}
+
+	if (len < afmalloc_OBJ_MIN_SIZE)
+		size = afmalloc_OBJ_MIN_SIZE;
+	else
+		size = roundup_pow_of_two(len);
+	alloced += size;
+
+	return alloced;
+}
+
+static void *alloc_entry(struct afmalloc_pool *pool, size_t len)
+{
+	int i, level;
+	size_t size;
+	int alloced = 0;
+	size_t remain = len;
+	struct afmalloc_entry *entry;
+	void *mem;
+
+	/*
+	 * Determine whether memory is power of 2 or not. If not,
+	 * determine how many chunks are needed.
+	 */
+	level = calculate_level(pool, len);
+	if (level == 1)
+		goto alloc_direct_entry;
+
+	size = sizeof(void *) * level + sizeof(struct afmalloc_entry);
+	entry = kmalloc(size, pool->flags);
+	if (!entry)
+		return NULL;
+
+	size = ksize(entry);
+	alloced += size;
+
+	/*
+	 * Although request isn't for power of 2 object, sometimes, it is
+	 * better to allocate one power of 2 memory due to waste of metadata.
+	 */
+	if (size + estimate_alloced(pool, level, len)
+				>= roundup_pow_of_two(len)) {
+		kfree(entry);
+		goto alloc_direct_entry;
+	}
+
+	entry->level = level;
+	for (i = 0; i < level - 1; i++) {
+		size = rounddown_pow_of_two(remain);
+		entry->mem[i] = kmalloc(size, pool->flags);
+		if (!entry->mem[i])
+			goto err;
+
+		alloced += size;
+		remain -= size;
+	}
+
+	if (remain < afmalloc_OBJ_MIN_SIZE)
+		size = afmalloc_OBJ_MIN_SIZE;
+	else
+		size = roundup_pow_of_two(remain);
+	entry->mem[i] = kmalloc(size, pool->flags);
+	if (!entry->mem[i])
+		goto err;
+
+	alloced += size;
+	entry->alloced = alloced;
+	goto alloc_complete;
+
+alloc_direct_entry:
+	mem = kmalloc(len, pool->flags);
+	if (!mem)
+		return NULL;
+
+	alloced = ksize(mem);
+	entry = mem_to_direct_entry(mem);
+
+alloc_complete:
+	spin_lock(&pool->lock);
+	if (pool->size + alloced > pool->max_size) {
+		spin_unlock(&pool->lock);
+		goto err;
+	}
+
+	pool->size += alloced;
+	spin_unlock(&pool->lock);
+
+	return entry;
+
+err:
+	free_entry(pool, entry, false);
+
+	return NULL;
+}
+
+static bool valid_alloc_arg(size_t len)
+{
+	if (!len)
+		return false;
+
+	return true;
+}
+
+/**
+ * afmalloc_alloc - Allocate block of given length from pool
+ * @pool: pool from which the object was allocated
+ * @len: length of block to allocate
+ *
+ * On success, handle to the allocated object is returned,
+ * otherwise 0.
+ */
+unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len)
+{
+	struct afmalloc_entry *entry;
+
+	if (!valid_alloc_arg(len))
+		return 0;
+
+	entry = alloc_entry(pool, len);
+	if (!entry)
+		return 0;
+
+	return entry_to_handle(entry);
+}
+EXPORT_SYMBOL(afmalloc_alloc);
+
+static void __afmalloc_free(struct afmalloc_pool *pool,
+			struct afmalloc_entry *entry)
+{
+	free_entry(pool, entry, true);
+}
+
+void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle)
+{
+	struct afmalloc_entry *entry;
+
+	entry = handle_to_entry(handle);
+	if (!entry)
+		return;
+
+	__afmalloc_free(pool, entry);
+}
+EXPORT_SYMBOL(afmalloc_free);
+
+static void __afmalloc_store(struct afmalloc_pool *pool,
+			struct afmalloc_entry *entry, void *src, size_t len)
+{
+	int i, level = entry->level;
+	size_t size;
+	size_t offset = 0;
+
+	if (is_direct_entry(entry)) {
+		memcpy(direct_entry_to_mem(entry), src, len);
+		return;
+	}
+
+	for (i = 0; i < level - 1; i++) {
+		size = rounddown_pow_of_two(len);
+		memcpy(entry->mem[i], src + offset, size);
+		offset += size;
+		len -= size;
+	}
+	memcpy(entry->mem[i], src + offset, len);
+}
+
+static bool valid_store_arg(struct afmalloc_entry *entry, void *src, size_t len)
+{
+	if (!entry)
+		return false;
+
+	if (!src || !len)
+		return false;
+
+	return true;
+}
+
+/**
+ * afmalloc_store - store data into allocated object from handle.
+ * @pool: pool from which the object was allocated
+ * @handle: handle returned from afmalloc
+ * @src: memory address of source data
+ * @len: length in bytes of desired store
+ *
+ * To store data into an object allocated from afmalloc, it must be
+ * mapped before using it or accessed through afmalloc-specific
+ * load/store functions. These functions properly handle load/store
+ * request according to afmalloc's internal structure.
+ */
+size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
+			void *src, size_t len)
+{
+	struct afmalloc_entry *entry;
+
+	entry = handle_to_entry(handle);
+	if (!valid_store_arg(entry, src, len))
+		return 0;
+
+	__afmalloc_store(pool, entry, src, len);
+
+	return len;
+}
+EXPORT_SYMBOL(afmalloc_store);
+
+static void __afmalloc_load(struct afmalloc_pool *pool,
+			struct afmalloc_entry *entry, void *dst, size_t len)
+{
+	int i, level = entry->level;
+	size_t size;
+	size_t offset = 0;
+
+	if (is_direct_entry(entry)) {
+		memcpy(dst, direct_entry_to_mem(entry), len);
+		return;
+	}
+
+	for (i = 0; i < level - 1; i++) {
+		size = rounddown_pow_of_two(len);
+		memcpy(dst + offset, entry->mem[i], size);
+		offset += size;
+		len -= size;
+	}
+	memcpy(dst + offset, entry->mem[i], len);
+}
+
+static bool valid_load_arg(struct afmalloc_entry *entry, void *dst, size_t len)
+{
+	if (!entry)
+		return false;
+
+	if (!dst || !len)
+		return false;
+
+	return true;
+}
+
+size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
+		void *dst, size_t len)
+{
+	struct afmalloc_entry *entry;
+
+	entry = handle_to_entry(handle);
+	if (!valid_load_arg(entry, dst, len))
+		return 0;
+
+	__afmalloc_load(pool, entry, dst, len);
+
+	return len;
+}
+EXPORT_SYMBOL(afmalloc_load);
+
+/**
+ * afmalloc_map_object - get address of allocated object from handle.
+ * @pool: pool from which the object was allocated
+ * @handle: handle returned from afmalloc
+ * @len: length in bytes of desired mapping
+ * @read_only: flag that represents whether data on mapped region is
+ *	written back into an object or not
+ *
+ * Before using an object allocated from afmalloc, it must be mapped using
+ * this function. When done with the object, it must be unmapped using
+ * afmalloc_unmap_handle.
+ *
+ * Only one object can be mapped per cpu at a time. There is no protection
+ * against nested mappings.
+ *
+ * This function returns with preemption and page faults disabled.
+ */
+void *afmalloc_map_handle(struct afmalloc_pool *pool, unsigned long handle,
+			size_t len, bool read_only)
+{
+	int cpu;
+	struct afmalloc_entry *entry;
+	struct afmalloc_mapped_info *info;
+	void *addr;
+
+	entry = handle_to_entry(handle);
+	if (!entry)
+		return NULL;
+
+	cpu = get_cpu();
+	if (is_direct_entry(entry))
+		return direct_entry_to_mem(entry);
+
+	info = per_cpu_ptr(mapped_info, cpu);
+	addr = page_address(info->page);
+	info->len = len;
+	info->read_only = read_only;
+	__afmalloc_load(pool, entry, addr, len);
+	return addr;
+}
+EXPORT_SYMBOL(afmalloc_map_handle);
+
+void afmalloc_unmap_handle(struct afmalloc_pool *pool, unsigned long handle)
+{
+	struct afmalloc_entry *entry;
+	struct afmalloc_mapped_info *info;
+	void *addr;
+
+	entry = handle_to_entry(handle);
+	if (!entry)
+		return;
+
+	if (is_direct_entry(entry))
+		goto out;
+
+	info = this_cpu_ptr(mapped_info);
+	if (info->read_only)
+		goto out;
+
+	addr = page_address(info->page);
+	__afmalloc_store(pool, entry, addr, info->len);
+
+out:
+	put_cpu();
+}
+EXPORT_SYMBOL(afmalloc_unmap_handle);
+
+static int __init afmalloc_init(void)
+{
+	int cpu;
+
+	mapped_info = alloc_percpu(struct afmalloc_mapped_info);
+	if (!mapped_info)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct page *page;
+
+		page = alloc_pages(GFP_KERNEL, 0);
+		if (!page)
+			goto err;
+
+		per_cpu_ptr(mapped_info, cpu)->page = page;
+	}
+
+	return 0;
+
+err:
+	for_each_possible_cpu(cpu) {
+		struct page *page;
+
+		page = per_cpu_ptr(mapped_info, cpu)->page;
+		if (page)
+			__free_pages(page, 0);
+	}
+	free_percpu(mapped_info);
+	return -ENOMEM;
+}
+module_init(afmalloc_init);
+
+MODULE_AUTHOR("Joonsoo Kim <iamjoonsoo.kim@lge.com>");
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
@ 2014-09-26  6:53 ` Joonsoo Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-09-26  6:53 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-mm, linux-kernel, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins, Joonsoo Kim

WARNING: This is just RFC patchset. patch 2/2 is only for testing.
If you know useful place to use this allocator, please let me know.

This is brand-new allocator, called anti-fragmentation memory allocator
(aka afmalloc), in order to deal with arbitrary sized object allocation
efficiently. zram and zswap uses arbitrary sized object to store
compressed data so they can use this allocator. If there are any other
use cases, they can use it, too.

This work is motivated by observation of fragmentation on zsmalloc which
intended for storing arbitrary sized object with low fragmentation.
Although it works well on allocation-intensive workload, memory could be
highly fragmented after many free occurs. In some cases, unused memory due
to fragmentation occupy 20% ~ 50% amount of real used memory. The other
problem is that other subsystem cannot use these unused memory. These
fragmented memory are zsmalloc specific, so most of other subsystem cannot
use it until zspage is freed to page allocator.

I guess that there are similar fragmentation problem in zbud, but, I
didn't deeply investigate it.

This new allocator uses SLAB allocator to solve above problems. When
request comes, it returns handle that is pointer of metatdata to point
many small chunks. These small chunks are in power of 2 size and
build up whole requested memory. We can easily acquire these chunks
using SLAB allocator. Following is conceptual represetation of metadata
used in this allocator to help understanding of this allocator.

Handle A for 400 bytes
{
	Pointer for 256 bytes chunk
	Pointer for 128 bytes chunk
	Pointer for 16 bytes chunk

	(256 + 128 + 16 = 400)
}

As you can see, 400 bytes memory are not contiguous in afmalloc so that
allocator specific store/load functions are needed. These require some
computation overhead and I guess that this is the only drawback this
allocator has.

For optimization, it uses another approach for power of 2 sized request.
Instead of returning handle for metadata, it adds tag on pointer from
SLAB allocator and directly returns this value as handle. With this tag,
afmalloc can recognize whether handle is for metadata or not and do proper
processing on it. This optimization can save some memory.

Although afmalloc use some memory for metadata, overall utilization of
memory is really good due to zero internal fragmentation by using power
of 2 sized object. Although zsmalloc has many size class, there is
considerable internal fragmentation in zsmalloc.

In workload that needs many free, memory could be fragmented like
zsmalloc, but, there is big difference. These unused portion of memory
are SLAB specific memory so that other subsystem can use it. Therefore,
fragmented memory could not be a big problem in this allocator.

Extra benefit of this allocator design is NUMA awareness. This allocator
allocates real memory from SLAB allocator. SLAB considers client's NUMA
affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
and zbud which are backend of zram and zswap, respectively, are not NUMA
awareness so that remote node's memory could be returned to requestor.
I think that it could be solved easily if NUMA awareness turns out to be
real problem. But, it may enlarge fragmentation depending on number of
nodes. Anyway, there is no NUMA awareness issue in this allocator.

Although I'd like to replace zsmalloc with this allocator, it cannot be
possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
would be very limited so supporting HIGHMEM would be really good advantage
of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
64-bits world, this allocator may be good option for this system. I
didn't deeply consider whether this allocator can replace zbud or not.

Below is the result of my simple test.
(zsmalloc used in experiments is patched with my previous patch:
zsmalloc: merge size_class to reduce fragmentation)

TEST ENV: EXT4 on zram, mount with discard option
WORKLOAD: untar kernel source, remove dir in descending order in size.
(drivers arch fs sound include)

Each line represents orig_data_size, compr_data_size, mem_used_total,
fragmentation overhead (mem_used - compr_data_size) and overhead ratio
(overhead to compr_data_size), respectively, after untar and remove
operation is executed. In afmalloc case, overhead is calculated by
before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
in afmalloc, one is real_overhead which represents metadata usage and
overhead of internal fragmentation, and the other is a ratio,
real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
internal fragmented memory cannot be used by other subsystem. So,
comparing real_overhead in afmalloc with overhead on zsmalloc seems to
be proper comparison.

* untar-merge.out

orig_size compr_size used_size overhead overhead_ratio
526.23MB 199.18MB 209.81MB  10.64MB 5.34%
288.68MB  97.45MB 104.08MB   6.63MB 6.80%
177.68MB  61.14MB  66.93MB   5.79MB 9.47%
146.83MB  47.34MB  52.79MB   5.45MB 11.51%
124.52MB  38.87MB  44.30MB   5.43MB 13.96%
104.29MB  31.70MB  36.83MB   5.13MB 16.19%

* untar-afmalloc.out

orig_size compr_size used_size overhead overhead_ratio real real-ratio
526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%

As you can see above result, real_overhead_ratio in afmalloc is
just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.

And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
than overhead_ratio in zsmalloc which is 5% ~ 17%.

Below is another simple test to check fragmentation effect in alloc/free
repetition workload.

TEST ENV: EXT4 on zram, mount with discard option
WORKLOAD: untar kernel source, remove dir in descending order in size
(drivers arch fs sound include). Repeat this untar and remove 10 times.

* untar-merge.out

orig_size compr_size used_size overhead overhead_ratio
526.24MB 199.18MB 209.79MB  10.61MB 5.33%
288.69MB  97.45MB 104.09MB   6.64MB 6.81%
177.69MB  61.14MB  66.89MB   5.75MB 9.40%
146.84MB  47.34MB  52.77MB   5.43MB 11.46%
124.53MB  38.88MB  44.28MB   5.40MB 13.90%
104.29MB  31.71MB  36.87MB   5.17MB 16.29%
535.59MB 200.30MB 211.77MB  11.47MB 5.73%
294.84MB  98.28MB 106.24MB   7.97MB 8.11%
179.99MB  61.58MB  69.34MB   7.76MB 12.60%
148.67MB  47.75MB  55.19MB   7.43MB 15.57%
125.98MB  39.26MB  46.62MB   7.36MB 18.75%
105.05MB  32.03MB  39.18MB   7.15MB 22.32%
(snip...)
535.59MB 200.31MB 211.88MB  11.57MB 5.77%
294.84MB  98.28MB 106.62MB   8.34MB 8.49%
179.99MB  61.59MB  73.83MB  12.24MB 19.88%
148.67MB  47.76MB  59.58MB  11.82MB 24.76%
125.98MB  39.27MB  51.10MB  11.84MB 30.14%
105.05MB  32.04MB  43.68MB  11.64MB 36.31%
535.59MB 200.31MB 211.89MB  11.58MB 5.78%
294.84MB  98.28MB 106.68MB   8.40MB 8.55%
179.99MB  61.59MB  74.14MB  12.55MB 20.37%
148.67MB  47.76MB  59.94MB  12.18MB 25.50%
125.98MB  39.27MB  51.46MB  12.19MB 31.04%
105.05MB  32.04MB  44.01MB  11.97MB 37.35%

* untar-afmalloc.out

orig_size compr_size used_size overhead overhead_ratio real real-ratio
526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
(snip...)
535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%

As you can see above result, fragmentation grows continuously at each run.
But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
while overhead_ratio on zsmalloc varies 5% ~ 38%.
Fragmented slab memory can be used for other system, so we don't
have to much worry about overhead metric in afmalloc. Anyway, overhead
metric is also better in afmalloc, 4% ~ 26%.

As a result, I think that afmalloc is better than zsmalloc in terms of
memory efficiency. But, I could be wrong so any comments are welcome. :)

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 include/linux/afmalloc.h |   21 ++
 mm/Kconfig               |    7 +
 mm/Makefile              |    1 +
 mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 619 insertions(+)
 create mode 100644 include/linux/afmalloc.h
 create mode 100644 mm/afmalloc.c

diff --git a/include/linux/afmalloc.h b/include/linux/afmalloc.h
new file mode 100644
index 0000000..751ae56
--- /dev/null
+++ b/include/linux/afmalloc.h
@@ -0,0 +1,21 @@
+#define AFMALLOC_MIN_LEVEL (1)
+#ifdef CONFIG_64BIT
+#define AFMALLOC_MAX_LEVEL (7)	/* 4 + 4 + 8 * 7 = 64 */
+#else
+#define AFMALLOC_MAX_LEVEL (6)	/* 4 + 4 + 4 * 6 = 32 */
+#endif
+
+extern struct afmalloc_pool *afmalloc_create_pool(int max_level,
+			size_t max_size, gfp_t flags);
+extern void afmalloc_destroy_pool(struct afmalloc_pool *pool);
+extern size_t afmalloc_get_used_pages(struct afmalloc_pool *pool);
+extern unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len);
+extern void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle);
+extern size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
+			void *src, size_t len);
+extern size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
+			void *dst, size_t len);
+extern void *afmalloc_map_handle(struct afmalloc_pool *pool,
+			unsigned long handle, size_t len, bool read_only);
+extern void afmalloc_unmap_handle(struct afmalloc_pool *pool,
+			unsigned long handle);
diff --git a/mm/Kconfig b/mm/Kconfig
index e09cf0a..7869768 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -585,6 +585,13 @@ config ZSMALLOC
 	  returned by an alloc().  This handle must be mapped in order to
 	  access the allocated space.
 
+config ANTI_FRAGMENTATION_MALLOC
+	boolean "Anti-fragmentation memory allocator"
+	help
+	  Select this to store data into anti-fragmentation memory
+	  allocator. This helps to reduce internal/external
+	  fragmentation caused by storing arbitrary sized data.
+
 config PGTABLE_MAPPING
 	bool "Use page table mapping to access object in zsmalloc"
 	depends on ZSMALLOC
diff --git a/mm/Makefile b/mm/Makefile
index b2f18dc..d47b147 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -62,6 +62,7 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
 obj-$(CONFIG_ZPOOL)	+= zpool.o
 obj-$(CONFIG_ZBUD)	+= zbud.o
 obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
+obj-$(CONFIG_ANTI_FRAGMENTATION_MALLOC) += afmalloc.o
 obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
 obj-$(CONFIG_CMA)	+= cma.o
 obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
diff --git a/mm/afmalloc.c b/mm/afmalloc.c
new file mode 100644
index 0000000..83a5c61
--- /dev/null
+++ b/mm/afmalloc.c
@@ -0,0 +1,590 @@
+/*
+ * Anti Fragmentation Memory allocator
+ *
+ * Copyright (C) 2014 Joonsoo Kim
+ *
+ * Anti Fragmentation Memory allocator(aka afmalloc) is special purpose
+ * allocator in order to deal with arbitrary sized object allocation
+ * efficiently in terms of memory utilization.
+ *
+ * Overall design is too simple.
+ *
+ * If request is for power of 2 sized object, afmalloc allocate object
+ * from the SLAB, add tag on it and return it to requestor. This tag will be
+ * used for determining whether it is a handle for metadata or not.
+ *
+ * If request isn't for power of 2 sized object, afmalloc divides size
+ * into elements in power of 2 size. For example, 400 byte request, 256,
+ * 128, 16 bytes build up 400 bytes. afmalloc allocates these size memory
+ * from the SLAB and allocates memory for metadata to keep the pointer of
+ * these chunks. Conceptual representation of metadata structure is below.
+ *
+ * Metadata for 400 bytes
+ * - Pointer for 256 bytes chunk
+ * - Pointer for 128 bytes chunk
+ * - Pointer for 16 bytes chunk
+ *
+ * After allocation all of them, afmalloc returns handle for this metadata to
+ * requestor. Requestor can load/store from/into this memory via this handle.
+ *
+ * Returned memory from afmalloc isn't contiguous so using this memory needs
+ * special APIs. afmalloc_(load/store) handles load/store requests according
+ * to afmalloc's internal structure, so you can use it without any anxiety.
+ *
+ * If you may want to use this memory like as normal memory, you need to call
+ * afmalloc_map_object before using it. This returns contiguous memory for
+ * this handle so that you could use it with normal memory operation.
+ * Unfortunately, only one object can be mapped per cpu at a time and to
+ * contruct this mapping has some overhead.
+ */
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/afmalloc.h>
+#include <linux/highmem.h>
+#include <linux/sizes.h>
+#include <linux/module.h>
+
+#define afmalloc_OBJ_MIN_SIZE (32)
+
+#define DIRECT_ENTRY (0x1)
+
+struct afmalloc_pool {
+	spinlock_t lock;
+	gfp_t flags;
+	int max_level;
+	size_t max_size;
+	size_t size;
+};
+
+struct afmalloc_entry {
+	int level;
+	int alloced;
+	void *mem[];
+};
+
+struct afmalloc_mapped_info {
+	struct page *page;
+	size_t len;
+	bool read_only;
+};
+
+static struct afmalloc_mapped_info __percpu *mapped_info;
+
+static struct afmalloc_entry *mem_to_direct_entry(void *mem)
+{
+	return (struct afmalloc_entry *)((unsigned long)mem | DIRECT_ENTRY);
+}
+
+static void *direct_entry_to_mem(struct afmalloc_entry *entry)
+{
+	return (void *)((unsigned long)entry & ~DIRECT_ENTRY);
+}
+
+static bool is_direct_entry(struct afmalloc_entry *entry)
+{
+	return (unsigned long)entry & DIRECT_ENTRY;
+}
+
+static unsigned long entry_to_handle(struct afmalloc_entry *entry)
+{
+	return (unsigned long)entry;
+}
+
+static struct afmalloc_entry *handle_to_entry(unsigned long handle)
+{
+	return (struct afmalloc_entry *)handle;
+}
+
+static bool valid_level(int max_level)
+{
+	if (max_level < AFMALLOC_MIN_LEVEL)
+		return false;
+
+	if (max_level > AFMALLOC_MAX_LEVEL)
+		return false;
+
+	return true;
+}
+
+static bool valid_flags(gfp_t flags)
+{
+	if (flags & __GFP_HIGHMEM)
+		return false;
+
+	return true;
+}
+
+/**
+ * afmalloc_create_pool - Creates an allocation pool to work from.
+ * @max_level: limit on number of chunks that is part of requested memory
+ * @max_size: limit on total allocation size from this pool
+ * @flags: allocation flags used to allocate memory
+ *
+ * This function must be called before anything when using
+ * the afmalloc allocator.
+ *
+ * On success, a pointer to the newly created pool is returned,
+ * otherwise NULL.
+ */
+struct afmalloc_pool *afmalloc_create_pool(int max_level, size_t max_size,
+					gfp_t flags)
+{
+	struct afmalloc_pool *pool;
+
+	if (!valid_level(max_level))
+		return NULL;
+
+	if (!valid_flags(flags))
+		return NULL;
+
+	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
+	if (!pool)
+		return NULL;
+
+	spin_lock_init(&pool->lock);
+	pool->flags = flags;
+	pool->max_level = max_level;
+	pool->max_size = max_size;
+	pool->size = 0;
+
+	return pool;
+}
+EXPORT_SYMBOL(afmalloc_create_pool);
+
+void afmalloc_destroy_pool(struct afmalloc_pool *pool)
+{
+	kfree(pool);
+}
+EXPORT_SYMBOL(afmalloc_destroy_pool);
+
+size_t afmalloc_get_used_pages(struct afmalloc_pool *pool)
+{
+	size_t size;
+
+	spin_lock(&pool->lock);
+	size = pool->size >> PAGE_SHIFT;
+	spin_unlock(&pool->lock);
+
+	return size;
+}
+EXPORT_SYMBOL(afmalloc_get_used_pages);
+
+static void free_entry(struct afmalloc_pool *pool, struct afmalloc_entry *entry,
+			bool calc_size)
+{
+	int i;
+	int level;
+	int alloced;
+
+	if (is_direct_entry(entry)) {
+		void *mem = direct_entry_to_mem(entry);
+
+		alloced = ksize(mem);
+		kfree(mem);
+		goto out;
+	}
+
+	level = entry->level;
+	alloced = entry->alloced;
+	for (i = 0; i < level; i++)
+		kfree(entry->mem[i]);
+
+	kfree(entry);
+
+out:
+	if (calc_size && alloced) {
+		spin_lock(&pool->lock);
+		pool->size -= alloced;
+		spin_unlock(&pool->lock);
+	}
+}
+
+static int calculate_level(struct afmalloc_pool *pool, size_t len)
+{
+	int level = 0;
+	size_t down_size, up_size;
+
+	if (len <= afmalloc_OBJ_MIN_SIZE)
+		goto out;
+
+	while (1) {
+		down_size = rounddown_pow_of_two(len);
+		if (down_size >= len)
+			break;
+
+		up_size = roundup_pow_of_two(len);
+		if (up_size - len <= afmalloc_OBJ_MIN_SIZE)
+			break;
+
+		len -= down_size;
+		level++;
+	}
+
+out:
+	level++;
+	return min(level, pool->max_level);
+}
+
+static int estimate_alloced(struct afmalloc_pool *pool, int level, size_t len)
+{
+	int i, alloced = 0;
+	size_t size;
+
+	for (i = 0; i < level - 1; i++) {
+		size = rounddown_pow_of_two(len);
+		alloced += size;
+		len -= size;
+	}
+
+	if (len < afmalloc_OBJ_MIN_SIZE)
+		size = afmalloc_OBJ_MIN_SIZE;
+	else
+		size = roundup_pow_of_two(len);
+	alloced += size;
+
+	return alloced;
+}
+
+static void *alloc_entry(struct afmalloc_pool *pool, size_t len)
+{
+	int i, level;
+	size_t size;
+	int alloced = 0;
+	size_t remain = len;
+	struct afmalloc_entry *entry;
+	void *mem;
+
+	/*
+	 * Determine whether memory is power of 2 or not. If not,
+	 * determine how many chunks are needed.
+	 */
+	level = calculate_level(pool, len);
+	if (level == 1)
+		goto alloc_direct_entry;
+
+	size = sizeof(void *) * level + sizeof(struct afmalloc_entry);
+	entry = kmalloc(size, pool->flags);
+	if (!entry)
+		return NULL;
+
+	size = ksize(entry);
+	alloced += size;
+
+	/*
+	 * Although request isn't for power of 2 object, sometimes, it is
+	 * better to allocate one power of 2 memory due to waste of metadata.
+	 */
+	if (size + estimate_alloced(pool, level, len)
+				>= roundup_pow_of_two(len)) {
+		kfree(entry);
+		goto alloc_direct_entry;
+	}
+
+	entry->level = level;
+	for (i = 0; i < level - 1; i++) {
+		size = rounddown_pow_of_two(remain);
+		entry->mem[i] = kmalloc(size, pool->flags);
+		if (!entry->mem[i])
+			goto err;
+
+		alloced += size;
+		remain -= size;
+	}
+
+	if (remain < afmalloc_OBJ_MIN_SIZE)
+		size = afmalloc_OBJ_MIN_SIZE;
+	else
+		size = roundup_pow_of_two(remain);
+	entry->mem[i] = kmalloc(size, pool->flags);
+	if (!entry->mem[i])
+		goto err;
+
+	alloced += size;
+	entry->alloced = alloced;
+	goto alloc_complete;
+
+alloc_direct_entry:
+	mem = kmalloc(len, pool->flags);
+	if (!mem)
+		return NULL;
+
+	alloced = ksize(mem);
+	entry = mem_to_direct_entry(mem);
+
+alloc_complete:
+	spin_lock(&pool->lock);
+	if (pool->size + alloced > pool->max_size) {
+		spin_unlock(&pool->lock);
+		goto err;
+	}
+
+	pool->size += alloced;
+	spin_unlock(&pool->lock);
+
+	return entry;
+
+err:
+	free_entry(pool, entry, false);
+
+	return NULL;
+}
+
+static bool valid_alloc_arg(size_t len)
+{
+	if (!len)
+		return false;
+
+	return true;
+}
+
+/**
+ * afmalloc_alloc - Allocate block of given length from pool
+ * @pool: pool from which the object was allocated
+ * @len: length of block to allocate
+ *
+ * On success, handle to the allocated object is returned,
+ * otherwise 0.
+ */
+unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len)
+{
+	struct afmalloc_entry *entry;
+
+	if (!valid_alloc_arg(len))
+		return 0;
+
+	entry = alloc_entry(pool, len);
+	if (!entry)
+		return 0;
+
+	return entry_to_handle(entry);
+}
+EXPORT_SYMBOL(afmalloc_alloc);
+
+static void __afmalloc_free(struct afmalloc_pool *pool,
+			struct afmalloc_entry *entry)
+{
+	free_entry(pool, entry, true);
+}
+
+void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle)
+{
+	struct afmalloc_entry *entry;
+
+	entry = handle_to_entry(handle);
+	if (!entry)
+		return;
+
+	__afmalloc_free(pool, entry);
+}
+EXPORT_SYMBOL(afmalloc_free);
+
+static void __afmalloc_store(struct afmalloc_pool *pool,
+			struct afmalloc_entry *entry, void *src, size_t len)
+{
+	int i, level = entry->level;
+	size_t size;
+	size_t offset = 0;
+
+	if (is_direct_entry(entry)) {
+		memcpy(direct_entry_to_mem(entry), src, len);
+		return;
+	}
+
+	for (i = 0; i < level - 1; i++) {
+		size = rounddown_pow_of_two(len);
+		memcpy(entry->mem[i], src + offset, size);
+		offset += size;
+		len -= size;
+	}
+	memcpy(entry->mem[i], src + offset, len);
+}
+
+static bool valid_store_arg(struct afmalloc_entry *entry, void *src, size_t len)
+{
+	if (!entry)
+		return false;
+
+	if (!src || !len)
+		return false;
+
+	return true;
+}
+
+/**
+ * afmalloc_store - store data into allocated object from handle.
+ * @pool: pool from which the object was allocated
+ * @handle: handle returned from afmalloc
+ * @src: memory address of source data
+ * @len: length in bytes of desired store
+ *
+ * To store data into an object allocated from afmalloc, it must be
+ * mapped before using it or accessed through afmalloc-specific
+ * load/store functions. These functions properly handle load/store
+ * request according to afmalloc's internal structure.
+ */
+size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
+			void *src, size_t len)
+{
+	struct afmalloc_entry *entry;
+
+	entry = handle_to_entry(handle);
+	if (!valid_store_arg(entry, src, len))
+		return 0;
+
+	__afmalloc_store(pool, entry, src, len);
+
+	return len;
+}
+EXPORT_SYMBOL(afmalloc_store);
+
+static void __afmalloc_load(struct afmalloc_pool *pool,
+			struct afmalloc_entry *entry, void *dst, size_t len)
+{
+	int i, level = entry->level;
+	size_t size;
+	size_t offset = 0;
+
+	if (is_direct_entry(entry)) {
+		memcpy(dst, direct_entry_to_mem(entry), len);
+		return;
+	}
+
+	for (i = 0; i < level - 1; i++) {
+		size = rounddown_pow_of_two(len);
+		memcpy(dst + offset, entry->mem[i], size);
+		offset += size;
+		len -= size;
+	}
+	memcpy(dst + offset, entry->mem[i], len);
+}
+
+static bool valid_load_arg(struct afmalloc_entry *entry, void *dst, size_t len)
+{
+	if (!entry)
+		return false;
+
+	if (!dst || !len)
+		return false;
+
+	return true;
+}
+
+size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
+		void *dst, size_t len)
+{
+	struct afmalloc_entry *entry;
+
+	entry = handle_to_entry(handle);
+	if (!valid_load_arg(entry, dst, len))
+		return 0;
+
+	__afmalloc_load(pool, entry, dst, len);
+
+	return len;
+}
+EXPORT_SYMBOL(afmalloc_load);
+
+/**
+ * afmalloc_map_object - get address of allocated object from handle.
+ * @pool: pool from which the object was allocated
+ * @handle: handle returned from afmalloc
+ * @len: length in bytes of desired mapping
+ * @read_only: flag that represents whether data on mapped region is
+ *	written back into an object or not
+ *
+ * Before using an object allocated from afmalloc, it must be mapped using
+ * this function. When done with the object, it must be unmapped using
+ * afmalloc_unmap_handle.
+ *
+ * Only one object can be mapped per cpu at a time. There is no protection
+ * against nested mappings.
+ *
+ * This function returns with preemption and page faults disabled.
+ */
+void *afmalloc_map_handle(struct afmalloc_pool *pool, unsigned long handle,
+			size_t len, bool read_only)
+{
+	int cpu;
+	struct afmalloc_entry *entry;
+	struct afmalloc_mapped_info *info;
+	void *addr;
+
+	entry = handle_to_entry(handle);
+	if (!entry)
+		return NULL;
+
+	cpu = get_cpu();
+	if (is_direct_entry(entry))
+		return direct_entry_to_mem(entry);
+
+	info = per_cpu_ptr(mapped_info, cpu);
+	addr = page_address(info->page);
+	info->len = len;
+	info->read_only = read_only;
+	__afmalloc_load(pool, entry, addr, len);
+	return addr;
+}
+EXPORT_SYMBOL(afmalloc_map_handle);
+
+void afmalloc_unmap_handle(struct afmalloc_pool *pool, unsigned long handle)
+{
+	struct afmalloc_entry *entry;
+	struct afmalloc_mapped_info *info;
+	void *addr;
+
+	entry = handle_to_entry(handle);
+	if (!entry)
+		return;
+
+	if (is_direct_entry(entry))
+		goto out;
+
+	info = this_cpu_ptr(mapped_info);
+	if (info->read_only)
+		goto out;
+
+	addr = page_address(info->page);
+	__afmalloc_store(pool, entry, addr, info->len);
+
+out:
+	put_cpu();
+}
+EXPORT_SYMBOL(afmalloc_unmap_handle);
+
+static int __init afmalloc_init(void)
+{
+	int cpu;
+
+	mapped_info = alloc_percpu(struct afmalloc_mapped_info);
+	if (!mapped_info)
+		return -ENOMEM;
+
+	for_each_possible_cpu(cpu) {
+		struct page *page;
+
+		page = alloc_pages(GFP_KERNEL, 0);
+		if (!page)
+			goto err;
+
+		per_cpu_ptr(mapped_info, cpu)->page = page;
+	}
+
+	return 0;
+
+err:
+	for_each_possible_cpu(cpu) {
+		struct page *page;
+
+		page = per_cpu_ptr(mapped_info, cpu)->page;
+		if (page)
+			__free_pages(page, 0);
+	}
+	free_percpu(mapped_info);
+	return -ENOMEM;
+}
+module_init(afmalloc_init);
+
+MODULE_AUTHOR("Joonsoo Kim <iamjoonsoo.kim@lge.com>");
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/2] zram: make afmalloc as zram's backend memory allocator
  2014-09-26  6:53 ` Joonsoo Kim
@ 2014-09-26  6:53   ` Joonsoo Kim
  -1 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-09-26  6:53 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-mm, linux-kernel, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins, Joonsoo Kim

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 drivers/block/zram/Kconfig    |    2 +-
 drivers/block/zram/zram_drv.c |   40 ++++++++++++----------------------------
 drivers/block/zram/zram_drv.h |    4 ++--
 3 files changed, 15 insertions(+), 31 deletions(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 6489c0f..1c09a11 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -1,6 +1,6 @@
 config ZRAM
 	tristate "Compressed RAM block device support"
-	depends on BLOCK && SYSFS && ZSMALLOC
+	depends on BLOCK && SYSFS && ANTI_FRAGMENTATION_MALLOC
 	select LZO_COMPRESS
 	select LZO_DECOMPRESS
 	default n
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bc20fe1..545e43f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -103,7 +103,7 @@ static ssize_t mem_used_total_show(struct device *dev,
 
 	down_read(&zram->init_lock);
 	if (init_done(zram))
-		val = zs_get_total_pages(meta->mem_pool);
+		val = afmalloc_get_used_pages(meta->mem_pool);
 	up_read(&zram->init_lock);
 
 	return scnprintf(buf, PAGE_SIZE, "%llu\n", val << PAGE_SHIFT);
@@ -173,16 +173,12 @@ static ssize_t mem_used_max_store(struct device *dev,
 	int err;
 	unsigned long val;
 	struct zram *zram = dev_to_zram(dev);
-	struct zram_meta *meta = zram->meta;
 
 	err = kstrtoul(buf, 10, &val);
 	if (err || val != 0)
 		return -EINVAL;
 
 	down_read(&zram->init_lock);
-	if (init_done(zram))
-		atomic_long_set(&zram->stats.max_used_pages,
-				zs_get_total_pages(meta->mem_pool));
 	up_read(&zram->init_lock);
 
 	return len;
@@ -309,7 +305,7 @@ static inline int valid_io_request(struct zram *zram, struct bio *bio)
 
 static void zram_meta_free(struct zram_meta *meta)
 {
-	zs_destroy_pool(meta->mem_pool);
+	afmalloc_destroy_pool(meta->mem_pool);
 	vfree(meta->table);
 	kfree(meta);
 }
@@ -328,7 +324,8 @@ static struct zram_meta *zram_meta_alloc(u64 disksize)
 		goto free_meta;
 	}
 
-	meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM);
+	meta->mem_pool = afmalloc_create_pool(AFMALLOC_MAX_LEVEL,
+						disksize, GFP_NOIO);
 	if (!meta->mem_pool) {
 		pr_err("Error creating memory pool\n");
 		goto free_table;
@@ -405,7 +402,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 		return;
 	}
 
-	zs_free(meta->mem_pool, handle);
+	afmalloc_free(meta->mem_pool, handle);
 
 	atomic64_sub(zram_get_obj_size(meta, index),
 			&zram->stats.compr_data_size);
@@ -434,12 +431,12 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
 		return 0;
 	}
 
-	cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_RO);
+	cmem = afmalloc_map_handle(meta->mem_pool, handle, size, true);
 	if (size == PAGE_SIZE)
 		copy_page(mem, cmem);
 	else
 		ret = zcomp_decompress(zram->comp, cmem, size, mem);
-	zs_unmap_object(meta->mem_pool, handle);
+	afmalloc_unmap_handle(meta->mem_pool, handle);
 	bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
 
 	/* Should NEVER happen. Return bio error if it does. */
@@ -523,11 +520,10 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 	size_t clen;
 	unsigned long handle;
 	struct page *page;
-	unsigned char *user_mem, *cmem, *src, *uncmem = NULL;
+	unsigned char *user_mem, *src, *uncmem = NULL;
 	struct zram_meta *meta = zram->meta;
 	struct zcomp_strm *zstrm;
 	bool locked = false;
-	unsigned long alloced_pages;
 
 	page = bvec->bv_page;
 	if (is_partial_io(bvec)) {
@@ -589,7 +585,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 			src = uncmem;
 	}
 
-	handle = zs_malloc(meta->mem_pool, clen);
+	handle = afmalloc_alloc(meta->mem_pool, clen);
 	if (!handle) {
 		pr_info("Error allocating memory for compressed page: %u, size=%zu\n",
 			index, clen);
@@ -597,28 +593,16 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 		goto out;
 	}
 
-	alloced_pages = zs_get_total_pages(meta->mem_pool);
-	if (zram->limit_pages && alloced_pages > zram->limit_pages) {
-		zs_free(meta->mem_pool, handle);
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	update_used_max(zram, alloced_pages);
-
-	cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
-
 	if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) {
 		src = kmap_atomic(page);
-		copy_page(cmem, src);
+		afmalloc_store(meta->mem_pool, handle, src, clen);
 		kunmap_atomic(src);
 	} else {
-		memcpy(cmem, src, clen);
+		afmalloc_store(meta->mem_pool, handle, src, clen);
 	}
 
 	zcomp_strm_release(zram->comp, zstrm);
 	locked = false;
-	zs_unmap_object(meta->mem_pool, handle);
 
 	/*
 	 * Free memory associated with this sector
@@ -725,7 +709,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
 		if (!handle)
 			continue;
 
-		zs_free(meta->mem_pool, handle);
+		afmalloc_free(meta->mem_pool, handle);
 	}
 
 	zcomp_destroy(zram->comp);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index c6ee271..1a116c0 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -16,7 +16,7 @@
 #define _ZRAM_DRV_H_
 
 #include <linux/spinlock.h>
-#include <linux/zsmalloc.h>
+#include <linux/afmalloc.h>
 
 #include "zcomp.h"
 
@@ -95,7 +95,7 @@ struct zram_stats {
 
 struct zram_meta {
 	struct zram_table_entry *table;
-	struct zs_pool *mem_pool;
+	struct afmalloc_pool *mem_pool;
 };
 
 struct zram {
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [RFC PATCH 2/2] zram: make afmalloc as zram's backend memory allocator
@ 2014-09-26  6:53   ` Joonsoo Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-09-26  6:53 UTC (permalink / raw)
  To: Andrew Morton, Minchan Kim
  Cc: Nitin Gupta, linux-mm, linux-kernel, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins, Joonsoo Kim

Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 drivers/block/zram/Kconfig    |    2 +-
 drivers/block/zram/zram_drv.c |   40 ++++++++++++----------------------------
 drivers/block/zram/zram_drv.h |    4 ++--
 3 files changed, 15 insertions(+), 31 deletions(-)

diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index 6489c0f..1c09a11 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -1,6 +1,6 @@
 config ZRAM
 	tristate "Compressed RAM block device support"
-	depends on BLOCK && SYSFS && ZSMALLOC
+	depends on BLOCK && SYSFS && ANTI_FRAGMENTATION_MALLOC
 	select LZO_COMPRESS
 	select LZO_DECOMPRESS
 	default n
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index bc20fe1..545e43f 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -103,7 +103,7 @@ static ssize_t mem_used_total_show(struct device *dev,
 
 	down_read(&zram->init_lock);
 	if (init_done(zram))
-		val = zs_get_total_pages(meta->mem_pool);
+		val = afmalloc_get_used_pages(meta->mem_pool);
 	up_read(&zram->init_lock);
 
 	return scnprintf(buf, PAGE_SIZE, "%llu\n", val << PAGE_SHIFT);
@@ -173,16 +173,12 @@ static ssize_t mem_used_max_store(struct device *dev,
 	int err;
 	unsigned long val;
 	struct zram *zram = dev_to_zram(dev);
-	struct zram_meta *meta = zram->meta;
 
 	err = kstrtoul(buf, 10, &val);
 	if (err || val != 0)
 		return -EINVAL;
 
 	down_read(&zram->init_lock);
-	if (init_done(zram))
-		atomic_long_set(&zram->stats.max_used_pages,
-				zs_get_total_pages(meta->mem_pool));
 	up_read(&zram->init_lock);
 
 	return len;
@@ -309,7 +305,7 @@ static inline int valid_io_request(struct zram *zram, struct bio *bio)
 
 static void zram_meta_free(struct zram_meta *meta)
 {
-	zs_destroy_pool(meta->mem_pool);
+	afmalloc_destroy_pool(meta->mem_pool);
 	vfree(meta->table);
 	kfree(meta);
 }
@@ -328,7 +324,8 @@ static struct zram_meta *zram_meta_alloc(u64 disksize)
 		goto free_meta;
 	}
 
-	meta->mem_pool = zs_create_pool(GFP_NOIO | __GFP_HIGHMEM);
+	meta->mem_pool = afmalloc_create_pool(AFMALLOC_MAX_LEVEL,
+						disksize, GFP_NOIO);
 	if (!meta->mem_pool) {
 		pr_err("Error creating memory pool\n");
 		goto free_table;
@@ -405,7 +402,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 		return;
 	}
 
-	zs_free(meta->mem_pool, handle);
+	afmalloc_free(meta->mem_pool, handle);
 
 	atomic64_sub(zram_get_obj_size(meta, index),
 			&zram->stats.compr_data_size);
@@ -434,12 +431,12 @@ static int zram_decompress_page(struct zram *zram, char *mem, u32 index)
 		return 0;
 	}
 
-	cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_RO);
+	cmem = afmalloc_map_handle(meta->mem_pool, handle, size, true);
 	if (size == PAGE_SIZE)
 		copy_page(mem, cmem);
 	else
 		ret = zcomp_decompress(zram->comp, cmem, size, mem);
-	zs_unmap_object(meta->mem_pool, handle);
+	afmalloc_unmap_handle(meta->mem_pool, handle);
 	bit_spin_unlock(ZRAM_ACCESS, &meta->table[index].value);
 
 	/* Should NEVER happen. Return bio error if it does. */
@@ -523,11 +520,10 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 	size_t clen;
 	unsigned long handle;
 	struct page *page;
-	unsigned char *user_mem, *cmem, *src, *uncmem = NULL;
+	unsigned char *user_mem, *src, *uncmem = NULL;
 	struct zram_meta *meta = zram->meta;
 	struct zcomp_strm *zstrm;
 	bool locked = false;
-	unsigned long alloced_pages;
 
 	page = bvec->bv_page;
 	if (is_partial_io(bvec)) {
@@ -589,7 +585,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 			src = uncmem;
 	}
 
-	handle = zs_malloc(meta->mem_pool, clen);
+	handle = afmalloc_alloc(meta->mem_pool, clen);
 	if (!handle) {
 		pr_info("Error allocating memory for compressed page: %u, size=%zu\n",
 			index, clen);
@@ -597,28 +593,16 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index,
 		goto out;
 	}
 
-	alloced_pages = zs_get_total_pages(meta->mem_pool);
-	if (zram->limit_pages && alloced_pages > zram->limit_pages) {
-		zs_free(meta->mem_pool, handle);
-		ret = -ENOMEM;
-		goto out;
-	}
-
-	update_used_max(zram, alloced_pages);
-
-	cmem = zs_map_object(meta->mem_pool, handle, ZS_MM_WO);
-
 	if ((clen == PAGE_SIZE) && !is_partial_io(bvec)) {
 		src = kmap_atomic(page);
-		copy_page(cmem, src);
+		afmalloc_store(meta->mem_pool, handle, src, clen);
 		kunmap_atomic(src);
 	} else {
-		memcpy(cmem, src, clen);
+		afmalloc_store(meta->mem_pool, handle, src, clen);
 	}
 
 	zcomp_strm_release(zram->comp, zstrm);
 	locked = false;
-	zs_unmap_object(meta->mem_pool, handle);
 
 	/*
 	 * Free memory associated with this sector
@@ -725,7 +709,7 @@ static void zram_reset_device(struct zram *zram, bool reset_capacity)
 		if (!handle)
 			continue;
 
-		zs_free(meta->mem_pool, handle);
+		afmalloc_free(meta->mem_pool, handle);
 	}
 
 	zcomp_destroy(zram->comp);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index c6ee271..1a116c0 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -16,7 +16,7 @@
 #define _ZRAM_DRV_H_
 
 #include <linux/spinlock.h>
-#include <linux/zsmalloc.h>
+#include <linux/afmalloc.h>
 
 #include "zcomp.h"
 
@@ -95,7 +95,7 @@ struct zram_stats {
 
 struct zram_meta {
 	struct zram_table_entry *table;
-	struct zs_pool *mem_pool;
+	struct afmalloc_pool *mem_pool;
 };
 
 struct zram {
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
  2014-09-26  6:53 ` Joonsoo Kim
@ 2014-09-29 15:41   ` Dan Streetman
  -1 siblings, 0 replies; 16+ messages in thread
From: Dan Streetman @ 2014-09-29 15:41 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Minchan Kim, Nitin Gupta, Linux-MM, linux-kernel,
	Jerome Marchand, Sergey Senozhatsky, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

On Fri, Sep 26, 2014 at 2:53 AM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> If you know useful place to use this allocator, please let me know.
>
> This is brand-new allocator, called anti-fragmentation memory allocator
> (aka afmalloc), in order to deal with arbitrary sized object allocation
> efficiently. zram and zswap uses arbitrary sized object to store
> compressed data so they can use this allocator. If there are any other
> use cases, they can use it, too.
>
> This work is motivated by observation of fragmentation on zsmalloc which
> intended for storing arbitrary sized object with low fragmentation.
> Although it works well on allocation-intensive workload, memory could be
> highly fragmented after many free occurs. In some cases, unused memory due
> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> problem is that other subsystem cannot use these unused memory. These
> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> use it until zspage is freed to page allocator.
>
> I guess that there are similar fragmentation problem in zbud, but, I
> didn't deeply investigate it.
>
> This new allocator uses SLAB allocator to solve above problems. When
> request comes, it returns handle that is pointer of metatdata to point
> many small chunks. These small chunks are in power of 2 size and
> build up whole requested memory. We can easily acquire these chunks
> using SLAB allocator. Following is conceptual represetation of metadata
> used in this allocator to help understanding of this allocator.
>
> Handle A for 400 bytes
> {
>         Pointer for 256 bytes chunk
>         Pointer for 128 bytes chunk
>         Pointer for 16 bytes chunk
>
>         (256 + 128 + 16 = 400)
> }
>
> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> allocator specific store/load functions are needed. These require some
> computation overhead and I guess that this is the only drawback this
> allocator has.

This also requires additional memory copying, for each map/unmap, no?

>
> For optimization, it uses another approach for power of 2 sized request.
> Instead of returning handle for metadata, it adds tag on pointer from
> SLAB allocator and directly returns this value as handle. With this tag,
> afmalloc can recognize whether handle is for metadata or not and do proper
> processing on it. This optimization can save some memory.
>
> Although afmalloc use some memory for metadata, overall utilization of
> memory is really good due to zero internal fragmentation by using power
> of 2 sized object. Although zsmalloc has many size class, there is
> considerable internal fragmentation in zsmalloc.
>
> In workload that needs many free, memory could be fragmented like
> zsmalloc, but, there is big difference. These unused portion of memory
> are SLAB specific memory so that other subsystem can use it. Therefore,
> fragmented memory could not be a big problem in this allocator.
>
> Extra benefit of this allocator design is NUMA awareness. This allocator
> allocates real memory from SLAB allocator. SLAB considers client's NUMA
> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> and zbud which are backend of zram and zswap, respectively, are not NUMA
> awareness so that remote node's memory could be returned to requestor.
> I think that it could be solved easily if NUMA awareness turns out to be
> real problem. But, it may enlarge fragmentation depending on number of
> nodes. Anyway, there is no NUMA awareness issue in this allocator.
>
> Although I'd like to replace zsmalloc with this allocator, it cannot be
> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> would be very limited so supporting HIGHMEM would be really good advantage
> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> 64-bits world, this allocator may be good option for this system. I
> didn't deeply consider whether this allocator can replace zbud or not.

While it looks like there may be some situations that benefit from
this, this won't work for all cases (as you mention), so maybe zpool
can allow zram to choose between zsmalloc and afmalloc.

>
> Below is the result of my simple test.
> (zsmalloc used in experiments is patched with my previous patch:
> zsmalloc: merge size_class to reduce fragmentation)
>
> TEST ENV: EXT4 on zram, mount with discard option
> WORKLOAD: untar kernel source, remove dir in descending order in size.
> (drivers arch fs sound include)
>
> Each line represents orig_data_size, compr_data_size, mem_used_total,
> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> (overhead to compr_data_size), respectively, after untar and remove
> operation is executed. In afmalloc case, overhead is calculated by
> before/after 'SUnreclaim' on /proc/meminfo.
> And there are two more columns
> in afmalloc, one is real_overhead which represents metadata usage and
> overhead of internal fragmentation, and the other is a ratio,
> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> internal fragmented memory cannot be used by other subsystem. So,
> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> be proper comparison.
>
> * untar-merge.out
>
> orig_size compr_size used_size overhead overhead_ratio
> 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
> 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
> 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
> 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
> 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
> 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
>
> * untar-afmalloc.out
>
> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
> 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
> 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
> 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
> 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
> 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
>
> As you can see above result, real_overhead_ratio in afmalloc is
> just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
>
> And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
> than overhead_ratio in zsmalloc which is 5% ~ 17%.

I think the key will be scaling up this test more.  What does it look
like when using 20G or more?

It certainly looks better when using (relatively) small amounts of data, though.

>
> Below is another simple test to check fragmentation effect in alloc/free
> repetition workload.
>
> TEST ENV: EXT4 on zram, mount with discard option
> WORKLOAD: untar kernel source, remove dir in descending order in size
> (drivers arch fs sound include). Repeat this untar and remove 10 times.
>
> * untar-merge.out
>
> orig_size compr_size used_size overhead overhead_ratio
> 526.24MB 199.18MB 209.79MB  10.61MB 5.33%
> 288.69MB  97.45MB 104.09MB   6.64MB 6.81%
> 177.69MB  61.14MB  66.89MB   5.75MB 9.40%
> 146.84MB  47.34MB  52.77MB   5.43MB 11.46%
> 124.53MB  38.88MB  44.28MB   5.40MB 13.90%
> 104.29MB  31.71MB  36.87MB   5.17MB 16.29%
> 535.59MB 200.30MB 211.77MB  11.47MB 5.73%
> 294.84MB  98.28MB 106.24MB   7.97MB 8.11%
> 179.99MB  61.58MB  69.34MB   7.76MB 12.60%
> 148.67MB  47.75MB  55.19MB   7.43MB 15.57%
> 125.98MB  39.26MB  46.62MB   7.36MB 18.75%
> 105.05MB  32.03MB  39.18MB   7.15MB 22.32%
> (snip...)
> 535.59MB 200.31MB 211.88MB  11.57MB 5.77%
> 294.84MB  98.28MB 106.62MB   8.34MB 8.49%
> 179.99MB  61.59MB  73.83MB  12.24MB 19.88%
> 148.67MB  47.76MB  59.58MB  11.82MB 24.76%
> 125.98MB  39.27MB  51.10MB  11.84MB 30.14%
> 105.05MB  32.04MB  43.68MB  11.64MB 36.31%
> 535.59MB 200.31MB 211.89MB  11.58MB 5.78%
> 294.84MB  98.28MB 106.68MB   8.40MB 8.55%
> 179.99MB  61.59MB  74.14MB  12.55MB 20.37%
> 148.67MB  47.76MB  59.94MB  12.18MB 25.50%
> 125.98MB  39.27MB  51.46MB  12.19MB 31.04%
> 105.05MB  32.04MB  44.01MB  11.97MB 37.35%
>
> * untar-afmalloc.out
>
> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> 526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
> 288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
> 177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
> 146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
> 124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
> 104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
> 535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
> 294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
> 148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
> 125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
> 105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
> (snip...)
> 535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
> 294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
> 148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
> 125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
> 105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
> 535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
> 294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
> 148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
> 125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
> 105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%
>
> As you can see above result, fragmentation grows continuously at each run.
> But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
> while overhead_ratio on zsmalloc varies 5% ~ 38%.
> Fragmented slab memory can be used for other system, so we don't
> have to much worry about overhead metric in afmalloc. Anyway, overhead
> metric is also better in afmalloc, 4% ~ 26%.
>
> As a result, I think that afmalloc is better than zsmalloc in terms of
> memory efficiency. But, I could be wrong so any comments are welcome. :)
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  include/linux/afmalloc.h |   21 ++
>  mm/Kconfig               |    7 +
>  mm/Makefile              |    1 +
>  mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 619 insertions(+)
>  create mode 100644 include/linux/afmalloc.h
>  create mode 100644 mm/afmalloc.c
>
> diff --git a/include/linux/afmalloc.h b/include/linux/afmalloc.h
> new file mode 100644
> index 0000000..751ae56
> --- /dev/null
> +++ b/include/linux/afmalloc.h
> @@ -0,0 +1,21 @@
> +#define AFMALLOC_MIN_LEVEL (1)
> +#ifdef CONFIG_64BIT
> +#define AFMALLOC_MAX_LEVEL (7) /* 4 + 4 + 8 * 7 = 64 */
> +#else
> +#define AFMALLOC_MAX_LEVEL (6) /* 4 + 4 + 4 * 6 = 32 */
> +#endif
> +
> +extern struct afmalloc_pool *afmalloc_create_pool(int max_level,
> +                       size_t max_size, gfp_t flags);
> +extern void afmalloc_destroy_pool(struct afmalloc_pool *pool);
> +extern size_t afmalloc_get_used_pages(struct afmalloc_pool *pool);
> +extern unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len);
> +extern void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle);
> +extern size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
> +                       void *src, size_t len);
> +extern size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
> +                       void *dst, size_t len);
> +extern void *afmalloc_map_handle(struct afmalloc_pool *pool,
> +                       unsigned long handle, size_t len, bool read_only);
> +extern void afmalloc_unmap_handle(struct afmalloc_pool *pool,
> +                       unsigned long handle);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e09cf0a..7869768 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -585,6 +585,13 @@ config ZSMALLOC
>           returned by an alloc().  This handle must be mapped in order to
>           access the allocated space.
>
> +config ANTI_FRAGMENTATION_MALLOC
> +       boolean "Anti-fragmentation memory allocator"
> +       help
> +         Select this to store data into anti-fragmentation memory
> +         allocator. This helps to reduce internal/external
> +         fragmentation caused by storing arbitrary sized data.
> +
>  config PGTABLE_MAPPING
>         bool "Use page table mapping to access object in zsmalloc"
>         depends on ZSMALLOC
> diff --git a/mm/Makefile b/mm/Makefile
> index b2f18dc..d47b147 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>  obj-$(CONFIG_ZPOOL)    += zpool.o
>  obj-$(CONFIG_ZBUD)     += zbud.o
>  obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
> +obj-$(CONFIG_ANTI_FRAGMENTATION_MALLOC) += afmalloc.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)      += cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> diff --git a/mm/afmalloc.c b/mm/afmalloc.c
> new file mode 100644
> index 0000000..83a5c61
> --- /dev/null
> +++ b/mm/afmalloc.c
> @@ -0,0 +1,590 @@
> +/*
> + * Anti Fragmentation Memory allocator
> + *
> + * Copyright (C) 2014 Joonsoo Kim
> + *
> + * Anti Fragmentation Memory allocator(aka afmalloc) is special purpose
> + * allocator in order to deal with arbitrary sized object allocation
> + * efficiently in terms of memory utilization.
> + *
> + * Overall design is too simple.
> + *
> + * If request is for power of 2 sized object, afmalloc allocate object
> + * from the SLAB, add tag on it and return it to requestor. This tag will be
> + * used for determining whether it is a handle for metadata or not.
> + *
> + * If request isn't for power of 2 sized object, afmalloc divides size
> + * into elements in power of 2 size. For example, 400 byte request, 256,
> + * 128, 16 bytes build up 400 bytes. afmalloc allocates these size memory
> + * from the SLAB and allocates memory for metadata to keep the pointer of
> + * these chunks. Conceptual representation of metadata structure is below.
> + *
> + * Metadata for 400 bytes
> + * - Pointer for 256 bytes chunk
> + * - Pointer for 128 bytes chunk
> + * - Pointer for 16 bytes chunk
> + *
> + * After allocation all of them, afmalloc returns handle for this metadata to
> + * requestor. Requestor can load/store from/into this memory via this handle.
> + *
> + * Returned memory from afmalloc isn't contiguous so using this memory needs
> + * special APIs. afmalloc_(load/store) handles load/store requests according
> + * to afmalloc's internal structure, so you can use it without any anxiety.
> + *
> + * If you may want to use this memory like as normal memory, you need to call
> + * afmalloc_map_object before using it. This returns contiguous memory for
> + * this handle so that you could use it with normal memory operation.
> + * Unfortunately, only one object can be mapped per cpu at a time and to
> + * contruct this mapping has some overhead.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/spinlock.h>
> +#include <linux/slab.h>
> +#include <linux/afmalloc.h>
> +#include <linux/highmem.h>
> +#include <linux/sizes.h>
> +#include <linux/module.h>
> +
> +#define afmalloc_OBJ_MIN_SIZE (32)
> +
> +#define DIRECT_ENTRY (0x1)
> +
> +struct afmalloc_pool {
> +       spinlock_t lock;
> +       gfp_t flags;
> +       int max_level;
> +       size_t max_size;
> +       size_t size;
> +};
> +
> +struct afmalloc_entry {
> +       int level;
> +       int alloced;
> +       void *mem[];
> +};
> +
> +struct afmalloc_mapped_info {
> +       struct page *page;
> +       size_t len;
> +       bool read_only;
> +};
> +
> +static struct afmalloc_mapped_info __percpu *mapped_info;
> +
> +static struct afmalloc_entry *mem_to_direct_entry(void *mem)
> +{
> +       return (struct afmalloc_entry *)((unsigned long)mem | DIRECT_ENTRY);
> +}
> +
> +static void *direct_entry_to_mem(struct afmalloc_entry *entry)
> +{
> +       return (void *)((unsigned long)entry & ~DIRECT_ENTRY);
> +}
> +
> +static bool is_direct_entry(struct afmalloc_entry *entry)
> +{
> +       return (unsigned long)entry & DIRECT_ENTRY;
> +}
> +
> +static unsigned long entry_to_handle(struct afmalloc_entry *entry)
> +{
> +       return (unsigned long)entry;
> +}
> +
> +static struct afmalloc_entry *handle_to_entry(unsigned long handle)
> +{
> +       return (struct afmalloc_entry *)handle;
> +}
> +
> +static bool valid_level(int max_level)
> +{
> +       if (max_level < AFMALLOC_MIN_LEVEL)
> +               return false;
> +
> +       if (max_level > AFMALLOC_MAX_LEVEL)
> +               return false;
> +
> +       return true;
> +}
> +
> +static bool valid_flags(gfp_t flags)
> +{
> +       if (flags & __GFP_HIGHMEM)
> +               return false;
> +
> +       return true;
> +}
> +
> +/**
> + * afmalloc_create_pool - Creates an allocation pool to work from.
> + * @max_level: limit on number of chunks that is part of requested memory
> + * @max_size: limit on total allocation size from this pool
> + * @flags: allocation flags used to allocate memory
> + *
> + * This function must be called before anything when using
> + * the afmalloc allocator.
> + *
> + * On success, a pointer to the newly created pool is returned,
> + * otherwise NULL.
> + */
> +struct afmalloc_pool *afmalloc_create_pool(int max_level, size_t max_size,
> +                                       gfp_t flags)
> +{
> +       struct afmalloc_pool *pool;
> +
> +       if (!valid_level(max_level))
> +               return NULL;
> +
> +       if (!valid_flags(flags))
> +               return NULL;
> +
> +       pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> +       if (!pool)
> +               return NULL;
> +
> +       spin_lock_init(&pool->lock);
> +       pool->flags = flags;
> +       pool->max_level = max_level;
> +       pool->max_size = max_size;
> +       pool->size = 0;
> +
> +       return pool;
> +}
> +EXPORT_SYMBOL(afmalloc_create_pool);
> +
> +void afmalloc_destroy_pool(struct afmalloc_pool *pool)
> +{
> +       kfree(pool);
> +}
> +EXPORT_SYMBOL(afmalloc_destroy_pool);
> +
> +size_t afmalloc_get_used_pages(struct afmalloc_pool *pool)
> +{
> +       size_t size;
> +
> +       spin_lock(&pool->lock);
> +       size = pool->size >> PAGE_SHIFT;
> +       spin_unlock(&pool->lock);
> +
> +       return size;
> +}
> +EXPORT_SYMBOL(afmalloc_get_used_pages);
> +
> +static void free_entry(struct afmalloc_pool *pool, struct afmalloc_entry *entry,
> +                       bool calc_size)
> +{
> +       int i;
> +       int level;
> +       int alloced;
> +
> +       if (is_direct_entry(entry)) {
> +               void *mem = direct_entry_to_mem(entry);
> +
> +               alloced = ksize(mem);
> +               kfree(mem);
> +               goto out;
> +       }
> +
> +       level = entry->level;
> +       alloced = entry->alloced;
> +       for (i = 0; i < level; i++)
> +               kfree(entry->mem[i]);
> +
> +       kfree(entry);
> +
> +out:
> +       if (calc_size && alloced) {
> +               spin_lock(&pool->lock);
> +               pool->size -= alloced;
> +               spin_unlock(&pool->lock);
> +       }
> +}
> +
> +static int calculate_level(struct afmalloc_pool *pool, size_t len)
> +{
> +       int level = 0;
> +       size_t down_size, up_size;
> +
> +       if (len <= afmalloc_OBJ_MIN_SIZE)
> +               goto out;
> +
> +       while (1) {
> +               down_size = rounddown_pow_of_two(len);
> +               if (down_size >= len)
> +                       break;
> +
> +               up_size = roundup_pow_of_two(len);
> +               if (up_size - len <= afmalloc_OBJ_MIN_SIZE)
> +                       break;
> +
> +               len -= down_size;
> +               level++;
> +       }
> +
> +out:
> +       level++;
> +       return min(level, pool->max_level);
> +}
> +
> +static int estimate_alloced(struct afmalloc_pool *pool, int level, size_t len)
> +{
> +       int i, alloced = 0;
> +       size_t size;
> +
> +       for (i = 0; i < level - 1; i++) {
> +               size = rounddown_pow_of_two(len);
> +               alloced += size;
> +               len -= size;
> +       }
> +
> +       if (len < afmalloc_OBJ_MIN_SIZE)
> +               size = afmalloc_OBJ_MIN_SIZE;
> +       else
> +               size = roundup_pow_of_two(len);
> +       alloced += size;
> +
> +       return alloced;
> +}
> +
> +static void *alloc_entry(struct afmalloc_pool *pool, size_t len)
> +{
> +       int i, level;
> +       size_t size;
> +       int alloced = 0;
> +       size_t remain = len;
> +       struct afmalloc_entry *entry;
> +       void *mem;
> +
> +       /*
> +        * Determine whether memory is power of 2 or not. If not,
> +        * determine how many chunks are needed.
> +        */
> +       level = calculate_level(pool, len);
> +       if (level == 1)
> +               goto alloc_direct_entry;
> +
> +       size = sizeof(void *) * level + sizeof(struct afmalloc_entry);
> +       entry = kmalloc(size, pool->flags);
> +       if (!entry)
> +               return NULL;
> +
> +       size = ksize(entry);
> +       alloced += size;
> +
> +       /*
> +        * Although request isn't for power of 2 object, sometimes, it is
> +        * better to allocate one power of 2 memory due to waste of metadata.
> +        */
> +       if (size + estimate_alloced(pool, level, len)
> +                               >= roundup_pow_of_two(len)) {
> +               kfree(entry);
> +               goto alloc_direct_entry;
> +       }
> +
> +       entry->level = level;
> +       for (i = 0; i < level - 1; i++) {
> +               size = rounddown_pow_of_two(remain);
> +               entry->mem[i] = kmalloc(size, pool->flags);
> +               if (!entry->mem[i])
> +                       goto err;
> +
> +               alloced += size;
> +               remain -= size;
> +       }
> +
> +       if (remain < afmalloc_OBJ_MIN_SIZE)
> +               size = afmalloc_OBJ_MIN_SIZE;
> +       else
> +               size = roundup_pow_of_two(remain);
> +       entry->mem[i] = kmalloc(size, pool->flags);
> +       if (!entry->mem[i])
> +               goto err;
> +
> +       alloced += size;
> +       entry->alloced = alloced;
> +       goto alloc_complete;
> +
> +alloc_direct_entry:
> +       mem = kmalloc(len, pool->flags);
> +       if (!mem)
> +               return NULL;
> +
> +       alloced = ksize(mem);
> +       entry = mem_to_direct_entry(mem);
> +
> +alloc_complete:
> +       spin_lock(&pool->lock);
> +       if (pool->size + alloced > pool->max_size) {
> +               spin_unlock(&pool->lock);
> +               goto err;
> +       }
> +
> +       pool->size += alloced;
> +       spin_unlock(&pool->lock);
> +
> +       return entry;
> +
> +err:
> +       free_entry(pool, entry, false);
> +
> +       return NULL;
> +}
> +
> +static bool valid_alloc_arg(size_t len)
> +{
> +       if (!len)
> +               return false;
> +
> +       return true;
> +}
> +
> +/**
> + * afmalloc_alloc - Allocate block of given length from pool
> + * @pool: pool from which the object was allocated
> + * @len: length of block to allocate
> + *
> + * On success, handle to the allocated object is returned,
> + * otherwise 0.
> + */
> +unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len)
> +{
> +       struct afmalloc_entry *entry;
> +
> +       if (!valid_alloc_arg(len))
> +               return 0;
> +
> +       entry = alloc_entry(pool, len);
> +       if (!entry)
> +               return 0;
> +
> +       return entry_to_handle(entry);
> +}
> +EXPORT_SYMBOL(afmalloc_alloc);
> +
> +static void __afmalloc_free(struct afmalloc_pool *pool,
> +                       struct afmalloc_entry *entry)
> +{
> +       free_entry(pool, entry, true);
> +}
> +
> +void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle)
> +{
> +       struct afmalloc_entry *entry;
> +
> +       entry = handle_to_entry(handle);
> +       if (!entry)
> +               return;
> +
> +       __afmalloc_free(pool, entry);
> +}
> +EXPORT_SYMBOL(afmalloc_free);
> +
> +static void __afmalloc_store(struct afmalloc_pool *pool,
> +                       struct afmalloc_entry *entry, void *src, size_t len)
> +{
> +       int i, level = entry->level;
> +       size_t size;
> +       size_t offset = 0;
> +
> +       if (is_direct_entry(entry)) {
> +               memcpy(direct_entry_to_mem(entry), src, len);
> +               return;
> +       }
> +
> +       for (i = 0; i < level - 1; i++) {
> +               size = rounddown_pow_of_two(len);
> +               memcpy(entry->mem[i], src + offset, size);
> +               offset += size;
> +               len -= size;
> +       }
> +       memcpy(entry->mem[i], src + offset, len);
> +}
> +
> +static bool valid_store_arg(struct afmalloc_entry *entry, void *src, size_t len)
> +{
> +       if (!entry)
> +               return false;
> +
> +       if (!src || !len)
> +               return false;
> +
> +       return true;
> +}
> +
> +/**
> + * afmalloc_store - store data into allocated object from handle.
> + * @pool: pool from which the object was allocated
> + * @handle: handle returned from afmalloc
> + * @src: memory address of source data
> + * @len: length in bytes of desired store
> + *
> + * To store data into an object allocated from afmalloc, it must be
> + * mapped before using it or accessed through afmalloc-specific
> + * load/store functions. These functions properly handle load/store
> + * request according to afmalloc's internal structure.
> + */
> +size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
> +                       void *src, size_t len)
> +{
> +       struct afmalloc_entry *entry;
> +
> +       entry = handle_to_entry(handle);
> +       if (!valid_store_arg(entry, src, len))
> +               return 0;
> +
> +       __afmalloc_store(pool, entry, src, len);
> +
> +       return len;
> +}
> +EXPORT_SYMBOL(afmalloc_store);
> +
> +static void __afmalloc_load(struct afmalloc_pool *pool,
> +                       struct afmalloc_entry *entry, void *dst, size_t len)
> +{
> +       int i, level = entry->level;
> +       size_t size;
> +       size_t offset = 0;
> +
> +       if (is_direct_entry(entry)) {
> +               memcpy(dst, direct_entry_to_mem(entry), len);
> +               return;
> +       }
> +
> +       for (i = 0; i < level - 1; i++) {
> +               size = rounddown_pow_of_two(len);
> +               memcpy(dst + offset, entry->mem[i], size);
> +               offset += size;
> +               len -= size;
> +       }
> +       memcpy(dst + offset, entry->mem[i], len);
> +}
> +
> +static bool valid_load_arg(struct afmalloc_entry *entry, void *dst, size_t len)
> +{
> +       if (!entry)
> +               return false;
> +
> +       if (!dst || !len)
> +               return false;
> +
> +       return true;
> +}
> +
> +size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
> +               void *dst, size_t len)
> +{
> +       struct afmalloc_entry *entry;
> +
> +       entry = handle_to_entry(handle);
> +       if (!valid_load_arg(entry, dst, len))
> +               return 0;
> +
> +       __afmalloc_load(pool, entry, dst, len);
> +
> +       return len;
> +}
> +EXPORT_SYMBOL(afmalloc_load);
> +
> +/**
> + * afmalloc_map_object - get address of allocated object from handle.
> + * @pool: pool from which the object was allocated
> + * @handle: handle returned from afmalloc
> + * @len: length in bytes of desired mapping
> + * @read_only: flag that represents whether data on mapped region is
> + *     written back into an object or not
> + *
> + * Before using an object allocated from afmalloc, it must be mapped using
> + * this function. When done with the object, it must be unmapped using
> + * afmalloc_unmap_handle.
> + *
> + * Only one object can be mapped per cpu at a time. There is no protection
> + * against nested mappings.
> + *
> + * This function returns with preemption and page faults disabled.
> + */
> +void *afmalloc_map_handle(struct afmalloc_pool *pool, unsigned long handle,
> +                       size_t len, bool read_only)
> +{
> +       int cpu;
> +       struct afmalloc_entry *entry;
> +       struct afmalloc_mapped_info *info;
> +       void *addr;
> +
> +       entry = handle_to_entry(handle);
> +       if (!entry)
> +               return NULL;
> +
> +       cpu = get_cpu();
> +       if (is_direct_entry(entry))
> +               return direct_entry_to_mem(entry);
> +
> +       info = per_cpu_ptr(mapped_info, cpu);
> +       addr = page_address(info->page);
> +       info->len = len;
> +       info->read_only = read_only;
> +       __afmalloc_load(pool, entry, addr, len);
> +       return addr;
> +}
> +EXPORT_SYMBOL(afmalloc_map_handle);
> +
> +void afmalloc_unmap_handle(struct afmalloc_pool *pool, unsigned long handle)
> +{
> +       struct afmalloc_entry *entry;
> +       struct afmalloc_mapped_info *info;
> +       void *addr;
> +
> +       entry = handle_to_entry(handle);
> +       if (!entry)
> +               return;
> +
> +       if (is_direct_entry(entry))
> +               goto out;
> +
> +       info = this_cpu_ptr(mapped_info);
> +       if (info->read_only)
> +               goto out;
> +
> +       addr = page_address(info->page);
> +       __afmalloc_store(pool, entry, addr, info->len);
> +
> +out:
> +       put_cpu();
> +}
> +EXPORT_SYMBOL(afmalloc_unmap_handle);
> +
> +static int __init afmalloc_init(void)
> +{
> +       int cpu;
> +
> +       mapped_info = alloc_percpu(struct afmalloc_mapped_info);
> +       if (!mapped_info)
> +               return -ENOMEM;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct page *page;
> +
> +               page = alloc_pages(GFP_KERNEL, 0);
> +               if (!page)
> +                       goto err;
> +
> +               per_cpu_ptr(mapped_info, cpu)->page = page;
> +       }
> +
> +       return 0;
> +
> +err:
> +       for_each_possible_cpu(cpu) {
> +               struct page *page;
> +
> +               page = per_cpu_ptr(mapped_info, cpu)->page;
> +               if (page)
> +                       __free_pages(page, 0);
> +       }
> +       free_percpu(mapped_info);
> +       return -ENOMEM;
> +}
> +module_init(afmalloc_init);
> +
> +MODULE_AUTHOR("Joonsoo Kim <iamjoonsoo.kim@lge.com>");
> --
> 1.7.9.5
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
@ 2014-09-29 15:41   ` Dan Streetman
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Streetman @ 2014-09-29 15:41 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Minchan Kim, Nitin Gupta, Linux-MM, linux-kernel,
	Jerome Marchand, Sergey Senozhatsky, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

On Fri, Sep 26, 2014 at 2:53 AM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> If you know useful place to use this allocator, please let me know.
>
> This is brand-new allocator, called anti-fragmentation memory allocator
> (aka afmalloc), in order to deal with arbitrary sized object allocation
> efficiently. zram and zswap uses arbitrary sized object to store
> compressed data so they can use this allocator. If there are any other
> use cases, they can use it, too.
>
> This work is motivated by observation of fragmentation on zsmalloc which
> intended for storing arbitrary sized object with low fragmentation.
> Although it works well on allocation-intensive workload, memory could be
> highly fragmented after many free occurs. In some cases, unused memory due
> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> problem is that other subsystem cannot use these unused memory. These
> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> use it until zspage is freed to page allocator.
>
> I guess that there are similar fragmentation problem in zbud, but, I
> didn't deeply investigate it.
>
> This new allocator uses SLAB allocator to solve above problems. When
> request comes, it returns handle that is pointer of metatdata to point
> many small chunks. These small chunks are in power of 2 size and
> build up whole requested memory. We can easily acquire these chunks
> using SLAB allocator. Following is conceptual represetation of metadata
> used in this allocator to help understanding of this allocator.
>
> Handle A for 400 bytes
> {
>         Pointer for 256 bytes chunk
>         Pointer for 128 bytes chunk
>         Pointer for 16 bytes chunk
>
>         (256 + 128 + 16 = 400)
> }
>
> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> allocator specific store/load functions are needed. These require some
> computation overhead and I guess that this is the only drawback this
> allocator has.

This also requires additional memory copying, for each map/unmap, no?

>
> For optimization, it uses another approach for power of 2 sized request.
> Instead of returning handle for metadata, it adds tag on pointer from
> SLAB allocator and directly returns this value as handle. With this tag,
> afmalloc can recognize whether handle is for metadata or not and do proper
> processing on it. This optimization can save some memory.
>
> Although afmalloc use some memory for metadata, overall utilization of
> memory is really good due to zero internal fragmentation by using power
> of 2 sized object. Although zsmalloc has many size class, there is
> considerable internal fragmentation in zsmalloc.
>
> In workload that needs many free, memory could be fragmented like
> zsmalloc, but, there is big difference. These unused portion of memory
> are SLAB specific memory so that other subsystem can use it. Therefore,
> fragmented memory could not be a big problem in this allocator.
>
> Extra benefit of this allocator design is NUMA awareness. This allocator
> allocates real memory from SLAB allocator. SLAB considers client's NUMA
> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> and zbud which are backend of zram and zswap, respectively, are not NUMA
> awareness so that remote node's memory could be returned to requestor.
> I think that it could be solved easily if NUMA awareness turns out to be
> real problem. But, it may enlarge fragmentation depending on number of
> nodes. Anyway, there is no NUMA awareness issue in this allocator.
>
> Although I'd like to replace zsmalloc with this allocator, it cannot be
> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> would be very limited so supporting HIGHMEM would be really good advantage
> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> 64-bits world, this allocator may be good option for this system. I
> didn't deeply consider whether this allocator can replace zbud or not.

While it looks like there may be some situations that benefit from
this, this won't work for all cases (as you mention), so maybe zpool
can allow zram to choose between zsmalloc and afmalloc.

>
> Below is the result of my simple test.
> (zsmalloc used in experiments is patched with my previous patch:
> zsmalloc: merge size_class to reduce fragmentation)
>
> TEST ENV: EXT4 on zram, mount with discard option
> WORKLOAD: untar kernel source, remove dir in descending order in size.
> (drivers arch fs sound include)
>
> Each line represents orig_data_size, compr_data_size, mem_used_total,
> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> (overhead to compr_data_size), respectively, after untar and remove
> operation is executed. In afmalloc case, overhead is calculated by
> before/after 'SUnreclaim' on /proc/meminfo.
> And there are two more columns
> in afmalloc, one is real_overhead which represents metadata usage and
> overhead of internal fragmentation, and the other is a ratio,
> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> internal fragmented memory cannot be used by other subsystem. So,
> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> be proper comparison.
>
> * untar-merge.out
>
> orig_size compr_size used_size overhead overhead_ratio
> 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
> 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
> 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
> 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
> 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
> 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
>
> * untar-afmalloc.out
>
> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
> 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
> 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
> 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
> 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
> 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
>
> As you can see above result, real_overhead_ratio in afmalloc is
> just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
>
> And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
> than overhead_ratio in zsmalloc which is 5% ~ 17%.

I think the key will be scaling up this test more.  What does it look
like when using 20G or more?

It certainly looks better when using (relatively) small amounts of data, though.

>
> Below is another simple test to check fragmentation effect in alloc/free
> repetition workload.
>
> TEST ENV: EXT4 on zram, mount with discard option
> WORKLOAD: untar kernel source, remove dir in descending order in size
> (drivers arch fs sound include). Repeat this untar and remove 10 times.
>
> * untar-merge.out
>
> orig_size compr_size used_size overhead overhead_ratio
> 526.24MB 199.18MB 209.79MB  10.61MB 5.33%
> 288.69MB  97.45MB 104.09MB   6.64MB 6.81%
> 177.69MB  61.14MB  66.89MB   5.75MB 9.40%
> 146.84MB  47.34MB  52.77MB   5.43MB 11.46%
> 124.53MB  38.88MB  44.28MB   5.40MB 13.90%
> 104.29MB  31.71MB  36.87MB   5.17MB 16.29%
> 535.59MB 200.30MB 211.77MB  11.47MB 5.73%
> 294.84MB  98.28MB 106.24MB   7.97MB 8.11%
> 179.99MB  61.58MB  69.34MB   7.76MB 12.60%
> 148.67MB  47.75MB  55.19MB   7.43MB 15.57%
> 125.98MB  39.26MB  46.62MB   7.36MB 18.75%
> 105.05MB  32.03MB  39.18MB   7.15MB 22.32%
> (snip...)
> 535.59MB 200.31MB 211.88MB  11.57MB 5.77%
> 294.84MB  98.28MB 106.62MB   8.34MB 8.49%
> 179.99MB  61.59MB  73.83MB  12.24MB 19.88%
> 148.67MB  47.76MB  59.58MB  11.82MB 24.76%
> 125.98MB  39.27MB  51.10MB  11.84MB 30.14%
> 105.05MB  32.04MB  43.68MB  11.64MB 36.31%
> 535.59MB 200.31MB 211.89MB  11.58MB 5.78%
> 294.84MB  98.28MB 106.68MB   8.40MB 8.55%
> 179.99MB  61.59MB  74.14MB  12.55MB 20.37%
> 148.67MB  47.76MB  59.94MB  12.18MB 25.50%
> 125.98MB  39.27MB  51.46MB  12.19MB 31.04%
> 105.05MB  32.04MB  44.01MB  11.97MB 37.35%
>
> * untar-afmalloc.out
>
> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> 526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
> 288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
> 177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
> 146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
> 124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
> 104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
> 535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
> 294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
> 148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
> 125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
> 105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
> (snip...)
> 535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
> 294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
> 148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
> 125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
> 105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
> 535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
> 294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
> 148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
> 125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
> 105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%
>
> As you can see above result, fragmentation grows continuously at each run.
> But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
> while overhead_ratio on zsmalloc varies 5% ~ 38%.
> Fragmented slab memory can be used for other system, so we don't
> have to much worry about overhead metric in afmalloc. Anyway, overhead
> metric is also better in afmalloc, 4% ~ 26%.
>
> As a result, I think that afmalloc is better than zsmalloc in terms of
> memory efficiency. But, I could be wrong so any comments are welcome. :)
>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  include/linux/afmalloc.h |   21 ++
>  mm/Kconfig               |    7 +
>  mm/Makefile              |    1 +
>  mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 619 insertions(+)
>  create mode 100644 include/linux/afmalloc.h
>  create mode 100644 mm/afmalloc.c
>
> diff --git a/include/linux/afmalloc.h b/include/linux/afmalloc.h
> new file mode 100644
> index 0000000..751ae56
> --- /dev/null
> +++ b/include/linux/afmalloc.h
> @@ -0,0 +1,21 @@
> +#define AFMALLOC_MIN_LEVEL (1)
> +#ifdef CONFIG_64BIT
> +#define AFMALLOC_MAX_LEVEL (7) /* 4 + 4 + 8 * 7 = 64 */
> +#else
> +#define AFMALLOC_MAX_LEVEL (6) /* 4 + 4 + 4 * 6 = 32 */
> +#endif
> +
> +extern struct afmalloc_pool *afmalloc_create_pool(int max_level,
> +                       size_t max_size, gfp_t flags);
> +extern void afmalloc_destroy_pool(struct afmalloc_pool *pool);
> +extern size_t afmalloc_get_used_pages(struct afmalloc_pool *pool);
> +extern unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len);
> +extern void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle);
> +extern size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
> +                       void *src, size_t len);
> +extern size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
> +                       void *dst, size_t len);
> +extern void *afmalloc_map_handle(struct afmalloc_pool *pool,
> +                       unsigned long handle, size_t len, bool read_only);
> +extern void afmalloc_unmap_handle(struct afmalloc_pool *pool,
> +                       unsigned long handle);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e09cf0a..7869768 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -585,6 +585,13 @@ config ZSMALLOC
>           returned by an alloc().  This handle must be mapped in order to
>           access the allocated space.
>
> +config ANTI_FRAGMENTATION_MALLOC
> +       boolean "Anti-fragmentation memory allocator"
> +       help
> +         Select this to store data into anti-fragmentation memory
> +         allocator. This helps to reduce internal/external
> +         fragmentation caused by storing arbitrary sized data.
> +
>  config PGTABLE_MAPPING
>         bool "Use page table mapping to access object in zsmalloc"
>         depends on ZSMALLOC
> diff --git a/mm/Makefile b/mm/Makefile
> index b2f18dc..d47b147 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>  obj-$(CONFIG_ZPOOL)    += zpool.o
>  obj-$(CONFIG_ZBUD)     += zbud.o
>  obj-$(CONFIG_ZSMALLOC) += zsmalloc.o
> +obj-$(CONFIG_ANTI_FRAGMENTATION_MALLOC) += afmalloc.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)      += cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> diff --git a/mm/afmalloc.c b/mm/afmalloc.c
> new file mode 100644
> index 0000000..83a5c61
> --- /dev/null
> +++ b/mm/afmalloc.c
> @@ -0,0 +1,590 @@
> +/*
> + * Anti Fragmentation Memory allocator
> + *
> + * Copyright (C) 2014 Joonsoo Kim
> + *
> + * Anti Fragmentation Memory allocator(aka afmalloc) is special purpose
> + * allocator in order to deal with arbitrary sized object allocation
> + * efficiently in terms of memory utilization.
> + *
> + * Overall design is too simple.
> + *
> + * If request is for power of 2 sized object, afmalloc allocate object
> + * from the SLAB, add tag on it and return it to requestor. This tag will be
> + * used for determining whether it is a handle for metadata or not.
> + *
> + * If request isn't for power of 2 sized object, afmalloc divides size
> + * into elements in power of 2 size. For example, 400 byte request, 256,
> + * 128, 16 bytes build up 400 bytes. afmalloc allocates these size memory
> + * from the SLAB and allocates memory for metadata to keep the pointer of
> + * these chunks. Conceptual representation of metadata structure is below.
> + *
> + * Metadata for 400 bytes
> + * - Pointer for 256 bytes chunk
> + * - Pointer for 128 bytes chunk
> + * - Pointer for 16 bytes chunk
> + *
> + * After allocation all of them, afmalloc returns handle for this metadata to
> + * requestor. Requestor can load/store from/into this memory via this handle.
> + *
> + * Returned memory from afmalloc isn't contiguous so using this memory needs
> + * special APIs. afmalloc_(load/store) handles load/store requests according
> + * to afmalloc's internal structure, so you can use it without any anxiety.
> + *
> + * If you may want to use this memory like as normal memory, you need to call
> + * afmalloc_map_object before using it. This returns contiguous memory for
> + * this handle so that you could use it with normal memory operation.
> + * Unfortunately, only one object can be mapped per cpu at a time and to
> + * contruct this mapping has some overhead.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/spinlock.h>
> +#include <linux/slab.h>
> +#include <linux/afmalloc.h>
> +#include <linux/highmem.h>
> +#include <linux/sizes.h>
> +#include <linux/module.h>
> +
> +#define afmalloc_OBJ_MIN_SIZE (32)
> +
> +#define DIRECT_ENTRY (0x1)
> +
> +struct afmalloc_pool {
> +       spinlock_t lock;
> +       gfp_t flags;
> +       int max_level;
> +       size_t max_size;
> +       size_t size;
> +};
> +
> +struct afmalloc_entry {
> +       int level;
> +       int alloced;
> +       void *mem[];
> +};
> +
> +struct afmalloc_mapped_info {
> +       struct page *page;
> +       size_t len;
> +       bool read_only;
> +};
> +
> +static struct afmalloc_mapped_info __percpu *mapped_info;
> +
> +static struct afmalloc_entry *mem_to_direct_entry(void *mem)
> +{
> +       return (struct afmalloc_entry *)((unsigned long)mem | DIRECT_ENTRY);
> +}
> +
> +static void *direct_entry_to_mem(struct afmalloc_entry *entry)
> +{
> +       return (void *)((unsigned long)entry & ~DIRECT_ENTRY);
> +}
> +
> +static bool is_direct_entry(struct afmalloc_entry *entry)
> +{
> +       return (unsigned long)entry & DIRECT_ENTRY;
> +}
> +
> +static unsigned long entry_to_handle(struct afmalloc_entry *entry)
> +{
> +       return (unsigned long)entry;
> +}
> +
> +static struct afmalloc_entry *handle_to_entry(unsigned long handle)
> +{
> +       return (struct afmalloc_entry *)handle;
> +}
> +
> +static bool valid_level(int max_level)
> +{
> +       if (max_level < AFMALLOC_MIN_LEVEL)
> +               return false;
> +
> +       if (max_level > AFMALLOC_MAX_LEVEL)
> +               return false;
> +
> +       return true;
> +}
> +
> +static bool valid_flags(gfp_t flags)
> +{
> +       if (flags & __GFP_HIGHMEM)
> +               return false;
> +
> +       return true;
> +}
> +
> +/**
> + * afmalloc_create_pool - Creates an allocation pool to work from.
> + * @max_level: limit on number of chunks that is part of requested memory
> + * @max_size: limit on total allocation size from this pool
> + * @flags: allocation flags used to allocate memory
> + *
> + * This function must be called before anything when using
> + * the afmalloc allocator.
> + *
> + * On success, a pointer to the newly created pool is returned,
> + * otherwise NULL.
> + */
> +struct afmalloc_pool *afmalloc_create_pool(int max_level, size_t max_size,
> +                                       gfp_t flags)
> +{
> +       struct afmalloc_pool *pool;
> +
> +       if (!valid_level(max_level))
> +               return NULL;
> +
> +       if (!valid_flags(flags))
> +               return NULL;
> +
> +       pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> +       if (!pool)
> +               return NULL;
> +
> +       spin_lock_init(&pool->lock);
> +       pool->flags = flags;
> +       pool->max_level = max_level;
> +       pool->max_size = max_size;
> +       pool->size = 0;
> +
> +       return pool;
> +}
> +EXPORT_SYMBOL(afmalloc_create_pool);
> +
> +void afmalloc_destroy_pool(struct afmalloc_pool *pool)
> +{
> +       kfree(pool);
> +}
> +EXPORT_SYMBOL(afmalloc_destroy_pool);
> +
> +size_t afmalloc_get_used_pages(struct afmalloc_pool *pool)
> +{
> +       size_t size;
> +
> +       spin_lock(&pool->lock);
> +       size = pool->size >> PAGE_SHIFT;
> +       spin_unlock(&pool->lock);
> +
> +       return size;
> +}
> +EXPORT_SYMBOL(afmalloc_get_used_pages);
> +
> +static void free_entry(struct afmalloc_pool *pool, struct afmalloc_entry *entry,
> +                       bool calc_size)
> +{
> +       int i;
> +       int level;
> +       int alloced;
> +
> +       if (is_direct_entry(entry)) {
> +               void *mem = direct_entry_to_mem(entry);
> +
> +               alloced = ksize(mem);
> +               kfree(mem);
> +               goto out;
> +       }
> +
> +       level = entry->level;
> +       alloced = entry->alloced;
> +       for (i = 0; i < level; i++)
> +               kfree(entry->mem[i]);
> +
> +       kfree(entry);
> +
> +out:
> +       if (calc_size && alloced) {
> +               spin_lock(&pool->lock);
> +               pool->size -= alloced;
> +               spin_unlock(&pool->lock);
> +       }
> +}
> +
> +static int calculate_level(struct afmalloc_pool *pool, size_t len)
> +{
> +       int level = 0;
> +       size_t down_size, up_size;
> +
> +       if (len <= afmalloc_OBJ_MIN_SIZE)
> +               goto out;
> +
> +       while (1) {
> +               down_size = rounddown_pow_of_two(len);
> +               if (down_size >= len)
> +                       break;
> +
> +               up_size = roundup_pow_of_two(len);
> +               if (up_size - len <= afmalloc_OBJ_MIN_SIZE)
> +                       break;
> +
> +               len -= down_size;
> +               level++;
> +       }
> +
> +out:
> +       level++;
> +       return min(level, pool->max_level);
> +}
> +
> +static int estimate_alloced(struct afmalloc_pool *pool, int level, size_t len)
> +{
> +       int i, alloced = 0;
> +       size_t size;
> +
> +       for (i = 0; i < level - 1; i++) {
> +               size = rounddown_pow_of_two(len);
> +               alloced += size;
> +               len -= size;
> +       }
> +
> +       if (len < afmalloc_OBJ_MIN_SIZE)
> +               size = afmalloc_OBJ_MIN_SIZE;
> +       else
> +               size = roundup_pow_of_two(len);
> +       alloced += size;
> +
> +       return alloced;
> +}
> +
> +static void *alloc_entry(struct afmalloc_pool *pool, size_t len)
> +{
> +       int i, level;
> +       size_t size;
> +       int alloced = 0;
> +       size_t remain = len;
> +       struct afmalloc_entry *entry;
> +       void *mem;
> +
> +       /*
> +        * Determine whether memory is power of 2 or not. If not,
> +        * determine how many chunks are needed.
> +        */
> +       level = calculate_level(pool, len);
> +       if (level == 1)
> +               goto alloc_direct_entry;
> +
> +       size = sizeof(void *) * level + sizeof(struct afmalloc_entry);
> +       entry = kmalloc(size, pool->flags);
> +       if (!entry)
> +               return NULL;
> +
> +       size = ksize(entry);
> +       alloced += size;
> +
> +       /*
> +        * Although request isn't for power of 2 object, sometimes, it is
> +        * better to allocate one power of 2 memory due to waste of metadata.
> +        */
> +       if (size + estimate_alloced(pool, level, len)
> +                               >= roundup_pow_of_two(len)) {
> +               kfree(entry);
> +               goto alloc_direct_entry;
> +       }
> +
> +       entry->level = level;
> +       for (i = 0; i < level - 1; i++) {
> +               size = rounddown_pow_of_two(remain);
> +               entry->mem[i] = kmalloc(size, pool->flags);
> +               if (!entry->mem[i])
> +                       goto err;
> +
> +               alloced += size;
> +               remain -= size;
> +       }
> +
> +       if (remain < afmalloc_OBJ_MIN_SIZE)
> +               size = afmalloc_OBJ_MIN_SIZE;
> +       else
> +               size = roundup_pow_of_two(remain);
> +       entry->mem[i] = kmalloc(size, pool->flags);
> +       if (!entry->mem[i])
> +               goto err;
> +
> +       alloced += size;
> +       entry->alloced = alloced;
> +       goto alloc_complete;
> +
> +alloc_direct_entry:
> +       mem = kmalloc(len, pool->flags);
> +       if (!mem)
> +               return NULL;
> +
> +       alloced = ksize(mem);
> +       entry = mem_to_direct_entry(mem);
> +
> +alloc_complete:
> +       spin_lock(&pool->lock);
> +       if (pool->size + alloced > pool->max_size) {
> +               spin_unlock(&pool->lock);
> +               goto err;
> +       }
> +
> +       pool->size += alloced;
> +       spin_unlock(&pool->lock);
> +
> +       return entry;
> +
> +err:
> +       free_entry(pool, entry, false);
> +
> +       return NULL;
> +}
> +
> +static bool valid_alloc_arg(size_t len)
> +{
> +       if (!len)
> +               return false;
> +
> +       return true;
> +}
> +
> +/**
> + * afmalloc_alloc - Allocate block of given length from pool
> + * @pool: pool from which the object was allocated
> + * @len: length of block to allocate
> + *
> + * On success, handle to the allocated object is returned,
> + * otherwise 0.
> + */
> +unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len)
> +{
> +       struct afmalloc_entry *entry;
> +
> +       if (!valid_alloc_arg(len))
> +               return 0;
> +
> +       entry = alloc_entry(pool, len);
> +       if (!entry)
> +               return 0;
> +
> +       return entry_to_handle(entry);
> +}
> +EXPORT_SYMBOL(afmalloc_alloc);
> +
> +static void __afmalloc_free(struct afmalloc_pool *pool,
> +                       struct afmalloc_entry *entry)
> +{
> +       free_entry(pool, entry, true);
> +}
> +
> +void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle)
> +{
> +       struct afmalloc_entry *entry;
> +
> +       entry = handle_to_entry(handle);
> +       if (!entry)
> +               return;
> +
> +       __afmalloc_free(pool, entry);
> +}
> +EXPORT_SYMBOL(afmalloc_free);
> +
> +static void __afmalloc_store(struct afmalloc_pool *pool,
> +                       struct afmalloc_entry *entry, void *src, size_t len)
> +{
> +       int i, level = entry->level;
> +       size_t size;
> +       size_t offset = 0;
> +
> +       if (is_direct_entry(entry)) {
> +               memcpy(direct_entry_to_mem(entry), src, len);
> +               return;
> +       }
> +
> +       for (i = 0; i < level - 1; i++) {
> +               size = rounddown_pow_of_two(len);
> +               memcpy(entry->mem[i], src + offset, size);
> +               offset += size;
> +               len -= size;
> +       }
> +       memcpy(entry->mem[i], src + offset, len);
> +}
> +
> +static bool valid_store_arg(struct afmalloc_entry *entry, void *src, size_t len)
> +{
> +       if (!entry)
> +               return false;
> +
> +       if (!src || !len)
> +               return false;
> +
> +       return true;
> +}
> +
> +/**
> + * afmalloc_store - store data into allocated object from handle.
> + * @pool: pool from which the object was allocated
> + * @handle: handle returned from afmalloc
> + * @src: memory address of source data
> + * @len: length in bytes of desired store
> + *
> + * To store data into an object allocated from afmalloc, it must be
> + * mapped before using it or accessed through afmalloc-specific
> + * load/store functions. These functions properly handle load/store
> + * request according to afmalloc's internal structure.
> + */
> +size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
> +                       void *src, size_t len)
> +{
> +       struct afmalloc_entry *entry;
> +
> +       entry = handle_to_entry(handle);
> +       if (!valid_store_arg(entry, src, len))
> +               return 0;
> +
> +       __afmalloc_store(pool, entry, src, len);
> +
> +       return len;
> +}
> +EXPORT_SYMBOL(afmalloc_store);
> +
> +static void __afmalloc_load(struct afmalloc_pool *pool,
> +                       struct afmalloc_entry *entry, void *dst, size_t len)
> +{
> +       int i, level = entry->level;
> +       size_t size;
> +       size_t offset = 0;
> +
> +       if (is_direct_entry(entry)) {
> +               memcpy(dst, direct_entry_to_mem(entry), len);
> +               return;
> +       }
> +
> +       for (i = 0; i < level - 1; i++) {
> +               size = rounddown_pow_of_two(len);
> +               memcpy(dst + offset, entry->mem[i], size);
> +               offset += size;
> +               len -= size;
> +       }
> +       memcpy(dst + offset, entry->mem[i], len);
> +}
> +
> +static bool valid_load_arg(struct afmalloc_entry *entry, void *dst, size_t len)
> +{
> +       if (!entry)
> +               return false;
> +
> +       if (!dst || !len)
> +               return false;
> +
> +       return true;
> +}
> +
> +size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
> +               void *dst, size_t len)
> +{
> +       struct afmalloc_entry *entry;
> +
> +       entry = handle_to_entry(handle);
> +       if (!valid_load_arg(entry, dst, len))
> +               return 0;
> +
> +       __afmalloc_load(pool, entry, dst, len);
> +
> +       return len;
> +}
> +EXPORT_SYMBOL(afmalloc_load);
> +
> +/**
> + * afmalloc_map_object - get address of allocated object from handle.
> + * @pool: pool from which the object was allocated
> + * @handle: handle returned from afmalloc
> + * @len: length in bytes of desired mapping
> + * @read_only: flag that represents whether data on mapped region is
> + *     written back into an object or not
> + *
> + * Before using an object allocated from afmalloc, it must be mapped using
> + * this function. When done with the object, it must be unmapped using
> + * afmalloc_unmap_handle.
> + *
> + * Only one object can be mapped per cpu at a time. There is no protection
> + * against nested mappings.
> + *
> + * This function returns with preemption and page faults disabled.
> + */
> +void *afmalloc_map_handle(struct afmalloc_pool *pool, unsigned long handle,
> +                       size_t len, bool read_only)
> +{
> +       int cpu;
> +       struct afmalloc_entry *entry;
> +       struct afmalloc_mapped_info *info;
> +       void *addr;
> +
> +       entry = handle_to_entry(handle);
> +       if (!entry)
> +               return NULL;
> +
> +       cpu = get_cpu();
> +       if (is_direct_entry(entry))
> +               return direct_entry_to_mem(entry);
> +
> +       info = per_cpu_ptr(mapped_info, cpu);
> +       addr = page_address(info->page);
> +       info->len = len;
> +       info->read_only = read_only;
> +       __afmalloc_load(pool, entry, addr, len);
> +       return addr;
> +}
> +EXPORT_SYMBOL(afmalloc_map_handle);
> +
> +void afmalloc_unmap_handle(struct afmalloc_pool *pool, unsigned long handle)
> +{
> +       struct afmalloc_entry *entry;
> +       struct afmalloc_mapped_info *info;
> +       void *addr;
> +
> +       entry = handle_to_entry(handle);
> +       if (!entry)
> +               return;
> +
> +       if (is_direct_entry(entry))
> +               goto out;
> +
> +       info = this_cpu_ptr(mapped_info);
> +       if (info->read_only)
> +               goto out;
> +
> +       addr = page_address(info->page);
> +       __afmalloc_store(pool, entry, addr, info->len);
> +
> +out:
> +       put_cpu();
> +}
> +EXPORT_SYMBOL(afmalloc_unmap_handle);
> +
> +static int __init afmalloc_init(void)
> +{
> +       int cpu;
> +
> +       mapped_info = alloc_percpu(struct afmalloc_mapped_info);
> +       if (!mapped_info)
> +               return -ENOMEM;
> +
> +       for_each_possible_cpu(cpu) {
> +               struct page *page;
> +
> +               page = alloc_pages(GFP_KERNEL, 0);
> +               if (!page)
> +                       goto err;
> +
> +               per_cpu_ptr(mapped_info, cpu)->page = page;
> +       }
> +
> +       return 0;
> +
> +err:
> +       for_each_possible_cpu(cpu) {
> +               struct page *page;
> +
> +               page = per_cpu_ptr(mapped_info, cpu)->page;
> +               if (page)
> +                       __free_pages(page, 0);
> +       }
> +       free_percpu(mapped_info);
> +       return -ENOMEM;
> +}
> +module_init(afmalloc_init);
> +
> +MODULE_AUTHOR("Joonsoo Kim <iamjoonsoo.kim@lge.com>");
> --
> 1.7.9.5
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
  2014-09-26  6:53 ` Joonsoo Kim
@ 2014-09-29 19:53   ` Seth Jennings
  -1 siblings, 0 replies; 16+ messages in thread
From: Seth Jennings @ 2014-09-29 19:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Minchan Kim, Nitin Gupta, linux-mm, linux-kernel,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman,
	Luigi Semenzato, Mel Gorman, Hugh Dickins

On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> If you know useful place to use this allocator, please let me know.
> 
> This is brand-new allocator, called anti-fragmentation memory allocator
> (aka afmalloc), in order to deal with arbitrary sized object allocation
> efficiently. zram and zswap uses arbitrary sized object to store
> compressed data so they can use this allocator. If there are any other
> use cases, they can use it, too.
> 
> This work is motivated by observation of fragmentation on zsmalloc which
> intended for storing arbitrary sized object with low fragmentation.
> Although it works well on allocation-intensive workload, memory could be
> highly fragmented after many free occurs. In some cases, unused memory due
> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> problem is that other subsystem cannot use these unused memory. These
> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> use it until zspage is freed to page allocator.

Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
I and others are looking at putting compaction logic into zsmalloc to
help with this.

> 
> I guess that there are similar fragmentation problem in zbud, but, I
> didn't deeply investigate it.
> 
> This new allocator uses SLAB allocator to solve above problems. When
> request comes, it returns handle that is pointer of metatdata to point
> many small chunks. These small chunks are in power of 2 size and
> build up whole requested memory. We can easily acquire these chunks
> using SLAB allocator. Following is conceptual represetation of metadata
> used in this allocator to help understanding of this allocator.
> 
> Handle A for 400 bytes
> {
> 	Pointer for 256 bytes chunk
> 	Pointer for 128 bytes chunk
> 	Pointer for 16 bytes chunk
> 
> 	(256 + 128 + 16 = 400)
> }
> 
> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> allocator specific store/load functions are needed. These require some
> computation overhead and I guess that this is the only drawback this
> allocator has.

One problem with using the SLAB allocator is that kmalloc caches greater
than 256 bytes, at least on my x86_64 machine, have slabs that require
high order page allocations, which are going to be really hard to come
by in the memory stressed environment in which zswap/zram are expected
to operate.  I guess you could max out at 256 byte chunks to overcome
this.  However, if you have a 3k object, that would require copying 12
chunks from potentially 12 different pages into a contiguous area at
mapping time and a larger metadata size.

> 
> For optimization, it uses another approach for power of 2 sized request.
> Instead of returning handle for metadata, it adds tag on pointer from
> SLAB allocator and directly returns this value as handle. With this tag,
> afmalloc can recognize whether handle is for metadata or not and do proper
> processing on it. This optimization can save some memory.
> 
> Although afmalloc use some memory for metadata, overall utilization of
> memory is really good due to zero internal fragmentation by using power

Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
fragmentation per object right?  If so, "near zero".

> of 2 sized object. Although zsmalloc has many size class, there is
> considerable internal fragmentation in zsmalloc.

Lets put a number on it. Internal fragmentation on objects with size >
ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
64-bit system with 4k pages.  (Note: I don't think that is it possible to
compress a 4k page to less than 32 bytes, so for zswap, there will be no
allocations in this size range).

So we are looking at up to 7 vs 15 bytes of internal fragmentation per
object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
take into account the per-object metadata overhead of afmalloc, I think
zsmalloc comes out ahead here.

> 
> In workload that needs many free, memory could be fragmented like
> zsmalloc, but, there is big difference. These unused portion of memory
> are SLAB specific memory so that other subsystem can use it. Therefore,
> fragmented memory could not be a big problem in this allocator.

While freeing chunks back to the slab allocator does make that memory
available to other _kernel_ users, the fragmentation problem is just
moved one level down.  The fragmentation will exist in the slabs and
those fragmented slabs won't be freed to the page allocator, which would
make them available to _any_ user, not just the kernel.  Additionally,
there is little visibility into how chunks are organized in the slab,
making compaction at the afmalloc level nearly impossible.  (The only
visibility being the address returned by kmalloc())

> 
> Extra benefit of this allocator design is NUMA awareness. This allocator
> allocates real memory from SLAB allocator. SLAB considers client's NUMA
> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> and zbud which are backend of zram and zswap, respectively, are not NUMA
> awareness so that remote node's memory could be returned to requestor.
> I think that it could be solved easily if NUMA awareness turns out to be
> real problem. But, it may enlarge fragmentation depending on number of
> nodes. Anyway, there is no NUMA awareness issue in this allocator.
> 
> Although I'd like to replace zsmalloc with this allocator, it cannot be
> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> would be very limited so supporting HIGHMEM would be really good advantage
> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> 64-bits world, this allocator may be good option for this system. I
> didn't deeply consider whether this allocator can replace zbud or not.
> 
> Below is the result of my simple test.
> (zsmalloc used in experiments is patched with my previous patch:
> zsmalloc: merge size_class to reduce fragmentation)
> 
> TEST ENV: EXT4 on zram, mount with discard option
> WORKLOAD: untar kernel source, remove dir in descending order in size.
> (drivers arch fs sound include)
> 
> Each line represents orig_data_size, compr_data_size, mem_used_total,
> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> (overhead to compr_data_size), respectively, after untar and remove
> operation is executed. In afmalloc case, overhead is calculated by
> before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
> in afmalloc, one is real_overhead which represents metadata usage and
> overhead of internal fragmentation, and the other is a ratio,
> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> internal fragmented memory cannot be used by other subsystem. So,
> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> be proper comparison.

See last comment about why the real measure of memory usage should be
total pages not returned to the page allocator.  I don't consider chunks
freed to the slab allocator to be truly freed unless the slab containing
the chunks is also freed to the page allocator.

The closest thing I can think of to measure the memory utilization of
this allocator is, for each kmalloc cache, do a before/after of how many
slabs are in the cache, then multiply that delta by pagesperslab and sum
the results.  This would give a rough measure of the number of pages
utilized in the slab allocator either by or as a result of afmalloc.
Of course, there will be noise from other components doing allocations
during the time between the before and after measurement.

Seth

> 
> * untar-merge.out
> 
> orig_size compr_size used_size overhead overhead_ratio
> 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
> 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
> 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
> 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
> 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
> 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
> 
> * untar-afmalloc.out
> 
> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
> 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
> 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
> 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
> 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
> 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
> 
> As you can see above result, real_overhead_ratio in afmalloc is
> just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
> 
> And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
> than overhead_ratio in zsmalloc which is 5% ~ 17%.
> 
> Below is another simple test to check fragmentation effect in alloc/free
> repetition workload.
> 
> TEST ENV: EXT4 on zram, mount with discard option
> WORKLOAD: untar kernel source, remove dir in descending order in size
> (drivers arch fs sound include). Repeat this untar and remove 10 times.
> 
> * untar-merge.out
> 
> orig_size compr_size used_size overhead overhead_ratio
> 526.24MB 199.18MB 209.79MB  10.61MB 5.33%
> 288.69MB  97.45MB 104.09MB   6.64MB 6.81%
> 177.69MB  61.14MB  66.89MB   5.75MB 9.40%
> 146.84MB  47.34MB  52.77MB   5.43MB 11.46%
> 124.53MB  38.88MB  44.28MB   5.40MB 13.90%
> 104.29MB  31.71MB  36.87MB   5.17MB 16.29%
> 535.59MB 200.30MB 211.77MB  11.47MB 5.73%
> 294.84MB  98.28MB 106.24MB   7.97MB 8.11%
> 179.99MB  61.58MB  69.34MB   7.76MB 12.60%
> 148.67MB  47.75MB  55.19MB   7.43MB 15.57%
> 125.98MB  39.26MB  46.62MB   7.36MB 18.75%
> 105.05MB  32.03MB  39.18MB   7.15MB 22.32%
> (snip...)
> 535.59MB 200.31MB 211.88MB  11.57MB 5.77%
> 294.84MB  98.28MB 106.62MB   8.34MB 8.49%
> 179.99MB  61.59MB  73.83MB  12.24MB 19.88%
> 148.67MB  47.76MB  59.58MB  11.82MB 24.76%
> 125.98MB  39.27MB  51.10MB  11.84MB 30.14%
> 105.05MB  32.04MB  43.68MB  11.64MB 36.31%
> 535.59MB 200.31MB 211.89MB  11.58MB 5.78%
> 294.84MB  98.28MB 106.68MB   8.40MB 8.55%
> 179.99MB  61.59MB  74.14MB  12.55MB 20.37%
> 148.67MB  47.76MB  59.94MB  12.18MB 25.50%
> 125.98MB  39.27MB  51.46MB  12.19MB 31.04%
> 105.05MB  32.04MB  44.01MB  11.97MB 37.35%
> 
> * untar-afmalloc.out
> 
> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> 526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
> 288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
> 177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
> 146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
> 124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
> 104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
> 535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
> 294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
> 148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
> 125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
> 105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
> (snip...)
> 535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
> 294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
> 148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
> 125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
> 105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
> 535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
> 294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
> 148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
> 125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
> 105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%
> 
> As you can see above result, fragmentation grows continuously at each run.
> But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
> while overhead_ratio on zsmalloc varies 5% ~ 38%.
> Fragmented slab memory can be used for other system, so we don't
> have to much worry about overhead metric in afmalloc. Anyway, overhead
> metric is also better in afmalloc, 4% ~ 26%.
> 
> As a result, I think that afmalloc is better than zsmalloc in terms of
> memory efficiency. But, I could be wrong so any comments are welcome. :)
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  include/linux/afmalloc.h |   21 ++
>  mm/Kconfig               |    7 +
>  mm/Makefile              |    1 +
>  mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 619 insertions(+)
>  create mode 100644 include/linux/afmalloc.h
>  create mode 100644 mm/afmalloc.c
> 
> diff --git a/include/linux/afmalloc.h b/include/linux/afmalloc.h
> new file mode 100644
> index 0000000..751ae56
> --- /dev/null
> +++ b/include/linux/afmalloc.h
> @@ -0,0 +1,21 @@
> +#define AFMALLOC_MIN_LEVEL (1)
> +#ifdef CONFIG_64BIT
> +#define AFMALLOC_MAX_LEVEL (7)	/* 4 + 4 + 8 * 7 = 64 */
> +#else
> +#define AFMALLOC_MAX_LEVEL (6)	/* 4 + 4 + 4 * 6 = 32 */
> +#endif
> +
> +extern struct afmalloc_pool *afmalloc_create_pool(int max_level,
> +			size_t max_size, gfp_t flags);
> +extern void afmalloc_destroy_pool(struct afmalloc_pool *pool);
> +extern size_t afmalloc_get_used_pages(struct afmalloc_pool *pool);
> +extern unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len);
> +extern void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle);
> +extern size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
> +			void *src, size_t len);
> +extern size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
> +			void *dst, size_t len);
> +extern void *afmalloc_map_handle(struct afmalloc_pool *pool,
> +			unsigned long handle, size_t len, bool read_only);
> +extern void afmalloc_unmap_handle(struct afmalloc_pool *pool,
> +			unsigned long handle);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e09cf0a..7869768 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -585,6 +585,13 @@ config ZSMALLOC
>  	  returned by an alloc().  This handle must be mapped in order to
>  	  access the allocated space.
>  
> +config ANTI_FRAGMENTATION_MALLOC
> +	boolean "Anti-fragmentation memory allocator"
> +	help
> +	  Select this to store data into anti-fragmentation memory
> +	  allocator. This helps to reduce internal/external
> +	  fragmentation caused by storing arbitrary sized data.
> +
>  config PGTABLE_MAPPING
>  	bool "Use page table mapping to access object in zsmalloc"
>  	depends on ZSMALLOC
> diff --git a/mm/Makefile b/mm/Makefile
> index b2f18dc..d47b147 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>  obj-$(CONFIG_ZPOOL)	+= zpool.o
>  obj-$(CONFIG_ZBUD)	+= zbud.o
>  obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
> +obj-$(CONFIG_ANTI_FRAGMENTATION_MALLOC) += afmalloc.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)	+= cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> diff --git a/mm/afmalloc.c b/mm/afmalloc.c
> new file mode 100644
> index 0000000..83a5c61
> --- /dev/null
> +++ b/mm/afmalloc.c
> @@ -0,0 +1,590 @@
> +/*
> + * Anti Fragmentation Memory allocator
> + *
> + * Copyright (C) 2014 Joonsoo Kim
> + *
> + * Anti Fragmentation Memory allocator(aka afmalloc) is special purpose
> + * allocator in order to deal with arbitrary sized object allocation
> + * efficiently in terms of memory utilization.
> + *
> + * Overall design is too simple.
> + *
> + * If request is for power of 2 sized object, afmalloc allocate object
> + * from the SLAB, add tag on it and return it to requestor. This tag will be
> + * used for determining whether it is a handle for metadata or not.
> + *
> + * If request isn't for power of 2 sized object, afmalloc divides size
> + * into elements in power of 2 size. For example, 400 byte request, 256,
> + * 128, 16 bytes build up 400 bytes. afmalloc allocates these size memory
> + * from the SLAB and allocates memory for metadata to keep the pointer of
> + * these chunks. Conceptual representation of metadata structure is below.
> + *
> + * Metadata for 400 bytes
> + * - Pointer for 256 bytes chunk
> + * - Pointer for 128 bytes chunk
> + * - Pointer for 16 bytes chunk
> + *
> + * After allocation all of them, afmalloc returns handle for this metadata to
> + * requestor. Requestor can load/store from/into this memory via this handle.
> + *
> + * Returned memory from afmalloc isn't contiguous so using this memory needs
> + * special APIs. afmalloc_(load/store) handles load/store requests according
> + * to afmalloc's internal structure, so you can use it without any anxiety.
> + *
> + * If you may want to use this memory like as normal memory, you need to call
> + * afmalloc_map_object before using it. This returns contiguous memory for
> + * this handle so that you could use it with normal memory operation.
> + * Unfortunately, only one object can be mapped per cpu at a time and to
> + * contruct this mapping has some overhead.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/spinlock.h>
> +#include <linux/slab.h>
> +#include <linux/afmalloc.h>
> +#include <linux/highmem.h>
> +#include <linux/sizes.h>
> +#include <linux/module.h>
> +
> +#define afmalloc_OBJ_MIN_SIZE (32)
> +
> +#define DIRECT_ENTRY (0x1)
> +
> +struct afmalloc_pool {
> +	spinlock_t lock;
> +	gfp_t flags;
> +	int max_level;
> +	size_t max_size;
> +	size_t size;
> +};
> +
> +struct afmalloc_entry {
> +	int level;
> +	int alloced;
> +	void *mem[];
> +};
> +
> +struct afmalloc_mapped_info {
> +	struct page *page;
> +	size_t len;
> +	bool read_only;
> +};
> +
> +static struct afmalloc_mapped_info __percpu *mapped_info;
> +
> +static struct afmalloc_entry *mem_to_direct_entry(void *mem)
> +{
> +	return (struct afmalloc_entry *)((unsigned long)mem | DIRECT_ENTRY);
> +}
> +
> +static void *direct_entry_to_mem(struct afmalloc_entry *entry)
> +{
> +	return (void *)((unsigned long)entry & ~DIRECT_ENTRY);
> +}
> +
> +static bool is_direct_entry(struct afmalloc_entry *entry)
> +{
> +	return (unsigned long)entry & DIRECT_ENTRY;
> +}
> +
> +static unsigned long entry_to_handle(struct afmalloc_entry *entry)
> +{
> +	return (unsigned long)entry;
> +}
> +
> +static struct afmalloc_entry *handle_to_entry(unsigned long handle)
> +{
> +	return (struct afmalloc_entry *)handle;
> +}
> +
> +static bool valid_level(int max_level)
> +{
> +	if (max_level < AFMALLOC_MIN_LEVEL)
> +		return false;
> +
> +	if (max_level > AFMALLOC_MAX_LEVEL)
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool valid_flags(gfp_t flags)
> +{
> +	if (flags & __GFP_HIGHMEM)
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * afmalloc_create_pool - Creates an allocation pool to work from.
> + * @max_level: limit on number of chunks that is part of requested memory
> + * @max_size: limit on total allocation size from this pool
> + * @flags: allocation flags used to allocate memory
> + *
> + * This function must be called before anything when using
> + * the afmalloc allocator.
> + *
> + * On success, a pointer to the newly created pool is returned,
> + * otherwise NULL.
> + */
> +struct afmalloc_pool *afmalloc_create_pool(int max_level, size_t max_size,
> +					gfp_t flags)
> +{
> +	struct afmalloc_pool *pool;
> +
> +	if (!valid_level(max_level))
> +		return NULL;
> +
> +	if (!valid_flags(flags))
> +		return NULL;
> +
> +	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> +	if (!pool)
> +		return NULL;
> +
> +	spin_lock_init(&pool->lock);
> +	pool->flags = flags;
> +	pool->max_level = max_level;
> +	pool->max_size = max_size;
> +	pool->size = 0;
> +
> +	return pool;
> +}
> +EXPORT_SYMBOL(afmalloc_create_pool);
> +
> +void afmalloc_destroy_pool(struct afmalloc_pool *pool)
> +{
> +	kfree(pool);
> +}
> +EXPORT_SYMBOL(afmalloc_destroy_pool);
> +
> +size_t afmalloc_get_used_pages(struct afmalloc_pool *pool)
> +{
> +	size_t size;
> +
> +	spin_lock(&pool->lock);
> +	size = pool->size >> PAGE_SHIFT;
> +	spin_unlock(&pool->lock);
> +
> +	return size;
> +}
> +EXPORT_SYMBOL(afmalloc_get_used_pages);
> +
> +static void free_entry(struct afmalloc_pool *pool, struct afmalloc_entry *entry,
> +			bool calc_size)
> +{
> +	int i;
> +	int level;
> +	int alloced;
> +
> +	if (is_direct_entry(entry)) {
> +		void *mem = direct_entry_to_mem(entry);
> +
> +		alloced = ksize(mem);
> +		kfree(mem);
> +		goto out;
> +	}
> +
> +	level = entry->level;
> +	alloced = entry->alloced;
> +	for (i = 0; i < level; i++)
> +		kfree(entry->mem[i]);
> +
> +	kfree(entry);
> +
> +out:
> +	if (calc_size && alloced) {
> +		spin_lock(&pool->lock);
> +		pool->size -= alloced;
> +		spin_unlock(&pool->lock);
> +	}
> +}
> +
> +static int calculate_level(struct afmalloc_pool *pool, size_t len)
> +{
> +	int level = 0;
> +	size_t down_size, up_size;
> +
> +	if (len <= afmalloc_OBJ_MIN_SIZE)
> +		goto out;
> +
> +	while (1) {
> +		down_size = rounddown_pow_of_two(len);
> +		if (down_size >= len)
> +			break;
> +
> +		up_size = roundup_pow_of_two(len);
> +		if (up_size - len <= afmalloc_OBJ_MIN_SIZE)
> +			break;
> +
> +		len -= down_size;
> +		level++;
> +	}
> +
> +out:
> +	level++;
> +	return min(level, pool->max_level);
> +}
> +
> +static int estimate_alloced(struct afmalloc_pool *pool, int level, size_t len)
> +{
> +	int i, alloced = 0;
> +	size_t size;
> +
> +	for (i = 0; i < level - 1; i++) {
> +		size = rounddown_pow_of_two(len);
> +		alloced += size;
> +		len -= size;
> +	}
> +
> +	if (len < afmalloc_OBJ_MIN_SIZE)
> +		size = afmalloc_OBJ_MIN_SIZE;
> +	else
> +		size = roundup_pow_of_two(len);
> +	alloced += size;
> +
> +	return alloced;
> +}
> +
> +static void *alloc_entry(struct afmalloc_pool *pool, size_t len)
> +{
> +	int i, level;
> +	size_t size;
> +	int alloced = 0;
> +	size_t remain = len;
> +	struct afmalloc_entry *entry;
> +	void *mem;
> +
> +	/*
> +	 * Determine whether memory is power of 2 or not. If not,
> +	 * determine how many chunks are needed.
> +	 */
> +	level = calculate_level(pool, len);
> +	if (level == 1)
> +		goto alloc_direct_entry;
> +
> +	size = sizeof(void *) * level + sizeof(struct afmalloc_entry);
> +	entry = kmalloc(size, pool->flags);
> +	if (!entry)
> +		return NULL;
> +
> +	size = ksize(entry);
> +	alloced += size;
> +
> +	/*
> +	 * Although request isn't for power of 2 object, sometimes, it is
> +	 * better to allocate one power of 2 memory due to waste of metadata.
> +	 */
> +	if (size + estimate_alloced(pool, level, len)
> +				>= roundup_pow_of_two(len)) {
> +		kfree(entry);
> +		goto alloc_direct_entry;
> +	}
> +
> +	entry->level = level;
> +	for (i = 0; i < level - 1; i++) {
> +		size = rounddown_pow_of_two(remain);
> +		entry->mem[i] = kmalloc(size, pool->flags);
> +		if (!entry->mem[i])
> +			goto err;
> +
> +		alloced += size;
> +		remain -= size;
> +	}
> +
> +	if (remain < afmalloc_OBJ_MIN_SIZE)
> +		size = afmalloc_OBJ_MIN_SIZE;
> +	else
> +		size = roundup_pow_of_two(remain);
> +	entry->mem[i] = kmalloc(size, pool->flags);
> +	if (!entry->mem[i])
> +		goto err;
> +
> +	alloced += size;
> +	entry->alloced = alloced;
> +	goto alloc_complete;
> +
> +alloc_direct_entry:
> +	mem = kmalloc(len, pool->flags);
> +	if (!mem)
> +		return NULL;
> +
> +	alloced = ksize(mem);
> +	entry = mem_to_direct_entry(mem);
> +
> +alloc_complete:
> +	spin_lock(&pool->lock);
> +	if (pool->size + alloced > pool->max_size) {
> +		spin_unlock(&pool->lock);
> +		goto err;
> +	}
> +
> +	pool->size += alloced;
> +	spin_unlock(&pool->lock);
> +
> +	return entry;
> +
> +err:
> +	free_entry(pool, entry, false);
> +
> +	return NULL;
> +}
> +
> +static bool valid_alloc_arg(size_t len)
> +{
> +	if (!len)
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * afmalloc_alloc - Allocate block of given length from pool
> + * @pool: pool from which the object was allocated
> + * @len: length of block to allocate
> + *
> + * On success, handle to the allocated object is returned,
> + * otherwise 0.
> + */
> +unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len)
> +{
> +	struct afmalloc_entry *entry;
> +
> +	if (!valid_alloc_arg(len))
> +		return 0;
> +
> +	entry = alloc_entry(pool, len);
> +	if (!entry)
> +		return 0;
> +
> +	return entry_to_handle(entry);
> +}
> +EXPORT_SYMBOL(afmalloc_alloc);
> +
> +static void __afmalloc_free(struct afmalloc_pool *pool,
> +			struct afmalloc_entry *entry)
> +{
> +	free_entry(pool, entry, true);
> +}
> +
> +void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle)
> +{
> +	struct afmalloc_entry *entry;
> +
> +	entry = handle_to_entry(handle);
> +	if (!entry)
> +		return;
> +
> +	__afmalloc_free(pool, entry);
> +}
> +EXPORT_SYMBOL(afmalloc_free);
> +
> +static void __afmalloc_store(struct afmalloc_pool *pool,
> +			struct afmalloc_entry *entry, void *src, size_t len)
> +{
> +	int i, level = entry->level;
> +	size_t size;
> +	size_t offset = 0;
> +
> +	if (is_direct_entry(entry)) {
> +		memcpy(direct_entry_to_mem(entry), src, len);
> +		return;
> +	}
> +
> +	for (i = 0; i < level - 1; i++) {
> +		size = rounddown_pow_of_two(len);
> +		memcpy(entry->mem[i], src + offset, size);
> +		offset += size;
> +		len -= size;
> +	}
> +	memcpy(entry->mem[i], src + offset, len);
> +}
> +
> +static bool valid_store_arg(struct afmalloc_entry *entry, void *src, size_t len)
> +{
> +	if (!entry)
> +		return false;
> +
> +	if (!src || !len)
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * afmalloc_store - store data into allocated object from handle.
> + * @pool: pool from which the object was allocated
> + * @handle: handle returned from afmalloc
> + * @src: memory address of source data
> + * @len: length in bytes of desired store
> + *
> + * To store data into an object allocated from afmalloc, it must be
> + * mapped before using it or accessed through afmalloc-specific
> + * load/store functions. These functions properly handle load/store
> + * request according to afmalloc's internal structure.
> + */
> +size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
> +			void *src, size_t len)
> +{
> +	struct afmalloc_entry *entry;
> +
> +	entry = handle_to_entry(handle);
> +	if (!valid_store_arg(entry, src, len))
> +		return 0;
> +
> +	__afmalloc_store(pool, entry, src, len);
> +
> +	return len;
> +}
> +EXPORT_SYMBOL(afmalloc_store);
> +
> +static void __afmalloc_load(struct afmalloc_pool *pool,
> +			struct afmalloc_entry *entry, void *dst, size_t len)
> +{
> +	int i, level = entry->level;
> +	size_t size;
> +	size_t offset = 0;
> +
> +	if (is_direct_entry(entry)) {
> +		memcpy(dst, direct_entry_to_mem(entry), len);
> +		return;
> +	}
> +
> +	for (i = 0; i < level - 1; i++) {
> +		size = rounddown_pow_of_two(len);
> +		memcpy(dst + offset, entry->mem[i], size);
> +		offset += size;
> +		len -= size;
> +	}
> +	memcpy(dst + offset, entry->mem[i], len);
> +}
> +
> +static bool valid_load_arg(struct afmalloc_entry *entry, void *dst, size_t len)
> +{
> +	if (!entry)
> +		return false;
> +
> +	if (!dst || !len)
> +		return false;
> +
> +	return true;
> +}
> +
> +size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
> +		void *dst, size_t len)
> +{
> +	struct afmalloc_entry *entry;
> +
> +	entry = handle_to_entry(handle);
> +	if (!valid_load_arg(entry, dst, len))
> +		return 0;
> +
> +	__afmalloc_load(pool, entry, dst, len);
> +
> +	return len;
> +}
> +EXPORT_SYMBOL(afmalloc_load);
> +
> +/**
> + * afmalloc_map_object - get address of allocated object from handle.
> + * @pool: pool from which the object was allocated
> + * @handle: handle returned from afmalloc
> + * @len: length in bytes of desired mapping
> + * @read_only: flag that represents whether data on mapped region is
> + *	written back into an object or not
> + *
> + * Before using an object allocated from afmalloc, it must be mapped using
> + * this function. When done with the object, it must be unmapped using
> + * afmalloc_unmap_handle.
> + *
> + * Only one object can be mapped per cpu at a time. There is no protection
> + * against nested mappings.
> + *
> + * This function returns with preemption and page faults disabled.
> + */
> +void *afmalloc_map_handle(struct afmalloc_pool *pool, unsigned long handle,
> +			size_t len, bool read_only)
> +{
> +	int cpu;
> +	struct afmalloc_entry *entry;
> +	struct afmalloc_mapped_info *info;
> +	void *addr;
> +
> +	entry = handle_to_entry(handle);
> +	if (!entry)
> +		return NULL;
> +
> +	cpu = get_cpu();
> +	if (is_direct_entry(entry))
> +		return direct_entry_to_mem(entry);
> +
> +	info = per_cpu_ptr(mapped_info, cpu);
> +	addr = page_address(info->page);
> +	info->len = len;
> +	info->read_only = read_only;
> +	__afmalloc_load(pool, entry, addr, len);
> +	return addr;
> +}
> +EXPORT_SYMBOL(afmalloc_map_handle);
> +
> +void afmalloc_unmap_handle(struct afmalloc_pool *pool, unsigned long handle)
> +{
> +	struct afmalloc_entry *entry;
> +	struct afmalloc_mapped_info *info;
> +	void *addr;
> +
> +	entry = handle_to_entry(handle);
> +	if (!entry)
> +		return;
> +
> +	if (is_direct_entry(entry))
> +		goto out;
> +
> +	info = this_cpu_ptr(mapped_info);
> +	if (info->read_only)
> +		goto out;
> +
> +	addr = page_address(info->page);
> +	__afmalloc_store(pool, entry, addr, info->len);
> +
> +out:
> +	put_cpu();
> +}
> +EXPORT_SYMBOL(afmalloc_unmap_handle);
> +
> +static int __init afmalloc_init(void)
> +{
> +	int cpu;
> +
> +	mapped_info = alloc_percpu(struct afmalloc_mapped_info);
> +	if (!mapped_info)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct page *page;
> +
> +		page = alloc_pages(GFP_KERNEL, 0);
> +		if (!page)
> +			goto err;
> +
> +		per_cpu_ptr(mapped_info, cpu)->page = page;
> +	}
> +
> +	return 0;
> +
> +err:
> +	for_each_possible_cpu(cpu) {
> +		struct page *page;
> +
> +		page = per_cpu_ptr(mapped_info, cpu)->page;
> +		if (page)
> +			__free_pages(page, 0);
> +	}
> +	free_percpu(mapped_info);
> +	return -ENOMEM;
> +}
> +module_init(afmalloc_init);
> +
> +MODULE_AUTHOR("Joonsoo Kim <iamjoonsoo.kim@lge.com>");
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
@ 2014-09-29 19:53   ` Seth Jennings
  0 siblings, 0 replies; 16+ messages in thread
From: Seth Jennings @ 2014-09-29 19:53 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Minchan Kim, Nitin Gupta, linux-mm, linux-kernel,
	Jerome Marchand, Sergey Senozhatsky, Dan Streetman,
	Luigi Semenzato, Mel Gorman, Hugh Dickins

On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> If you know useful place to use this allocator, please let me know.
> 
> This is brand-new allocator, called anti-fragmentation memory allocator
> (aka afmalloc), in order to deal with arbitrary sized object allocation
> efficiently. zram and zswap uses arbitrary sized object to store
> compressed data so they can use this allocator. If there are any other
> use cases, they can use it, too.
> 
> This work is motivated by observation of fragmentation on zsmalloc which
> intended for storing arbitrary sized object with low fragmentation.
> Although it works well on allocation-intensive workload, memory could be
> highly fragmented after many free occurs. In some cases, unused memory due
> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> problem is that other subsystem cannot use these unused memory. These
> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> use it until zspage is freed to page allocator.

Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
I and others are looking at putting compaction logic into zsmalloc to
help with this.

> 
> I guess that there are similar fragmentation problem in zbud, but, I
> didn't deeply investigate it.
> 
> This new allocator uses SLAB allocator to solve above problems. When
> request comes, it returns handle that is pointer of metatdata to point
> many small chunks. These small chunks are in power of 2 size and
> build up whole requested memory. We can easily acquire these chunks
> using SLAB allocator. Following is conceptual represetation of metadata
> used in this allocator to help understanding of this allocator.
> 
> Handle A for 400 bytes
> {
> 	Pointer for 256 bytes chunk
> 	Pointer for 128 bytes chunk
> 	Pointer for 16 bytes chunk
> 
> 	(256 + 128 + 16 = 400)
> }
> 
> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> allocator specific store/load functions are needed. These require some
> computation overhead and I guess that this is the only drawback this
> allocator has.

One problem with using the SLAB allocator is that kmalloc caches greater
than 256 bytes, at least on my x86_64 machine, have slabs that require
high order page allocations, which are going to be really hard to come
by in the memory stressed environment in which zswap/zram are expected
to operate.  I guess you could max out at 256 byte chunks to overcome
this.  However, if you have a 3k object, that would require copying 12
chunks from potentially 12 different pages into a contiguous area at
mapping time and a larger metadata size.

> 
> For optimization, it uses another approach for power of 2 sized request.
> Instead of returning handle for metadata, it adds tag on pointer from
> SLAB allocator and directly returns this value as handle. With this tag,
> afmalloc can recognize whether handle is for metadata or not and do proper
> processing on it. This optimization can save some memory.
> 
> Although afmalloc use some memory for metadata, overall utilization of
> memory is really good due to zero internal fragmentation by using power

Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
fragmentation per object right?  If so, "near zero".

> of 2 sized object. Although zsmalloc has many size class, there is
> considerable internal fragmentation in zsmalloc.

Lets put a number on it. Internal fragmentation on objects with size >
ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
64-bit system with 4k pages.  (Note: I don't think that is it possible to
compress a 4k page to less than 32 bytes, so for zswap, there will be no
allocations in this size range).

So we are looking at up to 7 vs 15 bytes of internal fragmentation per
object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
take into account the per-object metadata overhead of afmalloc, I think
zsmalloc comes out ahead here.

> 
> In workload that needs many free, memory could be fragmented like
> zsmalloc, but, there is big difference. These unused portion of memory
> are SLAB specific memory so that other subsystem can use it. Therefore,
> fragmented memory could not be a big problem in this allocator.

While freeing chunks back to the slab allocator does make that memory
available to other _kernel_ users, the fragmentation problem is just
moved one level down.  The fragmentation will exist in the slabs and
those fragmented slabs won't be freed to the page allocator, which would
make them available to _any_ user, not just the kernel.  Additionally,
there is little visibility into how chunks are organized in the slab,
making compaction at the afmalloc level nearly impossible.  (The only
visibility being the address returned by kmalloc())

> 
> Extra benefit of this allocator design is NUMA awareness. This allocator
> allocates real memory from SLAB allocator. SLAB considers client's NUMA
> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> and zbud which are backend of zram and zswap, respectively, are not NUMA
> awareness so that remote node's memory could be returned to requestor.
> I think that it could be solved easily if NUMA awareness turns out to be
> real problem. But, it may enlarge fragmentation depending on number of
> nodes. Anyway, there is no NUMA awareness issue in this allocator.
> 
> Although I'd like to replace zsmalloc with this allocator, it cannot be
> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> would be very limited so supporting HIGHMEM would be really good advantage
> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> 64-bits world, this allocator may be good option for this system. I
> didn't deeply consider whether this allocator can replace zbud or not.
> 
> Below is the result of my simple test.
> (zsmalloc used in experiments is patched with my previous patch:
> zsmalloc: merge size_class to reduce fragmentation)
> 
> TEST ENV: EXT4 on zram, mount with discard option
> WORKLOAD: untar kernel source, remove dir in descending order in size.
> (drivers arch fs sound include)
> 
> Each line represents orig_data_size, compr_data_size, mem_used_total,
> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> (overhead to compr_data_size), respectively, after untar and remove
> operation is executed. In afmalloc case, overhead is calculated by
> before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
> in afmalloc, one is real_overhead which represents metadata usage and
> overhead of internal fragmentation, and the other is a ratio,
> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> internal fragmented memory cannot be used by other subsystem. So,
> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> be proper comparison.

See last comment about why the real measure of memory usage should be
total pages not returned to the page allocator.  I don't consider chunks
freed to the slab allocator to be truly freed unless the slab containing
the chunks is also freed to the page allocator.

The closest thing I can think of to measure the memory utilization of
this allocator is, for each kmalloc cache, do a before/after of how many
slabs are in the cache, then multiply that delta by pagesperslab and sum
the results.  This would give a rough measure of the number of pages
utilized in the slab allocator either by or as a result of afmalloc.
Of course, there will be noise from other components doing allocations
during the time between the before and after measurement.

Seth

> 
> * untar-merge.out
> 
> orig_size compr_size used_size overhead overhead_ratio
> 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
> 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
> 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
> 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
> 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
> 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
> 
> * untar-afmalloc.out
> 
> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
> 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
> 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
> 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
> 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
> 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
> 
> As you can see above result, real_overhead_ratio in afmalloc is
> just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
> 
> And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
> than overhead_ratio in zsmalloc which is 5% ~ 17%.
> 
> Below is another simple test to check fragmentation effect in alloc/free
> repetition workload.
> 
> TEST ENV: EXT4 on zram, mount with discard option
> WORKLOAD: untar kernel source, remove dir in descending order in size
> (drivers arch fs sound include). Repeat this untar and remove 10 times.
> 
> * untar-merge.out
> 
> orig_size compr_size used_size overhead overhead_ratio
> 526.24MB 199.18MB 209.79MB  10.61MB 5.33%
> 288.69MB  97.45MB 104.09MB   6.64MB 6.81%
> 177.69MB  61.14MB  66.89MB   5.75MB 9.40%
> 146.84MB  47.34MB  52.77MB   5.43MB 11.46%
> 124.53MB  38.88MB  44.28MB   5.40MB 13.90%
> 104.29MB  31.71MB  36.87MB   5.17MB 16.29%
> 535.59MB 200.30MB 211.77MB  11.47MB 5.73%
> 294.84MB  98.28MB 106.24MB   7.97MB 8.11%
> 179.99MB  61.58MB  69.34MB   7.76MB 12.60%
> 148.67MB  47.75MB  55.19MB   7.43MB 15.57%
> 125.98MB  39.26MB  46.62MB   7.36MB 18.75%
> 105.05MB  32.03MB  39.18MB   7.15MB 22.32%
> (snip...)
> 535.59MB 200.31MB 211.88MB  11.57MB 5.77%
> 294.84MB  98.28MB 106.62MB   8.34MB 8.49%
> 179.99MB  61.59MB  73.83MB  12.24MB 19.88%
> 148.67MB  47.76MB  59.58MB  11.82MB 24.76%
> 125.98MB  39.27MB  51.10MB  11.84MB 30.14%
> 105.05MB  32.04MB  43.68MB  11.64MB 36.31%
> 535.59MB 200.31MB 211.89MB  11.58MB 5.78%
> 294.84MB  98.28MB 106.68MB   8.40MB 8.55%
> 179.99MB  61.59MB  74.14MB  12.55MB 20.37%
> 148.67MB  47.76MB  59.94MB  12.18MB 25.50%
> 125.98MB  39.27MB  51.46MB  12.19MB 31.04%
> 105.05MB  32.04MB  44.01MB  11.97MB 37.35%
> 
> * untar-afmalloc.out
> 
> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> 526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
> 288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
> 177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
> 146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
> 124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
> 104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
> 535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
> 294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
> 148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
> 125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
> 105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
> (snip...)
> 535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
> 294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
> 148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
> 125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
> 105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
> 535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
> 294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
> 179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
> 148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
> 125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
> 105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%
> 
> As you can see above result, fragmentation grows continuously at each run.
> But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
> while overhead_ratio on zsmalloc varies 5% ~ 38%.
> Fragmented slab memory can be used for other system, so we don't
> have to much worry about overhead metric in afmalloc. Anyway, overhead
> metric is also better in afmalloc, 4% ~ 26%.
> 
> As a result, I think that afmalloc is better than zsmalloc in terms of
> memory efficiency. But, I could be wrong so any comments are welcome. :)
> 
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  include/linux/afmalloc.h |   21 ++
>  mm/Kconfig               |    7 +
>  mm/Makefile              |    1 +
>  mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 619 insertions(+)
>  create mode 100644 include/linux/afmalloc.h
>  create mode 100644 mm/afmalloc.c
> 
> diff --git a/include/linux/afmalloc.h b/include/linux/afmalloc.h
> new file mode 100644
> index 0000000..751ae56
> --- /dev/null
> +++ b/include/linux/afmalloc.h
> @@ -0,0 +1,21 @@
> +#define AFMALLOC_MIN_LEVEL (1)
> +#ifdef CONFIG_64BIT
> +#define AFMALLOC_MAX_LEVEL (7)	/* 4 + 4 + 8 * 7 = 64 */
> +#else
> +#define AFMALLOC_MAX_LEVEL (6)	/* 4 + 4 + 4 * 6 = 32 */
> +#endif
> +
> +extern struct afmalloc_pool *afmalloc_create_pool(int max_level,
> +			size_t max_size, gfp_t flags);
> +extern void afmalloc_destroy_pool(struct afmalloc_pool *pool);
> +extern size_t afmalloc_get_used_pages(struct afmalloc_pool *pool);
> +extern unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len);
> +extern void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle);
> +extern size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
> +			void *src, size_t len);
> +extern size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
> +			void *dst, size_t len);
> +extern void *afmalloc_map_handle(struct afmalloc_pool *pool,
> +			unsigned long handle, size_t len, bool read_only);
> +extern void afmalloc_unmap_handle(struct afmalloc_pool *pool,
> +			unsigned long handle);
> diff --git a/mm/Kconfig b/mm/Kconfig
> index e09cf0a..7869768 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -585,6 +585,13 @@ config ZSMALLOC
>  	  returned by an alloc().  This handle must be mapped in order to
>  	  access the allocated space.
>  
> +config ANTI_FRAGMENTATION_MALLOC
> +	boolean "Anti-fragmentation memory allocator"
> +	help
> +	  Select this to store data into anti-fragmentation memory
> +	  allocator. This helps to reduce internal/external
> +	  fragmentation caused by storing arbitrary sized data.
> +
>  config PGTABLE_MAPPING
>  	bool "Use page table mapping to access object in zsmalloc"
>  	depends on ZSMALLOC
> diff --git a/mm/Makefile b/mm/Makefile
> index b2f18dc..d47b147 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -62,6 +62,7 @@ obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>  obj-$(CONFIG_ZPOOL)	+= zpool.o
>  obj-$(CONFIG_ZBUD)	+= zbud.o
>  obj-$(CONFIG_ZSMALLOC)	+= zsmalloc.o
> +obj-$(CONFIG_ANTI_FRAGMENTATION_MALLOC) += afmalloc.o
>  obj-$(CONFIG_GENERIC_EARLY_IOREMAP) += early_ioremap.o
>  obj-$(CONFIG_CMA)	+= cma.o
>  obj-$(CONFIG_MEMORY_BALLOON) += balloon_compaction.o
> diff --git a/mm/afmalloc.c b/mm/afmalloc.c
> new file mode 100644
> index 0000000..83a5c61
> --- /dev/null
> +++ b/mm/afmalloc.c
> @@ -0,0 +1,590 @@
> +/*
> + * Anti Fragmentation Memory allocator
> + *
> + * Copyright (C) 2014 Joonsoo Kim
> + *
> + * Anti Fragmentation Memory allocator(aka afmalloc) is special purpose
> + * allocator in order to deal with arbitrary sized object allocation
> + * efficiently in terms of memory utilization.
> + *
> + * Overall design is too simple.
> + *
> + * If request is for power of 2 sized object, afmalloc allocate object
> + * from the SLAB, add tag on it and return it to requestor. This tag will be
> + * used for determining whether it is a handle for metadata or not.
> + *
> + * If request isn't for power of 2 sized object, afmalloc divides size
> + * into elements in power of 2 size. For example, 400 byte request, 256,
> + * 128, 16 bytes build up 400 bytes. afmalloc allocates these size memory
> + * from the SLAB and allocates memory for metadata to keep the pointer of
> + * these chunks. Conceptual representation of metadata structure is below.
> + *
> + * Metadata for 400 bytes
> + * - Pointer for 256 bytes chunk
> + * - Pointer for 128 bytes chunk
> + * - Pointer for 16 bytes chunk
> + *
> + * After allocation all of them, afmalloc returns handle for this metadata to
> + * requestor. Requestor can load/store from/into this memory via this handle.
> + *
> + * Returned memory from afmalloc isn't contiguous so using this memory needs
> + * special APIs. afmalloc_(load/store) handles load/store requests according
> + * to afmalloc's internal structure, so you can use it without any anxiety.
> + *
> + * If you may want to use this memory like as normal memory, you need to call
> + * afmalloc_map_object before using it. This returns contiguous memory for
> + * this handle so that you could use it with normal memory operation.
> + * Unfortunately, only one object can be mapped per cpu at a time and to
> + * contruct this mapping has some overhead.
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/types.h>
> +#include <linux/spinlock.h>
> +#include <linux/slab.h>
> +#include <linux/afmalloc.h>
> +#include <linux/highmem.h>
> +#include <linux/sizes.h>
> +#include <linux/module.h>
> +
> +#define afmalloc_OBJ_MIN_SIZE (32)
> +
> +#define DIRECT_ENTRY (0x1)
> +
> +struct afmalloc_pool {
> +	spinlock_t lock;
> +	gfp_t flags;
> +	int max_level;
> +	size_t max_size;
> +	size_t size;
> +};
> +
> +struct afmalloc_entry {
> +	int level;
> +	int alloced;
> +	void *mem[];
> +};
> +
> +struct afmalloc_mapped_info {
> +	struct page *page;
> +	size_t len;
> +	bool read_only;
> +};
> +
> +static struct afmalloc_mapped_info __percpu *mapped_info;
> +
> +static struct afmalloc_entry *mem_to_direct_entry(void *mem)
> +{
> +	return (struct afmalloc_entry *)((unsigned long)mem | DIRECT_ENTRY);
> +}
> +
> +static void *direct_entry_to_mem(struct afmalloc_entry *entry)
> +{
> +	return (void *)((unsigned long)entry & ~DIRECT_ENTRY);
> +}
> +
> +static bool is_direct_entry(struct afmalloc_entry *entry)
> +{
> +	return (unsigned long)entry & DIRECT_ENTRY;
> +}
> +
> +static unsigned long entry_to_handle(struct afmalloc_entry *entry)
> +{
> +	return (unsigned long)entry;
> +}
> +
> +static struct afmalloc_entry *handle_to_entry(unsigned long handle)
> +{
> +	return (struct afmalloc_entry *)handle;
> +}
> +
> +static bool valid_level(int max_level)
> +{
> +	if (max_level < AFMALLOC_MIN_LEVEL)
> +		return false;
> +
> +	if (max_level > AFMALLOC_MAX_LEVEL)
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool valid_flags(gfp_t flags)
> +{
> +	if (flags & __GFP_HIGHMEM)
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * afmalloc_create_pool - Creates an allocation pool to work from.
> + * @max_level: limit on number of chunks that is part of requested memory
> + * @max_size: limit on total allocation size from this pool
> + * @flags: allocation flags used to allocate memory
> + *
> + * This function must be called before anything when using
> + * the afmalloc allocator.
> + *
> + * On success, a pointer to the newly created pool is returned,
> + * otherwise NULL.
> + */
> +struct afmalloc_pool *afmalloc_create_pool(int max_level, size_t max_size,
> +					gfp_t flags)
> +{
> +	struct afmalloc_pool *pool;
> +
> +	if (!valid_level(max_level))
> +		return NULL;
> +
> +	if (!valid_flags(flags))
> +		return NULL;
> +
> +	pool = kzalloc(sizeof(*pool), GFP_KERNEL);
> +	if (!pool)
> +		return NULL;
> +
> +	spin_lock_init(&pool->lock);
> +	pool->flags = flags;
> +	pool->max_level = max_level;
> +	pool->max_size = max_size;
> +	pool->size = 0;
> +
> +	return pool;
> +}
> +EXPORT_SYMBOL(afmalloc_create_pool);
> +
> +void afmalloc_destroy_pool(struct afmalloc_pool *pool)
> +{
> +	kfree(pool);
> +}
> +EXPORT_SYMBOL(afmalloc_destroy_pool);
> +
> +size_t afmalloc_get_used_pages(struct afmalloc_pool *pool)
> +{
> +	size_t size;
> +
> +	spin_lock(&pool->lock);
> +	size = pool->size >> PAGE_SHIFT;
> +	spin_unlock(&pool->lock);
> +
> +	return size;
> +}
> +EXPORT_SYMBOL(afmalloc_get_used_pages);
> +
> +static void free_entry(struct afmalloc_pool *pool, struct afmalloc_entry *entry,
> +			bool calc_size)
> +{
> +	int i;
> +	int level;
> +	int alloced;
> +
> +	if (is_direct_entry(entry)) {
> +		void *mem = direct_entry_to_mem(entry);
> +
> +		alloced = ksize(mem);
> +		kfree(mem);
> +		goto out;
> +	}
> +
> +	level = entry->level;
> +	alloced = entry->alloced;
> +	for (i = 0; i < level; i++)
> +		kfree(entry->mem[i]);
> +
> +	kfree(entry);
> +
> +out:
> +	if (calc_size && alloced) {
> +		spin_lock(&pool->lock);
> +		pool->size -= alloced;
> +		spin_unlock(&pool->lock);
> +	}
> +}
> +
> +static int calculate_level(struct afmalloc_pool *pool, size_t len)
> +{
> +	int level = 0;
> +	size_t down_size, up_size;
> +
> +	if (len <= afmalloc_OBJ_MIN_SIZE)
> +		goto out;
> +
> +	while (1) {
> +		down_size = rounddown_pow_of_two(len);
> +		if (down_size >= len)
> +			break;
> +
> +		up_size = roundup_pow_of_two(len);
> +		if (up_size - len <= afmalloc_OBJ_MIN_SIZE)
> +			break;
> +
> +		len -= down_size;
> +		level++;
> +	}
> +
> +out:
> +	level++;
> +	return min(level, pool->max_level);
> +}
> +
> +static int estimate_alloced(struct afmalloc_pool *pool, int level, size_t len)
> +{
> +	int i, alloced = 0;
> +	size_t size;
> +
> +	for (i = 0; i < level - 1; i++) {
> +		size = rounddown_pow_of_two(len);
> +		alloced += size;
> +		len -= size;
> +	}
> +
> +	if (len < afmalloc_OBJ_MIN_SIZE)
> +		size = afmalloc_OBJ_MIN_SIZE;
> +	else
> +		size = roundup_pow_of_two(len);
> +	alloced += size;
> +
> +	return alloced;
> +}
> +
> +static void *alloc_entry(struct afmalloc_pool *pool, size_t len)
> +{
> +	int i, level;
> +	size_t size;
> +	int alloced = 0;
> +	size_t remain = len;
> +	struct afmalloc_entry *entry;
> +	void *mem;
> +
> +	/*
> +	 * Determine whether memory is power of 2 or not. If not,
> +	 * determine how many chunks are needed.
> +	 */
> +	level = calculate_level(pool, len);
> +	if (level == 1)
> +		goto alloc_direct_entry;
> +
> +	size = sizeof(void *) * level + sizeof(struct afmalloc_entry);
> +	entry = kmalloc(size, pool->flags);
> +	if (!entry)
> +		return NULL;
> +
> +	size = ksize(entry);
> +	alloced += size;
> +
> +	/*
> +	 * Although request isn't for power of 2 object, sometimes, it is
> +	 * better to allocate one power of 2 memory due to waste of metadata.
> +	 */
> +	if (size + estimate_alloced(pool, level, len)
> +				>= roundup_pow_of_two(len)) {
> +		kfree(entry);
> +		goto alloc_direct_entry;
> +	}
> +
> +	entry->level = level;
> +	for (i = 0; i < level - 1; i++) {
> +		size = rounddown_pow_of_two(remain);
> +		entry->mem[i] = kmalloc(size, pool->flags);
> +		if (!entry->mem[i])
> +			goto err;
> +
> +		alloced += size;
> +		remain -= size;
> +	}
> +
> +	if (remain < afmalloc_OBJ_MIN_SIZE)
> +		size = afmalloc_OBJ_MIN_SIZE;
> +	else
> +		size = roundup_pow_of_two(remain);
> +	entry->mem[i] = kmalloc(size, pool->flags);
> +	if (!entry->mem[i])
> +		goto err;
> +
> +	alloced += size;
> +	entry->alloced = alloced;
> +	goto alloc_complete;
> +
> +alloc_direct_entry:
> +	mem = kmalloc(len, pool->flags);
> +	if (!mem)
> +		return NULL;
> +
> +	alloced = ksize(mem);
> +	entry = mem_to_direct_entry(mem);
> +
> +alloc_complete:
> +	spin_lock(&pool->lock);
> +	if (pool->size + alloced > pool->max_size) {
> +		spin_unlock(&pool->lock);
> +		goto err;
> +	}
> +
> +	pool->size += alloced;
> +	spin_unlock(&pool->lock);
> +
> +	return entry;
> +
> +err:
> +	free_entry(pool, entry, false);
> +
> +	return NULL;
> +}
> +
> +static bool valid_alloc_arg(size_t len)
> +{
> +	if (!len)
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * afmalloc_alloc - Allocate block of given length from pool
> + * @pool: pool from which the object was allocated
> + * @len: length of block to allocate
> + *
> + * On success, handle to the allocated object is returned,
> + * otherwise 0.
> + */
> +unsigned long afmalloc_alloc(struct afmalloc_pool *pool, size_t len)
> +{
> +	struct afmalloc_entry *entry;
> +
> +	if (!valid_alloc_arg(len))
> +		return 0;
> +
> +	entry = alloc_entry(pool, len);
> +	if (!entry)
> +		return 0;
> +
> +	return entry_to_handle(entry);
> +}
> +EXPORT_SYMBOL(afmalloc_alloc);
> +
> +static void __afmalloc_free(struct afmalloc_pool *pool,
> +			struct afmalloc_entry *entry)
> +{
> +	free_entry(pool, entry, true);
> +}
> +
> +void afmalloc_free(struct afmalloc_pool *pool, unsigned long handle)
> +{
> +	struct afmalloc_entry *entry;
> +
> +	entry = handle_to_entry(handle);
> +	if (!entry)
> +		return;
> +
> +	__afmalloc_free(pool, entry);
> +}
> +EXPORT_SYMBOL(afmalloc_free);
> +
> +static void __afmalloc_store(struct afmalloc_pool *pool,
> +			struct afmalloc_entry *entry, void *src, size_t len)
> +{
> +	int i, level = entry->level;
> +	size_t size;
> +	size_t offset = 0;
> +
> +	if (is_direct_entry(entry)) {
> +		memcpy(direct_entry_to_mem(entry), src, len);
> +		return;
> +	}
> +
> +	for (i = 0; i < level - 1; i++) {
> +		size = rounddown_pow_of_two(len);
> +		memcpy(entry->mem[i], src + offset, size);
> +		offset += size;
> +		len -= size;
> +	}
> +	memcpy(entry->mem[i], src + offset, len);
> +}
> +
> +static bool valid_store_arg(struct afmalloc_entry *entry, void *src, size_t len)
> +{
> +	if (!entry)
> +		return false;
> +
> +	if (!src || !len)
> +		return false;
> +
> +	return true;
> +}
> +
> +/**
> + * afmalloc_store - store data into allocated object from handle.
> + * @pool: pool from which the object was allocated
> + * @handle: handle returned from afmalloc
> + * @src: memory address of source data
> + * @len: length in bytes of desired store
> + *
> + * To store data into an object allocated from afmalloc, it must be
> + * mapped before using it or accessed through afmalloc-specific
> + * load/store functions. These functions properly handle load/store
> + * request according to afmalloc's internal structure.
> + */
> +size_t afmalloc_store(struct afmalloc_pool *pool, unsigned long handle,
> +			void *src, size_t len)
> +{
> +	struct afmalloc_entry *entry;
> +
> +	entry = handle_to_entry(handle);
> +	if (!valid_store_arg(entry, src, len))
> +		return 0;
> +
> +	__afmalloc_store(pool, entry, src, len);
> +
> +	return len;
> +}
> +EXPORT_SYMBOL(afmalloc_store);
> +
> +static void __afmalloc_load(struct afmalloc_pool *pool,
> +			struct afmalloc_entry *entry, void *dst, size_t len)
> +{
> +	int i, level = entry->level;
> +	size_t size;
> +	size_t offset = 0;
> +
> +	if (is_direct_entry(entry)) {
> +		memcpy(dst, direct_entry_to_mem(entry), len);
> +		return;
> +	}
> +
> +	for (i = 0; i < level - 1; i++) {
> +		size = rounddown_pow_of_two(len);
> +		memcpy(dst + offset, entry->mem[i], size);
> +		offset += size;
> +		len -= size;
> +	}
> +	memcpy(dst + offset, entry->mem[i], len);
> +}
> +
> +static bool valid_load_arg(struct afmalloc_entry *entry, void *dst, size_t len)
> +{
> +	if (!entry)
> +		return false;
> +
> +	if (!dst || !len)
> +		return false;
> +
> +	return true;
> +}
> +
> +size_t afmalloc_load(struct afmalloc_pool *pool, unsigned long handle,
> +		void *dst, size_t len)
> +{
> +	struct afmalloc_entry *entry;
> +
> +	entry = handle_to_entry(handle);
> +	if (!valid_load_arg(entry, dst, len))
> +		return 0;
> +
> +	__afmalloc_load(pool, entry, dst, len);
> +
> +	return len;
> +}
> +EXPORT_SYMBOL(afmalloc_load);
> +
> +/**
> + * afmalloc_map_object - get address of allocated object from handle.
> + * @pool: pool from which the object was allocated
> + * @handle: handle returned from afmalloc
> + * @len: length in bytes of desired mapping
> + * @read_only: flag that represents whether data on mapped region is
> + *	written back into an object or not
> + *
> + * Before using an object allocated from afmalloc, it must be mapped using
> + * this function. When done with the object, it must be unmapped using
> + * afmalloc_unmap_handle.
> + *
> + * Only one object can be mapped per cpu at a time. There is no protection
> + * against nested mappings.
> + *
> + * This function returns with preemption and page faults disabled.
> + */
> +void *afmalloc_map_handle(struct afmalloc_pool *pool, unsigned long handle,
> +			size_t len, bool read_only)
> +{
> +	int cpu;
> +	struct afmalloc_entry *entry;
> +	struct afmalloc_mapped_info *info;
> +	void *addr;
> +
> +	entry = handle_to_entry(handle);
> +	if (!entry)
> +		return NULL;
> +
> +	cpu = get_cpu();
> +	if (is_direct_entry(entry))
> +		return direct_entry_to_mem(entry);
> +
> +	info = per_cpu_ptr(mapped_info, cpu);
> +	addr = page_address(info->page);
> +	info->len = len;
> +	info->read_only = read_only;
> +	__afmalloc_load(pool, entry, addr, len);
> +	return addr;
> +}
> +EXPORT_SYMBOL(afmalloc_map_handle);
> +
> +void afmalloc_unmap_handle(struct afmalloc_pool *pool, unsigned long handle)
> +{
> +	struct afmalloc_entry *entry;
> +	struct afmalloc_mapped_info *info;
> +	void *addr;
> +
> +	entry = handle_to_entry(handle);
> +	if (!entry)
> +		return;
> +
> +	if (is_direct_entry(entry))
> +		goto out;
> +
> +	info = this_cpu_ptr(mapped_info);
> +	if (info->read_only)
> +		goto out;
> +
> +	addr = page_address(info->page);
> +	__afmalloc_store(pool, entry, addr, info->len);
> +
> +out:
> +	put_cpu();
> +}
> +EXPORT_SYMBOL(afmalloc_unmap_handle);
> +
> +static int __init afmalloc_init(void)
> +{
> +	int cpu;
> +
> +	mapped_info = alloc_percpu(struct afmalloc_mapped_info);
> +	if (!mapped_info)
> +		return -ENOMEM;
> +
> +	for_each_possible_cpu(cpu) {
> +		struct page *page;
> +
> +		page = alloc_pages(GFP_KERNEL, 0);
> +		if (!page)
> +			goto err;
> +
> +		per_cpu_ptr(mapped_info, cpu)->page = page;
> +	}
> +
> +	return 0;
> +
> +err:
> +	for_each_possible_cpu(cpu) {
> +		struct page *page;
> +
> +		page = per_cpu_ptr(mapped_info, cpu)->page;
> +		if (page)
> +			__free_pages(page, 0);
> +	}
> +	free_percpu(mapped_info);
> +	return -ENOMEM;
> +}
> +module_init(afmalloc_init);
> +
> +MODULE_AUTHOR("Joonsoo Kim <iamjoonsoo.kim@lge.com>");
> -- 
> 1.7.9.5
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
  2014-09-29 15:41   ` Dan Streetman
@ 2014-10-02  5:47     ` Joonsoo Kim
  -1 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-10-02  5:47 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, Minchan Kim, Nitin Gupta, Linux-MM, linux-kernel,
	Jerome Marchand, Sergey Senozhatsky, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

On Mon, Sep 29, 2014 at 11:41:45AM -0400, Dan Streetman wrote:
> On Fri, Sep 26, 2014 at 2:53 AM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> > If you know useful place to use this allocator, please let me know.
> >
> > This is brand-new allocator, called anti-fragmentation memory allocator
> > (aka afmalloc), in order to deal with arbitrary sized object allocation
> > efficiently. zram and zswap uses arbitrary sized object to store
> > compressed data so they can use this allocator. If there are any other
> > use cases, they can use it, too.
> >
> > This work is motivated by observation of fragmentation on zsmalloc which
> > intended for storing arbitrary sized object with low fragmentation.
> > Although it works well on allocation-intensive workload, memory could be
> > highly fragmented after many free occurs. In some cases, unused memory due
> > to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> > problem is that other subsystem cannot use these unused memory. These
> > fragmented memory are zsmalloc specific, so most of other subsystem cannot
> > use it until zspage is freed to page allocator.
> >
> > I guess that there are similar fragmentation problem in zbud, but, I
> > didn't deeply investigate it.
> >
> > This new allocator uses SLAB allocator to solve above problems. When
> > request comes, it returns handle that is pointer of metatdata to point
> > many small chunks. These small chunks are in power of 2 size and
> > build up whole requested memory. We can easily acquire these chunks
> > using SLAB allocator. Following is conceptual represetation of metadata
> > used in this allocator to help understanding of this allocator.
> >
> > Handle A for 400 bytes
> > {
> >         Pointer for 256 bytes chunk
> >         Pointer for 128 bytes chunk
> >         Pointer for 16 bytes chunk
> >
> >         (256 + 128 + 16 = 400)
> > }
> >
> > As you can see, 400 bytes memory are not contiguous in afmalloc so that
> > allocator specific store/load functions are needed. These require some
> > computation overhead and I guess that this is the only drawback this
> > allocator has.
> 
> This also requires additional memory copying, for each map/unmap, no?

Indeed.

> 
> >
> > For optimization, it uses another approach for power of 2 sized request.
> > Instead of returning handle for metadata, it adds tag on pointer from
> > SLAB allocator and directly returns this value as handle. With this tag,
> > afmalloc can recognize whether handle is for metadata or not and do proper
> > processing on it. This optimization can save some memory.
> >
> > Although afmalloc use some memory for metadata, overall utilization of
> > memory is really good due to zero internal fragmentation by using power
> > of 2 sized object. Although zsmalloc has many size class, there is
> > considerable internal fragmentation in zsmalloc.
> >
> > In workload that needs many free, memory could be fragmented like
> > zsmalloc, but, there is big difference. These unused portion of memory
> > are SLAB specific memory so that other subsystem can use it. Therefore,
> > fragmented memory could not be a big problem in this allocator.
> >
> > Extra benefit of this allocator design is NUMA awareness. This allocator
> > allocates real memory from SLAB allocator. SLAB considers client's NUMA
> > affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> > and zbud which are backend of zram and zswap, respectively, are not NUMA
> > awareness so that remote node's memory could be returned to requestor.
> > I think that it could be solved easily if NUMA awareness turns out to be
> > real problem. But, it may enlarge fragmentation depending on number of
> > nodes. Anyway, there is no NUMA awareness issue in this allocator.
> >
> > Although I'd like to replace zsmalloc with this allocator, it cannot be
> > possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> > would be very limited so supporting HIGHMEM would be really good advantage
> > of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> > 64-bits world, this allocator may be good option for this system. I
> > didn't deeply consider whether this allocator can replace zbud or not.
> 
> While it looks like there may be some situations that benefit from
> this, this won't work for all cases (as you mention), so maybe zpool
> can allow zram to choose between zsmalloc and afmalloc.

Yes. :)

> >
> > Below is the result of my simple test.
> > (zsmalloc used in experiments is patched with my previous patch:
> > zsmalloc: merge size_class to reduce fragmentation)
> >
> > TEST ENV: EXT4 on zram, mount with discard option
> > WORKLOAD: untar kernel source, remove dir in descending order in size.
> > (drivers arch fs sound include)
> >
> > Each line represents orig_data_size, compr_data_size, mem_used_total,
> > fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> > (overhead to compr_data_size), respectively, after untar and remove
> > operation is executed. In afmalloc case, overhead is calculated by
> > before/after 'SUnreclaim' on /proc/meminfo.
> > And there are two more columns
> > in afmalloc, one is real_overhead which represents metadata usage and
> > overhead of internal fragmentation, and the other is a ratio,
> > real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> > internal fragmented memory cannot be used by other subsystem. So,
> > comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> > be proper comparison.
> >
> > * untar-merge.out
> >
> > orig_size compr_size used_size overhead overhead_ratio
> > 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
> > 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
> > 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
> > 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
> > 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
> > 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
> >
> > * untar-afmalloc.out
> >
> > orig_size compr_size used_size overhead overhead_ratio real real-ratio
> > 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
> > 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
> > 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
> > 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
> > 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
> > 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
> >
> > As you can see above result, real_overhead_ratio in afmalloc is
> > just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
> >
> > And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
> > than overhead_ratio in zsmalloc which is 5% ~ 17%.
> 
> I think the key will be scaling up this test more.  What does it look
> like when using 20G or more?

In fact, main usage type of zram, that is, zram-swap, doesn't use 20G
memory in normal case. But, I also wanna know how it is scalable. I will
do this kinds of some testing if possible.

> 
> It certainly looks better when using (relatively) small amounts of data, though.

Yes.

Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
@ 2014-10-02  5:47     ` Joonsoo Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-10-02  5:47 UTC (permalink / raw)
  To: Dan Streetman
  Cc: Andrew Morton, Minchan Kim, Nitin Gupta, Linux-MM, linux-kernel,
	Jerome Marchand, Sergey Senozhatsky, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

On Mon, Sep 29, 2014 at 11:41:45AM -0400, Dan Streetman wrote:
> On Fri, Sep 26, 2014 at 2:53 AM, Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> > WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> > If you know useful place to use this allocator, please let me know.
> >
> > This is brand-new allocator, called anti-fragmentation memory allocator
> > (aka afmalloc), in order to deal with arbitrary sized object allocation
> > efficiently. zram and zswap uses arbitrary sized object to store
> > compressed data so they can use this allocator. If there are any other
> > use cases, they can use it, too.
> >
> > This work is motivated by observation of fragmentation on zsmalloc which
> > intended for storing arbitrary sized object with low fragmentation.
> > Although it works well on allocation-intensive workload, memory could be
> > highly fragmented after many free occurs. In some cases, unused memory due
> > to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> > problem is that other subsystem cannot use these unused memory. These
> > fragmented memory are zsmalloc specific, so most of other subsystem cannot
> > use it until zspage is freed to page allocator.
> >
> > I guess that there are similar fragmentation problem in zbud, but, I
> > didn't deeply investigate it.
> >
> > This new allocator uses SLAB allocator to solve above problems. When
> > request comes, it returns handle that is pointer of metatdata to point
> > many small chunks. These small chunks are in power of 2 size and
> > build up whole requested memory. We can easily acquire these chunks
> > using SLAB allocator. Following is conceptual represetation of metadata
> > used in this allocator to help understanding of this allocator.
> >
> > Handle A for 400 bytes
> > {
> >         Pointer for 256 bytes chunk
> >         Pointer for 128 bytes chunk
> >         Pointer for 16 bytes chunk
> >
> >         (256 + 128 + 16 = 400)
> > }
> >
> > As you can see, 400 bytes memory are not contiguous in afmalloc so that
> > allocator specific store/load functions are needed. These require some
> > computation overhead and I guess that this is the only drawback this
> > allocator has.
> 
> This also requires additional memory copying, for each map/unmap, no?

Indeed.

> 
> >
> > For optimization, it uses another approach for power of 2 sized request.
> > Instead of returning handle for metadata, it adds tag on pointer from
> > SLAB allocator and directly returns this value as handle. With this tag,
> > afmalloc can recognize whether handle is for metadata or not and do proper
> > processing on it. This optimization can save some memory.
> >
> > Although afmalloc use some memory for metadata, overall utilization of
> > memory is really good due to zero internal fragmentation by using power
> > of 2 sized object. Although zsmalloc has many size class, there is
> > considerable internal fragmentation in zsmalloc.
> >
> > In workload that needs many free, memory could be fragmented like
> > zsmalloc, but, there is big difference. These unused portion of memory
> > are SLAB specific memory so that other subsystem can use it. Therefore,
> > fragmented memory could not be a big problem in this allocator.
> >
> > Extra benefit of this allocator design is NUMA awareness. This allocator
> > allocates real memory from SLAB allocator. SLAB considers client's NUMA
> > affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> > and zbud which are backend of zram and zswap, respectively, are not NUMA
> > awareness so that remote node's memory could be returned to requestor.
> > I think that it could be solved easily if NUMA awareness turns out to be
> > real problem. But, it may enlarge fragmentation depending on number of
> > nodes. Anyway, there is no NUMA awareness issue in this allocator.
> >
> > Although I'd like to replace zsmalloc with this allocator, it cannot be
> > possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> > would be very limited so supporting HIGHMEM would be really good advantage
> > of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> > 64-bits world, this allocator may be good option for this system. I
> > didn't deeply consider whether this allocator can replace zbud or not.
> 
> While it looks like there may be some situations that benefit from
> this, this won't work for all cases (as you mention), so maybe zpool
> can allow zram to choose between zsmalloc and afmalloc.

Yes. :)

> >
> > Below is the result of my simple test.
> > (zsmalloc used in experiments is patched with my previous patch:
> > zsmalloc: merge size_class to reduce fragmentation)
> >
> > TEST ENV: EXT4 on zram, mount with discard option
> > WORKLOAD: untar kernel source, remove dir in descending order in size.
> > (drivers arch fs sound include)
> >
> > Each line represents orig_data_size, compr_data_size, mem_used_total,
> > fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> > (overhead to compr_data_size), respectively, after untar and remove
> > operation is executed. In afmalloc case, overhead is calculated by
> > before/after 'SUnreclaim' on /proc/meminfo.
> > And there are two more columns
> > in afmalloc, one is real_overhead which represents metadata usage and
> > overhead of internal fragmentation, and the other is a ratio,
> > real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> > internal fragmented memory cannot be used by other subsystem. So,
> > comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> > be proper comparison.
> >
> > * untar-merge.out
> >
> > orig_size compr_size used_size overhead overhead_ratio
> > 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
> > 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
> > 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
> > 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
> > 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
> > 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
> >
> > * untar-afmalloc.out
> >
> > orig_size compr_size used_size overhead overhead_ratio real real-ratio
> > 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
> > 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
> > 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
> > 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
> > 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
> > 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
> >
> > As you can see above result, real_overhead_ratio in afmalloc is
> > just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
> >
> > And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
> > than overhead_ratio in zsmalloc which is 5% ~ 17%.
> 
> I think the key will be scaling up this test more.  What does it look
> like when using 20G or more?

In fact, main usage type of zram, that is, zram-swap, doesn't use 20G
memory in normal case. But, I also wanna know how it is scalable. I will
do this kinds of some testing if possible.

> 
> It certainly looks better when using (relatively) small amounts of data, though.

Yes.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
  2014-09-29 19:53   ` Seth Jennings
@ 2014-10-07  7:42     ` Joonsoo Kim
  -1 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-10-07  7:42 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Joonsoo Kim, Andrew Morton, Minchan Kim, Nitin Gupta,
	Linux Memory Management List, LKML, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

Hello, Seth.
Sorry for late response. :)

2014-09-30 4:53 GMT+09:00 Seth Jennings <sjennings@variantweb.net>:
> On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
>> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
>> If you know useful place to use this allocator, please let me know.
>>
>> This is brand-new allocator, called anti-fragmentation memory allocator
>> (aka afmalloc), in order to deal with arbitrary sized object allocation
>> efficiently. zram and zswap uses arbitrary sized object to store
>> compressed data so they can use this allocator. If there are any other
>> use cases, they can use it, too.
>>
>> This work is motivated by observation of fragmentation on zsmalloc which
>> intended for storing arbitrary sized object with low fragmentation.
>> Although it works well on allocation-intensive workload, memory could be
>> highly fragmented after many free occurs. In some cases, unused memory due
>> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
>> problem is that other subsystem cannot use these unused memory. These
>> fragmented memory are zsmalloc specific, so most of other subsystem cannot
>> use it until zspage is freed to page allocator.
>
> Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
> I and others are looking at putting compaction logic into zsmalloc to
> help with this.
>
>>
>> I guess that there are similar fragmentation problem in zbud, but, I
>> didn't deeply investigate it.
>>
>> This new allocator uses SLAB allocator to solve above problems. When
>> request comes, it returns handle that is pointer of metatdata to point
>> many small chunks. These small chunks are in power of 2 size and
>> build up whole requested memory. We can easily acquire these chunks
>> using SLAB allocator. Following is conceptual represetation of metadata
>> used in this allocator to help understanding of this allocator.
>>
>> Handle A for 400 bytes
>> {
>>       Pointer for 256 bytes chunk
>>       Pointer for 128 bytes chunk
>>       Pointer for 16 bytes chunk
>>
>>       (256 + 128 + 16 = 400)
>> }
>>
>> As you can see, 400 bytes memory are not contiguous in afmalloc so that
>> allocator specific store/load functions are needed. These require some
>> computation overhead and I guess that this is the only drawback this
>> allocator has.
>
> One problem with using the SLAB allocator is that kmalloc caches greater
> than 256 bytes, at least on my x86_64 machine, have slabs that require
> high order page allocations, which are going to be really hard to come
> by in the memory stressed environment in which zswap/zram are expected
> to operate.  I guess you could max out at 256 byte chunks to overcome
> this.  However, if you have a 3k object, that would require copying 12
> chunks from potentially 12 different pages into a contiguous area at
> mapping time and a larger metadata size.

SLUB uses high order allocation by default, but, it has fallback method. It
uses low order allocation if failed with high order allocation. So, we don't
need to worry about high order allocation.

>>
>> For optimization, it uses another approach for power of 2 sized request.
>> Instead of returning handle for metadata, it adds tag on pointer from
>> SLAB allocator and directly returns this value as handle. With this tag,
>> afmalloc can recognize whether handle is for metadata or not and do proper
>> processing on it. This optimization can save some memory.
>>
>> Although afmalloc use some memory for metadata, overall utilization of
>> memory is really good due to zero internal fragmentation by using power
>
> Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
> fragmentation per object right?  If so, "near zero".
>
>> of 2 sized object. Although zsmalloc has many size class, there is
>> considerable internal fragmentation in zsmalloc.
>
> Lets put a number on it. Internal fragmentation on objects with size >
> ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
> PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
> fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
> 64-bit system with 4k pages.  (Note: I don't think that is it possible to
> compress a 4k page to less than 32 bytes, so for zswap, there will be no
> allocations in this size range).
>
> So we are looking at up to 7 vs 15 bytes of internal fragmentation per
> object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
> take into account the per-object metadata overhead of afmalloc, I think
> zsmalloc comes out ahead here.

Sorry for misleading word usage.
What I want to tell is that the unused space at the end of zspage when
zspage isn't perfectly divided. For example, think about 2064 bytes size_class.
It's zspage would be 4 pages and it can have only 7 objects at maximum.
Remainder is 1936 bytes and we can't use this space. This is 11% of total
space on zspage. If we only use power of 2 size, there is no remainder and
no this type of unused space.

>>
>> In workload that needs many free, memory could be fragmented like
>> zsmalloc, but, there is big difference. These unused portion of memory
>> are SLAB specific memory so that other subsystem can use it. Therefore,
>> fragmented memory could not be a big problem in this allocator.
>
> While freeing chunks back to the slab allocator does make that memory
> available to other _kernel_ users, the fragmentation problem is just
> moved one level down.  The fragmentation will exist in the slabs and
> those fragmented slabs won't be freed to the page allocator, which would
> make them available to _any_ user, not just the kernel.  Additionally,
> there is little visibility into how chunks are organized in the slab,
> making compaction at the afmalloc level nearly impossible.  (The only
> visibility being the address returned by kmalloc())

Okay. Free objects in slab subsystem isn't perfect solution, but, it is better
than current situation.

And, I think that afmalloc could be compacted just with returned address.
My idea is sorting chunks by memory address and copying their contents
to temporary buffer in ascending order. After copy is complete, chunks could
be freed. These freed objects would be in contiguous range so SLAB would
free the slab to the page allocator. After some free are done, we allocate
chunks from SLAB again and copy contents in temporary buffers to these
newly allocated chunks. These chunks would be positioned in fragmented
slab so that fragmentation would be reduced.

>>
>> Extra benefit of this allocator design is NUMA awareness. This allocator
>> allocates real memory from SLAB allocator. SLAB considers client's NUMA
>> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
>> and zbud which are backend of zram and zswap, respectively, are not NUMA
>> awareness so that remote node's memory could be returned to requestor.
>> I think that it could be solved easily if NUMA awareness turns out to be
>> real problem. But, it may enlarge fragmentation depending on number of
>> nodes. Anyway, there is no NUMA awareness issue in this allocator.
>>
>> Although I'd like to replace zsmalloc with this allocator, it cannot be
>> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
>> would be very limited so supporting HIGHMEM would be really good advantage
>> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
>> 64-bits world, this allocator may be good option for this system. I
>> didn't deeply consider whether this allocator can replace zbud or not.
>>
>> Below is the result of my simple test.
>> (zsmalloc used in experiments is patched with my previous patch:
>> zsmalloc: merge size_class to reduce fragmentation)
>>
>> TEST ENV: EXT4 on zram, mount with discard option
>> WORKLOAD: untar kernel source, remove dir in descending order in size.
>> (drivers arch fs sound include)
>>
>> Each line represents orig_data_size, compr_data_size, mem_used_total,
>> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
>> (overhead to compr_data_size), respectively, after untar and remove
>> operation is executed. In afmalloc case, overhead is calculated by
>> before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
>> in afmalloc, one is real_overhead which represents metadata usage and
>> overhead of internal fragmentation, and the other is a ratio,
>> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
>> internal fragmented memory cannot be used by other subsystem. So,
>> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
>> be proper comparison.
>
> See last comment about why the real measure of memory usage should be
> total pages not returned to the page allocator.  I don't consider chunks
> freed to the slab allocator to be truly freed unless the slab containing
> the chunks is also freed to the page allocator.
>
> The closest thing I can think of to measure the memory utilization of
> this allocator is, for each kmalloc cache, do a before/after of how many
> slabs are in the cache, then multiply that delta by pagesperslab and sum
> the results.  This would give a rough measure of the number of pages
> utilized in the slab allocator either by or as a result of afmalloc.
> Of course, there will be noise from other components doing allocations
> during the time between the before and after measurement.

It was already in below benchmark result. overhead and overhead ratio on
intar-afmalloc.out result are measured by number of allocated page in SLAB.
You can see that overhead and overhead ratio of afmalloc is less than
zsmalloc even in this metric.

Thanks.

> Seth
>
>>
>> * untar-merge.out
>>
>> orig_size compr_size used_size overhead overhead_ratio
>> 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
>> 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
>> 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
>> 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
>> 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
>> 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
>>
>> * untar-afmalloc.out
>>
>> orig_size compr_size used_size overhead overhead_ratio real real-ratio
>> 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
>> 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
>> 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
>> 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
>> 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
>> 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
>>
>> As you can see above result, real_overhead_ratio in afmalloc is
>> just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
>>
>> And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
>> than overhead_ratio in zsmalloc which is 5% ~ 17%.
>>
>> Below is another simple test to check fragmentation effect in alloc/free
>> repetition workload.
>>
>> TEST ENV: EXT4 on zram, mount with discard option
>> WORKLOAD: untar kernel source, remove dir in descending order in size
>> (drivers arch fs sound include). Repeat this untar and remove 10 times.
>>
>> * untar-merge.out
>>
>> orig_size compr_size used_size overhead overhead_ratio
>> 526.24MB 199.18MB 209.79MB  10.61MB 5.33%
>> 288.69MB  97.45MB 104.09MB   6.64MB 6.81%
>> 177.69MB  61.14MB  66.89MB   5.75MB 9.40%
>> 146.84MB  47.34MB  52.77MB   5.43MB 11.46%
>> 124.53MB  38.88MB  44.28MB   5.40MB 13.90%
>> 104.29MB  31.71MB  36.87MB   5.17MB 16.29%
>> 535.59MB 200.30MB 211.77MB  11.47MB 5.73%
>> 294.84MB  98.28MB 106.24MB   7.97MB 8.11%
>> 179.99MB  61.58MB  69.34MB   7.76MB 12.60%
>> 148.67MB  47.75MB  55.19MB   7.43MB 15.57%
>> 125.98MB  39.26MB  46.62MB   7.36MB 18.75%
>> 105.05MB  32.03MB  39.18MB   7.15MB 22.32%
>> (snip...)
>> 535.59MB 200.31MB 211.88MB  11.57MB 5.77%
>> 294.84MB  98.28MB 106.62MB   8.34MB 8.49%
>> 179.99MB  61.59MB  73.83MB  12.24MB 19.88%
>> 148.67MB  47.76MB  59.58MB  11.82MB 24.76%
>> 125.98MB  39.27MB  51.10MB  11.84MB 30.14%
>> 105.05MB  32.04MB  43.68MB  11.64MB 36.31%
>> 535.59MB 200.31MB 211.89MB  11.58MB 5.78%
>> 294.84MB  98.28MB 106.68MB   8.40MB 8.55%
>> 179.99MB  61.59MB  74.14MB  12.55MB 20.37%
>> 148.67MB  47.76MB  59.94MB  12.18MB 25.50%
>> 125.98MB  39.27MB  51.46MB  12.19MB 31.04%
>> 105.05MB  32.04MB  44.01MB  11.97MB 37.35%
>>
>> * untar-afmalloc.out
>>
>> orig_size compr_size used_size overhead overhead_ratio real real-ratio
>> 526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
>> 288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
>> 177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
>> 146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
>> 124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
>> 104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
>> 535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
>> 294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
>> 179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
>> 148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
>> 125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
>> 105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
>> (snip...)
>> 535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
>> 294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
>> 179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
>> 148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
>> 125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
>> 105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
>> 535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
>> 294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
>> 179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
>> 148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
>> 125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
>> 105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%
>>
>> As you can see above result, fragmentation grows continuously at each run.
>> But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
>> while overhead_ratio on zsmalloc varies 5% ~ 38%.
>> Fragmented slab memory can be used for other system, so we don't
>> have to much worry about overhead metric in afmalloc. Anyway, overhead
>> metric is also better in afmalloc, 4% ~ 26%.
>>
>> As a result, I think that afmalloc is better than zsmalloc in terms of
>> memory efficiency. But, I could be wrong so any comments are welcome. :)
>>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> ---
>>  include/linux/afmalloc.h |   21 ++
>>  mm/Kconfig               |    7 +
>>  mm/Makefile              |    1 +
>>  mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 619 insertions(+)
>>  create mode 100644 include/linux/afmalloc.h
>>  create mode 100644 mm/afmalloc.c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
@ 2014-10-07  7:42     ` Joonsoo Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-10-07  7:42 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Joonsoo Kim, Andrew Morton, Minchan Kim, Nitin Gupta,
	Linux Memory Management List, LKML, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

Hello, Seth.
Sorry for late response. :)

2014-09-30 4:53 GMT+09:00 Seth Jennings <sjennings@variantweb.net>:
> On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
>> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
>> If you know useful place to use this allocator, please let me know.
>>
>> This is brand-new allocator, called anti-fragmentation memory allocator
>> (aka afmalloc), in order to deal with arbitrary sized object allocation
>> efficiently. zram and zswap uses arbitrary sized object to store
>> compressed data so they can use this allocator. If there are any other
>> use cases, they can use it, too.
>>
>> This work is motivated by observation of fragmentation on zsmalloc which
>> intended for storing arbitrary sized object with low fragmentation.
>> Although it works well on allocation-intensive workload, memory could be
>> highly fragmented after many free occurs. In some cases, unused memory due
>> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
>> problem is that other subsystem cannot use these unused memory. These
>> fragmented memory are zsmalloc specific, so most of other subsystem cannot
>> use it until zspage is freed to page allocator.
>
> Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
> I and others are looking at putting compaction logic into zsmalloc to
> help with this.
>
>>
>> I guess that there are similar fragmentation problem in zbud, but, I
>> didn't deeply investigate it.
>>
>> This new allocator uses SLAB allocator to solve above problems. When
>> request comes, it returns handle that is pointer of metatdata to point
>> many small chunks. These small chunks are in power of 2 size and
>> build up whole requested memory. We can easily acquire these chunks
>> using SLAB allocator. Following is conceptual represetation of metadata
>> used in this allocator to help understanding of this allocator.
>>
>> Handle A for 400 bytes
>> {
>>       Pointer for 256 bytes chunk
>>       Pointer for 128 bytes chunk
>>       Pointer for 16 bytes chunk
>>
>>       (256 + 128 + 16 = 400)
>> }
>>
>> As you can see, 400 bytes memory are not contiguous in afmalloc so that
>> allocator specific store/load functions are needed. These require some
>> computation overhead and I guess that this is the only drawback this
>> allocator has.
>
> One problem with using the SLAB allocator is that kmalloc caches greater
> than 256 bytes, at least on my x86_64 machine, have slabs that require
> high order page allocations, which are going to be really hard to come
> by in the memory stressed environment in which zswap/zram are expected
> to operate.  I guess you could max out at 256 byte chunks to overcome
> this.  However, if you have a 3k object, that would require copying 12
> chunks from potentially 12 different pages into a contiguous area at
> mapping time and a larger metadata size.

SLUB uses high order allocation by default, but, it has fallback method. It
uses low order allocation if failed with high order allocation. So, we don't
need to worry about high order allocation.

>>
>> For optimization, it uses another approach for power of 2 sized request.
>> Instead of returning handle for metadata, it adds tag on pointer from
>> SLAB allocator and directly returns this value as handle. With this tag,
>> afmalloc can recognize whether handle is for metadata or not and do proper
>> processing on it. This optimization can save some memory.
>>
>> Although afmalloc use some memory for metadata, overall utilization of
>> memory is really good due to zero internal fragmentation by using power
>
> Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
> fragmentation per object right?  If so, "near zero".
>
>> of 2 sized object. Although zsmalloc has many size class, there is
>> considerable internal fragmentation in zsmalloc.
>
> Lets put a number on it. Internal fragmentation on objects with size >
> ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
> PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
> fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
> 64-bit system with 4k pages.  (Note: I don't think that is it possible to
> compress a 4k page to less than 32 bytes, so for zswap, there will be no
> allocations in this size range).
>
> So we are looking at up to 7 vs 15 bytes of internal fragmentation per
> object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
> take into account the per-object metadata overhead of afmalloc, I think
> zsmalloc comes out ahead here.

Sorry for misleading word usage.
What I want to tell is that the unused space at the end of zspage when
zspage isn't perfectly divided. For example, think about 2064 bytes size_class.
It's zspage would be 4 pages and it can have only 7 objects at maximum.
Remainder is 1936 bytes and we can't use this space. This is 11% of total
space on zspage. If we only use power of 2 size, there is no remainder and
no this type of unused space.

>>
>> In workload that needs many free, memory could be fragmented like
>> zsmalloc, but, there is big difference. These unused portion of memory
>> are SLAB specific memory so that other subsystem can use it. Therefore,
>> fragmented memory could not be a big problem in this allocator.
>
> While freeing chunks back to the slab allocator does make that memory
> available to other _kernel_ users, the fragmentation problem is just
> moved one level down.  The fragmentation will exist in the slabs and
> those fragmented slabs won't be freed to the page allocator, which would
> make them available to _any_ user, not just the kernel.  Additionally,
> there is little visibility into how chunks are organized in the slab,
> making compaction at the afmalloc level nearly impossible.  (The only
> visibility being the address returned by kmalloc())

Okay. Free objects in slab subsystem isn't perfect solution, but, it is better
than current situation.

And, I think that afmalloc could be compacted just with returned address.
My idea is sorting chunks by memory address and copying their contents
to temporary buffer in ascending order. After copy is complete, chunks could
be freed. These freed objects would be in contiguous range so SLAB would
free the slab to the page allocator. After some free are done, we allocate
chunks from SLAB again and copy contents in temporary buffers to these
newly allocated chunks. These chunks would be positioned in fragmented
slab so that fragmentation would be reduced.

>>
>> Extra benefit of this allocator design is NUMA awareness. This allocator
>> allocates real memory from SLAB allocator. SLAB considers client's NUMA
>> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
>> and zbud which are backend of zram and zswap, respectively, are not NUMA
>> awareness so that remote node's memory could be returned to requestor.
>> I think that it could be solved easily if NUMA awareness turns out to be
>> real problem. But, it may enlarge fragmentation depending on number of
>> nodes. Anyway, there is no NUMA awareness issue in this allocator.
>>
>> Although I'd like to replace zsmalloc with this allocator, it cannot be
>> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
>> would be very limited so supporting HIGHMEM would be really good advantage
>> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
>> 64-bits world, this allocator may be good option for this system. I
>> didn't deeply consider whether this allocator can replace zbud or not.
>>
>> Below is the result of my simple test.
>> (zsmalloc used in experiments is patched with my previous patch:
>> zsmalloc: merge size_class to reduce fragmentation)
>>
>> TEST ENV: EXT4 on zram, mount with discard option
>> WORKLOAD: untar kernel source, remove dir in descending order in size.
>> (drivers arch fs sound include)
>>
>> Each line represents orig_data_size, compr_data_size, mem_used_total,
>> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
>> (overhead to compr_data_size), respectively, after untar and remove
>> operation is executed. In afmalloc case, overhead is calculated by
>> before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
>> in afmalloc, one is real_overhead which represents metadata usage and
>> overhead of internal fragmentation, and the other is a ratio,
>> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
>> internal fragmented memory cannot be used by other subsystem. So,
>> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
>> be proper comparison.
>
> See last comment about why the real measure of memory usage should be
> total pages not returned to the page allocator.  I don't consider chunks
> freed to the slab allocator to be truly freed unless the slab containing
> the chunks is also freed to the page allocator.
>
> The closest thing I can think of to measure the memory utilization of
> this allocator is, for each kmalloc cache, do a before/after of how many
> slabs are in the cache, then multiply that delta by pagesperslab and sum
> the results.  This would give a rough measure of the number of pages
> utilized in the slab allocator either by or as a result of afmalloc.
> Of course, there will be noise from other components doing allocations
> during the time between the before and after measurement.

It was already in below benchmark result. overhead and overhead ratio on
intar-afmalloc.out result are measured by number of allocated page in SLAB.
You can see that overhead and overhead ratio of afmalloc is less than
zsmalloc even in this metric.

Thanks.

> Seth
>
>>
>> * untar-merge.out
>>
>> orig_size compr_size used_size overhead overhead_ratio
>> 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
>> 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
>> 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
>> 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
>> 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
>> 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
>>
>> * untar-afmalloc.out
>>
>> orig_size compr_size used_size overhead overhead_ratio real real-ratio
>> 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
>> 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
>> 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
>> 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
>> 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
>> 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
>>
>> As you can see above result, real_overhead_ratio in afmalloc is
>> just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
>>
>> And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
>> than overhead_ratio in zsmalloc which is 5% ~ 17%.
>>
>> Below is another simple test to check fragmentation effect in alloc/free
>> repetition workload.
>>
>> TEST ENV: EXT4 on zram, mount with discard option
>> WORKLOAD: untar kernel source, remove dir in descending order in size
>> (drivers arch fs sound include). Repeat this untar and remove 10 times.
>>
>> * untar-merge.out
>>
>> orig_size compr_size used_size overhead overhead_ratio
>> 526.24MB 199.18MB 209.79MB  10.61MB 5.33%
>> 288.69MB  97.45MB 104.09MB   6.64MB 6.81%
>> 177.69MB  61.14MB  66.89MB   5.75MB 9.40%
>> 146.84MB  47.34MB  52.77MB   5.43MB 11.46%
>> 124.53MB  38.88MB  44.28MB   5.40MB 13.90%
>> 104.29MB  31.71MB  36.87MB   5.17MB 16.29%
>> 535.59MB 200.30MB 211.77MB  11.47MB 5.73%
>> 294.84MB  98.28MB 106.24MB   7.97MB 8.11%
>> 179.99MB  61.58MB  69.34MB   7.76MB 12.60%
>> 148.67MB  47.75MB  55.19MB   7.43MB 15.57%
>> 125.98MB  39.26MB  46.62MB   7.36MB 18.75%
>> 105.05MB  32.03MB  39.18MB   7.15MB 22.32%
>> (snip...)
>> 535.59MB 200.31MB 211.88MB  11.57MB 5.77%
>> 294.84MB  98.28MB 106.62MB   8.34MB 8.49%
>> 179.99MB  61.59MB  73.83MB  12.24MB 19.88%
>> 148.67MB  47.76MB  59.58MB  11.82MB 24.76%
>> 125.98MB  39.27MB  51.10MB  11.84MB 30.14%
>> 105.05MB  32.04MB  43.68MB  11.64MB 36.31%
>> 535.59MB 200.31MB 211.89MB  11.58MB 5.78%
>> 294.84MB  98.28MB 106.68MB   8.40MB 8.55%
>> 179.99MB  61.59MB  74.14MB  12.55MB 20.37%
>> 148.67MB  47.76MB  59.94MB  12.18MB 25.50%
>> 125.98MB  39.27MB  51.46MB  12.19MB 31.04%
>> 105.05MB  32.04MB  44.01MB  11.97MB 37.35%
>>
>> * untar-afmalloc.out
>>
>> orig_size compr_size used_size overhead overhead_ratio real real-ratio
>> 526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
>> 288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
>> 177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
>> 146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
>> 124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
>> 104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
>> 535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
>> 294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
>> 179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
>> 148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
>> 125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
>> 105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
>> (snip...)
>> 535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
>> 294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
>> 179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
>> 148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
>> 125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
>> 105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
>> 535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
>> 294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
>> 179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
>> 148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
>> 125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
>> 105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%
>>
>> As you can see above result, fragmentation grows continuously at each run.
>> But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
>> while overhead_ratio on zsmalloc varies 5% ~ 38%.
>> Fragmented slab memory can be used for other system, so we don't
>> have to much worry about overhead metric in afmalloc. Anyway, overhead
>> metric is also better in afmalloc, 4% ~ 26%.
>>
>> As a result, I think that afmalloc is better than zsmalloc in terms of
>> memory efficiency. But, I could be wrong so any comments are welcome. :)
>>
>> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> ---
>>  include/linux/afmalloc.h |   21 ++
>>  mm/Kconfig               |    7 +
>>  mm/Makefile              |    1 +
>>  mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
>>  4 files changed, 619 insertions(+)
>>  create mode 100644 include/linux/afmalloc.h
>>  create mode 100644 mm/afmalloc.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
  2014-10-07  7:42     ` Joonsoo Kim
@ 2014-10-07 20:26       ` Seth Jennings
  -1 siblings, 0 replies; 16+ messages in thread
From: Seth Jennings @ 2014-10-07 20:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Minchan Kim, Nitin Gupta,
	Linux Memory Management List, LKML, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

On Tue, Oct 07, 2014 at 04:42:33PM +0900, Joonsoo Kim wrote:
> Hello, Seth.
> Sorry for late response. :)
> 
> 2014-09-30 4:53 GMT+09:00 Seth Jennings <sjennings@variantweb.net>:
> > On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
> >> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> >> If you know useful place to use this allocator, please let me know.
> >>
> >> This is brand-new allocator, called anti-fragmentation memory allocator
> >> (aka afmalloc), in order to deal with arbitrary sized object allocation
> >> efficiently. zram and zswap uses arbitrary sized object to store
> >> compressed data so they can use this allocator. If there are any other
> >> use cases, they can use it, too.
> >>
> >> This work is motivated by observation of fragmentation on zsmalloc which
> >> intended for storing arbitrary sized object with low fragmentation.
> >> Although it works well on allocation-intensive workload, memory could be
> >> highly fragmented after many free occurs. In some cases, unused memory due
> >> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> >> problem is that other subsystem cannot use these unused memory. These
> >> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> >> use it until zspage is freed to page allocator.
> >
> > Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
> > I and others are looking at putting compaction logic into zsmalloc to
> > help with this.
> >
> >>
> >> I guess that there are similar fragmentation problem in zbud, but, I
> >> didn't deeply investigate it.
> >>
> >> This new allocator uses SLAB allocator to solve above problems. When
> >> request comes, it returns handle that is pointer of metatdata to point
> >> many small chunks. These small chunks are in power of 2 size and
> >> build up whole requested memory. We can easily acquire these chunks
> >> using SLAB allocator. Following is conceptual represetation of metadata
> >> used in this allocator to help understanding of this allocator.
> >>
> >> Handle A for 400 bytes
> >> {
> >>       Pointer for 256 bytes chunk
> >>       Pointer for 128 bytes chunk
> >>       Pointer for 16 bytes chunk
> >>
> >>       (256 + 128 + 16 = 400)
> >> }
> >>
> >> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> >> allocator specific store/load functions are needed. These require some
> >> computation overhead and I guess that this is the only drawback this
> >> allocator has.
> >
> > One problem with using the SLAB allocator is that kmalloc caches greater
> > than 256 bytes, at least on my x86_64 machine, have slabs that require
> > high order page allocations, which are going to be really hard to come
> > by in the memory stressed environment in which zswap/zram are expected
> > to operate.  I guess you could max out at 256 byte chunks to overcome
> > this.  However, if you have a 3k object, that would require copying 12
> > chunks from potentially 12 different pages into a contiguous area at
> > mapping time and a larger metadata size.
> 
> SLUB uses high order allocation by default, but, it has fallback method. It
> uses low order allocation if failed with high order allocation. So, we don't
> need to worry about high order allocation.

Didn't know about the fallback method :)

> 
> >>
> >> For optimization, it uses another approach for power of 2 sized request.
> >> Instead of returning handle for metadata, it adds tag on pointer from
> >> SLAB allocator and directly returns this value as handle. With this tag,
> >> afmalloc can recognize whether handle is for metadata or not and do proper
> >> processing on it. This optimization can save some memory.
> >>
> >> Although afmalloc use some memory for metadata, overall utilization of
> >> memory is really good due to zero internal fragmentation by using power
> >
> > Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
> > fragmentation per object right?  If so, "near zero".
> >
> >> of 2 sized object. Although zsmalloc has many size class, there is
> >> considerable internal fragmentation in zsmalloc.
> >
> > Lets put a number on it. Internal fragmentation on objects with size >
> > ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
> > PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
> > fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
> > 64-bit system with 4k pages.  (Note: I don't think that is it possible to
> > compress a 4k page to less than 32 bytes, so for zswap, there will be no
> > allocations in this size range).
> >
> > So we are looking at up to 7 vs 15 bytes of internal fragmentation per
> > object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
> > take into account the per-object metadata overhead of afmalloc, I think
> > zsmalloc comes out ahead here.
> 
> Sorry for misleading word usage.
> What I want to tell is that the unused space at the end of zspage when
> zspage isn't perfectly divided. For example, think about 2064 bytes size_class.
> It's zspage would be 4 pages and it can have only 7 objects at maximum.
> Remainder is 1936 bytes and we can't use this space. This is 11% of total
> space on zspage. If we only use power of 2 size, there is no remainder and
> no this type of unused space.

Ah, ok.  That's true.

> 
> >>
> >> In workload that needs many free, memory could be fragmented like
> >> zsmalloc, but, there is big difference. These unused portion of memory
> >> are SLAB specific memory so that other subsystem can use it. Therefore,
> >> fragmented memory could not be a big problem in this allocator.
> >
> > While freeing chunks back to the slab allocator does make that memory
> > available to other _kernel_ users, the fragmentation problem is just
> > moved one level down.  The fragmentation will exist in the slabs and
> > those fragmented slabs won't be freed to the page allocator, which would
> > make them available to _any_ user, not just the kernel.  Additionally,
> > there is little visibility into how chunks are organized in the slab,
> > making compaction at the afmalloc level nearly impossible.  (The only
> > visibility being the address returned by kmalloc())
> 
> Okay. Free objects in slab subsystem isn't perfect solution, but, it is better
> than current situation.
> 
> And, I think that afmalloc could be compacted just with returned address.
> My idea is sorting chunks by memory address and copying their contents
> to temporary buffer in ascending order. After copy is complete, chunks could
> be freed. These freed objects would be in contiguous range so SLAB would
> free the slab to the page allocator. After some free are done, we allocate
> chunks from SLAB again and copy contents in temporary buffers to these
> newly allocated chunks. These chunks would be positioned in fragmented
> slab so that fragmentation would be reduced.

I guess something that could be a problem is that the slabs might not
contain only afmalloc allocations.  If another kernel process is
allocating objects from the same slabs, then afmalloc might not be able
to evacuate entire slabs.  A side effect of not having unilateral
control of the memory pool at the page level.

> 
> >>
> >> Extra benefit of this allocator design is NUMA awareness. This allocator
> >> allocates real memory from SLAB allocator. SLAB considers client's NUMA
> >> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> >> and zbud which are backend of zram and zswap, respectively, are not NUMA
> >> awareness so that remote node's memory could be returned to requestor.
> >> I think that it could be solved easily if NUMA awareness turns out to be
> >> real problem. But, it may enlarge fragmentation depending on number of
> >> nodes. Anyway, there is no NUMA awareness issue in this allocator.
> >>
> >> Although I'd like to replace zsmalloc with this allocator, it cannot be
> >> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> >> would be very limited so supporting HIGHMEM would be really good advantage
> >> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> >> 64-bits world, this allocator may be good option for this system. I
> >> didn't deeply consider whether this allocator can replace zbud or not.
> >>
> >> Below is the result of my simple test.
> >> (zsmalloc used in experiments is patched with my previous patch:
> >> zsmalloc: merge size_class to reduce fragmentation)
> >>
> >> TEST ENV: EXT4 on zram, mount with discard option
> >> WORKLOAD: untar kernel source, remove dir in descending order in size.
> >> (drivers arch fs sound include)
> >>
> >> Each line represents orig_data_size, compr_data_size, mem_used_total,
> >> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> >> (overhead to compr_data_size), respectively, after untar and remove
> >> operation is executed. In afmalloc case, overhead is calculated by
> >> before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
> >> in afmalloc, one is real_overhead which represents metadata usage and
> >> overhead of internal fragmentation, and the other is a ratio,
> >> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> >> internal fragmented memory cannot be used by other subsystem. So,
> >> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> >> be proper comparison.
> >
> > See last comment about why the real measure of memory usage should be
> > total pages not returned to the page allocator.  I don't consider chunks
> > freed to the slab allocator to be truly freed unless the slab containing
> > the chunks is also freed to the page allocator.
> >
> > The closest thing I can think of to measure the memory utilization of
> > this allocator is, for each kmalloc cache, do a before/after of how many
> > slabs are in the cache, then multiply that delta by pagesperslab and sum
> > the results.  This would give a rough measure of the number of pages
> > utilized in the slab allocator either by or as a result of afmalloc.
> > Of course, there will be noise from other components doing allocations
> > during the time between the before and after measurement.
> 
> It was already in below benchmark result. overhead and overhead ratio on
> intar-afmalloc.out result are measured by number of allocated page in SLAB.
> You can see that overhead and overhead ratio of afmalloc is less than
> zsmalloc even in this metric.

Ah yes, I didn't equate SUnreclaim with "slab usage".

It does look interesting.  I like the simplicity vs zsmalloc.

There would be more memcpy() calls in the map/unmap process. I guess you
can tune the worse case number of memcpy()s by adjusting
afmalloc_OBJ_MIN_SIZE in exchange for added fragmentation.  There is
also the impact of many kmalloc() calls per allocated object: 1 for
metadata and up to 7 for chunks on 64-bit.  Compression efficiency is
important but so is speed.  Performance under memory pressure is also a
factor that isn't accounted for in these results.

I'll try to build it and kick the tires soon.  Thanks!

Seth

> 
> Thanks.
> 
> > Seth
> >
> >>
> >> * untar-merge.out
> >>
> >> orig_size compr_size used_size overhead overhead_ratio
> >> 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
> >> 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
> >> 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
> >> 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
> >> 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
> >> 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
> >>
> >> * untar-afmalloc.out
> >>
> >> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> >> 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
> >> 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
> >> 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
> >> 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
> >> 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
> >> 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
> >>
> >> As you can see above result, real_overhead_ratio in afmalloc is
> >> just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
> >>
> >> And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
> >> than overhead_ratio in zsmalloc which is 5% ~ 17%.
> >>
> >> Below is another simple test to check fragmentation effect in alloc/free
> >> repetition workload.
> >>
> >> TEST ENV: EXT4 on zram, mount with discard option
> >> WORKLOAD: untar kernel source, remove dir in descending order in size
> >> (drivers arch fs sound include). Repeat this untar and remove 10 times.
> >>
> >> * untar-merge.out
> >>
> >> orig_size compr_size used_size overhead overhead_ratio
> >> 526.24MB 199.18MB 209.79MB  10.61MB 5.33%
> >> 288.69MB  97.45MB 104.09MB   6.64MB 6.81%
> >> 177.69MB  61.14MB  66.89MB   5.75MB 9.40%
> >> 146.84MB  47.34MB  52.77MB   5.43MB 11.46%
> >> 124.53MB  38.88MB  44.28MB   5.40MB 13.90%
> >> 104.29MB  31.71MB  36.87MB   5.17MB 16.29%
> >> 535.59MB 200.30MB 211.77MB  11.47MB 5.73%
> >> 294.84MB  98.28MB 106.24MB   7.97MB 8.11%
> >> 179.99MB  61.58MB  69.34MB   7.76MB 12.60%
> >> 148.67MB  47.75MB  55.19MB   7.43MB 15.57%
> >> 125.98MB  39.26MB  46.62MB   7.36MB 18.75%
> >> 105.05MB  32.03MB  39.18MB   7.15MB 22.32%
> >> (snip...)
> >> 535.59MB 200.31MB 211.88MB  11.57MB 5.77%
> >> 294.84MB  98.28MB 106.62MB   8.34MB 8.49%
> >> 179.99MB  61.59MB  73.83MB  12.24MB 19.88%
> >> 148.67MB  47.76MB  59.58MB  11.82MB 24.76%
> >> 125.98MB  39.27MB  51.10MB  11.84MB 30.14%
> >> 105.05MB  32.04MB  43.68MB  11.64MB 36.31%
> >> 535.59MB 200.31MB 211.89MB  11.58MB 5.78%
> >> 294.84MB  98.28MB 106.68MB   8.40MB 8.55%
> >> 179.99MB  61.59MB  74.14MB  12.55MB 20.37%
> >> 148.67MB  47.76MB  59.94MB  12.18MB 25.50%
> >> 125.98MB  39.27MB  51.46MB  12.19MB 31.04%
> >> 105.05MB  32.04MB  44.01MB  11.97MB 37.35%
> >>
> >> * untar-afmalloc.out
> >>
> >> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> >> 526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
> >> 288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
> >> 177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
> >> 146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
> >> 124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
> >> 104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
> >> 535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
> >> 294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
> >> 179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
> >> 148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
> >> 125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
> >> 105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
> >> (snip...)
> >> 535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
> >> 294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
> >> 179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
> >> 148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
> >> 125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
> >> 105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
> >> 535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
> >> 294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
> >> 179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
> >> 148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
> >> 125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
> >> 105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%
> >>
> >> As you can see above result, fragmentation grows continuously at each run.
> >> But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
> >> while overhead_ratio on zsmalloc varies 5% ~ 38%.
> >> Fragmented slab memory can be used for other system, so we don't
> >> have to much worry about overhead metric in afmalloc. Anyway, overhead
> >> metric is also better in afmalloc, 4% ~ 26%.
> >>
> >> As a result, I think that afmalloc is better than zsmalloc in terms of
> >> memory efficiency. But, I could be wrong so any comments are welcome. :)
> >>
> >> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >> ---
> >>  include/linux/afmalloc.h |   21 ++
> >>  mm/Kconfig               |    7 +
> >>  mm/Makefile              |    1 +
> >>  mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  4 files changed, 619 insertions(+)
> >>  create mode 100644 include/linux/afmalloc.h
> >>  create mode 100644 mm/afmalloc.c

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
@ 2014-10-07 20:26       ` Seth Jennings
  0 siblings, 0 replies; 16+ messages in thread
From: Seth Jennings @ 2014-10-07 20:26 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Joonsoo Kim, Andrew Morton, Minchan Kim, Nitin Gupta,
	Linux Memory Management List, LKML, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

On Tue, Oct 07, 2014 at 04:42:33PM +0900, Joonsoo Kim wrote:
> Hello, Seth.
> Sorry for late response. :)
> 
> 2014-09-30 4:53 GMT+09:00 Seth Jennings <sjennings@variantweb.net>:
> > On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
> >> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> >> If you know useful place to use this allocator, please let me know.
> >>
> >> This is brand-new allocator, called anti-fragmentation memory allocator
> >> (aka afmalloc), in order to deal with arbitrary sized object allocation
> >> efficiently. zram and zswap uses arbitrary sized object to store
> >> compressed data so they can use this allocator. If there are any other
> >> use cases, they can use it, too.
> >>
> >> This work is motivated by observation of fragmentation on zsmalloc which
> >> intended for storing arbitrary sized object with low fragmentation.
> >> Although it works well on allocation-intensive workload, memory could be
> >> highly fragmented after many free occurs. In some cases, unused memory due
> >> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> >> problem is that other subsystem cannot use these unused memory. These
> >> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> >> use it until zspage is freed to page allocator.
> >
> > Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
> > I and others are looking at putting compaction logic into zsmalloc to
> > help with this.
> >
> >>
> >> I guess that there are similar fragmentation problem in zbud, but, I
> >> didn't deeply investigate it.
> >>
> >> This new allocator uses SLAB allocator to solve above problems. When
> >> request comes, it returns handle that is pointer of metatdata to point
> >> many small chunks. These small chunks are in power of 2 size and
> >> build up whole requested memory. We can easily acquire these chunks
> >> using SLAB allocator. Following is conceptual represetation of metadata
> >> used in this allocator to help understanding of this allocator.
> >>
> >> Handle A for 400 bytes
> >> {
> >>       Pointer for 256 bytes chunk
> >>       Pointer for 128 bytes chunk
> >>       Pointer for 16 bytes chunk
> >>
> >>       (256 + 128 + 16 = 400)
> >> }
> >>
> >> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> >> allocator specific store/load functions are needed. These require some
> >> computation overhead and I guess that this is the only drawback this
> >> allocator has.
> >
> > One problem with using the SLAB allocator is that kmalloc caches greater
> > than 256 bytes, at least on my x86_64 machine, have slabs that require
> > high order page allocations, which are going to be really hard to come
> > by in the memory stressed environment in which zswap/zram are expected
> > to operate.  I guess you could max out at 256 byte chunks to overcome
> > this.  However, if you have a 3k object, that would require copying 12
> > chunks from potentially 12 different pages into a contiguous area at
> > mapping time and a larger metadata size.
> 
> SLUB uses high order allocation by default, but, it has fallback method. It
> uses low order allocation if failed with high order allocation. So, we don't
> need to worry about high order allocation.

Didn't know about the fallback method :)

> 
> >>
> >> For optimization, it uses another approach for power of 2 sized request.
> >> Instead of returning handle for metadata, it adds tag on pointer from
> >> SLAB allocator and directly returns this value as handle. With this tag,
> >> afmalloc can recognize whether handle is for metadata or not and do proper
> >> processing on it. This optimization can save some memory.
> >>
> >> Although afmalloc use some memory for metadata, overall utilization of
> >> memory is really good due to zero internal fragmentation by using power
> >
> > Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
> > fragmentation per object right?  If so, "near zero".
> >
> >> of 2 sized object. Although zsmalloc has many size class, there is
> >> considerable internal fragmentation in zsmalloc.
> >
> > Lets put a number on it. Internal fragmentation on objects with size >
> > ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
> > PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
> > fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
> > 64-bit system with 4k pages.  (Note: I don't think that is it possible to
> > compress a 4k page to less than 32 bytes, so for zswap, there will be no
> > allocations in this size range).
> >
> > So we are looking at up to 7 vs 15 bytes of internal fragmentation per
> > object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
> > take into account the per-object metadata overhead of afmalloc, I think
> > zsmalloc comes out ahead here.
> 
> Sorry for misleading word usage.
> What I want to tell is that the unused space at the end of zspage when
> zspage isn't perfectly divided. For example, think about 2064 bytes size_class.
> It's zspage would be 4 pages and it can have only 7 objects at maximum.
> Remainder is 1936 bytes and we can't use this space. This is 11% of total
> space on zspage. If we only use power of 2 size, there is no remainder and
> no this type of unused space.

Ah, ok.  That's true.

> 
> >>
> >> In workload that needs many free, memory could be fragmented like
> >> zsmalloc, but, there is big difference. These unused portion of memory
> >> are SLAB specific memory so that other subsystem can use it. Therefore,
> >> fragmented memory could not be a big problem in this allocator.
> >
> > While freeing chunks back to the slab allocator does make that memory
> > available to other _kernel_ users, the fragmentation problem is just
> > moved one level down.  The fragmentation will exist in the slabs and
> > those fragmented slabs won't be freed to the page allocator, which would
> > make them available to _any_ user, not just the kernel.  Additionally,
> > there is little visibility into how chunks are organized in the slab,
> > making compaction at the afmalloc level nearly impossible.  (The only
> > visibility being the address returned by kmalloc())
> 
> Okay. Free objects in slab subsystem isn't perfect solution, but, it is better
> than current situation.
> 
> And, I think that afmalloc could be compacted just with returned address.
> My idea is sorting chunks by memory address and copying their contents
> to temporary buffer in ascending order. After copy is complete, chunks could
> be freed. These freed objects would be in contiguous range so SLAB would
> free the slab to the page allocator. After some free are done, we allocate
> chunks from SLAB again and copy contents in temporary buffers to these
> newly allocated chunks. These chunks would be positioned in fragmented
> slab so that fragmentation would be reduced.

I guess something that could be a problem is that the slabs might not
contain only afmalloc allocations.  If another kernel process is
allocating objects from the same slabs, then afmalloc might not be able
to evacuate entire slabs.  A side effect of not having unilateral
control of the memory pool at the page level.

> 
> >>
> >> Extra benefit of this allocator design is NUMA awareness. This allocator
> >> allocates real memory from SLAB allocator. SLAB considers client's NUMA
> >> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> >> and zbud which are backend of zram and zswap, respectively, are not NUMA
> >> awareness so that remote node's memory could be returned to requestor.
> >> I think that it could be solved easily if NUMA awareness turns out to be
> >> real problem. But, it may enlarge fragmentation depending on number of
> >> nodes. Anyway, there is no NUMA awareness issue in this allocator.
> >>
> >> Although I'd like to replace zsmalloc with this allocator, it cannot be
> >> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> >> would be very limited so supporting HIGHMEM would be really good advantage
> >> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> >> 64-bits world, this allocator may be good option for this system. I
> >> didn't deeply consider whether this allocator can replace zbud or not.
> >>
> >> Below is the result of my simple test.
> >> (zsmalloc used in experiments is patched with my previous patch:
> >> zsmalloc: merge size_class to reduce fragmentation)
> >>
> >> TEST ENV: EXT4 on zram, mount with discard option
> >> WORKLOAD: untar kernel source, remove dir in descending order in size.
> >> (drivers arch fs sound include)
> >>
> >> Each line represents orig_data_size, compr_data_size, mem_used_total,
> >> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> >> (overhead to compr_data_size), respectively, after untar and remove
> >> operation is executed. In afmalloc case, overhead is calculated by
> >> before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
> >> in afmalloc, one is real_overhead which represents metadata usage and
> >> overhead of internal fragmentation, and the other is a ratio,
> >> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> >> internal fragmented memory cannot be used by other subsystem. So,
> >> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> >> be proper comparison.
> >
> > See last comment about why the real measure of memory usage should be
> > total pages not returned to the page allocator.  I don't consider chunks
> > freed to the slab allocator to be truly freed unless the slab containing
> > the chunks is also freed to the page allocator.
> >
> > The closest thing I can think of to measure the memory utilization of
> > this allocator is, for each kmalloc cache, do a before/after of how many
> > slabs are in the cache, then multiply that delta by pagesperslab and sum
> > the results.  This would give a rough measure of the number of pages
> > utilized in the slab allocator either by or as a result of afmalloc.
> > Of course, there will be noise from other components doing allocations
> > during the time between the before and after measurement.
> 
> It was already in below benchmark result. overhead and overhead ratio on
> intar-afmalloc.out result are measured by number of allocated page in SLAB.
> You can see that overhead and overhead ratio of afmalloc is less than
> zsmalloc even in this metric.

Ah yes, I didn't equate SUnreclaim with "slab usage".

It does look interesting.  I like the simplicity vs zsmalloc.

There would be more memcpy() calls in the map/unmap process. I guess you
can tune the worse case number of memcpy()s by adjusting
afmalloc_OBJ_MIN_SIZE in exchange for added fragmentation.  There is
also the impact of many kmalloc() calls per allocated object: 1 for
metadata and up to 7 for chunks on 64-bit.  Compression efficiency is
important but so is speed.  Performance under memory pressure is also a
factor that isn't accounted for in these results.

I'll try to build it and kick the tires soon.  Thanks!

Seth

> 
> Thanks.
> 
> > Seth
> >
> >>
> >> * untar-merge.out
> >>
> >> orig_size compr_size used_size overhead overhead_ratio
> >> 526.23MB 199.18MB 209.81MB  10.64MB 5.34%
> >> 288.68MB  97.45MB 104.08MB   6.63MB 6.80%
> >> 177.68MB  61.14MB  66.93MB   5.79MB 9.47%
> >> 146.83MB  47.34MB  52.79MB   5.45MB 11.51%
> >> 124.52MB  38.87MB  44.30MB   5.43MB 13.96%
> >> 104.29MB  31.70MB  36.83MB   5.13MB 16.19%
> >>
> >> * untar-afmalloc.out
> >>
> >> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> >> 526.27MB 199.18MB 206.37MB   8.00MB 4.02%   7.19MB 3.61%
> >> 288.71MB  97.45MB 101.25MB   5.86MB 6.01%   3.80MB 3.90%
> >> 177.71MB  61.14MB  63.44MB   4.39MB 7.19%   2.30MB 3.76%
> >> 146.86MB  47.34MB  49.20MB   3.97MB 8.39%   1.86MB 3.93%
> >> 124.55MB  38.88MB  40.41MB   3.71MB 9.54%   1.53MB 3.95%
> >> 104.32MB  31.70MB  32.96MB   3.43MB 10.81%   1.26MB 3.96%
> >>
> >> As you can see above result, real_overhead_ratio in afmalloc is
> >> just 3% ~ 4% while overhead_ratio on zsmalloc varies 5% ~ 17%.
> >>
> >> And, 4% ~ 11% overhead_ratio in afmalloc is also slightly better
> >> than overhead_ratio in zsmalloc which is 5% ~ 17%.
> >>
> >> Below is another simple test to check fragmentation effect in alloc/free
> >> repetition workload.
> >>
> >> TEST ENV: EXT4 on zram, mount with discard option
> >> WORKLOAD: untar kernel source, remove dir in descending order in size
> >> (drivers arch fs sound include). Repeat this untar and remove 10 times.
> >>
> >> * untar-merge.out
> >>
> >> orig_size compr_size used_size overhead overhead_ratio
> >> 526.24MB 199.18MB 209.79MB  10.61MB 5.33%
> >> 288.69MB  97.45MB 104.09MB   6.64MB 6.81%
> >> 177.69MB  61.14MB  66.89MB   5.75MB 9.40%
> >> 146.84MB  47.34MB  52.77MB   5.43MB 11.46%
> >> 124.53MB  38.88MB  44.28MB   5.40MB 13.90%
> >> 104.29MB  31.71MB  36.87MB   5.17MB 16.29%
> >> 535.59MB 200.30MB 211.77MB  11.47MB 5.73%
> >> 294.84MB  98.28MB 106.24MB   7.97MB 8.11%
> >> 179.99MB  61.58MB  69.34MB   7.76MB 12.60%
> >> 148.67MB  47.75MB  55.19MB   7.43MB 15.57%
> >> 125.98MB  39.26MB  46.62MB   7.36MB 18.75%
> >> 105.05MB  32.03MB  39.18MB   7.15MB 22.32%
> >> (snip...)
> >> 535.59MB 200.31MB 211.88MB  11.57MB 5.77%
> >> 294.84MB  98.28MB 106.62MB   8.34MB 8.49%
> >> 179.99MB  61.59MB  73.83MB  12.24MB 19.88%
> >> 148.67MB  47.76MB  59.58MB  11.82MB 24.76%
> >> 125.98MB  39.27MB  51.10MB  11.84MB 30.14%
> >> 105.05MB  32.04MB  43.68MB  11.64MB 36.31%
> >> 535.59MB 200.31MB 211.89MB  11.58MB 5.78%
> >> 294.84MB  98.28MB 106.68MB   8.40MB 8.55%
> >> 179.99MB  61.59MB  74.14MB  12.55MB 20.37%
> >> 148.67MB  47.76MB  59.94MB  12.18MB 25.50%
> >> 125.98MB  39.27MB  51.46MB  12.19MB 31.04%
> >> 105.05MB  32.04MB  44.01MB  11.97MB 37.35%
> >>
> >> * untar-afmalloc.out
> >>
> >> orig_size compr_size used_size overhead overhead_ratio real real-ratio
> >> 526.23MB 199.17MB 206.36MB   8.02MB 4.03%   7.19MB 3.61%
> >> 288.68MB  97.45MB 101.25MB   5.42MB 5.56%   3.80MB 3.90%
> >> 177.68MB  61.14MB  63.43MB   4.00MB 6.54%   2.30MB 3.76%
> >> 146.83MB  47.34MB  49.20MB   3.66MB 7.74%   1.86MB 3.93%
> >> 124.52MB  38.87MB  40.41MB   3.33MB 8.57%   1.54MB 3.96%
> >> 104.29MB  31.70MB  32.95MB   3.23MB 10.19%   1.26MB 3.97%
> >> 535.59MB 200.30MB 207.59MB   9.21MB 4.60%   7.29MB 3.64%
> >> 294.84MB  98.27MB 102.14MB   6.23MB 6.34%   3.87MB 3.94%
> >> 179.99MB  61.58MB  63.91MB   4.98MB 8.09%   2.33MB 3.78%
> >> 148.67MB  47.75MB  49.64MB   4.48MB 9.37%   1.89MB 3.95%
> >> 125.98MB  39.26MB  40.82MB   4.23MB 10.78%   1.56MB 3.97%
> >> 105.05MB  32.03MB  33.30MB   4.10MB 12.81%   1.27MB 3.98%
> >> (snip...)
> >> 535.59MB 200.30MB 207.60MB   8.94MB 4.46%   7.29MB 3.64%
> >> 294.84MB  98.27MB 102.14MB   6.19MB 6.29%   3.87MB 3.94%
> >> 179.99MB  61.58MB  63.91MB   8.25MB 13.39%   2.33MB 3.79%
> >> 148.67MB  47.75MB  49.64MB   7.98MB 16.71%   1.89MB 3.96%
> >> 125.98MB  39.26MB  40.82MB   7.52MB 19.15%   1.56MB 3.98%
> >> 105.05MB  32.03MB  33.31MB   7.04MB 21.97%   1.28MB 3.98%
> >> 535.59MB 200.31MB 207.60MB   9.26MB 4.62%   7.30MB 3.64%
> >> 294.84MB  98.28MB 102.15MB   6.85MB 6.97%   3.87MB 3.94%
> >> 179.99MB  61.58MB  63.91MB   9.08MB 14.74%   2.33MB 3.79%
> >> 148.67MB  47.75MB  49.64MB   8.77MB 18.36%   1.89MB 3.96%
> >> 125.98MB  39.26MB  40.82MB   8.35MB 21.28%   1.56MB 3.98%
> >> 105.05MB  32.03MB  33.31MB   8.24MB 25.71%   1.28MB 3.98%
> >>
> >> As you can see above result, fragmentation grows continuously at each run.
> >> But, real_overhead_ratio in afmalloc is always just 3% ~ 4%,
> >> while overhead_ratio on zsmalloc varies 5% ~ 38%.
> >> Fragmented slab memory can be used for other system, so we don't
> >> have to much worry about overhead metric in afmalloc. Anyway, overhead
> >> metric is also better in afmalloc, 4% ~ 26%.
> >>
> >> As a result, I think that afmalloc is better than zsmalloc in terms of
> >> memory efficiency. But, I could be wrong so any comments are welcome. :)
> >>
> >> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> >> ---
> >>  include/linux/afmalloc.h |   21 ++
> >>  mm/Kconfig               |    7 +
> >>  mm/Makefile              |    1 +
> >>  mm/afmalloc.c            |  590 ++++++++++++++++++++++++++++++++++++++++++++++
> >>  4 files changed, 619 insertions(+)
> >>  create mode 100644 include/linux/afmalloc.h
> >>  create mode 100644 mm/afmalloc.c

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
  2014-10-07 20:26       ` Seth Jennings
@ 2014-10-08  2:31         ` Joonsoo Kim
  -1 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-10-08  2:31 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Andrew Morton, Minchan Kim, Nitin Gupta,
	Linux Memory Management List, LKML, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

On Tue, Oct 07, 2014 at 03:26:35PM -0500, Seth Jennings wrote:
> On Tue, Oct 07, 2014 at 04:42:33PM +0900, Joonsoo Kim wrote:
> > Hello, Seth.
> > Sorry for late response. :)
> > 
> > 2014-09-30 4:53 GMT+09:00 Seth Jennings <sjennings@variantweb.net>:
> > > On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
> > >> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> > >> If you know useful place to use this allocator, please let me know.
> > >>
> > >> This is brand-new allocator, called anti-fragmentation memory allocator
> > >> (aka afmalloc), in order to deal with arbitrary sized object allocation
> > >> efficiently. zram and zswap uses arbitrary sized object to store
> > >> compressed data so they can use this allocator. If there are any other
> > >> use cases, they can use it, too.
> > >>
> > >> This work is motivated by observation of fragmentation on zsmalloc which
> > >> intended for storing arbitrary sized object with low fragmentation.
> > >> Although it works well on allocation-intensive workload, memory could be
> > >> highly fragmented after many free occurs. In some cases, unused memory due
> > >> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> > >> problem is that other subsystem cannot use these unused memory. These
> > >> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> > >> use it until zspage is freed to page allocator.
> > >
> > > Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
> > > I and others are looking at putting compaction logic into zsmalloc to
> > > help with this.
> > >
> > >>
> > >> I guess that there are similar fragmentation problem in zbud, but, I
> > >> didn't deeply investigate it.
> > >>
> > >> This new allocator uses SLAB allocator to solve above problems. When
> > >> request comes, it returns handle that is pointer of metatdata to point
> > >> many small chunks. These small chunks are in power of 2 size and
> > >> build up whole requested memory. We can easily acquire these chunks
> > >> using SLAB allocator. Following is conceptual represetation of metadata
> > >> used in this allocator to help understanding of this allocator.
> > >>
> > >> Handle A for 400 bytes
> > >> {
> > >>       Pointer for 256 bytes chunk
> > >>       Pointer for 128 bytes chunk
> > >>       Pointer for 16 bytes chunk
> > >>
> > >>       (256 + 128 + 16 = 400)
> > >> }
> > >>
> > >> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> > >> allocator specific store/load functions are needed. These require some
> > >> computation overhead and I guess that this is the only drawback this
> > >> allocator has.
> > >
> > > One problem with using the SLAB allocator is that kmalloc caches greater
> > > than 256 bytes, at least on my x86_64 machine, have slabs that require
> > > high order page allocations, which are going to be really hard to come
> > > by in the memory stressed environment in which zswap/zram are expected
> > > to operate.  I guess you could max out at 256 byte chunks to overcome
> > > this.  However, if you have a 3k object, that would require copying 12
> > > chunks from potentially 12 different pages into a contiguous area at
> > > mapping time and a larger metadata size.
> > 
> > SLUB uses high order allocation by default, but, it has fallback method. It
> > uses low order allocation if failed with high order allocation. So, we don't
> > need to worry about high order allocation.
> 
> Didn't know about the fallback method :)
> 
> > 
> > >>
> > >> For optimization, it uses another approach for power of 2 sized request.
> > >> Instead of returning handle for metadata, it adds tag on pointer from
> > >> SLAB allocator and directly returns this value as handle. With this tag,
> > >> afmalloc can recognize whether handle is for metadata or not and do proper
> > >> processing on it. This optimization can save some memory.
> > >>
> > >> Although afmalloc use some memory for metadata, overall utilization of
> > >> memory is really good due to zero internal fragmentation by using power
> > >
> > > Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
> > > fragmentation per object right?  If so, "near zero".
> > >
> > >> of 2 sized object. Although zsmalloc has many size class, there is
> > >> considerable internal fragmentation in zsmalloc.
> > >
> > > Lets put a number on it. Internal fragmentation on objects with size >
> > > ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
> > > PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
> > > fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
> > > 64-bit system with 4k pages.  (Note: I don't think that is it possible to
> > > compress a 4k page to less than 32 bytes, so for zswap, there will be no
> > > allocations in this size range).
> > >
> > > So we are looking at up to 7 vs 15 bytes of internal fragmentation per
> > > object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
> > > take into account the per-object metadata overhead of afmalloc, I think
> > > zsmalloc comes out ahead here.
> > 
> > Sorry for misleading word usage.
> > What I want to tell is that the unused space at the end of zspage when
> > zspage isn't perfectly divided. For example, think about 2064 bytes size_class.
> > It's zspage would be 4 pages and it can have only 7 objects at maximum.
> > Remainder is 1936 bytes and we can't use this space. This is 11% of total
> > space on zspage. If we only use power of 2 size, there is no remainder and
> > no this type of unused space.
> 
> Ah, ok.  That's true.
> 
> > 
> > >>
> > >> In workload that needs many free, memory could be fragmented like
> > >> zsmalloc, but, there is big difference. These unused portion of memory
> > >> are SLAB specific memory so that other subsystem can use it. Therefore,
> > >> fragmented memory could not be a big problem in this allocator.
> > >
> > > While freeing chunks back to the slab allocator does make that memory
> > > available to other _kernel_ users, the fragmentation problem is just
> > > moved one level down.  The fragmentation will exist in the slabs and
> > > those fragmented slabs won't be freed to the page allocator, which would
> > > make them available to _any_ user, not just the kernel.  Additionally,
> > > there is little visibility into how chunks are organized in the slab,
> > > making compaction at the afmalloc level nearly impossible.  (The only
> > > visibility being the address returned by kmalloc())
> > 
> > Okay. Free objects in slab subsystem isn't perfect solution, but, it is better
> > than current situation.
> > 
> > And, I think that afmalloc could be compacted just with returned address.
> > My idea is sorting chunks by memory address and copying their contents
> > to temporary buffer in ascending order. After copy is complete, chunks could
> > be freed. These freed objects would be in contiguous range so SLAB would
> > free the slab to the page allocator. After some free are done, we allocate
> > chunks from SLAB again and copy contents in temporary buffers to these
> > newly allocated chunks. These chunks would be positioned in fragmented
> > slab so that fragmentation would be reduced.
> 
> I guess something that could be a problem is that the slabs might not
> contain only afmalloc allocations.  If another kernel process is
> allocating objects from the same slabs, then afmalloc might not be able
> to evacuate entire slabs.  A side effect of not having unilateral
> control of the memory pool at the page level.

Yes, it could be a problem. But, if we decide to implement compaction
in afmalloc, we don't need to adhere using general kmem_cache, because
fragmentation would be not an issue with compaction. In this case, afmalloc
can use its own private kmem_caches for power of 2 sized objects and
this would solves above concern.

> 
> > 
> > >>
> > >> Extra benefit of this allocator design is NUMA awareness. This allocator
> > >> allocates real memory from SLAB allocator. SLAB considers client's NUMA
> > >> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> > >> and zbud which are backend of zram and zswap, respectively, are not NUMA
> > >> awareness so that remote node's memory could be returned to requestor.
> > >> I think that it could be solved easily if NUMA awareness turns out to be
> > >> real problem. But, it may enlarge fragmentation depending on number of
> > >> nodes. Anyway, there is no NUMA awareness issue in this allocator.
> > >>
> > >> Although I'd like to replace zsmalloc with this allocator, it cannot be
> > >> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> > >> would be very limited so supporting HIGHMEM would be really good advantage
> > >> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> > >> 64-bits world, this allocator may be good option for this system. I
> > >> didn't deeply consider whether this allocator can replace zbud or not.
> > >>
> > >> Below is the result of my simple test.
> > >> (zsmalloc used in experiments is patched with my previous patch:
> > >> zsmalloc: merge size_class to reduce fragmentation)
> > >>
> > >> TEST ENV: EXT4 on zram, mount with discard option
> > >> WORKLOAD: untar kernel source, remove dir in descending order in size.
> > >> (drivers arch fs sound include)
> > >>
> > >> Each line represents orig_data_size, compr_data_size, mem_used_total,
> > >> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> > >> (overhead to compr_data_size), respectively, after untar and remove
> > >> operation is executed. In afmalloc case, overhead is calculated by
> > >> before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
> > >> in afmalloc, one is real_overhead which represents metadata usage and
> > >> overhead of internal fragmentation, and the other is a ratio,
> > >> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> > >> internal fragmented memory cannot be used by other subsystem. So,
> > >> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> > >> be proper comparison.
> > >
> > > See last comment about why the real measure of memory usage should be
> > > total pages not returned to the page allocator.  I don't consider chunks
> > > freed to the slab allocator to be truly freed unless the slab containing
> > > the chunks is also freed to the page allocator.
> > >
> > > The closest thing I can think of to measure the memory utilization of
> > > this allocator is, for each kmalloc cache, do a before/after of how many
> > > slabs are in the cache, then multiply that delta by pagesperslab and sum
> > > the results.  This would give a rough measure of the number of pages
> > > utilized in the slab allocator either by or as a result of afmalloc.
> > > Of course, there will be noise from other components doing allocations
> > > during the time between the before and after measurement.
> > 
> > It was already in below benchmark result. overhead and overhead ratio on
> > intar-afmalloc.out result are measured by number of allocated page in SLAB.
> > You can see that overhead and overhead ratio of afmalloc is less than
> > zsmalloc even in this metric.
> 
> Ah yes, I didn't equate SUnreclaim with "slab usage".
> 
> It does look interesting.  I like the simplicity vs zsmalloc.
> 
> There would be more memcpy() calls in the map/unmap process. I guess you
> can tune the worse case number of memcpy()s by adjusting
> afmalloc_OBJ_MIN_SIZE in exchange for added fragmentation.  There is
> also the impact of many kmalloc() calls per allocated object: 1 for
> metadata and up to 7 for chunks on 64-bit.  Compression efficiency is
> important but so is speed.  Performance under memory pressure is also a
> factor that isn't accounted for in these results.

Yes... As I already mentioned in patch description, performance is
the problem this allocator has.

In fact, I guess that many kmalloc() calls would be no problem, because
SLAB allocation is really fast. Real factor of performance would be how
many pages we allocate from page allocator. If we allocate too many pages
from page allocator, kswapd would be invoked more or direct reclaim would
occurs more. These would be dominent factor of allocation performance.
afmalloc has good memory utilization so that page allocation happens
infrequently than zsmalloc. Therefore, it would be good performance
in this case.

But, it does many memcpy() than zsmalloc and it really affects
performance. IIUC, iozone test with afmalloc on ext4 on zram showed worse
performance than zsmalloc, roughly, 5 ~ 10%. This workload has no diverse
memory contents so there is no factor that affect to performance
except memcpy(). Therefore, I guess that it is upper bound of afmalloc's
performance loss to zsmalloc's. Meanwhile, simple swap test using kernel
build didn't have any noticible difference on performance.

> I'll try to build it and kick the tires soon.  Thanks!

Really appreciate for your interest and trying. :)
Thanks.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator
@ 2014-10-08  2:31         ` Joonsoo Kim
  0 siblings, 0 replies; 16+ messages in thread
From: Joonsoo Kim @ 2014-10-08  2:31 UTC (permalink / raw)
  To: Seth Jennings
  Cc: Andrew Morton, Minchan Kim, Nitin Gupta,
	Linux Memory Management List, LKML, Jerome Marchand,
	Sergey Senozhatsky, Dan Streetman, Luigi Semenzato, Mel Gorman,
	Hugh Dickins

On Tue, Oct 07, 2014 at 03:26:35PM -0500, Seth Jennings wrote:
> On Tue, Oct 07, 2014 at 04:42:33PM +0900, Joonsoo Kim wrote:
> > Hello, Seth.
> > Sorry for late response. :)
> > 
> > 2014-09-30 4:53 GMT+09:00 Seth Jennings <sjennings@variantweb.net>:
> > > On Fri, Sep 26, 2014 at 03:53:14PM +0900, Joonsoo Kim wrote:
> > >> WARNING: This is just RFC patchset. patch 2/2 is only for testing.
> > >> If you know useful place to use this allocator, please let me know.
> > >>
> > >> This is brand-new allocator, called anti-fragmentation memory allocator
> > >> (aka afmalloc), in order to deal with arbitrary sized object allocation
> > >> efficiently. zram and zswap uses arbitrary sized object to store
> > >> compressed data so they can use this allocator. If there are any other
> > >> use cases, they can use it, too.
> > >>
> > >> This work is motivated by observation of fragmentation on zsmalloc which
> > >> intended for storing arbitrary sized object with low fragmentation.
> > >> Although it works well on allocation-intensive workload, memory could be
> > >> highly fragmented after many free occurs. In some cases, unused memory due
> > >> to fragmentation occupy 20% ~ 50% amount of real used memory. The other
> > >> problem is that other subsystem cannot use these unused memory. These
> > >> fragmented memory are zsmalloc specific, so most of other subsystem cannot
> > >> use it until zspage is freed to page allocator.
> > >
> > > Yes, zsmalloc has a fragmentation issue.  This has been a topic lately.
> > > I and others are looking at putting compaction logic into zsmalloc to
> > > help with this.
> > >
> > >>
> > >> I guess that there are similar fragmentation problem in zbud, but, I
> > >> didn't deeply investigate it.
> > >>
> > >> This new allocator uses SLAB allocator to solve above problems. When
> > >> request comes, it returns handle that is pointer of metatdata to point
> > >> many small chunks. These small chunks are in power of 2 size and
> > >> build up whole requested memory. We can easily acquire these chunks
> > >> using SLAB allocator. Following is conceptual represetation of metadata
> > >> used in this allocator to help understanding of this allocator.
> > >>
> > >> Handle A for 400 bytes
> > >> {
> > >>       Pointer for 256 bytes chunk
> > >>       Pointer for 128 bytes chunk
> > >>       Pointer for 16 bytes chunk
> > >>
> > >>       (256 + 128 + 16 = 400)
> > >> }
> > >>
> > >> As you can see, 400 bytes memory are not contiguous in afmalloc so that
> > >> allocator specific store/load functions are needed. These require some
> > >> computation overhead and I guess that this is the only drawback this
> > >> allocator has.
> > >
> > > One problem with using the SLAB allocator is that kmalloc caches greater
> > > than 256 bytes, at least on my x86_64 machine, have slabs that require
> > > high order page allocations, which are going to be really hard to come
> > > by in the memory stressed environment in which zswap/zram are expected
> > > to operate.  I guess you could max out at 256 byte chunks to overcome
> > > this.  However, if you have a 3k object, that would require copying 12
> > > chunks from potentially 12 different pages into a contiguous area at
> > > mapping time and a larger metadata size.
> > 
> > SLUB uses high order allocation by default, but, it has fallback method. It
> > uses low order allocation if failed with high order allocation. So, we don't
> > need to worry about high order allocation.
> 
> Didn't know about the fallback method :)
> 
> > 
> > >>
> > >> For optimization, it uses another approach for power of 2 sized request.
> > >> Instead of returning handle for metadata, it adds tag on pointer from
> > >> SLAB allocator and directly returns this value as handle. With this tag,
> > >> afmalloc can recognize whether handle is for metadata or not and do proper
> > >> processing on it. This optimization can save some memory.
> > >>
> > >> Although afmalloc use some memory for metadata, overall utilization of
> > >> memory is really good due to zero internal fragmentation by using power
> > >
> > > Smallest kmalloc cache is 8 bytes so up to 7 bytes of internal
> > > fragmentation per object right?  If so, "near zero".
> > >
> > >> of 2 sized object. Although zsmalloc has many size class, there is
> > >> considerable internal fragmentation in zsmalloc.
> > >
> > > Lets put a number on it. Internal fragmentation on objects with size >
> > > ZS_MIN_ALLOC_SIZE is ZS_SIZE_CLASS_DELTA-1, which is 15 bytes with
> > > PAGE_SIZE of 4k.  If the allocation is less than ZS_MIN_ALLOC_SIZE,
> > > fragmentation could be as high as ZS_MIN_ALLOC_SIZE-1 which is 31 on a
> > > 64-bit system with 4k pages.  (Note: I don't think that is it possible to
> > > compress a 4k page to less than 32 bytes, so for zswap, there will be no
> > > allocations in this size range).
> > >
> > > So we are looking at up to 7 vs 15 bytes of internal fragmentation per
> > > object in the case when allocations are > ZS_MIN_ALLOC_SIZE.  Once you
> > > take into account the per-object metadata overhead of afmalloc, I think
> > > zsmalloc comes out ahead here.
> > 
> > Sorry for misleading word usage.
> > What I want to tell is that the unused space at the end of zspage when
> > zspage isn't perfectly divided. For example, think about 2064 bytes size_class.
> > It's zspage would be 4 pages and it can have only 7 objects at maximum.
> > Remainder is 1936 bytes and we can't use this space. This is 11% of total
> > space on zspage. If we only use power of 2 size, there is no remainder and
> > no this type of unused space.
> 
> Ah, ok.  That's true.
> 
> > 
> > >>
> > >> In workload that needs many free, memory could be fragmented like
> > >> zsmalloc, but, there is big difference. These unused portion of memory
> > >> are SLAB specific memory so that other subsystem can use it. Therefore,
> > >> fragmented memory could not be a big problem in this allocator.
> > >
> > > While freeing chunks back to the slab allocator does make that memory
> > > available to other _kernel_ users, the fragmentation problem is just
> > > moved one level down.  The fragmentation will exist in the slabs and
> > > those fragmented slabs won't be freed to the page allocator, which would
> > > make them available to _any_ user, not just the kernel.  Additionally,
> > > there is little visibility into how chunks are organized in the slab,
> > > making compaction at the afmalloc level nearly impossible.  (The only
> > > visibility being the address returned by kmalloc())
> > 
> > Okay. Free objects in slab subsystem isn't perfect solution, but, it is better
> > than current situation.
> > 
> > And, I think that afmalloc could be compacted just with returned address.
> > My idea is sorting chunks by memory address and copying their contents
> > to temporary buffer in ascending order. After copy is complete, chunks could
> > be freed. These freed objects would be in contiguous range so SLAB would
> > free the slab to the page allocator. After some free are done, we allocate
> > chunks from SLAB again and copy contents in temporary buffers to these
> > newly allocated chunks. These chunks would be positioned in fragmented
> > slab so that fragmentation would be reduced.
> 
> I guess something that could be a problem is that the slabs might not
> contain only afmalloc allocations.  If another kernel process is
> allocating objects from the same slabs, then afmalloc might not be able
> to evacuate entire slabs.  A side effect of not having unilateral
> control of the memory pool at the page level.

Yes, it could be a problem. But, if we decide to implement compaction
in afmalloc, we don't need to adhere using general kmem_cache, because
fragmentation would be not an issue with compaction. In this case, afmalloc
can use its own private kmem_caches for power of 2 sized objects and
this would solves above concern.

> 
> > 
> > >>
> > >> Extra benefit of this allocator design is NUMA awareness. This allocator
> > >> allocates real memory from SLAB allocator. SLAB considers client's NUMA
> > >> affinity, so these allocated memory is NUMA-friendly. Currently, zsmalloc
> > >> and zbud which are backend of zram and zswap, respectively, are not NUMA
> > >> awareness so that remote node's memory could be returned to requestor.
> > >> I think that it could be solved easily if NUMA awareness turns out to be
> > >> real problem. But, it may enlarge fragmentation depending on number of
> > >> nodes. Anyway, there is no NUMA awareness issue in this allocator.
> > >>
> > >> Although I'd like to replace zsmalloc with this allocator, it cannot be
> > >> possible, because zsmalloc supports HIGHMEM. In 32-bits world, SLAB memory
> > >> would be very limited so supporting HIGHMEM would be really good advantage
> > >> of zsmalloc. Because there is no HIGHMEM in 32-bits low memory device or
> > >> 64-bits world, this allocator may be good option for this system. I
> > >> didn't deeply consider whether this allocator can replace zbud or not.
> > >>
> > >> Below is the result of my simple test.
> > >> (zsmalloc used in experiments is patched with my previous patch:
> > >> zsmalloc: merge size_class to reduce fragmentation)
> > >>
> > >> TEST ENV: EXT4 on zram, mount with discard option
> > >> WORKLOAD: untar kernel source, remove dir in descending order in size.
> > >> (drivers arch fs sound include)
> > >>
> > >> Each line represents orig_data_size, compr_data_size, mem_used_total,
> > >> fragmentation overhead (mem_used - compr_data_size) and overhead ratio
> > >> (overhead to compr_data_size), respectively, after untar and remove
> > >> operation is executed. In afmalloc case, overhead is calculated by
> > >> before/after 'SUnreclaim' on /proc/meminfo. And there are two more columns
> > >> in afmalloc, one is real_overhead which represents metadata usage and
> > >> overhead of internal fragmentation, and the other is a ratio,
> > >> real_overhead to compr_data_size. Unlike zsmalloc, only metadata and
> > >> internal fragmented memory cannot be used by other subsystem. So,
> > >> comparing real_overhead in afmalloc with overhead on zsmalloc seems to
> > >> be proper comparison.
> > >
> > > See last comment about why the real measure of memory usage should be
> > > total pages not returned to the page allocator.  I don't consider chunks
> > > freed to the slab allocator to be truly freed unless the slab containing
> > > the chunks is also freed to the page allocator.
> > >
> > > The closest thing I can think of to measure the memory utilization of
> > > this allocator is, for each kmalloc cache, do a before/after of how many
> > > slabs are in the cache, then multiply that delta by pagesperslab and sum
> > > the results.  This would give a rough measure of the number of pages
> > > utilized in the slab allocator either by or as a result of afmalloc.
> > > Of course, there will be noise from other components doing allocations
> > > during the time between the before and after measurement.
> > 
> > It was already in below benchmark result. overhead and overhead ratio on
> > intar-afmalloc.out result are measured by number of allocated page in SLAB.
> > You can see that overhead and overhead ratio of afmalloc is less than
> > zsmalloc even in this metric.
> 
> Ah yes, I didn't equate SUnreclaim with "slab usage".
> 
> It does look interesting.  I like the simplicity vs zsmalloc.
> 
> There would be more memcpy() calls in the map/unmap process. I guess you
> can tune the worse case number of memcpy()s by adjusting
> afmalloc_OBJ_MIN_SIZE in exchange for added fragmentation.  There is
> also the impact of many kmalloc() calls per allocated object: 1 for
> metadata and up to 7 for chunks on 64-bit.  Compression efficiency is
> important but so is speed.  Performance under memory pressure is also a
> factor that isn't accounted for in these results.

Yes... As I already mentioned in patch description, performance is
the problem this allocator has.

In fact, I guess that many kmalloc() calls would be no problem, because
SLAB allocation is really fast. Real factor of performance would be how
many pages we allocate from page allocator. If we allocate too many pages
from page allocator, kswapd would be invoked more or direct reclaim would
occurs more. These would be dominent factor of allocation performance.
afmalloc has good memory utilization so that page allocation happens
infrequently than zsmalloc. Therefore, it would be good performance
in this case.

But, it does many memcpy() than zsmalloc and it really affects
performance. IIUC, iozone test with afmalloc on ext4 on zram showed worse
performance than zsmalloc, roughly, 5 ~ 10%. This workload has no diverse
memory contents so there is no factor that affect to performance
except memcpy(). Therefore, I guess that it is upper bound of afmalloc's
performance loss to zsmalloc's. Meanwhile, simple swap test using kernel
build didn't have any noticible difference on performance.

> I'll try to build it and kick the tires soon.  Thanks!

Really appreciate for your interest and trying. :)
Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2014-10-08  2:31 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-26  6:53 [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation memory allocator Joonsoo Kim
2014-09-26  6:53 ` Joonsoo Kim
2014-09-26  6:53 ` [RFC PATCH 2/2] zram: make afmalloc as zram's backend " Joonsoo Kim
2014-09-26  6:53   ` Joonsoo Kim
2014-09-29 15:41 ` [RFC PATCH 1/2] mm/afmalloc: introduce anti-fragmentation " Dan Streetman
2014-09-29 15:41   ` Dan Streetman
2014-10-02  5:47   ` Joonsoo Kim
2014-10-02  5:47     ` Joonsoo Kim
2014-09-29 19:53 ` Seth Jennings
2014-09-29 19:53   ` Seth Jennings
2014-10-07  7:42   ` Joonsoo Kim
2014-10-07  7:42     ` Joonsoo Kim
2014-10-07 20:26     ` Seth Jennings
2014-10-07 20:26       ` Seth Jennings
2014-10-08  2:31       ` Joonsoo Kim
2014-10-08  2:31         ` Joonsoo Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.