From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754263AbdCWDC6 (ORCPT ); Wed, 22 Mar 2017 23:02:58 -0400 Received: from LGEAMRELO11.lge.com ([156.147.23.51]:49556 "EHLO lgeamrelo11.lge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754177AbdCWDCv (ORCPT ); Wed, 22 Mar 2017 23:02:51 -0400 X-Original-SENDERIP: 156.147.1.151 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 165.244.249.26 X-Original-MAILFROM: iamjoonsoo.kim@lge.com X-Original-SENDERIP: 10.177.222.138 X-Original-MAILFROM: iamjoonsoo.kim@lge.com Date: Thu, 23 Mar 2017 12:04:23 +0900 From: Joonsoo Kim To: Minchan Kim CC: Andrew Morton , Sergey Senozhatsky , , Subject: Re: [PATCH 3/4] zram: implement deduplication in zram Message-ID: <20170323030423.GB17486@js1304-P5Q-DELUXE> References: <1489632398-31501-1-git-send-email-iamjoonsoo.kim@lge.com> <1489632398-31501-4-git-send-email-iamjoonsoo.kim@lge.com> <20170321234121.GA30149@bbox> MIME-Version: 1.0 In-Reply-To: <20170321234121.GA30149@bbox> User-Agent: Mutt/1.5.21 (2010-09-15) X-MIMETrack: Itemize by SMTP Server on LGEKRMHUB05/LGE/LG Group(Release 8.5.3FP6|November 21, 2013) at 2017/03/23 12:02:25, Serialize by Router on LGEKRMHUB05/LGE/LG Group(Release 8.5.3FP6|November 21, 2013) at 2017/03/23 12:02:25, Serialize complete at 2017/03/23 12:02:25 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Mar 22, 2017 at 08:41:21AM +0900, Minchan Kim wrote: > Hi Joonsoo, > > On Thu, Mar 16, 2017 at 11:46:37AM +0900, js1304@gmail.com wrote: > > From: Joonsoo Kim > > > > This patch implements deduplication feature in zram. The purpose > > of this work is naturally to save amount of memory usage by zram. > > > > Android is one of the biggest users to use zram as swap and it's > > really important to save amount of memory usage. There is a paper > > that reports that duplication ratio of Android's memory content is > > rather high [1]. And, there is a similar work on zswap that also > > reports that experiments has shown that around 10-15% of pages > > stored in zswp are duplicates and deduplicate them provides some > > benefits [2]. > > > > Also, there is a different kind of workload that uses zram as blockdev > > and store build outputs into it to reduce wear-out problem of real > > blockdev. In this workload, deduplication hit is very high > > although I don't know exact reason about it. > > Hmm, Isn't it due to binary composed by obj files so that many of > part between binary and object are sharable? > > I'd like to clear it out because dedup was not useful for swap workload > for the testing in android unlike papers you mentioned. > Of course, it depends on the workload so someone might find it's > huge useful for his swap workload. However, I want to focus clear > winning case scenario rather than "might be better". > > That's why I want to know clear reason the saving. If my assumption > is right(i.e., object file vs. binary file), it would be enough > justfication to merge this feature because user can turn on the feature > with reasonable expectation. Okay. I checked the reason of benefit on the kernel build now. There are some cases that deduplication happens. 1) *.cmd Build command is usually similar in one directory. So, content of these file are very similar. Please check fs/ext4/.namei.o.cmd and fs/ext4/.inode.o.cmd. In my system, more than 789 lines are the same in 944 and 939 line of the file, respectively. 2) object file built-in.o and temporal object file have the similar content. More than 50% of fs/ext4/ext4.o is the same with fs/ext4/built-in.o. I think that we can optimize in this case. ext4.o is temporal object file and it isn't necessarily needed. We can change following fs/ext4/Makefile to optimized one. orig> obj-$(CONFIG_EXT4_FS) += ext4.o ext4-y := XXX YYY ZZZ optimized> obj-$(CONFIG_EXT4_FS) += XXX YYY ZZZ 3) vmlinux .tmp_vmlinux1 and .tmp_vmlinux2 and arch/x86/boot/compressed/vmlinux.bin have the similar content. I did similar checking for Android and it also has similar case. Some of object files (.class and .so) are similar with another object files. (./host/linux-x86/lib/libartd.so and ./host/linux-x86/lib/libartd-compiler.so) > > > > > Anyway, if we can detect duplicated content and avoid to store duplicated > > content at different memory space, we can save memory. This patch > > tries to do that. > > > > Implementation is almost simple and intuitive but I should note > > one thing about implementation detail. > > > > To check duplication, this patch uses checksum of the page and > > collision of this checksum could be possible. There would be > > many choices to handle this situation but this patch chooses > > to allow entry with duplicated checksum to be added to the hash, > > but, not to compare all entries with duplicated checksum > > when checking duplication. I guess that checksum collision is quite > > Hmm, if there are many duplicated checksum, what a specific checksum > is compared among them? I implemented it to check just first one. > > rare event and we don't need to pay any attention to such a case. > > If it's rare event, why can't we iterate all of entries? It's possible. If you prefer it, I can do it. > > Therefore, I decided the most simplest way to implement the feature. > > If there is a different opinion, I can accept and go that way. > > > > Following is the result of this patch. > > > > Test result #1 (Swap): > > Android Marshmallow, emulator, x86_64, Backporting to kernel v3.18 > > > > orig_data_size: 145297408 > > compr_data_size: 32408125 > > mem_used_total: 32276480 > > dup_data_size: 3188134 > > meta_data_size: 1444272 > > > > Last two metrics added to mm_stat are related to this work. > > First one, dup_data_size, is amount of saved memory by avoiding > > to store duplicated page. Later one, meta_data_size, is the amount of > > data structure to support deduplication. If dup > meta, we can judge > > that the patch improves memory usage. > > > > In Adnroid, we can save 5% of memory usage by this work. > > > > Test result #2 (Blockdev): > > build the kernel and store output to ext4 FS on zram > > > > > > Elapsed time: 249 s > > mm_stat: 430845952 191014886 196898816 0 196898816 28320 0 0 0 > > > > > > Elapsed time: 250 s > > mm_stat: 430505984 190971334 148365312 0 148365312 28404 0 47287038 3945792 > > > > There is no performance degradation and save 23% memory. > > > > Test result #3 (Blockdev): > > copy android build output dir(out/host) to ext4 FS on zram > > > > > > Elapsed time: out/host: 88 s > > mm_stat: 8834420736 3658184579 3834208256 0 3834208256 32889 0 0 0 > > > > > > Elapsed time: out/host: 100 s > > mm_stat: 8832929792 3657329322 2832015360 0 2832015360 32609 0 952568877 80880336 > > > > It shows performance degradation roughly 13% and save 24% memory. Maybe, > > it is due to overhead of calculating checksum and comparison. > > > > Test result #4 (Blockdev): > > copy android build output dir(out/target/common) to ext4 FS on zram > > > > > > Elapsed time: out/host: 203 s > > mm_stat: 4041678848 2310355010 2346577920 0 2346582016 500 4 0 0 > > > > > > Elapsed time: out/host: 201 s > > mm_stat: 4041666560 2310488276 1338150912 0 1338150912 476 0 989088794 24564336 > > > > Memory is saved by 42% and performance is the same. Even if there is overhead > > of calculating checksum and comparison, large hit ratio compensate it since > > hit leads to less compression attempt. > > Cool! It would help a lot for users have used zram to build output directory. > > > > > Anyway, benefit seems to be largely dependent on the workload so > > following patch will make this feature optional. However, this feature > > can help some usecases so is deserved to be merged. > > > > [1]: MemScope: Analyzing Memory Duplication on Android Systems, > > dl.acm.org/citation.cfm?id=2797023 > > [2]: zswap: Optimize compressed pool memory utilization, > > lkml.kernel.org/r/1341407574.7551.1471584870761.JavaMail.weblogic@epwas3p2 > > Overall, it looks good. Thanks! > However, I want to change function naming/structuring into my preference style > to maintain because it's non-trival feature. So, please look below. Okay. I understand all you commented. I will fix them and I won't reply to each one. Thanks. > > > > Signed-off-by: Joonsoo Kim > > --- > > drivers/block/zram/zram_drv.c | 212 ++++++++++++++++++++++++++++++++++++++---- > > drivers/block/zram/zram_drv.h | 18 ++++ > > 2 files changed, 214 insertions(+), 16 deletions(-) > > > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > > index a3d9cbca..012425f 100644 > > --- a/drivers/block/zram/zram_drv.c > > +++ b/drivers/block/zram/zram_drv.c > > @@ -32,6 +32,7 @@ > > #include > > #include > > #include > > +#include > > > > #include "zram_drv.h" > > > > @@ -385,14 +386,16 @@ static ssize_t mm_stat_show(struct device *dev, > > max_used = atomic_long_read(&zram->stats.max_used_pages); > > > > ret = scnprintf(buf, PAGE_SIZE, > > - "%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n", > > + "%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu %8llu\n", > > orig_size << PAGE_SHIFT, > > (u64)atomic64_read(&zram->stats.compr_data_size), > > mem_used << PAGE_SHIFT, > > zram->limit_pages << PAGE_SHIFT, > > max_used << PAGE_SHIFT, > > (u64)atomic64_read(&zram->stats.same_pages), > > - pool_stats.pages_compacted); > > + pool_stats.pages_compacted, > > + (u64)atomic64_read(&zram->stats.dup_data_size), > > zram_dedeup_stat_read? > > The reason to use wrapper function is I reallly want to remove any dedup > related code unless users turns on the feature in Kconfig. > Wrapper function would be more clean rather than ifdef. > > > + (u64)atomic64_read(&zram->stats.meta_data_size)); > > > > up_read(&zram->init_lock); > > > > return ret; > > @@ -419,29 +422,165 @@ static ssize_t debug_stat_show(struct device *dev, > > static DEVICE_ATTR_RO(mm_stat); > > static DEVICE_ATTR_RO(debug_stat); > > > > -static struct zram_entry *zram_entry_alloc(struct zram_meta *meta, > > +static u32 zram_calc_checksum(unsigned char *mem) > > zram_dedup_checksum. > > I want to use 'dedup' term all of dedup related to functions. > > > > +{ > > + return jhash(mem, PAGE_SIZE, 0); > > +} > > + > > +static struct zram_entry *zram_entry_alloc(struct zram *zram, > > unsigned int len, gfp_t flags) > > { > > + struct zram_meta *meta = zram->meta; > > struct zram_entry *entry; > > + unsigned long handle; > > > > - entry = kzalloc(sizeof(*entry), flags); > > - if (!entry) > > + handle = zs_malloc(meta->mem_pool, len, flags); > > + if (!handle) > > return NULL; > > Separate allocate zram_entry and dedeup. > IOW, > > struct zram_entry *entry = zram_entry_alloc(zram, xxx); > if (!zram_dedup_init(zram, entry, xxx)) > zram_entry_free(entry); > > > > > - entry->handle = zs_malloc(meta->mem_pool, len, flags); > > - if (!entry->handle) { > > - kfree(entry); > > + entry = kzalloc(sizeof(*entry), flags); > > + if (!entry) { > > + zs_free(meta->mem_pool, handle); > > return NULL; > > } > > > > + entry->handle = handle; > > + RB_CLEAR_NODE(&entry->rb_node); > > + entry->refcount = 1; > > + entry->len = len; > > + atomic64_add(sizeof(*entry), &zram->stats.meta_data_size); > > + > > return entry; > > } > > > > -static inline void zram_entry_free(struct zram_meta *meta, > > - struct zram_entry *entry) > > +static void zram_entry_insert(struct zram *zram, struct zram_entry *new, > > zram_dedup_insert or add? > > > + u32 checksum) > > +{ > > + struct zram_meta *meta = zram->meta; > > + struct zram_hash *hash; > > + struct rb_root *rb_root; > > + struct rb_node **rb_node, *parent = NULL; > > + struct zram_entry *entry; > > + > > + new->checksum = checksum; > > + hash = &meta->hash[checksum % meta->hash_size]; > > + rb_root = &hash->rb_root; > > + > > + spin_lock(&hash->lock); > > + rb_node = &rb_root->rb_node; > > + while (*rb_node) { > > + parent = *rb_node; > > + entry = rb_entry(parent, struct zram_entry, rb_node); > > + if (checksum < entry->checksum) > > + rb_node = &parent->rb_left; > > + else if (checksum > entry->checksum) > > + rb_node = &parent->rb_right; > > + else > > + rb_node = &parent->rb_left; > > + } > > + > > + rb_link_node(&new->rb_node, parent, rb_node); > > + rb_insert_color(&new->rb_node, rb_root); > > + spin_unlock(&hash->lock); > > +} > > + > > +static bool zram_entry_match(struct zram *zram, struct zram_entry *entry, > > Ditto. > > zram_dedup_match? > > > + unsigned char *mem) > > +{ > > + bool match = false; > > + unsigned char *cmem; > > + struct zram_meta *meta = zram->meta; > > + struct zcomp_strm *zstrm; > > + > > + cmem = zs_map_object(meta->mem_pool, entry->handle, ZS_MM_RO); > > + if (entry->len == PAGE_SIZE) { > > + match = !memcmp(mem, cmem, PAGE_SIZE); > > + } else { > > + zstrm = zcomp_stream_get(zram->comp); > > + if (!zcomp_decompress(zstrm, cmem, entry->len, zstrm->buffer)) > > + match = !memcmp(mem, zstrm->buffer, PAGE_SIZE); > > + zcomp_stream_put(zram->comp); > > + } > > + zs_unmap_object(meta->mem_pool, entry->handle); > > + > > + return match; > > +} > > + > > +static inline void zram_entry_free(struct zram *zram, struct zram_meta *meta, > > + struct zram_entry *entry) > > { > > zs_free(meta->mem_pool, entry->handle); > > kfree(entry); > > + if (zram) > > + atomic64_sub(sizeof(*entry), &zram->stats.meta_data_size); > > +} > > + > > +static bool zram_entry_put(struct zram *zram, struct zram_meta *meta, > > + struct zram_entry *entry, bool populated) > > Please, separate entry and dedup part like zram_entry_alloc. > Then, Can't we remove 'populated'? > > > +{ > > + struct zram_hash *hash; > > + u32 checksum; > > + > > + if (!populated) > > + goto free; > > + > > + checksum = entry->checksum; > > + hash = &meta->hash[checksum % meta->hash_size]; > > + > > + spin_lock(&hash->lock); > > + entry->refcount--; > > + if (!entry->refcount) { > > + populated = false; > > + rb_erase(&entry->rb_node, &hash->rb_root); > > + RB_CLEAR_NODE(&entry->rb_node); > > + } > > + spin_unlock(&hash->lock); > > + > > +free: > > + if (!populated) > > + zram_entry_free(zram, meta, entry); > > + > > + return populated; > > +} > > + > > +static struct zram_entry *zram_entry_get(struct zram *zram, > > We can rename it to zram_entry_match. > > > + unsigned char *mem, u32 checksum) > > +{ > > + struct zram_meta *meta = zram->meta; > > + struct zram_hash *hash; > > + struct zram_entry *entry; > > + struct rb_node *rb_node; > > + > > + hash = &meta->hash[checksum % meta->hash_size]; > > + > > + spin_lock(&hash->lock); > > + rb_node = hash->rb_root.rb_node; > > + while (rb_node) { > > + entry = rb_entry(rb_node, struct zram_entry, rb_node); > > + if (checksum == entry->checksum) { > > + entry->refcount++; > > + atomic64_add(entry->len, &zram->stats.dup_data_size); > > + spin_unlock(&hash->lock); > > + > > + if (zram_entry_match(zram, entry, mem)) > > __zram_entry_match? Or just open-code. > > > + return entry; > > + > > + if (zram_entry_put(zram, meta, entry, true)) { > > + atomic64_sub(entry->len, > > + &zram->stats.dup_data_size); > > + } > > + > > + return NULL; > > + } > > + > > + if (checksum < entry->checksum) > > + rb_node = rb_node->rb_left; > > + else > > + rb_node = rb_node->rb_right; > > + } > > + spin_unlock(&hash->lock); > > + > > + return NULL; > > } > > > > static void zram_meta_free(struct zram_meta *meta, u64 disksize) > > @@ -459,18 +598,31 @@ static void zram_meta_free(struct zram_meta *meta, u64 disksize) > > if (!entry || zram_test_flag(meta, index, ZRAM_SAME)) > > continue; > > > > - zram_entry_free(meta, entry); > > + zram_entry_put(NULL, meta, entry, true); > > } > > > > zs_destroy_pool(meta->mem_pool); > > + vfree(meta->hash); > > vfree(meta->table); > > kfree(meta); > > } > > > > +static void zram_meta_init(struct zram_meta *meta) > > +{ > > + int i; > > + struct zram_hash *hash; > > + > > + for (i = 0; i < meta->hash_size; i++) { > > + hash = &meta->hash[i]; > > + spin_lock_init(&hash->lock); > > + hash->rb_root = RB_ROOT; > > + } > > +} > > + > > static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize) > > { > > size_t num_pages; > > - struct zram_meta *meta = kmalloc(sizeof(*meta), GFP_KERNEL); > > + struct zram_meta *meta = kzalloc(sizeof(*meta), GFP_KERNEL); > > > > if (!meta) > > return NULL; > > @@ -482,15 +634,27 @@ static struct zram_meta *zram_meta_alloc(char *pool_name, u64 disksize) > > goto out_error; > > } > > > > + meta->hash_size = num_pages >> ZRAM_HASH_SHIFT; > > + meta->hash_size = min_t(size_t, ZRAM_HASH_SIZE_MAX, meta->hash_size); > > + meta->hash_size = max_t(size_t, ZRAM_HASH_SIZE_MIN, meta->hash_size); > > + meta->hash = vzalloc(meta->hash_size * sizeof(struct zram_hash)); > > + if (!meta->hash) { > > + pr_err("Error allocating zram entry hash\n"); > > + goto out_error; > > + } > > + > > meta->mem_pool = zs_create_pool(pool_name); > > if (!meta->mem_pool) { > > pr_err("Error creating memory pool\n"); > > goto out_error; > > } > > > > + zram_meta_init(meta); > > zram_dedup_init? > > Can't we put above hash initialization routines to zram_meta_init? > > > + > > return meta; > > > > out_error: > > + vfree(meta->hash); > > vfree(meta->table); > > kfree(meta); > > return NULL; > > @@ -520,7 +684,8 @@ static void zram_free_page(struct zram *zram, size_t index) > > if (!entry) > > return; > > > > - zram_entry_free(meta, entry); > > + if (zram_entry_put(zram, meta, entry, true)) > > + atomic64_sub(entry->len, &zram->stats.dup_data_size); > > Hmm, > > I want to put dedup stat logic in dedup functions without exporting > higher level functions like zram_free_page. > > So, > > zram_dedup_put: > ... > ... > atomic64_sub(entry->len, &zram->stats.dup_data_size); > > zram_free_page: > ... > if (zram_dedup_put()) > zram_entry_free > > > > > atomic64_sub(zram_get_obj_size(meta, index), > > &zram->stats.compr_data_size); > > @@ -631,6 +796,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, > > struct zcomp_strm *zstrm = NULL; > > unsigned long alloced_pages; > > unsigned long element; > > + u32 checksum; > > > > page = bvec->bv_page; > > if (is_partial_io(bvec)) { > > @@ -674,6 +840,18 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, > > goto out; > > } > > > > + checksum = zram_calc_checksum(uncmem); > > + if (!entry) { > > + entry = zram_entry_get(zram, uncmem, checksum); > > entry = zram_dedup_get > > > + if (entry) { > > + if (!is_partial_io(bvec)) > > + kunmap_atomic(user_mem); > > + > > + clen = entry->len; > > + goto found_duplication; > > + } > > + } > > + > > zstrm = zcomp_stream_get(zram->comp); > > ret = zcomp_compress(zstrm, uncmem, &clen); > > if (!is_partial_io(bvec)) { > > @@ -708,7 +886,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, > > * from the slow path and entry has already been allocated. > > */ > > if (!entry) { > > - entry = zram_entry_alloc(meta, clen, > > + entry = zram_entry_alloc(zram, clen, > > __GFP_KSWAPD_RECLAIM | __GFP_NOWARN); > > } > > > > @@ -718,7 +896,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, > > > > atomic64_inc(&zram->stats.writestall); > > > > - entry = zram_entry_alloc(meta, clen, GFP_NOIO); > > + entry = zram_entry_alloc(zram, clen, GFP_NOIO); > > if (entry) > > goto compress_again; > > > > @@ -732,7 +910,7 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, > > update_used_max(zram, alloced_pages); > > > > if (zram->limit_pages && alloced_pages > zram->limit_pages) { > > - zram_entry_free(meta, entry); > > + zram_entry_put(zram, meta, entry, false); > > ret = -ENOMEM; > > goto out; > > } > > @@ -750,7 +928,9 @@ static int zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index, > > zcomp_stream_put(zram->comp); > > zstrm = NULL; > > zs_unmap_object(meta->mem_pool, entry->handle); > > + zram_entry_insert(zram, entry, checksum); > > > > +found_duplication: > > Nit: > > I prefer found_dup. > > > /* > > * Free memory associated with this sector > > * before overwriting unused sectors. > > diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h > > index a7ae46c..07d1f8d 100644 > > --- a/drivers/block/zram/zram_drv.h > > +++ b/drivers/block/zram/zram_drv.h > > @@ -18,6 +18,7 @@ > > #include > > #include > > #include > > +#include > > > > #include "zcomp.h" > > > > @@ -45,6 +46,10 @@ > > #define ZRAM_SECTOR_PER_LOGICAL_BLOCK \ > > (1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT)) > > > > +/* One slot will contain 128 pages theoretically */ > > +#define ZRAM_HASH_SHIFT 7 > > +#define ZRAM_HASH_SIZE_MIN (1 << 10) > > +#define ZRAM_HASH_SIZE_MAX (1 << 31) > > > > /* > > * The lower ZRAM_FLAG_SHIFT bits of table.value is for > > @@ -70,6 +75,10 @@ enum zram_pageflags { > > /*-- Data structures */ > > > > struct zram_entry { > > + struct rb_node rb_node; > > + u32 len; > > + u32 checksum; > > + unsigned long refcount; > > It should be compiled out if users doesn't turn on the feature in Kconfig. > I hope you makes Kconfig option to [4/4].