* [PATCH 1/2] zram: count same page write as page_stored @ 2017-05-15 7:41 Minchan Kim 2017-05-15 7:41 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim 2017-05-16 1:11 ` [PATCH 1/2] zram: count same page write as page_stored Sergey Senozhatsky 0 siblings, 2 replies; 15+ messages in thread From: Minchan Kim @ 2017-05-15 7:41 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, Minchan Kim Regardless of whether it is same page or not, it's surely write and stored to zram so we should increase pages_stored stat. Otherwise, user can see zero value via mm_stats although he writes a lot of pages to zram. Signed-off-by: Minchan Kim <minchan@kernel.org> --- drivers/block/zram/zram_drv.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 372602c7da49..b885356551e9 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -516,6 +516,7 @@ static bool zram_same_page_write(struct zram *zram, u32 index, zram_slot_unlock(zram, index); atomic64_inc(&zram->stats.same_pages); + atomic64_inc(&zram->stats.pages_stored); return true; } kunmap_atomic(mem); @@ -619,6 +620,7 @@ static void zram_free_page(struct zram *zram, size_t index) zram_clear_flag(zram, index, ZRAM_SAME); zram_set_element(zram, index, 0); atomic64_dec(&zram->stats.same_pages); + atomic64_dec(&zram->stats.pages_stored); return; } -- 2.7.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-15 7:41 [PATCH 1/2] zram: count same page write as page_stored Minchan Kim @ 2017-05-15 7:41 ` Minchan Kim 2017-05-16 1:30 ` Sergey Senozhatsky 2017-05-16 1:11 ` [PATCH 1/2] zram: count same page write as page_stored Sergey Senozhatsky 1 sibling, 1 reply; 15+ messages in thread From: Minchan Kim @ 2017-05-15 7:41 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, Minchan Kim it's not same compressed pages and deduplicated pages so we shouldn't count duplicated pages as compressed pages. Signed-off-by: Minchan Kim <minchan@kernel.org> --- drivers/block/zram/zram_dedup.c | 4 ---- drivers/block/zram/zram_drv.c | 24 +++++++++++++++++++----- drivers/block/zram/zram_drv.h | 1 + 3 files changed, 20 insertions(+), 9 deletions(-) diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c index 14c4988f8ff7..c15848cc1b31 100644 --- a/drivers/block/zram/zram_dedup.c +++ b/drivers/block/zram/zram_dedup.c @@ -101,9 +101,6 @@ static unsigned long zram_dedup_put(struct zram *zram, entry->refcount--; if (!entry->refcount) rb_erase(&entry->rb_node, &hash->rb_root); - else - atomic64_sub(entry->len, &zram->stats.dup_data_size); - spin_unlock(&hash->lock); return entry->refcount; @@ -127,7 +124,6 @@ static struct zram_entry *__zram_dedup_get(struct zram *zram, again: entry->refcount++; - atomic64_add(entry->len, &zram->stats.dup_data_size); spin_unlock(&hash->lock); if (prev) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index b885356551e9..8152e405117b 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -624,15 +624,22 @@ static void zram_free_page(struct zram *zram, size_t index) return; } + if (zram_dedup_enabled(zram) && + zram_test_flag(zram, index, ZRAM_DUP)) { + zram_clear_flag(zram, index, ZRAM_DUP); + atomic64_sub(entry->len, &zram->stats.dup_data_size); + goto out; + } + if (!entry) return; - zram_entry_free(zram, entry); - atomic64_sub(zram_get_obj_size(zram, index), &zram->stats.compr_data_size); - atomic64_dec(&zram->stats.pages_stored); +out: + zram_entry_free(zram, entry); + atomic64_dec(&zram->stats.pages_stored); zram_set_entry(zram, index, NULL); zram_set_obj_size(zram, index, 0); } @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) entry = zram_dedup_find(zram, page, &checksum); if (entry) { comp_len = entry->len; - goto found_dup; + zram_slot_lock(zram, index); + zram_free_page(zram, index); + zram_set_flag(zram, index, ZRAM_DUP); + zram_set_entry(zram, index, entry); + zram_set_obj_size(zram, index, comp_len); + zram_slot_unlock(zram, index); + atomic64_add(comp_len, &zram->stats.dup_data_size); + atomic64_inc(&zram->stats.pages_stored); + return 0; } zstrm = zcomp_stream_get(zram->comp); @@ -818,7 +833,6 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) zs_unmap_object(zram->mem_pool, zram_entry_handle(zram, entry)); zram_dedup_insert(zram, entry, checksum); -found_dup: /* * Free memory associated with this sector * before overwriting unused sectors. diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h index 0091e23873c1..8ccfdcd8f674 100644 --- a/drivers/block/zram/zram_drv.h +++ b/drivers/block/zram/zram_drv.h @@ -64,6 +64,7 @@ static const size_t max_zpage_size = PAGE_SIZE / 4 * 3; enum zram_pageflags { /* Page consists entirely of zeros */ ZRAM_SAME = ZRAM_FLAG_SHIFT, + ZRAM_DUP, ZRAM_ACCESS, /* page is now accessed */ __NR_ZRAM_PAGEFLAGS, -- 2.7.4 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-15 7:41 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim @ 2017-05-16 1:30 ` Sergey Senozhatsky 2017-05-16 1:59 ` Minchan Kim 0 siblings, 1 reply; 15+ messages in thread From: Sergey Senozhatsky @ 2017-05-16 1:30 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team On (05/15/17 16:41), Minchan Kim wrote: [..] > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index b885356551e9..8152e405117b 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -624,15 +624,22 @@ static void zram_free_page(struct zram *zram, size_t index) > return; > } > > + if (zram_dedup_enabled(zram) && > + zram_test_flag(zram, index, ZRAM_DUP)) { > + zram_clear_flag(zram, index, ZRAM_DUP); > + atomic64_sub(entry->len, &zram->stats.dup_data_size); > + goto out; > + } so that `goto' there is to just jump over ->stats.compr_data_size? can you sub ->stats.compr_data_size before the `if' and avoid labels? > if (!entry) > return; shouldn't this `if' be moved before `if (zram_dedup_enabled(zram)`? [..] > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) > entry = zram_dedup_find(zram, page, &checksum); > if (entry) { > comp_len = entry->len; > - goto found_dup; > + zram_slot_lock(zram, index); > + zram_free_page(zram, index); > + zram_set_flag(zram, index, ZRAM_DUP); > + zram_set_entry(zram, index, entry); > + zram_set_obj_size(zram, index, comp_len); > + zram_slot_unlock(zram, index); > + atomic64_add(comp_len, &zram->stats.dup_data_size); > + atomic64_inc(&zram->stats.pages_stored); > + return 0; hm. that's a somewhat big code duplication. isn't it? -ss ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-16 1:30 ` Sergey Senozhatsky @ 2017-05-16 1:59 ` Minchan Kim 2017-05-16 2:36 ` Sergey Senozhatsky 0 siblings, 1 reply; 15+ messages in thread From: Minchan Kim @ 2017-05-16 1:59 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team Hi Sergey, On Tue, May 16, 2017 at 10:30:22AM +0900, Sergey Senozhatsky wrote: > On (05/15/17 16:41), Minchan Kim wrote: > [..] > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > > index b885356551e9..8152e405117b 100644 > > --- a/drivers/block/zram/zram_drv.c > > +++ b/drivers/block/zram/zram_drv.c > > @@ -624,15 +624,22 @@ static void zram_free_page(struct zram *zram, size_t index) > > return; > > } > > > > + if (zram_dedup_enabled(zram) && > > + zram_test_flag(zram, index, ZRAM_DUP)) { > > + zram_clear_flag(zram, index, ZRAM_DUP); > > + atomic64_sub(entry->len, &zram->stats.dup_data_size); > > + goto out; > > + } > > so that `goto' there is to just jump over ->stats.compr_data_size? Yub. > can you sub ->stats.compr_data_size before the `if' and avoid labels? > > > if (!entry) > > return; > > shouldn't this `if' be moved before `if (zram_dedup_enabled(zram)`? You mean this? static void zram_free_page(..) { if (zram_test_flag(zram, index, ZRAM_SAME)) ... if (!entry) return; if (zram_dedup_enabled(zram) && xxxx)) { zram_clear_flag(ZRAM_DUP); atomic64_sub(entry->len, &zram->stats.dup_data_size); } else { atomic64_sub(zram_get_obj_size(zram, index), &zram->stats.compr_dat_size); } zram_entry_free zram_set_entry zram_set_obj_size } > > > [..] > > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) > > entry = zram_dedup_find(zram, page, &checksum); > > if (entry) { > > comp_len = entry->len; > > - goto found_dup; > > + zram_slot_lock(zram, index); > > + zram_free_page(zram, index); > > + zram_set_flag(zram, index, ZRAM_DUP); > > + zram_set_entry(zram, index, entry); > > + zram_set_obj_size(zram, index, comp_len); > > + zram_slot_unlock(zram, index); > > + atomic64_add(comp_len, &zram->stats.dup_data_size); > > + atomic64_inc(&zram->stats.pages_stored); > > + return 0; > > hm. that's a somewhat big code duplication. isn't it? Yub. 3 parts. above part, zram_same_page_write and tail of __zram_bvec_write. Do you have any idea? Feel free to suggest. :) Thanks. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-16 1:59 ` Minchan Kim @ 2017-05-16 2:36 ` Sergey Senozhatsky 2017-05-16 5:26 ` Minchan Kim 0 siblings, 1 reply; 15+ messages in thread From: Sergey Senozhatsky @ 2017-05-16 2:36 UTC (permalink / raw) To: Minchan Kim Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team Hello Minchan, On (05/16/17 10:59), Minchan Kim wrote: > Hi Sergey, > [..] > You mean this? > > static void zram_free_page(..) { > if (zram_test_flag(zram, index, ZRAM_SAME)) > ... > > if (!entry) > return; > > if (zram_dedup_enabled(zram) && xxxx)) { > zram_clear_flag(ZRAM_DUP); > atomic64_sub(entry->len, &zram->stats.dup_data_size); > } else { > atomic64_sub(zram_get_obj_size(zram, index), > &zram->stats.compr_dat_size); > } > > zram_entry_free > zram_set_entry > zram_set_obj_size > } yeah, something like this. > > > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) > > > entry = zram_dedup_find(zram, page, &checksum); > > > if (entry) { > > > comp_len = entry->len; > > > - goto found_dup; > > > + zram_slot_lock(zram, index); > > > + zram_free_page(zram, index); > > > + zram_set_flag(zram, index, ZRAM_DUP); > > > + zram_set_entry(zram, index, entry); > > > + zram_set_obj_size(zram, index, comp_len); > > > + zram_slot_unlock(zram, index); > > > + atomic64_add(comp_len, &zram->stats.dup_data_size); > > > + atomic64_inc(&zram->stats.pages_stored); > > > + return 0; > > > > hm. that's a somewhat big code duplication. isn't it? > > Yub. 3 parts. above part, zram_same_page_write and tail of __zram_bvec_write. hmm... good question... hardly can think of anything significantly better, zram object handling is now a mix of flags, entries, ref_counters, etc. etc. may be we can merge some of those ops, if we would keep slot locked through the entire __zram_bvec_write(), but that does not look attractive. something ABSOLUTELY untested and incomplete. not even compile tested (!). 99% broken and stupid (!). but there is one thing that it has revealed, so thus incomplete. see below: --- diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 372602c7da49..b31543c40d54 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -509,11 +509,8 @@ static bool zram_same_page_write(struct zram *zram, u32 index, if (page_same_filled(mem, &element)) { kunmap_atomic(mem); /* Free memory associated with this sector now. */ - zram_slot_lock(zram, index); - zram_free_page(zram, index); zram_set_flag(zram, index, ZRAM_SAME); zram_set_element(zram, index, element); - zram_slot_unlock(zram, index); atomic64_inc(&zram->stats.same_pages); return true; @@ -778,7 +775,7 @@ static int zram_compress(struct zram *zram, struct zcomp_strm **zstrm, static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) { - int ret; + int ret = 0; struct zram_entry *entry; unsigned int comp_len; void *src, *dst; @@ -786,12 +783,20 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) struct page *page = bvec->bv_page; u32 checksum; + /* + * Free memory associated with this sector + * before overwriting unused sectors. + */ + zram_slot_lock(zram, index); + zram_free_page(zram, index); + if (zram_same_page_write(zram, index, page)) - return 0; + goto out_unlock; entry = zram_dedup_find(zram, page, &checksum); if (entry) { comp_len = entry->len; + zram_set_flag(zram, index, ZRAM_DUP); goto found_dup; } @@ -799,7 +804,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) ret = zram_compress(zram, &zstrm, page, &entry, &comp_len); if (ret) { zcomp_stream_put(zram->comp); - return ret; + goto out_unlock; } dst = zs_map_object(zram->mem_pool, @@ -817,20 +822,16 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) zram_dedup_insert(zram, entry, checksum); found_dup: - /* - * Free memory associated with this sector - * before overwriting unused sectors. - */ - zram_slot_lock(zram, index); - zram_free_page(zram, index); zram_set_entry(zram, index, entry); zram_set_obj_size(zram, index, comp_len); - zram_slot_unlock(zram, index); /* Update stats */ atomic64_add(comp_len, &zram->stats.compr_data_size); atomic64_inc(&zram->stats.pages_stored); - return 0; + +out_unlock: + zram_slot_unlock(zram, index); + return ret; } --- namely, that zram_compress() error return path from __zram_bvec_write(). currently, we touch the existing compressed object and overwrite it only when we successfully compressed a new object. when zram_compress() fails we propagate the error, but never touch the old object. so all reads that could hit that index later will read stale data. and probably it would make more sense to fail those reads as well; IOW to free the old page regardless zram_compress() progress. what do you think? -ss ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-16 2:36 ` Sergey Senozhatsky @ 2017-05-16 5:26 ` Minchan Kim 2017-05-16 5:45 ` Sergey Senozhatsky 0 siblings, 1 reply; 15+ messages in thread From: Minchan Kim @ 2017-05-16 5:26 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team On Tue, May 16, 2017 at 11:36:15AM +0900, Sergey Senozhatsky wrote: > > > > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) > > > > entry = zram_dedup_find(zram, page, &checksum); > > > > if (entry) { > > > > comp_len = entry->len; > > > > - goto found_dup; > > > > + zram_slot_lock(zram, index); > > > > + zram_free_page(zram, index); > > > > + zram_set_flag(zram, index, ZRAM_DUP); > > > > + zram_set_entry(zram, index, entry); > > > > + zram_set_obj_size(zram, index, comp_len); > > > > + zram_slot_unlock(zram, index); > > > > + atomic64_add(comp_len, &zram->stats.dup_data_size); > > > > + atomic64_inc(&zram->stats.pages_stored); > > > > + return 0; > > > > > > hm. that's a somewhat big code duplication. isn't it? > > > > Yub. 3 parts. above part, zram_same_page_write and tail of __zram_bvec_write. > > hmm... good question... hardly can think of anything significantly > better, zram object handling is now a mix of flags, entries, > ref_counters, etc. etc. may be we can merge some of those ops, if we > would keep slot locked through the entire __zram_bvec_write(), but > that does not look attractive. > > something ABSOLUTELY untested and incomplete. not even compile tested (!). > 99% broken and stupid (!). but there is one thing that it has revealed, so > thus incomplete. see below: > > --- > > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c > index 372602c7da49..b31543c40d54 100644 > --- a/drivers/block/zram/zram_drv.c > +++ b/drivers/block/zram/zram_drv.c > @@ -509,11 +509,8 @@ static bool zram_same_page_write(struct zram *zram, u32 index, > if (page_same_filled(mem, &element)) { > kunmap_atomic(mem); > /* Free memory associated with this sector now. */ > - zram_slot_lock(zram, index); > - zram_free_page(zram, index); > zram_set_flag(zram, index, ZRAM_SAME); > zram_set_element(zram, index, element); > - zram_slot_unlock(zram, index); > > atomic64_inc(&zram->stats.same_pages); > return true; > @@ -778,7 +775,7 @@ static int zram_compress(struct zram *zram, struct zcomp_strm **zstrm, > > static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) > { > - int ret; > + int ret = 0; > struct zram_entry *entry; > unsigned int comp_len; > void *src, *dst; > @@ -786,12 +783,20 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) > struct page *page = bvec->bv_page; > u32 checksum; > > + /* > + * Free memory associated with this sector > + * before overwriting unused sectors. > + */ > + zram_slot_lock(zram, index); > + zram_free_page(zram, index); Hmm, zram_free should happen only if the write is done successfully. Otherwise, we lose the valid data although write IO was fail. > + > if (zram_same_page_write(zram, index, page)) > - return 0; > + goto out_unlock; > > entry = zram_dedup_find(zram, page, &checksum); > if (entry) { > comp_len = entry->len; > + zram_set_flag(zram, index, ZRAM_DUP); In case of hitting dedup, we shouldn't increase compr_data_size. If we fix above two problems, do you think it's still cleaner? (I don't mean to be reluctant with your suggestion. Just a real question to know your thought.:) > goto found_dup; > } > > @@ -799,7 +804,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) > ret = zram_compress(zram, &zstrm, page, &entry, &comp_len); > if (ret) { > zcomp_stream_put(zram->comp); > - return ret; > + goto out_unlock; > } > > dst = zs_map_object(zram->mem_pool, > @@ -817,20 +822,16 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index) > zram_dedup_insert(zram, entry, checksum); > > found_dup: > - /* > - * Free memory associated with this sector > - * before overwriting unused sectors. > - */ > - zram_slot_lock(zram, index); > - zram_free_page(zram, index); > zram_set_entry(zram, index, entry); > zram_set_obj_size(zram, index, comp_len); > - zram_slot_unlock(zram, index); > > /* Update stats */ > atomic64_add(comp_len, &zram->stats.compr_data_size); > atomic64_inc(&zram->stats.pages_stored); > - return 0; > + > +out_unlock: > + zram_slot_unlock(zram, index); > + return ret; > } > > --- > > > namely, > that zram_compress() error return path from __zram_bvec_write(). > > currently, we touch the existing compressed object and overwrite it only > when we successfully compressed a new object. when zram_compress() fails > we propagate the error, but never touch the old object. so all reads that > could hit that index later will read stale data. and probably it would > make more sense to fail those reads as well; IOW to free the old page > regardless zram_compress() progress. > > what do you think? > > -ss ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-16 5:26 ` Minchan Kim @ 2017-05-16 5:45 ` Sergey Senozhatsky 2017-05-16 7:16 ` Minchan Kim 0 siblings, 1 reply; 15+ messages in thread From: Sergey Senozhatsky @ 2017-05-16 5:45 UTC (permalink / raw) To: Minchan Kim Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team On (05/16/17 14:26), Minchan Kim wrote: [..] > > + /* > > + * Free memory associated with this sector > > + * before overwriting unused sectors. > > + */ > > + zram_slot_lock(zram, index); > > + zram_free_page(zram, index); > > Hmm, zram_free should happen only if the write is done successfully. > Otherwise, we lose the valid data although write IO was fail. but would this be correct? the data is not valid - we failed to store the valid one. but instead we assure application that read()/swapin/etc., depending on the usage scenario, is successful (even though the data is not what application really expects to see), application tries to use the data from that page and probably crashes (dunno, for example page contained hash tables with pointers that are not valid anymore, etc. etc.). I'm not optimistic about stale data reads; it basically will look like data corruption to the application. > > + > > if (zram_same_page_write(zram, index, page)) > > - return 0; > > + goto out_unlock; > > > > entry = zram_dedup_find(zram, page, &checksum); > > if (entry) { > > comp_len = entry->len; > > + zram_set_flag(zram, index, ZRAM_DUP); > > In case of hitting dedup, we shouldn't increase compr_data_size. no, we should not. you are right. my "... patch" is incomplete and wrong. please don't pay too much attention to it. > If we fix above two problems, do you think it's still cleaner? > (I don't mean to be reluctant with your suggestion. Just a > real question to know your thought.:) do you mean code duplication and stale data read? I'd probably prefer to address stale data reads separately. but it seems that stale reads fix will re-do parts of your 0002 patch and, at least potentially, reduce code duplication. so we can go with your 0002 and then stale reads fix will try to reduce code duplication (unless we want to have 4 places doing the same thing :) ) -ss ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-16 5:45 ` Sergey Senozhatsky @ 2017-05-16 7:16 ` Minchan Kim 2017-05-16 7:36 ` Sergey Senozhatsky 0 siblings, 1 reply; 15+ messages in thread From: Minchan Kim @ 2017-05-16 7:16 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team On Tue, May 16, 2017 at 02:45:33PM +0900, Sergey Senozhatsky wrote: > On (05/16/17 14:26), Minchan Kim wrote: > [..] > > > + /* > > > + * Free memory associated with this sector > > > + * before overwriting unused sectors. > > > + */ > > > + zram_slot_lock(zram, index); > > > + zram_free_page(zram, index); > > > > Hmm, zram_free should happen only if the write is done successfully. > > Otherwise, we lose the valid data although write IO was fail. > > but would this be correct? the data is not valid - we failed to store > the valid one. but instead we assure application that read()/swapin/etc., > depending on the usage scenario, is successful (even though the data is > not what application really expects to see), application tries to use the > data from that page and probably crashes (dunno, for example page contained > hash tables with pointers that are not valid anymore, etc. etc.). > > I'm not optimistic about stale data reads; it basically will look like > data corruption to the application. Hmm, I don't understand what you say. My point is zram_free_page should be done only if whoe write operation is successful. With you change, following situation can happens. write block 4, 'all A' -> success read block 4, 'all A' verified -> Good write block 4, 'all B' -> but failed with ENOMEM read block 4 expected 'all A' but 'all 0' -> Oops It is the problem what I pointed out. If I miss something, could you elaborate it a bit? Thanks! > > > > + > > > if (zram_same_page_write(zram, index, page)) > > > - return 0; > > > + goto out_unlock; > > > > > > entry = zram_dedup_find(zram, page, &checksum); > > > if (entry) { > > > comp_len = entry->len; > > > + zram_set_flag(zram, index, ZRAM_DUP); > > > > In case of hitting dedup, we shouldn't increase compr_data_size. > > no, we should not. you are right. my "... patch" is incomplete and > wrong. please don't pay too much attention to it. > > > > If we fix above two problems, do you think it's still cleaner? > > (I don't mean to be reluctant with your suggestion. Just a > > real question to know your thought.:) > > do you mean code duplication and stale data read? > > I'd probably prefer to address stale data reads separately. > but it seems that stale reads fix will re-do parts of your > 0002 patch and, at least potentially, reduce code duplication. > > so we can go with your 0002 and then stale reads fix will try > to reduce code duplication (unless we want to have 4 places doing > the same thing :) ) > > -ss ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-16 7:16 ` Minchan Kim @ 2017-05-16 7:36 ` Sergey Senozhatsky 2017-05-17 8:32 ` Minchan Kim 0 siblings, 1 reply; 15+ messages in thread From: Sergey Senozhatsky @ 2017-05-16 7:36 UTC (permalink / raw) To: Minchan Kim Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team On (05/16/17 16:16), Minchan Kim wrote: > > but would this be correct? the data is not valid - we failed to store > > the valid one. but instead we assure application that read()/swapin/etc., > > depending on the usage scenario, is successful (even though the data is > > not what application really expects to see), application tries to use the > > data from that page and probably crashes (dunno, for example page contained > > hash tables with pointers that are not valid anymore, etc. etc.). > > > > I'm not optimistic about stale data reads; it basically will look like > > data corruption to the application. > > Hmm, I don't understand what you say. > My point is zram_free_page should be done only if whoe write operation > is successful. > With you change, following situation can happens. > > write block 4, 'all A' -> success > read block 4, 'all A' verified -> Good > write block 4, 'all B' -> but failed with ENOMEM > read block 4 expected 'all A' but 'all 0' -> Oops yes. 'all A' in #4 can be incorrect. zram can be used as a block device with a file system, and pid that does write op not necessarily does read op later. it can be a completely different application. e.g. compilation, or anything else. suppose PID A does wr block 1 all a wr block 2 all a + 1 wr block 3 all a + 2 wr block 4 all a + 3 now PID A does wr block 1 all m wr block 2 all m + 1 wr block 3 all m + 2 wr block 4 failed. block still has 'all a + 3'. exit another application, PID C, reads in the file and tries to do something sane with it rd block 1 all m rd block 2 all m + 1 rd block 3 all m + 3 rd block 4 all a + 3 << this is dangerous. we should return error from read() here; not stale data. what we can return now is a `partially updated' data, with some new and some stale pages. this is quite unlikely to end up anywhere good. am I wrong? why does `rd block 4' in your case causes Oops? as a worst case scenario? application does not expect page to be 'all A' at this point. pages are likely to belong to some mappings/files/etc., and there is likely a data dependency between them, dunno C++ objects that span across pages or JPEG images, etc. so returning "new data new data stale data" is a bit fishy. -ss ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-16 7:36 ` Sergey Senozhatsky @ 2017-05-17 8:32 ` Minchan Kim 2017-05-17 9:14 ` Sergey Senozhatsky 2017-05-21 7:04 ` Christoph Hellwig 0 siblings, 2 replies; 15+ messages in thread From: Minchan Kim @ 2017-05-17 8:32 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, hch, viro, axboe, jlayton, tytso Hi Sergey, On Tue, May 16, 2017 at 04:36:17PM +0900, Sergey Senozhatsky wrote: > On (05/16/17 16:16), Minchan Kim wrote: > > > but would this be correct? the data is not valid - we failed to store > > > the valid one. but instead we assure application that read()/swapin/etc., > > > depending on the usage scenario, is successful (even though the data is > > > not what application really expects to see), application tries to use the > > > data from that page and probably crashes (dunno, for example page contained > > > hash tables with pointers that are not valid anymore, etc. etc.). > > > > > > I'm not optimistic about stale data reads; it basically will look like > > > data corruption to the application. > > > > Hmm, I don't understand what you say. > > My point is zram_free_page should be done only if whoe write operation > > is successful. > > With you change, following situation can happens. > > > > write block 4, 'all A' -> success > > read block 4, 'all A' verified -> Good > > write block 4, 'all B' -> but failed with ENOMEM > > read block 4 expected 'all A' but 'all 0' -> Oops > > yes. 'all A' in #4 can be incorrect. zram can be used as a block device > with a file system, and pid that does write op not necessarily does read > op later. it can be a completely different application. e.g. compilation, > or anything else. > > suppose PID A does > > wr block 1 all a > wr block 2 all a + 1 > wr block 3 all a + 2 > wr block 4 all a + 3 > > now PID A does > > wr block 1 all m > wr block 2 all m + 1 > wr block 3 all m + 2 > wr block 4 failed. block still has 'all a + 3'. > exit > > another application, PID C, reads in the file and tries to do > something sane with it > > rd block 1 all m > rd block 2 all m + 1 > rd block 3 all m + 3 > rd block 4 all a + 3 << this is dangerous. we should return > error from read() here; not stale data. > > > what we can return now is a `partially updated' data, with some new > and some stale pages. this is quite unlikely to end up anywhere good. > am I wrong? > > why does `rd block 4' in your case causes Oops? as a worst case scenario? > application does not expect page to be 'all A' at this point. pages are > likely to belong to some mappings/files/etc., and there is likely a data > dependency between them, dunno C++ objects that span across pages or > JPEG images, etc. so returning "new data new data stale data" is a bit > fishy. I thought more about it and start to confuse. :/ So, let's Cc linux-block, fs peoples. The question is that Is block device(esp, zram which is compressed ram block device) okay to return garbage when ongoing overwrite IO fails? O_DIRECT write 4 block "aaa.." -> success read 4 block "aaa.." -> success O_DIRECT write 4 block "bbb.." -> fail read 4 block "000..' -> it is okay? Hope to get an answer form experts. :) ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-17 8:32 ` Minchan Kim @ 2017-05-17 9:14 ` Sergey Senozhatsky 2017-05-18 4:53 ` Minchan Kim 2017-05-21 7:04 ` Christoph Hellwig 1 sibling, 1 reply; 15+ messages in thread From: Sergey Senozhatsky @ 2017-05-17 9:14 UTC (permalink / raw) To: Minchan Kim Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, hch, viro, axboe, jlayton, tytso Hello Minchan, On (05/17/17 17:32), Minchan Kim wrote: [..] > > what we can return now is a `partially updated' data, with some new > > and some stale pages. this is quite unlikely to end up anywhere good. > > am I wrong? > > > > why does `rd block 4' in your case causes Oops? as a worst case scenario? > > application does not expect page to be 'all A' at this point. pages are > > likely to belong to some mappings/files/etc., and there is likely a data > > dependency between them, dunno C++ objects that span across pages or > > JPEG images, etc. so returning "new data new data stale data" is a bit > > fishy. > > I thought more about it and start to confuse. :/ sorry, I'm not sure I see what's the source of your confusion :) my point is - we should not let READ succeed if we know that WRITE failed. assume JPEG image example, over-write block 1 aaa->xxx OK over-write block 2 bbb->yyy OK over-write block 3 ccc->zzz error reading that JPEG file read block 1 xxx OK read block 2 yyy OK read block 3 ccc OK << we should not return OK here. because "xxxyyyccc" is not the correct JPEG file anyway. do you agree that telling application that read() succeeded and at the same returning corrupted "xxxyyyccc" instead of "xxxyyyzzz" is not correct? so how about this, - if we fail to compress page (S/W or H/W compressor error, depending on particular setup) let's store it uncompressed (page_size-d zspool object). ? this should do the trick. at least we will have correct data: xxx - compressed yyy - compressed zzz - uncompressed, because compressing back-end returned an error. thoughts? -ss ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-17 9:14 ` Sergey Senozhatsky @ 2017-05-18 4:53 ` Minchan Kim 0 siblings, 0 replies; 15+ messages in thread From: Minchan Kim @ 2017-05-18 4:53 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, hch, viro, axboe, jlayton, tytso Hi Sergey, On Wed, May 17, 2017 at 06:14:23PM +0900, Sergey Senozhatsky wrote: > Hello Minchan, > > On (05/17/17 17:32), Minchan Kim wrote: > [..] > > > what we can return now is a `partially updated' data, with some new > > > and some stale pages. this is quite unlikely to end up anywhere good. > > > am I wrong? > > > > > > why does `rd block 4' in your case causes Oops? as a worst case scenario? > > > application does not expect page to be 'all A' at this point. pages are > > > likely to belong to some mappings/files/etc., and there is likely a data > > > dependency between them, dunno C++ objects that span across pages or > > > JPEG images, etc. so returning "new data new data stale data" is a bit > > > fishy. > > > > I thought more about it and start to confuse. :/ > > sorry, I'm not sure I see what's the source of your confusion :) > > my point is - we should not let READ succeed if we know that WRITE > failed. assume JPEG image example, I don't think we shoul do it. I will write down my thought below. :) > > > over-write block 1 aaa->xxx OK > over-write block 2 bbb->yyy OK > over-write block 3 ccc->zzz error > > reading that JPEG file > > read block 1 xxx OK > read block 2 yyy OK > read block 3 ccc OK << we should not return OK here. because > "xxxyyyccc" is not the correct JPEG file > anyway. > > do you agree that telling application that read() succeeded and at > the same returning corrupted "xxxyyyccc" instead of "xxxyyyzzz" is > not correct? I don't agree. I *think* block device is a just dumb device so zram doesn't need to know about any objects from the upper layer. What zram should consider is basically read/write success or fail of IO unit(maybe, BIO). So if we assume each step from above example is bio unit, I think it's no problem returns "xxxyyyccc". What I meant "started confused" was about atomicity, not above thing. I think it's okay to return ccc instead of zzz but is it okay zram to return "000", not "ccc" and "zzz"? My conclusion is that it's okay now after discussion from one of my FS friends. Let's think about it. FS requests write "aaa" to block 4 and fails by somethings (H/W failure, S/W failure like ENOMEM). The interface to catch the failure is the function registered by bio_endio which is normally handles by AS_EIO by mappint_set_error as well as PG_error flags of the page. In this case, FS assumes the block 4 can have stale data, not 'zzz' and 'ccc' because the device was broken in the middle of write some data to a block if the block device doesn't support atomic write(I guess it's more popular) so it would be safe to consider the block has garbage now rather than old value, new value. (I hope I explain my thought well :/) Having said that, I think everyone likes block device supports atomicity(ie, old or new). so I am reluctant to change the behavior for simple refactoring. Thanks. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-17 8:32 ` Minchan Kim 2017-05-17 9:14 ` Sergey Senozhatsky @ 2017-05-21 7:04 ` Christoph Hellwig 2017-05-21 7:15 ` Minchan Kim 1 sibling, 1 reply; 15+ messages in thread From: Christoph Hellwig @ 2017-05-21 7:04 UTC (permalink / raw) To: Minchan Kim Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, hch, viro, axboe, jlayton, tytso On Wed, May 17, 2017 at 05:32:12PM +0900, Minchan Kim wrote: > Is block device(esp, zram which is compressed ram block device) okay to > return garbage when ongoing overwrite IO fails? > > O_DIRECT write 4 block "aaa.." -> success > read 4 block "aaa.." -> success > O_DIRECT write 4 block "bbb.." -> fail > read 4 block "000..' -> it is okay? > > Hope to get an answer form experts. :) It's "okay" as it's what existing real block devices do (at least on a sector boundary). It's not "nice" though, so if you can avoid it, please do. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed 2017-05-21 7:04 ` Christoph Hellwig @ 2017-05-21 7:15 ` Minchan Kim 0 siblings, 0 replies; 15+ messages in thread From: Minchan Kim @ 2017-05-21 7:15 UTC (permalink / raw) To: Christoph Hellwig Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, viro, axboe, jlayton, tytso On Sun, May 21, 2017 at 12:04:27AM -0700, Christoph Hellwig wrote: > On Wed, May 17, 2017 at 05:32:12PM +0900, Minchan Kim wrote: > > Is block device(esp, zram which is compressed ram block device) okay to > > return garbage when ongoing overwrite IO fails? > > > > O_DIRECT write 4 block "aaa.." -> success > > read 4 block "aaa.." -> success > > O_DIRECT write 4 block "bbb.." -> fail > > read 4 block "000..' -> it is okay? > > > > Hope to get an answer form experts. :) > > It's "okay" as it's what existing real block devices do (at least on a > sector boundary). It's not "nice" though, so if you can avoid it, > please do. That was my understanding so I wanted to avoid it for just simple code refactoring. Your comment helps to confirm the thhought. Thanks, Christoph! ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH 1/2] zram: count same page write as page_stored 2017-05-15 7:41 [PATCH 1/2] zram: count same page write as page_stored Minchan Kim 2017-05-15 7:41 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim @ 2017-05-16 1:11 ` Sergey Senozhatsky 1 sibling, 0 replies; 15+ messages in thread From: Sergey Senozhatsky @ 2017-05-16 1:11 UTC (permalink / raw) To: Minchan Kim Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team On (05/15/17 16:41), Minchan Kim wrote: > Regardless of whether it is same page or not, it's surely write > and stored to zram so we should increase pages_stored stat. > Otherwise, user can see zero value via mm_stats although he > writes a lot of pages to zram. > > Signed-off-by: Minchan Kim <minchan@kernel.org> Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> -ss ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2017-05-21 7:15 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-05-15 7:41 [PATCH 1/2] zram: count same page write as page_stored Minchan Kim 2017-05-15 7:41 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim 2017-05-16 1:30 ` Sergey Senozhatsky 2017-05-16 1:59 ` Minchan Kim 2017-05-16 2:36 ` Sergey Senozhatsky 2017-05-16 5:26 ` Minchan Kim 2017-05-16 5:45 ` Sergey Senozhatsky 2017-05-16 7:16 ` Minchan Kim 2017-05-16 7:36 ` Sergey Senozhatsky 2017-05-17 8:32 ` Minchan Kim 2017-05-17 9:14 ` Sergey Senozhatsky 2017-05-18 4:53 ` Minchan Kim 2017-05-21 7:04 ` Christoph Hellwig 2017-05-21 7:15 ` Minchan Kim 2017-05-16 1:11 ` [PATCH 1/2] zram: count same page write as page_stored Sergey Senozhatsky
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.