All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] zram: count same page write as page_stored
@ 2017-05-15  7:41 Minchan Kim
  2017-05-15  7:41 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim
  2017-05-16  1:11 ` [PATCH 1/2] zram: count same page write as page_stored Sergey Senozhatsky
  0 siblings, 2 replies; 15+ messages in thread
From: Minchan Kim @ 2017-05-15  7:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, Minchan Kim

Regardless of whether it is same page or not, it's surely write
and stored to zram so we should increase pages_stored stat.
Otherwise, user can see zero value via mm_stats although he
writes a lot of pages to zram.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 372602c7da49..b885356551e9 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -516,6 +516,7 @@ static bool zram_same_page_write(struct zram *zram, u32 index,
 		zram_slot_unlock(zram, index);
 
 		atomic64_inc(&zram->stats.same_pages);
+		atomic64_inc(&zram->stats.pages_stored);
 		return true;
 	}
 	kunmap_atomic(mem);
@@ -619,6 +620,7 @@ static void zram_free_page(struct zram *zram, size_t index)
 		zram_clear_flag(zram, index, ZRAM_SAME);
 		zram_set_element(zram, index, 0);
 		atomic64_dec(&zram->stats.same_pages);
+		atomic64_dec(&zram->stats.pages_stored);
 		return;
 	}
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-15  7:41 [PATCH 1/2] zram: count same page write as page_stored Minchan Kim
@ 2017-05-15  7:41 ` Minchan Kim
  2017-05-16  1:30   ` Sergey Senozhatsky
  2017-05-16  1:11 ` [PATCH 1/2] zram: count same page write as page_stored Sergey Senozhatsky
  1 sibling, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2017-05-15  7:41 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, Joonsoo Kim, Sergey Senozhatsky, kernel-team, Minchan Kim

it's not same compressed pages and deduplicated pages
so we shouldn't count duplicated pages as compressed pages.

Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_dedup.c |  4 ----
 drivers/block/zram/zram_drv.c   | 24 +++++++++++++++++++-----
 drivers/block/zram/zram_drv.h   |  1 +
 3 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/block/zram/zram_dedup.c b/drivers/block/zram/zram_dedup.c
index 14c4988f8ff7..c15848cc1b31 100644
--- a/drivers/block/zram/zram_dedup.c
+++ b/drivers/block/zram/zram_dedup.c
@@ -101,9 +101,6 @@ static unsigned long zram_dedup_put(struct zram *zram,
 	entry->refcount--;
 	if (!entry->refcount)
 		rb_erase(&entry->rb_node, &hash->rb_root);
-	else
-		atomic64_sub(entry->len, &zram->stats.dup_data_size);
-
 	spin_unlock(&hash->lock);
 
 	return entry->refcount;
@@ -127,7 +124,6 @@ static struct zram_entry *__zram_dedup_get(struct zram *zram,
 
 again:
 	entry->refcount++;
-	atomic64_add(entry->len, &zram->stats.dup_data_size);
 	spin_unlock(&hash->lock);
 
 	if (prev)
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index b885356551e9..8152e405117b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -624,15 +624,22 @@ static void zram_free_page(struct zram *zram, size_t index)
 		return;
 	}
 
+	if (zram_dedup_enabled(zram) &&
+			zram_test_flag(zram, index, ZRAM_DUP)) {
+		zram_clear_flag(zram, index, ZRAM_DUP);
+		atomic64_sub(entry->len, &zram->stats.dup_data_size);
+		goto out;
+	}
+
 	if (!entry)
 		return;
 
-	zram_entry_free(zram, entry);
-
 	atomic64_sub(zram_get_obj_size(zram, index),
 			&zram->stats.compr_data_size);
-	atomic64_dec(&zram->stats.pages_stored);
+out:
+	zram_entry_free(zram, entry);
 
+	atomic64_dec(&zram->stats.pages_stored);
 	zram_set_entry(zram, index, NULL);
 	zram_set_obj_size(zram, index, 0);
 }
@@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
 	entry = zram_dedup_find(zram, page, &checksum);
 	if (entry) {
 		comp_len = entry->len;
-		goto found_dup;
+		zram_slot_lock(zram, index);
+		zram_free_page(zram, index);
+		zram_set_flag(zram, index, ZRAM_DUP);
+		zram_set_entry(zram, index, entry);
+		zram_set_obj_size(zram, index, comp_len);
+		zram_slot_unlock(zram, index);
+		atomic64_add(comp_len, &zram->stats.dup_data_size);
+		atomic64_inc(&zram->stats.pages_stored);
+		return 0;
 	}
 
 	zstrm = zcomp_stream_get(zram->comp);
@@ -818,7 +833,6 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
 	zs_unmap_object(zram->mem_pool, zram_entry_handle(zram, entry));
 	zram_dedup_insert(zram, entry, checksum);
 
-found_dup:
 	/*
 	 * Free memory associated with this sector
 	 * before overwriting unused sectors.
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 0091e23873c1..8ccfdcd8f674 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -64,6 +64,7 @@ static const size_t max_zpage_size = PAGE_SIZE / 4 * 3;
 enum zram_pageflags {
 	/* Page consists entirely of zeros */
 	ZRAM_SAME = ZRAM_FLAG_SHIFT,
+	ZRAM_DUP,
 	ZRAM_ACCESS,	/* page is now accessed */
 
 	__NR_ZRAM_PAGEFLAGS,
-- 
2.7.4

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/2] zram: count same page write as page_stored
  2017-05-15  7:41 [PATCH 1/2] zram: count same page write as page_stored Minchan Kim
  2017-05-15  7:41 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim
@ 2017-05-16  1:11 ` Sergey Senozhatsky
  1 sibling, 0 replies; 15+ messages in thread
From: Sergey Senozhatsky @ 2017-05-16  1:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team

On (05/15/17 16:41), Minchan Kim wrote:
> Regardless of whether it is same page or not, it's surely write
> and stored to zram so we should increase pages_stored stat.
> Otherwise, user can see zero value via mm_stats although he
> writes a lot of pages to zram.
> 
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-15  7:41 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim
@ 2017-05-16  1:30   ` Sergey Senozhatsky
  2017-05-16  1:59     ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2017-05-16  1:30 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team

On (05/15/17 16:41), Minchan Kim wrote:
[..]
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index b885356551e9..8152e405117b 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -624,15 +624,22 @@ static void zram_free_page(struct zram *zram, size_t index)
>  		return;
>  	}
>  
> +	if (zram_dedup_enabled(zram) &&
> +			zram_test_flag(zram, index, ZRAM_DUP)) {
> +		zram_clear_flag(zram, index, ZRAM_DUP);
> +		atomic64_sub(entry->len, &zram->stats.dup_data_size);
> +		goto out;
> +	}

so that `goto' there is to just jump over ->stats.compr_data_size?
can you sub ->stats.compr_data_size before the `if' and avoid labels?

>  	if (!entry)
>  		return;

shouldn't this `if' be moved before `if (zram_dedup_enabled(zram)`?


[..]
> @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
>  	entry = zram_dedup_find(zram, page, &checksum);
>  	if (entry) {
>  		comp_len = entry->len;
> -		goto found_dup;
> +		zram_slot_lock(zram, index);
> +		zram_free_page(zram, index);
> +		zram_set_flag(zram, index, ZRAM_DUP);
> +		zram_set_entry(zram, index, entry);
> +		zram_set_obj_size(zram, index, comp_len);
> +		zram_slot_unlock(zram, index);
> +		atomic64_add(comp_len, &zram->stats.dup_data_size);
> +		atomic64_inc(&zram->stats.pages_stored);
> +		return 0;

hm. that's a somewhat big code duplication. isn't it?

	-ss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-16  1:30   ` Sergey Senozhatsky
@ 2017-05-16  1:59     ` Minchan Kim
  2017-05-16  2:36       ` Sergey Senozhatsky
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2017-05-16  1:59 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team

Hi Sergey,

On Tue, May 16, 2017 at 10:30:22AM +0900, Sergey Senozhatsky wrote:
> On (05/15/17 16:41), Minchan Kim wrote:
> [..]
> > diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> > index b885356551e9..8152e405117b 100644
> > --- a/drivers/block/zram/zram_drv.c
> > +++ b/drivers/block/zram/zram_drv.c
> > @@ -624,15 +624,22 @@ static void zram_free_page(struct zram *zram, size_t index)
> >  		return;
> >  	}
> >  
> > +	if (zram_dedup_enabled(zram) &&
> > +			zram_test_flag(zram, index, ZRAM_DUP)) {
> > +		zram_clear_flag(zram, index, ZRAM_DUP);
> > +		atomic64_sub(entry->len, &zram->stats.dup_data_size);
> > +		goto out;
> > +	}
> 
> so that `goto' there is to just jump over ->stats.compr_data_size?

Yub.

> can you sub ->stats.compr_data_size before the `if' and avoid labels?


> 
> >  	if (!entry)
> >  		return;
> 
> shouldn't this `if' be moved before `if (zram_dedup_enabled(zram)`?

You mean this?

static void zram_free_page(..) {
        if (zram_test_flag(zram, index, ZRAM_SAME))
                ...

        if (!entry)
                return;

        if (zram_dedup_enabled(zram) && xxxx)) {
                zram_clear_flag(ZRAM_DUP);
                atomic64_sub(entry->len, &zram->stats.dup_data_size);
        } else {
                atomic64_sub(zram_get_obj_size(zram, index),
                                        &zram->stats.compr_dat_size);
        }

        zram_entry_free
        zram_set_entry
        zram_set_obj_size
}

> 
> 
> [..]
> > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
> >  	entry = zram_dedup_find(zram, page, &checksum);
> >  	if (entry) {
> >  		comp_len = entry->len;
> > -		goto found_dup;
> > +		zram_slot_lock(zram, index);
> > +		zram_free_page(zram, index);
> > +		zram_set_flag(zram, index, ZRAM_DUP);
> > +		zram_set_entry(zram, index, entry);
> > +		zram_set_obj_size(zram, index, comp_len);
> > +		zram_slot_unlock(zram, index);
> > +		atomic64_add(comp_len, &zram->stats.dup_data_size);
> > +		atomic64_inc(&zram->stats.pages_stored);
> > +		return 0;
> 
> hm. that's a somewhat big code duplication. isn't it?

Yub. 3 parts. above part,  zram_same_page_write and tail of __zram_bvec_write.

Do you have any idea? Feel free to suggest. :)
Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-16  1:59     ` Minchan Kim
@ 2017-05-16  2:36       ` Sergey Senozhatsky
  2017-05-16  5:26         ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2017-05-16  2:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team

Hello Minchan,

On (05/16/17 10:59), Minchan Kim wrote:
> Hi Sergey,
>

[..]
> You mean this?
> 
> static void zram_free_page(..) {
>         if (zram_test_flag(zram, index, ZRAM_SAME))
>                 ...
> 
>         if (!entry)
>                 return;
> 
>         if (zram_dedup_enabled(zram) && xxxx)) {
>                 zram_clear_flag(ZRAM_DUP);
>                 atomic64_sub(entry->len, &zram->stats.dup_data_size);
>         } else {
>                 atomic64_sub(zram_get_obj_size(zram, index),
>                                         &zram->stats.compr_dat_size);
>         }
> 
>         zram_entry_free
>         zram_set_entry
>         zram_set_obj_size
> }

yeah, something like this.

> > > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
> > >  	entry = zram_dedup_find(zram, page, &checksum);
> > >  	if (entry) {
> > >  		comp_len = entry->len;
> > > -		goto found_dup;
> > > +		zram_slot_lock(zram, index);
> > > +		zram_free_page(zram, index);
> > > +		zram_set_flag(zram, index, ZRAM_DUP);
> > > +		zram_set_entry(zram, index, entry);
> > > +		zram_set_obj_size(zram, index, comp_len);
> > > +		zram_slot_unlock(zram, index);
> > > +		atomic64_add(comp_len, &zram->stats.dup_data_size);
> > > +		atomic64_inc(&zram->stats.pages_stored);
> > > +		return 0;
> > 
> > hm. that's a somewhat big code duplication. isn't it?
> 
> Yub. 3 parts. above part,  zram_same_page_write and tail of __zram_bvec_write.

hmm... good question... hardly can think of anything significantly
better, zram object handling is now a mix of flags, entries,
ref_counters, etc. etc. may be we can merge some of those ops, if we
would keep slot locked through the entire __zram_bvec_write(), but
that does not look attractive.

something ABSOLUTELY untested and incomplete. not even compile tested (!).
99% broken and stupid (!). but there is one thing that it has revealed, so
thus incomplete. see below:

---

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 372602c7da49..b31543c40d54 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -509,11 +509,8 @@ static bool zram_same_page_write(struct zram *zram, u32 index,
        if (page_same_filled(mem, &element)) {
                kunmap_atomic(mem);
                /* Free memory associated with this sector now. */
-               zram_slot_lock(zram, index);
-               zram_free_page(zram, index);
                zram_set_flag(zram, index, ZRAM_SAME);
                zram_set_element(zram, index, element);
-               zram_slot_unlock(zram, index);
 
                atomic64_inc(&zram->stats.same_pages);
                return true;
@@ -778,7 +775,7 @@ static int zram_compress(struct zram *zram, struct zcomp_strm **zstrm,
 
 static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
 {
-       int ret;
+       int ret = 0;
        struct zram_entry *entry;
        unsigned int comp_len;
        void *src, *dst;
@@ -786,12 +783,20 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
        struct page *page = bvec->bv_page;
        u32 checksum;
 
+       /*
+        * Free memory associated with this sector
+        * before overwriting unused sectors.
+        */
+       zram_slot_lock(zram, index);
+       zram_free_page(zram, index);
+
        if (zram_same_page_write(zram, index, page))
-               return 0;
+               goto out_unlock;
 
        entry = zram_dedup_find(zram, page, &checksum);
        if (entry) {
                comp_len = entry->len;
+               zram_set_flag(zram, index, ZRAM_DUP);
                goto found_dup;
        }
 
@@ -799,7 +804,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
        ret = zram_compress(zram, &zstrm, page, &entry, &comp_len);
        if (ret) {
                zcomp_stream_put(zram->comp);
-               return ret;
+               goto out_unlock;
        }
 
        dst = zs_map_object(zram->mem_pool,
@@ -817,20 +822,16 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
        zram_dedup_insert(zram, entry, checksum);
 
 found_dup:
-       /*
-        * Free memory associated with this sector
-        * before overwriting unused sectors.
-        */
-       zram_slot_lock(zram, index);
-       zram_free_page(zram, index);
        zram_set_entry(zram, index, entry);
        zram_set_obj_size(zram, index, comp_len);
-       zram_slot_unlock(zram, index);
 
        /* Update stats */
        atomic64_add(comp_len, &zram->stats.compr_data_size);
        atomic64_inc(&zram->stats.pages_stored);
-       return 0;
+
+out_unlock:
+       zram_slot_unlock(zram, index);
+       return ret;
 }

---


namely,
that zram_compress() error return path from __zram_bvec_write().

currently, we touch the existing compressed object and overwrite it only
when we successfully compressed a new object. when zram_compress() fails
we propagate the error, but never touch the old object. so all reads that
could hit that index later will read stale data. and probably it would
make more sense to fail those reads as well; IOW to free the old page
regardless zram_compress() progress.

what do you think?

	-ss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-16  2:36       ` Sergey Senozhatsky
@ 2017-05-16  5:26         ` Minchan Kim
  2017-05-16  5:45           ` Sergey Senozhatsky
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2017-05-16  5:26 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team

On Tue, May 16, 2017 at 11:36:15AM +0900, Sergey Senozhatsky wrote:
> > > > @@ -794,7 +801,15 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
> > > >  	entry = zram_dedup_find(zram, page, &checksum);
> > > >  	if (entry) {
> > > >  		comp_len = entry->len;
> > > > -		goto found_dup;
> > > > +		zram_slot_lock(zram, index);
> > > > +		zram_free_page(zram, index);
> > > > +		zram_set_flag(zram, index, ZRAM_DUP);
> > > > +		zram_set_entry(zram, index, entry);
> > > > +		zram_set_obj_size(zram, index, comp_len);
> > > > +		zram_slot_unlock(zram, index);
> > > > +		atomic64_add(comp_len, &zram->stats.dup_data_size);
> > > > +		atomic64_inc(&zram->stats.pages_stored);
> > > > +		return 0;
> > > 
> > > hm. that's a somewhat big code duplication. isn't it?
> > 
> > Yub. 3 parts. above part,  zram_same_page_write and tail of __zram_bvec_write.
> 
> hmm... good question... hardly can think of anything significantly
> better, zram object handling is now a mix of flags, entries,
> ref_counters, etc. etc. may be we can merge some of those ops, if we
> would keep slot locked through the entire __zram_bvec_write(), but
> that does not look attractive.
> 
> something ABSOLUTELY untested and incomplete. not even compile tested (!).
> 99% broken and stupid (!). but there is one thing that it has revealed, so
> thus incomplete. see below:
> 
> ---
> 
> diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
> index 372602c7da49..b31543c40d54 100644
> --- a/drivers/block/zram/zram_drv.c
> +++ b/drivers/block/zram/zram_drv.c
> @@ -509,11 +509,8 @@ static bool zram_same_page_write(struct zram *zram, u32 index,
>         if (page_same_filled(mem, &element)) {
>                 kunmap_atomic(mem);
>                 /* Free memory associated with this sector now. */
> -               zram_slot_lock(zram, index);
> -               zram_free_page(zram, index);
>                 zram_set_flag(zram, index, ZRAM_SAME);
>                 zram_set_element(zram, index, element);
> -               zram_slot_unlock(zram, index);
>  
>                 atomic64_inc(&zram->stats.same_pages);
>                 return true;
> @@ -778,7 +775,7 @@ static int zram_compress(struct zram *zram, struct zcomp_strm **zstrm,
>  
>  static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
>  {
> -       int ret;
> +       int ret = 0;
>         struct zram_entry *entry;
>         unsigned int comp_len;
>         void *src, *dst;
> @@ -786,12 +783,20 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
>         struct page *page = bvec->bv_page;
>         u32 checksum;
>  
> +       /*
> +        * Free memory associated with this sector
> +        * before overwriting unused sectors.
> +        */
> +       zram_slot_lock(zram, index);
> +       zram_free_page(zram, index);

Hmm, zram_free should happen only if the write is done successfully.
Otherwise, we lose the valid data although write IO was fail.

> +
>         if (zram_same_page_write(zram, index, page))
> -               return 0;
> +               goto out_unlock;
>  
>         entry = zram_dedup_find(zram, page, &checksum);
>         if (entry) {
>                 comp_len = entry->len;
> +               zram_set_flag(zram, index, ZRAM_DUP);

In case of hitting dedup, we shouldn't increase compr_data_size.
If we fix above two problems, do you think it's still cleaner?
(I don't mean to be reluctant with your suggestion. Just a
 real question to know your thought.:)


>                 goto found_dup;
>         }
>  
> @@ -799,7 +804,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
>         ret = zram_compress(zram, &zstrm, page, &entry, &comp_len);
>         if (ret) {
>                 zcomp_stream_put(zram->comp);
> -               return ret;
> +               goto out_unlock;
>         }
>  
>         dst = zs_map_object(zram->mem_pool,
> @@ -817,20 +822,16 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, u32 index)
>         zram_dedup_insert(zram, entry, checksum);
>  
>  found_dup:
> -       /*
> -        * Free memory associated with this sector
> -        * before overwriting unused sectors.
> -        */
> -       zram_slot_lock(zram, index);
> -       zram_free_page(zram, index);
>         zram_set_entry(zram, index, entry);
>         zram_set_obj_size(zram, index, comp_len);
> -       zram_slot_unlock(zram, index);
>  
>         /* Update stats */
>         atomic64_add(comp_len, &zram->stats.compr_data_size);
>         atomic64_inc(&zram->stats.pages_stored);
> -       return 0;
> +
> +out_unlock:
> +       zram_slot_unlock(zram, index);
> +       return ret;
>  }
> 
> ---
> 
> 
> namely,
> that zram_compress() error return path from __zram_bvec_write().
> 
> currently, we touch the existing compressed object and overwrite it only
> when we successfully compressed a new object. when zram_compress() fails
> we propagate the error, but never touch the old object. so all reads that
> could hit that index later will read stale data. and probably it would
> make more sense to fail those reads as well; IOW to free the old page
> regardless zram_compress() progress.
> 
> what do you think?
> 
> 	-ss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-16  5:26         ` Minchan Kim
@ 2017-05-16  5:45           ` Sergey Senozhatsky
  2017-05-16  7:16             ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2017-05-16  5:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team

On (05/16/17 14:26), Minchan Kim wrote:
[..]
> > +       /*
> > +        * Free memory associated with this sector
> > +        * before overwriting unused sectors.
> > +        */
> > +       zram_slot_lock(zram, index);
> > +       zram_free_page(zram, index);
> 
> Hmm, zram_free should happen only if the write is done successfully.
> Otherwise, we lose the valid data although write IO was fail.

but would this be correct? the data is not valid - we failed to store
the valid one. but instead we assure application that read()/swapin/etc.,
depending on the usage scenario, is successful (even though the data is
not what application really expects to see), application tries to use the
data from that page and probably crashes (dunno, for example page contained
hash tables with pointers that are not valid anymore, etc. etc.).

I'm not optimistic about stale data reads; it basically will look like
data corruption to the application.

> > +
> >         if (zram_same_page_write(zram, index, page))
> > -               return 0;
> > +               goto out_unlock;
> >  
> >         entry = zram_dedup_find(zram, page, &checksum);
> >         if (entry) {
> >                 comp_len = entry->len;
> > +               zram_set_flag(zram, index, ZRAM_DUP);
>
> In case of hitting dedup, we shouldn't increase compr_data_size.

no, we should not. you are right. my "... patch" is incomplete and
wrong. please don't pay too much attention to it.


> If we fix above two problems, do you think it's still cleaner?
> (I don't mean to be reluctant with your suggestion. Just a
>  real question to know your thought.:)

do you mean code duplication and stale data read?

I'd probably prefer to address stale data reads separately.
but it seems that stale reads fix will re-do parts of your
0002 patch and, at least potentially, reduce code duplication.

so we can go with your 0002 and then stale reads fix will try
to reduce code duplication (unless we want to have 4 places doing
the same thing :) )

	-ss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-16  5:45           ` Sergey Senozhatsky
@ 2017-05-16  7:16             ` Minchan Kim
  2017-05-16  7:36               ` Sergey Senozhatsky
  0 siblings, 1 reply; 15+ messages in thread
From: Minchan Kim @ 2017-05-16  7:16 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team

On Tue, May 16, 2017 at 02:45:33PM +0900, Sergey Senozhatsky wrote:
> On (05/16/17 14:26), Minchan Kim wrote:
> [..]
> > > +       /*
> > > +        * Free memory associated with this sector
> > > +        * before overwriting unused sectors.
> > > +        */
> > > +       zram_slot_lock(zram, index);
> > > +       zram_free_page(zram, index);
> > 
> > Hmm, zram_free should happen only if the write is done successfully.
> > Otherwise, we lose the valid data although write IO was fail.
> 
> but would this be correct? the data is not valid - we failed to store
> the valid one. but instead we assure application that read()/swapin/etc.,
> depending on the usage scenario, is successful (even though the data is
> not what application really expects to see), application tries to use the
> data from that page and probably crashes (dunno, for example page contained
> hash tables with pointers that are not valid anymore, etc. etc.).
> 
> I'm not optimistic about stale data reads; it basically will look like
> data corruption to the application.

Hmm, I don't understand what you say.
My point is zram_free_page should be done only if whoe write operation
is successful.
With you change, following situation can happens.

write block 4, 'all A' -> success
read  block 4, 'all A' verified -> Good
write block 4, 'all B' -> but failed with ENOMEM
read  block 4  expected 'all A' but 'all 0' -> Oops

It is the problem what I pointed out.
If I miss something, could you elaborate it a bit?

Thanks!

> 
> > > +
> > >         if (zram_same_page_write(zram, index, page))
> > > -               return 0;
> > > +               goto out_unlock;
> > >  
> > >         entry = zram_dedup_find(zram, page, &checksum);
> > >         if (entry) {
> > >                 comp_len = entry->len;
> > > +               zram_set_flag(zram, index, ZRAM_DUP);
> >
> > In case of hitting dedup, we shouldn't increase compr_data_size.
> 
> no, we should not. you are right. my "... patch" is incomplete and
> wrong. please don't pay too much attention to it.
> 
> 
> > If we fix above two problems, do you think it's still cleaner?
> > (I don't mean to be reluctant with your suggestion. Just a
> >  real question to know your thought.:)
> 
> do you mean code duplication and stale data read?
> 
> I'd probably prefer to address stale data reads separately.
> but it seems that stale reads fix will re-do parts of your
> 0002 patch and, at least potentially, reduce code duplication.
> 
> so we can go with your 0002 and then stale reads fix will try
> to reduce code duplication (unless we want to have 4 places doing
> the same thing :) )
> 
> 	-ss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-16  7:16             ` Minchan Kim
@ 2017-05-16  7:36               ` Sergey Senozhatsky
  2017-05-17  8:32                 ` Minchan Kim
  0 siblings, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2017-05-16  7:36 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team

On (05/16/17 16:16), Minchan Kim wrote:
> > but would this be correct? the data is not valid - we failed to store
> > the valid one. but instead we assure application that read()/swapin/etc.,
> > depending on the usage scenario, is successful (even though the data is
> > not what application really expects to see), application tries to use the
> > data from that page and probably crashes (dunno, for example page contained
> > hash tables with pointers that are not valid anymore, etc. etc.).
> > 
> > I'm not optimistic about stale data reads; it basically will look like
> > data corruption to the application.
> 
> Hmm, I don't understand what you say.
> My point is zram_free_page should be done only if whoe write operation
> is successful.
> With you change, following situation can happens.
> 
> write block 4, 'all A' -> success
> read  block 4, 'all A' verified -> Good
> write block 4, 'all B' -> but failed with ENOMEM
> read  block 4  expected 'all A' but 'all 0' -> Oops

yes. 'all A' in #4 can be incorrect. zram can be used as a block device
with a file system, and pid that does write op not necessarily does read
op later. it can be a completely different application. e.g. compilation,
or anything else.

suppose PID A does

wr block 1       all a
wr block 2       all a + 1
wr block 3       all a + 2
wr block 4       all a + 3

now PID A does

wr block 1       all m
wr block 2       all m + 1
wr block 3       all m + 2
wr block 4       failed. block still has 'all a + 3'.
exit

another application, PID C, reads in the file and tries to do
something sane with it

rd block 1       all m
rd block 2       all m + 1
rd block 3       all m + 3
rd block 4       all a + 3      << this is dangerous. we should return
                                   error from read() here; not stale data.


what we can return now is a `partially updated' data, with some new
and some stale pages. this is quite unlikely to end up anywhere good.
am I wrong?

why does `rd block 4' in your case causes Oops? as a worst case scenario?
application does not expect page to be 'all A' at this point. pages are
likely to belong to some mappings/files/etc., and there is likely a data
dependency between them, dunno C++ objects that span across pages or
JPEG images, etc. so returning "new data   new data   stale data" is a bit
fishy.

	-ss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-16  7:36               ` Sergey Senozhatsky
@ 2017-05-17  8:32                 ` Minchan Kim
  2017-05-17  9:14                   ` Sergey Senozhatsky
  2017-05-21  7:04                   ` Christoph Hellwig
  0 siblings, 2 replies; 15+ messages in thread
From: Minchan Kim @ 2017-05-17  8:32 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team, linux-fsdevel, linux-block, hch, viro, axboe,
	jlayton, tytso

Hi Sergey,

On Tue, May 16, 2017 at 04:36:17PM +0900, Sergey Senozhatsky wrote:
> On (05/16/17 16:16), Minchan Kim wrote:
> > > but would this be correct? the data is not valid - we failed to store
> > > the valid one. but instead we assure application that read()/swapin/etc.,
> > > depending on the usage scenario, is successful (even though the data is
> > > not what application really expects to see), application tries to use the
> > > data from that page and probably crashes (dunno, for example page contained
> > > hash tables with pointers that are not valid anymore, etc. etc.).
> > > 
> > > I'm not optimistic about stale data reads; it basically will look like
> > > data corruption to the application.
> > 
> > Hmm, I don't understand what you say.
> > My point is zram_free_page should be done only if whoe write operation
> > is successful.
> > With you change, following situation can happens.
> > 
> > write block 4, 'all A' -> success
> > read  block 4, 'all A' verified -> Good
> > write block 4, 'all B' -> but failed with ENOMEM
> > read  block 4  expected 'all A' but 'all 0' -> Oops
> 
> yes. 'all A' in #4 can be incorrect. zram can be used as a block device
> with a file system, and pid that does write op not necessarily does read
> op later. it can be a completely different application. e.g. compilation,
> or anything else.
> 
> suppose PID A does
> 
> wr block 1       all a
> wr block 2       all a + 1
> wr block 3       all a + 2
> wr block 4       all a + 3
> 
> now PID A does
> 
> wr block 1       all m
> wr block 2       all m + 1
> wr block 3       all m + 2
> wr block 4       failed. block still has 'all a + 3'.
> exit
> 
> another application, PID C, reads in the file and tries to do
> something sane with it
> 
> rd block 1       all m
> rd block 2       all m + 1
> rd block 3       all m + 3
> rd block 4       all a + 3      << this is dangerous. we should return
>                                    error from read() here; not stale data.
> 
> 
> what we can return now is a `partially updated' data, with some new
> and some stale pages. this is quite unlikely to end up anywhere good.
> am I wrong?
> 
> why does `rd block 4' in your case causes Oops? as a worst case scenario?
> application does not expect page to be 'all A' at this point. pages are
> likely to belong to some mappings/files/etc., and there is likely a data
> dependency between them, dunno C++ objects that span across pages or
> JPEG images, etc. so returning "new data   new data   stale data" is a bit
> fishy.

I thought more about it and start to confuse. :/
So, let's Cc linux-block, fs peoples.

The question is that 

Is block device(esp, zram which is compressed ram block device) okay to
return garbage when ongoing overwrite IO fails?

O_DIRECT write 4 block "aaa.." -> success
read  4 block "aaa.." -> success
O_DIRECT write 4 block "bbb.." -> fail
read  4 block "000..' -> it is okay?

Hope to get an answer form experts. :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-17  8:32                 ` Minchan Kim
@ 2017-05-17  9:14                   ` Sergey Senozhatsky
  2017-05-18  4:53                     ` Minchan Kim
  2017-05-21  7:04                   ` Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Sergey Senozhatsky @ 2017-05-17  9:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, hch,
	viro, axboe, jlayton, tytso

Hello Minchan,

On (05/17/17 17:32), Minchan Kim wrote:
[..]
> > what we can return now is a `partially updated' data, with some new
> > and some stale pages. this is quite unlikely to end up anywhere good.
> > am I wrong?
> > 
> > why does `rd block 4' in your case causes Oops? as a worst case scenario?
> > application does not expect page to be 'all A' at this point. pages are
> > likely to belong to some mappings/files/etc., and there is likely a data
> > dependency between them, dunno C++ objects that span across pages or
> > JPEG images, etc. so returning "new data   new data   stale data" is a bit
> > fishy.
> 
> I thought more about it and start to confuse. :/

sorry, I'm not sure I see what's the source of your confusion :)

my point is - we should not let READ succeed if we know that WRITE
failed. assume JPEG image example,


over-write block 1 aaa->xxx OK
over-write block 2 bbb->yyy OK
over-write block 3 ccc->zzz error

reading that JPEG file

read block 1 xxx OK
read block 2 yyy OK
read block 3 ccc OK   << we should not return OK here. because
                         "xxxyyyccc" is not the correct JPEG file
                         anyway.

do you agree that telling application that read() succeeded and at
the same returning corrupted "xxxyyyccc" instead of "xxxyyyzzz" is
not correct?



so how about this,

- if we fail to compress page (S/W or H/W compressor error, depending
  on particular setup) let's store it uncompressed (page_size-d zspool
  object).

?

this should do the trick. at least we will have correct data:
	xxx  - compressed
	yyy  - compressed
	zzz  - uncompressed, because compressing back-end returned an error.

thoughts?

	-ss

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-17  9:14                   ` Sergey Senozhatsky
@ 2017-05-18  4:53                     ` Minchan Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2017-05-18  4:53 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, linux-kernel, Joonsoo Kim, Sergey Senozhatsky,
	kernel-team, linux-fsdevel, linux-block, hch, viro, axboe,
	jlayton, tytso

Hi Sergey,

On Wed, May 17, 2017 at 06:14:23PM +0900, Sergey Senozhatsky wrote:
> Hello Minchan,
> 
> On (05/17/17 17:32), Minchan Kim wrote:
> [..]
> > > what we can return now is a `partially updated' data, with some new
> > > and some stale pages. this is quite unlikely to end up anywhere good.
> > > am I wrong?
> > > 
> > > why does `rd block 4' in your case causes Oops? as a worst case scenario?
> > > application does not expect page to be 'all A' at this point. pages are
> > > likely to belong to some mappings/files/etc., and there is likely a data
> > > dependency between them, dunno C++ objects that span across pages or
> > > JPEG images, etc. so returning "new data   new data   stale data" is a bit
> > > fishy.
> > 
> > I thought more about it and start to confuse. :/
> 
> sorry, I'm not sure I see what's the source of your confusion :)
> 
> my point is - we should not let READ succeed if we know that WRITE
> failed. assume JPEG image example,

I don't think we shoul do it. I will write down my thought below. :)

> 
> 
> over-write block 1 aaa->xxx OK
> over-write block 2 bbb->yyy OK
> over-write block 3 ccc->zzz error
> 
> reading that JPEG file
> 
> read block 1 xxx OK
> read block 2 yyy OK
> read block 3 ccc OK   << we should not return OK here. because
>                          "xxxyyyccc" is not the correct JPEG file
>                          anyway.
> 
> do you agree that telling application that read() succeeded and at
> the same returning corrupted "xxxyyyccc" instead of "xxxyyyzzz" is
> not correct?

I don't agree. I *think* block device is a just dumb device so
zram doesn't need to know about any objects from the upper layer.
What zram should consider is basically read/write success or fail of
IO unit(maybe, BIO).

So if we assume each step from above example is bio unit, I think
it's no problem returns "xxxyyyccc".

What I meant "started confused" was about atomicity, not above
thing.

I think it's okay to return ccc instead of zzz but is it okay
zram to return "000", not "ccc" and "zzz"?
My conclusion is that it's okay now after discussion from one
of my FS friends.

Let's think about it.

FS requests write "aaa" to block 4 and fails by somethings
(H/W failure, S/W failure like ENOMEM). The interface to catch
the failure is the function registered by bio_endio which is
normally handles by AS_EIO by mappint_set_error as well as
PG_error flags of the page. In this case, FS assumes the block
4 can have stale data, not 'zzz' and 'ccc' because the device
was broken in the middle of write some data to a block if
the block device doesn't support atomic write(I guess it's
more popular) so it would be safe to consider the block
has garbage now rather than old value, new value.
(I hope I explain my thought well :/)

Having said that, I think everyone likes block device supports
atomicity(ie, old or new). so I am reluctant to change the
behavior for simple refactoring.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-17  8:32                 ` Minchan Kim
  2017-05-17  9:14                   ` Sergey Senozhatsky
@ 2017-05-21  7:04                   ` Christoph Hellwig
  2017-05-21  7:15                     ` Minchan Kim
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2017-05-21  7:04 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block, hch,
	viro, axboe, jlayton, tytso

On Wed, May 17, 2017 at 05:32:12PM +0900, Minchan Kim wrote:
> Is block device(esp, zram which is compressed ram block device) okay to
> return garbage when ongoing overwrite IO fails?
> 
> O_DIRECT write 4 block "aaa.." -> success
> read  4 block "aaa.." -> success
> O_DIRECT write 4 block "bbb.." -> fail
> read  4 block "000..' -> it is okay?
> 
> Hope to get an answer form experts. :)

It's "okay" as it's what existing real block devices do (at least on a
sector boundary).  It's not "nice" though, so if you can avoid it,
please do.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/2] zram: do not count duplicated pages as compressed
  2017-05-21  7:04                   ` Christoph Hellwig
@ 2017-05-21  7:15                     ` Minchan Kim
  0 siblings, 0 replies; 15+ messages in thread
From: Minchan Kim @ 2017-05-21  7:15 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sergey Senozhatsky, Andrew Morton, linux-kernel, Joonsoo Kim,
	Sergey Senozhatsky, kernel-team, linux-fsdevel, linux-block,
	viro, axboe, jlayton, tytso

On Sun, May 21, 2017 at 12:04:27AM -0700, Christoph Hellwig wrote:
> On Wed, May 17, 2017 at 05:32:12PM +0900, Minchan Kim wrote:
> > Is block device(esp, zram which is compressed ram block device) okay to
> > return garbage when ongoing overwrite IO fails?
> > 
> > O_DIRECT write 4 block "aaa.." -> success
> > read  4 block "aaa.." -> success
> > O_DIRECT write 4 block "bbb.." -> fail
> > read  4 block "000..' -> it is okay?
> > 
> > Hope to get an answer form experts. :)
> 
> It's "okay" as it's what existing real block devices do (at least on a
> sector boundary).  It's not "nice" though, so if you can avoid it,
> please do.

That was my understanding so I wanted to avoid it for just simple
code refactoring. Your comment helps to confirm the thhought.

Thanks, Christoph!

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-05-21  7:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-15  7:41 [PATCH 1/2] zram: count same page write as page_stored Minchan Kim
2017-05-15  7:41 ` [PATCH 2/2] zram: do not count duplicated pages as compressed Minchan Kim
2017-05-16  1:30   ` Sergey Senozhatsky
2017-05-16  1:59     ` Minchan Kim
2017-05-16  2:36       ` Sergey Senozhatsky
2017-05-16  5:26         ` Minchan Kim
2017-05-16  5:45           ` Sergey Senozhatsky
2017-05-16  7:16             ` Minchan Kim
2017-05-16  7:36               ` Sergey Senozhatsky
2017-05-17  8:32                 ` Minchan Kim
2017-05-17  9:14                   ` Sergey Senozhatsky
2017-05-18  4:53                     ` Minchan Kim
2017-05-21  7:04                   ` Christoph Hellwig
2017-05-21  7:15                     ` Minchan Kim
2017-05-16  1:11 ` [PATCH 1/2] zram: count same page write as page_stored Sergey Senozhatsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.