All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/4] zram memory tracking
@ 2018-04-16  9:09 Minchan Kim
  2018-04-16  9:09 ` [PATCH v5 1/4] zram: correct flag name of ZRAM_ACCESS Minchan Kim
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Minchan Kim @ 2018-04-16  9:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim

zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out.
zRAM can store such cold pages as compressed form but it's pointless
to keep in memory. As well, it's pointless to store incompressible
pages to zram so better idea is app developers manages them directly
like free or mlock rather than remaining them on heap.

This patch provides a debugfs /sys/kernel/debug/zram/zram0/block_state
to represent each block's state so admin can investigate what memory is
cold|incompressible|same page with using pagemap once the pages are
swapped out.


The output is as follows,
      300    75.033841 .wh
      301    63.806904 s..
      302    63.806919 ..h

First column is zram's block index and 3rh one represents symbol
(s: same page w: written page to backing store h: huge page) of the
block state. Second column represents usec time unit of the block
was last accessed. So above example means the 300th block is accessed
at 75.033851 second and it was huge so it was written to the backing
store.

* from v4:
  * Fix typos - Randy
  * Add reviewed-by from Sergey

* from v3:
  * use depends on selecting DEBUG_FS - Greg KH
  * Add acked-by from Greg
  * Fix null ptr access at module unload - Sergey
  * warning fix from copy_to_user - Sergey

* From v2:
  * debugfs and Kconfig cleanup - Greg KH
  * Remove unnecesarry buffer - Sergey
  * Change timestamp from sec to usec

* From v1:
  * Do not propagate error number for debugfs fail - Greg KH
  * Add writeback and hugepage information - Serge

Minchan Kim (4):
  zram: correct flag name of ZRAM_ACCESS
  zram: mark incompressible page as ZRAM_HUGE
  zram: record accessed second
  zram: introduce zram memory tracking

 Documentation/blockdev/zram.txt |  25 +++++
 drivers/block/zram/Kconfig      |  14 ++-
 drivers/block/zram/zram_drv.c   | 173 +++++++++++++++++++++++++++++---
 drivers/block/zram/zram_drv.h   |  14 ++-
 4 files changed, 207 insertions(+), 19 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v5 1/4] zram: correct flag name of ZRAM_ACCESS
  2018-04-16  9:09 [PATCH v5 0/4] zram memory tracking Minchan Kim
@ 2018-04-16  9:09 ` Minchan Kim
  2018-04-16  9:09 ` [PATCH v5 2/4] zram: mark incompressible page as ZRAM_HUGE Minchan Kim
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2018-04-16  9:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim, Sergey Senozhatsky

ZRAM_ACCESS is used for locking a slot of zram so correct the name.
It is also not a common flag to indicate status of the block so
move the declare position on top of the flag.
Lastly, let's move the function to the top of source code to be able to
use it easily without forward declaration.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 20 ++++++++++----------
 drivers/block/zram/zram_drv.h |  6 +++---
 2 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 0f3fadd71230..18dadeab775b 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -52,6 +52,16 @@ static size_t huge_class_size;
 
 static void zram_free_page(struct zram *zram, size_t index);
 
+static void zram_slot_lock(struct zram *zram, u32 index)
+{
+	bit_spin_lock(ZRAM_LOCK, &zram->table[index].value);
+}
+
+static void zram_slot_unlock(struct zram *zram, u32 index)
+{
+	bit_spin_unlock(ZRAM_LOCK, &zram->table[index].value);
+}
+
 static inline bool init_done(struct zram *zram)
 {
 	return zram->disksize;
@@ -753,16 +763,6 @@ static DEVICE_ATTR_RO(io_stat);
 static DEVICE_ATTR_RO(mm_stat);
 static DEVICE_ATTR_RO(debug_stat);
 
-static void zram_slot_lock(struct zram *zram, u32 index)
-{
-	bit_spin_lock(ZRAM_ACCESS, &zram->table[index].value);
-}
-
-static void zram_slot_unlock(struct zram *zram, u32 index)
-{
-	bit_spin_unlock(ZRAM_ACCESS, &zram->table[index].value);
-}
-
 static void zram_meta_free(struct zram *zram, u64 disksize)
 {
 	size_t num_pages = disksize >> PAGE_SHIFT;
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 008861220723..8d8959ceabd1 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -43,9 +43,9 @@
 
 /* Flags for zram pages (table[page_no].value) */
 enum zram_pageflags {
-	/* Page consists the same element */
-	ZRAM_SAME = ZRAM_FLAG_SHIFT,
-	ZRAM_ACCESS,	/* page is now accessed */
+	/* zram slot is locked */
+	ZRAM_LOCK = ZRAM_FLAG_SHIFT,
+	ZRAM_SAME,	/* Page consists the same element */
 	ZRAM_WB,	/* page is stored on backing_device */
 
 	__NR_ZRAM_PAGEFLAGS,
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 2/4] zram: mark incompressible page as ZRAM_HUGE
  2018-04-16  9:09 [PATCH v5 0/4] zram memory tracking Minchan Kim
  2018-04-16  9:09 ` [PATCH v5 1/4] zram: correct flag name of ZRAM_ACCESS Minchan Kim
@ 2018-04-16  9:09 ` Minchan Kim
  2018-04-16  9:09 ` [PATCH v5 3/4] zram: record accessed second Minchan Kim
  2018-04-16  9:09 ` [PATCH v5 4/4] zram: introduce zram memory tracking Minchan Kim
  3 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2018-04-16  9:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim, Sergey Senozhatsky

Mark incompressible pages so that we could investigate who is the
owner of the incompressible pages once the page is swapped out
via using upcoming zram memory tracker feature.

With it, we could prevent such pages to be swapped out by using
mlock. Otherwise we might remove them.

This patch exposes new stat for huge pages via mm_stat.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/blockdev/zram.txt |  1 +
 drivers/block/zram/zram_drv.c   | 17 ++++++++++++++---
 drivers/block/zram/zram_drv.h   |  2 ++
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 257e65714c6a..78db38d02bc9 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -218,6 +218,7 @@ The stat file represents device's mm statistics. It consists of a single
  same_pages       the number of same element filled pages written to this disk.
                   No memory is allocated for such pages.
  pages_compacted  the number of pages freed during compaction
+ huge_pages	  the number of incompressible pages
 
 9) Deactivate:
 	swapoff /dev/zram0
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 18dadeab775b..777fb3339f59 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -729,14 +729,15 @@ static ssize_t mm_stat_show(struct device *dev,
 	max_used = atomic_long_read(&zram->stats.max_used_pages);
 
 	ret = scnprintf(buf, PAGE_SIZE,
-			"%8llu %8llu %8llu %8lu %8ld %8llu %8lu\n",
+			"%8llu %8llu %8llu %8lu %8ld %8llu %8lu %8llu\n",
 			orig_size << PAGE_SHIFT,
 			(u64)atomic64_read(&zram->stats.compr_data_size),
 			mem_used << PAGE_SHIFT,
 			zram->limit_pages << PAGE_SHIFT,
 			max_used << PAGE_SHIFT,
 			(u64)atomic64_read(&zram->stats.same_pages),
-			pool_stats.pages_compacted);
+			pool_stats.pages_compacted,
+			(u64)atomic64_read(&zram->stats.huge_pages));
 	up_read(&zram->init_lock);
 
 	return ret;
@@ -805,6 +806,11 @@ static void zram_free_page(struct zram *zram, size_t index)
 {
 	unsigned long handle;
 
+	if (zram_test_flag(zram, index, ZRAM_HUGE)) {
+		zram_clear_flag(zram, index, ZRAM_HUGE);
+		atomic64_dec(&zram->stats.huge_pages);
+	}
+
 	if (zram_wb_enabled(zram) && zram_test_flag(zram, index, ZRAM_WB)) {
 		zram_wb_clear(zram, index);
 		atomic64_dec(&zram->stats.pages_stored);
@@ -973,6 +979,7 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
 	}
 
 	if (unlikely(comp_len >= huge_class_size)) {
+		comp_len = PAGE_SIZE;
 		if (zram_wb_enabled(zram) && allow_wb) {
 			zcomp_stream_put(zram->comp);
 			ret = write_to_bdev(zram, bvec, index, bio, &element);
@@ -984,7 +991,6 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
 			allow_wb = false;
 			goto compress_again;
 		}
-		comp_len = PAGE_SIZE;
 	}
 
 	/*
@@ -1046,6 +1052,11 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec,
 	zram_slot_lock(zram, index);
 	zram_free_page(zram, index);
 
+	if (comp_len == PAGE_SIZE) {
+		zram_set_flag(zram, index, ZRAM_HUGE);
+		atomic64_inc(&zram->stats.huge_pages);
+	}
+
 	if (flags) {
 		zram_set_flag(zram, index, flags);
 		zram_set_element(zram, index, element);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 8d8959ceabd1..ff0547bdb586 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -47,6 +47,7 @@ enum zram_pageflags {
 	ZRAM_LOCK = ZRAM_FLAG_SHIFT,
 	ZRAM_SAME,	/* Page consists the same element */
 	ZRAM_WB,	/* page is stored on backing_device */
+	ZRAM_HUGE,	/* Incompressible page */
 
 	__NR_ZRAM_PAGEFLAGS,
 };
@@ -71,6 +72,7 @@ struct zram_stats {
 	atomic64_t invalid_io;	/* non-page-aligned I/O requests */
 	atomic64_t notify_free;	/* no. of swap slot free notifications */
 	atomic64_t same_pages;		/* no. of same element filled pages */
+	atomic64_t huge_pages;		/* no. of huge pages */
 	atomic64_t pages_stored;	/* no. of pages currently stored */
 	atomic_long_t max_used_pages;	/* no. of maximum pages stored */
 	atomic64_t writestall;		/* no. of write slow paths */
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 3/4] zram: record accessed second
  2018-04-16  9:09 [PATCH v5 0/4] zram memory tracking Minchan Kim
  2018-04-16  9:09 ` [PATCH v5 1/4] zram: correct flag name of ZRAM_ACCESS Minchan Kim
  2018-04-16  9:09 ` [PATCH v5 2/4] zram: mark incompressible page as ZRAM_HUGE Minchan Kim
@ 2018-04-16  9:09 ` Minchan Kim
  2018-04-16  9:09 ` [PATCH v5 4/4] zram: introduce zram memory tracking Minchan Kim
  3 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2018-04-16  9:09 UTC (permalink / raw)
  To: Andrew Morton; +Cc: LKML, Sergey Senozhatsky, Minchan Kim, Sergey Senozhatsky

zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out.
zRAM can store such cold pages as compressed form but it's pointless
to keep in memory. Better idea is app developers free them directly
rather than remaining them on heap.

This patch records last access time of each block of zram so that
With upcoming zram memory tracking, it could help userspace developers
to reduce memory footprint.

Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 drivers/block/zram/zram_drv.c | 16 ++++++++++++++++
 drivers/block/zram/zram_drv.h |  1 +
 2 files changed, 17 insertions(+)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 777fb3339f59..7fc10e2ad734 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -107,6 +107,16 @@ static inline void zram_set_element(struct zram *zram, u32 index,
 	zram->table[index].element = element;
 }
 
+static void zram_accessed(struct zram *zram, u32 index)
+{
+	zram->table[index].ac_time = sched_clock();
+}
+
+static void zram_reset_access(struct zram *zram, u32 index)
+{
+	zram->table[index].ac_time = 0;
+}
+
 static unsigned long zram_get_element(struct zram *zram, u32 index)
 {
 	return zram->table[index].element;
@@ -806,6 +816,8 @@ static void zram_free_page(struct zram *zram, size_t index)
 {
 	unsigned long handle;
 
+	zram_reset_access(zram, index);
+
 	if (zram_test_flag(zram, index, ZRAM_HUGE)) {
 		zram_clear_flag(zram, index, ZRAM_HUGE);
 		atomic64_dec(&zram->stats.huge_pages);
@@ -1177,6 +1189,10 @@ static int zram_bvec_rw(struct zram *zram, struct bio_vec *bvec, u32 index,
 
 	generic_end_io_acct(q, rw_acct, &zram->disk->part0, start_time);
 
+	zram_slot_lock(zram, index);
+	zram_accessed(zram, index);
+	zram_slot_unlock(zram, index);
+
 	if (unlikely(ret < 0)) {
 		if (!is_write)
 			atomic64_inc(&zram->stats.failed_reads);
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index ff0547bdb586..1075218e88b2 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -61,6 +61,7 @@ struct zram_table_entry {
 		unsigned long element;
 	};
 	unsigned long value;
+	u64 ac_time;
 };
 
 struct zram_stats {
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v5 4/4] zram: introduce zram memory tracking
  2018-04-16  9:09 [PATCH v5 0/4] zram memory tracking Minchan Kim
                   ` (2 preceding siblings ...)
  2018-04-16  9:09 ` [PATCH v5 3/4] zram: record accessed second Minchan Kim
@ 2018-04-16  9:09 ` Minchan Kim
  2018-04-17 21:59   ` Andrew Morton
  3 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2018-04-16  9:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Sergey Senozhatsky, Minchan Kim, Randy Dunlap,
	Greg Kroah-Hartman, Sergey Senozhatsky

zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out.
zRAM can store such cold pages as compressed form but it's pointless
to keep in memory. Better idea is app developers free them directly
rather than remaining them on heap.

This patch tell us last access time of each block of zram via
"cat /sys/kernel/debug/zram/zram0/block_state".

The output is as follows,
      300    75.033841 .wh
      301    63.806904 s..
      302    63.806919 ..h

First column is zram's block index and 3rh one represents symbol
(s: same page w: written page to backing store h: huge page) of the
block state. Second column represents usec time unit of the block
was last accessed. So above example means the 300th block is accessed
at 75.033851 second and it was huge so it was written to the backing
store.

Admin can leverage this information to catch cold|incompressible pages
of process with *pagemap* once part of heaps are swapped out.

Cc: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 Documentation/blockdev/zram.txt |  24 ++++++
 drivers/block/zram/Kconfig      |  14 +++-
 drivers/block/zram/zram_drv.c   | 140 +++++++++++++++++++++++++++++---
 drivers/block/zram/zram_drv.h   |   5 ++
 4 files changed, 170 insertions(+), 13 deletions(-)

diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 78db38d02bc9..49015b51ef1e 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -243,5 +243,29 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+= memory tracking
+
+With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
+zram block. It could be useful to catch cold or incompressible
+pages of the process with*pagemap.
+If you enable the feature, you could see block state via
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
+
+	  300    75.033841 .wh
+	  301    63.806904 s..
+	  302    63.806919 ..h
+
+First column is zram's block index.
+Second column is access time.
+Third column is state of the block.
+(s: same page
+w: written page to backing store
+h: huge page)
+
+First line of above example says 300th block is accessed at 75.033841sec
+and the block's state is huge so it is written back to the backing
+storage. It's a debugging feature so anyone shouldn't rely on it to work
+properly.
+
 Nitin Gupta
 ngupta@vflare.org
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index ac3a31d433b2..635235759a0a 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,7 +13,7 @@ config ZRAM
 	  It has several use cases, for example: /tmp storage, use as swap
 	  disks and maybe many more.
 
-	  See zram.txt for more information.
+	  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
        bool "Write back incompressible page to backing device"
@@ -25,4 +25,14 @@ config ZRAM_WRITEBACK
 	 For this feature, admin should set up backing device via
 	 /sys/block/zramX/backing_dev.
 
-	 See zram.txt for more infomration.
+	 See Documentation/blockdev/zram.txt for more information.
+
+config ZRAM_MEMORY_TRACKING
+	bool "Track zRam block status"
+	depends on ZRAM && DEBUG_FS
+	help
+	  With this feature, admin can track the state of allocated blocks
+	  of zRAM. Admin could see the information via
+	  /sys/kernel/debug/zram/zramX/block_state.
+
+	  See Documentation/blockdev/zram.txt for more information.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7fc10e2ad734..a999d9996a13 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -31,6 +31,7 @@
 #include <linux/err.h>
 #include <linux/idr.h>
 #include <linux/sysfs.h>
+#include <linux/debugfs.h>
 #include <linux/cpuhotplug.h>
 
 #include "zram_drv.h"
@@ -67,6 +68,13 @@ static inline bool init_done(struct zram *zram)
 	return zram->disksize;
 }
 
+static inline bool zram_allocated(struct zram *zram, u32 index)
+{
+
+	return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+					zram->table[index].handle;
+}
+
 static inline struct zram *dev_to_zram(struct device *dev)
 {
 	return (struct zram *)dev_to_disk(dev)->private_data;
@@ -83,7 +91,7 @@ static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
 }
 
 /* flag operations require table entry bit_spin_lock() being held */
-static int zram_test_flag(struct zram *zram, u32 index,
+static bool zram_test_flag(struct zram *zram, u32 index,
 			enum zram_pageflags flag)
 {
 	return zram->table[index].value & BIT(flag);
@@ -107,16 +115,6 @@ static inline void zram_set_element(struct zram *zram, u32 index,
 	zram->table[index].element = element;
 }
 
-static void zram_accessed(struct zram *zram, u32 index)
-{
-	zram->table[index].ac_time = sched_clock();
-}
-
-static void zram_reset_access(struct zram *zram, u32 index)
-{
-	zram->table[index].ac_time = 0;
-}
-
 static unsigned long zram_get_element(struct zram *zram, u32 index)
 {
 	return zram->table[index].element;
@@ -620,6 +618,122 @@ static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
 static void zram_wb_clear(struct zram *zram, u32 index) {}
 #endif
 
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+
+static struct dentry *zram_debugfs_root;
+
+static void zram_debugfs_create(void)
+{
+	zram_debugfs_root = debugfs_create_dir("zram", NULL);
+}
+
+static void zram_debugfs_destroy(void)
+{
+	debugfs_remove_recursive(zram_debugfs_root);
+}
+
+static void zram_accessed(struct zram *zram, u32 index)
+{
+	zram->table[index].ac_time = sched_clock();
+}
+
+static void zram_reset_access(struct zram *zram, u32 index)
+{
+	zram->table[index].ac_time = 0;
+}
+
+static long long ns2usecs(u64 nsec)
+{
+	nsec += 500;
+	do_div(nsec, 1000);
+	return nsec;
+}
+
+static ssize_t read_block_state(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	char *kbuf;
+	ssize_t index, written = 0;
+	struct zram *zram = file->private_data;
+	u64 last_access;
+	unsigned long usec_rem;
+	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+
+	kbuf = kvmalloc(count, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	down_read(&zram->init_lock);
+	if (!init_done(zram)) {
+		up_read(&zram->init_lock);
+		kvfree(kbuf);
+		return -EINVAL;
+	}
+
+	for (index = *ppos; index < nr_pages; index++) {
+		int copied;
+
+		zram_slot_lock(zram, index);
+		if (!zram_allocated(zram, index))
+			goto next;
+
+		last_access = ns2usecs(zram->table[index].ac_time);
+		usec_rem = do_div(last_access, USEC_PER_SEC);
+		copied = snprintf(kbuf + written, count,
+			"%12lu %5lu.%06lu %c%c%c\n",
+			index, (unsigned long)last_access, usec_rem,
+			zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.',
+			zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.',
+			zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.');
+
+		if (count < copied) {
+			zram_slot_unlock(zram, index);
+			break;
+		}
+		written += copied;
+		count -= copied;
+next:
+		zram_slot_unlock(zram, index);
+		*ppos += 1;
+	}
+
+	up_read(&zram->init_lock);
+	if (copy_to_user(buf, kbuf, written))
+		written = -EFAULT;
+	kvfree(kbuf);
+
+	return written;
+}
+
+static const struct file_operations proc_zram_block_state_op = {
+	.open = simple_open,
+	.read = read_block_state,
+	.llseek = default_llseek,
+};
+
+static void zram_debugfs_register(struct zram *zram)
+{
+	if (!zram_debugfs_root)
+		return;
+
+	zram->debugfs_dir = debugfs_create_dir(zram->disk->disk_name,
+						zram_debugfs_root);
+	debugfs_create_file("block_state", 0400, zram->debugfs_dir,
+				zram, &proc_zram_block_state_op);
+}
+
+static void zram_debugfs_unregister(struct zram *zram)
+{
+	debugfs_remove_recursive(zram->debugfs_dir);
+}
+#else
+static void zram_debugfs_create(void) {};
+static void zram_debugfs_destroy(void) {};
+static void zram_accessed(struct zram *zram, u32 index) {};
+static void zram_reset_access(struct zram *zram, u32 index) {};
+static void zram_debugfs_register(struct zram *zram) {};
+static void zram_debugfs_unregister(struct zram *zram) {};
+#endif
 
 /*
  * We switched to per-cpu streams and this attr is not needed anymore.
@@ -1604,6 +1718,7 @@ static int zram_add(void)
 	}
 	strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
+	zram_debugfs_register(zram);
 	pr_info("Added device: %s\n", zram->disk->disk_name);
 	return device_id;
 
@@ -1637,6 +1752,7 @@ static int zram_remove(struct zram *zram)
 	zram->claim = true;
 	mutex_unlock(&bdev->bd_mutex);
 
+	zram_debugfs_unregister(zram);
 	/*
 	 * Remove sysfs first, so no one will perform a disksize
 	 * store while we destroy the devices. This also helps during
@@ -1739,6 +1855,7 @@ static void destroy_devices(void)
 {
 	class_unregister(&zram_control_class);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
+	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
 	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
@@ -1760,6 +1877,7 @@ static int __init zram_init(void)
 		return ret;
 	}
 
+	zram_debugfs_create();
 	zram_major = register_blkdev(0, "zram");
 	if (zram_major <= 0) {
 		pr_err("Unable to get major number\n");
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 1075218e88b2..dbe459b300c0 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -61,7 +61,9 @@ struct zram_table_entry {
 		unsigned long element;
 	};
 	unsigned long value;
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
 	u64 ac_time;
+#endif
 };
 
 struct zram_stats {
@@ -110,5 +112,8 @@ struct zram {
 	unsigned long nr_pages;
 	spinlock_t bitmap_lock;
 #endif
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+	struct dentry *debugfs_dir;
+#endif
 };
 #endif
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 4/4] zram: introduce zram memory tracking
  2018-04-16  9:09 ` [PATCH v5 4/4] zram: introduce zram memory tracking Minchan Kim
@ 2018-04-17 21:59   ` Andrew Morton
  2018-04-18  1:26     ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2018-04-17 21:59 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Sergey Senozhatsky, Randy Dunlap, Greg Kroah-Hartman,
	Sergey Senozhatsky

On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@kernel.org> wrote:

> zRam as swap is useful for small memory device. However, swap means
> those pages on zram are mostly cold pages due to VM's LRU algorithm.
> Especially, once init data for application are touched for launching,
> they tend to be not accessed any more and finally swapped out.
> zRAM can store such cold pages as compressed form but it's pointless
> to keep in memory. Better idea is app developers free them directly
> rather than remaining them on heap.
> 
> This patch tell us last access time of each block of zram via
> "cat /sys/kernel/debug/zram/zram0/block_state".
> 
> The output is as follows,
>       300    75.033841 .wh
>       301    63.806904 s..
>       302    63.806919 ..h
> 
> First column is zram's block index and 3rh one represents symbol
> (s: same page w: written page to backing store h: huge page) of the
> block state. Second column represents usec time unit of the block
> was last accessed. So above example means the 300th block is accessed
> at 75.033851 second and it was huge so it was written to the backing
> store.
> 
> Admin can leverage this information to catch cold|incompressible pages
> of process with *pagemap* once part of heaps are swapped out.

A few things..

- Terms like "Admin can" and "Admin could" are worrisome.  How do we
  know that admins *will* use this?  How do we know that we aren't
  adding a bunch of stuff which nobody will find to be (sufficiently)
  useful?  For example, is there some userspace tool to which you are
  contributing which will be updated to use this feature?

- block_state's second column is in microseconds since some
  undocumented time.  But how is userspace to know how much time has
  elapsed since the access?  ie, "current time".

- Is the sched_clock() return value suitable for exporting to
  userspace?  Is it monotonic?  Is it consistent across CPUs, across
  CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
  way up to 2^64 on all CPU types, or will some processors wrap it at
  (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
  points out that suspend/resume can mess it up and that the counter
  can drift between cpus.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 4/4] zram: introduce zram memory tracking
  2018-04-17 21:59   ` Andrew Morton
@ 2018-04-18  1:26     ` Minchan Kim
  2018-04-18 21:07       ` Andrew Morton
  0 siblings, 1 reply; 12+ messages in thread
From: Minchan Kim @ 2018-04-18  1:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Sergey Senozhatsky, Randy Dunlap, Greg Kroah-Hartman,
	Sergey Senozhatsky

Hi Andrew,

On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > zRam as swap is useful for small memory device. However, swap means
> > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > Especially, once init data for application are touched for launching,
> > they tend to be not accessed any more and finally swapped out.
> > zRAM can store such cold pages as compressed form but it's pointless
> > to keep in memory. Better idea is app developers free them directly
> > rather than remaining them on heap.
> > 
> > This patch tell us last access time of each block of zram via
> > "cat /sys/kernel/debug/zram/zram0/block_state".
> > 
> > The output is as follows,
> >       300    75.033841 .wh
> >       301    63.806904 s..
> >       302    63.806919 ..h
> > 
> > First column is zram's block index and 3rh one represents symbol
> > (s: same page w: written page to backing store h: huge page) of the
> > block state. Second column represents usec time unit of the block
> > was last accessed. So above example means the 300th block is accessed
> > at 75.033851 second and it was huge so it was written to the backing
> > store.
> > 
> > Admin can leverage this information to catch cold|incompressible pages
> > of process with *pagemap* once part of heaps are swapped out.
> 
> A few things..
> 
> - Terms like "Admin can" and "Admin could" are worrisome.  How do we
>   know that admins *will* use this?  How do we know that we aren't
>   adding a bunch of stuff which nobody will find to be (sufficiently)
>   useful?  For example, is there some userspace tool to which you are
>   contributing which will be updated to use this feature?

Actually, I used this feature two years ago to find memory hogger
although the feature was very fast prototyping. It was very useful
to reduce memory cost in embedded space.

The reason I am trying to upstream the feature is I need the feature
again. :)

Yub, I have a userspace tool to use the feature although it was
not compatible with this new version. It should be updated with
new format. I will find a time to submit the tool.

> 
> - block_state's second column is in microseconds since some
>   undocumented time.  But how is userspace to know how much time has
>   elapsed since the access?  ie, "current time".

It's a sched_clock so it should be elapsed time since the system boot.
I should have written it explictly.
I will fix it.

> 
> - Is the sched_clock() return value suitable for exporting to
>   userspace?  Is it monotonic?  Is it consistent across CPUs, across
>   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
>   way up to 2^64 on all CPU types, or will some processors wrap it at
>   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
>   points out that suspend/resume can mess it up and that the counter
>   can drift between cpus.

Good point!

I just referenced it from ftrace because I thought the goal is similiar
"no need to be exact unless the drift is frequent but wanted to be fast"

AFAIK, ftrace/printk is active user of the function so if the problem
happens frequently, it might be serious. :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 4/4] zram: introduce zram memory tracking
  2018-04-18  1:26     ` Minchan Kim
@ 2018-04-18 21:07       ` Andrew Morton
  2018-04-20  2:09         ` Minchan Kim
  0 siblings, 1 reply; 12+ messages in thread
From: Andrew Morton @ 2018-04-18 21:07 UTC (permalink / raw)
  To: Minchan Kim
  Cc: LKML, Sergey Senozhatsky, Randy Dunlap, Greg Kroah-Hartman,
	Sergey Senozhatsky

On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim <minchan@kernel.org> wrote:

> Hi Andrew,
> 
> On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > > zRam as swap is useful for small memory device. However, swap means
> > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > Especially, once init data for application are touched for launching,
> > > they tend to be not accessed any more and finally swapped out.
> > > zRAM can store such cold pages as compressed form but it's pointless
> > > to keep in memory. Better idea is app developers free them directly
> > > rather than remaining them on heap.
> > > 
> > > This patch tell us last access time of each block of zram via
> > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > 
> > > The output is as follows,
> > >       300    75.033841 .wh
> > >       301    63.806904 s..
> > >       302    63.806919 ..h
> > > 
> > > First column is zram's block index and 3rh one represents symbol
> > > (s: same page w: written page to backing store h: huge page) of the
> > > block state. Second column represents usec time unit of the block
> > > was last accessed. So above example means the 300th block is accessed
> > > at 75.033851 second and it was huge so it was written to the backing
> > > store.
> > > 
> > > Admin can leverage this information to catch cold|incompressible pages
> > > of process with *pagemap* once part of heaps are swapped out.
> > 
> > A few things..
> > 
> > - Terms like "Admin can" and "Admin could" are worrisome.  How do we
> >   know that admins *will* use this?  How do we know that we aren't
> >   adding a bunch of stuff which nobody will find to be (sufficiently)
> >   useful?  For example, is there some userspace tool to which you are
> >   contributing which will be updated to use this feature?
> 
> Actually, I used this feature two years ago to find memory hogger
> although the feature was very fast prototyping. It was very useful
> to reduce memory cost in embedded space.
> 
> The reason I am trying to upstream the feature is I need the feature
> again. :)
> 
> Yub, I have a userspace tool to use the feature although it was
> not compatible with this new version. It should be updated with
> new format. I will find a time to submit the tool.

hm, OK, can we get this info into the changelog?  

> > 
> > - block_state's second column is in microseconds since some
> >   undocumented time.  But how is userspace to know how much time has
> >   elapsed since the access?  ie, "current time".
> 
> It's a sched_clock so it should be elapsed time since the system boot.
> I should have written it explictly.
> I will fix it.
> 
> > 
> > - Is the sched_clock() return value suitable for exporting to
> >   userspace?  Is it monotonic?  Is it consistent across CPUs, across
> >   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
> >   way up to 2^64 on all CPU types, or will some processors wrap it at
> >   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
> >   points out that suspend/resume can mess it up and that the counter
> >   can drift between cpus.
> 
> Good point!
> 
> I just referenced it from ftrace because I thought the goal is similiar
> "no need to be exact unless the drift is frequent but wanted to be fast"
> 
> AFAIK, ftrace/printk is active user of the function so if the problem
> happens frequently, it might be serious. :)

It could be that ktime_get() is a better fit here - especially if
sched_clock() goes nuts after resume.  Unfortunately ktime_get()
appears to be totally undocumented :(

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 4/4] zram: introduce zram memory tracking
  2018-04-18 21:07       ` Andrew Morton
@ 2018-04-20  2:09         ` Minchan Kim
  2018-04-20  2:18           ` Sergey Senozhatsky
  2018-04-20  6:35           ` Minchan Kim
  0 siblings, 2 replies; 12+ messages in thread
From: Minchan Kim @ 2018-04-20  2:09 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Sergey Senozhatsky, Randy Dunlap, Greg Kroah-Hartman,
	Sergey Senozhatsky

On Wed, Apr 18, 2018 at 02:07:15PM -0700, Andrew Morton wrote:
> On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim <minchan@kernel.org> wrote:
> 
> > Hi Andrew,
> > 
> > On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > > 
> > > > zRam as swap is useful for small memory device. However, swap means
> > > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > > Especially, once init data for application are touched for launching,
> > > > they tend to be not accessed any more and finally swapped out.
> > > > zRAM can store such cold pages as compressed form but it's pointless
> > > > to keep in memory. Better idea is app developers free them directly
> > > > rather than remaining them on heap.
> > > > 
> > > > This patch tell us last access time of each block of zram via
> > > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > > 
> > > > The output is as follows,
> > > >       300    75.033841 .wh
> > > >       301    63.806904 s..
> > > >       302    63.806919 ..h
> > > > 
> > > > First column is zram's block index and 3rh one represents symbol
> > > > (s: same page w: written page to backing store h: huge page) of the
> > > > block state. Second column represents usec time unit of the block
> > > > was last accessed. So above example means the 300th block is accessed
> > > > at 75.033851 second and it was huge so it was written to the backing
> > > > store.
> > > > 
> > > > Admin can leverage this information to catch cold|incompressible pages
> > > > of process with *pagemap* once part of heaps are swapped out.
> > > 
> > > A few things..
> > > 
> > > - Terms like "Admin can" and "Admin could" are worrisome.  How do we
> > >   know that admins *will* use this?  How do we know that we aren't
> > >   adding a bunch of stuff which nobody will find to be (sufficiently)
> > >   useful?  For example, is there some userspace tool to which you are
> > >   contributing which will be updated to use this feature?
> > 
> > Actually, I used this feature two years ago to find memory hogger
> > although the feature was very fast prototyping. It was very useful
> > to reduce memory cost in embedded space.
> > 
> > The reason I am trying to upstream the feature is I need the feature
> > again. :)
> > 
> > Yub, I have a userspace tool to use the feature although it was
> > not compatible with this new version. It should be updated with
> > new format. I will find a time to submit the tool.
> 
> hm, OK, can we get this info into the changelog?  

No problem. I will add as follows,

"I used the feature a few years ago to find memory hoggers in userspace
to notice them what memory they have wasted without touch for a long time.
With it, they could reduce unnecessary memory space. However, at that time,
I hacked up zram for the feature but now I need the feature again so
I decided it would be better to upstream rather than keeping it alone.
I hope I submit the userspace tool to use the feature soon"

> 
> > > 
> > > - block_state's second column is in microseconds since some
> > >   undocumented time.  But how is userspace to know how much time has
> > >   elapsed since the access?  ie, "current time".
> > 
> > It's a sched_clock so it should be elapsed time since the system boot.
> > I should have written it explictly.
> > I will fix it.
> > 
> > > 
> > > - Is the sched_clock() return value suitable for exporting to
> > >   userspace?  Is it monotonic?  Is it consistent across CPUs, across
> > >   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
> > >   way up to 2^64 on all CPU types, or will some processors wrap it at
> > >   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
> > >   points out that suspend/resume can mess it up and that the counter
> > >   can drift between cpus.
> > 
> > Good point!
> > 
> > I just referenced it from ftrace because I thought the goal is similiar
> > "no need to be exact unless the drift is frequent but wanted to be fast"
> > 
> > AFAIK, ftrace/printk is active user of the function so if the problem
> > happens frequently, it might be serious. :)
> 
> It could be that ktime_get() is a better fit here - especially if
> sched_clock() goes nuts after resume.  Unfortunately ktime_get()
> appears to be totally undocumented :(
> 

I will use ktime_get_boottime(). With it, zram is not demamaged by
suspend/resume and code would be more simple/clear. For user, it
would be more straightforward to parse the time.

Thanks for good suggestion, Andrew!

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 4/4] zram: introduce zram memory tracking
  2018-04-20  2:09         ` Minchan Kim
@ 2018-04-20  2:18           ` Sergey Senozhatsky
  2018-04-20  2:32             ` Minchan Kim
  2018-04-20  6:35           ` Minchan Kim
  1 sibling, 1 reply; 12+ messages in thread
From: Sergey Senozhatsky @ 2018-04-20  2:18 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Andrew Morton, LKML, Sergey Senozhatsky, Randy Dunlap,
	Greg Kroah-Hartman, Sergey Senozhatsky

On (04/20/18 11:09), Minchan Kim wrote:
[..]
> > hm, OK, can we get this info into the changelog?  
> 
> No problem. I will add as follows,
> 
> "I used the feature a few years ago to find memory hoggers in userspace
> to notice them what memory they have wasted without touch for a long time.
> With it, they could reduce unnecessary memory space. However, at that time,
> I hacked up zram for the feature but now I need the feature again so
> I decided it would be better to upstream rather than keeping it alone.
> I hope I submit the userspace tool to use the feature soon"

Shall we then just wait until you resubmit the "complete" patch set: zram
tracking + the user space tool which would parse the tracking output?

	-ss

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 4/4] zram: introduce zram memory tracking
  2018-04-20  2:18           ` Sergey Senozhatsky
@ 2018-04-20  2:32             ` Minchan Kim
  0 siblings, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2018-04-20  2:32 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, LKML, Randy Dunlap, Greg Kroah-Hartman,
	Sergey Senozhatsky

On Fri, Apr 20, 2018 at 11:18:34AM +0900, Sergey Senozhatsky wrote:
> On (04/20/18 11:09), Minchan Kim wrote:
> [..]
> > > hm, OK, can we get this info into the changelog?  
> > 
> > No problem. I will add as follows,
> > 
> > "I used the feature a few years ago to find memory hoggers in userspace
> > to notice them what memory they have wasted without touch for a long time.
> > With it, they could reduce unnecessary memory space. However, at that time,
> > I hacked up zram for the feature but now I need the feature again so
> > I decided it would be better to upstream rather than keeping it alone.
> > I hope I submit the userspace tool to use the feature soon"
> 
> Shall we then just wait until you resubmit the "complete" patch set: zram
> tracking + the user space tool which would parse the tracking output?

tl;dr: I think userspace tool is just ancillary, not must.

Although my main purpose is to find idle memory hogger, I don't think
userspace tool to find is must to merge this feature because someone
might want to do other thing regardless of the tool.

Examples from my mind is to see how swap write pattern going on,
how sparse swap write happens and so on. :)

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v5 4/4] zram: introduce zram memory tracking
  2018-04-20  2:09         ` Minchan Kim
  2018-04-20  2:18           ` Sergey Senozhatsky
@ 2018-04-20  6:35           ` Minchan Kim
  1 sibling, 0 replies; 12+ messages in thread
From: Minchan Kim @ 2018-04-20  6:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, Sergey Senozhatsky, Randy Dunlap, Greg Kroah-Hartman,
	Sergey Senozhatsky

On Fri, Apr 20, 2018 at 11:09:21AM +0900, Minchan Kim wrote:
> On Wed, Apr 18, 2018 at 02:07:15PM -0700, Andrew Morton wrote:
> > On Wed, 18 Apr 2018 10:26:36 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > 
> > > Hi Andrew,
> > > 
> > > On Tue, Apr 17, 2018 at 02:59:21PM -0700, Andrew Morton wrote:
> > > > On Mon, 16 Apr 2018 18:09:46 +0900 Minchan Kim <minchan@kernel.org> wrote:
> > > > 
> > > > > zRam as swap is useful for small memory device. However, swap means
> > > > > those pages on zram are mostly cold pages due to VM's LRU algorithm.
> > > > > Especially, once init data for application are touched for launching,
> > > > > they tend to be not accessed any more and finally swapped out.
> > > > > zRAM can store such cold pages as compressed form but it's pointless
> > > > > to keep in memory. Better idea is app developers free them directly
> > > > > rather than remaining them on heap.
> > > > > 
> > > > > This patch tell us last access time of each block of zram via
> > > > > "cat /sys/kernel/debug/zram/zram0/block_state".
> > > > > 
> > > > > The output is as follows,
> > > > >       300    75.033841 .wh
> > > > >       301    63.806904 s..
> > > > >       302    63.806919 ..h
> > > > > 
> > > > > First column is zram's block index and 3rh one represents symbol
> > > > > (s: same page w: written page to backing store h: huge page) of the
> > > > > block state. Second column represents usec time unit of the block
> > > > > was last accessed. So above example means the 300th block is accessed
> > > > > at 75.033851 second and it was huge so it was written to the backing
> > > > > store.
> > > > > 
> > > > > Admin can leverage this information to catch cold|incompressible pages
> > > > > of process with *pagemap* once part of heaps are swapped out.
> > > > 
> > > > A few things..
> > > > 
> > > > - Terms like "Admin can" and "Admin could" are worrisome.  How do we
> > > >   know that admins *will* use this?  How do we know that we aren't
> > > >   adding a bunch of stuff which nobody will find to be (sufficiently)
> > > >   useful?  For example, is there some userspace tool to which you are
> > > >   contributing which will be updated to use this feature?
> > > 
> > > Actually, I used this feature two years ago to find memory hogger
> > > although the feature was very fast prototyping. It was very useful
> > > to reduce memory cost in embedded space.
> > > 
> > > The reason I am trying to upstream the feature is I need the feature
> > > again. :)
> > > 
> > > Yub, I have a userspace tool to use the feature although it was
> > > not compatible with this new version. It should be updated with
> > > new format. I will find a time to submit the tool.
> > 
> > hm, OK, can we get this info into the changelog?  
> 
> No problem. I will add as follows,
> 
> "I used the feature a few years ago to find memory hoggers in userspace
> to notice them what memory they have wasted without touch for a long time.
> With it, they could reduce unnecessary memory space. However, at that time,
> I hacked up zram for the feature but now I need the feature again so
> I decided it would be better to upstream rather than keeping it alone.
> I hope I submit the userspace tool to use the feature soon"
> 
> > 
> > > > 
> > > > - block_state's second column is in microseconds since some
> > > >   undocumented time.  But how is userspace to know how much time has
> > > >   elapsed since the access?  ie, "current time".
> > > 
> > > It's a sched_clock so it should be elapsed time since the system boot.
> > > I should have written it explictly.
> > > I will fix it.
> > > 
> > > > 
> > > > - Is the sched_clock() return value suitable for exporting to
> > > >   userspace?  Is it monotonic?  Is it consistent across CPUs, across
> > > >   CPU hotadd/remove, across suspend/resume, etc?  Does it run all the
> > > >   way up to 2^64 on all CPU types, or will some processors wrap it at
> > > >   (say) 32 bits?  etcetera.  Documentation/timers/timekeeping.txt
> > > >   points out that suspend/resume can mess it up and that the counter
> > > >   can drift between cpus.
> > > 
> > > Good point!
> > > 
> > > I just referenced it from ftrace because I thought the goal is similiar
> > > "no need to be exact unless the drift is frequent but wanted to be fast"
> > > 
> > > AFAIK, ftrace/printk is active user of the function so if the problem
> > > happens frequently, it might be serious. :)
> > 
> > It could be that ktime_get() is a better fit here - especially if
> > sched_clock() goes nuts after resume.  Unfortunately ktime_get()
> > appears to be totally undocumented :(
> > 
> 
> I will use ktime_get_boottime(). With it, zram is not demamaged by
> suspend/resume and code would be more simple/clear. For user, it
> would be more straightforward to parse the time.
> 
> Thanks for good suggestion, Andrew!
> 

Hey Andrew,

This is updated patch for 4/4.
If you want to replace full patchset, please tell me. I will send full
patchset.

>From 2ac685c32ffd3fba42d5eea6347f924c6e89bec0 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Mon, 9 Apr 2018 14:34:43 +0900
Subject: [PATCH v5 4/4] zram: introduce zram memory tracking

zRam as swap is useful for small memory device. However, swap means
those pages on zram are mostly cold pages due to VM's LRU algorithm.
Especially, once init data for application are touched for launching,
they tend to be not accessed any more and finally swapped out.
zRAM can store such cold pages as compressed form but it's pointless
to keep in memory. Better idea is app developers free them directly
rather than remaining them on heap.

This patch tell us last access time of each block of zram via
"cat /sys/kernel/debug/zram/zram0/block_state".

The output is as follows,
      300    75.033841 .wh
      301    63.806904 s..
      302    63.806919 ..h

First column is zram's block index and 3rh one represents symbol
(s: same page w: written page to backing store h: huge page) of the
block state. Second column represents usec time unit of the block
was last accessed. So above example means the 300th block is accessed
at 75.033851 second and it was huge so it was written to the backing
store.

Admin can leverage this information to catch cold|incompressible pages
of process with *pagemap* once part of heaps are swapped out.

I used the feature a few years ago to find memory hoggers in userspace
to notify them what memory they have wasted without touch for a long time.
With it, they could reduce unnecessary memory space. However, at that time,
I hacked up zram for the feature but now I need the feature again so
I decided it would be better to upstream rather than keeping it alone.
I hope I submit the userspace tool to use the feature soon

Cc: Randy Dunlap <rdunlap@infradead.org>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
* from v4
  * use ktime_get_bootimte - Andrew
  * add feature usecase in change log - Andrew

 Documentation/blockdev/zram.txt |  24 ++++++
 drivers/block/zram/Kconfig      |  14 +++-
 drivers/block/zram/zram_drv.c   | 131 +++++++++++++++++++++++++++++---
 drivers/block/zram/zram_drv.h   |   7 +-
 4 files changed, 162 insertions(+), 14 deletions(-)

diff --git a/Documentation/blockdev/zram.txt b/Documentation/blockdev/zram.txt
index 78db38d02bc9..6cb804b709cf 100644
--- a/Documentation/blockdev/zram.txt
+++ b/Documentation/blockdev/zram.txt
@@ -243,5 +243,29 @@ to backing storage rather than keeping it in memory.
 User should set up backing device via /sys/block/zramX/backing_dev
 before disksize setting.
 
+= memory tracking
+
+With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the
+zram block. It could be useful to catch cold or incompressible
+pages of the process with*pagemap.
+If you enable the feature, you could see block state via
+/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
+
+	  300    75.033841 .wh
+	  301    63.806904 s..
+	  302    63.806919 ..h
+
+First column is zram's block index.
+Second column is access time since the system is boot
+Third column is state of the block.
+(s: same page
+w: written page to backing store
+h: huge page)
+
+First line of above example says 300th block is accessed at 75.033841sec
+and the block's state is huge so it is written back to the backing
+storage. It's a debugging feature so anyone shouldn't rely on it to work
+properly.
+
 Nitin Gupta
 ngupta@vflare.org
diff --git a/drivers/block/zram/Kconfig b/drivers/block/zram/Kconfig
index ac3a31d433b2..635235759a0a 100644
--- a/drivers/block/zram/Kconfig
+++ b/drivers/block/zram/Kconfig
@@ -13,7 +13,7 @@ config ZRAM
 	  It has several use cases, for example: /tmp storage, use as swap
 	  disks and maybe many more.
 
-	  See zram.txt for more information.
+	  See Documentation/blockdev/zram.txt for more information.
 
 config ZRAM_WRITEBACK
        bool "Write back incompressible page to backing device"
@@ -25,4 +25,14 @@ config ZRAM_WRITEBACK
 	 For this feature, admin should set up backing device via
 	 /sys/block/zramX/backing_dev.
 
-	 See zram.txt for more infomration.
+	 See Documentation/blockdev/zram.txt for more information.
+
+config ZRAM_MEMORY_TRACKING
+	bool "Track zRam block status"
+	depends on ZRAM && DEBUG_FS
+	help
+	  With this feature, admin can track the state of allocated blocks
+	  of zRAM. Admin could see the information via
+	  /sys/kernel/debug/zram/zramX/block_state.
+
+	  See Documentation/blockdev/zram.txt for more information.
diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 7fc10e2ad734..68d727d89d38 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -31,6 +31,7 @@
 #include <linux/err.h>
 #include <linux/idr.h>
 #include <linux/sysfs.h>
+#include <linux/debugfs.h>
 #include <linux/cpuhotplug.h>
 
 #include "zram_drv.h"
@@ -67,6 +68,13 @@ static inline bool init_done(struct zram *zram)
 	return zram->disksize;
 }
 
+static inline bool zram_allocated(struct zram *zram, u32 index)
+{
+
+	return (zram->table[index].value >> (ZRAM_FLAG_SHIFT + 1)) ||
+					zram->table[index].handle;
+}
+
 static inline struct zram *dev_to_zram(struct device *dev)
 {
 	return (struct zram *)dev_to_disk(dev)->private_data;
@@ -83,7 +91,7 @@ static void zram_set_handle(struct zram *zram, u32 index, unsigned long handle)
 }
 
 /* flag operations require table entry bit_spin_lock() being held */
-static int zram_test_flag(struct zram *zram, u32 index,
+static bool zram_test_flag(struct zram *zram, u32 index,
 			enum zram_pageflags flag)
 {
 	return zram->table[index].value & BIT(flag);
@@ -107,16 +115,6 @@ static inline void zram_set_element(struct zram *zram, u32 index,
 	zram->table[index].element = element;
 }
 
-static void zram_accessed(struct zram *zram, u32 index)
-{
-	zram->table[index].ac_time = sched_clock();
-}
-
-static void zram_reset_access(struct zram *zram, u32 index)
-{
-	zram->table[index].ac_time = 0;
-}
-
 static unsigned long zram_get_element(struct zram *zram, u32 index)
 {
 	return zram->table[index].element;
@@ -620,6 +618,113 @@ static int read_from_bdev(struct zram *zram, struct bio_vec *bvec,
 static void zram_wb_clear(struct zram *zram, u32 index) {}
 #endif
 
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+
+static struct dentry *zram_debugfs_root;
+
+static void zram_debugfs_create(void)
+{
+	zram_debugfs_root = debugfs_create_dir("zram", NULL);
+}
+
+static void zram_debugfs_destroy(void)
+{
+	debugfs_remove_recursive(zram_debugfs_root);
+}
+
+static void zram_accessed(struct zram *zram, u32 index)
+{
+	zram->table[index].ac_time = ktime_get_boottime();
+}
+
+static void zram_reset_access(struct zram *zram, u32 index)
+{
+	zram->table[index].ac_time = 0;
+}
+
+static ssize_t read_block_state(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	char *kbuf;
+	ssize_t index, written = 0;
+	struct zram *zram = file->private_data;
+	unsigned long nr_pages = zram->disksize >> PAGE_SHIFT;
+	struct timespec64 ts;
+
+	kbuf = kvmalloc(count, GFP_KERNEL);
+	if (!kbuf)
+		return -ENOMEM;
+
+	down_read(&zram->init_lock);
+	if (!init_done(zram)) {
+		up_read(&zram->init_lock);
+		kvfree(kbuf);
+		return -EINVAL;
+	}
+
+	for (index = *ppos; index < nr_pages; index++) {
+		int copied;
+
+		zram_slot_lock(zram, index);
+		if (!zram_allocated(zram, index))
+			goto next;
+
+		ts = ktime_to_timespec64(zram->table[index].ac_time);
+		copied = snprintf(kbuf + written, count,
+			"%12lu %12lu.%06lu %c%c%c\n",
+			index, ts.tv_sec, ts.tv_nsec / NSEC_PER_USEC,
+			zram_test_flag(zram, index, ZRAM_SAME) ? 's' : '.',
+			zram_test_flag(zram, index, ZRAM_WB) ? 'w' : '.',
+			zram_test_flag(zram, index, ZRAM_HUGE) ? 'h' : '.');
+
+		if (count < copied) {
+			zram_slot_unlock(zram, index);
+			break;
+		}
+		written += copied;
+		count -= copied;
+next:
+		zram_slot_unlock(zram, index);
+		*ppos += 1;
+	}
+
+	up_read(&zram->init_lock);
+	if (copy_to_user(buf, kbuf, written))
+		written = -EFAULT;
+	kvfree(kbuf);
+
+	return written;
+}
+
+static const struct file_operations proc_zram_block_state_op = {
+	.open = simple_open,
+	.read = read_block_state,
+	.llseek = default_llseek,
+};
+
+static void zram_debugfs_register(struct zram *zram)
+{
+	if (!zram_debugfs_root)
+		return;
+
+	zram->debugfs_dir = debugfs_create_dir(zram->disk->disk_name,
+						zram_debugfs_root);
+	debugfs_create_file("block_state", 0400, zram->debugfs_dir,
+				zram, &proc_zram_block_state_op);
+}
+
+static void zram_debugfs_unregister(struct zram *zram)
+{
+	debugfs_remove_recursive(zram->debugfs_dir);
+}
+#else
+static void zram_debugfs_create(void) {};
+static void zram_debugfs_destroy(void) {};
+static void zram_accessed(struct zram *zram, u32 index) {};
+static void zram_reset_access(struct zram *zram, u32 index) {};
+static void zram_debugfs_register(struct zram *zram) {};
+static void zram_debugfs_unregister(struct zram *zram) {};
+#endif
 
 /*
  * We switched to per-cpu streams and this attr is not needed anymore.
@@ -1604,6 +1709,7 @@ static int zram_add(void)
 	}
 	strlcpy(zram->compressor, default_compressor, sizeof(zram->compressor));
 
+	zram_debugfs_register(zram);
 	pr_info("Added device: %s\n", zram->disk->disk_name);
 	return device_id;
 
@@ -1637,6 +1743,7 @@ static int zram_remove(struct zram *zram)
 	zram->claim = true;
 	mutex_unlock(&bdev->bd_mutex);
 
+	zram_debugfs_unregister(zram);
 	/*
 	 * Remove sysfs first, so no one will perform a disksize
 	 * store while we destroy the devices. This also helps during
@@ -1739,6 +1846,7 @@ static void destroy_devices(void)
 {
 	class_unregister(&zram_control_class);
 	idr_for_each(&zram_index_idr, &zram_remove_cb, NULL);
+	zram_debugfs_destroy();
 	idr_destroy(&zram_index_idr);
 	unregister_blkdev(zram_major, "zram");
 	cpuhp_remove_multi_state(CPUHP_ZCOMP_PREPARE);
@@ -1760,6 +1868,7 @@ static int __init zram_init(void)
 		return ret;
 	}
 
+	zram_debugfs_create();
 	zram_major = register_blkdev(0, "zram");
 	if (zram_major <= 0) {
 		pr_err("Unable to get major number\n");
diff --git a/drivers/block/zram/zram_drv.h b/drivers/block/zram/zram_drv.h
index 1075218e88b2..72c8584b6dff 100644
--- a/drivers/block/zram/zram_drv.h
+++ b/drivers/block/zram/zram_drv.h
@@ -61,7 +61,9 @@ struct zram_table_entry {
 		unsigned long element;
 	};
 	unsigned long value;
-	u64 ac_time;
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+	ktime_t ac_time;
+#endif
 };
 
 struct zram_stats {
@@ -110,5 +112,8 @@ struct zram {
 	unsigned long nr_pages;
 	spinlock_t bitmap_lock;
 #endif
+#ifdef CONFIG_ZRAM_MEMORY_TRACKING
+	struct dentry *debugfs_dir;
+#endif
 };
 #endif
-- 
2.17.0.484.g0c8726318c-goog

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-04-20  6:35 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-16  9:09 [PATCH v5 0/4] zram memory tracking Minchan Kim
2018-04-16  9:09 ` [PATCH v5 1/4] zram: correct flag name of ZRAM_ACCESS Minchan Kim
2018-04-16  9:09 ` [PATCH v5 2/4] zram: mark incompressible page as ZRAM_HUGE Minchan Kim
2018-04-16  9:09 ` [PATCH v5 3/4] zram: record accessed second Minchan Kim
2018-04-16  9:09 ` [PATCH v5 4/4] zram: introduce zram memory tracking Minchan Kim
2018-04-17 21:59   ` Andrew Morton
2018-04-18  1:26     ` Minchan Kim
2018-04-18 21:07       ` Andrew Morton
2018-04-20  2:09         ` Minchan Kim
2018-04-20  2:18           ` Sergey Senozhatsky
2018-04-20  2:32             ` Minchan Kim
2018-04-20  6:35           ` Minchan Kim

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.