All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] nilfs2: implementation of cost-benefit GC policy
@ 2015-05-03 10:05 Andreas Rohner
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hello,

This is an updated version based on the review of Ryusuke Konishi. It is
a complete rewrite of the first version and the implementation is much 
simpler and cleaner.

I include here a copy of my cover letter from the first version:

One of the biggest performance problems of NILFS is its
inefficient Timestamp GC policy. This patch set introduces two new GC
policies, namely Cost-Benefit and Greedy.

The Cost-Benefit policy is nothing new. It has been around for a long
time with log-structured file systems [1]. But it relies on accurate
information, about the number of live blocks in a segment. NILFS
currently does not provide the necessary information. So this patch set
extends the entries in the SUFILE to include a counter for the number of
live blocks. This counter is decremented whenever a file is deleted or
overwritten.

Except for some tricky parts, the counting of live blocks is quite
trivial. The problem is snapshots. At any time, a checkpoint can be
turned into a snapshot or vice versa. So blocks that are reclaimable at
one point in time, are protected by a snapshot a moment later.

This patch set does not try to track snapshots at all. Instead it uses a
heuristic approach to prevent the worst case scenario. The performance
is still significantly better than timestamp for my benchmarks.

The worst case scenario is, the following:

1. Segment 1 is written
2. Snapshot is created
3. GC tries to reclaim Segment 1, but all blocks are protected
   by the Snapshot. The GC has to set the number of live blocks
   to maximum to avoid reclaiming this Segment again in the near future.
4. Snapshot is deleted
5. Segment 1 is reclaimable, but its counter is so high, that the GC
   will never try to reclaim it again.

To prevent this kind of starvation I use another field in the SUFILE
entry, to store the number of blocks that are protected by a snapshot.
This value is just a heuristic and it is usually set to 0. Only if the
GC reclaims a segment, it is written to the SUFILE entry. The GC has to
check for snapshots anyway, so we get this information for free. By
storing this information in the SUFILE we can avoid starvation in the
following way:

1. Segment 1 is written
2. Snapshot is created
3. GC tries to reclaim Segment 1, but all blocks are protected
   by the Snapshot. The GC has to set the number of live blocks
   to maximum to avoid reclaiming this Segment again in the near future.
4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
   entry
5. Snapshot is deleted
6. On Snapshot deletion we walk through every entry in the SUFILE and
   reduce the number of live blocks to half, if the number of snapshot
   blocks is bigger than half of the maximum.
7. Segment 1 is reclaimable and the number of live blocks entry is at
   half the maximum. The GC will try to reclaim this segment as soon as
   there are no other better choices.

BENCHMARKS:
-----------

My benchmark is quite simple. It consists of a process, that replays
real NFS traces at a faster speed. It thereby creates relatively
realistic patterns of file creation and deletions. At the same time
multiple snapshots are created and deleted in parallel. I use a 100GB
partition of a Samsung SSD:

WITH SNAPSHOTS EVERY 5 MINUTES:
--------------------------------------------------------------------
                Execution time       Wear (Data written to disk)
Timestamp:      100%                 100%
Cost-Benefit:   80%                  43%

NO SNAPSHOTS:
---------------------------------------------------------------------
                Execution time       Wear (Data written to disk)
Timestamp:      100%                 100%
Cost-Benefit:   70%                  45%

I plan on adding more benchmark results soon.

Best regards,
Andreas Rohner

[1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
    tion of a log-structured file system. ACM Trans. Comput. Syst.,
    10(1):26–52, February 1992.

Changes since v1:

 - Complete rewrite
 - Use a radix_tree to store the cache
 - Cache is stored in struct nilfs_sufile_info and does not have to be 
   passed around.
 - No new lock classes are needed, because the cache is flushed only at 
   segment creation
 - Dead blocks for the DAT file are tracked in nilfs_btree_propagate_p()

Andreas Rohner (9):
  nilfs2: copy file system feature flags to the nilfs object
  nilfs2: extend SUFILE on-disk format to enable tracking of live blocks
  nilfs2: introduce new feature flag for tracking live blocks
  nilfs2: add kmem_cache for SUFILE cache nodes
  nilfs2: add SUFILE cache for changes to su_nlive_blks field
  nilfs2: add tracking of block deletions and updates
  nilfs2: ensure that all dirty blocks are written out
  nilfs2: correct live block tracking for GC protection period
  nilfs2: prevent starvation of segments protected by snapshots

 fs/nilfs2/btree.c         |  33 ++-
 fs/nilfs2/dat.c           |  81 +++++++-
 fs/nilfs2/dat.h           |   1 +
 fs/nilfs2/direct.c        |  20 +-
 fs/nilfs2/ioctl.c         |  69 ++++++-
 fs/nilfs2/page.c          |   6 +-
 fs/nilfs2/page.h          |   9 +
 fs/nilfs2/segbuf.c        |   3 +
 fs/nilfs2/segbuf.h        |   5 +
 fs/nilfs2/segment.c       | 162 +++++++++++++--
 fs/nilfs2/segment.h       |   3 +-
 fs/nilfs2/sufile.c        | 516 +++++++++++++++++++++++++++++++++++++++++++++-
 fs/nilfs2/sufile.h        |  31 ++-
 fs/nilfs2/super.c         |  14 ++
 fs/nilfs2/the_nilfs.c     |   4 +
 fs/nilfs2/the_nilfs.h     |  16 ++
 include/linux/nilfs2_fs.h | 103 ++++++++-
 17 files changed, 1033 insertions(+), 43 deletions(-)

-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [PATCH v2 1/9] nilfs2: copy file system feature flags to the nilfs object
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-2-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 2/9] nilfs2: extend SUFILE on-disk format to enable tracking of live blocks Andreas Rohner
                     ` (13 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch adds three new attributes to the nilfs object, which contain
a copy of the feature flags from the super block. This can be used, to
efficiently test whether file system feature flags are set or not.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/the_nilfs.c | 4 ++++
 fs/nilfs2/the_nilfs.h | 8 ++++++++
 2 files changed, 12 insertions(+)

diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index 69bd801..606fdfc 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -630,6 +630,10 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data)
 	get_random_bytes(&nilfs->ns_next_generation,
 			 sizeof(nilfs->ns_next_generation));
 
+	nilfs->ns_feature_compat = le64_to_cpu(sbp->s_feature_compat);
+	nilfs->ns_feature_compat_ro = le64_to_cpu(sbp->s_feature_compat_ro);
+	nilfs->ns_feature_incompat = le64_to_cpu(sbp->s_feature_incompat);
+
 	err = nilfs_store_disk_layout(nilfs, sbp);
 	if (err)
 		goto failed_sbh;
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index 23778d3..12cd91d 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -101,6 +101,9 @@ enum {
  * @ns_dev_kobj: /sys/fs/<nilfs>/<device>
  * @ns_dev_kobj_unregister: completion state
  * @ns_dev_subgroups: <device> subgroups pointer
+ * @ns_feature_compat: Compatible feature set
+ * @ns_feature_compat_ro: Read-only compatible feature set
+ * @ns_feature_incompat: Incompatible feature set
  */
 struct the_nilfs {
 	unsigned long		ns_flags;
@@ -201,6 +204,11 @@ struct the_nilfs {
 	struct kobject ns_dev_kobj;
 	struct completion ns_dev_kobj_unregister;
 	struct nilfs_sysfs_dev_subgroups *ns_dev_subgroups;
+
+	/* Features */
+	__u64                   ns_feature_compat;
+	__u64                   ns_feature_compat_ro;
+	__u64                   ns_feature_incompat;
 };
 
 #define THE_NILFS_FNS(bit, name)					\
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 2/9] nilfs2: extend SUFILE on-disk format to enable tracking of live blocks
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 1/9] nilfs2: copy file system feature flags to the nilfs object Andreas Rohner
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 3/9] nilfs2: introduce new feature flag for tracking " Andreas Rohner
                     ` (12 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch extends the nilfs_segment_usage structure with two extra
fields. This changes the on-disk format of the SUFILE, but the NILFS2
metadata files are flexible enough, so that there are no compatibility
issues. The extension is fully backwards compatible. Nevertheless a
feature compatibility flag was added to indicate the on-disk format
change.

The new field su_nlive_blks is used to track the number of live blocks
in the corresponding segment. Its value should always be smaller than
su_nblocks, which contains the total number of blocks in the segment.

The field su_nlive_lastmod is necessary because of the protection period
used by the GC. It is a timestamp, which contains the last time
su_nlive_blks was modified. For example if a file is deleted, its
blocks are subtracted from su_nlive_blks and are therefore considered to
be reclaimable by the kernel. But the GC additionally protects them with
the protection period. So while su_nilve_blks contains the number of
potentially reclaimable blocks, the actual number depends on the
protection period. To enable GC policies to effectively choose or prefer
segments with unprotected blocks, the timestamp in su_nlive_lastmod is
necessary.

The new field su_nsnapshot_blks contains the number of blocks in a
segment that are protected by a snapshot. The value is meant to be a
heuristic for the GC and is not necessarily always accurate.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/ioctl.c         |  4 +--
 fs/nilfs2/sufile.c        | 45 +++++++++++++++++++++++++++++++--
 fs/nilfs2/sufile.h        |  6 +++++
 include/linux/nilfs2_fs.h | 63 +++++++++++++++++++++++++++++++++++++++++------
 4 files changed, 106 insertions(+), 12 deletions(-)

diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 9a20e51..f6ee54e 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1250,7 +1250,7 @@ static int nilfs_ioctl_set_suinfo(struct inode *inode, struct file *filp,
 		goto out;
 
 	ret = -EINVAL;
-	if (argv.v_size < sizeof(struct nilfs_suinfo_update))
+	if (argv.v_size < NILFS_MIN_SUINFO_UPDATE_SIZE)
 		goto out;
 
 	if (argv.v_nmembs > nilfs->ns_nsegments)
@@ -1316,7 +1316,7 @@ long nilfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		return nilfs_ioctl_get_cpstat(inode, filp, cmd, argp);
 	case NILFS_IOCTL_GET_SUINFO:
 		return nilfs_ioctl_get_info(inode, filp, cmd, argp,
-					    sizeof(struct nilfs_suinfo),
+					    NILFS_MIN_SEGMENT_USAGE_SIZE,
 					    nilfs_ioctl_do_get_suinfo);
 	case NILFS_IOCTL_SET_SUINFO:
 		return nilfs_ioctl_set_suinfo(inode, filp, cmd, argp);
diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 2a869c3..1cce358 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -453,6 +453,11 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
 	su->su_lastmod = cpu_to_le64(0);
 	su->su_nblocks = cpu_to_le32(0);
 	su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
+	if (nilfs_sufile_live_blks_ext_supported(sufile)) {
+		su->su_nlive_blks = cpu_to_le32(0);
+		su->su_nsnapshot_blks = cpu_to_le32(0);
+		su->su_nlive_lastmod = cpu_to_le64(0);
+	}
 	kunmap_atomic(kaddr);
 
 	nilfs_sufile_mod_counter(header_bh, clean ? (u64)-1 : 0, dirty ? 0 : 1);
@@ -482,7 +487,7 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 segnum,
 	WARN_ON(!nilfs_segment_usage_dirty(su));
 
 	sudirty = nilfs_segment_usage_dirty(su);
-	nilfs_segment_usage_set_clean(su);
+	nilfs_segment_usage_set_clean(su, NILFS_MDT(sufile)->mi_entry_size);
 	kunmap_atomic(kaddr);
 	mark_buffer_dirty(su_bh);
 
@@ -698,7 +703,7 @@ static int nilfs_sufile_truncate_range(struct inode *sufile,
 		nc = 0;
 		for (su = su2, j = 0; j < n; j++, su = (void *)su + susz) {
 			if (nilfs_segment_usage_error(su)) {
-				nilfs_segment_usage_set_clean(su);
+				nilfs_segment_usage_set_clean(su, susz);
 				nc++;
 			}
 		}
@@ -821,6 +826,8 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
 	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
 	void *kaddr;
 	unsigned long nsegs, segusages_per_block;
+	__u64 lm = 0;
+	__u32 nlb = 0, nsb = 0;
 	ssize_t n;
 	int ret, i, j;
 
@@ -858,6 +865,18 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
 			if (nilfs_segment_is_active(nilfs, segnum + j))
 				si->sui_flags |=
 					(1UL << NILFS_SEGMENT_USAGE_ACTIVE);
+
+			if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
+				nlb = le32_to_cpu(su->su_nlive_blks);
+				nsb = le32_to_cpu(su->su_nsnapshot_blks);
+				lm = le64_to_cpu(su->su_nlive_lastmod);
+			}
+
+			if (sisz >= NILFS_LIVE_BLKS_EXT_SUINFO_SIZE) {
+				si->sui_nlive_blks = nlb;
+				si->sui_nsnapshot_blks = nsb;
+				si->sui_nlive_lastmod = lm;
+			}
 		}
 		kunmap_atomic(kaddr);
 		brelse(su_bh);
@@ -901,6 +920,9 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 	int cleansi, cleansu, dirtysi, dirtysu;
 	long ncleaned = 0, ndirtied = 0;
 	int ret = 0;
+	bool sup_ext = (supsz >= NILFS_LIVE_BLKS_EXT_SUINFO_UPDATE_SIZE);
+	bool su_ext = nilfs_sufile_live_blks_ext_supported(sufile);
+	bool supsu_ext = sup_ext && su_ext;
 
 	if (unlikely(nsup == 0))
 		return ret;
@@ -911,6 +933,13 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 				(~0UL << __NR_NILFS_SUINFO_UPDATE_FIELDS))
 			|| (nilfs_suinfo_update_nblocks(sup) &&
 				sup->sup_sui.sui_nblocks >
+				nilfs->ns_blocks_per_segment)
+			|| (nilfs_suinfo_update_nlive_blks(sup) && sup_ext &&
+				sup->sup_sui.sui_nlive_blks >
+				nilfs->ns_blocks_per_segment)
+			|| (nilfs_suinfo_update_nsnapshot_blks(sup) &&
+				sup_ext &&
+				sup->sup_sui.sui_nsnapshot_blks >
 				nilfs->ns_blocks_per_segment))
 			return -EINVAL;
 	}
@@ -938,6 +967,18 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 		if (nilfs_suinfo_update_nblocks(sup))
 			su->su_nblocks = cpu_to_le32(sup->sup_sui.sui_nblocks);
 
+		if (nilfs_suinfo_update_nlive_blks(sup) && supsu_ext)
+			su->su_nlive_blks =
+				cpu_to_le32(sup->sup_sui.sui_nlive_blks);
+
+		if (nilfs_suinfo_update_nsnapshot_blks(sup) && supsu_ext)
+			su->su_nsnapshot_blks =
+				cpu_to_le32(sup->sup_sui.sui_nsnapshot_blks);
+
+		if (nilfs_suinfo_update_nlive_lastmod(sup) && supsu_ext)
+			su->su_nlive_lastmod =
+				cpu_to_le64(sup->sup_sui.sui_nlive_lastmod);
+
 		if (nilfs_suinfo_update_flags(sup)) {
 			/*
 			 * Active flag is a virtual flag projected by running
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index b8afd72..da78edf 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -28,6 +28,12 @@
 #include <linux/nilfs2_fs.h>
 #include "mdt.h"
 
+static inline int
+nilfs_sufile_live_blks_ext_supported(const struct inode *sufile)
+{
+	return NILFS_MDT(sufile)->mi_entry_size >=
+			NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE;
+}
 
 static inline unsigned long nilfs_sufile_get_nsegments(struct inode *sufile)
 {
diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
index ff3fea3..4800daa 100644
--- a/include/linux/nilfs2_fs.h
+++ b/include/linux/nilfs2_fs.h
@@ -220,9 +220,12 @@ struct nilfs_super_block {
  * If there is a bit set in the incompatible feature set that the kernel
  * doesn't know about, it should refuse to mount the filesystem.
  */
-#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT	0x00000001ULL
+#define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT	BIT(0)
 
-#define NILFS_FEATURE_COMPAT_SUPP	0ULL
+#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		BIT(0)
+
+#define NILFS_FEATURE_COMPAT_SUPP					\
+			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
@@ -609,19 +612,34 @@ struct nilfs_cpfile_header {
 	  sizeof(struct nilfs_checkpoint) - 1) /			\
 			sizeof(struct nilfs_checkpoint))
 
+#ifndef offsetofend
+#define offsetofend(TYPE, MEMBER) \
+		(offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
+#endif
+
 /**
  * struct nilfs_segment_usage - segment usage
  * @su_lastmod: last modified timestamp
  * @su_nblocks: number of blocks in segment
  * @su_flags: flags
+ * @su_nlive_blks: number of live blocks in the segment
+ * @su_nsnapshot_blks: number of blocks belonging to a snapshot in the segment
+ * @su_nlive_lastmod: timestamp nlive_blks was last modified
  */
 struct nilfs_segment_usage {
 	__le64 su_lastmod;
 	__le32 su_nblocks;
 	__le32 su_flags;
+	__le32 su_nlive_blks;
+	__le32 su_nsnapshot_blks;
+	__le64 su_nlive_lastmod;
 };
 
-#define NILFS_MIN_SEGMENT_USAGE_SIZE	16
+#define NILFS_MIN_SEGMENT_USAGE_SIZE	\
+	offsetofend(struct nilfs_segment_usage, su_flags)
+
+#define NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE	\
+	offsetofend(struct nilfs_segment_usage, su_nlive_lastmod)
 
 /* segment usage flag */
 enum {
@@ -658,11 +676,16 @@ NILFS_SEGMENT_USAGE_FNS(DIRTY, dirty)
 NILFS_SEGMENT_USAGE_FNS(ERROR, error)
 
 static inline void
-nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su)
+nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
 {
 	su->su_lastmod = cpu_to_le64(0);
 	su->su_nblocks = cpu_to_le32(0);
 	su->su_flags = cpu_to_le32(0);
+	if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
+		su->su_nlive_blks = cpu_to_le32(0);
+		su->su_nsnapshot_blks = cpu_to_le32(0);
+		su->su_nlive_lastmod = cpu_to_le64(0);
+	}
 }
 
 static inline int
@@ -684,23 +707,33 @@ struct nilfs_sufile_header {
 	/* ... */
 };
 
-#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET	\
-	((sizeof(struct nilfs_sufile_header) +				\
-	  sizeof(struct nilfs_segment_usage) - 1) /			\
-			 sizeof(struct nilfs_segment_usage))
+#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET(susz)	\
+	((sizeof(struct nilfs_sufile_header) + (susz) - 1) / (susz))
 
 /**
  * nilfs_suinfo - segment usage information
  * @sui_lastmod: timestamp of last modification
  * @sui_nblocks: number of written blocks in segment
  * @sui_flags: segment usage flags
+ * @sui_nlive_blks: number of live blocks in the segment
+ * @sui_nsnapshot_blks: number of blocks belonging to a snapshot in the segment
+ * @sui_nlive_lastmod: timestamp nlive_blks was last modified
  */
 struct nilfs_suinfo {
 	__u64 sui_lastmod;
 	__u32 sui_nblocks;
 	__u32 sui_flags;
+	__u32 sui_nlive_blks;
+	__u32 sui_nsnapshot_blks;
+	__u64 sui_nlive_lastmod;
 };
 
+#define NILFS_MIN_SUINFO_SIZE	\
+	offsetofend(struct nilfs_suinfo, sui_flags)
+
+#define NILFS_LIVE_BLKS_EXT_SUINFO_SIZE	\
+	offsetofend(struct nilfs_suinfo, sui_nlive_lastmod)
+
 #define NILFS_SUINFO_FNS(flag, name)					\
 static inline int							\
 nilfs_suinfo_##name(const struct nilfs_suinfo *si)			\
@@ -736,6 +769,9 @@ enum {
 	NILFS_SUINFO_UPDATE_LASTMOD,
 	NILFS_SUINFO_UPDATE_NBLOCKS,
 	NILFS_SUINFO_UPDATE_FLAGS,
+	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
+	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
+	NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
 	__NR_NILFS_SUINFO_UPDATE_FIELDS,
 };
 
@@ -759,6 +795,17 @@ nilfs_suinfo_update_##name(const struct nilfs_suinfo_update *sup)	\
 NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
 NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
 NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
+NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
+NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
+NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
+
+#define NILFS_MIN_SUINFO_UPDATE_SIZE	\
+	(offsetofend(struct nilfs_suinfo_update, sup_reserved) + \
+	NILFS_MIN_SUINFO_SIZE)
+
+#define NILFS_LIVE_BLKS_EXT_SUINFO_UPDATE_SIZE	\
+	(offsetofend(struct nilfs_suinfo_update, sup_reserved) + \
+	NILFS_LIVE_BLKS_EXT_SUINFO_SIZE)
 
 enum {
 	NILFS_CHECKPOINT,
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 3/9] nilfs2: introduce new feature flag for tracking live blocks
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 1/9] nilfs2: copy file system feature flags to the nilfs object Andreas Rohner
  2015-05-03 10:05   ` [PATCH v2 2/9] nilfs2: extend SUFILE on-disk format to enable tracking of live blocks Andreas Rohner
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-4-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 4/9] nilfs2: add kmem_cache for SUFILE cache nodes Andreas Rohner
                     ` (11 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch introduces a new file system feature flag
NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS. If it is enabled, the file system
will keep track of the number of live blocks per segment. This
information can be used by the GC to select segments for cleaning more
efficiently.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/the_nilfs.h     | 8 ++++++++
 include/linux/nilfs2_fs.h | 4 +++-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index 12cd91d..d755b6b 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -401,4 +401,12 @@ static inline int nilfs_flush_device(struct the_nilfs *nilfs)
 	return err;
 }
 
+static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
+{
+	const __u64 required_bits = NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
+				    NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT;
+
+	return ((nilfs->ns_feature_compat & required_bits) == required_bits);
+}
+
 #endif /* _THE_NILFS_H */
diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
index 4800daa..5f05bbf 100644
--- a/include/linux/nilfs2_fs.h
+++ b/include/linux/nilfs2_fs.h
@@ -221,11 +221,13 @@ struct nilfs_super_block {
  * doesn't know about, it should refuse to mount the filesystem.
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT	BIT(0)
+#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		BIT(1)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		BIT(0)
 
 #define NILFS_FEATURE_COMPAT_SUPP					\
-			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
+			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT |	\
+			 NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 4/9] nilfs2: add kmem_cache for SUFILE cache nodes
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-05-03 10:05   ` [PATCH v2 3/9] nilfs2: introduce new feature flag for tracking " Andreas Rohner
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-5-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 5/9] nilfs2: add SUFILE cache for changes to su_nlive_blks field Andreas Rohner
                     ` (10 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch adds a kmem_cache to efficiently allocate SUFILE cache nodes.
One cache node contains a certain number of unsigned 32 bit values and
either a list_head, to string a number of nodes together into a linked
list, or an rcu_head to be able to use the node with an rcu
callback.

These cache nodes can be used to cache small changes to the SUFILE and
apply them later at segment construction.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/sufile.h | 14 ++++++++++++++
 fs/nilfs2/super.c  | 14 ++++++++++++++
 2 files changed, 28 insertions(+)

diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index da78edf..520614f 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -28,6 +28,20 @@
 #include <linux/nilfs2_fs.h>
 #include "mdt.h"
 
+#define NILFS_SUFILE_CACHE_NODE_SHIFT	6
+#define NILFS_SUFILE_CACHE_NODE_COUNT	(1 << NILFS_SUFILE_CACHE_NODE_SHIFT)
+
+struct nilfs_sufile_cache_node {
+	__u32 values[NILFS_SUFILE_CACHE_NODE_COUNT];
+	union {
+		struct rcu_head rcu_head;
+		struct list_head list_head;
+	};
+	unsigned long index;
+};
+
+extern struct kmem_cache *nilfs_sufile_node_cachep;
+
 static inline int
 nilfs_sufile_live_blks_ext_supported(const struct inode *sufile)
 {
diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
index f47585b..97a30db 100644
--- a/fs/nilfs2/super.c
+++ b/fs/nilfs2/super.c
@@ -71,6 +71,7 @@ static struct kmem_cache *nilfs_inode_cachep;
 struct kmem_cache *nilfs_transaction_cachep;
 struct kmem_cache *nilfs_segbuf_cachep;
 struct kmem_cache *nilfs_btree_path_cache;
+struct kmem_cache *nilfs_sufile_node_cachep;
 
 static int nilfs_setup_super(struct super_block *sb, int is_mount);
 static int nilfs_remount(struct super_block *sb, int *flags, char *data);
@@ -1397,6 +1398,11 @@ static void nilfs_segbuf_init_once(void *obj)
 	memset(obj, 0, sizeof(struct nilfs_segment_buffer));
 }
 
+static void nilfs_sufile_cache_node_init_once(void *obj)
+{
+	memset(obj, 0, sizeof(struct nilfs_sufile_cache_node));
+}
+
 static void nilfs_destroy_cachep(void)
 {
 	/*
@@ -1413,6 +1419,8 @@ static void nilfs_destroy_cachep(void)
 		kmem_cache_destroy(nilfs_segbuf_cachep);
 	if (nilfs_btree_path_cache)
 		kmem_cache_destroy(nilfs_btree_path_cache);
+	if (nilfs_sufile_node_cachep)
+		kmem_cache_destroy(nilfs_sufile_node_cachep);
 }
 
 static int __init nilfs_init_cachep(void)
@@ -1441,6 +1449,12 @@ static int __init nilfs_init_cachep(void)
 	if (!nilfs_btree_path_cache)
 		goto fail;
 
+	nilfs_sufile_node_cachep = kmem_cache_create("nilfs_sufile_node_cache",
+			sizeof(struct nilfs_sufile_cache_node), 0, 0,
+			nilfs_sufile_cache_node_init_once);
+	if (!nilfs_sufile_node_cachep)
+		goto fail;
+
 	return 0;
 
 fail:
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 5/9] nilfs2: add SUFILE cache for changes to su_nlive_blks field
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-05-03 10:05   ` [PATCH v2 4/9] nilfs2: add kmem_cache for SUFILE cache nodes Andreas Rohner
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-6-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 6/9] nilfs2: add tracking of block deletions and updates Andreas Rohner
                     ` (9 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch adds a cache for the SUFILE to efficiently store lots of
small changes to su_nlive_blks in memory and apply the accumulated
results later at segment construction. This improves performance of
these operations and reduces lock contention in the SUFILE.

The implementation uses a radix_tree to store cache nodes, which
contain a certain number of values. Every value corresponds to
exactly one SUFILE entry. If the cache is flushed the values are
subtracted from the su_nlive_blks field of the corresponding SUFILE
entry.

If the parameter only_mark of the function nilfs_sufile_flush_cache() is
set, then the blocks that would have been dirtied by the flush are
marked as dirty, but nothing is actually written to them. This mode is
useful during segment construction, when blocks need to be marked dirty
in advance.

New nodes are allocated on demand. The lookup of nodes is protected by
rcu_read_lock() and the modification of values is protected by a block
group lock. This should allow for concurrent updates to the cache.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/sufile.c | 369 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nilfs2/sufile.h |   5 +
 2 files changed, 374 insertions(+)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 1cce358..80bbd87 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -26,6 +26,7 @@
 #include <linux/string.h>
 #include <linux/buffer_head.h>
 #include <linux/errno.h>
+#include <linux/radix-tree.h>
 #include <linux/nilfs2_fs.h>
 #include "mdt.h"
 #include "sufile.h"
@@ -42,6 +43,11 @@ struct nilfs_sufile_info {
 	unsigned long ncleansegs;/* number of clean segments */
 	__u64 allocmin;		/* lower limit of allocatable segment range */
 	__u64 allocmax;		/* upper limit of allocatable segment range */
+
+	struct blockgroup_lock nlive_blks_cache_bgl;
+	spinlock_t nlive_blks_cache_lock;
+	int nlive_blks_cache_dirty;
+	struct radix_tree_root nlive_blks_cache;
 };
 
 static inline struct nilfs_sufile_info *NILFS_SUI(struct inode *sufile)
@@ -1194,6 +1200,362 @@ out_sem:
 }
 
 /**
+ * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node
+ * @sufile: inode of segment usage file
+ * @group: group to allocate a node for
+ *
+ * Description: Allocates a new cache node and inserts it into the cache. If
+ * there is an error, nothing will be allocated. If there already exists
+ * a node for @group, no new node will be allocated.
+ *
+ * Return Value: On success, 0 is returned, on error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+static int nilfs_sufile_alloc_cache_node(struct inode *sufile,
+					 unsigned long group)
+{
+	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
+	struct nilfs_sufile_cache_node *node;
+	int ret;
+
+	node = kmem_cache_alloc(nilfs_sufile_node_cachep, GFP_NOFS);
+	if (!node)
+		return -ENOMEM;
+
+	ret = radix_tree_preload(GFP_NOFS);
+	if (ret)
+		goto free_node;
+
+	spin_lock(&sui->nlive_blks_cache_lock);
+	ret = radix_tree_insert(&sui->nlive_blks_cache, group, node);
+	spin_unlock(&sui->nlive_blks_cache_lock);
+
+	radix_tree_preload_end();
+
+	if (ret == -EEXIST) {
+		ret = 0;
+		goto free_node;
+	} else if (ret)
+		goto free_node;
+
+	return 0;
+free_node:
+	kmem_cache_free(nilfs_sufile_node_cachep, node);
+	return ret;
+}
+
+/**
+ * nilfs_sufile_dec_nlive_blks - decrements nlive_blks in the cache
+ * @sufile: inode of segment usage file
+ * @segnum: segnum for which nlive_blks will be decremented
+ *
+ * Description: Decrements the number of live blocks for @segnum in the cache.
+ * This function only affects the cache. If the cache is not flushed at a
+ * later time the changes are lost. It tries to lookup the group node to
+ * which the @segnum belongs in a lock free manner and uses a blockgroup lock
+ * to do the actual modification on the node.
+ *
+ * Return Value: On success, 0 is returned on error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum)
+{
+	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
+	struct nilfs_sufile_cache_node *node;
+	spinlock_t *lock;
+	unsigned long group;
+	int ret;
+
+	group = (unsigned long)(segnum >> NILFS_SUFILE_CACHE_NODE_SHIFT);
+
+try_again:
+	rcu_read_lock();
+	node = radix_tree_lookup(&sui->nlive_blks_cache, group);
+	if (!node) {
+		rcu_read_unlock();
+
+		ret = nilfs_sufile_alloc_cache_node(sufile, group);
+		if (ret)
+			return ret;
+
+		/*
+		 * It is important to acquire the rcu_read_lock() before using
+		 * the node pointer
+		 */
+		goto try_again;
+	}
+
+	lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, (unsigned int)group);
+	spin_lock(lock);
+	node->values[segnum & ((1 << NILFS_SUFILE_CACHE_NODE_SHIFT) - 1)] += 1;
+	sui->nlive_blks_cache_dirty = 1;
+	spin_unlock(lock);
+	rcu_read_unlock();
+
+	return 0;
+}
+
+/**
+ * nilfs_sufile_flush_cache_node - flushes one cache node to the SUFILE
+ * @sufile: inode of segment usage file
+ * @node: cache node to flush
+ * @only_mark: do not write anything, but mark the blocks as dirty
+ * @pndirty_blks: pointer to return number of dirtied blocks
+ *
+ * Description: Flushes one cache node to the SUFILE and also clears the cache
+ * node at the same time. If @only_mark is 1, nothing is written to the
+ * SUFILE, but the blocks are still marked as dirty. This is useful to mark
+ * the blocks in one phase of the segment creation and write them in another.
+ *
+ * Return Value: On success, 0 is returned on error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOMEM - Insufficient memory available.
+ *
+ * %-EIO - I/O error
+ *
+ * %-EROFS - Read only filesystem (for create mode)
+ */
+static int nilfs_sufile_flush_cache_node(struct inode *sufile,
+					 struct nilfs_sufile_cache_node *node,
+					 int only_mark,
+					 unsigned long *pndirty_blks)
+{
+	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
+	struct buffer_head *su_bh;
+	struct nilfs_segment_usage *su;
+	spinlock_t *lock;
+	void *kaddr;
+	size_t n, i, j;
+	size_t susz = NILFS_MDT(sufile)->mi_entry_size;
+	__u64 segnum, seg_start, nsegs;
+	__u32 nlive_blocks, value;
+	unsigned long secs = get_seconds(), ndirty_blks = 0;
+	int ret, dirty;
+
+	nsegs = nilfs_sufile_get_nsegments(sufile);
+	seg_start = node->index << NILFS_SUFILE_CACHE_NODE_SHIFT;
+	lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, node->index);
+
+	for (i = 0; i < NILFS_SUFILE_CACHE_NODE_COUNT;) {
+		segnum = seg_start + i;
+		if (segnum >= nsegs)
+			break;
+
+		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
+				seg_start + NILFS_SUFILE_CACHE_NODE_COUNT - 1);
+
+		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
+							   0, &su_bh);
+		if (ret < 0) {
+			if (ret != -ENOENT)
+				return ret;
+			/* hole */
+			i += n;
+			continue;
+		}
+
+		if (only_mark && buffer_dirty(su_bh)) {
+			/* buffer already dirty */
+			put_bh(su_bh);
+			i += n;
+			continue;
+		}
+
+		spin_lock(lock);
+		kaddr = kmap_atomic(su_bh->b_page);
+
+		dirty = 0;
+		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
+							  su_bh, kaddr);
+		for (j = 0; j < n; ++j, ++i, su = (void *)su + susz) {
+			value = node->values[i];
+			if (!value)
+				continue;
+			if (!only_mark)
+				node->values[i] = 0;
+
+			WARN_ON(nilfs_segment_usage_error(su));
+
+			nlive_blocks = le32_to_cpu(su->su_nlive_blks);
+			if (!nlive_blocks)
+				continue;
+
+			dirty = 1;
+			if (only_mark) {
+				i += n - j;
+				break;
+			}
+
+			if (nlive_blocks <= value)
+				nlive_blocks = 0;
+			else
+				nlive_blocks -= value;
+
+			su->su_nlive_blks = cpu_to_le32(nlive_blocks);
+			su->su_nlive_lastmod = cpu_to_le64(secs);
+		}
+
+		kunmap_atomic(kaddr);
+		spin_unlock(lock);
+
+		if (dirty && !buffer_dirty(su_bh)) {
+			mark_buffer_dirty(su_bh);
+			nilfs_mdt_mark_dirty(sufile);
+			++ndirty_blks;
+		}
+
+		put_bh(su_bh);
+	}
+
+	*pndirty_blks += ndirty_blks;
+	return 0;
+}
+
+/**
+ * nilfs_sufile_flush_cache - flushes cache to the SUFILE
+ * @sufile: inode of segment usage file
+ * @only_mark: do not write anything, but mark the blocks as dirty
+ * @pndirty_blks: pointer to return number of dirtied blocks
+ *
+ * Description: Flushes the whole cache to the SUFILE and also clears it
+ * at the same time. If @only_mark is 1, nothing is written to the
+ * SUFILE, but the blocks are still marked as dirty. This is useful to mark
+ * the blocks in one phase of the segment creation and write them in another.
+ * If there are concurrent inserts into the cache, it cannot be guaranteed,
+ * that everything is flushed when the function returns.
+ *
+ * Return Value: On success, 0 is returned on error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOMEM - Insufficient memory available.
+ *
+ * %-EIO - I/O error
+ *
+ * %-EROFS - Read only filesystem (for create mode)
+ */
+int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
+			     unsigned long *pndirty_blks)
+{
+	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
+	struct nilfs_sufile_cache_node *node;
+	LIST_HEAD(nodes);
+	struct radix_tree_iter iter;
+	void **slot;
+	unsigned long ndirty_blks = 0;
+	int ret = 0;
+
+	if (!sui->nlive_blks_cache_dirty)
+		goto out;
+
+	down_write(&NILFS_MDT(sufile)->mi_sem);
+
+	/* prevent concurrent inserts */
+	spin_lock(&sui->nlive_blks_cache_lock);
+	radix_tree_for_each_slot(slot, &sui->nlive_blks_cache, &iter, 0) {
+		node = radix_tree_deref_slot_protected(slot,
+				&sui->nlive_blks_cache_lock);
+		if (!node)
+			continue;
+		if (radix_tree_exception(node))
+			continue;
+
+		list_add(&node->list_head, &nodes);
+		node->index = iter.index;
+	}
+	if (!only_mark)
+		sui->nlive_blks_cache_dirty = 0;
+	spin_unlock(&sui->nlive_blks_cache_lock);
+
+	list_for_each_entry(node, &nodes, list_head) {
+		ret = nilfs_sufile_flush_cache_node(sufile, node, only_mark,
+						    &ndirty_blks);
+		if (ret)
+			goto out_sem;
+	}
+
+out_sem:
+	up_write(&NILFS_MDT(sufile)->mi_sem);
+out:
+	if (pndirty_blks)
+		*pndirty_blks = ndirty_blks;
+	return ret;
+}
+
+/**
+ * nilfs_sufile_cache_dirty - is the sufile cache dirty
+ * @sufile: inode of segment usage file
+ *
+ * Description: Returns whether the sufile cache is dirty. If this flag is
+ * true, the cache contains unflushed content.
+ *
+ * Return Value: If the cache is not dirty, 0 is returned, otherwise
+ * 1 is returned
+ */
+int nilfs_sufile_cache_dirty(struct inode *sufile)
+{
+	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
+
+	return sui->nlive_blks_cache_dirty;
+}
+
+/**
+ * nilfs_sufile_cache_node_release_rcu - rcu callback function to free nodes
+ * @head: rcu head
+ *
+ * Description: Rcu callback function to free nodes.
+ */
+static void nilfs_sufile_cache_node_release_rcu(struct rcu_head *head)
+{
+	struct nilfs_sufile_cache_node *node;
+
+	node = container_of(head, struct nilfs_sufile_cache_node, rcu_head);
+
+	kmem_cache_free(nilfs_sufile_node_cachep, node);
+}
+
+/**
+ * nilfs_sufile_shrink_cache - free all cache nodes
+ * @sufile: inode of segment usage file
+ *
+ * Description: Frees all cache nodes in the cache regardless of their
+ * content. The content will not be flushed and may be lost. This function
+ * is intended to free up memory after the cache was flushed by
+ * nilfs_sufile_flush_cache().
+ */
+void nilfs_sufile_shrink_cache(struct inode *sufile)
+{
+	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
+	struct nilfs_sufile_cache_node *node;
+	struct radix_tree_iter iter;
+	void **slot;
+
+	/* prevent flush form running at the same time */
+	down_read(&NILFS_MDT(sufile)->mi_sem);
+	/* prevent concurrent inserts */
+	spin_lock(&sui->nlive_blks_cache_lock);
+
+	radix_tree_for_each_slot(slot, &sui->nlive_blks_cache, &iter, 0) {
+		node = radix_tree_deref_slot_protected(slot,
+				&sui->nlive_blks_cache_lock);
+		if (!node)
+			continue;
+		if (radix_tree_exception(node))
+			continue;
+
+		radix_tree_delete(&sui->nlive_blks_cache, iter.index);
+		call_rcu(&node->rcu_head, nilfs_sufile_cache_node_release_rcu);
+	}
+
+	spin_unlock(&sui->nlive_blks_cache_lock);
+	up_read(&NILFS_MDT(sufile)->mi_sem);
+}
+
+/**
  * nilfs_sufile_read - read or get sufile inode
  * @sb: super block instance
  * @susize: size of a segment usage entry
@@ -1253,6 +1615,13 @@ int nilfs_sufile_read(struct super_block *sb, size_t susize,
 	sui->allocmax = nilfs_sufile_get_nsegments(sufile) - 1;
 	sui->allocmin = 0;
 
+	if (nilfs_feature_track_live_blks(sb->s_fs_info)) {
+		bgl_lock_init(&sui->nlive_blks_cache_bgl);
+		spin_lock_init(&sui->nlive_blks_cache_lock);
+		INIT_RADIX_TREE(&sui->nlive_blks_cache, GFP_ATOMIC);
+	}
+	sui->nlive_blks_cache_dirty = 0;
+
 	unlock_new_inode(sufile);
  out:
 	*inodep = sufile;
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index 520614f..662ab56 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -87,6 +87,11 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
 int nilfs_sufile_read(struct super_block *sb, size_t susize,
 		      struct nilfs_inode *raw_inode, struct inode **inodep);
 int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
+int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum);
+void nilfs_sufile_shrink_cache(struct inode *sufile);
+int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
+			     unsigned long *pndirty_blks);
+int nilfs_sufile_cache_dirty(struct inode *sufile);
 
 /**
  * nilfs_sufile_scrap - make a segment garbage
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 6/9] nilfs2: add tracking of block deletions and updates
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-05-03 10:05   ` [PATCH v2 5/9] nilfs2: add SUFILE cache for changes to su_nlive_blks field Andreas Rohner
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-7-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out Andreas Rohner
                     ` (8 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch adds tracking of block deletions and updates for all files.
It uses the fact, that for every block, NILFS2 keeps an entry in the
DAT file and stores the checkpoint where it was created, deleted or
overwritten. So whenever a block is deleted or overwritten
nilfs_dat_commit_end() is called to update the DAT entry. At this
point this patch simply decrements the su_nlive_blks field of the
corresponding segment. The value of su_nlive_blks is set at segment
creation time.

The DAT file itself has of course no DAT entries for its own blocks, but
it still has to propagate deletions and updates to its btree. When this
happens this patch again decrements the su_nlive_blks field of the
corresponding segment.

The new feature compatibility flag NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS
can be used to enable or disable the block tracking at any time.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/btree.c   | 33 ++++++++++++++++++++++++++++++---
 fs/nilfs2/dat.c     | 15 +++++++++++++--
 fs/nilfs2/direct.c  | 20 +++++++++++++++-----
 fs/nilfs2/page.c    |  6 ++++--
 fs/nilfs2/page.h    |  3 +++
 fs/nilfs2/segbuf.c  |  3 +++
 fs/nilfs2/segbuf.h  |  5 +++++
 fs/nilfs2/segment.c | 48 +++++++++++++++++++++++++++++++++++++-----------
 fs/nilfs2/sufile.c  | 17 ++++++++++++++++-
 fs/nilfs2/sufile.h  |  3 ++-
 10 files changed, 128 insertions(+), 25 deletions(-)

diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
index 059f371..d3b2763 100644
--- a/fs/nilfs2/btree.c
+++ b/fs/nilfs2/btree.c
@@ -30,6 +30,7 @@
 #include "btree.h"
 #include "alloc.h"
 #include "dat.h"
+#include "sufile.h"
 
 static void __nilfs_btree_init(struct nilfs_bmap *bmap);
 
@@ -1889,9 +1890,35 @@ static int nilfs_btree_propagate_p(struct nilfs_bmap *btree,
 				   int level,
 				   struct buffer_head *bh)
 {
-	while ((++level < nilfs_btree_height(btree) - 1) &&
-	       !buffer_dirty(path[level].bp_bh))
-		mark_buffer_dirty(path[level].bp_bh);
+	struct the_nilfs *nilfs = btree->b_inode->i_sb->s_fs_info;
+	struct nilfs_btree_node *node;
+	__u64 ptr, segnum;
+	int ncmax, vol, counted;
+
+	vol = buffer_nilfs_volatile(bh);
+	counted = buffer_nilfs_counted(bh);
+	set_buffer_nilfs_counted(bh);
+
+	while (++level < nilfs_btree_height(btree)) {
+		if (!vol && !counted && nilfs_feature_track_live_blks(nilfs)) {
+			node = nilfs_btree_get_node(btree, path, level, &ncmax);
+			ptr = nilfs_btree_node_get_ptr(node,
+						       path[level].bp_index,
+						       ncmax);
+			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
+			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
+		}
+
+		if (path[level].bp_bh) {
+			if (buffer_dirty(path[level].bp_bh))
+				break;
+
+			mark_buffer_dirty(path[level].bp_bh);
+			vol = buffer_nilfs_volatile(path[level].bp_bh);
+			counted = buffer_nilfs_counted(path[level].bp_bh);
+			set_buffer_nilfs_counted(path[level].bp_bh);
+		}
+	}
 
 	return 0;
 }
diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index 0d5fada..9c2fc32 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -28,6 +28,7 @@
 #include "mdt.h"
 #include "alloc.h"
 #include "dat.h"
+#include "sufile.h"
 
 
 #define NILFS_CNO_MIN	((__u64)1)
@@ -188,9 +189,10 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
 			  int dead)
 {
 	struct nilfs_dat_entry *entry;
-	__u64 start, end;
+	__u64 start, end, segnum;
 	sector_t blocknr;
 	void *kaddr;
+	struct the_nilfs *nilfs;
 
 	kaddr = kmap_atomic(req->pr_entry_bh->b_page);
 	entry = nilfs_palloc_block_get_entry(dat, req->pr_entry_nr,
@@ -206,8 +208,17 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
 
 	if (blocknr == 0)
 		nilfs_dat_commit_free(dat, req);
-	else
+	else {
 		nilfs_dat_commit_entry(dat, req);
+
+		nilfs = dat->i_sb->s_fs_info;
+
+		if (nilfs_feature_track_live_blks(nilfs)) {
+			segnum = nilfs_get_segnum_of_block(nilfs, blocknr);
+			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
+		}
+	}
+
 }
 
 void nilfs_dat_abort_end(struct inode *dat, struct nilfs_palloc_req *req)
diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c
index ebf89fd..42704eb 100644
--- a/fs/nilfs2/direct.c
+++ b/fs/nilfs2/direct.c
@@ -26,6 +26,7 @@
 #include "direct.h"
 #include "alloc.h"
 #include "dat.h"
+#include "sufile.h"
 
 static inline __le64 *nilfs_direct_dptrs(const struct nilfs_bmap *direct)
 {
@@ -268,18 +269,27 @@ int nilfs_direct_delete_and_convert(struct nilfs_bmap *bmap,
 static int nilfs_direct_propagate(struct nilfs_bmap *bmap,
 				  struct buffer_head *bh)
 {
+	struct the_nilfs *nilfs = bmap->b_inode->i_sb->s_fs_info;
 	struct nilfs_palloc_req oldreq, newreq;
 	struct inode *dat;
-	__u64 key;
-	__u64 ptr;
+	__u64 key, ptr, segnum;
 	int ret;
 
-	if (!NILFS_BMAP_USE_VBN(bmap))
-		return 0;
-
 	dat = nilfs_bmap_get_dat(bmap);
 	key = nilfs_bmap_data_get_key(bmap, bh);
 	ptr = nilfs_direct_get_ptr(bmap, key);
+
+	if (unlikely(!NILFS_BMAP_USE_VBN(bmap))) {
+		if (!buffer_nilfs_volatile(bh) && !buffer_nilfs_counted(bh) &&
+				nilfs_feature_track_live_blks(nilfs)) {
+			set_buffer_nilfs_counted(bh);
+			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
+
+			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
+		}
+		return 0;
+	}
+
 	if (!buffer_nilfs_volatile(bh)) {
 		oldreq.pr_entry_nr = ptr;
 		newreq.pr_entry_nr = ptr;
diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
index 45d650a..fd21b43 100644
--- a/fs/nilfs2/page.c
+++ b/fs/nilfs2/page.c
@@ -92,7 +92,8 @@ void nilfs_forget_buffer(struct buffer_head *bh)
 	const unsigned long clear_bits =
 		(1 << BH_Uptodate | 1 << BH_Dirty | 1 << BH_Mapped |
 		 1 << BH_Async_Write | 1 << BH_NILFS_Volatile |
-		 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected);
+		 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected |
+		 1 << BH_NILFS_Counted);
 
 	lock_buffer(bh);
 	set_mask_bits(&bh->b_state, clear_bits, 0);
@@ -422,7 +423,8 @@ void nilfs_clear_dirty_page(struct page *page, bool silent)
 		const unsigned long clear_bits =
 			(1 << BH_Uptodate | 1 << BH_Dirty | 1 << BH_Mapped |
 			 1 << BH_Async_Write | 1 << BH_NILFS_Volatile |
-			 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected);
+			 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected |
+			 1 << BH_NILFS_Counted);
 
 		bh = head = page_buffers(page);
 		do {
diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
index a43b828..4e35814 100644
--- a/fs/nilfs2/page.h
+++ b/fs/nilfs2/page.h
@@ -36,12 +36,15 @@ enum {
 	BH_NILFS_Volatile,
 	BH_NILFS_Checked,
 	BH_NILFS_Redirected,
+	BH_NILFS_Counted,
 };
 
 BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
 BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
 BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
 BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */
+/* counted by propagate_p for segment usage */
+BUFFER_FNS(NILFS_Counted, nilfs_counted)
 
 
 int __nilfs_clear_page_dirty(struct page *);
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index dc3a9efd..dabb65b 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -57,6 +57,9 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
 	INIT_LIST_HEAD(&segbuf->sb_segsum_buffers);
 	INIT_LIST_HEAD(&segbuf->sb_payload_buffers);
 	segbuf->sb_super_root = NULL;
+	segbuf->sb_flags = 0;
+	segbuf->sb_nlive_blks = 0;
+	segbuf->sb_nsnapshot_blks = 0;
 
 	init_completion(&segbuf->sb_bio_event);
 	atomic_set(&segbuf->sb_err, 0);
diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
index b04f08c..a802f61 100644
--- a/fs/nilfs2/segbuf.h
+++ b/fs/nilfs2/segbuf.h
@@ -83,6 +83,9 @@ struct nilfs_segment_buffer {
 	sector_t		sb_fseg_start, sb_fseg_end;
 	sector_t		sb_pseg_start;
 	unsigned		sb_rest_blocks;
+	int			sb_flags;
+	__u32			sb_nlive_blks;
+	__u32			sb_nsnapshot_blks;
 
 	/* Buffers */
 	struct list_head	sb_segsum_buffers;
@@ -95,6 +98,8 @@ struct nilfs_segment_buffer {
 	struct completion	sb_bio_event;
 };
 
+#define NILFS_SEGBUF_SUSET	BIT(0)	/* segment usage has been set */
+
 #define NILFS_LIST_SEGBUF(head)  \
 	list_entry((head), struct nilfs_segment_buffer, sb_list)
 #define NILFS_NEXT_SEGBUF(segbuf)  NILFS_LIST_SEGBUF((segbuf)->sb_list.next)
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index c6abbad9..14e76c3 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -762,7 +762,8 @@ static int nilfs_test_metadata_dirty(struct the_nilfs *nilfs,
 		ret++;
 	if (nilfs_mdt_fetch_dirty(nilfs->ns_cpfile))
 		ret++;
-	if (nilfs_mdt_fetch_dirty(nilfs->ns_sufile))
+	if (nilfs_mdt_fetch_dirty(nilfs->ns_sufile) ||
+	    nilfs_sufile_cache_dirty(nilfs->ns_sufile))
 		ret++;
 	if ((ret || nilfs_doing_gc()) && nilfs_mdt_fetch_dirty(nilfs->ns_dat))
 		ret++;
@@ -1368,36 +1369,49 @@ static void nilfs_free_incomplete_logs(struct list_head *logs,
 }
 
 static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
-					  struct inode *sufile)
+					  struct the_nilfs *nilfs)
 {
 	struct nilfs_segment_buffer *segbuf;
-	unsigned long live_blocks;
+	struct inode *sufile = nilfs->ns_sufile;
+	unsigned long nblocks;
 	int ret;
 
 	list_for_each_entry(segbuf, &sci->sc_segbufs, sb_list) {
-		live_blocks = segbuf->sb_sum.nblocks +
+		nblocks = segbuf->sb_sum.nblocks +
 			(segbuf->sb_pseg_start - segbuf->sb_fseg_start);
 		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
-						     live_blocks,
+						     nblocks,
+						     segbuf->sb_nlive_blks,
+						     segbuf->sb_nsnapshot_blks,
 						     sci->sc_seg_ctime);
 		WARN_ON(ret); /* always succeed because the segusage is dirty */
+
+		segbuf->sb_flags |= NILFS_SEGBUF_SUSET;
 	}
 }
 
-static void nilfs_cancel_segusage(struct list_head *logs, struct inode *sufile)
+static void nilfs_cancel_segusage(struct list_head *logs,
+				  struct the_nilfs *nilfs)
 {
 	struct nilfs_segment_buffer *segbuf;
+	struct inode *sufile = nilfs->ns_sufile;
+	__s64 nlive_blks = 0, nsnapshot_blks = 0;
 	int ret;
 
 	segbuf = NILFS_FIRST_SEGBUF(logs);
+	if (segbuf->sb_flags & NILFS_SEGBUF_SUSET) {
+		nlive_blks = -(__s64)segbuf->sb_nlive_blks;
+		nsnapshot_blks = -(__s64)segbuf->sb_nsnapshot_blks;
+	}
 	ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
 					     segbuf->sb_pseg_start -
-					     segbuf->sb_fseg_start, 0);
+					     segbuf->sb_fseg_start,
+					     nlive_blks, nsnapshot_blks, 0);
 	WARN_ON(ret); /* always succeed because the segusage is dirty */
 
 	list_for_each_entry_continue(segbuf, logs, sb_list) {
 		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
-						     0, 0);
+						     0, 0, 0, 0);
 		WARN_ON(ret); /* always succeed */
 	}
 }
@@ -1499,6 +1513,7 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 	if (!nfinfo)
 		goto out;
 
+	segbuf->sb_nlive_blks = segbuf->sb_sum.nfileblk;
 	blocknr = segbuf->sb_pseg_start + segbuf->sb_sum.nsumblk;
 	ssp.bh = NILFS_SEGBUF_FIRST_BH(&segbuf->sb_segsum_buffers);
 	ssp.offset = sizeof(struct nilfs_segment_summary);
@@ -1728,7 +1743,7 @@ static void nilfs_segctor_abort_construction(struct nilfs_sc_info *sci,
 	nilfs_abort_logs(&logs, ret ? : err);
 
 	list_splice_tail_init(&sci->sc_segbufs, &logs);
-	nilfs_cancel_segusage(&logs, nilfs->ns_sufile);
+	nilfs_cancel_segusage(&logs, nilfs);
 	nilfs_free_incomplete_logs(&logs, nilfs);
 
 	if (sci->sc_stage.flags & NILFS_CF_SUFREED) {
@@ -1790,7 +1805,8 @@ static void nilfs_segctor_complete_write(struct nilfs_sc_info *sci)
 			const unsigned long clear_bits =
 				(1 << BH_Dirty | 1 << BH_Async_Write |
 				 1 << BH_Delay | 1 << BH_NILFS_Volatile |
-				 1 << BH_NILFS_Redirected);
+				 1 << BH_NILFS_Redirected |
+				 1 << BH_NILFS_Counted);
 
 			set_mask_bits(&bh->b_state, clear_bits, set_bits);
 			if (bh == segbuf->sb_super_root) {
@@ -1995,7 +2011,14 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
 
 			nilfs_segctor_fill_in_super_root(sci, nilfs);
 		}
-		nilfs_segctor_update_segusage(sci, nilfs->ns_sufile);
+
+		if (nilfs_feature_track_live_blks(nilfs)) {
+			err = nilfs_sufile_flush_cache(nilfs->ns_sufile, 0,
+						       NULL);
+			if (unlikely(err))
+				goto failed_to_write;
+		}
+		nilfs_segctor_update_segusage(sci, nilfs);
 
 		/* Write partial segments */
 		nilfs_segctor_prepare_write(sci);
@@ -2022,6 +2045,9 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
 		}
 	} while (sci->sc_stage.scnt != NILFS_ST_DONE);
 
+	if (nilfs_feature_track_live_blks(nilfs))
+		nilfs_sufile_shrink_cache(nilfs->ns_sufile);
+
  out:
 	nilfs_segctor_drop_written_files(sci, nilfs);
 	return err;
diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 80bbd87..9cd8820d 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -527,10 +527,13 @@ int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum)
  * @sufile: inode of segment usage file
  * @segnum: segment number
  * @nblocks: number of live blocks in the segment
+ * @nlive_blks: number of live blocks to add to the su_nlive_blks field
+ * @nsnapshot_blks: number of snapshot blocks to add to su_nsnapshot_blks
  * @modtime: modification time (option)
  */
 int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
-				   unsigned long nblocks, time_t modtime)
+				   unsigned long nblocks, __s64 nlive_blks,
+				   __s64 nsnapshot_blks, time_t modtime)
 {
 	struct buffer_head *bh;
 	struct nilfs_segment_usage *su;
@@ -548,6 +551,18 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
 	if (modtime)
 		su->su_lastmod = cpu_to_le64(modtime);
 	su->su_nblocks = cpu_to_le32(nblocks);
+
+	if (nilfs_sufile_live_blks_ext_supported(sufile)) {
+		nsnapshot_blks += le32_to_cpu(su->su_nsnapshot_blks);
+		nsnapshot_blks = min_t(__s64, max_t(__s64, nsnapshot_blks, 0),
+				       nblocks);
+		su->su_nsnapshot_blks = cpu_to_le32(nsnapshot_blks);
+
+		nlive_blks += le32_to_cpu(su->su_nlive_blks);
+		nlive_blks = min_t(__s64, max_t(__s64, nlive_blks, 0), nblocks);
+		su->su_nlive_blks = cpu_to_le32(nlive_blks);
+	}
+
 	kunmap_atomic(kaddr);
 
 	mark_buffer_dirty(bh);
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index 662ab56..3466abb 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -60,7 +60,8 @@ int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
 int nilfs_sufile_alloc(struct inode *, __u64 *);
 int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
 int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
-				   unsigned long nblocks, time_t modtime);
+				   unsigned long nblocks, __s64 nlive_blks,
+				   __s64 nsnapshot_blks, time_t modtime);
 int nilfs_sufile_get_stat(struct inode *, struct nilfs_sustat *);
 ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
 				size_t);
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-05-03 10:05   ` [PATCH v2 6/9] nilfs2: add tracking of block deletions and updates Andreas Rohner
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-8-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period Andreas Rohner
                     ` (7 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch ensures, that all dirty blocks are written out if the segment
construction mode is SC_LSEG_SR. The scanning of the DAT file can cause
blocks in the SUFILE to be dirtied and newly dirtied blocks in the
SUFILE can in turn dirty more blocks in the DAT file. Since one of
these stages has to happen before the other during segment
construction, we end up with unwritten dirty blocks, that are lost
in case of a file system unmount.

This patch introduces a new set of file scanning operations that
only propagate the changes to the bmap and do not add anything to the
segment buffer. The DAT file and SUFILE are scanned with these
operations. The function nilfs_sufile_flush_cache() is called in between
these scans with the parameter only_mark set. That way it can be called
repeatedly without actually writing anything to the SUFILE. If there are
no new blocks dirtied in the flush, the normal segment construction
stages can safely continue.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/segment.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/nilfs2/segment.h |  3 ++-
 2 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 14e76c3..ab8df33 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -579,6 +579,12 @@ static int nilfs_collect_dat_data(struct nilfs_sc_info *sci,
 	return err;
 }
 
+static int nilfs_collect_prop_data(struct nilfs_sc_info *sci,
+				  struct buffer_head *bh, struct inode *inode)
+{
+	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
+}
+
 static int nilfs_collect_dat_bmap(struct nilfs_sc_info *sci,
 				  struct buffer_head *bh, struct inode *inode)
 {
@@ -613,6 +619,14 @@ static struct nilfs_sc_operations nilfs_sc_dat_ops = {
 	.write_node_binfo = nilfs_write_dat_node_binfo,
 };
 
+static struct nilfs_sc_operations nilfs_sc_prop_ops = {
+	.collect_data = nilfs_collect_prop_data,
+	.collect_node = nilfs_collect_file_node,
+	.collect_bmap = NULL,
+	.write_data_binfo = NULL,
+	.write_node_binfo = NULL,
+};
+
 static struct nilfs_sc_operations nilfs_sc_dsync_ops = {
 	.collect_data = nilfs_collect_file_data,
 	.collect_node = NULL,
@@ -998,7 +1012,8 @@ static int nilfs_segctor_scan_file(struct nilfs_sc_info *sci,
 			err = nilfs_segctor_apply_buffers(
 				sci, inode, &data_buffers,
 				sc_ops->collect_data);
-			BUG_ON(!err); /* always receive -E2BIG or true error */
+			/* always receive -E2BIG or true error (NOT ANYMORE?)*/
+			/* BUG_ON(!err); */
 			goto break_or_fail;
 		}
 	}
@@ -1055,6 +1070,55 @@ static int nilfs_segctor_scan_file_dsync(struct nilfs_sc_info *sci,
 	return err;
 }
 
+/**
+ * nilfs_segctor_propagate_sufile - dirties all needed SUFILE blocks
+ * @sci: nilfs_sc_info
+ *
+ * Description: Dirties and propagates all SUFILE blocks that need to be
+ * available later in the segment construction process, when the SUFILE cache
+ * is flushed. Here the SUFILE cache is not actually flushed, but the blocks
+ * that are needed for a later flush are marked as dirty. Since the propagation
+ * of the SUFILE can dirty DAT entries and vice versa, the functions
+ * are executed in a loop until no new blocks are dirtied.
+ *
+ * Return Value: On success, 0 is returned on error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOMEM - Insufficient memory available.
+ *
+ * %-EIO - I/O error
+ *
+ * %-EROFS - Read only filesystem (for create mode)
+ */
+static int nilfs_segctor_propagate_sufile(struct nilfs_sc_info *sci)
+{
+	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
+	unsigned long ndirty_blks;
+	int ret, retrycount = NILFS_SC_SUFILE_PROP_RETRY;
+
+	do {
+		/* count changes to DAT file before flush */
+		ret = nilfs_segctor_scan_file(sci, nilfs->ns_dat,
+					      &nilfs_sc_prop_ops);
+		if (unlikely(ret))
+			return ret;
+
+		ret = nilfs_sufile_flush_cache(nilfs->ns_sufile, 1,
+					       &ndirty_blks);
+		if (unlikely(ret))
+			return ret;
+		if (!ndirty_blks)
+			break;
+
+		ret = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
+					      &nilfs_sc_prop_ops);
+		if (unlikely(ret))
+			return ret;
+	} while (ndirty_blks && retrycount-- > 0);
+
+	return 0;
+}
+
 static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
 {
 	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
@@ -1160,6 +1224,13 @@ static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
 		}
 		sci->sc_stage.flags |= NILFS_CF_SUFREED;
 
+		if (mode == SC_LSEG_SR &&
+		    nilfs_feature_track_live_blks(nilfs)) {
+			err = nilfs_segctor_propagate_sufile(sci);
+			if (unlikely(err))
+				break;
+		}
+
 		err = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
 					      &nilfs_sc_file_ops);
 		if (unlikely(err))
diff --git a/fs/nilfs2/segment.h b/fs/nilfs2/segment.h
index a48d6de..5aa7f91 100644
--- a/fs/nilfs2/segment.h
+++ b/fs/nilfs2/segment.h
@@ -208,7 +208,8 @@ enum {
  */
 #define NILFS_SC_CLEANUP_RETRY	    3  /* Retry count of construction when
 					  destroying segctord */
-
+#define NILFS_SC_SUFILE_PROP_RETRY  10 /* Retry count of the propagate
+					  sufile loop */
 /*
  * Default values of timeout, in seconds.
  */
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-05-03 10:05   ` [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out Andreas Rohner
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-9-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:05   ` [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots Andreas Rohner
                     ` (6 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

The userspace GC uses the concept of a so-called protection period,
which is a period of time, where actually reclaimable blocks are
protected. If a segment is cleaned and there are blocks in it that are
protected by this, they have to be treated as if they were live blocks.

This is a problem for the live block tracking on the kernel side,
because the kernel knows nothing about the protection period. This patch
introduces new flags for the nilfs_vdesc data structure, to mark blocks
that need to be treated as if they were alive, but must be counted as if
they were reclaimable. There are two reasons for this to happen.
Either a block was deleted within the protection period, or it is
part of a snapshot.

After the blocks described by the nilfs_vdesc structures are read in,
the flags are passed on to the buffer_heads to get the information to
the segment construction phase. During segment construction, the live
block tracking is adjusted accordingly.

Additionally the blocks are rechecked if they are reclaimable, since the
last check was in userspace without the proper locking.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/dat.c           | 66 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/nilfs2/dat.h           |  1 +
 fs/nilfs2/ioctl.c         | 15 +++++++++++
 fs/nilfs2/page.h          |  6 +++++
 fs/nilfs2/segment.c       | 41 ++++++++++++++++++++++++++++-
 include/linux/nilfs2_fs.h | 38 +++++++++++++++++++++++++--
 6 files changed, 164 insertions(+), 3 deletions(-)

diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index 9c2fc32..80a1905 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -35,6 +35,17 @@
 #define NILFS_CNO_MAX	(~(__u64)0)
 
 /**
+ * nilfs_dat_entry_is_live - check if @entry is alive
+ * @entry: DAT-Entry
+ *
+ * Description: Simple check if @entry is alive in the current checkpoint.
+ */
+static int nilfs_dat_entry_is_live(struct nilfs_dat_entry *entry)
+{
+	return entry->de_end == cpu_to_le64(NILFS_CNO_MAX);
+}
+
+/**
  * struct nilfs_dat_info - on-memory private data of DAT file
  * @mi: on-memory private data of metadata file
  * @palloc_cache: persistent object allocator cache of DAT file
@@ -387,6 +398,61 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, sector_t blocknr)
 }
 
 /**
+ * nilfs_dat_is_live - checks if the virtual block number is alive
+ * @dat: DAT file inode
+ * @vblocknr: virtual block number
+ *
+ * Description: nilfs_dat_is_live() looks up the DAT-Entry for
+ * @vblocknr and determines if the corresponding block is alive in the current
+ * checkpoint or not. This check ignores snapshots and protection periods.
+ *
+ * Return Value: 1 if vblocknr is alive and 0 otherwise. On error one of the
+ * following negative error codes is returned
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - A block number associated with @vblocknr does not exist.
+ */
+int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr)
+{
+	struct buffer_head *entry_bh, *bh;
+	struct nilfs_dat_entry *entry;
+	sector_t blocknr;
+	void *kaddr;
+	int ret;
+
+	ret = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
+	if (ret < 0)
+		return ret;
+
+	if (!nilfs_doing_gc() && buffer_nilfs_redirected(entry_bh)) {
+		bh = nilfs_mdt_get_frozen_buffer(dat, entry_bh);
+		if (bh) {
+			WARN_ON(!buffer_uptodate(bh));
+			put_bh(entry_bh);
+			entry_bh = bh;
+		}
+	}
+
+	kaddr = kmap_atomic(entry_bh->b_page);
+	entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
+	blocknr = le64_to_cpu(entry->de_blocknr);
+	if (blocknr == 0) {
+		ret = -ENOENT;
+		goto out_unmap;
+	}
+
+	ret = nilfs_dat_entry_is_live(entry);
+
+out_unmap:
+	kunmap_atomic(kaddr);
+	put_bh(entry_bh);
+	return ret;
+}
+
+/**
  * nilfs_dat_translate - translate a virtual block number to a block number
  * @dat: DAT file inode
  * @vblocknr: virtual block number
diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
index cbd8e97..a95547c 100644
--- a/fs/nilfs2/dat.h
+++ b/fs/nilfs2/dat.h
@@ -47,6 +47,7 @@ void nilfs_dat_commit_update(struct inode *, struct nilfs_palloc_req *,
 			     struct nilfs_palloc_req *, int);
 void nilfs_dat_abort_update(struct inode *, struct nilfs_palloc_req *,
 			    struct nilfs_palloc_req *);
+int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr);
 
 int nilfs_dat_mark_dirty(struct inode *, __u64);
 int nilfs_dat_freev(struct inode *, __u64 *, size_t);
diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index f6ee54e..40bf74a 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -612,6 +612,12 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
 		brelse(bh);
 		return -EEXIST;
 	}
+
+	if (nilfs_vdesc_snapshot_protected(vdesc))
+		set_buffer_nilfs_snapshot_protected(bh);
+	if (nilfs_vdesc_period_protected(vdesc))
+		set_buffer_nilfs_period_protected(bh);
+
 	list_add_tail(&bh->b_assoc_buffers, buffers);
 	return 0;
 }
@@ -662,6 +668,15 @@ static int nilfs_ioctl_move_blocks(struct super_block *sb,
 		}
 
 		do {
+			/*
+			 * old user space tools to not initialize vd_blk_flags
+			 * if vd_period.p_start > 0 then vd_blk_flags was
+			 * not initialized properly and may contain invalid
+			 * flags
+			 */
+			if (vdesc->vd_period.p_start > 0)
+				vdesc->vd_blk_flags = 0;
+
 			ret = nilfs_ioctl_move_inode_block(inode, vdesc,
 							   &buffers);
 			if (unlikely(ret < 0)) {
diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
index 4e35814..4835e37 100644
--- a/fs/nilfs2/page.h
+++ b/fs/nilfs2/page.h
@@ -36,6 +36,8 @@ enum {
 	BH_NILFS_Volatile,
 	BH_NILFS_Checked,
 	BH_NILFS_Redirected,
+	BH_NILFS_Snapshot_Protected,
+	BH_NILFS_Period_Protected,
 	BH_NILFS_Counted,
 };
 
@@ -43,6 +45,10 @@ BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
 BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
 BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
 BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */
+/* buffer belongs to a snapshot and is protected by it */
+BUFFER_FNS(NILFS_Snapshot_Protected, nilfs_snapshot_protected)
+/* protected by protection period */
+BUFFER_FNS(NILFS_Period_Protected, nilfs_period_protected)
 /* counted by propagate_p for segment usage */
 BUFFER_FNS(NILFS_Counted, nilfs_counted)
 
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index ab8df33..b476ce7 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1564,12 +1564,41 @@ static void nilfs_list_replace_buffer(struct buffer_head *old_bh,
 	/* The caller must release old_bh */
 }
 
+/**
+ * nilfs_segctor_dec_nlive_blks_gc - dec. nlive_blks for blocks of GC-Inodes
+ * @dat: dat inode
+ * @segbuf: currtent segment buffer
+ * @bh: current buffer head
+ *
+ * Description: nilfs_segctor_dec_nlive_blks_gc() is called if the inode to
+ * which @bh belongs is a GC-Inode. In that case it is not necessary to
+ * decrement the previous segment, because at the end of the GC process it
+ * will be freed anyway. It is however necessary to check again if the blocks
+ * are alive here, because the last check was in userspace without the proper
+ * locking. Additionally the blocks protected by the protection period should
+ * be considered reclaimable. It is assumed, that @bh->b_blocknr contains
+ * a virtual block number, which is only true if @bh is part of a GC-Inode.
+ */
+static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
+					    struct nilfs_segment_buffer *segbuf,
+					    struct buffer_head *bh) {
+	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
+				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
+
+	if (!buffer_nilfs_snapshot_protected(bh) && isreclaimable)
+		segbuf->sb_nlive_blks--;
+	if (buffer_nilfs_snapshot_protected(bh))
+		segbuf->sb_nsnapshot_blks++;
+}
+
 static int
 nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 				     struct nilfs_segment_buffer *segbuf,
 				     int mode)
 {
+	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
 	struct inode *inode = NULL;
+	struct nilfs_inode_info *ii;
 	sector_t blocknr;
 	unsigned long nfinfo = segbuf->sb_sum.nfinfo;
 	unsigned long nblocks = 0, ndatablk = 0;
@@ -1579,7 +1608,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 	union nilfs_binfo binfo;
 	struct buffer_head *bh, *bh_org;
 	ino_t ino = 0;
-	int err = 0;
+	int err = 0, gc_inode = 0, track_live_blks;
+
+	track_live_blks = nilfs_feature_track_live_blks(nilfs);
 
 	if (!nfinfo)
 		goto out;
@@ -1601,6 +1632,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 
 			inode = bh->b_page->mapping->host;
 
+			ii = NILFS_I(inode);
+			gc_inode = test_bit(NILFS_I_GCINODE, &ii->i_state);
+
 			if (mode == SC_LSEG_DSYNC)
 				sc_op = &nilfs_sc_dsync_ops;
 			else if (ino == NILFS_DAT_INO)
@@ -1608,6 +1642,11 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 			else /* file blocks */
 				sc_op = &nilfs_sc_file_ops;
 		}
+
+		if (track_live_blks && gc_inode)
+			nilfs_segctor_dec_nlive_blks_gc(nilfs->ns_dat,
+							segbuf, bh);
+
 		bh_org = bh;
 		get_bh(bh_org);
 		err = nilfs_bmap_assign(NILFS_I(inode)->i_bmap, &bh, blocknr,
diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
index 5f05bbf..ddc98e8 100644
--- a/include/linux/nilfs2_fs.h
+++ b/include/linux/nilfs2_fs.h
@@ -905,7 +905,7 @@ struct nilfs_vinfo {
  * @vd_blocknr: disk block number
  * @vd_offset: logical block offset inside a file
  * @vd_flags: flags (data or node block)
- * @vd_pad: padding
+ * @vd_blk_flags: additional flags
  */
 struct nilfs_vdesc {
 	__u64 vd_ino;
@@ -915,9 +915,43 @@ struct nilfs_vdesc {
 	__u64 vd_blocknr;
 	__u64 vd_offset;
 	__u32 vd_flags;
-	__u32 vd_pad;
+	/*
+	 * vd_blk_flags needed because vd_flags doesn't support
+	 * bit-flags because of backwards compatibility
+	 */
+	__u32 vd_blk_flags;
 };
 
+/* vdesc flags */
+enum {
+	NILFS_VDESC_SNAPSHOT_PROTECTED,
+	NILFS_VDESC_PERIOD_PROTECTED,
+
+	/* ... */
+
+	__NR_NILFS_VDESC_FIELDS,
+};
+
+#define NILFS_VDESC_FNS(flag, name)					\
+static inline void							\
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_blk_flags |= (1UL << NILFS_VDESC_##flag);		\
+}									\
+static inline void							\
+nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_blk_flags &= ~(1UL << NILFS_VDESC_##flag);		\
+}									\
+static inline int							\
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
+{									\
+	return !!(vdesc->vd_blk_flags & (1UL << NILFS_VDESC_##flag));	\
+}
+
+NILFS_VDESC_FNS(SNAPSHOT_PROTECTED, snapshot_protected)
+NILFS_VDESC_FNS(PERIOD_PROTECTED, period_protected)
+
 /**
  * struct nilfs_bdesc - descriptor of disk block number
  * @bd_ino: inode number
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (7 preceding siblings ...)
  2015-05-03 10:05   ` [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period Andreas Rohner
@ 2015-05-03 10:05   ` Andreas Rohner
       [not found]     ` <1430647522-14304-10-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-03 10:07   ` [PATCH v2 1/5] nilfs-utils: extend SUFILE on-disk format to enable track live blocks Andreas Rohner
                     ` (5 subsequent siblings)
  14 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:05 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

It doesn't really matter if the number of reclaimable blocks for a
segment is inaccurate, as long as the overall performance is better than
the simple timestamp algorithm and starvation is prevented.

The following steps will lead to starvation of a segment:

1. The segment is written
2. A snapshot is created
3. The files in the segment are deleted and the number of live
   blocks for the segment is decremented to a very low value
4. The GC tries to free the segment, but there are no reclaimable
   blocks, because they are all protected by the snapshot. To prevent an
   infinite loop the GC has to adjust the number of live blocks to the
   correct value.
5. The snapshot is converted to a checkpoint and the blocks in the
   segment are now reclaimable.
6. The GC will never attempt to clean the segment again, because it
   looks as if it had a high number of live blocks.

To prevent this, the already existing padding field of the SUFILE entry
is used to track the number of snapshot blocks in the segment. This
number is only set by the GC, since it collects the necessary
information anyway. So there is no need, to track which block belongs to
which segment. In step 4 of the list above the GC will set the new field
su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
entries with a big su_nsnapshot_blks field get their su_nlive_blks field
reduced.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/ioctl.c  | 50 +++++++++++++++++++++++++++++++-
 fs/nilfs2/sufile.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nilfs2/sufile.h |  3 ++
 3 files changed, 137 insertions(+), 1 deletion(-)

diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 40bf74a..431725f 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -200,6 +200,49 @@ static int nilfs_ioctl_getversion(struct inode *inode, void __user *argp)
 }
 
 /**
+ * nilfs_ioctl_fix_starving_segs - fix potentially starving segments
+ * @nilfs: nilfs object
+ * @inode: inode object
+ *
+ * Description: Scans for segments, which are potentially starving and
+ * reduces the number of live blocks to less than half of the maximum
+ * number of blocks in a segment. This requires a scan of the whole SUFILE,
+ * which can take a long time on certain devices and under certain conditions.
+ * To avoid blocking other file system operations for too long the SUFILE is
+ * scanned in steps of NILFS_SUFILE_STARVING_SEGS_STEP. After each step the
+ * locks are released and cond_resched() is called.
+ *
+ * Return Value: On success, 0 is returned and on error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+static int nilfs_ioctl_fix_starving_segs(struct the_nilfs *nilfs,
+					 struct inode *inode) {
+	struct nilfs_transaction_info ti;
+	unsigned long i, nsegs = nilfs_sufile_get_nsegments(nilfs->ns_sufile);
+	int ret = 0;
+
+	for (i = 0; i < nsegs; i += NILFS_SUFILE_STARVING_SEGS_STEP) {
+		nilfs_transaction_begin(inode->i_sb, &ti, 0);
+
+		ret = nilfs_sufile_fix_starving_segs(nilfs->ns_sufile, i,
+				NILFS_SUFILE_STARVING_SEGS_STEP);
+		if (unlikely(ret < 0)) {
+			nilfs_transaction_abort(inode->i_sb);
+			break;
+		}
+
+		nilfs_transaction_commit(inode->i_sb); /* never fails */
+		cond_resched();
+	}
+
+	return ret;
+}
+
+/**
  * nilfs_ioctl_change_cpmode - change checkpoint mode (checkpoint/snapshot)
  * @inode: inode object
  * @filp: file object
@@ -224,7 +267,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
 	struct the_nilfs *nilfs = inode->i_sb->s_fs_info;
 	struct nilfs_transaction_info ti;
 	struct nilfs_cpmode cpmode;
-	int ret;
+	int ret, is_snapshot;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
@@ -240,6 +283,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
 	mutex_lock(&nilfs->ns_snapshot_mount_mutex);
 
 	nilfs_transaction_begin(inode->i_sb, &ti, 0);
+	is_snapshot = nilfs_cpfile_is_snapshot(nilfs->ns_cpfile, cpmode.cm_cno);
 	ret = nilfs_cpfile_change_cpmode(
 		nilfs->ns_cpfile, cpmode.cm_cno, cpmode.cm_mode);
 	if (unlikely(ret < 0))
@@ -248,6 +292,10 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
 		nilfs_transaction_commit(inode->i_sb); /* never fails */
 
 	mutex_unlock(&nilfs->ns_snapshot_mount_mutex);
+
+	if (is_snapshot > 0 && cpmode.cm_mode == NILFS_CHECKPOINT &&
+			nilfs_feature_track_live_blks(nilfs))
+		ret = nilfs_ioctl_fix_starving_segs(nilfs, inode);
 out:
 	mnt_drop_write_file(filp);
 	return ret;
diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 9cd8820d..47e2c05 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -1215,6 +1215,91 @@ out_sem:
 }
 
 /**
+ * nilfs_sufile_fix_starving_segs - fix potentially starving segments
+ * @sufile: inode of segment usage file
+ * @segnum: segnum to start
+ * @nsegs: number of segments to check
+ *
+ * Description: Scans for segments, which are potentially starving and
+ * reduces the number of live blocks to less than half of the maximum
+ * number of blocks in a segment. This way the segment is more likely to be
+ * chosen by the GC. A segment is marked as potentially starving, if more
+ * than half of the blocks it contains are protected by snapshots.
+ *
+ * Return Value: On success, 0 is returned and on error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum,
+				   __u64 nsegs)
+{
+	struct buffer_head *su_bh;
+	struct nilfs_segment_usage *su;
+	size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
+	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
+	void *kaddr;
+	unsigned long maxnsegs, segusages_per_block;
+	__u32 max_segblks = nilfs->ns_blocks_per_segment >> 1;
+	int ret = 0, blkdirty, dirty = 0;
+
+	down_write(&NILFS_MDT(sufile)->mi_sem);
+
+	maxnsegs = nilfs_sufile_get_nsegments(sufile);
+	segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
+	nsegs += segnum;
+	if (nsegs > maxnsegs)
+		nsegs = maxnsegs;
+
+	while (segnum < nsegs) {
+		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
+							 nsegs - 1);
+
+		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
+							   0, &su_bh);
+		if (ret < 0) {
+			if (ret != -ENOENT)
+				goto out;
+			/* hole */
+			segnum += n;
+			continue;
+		}
+
+		kaddr = kmap_atomic(su_bh->b_page);
+		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
+							  su_bh, kaddr);
+		blkdirty = 0;
+		for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
+			if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
+				continue;
+			if (le32_to_cpu(su->su_nlive_blks) <= max_segblks)
+				continue;
+
+			su->su_nlive_blks = cpu_to_le32(max_segblks);
+			su->su_nsnapshot_blks = cpu_to_le32(max_segblks);
+			blkdirty = 1;
+		}
+
+		kunmap_atomic(kaddr);
+		if (blkdirty) {
+			mark_buffer_dirty(su_bh);
+			dirty = 1;
+		}
+		put_bh(su_bh);
+		cond_resched();
+	}
+
+out:
+	if (dirty)
+		nilfs_mdt_mark_dirty(sufile);
+
+	up_write(&NILFS_MDT(sufile)->mi_sem);
+	return ret;
+}
+
+/**
  * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node
  * @sufile: inode of segment usage file
  * @group: group to allocate a node for
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index 3466abb..f11e3e6 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -30,6 +30,7 @@
 
 #define NILFS_SUFILE_CACHE_NODE_SHIFT	6
 #define NILFS_SUFILE_CACHE_NODE_COUNT	(1 << NILFS_SUFILE_CACHE_NODE_SHIFT)
+#define NILFS_SUFILE_STARVING_SEGS_STEP (1 << 15)
 
 struct nilfs_sufile_cache_node {
 	__u32 values[NILFS_SUFILE_CACHE_NODE_COUNT];
@@ -88,6 +89,8 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
 int nilfs_sufile_read(struct super_block *sb, size_t susize,
 		      struct nilfs_inode *raw_inode, struct inode **inodep);
 int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
+int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum,
+				   __u64 nsegs);
 int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum);
 void nilfs_sufile_shrink_cache(struct inode *sufile);
 int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 1/5] nilfs-utils: extend SUFILE on-disk format to enable track live blocks
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (8 preceding siblings ...)
  2015-05-03 10:05   ` [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots Andreas Rohner
@ 2015-05-03 10:07   ` Andreas Rohner
  2015-05-03 10:07   ` [PATCH v2 2/5] nilfs-utils: add additional flags for nilfs_vdesc Andreas Rohner
                     ` (4 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:07 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch extends the nilfs_segment_usage structure with two extra
fields. This changes the on-disk format of the SUFILE, but the NILFS2
metadata files are flexible enough, so that there are no compatibility
issues. The extension is fully backwards compatible. Nevertheless a
feature compatibility flag was added to indicate the on-disk format
change.

The new field su_nlive_blks is used to track the number of live blocks
in the corresponding segment. Its value should always be smaller than
su_nblocks, which contains the total number of blocks in the segment.

The field su_nlive_lastmod is necessary because of the protection period
used by the GC. It is a timestamp, which contains the last time
su_nlive_blks was modified. For example if a file is deleted, its
blocks are subtracted from su_nlive_blks and are therefore
considered to be reclaimable by the kernel. But the GC additionally
protects them with the protection period. So while su_nilve_blks
contains the number of potentially reclaimable blocks, the actual number
depends on the protection period. To enable GC policies to
effectively choose or prefer segments with unprotected blocks, the
timestamp in su_nlive_lastmod is necessary.

Since the changes to the disk layout are fully backwards compatible and
the feature flag cannot be set after file system creation time,
NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT is set by default. It can
however be disabled by mkfs.nilfs2 -O ^sufile_live_blks_ext

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 bin/lssu.c          | 14 +++++++----
 include/nilfs2_fs.h | 52 ++++++++++++++++++++++++++++++++++------
 lib/feature.c       |  2 ++
 man/mkfs.nilfs2.8   |  7 ++++++
 sbin/mkfs/mkfs.c    | 69 +++++++++++++++++++++++++++++++++++++++--------------
 5 files changed, 114 insertions(+), 30 deletions(-)

diff --git a/bin/lssu.c b/bin/lssu.c
index 09ed973..e50e628 100644
--- a/bin/lssu.c
+++ b/bin/lssu.c
@@ -104,8 +104,8 @@ static const struct lssu_format lssu_format[] = {
 	},
 	{
 		"           SEGNUM        DATE     TIME STAT     NBLOCKS" \
-		"       NLIVEBLOCKS",
-		"%17llu  %s %c%c%c%c  %10u %10u (%3u%%)\n"
+		"       NLIVEBLOCKS   NPREDLIVEBLOCKS",
+		"%17llu  %s %c%c%c%c  %10u %10u (%3u%%) %10u (%3u%%)\n"
 	}
 };
 
@@ -164,9 +164,9 @@ static ssize_t lssu_print_suinfo(struct nilfs *nilfs, __u64 segnum,
 	time_t t;
 	char timebuf[LSSU_BUFSIZE];
 	ssize_t i, n = 0, ret;
-	int ratio;
+	int ratio, predratio;
 	int protected;
-	size_t nliveblks;
+	size_t nliveblks, npredliveblks;
 
 	for (i = 0; i < nsi; i++, segnum++) {
 		if (!all && nilfs_suinfo_clean(&suinfos[i]))
@@ -192,7 +192,10 @@ static ssize_t lssu_print_suinfo(struct nilfs *nilfs, __u64 segnum,
 			break;
 		case LSSU_MODE_LATEST_USAGE:
 			nliveblks = 0;
+			npredliveblks = suinfos[i].sui_nlive_blks;
 			ratio = 0;
+			predratio = (npredliveblks * 100 + 99) /
+					blocks_per_segment;
 			protected = suinfos[i].sui_lastmod >= prottime;
 
 			if (!nilfs_suinfo_dirty(&suinfos[i]) ||
@@ -223,7 +226,8 @@ skip_scan:
 			       nilfs_suinfo_dirty(&suinfos[i]) ? 'd' : '-',
 			       nilfs_suinfo_error(&suinfos[i]) ? 'e' : '-',
 			       protected ? 'p' : '-',
-			       suinfos[i].sui_nblocks, nliveblks, ratio);
+			       suinfos[i].sui_nblocks, nliveblks, ratio,
+			       npredliveblks, predratio);
 			break;
 		}
 		n++;
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index a16ad4c..6f0a27e 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -219,9 +219,12 @@ struct nilfs_super_block {
  * If there is a bit set in the incompatible feature set that the kernel
  * doesn't know about, it should refuse to mount the filesystem.
  */
-#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT	0x00000001ULL
+#define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT	(1ULL << 0)
 
-#define NILFS_FEATURE_COMPAT_SUPP	0ULL
+#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
+
+#define NILFS_FEATURE_COMPAT_SUPP					\
+			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
@@ -607,18 +610,38 @@ struct nilfs_cpfile_header {
 	  sizeof(struct nilfs_checkpoint) - 1) /			\
 			sizeof(struct nilfs_checkpoint))
 
+#undef offsetof
+#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
+
+#ifndef offsetofend
+#define offsetofend(TYPE, MEMBER) \
+		(offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
+#endif
+
 /**
  * struct nilfs_segment_usage - segment usage
  * @su_lastmod: last modified timestamp
  * @su_nblocks: number of blocks in segment
  * @su_flags: flags
+ * @su_nlive_blks: number of live blocks in the segment
+ * @su_nsnapshot_blks: number of blocks belonging to a snapshot in the segment
+ * @su_nlive_lastmod: timestamp nlive_blks was last modified
  */
 struct nilfs_segment_usage {
 	__le64 su_lastmod;
 	__le32 su_nblocks;
 	__le32 su_flags;
+	__le32 su_nlive_blks;
+	__le32 su_nsnapshot_blks;
+	__le64 su_nlive_lastmod;
 };
 
+#define NILFS_MIN_SEGMENT_USAGE_SIZE	\
+	offsetofend(struct nilfs_segment_usage, su_flags)
+
+#define NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE	\
+	offsetofend(struct nilfs_segment_usage, su_nlive_lastmod)
+
 /* segment usage flag */
 enum {
 	NILFS_SEGMENT_USAGE_ACTIVE,
@@ -654,11 +677,16 @@ NILFS_SEGMENT_USAGE_FNS(DIRTY, dirty)
 NILFS_SEGMENT_USAGE_FNS(ERROR, error)
 
 static inline void
-nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su)
+nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
 {
 	su->su_lastmod = cpu_to_le64(0);
 	su->su_nblocks = cpu_to_le32(0);
 	su->su_flags = cpu_to_le32(0);
+	if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
+		su->su_nlive_blks = cpu_to_le32(0);
+		su->su_nsnapshot_blks = cpu_to_le32(0);
+		su->su_nlive_lastmod = cpu_to_le64(0);
+	}
 }
 
 static inline int
@@ -680,21 +708,25 @@ struct nilfs_sufile_header {
 	/* ... */
 };
 
-#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET	\
-	((sizeof(struct nilfs_sufile_header) +				\
-	  sizeof(struct nilfs_segment_usage) - 1) /			\
-			 sizeof(struct nilfs_segment_usage))
+#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET(susz)	\
+	((sizeof(struct nilfs_sufile_header) + (susz) - 1) / (susz))
 
 /**
  * nilfs_suinfo - segment usage information
  * @sui_lastmod: timestamp of last modification
  * @sui_nblocks: number of written blocks in segment
  * @sui_flags: segment usage flags
+ * @sui_nlive_blks: number of live blocks in the segment
+ * @sui_nsnapshot_blks: number of blocks belonging to a snapshot in the segment
+ * @sui_nlive_lastmod: timestamp nlive_blks was last modified
  */
 struct nilfs_suinfo {
 	__u64 sui_lastmod;
 	__u32 sui_nblocks;
 	__u32 sui_flags;
+	__u32 sui_nlive_blks;
+	__u32 sui_nsnapshot_blks;
+	__u64 sui_nlive_lastmod;
 };
 
 #define NILFS_SUINFO_FNS(flag, name)					\
@@ -732,6 +764,9 @@ enum {
 	NILFS_SUINFO_UPDATE_LASTMOD,
 	NILFS_SUINFO_UPDATE_NBLOCKS,
 	NILFS_SUINFO_UPDATE_FLAGS,
+	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
+	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
+	NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
 	__NR_NILFS_SUINFO_UPDATE_FIELDS,
 };
 
@@ -755,6 +790,9 @@ nilfs_suinfo_update_##name(const struct nilfs_suinfo_update *sup)	\
 NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
 NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
 NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
+NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
+NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
+NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
 
 enum {
 	NILFS_CHECKPOINT,
diff --git a/lib/feature.c b/lib/feature.c
index b3317b7..ea3cb3d 100644
--- a/lib/feature.c
+++ b/lib/feature.c
@@ -55,6 +55,8 @@ struct nilfs_feature {
 
 static const struct nilfs_feature features[] = {
 	/* Compat features */
+	{ NILFS_FEATURE_TYPE_COMPAT,
+	  NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT, "sufile_live_blks_ext" },
 	/* Read-only compat features */
 	{ NILFS_FEATURE_TYPE_COMPAT_RO,
 	  NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT, "block_count" },
diff --git a/man/mkfs.nilfs2.8 b/man/mkfs.nilfs2.8
index 0ff2fbe..f04d6be 100644
--- a/man/mkfs.nilfs2.8
+++ b/man/mkfs.nilfs2.8
@@ -168,6 +168,13 @@ pseudo-filesystem feature "none" will clear all filesystem features.
 .TP
 .B block_count
 Enable block count per checkpoint.
+.TP
+.B sufile_live_blks_ext
+Enable SUFILE extension with extra fields. This is necessary for the
+track_live_blks features to work. Once enabled it cannot be disabled, because
+it changes the ondisk format. Nevertheless it is fully compatible with older
+versions of the file system. This feature is on by default, because it is fully
+backwards compatible and can only be set at file system creation time.
 .RE
 .TP
 .B \-q
diff --git a/sbin/mkfs/mkfs.c b/sbin/mkfs/mkfs.c
index f5f7dbb..96b944c 100644
--- a/sbin/mkfs/mkfs.c
+++ b/sbin/mkfs/mkfs.c
@@ -116,7 +116,12 @@ static time_t creation_time;
 static char volume_label[80];
 static __u64 compat_array[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	0,
+	/*
+	 * SUFILE_EXTENSION is set by default, because
+	 * it is fully compatible with previous versions and it
+	 * cannot be enabled later with nilfs-tune
+	 */
+	NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT,
 	/* Read-only compat */
 	0,
 	/* Incompat */
@@ -375,12 +380,33 @@ static unsigned count_ifile_blocks(void)
 	return nblocks;
 }
 
+static inline int sufile_live_blks_ext_enabled(void)
+{
+	return compat_array[NILFS_FEATURE_TYPE_COMPAT] &
+			NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT;
+}
+
+static unsigned get_sufile_entry_size(void)
+{
+	if (sufile_live_blks_ext_enabled())
+		return NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE;
+	else
+		return NILFS_MIN_SEGMENT_USAGE_SIZE;
+}
+
+static unsigned get_sufile_first_entry_offset(void)
+{
+	unsigned susz = get_sufile_entry_size();
+
+	return NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET(susz);
+}
+
 static unsigned count_sufile_blocks(void)
 {
 	unsigned long sufile_segment_usages_per_block
-		= blocksize / sizeof(struct nilfs_segment_usage);
+		= blocksize / get_sufile_entry_size();
 	return DIV_ROUND_UP(nr_initial_segments +
-			   NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET,
+			   get_sufile_first_entry_offset(),
 			   sufile_segment_usages_per_block);
 }
 
@@ -1056,7 +1082,7 @@ static inline void check_ctime(time_t ctime)
 
 static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	0,
+	NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
@@ -1499,8 +1525,8 @@ static void commit_cpfile(void)
 static void prepare_sufile(void)
 {
 	struct nilfs_file_info *fi = nilfs.files[NILFS_SUFILE_INO];
-	const unsigned entries_per_block
-		= blocksize / sizeof(struct nilfs_segment_usage);
+	const size_t susz = get_sufile_entry_size();
+	const unsigned entries_per_block = blocksize / susz;
 	blocknr_t blocknr = fi->start;
 	blocknr_t entry_block = blocknr;
 	struct nilfs_sufile_header *header;
@@ -1516,10 +1542,10 @@ static void prepare_sufile(void)
 	for (entry_block = blocknr;
 	     entry_block < blocknr + fi->nblocks; entry_block++) {
 		i = (entry_block == blocknr) ?
-			NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET : 0;
-		su = (struct nilfs_segment_usage *)
-			map_disk_buffer(entry_block, 1) + i;
-		for (; i < entries_per_block; i++, su++, segnum++) {
+			get_sufile_first_entry_offset() : 0;
+		su = map_disk_buffer(entry_block, 1) + i * susz;
+		for (; i < entries_per_block; i++, su = (void *)su + susz,
+		     segnum++) {
 #if 0 /* these fields are cleared when mapped first */
 			su->su_lastmod = 0;
 			su->su_nblocks = 0;
@@ -1529,7 +1555,7 @@ static void prepare_sufile(void)
 				nilfs_segment_usage_set_active(su);
 				nilfs_segment_usage_set_dirty(su);
 			} else
-				nilfs_segment_usage_set_clean(su);
+				nilfs_segment_usage_set_clean(su, susz);
 		}
 	}
 	init_inode(NILFS_SUFILE_INO, DT_REG, 0, 0);
@@ -1538,19 +1564,26 @@ static void prepare_sufile(void)
 static void commit_sufile(void)
 {
 	struct nilfs_file_info *fi = nilfs.files[NILFS_SUFILE_INO];
-	const unsigned entries_per_block
-		= blocksize / sizeof(struct nilfs_segment_usage);
+	const size_t susz = get_sufile_entry_size();
+	const unsigned entries_per_block = blocksize / susz;
 	struct nilfs_segment_usage *su;
 	unsigned segnum = fi->start / nilfs.diskinfo->blocks_per_segment;
 	blocknr_t blocknr = fi->start +
-		(segnum + NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET) /
+		(segnum + get_sufile_first_entry_offset()) /
 		entries_per_block;
-
-	su = map_disk_buffer(blocknr, 1);
-	su += (segnum + NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET) %
+	size_t entry_off = (segnum + get_sufile_first_entry_offset()) %
 		entries_per_block;
+
+	su = map_disk_buffer(blocknr, 1) + entry_off * susz;
+
 	su->su_lastmod = cpu_to_le64(nilfs.diskinfo->ctime);
 	su->su_nblocks = cpu_to_le32(nilfs.current_segment->nblocks);
+	if (sufile_live_blks_ext_enabled()) {
+		/* nlive_blks = nblocks - (nsummary_blks + nsuperroot_blks) */
+		su->su_nlive_blks = cpu_to_le32(nilfs.current_segment->nblocks -
+				(nilfs.current_segment->nblk_sum + 1));
+		su->su_nlive_lastmod = su->su_lastmod;
+	}
 }
 
 static void prepare_dat(void)
@@ -1756,7 +1789,7 @@ static void prepare_super_block(struct nilfs_disk_info *di)
 	raw_sb->s_checkpoint_size =
 		cpu_to_le16(sizeof(struct nilfs_checkpoint));
 	raw_sb->s_segment_usage_size =
-		cpu_to_le16(sizeof(struct nilfs_segment_usage));
+		cpu_to_le16(get_sufile_entry_size());
 
 	raw_sb->s_feature_compat =
 		cpu_to_le64(compat_array[NILFS_FEATURE_TYPE_COMPAT]);
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 2/5] nilfs-utils: add additional flags for nilfs_vdesc
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (9 preceding siblings ...)
  2015-05-03 10:07   ` [PATCH v2 1/5] nilfs-utils: extend SUFILE on-disk format to enable track live blocks Andreas Rohner
@ 2015-05-03 10:07   ` Andreas Rohner
  2015-05-03 10:07   ` [PATCH v2 3/5] nilfs-utils: add support for tracking live blocks Andreas Rohner
                     ` (3 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:07 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch adds support for additional bit-flags to the nilfs_vdesc
structure used by the GC to communicate block information to the
kernel.

The field vd_flags cannot be used for this purpose, because it does
not support bit-flags, and changing that would break backwards
compatibility. Therefore the padding field is renamed to vd_blk_flags
to contain more flags.

Unfortunately older versions of nilfs-utils do not initialize the
padding field to zero. So it is necessary to signal to the kernel if
the new vd_blk_flags field contains usable flags or just random data.
Since the vd_period field is only used in userspace, and is guaranteed
to contain a value that is > 0 (NILFS_CNO_MIN == 1), it can be used to
give the kernel a hint. So if vd_period.p_start is set to 0, the
vd_blk_flags field will be interpreted by the kernel.

The following new flags are added:

NILFS_VDESC_SNAPSHOT_PROTECTED:
    The block corresponding to the vdesc structure is protected by a
    snapshot. This information is used in the kernel as well as in
    nilfs-utils to calculate the number of live blocks in a given
    segment. A block with this flag is counted as live regardless of
    other indicators.

NILFS_VDESC_PERIOD_PROTECTED:
    The block corresponding to the vdesc structure is protected by the
    protection period of the userspace GC. The block is actually
    reclaimable, but for the moment protected. So it has to be
    treated as if it were alive and moved to a new free segment,
    but it must not be counted as live by the kernel. This flag
    indicates to the kernel, that this block should be counted as
    reclaimable.

The nilfs_vdesc_is_live() function is modified to store the
corresponding flags in the vdesc structure. However the algorithm it
uses it not modified, so it should return exactly the same results.

After the nilfs_vdesc_is_live() is called, the vd_period field is no
longer needed and set to 0, to indicate to the kernel, that the
vd_blk_flags field should be interpreted. This ensures full backward
compatibility:

Old nilfs2 and new nilfs-utils:
    vd_blk_flags is ignored

New nilfs2 and old nilfs-utils:
    vd_period.p_start > 0 so vd_blk_flags is ignored

New nilfs2 and new nilfs-utils:
    vd_period.p_start == 0 so vd_blk_flags is interpreted

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 include/nilfs2_fs.h | 38 ++++++++++++++++++++++++++++++++++++--
 lib/gc.c            | 31 ++++++++++++++++++++++++-------
 2 files changed, 60 insertions(+), 9 deletions(-)

diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index 6f0a27e..efa861c 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -890,7 +890,7 @@ struct nilfs_vinfo {
  * @vd_blocknr: disk block number
  * @vd_offset: logical block offset inside a file
  * @vd_flags: flags (data or node block)
- * @vd_pad: padding
+ * @vd_blk_flags: additional flags
  */
 struct nilfs_vdesc {
 	__u64 vd_ino;
@@ -900,9 +900,43 @@ struct nilfs_vdesc {
 	__u64 vd_blocknr;
 	__u64 vd_offset;
 	__u32 vd_flags;
-	__u32 vd_pad;
+	/*
+	 * vd_blk_flags needed because vd_flags doesn't support
+	 * bit-flags because of backwards compatibility
+	 */
+	__u32 vd_blk_flags;
 };
 
+/* vdesc flags */
+enum {
+	NILFS_VDESC_SNAPSHOT_PROTECTED,
+	NILFS_VDESC_PERIOD_PROTECTED,
+
+	/* ... */
+
+	__NR_NILFS_VDESC_FIELDS,
+};
+
+#define NILFS_VDESC_FNS(flag, name)					\
+static inline void							\
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_blk_flags |= (1UL << NILFS_VDESC_##flag);		\
+}									\
+static inline void							\
+nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_blk_flags &= ~(1UL << NILFS_VDESC_##flag);		\
+}									\
+static inline int							\
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
+{									\
+	return !!(vdesc->vd_blk_flags & (1UL << NILFS_VDESC_##flag));	\
+}
+
+NILFS_VDESC_FNS(SNAPSHOT_PROTECTED, snapshot_protected)
+NILFS_VDESC_FNS(PERIOD_PROTECTED, period_protected)
+
 /**
  * struct nilfs_bdesc - descriptor of disk block number
  * @bd_ino: inode number
diff --git a/lib/gc.c b/lib/gc.c
index 48c295a..a830e45 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -128,6 +128,7 @@ static int nilfs_acc_blocks_file(struct nilfs_file *file,
 				return -1;
 			bdesc->bd_ino = ino;
 			bdesc->bd_oblocknr = blk.b_blocknr;
+			bdesc->bd_pad = 0;
 			if (nilfs_block_is_data(&blk)) {
 				bdesc->bd_offset =
 					le64_to_cpu(*(__le64 *)blk.b_binfo);
@@ -148,6 +149,7 @@ static int nilfs_acc_blocks_file(struct nilfs_file *file,
 			vdesc->vd_ino = ino;
 			vdesc->vd_cno = cno;
 			vdesc->vd_blocknr = blk.b_blocknr;
+			vdesc->vd_blk_flags = 0;
 			if (nilfs_block_is_data(&blk)) {
 				binfo = blk.b_binfo;
 				vdesc->vd_vblocknr =
@@ -392,7 +394,7 @@ static ssize_t nilfs_get_snapshot(struct nilfs *nilfs, nilfs_cno_t **ssp)
  * @n: size of @ss array
  * @last_hit: the last snapshot number hit
  */
-static int nilfs_vdesc_is_live(const struct nilfs_vdesc *vdesc,
+static int nilfs_vdesc_is_live(struct nilfs_vdesc *vdesc,
 			       nilfs_cno_t protect, const nilfs_cno_t *ss,
 			       size_t n, nilfs_cno_t *last_hit)
 {
@@ -408,18 +410,22 @@ static int nilfs_vdesc_is_live(const struct nilfs_vdesc *vdesc,
 		return vdesc->vd_period.p_end == NILFS_CNO_MAX;
 	}
 
-	if (vdesc->vd_period.p_end == NILFS_CNO_MAX ||
-	    vdesc->vd_period.p_end > protect)
+	if (vdesc->vd_period.p_end == NILFS_CNO_MAX)
 		return 1;
 
+	if (vdesc->vd_period.p_end > protect)
+		nilfs_vdesc_set_period_protected(vdesc);
+
 	if (n == 0 || vdesc->vd_period.p_start > ss[n - 1] ||
 	    vdesc->vd_period.p_end <= ss[0])
-		return 0;
+		return nilfs_vdesc_period_protected(vdesc);
 
 	/* Try the last hit snapshot number */
 	if (*last_hit >= vdesc->vd_period.p_start &&
-	    *last_hit < vdesc->vd_period.p_end)
+	    *last_hit < vdesc->vd_period.p_end) {
+		nilfs_vdesc_set_snapshot_protected(vdesc);
 		return 1;
+	}
 
 	low = 0;
 	high = n - 1;
@@ -435,10 +441,11 @@ static int nilfs_vdesc_is_live(const struct nilfs_vdesc *vdesc,
 		} else {
 			/* ss[index] is in the range [p_start, p_end) */
 			*last_hit = ss[index];
+			nilfs_vdesc_set_snapshot_protected(vdesc);
 			return 1;
 		}
 	}
-	return 0;
+	return nilfs_vdesc_period_protected(vdesc);
 }
 
 /**
@@ -476,8 +483,18 @@ static int nilfs_toss_vdescs(struct nilfs *nilfs,
 			vdesc = nilfs_vector_get_element(vdescv, j);
 			assert(vdesc != NULL);
 			if (nilfs_vdesc_is_live(vdesc, protcno, ss, n,
-						&last_hit))
+						&last_hit)) {
+				/*
+				 * vd_period is not used any more after this,
+				 * but by setting it to 0 it can be used
+				 * as a flag to the kernel that vd_blk_flags
+				 * is used (old userspace tools didn't
+				 * initialize vd_pad to 0)
+				 */
+				vdesc->vd_period.p_start = 0;
+				vdesc->vd_period.p_end = 0;
 				break;
+			}
 
 			/*
 			 * Add the virtual block number to the candidate
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 3/5] nilfs-utils: add support for tracking live blocks
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (10 preceding siblings ...)
  2015-05-03 10:07   ` [PATCH v2 2/5] nilfs-utils: add additional flags for nilfs_vdesc Andreas Rohner
@ 2015-05-03 10:07   ` Andreas Rohner
  2015-05-03 10:07   ` [PATCH v2 4/5] nilfs-utils: implement the tracking of live blocks for set_suinfo Andreas Rohner
                     ` (2 subsequent siblings)
  14 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:07 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch adds a new feature flag NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
which allows the user to enable and disable the tracking of live
blocks. The flag can be set at file system creation time with mkfs or
at any later time with nilfs-tune.

Additionally a new option NILFS_OPT_TRACK_LIVE_BLKS is added to be
used by the GC. It is set to the same value as
NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS at startup. It is mainly used to
easily and efficiently check for the feature at runtime and to disable
it if the kernel doesn't support it.

It is fully backwards compatible, because
NILFS_FEATURE_COMPAT_SUFILE_EXTENSION also is backwards compatible and
it basically only tells the kernel to update a counter for every
segment in the SUFILE. If the kernel doesn't support it, the counter
won't be updated and the GC policies depending on that information
will work less efficiently, but they would still work.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 include/nilfs.h              | 23 ++++++++++++++++------
 include/nilfs2_fs.h          |  4 +++-
 lib/feature.c                |  2 ++
 lib/nilfs.c                  | 45 +++++++++++++++++++-------------------------
 man/mkfs.nilfs2.8            |  6 ++++++
 sbin/mkfs/mkfs.c             |  3 ++-
 sbin/nilfs-tune/nilfs-tune.c |  4 ++--
 7 files changed, 51 insertions(+), 36 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index f695f48..dbcb76e 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -130,6 +130,7 @@ struct nilfs {
 
 #define NILFS_OPT_MMAP		0x01
 #define NILFS_OPT_SET_SUINFO	0x02
+#define NILFS_OPT_TRACK_LIVE_BLKS	0x04
 
 
 struct nilfs *nilfs_open(const char *, const char *, int);
@@ -137,13 +138,14 @@ void nilfs_close(struct nilfs *);
 
 const char *nilfs_get_dev(const struct nilfs *);
 
-void nilfs_opt_clear_mmap(struct nilfs *);
-int nilfs_opt_set_mmap(struct nilfs *);
-int nilfs_opt_test_mmap(struct nilfs *);
+#define NILFS_DEF_OPT_FLAG(name)					\
+int nilfs_opt_set_##name(struct nilfs *nilfs);				\
+void nilfs_opt_clear_##name(struct nilfs *nilfs);			\
+int nilfs_opt_test_##name(const struct nilfs *nilfs);
 
-void nilfs_opt_clear_set_suinfo(struct nilfs *);
-int nilfs_opt_set_set_suinfo(struct nilfs *);
-int nilfs_opt_test_set_suinfo(struct nilfs *);
+NILFS_DEF_OPT_FLAG(mmap);
+NILFS_DEF_OPT_FLAG(set_suinfo);
+NILFS_DEF_OPT_FLAG(track_live_blks);
 
 nilfs_cno_t nilfs_get_oldest_cno(struct nilfs *);
 
@@ -326,4 +328,13 @@ static inline __u32 nilfs_get_blocks_per_segment(const struct nilfs *nilfs)
 	return le32_to_cpu(nilfs->n_sb->s_blocks_per_segment);
 }
 
+static inline int nilfs_feature_track_live_blks(const struct nilfs *nilfs)
+{
+	const __u64 required_bits = NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
+				    NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT;
+	__u64 fc = le64_to_cpu(nilfs->n_sb->s_feature_compat);
+
+	return ((fc & required_bits) == required_bits);
+}
+
 #endif	/* NILFS_H */
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index efa861c..a5dad2a 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -220,11 +220,13 @@ struct nilfs_super_block {
  * doesn't know about, it should refuse to mount the filesystem.
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT	(1ULL << 0)
+#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
 
 #define NILFS_FEATURE_COMPAT_SUPP					\
-			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
+			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT |	\
+			 NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
diff --git a/lib/feature.c b/lib/feature.c
index ea3cb3d..8154e16 100644
--- a/lib/feature.c
+++ b/lib/feature.c
@@ -57,6 +57,8 @@ static const struct nilfs_feature features[] = {
 	/* Compat features */
 	{ NILFS_FEATURE_TYPE_COMPAT,
 	  NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT, "sufile_live_blks_ext" },
+	{ NILFS_FEATURE_TYPE_COMPAT,
+	  NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS, "track_live_blks" },
 	/* Read-only compat features */
 	{ NILFS_FEATURE_TYPE_COMPAT_RO,
 	  NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT, "block_count" },
diff --git a/lib/nilfs.c b/lib/nilfs.c
index 1d18ffc..23ca381 100644
--- a/lib/nilfs.c
+++ b/lib/nilfs.c
@@ -285,38 +285,28 @@ void nilfs_opt_clear_mmap(struct nilfs *nilfs)
  * nilfs_opt_test_mmap - test whether mmap option is set or not
  * @nilfs: nilfs object
  */
-int nilfs_opt_test_mmap(struct nilfs *nilfs)
+int nilfs_opt_test_mmap(const struct nilfs *nilfs)
 {
 	return !!(nilfs->n_opts & NILFS_OPT_MMAP);
 }
 
-/**
- * nilfs_opt_set_set_suinfo - set set_suinfo option
- * @nilfs: nilfs object
- */
-int nilfs_opt_set_set_suinfo(struct nilfs *nilfs)
-{
-	nilfs->n_opts |= NILFS_OPT_SET_SUINFO;
-	return 0;
+#define NILFS_OPT_FLAG(flag, name)					\
+int nilfs_opt_set_##name(struct nilfs *nilfs)				\
+{									\
+	nilfs->n_opts |= NILFS_OPT_##flag;				\
+	return 0;							\
+}									\
+void nilfs_opt_clear_##name(struct nilfs *nilfs)			\
+{									\
+	nilfs->n_opts &= ~NILFS_OPT_##flag;				\
+}									\
+int nilfs_opt_test_##name(const struct nilfs *nilfs)			\
+{									\
+	return !!(nilfs->n_opts & NILFS_OPT_##flag);			\
 }
 
-/**
- * nilfs_opt_clear_set_suinfo - clear set_suinfo option
- * @nilfs: nilfs object
- */
-void nilfs_opt_clear_set_suinfo(struct nilfs *nilfs)
-{
-	nilfs->n_opts &= ~NILFS_OPT_SET_SUINFO;
-}
-
-/**
- * nilfs_opt_test_set_suinfo - test whether set_suinfo option is set or not
- * @nilfs: nilfs object
- */
-int nilfs_opt_test_set_suinfo(struct nilfs *nilfs)
-{
-	return !!(nilfs->n_opts & NILFS_OPT_SET_SUINFO);
-}
+NILFS_OPT_FLAG(SET_SUINFO, set_suinfo);
+NILFS_OPT_FLAG(TRACK_LIVE_BLKS, track_live_blks);
 
 static int nilfs_open_sem(struct nilfs *nilfs)
 {
@@ -406,6 +396,9 @@ struct nilfs *nilfs_open(const char *dev, const char *dir, int flags)
 			errno = ENOTSUP;
 			goto out_fd;
 		}
+
+		if (nilfs_feature_track_live_blks(nilfs))
+			nilfs_opt_set_track_live_blks(nilfs);
 	}
 
 	if (flags &
diff --git a/man/mkfs.nilfs2.8 b/man/mkfs.nilfs2.8
index f04d6be..e5a3976 100644
--- a/man/mkfs.nilfs2.8
+++ b/man/mkfs.nilfs2.8
@@ -175,6 +175,12 @@ track_live_blks features to work. Once enabled it cannot be disabled, because
 it changes the ondisk format. Nevertheless it is fully compatible with older
 versions of the file system. This feature is on by default, because it is fully
 backwards compatible and can only be set at file system creation time.
+.TP
+.B track_live_blks
+Enables the tracking of live blocks, which might improve the effectiveness of
+garbage collection, but entails a small runtime overhead. It is important to
+note, that this feature depends on sufile_live_blks_ext, which can only be set
+at file system creation time.
 .RE
 .TP
 .B \-q
diff --git a/sbin/mkfs/mkfs.c b/sbin/mkfs/mkfs.c
index 96b944c..bd6182c 100644
--- a/sbin/mkfs/mkfs.c
+++ b/sbin/mkfs/mkfs.c
@@ -1082,7 +1082,8 @@ static inline void check_ctime(time_t ctime)
 
 static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT,
+	NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT |
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
diff --git a/sbin/nilfs-tune/nilfs-tune.c b/sbin/nilfs-tune/nilfs-tune.c
index 60f1d39..7889310 100644
--- a/sbin/nilfs-tune/nilfs-tune.c
+++ b/sbin/nilfs-tune/nilfs-tune.c
@@ -84,7 +84,7 @@ static void nilfs_tune_usage(void)
 
 static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	0,
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
@@ -93,7 +93,7 @@ static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 
 static const __u64 clear_ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	0,
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 4/5] nilfs-utils: implement the tracking of live blocks for set_suinfo
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (11 preceding siblings ...)
  2015-05-03 10:07   ` [PATCH v2 3/5] nilfs-utils: add support for tracking live blocks Andreas Rohner
@ 2015-05-03 10:07   ` Andreas Rohner
  2015-05-03 10:07   ` [PATCH v2 5/5] nilfs-utils: add support for greedy/cost-benefit policies Andreas Rohner
  2015-05-05  3:09   ` [PATCH v2 0/9] nilfs2: implementation of cost-benefit GC policy Ryusuke Konishi
  14 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:07 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

If the tracking of live blocks is enabled, the information passed to
the kernel with the set_suinfo ioctl must also be modified. To this
end the nilfs_count_nlive_blks() fucntion is introduced. It simply
loops through the vdescv and bdescv vectors and counts the live
blocks belonging to a certain segment. Here the new vdesc flags
introduced earlier come in handy. If NILFS_VDESC_SNAPSHOT_PROTECTED flag
is set, the block is always counted as alive. However if it is not set
and NILFS_VDESC_PERIOD_PROTECTED is set instead it is counted as
reclaimable.

Additionally the nilfs_xreclaim_segment() function is refactored, so
that the set_suinfo part is extracted into its own function
nilfs_try_set_suinfo(). This is useful, because the code gets more
complicated with the new additions.

If the kernel either doesn't support the set_suinfo ioctl or doesn't
support the set_nlive_blks flag, it returns ENOTTY or EINVAL
respectively and the corresponding options are disabled and not used
again.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 include/nilfs.h |   6 ++
 lib/gc.c        | 177 ++++++++++++++++++++++++++++++++++++++++++++------------
 2 files changed, 145 insertions(+), 38 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index dbcb76e..5fd01fc 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -328,6 +328,12 @@ static inline __u32 nilfs_get_blocks_per_segment(const struct nilfs *nilfs)
 	return le32_to_cpu(nilfs->n_sb->s_blocks_per_segment);
 }
 
+static inline __u64
+nilfs_get_segnum_of_block(const struct nilfs *nilfs, sector_t blocknr)
+{
+	return blocknr / nilfs_get_blocks_per_segment(nilfs);
+}
+
 static inline int nilfs_feature_track_live_blks(const struct nilfs *nilfs)
 {
 	const __u64 required_bits = NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
diff --git a/lib/gc.c b/lib/gc.c
index a830e45..3e3bb2c 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -619,6 +619,130 @@ static int nilfs_toss_bdescs(struct nilfs_vector *bdescv)
 }
 
 /**
+ * nilfs_count_nlive_blks - returns the number of live blocks in segnum
+ * @nilfs: nilfs object
+ * @segnum: segment number
+ * @bdescv: vector object storing (descriptors of) disk block numbers
+ * @vdescv: vector object storing (descriptors of) virtual block numbers
+ */
+static size_t nilfs_count_nlive_blks(const struct nilfs *nilfs,
+				     __u64 segnum,
+				     struct nilfs_vector *vdescv,
+				     struct nilfs_vector *bdescv,
+				     size_t *pnss)
+{
+	struct nilfs_vdesc *vdesc;
+	struct nilfs_bdesc *bdesc;
+	int i;
+	size_t res = 0, nss = 0;
+
+	for (i = 0; i < nilfs_vector_get_size(bdescv); i++) {
+		bdesc = nilfs_vector_get_element(bdescv, i);
+		assert(bdesc != NULL);
+
+		if (nilfs_get_segnum_of_block(nilfs, bdesc->bd_blocknr) ==
+		    segnum && nilfs_bdesc_is_live(bdesc))
+			++res;
+	}
+
+	for (i = 0; i < nilfs_vector_get_size(vdescv); i++) {
+		vdesc = nilfs_vector_get_element(vdescv, i);
+		assert(vdesc != NULL);
+
+		if (nilfs_get_segnum_of_block(nilfs, vdesc->vd_blocknr) ==
+		    segnum && (nilfs_vdesc_snapshot_protected(vdesc) ||
+		    !nilfs_vdesc_period_protected(vdesc))) {
+			++res;
+			if (nilfs_vdesc_snapshot_protected(vdesc))
+				++nss;
+		}
+	}
+
+	if (pnss)
+		*pnss = nss;
+
+	return res;
+}
+
+/**
+ * nilfs_try_set_suinfo - wrapper for nilfs_set_suinfo
+ * @nilfs: nilfs object
+ * @segnums: array of segment numbers storing selected segments
+ * @nsegs: size of the @segnums array
+ * @vdescv: vector object storing (descriptors of) virtual block numbers
+ * @bdescv: vector object storing (descriptors of) disk block numbers
+ *
+ * Description: nilfs_try_set_suinfo() prepares the input data structure
+ * for nilfs_set_suinfo(). If the kernel doesn't support the
+ * NILFS_IOCTL_SET_SUINFO ioctl, errno is set to ENOTTY and the set_suinfo
+ * option is cleared to prevent future calls to nilfs_try_set_suinfo().
+ * Similarly if the SUFILE extension is not supported by the kernel,
+ * errno is set to EINVAL and the track_live_blks option is disabled.
+ *
+ * Return Value: On success, zero is returned.  On error, a negative value
+ * is returned. If errno is set to ENOTTY or EINVAL, the kernel doesn't support
+ * the current configuration for nilfs_set_suinfo().
+ */
+static int nilfs_try_set_suinfo(struct nilfs *nilfs, __u64 *segnums,
+		size_t nsegs, struct nilfs_vector *vdescv,
+		struct nilfs_vector *bdescv)
+{
+	struct nilfs_vector *supv;
+	struct nilfs_suinfo_update *sup;
+	struct timeval tv;
+	int ret = -1;
+	size_t i, nblocks, nss;
+
+	supv = nilfs_vector_create(sizeof(struct nilfs_suinfo_update));
+	if (!supv)
+		goto out;
+
+	ret = gettimeofday(&tv, NULL);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < nsegs; ++i) {
+		sup = nilfs_vector_get_new_element(supv);
+		if (!sup) {
+			ret = -1;
+			goto out;
+		}
+
+		sup->sup_segnum = segnums[i];
+		sup->sup_flags = 0;
+		nilfs_suinfo_update_set_lastmod(sup);
+		sup->sup_sui.sui_lastmod = tv.tv_sec;
+
+		if (nilfs_opt_test_track_live_blks(nilfs)) {
+			nilfs_suinfo_update_set_nlive_blks(sup);
+			nilfs_suinfo_update_set_nsnapshot_blks(sup);
+
+			nblocks = nilfs_count_nlive_blks(nilfs,
+					segnums[i], vdescv, bdescv, &nss);
+			sup->sup_sui.sui_nlive_blks = nblocks;
+			sup->sup_sui.sui_nsnapshot_blks = nss;
+		}
+	}
+
+	ret = nilfs_set_suinfo(nilfs, nilfs_vector_get_data(supv), nsegs);
+	if (ret < 0) {
+		if (errno == ENOTTY) {
+			nilfs_gc_logger(LOG_WARNING,
+					"set_suinfo ioctl is not supported");
+			nilfs_opt_clear_set_suinfo(nilfs);
+		} else if (errno == EINVAL) {
+			nilfs_gc_logger(LOG_WARNING,
+					"sufile extension is not supported");
+			nilfs_opt_clear_track_live_blks(nilfs);
+		}
+	}
+
+out:
+	nilfs_vector_destroy(supv);
+	return ret;
+}
+
+/**
  * nilfs_xreclaim_segment - reclaim segments (enhanced API)
  * @nilfs: nilfs object
  * @segnums: array of segment numbers storing selected segments
@@ -632,14 +756,12 @@ int nilfs_xreclaim_segment(struct nilfs *nilfs,
 			   const struct nilfs_reclaim_params *params,
 			   struct nilfs_reclaim_stat *stat)
 {
-	struct nilfs_vector *vdescv, *bdescv, *periodv, *vblocknrv, *supv;
+	struct nilfs_vector *vdescv, *bdescv, *periodv, *vblocknrv;
 	sigset_t sigset, oldset, waitset;
 	nilfs_cno_t protcno;
-	ssize_t n, i, ret = -1;
+	ssize_t n, ret = -1;
 	size_t nblocks;
 	__u32 reclaimable_blocks;
-	struct nilfs_suinfo_update *sup;
-	struct timeval tv;
 
 	if (!(params->flags & NILFS_RECLAIM_PARAM_PROTSEQ) ||
 	    (params->flags & (~0UL << __NR_NILFS_RECLAIM_PARAMS))) {
@@ -658,8 +780,7 @@ int nilfs_xreclaim_segment(struct nilfs *nilfs,
 	bdescv = nilfs_vector_create(sizeof(struct nilfs_bdesc));
 	periodv = nilfs_vector_create(sizeof(struct nilfs_period));
 	vblocknrv = nilfs_vector_create(sizeof(__u64));
-	supv = nilfs_vector_create(sizeof(struct nilfs_suinfo_update));
-	if (!vdescv || !bdescv || !periodv || !vblocknrv || !supv)
+	if (!vdescv || !bdescv || !periodv || !vblocknrv)
 		goto out_vec;
 
 	sigemptyset(&sigset);
@@ -757,46 +878,27 @@ int nilfs_xreclaim_segment(struct nilfs *nilfs,
 	if ((params->flags & NILFS_RECLAIM_PARAM_MIN_RECLAIMABLE_BLKS) &&
 			nilfs_opt_test_set_suinfo(nilfs) &&
 			reclaimable_blocks < params->min_reclaimable_blks * n) {
-		if (stat) {
-			stat->deferred_segs = n;
-			stat->cleaned_segs = 0;
-		}
 
-		ret = gettimeofday(&tv, NULL);
-		if (ret < 0)
+		ret = nilfs_try_set_suinfo(nilfs, segnums, n, vdescv, bdescv);
+		if (ret == 0) {
+			if (stat) {
+				stat->deferred_segs = n;
+				stat->cleaned_segs = 0;
+			}
 			goto out_lock;
-
-		for (i = 0; i < n; ++i) {
-			sup = nilfs_vector_get_new_element(supv);
-			if (!sup)
-				goto out_lock;
-
-			sup->sup_segnum = segnums[i];
-			sup->sup_flags = 0;
-			nilfs_suinfo_update_set_lastmod(sup);
-			sup->sup_sui.sui_lastmod = tv.tv_sec;
 		}
 
-		ret = nilfs_set_suinfo(nilfs, nilfs_vector_get_data(supv), n);
-
-		if (ret == 0)
-			goto out_lock;
-
-		if (ret < 0 && errno != ENOTTY) {
+		if (ret < 0 && errno != ENOTTY && errno != EINVAL) {
 			nilfs_gc_logger(LOG_ERR, "cannot set suinfo: %s",
 					strerror(errno));
 			goto out_lock;
 		}
 
-		/* errno == ENOTTY */
-		nilfs_gc_logger(LOG_WARNING,
-				"set_suinfo ioctl is not supported");
-		nilfs_opt_clear_set_suinfo(nilfs);
-		if (stat) {
-			stat->deferred_segs = 0;
-			stat->cleaned_segs = n;
-		}
-		/* Try nilfs_clean_segments */
+		/*
+		 * errno == ENOTTY || errno == EINVAL
+		 * nilfs_try_set_suinfo() failed because it is not supported
+		 * so try nilfs_clean_segments() instead
+		 */
 	}
 
 	ret = nilfs_clean_segments(nilfs,
@@ -829,7 +931,6 @@ out_vec:
 	nilfs_vector_destroy(bdescv);
 	nilfs_vector_destroy(periodv);
 	nilfs_vector_destroy(vblocknrv);
-	nilfs_vector_destroy(supv);
 	/*
 	 * Flags of valid fields in stat->exflags must be unset.
 	 */
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [PATCH v2 5/5] nilfs-utils: add support for greedy/cost-benefit policies
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (12 preceding siblings ...)
  2015-05-03 10:07   ` [PATCH v2 4/5] nilfs-utils: implement the tracking of live blocks for set_suinfo Andreas Rohner
@ 2015-05-03 10:07   ` Andreas Rohner
  2015-05-05  3:09   ` [PATCH v2 0/9] nilfs2: implementation of cost-benefit GC policy Ryusuke Konishi
  14 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-03 10:07 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

This patch implements the cost-benefit and greedy GC policies. These are
well known policies for log-structured file systems [1].

* Greedy:
  Select the segments with the most reclaimable space.
* Cost-Benefit [1]:
  Perform a cost-benefit analysis, whereby the reclaimable space
  gained is weighed against the cost of collecting the segment.

Since especially cost-benefit needs more information than is available
in nilfs_suinfo, a few extra parameters are added to the policy callback
function prototype. The flag p_comparison is added to indicate how the
importance values should be interpreted. For example for the timestamp
policy smaller values mean older timestamps, which is better. For greedy
and cost-benefit on the other hand, higher values are better.
nilfs_cleanerd_select_segments() was updated accordingly.

The threshold in nilfs_cleanerd_select_segments() can no longer be set
to sustat->ss_nongc_ctime on default, because the greedy/cost-benefit
policies do not return a timestamp, so their importance values cannot be
compared to each other. Instead segments that are younger than
sustat->ss_nongc_ctime are always excluded.

[1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
tion of a log-structured file system. ACM Trans. Comput. Syst.,
10(1):26–52, February 1992.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 sbin/cleanerd/cldconfig.c | 79 +++++++++++++++++++++++++++++++++++++++++++++--
 sbin/cleanerd/cldconfig.h | 22 +++++++++----
 sbin/cleanerd/cleanerd.c  | 43 ++++++++++++++++++++------
 3 files changed, 126 insertions(+), 18 deletions(-)

diff --git a/sbin/cleanerd/cldconfig.c b/sbin/cleanerd/cldconfig.c
index c8b197b..68090e9 100644
--- a/sbin/cleanerd/cldconfig.c
+++ b/sbin/cleanerd/cldconfig.c
@@ -380,7 +380,9 @@ nilfs_cldconfig_handle_clean_check_interval(struct nilfs_cldconfig *config,
 }
 
 static unsigned long long
-nilfs_cldconfig_selection_policy_timestamp(const struct nilfs_suinfo *si)
+nilfs_cldconfig_selection_policy_timestamp(const struct nilfs_suinfo *si,
+					   const struct nilfs_sustat *sustat,
+					   __u64 prottime)
 {
 	return si->sui_lastmod;
 }
@@ -392,13 +394,84 @@ nilfs_cldconfig_handle_selection_policy_timestamp(struct nilfs_cldconfig *config
 	config->cf_selection_policy.p_importance =
 		NILFS_CLDCONFIG_SELECTION_POLICY_IMPORTANCE;
 	config->cf_selection_policy.p_threshold =
-		NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD;
+		NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD;
+	config->cf_selection_policy.p_comparison =
+		NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER;
+	return 0;
+}
+
+static unsigned long long
+nilfs_cldconfig_selection_policy_greedy(const struct nilfs_suinfo *si,
+					const struct nilfs_sustat *sustat,
+					__u64 prottime)
+{
+	if (si->sui_nblocks < si->sui_nlive_blks ||
+	    si->sui_nlive_lastmod >= prottime)
+		return 0;
+
+	return si->sui_nblocks - si->sui_nlive_blks;
+}
+
+static int
+nilfs_cldconfig_handle_selection_policy_greedy(struct nilfs_cldconfig *config,
+					       char **tokens, size_t ntoks)
+{
+	config->cf_selection_policy.p_importance =
+		nilfs_cldconfig_selection_policy_greedy;
+	config->cf_selection_policy.p_threshold =
+		NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD;
+	config->cf_selection_policy.p_comparison =
+		NILFS_CLDCONFIG_SELECTION_POLICY_BIGGER_IS_BETTER;
+	return 0;
+}
+
+static unsigned long long
+nilfs_cldconfig_selection_policy_cost_benefit(const struct nilfs_suinfo *si,
+					      const struct nilfs_sustat *sustat,
+					      __u64 prottime)
+{
+	__u32 free_blocks, cleaning_cost;
+	unsigned long long age;
+
+	if (si->sui_nblocks < si->sui_nlive_blks ||
+	    sustat->ss_nongc_ctime < si->sui_lastmod ||
+	    si->sui_nlive_lastmod >= prottime)
+		return 0;
+
+	free_blocks = si->sui_nblocks - si->sui_nlive_blks;
+	/* read the whole segment + write the live blocks */
+	cleaning_cost = 2 * si->sui_nlive_blks;
+	/*
+	 * multiply by 1000 to convert age to milliseconds
+	 * (higher precision for division)
+	 */
+	age = (sustat->ss_nongc_ctime - si->sui_lastmod) * 1000;
+
+	if (cleaning_cost == 0)
+		cleaning_cost = 1;
+
+	return (age * free_blocks) / cleaning_cost;
+}
+
+static int
+nilfs_cldconfig_handle_selection_policy_cost_benefit(
+						struct nilfs_cldconfig *config,
+						char **tokens, size_t ntoks)
+{
+	config->cf_selection_policy.p_importance =
+		nilfs_cldconfig_selection_policy_cost_benefit;
+	config->cf_selection_policy.p_threshold =
+		NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD;
+	config->cf_selection_policy.p_comparison =
+		NILFS_CLDCONFIG_SELECTION_POLICY_BIGGER_IS_BETTER;
 	return 0;
 }
 
 static const struct nilfs_cldconfig_polhandle
 nilfs_cldconfig_polhandle_table[] = {
 	{"timestamp",	nilfs_cldconfig_handle_selection_policy_timestamp},
+	{"greedy",	nilfs_cldconfig_handle_selection_policy_greedy},
+	{"cost-benefit", nilfs_cldconfig_handle_selection_policy_cost_benefit},
 };
 
 #define NILFS_CLDCONFIG_NPOLHANDLES			\
@@ -690,6 +763,8 @@ static void nilfs_cldconfig_set_default(struct nilfs_cldconfig *config,
 		NILFS_CLDCONFIG_SELECTION_POLICY_IMPORTANCE;
 	config->cf_selection_policy.p_threshold =
 		NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD;
+	config->cf_selection_policy.p_comparison =
+		NILFS_CLDCONFIG_SELECTION_POLICY_COMPARISON;
 	config->cf_protection_period.tv_sec = NILFS_CLDCONFIG_PROTECTION_PERIOD;
 	config->cf_protection_period.tv_usec = 0;
 
diff --git a/sbin/cleanerd/cldconfig.h b/sbin/cleanerd/cldconfig.h
index 2a0af5f..3c9f5e6 100644
--- a/sbin/cleanerd/cldconfig.h
+++ b/sbin/cleanerd/cldconfig.h
@@ -30,16 +30,22 @@
 #include <sys/time.h>
 #include <syslog.h>
 
+struct nilfs;
+struct nilfs_sustat;
 struct nilfs_suinfo;
 
 /**
  * struct nilfs_selection_policy -
- * @p_importance:
- * @p_threshold:
+ * @p_importance: function to calculate the importance for the policy
+ * @p_threshold: segments with lower/higher importance are ignored
+ * @p_comparison: flag that indicates how to sort the importance
  */
 struct nilfs_selection_policy {
-	unsigned long long (*p_importance)(const struct nilfs_suinfo *);
+	unsigned long long (*p_importance)(const struct nilfs_suinfo *,
+					   const struct nilfs_sustat *,
+					   __u64);
 	unsigned long long p_threshold;
+	int p_comparison;
 };
 
 /**
@@ -111,9 +117,15 @@ struct nilfs_cldconfig {
 	unsigned long cf_mc_min_reclaimable_blocks;
 };
 
+#define NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER	0
+#define NILFS_CLDCONFIG_SELECTION_POLICY_BIGGER_IS_BETTER	1
+#define NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD	0
 #define NILFS_CLDCONFIG_SELECTION_POLICY_IMPORTANCE	\
 			nilfs_cldconfig_selection_policy_timestamp
-#define NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD	0
+#define NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD	\
+			NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD
+#define NILFS_CLDCONFIG_SELECTION_POLICY_COMPARISON	\
+			NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER
 #define NILFS_CLDCONFIG_PROTECTION_PERIOD		3600
 #define NILFS_CLDCONFIG_MIN_CLEAN_SEGMENTS		10
 #define NILFS_CLDCONFIG_MIN_CLEAN_SEGMENTS_UNIT		NILFS_SIZE_UNIT_PERCENT
@@ -135,8 +147,6 @@ struct nilfs_cldconfig {
 
 #define NILFS_CLDCONFIG_NSEGMENTS_PER_CLEAN_MAX	32
 
-struct nilfs;
-
 int nilfs_cldconfig_read(struct nilfs_cldconfig *config, const char *path,
 			 struct nilfs *nilfs);
 
diff --git a/sbin/cleanerd/cleanerd.c b/sbin/cleanerd/cleanerd.c
index d37bd5c..e0741f1 100644
--- a/sbin/cleanerd/cleanerd.c
+++ b/sbin/cleanerd/cleanerd.c
@@ -417,7 +417,7 @@ static void nilfs_cleanerd_destroy(struct nilfs_cleanerd *cleanerd)
 	free(cleanerd);
 }
 
-static int nilfs_comp_segimp(const void *elem1, const void *elem2)
+static int nilfs_comp_segimp_asc(const void *elem1, const void *elem2)
 {
 	const struct nilfs_segimp *segimp1 = elem1, *segimp2 = elem2;
 
@@ -429,6 +429,18 @@ static int nilfs_comp_segimp(const void *elem1, const void *elem2)
 	return (segimp1->si_segnum < segimp2->si_segnum) ? -1 : 1;
 }
 
+static int nilfs_comp_segimp_desc(const void *elem1, const void *elem2)
+{
+	const struct nilfs_segimp *segimp1 = elem1, *segimp2 = elem2;
+
+	if (segimp1->si_importance > segimp2->si_importance)
+		return -1;
+	else if (segimp1->si_importance < segimp2->si_importance)
+		return 1;
+
+	return (segimp1->si_segnum < segimp2->si_segnum) ? -1 : 1;
+}
+
 static int nilfs_cleanerd_automatic_suspend(struct nilfs_cleanerd *cleanerd)
 {
 	return cleanerd->config.cf_min_clean_segments > 0;
@@ -580,7 +592,7 @@ nilfs_cleanerd_select_segments(struct nilfs_cleanerd *cleanerd,
 	size_t count, nsegs;
 	ssize_t nssegs, n;
 	unsigned long long imp, thr;
-	int i;
+	int i, sib;
 
 	nsegs = nilfs_cleanerd_ncleansegs(cleanerd);
 	nilfs = cleanerd->nilfs;
@@ -600,11 +612,17 @@ nilfs_cleanerd_select_segments(struct nilfs_cleanerd *cleanerd,
 	prottime = tv2.tv_sec;
 	oldest = tv.tv_sec;
 
-	/* The segments that have larger importance than thr are not
+	/*
+	 * sufile extension fields may not be initialized by
+	 * nilfs_get_suinfo()
+	 */
+	memset(si, 0, sizeof(si));
+
+	/* The segments that have larger/smaller importance than thr are not
 	 * selected. */
-	thr = (config->cf_selection_policy.p_threshold != 0) ?
-		config->cf_selection_policy.p_threshold :
-		sustat->ss_nongc_ctime;
+	thr = config->cf_selection_policy.p_threshold;
+	sib = config->cf_selection_policy.p_comparison ==
+			NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER;
 
 	for (segnum = 0; segnum < sustat->ss_nsegs; segnum += n) {
 		count = min_t(__u64, sustat->ss_nsegs - segnum,
@@ -615,11 +633,13 @@ nilfs_cleanerd_select_segments(struct nilfs_cleanerd *cleanerd,
 			goto out;
 		}
 		for (i = 0; i < n; i++) {
-			if (!nilfs_suinfo_reclaimable(&si[i]))
+			if (!nilfs_suinfo_reclaimable(&si[i]) ||
+				si[i].sui_lastmod >= sustat->ss_nongc_ctime)
 				continue;
 
-			imp = config->cf_selection_policy.p_importance(&si[i]);
-			if (imp < thr) {
+			imp = config->cf_selection_policy.p_importance(&si[i],
+					sustat, prottime);
+			if (!thr || (sib && imp < thr) || (!sib && imp > thr)) {
 				if (si[i].sui_lastmod < oldest)
 					oldest = si[i].sui_lastmod;
 				if (si[i].sui_lastmod < prottime) {
@@ -642,7 +662,10 @@ nilfs_cleanerd_select_segments(struct nilfs_cleanerd *cleanerd,
 			break;
 		}
 	}
-	nilfs_vector_sort(smv, nilfs_comp_segimp);
+	if (sib)
+		nilfs_vector_sort(smv, nilfs_comp_segimp_asc);
+	else
+		nilfs_vector_sort(smv, nilfs_comp_segimp_desc);
 
 	nssegs = (nilfs_vector_get_size(smv) < nsegs) ?
 		nilfs_vector_get_size(smv) : nsegs;
-- 
2.3.7

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 0/9] nilfs2: implementation of cost-benefit GC policy
       [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (13 preceding siblings ...)
  2015-05-03 10:07   ` [PATCH v2 5/5] nilfs-utils: add support for greedy/cost-benefit policies Andreas Rohner
@ 2015-05-05  3:09   ` Ryusuke Konishi
  14 siblings, 0 replies; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-05  3:09 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi Andreas,
On Sun,  3 May 2015 12:05:13 +0200, Andreas Rohner wrote:
> Hello,
> 
> This is an updated version based on the review of Ryusuke Konishi. It is
> a complete rewrite of the first version and the implementation is much 
> simpler and cleaner.
> 
> I include here a copy of my cover letter from the first version:
> 
> One of the biggest performance problems of NILFS is its
> inefficient Timestamp GC policy. This patch set introduces two new GC
> policies, namely Cost-Benefit and Greedy.
> 
> The Cost-Benefit policy is nothing new. It has been around for a long
> time with log-structured file systems [1]. But it relies on accurate
> information, about the number of live blocks in a segment. NILFS
> currently does not provide the necessary information. So this patch set
> extends the entries in the SUFILE to include a counter for the number of
> live blocks. This counter is decremented whenever a file is deleted or
> overwritten.
> 
> Except for some tricky parts, the counting of live blocks is quite
> trivial. The problem is snapshots. At any time, a checkpoint can be
> turned into a snapshot or vice versa. So blocks that are reclaimable at
> one point in time, are protected by a snapshot a moment later.
> 
> This patch set does not try to track snapshots at all. Instead it uses a
> heuristic approach to prevent the worst case scenario. The performance
> is still significantly better than timestamp for my benchmarks.
> 
> The worst case scenario is, the following:
> 
> 1. Segment 1 is written
> 2. Snapshot is created
> 3. GC tries to reclaim Segment 1, but all blocks are protected
>    by the Snapshot. The GC has to set the number of live blocks
>    to maximum to avoid reclaiming this Segment again in the near future.
> 4. Snapshot is deleted
> 5. Segment 1 is reclaimable, but its counter is so high, that the GC
>    will never try to reclaim it again.
> 
> To prevent this kind of starvation I use another field in the SUFILE
> entry, to store the number of blocks that are protected by a snapshot.
> This value is just a heuristic and it is usually set to 0. Only if the
> GC reclaims a segment, it is written to the SUFILE entry. The GC has to
> check for snapshots anyway, so we get this information for free. By
> storing this information in the SUFILE we can avoid starvation in the
> following way:
> 
> 1. Segment 1 is written
> 2. Snapshot is created
> 3. GC tries to reclaim Segment 1, but all blocks are protected
>    by the Snapshot. The GC has to set the number of live blocks
>    to maximum to avoid reclaiming this Segment again in the near future.
> 4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
>    entry
> 5. Snapshot is deleted
> 6. On Snapshot deletion we walk through every entry in the SUFILE and
>    reduce the number of live blocks to half, if the number of snapshot
>    blocks is bigger than half of the maximum.
> 7. Segment 1 is reclaimable and the number of live blocks entry is at
>    half the maximum. The GC will try to reclaim this segment as soon as
>    there are no other better choices.
> 
> BENCHMARKS:
> -----------
> 
> My benchmark is quite simple. It consists of a process, that replays
> real NFS traces at a faster speed. It thereby creates relatively
> realistic patterns of file creation and deletions. At the same time
> multiple snapshots are created and deleted in parallel. I use a 100GB
> partition of a Samsung SSD:
> 
> WITH SNAPSHOTS EVERY 5 MINUTES:
> --------------------------------------------------------------------
>                 Execution time       Wear (Data written to disk)
> Timestamp:      100%                 100%
> Cost-Benefit:   80%                  43%
> 
> NO SNAPSHOTS:
> ---------------------------------------------------------------------
>                 Execution time       Wear (Data written to disk)
> Timestamp:      100%                 100%
> Cost-Benefit:   70%                  45%
> 
> I plan on adding more benchmark results soon.
> 
> Best regards,
> Andreas Rohner

Thanks for your effort.

I'm now reviewing the kernel patches.  Please wait for a while.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/9] nilfs2: copy file system feature flags to the nilfs object
       [not found]     ` <1430647522-14304-2-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-09  1:54       ` Ryusuke Konishi
       [not found]         ` <20150509.105445.1816655707671265145.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-09  1:54 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:14 +0200, Andreas Rohner wrote:
> This patch adds three new attributes to the nilfs object, which contain
> a copy of the feature flags from the super block. This can be used, to
> efficiently test whether file system feature flags are set or not.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/the_nilfs.c | 4 ++++
>  fs/nilfs2/the_nilfs.h | 8 ++++++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
> index 69bd801..606fdfc 100644
> --- a/fs/nilfs2/the_nilfs.c
> +++ b/fs/nilfs2/the_nilfs.c
> @@ -630,6 +630,10 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data)
>  	get_random_bytes(&nilfs->ns_next_generation,
>  			 sizeof(nilfs->ns_next_generation));
>  
> +	nilfs->ns_feature_compat = le64_to_cpu(sbp->s_feature_compat);
> +	nilfs->ns_feature_compat_ro = le64_to_cpu(sbp->s_feature_compat_ro);
> +	nilfs->ns_feature_incompat = le64_to_cpu(sbp->s_feature_incompat);

Consider moving these initialization to just before calling
nilfs_check_feature_compatibility().

It uses compat flags, and I'd like to unfold the function using these
internal variables sometime.

> +
>  	err = nilfs_store_disk_layout(nilfs, sbp);
>  	if (err)
>  		goto failed_sbh;
> diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
> index 23778d3..12cd91d 100644
> --- a/fs/nilfs2/the_nilfs.h
> +++ b/fs/nilfs2/the_nilfs.h
> @@ -101,6 +101,9 @@ enum {
>   * @ns_dev_kobj: /sys/fs/<nilfs>/<device>
>   * @ns_dev_kobj_unregister: completion state
>   * @ns_dev_subgroups: <device> subgroups pointer
> + * @ns_feature_compat: Compatible feature set
> + * @ns_feature_compat_ro: Read-only compatible feature set
> + * @ns_feature_incompat: Incompatible feature set
>   */
>  struct the_nilfs {
>  	unsigned long		ns_flags;
> @@ -201,6 +204,11 @@ struct the_nilfs {
>  	struct kobject ns_dev_kobj;
>  	struct completion ns_dev_kobj_unregister;
>  	struct nilfs_sysfs_dev_subgroups *ns_dev_subgroups;
> +
> +	/* Features */
> +	__u64                   ns_feature_compat;
> +	__u64                   ns_feature_compat_ro;
> +	__u64                   ns_feature_incompat;
>  };
>  
>  #define THE_NILFS_FNS(bit, name)					\
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/9] nilfs2: extend SUFILE on-disk format to enable tracking of live blocks
       [not found]     ` <1430647522-14304-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-09  2:24       ` Ryusuke Konishi
       [not found]         ` <20150509.112403.380867861504859109.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-09  2:24 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:15 +0200, Andreas Rohner wrote:
> This patch extends the nilfs_segment_usage structure with two extra
> fields. This changes the on-disk format of the SUFILE, but the NILFS2
> metadata files are flexible enough, so that there are no compatibility
> issues. The extension is fully backwards compatible. Nevertheless a
> feature compatibility flag was added to indicate the on-disk format
> change.
> 
> The new field su_nlive_blks is used to track the number of live blocks
> in the corresponding segment. Its value should always be smaller than
> su_nblocks, which contains the total number of blocks in the segment.
> 
> The field su_nlive_lastmod is necessary because of the protection period
> used by the GC. It is a timestamp, which contains the last time
> su_nlive_blks was modified. For example if a file is deleted, its
> blocks are subtracted from su_nlive_blks and are therefore considered to
> be reclaimable by the kernel. But the GC additionally protects them with
> the protection period. So while su_nilve_blks contains the number of
> potentially reclaimable blocks, the actual number depends on the
> protection period. To enable GC policies to effectively choose or prefer
> segments with unprotected blocks, the timestamp in su_nlive_lastmod is
> necessary.
> 
> The new field su_nsnapshot_blks contains the number of blocks in a
> segment that are protected by a snapshot. The value is meant to be a
> heuristic for the GC and is not necessarily always accurate.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/ioctl.c         |  4 +--
>  fs/nilfs2/sufile.c        | 45 +++++++++++++++++++++++++++++++--
>  fs/nilfs2/sufile.h        |  6 +++++
>  include/linux/nilfs2_fs.h | 63 +++++++++++++++++++++++++++++++++++++++++------
>  4 files changed, 106 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
> index 9a20e51..f6ee54e 100644
> --- a/fs/nilfs2/ioctl.c
> +++ b/fs/nilfs2/ioctl.c
> @@ -1250,7 +1250,7 @@ static int nilfs_ioctl_set_suinfo(struct inode *inode, struct file *filp,
>  		goto out;
>  
>  	ret = -EINVAL;
> -	if (argv.v_size < sizeof(struct nilfs_suinfo_update))
> +	if (argv.v_size < NILFS_MIN_SUINFO_UPDATE_SIZE)
>  		goto out;
>  
>  	if (argv.v_nmembs > nilfs->ns_nsegments)
> @@ -1316,7 +1316,7 @@ long nilfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  		return nilfs_ioctl_get_cpstat(inode, filp, cmd, argp);
>  	case NILFS_IOCTL_GET_SUINFO:
>  		return nilfs_ioctl_get_info(inode, filp, cmd, argp,
> -					    sizeof(struct nilfs_suinfo),
> +					    NILFS_MIN_SEGMENT_USAGE_SIZE,
>  					    nilfs_ioctl_do_get_suinfo);
>  	case NILFS_IOCTL_SET_SUINFO:
>  		return nilfs_ioctl_set_suinfo(inode, filp, cmd, argp);
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 2a869c3..1cce358 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -453,6 +453,11 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
>  	su->su_lastmod = cpu_to_le64(0);
>  	su->su_nblocks = cpu_to_le32(0);
>  	su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
> +	if (nilfs_sufile_live_blks_ext_supported(sufile)) {
> +		su->su_nlive_blks = cpu_to_le32(0);
> +		su->su_nsnapshot_blks = cpu_to_le32(0);
> +		su->su_nlive_lastmod = cpu_to_le64(0);
> +	}
>  	kunmap_atomic(kaddr);
>  
>  	nilfs_sufile_mod_counter(header_bh, clean ? (u64)-1 : 0, dirty ? 0 : 1);
> @@ -482,7 +487,7 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 segnum,
>  	WARN_ON(!nilfs_segment_usage_dirty(su));
>  
>  	sudirty = nilfs_segment_usage_dirty(su);
> -	nilfs_segment_usage_set_clean(su);
> +	nilfs_segment_usage_set_clean(su, NILFS_MDT(sufile)->mi_entry_size);
>  	kunmap_atomic(kaddr);
>  	mark_buffer_dirty(su_bh);
>  
> @@ -698,7 +703,7 @@ static int nilfs_sufile_truncate_range(struct inode *sufile,
>  		nc = 0;
>  		for (su = su2, j = 0; j < n; j++, su = (void *)su + susz) {
>  			if (nilfs_segment_usage_error(su)) {
> -				nilfs_segment_usage_set_clean(su);
> +				nilfs_segment_usage_set_clean(su, susz);
>  				nc++;
>  			}
>  		}
> @@ -821,6 +826,8 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
>  	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
>  	void *kaddr;
>  	unsigned long nsegs, segusages_per_block;
> +	__u64 lm = 0;
> +	__u32 nlb = 0, nsb = 0;
>  	ssize_t n;
>  	int ret, i, j;
>  
> @@ -858,6 +865,18 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
>  			if (nilfs_segment_is_active(nilfs, segnum + j))
>  				si->sui_flags |=
>  					(1UL << NILFS_SEGMENT_USAGE_ACTIVE);
> +
> +			if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
> +				nlb = le32_to_cpu(su->su_nlive_blks);
> +				nsb = le32_to_cpu(su->su_nsnapshot_blks);
> +				lm = le64_to_cpu(su->su_nlive_lastmod);
> +			}
> +
> +			if (sisz >= NILFS_LIVE_BLKS_EXT_SUINFO_SIZE) {
> +				si->sui_nlive_blks = nlb;
> +				si->sui_nsnapshot_blks = nsb;
> +				si->sui_nlive_lastmod = lm;
> +			}
>  		}
>  		kunmap_atomic(kaddr);
>  		brelse(su_bh);
> @@ -901,6 +920,9 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  	int cleansi, cleansu, dirtysi, dirtysu;
>  	long ncleaned = 0, ndirtied = 0;
>  	int ret = 0;
> +	bool sup_ext = (supsz >= NILFS_LIVE_BLKS_EXT_SUINFO_UPDATE_SIZE);
> +	bool su_ext = nilfs_sufile_live_blks_ext_supported(sufile);
> +	bool supsu_ext = sup_ext && su_ext;

These boolean variables determine the control follow.  For these, more
intuitive names are preferable.  For instance:

  - sup_ext -> suinfo_extended
  - su_ext -> su_extended
  - supsu_ext -> both_extended

>  
>  	if (unlikely(nsup == 0))
>  		return ret;
> @@ -911,6 +933,13 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  				(~0UL << __NR_NILFS_SUINFO_UPDATE_FIELDS))
>  			|| (nilfs_suinfo_update_nblocks(sup) &&
>  				sup->sup_sui.sui_nblocks >
> +				nilfs->ns_blocks_per_segment)
> +			|| (nilfs_suinfo_update_nlive_blks(sup) && sup_ext &&
> +				sup->sup_sui.sui_nlive_blks >
> +				nilfs->ns_blocks_per_segment)
> +			|| (nilfs_suinfo_update_nsnapshot_blks(sup) &&
> +				sup_ext &&
> +				sup->sup_sui.sui_nsnapshot_blks >
>  				nilfs->ns_blocks_per_segment))
>  			return -EINVAL;
>  	}

Testing sup_ext repeatedly is pointless since it increases branches.
Consider moving it forward as follows:

        for (sup = buf; sup < supend; sup = (void *)sup + supsz) {
                if (sup->sup_segnum >= nilfs->ns_nsegments ||
                   || (sup->sup_flags &
                       (~0UL << __NR_NILFS_SUINFO_UPDATE_FIELDS))
                   || (nilfs_suinfo_update_nblocks(sup) &&
                       sup->sup_sui.sui_nblocks > nilfs->ns_blocks_per_segment)\
)
                        return -EINVAL;
                if (!sup_extended)
                        continue;
                if (nilfs_suinfo_update_nlive_blks(sup) &&
                    (sup->sup_sui.sui_nlive_blks >
                     nilfs->ns_blocks_per_segment)
                    || (nilfs_suinfo_update_nsnapshot_blks(sup) &&
                        sup->sup_sui.sui_nsnapshot_blks >
                        nilfs->ns_blocks_per_segment))
                        return -EINVAL;
        }

> @@ -938,6 +967,18 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  		if (nilfs_suinfo_update_nblocks(sup))
>  			su->su_nblocks = cpu_to_le32(sup->sup_sui.sui_nblocks);
>  
> +		if (nilfs_suinfo_update_nlive_blks(sup) && supsu_ext)
> +			su->su_nlive_blks =
> +				cpu_to_le32(sup->sup_sui.sui_nlive_blks);
> +
> +		if (nilfs_suinfo_update_nsnapshot_blks(sup) && supsu_ext)
> +			su->su_nsnapshot_blks =
> +				cpu_to_le32(sup->sup_sui.sui_nsnapshot_blks);
> +
> +		if (nilfs_suinfo_update_nlive_lastmod(sup) && supsu_ext)
> +			su->su_nlive_lastmod =
> +				cpu_to_le64(sup->sup_sui.sui_nlive_lastmod);
> +

Ditto.

Consider defining pointer to suinfo structure

        for (;;) {
                struct nilfs_suinfo *sui = &sup->sup_sui;

and simplifying the above part as follows:

                if (both_extended) {
                        if (nilfs_suinfo_update_nlive_blks(sup))
                                su->su_nlive_blks =
                                        cpu_to_le32(sui->sui_nlive_blks);
                        if (nilfs_suinfo_update_nsnapshot_blks(sup))
                                su->su_nsnapshot_blks =
                                        cpu_to_le32(sui->sui_nsnapshot_blks);
                        if (nilfs_suinfo_update_nlive_lastmod(sup))
                                su->su_nlive_lastmod =
                                        cpu_to_le64(sui->sui_nlive_lastmod);
                }


>  		if (nilfs_suinfo_update_flags(sup)) {
>  			/*
>  			 * Active flag is a virtual flag projected by running
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index b8afd72..da78edf 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -28,6 +28,12 @@
>  #include <linux/nilfs2_fs.h>
>  #include "mdt.h"
>  
> +static inline int
> +nilfs_sufile_live_blks_ext_supported(const struct inode *sufile)
> +{
> +	return NILFS_MDT(sufile)->mi_entry_size >=
> +			NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE;
> +}
>  
>  static inline unsigned long nilfs_sufile_get_nsegments(struct inode *sufile)
>  {
> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
> index ff3fea3..4800daa 100644
> --- a/include/linux/nilfs2_fs.h
> +++ b/include/linux/nilfs2_fs.h
> @@ -220,9 +220,12 @@ struct nilfs_super_block {
>   * If there is a bit set in the incompatible feature set that the kernel
>   * doesn't know about, it should refuse to mount the filesystem.
>   */
> -#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT	0x00000001ULL

> +#define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT	BIT(0)

You should not use BIT() macro and its variants for now because they
are only enabled in kernel space (__KERNEL__ macro is required).

"nilfs2_fs.h" should be defined both for kernel space and user space.
Consider defining it like "(1ULL << 0)".

>  
> -#define NILFS_FEATURE_COMPAT_SUPP	0ULL

> +#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		BIT(0)

Ditto.

> +
> +#define NILFS_FEATURE_COMPAT_SUPP					\
> +			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>  
> @@ -609,19 +612,34 @@ struct nilfs_cpfile_header {
>  	  sizeof(struct nilfs_checkpoint) - 1) /			\
>  			sizeof(struct nilfs_checkpoint))
>  
> +#ifndef offsetofend
> +#define offsetofend(TYPE, MEMBER) \
> +		(offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
> +#endif
> +
>  /**
>   * struct nilfs_segment_usage - segment usage
>   * @su_lastmod: last modified timestamp
>   * @su_nblocks: number of blocks in segment
>   * @su_flags: flags
> + * @su_nlive_blks: number of live blocks in the segment
> + * @su_nsnapshot_blks: number of blocks belonging to a snapshot in the segment
> + * @su_nlive_lastmod: timestamp nlive_blks was last modified
>   */
>  struct nilfs_segment_usage {
>  	__le64 su_lastmod;
>  	__le32 su_nblocks;
>  	__le32 su_flags;
> +	__le32 su_nlive_blks;
> +	__le32 su_nsnapshot_blks;
> +	__le64 su_nlive_lastmod;
>  };
>  
> -#define NILFS_MIN_SEGMENT_USAGE_SIZE	16
> +#define NILFS_MIN_SEGMENT_USAGE_SIZE	\
> +	offsetofend(struct nilfs_segment_usage, su_flags)
> +
> +#define NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE	\
> +	offsetofend(struct nilfs_segment_usage, su_nlive_lastmod)
>  
>  /* segment usage flag */
>  enum {
> @@ -658,11 +676,16 @@ NILFS_SEGMENT_USAGE_FNS(DIRTY, dirty)
>  NILFS_SEGMENT_USAGE_FNS(ERROR, error)
>  

>  static inline void
> -nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su)
> +nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
>  {
>  	su->su_lastmod = cpu_to_le64(0);
>  	su->su_nblocks = cpu_to_le32(0);
>  	su->su_flags = cpu_to_le32(0);
> +	if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
> +		su->su_nlive_blks = cpu_to_le32(0);
> +		su->su_nsnapshot_blks = cpu_to_le32(0);
> +		su->su_nlive_lastmod = cpu_to_le64(0);
> +	}
>  }

nilfs_sufile_do_scrap() function does almost the same thing.
Consider defining common inline function and using it for
nilfs_segment_usage_set_clean() and nilfs_sufile_do_scrap():

static inline void
nilfs_segment_usage_format(struct nilfs_segment_usage *su, size_t susz,
			   __u32 flags)
{
	su->su_lastmod = cpu_to_le64(0);
	su->su_nblocks = cpu_to_le32(0);
	su->su_flags = cpu_to_le32(flags);
	if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
		su->su_nlive_blks = cpu_to_le32(0);
		su->su_nsnapshot_blks = cpu_to_le32(0);
		su->su_nlive_lastmod = cpu_to_le64(0);
	}
}

static inline void
nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
{
	nilfs_segment_usage_format(su, susz, 0);
}

Regards,
Ryusuke Konishi

>  
>  static inline int
> @@ -684,23 +707,33 @@ struct nilfs_sufile_header {
>  	/* ... */
>  };
>  
> -#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET	\
> -	((sizeof(struct nilfs_sufile_header) +				\
> -	  sizeof(struct nilfs_segment_usage) - 1) /			\
> -			 sizeof(struct nilfs_segment_usage))
> +#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET(susz)	\
> +	((sizeof(struct nilfs_sufile_header) + (susz) - 1) / (susz))
>  
>  /**
>   * nilfs_suinfo - segment usage information
>   * @sui_lastmod: timestamp of last modification
>   * @sui_nblocks: number of written blocks in segment
>   * @sui_flags: segment usage flags
> + * @sui_nlive_blks: number of live blocks in the segment
> + * @sui_nsnapshot_blks: number of blocks belonging to a snapshot in the segment
> + * @sui_nlive_lastmod: timestamp nlive_blks was last modified
>   */
>  struct nilfs_suinfo {
>  	__u64 sui_lastmod;
>  	__u32 sui_nblocks;
>  	__u32 sui_flags;
> +	__u32 sui_nlive_blks;
> +	__u32 sui_nsnapshot_blks;
> +	__u64 sui_nlive_lastmod;
>  };
>  
> +#define NILFS_MIN_SUINFO_SIZE	\
> +	offsetofend(struct nilfs_suinfo, sui_flags)
> +
> +#define NILFS_LIVE_BLKS_EXT_SUINFO_SIZE	\
> +	offsetofend(struct nilfs_suinfo, sui_nlive_lastmod)
> +
>  #define NILFS_SUINFO_FNS(flag, name)					\
>  static inline int							\
>  nilfs_suinfo_##name(const struct nilfs_suinfo *si)			\
> @@ -736,6 +769,9 @@ enum {
>  	NILFS_SUINFO_UPDATE_LASTMOD,
>  	NILFS_SUINFO_UPDATE_NBLOCKS,
>  	NILFS_SUINFO_UPDATE_FLAGS,
> +	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
> +	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
> +	NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
>  	__NR_NILFS_SUINFO_UPDATE_FIELDS,
>  };
>  
> @@ -759,6 +795,17 @@ nilfs_suinfo_update_##name(const struct nilfs_suinfo_update *sup)	\
>  NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
>  NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
>  NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
> +NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
> +NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
> +NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
> +
> +#define NILFS_MIN_SUINFO_UPDATE_SIZE	\
> +	(offsetofend(struct nilfs_suinfo_update, sup_reserved) + \
> +	NILFS_MIN_SUINFO_SIZE)
> +
> +#define NILFS_LIVE_BLKS_EXT_SUINFO_UPDATE_SIZE	\
> +	(offsetofend(struct nilfs_suinfo_update, sup_reserved) + \
> +	NILFS_LIVE_BLKS_EXT_SUINFO_SIZE)
>  
>  enum {
>  	NILFS_CHECKPOINT,
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/9] nilfs2: introduce new feature flag for tracking live blocks
       [not found]     ` <1430647522-14304-4-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-09  2:28       ` Ryusuke Konishi
       [not found]         ` <20150509.112814.2026089040966346261.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-09  2:28 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:16 +0200, Andreas Rohner wrote:
> This patch introduces a new file system feature flag
> NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS. If it is enabled, the file system
> will keep track of the number of live blocks per segment. This
> information can be used by the GC to select segments for cleaning more
> efficiently.

Please describe the reason why you separated
NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS and
NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT that you mentioned before
in the commit log.

> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/the_nilfs.h     | 8 ++++++++
>  include/linux/nilfs2_fs.h | 4 +++-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
> index 12cd91d..d755b6b 100644
> --- a/fs/nilfs2/the_nilfs.h
> +++ b/fs/nilfs2/the_nilfs.h
> @@ -401,4 +401,12 @@ static inline int nilfs_flush_device(struct the_nilfs *nilfs)
>  	return err;
>  }
>  
> +static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
> +{
> +	const __u64 required_bits = NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
> +				    NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT;
> +
> +	return ((nilfs->ns_feature_compat & required_bits) == required_bits);
> +}
> +
>  #endif /* _THE_NILFS_H */
> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
> index 4800daa..5f05bbf 100644
> --- a/include/linux/nilfs2_fs.h
> +++ b/include/linux/nilfs2_fs.h
> @@ -221,11 +221,13 @@ struct nilfs_super_block {
>   * doesn't know about, it should refuse to mount the filesystem.
>   */
>  #define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT	BIT(0)
> +#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		BIT(1)

Ditto.  Avoid using BIT macro in nilfs2_fs.h for now.

Regards,
Ryusuke Konishi

>  #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		BIT(0)
>  
>  #define NILFS_FEATURE_COMPAT_SUPP					\
> -			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
> +			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT |	\
> +			 NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>  
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/9] nilfs2: add kmem_cache for SUFILE cache nodes
       [not found]     ` <1430647522-14304-5-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-09  2:41       ` Ryusuke Konishi
       [not found]         ` <20150509.114149.1643183669812667339.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-09  2:41 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:17 +0200, Andreas Rohner wrote:
> This patch adds a kmem_cache to efficiently allocate SUFILE cache nodes.
> One cache node contains a certain number of unsigned 32 bit values and
> either a list_head, to string a number of nodes together into a linked
> list, or an rcu_head to be able to use the node with an rcu
> callback.
> 
> These cache nodes can be used to cache small changes to the SUFILE and
> apply them later at segment construction.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/sufile.h | 14 ++++++++++++++
>  fs/nilfs2/super.c  | 14 ++++++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index da78edf..520614f 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -28,6 +28,20 @@
>  #include <linux/nilfs2_fs.h>
>  #include "mdt.h"
>  
> +#define NILFS_SUFILE_CACHE_NODE_SHIFT	6
> +#define NILFS_SUFILE_CACHE_NODE_COUNT	(1 << NILFS_SUFILE_CACHE_NODE_SHIFT)
> +
> +struct nilfs_sufile_cache_node {
> +	__u32 values[NILFS_SUFILE_CACHE_NODE_COUNT];
> +	union {
> +		struct rcu_head rcu_head;
> +		struct list_head list_head;
> +	};
> +	unsigned long index;
> +};
> +
> +extern struct kmem_cache *nilfs_sufile_node_cachep;
> +
>  static inline int
>  nilfs_sufile_live_blks_ext_supported(const struct inode *sufile)
>  {
> diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
> index f47585b..97a30db 100644
> --- a/fs/nilfs2/super.c
> +++ b/fs/nilfs2/super.c
> @@ -71,6 +71,7 @@ static struct kmem_cache *nilfs_inode_cachep;
>  struct kmem_cache *nilfs_transaction_cachep;
>  struct kmem_cache *nilfs_segbuf_cachep;
>  struct kmem_cache *nilfs_btree_path_cache;
> +struct kmem_cache *nilfs_sufile_node_cachep;
>  
>  static int nilfs_setup_super(struct super_block *sb, int is_mount);
>  static int nilfs_remount(struct super_block *sb, int *flags, char *data);
> @@ -1397,6 +1398,11 @@ static void nilfs_segbuf_init_once(void *obj)
>  	memset(obj, 0, sizeof(struct nilfs_segment_buffer));
>  }
>  
> +static void nilfs_sufile_cache_node_init_once(void *obj)
> +{
> +	memset(obj, 0, sizeof(struct nilfs_sufile_cache_node));
> +}
> +

Note that nilfs_sufile_cache_node_init_once() is only called when each
cache entry is allocated first time.  It doesn't ensure each cache
entry is clean when it will be allocated with kmem_cache_alloc()
the second time and afterwards.

Regards,
Ryusuke Konishi

>  static void nilfs_destroy_cachep(void)
>  {
>  	/*
> @@ -1413,6 +1419,8 @@ static void nilfs_destroy_cachep(void)
>  		kmem_cache_destroy(nilfs_segbuf_cachep);
>  	if (nilfs_btree_path_cache)
>  		kmem_cache_destroy(nilfs_btree_path_cache);
> +	if (nilfs_sufile_node_cachep)
> +		kmem_cache_destroy(nilfs_sufile_node_cachep);
>  }
>  
>  static int __init nilfs_init_cachep(void)
> @@ -1441,6 +1449,12 @@ static int __init nilfs_init_cachep(void)
>  	if (!nilfs_btree_path_cache)
>  		goto fail;
>  
> +	nilfs_sufile_node_cachep = kmem_cache_create("nilfs_sufile_node_cache",
> +			sizeof(struct nilfs_sufile_cache_node), 0, 0,
> +			nilfs_sufile_cache_node_init_once);
> +	if (!nilfs_sufile_node_cachep)
> +		goto fail;
> +
>  	return 0;
>  
>  fail:
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 5/9] nilfs2: add SUFILE cache for changes to su_nlive_blks field
       [not found]     ` <1430647522-14304-6-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-09  4:09       ` Ryusuke Konishi
       [not found]         ` <20150509.130900.223492430584220355.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-09  4:09 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:18 +0200, Andreas Rohner wrote:
> This patch adds a cache for the SUFILE to efficiently store lots of
> small changes to su_nlive_blks in memory and apply the accumulated
> results later at segment construction. This improves performance of
> these operations and reduces lock contention in the SUFILE.
> 
> The implementation uses a radix_tree to store cache nodes, which
> contain a certain number of values. Every value corresponds to
> exactly one SUFILE entry. If the cache is flushed the values are
> subtracted from the su_nlive_blks field of the corresponding SUFILE
> entry.
> 
> If the parameter only_mark of the function nilfs_sufile_flush_cache() is
> set, then the blocks that would have been dirtied by the flush are
> marked as dirty, but nothing is actually written to them. This mode is
> useful during segment construction, when blocks need to be marked dirty
> in advance.
> 
> New nodes are allocated on demand. The lookup of nodes is protected by
> rcu_read_lock() and the modification of values is protected by a block
> group lock. This should allow for concurrent updates to the cache.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/sufile.c | 369 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nilfs2/sufile.h |   5 +
>  2 files changed, 374 insertions(+)
> 
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 1cce358..80bbd87 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -26,6 +26,7 @@
>  #include <linux/string.h>
>  #include <linux/buffer_head.h>
>  #include <linux/errno.h>
> +#include <linux/radix-tree.h>
>  #include <linux/nilfs2_fs.h>
>  #include "mdt.h"
>  #include "sufile.h"
> @@ -42,6 +43,11 @@ struct nilfs_sufile_info {
>  	unsigned long ncleansegs;/* number of clean segments */
>  	__u64 allocmin;		/* lower limit of allocatable segment range */
>  	__u64 allocmax;		/* upper limit of allocatable segment range */
> +

> +	struct blockgroup_lock nlive_blks_cache_bgl;
> +	spinlock_t nlive_blks_cache_lock;
> +	int nlive_blks_cache_dirty;
> +	struct radix_tree_root nlive_blks_cache;

blockgroup_lock is not needed.  For the counter operations in this
patch, using cmpxchg() or atomic_xxx() is more effective as I mention
later.

And, I prefer to address this cache as updates of segment usage
instead of that of nlive_blks.  In that sense, it's preferable
to define the array element like:

struct nilfs_segusage_update {
	__u32 nlive_blks_adj;
};

and define the variable names like update_cache (instead of
nlive_blks_cache), update_cache_lock, update_cache_dirty, etc.


>  };
>  
>  static inline struct nilfs_sufile_info *NILFS_SUI(struct inode *sufile)
> @@ -1194,6 +1200,362 @@ out_sem:
>  }
>  
>  /**
> + * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node
> + * @sufile: inode of segment usage file
> + * @group: group to allocate a node for
> + *
> + * Description: Allocates a new cache node and inserts it into the cache. If
> + * there is an error, nothing will be allocated. If there already exists
> + * a node for @group, no new node will be allocated.
> + *
> + * Return Value: On success, 0 is returned, on error, one of the following
> + * negative error codes is returned.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + */
> +static int nilfs_sufile_alloc_cache_node(struct inode *sufile,
> +					 unsigned long group)
> +{
> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
> +	struct nilfs_sufile_cache_node *node;
> +	int ret;
> +
> +	node = kmem_cache_alloc(nilfs_sufile_node_cachep, GFP_NOFS);
> +	if (!node)
> +		return -ENOMEM;
> +
> +	ret = radix_tree_preload(GFP_NOFS);
> +	if (ret)
> +		goto free_node;
> +
> +	spin_lock(&sui->nlive_blks_cache_lock);
> +	ret = radix_tree_insert(&sui->nlive_blks_cache, group, node);
> +	spin_unlock(&sui->nlive_blks_cache_lock);
> +
> +	radix_tree_preload_end();
> +

> +	if (ret == -EEXIST) {
> +		ret = 0;
> +		goto free_node;
> +	} else if (ret)
> +		goto free_node;
> +
> +	return 0;
> +free_node:
> +	kmem_cache_free(nilfs_sufile_node_cachep, node);
> +	return ret;

The above error check implies two branches in regular path.
Consider rewriting it as follows:

	if (!ret)
		return 0;

	if (ret == -EEXIST)
		ret = 0;
free_node:
	kmem_cache_free(nilfs_sufile_node_cachep, node);
	return ret;

By the way, you should use braces in both branches if the one of them
has multiple statements in an "if else" conditional statement.  This
exception is written in the Chapter 3 of Documentation/CodingStyle.

    e.g.

        if (condition) {
                do_this();
                do_that();
        } else {
                otherwise();
        }

> +}
> +
> +/**
> + * nilfs_sufile_dec_nlive_blks - decrements nlive_blks in the cache
> + * @sufile: inode of segment usage file
> + * @segnum: segnum for which nlive_blks will be decremented
> + *
> + * Description: Decrements the number of live blocks for @segnum in the cache.
> + * This function only affects the cache. If the cache is not flushed at a
> + * later time the changes are lost. It tries to lookup the group node to
> + * which the @segnum belongs in a lock free manner and uses a blockgroup lock
> + * to do the actual modification on the node.
> + *
> + * Return Value: On success, 0 is returned on error, one of the following
> + * negative error codes is returned.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + */
> +int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum)
> +{
> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
> +	struct nilfs_sufile_cache_node *node;
> +	spinlock_t *lock;
> +	unsigned long group;
> +	int ret;
> +
> +	group = (unsigned long)(segnum >> NILFS_SUFILE_CACHE_NODE_SHIFT);
> +
> +try_again:
> +	rcu_read_lock();
> +	node = radix_tree_lookup(&sui->nlive_blks_cache, group);
> +	if (!node) {
> +		rcu_read_unlock();
> +
> +		ret = nilfs_sufile_alloc_cache_node(sufile, group);
> +		if (ret)
> +			return ret;
> +
> +		/*
> +		 * It is important to acquire the rcu_read_lock() before using
> +		 * the node pointer
> +		 */
> +		goto try_again;
> +	}
> +

> +	lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, (unsigned int)group);
> +	spin_lock(lock);
> +	node->values[segnum & ((1 << NILFS_SUFILE_CACHE_NODE_SHIFT) - 1)] += 1;
> +	sui->nlive_blks_cache_dirty = 1;
> +	spin_unlock(lock);
> +	rcu_read_unlock();
> +
> +	return 0;
> +}

Consider using cmpxchg() or atomic_inc(), and using
NILFS_SUFILE_CACHE_NODE_MASK to mask segnum.  The following is an
example in the case of using cmpxchg():

	__u32 old, new, *valuep;
	...
	old = node->values[segnum & (NILFS_SUFILE_CACHE_NODE_COUNT - 1)];
	do {
		old = ACCESS_ONCE(*valuep);
		new = old + 1;
	} while (cmpxchg(valuep, old, new) != old);

	sui->nlive_blks_cache_dirty = 1;

	rcu_read_unlock();
	return 0;
}

The current atomic_xxxx() macros are actually defined in the same way
to the reduce overheads in smp environment.

Using atomic_xxxx() is more preferable but formally it requires
initialization with "atomic_set(&counter, 0)" or "ATOMIC_INIT(0)" for
every element.  I don't know whether initialization with memset()
function is allowed or not for atomic_t type variables.

> +
> +/**
> + * nilfs_sufile_flush_cache_node - flushes one cache node to the SUFILE
> + * @sufile: inode of segment usage file
> + * @node: cache node to flush
> + * @only_mark: do not write anything, but mark the blocks as dirty
> + * @pndirty_blks: pointer to return number of dirtied blocks
> + *
> + * Description: Flushes one cache node to the SUFILE and also clears the cache
> + * node at the same time. If @only_mark is 1, nothing is written to the
> + * SUFILE, but the blocks are still marked as dirty. This is useful to mark
> + * the blocks in one phase of the segment creation and write them in another.
> + *
> + * Return Value: On success, 0 is returned on error, one of the following
> + * negative error codes is returned.
> + *
> + * %-ENOMEM - Insufficient memory available.
> + *
> + * %-EIO - I/O error
> + *
> + * %-EROFS - Read only filesystem (for create mode)
> + */
> +static int nilfs_sufile_flush_cache_node(struct inode *sufile,
> +					 struct nilfs_sufile_cache_node *node,
> +					 int only_mark,
> +					 unsigned long *pndirty_blks)
> +{
> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
> +	struct buffer_head *su_bh;
> +	struct nilfs_segment_usage *su;
> +	spinlock_t *lock;
> +	void *kaddr;
> +	size_t n, i, j;
> +	size_t susz = NILFS_MDT(sufile)->mi_entry_size;
> +	__u64 segnum, seg_start, nsegs;
> +	__u32 nlive_blocks, value;
> +	unsigned long secs = get_seconds(), ndirty_blks = 0;
> +	int ret, dirty;
> +
> +	nsegs = nilfs_sufile_get_nsegments(sufile);
> +	seg_start = node->index << NILFS_SUFILE_CACHE_NODE_SHIFT;
> +	lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, node->index);
> +
> +	for (i = 0; i < NILFS_SUFILE_CACHE_NODE_COUNT;) {
> +		segnum = seg_start + i;
> +		if (segnum >= nsegs)
> +			break;
> +
> +		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
> +				seg_start + NILFS_SUFILE_CACHE_NODE_COUNT - 1);
> +
> +		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
> +							   0, &su_bh);
> +		if (ret < 0) {
> +			if (ret != -ENOENT)
> +				return ret;
> +			/* hole */
> +			i += n;
> +			continue;
> +		}
> +
> +		if (only_mark && buffer_dirty(su_bh)) {
> +			/* buffer already dirty */
> +			put_bh(su_bh);
> +			i += n;
> +			continue;
> +		}
> +
> +		spin_lock(lock);
> +		kaddr = kmap_atomic(su_bh->b_page);
> +
> +		dirty = 0;
> +		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
> +							  su_bh, kaddr);
> +		for (j = 0; j < n; ++j, ++i, su = (void *)su + susz) {
> +			value = node->values[i];
> +			if (!value)
> +				continue;
> +			if (!only_mark)
> +				node->values[i] = 0;
> +
> +			WARN_ON(nilfs_segment_usage_error(su));
> +
> +			nlive_blocks = le32_to_cpu(su->su_nlive_blks);
> +			if (!nlive_blocks)
> +				continue;
> +
> +			dirty = 1;
> +			if (only_mark) {
> +				i += n - j;
> +				break;
> +			}
> +

> +			if (nlive_blocks <= value)
> +				nlive_blocks = 0;
> +			else
> +				nlive_blocks -= value;

This can be simplified as below:

			nlive_blocks -= min_t(__u32, nlive_blocks, value);

> +
> +			su->su_nlive_blks = cpu_to_le32(nlive_blocks);
> +			su->su_nlive_lastmod = cpu_to_le64(secs);
> +		}
> +
> +		kunmap_atomic(kaddr);
> +		spin_unlock(lock);
> +
> +		if (dirty && !buffer_dirty(su_bh)) {
> +			mark_buffer_dirty(su_bh);

> +			nilfs_mdt_mark_dirty(sufile);

nilfs_mdt_mark_dirty() should be called only once if ndirty_blks is
larger than zero.  We can move it to nilfs_sufile_flush_cache() side
(to the position just before calling up_write()).

> +			++ndirty_blks;
> +		}
> +
> +		put_bh(su_bh);
> +	}
> +
> +	*pndirty_blks += ndirty_blks;
> +	return 0;
> +}
> +
> +/**
> + * nilfs_sufile_flush_cache - flushes cache to the SUFILE
> + * @sufile: inode of segment usage file
> + * @only_mark: do not write anything, but mark the blocks as dirty
> + * @pndirty_blks: pointer to return number of dirtied blocks
> + *
> + * Description: Flushes the whole cache to the SUFILE and also clears it
> + * at the same time. If @only_mark is 1, nothing is written to the
> + * SUFILE, but the blocks are still marked as dirty. This is useful to mark
> + * the blocks in one phase of the segment creation and write them in another.
> + * If there are concurrent inserts into the cache, it cannot be guaranteed,
> + * that everything is flushed when the function returns.
> + *
> + * Return Value: On success, 0 is returned on error, one of the following
> + * negative error codes is returned.
> + *
> + * %-ENOMEM - Insufficient memory available.
> + *
> + * %-EIO - I/O error
> + *
> + * %-EROFS - Read only filesystem (for create mode)
> + */
> +int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
> +			     unsigned long *pndirty_blks)
> +{
> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
> +	struct nilfs_sufile_cache_node *node;
> +	LIST_HEAD(nodes);
> +	struct radix_tree_iter iter;
> +	void **slot;
> +	unsigned long ndirty_blks = 0;
> +	int ret = 0;
> +
> +	if (!sui->nlive_blks_cache_dirty)
> +		goto out;
> +
> +	down_write(&NILFS_MDT(sufile)->mi_sem);
> +
> +	/* prevent concurrent inserts */
> +	spin_lock(&sui->nlive_blks_cache_lock);
> +	radix_tree_for_each_slot(slot, &sui->nlive_blks_cache, &iter, 0) {
> +		node = radix_tree_deref_slot_protected(slot,
> +				&sui->nlive_blks_cache_lock);
> +		if (!node)
> +			continue;
> +		if (radix_tree_exception(node))
> +			continue;
> +
> +		list_add(&node->list_head, &nodes);
> +		node->index = iter.index;
> +	}
> +	if (!only_mark)
> +		sui->nlive_blks_cache_dirty = 0;
> +	spin_unlock(&sui->nlive_blks_cache_lock);
> +
> +	list_for_each_entry(node, &nodes, list_head) {
> +		ret = nilfs_sufile_flush_cache_node(sufile, node, only_mark,
> +						    &ndirty_blks);
> +		if (ret)
> +			goto out_sem;
> +	}
> +
> +out_sem:
> +	up_write(&NILFS_MDT(sufile)->mi_sem);
> +out:
> +	if (pndirty_blks)
> +		*pndirty_blks = ndirty_blks;
> +	return ret;
> +}
> +
> +/**
> + * nilfs_sufile_cache_dirty - is the sufile cache dirty
> + * @sufile: inode of segment usage file
> + *
> + * Description: Returns whether the sufile cache is dirty. If this flag is
> + * true, the cache contains unflushed content.
> + *
> + * Return Value: If the cache is not dirty, 0 is returned, otherwise
> + * 1 is returned
> + */
> +int nilfs_sufile_cache_dirty(struct inode *sufile)
> +{
> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
> +
> +	return sui->nlive_blks_cache_dirty;
> +}
> +
> +/**
> + * nilfs_sufile_cache_node_release_rcu - rcu callback function to free nodes
> + * @head: rcu head
> + *
> + * Description: Rcu callback function to free nodes.
> + */
> +static void nilfs_sufile_cache_node_release_rcu(struct rcu_head *head)
> +{
> +	struct nilfs_sufile_cache_node *node;
> +
> +	node = container_of(head, struct nilfs_sufile_cache_node, rcu_head);
> +
> +	kmem_cache_free(nilfs_sufile_node_cachep, node);
> +}
> +
> +/**
> + * nilfs_sufile_shrink_cache - free all cache nodes
> + * @sufile: inode of segment usage file
> + *
> + * Description: Frees all cache nodes in the cache regardless of their
> + * content. The content will not be flushed and may be lost. This function
> + * is intended to free up memory after the cache was flushed by
> + * nilfs_sufile_flush_cache().
> + */
> +void nilfs_sufile_shrink_cache(struct inode *sufile)
> +{
> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
> +	struct nilfs_sufile_cache_node *node;
> +	struct radix_tree_iter iter;
> +	void **slot;
> +

> +	/* prevent flush form running at the same time */

"flush from" ?

> +	down_read(&NILFS_MDT(sufile)->mi_sem);

This protection with mi_sem seems to be needless because the current
implementation of nilfs_sufile_shrink_cache() doesn't touch buffers of
sufile.  The delete operation is protected by a spinlock and the
counter operations are protected with rcu.  What does this
down_read()/up_read() protect ?


> +	/* prevent concurrent inserts */
> +	spin_lock(&sui->nlive_blks_cache_lock);
> +
> +	radix_tree_for_each_slot(slot, &sui->nlive_blks_cache, &iter, 0) {
> +		node = radix_tree_deref_slot_protected(slot,
> +				&sui->nlive_blks_cache_lock);
> +		if (!node)
> +			continue;
> +		if (radix_tree_exception(node))
> +			continue;
> +
> +		radix_tree_delete(&sui->nlive_blks_cache, iter.index);
> +		call_rcu(&node->rcu_head, nilfs_sufile_cache_node_release_rcu);
> +	}
> +
> +	spin_unlock(&sui->nlive_blks_cache_lock);
> +	up_read(&NILFS_MDT(sufile)->mi_sem);
> +}
> +
> +/**
>   * nilfs_sufile_read - read or get sufile inode
>   * @sb: super block instance
>   * @susize: size of a segment usage entry
> @@ -1253,6 +1615,13 @@ int nilfs_sufile_read(struct super_block *sb, size_t susize,
>  	sui->allocmax = nilfs_sufile_get_nsegments(sufile) - 1;
>  	sui->allocmin = 0;
>  
> +	if (nilfs_feature_track_live_blks(sb->s_fs_info)) {
> +		bgl_lock_init(&sui->nlive_blks_cache_bgl);
> +		spin_lock_init(&sui->nlive_blks_cache_lock);
> +		INIT_RADIX_TREE(&sui->nlive_blks_cache, GFP_ATOMIC);
> +	}
> +	sui->nlive_blks_cache_dirty = 0;
> +
>  	unlock_new_inode(sufile);
>   out:
>  	*inodep = sufile;

I think we should introduce destructor to metadata files to prevent
memory leak which is brought by the introduction of the cache nodes
and radix tree.  nilfs_sufile_shrink_cache() should be called from the
destructor.

The destructor (e.g. mi->mi_dtor) should be called from
nilfs_clear_inode() if it isn't set to a NULL value.  Initialization
of the destructor will be done in nilfs_xxx_read().

In the current patchset, the callsite of nilfs_sufile_shrink_cache()
is well considered, but it's not sufficient.  We have to eliminate the
possibility of memory leak completely and clearly.

Regards,
Ryusuke Konishi

> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index 520614f..662ab56 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -87,6 +87,11 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
>  int nilfs_sufile_read(struct super_block *sb, size_t susize,
>  		      struct nilfs_inode *raw_inode, struct inode **inodep);
>  int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
> +int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum);
> +void nilfs_sufile_shrink_cache(struct inode *sufile);
> +int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
> +			     unsigned long *pndirty_blks);
> +int nilfs_sufile_cache_dirty(struct inode *sufile);
>  
>  /**
>   * nilfs_sufile_scrap - make a segment garbage
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/9] nilfs2: add tracking of block deletions and updates
       [not found]     ` <1430647522-14304-7-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-09  7:05       ` Ryusuke Konishi
       [not found]         ` <20150509.160512.1087140271092828536.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-09  7:05 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:19 +0200, Andreas Rohner wrote:
> This patch adds tracking of block deletions and updates for all files.
> It uses the fact, that for every block, NILFS2 keeps an entry in the
> DAT file and stores the checkpoint where it was created, deleted or
> overwritten. So whenever a block is deleted or overwritten
> nilfs_dat_commit_end() is called to update the DAT entry. At this
> point this patch simply decrements the su_nlive_blks field of the
> corresponding segment. The value of su_nlive_blks is set at segment
> creation time.
> 
> The DAT file itself has of course no DAT entries for its own blocks, but
> it still has to propagate deletions and updates to its btree. When this
> happens this patch again decrements the su_nlive_blks field of the
> corresponding segment.
> 
> The new feature compatibility flag NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS
> can be used to enable or disable the block tracking at any time.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/btree.c   | 33 ++++++++++++++++++++++++++++++---
>  fs/nilfs2/dat.c     | 15 +++++++++++++--
>  fs/nilfs2/direct.c  | 20 +++++++++++++++-----
>  fs/nilfs2/page.c    |  6 ++++--
>  fs/nilfs2/page.h    |  3 +++
>  fs/nilfs2/segbuf.c  |  3 +++
>  fs/nilfs2/segbuf.h  |  5 +++++
>  fs/nilfs2/segment.c | 48 +++++++++++++++++++++++++++++++++++++-----------
>  fs/nilfs2/sufile.c  | 17 ++++++++++++++++-
>  fs/nilfs2/sufile.h  |  3 ++-
>  10 files changed, 128 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
> index 059f371..d3b2763 100644
> --- a/fs/nilfs2/btree.c
> +++ b/fs/nilfs2/btree.c
> @@ -30,6 +30,7 @@
>  #include "btree.h"
>  #include "alloc.h"
>  #include "dat.h"
> +#include "sufile.h"
>  
>  static void __nilfs_btree_init(struct nilfs_bmap *bmap);
>  

> @@ -1889,9 +1890,35 @@ static int nilfs_btree_propagate_p(struct nilfs_bmap *btree,
>  				   int level,
>  				   struct buffer_head *bh)
>  {
> -	while ((++level < nilfs_btree_height(btree) - 1) &&
> -	       !buffer_dirty(path[level].bp_bh))
> -		mark_buffer_dirty(path[level].bp_bh);
> +	struct the_nilfs *nilfs = btree->b_inode->i_sb->s_fs_info;
> +	struct nilfs_btree_node *node;
> +	__u64 ptr, segnum;
> +	int ncmax, vol, counted;
> +
> +	vol = buffer_nilfs_volatile(bh);
> +	counted = buffer_nilfs_counted(bh);
> +	set_buffer_nilfs_counted(bh);
> +
> +	while (++level < nilfs_btree_height(btree)) {
> +		if (!vol && !counted && nilfs_feature_track_live_blks(nilfs)) {
> +			node = nilfs_btree_get_node(btree, path, level, &ncmax);
> +			ptr = nilfs_btree_node_get_ptr(node,
> +						       path[level].bp_index,
> +						       ncmax);
> +			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
> +		}
> +
> +		if (path[level].bp_bh) {
> +			if (buffer_dirty(path[level].bp_bh))
> +				break;
> +
> +			mark_buffer_dirty(path[level].bp_bh);
> +			vol = buffer_nilfs_volatile(path[level].bp_bh);
> +			counted = buffer_nilfs_counted(path[level].bp_bh);
> +			set_buffer_nilfs_counted(path[level].bp_bh);
> +		}
> +	}
>  
>  	return 0;
>  }

Consider the following comments:

- Please use volatile flag also for the duplication check instead of
  adding nilfs_counted flag.  
- btree.c, direct.c, and dat.c shouldn't refer SUFILE directly.
  Please add a wrapper function like "nilfs_dec_nlive_blks(nilfs, blocknr)"
  to the implementation of the_nilfs.c, and use it instead.
- To clarify implementation separate function to update pointers
  like nilfs_btree_propagate_v() is doing.
- The return value of nilfs_sufile_dec_nlive_blks() looks to be ignored
  intentionally.  Please add a comment explaining why you do so.

e.g.

static void nilfs_btree_update_p(struct nilfs_bmap *btree,
                                 struct nilfs_btree_path *path, int level)
{
	struct the_nilfs *nilfs = btree->b_inode->i_sb->s_fs_info;
	struct nilfs_btree_node *parent;
	__u64 ptr;
	int ncmax;

	if (nilfs_feature_track_live_blks(nilfs)) {
		parent = nilfs_btree_get_node(btree, path, level + 1, &ncmax);
		ptr = nilfs_btree_node_get_ptr(parent,
					       path[level + 1].bp_index,
					       ncmax);
		nilfs_dec_nlive_blks(nilfs, ptr);
		/* (Please add a comment explaining why we ignore the return value) */
	}
	set_buffer_nilfs_volatile(path[level].bp_bh);
}

static int nilfs_btree_propagate_p(struct nilfs_bmap *btree,
				   struct nilfs_btree_path *path,
				   int level,
				   struct buffer_head *bh)
{
	/*
	 * Update pointer to the given dirty buffer.  If the buffer is
	 * marked volatile, it shouldn't be updated because it's
	 * either a newly created buffer or an already updated one.
	 */
	if (!buffer_nilfs_volatile(path[level].bp_bh))
		nilfs_btree_update_p(btree, path, level);

	/*
	 * Mark upper nodes dirty and update their pointers unless
	 * they're already marked dirty.
	 */
	while (++level < nilfs_btree_height(btree) - 1 &&
	       !buffer_dirty(path[level].bp_bh)) {

		WARN_ON(buffer_nilfs_volatile(path[level].bp_bh));
		nilfs_btree_update_p(btree, path, level);
		mark_buffer_dirty(path[level].bp_bh);
	}
	return 0;
}

> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
> index 0d5fada..9c2fc32 100644
> --- a/fs/nilfs2/dat.c
> +++ b/fs/nilfs2/dat.c
> @@ -28,6 +28,7 @@
>  #include "mdt.h"
>  #include "alloc.h"
>  #include "dat.h"
> +#include "sufile.h"
>  
>  
>  #define NILFS_CNO_MIN	((__u64)1)
> @@ -188,9 +189,10 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
>  			  int dead)
>  {
>  	struct nilfs_dat_entry *entry;
> -	__u64 start, end;
> +	__u64 start, end, segnum;
>  	sector_t blocknr;
>  	void *kaddr;
> +	struct the_nilfs *nilfs;
>  
>  	kaddr = kmap_atomic(req->pr_entry_bh->b_page);
>  	entry = nilfs_palloc_block_get_entry(dat, req->pr_entry_nr,
> @@ -206,8 +208,17 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
>  
>  	if (blocknr == 0)
>  		nilfs_dat_commit_free(dat, req);
> -	else

Add braces around nilfs_dat_commit_free() since you add multiple
sentences in the else clause.  See the chapter 3 of CodingStyle file.

> +	else {
>  		nilfs_dat_commit_entry(dat, req);
> +
> +		nilfs = dat->i_sb->s_fs_info;
> +
> +		if (nilfs_feature_track_live_blks(nilfs)) {

> +			segnum = nilfs_get_segnum_of_block(nilfs, blocknr);
> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);

Ditto.  Call nilfs_dec_nlive_blks(nilfs, blocknr) instead and do not
to add dependency to SUFILE in dat.c.

> +		}
> +	}
> +
>  }
>  
>  void nilfs_dat_abort_end(struct inode *dat, struct nilfs_palloc_req *req)
> diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c
> index ebf89fd..42704eb 100644
> --- a/fs/nilfs2/direct.c
> +++ b/fs/nilfs2/direct.c
> @@ -26,6 +26,7 @@
>  #include "direct.h"
>  #include "alloc.h"
>  #include "dat.h"
> +#include "sufile.h"
>  
>  static inline __le64 *nilfs_direct_dptrs(const struct nilfs_bmap *direct)
>  {
> @@ -268,18 +269,27 @@ int nilfs_direct_delete_and_convert(struct nilfs_bmap *bmap,
>  static int nilfs_direct_propagate(struct nilfs_bmap *bmap,
>  				  struct buffer_head *bh)
>  {
> +	struct the_nilfs *nilfs = bmap->b_inode->i_sb->s_fs_info;
>  	struct nilfs_palloc_req oldreq, newreq;
>  	struct inode *dat;
> -	__u64 key;
> -	__u64 ptr;
> +	__u64 key, ptr, segnum;
>  	int ret;
>  
> -	if (!NILFS_BMAP_USE_VBN(bmap))
> -		return 0;
> -

>  	dat = nilfs_bmap_get_dat(bmap);
>  	key = nilfs_bmap_data_get_key(bmap, bh);
>  	ptr = nilfs_direct_get_ptr(bmap, key);
> +

> +	if (unlikely(!NILFS_BMAP_USE_VBN(bmap))) {
> +		if (!buffer_nilfs_volatile(bh) && !buffer_nilfs_counted(bh) &&
> +				nilfs_feature_track_live_blks(nilfs)) {
> +			set_buffer_nilfs_counted(bh);
> +			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
> +
> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
> +		}
> +		return 0;
> +	}

Use the volatile flag also for duplication check, and do not use
unlikely() marcro when testing "!NILFS_BMAP_USE_VBN(bmap)".  It's
not exceptional as error:

	if (!NILFS_BMAP_USE_VBN(bmap)) {
		if (!buffer_nilfs_volatile(bh)) {
			if (nilfs_feature_track_live_blks(nilfs))
				nilfs_dec_nlive_blks(nilfs, ptr);
			set_buffer_nilfs_volatile(bh);
		}
		return 0;
	}

> +
>  	if (!buffer_nilfs_volatile(bh)) {
>  		oldreq.pr_entry_nr = ptr;
>  		newreq.pr_entry_nr = ptr;
> diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
> index 45d650a..fd21b43 100644
> --- a/fs/nilfs2/page.c
> +++ b/fs/nilfs2/page.c
> @@ -92,7 +92,8 @@ void nilfs_forget_buffer(struct buffer_head *bh)
>  	const unsigned long clear_bits =
>  		(1 << BH_Uptodate | 1 << BH_Dirty | 1 << BH_Mapped |
>  		 1 << BH_Async_Write | 1 << BH_NILFS_Volatile |
> -		 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected);
> +		 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected |

> +		 1 << BH_NILFS_Counted);

You don't have to add nilfs_counted flag as I mentioned above.  Remove
this.

>  
>  	lock_buffer(bh);
>  	set_mask_bits(&bh->b_state, clear_bits, 0);
> @@ -422,7 +423,8 @@ void nilfs_clear_dirty_page(struct page *page, bool silent)
>  		const unsigned long clear_bits =
>  			(1 << BH_Uptodate | 1 << BH_Dirty | 1 << BH_Mapped |
>  			 1 << BH_Async_Write | 1 << BH_NILFS_Volatile |
> -			 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected);
> +			 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected |

> +			 1 << BH_NILFS_Counted);

Ditto.

>  
>  		bh = head = page_buffers(page);
>  		do {
> diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
> index a43b828..4e35814 100644
> --- a/fs/nilfs2/page.h
> +++ b/fs/nilfs2/page.h
> @@ -36,12 +36,15 @@ enum {
>  	BH_NILFS_Volatile,
>  	BH_NILFS_Checked,
>  	BH_NILFS_Redirected,
> +	BH_NILFS_Counted,

Ditto.

>  };
>  
>  BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
>  BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
>  BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
>  BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */

> +/* counted by propagate_p for segment usage */
> +BUFFER_FNS(NILFS_Counted, nilfs_counted)

Ditto.

>  
>  
>  int __nilfs_clear_page_dirty(struct page *);
> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
> index dc3a9efd..dabb65b 100644
> --- a/fs/nilfs2/segbuf.c
> +++ b/fs/nilfs2/segbuf.c
> @@ -57,6 +57,9 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
>  	INIT_LIST_HEAD(&segbuf->sb_segsum_buffers);
>  	INIT_LIST_HEAD(&segbuf->sb_payload_buffers);
>  	segbuf->sb_super_root = NULL;

> +	segbuf->sb_flags = 0;

You don't have to add sb_flags.  Use sci->sc_stage.flags instead
because the flag is used to manage internal state of segment
construction rather than the state of segbuf.

> +	segbuf->sb_nlive_blks = 0;
> +	segbuf->sb_nsnapshot_blks = 0;
>  
>  	init_completion(&segbuf->sb_bio_event);
>  	atomic_set(&segbuf->sb_err, 0);
> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
> index b04f08c..a802f61 100644
> --- a/fs/nilfs2/segbuf.h
> +++ b/fs/nilfs2/segbuf.h
> @@ -83,6 +83,9 @@ struct nilfs_segment_buffer {
>  	sector_t		sb_fseg_start, sb_fseg_end;
>  	sector_t		sb_pseg_start;
>  	unsigned		sb_rest_blocks;

> +	int			sb_flags;

ditto.

> +	__u32			sb_nlive_blks;
> +	__u32			sb_nsnapshot_blks;
>  
>  	/* Buffers */
>  	struct list_head	sb_segsum_buffers;
> @@ -95,6 +98,8 @@ struct nilfs_segment_buffer {
>  	struct completion	sb_bio_event;
>  };
>  
> +#define NILFS_SEGBUF_SUSET	BIT(0)	/* segment usage has been set */
> +

Ditto.

>  #define NILFS_LIST_SEGBUF(head)  \
>  	list_entry((head), struct nilfs_segment_buffer, sb_list)
>  #define NILFS_NEXT_SEGBUF(segbuf)  NILFS_LIST_SEGBUF((segbuf)->sb_list.next)
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index c6abbad9..14e76c3 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -762,7 +762,8 @@ static int nilfs_test_metadata_dirty(struct the_nilfs *nilfs,
>  		ret++;
>  	if (nilfs_mdt_fetch_dirty(nilfs->ns_cpfile))
>  		ret++;
> -	if (nilfs_mdt_fetch_dirty(nilfs->ns_sufile))
> +	if (nilfs_mdt_fetch_dirty(nilfs->ns_sufile) ||
> +	    nilfs_sufile_cache_dirty(nilfs->ns_sufile))
>  		ret++;
>  	if ((ret || nilfs_doing_gc()) && nilfs_mdt_fetch_dirty(nilfs->ns_dat))
>  		ret++;
> @@ -1368,36 +1369,49 @@ static void nilfs_free_incomplete_logs(struct list_head *logs,
>  }
>  
>  static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
> -					  struct inode *sufile)
> +					  struct the_nilfs *nilfs)

Do not change the sufile argument to nilfs.  It's not necessary
for this change.

>  {
>  	struct nilfs_segment_buffer *segbuf;
> -	unsigned long live_blocks;
> +	struct inode *sufile = nilfs->ns_sufile;
> +	unsigned long nblocks;
>  	int ret;
>  
>  	list_for_each_entry(segbuf, &sci->sc_segbufs, sb_list) {
> -		live_blocks = segbuf->sb_sum.nblocks +
> +		nblocks = segbuf->sb_sum.nblocks +
>  			(segbuf->sb_pseg_start - segbuf->sb_fseg_start);

>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
> -						     live_blocks,
> +						     nblocks,
> +						     segbuf->sb_nlive_blks,
> +						     segbuf->sb_nsnapshot_blks,
>  						     sci->sc_seg_ctime);

With this change, two different semantics, "set" and "modify", are
mixed up in the arguments of nilfs_sufile_set_segment_usage().  It's
bad and confusing.

Please change nilfs_sufile_set_segment_usage() function, for instance,
to nilfs_sufile_modify_segment_usage() and rewrite the above part
so that all counter arguments are passed with the "modify" semantics.

>  		WARN_ON(ret); /* always succeed because the segusage is dirty */
> +
> +		segbuf->sb_flags |= NILFS_SEGBUF_SUSET;

Use sci->sc_stage.flags adding NILFS_CF_SUMOD flag.  Note that the
flag must be added also to NILFS_CF_HISTORY_MASK so that the flag will
be cleared every time a new cycle starts in the loop of
nilfs_segctor_do_construct().

>  	}
>  }
>  
> -static void nilfs_cancel_segusage(struct list_head *logs, struct inode *sufile)
> +static void nilfs_cancel_segusage(struct list_head *logs,
> +				  struct the_nilfs *nilfs)

Ditto.  Do not change the sufile argument to the pointer to nilfs
object.

>  {
>  	struct nilfs_segment_buffer *segbuf;
> +	struct inode *sufile = nilfs->ns_sufile;
> +	__s64 nlive_blks = 0, nsnapshot_blks = 0;
>  	int ret;
>  
>  	segbuf = NILFS_FIRST_SEGBUF(logs);

> +	if (segbuf->sb_flags & NILFS_SEGBUF_SUSET) {

Ditto.

> +		nlive_blks = -(__s64)segbuf->sb_nlive_blks;
> +		nsnapshot_blks = -(__s64)segbuf->sb_nsnapshot_blks;
> +	}
>  	ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>  					     segbuf->sb_pseg_start -
> -					     segbuf->sb_fseg_start, 0);
> +					     segbuf->sb_fseg_start,
> +					     nlive_blks, nsnapshot_blks, 0);

Ditto.

>  	WARN_ON(ret); /* always succeed because the segusage is dirty */
>  
>  	list_for_each_entry_continue(segbuf, logs, sb_list) {
>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
> -						     0, 0);
> +						     0, 0, 0, 0);
>  		WARN_ON(ret); /* always succeed */
>  	}
>  }
> @@ -1499,6 +1513,7 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  	if (!nfinfo)
>  		goto out;
>  
> +	segbuf->sb_nlive_blks = segbuf->sb_sum.nfileblk;
>  	blocknr = segbuf->sb_pseg_start + segbuf->sb_sum.nsumblk;
>  	ssp.bh = NILFS_SEGBUF_FIRST_BH(&segbuf->sb_segsum_buffers);
>  	ssp.offset = sizeof(struct nilfs_segment_summary);
> @@ -1728,7 +1743,7 @@ static void nilfs_segctor_abort_construction(struct nilfs_sc_info *sci,
>  	nilfs_abort_logs(&logs, ret ? : err);
>  
>  	list_splice_tail_init(&sci->sc_segbufs, &logs);
> -	nilfs_cancel_segusage(&logs, nilfs->ns_sufile);
> +	nilfs_cancel_segusage(&logs, nilfs);
>  	nilfs_free_incomplete_logs(&logs, nilfs);
>  
>  	if (sci->sc_stage.flags & NILFS_CF_SUFREED) {
> @@ -1790,7 +1805,8 @@ static void nilfs_segctor_complete_write(struct nilfs_sc_info *sci)
>  			const unsigned long clear_bits =
>  				(1 << BH_Dirty | 1 << BH_Async_Write |
>  				 1 << BH_Delay | 1 << BH_NILFS_Volatile |
> -				 1 << BH_NILFS_Redirected);
> +				 1 << BH_NILFS_Redirected |
> +				 1 << BH_NILFS_Counted);

Ditto.  Stop to add nilfs_counted flag.

>  
>  			set_mask_bits(&bh->b_state, clear_bits, set_bits);
>  			if (bh == segbuf->sb_super_root) {
> @@ -1995,7 +2011,14 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
>  
>  			nilfs_segctor_fill_in_super_root(sci, nilfs);
>  		}
> -		nilfs_segctor_update_segusage(sci, nilfs->ns_sufile);
> +
> +		if (nilfs_feature_track_live_blks(nilfs)) {
> +			err = nilfs_sufile_flush_cache(nilfs->ns_sufile, 0,
> +						       NULL);
> +			if (unlikely(err))
> +				goto failed_to_write;
> +		}
> +		nilfs_segctor_update_segusage(sci, nilfs);
>  
>  		/* Write partial segments */
>  		nilfs_segctor_prepare_write(sci);
> @@ -2022,6 +2045,9 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
>  		}
>  	} while (sci->sc_stage.scnt != NILFS_ST_DONE);
>  

> +	if (nilfs_feature_track_live_blks(nilfs))
> +		nilfs_sufile_shrink_cache(nilfs->ns_sufile);

As I mentioned on ahead, this shrink cache function should be called
from a destructor of sufile which doesn't exist at present.

> +
>   out:
>  	nilfs_segctor_drop_written_files(sci, nilfs);
>  	return err;
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 80bbd87..9cd8820d 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -527,10 +527,13 @@ int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum)
>   * @sufile: inode of segment usage file
>   * @segnum: segment number
>   * @nblocks: number of live blocks in the segment
> + * @nlive_blks: number of live blocks to add to the su_nlive_blks field
> + * @nsnapshot_blks: number of snapshot blocks to add to su_nsnapshot_blks
>   * @modtime: modification time (option)
>   */
>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
> -				   unsigned long nblocks, time_t modtime)
> +				   unsigned long nblocks, __s64 nlive_blks,
> +				   __s64 nsnapshot_blks, time_t modtime)

As I mentioned above, this function should be renamed to
nilfs_sufile_modify_segment_usage() and the semantics of nblocks,
nlive_blks, nsnapshot_blks arguments should be uniformed to "modify"
semantics.

Also the types of these three counter arguments is not uniformed.

Regards,
Ryusuke Konishi

>  {
>  	struct buffer_head *bh;
>  	struct nilfs_segment_usage *su;
> @@ -548,6 +551,18 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>  	if (modtime)
>  		su->su_lastmod = cpu_to_le64(modtime);
>  	su->su_nblocks = cpu_to_le32(nblocks);
> +
> +	if (nilfs_sufile_live_blks_ext_supported(sufile)) {
> +		nsnapshot_blks += le32_to_cpu(su->su_nsnapshot_blks);
> +		nsnapshot_blks = min_t(__s64, max_t(__s64, nsnapshot_blks, 0),
> +				       nblocks);
> +		su->su_nsnapshot_blks = cpu_to_le32(nsnapshot_blks);
> +
> +		nlive_blks += le32_to_cpu(su->su_nlive_blks);
> +		nlive_blks = min_t(__s64, max_t(__s64, nlive_blks, 0), nblocks);
> +		su->su_nlive_blks = cpu_to_le32(nlive_blks);
> +	}
> +
>  	kunmap_atomic(kaddr);
>  
>  	mark_buffer_dirty(bh);
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index 662ab56..3466abb 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -60,7 +60,8 @@ int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
>  int nilfs_sufile_alloc(struct inode *, __u64 *);
>  int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
> -				   unsigned long nblocks, time_t modtime);
> +				   unsigned long nblocks, __s64 nlive_blks,
> +				   __s64 nsnapshot_blks, time_t modtime);
>  int nilfs_sufile_get_stat(struct inode *, struct nilfs_sustat *);
>  ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
>  				size_t);
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out
       [not found]     ` <1430647522-14304-8-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-09 12:17       ` Ryusuke Konishi
       [not found]         ` <20150509.211741.1463241033923032068.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-09 12:17 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
> This patch ensures, that all dirty blocks are written out if the segment
> construction mode is SC_LSEG_SR. The scanning of the DAT file can cause
> blocks in the SUFILE to be dirtied and newly dirtied blocks in the
> SUFILE can in turn dirty more blocks in the DAT file. Since one of
> these stages has to happen before the other during segment
> construction, we end up with unwritten dirty blocks, that are lost
> in case of a file system unmount.
> 
> This patch introduces a new set of file scanning operations that
> only propagate the changes to the bmap and do not add anything to the
> segment buffer. The DAT file and SUFILE are scanned with these
> operations. The function nilfs_sufile_flush_cache() is called in between
> these scans with the parameter only_mark set. That way it can be called
> repeatedly without actually writing anything to the SUFILE. If there are
> no new blocks dirtied in the flush, the normal segment construction
> stages can safely continue.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/segment.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  fs/nilfs2/segment.h |  3 ++-
>  2 files changed, 74 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 14e76c3..ab8df33 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -579,6 +579,12 @@ static int nilfs_collect_dat_data(struct nilfs_sc_info *sci,
>  	return err;
>  }
>  
> +static int nilfs_collect_prop_data(struct nilfs_sc_info *sci,
> +				  struct buffer_head *bh, struct inode *inode)
> +{
> +	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
> +}
> +
>  static int nilfs_collect_dat_bmap(struct nilfs_sc_info *sci,
>  				  struct buffer_head *bh, struct inode *inode)
>  {
> @@ -613,6 +619,14 @@ static struct nilfs_sc_operations nilfs_sc_dat_ops = {
>  	.write_node_binfo = nilfs_write_dat_node_binfo,
>  };
>  
> +static struct nilfs_sc_operations nilfs_sc_prop_ops = {
> +	.collect_data = nilfs_collect_prop_data,
> +	.collect_node = nilfs_collect_file_node,
> +	.collect_bmap = NULL,
> +	.write_data_binfo = NULL,
> +	.write_node_binfo = NULL,
> +};
> +
>  static struct nilfs_sc_operations nilfs_sc_dsync_ops = {
>  	.collect_data = nilfs_collect_file_data,
>  	.collect_node = NULL,
> @@ -998,7 +1012,8 @@ static int nilfs_segctor_scan_file(struct nilfs_sc_info *sci,
>  			err = nilfs_segctor_apply_buffers(
>  				sci, inode, &data_buffers,
>  				sc_ops->collect_data);
> -			BUG_ON(!err); /* always receive -E2BIG or true error */
> +			/* always receive -E2BIG or true error (NOT ANYMORE?)*/
> +			/* BUG_ON(!err); */
>  			goto break_or_fail;
>  		}
>  	}

If n > rest, this function will exit without scanning node buffers
for nilfs_segctor_propagate_sufile().  This looks problem, right?

I think adding separate functions is better.  For instance,

static int nilfs_propagate_buffer(struct nilfs_sc_info *sci,
				  struct buffer_head *bh,
				  struct inode *inode)
{
	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
}

static int nilfs_segctor_propagate_file(struct nilfs_sc_info *sci,
					struct inode *inode)
{
	LIST_HEAD(buffers);
	size_t n;
	int err;

	n = nilfs_lookup_dirty_data_buffers(inode, &buffers, SIZE_MAX, 0,
					    LLONG_MAX);
	if (n > 0) {
		ret = nilfs_segctor_apply_buffers(sci, inode, &buffers,
						  nilfs_propagate_buffer);
		if (unlikely(ret))
			goto fail;
	}

	nilfs_lookup_dirty_node_buffers(inode, &buffers);
	ret = nilfs_segctor_apply_buffers(sci, inode, &buffers,
					  nilfs_propagate_buffer);
fail:
	return ret;
}

With this, you can also avoid defining nilfs_sc_prop_ops, nor touching
the BUG_ON() in nilfs_segctor_scan_file.

> @@ -1055,6 +1070,55 @@ static int nilfs_segctor_scan_file_dsync(struct nilfs_sc_info *sci,
>  	return err;
>  }
>  
> +/**
> + * nilfs_segctor_propagate_sufile - dirties all needed SUFILE blocks
> + * @sci: nilfs_sc_info
> + *
> + * Description: Dirties and propagates all SUFILE blocks that need to be
> + * available later in the segment construction process, when the SUFILE cache
> + * is flushed. Here the SUFILE cache is not actually flushed, but the blocks
> + * that are needed for a later flush are marked as dirty. Since the propagation
> + * of the SUFILE can dirty DAT entries and vice versa, the functions
> + * are executed in a loop until no new blocks are dirtied.
> + *
> + * Return Value: On success, 0 is returned on error, one of the following
> + * negative error codes is returned.
> + *
> + * %-ENOMEM - Insufficient memory available.
> + *
> + * %-EIO - I/O error
> + *
> + * %-EROFS - Read only filesystem (for create mode)
> + */
> +static int nilfs_segctor_propagate_sufile(struct nilfs_sc_info *sci)
> +{
> +	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
> +	unsigned long ndirty_blks;
> +	int ret, retrycount = NILFS_SC_SUFILE_PROP_RETRY;
> +
> +	do {
> +		/* count changes to DAT file before flush */
> +		ret = nilfs_segctor_scan_file(sci, nilfs->ns_dat,
> +					      &nilfs_sc_prop_ops);

Use the previous nilfs_segctor_propagate_file() here.

> +		if (unlikely(ret))
> +			return ret;
> +
> +		ret = nilfs_sufile_flush_cache(nilfs->ns_sufile, 1,
> +					       &ndirty_blks);
> +		if (unlikely(ret))
> +			return ret;
> +		if (!ndirty_blks)
> +			break;
> +
> +		ret = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
> +					      &nilfs_sc_prop_ops);

Ditto.

> +		if (unlikely(ret))
> +			return ret;
> +	} while (ndirty_blks && retrycount-- > 0);
> +

Uum. This still looks to have potential for leak of dirty block
collection between DAT and SUFILE since this retry is limited by
the fixed retry count.

How about adding function temporarily turning off the live block
tracking and using it after this propagation loop until log write
finishes ?

It would reduce the accuracy of live block count, but is it enough ?
How do you think ?  We have to eliminate the possibility of the leak
because it can cause file system corruption.  Every checkpoint must be
self-contained.


> +	return 0;
> +}
> +
>  static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
>  {
>  	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
> @@ -1160,6 +1224,13 @@ static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
>  		}
>  		sci->sc_stage.flags |= NILFS_CF_SUFREED;
>  

> +		if (mode == SC_LSEG_SR &&

This test ("mode == SC_LSEG_SR") can be removed.  When the thread
comes here, it will always make a checkpoint.

> +		    nilfs_feature_track_live_blks(nilfs)) {
> +			err = nilfs_segctor_propagate_sufile(sci);
> +			if (unlikely(err))
> +				break;
> +		}
> +
>  		err = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
>  					      &nilfs_sc_file_ops);
>  		if (unlikely(err))
> diff --git a/fs/nilfs2/segment.h b/fs/nilfs2/segment.h
> index a48d6de..5aa7f91 100644
> --- a/fs/nilfs2/segment.h
> +++ b/fs/nilfs2/segment.h
> @@ -208,7 +208,8 @@ enum {
>   */
>  #define NILFS_SC_CLEANUP_RETRY	    3  /* Retry count of construction when
>  					  destroying segctord */
> -
> +#define NILFS_SC_SUFILE_PROP_RETRY  10 /* Retry count of the propagate
> +					  sufile loop */

How many times does the propagation loop has to be repeated
until it converges ?

The current dirty block scanning function collects all dirty blocks of
the specified file (i.e. SUFILE or DAT), traversing page cache, making
and destructing list of dirty buffers, every time the propagation
function is called.  It's so wasteful to repeat that many times.

Regards,
Ryusuke Konishi

>  /*
>   * Default values of timeout, in seconds.
>   */
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/9] nilfs2: add tracking of block deletions and updates
       [not found]         ` <20150509.160512.1087140271092828536.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-09 15:58           ` Ryusuke Konishi
  2015-05-09 20:02           ` Andreas Rohner
  1 sibling, 0 replies; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-09 15:58 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

One more comment.

>> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
>> index b04f08c..a802f61 100644
>> --- a/fs/nilfs2/segbuf.h
>> +++ b/fs/nilfs2/segbuf.h
>> @@ -83,6 +83,9 @@ struct nilfs_segment_buffer {
>>  	sector_t		sb_fseg_start, sb_fseg_end;
>>  	sector_t		sb_pseg_start;
>>  	unsigned		sb_rest_blocks;
> 
>> +	int			sb_flags;
> 
> ditto.
> 

>> +	__u32			sb_nlive_blks;
>> +	__u32			sb_nsnapshot_blks;
>>  

Please add corresponding comment lines of these variables to the
kernel-doc comment of the structure.

Regards,
Ryusuke Konishi

On Sat, 09 May 2015 16:05:12 +0900 (JST), Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:19 +0200, Andreas Rohner wrote:
>> This patch adds tracking of block deletions and updates for all files.
>> It uses the fact, that for every block, NILFS2 keeps an entry in the
>> DAT file and stores the checkpoint where it was created, deleted or
>> overwritten. So whenever a block is deleted or overwritten
>> nilfs_dat_commit_end() is called to update the DAT entry. At this
>> point this patch simply decrements the su_nlive_blks field of the
>> corresponding segment. The value of su_nlive_blks is set at segment
>> creation time.
>> 
>> The DAT file itself has of course no DAT entries for its own blocks, but
>> it still has to propagate deletions and updates to its btree. When this
>> happens this patch again decrements the su_nlive_blks field of the
>> corresponding segment.
>> 
>> The new feature compatibility flag NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS
>> can be used to enable or disable the block tracking at any time.
>> 
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/btree.c   | 33 ++++++++++++++++++++++++++++++---
>>  fs/nilfs2/dat.c     | 15 +++++++++++++--
>>  fs/nilfs2/direct.c  | 20 +++++++++++++++-----
>>  fs/nilfs2/page.c    |  6 ++++--
>>  fs/nilfs2/page.h    |  3 +++
>>  fs/nilfs2/segbuf.c  |  3 +++
>>  fs/nilfs2/segbuf.h  |  5 +++++
>>  fs/nilfs2/segment.c | 48 +++++++++++++++++++++++++++++++++++++-----------
>>  fs/nilfs2/sufile.c  | 17 ++++++++++++++++-
>>  fs/nilfs2/sufile.h  |  3 ++-
>>  10 files changed, 128 insertions(+), 25 deletions(-)
>> 
>> diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
>> index 059f371..d3b2763 100644
>> --- a/fs/nilfs2/btree.c
>> +++ b/fs/nilfs2/btree.c
>> @@ -30,6 +30,7 @@
>>  #include "btree.h"
>>  #include "alloc.h"
>>  #include "dat.h"
>> +#include "sufile.h"
>>  
>>  static void __nilfs_btree_init(struct nilfs_bmap *bmap);
>>  
> 
>> @@ -1889,9 +1890,35 @@ static int nilfs_btree_propagate_p(struct nilfs_bmap *btree,
>>  				   int level,
>>  				   struct buffer_head *bh)
>>  {
>> -	while ((++level < nilfs_btree_height(btree) - 1) &&
>> -	       !buffer_dirty(path[level].bp_bh))
>> -		mark_buffer_dirty(path[level].bp_bh);
>> +	struct the_nilfs *nilfs = btree->b_inode->i_sb->s_fs_info;
>> +	struct nilfs_btree_node *node;
>> +	__u64 ptr, segnum;
>> +	int ncmax, vol, counted;
>> +
>> +	vol = buffer_nilfs_volatile(bh);
>> +	counted = buffer_nilfs_counted(bh);
>> +	set_buffer_nilfs_counted(bh);
>> +
>> +	while (++level < nilfs_btree_height(btree)) {
>> +		if (!vol && !counted && nilfs_feature_track_live_blks(nilfs)) {
>> +			node = nilfs_btree_get_node(btree, path, level, &ncmax);
>> +			ptr = nilfs_btree_node_get_ptr(node,
>> +						       path[level].bp_index,
>> +						       ncmax);
>> +			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
>> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
>> +		}
>> +
>> +		if (path[level].bp_bh) {
>> +			if (buffer_dirty(path[level].bp_bh))
>> +				break;
>> +
>> +			mark_buffer_dirty(path[level].bp_bh);
>> +			vol = buffer_nilfs_volatile(path[level].bp_bh);
>> +			counted = buffer_nilfs_counted(path[level].bp_bh);
>> +			set_buffer_nilfs_counted(path[level].bp_bh);
>> +		}
>> +	}
>>  
>>  	return 0;
>>  }
> 
> Consider the following comments:
> 
> - Please use volatile flag also for the duplication check instead of
>   adding nilfs_counted flag.  
> - btree.c, direct.c, and dat.c shouldn't refer SUFILE directly.
>   Please add a wrapper function like "nilfs_dec_nlive_blks(nilfs, blocknr)"
>   to the implementation of the_nilfs.c, and use it instead.
> - To clarify implementation separate function to update pointers
>   like nilfs_btree_propagate_v() is doing.
> - The return value of nilfs_sufile_dec_nlive_blks() looks to be ignored
>   intentionally.  Please add a comment explaining why you do so.
> 
> e.g.
> 
> static void nilfs_btree_update_p(struct nilfs_bmap *btree,
>                                  struct nilfs_btree_path *path, int level)
> {
> 	struct the_nilfs *nilfs = btree->b_inode->i_sb->s_fs_info;
> 	struct nilfs_btree_node *parent;
> 	__u64 ptr;
> 	int ncmax;
> 
> 	if (nilfs_feature_track_live_blks(nilfs)) {
> 		parent = nilfs_btree_get_node(btree, path, level + 1, &ncmax);
> 		ptr = nilfs_btree_node_get_ptr(parent,
> 					       path[level + 1].bp_index,
> 					       ncmax);
> 		nilfs_dec_nlive_blks(nilfs, ptr);
> 		/* (Please add a comment explaining why we ignore the return value) */
> 	}
> 	set_buffer_nilfs_volatile(path[level].bp_bh);
> }
> 
> static int nilfs_btree_propagate_p(struct nilfs_bmap *btree,
> 				   struct nilfs_btree_path *path,
> 				   int level,
> 				   struct buffer_head *bh)
> {
> 	/*
> 	 * Update pointer to the given dirty buffer.  If the buffer is
> 	 * marked volatile, it shouldn't be updated because it's
> 	 * either a newly created buffer or an already updated one.
> 	 */
> 	if (!buffer_nilfs_volatile(path[level].bp_bh))
> 		nilfs_btree_update_p(btree, path, level);
> 
> 	/*
> 	 * Mark upper nodes dirty and update their pointers unless
> 	 * they're already marked dirty.
> 	 */
> 	while (++level < nilfs_btree_height(btree) - 1 &&
> 	       !buffer_dirty(path[level].bp_bh)) {
> 
> 		WARN_ON(buffer_nilfs_volatile(path[level].bp_bh));
> 		nilfs_btree_update_p(btree, path, level);
> 		mark_buffer_dirty(path[level].bp_bh);
> 	}
> 	return 0;
> }
> 
>> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
>> index 0d5fada..9c2fc32 100644
>> --- a/fs/nilfs2/dat.c
>> +++ b/fs/nilfs2/dat.c
>> @@ -28,6 +28,7 @@
>>  #include "mdt.h"
>>  #include "alloc.h"
>>  #include "dat.h"
>> +#include "sufile.h"
>>  
>>  
>>  #define NILFS_CNO_MIN	((__u64)1)
>> @@ -188,9 +189,10 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
>>  			  int dead)
>>  {
>>  	struct nilfs_dat_entry *entry;
>> -	__u64 start, end;
>> +	__u64 start, end, segnum;
>>  	sector_t blocknr;
>>  	void *kaddr;
>> +	struct the_nilfs *nilfs;
>>  
>>  	kaddr = kmap_atomic(req->pr_entry_bh->b_page);
>>  	entry = nilfs_palloc_block_get_entry(dat, req->pr_entry_nr,
>> @@ -206,8 +208,17 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
>>  
>>  	if (blocknr == 0)
>>  		nilfs_dat_commit_free(dat, req);
>> -	else
> 
> Add braces around nilfs_dat_commit_free() since you add multiple
> sentences in the else clause.  See the chapter 3 of CodingStyle file.
> 
>> +	else {
>>  		nilfs_dat_commit_entry(dat, req);
>> +
>> +		nilfs = dat->i_sb->s_fs_info;
>> +
>> +		if (nilfs_feature_track_live_blks(nilfs)) {
> 
>> +			segnum = nilfs_get_segnum_of_block(nilfs, blocknr);
>> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
> 
> Ditto.  Call nilfs_dec_nlive_blks(nilfs, blocknr) instead and do not
> to add dependency to SUFILE in dat.c.
> 
>> +		}
>> +	}
>> +
>>  }
>>  
>>  void nilfs_dat_abort_end(struct inode *dat, struct nilfs_palloc_req *req)
>> diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c
>> index ebf89fd..42704eb 100644
>> --- a/fs/nilfs2/direct.c
>> +++ b/fs/nilfs2/direct.c
>> @@ -26,6 +26,7 @@
>>  #include "direct.h"
>>  #include "alloc.h"
>>  #include "dat.h"
>> +#include "sufile.h"
>>  
>>  static inline __le64 *nilfs_direct_dptrs(const struct nilfs_bmap *direct)
>>  {
>> @@ -268,18 +269,27 @@ int nilfs_direct_delete_and_convert(struct nilfs_bmap *bmap,
>>  static int nilfs_direct_propagate(struct nilfs_bmap *bmap,
>>  				  struct buffer_head *bh)
>>  {
>> +	struct the_nilfs *nilfs = bmap->b_inode->i_sb->s_fs_info;
>>  	struct nilfs_palloc_req oldreq, newreq;
>>  	struct inode *dat;
>> -	__u64 key;
>> -	__u64 ptr;
>> +	__u64 key, ptr, segnum;
>>  	int ret;
>>  
>> -	if (!NILFS_BMAP_USE_VBN(bmap))
>> -		return 0;
>> -
> 
>>  	dat = nilfs_bmap_get_dat(bmap);
>>  	key = nilfs_bmap_data_get_key(bmap, bh);
>>  	ptr = nilfs_direct_get_ptr(bmap, key);
>> +
> 
>> +	if (unlikely(!NILFS_BMAP_USE_VBN(bmap))) {
>> +		if (!buffer_nilfs_volatile(bh) && !buffer_nilfs_counted(bh) &&
>> +				nilfs_feature_track_live_blks(nilfs)) {
>> +			set_buffer_nilfs_counted(bh);
>> +			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
>> +
>> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
>> +		}
>> +		return 0;
>> +	}
> 
> Use the volatile flag also for duplication check, and do not use
> unlikely() marcro when testing "!NILFS_BMAP_USE_VBN(bmap)".  It's
> not exceptional as error:
> 
> 	if (!NILFS_BMAP_USE_VBN(bmap)) {
> 		if (!buffer_nilfs_volatile(bh)) {
> 			if (nilfs_feature_track_live_blks(nilfs))
> 				nilfs_dec_nlive_blks(nilfs, ptr);
> 			set_buffer_nilfs_volatile(bh);
> 		}
> 		return 0;
> 	}
> 
>> +
>>  	if (!buffer_nilfs_volatile(bh)) {
>>  		oldreq.pr_entry_nr = ptr;
>>  		newreq.pr_entry_nr = ptr;
>> diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
>> index 45d650a..fd21b43 100644
>> --- a/fs/nilfs2/page.c
>> +++ b/fs/nilfs2/page.c
>> @@ -92,7 +92,8 @@ void nilfs_forget_buffer(struct buffer_head *bh)
>>  	const unsigned long clear_bits =
>>  		(1 << BH_Uptodate | 1 << BH_Dirty | 1 << BH_Mapped |
>>  		 1 << BH_Async_Write | 1 << BH_NILFS_Volatile |
>> -		 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected);
>> +		 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected |
> 
>> +		 1 << BH_NILFS_Counted);
> 
> You don't have to add nilfs_counted flag as I mentioned above.  Remove
> this.
> 
>>  
>>  	lock_buffer(bh);
>>  	set_mask_bits(&bh->b_state, clear_bits, 0);
>> @@ -422,7 +423,8 @@ void nilfs_clear_dirty_page(struct page *page, bool silent)
>>  		const unsigned long clear_bits =
>>  			(1 << BH_Uptodate | 1 << BH_Dirty | 1 << BH_Mapped |
>>  			 1 << BH_Async_Write | 1 << BH_NILFS_Volatile |
>> -			 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected);
>> +			 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected |
> 
>> +			 1 << BH_NILFS_Counted);
> 
> Ditto.
> 
>>  
>>  		bh = head = page_buffers(page);
>>  		do {
>> diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
>> index a43b828..4e35814 100644
>> --- a/fs/nilfs2/page.h
>> +++ b/fs/nilfs2/page.h
>> @@ -36,12 +36,15 @@ enum {
>>  	BH_NILFS_Volatile,
>>  	BH_NILFS_Checked,
>>  	BH_NILFS_Redirected,
>> +	BH_NILFS_Counted,
> 
> Ditto.
> 
>>  };
>>  
>>  BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
>>  BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
>>  BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
>>  BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */
> 
>> +/* counted by propagate_p for segment usage */
>> +BUFFER_FNS(NILFS_Counted, nilfs_counted)
> 
> Ditto.
> 
>>  
>>  
>>  int __nilfs_clear_page_dirty(struct page *);
>> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
>> index dc3a9efd..dabb65b 100644
>> --- a/fs/nilfs2/segbuf.c
>> +++ b/fs/nilfs2/segbuf.c
>> @@ -57,6 +57,9 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
>>  	INIT_LIST_HEAD(&segbuf->sb_segsum_buffers);
>>  	INIT_LIST_HEAD(&segbuf->sb_payload_buffers);
>>  	segbuf->sb_super_root = NULL;
> 
>> +	segbuf->sb_flags = 0;
> 
> You don't have to add sb_flags.  Use sci->sc_stage.flags instead
> because the flag is used to manage internal state of segment
> construction rather than the state of segbuf.
> 
>> +	segbuf->sb_nlive_blks = 0;
>> +	segbuf->sb_nsnapshot_blks = 0;
>>  
>>  	init_completion(&segbuf->sb_bio_event);
>>  	atomic_set(&segbuf->sb_err, 0);
>> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
>> index b04f08c..a802f61 100644
>> --- a/fs/nilfs2/segbuf.h
>> +++ b/fs/nilfs2/segbuf.h
>> @@ -83,6 +83,9 @@ struct nilfs_segment_buffer {
>>  	sector_t		sb_fseg_start, sb_fseg_end;
>>  	sector_t		sb_pseg_start;
>>  	unsigned		sb_rest_blocks;
> 
>> +	int			sb_flags;
> 
> ditto.
> 
>> +	__u32			sb_nlive_blks;
>> +	__u32			sb_nsnapshot_blks;
>>  
>>  	/* Buffers */
>>  	struct list_head	sb_segsum_buffers;
>> @@ -95,6 +98,8 @@ struct nilfs_segment_buffer {
>>  	struct completion	sb_bio_event;
>>  };
>>  
>> +#define NILFS_SEGBUF_SUSET	BIT(0)	/* segment usage has been set */
>> +
> 
> Ditto.
> 
>>  #define NILFS_LIST_SEGBUF(head)  \
>>  	list_entry((head), struct nilfs_segment_buffer, sb_list)
>>  #define NILFS_NEXT_SEGBUF(segbuf)  NILFS_LIST_SEGBUF((segbuf)->sb_list.next)
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index c6abbad9..14e76c3 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -762,7 +762,8 @@ static int nilfs_test_metadata_dirty(struct the_nilfs *nilfs,
>>  		ret++;
>>  	if (nilfs_mdt_fetch_dirty(nilfs->ns_cpfile))
>>  		ret++;
>> -	if (nilfs_mdt_fetch_dirty(nilfs->ns_sufile))
>> +	if (nilfs_mdt_fetch_dirty(nilfs->ns_sufile) ||
>> +	    nilfs_sufile_cache_dirty(nilfs->ns_sufile))
>>  		ret++;
>>  	if ((ret || nilfs_doing_gc()) && nilfs_mdt_fetch_dirty(nilfs->ns_dat))
>>  		ret++;
>> @@ -1368,36 +1369,49 @@ static void nilfs_free_incomplete_logs(struct list_head *logs,
>>  }
>>  
>>  static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
>> -					  struct inode *sufile)
>> +					  struct the_nilfs *nilfs)
> 
> Do not change the sufile argument to nilfs.  It's not necessary
> for this change.
> 
>>  {
>>  	struct nilfs_segment_buffer *segbuf;
>> -	unsigned long live_blocks;
>> +	struct inode *sufile = nilfs->ns_sufile;
>> +	unsigned long nblocks;
>>  	int ret;
>>  
>>  	list_for_each_entry(segbuf, &sci->sc_segbufs, sb_list) {
>> -		live_blocks = segbuf->sb_sum.nblocks +
>> +		nblocks = segbuf->sb_sum.nblocks +
>>  			(segbuf->sb_pseg_start - segbuf->sb_fseg_start);
> 
>>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>> -						     live_blocks,
>> +						     nblocks,
>> +						     segbuf->sb_nlive_blks,
>> +						     segbuf->sb_nsnapshot_blks,
>>  						     sci->sc_seg_ctime);
> 
> With this change, two different semantics, "set" and "modify", are
> mixed up in the arguments of nilfs_sufile_set_segment_usage().  It's
> bad and confusing.
> 
> Please change nilfs_sufile_set_segment_usage() function, for instance,
> to nilfs_sufile_modify_segment_usage() and rewrite the above part
> so that all counter arguments are passed with the "modify" semantics.
> 
>>  		WARN_ON(ret); /* always succeed because the segusage is dirty */
>> +
>> +		segbuf->sb_flags |= NILFS_SEGBUF_SUSET;
> 
> Use sci->sc_stage.flags adding NILFS_CF_SUMOD flag.  Note that the
> flag must be added also to NILFS_CF_HISTORY_MASK so that the flag will
> be cleared every time a new cycle starts in the loop of
> nilfs_segctor_do_construct().
> 
>>  	}
>>  }
>>  
>> -static void nilfs_cancel_segusage(struct list_head *logs, struct inode *sufile)
>> +static void nilfs_cancel_segusage(struct list_head *logs,
>> +				  struct the_nilfs *nilfs)
> 
> Ditto.  Do not change the sufile argument to the pointer to nilfs
> object.
> 
>>  {
>>  	struct nilfs_segment_buffer *segbuf;
>> +	struct inode *sufile = nilfs->ns_sufile;
>> +	__s64 nlive_blks = 0, nsnapshot_blks = 0;
>>  	int ret;
>>  
>>  	segbuf = NILFS_FIRST_SEGBUF(logs);
> 
>> +	if (segbuf->sb_flags & NILFS_SEGBUF_SUSET) {
> 
> Ditto.
> 
>> +		nlive_blks = -(__s64)segbuf->sb_nlive_blks;
>> +		nsnapshot_blks = -(__s64)segbuf->sb_nsnapshot_blks;
>> +	}
>>  	ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>>  					     segbuf->sb_pseg_start -
>> -					     segbuf->sb_fseg_start, 0);
>> +					     segbuf->sb_fseg_start,
>> +					     nlive_blks, nsnapshot_blks, 0);
> 
> Ditto.
> 
>>  	WARN_ON(ret); /* always succeed because the segusage is dirty */
>>  
>>  	list_for_each_entry_continue(segbuf, logs, sb_list) {
>>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>> -						     0, 0);
>> +						     0, 0, 0, 0);
>>  		WARN_ON(ret); /* always succeed */
>>  	}
>>  }
>> @@ -1499,6 +1513,7 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>>  	if (!nfinfo)
>>  		goto out;
>>  
>> +	segbuf->sb_nlive_blks = segbuf->sb_sum.nfileblk;
>>  	blocknr = segbuf->sb_pseg_start + segbuf->sb_sum.nsumblk;
>>  	ssp.bh = NILFS_SEGBUF_FIRST_BH(&segbuf->sb_segsum_buffers);
>>  	ssp.offset = sizeof(struct nilfs_segment_summary);
>> @@ -1728,7 +1743,7 @@ static void nilfs_segctor_abort_construction(struct nilfs_sc_info *sci,
>>  	nilfs_abort_logs(&logs, ret ? : err);
>>  
>>  	list_splice_tail_init(&sci->sc_segbufs, &logs);
>> -	nilfs_cancel_segusage(&logs, nilfs->ns_sufile);
>> +	nilfs_cancel_segusage(&logs, nilfs);
>>  	nilfs_free_incomplete_logs(&logs, nilfs);
>>  
>>  	if (sci->sc_stage.flags & NILFS_CF_SUFREED) {
>> @@ -1790,7 +1805,8 @@ static void nilfs_segctor_complete_write(struct nilfs_sc_info *sci)
>>  			const unsigned long clear_bits =
>>  				(1 << BH_Dirty | 1 << BH_Async_Write |
>>  				 1 << BH_Delay | 1 << BH_NILFS_Volatile |
>> -				 1 << BH_NILFS_Redirected);
>> +				 1 << BH_NILFS_Redirected |
>> +				 1 << BH_NILFS_Counted);
> 
> Ditto.  Stop to add nilfs_counted flag.
> 
>>  
>>  			set_mask_bits(&bh->b_state, clear_bits, set_bits);
>>  			if (bh == segbuf->sb_super_root) {
>> @@ -1995,7 +2011,14 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
>>  
>>  			nilfs_segctor_fill_in_super_root(sci, nilfs);
>>  		}
>> -		nilfs_segctor_update_segusage(sci, nilfs->ns_sufile);
>> +
>> +		if (nilfs_feature_track_live_blks(nilfs)) {
>> +			err = nilfs_sufile_flush_cache(nilfs->ns_sufile, 0,
>> +						       NULL);
>> +			if (unlikely(err))
>> +				goto failed_to_write;
>> +		}
>> +		nilfs_segctor_update_segusage(sci, nilfs);
>>  
>>  		/* Write partial segments */
>>  		nilfs_segctor_prepare_write(sci);
>> @@ -2022,6 +2045,9 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
>>  		}
>>  	} while (sci->sc_stage.scnt != NILFS_ST_DONE);
>>  
> 
>> +	if (nilfs_feature_track_live_blks(nilfs))
>> +		nilfs_sufile_shrink_cache(nilfs->ns_sufile);
> 
> As I mentioned on ahead, this shrink cache function should be called
> from a destructor of sufile which doesn't exist at present.
> 
>> +
>>   out:
>>  	nilfs_segctor_drop_written_files(sci, nilfs);
>>  	return err;
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 80bbd87..9cd8820d 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -527,10 +527,13 @@ int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum)
>>   * @sufile: inode of segment usage file
>>   * @segnum: segment number
>>   * @nblocks: number of live blocks in the segment
>> + * @nlive_blks: number of live blocks to add to the su_nlive_blks field
>> + * @nsnapshot_blks: number of snapshot blocks to add to su_nsnapshot_blks
>>   * @modtime: modification time (option)
>>   */
>>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>> -				   unsigned long nblocks, time_t modtime)
>> +				   unsigned long nblocks, __s64 nlive_blks,
>> +				   __s64 nsnapshot_blks, time_t modtime)
> 
> As I mentioned above, this function should be renamed to
> nilfs_sufile_modify_segment_usage() and the semantics of nblocks,
> nlive_blks, nsnapshot_blks arguments should be uniformed to "modify"
> semantics.
> 
> Also the types of these three counter arguments is not uniformed.
> 
> Regards,
> Ryusuke Konishi
> 
>>  {
>>  	struct buffer_head *bh;
>>  	struct nilfs_segment_usage *su;
>> @@ -548,6 +551,18 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>>  	if (modtime)
>>  		su->su_lastmod = cpu_to_le64(modtime);
>>  	su->su_nblocks = cpu_to_le32(nblocks);
>> +
>> +	if (nilfs_sufile_live_blks_ext_supported(sufile)) {
>> +		nsnapshot_blks += le32_to_cpu(su->su_nsnapshot_blks);
>> +		nsnapshot_blks = min_t(__s64, max_t(__s64, nsnapshot_blks, 0),
>> +				       nblocks);
>> +		su->su_nsnapshot_blks = cpu_to_le32(nsnapshot_blks);
>> +
>> +		nlive_blks += le32_to_cpu(su->su_nlive_blks);
>> +		nlive_blks = min_t(__s64, max_t(__s64, nlive_blks, 0), nblocks);
>> +		su->su_nlive_blks = cpu_to_le32(nlive_blks);
>> +	}
>> +
>>  	kunmap_atomic(kaddr);
>>  
>>  	mark_buffer_dirty(bh);
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index 662ab56..3466abb 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -60,7 +60,8 @@ int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
>>  int nilfs_sufile_alloc(struct inode *, __u64 *);
>>  int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
>>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>> -				   unsigned long nblocks, time_t modtime);
>> +				   unsigned long nblocks, __s64 nlive_blks,
>> +				   __s64 nsnapshot_blks, time_t modtime);
>>  int nilfs_sufile_get_stat(struct inode *, struct nilfs_sustat *);
>>  ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
>>  				size_t);
>> -- 
>> 2.3.7
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 1/9] nilfs2: copy file system feature flags to the nilfs object
       [not found]         ` <20150509.105445.1816655707671265145.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-09 18:41           ` Andreas Rohner
  0 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-09 18:41 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-09 03:54, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:14 +0200, Andreas Rohner wrote:
>> This patch adds three new attributes to the nilfs object, which contain
>> a copy of the feature flags from the super block. This can be used, to
>> efficiently test whether file system feature flags are set or not.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/the_nilfs.c | 4 ++++
>>  fs/nilfs2/the_nilfs.h | 8 ++++++++
>>  2 files changed, 12 insertions(+)
>>
>> diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
>> index 69bd801..606fdfc 100644
>> --- a/fs/nilfs2/the_nilfs.c
>> +++ b/fs/nilfs2/the_nilfs.c
>> @@ -630,6 +630,10 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data)
>>  	get_random_bytes(&nilfs->ns_next_generation,
>>  			 sizeof(nilfs->ns_next_generation));
>>  
>> +	nilfs->ns_feature_compat = le64_to_cpu(sbp->s_feature_compat);
>> +	nilfs->ns_feature_compat_ro = le64_to_cpu(sbp->s_feature_compat_ro);
>> +	nilfs->ns_feature_incompat = le64_to_cpu(sbp->s_feature_incompat);
> 
> Consider moving these initialization to just before calling
> nilfs_check_feature_compatibility().

Yes no problem.

Regards,
Andreas Rohner

> It uses compat flags, and I'd like to unfold the function using these
> internal variables sometime.
> 
>> +
>>  	err = nilfs_store_disk_layout(nilfs, sbp);
>>  	if (err)
>>  		goto failed_sbh;
>> diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
>> index 23778d3..12cd91d 100644
>> --- a/fs/nilfs2/the_nilfs.h
>> +++ b/fs/nilfs2/the_nilfs.h
>> @@ -101,6 +101,9 @@ enum {
>>   * @ns_dev_kobj: /sys/fs/<nilfs>/<device>
>>   * @ns_dev_kobj_unregister: completion state
>>   * @ns_dev_subgroups: <device> subgroups pointer
>> + * @ns_feature_compat: Compatible feature set
>> + * @ns_feature_compat_ro: Read-only compatible feature set
>> + * @ns_feature_incompat: Incompatible feature set
>>   */
>>  struct the_nilfs {
>>  	unsigned long		ns_flags;
>> @@ -201,6 +204,11 @@ struct the_nilfs {
>>  	struct kobject ns_dev_kobj;
>>  	struct completion ns_dev_kobj_unregister;
>>  	struct nilfs_sysfs_dev_subgroups *ns_dev_subgroups;
>> +
>> +	/* Features */
>> +	__u64                   ns_feature_compat;
>> +	__u64                   ns_feature_compat_ro;
>> +	__u64                   ns_feature_incompat;
>>  };
>>  
>>  #define THE_NILFS_FNS(bit, name)					\
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 2/9] nilfs2: extend SUFILE on-disk format to enable tracking of live blocks
       [not found]         ` <20150509.112403.380867861504859109.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-09 18:47           ` Andreas Rohner
  0 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-09 18:47 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-09 04:24, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:15 +0200, Andreas Rohner wrote:
>> This patch extends the nilfs_segment_usage structure with two extra
>> fields. This changes the on-disk format of the SUFILE, but the NILFS2
>> metadata files are flexible enough, so that there are no compatibility
>> issues. The extension is fully backwards compatible. Nevertheless a
>> feature compatibility flag was added to indicate the on-disk format
>> change.
>>
>> The new field su_nlive_blks is used to track the number of live blocks
>> in the corresponding segment. Its value should always be smaller than
>> su_nblocks, which contains the total number of blocks in the segment.
>>
>> The field su_nlive_lastmod is necessary because of the protection period
>> used by the GC. It is a timestamp, which contains the last time
>> su_nlive_blks was modified. For example if a file is deleted, its
>> blocks are subtracted from su_nlive_blks and are therefore considered to
>> be reclaimable by the kernel. But the GC additionally protects them with
>> the protection period. So while su_nilve_blks contains the number of
>> potentially reclaimable blocks, the actual number depends on the
>> protection period. To enable GC policies to effectively choose or prefer
>> segments with unprotected blocks, the timestamp in su_nlive_lastmod is
>> necessary.
>>
>> The new field su_nsnapshot_blks contains the number of blocks in a
>> segment that are protected by a snapshot. The value is meant to be a
>> heuristic for the GC and is not necessarily always accurate.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/ioctl.c         |  4 +--
>>  fs/nilfs2/sufile.c        | 45 +++++++++++++++++++++++++++++++--
>>  fs/nilfs2/sufile.h        |  6 +++++
>>  include/linux/nilfs2_fs.h | 63 +++++++++++++++++++++++++++++++++++++++++------
>>  4 files changed, 106 insertions(+), 12 deletions(-)
>>
>> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
>> index 9a20e51..f6ee54e 100644
>> --- a/fs/nilfs2/ioctl.c
>> +++ b/fs/nilfs2/ioctl.c
>> @@ -1250,7 +1250,7 @@ static int nilfs_ioctl_set_suinfo(struct inode *inode, struct file *filp,
>>  		goto out;
>>  
>>  	ret = -EINVAL;
>> -	if (argv.v_size < sizeof(struct nilfs_suinfo_update))
>> +	if (argv.v_size < NILFS_MIN_SUINFO_UPDATE_SIZE)
>>  		goto out;
>>  
>>  	if (argv.v_nmembs > nilfs->ns_nsegments)
>> @@ -1316,7 +1316,7 @@ long nilfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>>  		return nilfs_ioctl_get_cpstat(inode, filp, cmd, argp);
>>  	case NILFS_IOCTL_GET_SUINFO:
>>  		return nilfs_ioctl_get_info(inode, filp, cmd, argp,
>> -					    sizeof(struct nilfs_suinfo),
>> +					    NILFS_MIN_SEGMENT_USAGE_SIZE,
>>  					    nilfs_ioctl_do_get_suinfo);
>>  	case NILFS_IOCTL_SET_SUINFO:
>>  		return nilfs_ioctl_set_suinfo(inode, filp, cmd, argp);
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 2a869c3..1cce358 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -453,6 +453,11 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
>>  	su->su_lastmod = cpu_to_le64(0);
>>  	su->su_nblocks = cpu_to_le32(0);
>>  	su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
>> +	if (nilfs_sufile_live_blks_ext_supported(sufile)) {
>> +		su->su_nlive_blks = cpu_to_le32(0);
>> +		su->su_nsnapshot_blks = cpu_to_le32(0);
>> +		su->su_nlive_lastmod = cpu_to_le64(0);
>> +	}
>>  	kunmap_atomic(kaddr);
>>  
>>  	nilfs_sufile_mod_counter(header_bh, clean ? (u64)-1 : 0, dirty ? 0 : 1);
>> @@ -482,7 +487,7 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 segnum,
>>  	WARN_ON(!nilfs_segment_usage_dirty(su));
>>  
>>  	sudirty = nilfs_segment_usage_dirty(su);
>> -	nilfs_segment_usage_set_clean(su);
>> +	nilfs_segment_usage_set_clean(su, NILFS_MDT(sufile)->mi_entry_size);
>>  	kunmap_atomic(kaddr);
>>  	mark_buffer_dirty(su_bh);
>>  
>> @@ -698,7 +703,7 @@ static int nilfs_sufile_truncate_range(struct inode *sufile,
>>  		nc = 0;
>>  		for (su = su2, j = 0; j < n; j++, su = (void *)su + susz) {
>>  			if (nilfs_segment_usage_error(su)) {
>> -				nilfs_segment_usage_set_clean(su);
>> +				nilfs_segment_usage_set_clean(su, susz);
>>  				nc++;
>>  			}
>>  		}
>> @@ -821,6 +826,8 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
>>  	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
>>  	void *kaddr;
>>  	unsigned long nsegs, segusages_per_block;
>> +	__u64 lm = 0;
>> +	__u32 nlb = 0, nsb = 0;
>>  	ssize_t n;
>>  	int ret, i, j;
>>  
>> @@ -858,6 +865,18 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
>>  			if (nilfs_segment_is_active(nilfs, segnum + j))
>>  				si->sui_flags |=
>>  					(1UL << NILFS_SEGMENT_USAGE_ACTIVE);
>> +
>> +			if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
>> +				nlb = le32_to_cpu(su->su_nlive_blks);
>> +				nsb = le32_to_cpu(su->su_nsnapshot_blks);
>> +				lm = le64_to_cpu(su->su_nlive_lastmod);
>> +			}
>> +
>> +			if (sisz >= NILFS_LIVE_BLKS_EXT_SUINFO_SIZE) {
>> +				si->sui_nlive_blks = nlb;
>> +				si->sui_nsnapshot_blks = nsb;
>> +				si->sui_nlive_lastmod = lm;
>> +			}
>>  		}
>>  		kunmap_atomic(kaddr);
>>  		brelse(su_bh);
>> @@ -901,6 +920,9 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>>  	int cleansi, cleansu, dirtysi, dirtysu;
>>  	long ncleaned = 0, ndirtied = 0;
>>  	int ret = 0;
>> +	bool sup_ext = (supsz >= NILFS_LIVE_BLKS_EXT_SUINFO_UPDATE_SIZE);
>> +	bool su_ext = nilfs_sufile_live_blks_ext_supported(sufile);
>> +	bool supsu_ext = sup_ext && su_ext;
> 
> These boolean variables determine the control follow.  For these, more
> intuitive names are preferable.  For instance:
> 
>   - sup_ext -> suinfo_extended
>   - su_ext -> su_extended
>   - supsu_ext -> both_extended

I agree.

>>  
>>  	if (unlikely(nsup == 0))
>>  		return ret;
>> @@ -911,6 +933,13 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>>  				(~0UL << __NR_NILFS_SUINFO_UPDATE_FIELDS))
>>  			|| (nilfs_suinfo_update_nblocks(sup) &&
>>  				sup->sup_sui.sui_nblocks >
>> +				nilfs->ns_blocks_per_segment)
>> +			|| (nilfs_suinfo_update_nlive_blks(sup) && sup_ext &&
>> +				sup->sup_sui.sui_nlive_blks >
>> +				nilfs->ns_blocks_per_segment)
>> +			|| (nilfs_suinfo_update_nsnapshot_blks(sup) &&
>> +				sup_ext &&
>> +				sup->sup_sui.sui_nsnapshot_blks >
>>  				nilfs->ns_blocks_per_segment))
>>  			return -EINVAL;
>>  	}
> 
> Testing sup_ext repeatedly is pointless since it increases branches.
> Consider moving it forward as follows:
> 
>         for (sup = buf; sup < supend; sup = (void *)sup + supsz) {
>                 if (sup->sup_segnum >= nilfs->ns_nsegments ||
>                    || (sup->sup_flags &
>                        (~0UL << __NR_NILFS_SUINFO_UPDATE_FIELDS))
>                    || (nilfs_suinfo_update_nblocks(sup) &&
>                        sup->sup_sui.sui_nblocks > nilfs->ns_blocks_per_segment)\
> )
>                         return -EINVAL;
>                 if (!sup_extended)
>                         continue;
>                 if (nilfs_suinfo_update_nlive_blks(sup) &&
>                     (sup->sup_sui.sui_nlive_blks >
>                      nilfs->ns_blocks_per_segment)
>                     || (nilfs_suinfo_update_nsnapshot_blks(sup) &&
>                         sup->sup_sui.sui_nsnapshot_blks >
>                         nilfs->ns_blocks_per_segment))
>                         return -EINVAL;
>         }

Good suggestion.

>> @@ -938,6 +967,18 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>>  		if (nilfs_suinfo_update_nblocks(sup))
>>  			su->su_nblocks = cpu_to_le32(sup->sup_sui.sui_nblocks);
>>  
>> +		if (nilfs_suinfo_update_nlive_blks(sup) && supsu_ext)
>> +			su->su_nlive_blks =
>> +				cpu_to_le32(sup->sup_sui.sui_nlive_blks);
>> +
>> +		if (nilfs_suinfo_update_nsnapshot_blks(sup) && supsu_ext)
>> +			su->su_nsnapshot_blks =
>> +				cpu_to_le32(sup->sup_sui.sui_nsnapshot_blks);
>> +
>> +		if (nilfs_suinfo_update_nlive_lastmod(sup) && supsu_ext)
>> +			su->su_nlive_lastmod =
>> +				cpu_to_le64(sup->sup_sui.sui_nlive_lastmod);
>> +
> 
> Ditto.
> 
> Consider defining pointer to suinfo structure
> 
>         for (;;) {
>                 struct nilfs_suinfo *sui = &sup->sup_sui;
> 
> and simplifying the above part as follows:
> 
>                 if (both_extended) {
>                         if (nilfs_suinfo_update_nlive_blks(sup))
>                                 su->su_nlive_blks =
>                                         cpu_to_le32(sui->sui_nlive_blks);
>                         if (nilfs_suinfo_update_nsnapshot_blks(sup))
>                                 su->su_nsnapshot_blks =
>                                         cpu_to_le32(sui->sui_nsnapshot_blks);
>                         if (nilfs_suinfo_update_nlive_lastmod(sup))
>                                 su->su_nlive_lastmod =
>                                         cpu_to_le64(sui->sui_nlive_lastmod);
>                 }
> 

I agree this is much nicer.

>>  		if (nilfs_suinfo_update_flags(sup)) {
>>  			/*
>>  			 * Active flag is a virtual flag projected by running
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index b8afd72..da78edf 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -28,6 +28,12 @@
>>  #include <linux/nilfs2_fs.h>
>>  #include "mdt.h"
>>  
>> +static inline int
>> +nilfs_sufile_live_blks_ext_supported(const struct inode *sufile)
>> +{
>> +	return NILFS_MDT(sufile)->mi_entry_size >=
>> +			NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE;
>> +}
>>  
>>  static inline unsigned long nilfs_sufile_get_nsegments(struct inode *sufile)
>>  {
>> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
>> index ff3fea3..4800daa 100644
>> --- a/include/linux/nilfs2_fs.h
>> +++ b/include/linux/nilfs2_fs.h
>> @@ -220,9 +220,12 @@ struct nilfs_super_block {
>>   * If there is a bit set in the incompatible feature set that the kernel
>>   * doesn't know about, it should refuse to mount the filesystem.
>>   */
>> -#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT	0x00000001ULL
> 
>> +#define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT	BIT(0)
> 
> You should not use BIT() macro and its variants for now because they
> are only enabled in kernel space (__KERNEL__ macro is required).
> 
> "nilfs2_fs.h" should be defined both for kernel space and user space.
> Consider defining it like "(1ULL << 0)".

Ok.

>>  
>> -#define NILFS_FEATURE_COMPAT_SUPP	0ULL
> 
>> +#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		BIT(0)
> 
> Ditto.
> 
>> +
>> +#define NILFS_FEATURE_COMPAT_SUPP					\
>> +			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
>>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>>  
>> @@ -609,19 +612,34 @@ struct nilfs_cpfile_header {
>>  	  sizeof(struct nilfs_checkpoint) - 1) /			\
>>  			sizeof(struct nilfs_checkpoint))
>>  
>> +#ifndef offsetofend
>> +#define offsetofend(TYPE, MEMBER) \
>> +		(offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
>> +#endif
>> +
>>  /**
>>   * struct nilfs_segment_usage - segment usage
>>   * @su_lastmod: last modified timestamp
>>   * @su_nblocks: number of blocks in segment
>>   * @su_flags: flags
>> + * @su_nlive_blks: number of live blocks in the segment
>> + * @su_nsnapshot_blks: number of blocks belonging to a snapshot in the segment
>> + * @su_nlive_lastmod: timestamp nlive_blks was last modified
>>   */
>>  struct nilfs_segment_usage {
>>  	__le64 su_lastmod;
>>  	__le32 su_nblocks;
>>  	__le32 su_flags;
>> +	__le32 su_nlive_blks;
>> +	__le32 su_nsnapshot_blks;
>> +	__le64 su_nlive_lastmod;
>>  };
>>  
>> -#define NILFS_MIN_SEGMENT_USAGE_SIZE	16
>> +#define NILFS_MIN_SEGMENT_USAGE_SIZE	\
>> +	offsetofend(struct nilfs_segment_usage, su_flags)
>> +
>> +#define NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE	\
>> +	offsetofend(struct nilfs_segment_usage, su_nlive_lastmod)
>>  
>>  /* segment usage flag */
>>  enum {
>> @@ -658,11 +676,16 @@ NILFS_SEGMENT_USAGE_FNS(DIRTY, dirty)
>>  NILFS_SEGMENT_USAGE_FNS(ERROR, error)
>>  
> 
>>  static inline void
>> -nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su)
>> +nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
>>  {
>>  	su->su_lastmod = cpu_to_le64(0);
>>  	su->su_nblocks = cpu_to_le32(0);
>>  	su->su_flags = cpu_to_le32(0);
>> +	if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
>> +		su->su_nlive_blks = cpu_to_le32(0);
>> +		su->su_nsnapshot_blks = cpu_to_le32(0);
>> +		su->su_nlive_lastmod = cpu_to_le64(0);
>> +	}
>>  }
> 
> nilfs_sufile_do_scrap() function does almost the same thing.
> Consider defining common inline function and using it for
> nilfs_segment_usage_set_clean() and nilfs_sufile_do_scrap():
> 
> static inline void
> nilfs_segment_usage_format(struct nilfs_segment_usage *su, size_t susz,
> 			   __u32 flags)
> {
> 	su->su_lastmod = cpu_to_le64(0);
> 	su->su_nblocks = cpu_to_le32(0);
> 	su->su_flags = cpu_to_le32(flags);
> 	if (susz >= NILFS_LIVE_BLKS_EXT_SEGMENT_USAGE_SIZE) {
> 		su->su_nlive_blks = cpu_to_le32(0);
> 		su->su_nsnapshot_blks = cpu_to_le32(0);
> 		su->su_nlive_lastmod = cpu_to_le64(0);
> 	}
> }
> 
> static inline void
> nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
> {
> 	nilfs_segment_usage_format(su, susz, 0);
> }

Good idea. I will start with the modifications right away. I will of
course wait with version 3 of the patch set until you finished your review.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
>>  
>>  static inline int
>> @@ -684,23 +707,33 @@ struct nilfs_sufile_header {
>>  	/* ... */
>>  };
>>  
>> -#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET	\
>> -	((sizeof(struct nilfs_sufile_header) +				\
>> -	  sizeof(struct nilfs_segment_usage) - 1) /			\
>> -			 sizeof(struct nilfs_segment_usage))
>> +#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET(susz)	\
>> +	((sizeof(struct nilfs_sufile_header) + (susz) - 1) / (susz))
>>  
>>  /**
>>   * nilfs_suinfo - segment usage information
>>   * @sui_lastmod: timestamp of last modification
>>   * @sui_nblocks: number of written blocks in segment
>>   * @sui_flags: segment usage flags
>> + * @sui_nlive_blks: number of live blocks in the segment
>> + * @sui_nsnapshot_blks: number of blocks belonging to a snapshot in the segment
>> + * @sui_nlive_lastmod: timestamp nlive_blks was last modified
>>   */
>>  struct nilfs_suinfo {
>>  	__u64 sui_lastmod;
>>  	__u32 sui_nblocks;
>>  	__u32 sui_flags;
>> +	__u32 sui_nlive_blks;
>> +	__u32 sui_nsnapshot_blks;
>> +	__u64 sui_nlive_lastmod;
>>  };
>>  
>> +#define NILFS_MIN_SUINFO_SIZE	\
>> +	offsetofend(struct nilfs_suinfo, sui_flags)
>> +
>> +#define NILFS_LIVE_BLKS_EXT_SUINFO_SIZE	\
>> +	offsetofend(struct nilfs_suinfo, sui_nlive_lastmod)
>> +
>>  #define NILFS_SUINFO_FNS(flag, name)					\
>>  static inline int							\
>>  nilfs_suinfo_##name(const struct nilfs_suinfo *si)			\
>> @@ -736,6 +769,9 @@ enum {
>>  	NILFS_SUINFO_UPDATE_LASTMOD,
>>  	NILFS_SUINFO_UPDATE_NBLOCKS,
>>  	NILFS_SUINFO_UPDATE_FLAGS,
>> +	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
>> +	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
>> +	NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
>>  	__NR_NILFS_SUINFO_UPDATE_FIELDS,
>>  };
>>  
>> @@ -759,6 +795,17 @@ nilfs_suinfo_update_##name(const struct nilfs_suinfo_update *sup)	\
>>  NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
>>  NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
>>  NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
>> +NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
>> +NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
>> +NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
>> +
>> +#define NILFS_MIN_SUINFO_UPDATE_SIZE	\
>> +	(offsetofend(struct nilfs_suinfo_update, sup_reserved) + \
>> +	NILFS_MIN_SUINFO_SIZE)
>> +
>> +#define NILFS_LIVE_BLKS_EXT_SUINFO_UPDATE_SIZE	\
>> +	(offsetofend(struct nilfs_suinfo_update, sup_reserved) + \
>> +	NILFS_LIVE_BLKS_EXT_SUINFO_SIZE)
>>  
>>  enum {
>>  	NILFS_CHECKPOINT,
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 3/9] nilfs2: introduce new feature flag for tracking live blocks
       [not found]         ` <20150509.112814.2026089040966346261.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-09 18:53           ` Andreas Rohner
  0 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-09 18:53 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-09 04:28, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:16 +0200, Andreas Rohner wrote:
>> This patch introduces a new file system feature flag
>> NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS. If it is enabled, the file system
>> will keep track of the number of live blocks per segment. This
>> information can be used by the GC to select segments for cleaning more
>> efficiently.
> 
> Please describe the reason why you separated
> NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS and
> NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT that you mentioned before
> in the commit log.

Yes sure.

Regards,
Andreas Rohner

>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/the_nilfs.h     | 8 ++++++++
>>  include/linux/nilfs2_fs.h | 4 +++-
>>  2 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
>> index 12cd91d..d755b6b 100644
>> --- a/fs/nilfs2/the_nilfs.h
>> +++ b/fs/nilfs2/the_nilfs.h
>> @@ -401,4 +401,12 @@ static inline int nilfs_flush_device(struct the_nilfs *nilfs)
>>  	return err;
>>  }
>>  
>> +static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
>> +{
>> +	const __u64 required_bits = NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
>> +				    NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT;
>> +
>> +	return ((nilfs->ns_feature_compat & required_bits) == required_bits);
>> +}
>> +
>>  #endif /* _THE_NILFS_H */
>> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
>> index 4800daa..5f05bbf 100644
>> --- a/include/linux/nilfs2_fs.h
>> +++ b/include/linux/nilfs2_fs.h
>> @@ -221,11 +221,13 @@ struct nilfs_super_block {
>>   * doesn't know about, it should refuse to mount the filesystem.
>>   */
>>  #define NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT	BIT(0)
>> +#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		BIT(1)
> 
> Ditto.  Avoid using BIT macro in nilfs2_fs.h for now.
> 
> Regards,
> Ryusuke Konishi
> 
>>  #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		BIT(0)
>>  
>>  #define NILFS_FEATURE_COMPAT_SUPP					\
>> -			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT)
>> +			(NILFS_FEATURE_COMPAT_SUFILE_LIVE_BLKS_EXT |	\
>> +			 NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
>>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>>  
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/9] nilfs2: add kmem_cache for SUFILE cache nodes
       [not found]         ` <20150509.114149.1643183669812667339.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-09 19:10           ` Andreas Rohner
       [not found]             ` <554E5B9D.7070807-hi6Y0CQ0nG0@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-09 19:10 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-09 04:41, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:17 +0200, Andreas Rohner wrote:
>> This patch adds a kmem_cache to efficiently allocate SUFILE cache nodes.
>> One cache node contains a certain number of unsigned 32 bit values and
>> either a list_head, to string a number of nodes together into a linked
>> list, or an rcu_head to be able to use the node with an rcu
>> callback.
>>
>> These cache nodes can be used to cache small changes to the SUFILE and
>> apply them later at segment construction.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/sufile.h | 14 ++++++++++++++
>>  fs/nilfs2/super.c  | 14 ++++++++++++++
>>  2 files changed, 28 insertions(+)
>>
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index da78edf..520614f 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -28,6 +28,20 @@
>>  #include <linux/nilfs2_fs.h>
>>  #include "mdt.h"
>>  
>> +#define NILFS_SUFILE_CACHE_NODE_SHIFT	6
>> +#define NILFS_SUFILE_CACHE_NODE_COUNT	(1 << NILFS_SUFILE_CACHE_NODE_SHIFT)
>> +
>> +struct nilfs_sufile_cache_node {
>> +	__u32 values[NILFS_SUFILE_CACHE_NODE_COUNT];
>> +	union {
>> +		struct rcu_head rcu_head;
>> +		struct list_head list_head;
>> +	};
>> +	unsigned long index;
>> +};
>> +
>> +extern struct kmem_cache *nilfs_sufile_node_cachep;
>> +
>>  static inline int
>>  nilfs_sufile_live_blks_ext_supported(const struct inode *sufile)
>>  {
>> diff --git a/fs/nilfs2/super.c b/fs/nilfs2/super.c
>> index f47585b..97a30db 100644
>> --- a/fs/nilfs2/super.c
>> +++ b/fs/nilfs2/super.c
>> @@ -71,6 +71,7 @@ static struct kmem_cache *nilfs_inode_cachep;
>>  struct kmem_cache *nilfs_transaction_cachep;
>>  struct kmem_cache *nilfs_segbuf_cachep;
>>  struct kmem_cache *nilfs_btree_path_cache;
>> +struct kmem_cache *nilfs_sufile_node_cachep;
>>  
>>  static int nilfs_setup_super(struct super_block *sb, int is_mount);
>>  static int nilfs_remount(struct super_block *sb, int *flags, char *data);
>> @@ -1397,6 +1398,11 @@ static void nilfs_segbuf_init_once(void *obj)
>>  	memset(obj, 0, sizeof(struct nilfs_segment_buffer));
>>  }
>>  
>> +static void nilfs_sufile_cache_node_init_once(void *obj)
>> +{
>> +	memset(obj, 0, sizeof(struct nilfs_sufile_cache_node));
>> +}
>> +
> 
> Note that nilfs_sufile_cache_node_init_once() is only called when each
> cache entry is allocated first time.  It doesn't ensure each cache
> entry is clean when it will be allocated with kmem_cache_alloc()
> the second time and afterwards.

I kind of assumed it would be called for every object returned by
kmem_cache_alloc(). In that case I have to do the initialization in
nilfs_sufile_alloc_cache_node() and remove this function.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
>>  static void nilfs_destroy_cachep(void)
>>  {
>>  	/*
>> @@ -1413,6 +1419,8 @@ static void nilfs_destroy_cachep(void)
>>  		kmem_cache_destroy(nilfs_segbuf_cachep);
>>  	if (nilfs_btree_path_cache)
>>  		kmem_cache_destroy(nilfs_btree_path_cache);
>> +	if (nilfs_sufile_node_cachep)
>> +		kmem_cache_destroy(nilfs_sufile_node_cachep);
>>  }
>>  
>>  static int __init nilfs_init_cachep(void)
>> @@ -1441,6 +1449,12 @@ static int __init nilfs_init_cachep(void)
>>  	if (!nilfs_btree_path_cache)
>>  		goto fail;
>>  
>> +	nilfs_sufile_node_cachep = kmem_cache_create("nilfs_sufile_node_cache",
>> +			sizeof(struct nilfs_sufile_cache_node), 0, 0,
>> +			nilfs_sufile_cache_node_init_once);
>> +	if (!nilfs_sufile_node_cachep)
>> +		goto fail;
>> +
>>  	return 0;
>>  
>>  fail:
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 5/9] nilfs2: add SUFILE cache for changes to su_nlive_blks field
       [not found]         ` <20150509.130900.223492430584220355.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-09 19:39           ` Andreas Rohner
       [not found]             ` <554E626A.2030503-hi6Y0CQ0nG0@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-09 19:39 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-09 06:09, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:18 +0200, Andreas Rohner wrote:
>> This patch adds a cache for the SUFILE to efficiently store lots of
>> small changes to su_nlive_blks in memory and apply the accumulated
>> results later at segment construction. This improves performance of
>> these operations and reduces lock contention in the SUFILE.
>>
>> The implementation uses a radix_tree to store cache nodes, which
>> contain a certain number of values. Every value corresponds to
>> exactly one SUFILE entry. If the cache is flushed the values are
>> subtracted from the su_nlive_blks field of the corresponding SUFILE
>> entry.
>>
>> If the parameter only_mark of the function nilfs_sufile_flush_cache() is
>> set, then the blocks that would have been dirtied by the flush are
>> marked as dirty, but nothing is actually written to them. This mode is
>> useful during segment construction, when blocks need to be marked dirty
>> in advance.
>>
>> New nodes are allocated on demand. The lookup of nodes is protected by
>> rcu_read_lock() and the modification of values is protected by a block
>> group lock. This should allow for concurrent updates to the cache.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/sufile.c | 369 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/nilfs2/sufile.h |   5 +
>>  2 files changed, 374 insertions(+)
>>
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 1cce358..80bbd87 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -26,6 +26,7 @@
>>  #include <linux/string.h>
>>  #include <linux/buffer_head.h>
>>  #include <linux/errno.h>
>> +#include <linux/radix-tree.h>
>>  #include <linux/nilfs2_fs.h>
>>  #include "mdt.h"
>>  #include "sufile.h"
>> @@ -42,6 +43,11 @@ struct nilfs_sufile_info {
>>  	unsigned long ncleansegs;/* number of clean segments */
>>  	__u64 allocmin;		/* lower limit of allocatable segment range */
>>  	__u64 allocmax;		/* upper limit of allocatable segment range */
>> +
> 
>> +	struct blockgroup_lock nlive_blks_cache_bgl;
>> +	spinlock_t nlive_blks_cache_lock;
>> +	int nlive_blks_cache_dirty;
>> +	struct radix_tree_root nlive_blks_cache;
> 
> blockgroup_lock is not needed.  For the counter operations in this
> patch, using cmpxchg() or atomic_xxx() is more effective as I mention
> later.
> 
> And, I prefer to address this cache as updates of segment usage
> instead of that of nlive_blks.  In that sense, it's preferable
> to define the array element like:
> 
> struct nilfs_segusage_update {
> 	__u32 nlive_blks_adj;
> };

Great idea!

> and define the variable names like update_cache (instead of
> nlive_blks_cache), update_cache_lock, update_cache_dirty, etc.

I really like this suggestion. I was struggling to come up with good
names for all the cache related functions.

> 
>>  };
>>  
>>  static inline struct nilfs_sufile_info *NILFS_SUI(struct inode *sufile)
>> @@ -1194,6 +1200,362 @@ out_sem:
>>  }
>>  
>>  /**
>> + * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node
>> + * @sufile: inode of segment usage file
>> + * @group: group to allocate a node for
>> + *
>> + * Description: Allocates a new cache node and inserts it into the cache. If
>> + * there is an error, nothing will be allocated. If there already exists
>> + * a node for @group, no new node will be allocated.
>> + *
>> + * Return Value: On success, 0 is returned, on error, one of the following
>> + * negative error codes is returned.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
>> +static int nilfs_sufile_alloc_cache_node(struct inode *sufile,
>> +					 unsigned long group)
>> +{
>> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
>> +	struct nilfs_sufile_cache_node *node;
>> +	int ret;
>> +
>> +	node = kmem_cache_alloc(nilfs_sufile_node_cachep, GFP_NOFS);
>> +	if (!node)
>> +		return -ENOMEM;
>> +
>> +	ret = radix_tree_preload(GFP_NOFS);
>> +	if (ret)
>> +		goto free_node;
>> +
>> +	spin_lock(&sui->nlive_blks_cache_lock);
>> +	ret = radix_tree_insert(&sui->nlive_blks_cache, group, node);
>> +	spin_unlock(&sui->nlive_blks_cache_lock);
>> +
>> +	radix_tree_preload_end();
>> +
> 
>> +	if (ret == -EEXIST) {
>> +		ret = 0;
>> +		goto free_node;
>> +	} else if (ret)
>> +		goto free_node;
>> +
>> +	return 0;
>> +free_node:
>> +	kmem_cache_free(nilfs_sufile_node_cachep, node);
>> +	return ret;
> 
> The above error check implies two branches in regular path.
> Consider rewriting it as follows:
> 
> 	if (!ret)
> 		return 0;
> 
> 	if (ret == -EEXIST)
> 		ret = 0;
> free_node:
> 	kmem_cache_free(nilfs_sufile_node_cachep, node);
> 	return ret;

Ok.

> By the way, you should use braces in both branches if the one of them
> has multiple statements in an "if else" conditional statement.  This
> exception is written in the Chapter 3 of Documentation/CodingStyle.
> 
>     e.g.
> 
>         if (condition) {
>                 do_this();
>                 do_that();
>         } else {
>                 otherwise();
>         }

Ok.

>> +}
>> +
>> +/**
>> + * nilfs_sufile_dec_nlive_blks - decrements nlive_blks in the cache
>> + * @sufile: inode of segment usage file
>> + * @segnum: segnum for which nlive_blks will be decremented
>> + *
>> + * Description: Decrements the number of live blocks for @segnum in the cache.
>> + * This function only affects the cache. If the cache is not flushed at a
>> + * later time the changes are lost. It tries to lookup the group node to
>> + * which the @segnum belongs in a lock free manner and uses a blockgroup lock
>> + * to do the actual modification on the node.
>> + *
>> + * Return Value: On success, 0 is returned on error, one of the following
>> + * negative error codes is returned.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
>> +int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum)
>> +{
>> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
>> +	struct nilfs_sufile_cache_node *node;
>> +	spinlock_t *lock;
>> +	unsigned long group;
>> +	int ret;
>> +
>> +	group = (unsigned long)(segnum >> NILFS_SUFILE_CACHE_NODE_SHIFT);
>> +
>> +try_again:
>> +	rcu_read_lock();
>> +	node = radix_tree_lookup(&sui->nlive_blks_cache, group);
>> +	if (!node) {
>> +		rcu_read_unlock();
>> +
>> +		ret = nilfs_sufile_alloc_cache_node(sufile, group);
>> +		if (ret)
>> +			return ret;
>> +
>> +		/*
>> +		 * It is important to acquire the rcu_read_lock() before using
>> +		 * the node pointer
>> +		 */
>> +		goto try_again;
>> +	}
>> +
> 
>> +	lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, (unsigned int)group);
>> +	spin_lock(lock);
>> +	node->values[segnum & ((1 << NILFS_SUFILE_CACHE_NODE_SHIFT) - 1)] += 1;
>> +	sui->nlive_blks_cache_dirty = 1;
>> +	spin_unlock(lock);
>> +	rcu_read_unlock();
>> +
>> +	return 0;
>> +}
> 
> Consider using cmpxchg() or atomic_inc(), and using
> NILFS_SUFILE_CACHE_NODE_MASK to mask segnum.  The following is an
> example in the case of using cmpxchg():
> 
> 	__u32 old, new, *valuep;
> 	...
> 	old = node->values[segnum & (NILFS_SUFILE_CACHE_NODE_COUNT - 1)];
> 	do {
> 		old = ACCESS_ONCE(*valuep);
> 		new = old + 1;
> 	} while (cmpxchg(valuep, old, new) != old);
> 
> 	sui->nlive_blks_cache_dirty = 1;
> 
> 	rcu_read_unlock();
> 	return 0;
> }
> 
> The current atomic_xxxx() macros are actually defined in the same way
> to the reduce overheads in smp environment.
> 
> Using atomic_xxxx() is more preferable but formally it requires
> initialization with "atomic_set(&counter, 0)" or "ATOMIC_INIT(0)" for
> every element.  I don't know whether initialization with memset()
> function is allowed or not for atomic_t type variables.

Ok, but I would also have to use cmpxchg() in
nilfs_sufile_flush_cache_node(), if the value needs to be set to 0.

Currently the cache is only flushed during segment construction, so
there should be no concurrent calls to nilfs_sufile_dec_nlive_blks().
But I thought it would be best do design a thread safe flush function.

>> +
>> +/**
>> + * nilfs_sufile_flush_cache_node - flushes one cache node to the SUFILE
>> + * @sufile: inode of segment usage file
>> + * @node: cache node to flush
>> + * @only_mark: do not write anything, but mark the blocks as dirty
>> + * @pndirty_blks: pointer to return number of dirtied blocks
>> + *
>> + * Description: Flushes one cache node to the SUFILE and also clears the cache
>> + * node at the same time. If @only_mark is 1, nothing is written to the
>> + * SUFILE, but the blocks are still marked as dirty. This is useful to mark
>> + * the blocks in one phase of the segment creation and write them in another.
>> + *
>> + * Return Value: On success, 0 is returned on error, one of the following
>> + * negative error codes is returned.
>> + *
>> + * %-ENOMEM - Insufficient memory available.
>> + *
>> + * %-EIO - I/O error
>> + *
>> + * %-EROFS - Read only filesystem (for create mode)
>> + */
>> +static int nilfs_sufile_flush_cache_node(struct inode *sufile,
>> +					 struct nilfs_sufile_cache_node *node,
>> +					 int only_mark,
>> +					 unsigned long *pndirty_blks)
>> +{
>> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
>> +	struct buffer_head *su_bh;
>> +	struct nilfs_segment_usage *su;
>> +	spinlock_t *lock;
>> +	void *kaddr;
>> +	size_t n, i, j;
>> +	size_t susz = NILFS_MDT(sufile)->mi_entry_size;
>> +	__u64 segnum, seg_start, nsegs;
>> +	__u32 nlive_blocks, value;
>> +	unsigned long secs = get_seconds(), ndirty_blks = 0;
>> +	int ret, dirty;
>> +
>> +	nsegs = nilfs_sufile_get_nsegments(sufile);
>> +	seg_start = node->index << NILFS_SUFILE_CACHE_NODE_SHIFT;
>> +	lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, node->index);
>> +
>> +	for (i = 0; i < NILFS_SUFILE_CACHE_NODE_COUNT;) {
>> +		segnum = seg_start + i;
>> +		if (segnum >= nsegs)
>> +			break;
>> +
>> +		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
>> +				seg_start + NILFS_SUFILE_CACHE_NODE_COUNT - 1);
>> +
>> +		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
>> +							   0, &su_bh);
>> +		if (ret < 0) {
>> +			if (ret != -ENOENT)
>> +				return ret;
>> +			/* hole */
>> +			i += n;
>> +			continue;
>> +		}
>> +
>> +		if (only_mark && buffer_dirty(su_bh)) {
>> +			/* buffer already dirty */
>> +			put_bh(su_bh);
>> +			i += n;
>> +			continue;
>> +		}
>> +
>> +		spin_lock(lock);
>> +		kaddr = kmap_atomic(su_bh->b_page);
>> +
>> +		dirty = 0;
>> +		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
>> +							  su_bh, kaddr);
>> +		for (j = 0; j < n; ++j, ++i, su = (void *)su + susz) {
>> +			value = node->values[i];
>> +			if (!value)
>> +				continue;
>> +			if (!only_mark)
>> +				node->values[i] = 0;
>> +
>> +			WARN_ON(nilfs_segment_usage_error(su));
>> +
>> +			nlive_blocks = le32_to_cpu(su->su_nlive_blks);
>> +			if (!nlive_blocks)
>> +				continue;
>> +
>> +			dirty = 1;
>> +			if (only_mark) {
>> +				i += n - j;
>> +				break;
>> +			}
>> +
> 
>> +			if (nlive_blocks <= value)
>> +				nlive_blocks = 0;
>> +			else
>> +				nlive_blocks -= value;
> 
> This can be simplified as below:
> 
> 			nlive_blocks -= min_t(__u32, nlive_blocks, value);

Ok.

>> +
>> +			su->su_nlive_blks = cpu_to_le32(nlive_blocks);
>> +			su->su_nlive_lastmod = cpu_to_le64(secs);
>> +		}
>> +
>> +		kunmap_atomic(kaddr);
>> +		spin_unlock(lock);
>> +
>> +		if (dirty && !buffer_dirty(su_bh)) {
>> +			mark_buffer_dirty(su_bh);
> 
>> +			nilfs_mdt_mark_dirty(sufile);
> 
> nilfs_mdt_mark_dirty() should be called only once if ndirty_blks is
> larger than zero.  We can move it to nilfs_sufile_flush_cache() side
> (to the position just before calling up_write()).

Good idea.

>> +			++ndirty_blks;
>> +		}
>> +
>> +		put_bh(su_bh);
>> +	}
>> +
>> +	*pndirty_blks += ndirty_blks;
>> +	return 0;
>> +}
>> +
>> +/**
>> + * nilfs_sufile_flush_cache - flushes cache to the SUFILE
>> + * @sufile: inode of segment usage file
>> + * @only_mark: do not write anything, but mark the blocks as dirty
>> + * @pndirty_blks: pointer to return number of dirtied blocks
>> + *
>> + * Description: Flushes the whole cache to the SUFILE and also clears it
>> + * at the same time. If @only_mark is 1, nothing is written to the
>> + * SUFILE, but the blocks are still marked as dirty. This is useful to mark
>> + * the blocks in one phase of the segment creation and write them in another.
>> + * If there are concurrent inserts into the cache, it cannot be guaranteed,
>> + * that everything is flushed when the function returns.
>> + *
>> + * Return Value: On success, 0 is returned on error, one of the following
>> + * negative error codes is returned.
>> + *
>> + * %-ENOMEM - Insufficient memory available.
>> + *
>> + * %-EIO - I/O error
>> + *
>> + * %-EROFS - Read only filesystem (for create mode)
>> + */
>> +int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
>> +			     unsigned long *pndirty_blks)
>> +{
>> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
>> +	struct nilfs_sufile_cache_node *node;
>> +	LIST_HEAD(nodes);
>> +	struct radix_tree_iter iter;
>> +	void **slot;
>> +	unsigned long ndirty_blks = 0;
>> +	int ret = 0;
>> +
>> +	if (!sui->nlive_blks_cache_dirty)
>> +		goto out;
>> +
>> +	down_write(&NILFS_MDT(sufile)->mi_sem);
>> +
>> +	/* prevent concurrent inserts */
>> +	spin_lock(&sui->nlive_blks_cache_lock);
>> +	radix_tree_for_each_slot(slot, &sui->nlive_blks_cache, &iter, 0) {
>> +		node = radix_tree_deref_slot_protected(slot,
>> +				&sui->nlive_blks_cache_lock);
>> +		if (!node)
>> +			continue;
>> +		if (radix_tree_exception(node))
>> +			continue;
>> +
>> +		list_add(&node->list_head, &nodes);
>> +		node->index = iter.index;
>> +	}
>> +	if (!only_mark)
>> +		sui->nlive_blks_cache_dirty = 0;
>> +	spin_unlock(&sui->nlive_blks_cache_lock);
>> +
>> +	list_for_each_entry(node, &nodes, list_head) {
>> +		ret = nilfs_sufile_flush_cache_node(sufile, node, only_mark,
>> +						    &ndirty_blks);
>> +		if (ret)
>> +			goto out_sem;
>> +	}
>> +
>> +out_sem:
>> +	up_write(&NILFS_MDT(sufile)->mi_sem);
>> +out:
>> +	if (pndirty_blks)
>> +		*pndirty_blks = ndirty_blks;
>> +	return ret;
>> +}
>> +
>> +/**
>> + * nilfs_sufile_cache_dirty - is the sufile cache dirty
>> + * @sufile: inode of segment usage file
>> + *
>> + * Description: Returns whether the sufile cache is dirty. If this flag is
>> + * true, the cache contains unflushed content.
>> + *
>> + * Return Value: If the cache is not dirty, 0 is returned, otherwise
>> + * 1 is returned
>> + */
>> +int nilfs_sufile_cache_dirty(struct inode *sufile)
>> +{
>> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
>> +
>> +	return sui->nlive_blks_cache_dirty;
>> +}
>> +
>> +/**
>> + * nilfs_sufile_cache_node_release_rcu - rcu callback function to free nodes
>> + * @head: rcu head
>> + *
>> + * Description: Rcu callback function to free nodes.
>> + */
>> +static void nilfs_sufile_cache_node_release_rcu(struct rcu_head *head)
>> +{
>> +	struct nilfs_sufile_cache_node *node;
>> +
>> +	node = container_of(head, struct nilfs_sufile_cache_node, rcu_head);
>> +
>> +	kmem_cache_free(nilfs_sufile_node_cachep, node);
>> +}
>> +
>> +/**
>> + * nilfs_sufile_shrink_cache - free all cache nodes
>> + * @sufile: inode of segment usage file
>> + *
>> + * Description: Frees all cache nodes in the cache regardless of their
>> + * content. The content will not be flushed and may be lost. This function
>> + * is intended to free up memory after the cache was flushed by
>> + * nilfs_sufile_flush_cache().
>> + */
>> +void nilfs_sufile_shrink_cache(struct inode *sufile)
>> +{
>> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
>> +	struct nilfs_sufile_cache_node *node;
>> +	struct radix_tree_iter iter;
>> +	void **slot;
>> +
> 
>> +	/* prevent flush form running at the same time */
> 
> "flush from" ?

Yes that is a typo. It should be "flush from".

>> +	down_read(&NILFS_MDT(sufile)->mi_sem);
> 
> This protection with mi_sem seems to be needless because the current
> implementation of nilfs_sufile_shrink_cache() doesn't touch buffers of
> sufile.  The delete operation is protected by a spinlock and the
> counter operations are protected with rcu.  What does this
> down_read()/up_read() protect ?

It is intended to protect against a concurrently running
nilfs_sufile_flush_cache() function. This should not happen the way the
functions are called currently, but I wanted to make them thread safe.

In nilfs_sufile_flush_cache() I use references to nodes outside of the
spinlock, which I am not allowed to do if they can be deallocated at any
moment. I cannot hold the spinlock for the entire flush, because
nilfs_sufile_flush_cache_node() needs to be able to sleep. I cannot use
a mutex instead of a spinlock, because this would lead to a potential
deadlock:

nilfs_sufile_alloc_cache_node():
1. bmap->b_sem
2. sui->nlive_blks_cache_lock

nilfs_sufile_flush_cache_node():
1. sui->nlive_blks_cache_lock
2. bmap->b_sem

So I decided to "abuse" mi_sem for this purpose, since I already need to
hold mi_sem in nilfs_sufile_flush_cache().

>> +	/* prevent concurrent inserts */
>> +	spin_lock(&sui->nlive_blks_cache_lock);
>> +
>> +	radix_tree_for_each_slot(slot, &sui->nlive_blks_cache, &iter, 0) {
>> +		node = radix_tree_deref_slot_protected(slot,
>> +				&sui->nlive_blks_cache_lock);
>> +		if (!node)
>> +			continue;
>> +		if (radix_tree_exception(node))
>> +			continue;
>> +
>> +		radix_tree_delete(&sui->nlive_blks_cache, iter.index);
>> +		call_rcu(&node->rcu_head, nilfs_sufile_cache_node_release_rcu);
>> +	}
>> +
>> +	spin_unlock(&sui->nlive_blks_cache_lock);
>> +	up_read(&NILFS_MDT(sufile)->mi_sem);
>> +}
>> +
>> +/**
>>   * nilfs_sufile_read - read or get sufile inode
>>   * @sb: super block instance
>>   * @susize: size of a segment usage entry
>> @@ -1253,6 +1615,13 @@ int nilfs_sufile_read(struct super_block *sb, size_t susize,
>>  	sui->allocmax = nilfs_sufile_get_nsegments(sufile) - 1;
>>  	sui->allocmin = 0;
>>  
>> +	if (nilfs_feature_track_live_blks(sb->s_fs_info)) {
>> +		bgl_lock_init(&sui->nlive_blks_cache_bgl);
>> +		spin_lock_init(&sui->nlive_blks_cache_lock);
>> +		INIT_RADIX_TREE(&sui->nlive_blks_cache, GFP_ATOMIC);
>> +	}
>> +	sui->nlive_blks_cache_dirty = 0;
>> +
>>  	unlock_new_inode(sufile);
>>   out:
>>  	*inodep = sufile;
> 
> I think we should introduce destructor to metadata files to prevent
> memory leak which is brought by the introduction of the cache nodes
> and radix tree.  nilfs_sufile_shrink_cache() should be called from the
> destructor.
> 
> The destructor (e.g. mi->mi_dtor) should be called from
> nilfs_clear_inode() if it isn't set to a NULL value.  Initialization
> of the destructor will be done in nilfs_xxx_read().
> 
> In the current patchset, the callsite of nilfs_sufile_shrink_cache()
> is well considered, but it's not sufficient.  We have to eliminate the
> possibility of memory leak completely and clearly.

Ok good idea.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index 520614f..662ab56 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -87,6 +87,11 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
>>  int nilfs_sufile_read(struct super_block *sb, size_t susize,
>>  		      struct nilfs_inode *raw_inode, struct inode **inodep);
>>  int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
>> +int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum);
>> +void nilfs_sufile_shrink_cache(struct inode *sufile);
>> +int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
>> +			     unsigned long *pndirty_blks);
>> +int nilfs_sufile_cache_dirty(struct inode *sufile);
>>  
>>  /**
>>   * nilfs_sufile_scrap - make a segment garbage
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/9] nilfs2: add tracking of block deletions and updates
       [not found]         ` <20150509.160512.1087140271092828536.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  2015-05-09 15:58           ` Ryusuke Konishi
@ 2015-05-09 20:02           ` Andreas Rohner
       [not found]             ` <554E67C0.1050309-hi6Y0CQ0nG0@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-09 20:02 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-09 09:05, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:19 +0200, Andreas Rohner wrote:
>> This patch adds tracking of block deletions and updates for all files.
>> It uses the fact, that for every block, NILFS2 keeps an entry in the
>> DAT file and stores the checkpoint where it was created, deleted or
>> overwritten. So whenever a block is deleted or overwritten
>> nilfs_dat_commit_end() is called to update the DAT entry. At this
>> point this patch simply decrements the su_nlive_blks field of the
>> corresponding segment. The value of su_nlive_blks is set at segment
>> creation time.
>>
>> The DAT file itself has of course no DAT entries for its own blocks, but
>> it still has to propagate deletions and updates to its btree. When this
>> happens this patch again decrements the su_nlive_blks field of the
>> corresponding segment.
>>
>> The new feature compatibility flag NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS
>> can be used to enable or disable the block tracking at any time.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/btree.c   | 33 ++++++++++++++++++++++++++++++---
>>  fs/nilfs2/dat.c     | 15 +++++++++++++--
>>  fs/nilfs2/direct.c  | 20 +++++++++++++++-----
>>  fs/nilfs2/page.c    |  6 ++++--
>>  fs/nilfs2/page.h    |  3 +++
>>  fs/nilfs2/segbuf.c  |  3 +++
>>  fs/nilfs2/segbuf.h  |  5 +++++
>>  fs/nilfs2/segment.c | 48 +++++++++++++++++++++++++++++++++++++-----------
>>  fs/nilfs2/sufile.c  | 17 ++++++++++++++++-
>>  fs/nilfs2/sufile.h  |  3 ++-
>>  10 files changed, 128 insertions(+), 25 deletions(-)
>>
>> diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
>> index 059f371..d3b2763 100644
>> --- a/fs/nilfs2/btree.c
>> +++ b/fs/nilfs2/btree.c
>> @@ -30,6 +30,7 @@
>>  #include "btree.h"
>>  #include "alloc.h"
>>  #include "dat.h"
>> +#include "sufile.h"
>>  
>>  static void __nilfs_btree_init(struct nilfs_bmap *bmap);
>>  
> 
>> @@ -1889,9 +1890,35 @@ static int nilfs_btree_propagate_p(struct nilfs_bmap *btree,
>>  				   int level,
>>  				   struct buffer_head *bh)
>>  {
>> -	while ((++level < nilfs_btree_height(btree) - 1) &&
>> -	       !buffer_dirty(path[level].bp_bh))
>> -		mark_buffer_dirty(path[level].bp_bh);
>> +	struct the_nilfs *nilfs = btree->b_inode->i_sb->s_fs_info;
>> +	struct nilfs_btree_node *node;
>> +	__u64 ptr, segnum;
>> +	int ncmax, vol, counted;
>> +
>> +	vol = buffer_nilfs_volatile(bh);
>> +	counted = buffer_nilfs_counted(bh);
>> +	set_buffer_nilfs_counted(bh);
>> +
>> +	while (++level < nilfs_btree_height(btree)) {
>> +		if (!vol && !counted && nilfs_feature_track_live_blks(nilfs)) {
>> +			node = nilfs_btree_get_node(btree, path, level, &ncmax);
>> +			ptr = nilfs_btree_node_get_ptr(node,
>> +						       path[level].bp_index,
>> +						       ncmax);
>> +			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
>> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
>> +		}
>> +
>> +		if (path[level].bp_bh) {
>> +			if (buffer_dirty(path[level].bp_bh))
>> +				break;
>> +
>> +			mark_buffer_dirty(path[level].bp_bh);
>> +			vol = buffer_nilfs_volatile(path[level].bp_bh);
>> +			counted = buffer_nilfs_counted(path[level].bp_bh);
>> +			set_buffer_nilfs_counted(path[level].bp_bh);
>> +		}
>> +	}
>>  
>>  	return 0;
>>  }
> 
> Consider the following comments:
> 
> - Please use volatile flag also for the duplication check instead of
>   adding nilfs_counted flag.

I thought volatile already means something else. I wasn't sure if could
use it. I will change it and remove the nilfs_counted flag.

> - btree.c, direct.c, and dat.c shouldn't refer SUFILE directly.
>   Please add a wrapper function like "nilfs_dec_nlive_blks(nilfs, blocknr)"
>   to the implementation of the_nilfs.c, and use it instead.
> - To clarify implementation separate function to update pointers
>   like nilfs_btree_propagate_v() is doing.

Ok.

> - The return value of nilfs_sufile_dec_nlive_blks() looks to be ignored
>   intentionally.  Please add a comment explaining why you do so.

I just thought, that the block tracking isn't important enough to cause
a fatal error. I should at least use the WARN_ON() macro. Do you think I
should return possible errors?

> e.g.
> 
> static void nilfs_btree_update_p(struct nilfs_bmap *btree,
>                                  struct nilfs_btree_path *path, int level)
> {
> 	struct the_nilfs *nilfs = btree->b_inode->i_sb->s_fs_info;
> 	struct nilfs_btree_node *parent;
> 	__u64 ptr;
> 	int ncmax;
> 
> 	if (nilfs_feature_track_live_blks(nilfs)) {
> 		parent = nilfs_btree_get_node(btree, path, level + 1, &ncmax);
> 		ptr = nilfs_btree_node_get_ptr(parent,
> 					       path[level + 1].bp_index,
> 					       ncmax);
> 		nilfs_dec_nlive_blks(nilfs, ptr);
> 		/* (Please add a comment explaining why we ignore the return value) */
> 	}
> 	set_buffer_nilfs_volatile(path[level].bp_bh);
> }
> 
> static int nilfs_btree_propagate_p(struct nilfs_bmap *btree,
> 				   struct nilfs_btree_path *path,
> 				   int level,
> 				   struct buffer_head *bh)
> {
> 	/*
> 	 * Update pointer to the given dirty buffer.  If the buffer is
> 	 * marked volatile, it shouldn't be updated because it's
> 	 * either a newly created buffer or an already updated one.
> 	 */
> 	if (!buffer_nilfs_volatile(path[level].bp_bh))
> 		nilfs_btree_update_p(btree, path, level);
> 
> 	/*
> 	 * Mark upper nodes dirty and update their pointers unless
> 	 * they're already marked dirty.
> 	 */
> 	while (++level < nilfs_btree_height(btree) - 1 &&
> 	       !buffer_dirty(path[level].bp_bh)) {
> 
> 		WARN_ON(buffer_nilfs_volatile(path[level].bp_bh));
> 		nilfs_btree_update_p(btree, path, level);
> 		mark_buffer_dirty(path[level].bp_bh);
> 	}
> 	return 0;
> }
> 
>> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
>> index 0d5fada..9c2fc32 100644
>> --- a/fs/nilfs2/dat.c
>> +++ b/fs/nilfs2/dat.c
>> @@ -28,6 +28,7 @@
>>  #include "mdt.h"
>>  #include "alloc.h"
>>  #include "dat.h"
>> +#include "sufile.h"
>>  
>>  
>>  #define NILFS_CNO_MIN	((__u64)1)
>> @@ -188,9 +189,10 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
>>  			  int dead)
>>  {
>>  	struct nilfs_dat_entry *entry;
>> -	__u64 start, end;
>> +	__u64 start, end, segnum;
>>  	sector_t blocknr;
>>  	void *kaddr;
>> +	struct the_nilfs *nilfs;
>>  
>>  	kaddr = kmap_atomic(req->pr_entry_bh->b_page);
>>  	entry = nilfs_palloc_block_get_entry(dat, req->pr_entry_nr,
>> @@ -206,8 +208,17 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
>>  
>>  	if (blocknr == 0)
>>  		nilfs_dat_commit_free(dat, req);
>> -	else
> 
> Add braces around nilfs_dat_commit_free() since you add multiple
> sentences in the else clause.  See the chapter 3 of CodingStyle file.

Ok sorry for that.

>> +	else {
>>  		nilfs_dat_commit_entry(dat, req);
>> +
>> +		nilfs = dat->i_sb->s_fs_info;
>> +
>> +		if (nilfs_feature_track_live_blks(nilfs)) {
> 
>> +			segnum = nilfs_get_segnum_of_block(nilfs, blocknr);
>> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
> 
> Ditto.  Call nilfs_dec_nlive_blks(nilfs, blocknr) instead and do not
> to add dependency to SUFILE in dat.c.
> 
>> +		}
>> +	}
>> +
>>  }
>>  
>>  void nilfs_dat_abort_end(struct inode *dat, struct nilfs_palloc_req *req)
>> diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c
>> index ebf89fd..42704eb 100644
>> --- a/fs/nilfs2/direct.c
>> +++ b/fs/nilfs2/direct.c
>> @@ -26,6 +26,7 @@
>>  #include "direct.h"
>>  #include "alloc.h"
>>  #include "dat.h"
>> +#include "sufile.h"
>>  
>>  static inline __le64 *nilfs_direct_dptrs(const struct nilfs_bmap *direct)
>>  {
>> @@ -268,18 +269,27 @@ int nilfs_direct_delete_and_convert(struct nilfs_bmap *bmap,
>>  static int nilfs_direct_propagate(struct nilfs_bmap *bmap,
>>  				  struct buffer_head *bh)
>>  {
>> +	struct the_nilfs *nilfs = bmap->b_inode->i_sb->s_fs_info;
>>  	struct nilfs_palloc_req oldreq, newreq;
>>  	struct inode *dat;
>> -	__u64 key;
>> -	__u64 ptr;
>> +	__u64 key, ptr, segnum;
>>  	int ret;
>>  
>> -	if (!NILFS_BMAP_USE_VBN(bmap))
>> -		return 0;
>> -
> 
>>  	dat = nilfs_bmap_get_dat(bmap);
>>  	key = nilfs_bmap_data_get_key(bmap, bh);
>>  	ptr = nilfs_direct_get_ptr(bmap, key);
>> +
> 
>> +	if (unlikely(!NILFS_BMAP_USE_VBN(bmap))) {
>> +		if (!buffer_nilfs_volatile(bh) && !buffer_nilfs_counted(bh) &&
>> +				nilfs_feature_track_live_blks(nilfs)) {
>> +			set_buffer_nilfs_counted(bh);
>> +			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
>> +
>> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
>> +		}
>> +		return 0;
>> +	}
> 
> Use the volatile flag also for duplication check, and do not use
> unlikely() marcro when testing "!NILFS_BMAP_USE_VBN(bmap)".  It's
> not exceptional as error:
> 
> 	if (!NILFS_BMAP_USE_VBN(bmap)) {
> 		if (!buffer_nilfs_volatile(bh)) {
> 			if (nilfs_feature_track_live_blks(nilfs))
> 				nilfs_dec_nlive_blks(nilfs, ptr);
> 			set_buffer_nilfs_volatile(bh);
> 		}
> 		return 0;
> 	}

During my tests, this was only called once directly after the first
bytes are written on a newly formatted volume. This can only be true for
the DAT-File and the DAT-File is very unlikely to be small enough to use
the direct bmap, except on a newly formatted volume. Do you mean, that
unlikely() should only be used for errors?

>> +
>>  	if (!buffer_nilfs_volatile(bh)) {
>>  		oldreq.pr_entry_nr = ptr;
>>  		newreq.pr_entry_nr = ptr;
>> diff --git a/fs/nilfs2/page.c b/fs/nilfs2/page.c
>> index 45d650a..fd21b43 100644
>> --- a/fs/nilfs2/page.c
>> +++ b/fs/nilfs2/page.c
>> @@ -92,7 +92,8 @@ void nilfs_forget_buffer(struct buffer_head *bh)
>>  	const unsigned long clear_bits =
>>  		(1 << BH_Uptodate | 1 << BH_Dirty | 1 << BH_Mapped |
>>  		 1 << BH_Async_Write | 1 << BH_NILFS_Volatile |
>> -		 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected);
>> +		 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected |
> 
>> +		 1 << BH_NILFS_Counted);
> 
> You don't have to add nilfs_counted flag as I mentioned above.  Remove
> this.
> 
>>  
>>  	lock_buffer(bh);
>>  	set_mask_bits(&bh->b_state, clear_bits, 0);
>> @@ -422,7 +423,8 @@ void nilfs_clear_dirty_page(struct page *page, bool silent)
>>  		const unsigned long clear_bits =
>>  			(1 << BH_Uptodate | 1 << BH_Dirty | 1 << BH_Mapped |
>>  			 1 << BH_Async_Write | 1 << BH_NILFS_Volatile |
>> -			 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected);
>> +			 1 << BH_NILFS_Checked | 1 << BH_NILFS_Redirected |
> 
>> +			 1 << BH_NILFS_Counted);
> 
> Ditto.
> 
>>  
>>  		bh = head = page_buffers(page);
>>  		do {
>> diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
>> index a43b828..4e35814 100644
>> --- a/fs/nilfs2/page.h
>> +++ b/fs/nilfs2/page.h
>> @@ -36,12 +36,15 @@ enum {
>>  	BH_NILFS_Volatile,
>>  	BH_NILFS_Checked,
>>  	BH_NILFS_Redirected,
>> +	BH_NILFS_Counted,
> 
> Ditto.
> 
>>  };
>>  
>>  BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
>>  BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
>>  BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
>>  BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */
> 
>> +/* counted by propagate_p for segment usage */
>> +BUFFER_FNS(NILFS_Counted, nilfs_counted)
> 
> Ditto.
> 
>>  
>>  
>>  int __nilfs_clear_page_dirty(struct page *);
>> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
>> index dc3a9efd..dabb65b 100644
>> --- a/fs/nilfs2/segbuf.c
>> +++ b/fs/nilfs2/segbuf.c
>> @@ -57,6 +57,9 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
>>  	INIT_LIST_HEAD(&segbuf->sb_segsum_buffers);
>>  	INIT_LIST_HEAD(&segbuf->sb_payload_buffers);
>>  	segbuf->sb_super_root = NULL;
> 
>> +	segbuf->sb_flags = 0;
> 
> You don't have to add sb_flags.  Use sci->sc_stage.flags instead
> because the flag is used to manage internal state of segment
> construction rather than the state of segbuf.

Yes that is true. I'll change that.

>> +	segbuf->sb_nlive_blks = 0;
>> +	segbuf->sb_nsnapshot_blks = 0;
>>  
>>  	init_completion(&segbuf->sb_bio_event);
>>  	atomic_set(&segbuf->sb_err, 0);
>> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
>> index b04f08c..a802f61 100644
>> --- a/fs/nilfs2/segbuf.h
>> +++ b/fs/nilfs2/segbuf.h
>> @@ -83,6 +83,9 @@ struct nilfs_segment_buffer {
>>  	sector_t		sb_fseg_start, sb_fseg_end;
>>  	sector_t		sb_pseg_start;
>>  	unsigned		sb_rest_blocks;
> 
>> +	int			sb_flags;
> 
> ditto.
> 
>> +	__u32			sb_nlive_blks;
>> +	__u32			sb_nsnapshot_blks;
>>  
>>  	/* Buffers */
>>  	struct list_head	sb_segsum_buffers;
>> @@ -95,6 +98,8 @@ struct nilfs_segment_buffer {
>>  	struct completion	sb_bio_event;
>>  };
>>  
>> +#define NILFS_SEGBUF_SUSET	BIT(0)	/* segment usage has been set */
>> +
> 
> Ditto.
> 
>>  #define NILFS_LIST_SEGBUF(head)  \
>>  	list_entry((head), struct nilfs_segment_buffer, sb_list)
>>  #define NILFS_NEXT_SEGBUF(segbuf)  NILFS_LIST_SEGBUF((segbuf)->sb_list.next)
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index c6abbad9..14e76c3 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -762,7 +762,8 @@ static int nilfs_test_metadata_dirty(struct the_nilfs *nilfs,
>>  		ret++;
>>  	if (nilfs_mdt_fetch_dirty(nilfs->ns_cpfile))
>>  		ret++;
>> -	if (nilfs_mdt_fetch_dirty(nilfs->ns_sufile))
>> +	if (nilfs_mdt_fetch_dirty(nilfs->ns_sufile) ||
>> +	    nilfs_sufile_cache_dirty(nilfs->ns_sufile))
>>  		ret++;
>>  	if ((ret || nilfs_doing_gc()) && nilfs_mdt_fetch_dirty(nilfs->ns_dat))
>>  		ret++;
>> @@ -1368,36 +1369,49 @@ static void nilfs_free_incomplete_logs(struct list_head *logs,
>>  }
>>  
>>  static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
>> -					  struct inode *sufile)
>> +					  struct the_nilfs *nilfs)
> 
> Do not change the sufile argument to nilfs.  It's not necessary
> for this change.

Ok.

>>  {
>>  	struct nilfs_segment_buffer *segbuf;
>> -	unsigned long live_blocks;
>> +	struct inode *sufile = nilfs->ns_sufile;
>> +	unsigned long nblocks;
>>  	int ret;
>>  
>>  	list_for_each_entry(segbuf, &sci->sc_segbufs, sb_list) {
>> -		live_blocks = segbuf->sb_sum.nblocks +
>> +		nblocks = segbuf->sb_sum.nblocks +
>>  			(segbuf->sb_pseg_start - segbuf->sb_fseg_start);
> 
>>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>> -						     live_blocks,
>> +						     nblocks,
>> +						     segbuf->sb_nlive_blks,
>> +						     segbuf->sb_nsnapshot_blks,
>>  						     sci->sc_seg_ctime);
> 
> With this change, two different semantics, "set" and "modify", are
> mixed up in the arguments of nilfs_sufile_set_segment_usage().  It's
> bad and confusing.
> 
> Please change nilfs_sufile_set_segment_usage() function, for instance,
> to nilfs_sufile_modify_segment_usage() and rewrite the above part
> so that all counter arguments are passed with the "modify" semantics.

Ok.

>>  		WARN_ON(ret); /* always succeed because the segusage is dirty */
>> +
>> +		segbuf->sb_flags |= NILFS_SEGBUF_SUSET;
> 
> Use sci->sc_stage.flags adding NILFS_CF_SUMOD flag.  Note that the
> flag must be added also to NILFS_CF_HISTORY_MASK so that the flag will
> be cleared every time a new cycle starts in the loop of
> nilfs_segctor_do_construct().

Ok.

>>  	}
>>  }
>>  
>> -static void nilfs_cancel_segusage(struct list_head *logs, struct inode *sufile)
>> +static void nilfs_cancel_segusage(struct list_head *logs,
>> +				  struct the_nilfs *nilfs)
> 
> Ditto.  Do not change the sufile argument to the pointer to nilfs
> object.
> 
>>  {
>>  	struct nilfs_segment_buffer *segbuf;
>> +	struct inode *sufile = nilfs->ns_sufile;
>> +	__s64 nlive_blks = 0, nsnapshot_blks = 0;
>>  	int ret;
>>  
>>  	segbuf = NILFS_FIRST_SEGBUF(logs);
> 
>> +	if (segbuf->sb_flags & NILFS_SEGBUF_SUSET) {
> 
> Ditto.
> 
>> +		nlive_blks = -(__s64)segbuf->sb_nlive_blks;
>> +		nsnapshot_blks = -(__s64)segbuf->sb_nsnapshot_blks;
>> +	}
>>  	ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>>  					     segbuf->sb_pseg_start -
>> -					     segbuf->sb_fseg_start, 0);
>> +					     segbuf->sb_fseg_start,
>> +					     nlive_blks, nsnapshot_blks, 0);
> 
> Ditto.
> 
>>  	WARN_ON(ret); /* always succeed because the segusage is dirty */
>>  
>>  	list_for_each_entry_continue(segbuf, logs, sb_list) {
>>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>> -						     0, 0);
>> +						     0, 0, 0, 0);
>>  		WARN_ON(ret); /* always succeed */
>>  	}
>>  }
>> @@ -1499,6 +1513,7 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>>  	if (!nfinfo)
>>  		goto out;
>>  
>> +	segbuf->sb_nlive_blks = segbuf->sb_sum.nfileblk;
>>  	blocknr = segbuf->sb_pseg_start + segbuf->sb_sum.nsumblk;
>>  	ssp.bh = NILFS_SEGBUF_FIRST_BH(&segbuf->sb_segsum_buffers);
>>  	ssp.offset = sizeof(struct nilfs_segment_summary);
>> @@ -1728,7 +1743,7 @@ static void nilfs_segctor_abort_construction(struct nilfs_sc_info *sci,
>>  	nilfs_abort_logs(&logs, ret ? : err);
>>  
>>  	list_splice_tail_init(&sci->sc_segbufs, &logs);
>> -	nilfs_cancel_segusage(&logs, nilfs->ns_sufile);
>> +	nilfs_cancel_segusage(&logs, nilfs);
>>  	nilfs_free_incomplete_logs(&logs, nilfs);
>>  
>>  	if (sci->sc_stage.flags & NILFS_CF_SUFREED) {
>> @@ -1790,7 +1805,8 @@ static void nilfs_segctor_complete_write(struct nilfs_sc_info *sci)
>>  			const unsigned long clear_bits =
>>  				(1 << BH_Dirty | 1 << BH_Async_Write |
>>  				 1 << BH_Delay | 1 << BH_NILFS_Volatile |
>> -				 1 << BH_NILFS_Redirected);
>> +				 1 << BH_NILFS_Redirected |
>> +				 1 << BH_NILFS_Counted);
> 
> Ditto.  Stop to add nilfs_counted flag.
> 
>>  
>>  			set_mask_bits(&bh->b_state, clear_bits, set_bits);
>>  			if (bh == segbuf->sb_super_root) {
>> @@ -1995,7 +2011,14 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
>>  
>>  			nilfs_segctor_fill_in_super_root(sci, nilfs);
>>  		}
>> -		nilfs_segctor_update_segusage(sci, nilfs->ns_sufile);
>> +
>> +		if (nilfs_feature_track_live_blks(nilfs)) {
>> +			err = nilfs_sufile_flush_cache(nilfs->ns_sufile, 0,
>> +						       NULL);
>> +			if (unlikely(err))
>> +				goto failed_to_write;
>> +		}
>> +		nilfs_segctor_update_segusage(sci, nilfs);
>>  
>>  		/* Write partial segments */
>>  		nilfs_segctor_prepare_write(sci);
>> @@ -2022,6 +2045,9 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
>>  		}
>>  	} while (sci->sc_stage.scnt != NILFS_ST_DONE);
>>  
> 
>> +	if (nilfs_feature_track_live_blks(nilfs))
>> +		nilfs_sufile_shrink_cache(nilfs->ns_sufile);
> 
> As I mentioned on ahead, this shrink cache function should be called
> from a destructor of sufile which doesn't exist at present.
> 
>> +
>>   out:
>>  	nilfs_segctor_drop_written_files(sci, nilfs);
>>  	return err;
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 80bbd87..9cd8820d 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -527,10 +527,13 @@ int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum)
>>   * @sufile: inode of segment usage file
>>   * @segnum: segment number
>>   * @nblocks: number of live blocks in the segment
>> + * @nlive_blks: number of live blocks to add to the su_nlive_blks field
>> + * @nsnapshot_blks: number of snapshot blocks to add to su_nsnapshot_blks
>>   * @modtime: modification time (option)
>>   */
>>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>> -				   unsigned long nblocks, time_t modtime)
>> +				   unsigned long nblocks, __s64 nlive_blks,
>> +				   __s64 nsnapshot_blks, time_t modtime)
> 
> As I mentioned above, this function should be renamed to
> nilfs_sufile_modify_segment_usage() and the semantics of nblocks,
> nlive_blks, nsnapshot_blks arguments should be uniformed to "modify"
> semantics.
> 
> Also the types of these three counter arguments is not uniformed.

I used signed types for nlive_blks, nsnapshot_blks to be able to pass
negative numbers in nilfs_cancel_segusage().

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
>>  {
>>  	struct buffer_head *bh;
>>  	struct nilfs_segment_usage *su;
>> @@ -548,6 +551,18 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>>  	if (modtime)
>>  		su->su_lastmod = cpu_to_le64(modtime);
>>  	su->su_nblocks = cpu_to_le32(nblocks);
>> +
>> +	if (nilfs_sufile_live_blks_ext_supported(sufile)) {
>> +		nsnapshot_blks += le32_to_cpu(su->su_nsnapshot_blks);
>> +		nsnapshot_blks = min_t(__s64, max_t(__s64, nsnapshot_blks, 0),
>> +				       nblocks);
>> +		su->su_nsnapshot_blks = cpu_to_le32(nsnapshot_blks);
>> +
>> +		nlive_blks += le32_to_cpu(su->su_nlive_blks);
>> +		nlive_blks = min_t(__s64, max_t(__s64, nlive_blks, 0), nblocks);
>> +		su->su_nlive_blks = cpu_to_le32(nlive_blks);
>> +	}
>> +
>>  	kunmap_atomic(kaddr);
>>  
>>  	mark_buffer_dirty(bh);
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index 662ab56..3466abb 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -60,7 +60,8 @@ int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
>>  int nilfs_sufile_alloc(struct inode *, __u64 *);
>>  int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
>>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>> -				   unsigned long nblocks, time_t modtime);
>> +				   unsigned long nblocks, __s64 nlive_blks,
>> +				   __s64 nsnapshot_blks, time_t modtime);
>>  int nilfs_sufile_get_stat(struct inode *, struct nilfs_sustat *);
>>  ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
>>  				size_t);
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out
       [not found]         ` <20150509.211741.1463241033923032068.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-09 20:18           ` Andreas Rohner
       [not found]             ` <554E6B7E.8070000-hi6Y0CQ0nG0@public.gmane.org>
  2015-05-10 11:04           ` Andreas Rohner
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-09 20:18 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-09 14:17, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
>> This patch ensures, that all dirty blocks are written out if the segment
>> construction mode is SC_LSEG_SR. The scanning of the DAT file can cause
>> blocks in the SUFILE to be dirtied and newly dirtied blocks in the
>> SUFILE can in turn dirty more blocks in the DAT file. Since one of
>> these stages has to happen before the other during segment
>> construction, we end up with unwritten dirty blocks, that are lost
>> in case of a file system unmount.
>>
>> This patch introduces a new set of file scanning operations that
>> only propagate the changes to the bmap and do not add anything to the
>> segment buffer. The DAT file and SUFILE are scanned with these
>> operations. The function nilfs_sufile_flush_cache() is called in between
>> these scans with the parameter only_mark set. That way it can be called
>> repeatedly without actually writing anything to the SUFILE. If there are
>> no new blocks dirtied in the flush, the normal segment construction
>> stages can safely continue.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/segment.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  fs/nilfs2/segment.h |  3 ++-
>>  2 files changed, 74 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index 14e76c3..ab8df33 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -579,6 +579,12 @@ static int nilfs_collect_dat_data(struct nilfs_sc_info *sci,
>>  	return err;
>>  }
>>  
>> +static int nilfs_collect_prop_data(struct nilfs_sc_info *sci,
>> +				  struct buffer_head *bh, struct inode *inode)
>> +{
>> +	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
>> +}
>> +
>>  static int nilfs_collect_dat_bmap(struct nilfs_sc_info *sci,
>>  				  struct buffer_head *bh, struct inode *inode)
>>  {
>> @@ -613,6 +619,14 @@ static struct nilfs_sc_operations nilfs_sc_dat_ops = {
>>  	.write_node_binfo = nilfs_write_dat_node_binfo,
>>  };
>>  
>> +static struct nilfs_sc_operations nilfs_sc_prop_ops = {
>> +	.collect_data = nilfs_collect_prop_data,
>> +	.collect_node = nilfs_collect_file_node,
>> +	.collect_bmap = NULL,
>> +	.write_data_binfo = NULL,
>> +	.write_node_binfo = NULL,
>> +};
>> +
>>  static struct nilfs_sc_operations nilfs_sc_dsync_ops = {
>>  	.collect_data = nilfs_collect_file_data,
>>  	.collect_node = NULL,
>> @@ -998,7 +1012,8 @@ static int nilfs_segctor_scan_file(struct nilfs_sc_info *sci,
>>  			err = nilfs_segctor_apply_buffers(
>>  				sci, inode, &data_buffers,
>>  				sc_ops->collect_data);
>> -			BUG_ON(!err); /* always receive -E2BIG or true error */
>> +			/* always receive -E2BIG or true error (NOT ANYMORE?)*/
>> +			/* BUG_ON(!err); */
>>  			goto break_or_fail;
>>  		}
>>  	}
> 
> If n > rest, this function will exit without scanning node buffers
> for nilfs_segctor_propagate_sufile().  This looks problem, right?
> 
> I think adding separate functions is better.  For instance,
> 
> static int nilfs_propagate_buffer(struct nilfs_sc_info *sci,
> 				  struct buffer_head *bh,
> 				  struct inode *inode)
> {
> 	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
> }
> 
> static int nilfs_segctor_propagate_file(struct nilfs_sc_info *sci,
> 					struct inode *inode)
> {
> 	LIST_HEAD(buffers);
> 	size_t n;
> 	int err;
> 
> 	n = nilfs_lookup_dirty_data_buffers(inode, &buffers, SIZE_MAX, 0,
> 					    LLONG_MAX);
> 	if (n > 0) {
> 		ret = nilfs_segctor_apply_buffers(sci, inode, &buffers,
> 						  nilfs_propagate_buffer);
> 		if (unlikely(ret))
> 			goto fail;
> 	}
> 
> 	nilfs_lookup_dirty_node_buffers(inode, &buffers);
> 	ret = nilfs_segctor_apply_buffers(sci, inode, &buffers,
> 					  nilfs_propagate_buffer);
> fail:
> 	return ret;
> }
> 
> With this, you can also avoid defining nilfs_sc_prop_ops, nor touching
> the BUG_ON() in nilfs_segctor_scan_file.

I agree this is a much nicer solution.

>> @@ -1055,6 +1070,55 @@ static int nilfs_segctor_scan_file_dsync(struct nilfs_sc_info *sci,
>>  	return err;
>>  }
>>  
>> +/**
>> + * nilfs_segctor_propagate_sufile - dirties all needed SUFILE blocks
>> + * @sci: nilfs_sc_info
>> + *
>> + * Description: Dirties and propagates all SUFILE blocks that need to be
>> + * available later in the segment construction process, when the SUFILE cache
>> + * is flushed. Here the SUFILE cache is not actually flushed, but the blocks
>> + * that are needed for a later flush are marked as dirty. Since the propagation
>> + * of the SUFILE can dirty DAT entries and vice versa, the functions
>> + * are executed in a loop until no new blocks are dirtied.
>> + *
>> + * Return Value: On success, 0 is returned on error, one of the following
>> + * negative error codes is returned.
>> + *
>> + * %-ENOMEM - Insufficient memory available.
>> + *
>> + * %-EIO - I/O error
>> + *
>> + * %-EROFS - Read only filesystem (for create mode)
>> + */
>> +static int nilfs_segctor_propagate_sufile(struct nilfs_sc_info *sci)
>> +{
>> +	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>> +	unsigned long ndirty_blks;
>> +	int ret, retrycount = NILFS_SC_SUFILE_PROP_RETRY;
>> +
>> +	do {
>> +		/* count changes to DAT file before flush */
>> +		ret = nilfs_segctor_scan_file(sci, nilfs->ns_dat,
>> +					      &nilfs_sc_prop_ops);
> 
> Use the previous nilfs_segctor_propagate_file() here.
> 
>> +		if (unlikely(ret))
>> +			return ret;
>> +
>> +		ret = nilfs_sufile_flush_cache(nilfs->ns_sufile, 1,
>> +					       &ndirty_blks);
>> +		if (unlikely(ret))
>> +			return ret;
>> +		if (!ndirty_blks)
>> +			break;
>> +
>> +		ret = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
>> +					      &nilfs_sc_prop_ops);
> 
> Ditto.
> 
>> +		if (unlikely(ret))
>> +			return ret;
>> +	} while (ndirty_blks && retrycount-- > 0);
>> +
> 
> Uum. This still looks to have potential for leak of dirty block
> collection between DAT and SUFILE since this retry is limited by
> the fixed retry count.

Yes unfortunately.

> How about adding function temporarily turning off the live block
> tracking and using it after this propagation loop until log write
> finishes ?

I think this is a great idea.

> It would reduce the accuracy of live block count, but is it enough ?
> How do you think ? 

I would suggest to iterate through the loop in
nilfs_segctor_propagate_sufile() at least once or twice, so that we can
count most of the DAT-File blocks. After that we temporarily turn off
the live block tracking until the end of the segment construction. This
should only lead to small inaccuracies.

> We have to eliminate the possibility of the leak
> because it can cause file system corruption.  Every checkpoint must be
> self-contained.

I didn't realize that it could cause file system corruption.

>> +	return 0;
>> +}
>> +
>>  static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
>>  {
>>  	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>> @@ -1160,6 +1224,13 @@ static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
>>  		}
>>  		sci->sc_stage.flags |= NILFS_CF_SUFREED;
>>  
> 
>> +		if (mode == SC_LSEG_SR &&
> 
> This test ("mode == SC_LSEG_SR") can be removed.  When the thread
> comes here, it will always make a checkpoint.
> 
>> +		    nilfs_feature_track_live_blks(nilfs)) {
>> +			err = nilfs_segctor_propagate_sufile(sci);
>> +			if (unlikely(err))
>> +				break;
>> +		}
>> +
>>  		err = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
>>  					      &nilfs_sc_file_ops);
>>  		if (unlikely(err))
>> diff --git a/fs/nilfs2/segment.h b/fs/nilfs2/segment.h
>> index a48d6de..5aa7f91 100644
>> --- a/fs/nilfs2/segment.h
>> +++ b/fs/nilfs2/segment.h
>> @@ -208,7 +208,8 @@ enum {
>>   */
>>  #define NILFS_SC_CLEANUP_RETRY	    3  /* Retry count of construction when
>>  					  destroying segctord */
>> -
>> +#define NILFS_SC_SUFILE_PROP_RETRY  10 /* Retry count of the propagate
>> +					  sufile loop */
> 
> How many times does the propagation loop has to be repeated
> until it converges ?

Most of the time it runs only once, because all the blocks are already
dirty, but sometimes it can go on for more than 10 iterations.

Regards,
Andreas Rohner

> The current dirty block scanning function collects all dirty blocks of
> the specified file (i.e. SUFILE or DAT), traversing page cache, making
> and destructing list of dirty buffers, every time the propagation
> function is called.  It's so wasteful to repeat that many times.
> 
> Regards,
> Ryusuke Konishi
> 
>>  /*
>>   * Default values of timeout, in seconds.
>>   */
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 4/9] nilfs2: add kmem_cache for SUFILE cache nodes
       [not found]             ` <554E5B9D.7070807-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-10  0:05               ` Ryusuke Konishi
  0 siblings, 0 replies; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-10  0:05 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sat, 09 May 2015 21:10:21 +0200, Andreas Rohner wrote:
> On 2015-05-09 04:41, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:17 +0200, Andreas Rohner wrote:
>>> +static void nilfs_sufile_cache_node_init_once(void *obj)
>>> +{
>>> +	memset(obj, 0, sizeof(struct nilfs_sufile_cache_node));
>>> +}
>>> +
>> 
>> Note that nilfs_sufile_cache_node_init_once() is only called when each
>> cache entry is allocated first time.  It doesn't ensure each cache
>> entry is clean when it will be allocated with kmem_cache_alloc()
>> the second time and afterwards.
> 
> I kind of assumed it would be called for every object returned by
> kmem_cache_alloc(). In that case I have to do the initialization in
> nilfs_sufile_alloc_cache_node() and remove this function.
> 
> Regards,
> Andreas Rohner

You can use kmem_cache_zalloc() instead in that case.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 5/9] nilfs2: add SUFILE cache for changes to su_nlive_blks field
       [not found]             ` <554E626A.2030503-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-10  2:09               ` Ryusuke Konishi
  0 siblings, 0 replies; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-10  2:09 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sat, 09 May 2015 21:39:22 +0200, Andreas Rohner wrote:
> On 2015-05-09 06:09, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:18 +0200, Andreas Rohner wrote:
<snip>
>>> +	lock = bgl_lock_ptr(&sui->nlive_blks_cache_bgl, (unsigned int)group);
>>> +	spin_lock(lock);
>>> +	node->values[segnum & ((1 << NILFS_SUFILE_CACHE_NODE_SHIFT) - 1)] += 1;
>>> +	sui->nlive_blks_cache_dirty = 1;
>>> +	spin_unlock(lock);
>>> +	rcu_read_unlock();
>>> +
>>> +	return 0;
>>> +}
>> 
>> Consider using cmpxchg() or atomic_inc(), and using

>> NILFS_SUFILE_CACHE_NODE_MASK to mask segnum.

Sorry, I meant NILFS_SUFILE_CACHE_NODE_COUNT as below.

>> The following is an
>> example in the case of using cmpxchg():
>> 
>> 	__u32 old, new, *valuep;
>> 	...
>> 	old = node->values[segnum & (NILFS_SUFILE_CACHE_NODE_COUNT - 1)];
>> 	do {
>> 		old = ACCESS_ONCE(*valuep);
>> 		new = old + 1;
>> 	} while (cmpxchg(valuep, old, new) != old);
>> 
>> 	sui->nlive_blks_cache_dirty = 1;
>> 
>> 	rcu_read_unlock();
>> 	return 0;
>> }
>> 
>> The current atomic_xxxx() macros are actually defined in the same way
>> to the reduce overheads in smp environment.
>> 
>> Using atomic_xxxx() is more preferable but formally it requires
>> initialization with "atomic_set(&counter, 0)" or "ATOMIC_INIT(0)" for
>> every element.  I don't know whether initialization with memset()
>> function is allowed or not for atomic_t type variables.
> 
> Ok, but I would also have to use cmpxchg() in
> nilfs_sufile_flush_cache_node(), if the value needs to be set to 0.
> 
> Currently the cache is only flushed during segment construction, so
> there should be no concurrent calls to nilfs_sufile_dec_nlive_blks().
> But I thought it would be best do design a thread safe flush function.

You don't have to use cmpxchg() if you just want to set the value
to 0 since store operation is already atomic (see implementation of
atomic_set).

Do you mean atomic substraction is needed in
nilfs_sufile_flush_cache_node() ?

If so, using cmpxchg() is right.  But, note that, under the premise,
nilfs_sufile_shrink_cache() may free counters with a non-zero value.

At present, this is not a problem because nilfs_sufile_shrink_cache()
is called within nilfs_segctor_do_construct().  And, even when we add
its callsite in destructor of sufile, ignoring counters with a
non-zero value is not critical.

<snip>
>>> +void nilfs_sufile_shrink_cache(struct inode *sufile)
>>> +{
>>> +	struct nilfs_sufile_info *sui = NILFS_SUI(sufile);
>>> +	struct nilfs_sufile_cache_node *node;
>>> +	struct radix_tree_iter iter;
>>> +	void **slot;
>>> +
>> 
>>> +	/* prevent flush form running at the same time */
>> 
>> "flush from" ?
> 
> Yes that is a typo. It should be "flush from".
> 
>>> +	down_read(&NILFS_MDT(sufile)->mi_sem);
>> 
>> This protection with mi_sem seems to be needless because the current
>> implementation of nilfs_sufile_shrink_cache() doesn't touch buffers of
>> sufile.  The delete operation is protected by a spinlock and the
>> counter operations are protected with rcu.  What does this
>> down_read()/up_read() protect ?
> 
> It is intended to protect against a concurrently running
> nilfs_sufile_flush_cache() function. This should not happen the way the
> functions are called currently, but I wanted to make them thread safe.
> 
> In nilfs_sufile_flush_cache() I use references to nodes outside of the
> spinlock, which I am not allowed to do if they can be deallocated at any
> moment.

Ok, that's reasonable.	Thanks for detail explanation.

Regards,
Ryusuke Konishi

> I cannot hold the spinlock for the entire flush, because
> nilfs_sufile_flush_cache_node() needs to be able to sleep. I cannot use
> a mutex instead of a spinlock, because this would lead to a potential
> deadlock:
> 
> nilfs_sufile_alloc_cache_node():
> 1. bmap->b_sem
> 2. sui->nlive_blks_cache_lock
> 
> nilfs_sufile_flush_cache_node():
> 1. sui->nlive_blks_cache_lock
> 2. bmap->b_sem
> 
> So I decided to "abuse" mi_sem for this purpose, since I already need to
> hold mi_sem in nilfs_sufile_flush_cache().

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 6/9] nilfs2: add tracking of block deletions and updates
       [not found]             ` <554E67C0.1050309-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-10  3:17               ` Ryusuke Konishi
  0 siblings, 0 replies; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-10  3:17 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sat, 09 May 2015 22:02:08 +0200, Andreas Rohner wrote:
> On 2015-05-09 09:05, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:19 +0200, Andreas Rohner wrote:
<snip>
>>> @@ -1889,9 +1890,35 @@ static int nilfs_btree_propagate_p(struct nilfs_bmap *btree,
>>>  				   int level,
>>>  				   struct buffer_head *bh)
>>>  {
>>> -	while ((++level < nilfs_btree_height(btree) - 1) &&
>>> -	       !buffer_dirty(path[level].bp_bh))
>>> -		mark_buffer_dirty(path[level].bp_bh);
>>> +	struct the_nilfs *nilfs = btree->b_inode->i_sb->s_fs_info;
>>> +	struct nilfs_btree_node *node;
>>> +	__u64 ptr, segnum;
>>> +	int ncmax, vol, counted;
>>> +
>>> +	vol = buffer_nilfs_volatile(bh);
>>> +	counted = buffer_nilfs_counted(bh);
>>> +	set_buffer_nilfs_counted(bh);
>>> +
>>> +	while (++level < nilfs_btree_height(btree)) {
>>> +		if (!vol && !counted && nilfs_feature_track_live_blks(nilfs)) {
>>> +			node = nilfs_btree_get_node(btree, path, level, &ncmax);
>>> +			ptr = nilfs_btree_node_get_ptr(node,
>>> +						       path[level].bp_index,
>>> +						       ncmax);
>>> +			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
>>> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
>>> +		}
>>> +
>>> +		if (path[level].bp_bh) {
>>> +			if (buffer_dirty(path[level].bp_bh))
>>> +				break;
>>> +
>>> +			mark_buffer_dirty(path[level].bp_bh);
>>> +			vol = buffer_nilfs_volatile(path[level].bp_bh);
>>> +			counted = buffer_nilfs_counted(path[level].bp_bh);
>>> +			set_buffer_nilfs_counted(path[level].bp_bh);
>>> +		}
>>> +	}
>>>  
>>>  	return 0;
>>>  }
>> 
>> Consider the following comments:
>> 
>> - Please use volatile flag also for the duplication check instead of
>>   adding nilfs_counted flag.
> 
> I thought volatile already means something else. I wasn't sure if could
> use it. I will change it and remove the nilfs_counted flag.

Sorry for the confusing naming.

The volatile flag originally meant that the parent node of the buffer
stores a memory address of buffer_head struct of it instead of a disk
block address (virtual or real one).  This is needed for newly created
and never written buffers.

It later got used for the duplication check of pointer updating as
well.  For node buffers marked dirty through dirty flag propagation,
the dirty flag can be used for this purpose.  However, for node or
data buffers which trigger the propagation, another flag was needed
since it's already marked dirty.

For details, see what nilfs_btree_commit_propagate_v() and
nilfs_btree_commit_update_v() are doing.

Anyway, we can apply the same semantics for live block tracking of
DAT.

> 
>> - btree.c, direct.c, and dat.c shouldn't refer SUFILE directly.
>>   Please add a wrapper function like "nilfs_dec_nlive_blks(nilfs, blocknr)"
>>   to the implementation of the_nilfs.c, and use it instead.
>> - To clarify implementation separate function to update pointers
>>   like nilfs_btree_propagate_v() is doing.
> 
> Ok.
> 
>> - The return value of nilfs_sufile_dec_nlive_blks() looks to be ignored
>>   intentionally.  Please add a comment explaining why you do so.
> 
> I just thought, that the block tracking isn't important enough to cause
> a fatal error. I should at least use the WARN_ON() macro. Do you think I
> should return possible errors?

I think ignoring errors is ok.  Just I thought we should clarify the
reason for other kernel developers.  Outputting some warning sounds
good.  In that case, using nilfs_warning() seems better than WARN_ON()
since the former can deliver a meaningful message.

<snip>
>>> +	if (unlikely(!NILFS_BMAP_USE_VBN(bmap))) {
>>> +		if (!buffer_nilfs_volatile(bh) && !buffer_nilfs_counted(bh) &&
>>> +				nilfs_feature_track_live_blks(nilfs)) {
>>> +			set_buffer_nilfs_counted(bh);
>>> +			segnum = nilfs_get_segnum_of_block(nilfs, ptr);
>>> +
>>> +			nilfs_sufile_dec_nlive_blks(nilfs->ns_sufile, segnum);
>>> +		}
>>> +		return 0;
>>> +	}
>> 
>> Use the volatile flag also for duplication check, and do not use
>> unlikely() marcro when testing "!NILFS_BMAP_USE_VBN(bmap)".  It's
>> not exceptional as error:
>> 
>> 	if (!NILFS_BMAP_USE_VBN(bmap)) {
>> 		if (!buffer_nilfs_volatile(bh)) {
>> 			if (nilfs_feature_track_live_blks(nilfs))
>> 				nilfs_dec_nlive_blks(nilfs, ptr);
>> 			set_buffer_nilfs_volatile(bh);
>> 		}
>> 		return 0;
>> 	}
> 
> During my tests, this was only called once directly after the first
> bytes are written on a newly formatted volume. This can only be true for
> the DAT-File and the DAT-File is very unlikely to be small enough to use
> the direct bmap, except on a newly formatted volume. Do you mean, that
> unlikely() should only be used for errors?

Ok, it sounds reasonable.

In general, unlikely() should only be used for errors.  But this is
special; DAT-file soon becomes big and will be converted to b-tree,
therefore nilfs_direct_propagate() soon becomes a function for other
type of files.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out
       [not found]             ` <554E6B7E.8070000-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-10  3:31               ` Ryusuke Konishi
  0 siblings, 0 replies; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-10  3:31 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sat, 09 May 2015 22:18:06 +0200, Andreas Rohner wrote:
> On 2015-05-09 14:17, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
>>> @@ -1055,6 +1070,55 @@ static int nilfs_segctor_scan_file_dsync(struct nilfs_sc_info *sci,
>>>  	return err;
>>>  }
>>>  
>>> +/**
>>> + * nilfs_segctor_propagate_sufile - dirties all needed SUFILE blocks
>>> + * @sci: nilfs_sc_info
>>> + *
>>> + * Description: Dirties and propagates all SUFILE blocks that need to be
>>> + * available later in the segment construction process, when the SUFILE cache
>>> + * is flushed. Here the SUFILE cache is not actually flushed, but the blocks
>>> + * that are needed for a later flush are marked as dirty. Since the propagation
>>> + * of the SUFILE can dirty DAT entries and vice versa, the functions
>>> + * are executed in a loop until no new blocks are dirtied.
>>> + *
>>> + * Return Value: On success, 0 is returned on error, one of the following
>>> + * negative error codes is returned.
>>> + *
>>> + * %-ENOMEM - Insufficient memory available.
>>> + *
>>> + * %-EIO - I/O error
>>> + *
>>> + * %-EROFS - Read only filesystem (for create mode)
>>> + */
>>> +static int nilfs_segctor_propagate_sufile(struct nilfs_sc_info *sci)
>>> +{
>>> +	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>>> +	unsigned long ndirty_blks;
>>> +	int ret, retrycount = NILFS_SC_SUFILE_PROP_RETRY;
>>> +
>>> +	do {
>>> +		/* count changes to DAT file before flush */
>>> +		ret = nilfs_segctor_scan_file(sci, nilfs->ns_dat,
>>> +					      &nilfs_sc_prop_ops);
>> 
>> Use the previous nilfs_segctor_propagate_file() here.
>> 
>>> +		if (unlikely(ret))
>>> +			return ret;
>>> +
>>> +		ret = nilfs_sufile_flush_cache(nilfs->ns_sufile, 1,
>>> +					       &ndirty_blks);
>>> +		if (unlikely(ret))
>>> +			return ret;
>>> +		if (!ndirty_blks)
>>> +			break;
>>> +
>>> +		ret = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
>>> +					      &nilfs_sc_prop_ops);
>> 
>> Ditto.
>> 
>>> +		if (unlikely(ret))
>>> +			return ret;
>>> +	} while (ndirty_blks && retrycount-- > 0);
>>> +
>> 
>> Uum. This still looks to have potential for leak of dirty block
>> collection between DAT and SUFILE since this retry is limited by
>> the fixed retry count.
> 
> Yes unfortunately.
> 
>> How about adding function temporarily turning off the live block
>> tracking and using it after this propagation loop until log write
>> finishes ?
> 
> I think this is a great idea.
> 
>> It would reduce the accuracy of live block count, but is it enough ?
>> How do you think ? 
> 
> I would suggest to iterate through the loop in
> nilfs_segctor_propagate_sufile() at least once or twice, so that we can
> count most of the DAT-File blocks. After that we temporarily turn off
> the live block tracking until the end of the segment construction. This
> should only lead to small inaccuracies.

Agreed, that sounds better.

>> We have to eliminate the possibility of the leak
>> because it can cause file system corruption.  Every checkpoint must be
>> self-contained.
> 
> I didn't realize that it could cause file system corruption.
> 
>>> +	return 0;
>>> +}
>>> +
>>>  static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
>>>  {
>>>  	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>>> @@ -1160,6 +1224,13 @@ static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
>>>  		}
>>>  		sci->sc_stage.flags |= NILFS_CF_SUFREED;
>>>  
>> 
>>> +		if (mode == SC_LSEG_SR &&
>> 
>> This test ("mode == SC_LSEG_SR") can be removed.  When the thread
>> comes here, it will always make a checkpoint.
>> 
>>> +		    nilfs_feature_track_live_blks(nilfs)) {
>>> +			err = nilfs_segctor_propagate_sufile(sci);
>>> +			if (unlikely(err))
>>> +				break;
>>> +		}
>>> +
>>>  		err = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
>>>  					      &nilfs_sc_file_ops);
>>>  		if (unlikely(err))
>>> diff --git a/fs/nilfs2/segment.h b/fs/nilfs2/segment.h
>>> index a48d6de..5aa7f91 100644
>>> --- a/fs/nilfs2/segment.h
>>> +++ b/fs/nilfs2/segment.h
>>> @@ -208,7 +208,8 @@ enum {
>>>   */
>>>  #define NILFS_SC_CLEANUP_RETRY	    3  /* Retry count of construction when
>>>  					  destroying segctord */
>>> -
>>> +#define NILFS_SC_SUFILE_PROP_RETRY  10 /* Retry count of the propagate
>>> +					  sufile loop */
>> 
>> How many times does the propagation loop has to be repeated
>> until it converges ?
> 
> Most of the time it runs only once, because all the blocks are already
> dirty, but sometimes it can go on for more than 10 iterations.

Thank you for the reply.  I got the situation.

Regards,
Ryusuke Konishi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out
       [not found]         ` <20150509.211741.1463241033923032068.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  2015-05-09 20:18           ` Andreas Rohner
@ 2015-05-10 11:04           ` Andreas Rohner
       [not found]             ` <554F3B32.5050004-hi6Y0CQ0nG0@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-10 11:04 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-09 14:17, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
>> This patch ensures, that all dirty blocks are written out if the segment
>> construction mode is SC_LSEG_SR. The scanning of the DAT file can cause
>> blocks in the SUFILE to be dirtied and newly dirtied blocks in the
>> SUFILE can in turn dirty more blocks in the DAT file. Since one of
>> these stages has to happen before the other during segment
>> construction, we end up with unwritten dirty blocks, that are lost
>> in case of a file system unmount.
>>
>> This patch introduces a new set of file scanning operations that
>> only propagate the changes to the bmap and do not add anything to the
>> segment buffer. The DAT file and SUFILE are scanned with these
>> operations. The function nilfs_sufile_flush_cache() is called in between
>> these scans with the parameter only_mark set. That way it can be called
>> repeatedly without actually writing anything to the SUFILE. If there are
>> no new blocks dirtied in the flush, the normal segment construction
>> stages can safely continue.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/segment.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  fs/nilfs2/segment.h |  3 ++-
>>  2 files changed, 74 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index 14e76c3..ab8df33 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -579,6 +579,12 @@ static int nilfs_collect_dat_data(struct nilfs_sc_info *sci,
>>  	return err;
>>  }
>>  
>> +static int nilfs_collect_prop_data(struct nilfs_sc_info *sci,
>> +				  struct buffer_head *bh, struct inode *inode)
>> +{
>> +	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
>> +}
>> +
>>  static int nilfs_collect_dat_bmap(struct nilfs_sc_info *sci,
>>  				  struct buffer_head *bh, struct inode *inode)
>>  {
>> @@ -613,6 +619,14 @@ static struct nilfs_sc_operations nilfs_sc_dat_ops = {
>>  	.write_node_binfo = nilfs_write_dat_node_binfo,
>>  };
>>  
>> +static struct nilfs_sc_operations nilfs_sc_prop_ops = {
>> +	.collect_data = nilfs_collect_prop_data,
>> +	.collect_node = nilfs_collect_file_node,
>> +	.collect_bmap = NULL,
>> +	.write_data_binfo = NULL,
>> +	.write_node_binfo = NULL,
>> +};
>> +
>>  static struct nilfs_sc_operations nilfs_sc_dsync_ops = {
>>  	.collect_data = nilfs_collect_file_data,
>>  	.collect_node = NULL,
>> @@ -998,7 +1012,8 @@ static int nilfs_segctor_scan_file(struct nilfs_sc_info *sci,
>>  			err = nilfs_segctor_apply_buffers(
>>  				sci, inode, &data_buffers,
>>  				sc_ops->collect_data);
>> -			BUG_ON(!err); /* always receive -E2BIG or true error */
>> +			/* always receive -E2BIG or true error (NOT ANYMORE?)*/
>> +			/* BUG_ON(!err); */
>>  			goto break_or_fail;
>>  		}
>>  	}
> 
> If n > rest, this function will exit without scanning node buffers
> for nilfs_segctor_propagate_sufile().  This looks problem, right?
> 
> I think adding separate functions is better.  For instance,
> 
> static int nilfs_propagate_buffer(struct nilfs_sc_info *sci,
> 				  struct buffer_head *bh,
> 				  struct inode *inode)
> {
> 	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
> }
> 
> static int nilfs_segctor_propagate_file(struct nilfs_sc_info *sci,
> 					struct inode *inode)
> {
> 	LIST_HEAD(buffers);
> 	size_t n;
> 	int err;
> 
> 	n = nilfs_lookup_dirty_data_buffers(inode, &buffers, SIZE_MAX, 0,
> 					    LLONG_MAX);
> 	if (n > 0) {
> 		ret = nilfs_segctor_apply_buffers(sci, inode, &buffers,
> 						  nilfs_propagate_buffer);
> 		if (unlikely(ret))
> 			goto fail;
> 	}
> 
> 	nilfs_lookup_dirty_node_buffers(inode, &buffers);
> 	ret = nilfs_segctor_apply_buffers(sci, inode, &buffers,
> 					  nilfs_propagate_buffer);
> fail:
> 	return ret;
> }
> 
> With this, you can also avoid defining nilfs_sc_prop_ops, nor touching
> the BUG_ON() in nilfs_segctor_scan_file.
> 
>> @@ -1055,6 +1070,55 @@ static int nilfs_segctor_scan_file_dsync(struct nilfs_sc_info *sci,
>>  	return err;
>>  }
>>  
>> +/**
>> + * nilfs_segctor_propagate_sufile - dirties all needed SUFILE blocks
>> + * @sci: nilfs_sc_info
>> + *
>> + * Description: Dirties and propagates all SUFILE blocks that need to be
>> + * available later in the segment construction process, when the SUFILE cache
>> + * is flushed. Here the SUFILE cache is not actually flushed, but the blocks
>> + * that are needed for a later flush are marked as dirty. Since the propagation
>> + * of the SUFILE can dirty DAT entries and vice versa, the functions
>> + * are executed in a loop until no new blocks are dirtied.
>> + *
>> + * Return Value: On success, 0 is returned on error, one of the following
>> + * negative error codes is returned.
>> + *
>> + * %-ENOMEM - Insufficient memory available.
>> + *
>> + * %-EIO - I/O error
>> + *
>> + * %-EROFS - Read only filesystem (for create mode)
>> + */
>> +static int nilfs_segctor_propagate_sufile(struct nilfs_sc_info *sci)
>> +{
>> +	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>> +	unsigned long ndirty_blks;
>> +	int ret, retrycount = NILFS_SC_SUFILE_PROP_RETRY;
>> +
>> +	do {
>> +		/* count changes to DAT file before flush */
>> +		ret = nilfs_segctor_scan_file(sci, nilfs->ns_dat,
>> +					      &nilfs_sc_prop_ops);
> 
> Use the previous nilfs_segctor_propagate_file() here.
> 
>> +		if (unlikely(ret))
>> +			return ret;
>> +
>> +		ret = nilfs_sufile_flush_cache(nilfs->ns_sufile, 1,
>> +					       &ndirty_blks);
>> +		if (unlikely(ret))
>> +			return ret;
>> +		if (!ndirty_blks)
>> +			break;
>> +
>> +		ret = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
>> +					      &nilfs_sc_prop_ops);
> 
> Ditto.
> 
>> +		if (unlikely(ret))
>> +			return ret;
>> +	} while (ndirty_blks && retrycount-- > 0);
>> +
> 
> Uum. This still looks to have potential for leak of dirty block
> collection between DAT and SUFILE since this retry is limited by
> the fixed retry count.
> 
> How about adding function temporarily turning off the live block
> tracking and using it after this propagation loop until log write
> finishes ?
> 
> It would reduce the accuracy of live block count, but is it enough ?
> How do you think ?  We have to eliminate the possibility of the leak
> because it can cause file system corruption.  Every checkpoint must be
> self-contained.

How exactly could it lead to file system corruption? Maybe I miss
something important here, but it seems to me, that no corruption is
possible.

The nilfs_sufile_flush_cache_node() function only reads in already
existing blocks. No new blocks are created. If I mark those blocks
dirty, the btree is not changed at all. If I do not call
nilfs_bmap_propagate(), then the btree stays unchanged and there are no
dangling pointers. The resulting checkpoint should be self-contained.

The only problem would be, that I could lose some nlive_blks updates.

>> +	return 0;
>> +}
>> +
>>  static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
>>  {
>>  	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>> @@ -1160,6 +1224,13 @@ static int nilfs_segctor_collect_blocks(struct nilfs_sc_info *sci, int mode)
>>  		}
>>  		sci->sc_stage.flags |= NILFS_CF_SUFREED;
>>  
> 
>> +		if (mode == SC_LSEG_SR &&
> 
> This test ("mode == SC_LSEG_SR") can be removed.  When the thread
> comes here, it will always make a checkpoint.
> 
>> +		    nilfs_feature_track_live_blks(nilfs)) {
>> +			err = nilfs_segctor_propagate_sufile(sci);
>> +			if (unlikely(err))
>> +				break;
>> +		}
>> +
>>  		err = nilfs_segctor_scan_file(sci, nilfs->ns_sufile,
>>  					      &nilfs_sc_file_ops);
>>  		if (unlikely(err))
>> diff --git a/fs/nilfs2/segment.h b/fs/nilfs2/segment.h
>> index a48d6de..5aa7f91 100644
>> --- a/fs/nilfs2/segment.h
>> +++ b/fs/nilfs2/segment.h
>> @@ -208,7 +208,8 @@ enum {
>>   */
>>  #define NILFS_SC_CLEANUP_RETRY	    3  /* Retry count of construction when
>>  					  destroying segctord */
>> -
>> +#define NILFS_SC_SUFILE_PROP_RETRY  10 /* Retry count of the propagate
>> +					  sufile loop */
> 
> How many times does the propagation loop has to be repeated
> until it converges ?
> 
> The current dirty block scanning function collects all dirty blocks of
> the specified file (i.e. SUFILE or DAT), traversing page cache, making
> and destructing list of dirty buffers, every time the propagation
> function is called.  It's so wasteful to repeat that many times.
> 
> Regards,
> Ryusuke Konishi
> 
>>  /*
>>   * Default values of timeout, in seconds.
>>   */
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period
       [not found]     ` <1430647522-14304-9-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-10 18:15       ` Ryusuke Konishi
       [not found]         ` <20150511.031512.1036934606749624197.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-10 18:15 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:21 +0200, Andreas Rohner wrote:
> The userspace GC uses the concept of a so-called protection period,
> which is a period of time, where actually reclaimable blocks are
> protected. If a segment is cleaned and there are blocks in it that are
> protected by this, they have to be treated as if they were live blocks.
> 
> This is a problem for the live block tracking on the kernel side,
> because the kernel knows nothing about the protection period. This patch
> introduces new flags for the nilfs_vdesc data structure, to mark blocks
> that need to be treated as if they were alive, but must be counted as if
> they were reclaimable. There are two reasons for this to happen.
> Either a block was deleted within the protection period, or it is
> part of a snapshot.
> 
> After the blocks described by the nilfs_vdesc structures are read in,
> the flags are passed on to the buffer_heads to get the information to
> the segment construction phase. During segment construction, the live
> block tracking is adjusted accordingly.
> 
> Additionally the blocks are rechecked if they are reclaimable, since the
> last check was in userspace without the proper locking.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/dat.c           | 66 +++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nilfs2/dat.h           |  1 +
>  fs/nilfs2/ioctl.c         | 15 +++++++++++
>  fs/nilfs2/page.h          |  6 +++++
>  fs/nilfs2/segment.c       | 41 ++++++++++++++++++++++++++++-
>  include/linux/nilfs2_fs.h | 38 +++++++++++++++++++++++++--
>  6 files changed, 164 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
> index 9c2fc32..80a1905 100644
> --- a/fs/nilfs2/dat.c
> +++ b/fs/nilfs2/dat.c
> @@ -35,6 +35,17 @@
>  #define NILFS_CNO_MAX	(~(__u64)0)
>  
>  /**
> + * nilfs_dat_entry_is_live - check if @entry is alive
> + * @entry: DAT-Entry
> + *
> + * Description: Simple check if @entry is alive in the current checkpoint.
> + */
> +static int nilfs_dat_entry_is_live(struct nilfs_dat_entry *entry)
> +{
> +	return entry->de_end == cpu_to_le64(NILFS_CNO_MAX);
> +}
> +
> +/**
>   * struct nilfs_dat_info - on-memory private data of DAT file
>   * @mi: on-memory private data of metadata file
>   * @palloc_cache: persistent object allocator cache of DAT file
> @@ -387,6 +398,61 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, sector_t blocknr)
>  }
>  
>  /**
> + * nilfs_dat_is_live - checks if the virtual block number is alive
> + * @dat: DAT file inode
> + * @vblocknr: virtual block number
> + *
> + * Description: nilfs_dat_is_live() looks up the DAT-Entry for
> + * @vblocknr and determines if the corresponding block is alive in the current
> + * checkpoint or not. This check ignores snapshots and protection periods.
> + *
> + * Return Value: 1 if vblocknr is alive and 0 otherwise. On error one of the
> + * following negative error codes is returned
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + *
> + * %-ENOENT - A block number associated with @vblocknr does not exist.
> + */
> +int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr)
> +{
> +	struct buffer_head *entry_bh, *bh;
> +	struct nilfs_dat_entry *entry;
> +	sector_t blocknr;
> +	void *kaddr;
> +	int ret;
> +
> +	ret = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (!nilfs_doing_gc() && buffer_nilfs_redirected(entry_bh)) {
> +		bh = nilfs_mdt_get_frozen_buffer(dat, entry_bh);
> +		if (bh) {
> +			WARN_ON(!buffer_uptodate(bh));
> +			put_bh(entry_bh);
> +			entry_bh = bh;
> +		}
> +	}
> +
> +	kaddr = kmap_atomic(entry_bh->b_page);
> +	entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
> +	blocknr = le64_to_cpu(entry->de_blocknr);
> +	if (blocknr == 0) {
> +		ret = -ENOENT;
> +		goto out_unmap;
> +	}
> +
> +	ret = nilfs_dat_entry_is_live(entry);
> +
> +out_unmap:
> +	kunmap_atomic(kaddr);
> +	put_bh(entry_bh);
> +	return ret;
> +}
> +
> +/**
>   * nilfs_dat_translate - translate a virtual block number to a block number
>   * @dat: DAT file inode
>   * @vblocknr: virtual block number
> diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
> index cbd8e97..a95547c 100644
> --- a/fs/nilfs2/dat.h
> +++ b/fs/nilfs2/dat.h
> @@ -47,6 +47,7 @@ void nilfs_dat_commit_update(struct inode *, struct nilfs_palloc_req *,
>  			     struct nilfs_palloc_req *, int);
>  void nilfs_dat_abort_update(struct inode *, struct nilfs_palloc_req *,
>  			    struct nilfs_palloc_req *);
> +int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr);
>  
>  int nilfs_dat_mark_dirty(struct inode *, __u64);
>  int nilfs_dat_freev(struct inode *, __u64 *, size_t);
> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
> index f6ee54e..40bf74a 100644
> --- a/fs/nilfs2/ioctl.c
> +++ b/fs/nilfs2/ioctl.c
> @@ -612,6 +612,12 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
>  		brelse(bh);
>  		return -EEXIST;
>  	}
> +
> +	if (nilfs_vdesc_snapshot_protected(vdesc))
> +		set_buffer_nilfs_snapshot_protected(bh);
> +	if (nilfs_vdesc_period_protected(vdesc))
> +		set_buffer_nilfs_period_protected(bh);
> +
>  	list_add_tail(&bh->b_assoc_buffers, buffers);
>  	return 0;
>  }
> @@ -662,6 +668,15 @@ static int nilfs_ioctl_move_blocks(struct super_block *sb,
>  		}
>  
>  		do {
> +			/*
> +			 * old user space tools to not initialize vd_blk_flags
> +			 * if vd_period.p_start > 0 then vd_blk_flags was
> +			 * not initialized properly and may contain invalid
> +			 * flags
> +			 */
> +			if (vdesc->vd_period.p_start > 0)
> +				vdesc->vd_blk_flags = 0;
> +
>  			ret = nilfs_ioctl_move_inode_block(inode, vdesc,
>  							   &buffers);
>  			if (unlikely(ret < 0)) {
> diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
> index 4e35814..4835e37 100644
> --- a/fs/nilfs2/page.h
> +++ b/fs/nilfs2/page.h
> @@ -36,6 +36,8 @@ enum {
>  	BH_NILFS_Volatile,
>  	BH_NILFS_Checked,
>  	BH_NILFS_Redirected,
> +	BH_NILFS_Snapshot_Protected,
> +	BH_NILFS_Period_Protected,
>  	BH_NILFS_Counted,
>  };
>  
> @@ -43,6 +45,10 @@ BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
>  BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
>  BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
>  BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */
> +/* buffer belongs to a snapshot and is protected by it */
> +BUFFER_FNS(NILFS_Snapshot_Protected, nilfs_snapshot_protected)
> +/* protected by protection period */
> +BUFFER_FNS(NILFS_Period_Protected, nilfs_period_protected)
>  /* counted by propagate_p for segment usage */
>  BUFFER_FNS(NILFS_Counted, nilfs_counted)
>  
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index ab8df33..b476ce7 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1564,12 +1564,41 @@ static void nilfs_list_replace_buffer(struct buffer_head *old_bh,
>  	/* The caller must release old_bh */
>  }
>  
> +/**
> + * nilfs_segctor_dec_nlive_blks_gc - dec. nlive_blks for blocks of GC-Inodes
> + * @dat: dat inode
> + * @segbuf: currtent segment buffer
> + * @bh: current buffer head
> + *
> + * Description: nilfs_segctor_dec_nlive_blks_gc() is called if the inode to
> + * which @bh belongs is a GC-Inode. In that case it is not necessary to
> + * decrement the previous segment, because at the end of the GC process it
> + * will be freed anyway. It is however necessary to check again if the blocks
> + * are alive here, because the last check was in userspace without the proper
> + * locking. Additionally the blocks protected by the protection period should
> + * be considered reclaimable. It is assumed, that @bh->b_blocknr contains
> + * a virtual block number, which is only true if @bh is part of a GC-Inode.
> + */

> +static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
> +					    struct nilfs_segment_buffer *segbuf,
> +					    struct buffer_head *bh) {
> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
> +
> +	if (!buffer_nilfs_snapshot_protected(bh) && isreclaimable)
> +		segbuf->sb_nlive_blks--;
> +	if (buffer_nilfs_snapshot_protected(bh))
> +		segbuf->sb_nsnapshot_blks++;
> +}

I have some comments on this function:

 - The position of the brace "{" violates a CodingStyle rule of function.
 - buffer_nilfs_snapshot_protected() is tested twice, but this can be
   reduced as follows:

	if (buffer_nilfs_snapshot_protected(bh))
		segbuf->sb_nsnapshot_blks++;
	else if (isreclaimable)
		segbuf->sb_nlive_blks--;

 - Additionally, I prefer "reclaimable" to "isreclaimable" since it's
   simpler and still trivial.

 - The logic of isreclaimable is counterintuitive.  

> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;

   It looks like buffer_nilfs_period_protected(bh) here implies that
   the block is deleted.  But it's independent from the buffer is
   protected by protection_period or not.

   Why not just adding "still alive" or "deleted" flag and its
   corresponding vdesc flag instead of adding the period protected
   flag ?

   If we add the "still alive" flag, which means that the block is
   not yet deleted from the latest checkpoint, then this function
   can be simplified as follows:

static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
					    struct nilfs_segment_buffer *segbuf,
					    struct buffer_head *bh)
{
	if (buffer_nilfs_snapshot_protected(bh))
		segbuf->sb_nsnapshot_blks++;
	else if (!buffer_nilfs_still_alive(bh) ||
		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
		segbuf->sb_nlive_blks--;
}

 - The last comment: we usually expect that the first argument is a
   pointer to nilfs_sc_info struct for function nilfs_segctor_xxxx(),
   but this doesn't.  How about the following name ?

static void nilfs_segbuf_dec_nlive_blks_gc(struct nilfs_segment_buffer *segbuf,
					   struct buffer_head *bh,
				           struct inode *dat)


Regards,
Ryusuke Konishi

> +
>  static int
>  nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  				     struct nilfs_segment_buffer *segbuf,
>  				     int mode)
>  {
> +	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>  	struct inode *inode = NULL;
> +	struct nilfs_inode_info *ii;
>  	sector_t blocknr;
>  	unsigned long nfinfo = segbuf->sb_sum.nfinfo;
>  	unsigned long nblocks = 0, ndatablk = 0;
> @@ -1579,7 +1608,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  	union nilfs_binfo binfo;
>  	struct buffer_head *bh, *bh_org;
>  	ino_t ino = 0;
> -	int err = 0;
> +	int err = 0, gc_inode = 0, track_live_blks;
> +
> +	track_live_blks = nilfs_feature_track_live_blks(nilfs);
>  
>  	if (!nfinfo)
>  		goto out;
> @@ -1601,6 +1632,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  
>  			inode = bh->b_page->mapping->host;
>  
> +			ii = NILFS_I(inode);
> +			gc_inode = test_bit(NILFS_I_GCINODE, &ii->i_state);
> +
>  			if (mode == SC_LSEG_DSYNC)
>  				sc_op = &nilfs_sc_dsync_ops;
>  			else if (ino == NILFS_DAT_INO)
> @@ -1608,6 +1642,11 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  			else /* file blocks */
>  				sc_op = &nilfs_sc_file_ops;
>  		}
> +
> +		if (track_live_blks && gc_inode)
> +			nilfs_segctor_dec_nlive_blks_gc(nilfs->ns_dat,
> +							segbuf, bh);
> +
>  		bh_org = bh;
>  		get_bh(bh_org);
>  		err = nilfs_bmap_assign(NILFS_I(inode)->i_bmap, &bh, blocknr,
> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
> index 5f05bbf..ddc98e8 100644
> --- a/include/linux/nilfs2_fs.h
> +++ b/include/linux/nilfs2_fs.h
> @@ -905,7 +905,7 @@ struct nilfs_vinfo {
>   * @vd_blocknr: disk block number
>   * @vd_offset: logical block offset inside a file
>   * @vd_flags: flags (data or node block)
> - * @vd_pad: padding
> + * @vd_blk_flags: additional flags
>   */
>  struct nilfs_vdesc {
>  	__u64 vd_ino;
> @@ -915,9 +915,43 @@ struct nilfs_vdesc {
>  	__u64 vd_blocknr;
>  	__u64 vd_offset;
>  	__u32 vd_flags;
> -	__u32 vd_pad;
> +	/*
> +	 * vd_blk_flags needed because vd_flags doesn't support
> +	 * bit-flags because of backwards compatibility
> +	 */
> +	__u32 vd_blk_flags;
>  };
>  
> +/* vdesc flags */
> +enum {
> +	NILFS_VDESC_SNAPSHOT_PROTECTED,
> +	NILFS_VDESC_PERIOD_PROTECTED,
> +
> +	/* ... */
> +
> +	__NR_NILFS_VDESC_FIELDS,
> +};
> +
> +#define NILFS_VDESC_FNS(flag, name)					\
> +static inline void							\
> +nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
> +{									\
> +	vdesc->vd_blk_flags |= (1UL << NILFS_VDESC_##flag);		\
> +}									\
> +static inline void							\
> +nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)			\
> +{									\
> +	vdesc->vd_blk_flags &= ~(1UL << NILFS_VDESC_##flag);		\
> +}									\
> +static inline int							\
> +nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
> +{									\
> +	return !!(vdesc->vd_blk_flags & (1UL << NILFS_VDESC_##flag));	\
> +}
> +
> +NILFS_VDESC_FNS(SNAPSHOT_PROTECTED, snapshot_protected)
> +NILFS_VDESC_FNS(PERIOD_PROTECTED, period_protected)
> +
>  /**
>   * struct nilfs_bdesc - descriptor of disk block number
>   * @bd_ino: inode number
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period
       [not found]         ` <20150511.031512.1036934606749624197.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-10 18:23           ` Ryusuke Konishi
       [not found]             ` <20150511.032323.1250231827423193240.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  2015-05-11 13:00           ` Andreas Rohner
  1 sibling, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-10 18:23 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Mon, 11 May 2015 03:15:12 +0900 (JST), Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:21 +0200, Andreas Rohner wrote:
>> +/**
>> + * nilfs_segctor_dec_nlive_blks_gc - dec. nlive_blks for blocks of GC-Inodes
>> + * @dat: dat inode
>> + * @segbuf: currtent segment buffer
>> + * @bh: current buffer head
>> + *
>> + * Description: nilfs_segctor_dec_nlive_blks_gc() is called if the inode to
>> + * which @bh belongs is a GC-Inode. In that case it is not necessary to
>> + * decrement the previous segment, because at the end of the GC process it
>> + * will be freed anyway. It is however necessary to check again if the blocks
>> + * are alive here, because the last check was in userspace without the proper
>> + * locking. Additionally the blocks protected by the protection period should
>> + * be considered reclaimable. It is assumed, that @bh->b_blocknr contains
>> + * a virtual block number, which is only true if @bh is part of a GC-Inode.
>> + */
> 
>> +static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
>> +					    struct nilfs_segment_buffer *segbuf,
>> +					    struct buffer_head *bh) {
>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
>> +
>> +	if (!buffer_nilfs_snapshot_protected(bh) && isreclaimable)
>> +		segbuf->sb_nlive_blks--;
>> +	if (buffer_nilfs_snapshot_protected(bh))
>> +		segbuf->sb_nsnapshot_blks++;
>> +}
> 
> I have some comments on this function:
> 
>  - The position of the brace "{" violates a CodingStyle rule of function.
>  - buffer_nilfs_snapshot_protected() is tested twice, but this can be
>    reduced as follows:
> 
> 	if (buffer_nilfs_snapshot_protected(bh))
> 		segbuf->sb_nsnapshot_blks++;
> 	else if (isreclaimable)
> 		segbuf->sb_nlive_blks--;
> 
>  - Additionally, I prefer "reclaimable" to "isreclaimable" since it's
>    simpler and still trivial.
> 
>  - The logic of isreclaimable is counterintuitive.  
> 
>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
> 
>    It looks like buffer_nilfs_period_protected(bh) here implies that
>    the block is deleted.  But it's independent from the buffer is
>    protected by protection_period or not.
> 
>    Why not just adding "still alive" or "deleted" flag and its
>    corresponding vdesc flag instead of adding the period protected
>    flag ?
> 
>    If we add the "still alive" flag, which means that the block is
>    not yet deleted from the latest checkpoint, then this function
>    can be simplified as follows:
> 
> static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
> 					    struct nilfs_segment_buffer *segbuf,
> 					    struct buffer_head *bh)
> {
> 	if (buffer_nilfs_snapshot_protected(bh))
> 		segbuf->sb_nsnapshot_blks++;

> 	else if (!buffer_nilfs_still_alive(bh) ||
> 		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
> 		segbuf->sb_nlive_blks--;

This was wrong.  It should be:

	else if (!buffer_nilfs_still_alive(bh) &&
		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
		segbuf->sb_nlive_blks--;

Regards,
Ryusuke Konishi

> }
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period
       [not found]             ` <20150511.032323.1250231827423193240.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-11  2:07               ` Ryusuke Konishi
       [not found]                 ` <20150511.110726.725667075147435663.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-11  2:07 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Mon, 11 May 2015 03:23:23 +0900 (JST), Ryusuke Konishi wrote:
> On Mon, 11 May 2015 03:15:12 +0900 (JST), Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:21 +0200, Andreas Rohner wrote:
>>> +/**
>>> + * nilfs_segctor_dec_nlive_blks_gc - dec. nlive_blks for blocks of GC-Inodes
>>> + * @dat: dat inode
>>> + * @segbuf: currtent segment buffer
>>> + * @bh: current buffer head
>>> + *
>>> + * Description: nilfs_segctor_dec_nlive_blks_gc() is called if the inode to
>>> + * which @bh belongs is a GC-Inode. In that case it is not necessary to
>>> + * decrement the previous segment, because at the end of the GC process it
>>> + * will be freed anyway. It is however necessary to check again if the blocks
>>> + * are alive here, because the last check was in userspace without the proper
>>> + * locking. Additionally the blocks protected by the protection period should
>>> + * be considered reclaimable. It is assumed, that @bh->b_blocknr contains
>>> + * a virtual block number, which is only true if @bh is part of a GC-Inode.
>>> + */
>> 
>>> +static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
>>> +					    struct nilfs_segment_buffer *segbuf,
>>> +					    struct buffer_head *bh) {
>>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
>>> +
>>> +	if (!buffer_nilfs_snapshot_protected(bh) && isreclaimable)
>>> +		segbuf->sb_nlive_blks--;
>>> +	if (buffer_nilfs_snapshot_protected(bh))
>>> +		segbuf->sb_nsnapshot_blks++;
>>> +}
>> 
>> I have some comments on this function:
>> 
>>  - The position of the brace "{" violates a CodingStyle rule of function.
>>  - buffer_nilfs_snapshot_protected() is tested twice, but this can be
>>    reduced as follows:
>> 
>> 	if (buffer_nilfs_snapshot_protected(bh))
>> 		segbuf->sb_nsnapshot_blks++;
>> 	else if (isreclaimable)
>> 		segbuf->sb_nlive_blks--;
>> 
>>  - Additionally, I prefer "reclaimable" to "isreclaimable" since it's
>>    simpler and still trivial.
>> 
>>  - The logic of isreclaimable is counterintuitive.  
>> 
>>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
>> 
>>    It looks like buffer_nilfs_period_protected(bh) here implies that
>>    the block is deleted.  But it's independent from the buffer is
>>    protected by protection_period or not.
>> 
>>    Why not just adding "still alive" or "deleted" flag and its
>>    corresponding vdesc flag instead of adding the period protected
>>    flag ?
>> 
>>    If we add the "still alive" flag, which means that the block is
>>    not yet deleted from the latest checkpoint, then this function
>>    can be simplified as follows:
>> 
>> static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
>> 					    struct nilfs_segment_buffer *segbuf,
>> 					    struct buffer_head *bh)
>> {
>> 	if (buffer_nilfs_snapshot_protected(bh))
>> 		segbuf->sb_nsnapshot_blks++;
> 
>> 	else if (!buffer_nilfs_still_alive(bh) ||
>> 		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
>> 		segbuf->sb_nlive_blks--;
> 
> This was wrong.  It should be:
> 
> 	else if (!buffer_nilfs_still_alive(bh) &&
> 		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
> 		segbuf->sb_nlive_blks--;

Sorry for confusing you.  I read again the code, and now feel
the previous one (the following) was rather correct.

>> 	if (buffer_nilfs_snapshot_protected(bh))
>> 		segbuf->sb_nsnapshot_blks++;
>> 	else if (!buffer_nilfs_still_alive(bh) ||
>> 		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
>> 		segbuf->sb_nlive_blks--;

Could you confirm which logic correctly implements the algorithm that
you intended ?

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period
       [not found]                 ` <20150511.110726.725667075147435663.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-11 12:32                   ` Andreas Rohner
  0 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-11 12:32 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 4182 bytes --]

On 2015-05-11 04:07, Ryusuke Konishi wrote:
> On Mon, 11 May 2015 03:23:23 +0900 (JST), Ryusuke Konishi wrote:
>> On Mon, 11 May 2015 03:15:12 +0900 (JST), Ryusuke Konishi wrote:
>>> On Sun,  3 May 2015 12:05:21 +0200, Andreas Rohner wrote:
>>>> +/**
>>>> + * nilfs_segctor_dec_nlive_blks_gc - dec. nlive_blks for blocks of GC-Inodes
>>>> + * @dat: dat inode
>>>> + * @segbuf: currtent segment buffer
>>>> + * @bh: current buffer head
>>>> + *
>>>> + * Description: nilfs_segctor_dec_nlive_blks_gc() is called if the inode to
>>>> + * which @bh belongs is a GC-Inode. In that case it is not necessary to
>>>> + * decrement the previous segment, because at the end of the GC process it
>>>> + * will be freed anyway. It is however necessary to check again if the blocks
>>>> + * are alive here, because the last check was in userspace without the proper
>>>> + * locking. Additionally the blocks protected by the protection period should
>>>> + * be considered reclaimable. It is assumed, that @bh->b_blocknr contains
>>>> + * a virtual block number, which is only true if @bh is part of a GC-Inode.
>>>> + */
>>>
>>>> +static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
>>>> +					    struct nilfs_segment_buffer *segbuf,
>>>> +					    struct buffer_head *bh) {
>>>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>>>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
>>>> +
>>>> +	if (!buffer_nilfs_snapshot_protected(bh) && isreclaimable)
>>>> +		segbuf->sb_nlive_blks--;
>>>> +	if (buffer_nilfs_snapshot_protected(bh))
>>>> +		segbuf->sb_nsnapshot_blks++;
>>>> +}
>>>
>>> I have some comments on this function:
>>>
>>>  - The position of the brace "{" violates a CodingStyle rule of function.
>>>  - buffer_nilfs_snapshot_protected() is tested twice, but this can be
>>>    reduced as follows:
>>>
>>> 	if (buffer_nilfs_snapshot_protected(bh))
>>> 		segbuf->sb_nsnapshot_blks++;
>>> 	else if (isreclaimable)
>>> 		segbuf->sb_nlive_blks--;
>>>
>>>  - Additionally, I prefer "reclaimable" to "isreclaimable" since it's
>>>    simpler and still trivial.
>>>
>>>  - The logic of isreclaimable is counterintuitive.  
>>>
>>>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>>>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
>>>
>>>    It looks like buffer_nilfs_period_protected(bh) here implies that
>>>    the block is deleted.  But it's independent from the buffer is
>>>    protected by protection_period or not.
>>>
>>>    Why not just adding "still alive" or "deleted" flag and its
>>>    corresponding vdesc flag instead of adding the period protected
>>>    flag ?
>>>
>>>    If we add the "still alive" flag, which means that the block is
>>>    not yet deleted from the latest checkpoint, then this function
>>>    can be simplified as follows:
>>>
>>> static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
>>> 					    struct nilfs_segment_buffer *segbuf,
>>> 					    struct buffer_head *bh)
>>> {
>>> 	if (buffer_nilfs_snapshot_protected(bh))
>>> 		segbuf->sb_nsnapshot_blks++;
>>
>>> 	else if (!buffer_nilfs_still_alive(bh) ||
>>> 		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
>>> 		segbuf->sb_nlive_blks--;
>>
>> This was wrong.  It should be:
>>
>> 	else if (!buffer_nilfs_still_alive(bh) &&
>> 		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
>> 		segbuf->sb_nlive_blks--;
> 
> Sorry for confusing you.  I read again the code, and now feel
> the previous one (the following) was rather correct.
> 
>>> 	if (buffer_nilfs_snapshot_protected(bh))
>>> 		segbuf->sb_nsnapshot_blks++;
>>> 	else if (!buffer_nilfs_still_alive(bh) ||
>>> 		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
>>> 		segbuf->sb_nlive_blks--;
> 
> Could you confirm which logic correctly implements the algorithm that
> you intended ?

This one is correct. We only have to call nilfs_dat_is_live() if the
block is alive. nilfs_dat_is_live() is intended to confirm that a block
is really live. If we know from userspace, that a block is
dead/reclaimable we do not have to check it again.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period
       [not found]         ` <20150511.031512.1036934606749624197.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  2015-05-10 18:23           ` Ryusuke Konishi
@ 2015-05-11 13:00           ` Andreas Rohner
       [not found]             ` <5550A7FC.4050709-hi6Y0CQ0nG0@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-11 13:00 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-10 20:15, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:21 +0200, Andreas Rohner wrote:
>> The userspace GC uses the concept of a so-called protection period,
>> which is a period of time, where actually reclaimable blocks are
>> protected. If a segment is cleaned and there are blocks in it that are
>> protected by this, they have to be treated as if they were live blocks.
>>
>> This is a problem for the live block tracking on the kernel side,
>> because the kernel knows nothing about the protection period. This patch
>> introduces new flags for the nilfs_vdesc data structure, to mark blocks
>> that need to be treated as if they were alive, but must be counted as if
>> they were reclaimable. There are two reasons for this to happen.
>> Either a block was deleted within the protection period, or it is
>> part of a snapshot.
>>
>> After the blocks described by the nilfs_vdesc structures are read in,
>> the flags are passed on to the buffer_heads to get the information to
>> the segment construction phase. During segment construction, the live
>> block tracking is adjusted accordingly.
>>
>> Additionally the blocks are rechecked if they are reclaimable, since the
>> last check was in userspace without the proper locking.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/dat.c           | 66 +++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/nilfs2/dat.h           |  1 +
>>  fs/nilfs2/ioctl.c         | 15 +++++++++++
>>  fs/nilfs2/page.h          |  6 +++++
>>  fs/nilfs2/segment.c       | 41 ++++++++++++++++++++++++++++-
>>  include/linux/nilfs2_fs.h | 38 +++++++++++++++++++++++++--
>>  6 files changed, 164 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
>> index 9c2fc32..80a1905 100644
>> --- a/fs/nilfs2/dat.c
>> +++ b/fs/nilfs2/dat.c
>> @@ -35,6 +35,17 @@
>>  #define NILFS_CNO_MAX	(~(__u64)0)
>>  
>>  /**
>> + * nilfs_dat_entry_is_live - check if @entry is alive
>> + * @entry: DAT-Entry
>> + *
>> + * Description: Simple check if @entry is alive in the current checkpoint.
>> + */
>> +static int nilfs_dat_entry_is_live(struct nilfs_dat_entry *entry)
>> +{
>> +	return entry->de_end == cpu_to_le64(NILFS_CNO_MAX);
>> +}
>> +
>> +/**
>>   * struct nilfs_dat_info - on-memory private data of DAT file
>>   * @mi: on-memory private data of metadata file
>>   * @palloc_cache: persistent object allocator cache of DAT file
>> @@ -387,6 +398,61 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, sector_t blocknr)
>>  }
>>  
>>  /**
>> + * nilfs_dat_is_live - checks if the virtual block number is alive
>> + * @dat: DAT file inode
>> + * @vblocknr: virtual block number
>> + *
>> + * Description: nilfs_dat_is_live() looks up the DAT-Entry for
>> + * @vblocknr and determines if the corresponding block is alive in the current
>> + * checkpoint or not. This check ignores snapshots and protection periods.
>> + *
>> + * Return Value: 1 if vblocknr is alive and 0 otherwise. On error one of the
>> + * following negative error codes is returned
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + *
>> + * %-ENOENT - A block number associated with @vblocknr does not exist.
>> + */
>> +int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr)
>> +{
>> +	struct buffer_head *entry_bh, *bh;
>> +	struct nilfs_dat_entry *entry;
>> +	sector_t blocknr;
>> +	void *kaddr;
>> +	int ret;
>> +
>> +	ret = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	if (!nilfs_doing_gc() && buffer_nilfs_redirected(entry_bh)) {
>> +		bh = nilfs_mdt_get_frozen_buffer(dat, entry_bh);
>> +		if (bh) {
>> +			WARN_ON(!buffer_uptodate(bh));
>> +			put_bh(entry_bh);
>> +			entry_bh = bh;
>> +		}
>> +	}
>> +
>> +	kaddr = kmap_atomic(entry_bh->b_page);
>> +	entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
>> +	blocknr = le64_to_cpu(entry->de_blocknr);
>> +	if (blocknr == 0) {
>> +		ret = -ENOENT;
>> +		goto out_unmap;
>> +	}
>> +
>> +	ret = nilfs_dat_entry_is_live(entry);
>> +
>> +out_unmap:
>> +	kunmap_atomic(kaddr);
>> +	put_bh(entry_bh);
>> +	return ret;
>> +}
>> +
>> +/**
>>   * nilfs_dat_translate - translate a virtual block number to a block number
>>   * @dat: DAT file inode
>>   * @vblocknr: virtual block number
>> diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
>> index cbd8e97..a95547c 100644
>> --- a/fs/nilfs2/dat.h
>> +++ b/fs/nilfs2/dat.h
>> @@ -47,6 +47,7 @@ void nilfs_dat_commit_update(struct inode *, struct nilfs_palloc_req *,
>>  			     struct nilfs_palloc_req *, int);
>>  void nilfs_dat_abort_update(struct inode *, struct nilfs_palloc_req *,
>>  			    struct nilfs_palloc_req *);
>> +int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr);
>>  
>>  int nilfs_dat_mark_dirty(struct inode *, __u64);
>>  int nilfs_dat_freev(struct inode *, __u64 *, size_t);
>> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
>> index f6ee54e..40bf74a 100644
>> --- a/fs/nilfs2/ioctl.c
>> +++ b/fs/nilfs2/ioctl.c
>> @@ -612,6 +612,12 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
>>  		brelse(bh);
>>  		return -EEXIST;
>>  	}
>> +
>> +	if (nilfs_vdesc_snapshot_protected(vdesc))
>> +		set_buffer_nilfs_snapshot_protected(bh);
>> +	if (nilfs_vdesc_period_protected(vdesc))
>> +		set_buffer_nilfs_period_protected(bh);
>> +
>>  	list_add_tail(&bh->b_assoc_buffers, buffers);
>>  	return 0;
>>  }
>> @@ -662,6 +668,15 @@ static int nilfs_ioctl_move_blocks(struct super_block *sb,
>>  		}
>>  
>>  		do {
>> +			/*
>> +			 * old user space tools to not initialize vd_blk_flags
>> +			 * if vd_period.p_start > 0 then vd_blk_flags was
>> +			 * not initialized properly and may contain invalid
>> +			 * flags
>> +			 */
>> +			if (vdesc->vd_period.p_start > 0)
>> +				vdesc->vd_blk_flags = 0;
>> +
>>  			ret = nilfs_ioctl_move_inode_block(inode, vdesc,
>>  							   &buffers);
>>  			if (unlikely(ret < 0)) {
>> diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
>> index 4e35814..4835e37 100644
>> --- a/fs/nilfs2/page.h
>> +++ b/fs/nilfs2/page.h
>> @@ -36,6 +36,8 @@ enum {
>>  	BH_NILFS_Volatile,
>>  	BH_NILFS_Checked,
>>  	BH_NILFS_Redirected,
>> +	BH_NILFS_Snapshot_Protected,
>> +	BH_NILFS_Period_Protected,
>>  	BH_NILFS_Counted,
>>  };
>>  
>> @@ -43,6 +45,10 @@ BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
>>  BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
>>  BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
>>  BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */
>> +/* buffer belongs to a snapshot and is protected by it */
>> +BUFFER_FNS(NILFS_Snapshot_Protected, nilfs_snapshot_protected)
>> +/* protected by protection period */
>> +BUFFER_FNS(NILFS_Period_Protected, nilfs_period_protected)
>>  /* counted by propagate_p for segment usage */
>>  BUFFER_FNS(NILFS_Counted, nilfs_counted)
>>  
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index ab8df33..b476ce7 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -1564,12 +1564,41 @@ static void nilfs_list_replace_buffer(struct buffer_head *old_bh,
>>  	/* The caller must release old_bh */
>>  }
>>  
>> +/**
>> + * nilfs_segctor_dec_nlive_blks_gc - dec. nlive_blks for blocks of GC-Inodes
>> + * @dat: dat inode
>> + * @segbuf: currtent segment buffer
>> + * @bh: current buffer head
>> + *
>> + * Description: nilfs_segctor_dec_nlive_blks_gc() is called if the inode to
>> + * which @bh belongs is a GC-Inode. In that case it is not necessary to
>> + * decrement the previous segment, because at the end of the GC process it
>> + * will be freed anyway. It is however necessary to check again if the blocks
>> + * are alive here, because the last check was in userspace without the proper
>> + * locking. Additionally the blocks protected by the protection period should
>> + * be considered reclaimable. It is assumed, that @bh->b_blocknr contains
>> + * a virtual block number, which is only true if @bh is part of a GC-Inode.
>> + */
> 
>> +static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
>> +					    struct nilfs_segment_buffer *segbuf,
>> +					    struct buffer_head *bh) {
>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
>> +
>> +	if (!buffer_nilfs_snapshot_protected(bh) && isreclaimable)
>> +		segbuf->sb_nlive_blks--;
>> +	if (buffer_nilfs_snapshot_protected(bh))
>> +		segbuf->sb_nsnapshot_blks++;
>> +}
> 
> I have some comments on this function:
> 
>  - The position of the brace "{" violates a CodingStyle rule of function.

I agree. Sorry for that stupid mistake.

>  - buffer_nilfs_snapshot_protected() is tested twice, but this can be
>    reduced as follows:
> 
> 	if (buffer_nilfs_snapshot_protected(bh))
> 		segbuf->sb_nsnapshot_blks++;
> 	else if (isreclaimable)
> 		segbuf->sb_nlive_blks--;

I agree.

>  - Additionally, I prefer "reclaimable" to "isreclaimable" since it's
>    simpler and still trivial.

Ok.

>  - The logic of isreclaimable is counterintuitive.  
> 
>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
> 
>    It looks like buffer_nilfs_period_protected(bh) here implies that
>    the block is deleted.  But it's independent from the buffer is
>    protected by protection_period or not.

It is not independent. buffer_nilfs_period_protected() is only set for
deleted/reclaimable blocks that have to be copied because of the
protection period. So if the flag is set, then the block is always
reclaimable.

>    Why not just adding "still alive" or "deleted" flag and its
>    corresponding vdesc flag instead of adding the period protected
>    flag ?
> 
>    If we add the "still alive" flag, which means that the block is
>    not yet deleted from the latest checkpoint, then this function
>    can be simplified as follows:

I think buffer_nilfs_period_protected(bh) is a better name.

It does not mark all blocks within the protection period. Live blocks
within the protection period do not have this flag set. It marks exactly
those blocks that are dead and reclaimable but protected from being
discarded by the protection period. The protection period is key.
Without the protection period those blocks would not have been copied.
That is the exact meaning of the flag, and I think the name fits quite well.

> static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
> 					    struct nilfs_segment_buffer *segbuf,
> 					    struct buffer_head *bh)
> {
> 	if (buffer_nilfs_snapshot_protected(bh))
> 		segbuf->sb_nsnapshot_blks++;
> 	else if (!buffer_nilfs_still_alive(bh) ||
> 		 nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
> 		segbuf->sb_nlive_blks--;
> }

I can still simplify it like that:

	if (buffer_nilfs_snapshot_protected(bh))
		segbuf->sb_nsnapshot_blks++;
	else if (buffer_nilfs_period_protected(bh) ||
			nilfs_dat_is_live(dat, bh->b_blocknr) <= 0)
		segbuf->sb_nlive_blks--;

>  - The last comment: we usually expect that the first argument is a
>    pointer to nilfs_sc_info struct for function nilfs_segctor_xxxx(),
>    but this doesn't.  How about the following name ?
> 
> static void nilfs_segbuf_dec_nlive_blks_gc(struct nilfs_segment_buffer *segbuf,
> 					   struct buffer_head *bh,
> 				           struct inode *dat)

Yes I agree that looks better.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
>> +
>>  static int
>>  nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>>  				     struct nilfs_segment_buffer *segbuf,
>>  				     int mode)
>>  {
>> +	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>>  	struct inode *inode = NULL;
>> +	struct nilfs_inode_info *ii;
>>  	sector_t blocknr;
>>  	unsigned long nfinfo = segbuf->sb_sum.nfinfo;
>>  	unsigned long nblocks = 0, ndatablk = 0;
>> @@ -1579,7 +1608,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>>  	union nilfs_binfo binfo;
>>  	struct buffer_head *bh, *bh_org;
>>  	ino_t ino = 0;
>> -	int err = 0;
>> +	int err = 0, gc_inode = 0, track_live_blks;
>> +
>> +	track_live_blks = nilfs_feature_track_live_blks(nilfs);
>>  
>>  	if (!nfinfo)
>>  		goto out;
>> @@ -1601,6 +1632,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>>  
>>  			inode = bh->b_page->mapping->host;
>>  
>> +			ii = NILFS_I(inode);
>> +			gc_inode = test_bit(NILFS_I_GCINODE, &ii->i_state);
>> +
>>  			if (mode == SC_LSEG_DSYNC)
>>  				sc_op = &nilfs_sc_dsync_ops;
>>  			else if (ino == NILFS_DAT_INO)
>> @@ -1608,6 +1642,11 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>>  			else /* file blocks */
>>  				sc_op = &nilfs_sc_file_ops;
>>  		}
>> +
>> +		if (track_live_blks && gc_inode)
>> +			nilfs_segctor_dec_nlive_blks_gc(nilfs->ns_dat,
>> +							segbuf, bh);
>> +
>>  		bh_org = bh;
>>  		get_bh(bh_org);
>>  		err = nilfs_bmap_assign(NILFS_I(inode)->i_bmap, &bh, blocknr,
>> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
>> index 5f05bbf..ddc98e8 100644
>> --- a/include/linux/nilfs2_fs.h
>> +++ b/include/linux/nilfs2_fs.h
>> @@ -905,7 +905,7 @@ struct nilfs_vinfo {
>>   * @vd_blocknr: disk block number
>>   * @vd_offset: logical block offset inside a file
>>   * @vd_flags: flags (data or node block)
>> - * @vd_pad: padding
>> + * @vd_blk_flags: additional flags
>>   */
>>  struct nilfs_vdesc {
>>  	__u64 vd_ino;
>> @@ -915,9 +915,43 @@ struct nilfs_vdesc {
>>  	__u64 vd_blocknr;
>>  	__u64 vd_offset;
>>  	__u32 vd_flags;
>> -	__u32 vd_pad;
>> +	/*
>> +	 * vd_blk_flags needed because vd_flags doesn't support
>> +	 * bit-flags because of backwards compatibility
>> +	 */
>> +	__u32 vd_blk_flags;
>>  };
>>  
>> +/* vdesc flags */
>> +enum {
>> +	NILFS_VDESC_SNAPSHOT_PROTECTED,
>> +	NILFS_VDESC_PERIOD_PROTECTED,
>> +
>> +	/* ... */
>> +
>> +	__NR_NILFS_VDESC_FIELDS,
>> +};
>> +
>> +#define NILFS_VDESC_FNS(flag, name)					\
>> +static inline void							\
>> +nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
>> +{									\
>> +	vdesc->vd_blk_flags |= (1UL << NILFS_VDESC_##flag);		\
>> +}									\
>> +static inline void							\
>> +nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)			\
>> +{									\
>> +	vdesc->vd_blk_flags &= ~(1UL << NILFS_VDESC_##flag);		\
>> +}									\
>> +static inline int							\
>> +nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
>> +{									\
>> +	return !!(vdesc->vd_blk_flags & (1UL << NILFS_VDESC_##flag));	\
>> +}
>> +
>> +NILFS_VDESC_FNS(SNAPSHOT_PROTECTED, snapshot_protected)
>> +NILFS_VDESC_FNS(PERIOD_PROTECTED, period_protected)
>> +
>>  /**
>>   * struct nilfs_bdesc - descriptor of disk block number
>>   * @bd_ino: inode number
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period
       [not found]             ` <5550A7FC.4050709-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-12 14:31               ` Ryusuke Konishi
       [not found]                 ` <20150512.233126.2206330706583570566.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-12 14:31 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Mon, 11 May 2015 15:00:44 +0200, Andreas Rohner wrote:
> On 2015-05-10 20:15, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:21 +0200, Andreas Rohner wrote:
>>  - The logic of isreclaimable is counterintuitive.  
>> 
>>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
>> 
>>    It looks like buffer_nilfs_period_protected(bh) here implies that
>>    the block is deleted.  But it's independent from the buffer is
>>    protected by protection_period or not.
> 
> It is not independent. buffer_nilfs_period_protected() is only set for
> deleted/reclaimable blocks that have to be copied because of the
> protection period. So if the flag is set, then the block is always
> reclaimable.

That is why it's confusing.
Recall that we do not reclaim both deleted blocks and alive blocks 
if they are protected by protection_period.

This naming and logic, by contries, treats period_protected blocks are
"reclaimable".  That's the principal cause of this confusion.

> 
>>    Why not just adding "still alive" or "deleted" flag and its
>>    corresponding vdesc flag instead of adding the period protected
>>    flag ?
>> 
>>    If we add the "still alive" flag, which means that the block is
>>    not yet deleted from the latest checkpoint, then this function
>>    can be simplified as follows:
> 
> I think buffer_nilfs_period_protected(bh) is a better name.
> 
> It does not mark all blocks within the protection period. Live blocks
> within the protection period do not have this flag set.

I know.  That is my discussion point.

> It marks exactly
> those blocks that are dead and reclaimable but protected from being
> discarded by the protection period.
> The protection period is key.
> Without the protection period those blocks would not have been copied.
> That is the exact meaning of the flag, and I think the name fits quite well.

I say that the flagging is confusing.  It's not simple (nor clear) at
all.

Copied blocks by GC can have some properties.  For instance,

 1. snapshot protected  (guarded by one or more snapshots)
 2. period protected  (guarded by protection_period)
 3. deleted  (protected but isn't surviving on the latest checkpoint)

Among these, the property 2 does not relate to your live block
counting algorithm; your live block counting algorithm uses
property 1 and property 3 in reality.

If the algorithm counts the buffer marked "period protected" as alive
and prevents nlive_blks count from being decremented, then I agree it
"uses" the property 2.  But this patch doesn't

You are giving the property 2 to the period protection flag, and
making it imply the property 3, which is intrinsically independent of
the property 2.  And, are using the implication (property 3) instead
of the title property 2.

Please think about it once again.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period
       [not found]                 ` <20150512.233126.2206330706583570566.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-12 15:37                   ` Andreas Rohner
  0 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-05-12 15:37 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-12 16:31, Ryusuke Konishi wrote:
> On Mon, 11 May 2015 15:00:44 +0200, Andreas Rohner wrote:
>> On 2015-05-10 20:15, Ryusuke Konishi wrote:
>>> On Sun,  3 May 2015 12:05:21 +0200, Andreas Rohner wrote:
>>>  - The logic of isreclaimable is counterintuitive.  
>>>
>>>> +	bool isreclaimable = buffer_nilfs_period_protected(bh) ||
>>>> +				nilfs_dat_is_live(dat, bh->b_blocknr) <= 0;
>>>
>>>    It looks like buffer_nilfs_period_protected(bh) here implies that
>>>    the block is deleted.  But it's independent from the buffer is
>>>    protected by protection_period or not.
>>
>> It is not independent. buffer_nilfs_period_protected() is only set for
>> deleted/reclaimable blocks that have to be copied because of the
>> protection period. So if the flag is set, then the block is always
>> reclaimable.
> 
> That is why it's confusing.
> Recall that we do not reclaim both deleted blocks and alive blocks 
> if they are protected by protection_period.
> 
> This naming and logic, by contries, treats period_protected blocks are
> "reclaimable".  That's the principal cause of this confusion.
> 
>>
>>>    Why not just adding "still alive" or "deleted" flag and its
>>>    corresponding vdesc flag instead of adding the period protected
>>>    flag ?
>>>
>>>    If we add the "still alive" flag, which means that the block is
>>>    not yet deleted from the latest checkpoint, then this function
>>>    can be simplified as follows:
>>
>> I think buffer_nilfs_period_protected(bh) is a better name.
>>
>> It does not mark all blocks within the protection period. Live blocks
>> within the protection period do not have this flag set.
> 
> I know.  That is my discussion point.
> 
>> It marks exactly
>> those blocks that are dead and reclaimable but protected from being
>> discarded by the protection period.
>> The protection period is key.
>> Without the protection period those blocks would not have been copied.
>> That is the exact meaning of the flag, and I think the name fits quite well.
> 
> I say that the flagging is confusing.  It's not simple (nor clear) at
> all.
> 
> Copied blocks by GC can have some properties.  For instance,
> 
>  1. snapshot protected  (guarded by one or more snapshots)
>  2. period protected  (guarded by protection_period)
>  3. deleted  (protected but isn't surviving on the latest checkpoint)
> 
> Among these, the property 2 does not relate to your live block
> counting algorithm; your live block counting algorithm uses
> property 1 and property 3 in reality.
> 
> If the algorithm counts the buffer marked "period protected" as alive
> and prevents nlive_blks count from being decremented, then I agree it
> "uses" the property 2.  But this patch doesn't
> 
> You are giving the property 2 to the period protection flag, and
> making it imply the property 3, which is intrinsically independent of
> the property 2.  And, are using the implication (property 3) instead
> of the title property 2.
> 
> Please think about it once again.

Thank you for the detailed explanation of your reasoning. I can see now
that it could be confusing. Maybe I have just gotten used to calling it
"period_protected". I will change the name to "deleted" in the next
version of the patch.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]     ` <1430647522-14304-10-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-20 14:43       ` Ryusuke Konishi
       [not found]         ` <20150520.234335.542615158366069430.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-20 14:43 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
> It doesn't really matter if the number of reclaimable blocks for a
> segment is inaccurate, as long as the overall performance is better than
> the simple timestamp algorithm and starvation is prevented.
> 
> The following steps will lead to starvation of a segment:
> 
> 1. The segment is written
> 2. A snapshot is created
> 3. The files in the segment are deleted and the number of live
>    blocks for the segment is decremented to a very low value
> 4. The GC tries to free the segment, but there are no reclaimable
>    blocks, because they are all protected by the snapshot. To prevent an
>    infinite loop the GC has to adjust the number of live blocks to the
>    correct value.
> 5. The snapshot is converted to a checkpoint and the blocks in the
>    segment are now reclaimable.
> 6. The GC will never attempt to clean the segment again, because it
>    looks as if it had a high number of live blocks.
> 
> To prevent this, the already existing padding field of the SUFILE entry
> is used to track the number of snapshot blocks in the segment. This
> number is only set by the GC, since it collects the necessary
> information anyway. So there is no need, to track which block belongs to
> which segment. In step 4 of the list above the GC will set the new field
> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
> reduced.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>

I still don't know whether this workaround is the way we should take
or not.  This patch has several drawbacks:

 1. It introduces overheads to every "chcp cp" operation
    due to traversal rewrite of sufile.
    If the ratio of snapshot protected blocks is high, then
    this overheads will be big.

 2. The traversal rewrite of sufile will causes many sufile blocks will be
    written out.   If most blocks are protected by a snapshot,
    more than 4MB of sufile blocks will be written per 1TB capacity.

    Even though this rewrite may not happen for contiguous "chcp cp"
    operations, it still has potential for creating sufile write blocks
    if the application of nilfs manipulates snapshots frequently.

 3. The ratio of the threshold "max_segblks" is hard coded to 50%
    of blocks_per_segment.  It is not clear if the ratio is good
    (versatile).

I will add comments inline below.

> ---
>  fs/nilfs2/ioctl.c  | 50 +++++++++++++++++++++++++++++++-
>  fs/nilfs2/sufile.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nilfs2/sufile.h |  3 ++
>  3 files changed, 137 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
> index 40bf74a..431725f 100644
> --- a/fs/nilfs2/ioctl.c
> +++ b/fs/nilfs2/ioctl.c
> @@ -200,6 +200,49 @@ static int nilfs_ioctl_getversion(struct inode *inode, void __user *argp)
>  }
>  
>  /**
> + * nilfs_ioctl_fix_starving_segs - fix potentially starving segments
> + * @nilfs: nilfs object
> + * @inode: inode object
> + *
> + * Description: Scans for segments, which are potentially starving and
> + * reduces the number of live blocks to less than half of the maximum
> + * number of blocks in a segment. This requires a scan of the whole SUFILE,
> + * which can take a long time on certain devices and under certain conditions.
> + * To avoid blocking other file system operations for too long the SUFILE is
> + * scanned in steps of NILFS_SUFILE_STARVING_SEGS_STEP. After each step the
> + * locks are released and cond_resched() is called.
> + *
> + * Return Value: On success, 0 is returned and on error, one of the
> + * following negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + */

> +static int nilfs_ioctl_fix_starving_segs(struct the_nilfs *nilfs,
> +					 struct inode *inode) {

This "inode" argument is meaningless for this routine.
Consider passing "sb" instead.

I feel odd for the function name "fix starving segs".  It looks to
give a workaround rather than solve the root problem of gc in nilfs.
It looks like what this patch is doing, is "calibrating" live block
count.

> +	struct nilfs_transaction_info ti;

> +	unsigned long i, nsegs = nilfs_sufile_get_nsegments(nilfs->ns_sufile);

nsegs is set outside the transaction lock.

Since the file system can be resized (both shrinked or extended)
outside the lock, nsegs must be initialized or updated in the
section where the tranaction lock is held.

> +	int ret = 0;
> +
> +	for (i = 0; i < nsegs; i += NILFS_SUFILE_STARVING_SEGS_STEP) {
> +		nilfs_transaction_begin(inode->i_sb, &ti, 0);
> +
> +		ret = nilfs_sufile_fix_starving_segs(nilfs->ns_sufile, i,
> +				NILFS_SUFILE_STARVING_SEGS_STEP);
> +		if (unlikely(ret < 0)) {
> +			nilfs_transaction_abort(inode->i_sb);
> +			break;
> +		}
> +
> +		nilfs_transaction_commit(inode->i_sb); /* never fails */
> +		cond_resched();
> +	}
> +
> +	return ret;
> +}
> +
> +/**
>   * nilfs_ioctl_change_cpmode - change checkpoint mode (checkpoint/snapshot)
>   * @inode: inode object
>   * @filp: file object
> @@ -224,7 +267,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>  	struct the_nilfs *nilfs = inode->i_sb->s_fs_info;
>  	struct nilfs_transaction_info ti;
>  	struct nilfs_cpmode cpmode;
> -	int ret;
> +	int ret, is_snapshot;
>  
>  	if (!capable(CAP_SYS_ADMIN))
>  		return -EPERM;
> @@ -240,6 +283,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>  	mutex_lock(&nilfs->ns_snapshot_mount_mutex);
>  
>  	nilfs_transaction_begin(inode->i_sb, &ti, 0);
> +	is_snapshot = nilfs_cpfile_is_snapshot(nilfs->ns_cpfile, cpmode.cm_cno);
>  	ret = nilfs_cpfile_change_cpmode(
>  		nilfs->ns_cpfile, cpmode.cm_cno, cpmode.cm_mode);
>  	if (unlikely(ret < 0))
> @@ -248,6 +292,10 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>  		nilfs_transaction_commit(inode->i_sb); /* never fails */
>  
>  	mutex_unlock(&nilfs->ns_snapshot_mount_mutex);
> +

> +	if (is_snapshot > 0 && cpmode.cm_mode == NILFS_CHECKPOINT &&
> +			nilfs_feature_track_live_blks(nilfs))
> +		ret = nilfs_ioctl_fix_starving_segs(nilfs, inode);

Should we use this return value ?
This doesn't relate to the success and failure of "chcp" operation.

nilfs_ioctl_fix_starving_segs() is called every time "chcp cp" is
called.  I prefer to delay this extra work with a workqueue and to
skip starting a new work if the previous work is still running.

>  out:
>  	mnt_drop_write_file(filp);
>  	return ret;
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 9cd8820d..47e2c05 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -1215,6 +1215,91 @@ out_sem:
>  }
>  
>  /**
> + * nilfs_sufile_fix_starving_segs - fix potentially starving segments
> + * @sufile: inode of segment usage file
> + * @segnum: segnum to start
> + * @nsegs: number of segments to check
> + *
> + * Description: Scans for segments, which are potentially starving and
> + * reduces the number of live blocks to less than half of the maximum
> + * number of blocks in a segment. This way the segment is more likely to be
> + * chosen by the GC. A segment is marked as potentially starving, if more
> + * than half of the blocks it contains are protected by snapshots.
> + *
> + * Return Value: On success, 0 is returned and on error, one of the
> + * following negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + */
> +int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum,
> +				   __u64 nsegs)
> +{
> +	struct buffer_head *su_bh;
> +	struct nilfs_segment_usage *su;
> +	size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
> +	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
> +	void *kaddr;
> +	unsigned long maxnsegs, segusages_per_block;
> +	__u32 max_segblks = nilfs->ns_blocks_per_segment >> 1;
> +	int ret = 0, blkdirty, dirty = 0;
> +
> +	down_write(&NILFS_MDT(sufile)->mi_sem);
> +

> +	maxnsegs = nilfs_sufile_get_nsegments(sufile);
> +	segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
> +	nsegs += segnum;
> +	if (nsegs > maxnsegs)
> +		nsegs = maxnsegs;
> +
> +	while (segnum < nsegs) {

This local variable "nsegs" is used as an (exclusive) end segment number.
It's confusing.   You should define "end" variable separately.
It can be simply calculated by:

    end = min_t(__u64, segnum + nsegs, nilfs_sufile_get_nsegments(sufile));

("maxnsegs" can be removed.)

Note that the evaluation of each argument will never be done twice in
min_t() macro since min_t() temporarily stores the evaluation results
to hidden local variables and uses them for comparison.

Regards,
Ryusuke Konishi


> +		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
> +							 nsegs - 1);
> +
> +		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
> +							   0, &su_bh);
> +		if (ret < 0) {
> +			if (ret != -ENOENT)
> +				goto out;
> +			/* hole */
> +			segnum += n;
> +			continue;
> +		}
> +
> +		kaddr = kmap_atomic(su_bh->b_page);
> +		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
> +							  su_bh, kaddr);
> +		blkdirty = 0;
> +		for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
> +			if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
> +				continue;
> +			if (le32_to_cpu(su->su_nlive_blks) <= max_segblks)
> +				continue;
> +
> +			su->su_nlive_blks = cpu_to_le32(max_segblks);
> +			su->su_nsnapshot_blks = cpu_to_le32(max_segblks);
> +			blkdirty = 1;
> +		}
> +
> +		kunmap_atomic(kaddr);
> +		if (blkdirty) {
> +			mark_buffer_dirty(su_bh);
> +			dirty = 1;
> +		}
> +		put_bh(su_bh);
> +		cond_resched();
> +	}
> +
> +out:
> +	if (dirty)
> +		nilfs_mdt_mark_dirty(sufile);
> +
> +	up_write(&NILFS_MDT(sufile)->mi_sem);
> +	return ret;
> +}
> +
> +/**
>   * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node
>   * @sufile: inode of segment usage file
>   * @group: group to allocate a node for
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index 3466abb..f11e3e6 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -30,6 +30,7 @@
>  
>  #define NILFS_SUFILE_CACHE_NODE_SHIFT	6
>  #define NILFS_SUFILE_CACHE_NODE_COUNT	(1 << NILFS_SUFILE_CACHE_NODE_SHIFT)
> +#define NILFS_SUFILE_STARVING_SEGS_STEP (1 << 15)
>  
>  struct nilfs_sufile_cache_node {
>  	__u32 values[NILFS_SUFILE_CACHE_NODE_COUNT];
> @@ -88,6 +89,8 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
>  int nilfs_sufile_read(struct super_block *sb, size_t susize,
>  		      struct nilfs_inode *raw_inode, struct inode **inodep);
>  int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
> +int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum,
> +				   __u64 nsegs);
>  int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum);
>  void nilfs_sufile_shrink_cache(struct inode *sufile);
>  int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
> -- 
> 2.3.7
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]         ` <20150520.234335.542615158366069430.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-20 15:49           ` Ryusuke Konishi
  2015-05-22 18:10           ` Andreas Rohner
  1 sibling, 0 replies; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-20 15:49 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Wed, 20 May 2015 23:43:35 +0900 (JST), Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
>> It doesn't really matter if the number of reclaimable blocks for a
>> segment is inaccurate, as long as the overall performance is better than
>> the simple timestamp algorithm and starvation is prevented.
>> 
>> The following steps will lead to starvation of a segment:
>> 
>> 1. The segment is written
>> 2. A snapshot is created
>> 3. The files in the segment are deleted and the number of live
>>    blocks for the segment is decremented to a very low value
>> 4. The GC tries to free the segment, but there are no reclaimable
>>    blocks, because they are all protected by the snapshot. To prevent an
>>    infinite loop the GC has to adjust the number of live blocks to the
>>    correct value.
>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>    segment are now reclaimable.
>> 6. The GC will never attempt to clean the segment again, because it
>>    looks as if it had a high number of live blocks.
>> 
>> To prevent this, the already existing padding field of the SUFILE entry
>> is used to track the number of snapshot blocks in the segment. This
>> number is only set by the GC, since it collects the necessary
>> information anyway. So there is no need, to track which block belongs to
>> which segment. In step 4 of the list above the GC will set the new field
>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>> reduced.
>> 
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> 
> I still don't know whether this workaround is the way we should take
> or not.  This patch has several drawbacks:
> 
>  1. It introduces overheads to every "chcp cp" operation
>     due to traversal rewrite of sufile.
>     If the ratio of snapshot protected blocks is high, then
>     this overheads will be big.
> 
>  2. The traversal rewrite of sufile will causes many sufile blocks will be
>     written out.   If most blocks are protected by a snapshot,
>     more than 4MB of sufile blocks will be written per 1TB capacity.
> 
>     Even though this rewrite may not happen for contiguous "chcp cp"
>     operations, it still has potential for creating sufile write blocks
>     if the application of nilfs manipulates snapshots frequently.
> 
>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>     of blocks_per_segment.  It is not clear if the ratio is good
>     (versatile).
> 
> I will add comments inline below.
> 
>> ---
>>  fs/nilfs2/ioctl.c  | 50 +++++++++++++++++++++++++++++++-
>>  fs/nilfs2/sufile.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/nilfs2/sufile.h |  3 ++
>>  3 files changed, 137 insertions(+), 1 deletion(-)
>> 
>> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
>> index 40bf74a..431725f 100644
>> --- a/fs/nilfs2/ioctl.c
>> +++ b/fs/nilfs2/ioctl.c
>> @@ -200,6 +200,49 @@ static int nilfs_ioctl_getversion(struct inode *inode, void __user *argp)
>>  }
>>  
>>  /**
>> + * nilfs_ioctl_fix_starving_segs - fix potentially starving segments
>> + * @nilfs: nilfs object
>> + * @inode: inode object
>> + *
>> + * Description: Scans for segments, which are potentially starving and
>> + * reduces the number of live blocks to less than half of the maximum
>> + * number of blocks in a segment. This requires a scan of the whole SUFILE,
>> + * which can take a long time on certain devices and under certain conditions.
>> + * To avoid blocking other file system operations for too long the SUFILE is
>> + * scanned in steps of NILFS_SUFILE_STARVING_SEGS_STEP. After each step the
>> + * locks are released and cond_resched() is called.
>> + *
>> + * Return Value: On success, 0 is returned and on error, one of the
>> + * following negative error codes is returned.
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
> 
>> +static int nilfs_ioctl_fix_starving_segs(struct the_nilfs *nilfs,
>> +					 struct inode *inode) {
> 
> This "inode" argument is meaningless for this routine.
> Consider passing "sb" instead.
> 
> I feel odd for the function name "fix starving segs".  It looks to
> give a workaround rather than solve the root problem of gc in nilfs.
> It looks like what this patch is doing, is "calibrating" live block
> count.
> 
>> +	struct nilfs_transaction_info ti;
> 
>> +	unsigned long i, nsegs = nilfs_sufile_get_nsegments(nilfs->ns_sufile);
> 
> nsegs is set outside the transaction lock.
> 
> Since the file system can be resized (both shrinked or extended)
> outside the lock, nsegs must be initialized or updated in the
> section where the tranaction lock is held.
> 
>> +	int ret = 0;
>> +
>> +	for (i = 0; i < nsegs; i += NILFS_SUFILE_STARVING_SEGS_STEP) {
>> +		nilfs_transaction_begin(inode->i_sb, &ti, 0);
>> +
>> +		ret = nilfs_sufile_fix_starving_segs(nilfs->ns_sufile, i,
>> +				NILFS_SUFILE_STARVING_SEGS_STEP);
>> +		if (unlikely(ret < 0)) {
>> +			nilfs_transaction_abort(inode->i_sb);
>> +			break;
>> +		}
>> +
>> +		nilfs_transaction_commit(inode->i_sb); /* never fails */
>> +		cond_resched();
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +/**
>>   * nilfs_ioctl_change_cpmode - change checkpoint mode (checkpoint/snapshot)
>>   * @inode: inode object
>>   * @filp: file object
>> @@ -224,7 +267,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>>  	struct the_nilfs *nilfs = inode->i_sb->s_fs_info;
>>  	struct nilfs_transaction_info ti;
>>  	struct nilfs_cpmode cpmode;
>> -	int ret;
>> +	int ret, is_snapshot;
>>  
>>  	if (!capable(CAP_SYS_ADMIN))
>>  		return -EPERM;
>> @@ -240,6 +283,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>>  	mutex_lock(&nilfs->ns_snapshot_mount_mutex);
>>  
>>  	nilfs_transaction_begin(inode->i_sb, &ti, 0);
>> +	is_snapshot = nilfs_cpfile_is_snapshot(nilfs->ns_cpfile, cpmode.cm_cno);
>>  	ret = nilfs_cpfile_change_cpmode(
>>  		nilfs->ns_cpfile, cpmode.cm_cno, cpmode.cm_mode);
>>  	if (unlikely(ret < 0))
>> @@ -248,6 +292,10 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>>  		nilfs_transaction_commit(inode->i_sb); /* never fails */
>>  
>>  	mutex_unlock(&nilfs->ns_snapshot_mount_mutex);
>> +
> 
>> +	if (is_snapshot > 0 && cpmode.cm_mode == NILFS_CHECKPOINT &&
>> +			nilfs_feature_track_live_blks(nilfs))
>> +		ret = nilfs_ioctl_fix_starving_segs(nilfs, inode);
> 
> Should we use this return value ?
> This doesn't relate to the success and failure of "chcp" operation.
> 
> nilfs_ioctl_fix_starving_segs() is called every time "chcp cp" is
> called.  I prefer to delay this extra work with a workqueue and to
> skip starting a new work if the previous work is still running.
> 
>>  out:
>>  	mnt_drop_write_file(filp);
>>  	return ret;
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 9cd8820d..47e2c05 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -1215,6 +1215,91 @@ out_sem:
>>  }
>>  
>>  /**
>> + * nilfs_sufile_fix_starving_segs - fix potentially starving segments
>> + * @sufile: inode of segment usage file
>> + * @segnum: segnum to start
>> + * @nsegs: number of segments to check
>> + *
>> + * Description: Scans for segments, which are potentially starving and
>> + * reduces the number of live blocks to less than half of the maximum
>> + * number of blocks in a segment. This way the segment is more likely to be
>> + * chosen by the GC. A segment is marked as potentially starving, if more
>> + * than half of the blocks it contains are protected by snapshots.
>> + *
>> + * Return Value: On success, 0 is returned and on error, one of the
>> + * following negative error codes is returned.
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
>> +int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum,
>> +				   __u64 nsegs)
>> +{
>> +	struct buffer_head *su_bh;
>> +	struct nilfs_segment_usage *su;
>> +	size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
>> +	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
>> +	void *kaddr;
>> +	unsigned long maxnsegs, segusages_per_block;
>> +	__u32 max_segblks = nilfs->ns_blocks_per_segment >> 1;
>> +	int ret = 0, blkdirty, dirty = 0;
>> +
>> +	down_write(&NILFS_MDT(sufile)->mi_sem);
>> +
> 
>> +	maxnsegs = nilfs_sufile_get_nsegments(sufile);
>> +	segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
>> +	nsegs += segnum;
>> +	if (nsegs > maxnsegs)
>> +		nsegs = maxnsegs;
>> +
>> +	while (segnum < nsegs) {
> 
> This local variable "nsegs" is used as an (exclusive) end segment number.
> It's confusing.   You should define "end" variable separately.
> It can be simply calculated by:
> 
>     end = min_t(__u64, segnum + nsegs, nilfs_sufile_get_nsegments(sufile));
> 
> ("maxnsegs" can be removed.)
> 
> Note that the evaluation of each argument will never be done twice in
> min_t() macro since min_t() temporarily stores the evaluation results
> to hidden local variables and uses them for comparison.
> 
> Regards,
> Ryusuke Konishi
> 
> 
>> +		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
>> +							 nsegs - 1);
>> +
>> +		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
>> +							   0, &su_bh);
>> +		if (ret < 0) {
>> +			if (ret != -ENOENT)
>> +				goto out;
>> +			/* hole */
>> +			segnum += n;
>> +			continue;
>> +		}
>> +
>> +		kaddr = kmap_atomic(su_bh->b_page);
>> +		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
>> +							  su_bh, kaddr);
>> +		blkdirty = 0;
>> +		for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {

I forgot a few comments.

If the segment is not dirty then skip it first for safety.

>> +			if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
>> +				continue;
>> +			if (le32_to_cpu(su->su_nlive_blks) <= max_segblks)
>> +				continue;

The variable name "max_segblks" is not intuitive.  It gives a
threshold of live block count to make the segment reclaimable or to
calibrate live block counts.  Some sort of word having this nuance is
preferable.

>> +
>> +			su->su_nlive_blks = cpu_to_le32(max_segblks);
>> +			su->su_nsnapshot_blks = cpu_to_le32(max_segblks);

Live block counts are changed, but "su_nlive_lastmod" is not changed.


Regards,
Ryusuke Konishi

>> +			blkdirty = 1;
>> +		}
>> +
>> +		kunmap_atomic(kaddr);
>> +		if (blkdirty) {
>> +			mark_buffer_dirty(su_bh);
>> +			dirty = 1;
>> +		}
>> +		put_bh(su_bh);
>> +		cond_resched();
>> +	}
>> +
>> +out:
>> +	if (dirty)
>> +		nilfs_mdt_mark_dirty(sufile);
>> +
>> +	up_write(&NILFS_MDT(sufile)->mi_sem);
>> +	return ret;
>> +}
>> +
>> +/**
>>   * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node
>>   * @sufile: inode of segment usage file
>>   * @group: group to allocate a node for
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index 3466abb..f11e3e6 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -30,6 +30,7 @@
>>  
>>  #define NILFS_SUFILE_CACHE_NODE_SHIFT	6
>>  #define NILFS_SUFILE_CACHE_NODE_COUNT	(1 << NILFS_SUFILE_CACHE_NODE_SHIFT)
>> +#define NILFS_SUFILE_STARVING_SEGS_STEP (1 << 15)
>>  
>>  struct nilfs_sufile_cache_node {
>>  	__u32 values[NILFS_SUFILE_CACHE_NODE_COUNT];
>> @@ -88,6 +89,8 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
>>  int nilfs_sufile_read(struct super_block *sb, size_t susize,
>>  		      struct nilfs_inode *raw_inode, struct inode **inodep);
>>  int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
>> +int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum,
>> +				   __u64 nsegs);
>>  int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum);
>>  void nilfs_sufile_shrink_cache(struct inode *sufile);
>>  int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
>> -- 
>> 2.3.7
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]         ` <20150520.234335.542615158366069430.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  2015-05-20 15:49           ` Ryusuke Konishi
@ 2015-05-22 18:10           ` Andreas Rohner
       [not found]             ` <555F70FD.6090500-hi6Y0CQ0nG0@public.gmane.org>
  1 sibling, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-22 18:10 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-20 16:43, Ryusuke Konishi wrote:
> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
>> It doesn't really matter if the number of reclaimable blocks for a
>> segment is inaccurate, as long as the overall performance is better than
>> the simple timestamp algorithm and starvation is prevented.
>>
>> The following steps will lead to starvation of a segment:
>>
>> 1. The segment is written
>> 2. A snapshot is created
>> 3. The files in the segment are deleted and the number of live
>>    blocks for the segment is decremented to a very low value
>> 4. The GC tries to free the segment, but there are no reclaimable
>>    blocks, because they are all protected by the snapshot. To prevent an
>>    infinite loop the GC has to adjust the number of live blocks to the
>>    correct value.
>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>    segment are now reclaimable.
>> 6. The GC will never attempt to clean the segment again, because it
>>    looks as if it had a high number of live blocks.
>>
>> To prevent this, the already existing padding field of the SUFILE entry
>> is used to track the number of snapshot blocks in the segment. This
>> number is only set by the GC, since it collects the necessary
>> information anyway. So there is no need, to track which block belongs to
>> which segment. In step 4 of the list above the GC will set the new field
>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>> reduced.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> 
> I still don't know whether this workaround is the way we should take
> or not.  This patch has several drawbacks:
> 
>  1. It introduces overheads to every "chcp cp" operation
>     due to traversal rewrite of sufile.
>     If the ratio of snapshot protected blocks is high, then
>     this overheads will be big.
> 
>  2. The traversal rewrite of sufile will causes many sufile blocks will be
>     written out.   If most blocks are protected by a snapshot,
>     more than 4MB of sufile blocks will be written per 1TB capacity.
> 
>     Even though this rewrite may not happen for contiguous "chcp cp"
>     operations, it still has potential for creating sufile write blocks
>     if the application of nilfs manipulates snapshots frequently.

I could also implement this functionality in nilfs_cleanerd in
userspace. Every time a "chcp cp" happens some kind of permanent flag
like "snapshot_was_recently_deleted" is set at an appropriate location.
The flag could be returned with GET_SUSTAT ioctl(). Then nilfs_cleanerd
would, at certain intervals and if the flag is set, check all segments
with GET_SUINFO ioctl() and set the ones that have potentially invalid
values with SET_SUINFO ioctl(). After that it would clear the
"snapshot_was_recently_deleted" flag. What do you think about this idea?

If the policy is "timestamp" the GC would of course skip this scan,
because it is unnecessary.

>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>     of blocks_per_segment.  It is not clear if the ratio is good
>     (versatile).

The interval and percentage could be set in /etc/nilfs_cleanerd.conf.

I chose 50% kind of arbitrarily. My intent was to encourage the GC to
check the segment again in the future. I guess anything between 25% and
75% would also work.

> I will add comments inline below.

>> ---
>>  fs/nilfs2/ioctl.c  | 50 +++++++++++++++++++++++++++++++-
>>  fs/nilfs2/sufile.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  fs/nilfs2/sufile.h |  3 ++
>>  3 files changed, 137 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
>> index 40bf74a..431725f 100644
>> --- a/fs/nilfs2/ioctl.c
>> +++ b/fs/nilfs2/ioctl.c
>> @@ -200,6 +200,49 @@ static int nilfs_ioctl_getversion(struct inode *inode, void __user *argp)
>>  }
>>  
>>  /**
>> + * nilfs_ioctl_fix_starving_segs - fix potentially starving segments
>> + * @nilfs: nilfs object
>> + * @inode: inode object
>> + *
>> + * Description: Scans for segments, which are potentially starving and
>> + * reduces the number of live blocks to less than half of the maximum
>> + * number of blocks in a segment. This requires a scan of the whole SUFILE,
>> + * which can take a long time on certain devices and under certain conditions.
>> + * To avoid blocking other file system operations for too long the SUFILE is
>> + * scanned in steps of NILFS_SUFILE_STARVING_SEGS_STEP. After each step the
>> + * locks are released and cond_resched() is called.
>> + *
>> + * Return Value: On success, 0 is returned and on error, one of the
>> + * following negative error codes is returned.
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
> 
>> +static int nilfs_ioctl_fix_starving_segs(struct the_nilfs *nilfs,
>> +					 struct inode *inode) {
> 
> This "inode" argument is meaningless for this routine.
> Consider passing "sb" instead.

I agree.

> I feel odd for the function name "fix starving segs".  It looks to
> give a workaround rather than solve the root problem of gc in nilfs.
> It looks like what this patch is doing, is "calibrating" live block
> count.

I like the name "calibrating". I will change it.

>> +	struct nilfs_transaction_info ti;
> 
>> +	unsigned long i, nsegs = nilfs_sufile_get_nsegments(nilfs->ns_sufile);
> 
> nsegs is set outside the transaction lock.
> 
> Since the file system can be resized (both shrinked or extended)
> outside the lock, nsegs must be initialized or updated in the
> section where the tranaction lock is held.

Good point. I'll change it.

>> +	int ret = 0;
>> +
>> +	for (i = 0; i < nsegs; i += NILFS_SUFILE_STARVING_SEGS_STEP) {
>> +		nilfs_transaction_begin(inode->i_sb, &ti, 0);
>> +
>> +		ret = nilfs_sufile_fix_starving_segs(nilfs->ns_sufile, i,
>> +				NILFS_SUFILE_STARVING_SEGS_STEP);
>> +		if (unlikely(ret < 0)) {
>> +			nilfs_transaction_abort(inode->i_sb);
>> +			break;
>> +		}
>> +
>> +		nilfs_transaction_commit(inode->i_sb); /* never fails */
>> +		cond_resched();
>> +	}
>> +
>> +	return ret;
>> +}
>> +
>> +/**
>>   * nilfs_ioctl_change_cpmode - change checkpoint mode (checkpoint/snapshot)
>>   * @inode: inode object
>>   * @filp: file object
>> @@ -224,7 +267,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>>  	struct the_nilfs *nilfs = inode->i_sb->s_fs_info;
>>  	struct nilfs_transaction_info ti;
>>  	struct nilfs_cpmode cpmode;
>> -	int ret;
>> +	int ret, is_snapshot;
>>  
>>  	if (!capable(CAP_SYS_ADMIN))
>>  		return -EPERM;
>> @@ -240,6 +283,7 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>>  	mutex_lock(&nilfs->ns_snapshot_mount_mutex);
>>  
>>  	nilfs_transaction_begin(inode->i_sb, &ti, 0);
>> +	is_snapshot = nilfs_cpfile_is_snapshot(nilfs->ns_cpfile, cpmode.cm_cno);
>>  	ret = nilfs_cpfile_change_cpmode(
>>  		nilfs->ns_cpfile, cpmode.cm_cno, cpmode.cm_mode);
>>  	if (unlikely(ret < 0))
>> @@ -248,6 +292,10 @@ static int nilfs_ioctl_change_cpmode(struct inode *inode, struct file *filp,
>>  		nilfs_transaction_commit(inode->i_sb); /* never fails */
>>  
>>  	mutex_unlock(&nilfs->ns_snapshot_mount_mutex);
>> +
> 
>> +	if (is_snapshot > 0 && cpmode.cm_mode == NILFS_CHECKPOINT &&
>> +			nilfs_feature_track_live_blks(nilfs))
>> +		ret = nilfs_ioctl_fix_starving_segs(nilfs, inode);
> 
> Should we use this return value ?
> This doesn't relate to the success and failure of "chcp" operation.
> 
> nilfs_ioctl_fix_starving_segs() is called every time "chcp cp" is
> called.  I prefer to delay this extra work with a workqueue and to
> skip starting a new work if the previous work is still running.

Good idea. I'll look into it.

>>  out:
>>  	mnt_drop_write_file(filp);
>>  	return ret;
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 9cd8820d..47e2c05 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -1215,6 +1215,91 @@ out_sem:
>>  }
>>  
>>  /**
>> + * nilfs_sufile_fix_starving_segs - fix potentially starving segments
>> + * @sufile: inode of segment usage file
>> + * @segnum: segnum to start
>> + * @nsegs: number of segments to check
>> + *
>> + * Description: Scans for segments, which are potentially starving and
>> + * reduces the number of live blocks to less than half of the maximum
>> + * number of blocks in a segment. This way the segment is more likely to be
>> + * chosen by the GC. A segment is marked as potentially starving, if more
>> + * than half of the blocks it contains are protected by snapshots.
>> + *
>> + * Return Value: On success, 0 is returned and on error, one of the
>> + * following negative error codes is returned.
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
>> +int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum,
>> +				   __u64 nsegs)
>> +{
>> +	struct buffer_head *su_bh;
>> +	struct nilfs_segment_usage *su;
>> +	size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
>> +	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
>> +	void *kaddr;
>> +	unsigned long maxnsegs, segusages_per_block;
>> +	__u32 max_segblks = nilfs->ns_blocks_per_segment >> 1;
>> +	int ret = 0, blkdirty, dirty = 0;
>> +
>> +	down_write(&NILFS_MDT(sufile)->mi_sem);
>> +
> 
>> +	maxnsegs = nilfs_sufile_get_nsegments(sufile);
>> +	segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
>> +	nsegs += segnum;
>> +	if (nsegs > maxnsegs)
>> +		nsegs = maxnsegs;
>> +
>> +	while (segnum < nsegs) {
> 
> This local variable "nsegs" is used as an (exclusive) end segment number.
> It's confusing.   You should define "end" variable separately.
> It can be simply calculated by:
> 
>     end = min_t(__u64, segnum + nsegs, nilfs_sufile_get_nsegments(sufile));
> 
> ("maxnsegs" can be removed.)
> 
> Note that the evaluation of each argument will never be done twice in
> min_t() macro since min_t() temporarily stores the evaluation results
> to hidden local variables and uses them for comparison.

Ok.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
> 
>> +		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
>> +							 nsegs - 1);
>> +
>> +		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
>> +							   0, &su_bh);
>> +		if (ret < 0) {
>> +			if (ret != -ENOENT)
>> +				goto out;
>> +			/* hole */
>> +			segnum += n;
>> +			continue;
>> +		}
>> +
>> +		kaddr = kmap_atomic(su_bh->b_page);
>> +		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
>> +							  su_bh, kaddr);
>> +		blkdirty = 0;
>> +		for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
>> +			if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
>> +				continue;
>> +			if (le32_to_cpu(su->su_nlive_blks) <= max_segblks)
>> +				continue;
>> +
>> +			su->su_nlive_blks = cpu_to_le32(max_segblks);
>> +			su->su_nsnapshot_blks = cpu_to_le32(max_segblks);
>> +			blkdirty = 1;
>> +		}
>> +
>> +		kunmap_atomic(kaddr);
>> +		if (blkdirty) {
>> +			mark_buffer_dirty(su_bh);
>> +			dirty = 1;
>> +		}
>> +		put_bh(su_bh);
>> +		cond_resched();
>> +	}
>> +
>> +out:
>> +	if (dirty)
>> +		nilfs_mdt_mark_dirty(sufile);
>> +
>> +	up_write(&NILFS_MDT(sufile)->mi_sem);
>> +	return ret;
>> +}
>> +
>> +/**
>>   * nilfs_sufile_alloc_cache_node - allocate and insert a new cache node
>>   * @sufile: inode of segment usage file
>>   * @group: group to allocate a node for
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index 3466abb..f11e3e6 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -30,6 +30,7 @@
>>  
>>  #define NILFS_SUFILE_CACHE_NODE_SHIFT	6
>>  #define NILFS_SUFILE_CACHE_NODE_COUNT	(1 << NILFS_SUFILE_CACHE_NODE_SHIFT)
>> +#define NILFS_SUFILE_STARVING_SEGS_STEP (1 << 15)
>>  
>>  struct nilfs_sufile_cache_node {
>>  	__u32 values[NILFS_SUFILE_CACHE_NODE_COUNT];
>> @@ -88,6 +89,8 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
>>  int nilfs_sufile_read(struct super_block *sb, size_t susize,
>>  		      struct nilfs_inode *raw_inode, struct inode **inodep);
>>  int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
>> +int nilfs_sufile_fix_starving_segs(struct inode *sufile, __u64 segnum,
>> +				   __u64 nsegs);
>>  int nilfs_sufile_dec_nlive_blks(struct inode *sufile, __u64 segnum);
>>  void nilfs_sufile_shrink_cache(struct inode *sufile);
>>  int nilfs_sufile_flush_cache(struct inode *sufile, int only_mark,
>> -- 
>> 2.3.7
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]             ` <555F70FD.6090500-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-05-31 16:45               ` Ryusuke Konishi
       [not found]                 ` <20150601.014550.269184778137708369.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-05-31 16:45 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
> On 2015-05-20 16:43, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
>>> It doesn't really matter if the number of reclaimable blocks for a
>>> segment is inaccurate, as long as the overall performance is better than
>>> the simple timestamp algorithm and starvation is prevented.
>>>
>>> The following steps will lead to starvation of a segment:
>>>
>>> 1. The segment is written
>>> 2. A snapshot is created
>>> 3. The files in the segment are deleted and the number of live
>>>    blocks for the segment is decremented to a very low value
>>> 4. The GC tries to free the segment, but there are no reclaimable
>>>    blocks, because they are all protected by the snapshot. To prevent an
>>>    infinite loop the GC has to adjust the number of live blocks to the
>>>    correct value.
>>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>>    segment are now reclaimable.
>>> 6. The GC will never attempt to clean the segment again, because it
>>>    looks as if it had a high number of live blocks.
>>>
>>> To prevent this, the already existing padding field of the SUFILE entry
>>> is used to track the number of snapshot blocks in the segment. This
>>> number is only set by the GC, since it collects the necessary
>>> information anyway. So there is no need, to track which block belongs to
>>> which segment. In step 4 of the list above the GC will set the new field
>>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>>> reduced.
>>>
>>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> 
>> I still don't know whether this workaround is the way we should take
>> or not.  This patch has several drawbacks:
>> 
>>  1. It introduces overheads to every "chcp cp" operation
>>     due to traversal rewrite of sufile.
>>     If the ratio of snapshot protected blocks is high, then
>>     this overheads will be big.
>> 
>>  2. The traversal rewrite of sufile will causes many sufile blocks will be
>>     written out.   If most blocks are protected by a snapshot,
>>     more than 4MB of sufile blocks will be written per 1TB capacity.
>> 
>>     Even though this rewrite may not happen for contiguous "chcp cp"
>>     operations, it still has potential for creating sufile write blocks
>>     if the application of nilfs manipulates snapshots frequently.
> 
> I could also implement this functionality in nilfs_cleanerd in
> userspace. Every time a "chcp cp" happens some kind of permanent flag
> like "snapshot_was_recently_deleted" is set at an appropriate location.
> The flag could be returned with GET_SUSTAT ioctl(). Then nilfs_cleanerd
> would, at certain intervals and if the flag is set, check all segments
> with GET_SUINFO ioctl() and set the ones that have potentially invalid
> values with SET_SUINFO ioctl(). After that it would clear the
> "snapshot_was_recently_deleted" flag. What do you think about this idea?

Sorry for my late reply.

I think moving the functionality to cleanerd and notifying some sort
of information to userland through ioctl for that, is a good idea
except that I feel the ioctl should be GET_CPSTAT instead of
GET_SUINFO because it's checkpoint/snapshot related information.

I think the parameter that should be added is a set of statistics
information including the number of deleted snapshots since the file
system was mounted last (1).  The counter (1) can serve as the
"snapshot_was_recently_deleted" flag if it monotonically increases.
Although we can use timestamp of when a snapshot was deleted last
time, it's not preferable than the counter (1) because the system
clock may be rewinded and it also has an issue related to precision.

Note that we must add GET_CPSTAT_V2 (or GET_SUSTAT_V2) and the
corresponding structure (i.e. nilfs_cpstat_v2, or so) since ioctl
codes depend on the size of argument data and it will be changed in
both ioctls; unfortunately, neither GET_CPSTAT nor GET_SUSTAT ioctl is
expandable.  Some ioctls like EVIOCGKEYCODE_V2 will be a reference for
this issue.

> 
> If the policy is "timestamp" the GC would of course skip this scan,
> because it is unnecessary.
> 
>>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>>     of blocks_per_segment.  It is not clear if the ratio is good
>>     (versatile).
> 
> The interval and percentage could be set in /etc/nilfs_cleanerd.conf.
> 
> I chose 50% kind of arbitrarily. My intent was to encourage the GC to
> check the segment again in the future. I guess anything between 25% and
> 75% would also work.

Sound reasonable.

By the way, I am thinking we should move cleanerd into kernel as soon
as we can.  It's not only inefficient due to a large amount of data
exchange between kernel and user-land, but also is hindering changes
like we are trying.  We have to care compatibility unnecessarily due
to the early design mistake (i.e. the separation of gc to user-land).

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]                 ` <20150601.014550.269184778137708369.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-05-31 18:13                   ` Andreas Rohner
       [not found]                     ` <556B4F58.9080801-hi6Y0CQ0nG0@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Andreas Rohner @ 2015-05-31 18:13 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-05-31 18:45, Ryusuke Konishi wrote:
> On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
>> On 2015-05-20 16:43, Ryusuke Konishi wrote:
>>> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
>>>> It doesn't really matter if the number of reclaimable blocks for a
>>>> segment is inaccurate, as long as the overall performance is better than
>>>> the simple timestamp algorithm and starvation is prevented.
>>>>
>>>> The following steps will lead to starvation of a segment:
>>>>
>>>> 1. The segment is written
>>>> 2. A snapshot is created
>>>> 3. The files in the segment are deleted and the number of live
>>>>    blocks for the segment is decremented to a very low value
>>>> 4. The GC tries to free the segment, but there are no reclaimable
>>>>    blocks, because they are all protected by the snapshot. To prevent an
>>>>    infinite loop the GC has to adjust the number of live blocks to the
>>>>    correct value.
>>>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>>>    segment are now reclaimable.
>>>> 6. The GC will never attempt to clean the segment again, because it
>>>>    looks as if it had a high number of live blocks.
>>>>
>>>> To prevent this, the already existing padding field of the SUFILE entry
>>>> is used to track the number of snapshot blocks in the segment. This
>>>> number is only set by the GC, since it collects the necessary
>>>> information anyway. So there is no need, to track which block belongs to
>>>> which segment. In step 4 of the list above the GC will set the new field
>>>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>>>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>>>> reduced.
>>>>
>>>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>>>
>>> I still don't know whether this workaround is the way we should take
>>> or not.  This patch has several drawbacks:
>>>
>>>  1. It introduces overheads to every "chcp cp" operation
>>>     due to traversal rewrite of sufile.
>>>     If the ratio of snapshot protected blocks is high, then
>>>     this overheads will be big.
>>>
>>>  2. The traversal rewrite of sufile will causes many sufile blocks will be
>>>     written out.   If most blocks are protected by a snapshot,
>>>     more than 4MB of sufile blocks will be written per 1TB capacity.
>>>
>>>     Even though this rewrite may not happen for contiguous "chcp cp"
>>>     operations, it still has potential for creating sufile write blocks
>>>     if the application of nilfs manipulates snapshots frequently.
>>
>> I could also implement this functionality in nilfs_cleanerd in
>> userspace. Every time a "chcp cp" happens some kind of permanent flag
>> like "snapshot_was_recently_deleted" is set at an appropriate location.
>> The flag could be returned with GET_SUSTAT ioctl(). Then nilfs_cleanerd
>> would, at certain intervals and if the flag is set, check all segments
>> with GET_SUINFO ioctl() and set the ones that have potentially invalid
>> values with SET_SUINFO ioctl(). After that it would clear the
>> "snapshot_was_recently_deleted" flag. What do you think about this idea?
> 
> Sorry for my late reply.

No problem. I was also very busy last week.

> I think moving the functionality to cleanerd and notifying some sort
> of information to userland through ioctl for that, is a good idea
> except that I feel the ioctl should be GET_CPSTAT instead of
> GET_SUINFO because it's checkpoint/snapshot related information.

Ok good idea.

> I think the parameter that should be added is a set of statistics
> information including the number of deleted snapshots since the file
> system was mounted last (1).  The counter (1) can serve as the
> "snapshot_was_recently_deleted" flag if it monotonically increases.
> Although we can use timestamp of when a snapshot was deleted last
> time, it's not preferable than the counter (1) because the system
> clock may be rewinded and it also has an issue related to precision.

I agree, a counter is better than a simple flag.

> Note that we must add GET_CPSTAT_V2 (or GET_SUSTAT_V2) and the
> corresponding structure (i.e. nilfs_cpstat_v2, or so) since ioctl
> codes depend on the size of argument data and it will be changed in
> both ioctls; unfortunately, neither GET_CPSTAT nor GET_SUSTAT ioctl is
> expandable.  Some ioctls like EVIOCGKEYCODE_V2 will be a reference for
> this issue.
> 
>>
>> If the policy is "timestamp" the GC would of course skip this scan,
>> because it is unnecessary.
>>
>>>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>>>     of blocks_per_segment.  It is not clear if the ratio is good
>>>     (versatile).
>>
>> The interval and percentage could be set in /etc/nilfs_cleanerd.conf.
>>
>> I chose 50% kind of arbitrarily. My intent was to encourage the GC to
>> check the segment again in the future. I guess anything between 25% and
>> 75% would also work.
> 
> Sound reasonable.
> 
> By the way, I am thinking we should move cleanerd into kernel as soon
> as we can.  It's not only inefficient due to a large amount of data
> exchange between kernel and user-land, but also is hindering changes
> like we are trying.  We have to care compatibility unnecessarily due
> to the early design mistake (i.e. the separation of gc to user-land).

I am a bit confused. Is it OK if I implement this functionality in
nilfs_cleanerd for this patch set, or would it be better to implement it
with a workqueue in the kernel, like you've suggested before?

If you intend to move nilfs_cleanerd into the kernel anyway, then the
latter would make more sense to me. Which implementation do you prefer
for this patch set?

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]                     ` <556B4F58.9080801-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-06-01  0:44                       ` Ryusuke Konishi
       [not found]                         ` <20150601.094441.24658496988941562.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-06-01  0:44 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun, 31 May 2015 20:13:44 +0200, Andreas Rohner wrote:
> On 2015-05-31 18:45, Ryusuke Konishi wrote:
>> On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
>>> On 2015-05-20 16:43, Ryusuke Konishi wrote:
>>>> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
[...]
>>>>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>>>>     of blocks_per_segment.  It is not clear if the ratio is good
>>>>     (versatile).
>>>
>>> The interval and percentage could be set in /etc/nilfs_cleanerd.conf.
>>>
>>> I chose 50% kind of arbitrarily. My intent was to encourage the GC to
>>> check the segment again in the future. I guess anything between 25% and
>>> 75% would also work.
>> 
>> Sound reasonable.
>> 
>> By the way, I am thinking we should move cleanerd into kernel as soon
>> as we can.  It's not only inefficient due to a large amount of data
>> exchange between kernel and user-land, but also is hindering changes
>> like we are trying.  We have to care compatibility unnecessarily due
>> to the early design mistake (i.e. the separation of gc to user-land).
> 
> I am a bit confused. Is it OK if I implement this functionality in
> nilfs_cleanerd for this patch set, or would it be better to implement it
> with a workqueue in the kernel, like you've suggested before?
> 
> If you intend to move nilfs_cleanerd into the kernel anyway, then the
> latter would make more sense to me. Which implementation do you prefer
> for this patch set?

If nilfs_cleanerd will remain in userland, then the userland
implementation looks better.  But, yes, if we will move the cleaner
into kernel, then the kernel implementation looks better because we
may be able to avoid unnecessary API change.  It's a dilemma.

Do you have any good idea to reduce or hide overhead of the
calibration (i.e. traversal rewrite of sufile) in regard to the kernel
implementation ?
I'm inclined to leave that in kernel for now.

Regards,
Ryusuke Konishi

> 
> Regards,
> Andreas Rohner
> 
>> Regards,
>> Ryusuke Konishi
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out
       [not found]             ` <554F3B32.5050004-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-06-01  4:13               ` Ryusuke Konishi
       [not found]                 ` <20150601.131320.1075202804382267027.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 52+ messages in thread
From: Ryusuke Konishi @ 2015-06-01  4:13 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sun, 10 May 2015 13:04:18 +0200, Andreas Rohner wrote:
> On 2015-05-09 14:17, Ryusuke Konishi wrote:
>> On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
[...]
>> 
>> Uum. This still looks to have potential for leak of dirty block
>> collection between DAT and SUFILE since this retry is limited by
>> the fixed retry count.
>> 
>> How about adding function temporarily turning off the live block
>> tracking and using it after this propagation loop until log write
>> finishes ?
>> 
>> It would reduce the accuracy of live block count, but is it enough ?
>> How do you think ?  We have to eliminate the possibility of the leak
>> because it can cause file system corruption.  Every checkpoint must be
>> self-contained.
> 
> How exactly could it lead to file system corruption? Maybe I miss
> something important here, but it seems to me, that no corruption is
> possible.
> 
> The nilfs_sufile_flush_cache_node() function only reads in already
> existing blocks. No new blocks are created. If I mark those blocks
> dirty, the btree is not changed at all. If I do not call
> nilfs_bmap_propagate(), then the btree stays unchanged and there are no
> dangling pointers. The resulting checkpoint should be self-contained.

Good point.  As for btree, it looks like no inconsistency issue arises
since nilfs_sufile_flush_cache_node() never inserts new blocks as you
pointed out.  Even though we also must care inconsistency between
sufile header and sufile data blocks, and block count in inode as
well, fortunately these look to be ok, too.

However, I still think it's not good to carry over dirty blocks to the
next segment construction to avoid extra checkpoint creation and to
simplify things.

From this viewpoint, I also prefer that nilfs_sufile_flush_cache() and
nilfs_sufile_flush_cache_node() are changed a bit so that they will
skip adjusting su_nlive_blks and su_nlive_lastmod if the sufile block
that includes the segment usage is not marked dirty and only_mark == 0
as well as turing off live block counting temporarily after the
sufile/DAT propagation loop.

> 
> The only problem would be, that I could lose some nlive_blks updates.
> 

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out
       [not found]                 ` <20150601.131320.1075202804382267027.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-06-01 14:33                   ` Andreas Rohner
  0 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-06-01 14:33 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-06-01 06:13, Ryusuke Konishi wrote:
> On Sun, 10 May 2015 13:04:18 +0200, Andreas Rohner wrote:
>> On 2015-05-09 14:17, Ryusuke Konishi wrote:
>>> On Sun,  3 May 2015 12:05:20 +0200, Andreas Rohner wrote:
> [...]
>>>
>>> Uum. This still looks to have potential for leak of dirty block
>>> collection between DAT and SUFILE since this retry is limited by
>>> the fixed retry count.
>>>
>>> How about adding function temporarily turning off the live block
>>> tracking and using it after this propagation loop until log write
>>> finishes ?
>>>
>>> It would reduce the accuracy of live block count, but is it enough ?
>>> How do you think ?  We have to eliminate the possibility of the leak
>>> because it can cause file system corruption.  Every checkpoint must be
>>> self-contained.
>>
>> How exactly could it lead to file system corruption? Maybe I miss
>> something important here, but it seems to me, that no corruption is
>> possible.
>>
>> The nilfs_sufile_flush_cache_node() function only reads in already
>> existing blocks. No new blocks are created. If I mark those blocks
>> dirty, the btree is not changed at all. If I do not call
>> nilfs_bmap_propagate(), then the btree stays unchanged and there are no
>> dangling pointers. The resulting checkpoint should be self-contained.
> 
> Good point.  As for btree, it looks like no inconsistency issue arises
> since nilfs_sufile_flush_cache_node() never inserts new blocks as you
> pointed out.  Even though we also must care inconsistency between
> sufile header and sufile data blocks, and block count in inode as
> well, fortunately these look to be ok, too.
> 
> However, I still think it's not good to carry over dirty blocks to the
> next segment construction to avoid extra checkpoint creation and to
> simplify things.
> 
>>From this viewpoint, I also prefer that nilfs_sufile_flush_cache() and
> nilfs_sufile_flush_cache_node() are changed a bit so that they will
> skip adjusting su_nlive_blks and su_nlive_lastmod if the sufile block
> that includes the segment usage is not marked dirty and only_mark == 0
> as well as turing off live block counting temporarily after the
> sufile/DAT propagation loop.

Ok I'll start working on this.

Regards,
Andreas Rohner

>>
>> The only problem would be, that I could lose some nlive_blks updates.
>>
> 
> Regards,
> Ryusuke Konishi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]                         ` <20150601.094441.24658496988941562.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-06-01 14:45                           ` Andreas Rohner
  0 siblings, 0 replies; 52+ messages in thread
From: Andreas Rohner @ 2015-06-01 14:45 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-06-01 02:44, Ryusuke Konishi wrote:
> On Sun, 31 May 2015 20:13:44 +0200, Andreas Rohner wrote:
>> On 2015-05-31 18:45, Ryusuke Konishi wrote:
>>> On Fri, 22 May 2015 20:10:05 +0200, Andreas Rohner wrote:
>>>> On 2015-05-20 16:43, Ryusuke Konishi wrote:
>>>>> On Sun,  3 May 2015 12:05:22 +0200, Andreas Rohner wrote:
> [...]
>>>>>  3. The ratio of the threshold "max_segblks" is hard coded to 50%
>>>>>     of blocks_per_segment.  It is not clear if the ratio is good
>>>>>     (versatile).
>>>>
>>>> The interval and percentage could be set in /etc/nilfs_cleanerd.conf.
>>>>
>>>> I chose 50% kind of arbitrarily. My intent was to encourage the GC to
>>>> check the segment again in the future. I guess anything between 25% and
>>>> 75% would also work.
>>>
>>> Sound reasonable.
>>>
>>> By the way, I am thinking we should move cleanerd into kernel as soon
>>> as we can.  It's not only inefficient due to a large amount of data
>>> exchange between kernel and user-land, but also is hindering changes
>>> like we are trying.  We have to care compatibility unnecessarily due
>>> to the early design mistake (i.e. the separation of gc to user-land).
>>
>> I am a bit confused. Is it OK if I implement this functionality in
>> nilfs_cleanerd for this patch set, or would it be better to implement it
>> with a workqueue in the kernel, like you've suggested before?
>>
>> If you intend to move nilfs_cleanerd into the kernel anyway, then the
>> latter would make more sense to me. Which implementation do you prefer
>> for this patch set?
> 
> If nilfs_cleanerd will remain in userland, then the userland
> implementation looks better.  But, yes, if we will move the cleaner
> into kernel, then the kernel implementation looks better because we
> may be able to avoid unnecessary API change.  It's a dilemma.
> 
> Do you have any good idea to reduce or hide overhead of the
> calibration (i.e. traversal rewrite of sufile) in regard to the kernel
> implementation ?
> I'm inclined to leave that in kernel for now.

I haven't looked into that yet, so I don't have a good idea right now. I
will do some experiments. The good thing is, that the calibration does
not have to happen all at once and we do not have to do it all in one
iteration. The only question is how to best split up the work and keep
track of the progress.

If it turns out to be too complicated to do it in the kernel, I will go
for the userspace solution.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
>>
>> Regards,
>> Andreas Rohner
>>
>>> Regards,
>>> Ryusuke Konishi
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2015-06-01 14:45 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-03 10:05 [PATCH v2 0/9] nilfs2: implementation of cost-benefit GC policy Andreas Rohner
     [not found] ` <1430647522-14304-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-03 10:05   ` [PATCH v2 1/9] nilfs2: copy file system feature flags to the nilfs object Andreas Rohner
     [not found]     ` <1430647522-14304-2-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-09  1:54       ` Ryusuke Konishi
     [not found]         ` <20150509.105445.1816655707671265145.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-09 18:41           ` Andreas Rohner
2015-05-03 10:05   ` [PATCH v2 2/9] nilfs2: extend SUFILE on-disk format to enable tracking of live blocks Andreas Rohner
     [not found]     ` <1430647522-14304-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-09  2:24       ` Ryusuke Konishi
     [not found]         ` <20150509.112403.380867861504859109.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-09 18:47           ` Andreas Rohner
2015-05-03 10:05   ` [PATCH v2 3/9] nilfs2: introduce new feature flag for tracking " Andreas Rohner
     [not found]     ` <1430647522-14304-4-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-09  2:28       ` Ryusuke Konishi
     [not found]         ` <20150509.112814.2026089040966346261.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-09 18:53           ` Andreas Rohner
2015-05-03 10:05   ` [PATCH v2 4/9] nilfs2: add kmem_cache for SUFILE cache nodes Andreas Rohner
     [not found]     ` <1430647522-14304-5-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-09  2:41       ` Ryusuke Konishi
     [not found]         ` <20150509.114149.1643183669812667339.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-09 19:10           ` Andreas Rohner
     [not found]             ` <554E5B9D.7070807-hi6Y0CQ0nG0@public.gmane.org>
2015-05-10  0:05               ` Ryusuke Konishi
2015-05-03 10:05   ` [PATCH v2 5/9] nilfs2: add SUFILE cache for changes to su_nlive_blks field Andreas Rohner
     [not found]     ` <1430647522-14304-6-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-09  4:09       ` Ryusuke Konishi
     [not found]         ` <20150509.130900.223492430584220355.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-09 19:39           ` Andreas Rohner
     [not found]             ` <554E626A.2030503-hi6Y0CQ0nG0@public.gmane.org>
2015-05-10  2:09               ` Ryusuke Konishi
2015-05-03 10:05   ` [PATCH v2 6/9] nilfs2: add tracking of block deletions and updates Andreas Rohner
     [not found]     ` <1430647522-14304-7-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-09  7:05       ` Ryusuke Konishi
     [not found]         ` <20150509.160512.1087140271092828536.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-09 15:58           ` Ryusuke Konishi
2015-05-09 20:02           ` Andreas Rohner
     [not found]             ` <554E67C0.1050309-hi6Y0CQ0nG0@public.gmane.org>
2015-05-10  3:17               ` Ryusuke Konishi
2015-05-03 10:05   ` [PATCH v2 7/9] nilfs2: ensure that all dirty blocks are written out Andreas Rohner
     [not found]     ` <1430647522-14304-8-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-09 12:17       ` Ryusuke Konishi
     [not found]         ` <20150509.211741.1463241033923032068.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-09 20:18           ` Andreas Rohner
     [not found]             ` <554E6B7E.8070000-hi6Y0CQ0nG0@public.gmane.org>
2015-05-10  3:31               ` Ryusuke Konishi
2015-05-10 11:04           ` Andreas Rohner
     [not found]             ` <554F3B32.5050004-hi6Y0CQ0nG0@public.gmane.org>
2015-06-01  4:13               ` Ryusuke Konishi
     [not found]                 ` <20150601.131320.1075202804382267027.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-06-01 14:33                   ` Andreas Rohner
2015-05-03 10:05   ` [PATCH v2 8/9] nilfs2: correct live block tracking for GC protection period Andreas Rohner
     [not found]     ` <1430647522-14304-9-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-10 18:15       ` Ryusuke Konishi
     [not found]         ` <20150511.031512.1036934606749624197.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-10 18:23           ` Ryusuke Konishi
     [not found]             ` <20150511.032323.1250231827423193240.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-11  2:07               ` Ryusuke Konishi
     [not found]                 ` <20150511.110726.725667075147435663.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-11 12:32                   ` Andreas Rohner
2015-05-11 13:00           ` Andreas Rohner
     [not found]             ` <5550A7FC.4050709-hi6Y0CQ0nG0@public.gmane.org>
2015-05-12 14:31               ` Ryusuke Konishi
     [not found]                 ` <20150512.233126.2206330706583570566.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-12 15:37                   ` Andreas Rohner
2015-05-03 10:05   ` [PATCH v2 9/9] nilfs2: prevent starvation of segments protected by snapshots Andreas Rohner
     [not found]     ` <1430647522-14304-10-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-05-20 14:43       ` Ryusuke Konishi
     [not found]         ` <20150520.234335.542615158366069430.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-20 15:49           ` Ryusuke Konishi
2015-05-22 18:10           ` Andreas Rohner
     [not found]             ` <555F70FD.6090500-hi6Y0CQ0nG0@public.gmane.org>
2015-05-31 16:45               ` Ryusuke Konishi
     [not found]                 ` <20150601.014550.269184778137708369.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-05-31 18:13                   ` Andreas Rohner
     [not found]                     ` <556B4F58.9080801-hi6Y0CQ0nG0@public.gmane.org>
2015-06-01  0:44                       ` Ryusuke Konishi
     [not found]                         ` <20150601.094441.24658496988941562.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-06-01 14:45                           ` Andreas Rohner
2015-05-03 10:07   ` [PATCH v2 1/5] nilfs-utils: extend SUFILE on-disk format to enable track live blocks Andreas Rohner
2015-05-03 10:07   ` [PATCH v2 2/5] nilfs-utils: add additional flags for nilfs_vdesc Andreas Rohner
2015-05-03 10:07   ` [PATCH v2 3/5] nilfs-utils: add support for tracking live blocks Andreas Rohner
2015-05-03 10:07   ` [PATCH v2 4/5] nilfs-utils: implement the tracking of live blocks for set_suinfo Andreas Rohner
2015-05-03 10:07   ` [PATCH v2 5/5] nilfs-utils: add support for greedy/cost-benefit policies Andreas Rohner
2015-05-05  3:09   ` [PATCH v2 0/9] nilfs2: implementation of cost-benefit GC policy Ryusuke Konishi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.