All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
@ 2015-02-24 19:01 Andreas Rohner
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

Hi everyone!

One of the biggest performance problems of NILFS is its
inefficient Timestamp GC policy. This patch set introduces two new GC
policies, namely Cost-Benefit and Greedy.

The Cost-Benefit policy is nothing new. It has been around for a long
time with log-structured file systems [1]. But it relies on accurate
information, about the number of live blocks in a segment. NILFS
currently does not provide the necessary information. So this patch set
extends the entries in the SUFILE to include a counter for the number of
live blocks. This counter is decremented whenever a file is deleted or
overwritten.

Except for some tricky parts, the counting of live blocks is quite
trivial. The problem is snapshots. At any time, a checkpoint can be
turned into a snapshot or vice versa. So blocks that are reclaimable at
one point in time, are protected by a snapshot a moment later.

This patch set does not try to track snapshots at all. Instead it uses a
heuristic approach to prevent the worst case scenario. The performance
is still significantly better than timestamp for my benchmarks.

The worst case scenario is, the following:

1. Segment 1 is written
2. Snapshot is created
3. GC tries to reclaim Segment 1, but all blocks are protected
   by the Snapshot. The GC has to set the number of live blocks
   to maximum to avoid reclaiming this Segment again in the near future.
4. Snapshot is deleted
5. Segment 1 is reclaimable, but its counter is so high, that the GC
   will never try to reclaim it again.

To prevent this kind of starvation I use another field in the SUFILE
entry, to store the number of blocks that are protected by a snapshot.
This value is just a heuristic and it is usually set to 0. Only if the
GC reclaims a segment, it is written to the SUFILE entry. The GC has to
check for snapshots anyway, so we get this information for free. By
storing this information in the SUFILE we can avoid starvation in the
following way:

1. Segment 1 is written
2. Snapshot is created
3. GC tries to reclaim Segment 1, but all blocks are protected
   by the Snapshot. The GC has to set the number of live blocks
   to maximum to avoid reclaiming this Segment again in the near future.
4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
   entry
5. Snapshot is deleted
6. On Snapshot deletion we walk through every entry in the SUFILE and
   reduce the number of live blocks to half, if the number of snapshot
   blocks is bigger than half of the maximum.
7. Segment 1 is reclaimable and the number of live blocks entry is at
   half the maximum. The GC will try to reclaim this segment as soon as
   there are no other better choices.

BENCHMARKS:
-----------

My benchmark is quite simple. It consists of a process, that replays
real NFS traces at a faster speed. It thereby creates relatively
realistic patterns of file creation and deletions. At the same time
multiple snapshots are created and deleted in parallel. I use a 100GB
partition of a Samsung SSD:

WITH SNAPSHOTS EVERY 5 MINUTES:
--------------------------------------------------------------------
                Execution time       Wear (Data written to disk)
Timestamp:      100%                 100%
Cost-Benefit:   80%                  43%

NO SNAPSHOTS:
---------------------------------------------------------------------
                Execution time       Wear (Data written to disk)
Timestamp:      100%                 100%
Cost-Benefit:   70%                  45%

I plan on adding more benchmark results soon.

Best regards,
Andreas Rohner

[1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
    tion of a log-structured file system. ACM Trans. Comput. Syst.,
    10(1):26–52, February 1992.

Andreas Rohner (9):
  nilfs2: refactor nilfs_sufile_updatev()
  nilfs2: add simple cache for modifications to SUFILE
  nilfs2: extend SUFILE on-disk format to enable counting of live blocks
  nilfs2: add function to modify su_nlive_blks
  nilfs2: add simple tracking of block deletions and updates
  nilfs2: use modification cache to improve performance
  nilfs2: add additional flags for nilfs_vdesc
  nilfs2: improve accuracy and correct for invalid GC values
  nilfs2: prevent starvation of segments protected by snapshots

 fs/nilfs2/bmap.c          |  84 +++++++-
 fs/nilfs2/bmap.h          |  14 +-
 fs/nilfs2/btree.c         |   4 +-
 fs/nilfs2/cpfile.c        |   5 +
 fs/nilfs2/dat.c           |  95 ++++++++-
 fs/nilfs2/dat.h           |   8 +-
 fs/nilfs2/direct.c        |   4 +-
 fs/nilfs2/inode.c         |  24 ++-
 fs/nilfs2/ioctl.c         |  27 ++-
 fs/nilfs2/mdt.c           |   5 +-
 fs/nilfs2/page.h          |   6 +-
 fs/nilfs2/segbuf.c        |   6 +
 fs/nilfs2/segbuf.h        |   3 +
 fs/nilfs2/segment.c       | 155 +++++++++++++-
 fs/nilfs2/segment.h       |   3 +
 fs/nilfs2/sufile.c        | 533 +++++++++++++++++++++++++++++++++++++++++++---
 fs/nilfs2/sufile.h        |  97 +++++++--
 fs/nilfs2/the_nilfs.c     |   4 +
 fs/nilfs2/the_nilfs.h     |  23 ++
 include/linux/nilfs2_fs.h | 122 ++++++++++-
 20 files changed, 1126 insertions(+), 96 deletions(-)

-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev()
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-2-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 2/9] nilfs2: add simple cache for modifications to SUFILE Andreas Rohner
                     ` (9 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch refactors nilfs_sufile_updatev() to take an array of
arbitrary data structures instead of an array of segment numbers as
input parameter. With this  change it is reusable for cases, where
it is necessary to pass extra data to the update function. The only
requirement for the data structures passed as input is, that they
contain the segment number within the structure. By passing the
offset to the segment number as another input parameter,
nilfs_sufile_updatev() can be oblivious to the actual type of the
input structures in the array.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/sufile.c | 79 ++++++++++++++++++++++++++++++++----------------------
 fs/nilfs2/sufile.h | 39 ++++++++++++++-------------
 2 files changed, 68 insertions(+), 50 deletions(-)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 2a869c3..1e8cac6 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -138,14 +138,18 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile)
 /**
  * nilfs_sufile_updatev - modify multiple segment usages at a time
  * @sufile: inode of segment usage file
- * @segnumv: array of segment numbers
- * @nsegs: size of @segnumv array
+ * @datav: array of segment numbers
+ * @datasz: size of elements in @datav
+ * @segoff: offset to segnum within the elements of @datav
+ * @ndata: size of @datav array
  * @create: creation flag
  * @ndone: place to store number of modified segments on @segnumv
  * @dofunc: primitive operation for the update
  *
  * Description: nilfs_sufile_updatev() repeatedly calls @dofunc
- * against the given array of segments.  The @dofunc is called with
+ * against the given array of data elements. Every data element has
+ * to contain a valid segment number and @segoff should be the offset
+ * to that within the data structure. The @dofunc is called with
  * buffers of a header block and the sufile block in which the target
  * segment usage entry is contained.  If @ndone is given, the number
  * of successfully modified segments from the head is stored in the
@@ -163,50 +167,55 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile)
  *
  * %-EINVAL - Invalid segment usage number
  */
-int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs,
-			 int create, size_t *ndone,
-			 void (*dofunc)(struct inode *, __u64,
+int nilfs_sufile_updatev(struct inode *sufile, void *datav, size_t datasz,
+			 size_t segoff, size_t ndata, int create,
+			 size_t *ndone,
+			 void (*dofunc)(struct inode *, void *,
 					struct buffer_head *,
 					struct buffer_head *))
 {
 	struct buffer_head *header_bh, *bh;
 	unsigned long blkoff, prev_blkoff;
 	__u64 *seg;
-	size_t nerr = 0, n = 0;
+	void *data, *dataend = datav + ndata * datasz;
+	size_t n = 0;
 	int ret = 0;
 
-	if (unlikely(nsegs == 0))
+	if (unlikely(ndata == 0))
 		goto out;
 
-	down_write(&NILFS_MDT(sufile)->mi_sem);
-	for (seg = segnumv; seg < segnumv + nsegs; seg++) {
+
+	for (data = datav; data < dataend; data += datasz) {
+		seg = data + segoff;
 		if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) {
 			printk(KERN_WARNING
 			       "%s: invalid segment number: %llu\n", __func__,
 			       (unsigned long long)*seg);
-			nerr++;
+			ret = -EINVAL;
+			goto out;
 		}
 	}
-	if (nerr > 0) {
-		ret = -EINVAL;
-		goto out_sem;
-	}
 
+	down_write(&NILFS_MDT(sufile)->mi_sem);
 	ret = nilfs_sufile_get_header_block(sufile, &header_bh);
 	if (ret < 0)
 		goto out_sem;
 
-	seg = segnumv;
+	data = datav;
+	seg = data + segoff;
 	blkoff = nilfs_sufile_get_blkoff(sufile, *seg);
 	ret = nilfs_mdt_get_block(sufile, blkoff, create, NULL, &bh);
 	if (ret < 0)
 		goto out_header;
 
 	for (;;) {
-		dofunc(sufile, *seg, header_bh, bh);
+		dofunc(sufile, data, header_bh, bh);
 
-		if (++seg >= segnumv + nsegs)
+		++n;
+		data += datasz;
+		if (data >= dataend)
 			break;
+		seg = data + segoff;
 		prev_blkoff = blkoff;
 		blkoff = nilfs_sufile_get_blkoff(sufile, *seg);
 		if (blkoff == prev_blkoff)
@@ -220,28 +229,30 @@ int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs,
 	}
 	brelse(bh);
 
- out_header:
-	n = seg - segnumv;
+out_header:
 	brelse(header_bh);
- out_sem:
+out_sem:
 	up_write(&NILFS_MDT(sufile)->mi_sem);
- out:
+out:
 	if (ndone)
 		*ndone = n;
 	return ret;
 }
 
-int nilfs_sufile_update(struct inode *sufile, __u64 segnum, int create,
-			void (*dofunc)(struct inode *, __u64,
+int nilfs_sufile_update(struct inode *sufile, void *data, size_t segoff,
+			int create,
+			void (*dofunc)(struct inode *, void *,
 				       struct buffer_head *,
 				       struct buffer_head *))
 {
 	struct buffer_head *header_bh, *bh;
+	__u64 *seg;
 	int ret;
 
-	if (unlikely(segnum >= nilfs_sufile_get_nsegments(sufile))) {
+	seg = data + segoff;
+	if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) {
 		printk(KERN_WARNING "%s: invalid segment number: %llu\n",
-		       __func__, (unsigned long long)segnum);
+		       __func__, (unsigned long long)*seg);
 		return -EINVAL;
 	}
 	down_write(&NILFS_MDT(sufile)->mi_sem);
@@ -250,9 +261,9 @@ int nilfs_sufile_update(struct inode *sufile, __u64 segnum, int create,
 	if (ret < 0)
 		goto out_sem;
 
-	ret = nilfs_sufile_get_segment_usage_block(sufile, segnum, create, &bh);
+	ret = nilfs_sufile_get_segment_usage_block(sufile, *seg, create, &bh);
 	if (!ret) {
-		dofunc(sufile, segnum, header_bh, bh);
+		dofunc(sufile, data, header_bh, bh);
 		brelse(bh);
 	}
 	brelse(header_bh);
@@ -406,12 +417,13 @@ int nilfs_sufile_alloc(struct inode *sufile, __u64 *segnump)
 	return ret;
 }
 
-void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum,
+void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 *data,
 				 struct buffer_head *header_bh,
 				 struct buffer_head *su_bh)
 {
 	struct nilfs_segment_usage *su;
 	void *kaddr;
+	__u64 segnum = *data;
 
 	kaddr = kmap_atomic(su_bh->b_page);
 	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
@@ -431,13 +443,14 @@ void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum,
 	nilfs_mdt_mark_dirty(sufile);
 }
 
-void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
+void nilfs_sufile_do_scrap(struct inode *sufile, __u64 *data,
 			   struct buffer_head *header_bh,
 			   struct buffer_head *su_bh)
 {
 	struct nilfs_segment_usage *su;
 	void *kaddr;
 	int clean, dirty;
+	__u64 segnum = *data;
 
 	kaddr = kmap_atomic(su_bh->b_page);
 	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
@@ -462,13 +475,14 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
 	nilfs_mdt_mark_dirty(sufile);
 }
 
-void nilfs_sufile_do_free(struct inode *sufile, __u64 segnum,
+void nilfs_sufile_do_free(struct inode *sufile, __u64 *data,
 			  struct buffer_head *header_bh,
 			  struct buffer_head *su_bh)
 {
 	struct nilfs_segment_usage *su;
 	void *kaddr;
 	int sudirty;
+	__u64 segnum = *data;
 
 	kaddr = kmap_atomic(su_bh->b_page);
 	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
@@ -596,13 +610,14 @@ int nilfs_sufile_get_stat(struct inode *sufile, struct nilfs_sustat *sustat)
 	return ret;
 }
 
-void nilfs_sufile_do_set_error(struct inode *sufile, __u64 segnum,
+void nilfs_sufile_do_set_error(struct inode *sufile, __u64 *data,
 			       struct buffer_head *header_bh,
 			       struct buffer_head *su_bh)
 {
 	struct nilfs_segment_usage *su;
 	void *kaddr;
 	int suclean;
+	__u64 segnum = *data;
 
 	kaddr = kmap_atomic(su_bh->b_page);
 	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index b8afd72..2df6c71 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -46,21 +46,21 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
 				size_t);
 ssize_t nilfs_sufile_set_suinfo(struct inode *, void *, unsigned , size_t);
 
-int nilfs_sufile_updatev(struct inode *, __u64 *, size_t, int, size_t *,
-			 void (*dofunc)(struct inode *, __u64,
-					struct buffer_head *,
-					struct buffer_head *));
-int nilfs_sufile_update(struct inode *, __u64, int,
-			void (*dofunc)(struct inode *, __u64,
+int nilfs_sufile_updatev(struct inode *, void *, size_t, size_t, size_t, int,
+			 size_t *, void (*dofunc)(struct inode *, void *,
+						  struct buffer_head *,
+						  struct buffer_head *));
+int nilfs_sufile_update(struct inode *, void *, size_t, int,
+			void (*dofunc)(struct inode *, void *,
 				       struct buffer_head *,
 				       struct buffer_head *));
-void nilfs_sufile_do_scrap(struct inode *, __u64, struct buffer_head *,
+void nilfs_sufile_do_scrap(struct inode *, __u64 *, struct buffer_head *,
 			   struct buffer_head *);
-void nilfs_sufile_do_free(struct inode *, __u64, struct buffer_head *,
+void nilfs_sufile_do_free(struct inode *, __u64 *, struct buffer_head *,
 			  struct buffer_head *);
-void nilfs_sufile_do_cancel_free(struct inode *, __u64, struct buffer_head *,
+void nilfs_sufile_do_cancel_free(struct inode *, __u64 *, struct buffer_head *,
 				 struct buffer_head *);
-void nilfs_sufile_do_set_error(struct inode *, __u64, struct buffer_head *,
+void nilfs_sufile_do_set_error(struct inode *, __u64 *, struct buffer_head *,
 			       struct buffer_head *);
 
 int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
@@ -75,7 +75,8 @@ int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
  */
 static inline int nilfs_sufile_scrap(struct inode *sufile, __u64 segnum)
 {
-	return nilfs_sufile_update(sufile, segnum, 1, nilfs_sufile_do_scrap);
+	return nilfs_sufile_update(sufile, &segnum, 0, 1,
+				   (void *)nilfs_sufile_do_scrap);
 }
 
 /**
@@ -85,7 +86,8 @@ static inline int nilfs_sufile_scrap(struct inode *sufile, __u64 segnum)
  */
 static inline int nilfs_sufile_free(struct inode *sufile, __u64 segnum)
 {
-	return nilfs_sufile_update(sufile, segnum, 0, nilfs_sufile_do_free);
+	return nilfs_sufile_update(sufile, &segnum, 0, 0,
+				   (void *)nilfs_sufile_do_free);
 }
 
 /**
@@ -98,8 +100,8 @@ static inline int nilfs_sufile_free(struct inode *sufile, __u64 segnum)
 static inline int nilfs_sufile_freev(struct inode *sufile, __u64 *segnumv,
 				     size_t nsegs, size_t *ndone)
 {
-	return nilfs_sufile_updatev(sufile, segnumv, nsegs, 0, ndone,
-				    nilfs_sufile_do_free);
+	return nilfs_sufile_updatev(sufile, segnumv, sizeof(__u64), 0, nsegs,
+				    0, ndone, (void *)nilfs_sufile_do_free);
 }
 
 /**
@@ -116,8 +118,9 @@ static inline int nilfs_sufile_cancel_freev(struct inode *sufile,
 					    __u64 *segnumv, size_t nsegs,
 					    size_t *ndone)
 {
-	return nilfs_sufile_updatev(sufile, segnumv, nsegs, 0, ndone,
-				    nilfs_sufile_do_cancel_free);
+	return nilfs_sufile_updatev(sufile, segnumv, sizeof(__u64), 0, nsegs,
+				    0, ndone,
+				    (void *)nilfs_sufile_do_cancel_free);
 }
 
 /**
@@ -139,8 +142,8 @@ static inline int nilfs_sufile_cancel_freev(struct inode *sufile,
  */
 static inline int nilfs_sufile_set_error(struct inode *sufile, __u64 segnum)
 {
-	return nilfs_sufile_update(sufile, segnum, 0,
-				   nilfs_sufile_do_set_error);
+	return nilfs_sufile_update(sufile, &segnum, 0, 0,
+				   (void *)nilfs_sufile_do_set_error);
 }
 
 #endif	/* _NILFS_SUFILE_H */
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/9] nilfs2: add simple cache for modifications to SUFILE
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev() Andreas Rohner
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 3/9] nilfs2: extend SUFILE on-disk format to enable counting of live blocks Andreas Rohner
                     ` (8 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch adds a simple, small cache that can be used to accumulate
modifications to SUFILE entries. This is for example useful for
keeping track of reclaimable blocks, because most of the
modifications consist of small increments or decrements. By adding
these up and temporarily storing them in a small cache, the
performance can be improved. Additionally lock contention is
reduced.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/sufile.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nilfs2/sufile.h |  44 +++++++++++++
 2 files changed, 222 insertions(+)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 1e8cac6..a369c30 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -1168,6 +1168,184 @@ out_sem:
 }
 
 /**
+ * nilfs_sufile_mc_init - inits segusg modification cache
+ * @mc: modification cache
+ * @capacity: maximum capacity of the mod cache
+ *
+ * Description: Allocates memory for an array of nilfs_sufile_mod structures
+ * according to @capacity. This memory must be freed with
+ * nilfs_sufile_mc_destroy().
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-EINVAL - Invalid capacity.
+ */
+int nilfs_sufile_mc_init(struct nilfs_sufile_mod_cache *mc, size_t capacity)
+{
+	mc->mc_capacity = capacity;
+	if (!capacity)
+		return -EINVAL;
+
+	mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
+			      GFP_KERNEL);
+	if (!mc->mc_mods)
+		return -ENOMEM;
+
+	mc->mc_size = 0;
+
+	return 0;
+}
+
+/**
+ * nilfs_sufile_mc_add - add signed value to segusg modification cache
+ * @mc: modification cache
+ * @segnum: segment number
+ * @value: signed value (can be positive and negative)
+ *
+ * Description: nilfs_sufile_mc_add() tries to add a pair of @segnum and
+ * @value to the modification cache. If the cache already contains a
+ * segment number equal to @segnum, then @value is simply added to the
+ * existing value. This way thousands of small modifications can be
+ * accumulated into one value. If @segnum cannot be found and the
+ * capacity allows it, a new element is added to the cache. If the
+ * capacity is reached an error value is returned.
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-ENOSPC - The mod cache has reached its capacity and must be flushed.
+ */
+static inline int nilfs_sufile_mc_add(struct nilfs_sufile_mod_cache *mc,
+				      __u64 segnum, __s64 value)
+{
+	struct nilfs_sufile_mod *mods = mc->mc_mods;
+	int i;
+
+	for (i = 0; i < mc->mc_size; ++i, ++mods) {
+		if (mods->m_segnum == segnum) {
+			mods->m_value += value;
+			return 0;
+		}
+	}
+
+	if (mc->mc_size < mc->mc_capacity) {
+		mods->m_segnum = segnum;
+		mods->m_value = value;
+		mc->mc_size++;
+		return 0;
+	}
+
+	return -ENOSPC;
+}
+
+/**
+ * nilfs_sufile_mc_clear - set mc_size to 0
+ * @mc: modification cache
+ *
+ * Description: nilfs_sufile_mc_clear() sets mc_size to 0, which enables
+ * nilfs_sufile_mc_add() to overwrite the elements in @mc.
+ */
+static inline void nilfs_sufile_mc_clear(struct nilfs_sufile_mod_cache *mc)
+{
+	mc->mc_size = 0;
+}
+
+/**
+ * nilfs_sufile_mc_reset - clear cache and add one element
+ * @mc: modification cache
+ * @segnum: segment number
+ * @value: signed value (can be positive and negative)
+ *
+ * Description: Clears the modification cache in @mc and adds a new pair of
+ * @segnum and @value to it at the same time.
+ */
+static inline void nilfs_sufile_mc_reset(struct nilfs_sufile_mod_cache *mc,
+					 __u64 segnum, __s64 value)
+{
+	struct nilfs_sufile_mod *mods = mc->mc_mods;
+
+	mods->m_segnum = segnum;
+	mods->m_value = value;
+	mc->mc_size = 1;
+}
+
+/**
+ * nilfs_sufile_mc_flush - flush modification cache
+ * @sufile: inode of segment usage file
+ * @mc: modification cache
+ * @dofunc: primitive operation for the update
+ *
+ * Description: nilfs_sufile_mc_flush() flushes the cached modifications
+ * and applies them to the segment usages on disk. It persists the cached
+ * changes, by calling @dofunc for every element in the cache. @dofunc also
+ * determines the interpretation of the cached values and how they should
+ * be applied to the corresponding segment usage entries.
+ *
+ * Return Value: On success, zero is returned.  On error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - Given segment usage is in hole block
+ *
+ * %-EINVAL - Invalid segment usage number
+ */
+static inline int nilfs_sufile_mc_flush(struct inode *sufile,
+					struct nilfs_sufile_mod_cache *mc,
+					void (*dofunc)(struct inode *,
+						struct nilfs_sufile_mod *,
+						struct buffer_head *,
+						struct buffer_head *))
+{
+	return nilfs_sufile_updatev(sufile, mc->mc_mods,
+				    sizeof(struct nilfs_sufile_mod),
+				    offsetof(struct nilfs_sufile_mod, m_segnum),
+				    mc->mc_size, 0, NULL, (void *)dofunc);
+}
+
+/**
+ * nilfs_sufile_mc_update - immediately applies modification
+ * @sufile: inode of segment usage file
+ * @segnum: segment number
+ * @value: signed value (can be positive and negative)
+ * @dofunc: primitive operation for the update
+ *
+ * Description: nilfs_sufile_mc_update() is a helper function, that
+ * creates a temporary nilfs_sufile_mod structure out of @segnum and @value
+ * and immediately flushes it using @dofunc, without the use of a
+ * modification cache.
+ *
+ * Return Value: On success, zero is returned.  On error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - Given segment usage is in hole block
+ *
+ * %-EINVAL - Invalid segment usage number
+ */
+static inline int nilfs_sufile_mc_update(struct inode *sufile,
+					 __u64 segnum, __s64 value,
+					 void (*dofunc)(struct inode *,
+						struct nilfs_sufile_mod *,
+						struct buffer_head *,
+						struct buffer_head *))
+{
+	struct nilfs_sufile_mod m = {.m_segnum = segnum, .m_value = value};
+
+	return nilfs_sufile_update(sufile, &m,
+				   offsetof(struct nilfs_sufile_mod, m_segnum),
+				   0, (void *)dofunc);
+}
+
+/**
  * nilfs_sufile_read - read or get sufile inode
  * @sb: super block instance
  * @susize: size of a segment usage entry
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index 2df6c71..c446325 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -146,4 +146,48 @@ static inline int nilfs_sufile_set_error(struct inode *sufile, __u64 segnum)
 				   (void *)nilfs_sufile_do_set_error);
 }
 
+#define NILFS_SUFILE_MC_SIZE_DEFAULT	5
+#define NILFS_SUFILE_MC_SIZE_EXT	10
+
+/**
+ * struct nilfs_sufile_mod - segment usage modification
+ * @m_segnum: segment number
+ * @m_value: signed value that gets added to respective segusg field
+ */
+struct nilfs_sufile_mod {
+	__u64 m_segnum;
+	__s64 m_value;
+};
+
+/**
+ * struct nilfs_sufile_mod_cache - segment usage modification cache
+ * @mc_mods: array of modifications to segments
+ * @mc_capacity: maximum number of elements that fit in @mc_mods
+ * @mc_size: number of elements currently filled with valid data
+ */
+struct nilfs_sufile_mod_cache {
+	struct nilfs_sufile_mod *mc_mods;
+	size_t mc_capacity;
+	size_t mc_size;
+};
+
+int nilfs_sufile_mc_init(struct nilfs_sufile_mod_cache *, size_t);
+
+/**
+ * nilfs_sufile_mc_destroy - destroy segusg modification cache
+ * @mc: modification cache
+ *
+ * Description: Releases the memory allocated by nilfs_sufile_mc_init and
+ * sets the size and capacity to 0. @mc should not be used after a call to
+ * this function.
+ */
+static inline void nilfs_sufile_mc_destroy(struct nilfs_sufile_mod_cache *mc)
+{
+	if (mc) {
+		kfree(mc->mc_mods);
+		mc->mc_capacity = 0;
+		mc->mc_size = 0;
+	}
+}
+
 #endif	/* _NILFS_SUFILE_H */
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/9] nilfs2: extend SUFILE on-disk format to enable counting of live blocks
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev() Andreas Rohner
  2015-02-24 19:01   ` [PATCH 2/9] nilfs2: add simple cache for modifications to SUFILE Andreas Rohner
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-4-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 4/9] nilfs2: add function to modify su_nlive_blks Andreas Rohner
                     ` (7 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch extends the nilfs_segment_usage structure with two extra
fields. This changes the on-disk format of the SUFILE, but the nilfs2
metadata files are flexible enough, so that there are no compatibility
issues. The extension is fully backwards compatible. Nevertheless a
feature compatibility flag was added to indicate the on-disk format
change.

The new field su_nlive_blks is used to track the number of live blocks
in the corresponding segment. Its value should always be smaller than
su_nblocks, which contains the total number of blocks in the segment.

The field su_nlive_lastmod is necessary because of the protection period
used by the GC. It is a timestamp, which contains the last time
su_nlive_blks was modified. For example if a file is deleted, its
blocks are subtracted from su_nlive_blks and are therefore considered to
be reclaimable by the kernel. But the GC additionally protects them with
the protection period. So while su_nilve_blks contains the number of
potentially reclaimable blocks, the actual number depends on the
protection period. To enable GC policies to effectively choose or prefer
segments with unprotected blocks, the timestamp in su_nlive_lastmod is
necessary.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/ioctl.c         |  4 ++--
 fs/nilfs2/sufile.c        | 38 +++++++++++++++++++++++++++++--
 fs/nilfs2/sufile.h        |  5 ++++
 include/linux/nilfs2_fs.h | 58 ++++++++++++++++++++++++++++++++++++++++-------
 4 files changed, 93 insertions(+), 12 deletions(-)

diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index 9a20e51..f6ee54e 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -1250,7 +1250,7 @@ static int nilfs_ioctl_set_suinfo(struct inode *inode, struct file *filp,
 		goto out;
 
 	ret = -EINVAL;
-	if (argv.v_size < sizeof(struct nilfs_suinfo_update))
+	if (argv.v_size < NILFS_MIN_SUINFO_UPDATE_SIZE)
 		goto out;
 
 	if (argv.v_nmembs > nilfs->ns_nsegments)
@@ -1316,7 +1316,7 @@ long nilfs_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		return nilfs_ioctl_get_cpstat(inode, filp, cmd, argp);
 	case NILFS_IOCTL_GET_SUINFO:
 		return nilfs_ioctl_get_info(inode, filp, cmd, argp,
-					    sizeof(struct nilfs_suinfo),
+					    NILFS_MIN_SEGMENT_USAGE_SIZE,
 					    nilfs_ioctl_do_get_suinfo);
 	case NILFS_IOCTL_SET_SUINFO:
 		return nilfs_ioctl_set_suinfo(inode, filp, cmd, argp);
diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index a369c30..ae08050 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -466,6 +466,11 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 *data,
 	su->su_lastmod = cpu_to_le64(0);
 	su->su_nblocks = cpu_to_le32(0);
 	su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
+	if (nilfs_sufile_ext_supported(sufile)) {
+		su->su_nlive_blks = cpu_to_le32(0);
+		su->su_pad = cpu_to_le32(0);
+		su->su_nlive_lastmod = cpu_to_le64(0);
+	}
 	kunmap_atomic(kaddr);
 
 	nilfs_sufile_mod_counter(header_bh, clean ? (u64)-1 : 0, dirty ? 0 : 1);
@@ -496,7 +501,7 @@ void nilfs_sufile_do_free(struct inode *sufile, __u64 *data,
 	WARN_ON(!nilfs_segment_usage_dirty(su));
 
 	sudirty = nilfs_segment_usage_dirty(su);
-	nilfs_segment_usage_set_clean(su);
+	nilfs_segment_usage_set_clean(su, NILFS_MDT(sufile)->mi_entry_size);
 	kunmap_atomic(kaddr);
 	mark_buffer_dirty(su_bh);
 
@@ -551,6 +556,9 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
 	if (modtime)
 		su->su_lastmod = cpu_to_le64(modtime);
 	su->su_nblocks = cpu_to_le32(nblocks);
+	if (nilfs_sufile_ext_supported(sufile) &&
+	    nblocks < le32_to_cpu(su->su_nlive_blks))
+		su->su_nlive_blks = su->su_nblocks;
 	kunmap_atomic(kaddr);
 
 	mark_buffer_dirty(bh);
@@ -713,7 +721,7 @@ static int nilfs_sufile_truncate_range(struct inode *sufile,
 		nc = 0;
 		for (su = su2, j = 0; j < n; j++, su = (void *)su + susz) {
 			if (nilfs_segment_usage_error(su)) {
-				nilfs_segment_usage_set_clean(su);
+				nilfs_segment_usage_set_clean(su, susz);
 				nc++;
 			}
 		}
@@ -836,6 +844,8 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
 	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
 	void *kaddr;
 	unsigned long nsegs, segusages_per_block;
+	__u64 lm = 0;
+	__u32 nlb = 0;
 	ssize_t n;
 	int ret, i, j;
 
@@ -873,6 +883,17 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
 			if (nilfs_segment_is_active(nilfs, segnum + j))
 				si->sui_flags |=
 					(1UL << NILFS_SEGMENT_USAGE_ACTIVE);
+
+			if (susz >= NILFS_EXT_SEGMENT_USAGE_SIZE) {
+				nlb = le32_to_cpu(su->su_nlive_blks);
+				lm = le64_to_cpu(su->su_nlive_lastmod);
+			}
+
+			if (sisz >= NILFS_EXT_SUINFO_SIZE) {
+				si->sui_nlive_blks = nlb;
+				si->sui_pad = 0;
+				si->sui_nlive_lastmod = lm;
+			}
 		}
 		kunmap_atomic(kaddr);
 		brelse(su_bh);
@@ -916,6 +937,8 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 	int cleansi, cleansu, dirtysi, dirtysu;
 	long ncleaned = 0, ndirtied = 0;
 	int ret = 0;
+	bool sup_ext = (supsz >= NILFS_EXT_SUINFO_UPDATE_SIZE);
+	bool su_ext = nilfs_sufile_ext_supported(sufile);
 
 	if (unlikely(nsup == 0))
 		return ret;
@@ -926,6 +949,9 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 				(~0UL << __NR_NILFS_SUINFO_UPDATE_FIELDS))
 			|| (nilfs_suinfo_update_nblocks(sup) &&
 				sup->sup_sui.sui_nblocks >
+				nilfs->ns_blocks_per_segment)
+			|| (nilfs_suinfo_update_nlive_blks(sup) && sup_ext &&
+				sup->sup_sui.sui_nlive_blks >
 				nilfs->ns_blocks_per_segment))
 			return -EINVAL;
 	}
@@ -953,6 +979,14 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 		if (nilfs_suinfo_update_nblocks(sup))
 			su->su_nblocks = cpu_to_le32(sup->sup_sui.sui_nblocks);
 
+		if (nilfs_suinfo_update_nlive_blks(sup) && sup_ext && su_ext)
+			su->su_nlive_blks =
+				cpu_to_le32(sup->sup_sui.sui_nlive_blks);
+
+		if (nilfs_suinfo_update_nlive_lastmod(sup) && sup_ext && su_ext)
+			su->su_nlive_lastmod =
+				cpu_to_le64(sup->sup_sui.sui_nlive_lastmod);
+
 		if (nilfs_suinfo_update_flags(sup)) {
 			/*
 			 * Active flag is a virtual flag projected by running
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index c446325..d56498b 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -28,6 +28,11 @@
 #include <linux/nilfs2_fs.h>
 #include "mdt.h"
 
+static inline int
+nilfs_sufile_ext_supported(const struct inode *sufile)
+{
+	return NILFS_MDT(sufile)->mi_entry_size >= NILFS_EXT_SEGMENT_USAGE_SIZE;
+}
 
 static inline unsigned long nilfs_sufile_get_nsegments(struct inode *sufile)
 {
diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
index ff3fea3..5d83c55 100644
--- a/include/linux/nilfs2_fs.h
+++ b/include/linux/nilfs2_fs.h
@@ -220,9 +220,11 @@ struct nilfs_super_block {
  * If there is a bit set in the incompatible feature set that the kernel
  * doesn't know about, it should refuse to mount the filesystem.
  */
-#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT	0x00000001ULL
+#define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
 
-#define NILFS_FEATURE_COMPAT_SUPP	0ULL
+#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
+
+#define NILFS_FEATURE_COMPAT_SUPP	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
@@ -609,19 +611,32 @@ struct nilfs_cpfile_header {
 	  sizeof(struct nilfs_checkpoint) - 1) /			\
 			sizeof(struct nilfs_checkpoint))
 
+#define sub_sizeof(TYPE, MEMBER) (offsetof(TYPE, MEMBER) +		\
+					sizeof(((TYPE *)0)->MEMBER))
+
 /**
  * struct nilfs_segment_usage - segment usage
  * @su_lastmod: last modified timestamp
  * @su_nblocks: number of blocks in segment
  * @su_flags: flags
+ * @su_nlive_blks: number of live blocks in the segment
+ * @su_pad: padding bytes
+ * @su_nlive_lastmod: timestamp nlive_blks was last modified
  */
 struct nilfs_segment_usage {
 	__le64 su_lastmod;
 	__le32 su_nblocks;
 	__le32 su_flags;
+	__le32 su_nlive_blks;
+	__le32 su_pad;
+	__le64 su_nlive_lastmod;
 };
 
-#define NILFS_MIN_SEGMENT_USAGE_SIZE	16
+#define NILFS_MIN_SEGMENT_USAGE_SIZE	\
+	sub_sizeof(struct nilfs_segment_usage, su_flags)
+
+#define NILFS_EXT_SEGMENT_USAGE_SIZE	\
+	sub_sizeof(struct nilfs_segment_usage, su_nlive_lastmod)
 
 /* segment usage flag */
 enum {
@@ -658,11 +673,16 @@ NILFS_SEGMENT_USAGE_FNS(DIRTY, dirty)
 NILFS_SEGMENT_USAGE_FNS(ERROR, error)
 
 static inline void
-nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su)
+nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
 {
 	su->su_lastmod = cpu_to_le64(0);
 	su->su_nblocks = cpu_to_le32(0);
 	su->su_flags = cpu_to_le32(0);
+	if (susz >= NILFS_EXT_SEGMENT_USAGE_SIZE) {
+		su->su_nlive_blks = cpu_to_le32(0);
+		su->su_pad = cpu_to_le32(0);
+		su->su_nlive_lastmod = cpu_to_le64(0);
+	}
 }
 
 static inline int
@@ -684,23 +704,33 @@ struct nilfs_sufile_header {
 	/* ... */
 };
 
-#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET	\
-	((sizeof(struct nilfs_sufile_header) +				\
-	  sizeof(struct nilfs_segment_usage) - 1) /			\
-			 sizeof(struct nilfs_segment_usage))
+#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET(susz)	\
+	((sizeof(struct nilfs_sufile_header) + (susz) - 1) / (susz))
 
 /**
  * nilfs_suinfo - segment usage information
  * @sui_lastmod: timestamp of last modification
  * @sui_nblocks: number of written blocks in segment
  * @sui_flags: segment usage flags
+ * @sui_nlive_blks: number of live blocks in the segment
+ * @sui_pad: padding bytes
+ * @sui_nlive_lastmod: timestamp nlive_blks was last modified
  */
 struct nilfs_suinfo {
 	__u64 sui_lastmod;
 	__u32 sui_nblocks;
 	__u32 sui_flags;
+	__u32 sui_nlive_blks;
+	__u32 sui_pad;
+	__u64 sui_nlive_lastmod;
 };
 
+#define NILFS_MIN_SUINFO_SIZE	\
+	sub_sizeof(struct nilfs_suinfo, sui_flags)
+
+#define NILFS_EXT_SUINFO_SIZE	\
+	sub_sizeof(struct nilfs_suinfo, sui_nlive_lastmod)
+
 #define NILFS_SUINFO_FNS(flag, name)					\
 static inline int							\
 nilfs_suinfo_##name(const struct nilfs_suinfo *si)			\
@@ -736,6 +766,8 @@ enum {
 	NILFS_SUINFO_UPDATE_LASTMOD,
 	NILFS_SUINFO_UPDATE_NBLOCKS,
 	NILFS_SUINFO_UPDATE_FLAGS,
+	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
+	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
 	__NR_NILFS_SUINFO_UPDATE_FIELDS,
 };
 
@@ -759,6 +791,16 @@ nilfs_suinfo_update_##name(const struct nilfs_suinfo_update *sup)	\
 NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
 NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
 NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
+NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
+NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
+
+#define NILFS_MIN_SUINFO_UPDATE_SIZE	\
+	(sub_sizeof(struct nilfs_suinfo_update, sup_reserved) + \
+	NILFS_MIN_SUINFO_SIZE)
+
+#define NILFS_EXT_SUINFO_UPDATE_SIZE	\
+	(sub_sizeof(struct nilfs_suinfo_update, sup_reserved) + \
+	NILFS_EXT_SUINFO_SIZE)
 
 enum {
 	NILFS_CHECKPOINT,
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/9] nilfs2: add function to modify su_nlive_blks
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (2 preceding siblings ...)
  2015-02-24 19:01   ` [PATCH 3/9] nilfs2: extend SUFILE on-disk format to enable counting of live blocks Andreas Rohner
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-5-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 5/9] nilfs2: add simple tracking of block deletions and updates Andreas Rohner
                     ` (6 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch adds a function to modify the su_nlive_blks field of the
nilfs_segment_usage structure in the SUFILE. By using positive or
negative integers, it is possible to add and substract any value from
the su_nlive_blks field.

The use of a modification cache is optional and by passing a NULL
pointer the value will be added or subtracted directly. Otherwise it is
necessary to call nilfs_sufile_flush_nlive_blks() at some point to make
the modifications persistent.

The modification cache is useful, because it allows for small values,
like simple increments and decrements, to be added up before writing
them to the SUFILE.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/sufile.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nilfs2/sufile.h |   5 ++
 2 files changed, 143 insertions(+)

diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index ae08050..574a77e 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -1380,6 +1380,144 @@ static inline int nilfs_sufile_mc_update(struct inode *sufile,
 }
 
 /**
+ * nilfs_sufile_do_flush_nlive_blks - apply modification to su_nlive_blks
+ * @sufile: inode of segment usage file
+ * @mod: modification structure
+ * @header_bh: sufile header block
+ * @su_bh: block containing segment usage of m_segnum in @mod
+ *
+ * Description: nilfs_sufile_do_flush_nlive_blks() is a callback function
+ * used with nilfs_sufile_updatev(), that adds m_value in @mod to
+ * the su_nlive_blks field of the segment usage entry belonging to m_segnum.
+ */
+static void nilfs_sufile_do_flush_nlive_blks(struct inode *sufile,
+					     struct nilfs_sufile_mod *mod,
+					     struct buffer_head *header_bh,
+					     struct buffer_head *su_bh)
+{
+	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
+	struct nilfs_segment_usage *su;
+	void *kaddr;
+	__u32 nblocks, nlive_blocks;
+	__u64 segnum = mod->m_segnum;
+	__s64 value = mod->m_value;
+
+	if (!value)
+		return;
+
+	kaddr = kmap_atomic(su_bh->b_page);
+
+	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
+	WARN_ON(nilfs_segment_usage_error(su));
+
+	nblocks = le32_to_cpu(su->su_nblocks);
+	nlive_blocks = le32_to_cpu(su->su_nlive_blks);
+
+	value += nlive_blocks;
+	if (value < 0)
+		value = 0;
+	else if (value > nblocks)
+		value = nblocks;
+
+	/* do nothing if the value didn't change */
+	if (value != nlive_blocks) {
+		su->su_nlive_blks = cpu_to_le32(value);
+		su->su_nlive_lastmod = cpu_to_le64(nilfs->ns_ctime);
+	}
+
+	kunmap_atomic(kaddr);
+
+	if (value != nlive_blocks) {
+		mark_buffer_dirty(su_bh);
+		nilfs_mdt_mark_dirty(sufile);
+	}
+}
+
+/**
+ * nilfs_sufile_flush_nlive_blks - flush mod cache to su_nlive_blks
+ * @sufile: inode of segment usage file
+ * @mc: modification cache
+ *
+ * Description: nilfs_sufile_flush_nlive_blks() flushes the cached
+ * modifications in @mc, by applying them to the su_nlive_blks field of
+ * the corresponding segment usage entries. @mc can be NULL or empty. If
+ * the sufile extension needed to support su_nlive_blks is not supported the
+ * function will abort without error.
+ *
+ * Return Value: On success, zero is returned.  On error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - Given segment usage is in hole block
+ *
+ * %-EINVAL - Invalid segment usage number
+ */
+int nilfs_sufile_flush_nlive_blks(struct inode *sufile,
+				  struct nilfs_sufile_mod_cache *mc)
+{
+	int ret;
+
+	if (!mc || !mc->mc_size || !nilfs_sufile_ext_supported(sufile))
+		return 0;
+
+	ret = nilfs_sufile_mc_flush(sufile, mc,
+				    nilfs_sufile_do_flush_nlive_blks);
+
+	nilfs_sufile_mc_clear(mc);
+
+	return ret;
+}
+
+/**
+ * nilfs_sufile_mod_nlive_blks - modifiy su_nlive_blks using mod cache
+ * @sufile: inode of segment usage file
+ * @mc: modification cache
+ * @segnum: segment number
+ * @value: signed value (can be positive and negative)
+ *
+ * Description: nilfs_sufile_mod_nlive_blks() adds @value to the su_nlive_blks
+ * field of the segment usage entry for @segnum. If @mc is not NULL it first
+ * accumulates all modifications in the cache and flushes it if it is full.
+ * Otherwise the change is applied directly.
+ *
+ * Return Value: On success, zero is returned.  On error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - Given segment usage is in hole block
+ *
+ * %-EINVAL - Invalid segment usage number
+ */
+int nilfs_sufile_mod_nlive_blks(struct inode *sufile,
+				struct nilfs_sufile_mod_cache *mc,
+				__u64 segnum, __s64 value)
+{
+	int ret;
+
+	if (!value || !nilfs_sufile_ext_supported(sufile))
+		return 0;
+
+	if (!mc)
+		return nilfs_sufile_mc_update(sufile, segnum, value,
+				nilfs_sufile_do_flush_nlive_blks);
+
+	if (!nilfs_sufile_mc_add(mc, segnum, value))
+		return 0;
+
+	ret = nilfs_sufile_flush_nlive_blks(sufile, mc);
+
+	nilfs_sufile_mc_reset(mc, segnum, value);
+
+	return ret;
+}
+
+/**
  * nilfs_sufile_read - read or get sufile inode
  * @sb: super block instance
  * @susize: size of a segment usage entry
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index d56498b..ae3c52a 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -195,4 +195,9 @@ static inline void nilfs_sufile_mc_destroy(struct nilfs_sufile_mod_cache *mc)
 	}
 }
 
+int nilfs_sufile_flush_nlive_blks(struct inode *,
+				  struct nilfs_sufile_mod_cache *);
+int nilfs_sufile_mod_nlive_blks(struct inode *, struct nilfs_sufile_mod_cache *,
+				__u64, __s64);
+
 #endif	/* _NILFS_SUFILE_H */
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/9] nilfs2: add simple tracking of block deletions and updates
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (3 preceding siblings ...)
  2015-02-24 19:01   ` [PATCH 4/9] nilfs2: add function to modify su_nlive_blks Andreas Rohner
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-6-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 6/9] nilfs2: use modification cache to improve performance Andreas Rohner
                     ` (5 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch adds simple tracking of block deletions and updates for
all files except the DAT- and the SUFILE-Metadatafiles. It uses the
fact, that for every block, NILFS2 keeps an entry in the DAT-File
and stores the checkpoint where it was created and deleted or
overwritten. So whenever a block is deleted or overwritten
nilfs_dat_commit_end() is called to update the DAT-Entry. At this
point this patch simply decrements the su_nlive_blks field of the
corresponding segment. The value of su_nlive_blks is set at segment
creation time.

The blocks of the DAT-File cannot be counted this way, because it
does not contain any entries about itself, so the function
nilfs_dat_commit_end() is not called when its blocks are deleted or
overwritten.

The SUFILE cannot be counted this way, because it would lead to a
deadlock. When nilfs_dat_commit_end() is called, the bmap->b_sem is
held by code way up the call chain. To decrement the SUFILE entry
the same semaphore has to be aquired. So if the DAT-Entry belongs to
the SUFILE both semaphores are the same and a deadlock will occur.
But it works for any other file. So by excluding the SUFILE from
being counted by the extra parameter count_blocks a deadlock can be
avoided.

With the above changes the code does not pass the lock dependency
checks of the kernel, because all the locks have the same class and
the order in which the locks are taken is different. Usually it is:

1. down_write(&NILFS_MDT(sufile)->mi_sem);
2. down_write(&bmap->b_sem);

Now it can also be reversed, which leads to failed checks:

1. down_write(&bmap->b_sem); /* lock of a file other than SUFILE */
2. down_write(&NILFS_MDT(sufile)->mi_sem);

But this is safe as long as the first lock down_write(&bmap->b_sem)
doesn't belong to the SUFILE.

It is also possible, that two bmap->b_sem locks have to be taken at
the same time:

1. down_write(&bmap->b_sem); /* lock of a file other than SUFILE */
2. down_write(&bmap->b_sem); /* lock of SUFILE */

Since bmap->b_sem of normal files and the bmap->b_sem of the
SUFILE have the same lock class, the above behavior would also lead
to a warning.

Because of this, it is necessary to introduce two new lock classes
for the SUFILE. So the bmap->b_sem of the SUFILE gets its own lock
class and the NILFS_MDT(sufile)->mi_sem as well.

A new feature compatibility flag
NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS was added, so that the new
features introduced by this patch can be enabled or disabled at any
time.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/bmap.c          |  8 +++++++-
 fs/nilfs2/bmap.h          |  5 +++--
 fs/nilfs2/btree.c         |  4 +++-
 fs/nilfs2/dat.c           | 25 ++++++++++++++++++++-----
 fs/nilfs2/dat.h           |  7 +++++--
 fs/nilfs2/direct.c        |  4 +++-
 fs/nilfs2/mdt.c           |  5 ++++-
 fs/nilfs2/segbuf.c        |  1 +
 fs/nilfs2/segbuf.h        |  1 +
 fs/nilfs2/segment.c       | 25 +++++++++++++++++++++----
 fs/nilfs2/the_nilfs.c     |  4 ++++
 fs/nilfs2/the_nilfs.h     | 16 ++++++++++++++++
 include/linux/nilfs2_fs.h |  4 +++-
 13 files changed, 91 insertions(+), 18 deletions(-)

diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
index aadbd0b..ecd62ba 100644
--- a/fs/nilfs2/bmap.c
+++ b/fs/nilfs2/bmap.c
@@ -467,6 +467,7 @@ __u64 nilfs_bmap_find_target_in_group(const struct nilfs_bmap *bmap)
 
 static struct lock_class_key nilfs_bmap_dat_lock_key;
 static struct lock_class_key nilfs_bmap_mdt_lock_key;
+static struct lock_class_key nilfs_bmap_sufile_lock_key;
 
 /**
  * nilfs_bmap_read - read a bmap from an inode
@@ -498,12 +499,17 @@ int nilfs_bmap_read(struct nilfs_bmap *bmap, struct nilfs_inode *raw_inode)
 		lockdep_set_class(&bmap->b_sem, &nilfs_bmap_dat_lock_key);
 		break;
 	case NILFS_CPFILE_INO:
-	case NILFS_SUFILE_INO:
 		bmap->b_ptr_type = NILFS_BMAP_PTR_VS;
 		bmap->b_last_allocated_key = 0;
 		bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
 		lockdep_set_class(&bmap->b_sem, &nilfs_bmap_mdt_lock_key);
 		break;
+	case NILFS_SUFILE_INO:
+		bmap->b_ptr_type = NILFS_BMAP_PTR_VS;
+		bmap->b_last_allocated_key = 0;
+		bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
+		lockdep_set_class(&bmap->b_sem, &nilfs_bmap_sufile_lock_key);
+		break;
 	case NILFS_IFILE_INO:
 		lockdep_set_class(&bmap->b_sem, &nilfs_bmap_mdt_lock_key);
 		/* Fall through */
diff --git a/fs/nilfs2/bmap.h b/fs/nilfs2/bmap.h
index b89e680..718c814 100644
--- a/fs/nilfs2/bmap.h
+++ b/fs/nilfs2/bmap.h
@@ -222,8 +222,9 @@ static inline void nilfs_bmap_commit_end_ptr(struct nilfs_bmap *bmap,
 					     struct inode *dat)
 {
 	if (dat)
-		nilfs_dat_commit_end(dat, &req->bpr_req,
-				     bmap->b_ptr_type == NILFS_BMAP_PTR_VS);
+		nilfs_dat_commit_end(dat, &req->bpr_req, NULL,
+				     bmap->b_ptr_type == NILFS_BMAP_PTR_VS,
+				     bmap->b_inode->i_ino != NILFS_SUFILE_INO);
 }
 
 static inline void nilfs_bmap_abort_end_ptr(struct nilfs_bmap *bmap,
diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
index b2e3ff3..2af0519 100644
--- a/fs/nilfs2/btree.c
+++ b/fs/nilfs2/btree.c
@@ -1851,7 +1851,9 @@ static void nilfs_btree_commit_update_v(struct nilfs_bmap *btree,
 
 	nilfs_dat_commit_update(dat, &path[level].bp_oldreq.bpr_req,
 				&path[level].bp_newreq.bpr_req,
-				btree->b_ptr_type == NILFS_BMAP_PTR_VS);
+				NULL,
+				btree->b_ptr_type == NILFS_BMAP_PTR_VS,
+				btree->b_inode->i_ino != NILFS_SUFILE_INO);
 
 	if (buffer_nilfs_node(path[level].bp_bh)) {
 		nilfs_btnode_commit_change_key(
diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index 0d5fada..d2c8f7e 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -28,6 +28,7 @@
 #include "mdt.h"
 #include "alloc.h"
 #include "dat.h"
+#include "sufile.h"
 
 
 #define NILFS_CNO_MIN	((__u64)1)
@@ -185,12 +186,14 @@ int nilfs_dat_prepare_end(struct inode *dat, struct nilfs_palloc_req *req)
 }
 
 void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
-			  int dead)
+			  struct nilfs_sufile_mod_cache *mc,
+			  int dead, int count_blocks)
 {
 	struct nilfs_dat_entry *entry;
-	__u64 start, end;
+	__u64 start, end, segnum;
 	sector_t blocknr;
 	void *kaddr;
+	struct the_nilfs *nilfs;
 
 	kaddr = kmap_atomic(req->pr_entry_bh->b_page);
 	entry = nilfs_palloc_block_get_entry(dat, req->pr_entry_nr,
@@ -206,8 +209,18 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
 
 	if (blocknr == 0)
 		nilfs_dat_commit_free(dat, req);
-	else
+	else {
 		nilfs_dat_commit_entry(dat, req);
+
+		nilfs = dat->i_sb->s_fs_info;
+
+		if (count_blocks && nilfs_feature_track_live_blks(nilfs)) {
+			segnum = nilfs_get_segnum_of_block(nilfs, blocknr);
+
+			nilfs_sufile_mod_nlive_blks(nilfs->ns_sufile, mc,
+						    segnum, -1);
+		}
+	}
 }
 
 void nilfs_dat_abort_end(struct inode *dat, struct nilfs_palloc_req *req)
@@ -246,9 +259,11 @@ int nilfs_dat_prepare_update(struct inode *dat,
 
 void nilfs_dat_commit_update(struct inode *dat,
 			     struct nilfs_palloc_req *oldreq,
-			     struct nilfs_palloc_req *newreq, int dead)
+			     struct nilfs_palloc_req *newreq,
+			     struct nilfs_sufile_mod_cache *mc,
+			     int dead, int count_blocks)
 {
-	nilfs_dat_commit_end(dat, oldreq, dead);
+	nilfs_dat_commit_end(dat, oldreq, mc, dead, count_blocks);
 	nilfs_dat_commit_alloc(dat, newreq);
 }
 
diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
index cbd8e97..d196f09 100644
--- a/fs/nilfs2/dat.h
+++ b/fs/nilfs2/dat.h
@@ -29,6 +29,7 @@
 
 
 struct nilfs_palloc_req;
+struct nilfs_sufile_mod_cache;
 
 int nilfs_dat_translate(struct inode *, __u64, sector_t *);
 
@@ -39,12 +40,14 @@ int nilfs_dat_prepare_start(struct inode *, struct nilfs_palloc_req *);
 void nilfs_dat_commit_start(struct inode *, struct nilfs_palloc_req *,
 			    sector_t);
 int nilfs_dat_prepare_end(struct inode *, struct nilfs_palloc_req *);
-void nilfs_dat_commit_end(struct inode *, struct nilfs_palloc_req *, int);
+void nilfs_dat_commit_end(struct inode *, struct nilfs_palloc_req *,
+			  struct nilfs_sufile_mod_cache *, int, int);
 void nilfs_dat_abort_end(struct inode *, struct nilfs_palloc_req *);
 int nilfs_dat_prepare_update(struct inode *, struct nilfs_palloc_req *,
 			     struct nilfs_palloc_req *);
 void nilfs_dat_commit_update(struct inode *, struct nilfs_palloc_req *,
-			     struct nilfs_palloc_req *, int);
+			     struct nilfs_palloc_req *,
+			     struct nilfs_sufile_mod_cache *, int, int);
 void nilfs_dat_abort_update(struct inode *, struct nilfs_palloc_req *,
 			    struct nilfs_palloc_req *);
 
diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c
index 82f4865..e022cfb 100644
--- a/fs/nilfs2/direct.c
+++ b/fs/nilfs2/direct.c
@@ -272,7 +272,9 @@ static int nilfs_direct_propagate(struct nilfs_bmap *bmap,
 		if (ret < 0)
 			return ret;
 		nilfs_dat_commit_update(dat, &oldreq, &newreq,
-					bmap->b_ptr_type == NILFS_BMAP_PTR_VS);
+				NULL,
+				bmap->b_ptr_type == NILFS_BMAP_PTR_VS,
+				bmap->b_inode->i_ino != NILFS_SUFILE_INO);
 		set_buffer_nilfs_volatile(bh);
 		nilfs_direct_set_ptr(bmap, key, newreq.pr_entry_nr);
 	} else
diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
index 892cf5f..2a81f82 100644
--- a/fs/nilfs2/mdt.c
+++ b/fs/nilfs2/mdt.c
@@ -414,7 +414,7 @@ static const struct address_space_operations def_mdt_aops = {
 
 static const struct inode_operations def_mdt_iops;
 static const struct file_operations def_mdt_fops;
-
+static struct lock_class_key nilfs_mdt_mi_sufile_lock_key;
 
 int nilfs_mdt_init(struct inode *inode, gfp_t gfp_mask, size_t objsz)
 {
@@ -427,6 +427,9 @@ int nilfs_mdt_init(struct inode *inode, gfp_t gfp_mask, size_t objsz)
 	init_rwsem(&mi->mi_sem);
 	inode->i_private = mi;
 
+	if (inode->i_ino == NILFS_SUFILE_INO)
+		lockdep_set_class(&mi->mi_sem, &nilfs_mdt_mi_sufile_lock_key);
+
 	inode->i_mode = S_IFREG;
 	mapping_set_gfp_mask(inode->i_mapping, gfp_mask);
 
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index dc3a9efd..7a6e9cd 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -57,6 +57,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
 	INIT_LIST_HEAD(&segbuf->sb_segsum_buffers);
 	INIT_LIST_HEAD(&segbuf->sb_payload_buffers);
 	segbuf->sb_super_root = NULL;
+	segbuf->sb_nlive_blks_added = 0;
 
 	init_completion(&segbuf->sb_bio_event);
 	atomic_set(&segbuf->sb_err, 0);
diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
index b04f08c..d04da26 100644
--- a/fs/nilfs2/segbuf.h
+++ b/fs/nilfs2/segbuf.h
@@ -83,6 +83,7 @@ struct nilfs_segment_buffer {
 	sector_t		sb_fseg_start, sb_fseg_end;
 	sector_t		sb_pseg_start;
 	unsigned		sb_rest_blocks;
+	__u32			sb_nlive_blks_added;
 
 	/* Buffers */
 	struct list_head	sb_segsum_buffers;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 469086b..6059f53 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1367,9 +1367,10 @@ static void nilfs_free_incomplete_logs(struct list_head *logs,
 }
 
 static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
-					  struct inode *sufile)
+					  struct the_nilfs *nilfs)
 {
 	struct nilfs_segment_buffer *segbuf;
+	struct inode *sufile = nilfs->ns_sufile;
 	unsigned long live_blocks;
 	int ret;
 
@@ -1380,12 +1381,22 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
 						     live_blocks,
 						     sci->sc_seg_ctime);
 		WARN_ON(ret); /* always succeed because the segusage is dirty */
+
+		/* should always be positive */
+		segbuf->sb_nlive_blks_added = segbuf->sb_sum.nfileblk;
+
+		if (nilfs_feature_track_live_blks(nilfs))
+			nilfs_sufile_mod_nlive_blks(sufile, NULL,
+						segbuf->sb_segnum,
+						segbuf->sb_nlive_blks_added);
 	}
 }
 
-static void nilfs_cancel_segusage(struct list_head *logs, struct inode *sufile)
+static void nilfs_cancel_segusage(struct list_head *logs,
+				  struct the_nilfs *nilfs)
 {
 	struct nilfs_segment_buffer *segbuf;
+	struct inode *sufile = nilfs->ns_sufile;
 	int ret;
 
 	segbuf = NILFS_FIRST_SEGBUF(logs);
@@ -1394,6 +1405,12 @@ static void nilfs_cancel_segusage(struct list_head *logs, struct inode *sufile)
 					     segbuf->sb_fseg_start, 0);
 	WARN_ON(ret); /* always succeed because the segusage is dirty */
 
+	if (nilfs_feature_track_live_blks(nilfs))
+		nilfs_sufile_mod_nlive_blks(sufile, NULL, segbuf->sb_segnum,
+					-((__s64)segbuf->sb_nlive_blks_added));
+
+	segbuf->sb_nlive_blks_added = 0;
+
 	list_for_each_entry_continue(segbuf, logs, sb_list) {
 		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
 						     0, 0);
@@ -1729,7 +1746,7 @@ static void nilfs_segctor_abort_construction(struct nilfs_sc_info *sci,
 	nilfs_abort_logs(&logs, ret ? : err);
 
 	list_splice_tail_init(&sci->sc_segbufs, &logs);
-	nilfs_cancel_segusage(&logs, nilfs->ns_sufile);
+	nilfs_cancel_segusage(&logs, nilfs);
 	nilfs_free_incomplete_logs(&logs, nilfs);
 
 	if (sci->sc_stage.flags & NILFS_CF_SUFREED) {
@@ -1995,7 +2012,7 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
 
 			nilfs_segctor_fill_in_super_root(sci, nilfs);
 		}
-		nilfs_segctor_update_segusage(sci, nilfs->ns_sufile);
+		nilfs_segctor_update_segusage(sci, nilfs);
 
 		/* Write partial segments */
 		nilfs_segctor_prepare_write(sci);
diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
index 69bd801..606fdfc 100644
--- a/fs/nilfs2/the_nilfs.c
+++ b/fs/nilfs2/the_nilfs.c
@@ -630,6 +630,10 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data)
 	get_random_bytes(&nilfs->ns_next_generation,
 			 sizeof(nilfs->ns_next_generation));
 
+	nilfs->ns_feature_compat = le64_to_cpu(sbp->s_feature_compat);
+	nilfs->ns_feature_compat_ro = le64_to_cpu(sbp->s_feature_compat_ro);
+	nilfs->ns_feature_incompat = le64_to_cpu(sbp->s_feature_incompat);
+
 	err = nilfs_store_disk_layout(nilfs, sbp);
 	if (err)
 		goto failed_sbh;
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index 23778d3..87cab10 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -101,6 +101,9 @@ enum {
  * @ns_dev_kobj: /sys/fs/<nilfs>/<device>
  * @ns_dev_kobj_unregister: completion state
  * @ns_dev_subgroups: <device> subgroups pointer
+ * @ns_feature_compat: Compatible feature set
+ * @ns_feature_compat_ro: Read-only compatible feature set
+ * @ns_feature_incompat: Incompatible feature set
  */
 struct the_nilfs {
 	unsigned long		ns_flags;
@@ -201,6 +204,11 @@ struct the_nilfs {
 	struct kobject ns_dev_kobj;
 	struct completion ns_dev_kobj_unregister;
 	struct nilfs_sysfs_dev_subgroups *ns_dev_subgroups;
+
+	/* Features */
+	__u64			ns_feature_compat;
+	__u64			ns_feature_compat_ro;
+	__u64			ns_feature_incompat;
 };
 
 #define THE_NILFS_FNS(bit, name)					\
@@ -393,4 +401,12 @@ static inline int nilfs_flush_device(struct the_nilfs *nilfs)
 	return err;
 }
 
+static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
+{
+	return (nilfs->ns_feature_compat &
+		NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS) &&
+		(nilfs->ns_feature_compat &
+		NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
+}
+
 #endif /* _THE_NILFS_H */
diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
index 5d83c55..6ccb2ad 100644
--- a/include/linux/nilfs2_fs.h
+++ b/include/linux/nilfs2_fs.h
@@ -221,10 +221,12 @@ struct nilfs_super_block {
  * doesn't know about, it should refuse to mount the filesystem.
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
+#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
 
-#define NILFS_FEATURE_COMPAT_SUPP	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION
+#define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
+				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 6/9] nilfs2: use modification cache to improve performance
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (4 preceding siblings ...)
  2015-02-24 19:01   ` [PATCH 5/9] nilfs2: add simple tracking of block deletions and updates Andreas Rohner
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-7-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 7/9] nilfs2: add additional flags for nilfs_vdesc Andreas Rohner
                     ` (4 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch adds a small cache to accumulate the small decrements of
the number of live blocks in a segment usage entry. If for example a
large file is deleted, the segment usage entry has to be updated for
every single block. But for every decrement, a MDT write lock has to
be aquired, which blocks the entire SUFILE and effectively turns
this lock into a global lock for the whole file system.

The cache tries to ameliorate this situation by adding up the
decrements and increments for a given number of segments and
applying the changes all at once. Because the changes are
accumulated in memory and not immediately written to the SUFILE, the
afore mentioned lock only needs to be aquired, if the cache is full
or at the end of the respective operation.

To effectively get the pointer to the modification cache from the
high level operations down to the update of the individual blocks in
nilfs_dat_commit_end(), a new pointer b_private was added to struct
nilfs_bmap.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/bmap.c    | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/nilfs2/bmap.h    | 11 +++++++-
 fs/nilfs2/btree.c   |  2 +-
 fs/nilfs2/direct.c  |  2 +-
 fs/nilfs2/inode.c   | 22 +++++++++++++---
 fs/nilfs2/segment.c | 26 +++++++++++++++---
 fs/nilfs2/segment.h |  3 +++
 7 files changed, 132 insertions(+), 10 deletions(-)

diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
index ecd62ba..927acb7 100644
--- a/fs/nilfs2/bmap.c
+++ b/fs/nilfs2/bmap.c
@@ -288,6 +288,43 @@ int nilfs_bmap_truncate(struct nilfs_bmap *bmap, unsigned long key)
 }
 
 /**
+ * nilfs_bmap_truncate_with_mc - truncate a bmap to a specified key
+ * @bmap: bmap
+ * @mc: modification cache
+ * @key: key
+ *
+ * Description: nilfs_bmap_truncate_with_mc() removes key-record pairs whose
+ * keys are greater than or equal to @key from @bmap. It has the same
+ * functionality as nilfs_bmap_truncate(), but allows the passing
+ * of a modification cache to update segment usage information.
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+int nilfs_bmap_truncate_with_mc(struct nilfs_bmap *bmap,
+				struct nilfs_sufile_mod_cache *mc,
+				unsigned long key)
+{
+	int ret;
+
+	down_write(&bmap->b_sem);
+
+	bmap->b_private = mc;
+
+	ret = nilfs_bmap_do_truncate(bmap, key);
+
+	bmap->b_private = NULL;
+
+	up_write(&bmap->b_sem);
+
+	return nilfs_bmap_convert_error(bmap, __func__, ret);
+}
+
+/**
  * nilfs_bmap_clear - free resources a bmap holds
  * @bmap: bmap
  *
@@ -328,6 +365,43 @@ int nilfs_bmap_propagate(struct nilfs_bmap *bmap, struct buffer_head *bh)
 }
 
 /**
+ * nilfs_bmap_propagate_with_mc - propagate dirty state
+ * @bmap: bmap
+ * @mc: modification cache
+ * @bh: buffer head
+ *
+ * Description: nilfs_bmap_propagate_with_mc() marks the buffers that directly
+ * or indirectly refer to the block specified by @bh dirty. It has
+ * the same functionality as nilfs_bmap_propagate(), but allows the passing
+ * of a modification cache to update segment usage information.
+ *
+ * Return Value: On success, 0 is returned. On error, one of the following
+ * negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+int nilfs_bmap_propagate_with_mc(struct nilfs_bmap *bmap,
+				 struct nilfs_sufile_mod_cache *mc,
+				 struct buffer_head *bh)
+{
+	int ret;
+
+	down_write(&bmap->b_sem);
+
+	bmap->b_private = mc;
+
+	ret = bmap->b_ops->bop_propagate(bmap, bh);
+
+	bmap->b_private = NULL;
+
+	up_write(&bmap->b_sem);
+
+	return nilfs_bmap_convert_error(bmap, __func__, ret);
+}
+
+/**
  * nilfs_bmap_lookup_dirty_buffers -
  * @bmap: bmap
  * @listp: pointer to buffer head list
@@ -490,6 +564,7 @@ int nilfs_bmap_read(struct nilfs_bmap *bmap, struct nilfs_inode *raw_inode)
 
 	init_rwsem(&bmap->b_sem);
 	bmap->b_state = 0;
+	bmap->b_private = NULL;
 	bmap->b_inode = &NILFS_BMAP_I(bmap)->vfs_inode;
 	switch (bmap->b_inode->i_ino) {
 	case NILFS_DAT_INO:
@@ -551,6 +626,7 @@ void nilfs_bmap_init_gc(struct nilfs_bmap *bmap)
 	bmap->b_last_allocated_key = 0;
 	bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
 	bmap->b_state = 0;
+	bmap->b_private = NULL;
 	nilfs_btree_init_gc(bmap);
 }
 
diff --git a/fs/nilfs2/bmap.h b/fs/nilfs2/bmap.h
index 718c814..a8b935a 100644
--- a/fs/nilfs2/bmap.h
+++ b/fs/nilfs2/bmap.h
@@ -36,6 +36,7 @@
 
 
 struct nilfs_bmap;
+struct nilfs_sufile_mod_cache;
 
 /**
  * union nilfs_bmap_ptr_req - request for bmap ptr
@@ -106,6 +107,7 @@ static inline int nilfs_bmap_is_new_ptr(unsigned long ptr)
  * @b_ptr_type: pointer type
  * @b_state: state
  * @b_nchildren_per_block: maximum number of child nodes for non-root nodes
+ * @b_private: pointer for extra data
  */
 struct nilfs_bmap {
 	union {
@@ -120,6 +122,7 @@ struct nilfs_bmap {
 	int b_ptr_type;
 	int b_state;
 	__u16 b_nchildren_per_block;
+	void *b_private;
 };
 
 /* pointer type */
@@ -157,8 +160,14 @@ int nilfs_bmap_insert(struct nilfs_bmap *, unsigned long, unsigned long);
 int nilfs_bmap_delete(struct nilfs_bmap *, unsigned long);
 int nilfs_bmap_last_key(struct nilfs_bmap *, unsigned long *);
 int nilfs_bmap_truncate(struct nilfs_bmap *, unsigned long);
+int nilfs_bmap_truncate_with_mc(struct nilfs_bmap *,
+				struct nilfs_sufile_mod_cache *,
+				unsigned long);
 void nilfs_bmap_clear(struct nilfs_bmap *);
 int nilfs_bmap_propagate(struct nilfs_bmap *, struct buffer_head *);
+int nilfs_bmap_propagate_with_mc(struct nilfs_bmap *,
+				 struct nilfs_sufile_mod_cache *,
+				 struct buffer_head *);
 void nilfs_bmap_lookup_dirty_buffers(struct nilfs_bmap *, struct list_head *);
 int nilfs_bmap_assign(struct nilfs_bmap *, struct buffer_head **,
 		      unsigned long, union nilfs_binfo *);
@@ -222,7 +231,7 @@ static inline void nilfs_bmap_commit_end_ptr(struct nilfs_bmap *bmap,
 					     struct inode *dat)
 {
 	if (dat)
-		nilfs_dat_commit_end(dat, &req->bpr_req, NULL,
+		nilfs_dat_commit_end(dat, &req->bpr_req, bmap->b_private,
 				     bmap->b_ptr_type == NILFS_BMAP_PTR_VS,
 				     bmap->b_inode->i_ino != NILFS_SUFILE_INO);
 }
diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
index 2af0519..c3c883e 100644
--- a/fs/nilfs2/btree.c
+++ b/fs/nilfs2/btree.c
@@ -1851,7 +1851,7 @@ static void nilfs_btree_commit_update_v(struct nilfs_bmap *btree,
 
 	nilfs_dat_commit_update(dat, &path[level].bp_oldreq.bpr_req,
 				&path[level].bp_newreq.bpr_req,
-				NULL,
+				btree->b_private,
 				btree->b_ptr_type == NILFS_BMAP_PTR_VS,
 				btree->b_inode->i_ino != NILFS_SUFILE_INO);
 
diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c
index e022cfb..a716bba 100644
--- a/fs/nilfs2/direct.c
+++ b/fs/nilfs2/direct.c
@@ -272,7 +272,7 @@ static int nilfs_direct_propagate(struct nilfs_bmap *bmap,
 		if (ret < 0)
 			return ret;
 		nilfs_dat_commit_update(dat, &oldreq, &newreq,
-				NULL,
+				bmap->b_private,
 				bmap->b_ptr_type == NILFS_BMAP_PTR_VS,
 				bmap->b_inode->i_ino != NILFS_SUFILE_INO);
 		set_buffer_nilfs_volatile(bh);
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 8b59695..7f6d056 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -34,6 +34,7 @@
 #include "mdt.h"
 #include "cpfile.h"
 #include "ifile.h"
+#include "sufile.h"
 
 /**
  * struct nilfs_iget_args - arguments used during comparison between inodes
@@ -714,29 +715,42 @@ void nilfs_update_inode(struct inode *inode, struct buffer_head *ibh, int flags)
 static void nilfs_truncate_bmap(struct nilfs_inode_info *ii,
 				unsigned long from)
 {
+	struct the_nilfs *nilfs = ii->vfs_inode.i_sb->s_fs_info;
+	struct nilfs_sufile_mod_cache mc, *mcp = NULL;
 	unsigned long b;
 	int ret;
 
 	if (!test_bit(NILFS_I_BMAP, &ii->i_state))
 		return;
+
+	if (nilfs_feature_track_live_blks(nilfs) &&
+	    !nilfs_sufile_mc_init(&mc, NILFS_SUFILE_MC_SIZE_DEFAULT))
+		mcp = &mc;
+
 repeat:
 	ret = nilfs_bmap_last_key(ii->i_bmap, &b);
 	if (ret == -ENOENT)
-		return;
+		goto out_free;
 	else if (ret < 0)
 		goto failed;
 
 	if (b < from)
-		return;
+		goto out_free;
 
 	b -= min_t(unsigned long, NILFS_MAX_TRUNCATE_BLOCKS, b - from);
-	ret = nilfs_bmap_truncate(ii->i_bmap, b);
+	ret = nilfs_bmap_truncate_with_mc(ii->i_bmap, mcp, b);
 	nilfs_relax_pressure_in_lock(ii->vfs_inode.i_sb);
 	if (!ret || (ret == -ENOMEM &&
-		     nilfs_bmap_truncate(ii->i_bmap, b) == 0))
+		     nilfs_bmap_truncate_with_mc(ii->i_bmap, mcp, b) == 0))
 		goto repeat;
 
+out_free:
+	nilfs_sufile_flush_nlive_blks(nilfs->ns_sufile, mcp);
+	nilfs_sufile_mc_destroy(mcp);
+	return;
 failed:
+	nilfs_sufile_flush_nlive_blks(nilfs->ns_sufile, mcp);
+	nilfs_sufile_mc_destroy(mcp);
 	nilfs_warning(ii->vfs_inode.i_sb, __func__,
 		      "failed to truncate bmap (ino=%lu, err=%d)",
 		      ii->vfs_inode.i_ino, ret);
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 6059f53..dc0070c 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -511,7 +511,8 @@ static int nilfs_collect_file_data(struct nilfs_sc_info *sci,
 {
 	int err;
 
-	err = nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
+	err = nilfs_bmap_propagate_with_mc(NILFS_I(inode)->i_bmap,
+					   sci->sc_mc, bh);
 	if (err < 0)
 		return err;
 
@@ -526,7 +527,8 @@ static int nilfs_collect_file_node(struct nilfs_sc_info *sci,
 				   struct buffer_head *bh,
 				   struct inode *inode)
 {
-	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
+	return nilfs_bmap_propagate_with_mc(NILFS_I(inode)->i_bmap,
+					    sci->sc_mc, bh);
 }
 
 static int nilfs_collect_file_bmap(struct nilfs_sc_info *sci,
@@ -1386,7 +1388,7 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
 		segbuf->sb_nlive_blks_added = segbuf->sb_sum.nfileblk;
 
 		if (nilfs_feature_track_live_blks(nilfs))
-			nilfs_sufile_mod_nlive_blks(sufile, NULL,
+			nilfs_sufile_mod_nlive_blks(sufile, sci->sc_mc,
 						segbuf->sb_segnum,
 						segbuf->sb_nlive_blks_added);
 	}
@@ -2014,6 +2016,9 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
 		}
 		nilfs_segctor_update_segusage(sci, nilfs);
 
+		nilfs_sufile_flush_nlive_blks(nilfs->ns_sufile,
+					      sci->sc_mc);
+
 		/* Write partial segments */
 		nilfs_segctor_prepare_write(sci);
 
@@ -2603,6 +2608,7 @@ static struct nilfs_sc_info *nilfs_segctor_new(struct super_block *sb,
 {
 	struct the_nilfs *nilfs = sb->s_fs_info;
 	struct nilfs_sc_info *sci;
+	int ret;
 
 	sci = kzalloc(sizeof(*sci), GFP_KERNEL);
 	if (!sci)
@@ -2633,6 +2639,18 @@ static struct nilfs_sc_info *nilfs_segctor_new(struct super_block *sb,
 		sci->sc_interval = HZ * nilfs->ns_interval;
 	if (nilfs->ns_watermark)
 		sci->sc_watermark = nilfs->ns_watermark;
+
+	if (nilfs_feature_track_live_blks(nilfs)) {
+		sci->sc_mc = kmalloc(sizeof(*(sci->sc_mc)), GFP_KERNEL);
+		if (sci->sc_mc) {
+			ret = nilfs_sufile_mc_init(sci->sc_mc,
+						   NILFS_SUFILE_MC_SIZE_EXT);
+			if (ret) {
+				kfree(sci->sc_mc);
+				sci->sc_mc = NULL;
+			}
+		}
+	}
 	return sci;
 }
 
@@ -2701,6 +2719,8 @@ static void nilfs_segctor_destroy(struct nilfs_sc_info *sci)
 	down_write(&nilfs->ns_segctor_sem);
 
 	del_timer_sync(&sci->sc_timer);
+	nilfs_sufile_mc_destroy(sci->sc_mc);
+	kfree(sci->sc_mc);
 	kfree(sci);
 }
 
diff --git a/fs/nilfs2/segment.h b/fs/nilfs2/segment.h
index a48d6de..a857527 100644
--- a/fs/nilfs2/segment.h
+++ b/fs/nilfs2/segment.h
@@ -80,6 +80,7 @@ struct nilfs_cstage {
 };
 
 struct nilfs_segment_buffer;
+struct nilfs_sufile_mod_cache;
 
 struct nilfs_segsum_pointer {
 	struct buffer_head     *bh;
@@ -129,6 +130,7 @@ struct nilfs_segsum_pointer {
  * @sc_watermark: Watermark for the number of dirty buffers
  * @sc_timer: Timer for segctord
  * @sc_task: current thread of segctord
+ * @sc_mc: mod cache to add up updates for SUFILE during seg construction
  */
 struct nilfs_sc_info {
 	struct super_block     *sc_super;
@@ -185,6 +187,7 @@ struct nilfs_sc_info {
 
 	struct timer_list	sc_timer;
 	struct task_struct     *sc_task;
+	struct nilfs_sufile_mod_cache *sc_mc;
 };
 
 /* sc_flags */
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 7/9] nilfs2: add additional flags for nilfs_vdesc
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (5 preceding siblings ...)
  2015-02-24 19:01   ` [PATCH 6/9] nilfs2: use modification cache to improve performance Andreas Rohner
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-8-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 8/9] nilfs2: improve accuracy and correct for invalid GC values Andreas Rohner
                     ` (3 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch adds support for additional bit-flags to the
nilfs_vdesc structure used by the GC to communicate block
information from userspace. The field vd_flags cannot be used for
this purpose, because it does not support bit-flags, and changing
that would break backwards compatibility. Therefore the padding
field is renamed to vd_blk_flags to contain more flags.

Unfortunately older versions of the userspace tools do not
initialize the padding field to zero. So it is necessary to signal
to the kernel if the new vd_blk_flags field contains usable flags
or just random data. Since the vd_period field is only used in
userspace, and is guaranteed to contain a value that is > 0
(NILFS_CNO_MIN == 1), it can be used to give the kernel a hint. So
if the userspace tools set vd_period.p_start to 0, the
vd_blk_flags field will be interpreted.

To make the flags available for later stages of the GC process,
they are mapped to corresponding buffer_head flags.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/ioctl.c         | 23 ++++++++++++++++---
 fs/nilfs2/page.h          |  6 ++++-
 include/linux/nilfs2_fs.h | 58 +++++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 81 insertions(+), 6 deletions(-)

diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
index f6ee54e..63b1c77 100644
--- a/fs/nilfs2/ioctl.c
+++ b/fs/nilfs2/ioctl.c
@@ -578,7 +578,7 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
 	struct buffer_head *bh;
 	int ret;
 
-	if (vdesc->vd_flags == 0)
+	if (nilfs_vdesc_data(vdesc))
 		ret = nilfs_gccache_submit_read_data(
 			inode, vdesc->vd_offset, vdesc->vd_blocknr,
 			vdesc->vd_vblocknr, &bh);
@@ -592,7 +592,8 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
 			       "%s: invalid virtual block address (%s): "
 			       "ino=%llu, cno=%llu, offset=%llu, "
 			       "blocknr=%llu, vblocknr=%llu\n",
-			       __func__, vdesc->vd_flags ? "node" : "data",
+			       __func__,
+			       nilfs_vdesc_node(vdesc) ? "node" : "data",
 			       (unsigned long long)vdesc->vd_ino,
 			       (unsigned long long)vdesc->vd_cno,
 			       (unsigned long long)vdesc->vd_offset,
@@ -603,7 +604,8 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
 	if (unlikely(!list_empty(&bh->b_assoc_buffers))) {
 		printk(KERN_CRIT "%s: conflicting %s buffer: ino=%llu, "
 		       "cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu\n",
-		       __func__, vdesc->vd_flags ? "node" : "data",
+		       __func__,
+		       nilfs_vdesc_node(vdesc) ? "node" : "data",
 		       (unsigned long long)vdesc->vd_ino,
 		       (unsigned long long)vdesc->vd_cno,
 		       (unsigned long long)vdesc->vd_offset,
@@ -612,6 +614,12 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
 		brelse(bh);
 		return -EEXIST;
 	}
+
+	if (nilfs_vdesc_snapshot(vdesc))
+		set_buffer_nilfs_snapshot(bh);
+	if (nilfs_vdesc_protection_period(vdesc))
+		set_buffer_nilfs_protection_period(bh);
+
 	list_add_tail(&bh->b_assoc_buffers, buffers);
 	return 0;
 }
@@ -662,6 +670,15 @@ static int nilfs_ioctl_move_blocks(struct super_block *sb,
 		}
 
 		do {
+			/*
+			 * old user space tools to not initialize vd_blk_flags
+			 * if vd_period.p_start > 0 then vd_blk_flags was
+			 * not initialized properly and may contain invalid
+			 * flags
+			 */
+			if (vdesc->vd_period.p_start > 0)
+				vdesc->vd_blk_flags = 0;
+
 			ret = nilfs_ioctl_move_inode_block(inode, vdesc,
 							   &buffers);
 			if (unlikely(ret < 0)) {
diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
index a43b828..b9117e6 100644
--- a/fs/nilfs2/page.h
+++ b/fs/nilfs2/page.h
@@ -36,13 +36,17 @@ enum {
 	BH_NILFS_Volatile,
 	BH_NILFS_Checked,
 	BH_NILFS_Redirected,
+	BH_NILFS_Snapshot,
+	BH_NILFS_Protection_Period,
 };
 
 BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
 BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
 BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
 BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */
-
+BUFFER_FNS(NILFS_Snapshot, nilfs_snapshot)	/* belongs to a snapshot */
+BUFFER_FNS(NILFS_Protection_Period, nilfs_protection_period) /* protected by
+							protection period */
 
 int __nilfs_clear_page_dirty(struct page *);
 
diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
index 6ccb2ad..6ffdc09 100644
--- a/include/linux/nilfs2_fs.h
+++ b/include/linux/nilfs2_fs.h
@@ -900,7 +900,7 @@ struct nilfs_vinfo {
  * @vd_blocknr: disk block number
  * @vd_offset: logical block offset inside a file
  * @vd_flags: flags (data or node block)
- * @vd_pad: padding
+ * @vd_blk_flags: additional flags
  */
 struct nilfs_vdesc {
 	__u64 vd_ino;
@@ -910,9 +910,63 @@ struct nilfs_vdesc {
 	__u64 vd_blocknr;
 	__u64 vd_offset;
 	__u32 vd_flags;
-	__u32 vd_pad;
+	/*
+	 * vd_blk_flags needed because vd_flags doesn't support
+	 * bit-flags because of backwards compatibility
+	 */
+	__u32 vd_blk_flags;
 };
 
+/* vdesc flags */
+enum {
+	NILFS_VDESC_DATA,
+	NILFS_VDESC_NODE,
+
+	/* ... */
+};
+enum {
+	NILFS_VDESC_SNAPSHOT,
+	NILFS_VDESC_PROTECTION_PERIOD,
+
+	/* ... */
+
+	__NR_NILFS_VDESC_FIELDS,
+};
+
+#define NILFS_VDESC_FNS(flag, name)					\
+static inline void							\
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_flags = NILFS_VDESC_##flag;				\
+}									\
+static inline int							\
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
+{									\
+	return vdesc->vd_flags == NILFS_VDESC_##flag;			\
+}
+
+#define NILFS_VDESC_FNS2(flag, name)					\
+static inline void							\
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_blk_flags |= (1UL << NILFS_VDESC_##flag);		\
+}									\
+static inline void							\
+nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_blk_flags &= ~(1UL << NILFS_VDESC_##flag);		\
+}									\
+static inline int							\
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
+{									\
+	return !!(vdesc->vd_blk_flags & (1UL << NILFS_VDESC_##flag));	\
+}
+
+NILFS_VDESC_FNS(DATA, data)
+NILFS_VDESC_FNS(NODE, node)
+NILFS_VDESC_FNS2(SNAPSHOT, snapshot)
+NILFS_VDESC_FNS2(PROTECTION_PERIOD, protection_period)
+
 /**
  * struct nilfs_bdesc - descriptor of disk block number
  * @bd_ino: inode number
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 8/9] nilfs2: improve accuracy and correct for invalid GC values
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (6 preceding siblings ...)
  2015-02-24 19:01   ` [PATCH 7/9] nilfs2: add additional flags for nilfs_vdesc Andreas Rohner
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-9-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:01   ` [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots Andreas Rohner
                     ` (2 subsequent siblings)
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch improves the accuracy of the su_nlive_blks segment
usage field by also counting the blocks of the DAT-File. A block in
the DAT-File is considered reclaimable as soon as it is overwritten.
There is no need to consider protection periods, snapshots or
checkpoints. So whenever a block is overwritten during segment
construction, the segment usage information of the segment at the
previous location of the block is decremented. To get the previous
location of the block the b_blocknr field of the buffer_head
structure is used.

SUFILE blocks are counted in a similar way, but if the GC reads a
block into a GC inode, that already is in the cache, then there are
two versions of the block. If this happens both versions will be
counted, which can lead to small seemingly random incorrect values.
But it is better to accept these small inaccuracies than to not
count the SUFILE at all. These inaccuracies do not occur for the
DAT-File, because it does not need a GC inode.

Additionally the blocks that belong to a GC inode are rechecked if
they are reclaimable. If so the corresponding counter is
decremented. The blocks were already checked in userspace, but
without the proper locking. It is furthermore possible, that blocks
become reclaimable during the cleaning process. For example by
deleting checkpoints. To improve the performance of these extra
checks, flags from userspace are used to determine reclaimability.
If a block belongs to a snapshot it cannot be reclaimable and if
it is within the protection period it must be counted as
reclaimable.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/dat.c     |  70 ++++++++++++++++++++++++++++++++++++
 fs/nilfs2/dat.h     |   1 +
 fs/nilfs2/inode.c   |   2 ++
 fs/nilfs2/segbuf.c  |   4 +++
 fs/nilfs2/segbuf.h  |   1 +
 fs/nilfs2/segment.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 6 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
index d2c8f7e..63d079c 100644
--- a/fs/nilfs2/dat.c
+++ b/fs/nilfs2/dat.c
@@ -35,6 +35,17 @@
 #define NILFS_CNO_MAX	(~(__u64)0)
 
 /**
+ * nilfs_dat_entry_is_alive - check if @entry is alive
+ * @entry: DAT-Entry
+ *
+ * Description: Simple check if @entry is alive in the current checkpoint.
+ */
+static inline int nilfs_dat_entry_is_live(struct nilfs_dat_entry *entry)
+{
+	return entry->de_end == cpu_to_le64(NILFS_CNO_MAX);
+}
+
+/**
  * struct nilfs_dat_info - on-memory private data of DAT file
  * @mi: on-memory private data of metadata file
  * @palloc_cache: persistent object allocator cache of DAT file
@@ -391,6 +402,65 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, sector_t blocknr)
 }
 
 /**
+ * nilfs_dat_is_live - checks if the virtual block number is alive
+ * @dat: DAT file inode
+ * @vblocknr: virtual block number
+ * @errp: pointer to return code if error occurred
+ *
+ * Description: nilfs_dat_is_live() looks up the DAT-Entry for
+ * @vblocknr and determines if the corresponding block is alive in the current
+ * checkpoint or not. This check ignores snapshots and protection periods.
+ *
+ * Return Value: 1 if vblocknr is alive and 0 otherwise. On error, 0 is
+ * returned and @errp is set to one of the following negative error codes.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ *
+ * %-ENOENT - A block number associated with @vblocknr does not exist.
+ */
+int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr, int *errp)
+{
+	struct buffer_head *entry_bh, *bh;
+	struct nilfs_dat_entry *entry;
+	sector_t blocknr;
+	void *kaddr;
+	int ret = 0, err;
+
+	err = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
+	if (err < 0)
+		goto out;
+
+	if (!nilfs_doing_gc() && buffer_nilfs_redirected(entry_bh)) {
+		bh = nilfs_mdt_get_frozen_buffer(dat, entry_bh);
+		if (bh) {
+			WARN_ON(!buffer_uptodate(bh));
+			put_bh(entry_bh);
+			entry_bh = bh;
+		}
+	}
+
+	kaddr = kmap_atomic(entry_bh->b_page);
+	entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
+	blocknr = le64_to_cpu(entry->de_blocknr);
+	if (blocknr == 0) {
+		err = -ENOENT;
+		goto out_unmap;
+	}
+
+	ret = nilfs_dat_entry_is_live(entry);
+
+out_unmap:
+	kunmap_atomic(kaddr);
+	put_bh(entry_bh);
+out:
+	if (errp)
+		*errp = err;
+	return ret;
+}
+
+/**
  * nilfs_dat_translate - translate a virtual block number to a block number
  * @dat: DAT file inode
  * @vblocknr: virtual block number
diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
index d196f09..3cbddd6 100644
--- a/fs/nilfs2/dat.h
+++ b/fs/nilfs2/dat.h
@@ -32,6 +32,7 @@ struct nilfs_palloc_req;
 struct nilfs_sufile_mod_cache;
 
 int nilfs_dat_translate(struct inode *, __u64, sector_t *);
+int nilfs_dat_is_live(struct inode *, __u64, int *);
 
 int nilfs_dat_prepare_alloc(struct inode *, struct nilfs_palloc_req *);
 void nilfs_dat_commit_alloc(struct inode *, struct nilfs_palloc_req *);
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 7f6d056..5412a76 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -90,6 +90,8 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff,
 	int err = 0, ret;
 	unsigned maxblocks = bh_result->b_size >> inode->i_blkbits;
 
+	bh_result->b_blocknr = 0;
+
 	down_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
 	ret = nilfs_bmap_lookup_contig(ii->i_bmap, blkoff, &blknum, maxblocks);
 	up_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index 7a6e9cd..bbd807b 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -58,6 +58,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
 	INIT_LIST_HEAD(&segbuf->sb_payload_buffers);
 	segbuf->sb_super_root = NULL;
 	segbuf->sb_nlive_blks_added = 0;
+	segbuf->sb_nlive_blks_diff = 0;
 
 	init_completion(&segbuf->sb_bio_event);
 	atomic_set(&segbuf->sb_err, 0);
@@ -451,6 +452,9 @@ static int nilfs_segbuf_submit_bh(struct nilfs_segment_buffer *segbuf,
 
 	len = bio_add_page(wi->bio, bh->b_page, bh->b_size, bh_offset(bh));
 	if (len == bh->b_size) {
+		lock_buffer(bh);
+		map_bh(bh, segbuf->sb_super, wi->blocknr + wi->end);
+		unlock_buffer(bh);
 		wi->end++;
 		return 0;
 	}
diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
index d04da26..4e994f7 100644
--- a/fs/nilfs2/segbuf.h
+++ b/fs/nilfs2/segbuf.h
@@ -84,6 +84,7 @@ struct nilfs_segment_buffer {
 	sector_t		sb_pseg_start;
 	unsigned		sb_rest_blocks;
 	__u32			sb_nlive_blks_added;
+	__s64			sb_nlive_blks_diff;
 
 	/* Buffers */
 	struct list_head	sb_segsum_buffers;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index dc0070c..16c7c36 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1385,7 +1385,8 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
 		WARN_ON(ret); /* always succeed because the segusage is dirty */
 
 		/* should always be positive */
-		segbuf->sb_nlive_blks_added = segbuf->sb_sum.nfileblk;
+		segbuf->sb_nlive_blks_added = segbuf->sb_nlive_blks_diff +
+					      segbuf->sb_sum.nfileblk;
 
 		if (nilfs_feature_track_live_blks(nilfs))
 			nilfs_sufile_mod_nlive_blks(sufile, sci->sc_mc,
@@ -1497,12 +1498,98 @@ static void nilfs_list_replace_buffer(struct buffer_head *old_bh,
 	/* The caller must release old_bh */
 }
 
+/**
+ * nilfs_segctor_dec_nlive_blks_gc - dec. nlive_blks for blocks of GC-Inodes
+ * @dat: dat inode
+ * @segbuf: currtent segment buffer
+ * @bh: current buffer head
+ *
+ * Description: nilfs_segctor_dec_nlive_blks_gc() is called if the inode to
+ * which @bh belongs is a GC-Inode. In that case it is not necessary to
+ * decrement the previous segment, because at the end of the GC process it
+ * will be freed anyway. It is however necessary to check again if the blocks
+ * are alive here, because the last check was in userspace without the proper
+ * locking. Additionally the blocks protected by the protection period should
+ * be considered reclaimable. It is assumed, that @bh->b_blocknr contains
+ * a virtual block number, which is only true if @bh is part of a GC-Inode.
+ */
+static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
+					    struct nilfs_segment_buffer *segbuf,
+					    struct buffer_head *bh) {
+	bool isreclaimable = buffer_nilfs_protection_period(bh) ||
+				!nilfs_dat_is_live(dat, bh->b_blocknr, NULL);
+
+	if (!buffer_nilfs_snapshot(bh) && isreclaimable)
+		segbuf->sb_nlive_blks_diff--;
+}
+
+/**
+ * nilfs_segctor_dec_nlive_blks_nogc - dec. nlive_blks of segment
+ * @nilfs: the nilfs object
+ * @mc: modification cache
+ * @sb: currtent segment buffer
+ * @blocknr: current block number
+ *
+ * Description: Gets the segment number of the segment @blocknr belongs to
+ * and decrements the su_nlive_blks field of the corresponding segment usage
+ * entry.
+ */
+static void nilfs_segctor_dec_nlive_blks_nogc(struct the_nilfs *nilfs,
+					      struct nilfs_sufile_mod_cache *mc,
+					      struct nilfs_segment_buffer *sb,
+					      sector_t blocknr)
+{
+	__u64 segnum = nilfs_get_segnum_of_block(nilfs, blocknr);
+
+	if (segnum >= nilfs->ns_nsegments)
+		return;
+
+	if (segnum == sb->sb_segnum)
+		sb->sb_nlive_blks_diff--;
+	else
+		nilfs_sufile_mod_nlive_blks(nilfs->ns_sufile, mc, segnum, -1);
+}
+
+/**
+ * nilfs_segctor_dec_nlive_blks - dec. nlive_blks of previous segment
+ * @nilfs: the nilfs object
+ * @mc: modification cache
+ * @sb: currtent segment buffer
+ * @bh: current buffer head
+ * @ino: current inode number
+ * @gc_inode: true if current inode is a GC-Inode
+ *
+ * Description: Handles GC-Inodes and normal inodes differently. For normal
+ * inodes @bh->b_blocknr contains the location where the block was read in. If
+ * the block is updated, the old version of it is considered reclaimable and so
+ * the su_nlive_blks field of the segment usage information of the old segment
+ * needs to be decremented. Only the DATFILE and SUFILE are decremented here,
+ * because normal files and other meta data files can be better decremented in
+ * nilfs_dat_commit_end().
+ */
+static void nilfs_segctor_dec_nlive_blks(struct the_nilfs *nilfs,
+					 struct nilfs_sufile_mod_cache *mc,
+					 struct nilfs_segment_buffer *sb,
+					 struct buffer_head *bh,
+					 ino_t ino,
+					 bool gc_inode)
+{
+	bool isnode = buffer_nilfs_node(bh);
+
+	if (gc_inode)
+		nilfs_segctor_dec_nlive_blks_gc(nilfs->ns_dat, sb, bh);
+	else if (ino == NILFS_DAT_INO || (ino == NILFS_SUFILE_INO && !isnode))
+		nilfs_segctor_dec_nlive_blks_nogc(nilfs, mc, sb, bh->b_blocknr);
+}
+
 static int
 nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 				     struct nilfs_segment_buffer *segbuf,
 				     int mode)
 {
+	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
 	struct inode *inode = NULL;
+	struct nilfs_inode_info *ii;
 	sector_t blocknr;
 	unsigned long nfinfo = segbuf->sb_sum.nfinfo;
 	unsigned long nblocks = 0, ndatablk = 0;
@@ -1512,7 +1599,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 	union nilfs_binfo binfo;
 	struct buffer_head *bh, *bh_org;
 	ino_t ino = 0;
-	int err = 0;
+	int err = 0, gc_inode = 0, track_live_blks;
+
+	track_live_blks = nilfs_feature_track_live_blks(nilfs);
 
 	if (!nfinfo)
 		goto out;
@@ -1533,6 +1622,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 
 			inode = bh->b_page->mapping->host;
 
+			ii = NILFS_I(inode);
+			gc_inode = test_bit(NILFS_I_GCINODE, &ii->i_state);
+
 			if (mode == SC_LSEG_DSYNC)
 				sc_op = &nilfs_sc_dsync_ops;
 			else if (ino == NILFS_DAT_INO)
@@ -1540,6 +1632,11 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
 			else /* file blocks */
 				sc_op = &nilfs_sc_file_ops;
 		}
+
+		if (track_live_blks)
+			nilfs_segctor_dec_nlive_blks(nilfs, sci->sc_mc, segbuf,
+						     bh, ino, gc_inode);
+
 		bh_org = bh;
 		get_bh(bh_org);
 		err = nilfs_bmap_assign(NILFS_I(inode)->i_bmap, &bh, blocknr,
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (7 preceding siblings ...)
  2015-02-24 19:01   ` [PATCH 8/9] nilfs2: improve accuracy and correct for invalid GC values Andreas Rohner
@ 2015-02-24 19:01   ` Andreas Rohner
       [not found]     ` <1424804504-10914-10-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:04   ` [PATCH 1/6] nilfs-utils: extend SUFILE on-disk format to enable track live blocks Andreas Rohner
  2015-02-25  0:18   ` [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy Ryusuke Konishi
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:01 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

It doesn't really matter if the number of reclaimable blocks for a
segment is inaccurate, as long as the overall performance is better than
the simple timestamp algorithm and starvation is prevented.

The following steps will lead to starvation of a segment:

1. The segment is written
2. A snapshot is created
3. The files in the segment are deleted and the number of live
   blocks for the segment is decremented to a very low value
4. The GC tries to free the segment, but there are no reclaimable
   blocks, because they are all protected by the snapshot. To prevent an
   infinite loop the GC has to adjust the number of live blocks to the
   correct value.
5. The snapshot is converted to a checkpoint and the blocks in the
   segment are now reclaimable.
6. The GC will never attemt to clean the segment again, because of it
   incorrectly shows up as having a high number of live blocks.

To prevent this, the already existing padding field of the SUFILE entry
is used to track the number of snapshot blocks in the segment. This
number is only set by the GC, since it collects the necessary
information anyway. So there is no need, to track which block belongs to
which segment. In step 4 of the list above the GC will set the new field
su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
entries with a big su_nsnapshot_blks field get their su_nlive_blks field
reduced.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 fs/nilfs2/cpfile.c        |   5 ++
 fs/nilfs2/segbuf.c        |   1 +
 fs/nilfs2/segbuf.h        |   1 +
 fs/nilfs2/segment.c       |   7 ++-
 fs/nilfs2/sufile.c        | 114 ++++++++++++++++++++++++++++++++++++++++++----
 fs/nilfs2/sufile.h        |   4 +-
 fs/nilfs2/the_nilfs.h     |   7 +++
 include/linux/nilfs2_fs.h |  12 +++--
 8 files changed, 136 insertions(+), 15 deletions(-)

diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c
index 0d58075..6b61fd7 100644
--- a/fs/nilfs2/cpfile.c
+++ b/fs/nilfs2/cpfile.c
@@ -28,6 +28,7 @@
 #include <linux/nilfs2_fs.h>
 #include "mdt.h"
 #include "cpfile.h"
+#include "sufile.h"
 
 
 static inline unsigned long
@@ -703,6 +704,7 @@ static int nilfs_cpfile_clear_snapshot(struct inode *cpfile, __u64 cno)
 	struct nilfs_cpfile_header *header;
 	struct nilfs_checkpoint *cp;
 	struct nilfs_snapshot_list *list;
+	struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
 	__u64 next, prev;
 	void *kaddr;
 	int ret;
@@ -784,6 +786,9 @@ static int nilfs_cpfile_clear_snapshot(struct inode *cpfile, __u64 cno)
 	mark_buffer_dirty(header_bh);
 	nilfs_mdt_mark_dirty(cpfile);
 
+	if (nilfs_feature_track_snapshots(nilfs))
+		nilfs_sufile_fix_starving_segs(nilfs->ns_sufile);
+
 	brelse(prev_bh);
 
  out_next:
diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
index bbd807b..a98c576 100644
--- a/fs/nilfs2/segbuf.c
+++ b/fs/nilfs2/segbuf.c
@@ -59,6 +59,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
 	segbuf->sb_super_root = NULL;
 	segbuf->sb_nlive_blks_added = 0;
 	segbuf->sb_nlive_blks_diff = 0;
+	segbuf->sb_nsnapshot_blks = 0;
 
 	init_completion(&segbuf->sb_bio_event);
 	atomic_set(&segbuf->sb_err, 0);
diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
index 4e994f7..7a462c4 100644
--- a/fs/nilfs2/segbuf.h
+++ b/fs/nilfs2/segbuf.h
@@ -85,6 +85,7 @@ struct nilfs_segment_buffer {
 	unsigned		sb_rest_blocks;
 	__u32			sb_nlive_blks_added;
 	__s64			sb_nlive_blks_diff;
+	__u32			sb_nsnapshot_blks;
 
 	/* Buffers */
 	struct list_head	sb_segsum_buffers;
diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
index 16c7c36..b976198 100644
--- a/fs/nilfs2/segment.c
+++ b/fs/nilfs2/segment.c
@@ -1381,6 +1381,7 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
 			(segbuf->sb_pseg_start - segbuf->sb_fseg_start);
 		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
 						     live_blocks,
+						     segbuf->sb_nsnapshot_blks,
 						     sci->sc_seg_ctime);
 		WARN_ON(ret); /* always succeed because the segusage is dirty */
 
@@ -1405,7 +1406,7 @@ static void nilfs_cancel_segusage(struct list_head *logs,
 	segbuf = NILFS_FIRST_SEGBUF(logs);
 	ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
 					     segbuf->sb_pseg_start -
-					     segbuf->sb_fseg_start, 0);
+					     segbuf->sb_fseg_start, 0, 0);
 	WARN_ON(ret); /* always succeed because the segusage is dirty */
 
 	if (nilfs_feature_track_live_blks(nilfs))
@@ -1416,7 +1417,7 @@ static void nilfs_cancel_segusage(struct list_head *logs,
 
 	list_for_each_entry_continue(segbuf, logs, sb_list) {
 		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
-						     0, 0);
+						     0, 0, 0);
 		WARN_ON(ret); /* always succeed */
 	}
 }
@@ -1521,6 +1522,8 @@ static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
 
 	if (!buffer_nilfs_snapshot(bh) && isreclaimable)
 		segbuf->sb_nlive_blks_diff--;
+	if (buffer_nilfs_snapshot(bh))
+		segbuf->sb_nsnapshot_blks++;
 }
 
 /**
diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
index 574a77e..a6dc7bf 100644
--- a/fs/nilfs2/sufile.c
+++ b/fs/nilfs2/sufile.c
@@ -468,7 +468,7 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 *data,
 	su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
 	if (nilfs_sufile_ext_supported(sufile)) {
 		su->su_nlive_blks = cpu_to_le32(0);
-		su->su_pad = cpu_to_le32(0);
+		su->su_nsnapshot_blks = cpu_to_le32(0);
 		su->su_nlive_lastmod = cpu_to_le64(0);
 	}
 	kunmap_atomic(kaddr);
@@ -538,7 +538,8 @@ int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum)
  * @modtime: modification time (option)
  */
 int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
-				   unsigned long nblocks, time_t modtime)
+				   unsigned long nblocks, __u32 nsnapshot_blks,
+				   time_t modtime)
 {
 	struct buffer_head *bh;
 	struct nilfs_segment_usage *su;
@@ -556,9 +557,18 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
 	if (modtime)
 		su->su_lastmod = cpu_to_le64(modtime);
 	su->su_nblocks = cpu_to_le32(nblocks);
-	if (nilfs_sufile_ext_supported(sufile) &&
-	    nblocks < le32_to_cpu(su->su_nlive_blks))
-		su->su_nlive_blks = su->su_nblocks;
+	if (nilfs_sufile_ext_supported(sufile)) {
+		if (nblocks < le32_to_cpu(su->su_nlive_blks))
+			su->su_nlive_blks = su->su_nblocks;
+
+		nsnapshot_blks += le32_to_cpu(su->su_nsnapshot_blks);
+
+		if (nblocks < nsnapshot_blks)
+			nsnapshot_blks = nblocks;
+
+		su->su_nsnapshot_blks = cpu_to_le32(nsnapshot_blks);
+	}
+
 	kunmap_atomic(kaddr);
 
 	mark_buffer_dirty(bh);
@@ -891,7 +901,7 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
 
 			if (sisz >= NILFS_EXT_SUINFO_SIZE) {
 				si->sui_nlive_blks = nlb;
-				si->sui_pad = 0;
+				si->sui_nsnapshot_blks = 0;
 				si->sui_nlive_lastmod = lm;
 			}
 		}
@@ -939,6 +949,7 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 	int ret = 0;
 	bool sup_ext = (supsz >= NILFS_EXT_SUINFO_UPDATE_SIZE);
 	bool su_ext = nilfs_sufile_ext_supported(sufile);
+	bool supsu_ext = sup_ext && su_ext;
 
 	if (unlikely(nsup == 0))
 		return ret;
@@ -952,6 +963,10 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 				nilfs->ns_blocks_per_segment)
 			|| (nilfs_suinfo_update_nlive_blks(sup) && sup_ext &&
 				sup->sup_sui.sui_nlive_blks >
+				nilfs->ns_blocks_per_segment)
+			|| (nilfs_suinfo_update_nsnapshot_blks(sup) &&
+				sup_ext &&
+				sup->sup_sui.sui_nsnapshot_blks >
 				nilfs->ns_blocks_per_segment))
 			return -EINVAL;
 	}
@@ -979,11 +994,15 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 		if (nilfs_suinfo_update_nblocks(sup))
 			su->su_nblocks = cpu_to_le32(sup->sup_sui.sui_nblocks);
 
-		if (nilfs_suinfo_update_nlive_blks(sup) && sup_ext && su_ext)
+		if (nilfs_suinfo_update_nlive_blks(sup) && supsu_ext)
 			su->su_nlive_blks =
 				cpu_to_le32(sup->sup_sui.sui_nlive_blks);
 
-		if (nilfs_suinfo_update_nlive_lastmod(sup) && sup_ext && su_ext)
+		if (nilfs_suinfo_update_nsnapshot_blks(sup) && supsu_ext)
+			su->su_nsnapshot_blks =
+				cpu_to_le32(sup->sup_sui.sui_nsnapshot_blks);
+
+		if (nilfs_suinfo_update_nlive_lastmod(sup) && supsu_ext)
 			su->su_nlive_lastmod =
 				cpu_to_le64(sup->sup_sui.sui_nlive_lastmod);
 
@@ -1050,6 +1069,85 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
 }
 
 /**
+ * nilfs_sufile_fix_starving_segs - fix potentially starving segments
+ * @sufile: inode of segment usage file
+ *
+ * Description: Scans for segments, which are potentially starving and
+ * reduces the number of live blocks to less than half of the maximum
+ * number of blocks in a segment. This way the segment is more likely to be
+ * chosen by the GC. A segment is marked as potentially starving, if more
+ * than half of the blocks it contains are protected by snapshots.
+ *
+ * Return Value: On success, 0 is returned and on error, one of the
+ * following negative error codes is returned.
+ *
+ * %-EIO - I/O error.
+ *
+ * %-ENOMEM - Insufficient amount of memory available.
+ */
+int nilfs_sufile_fix_starving_segs(struct inode *sufile)
+{
+	struct buffer_head *su_bh;
+	struct nilfs_segment_usage *su;
+	size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
+	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
+	void *kaddr;
+	unsigned long nsegs, segusages_per_block;
+	__u32 max_segblks = nilfs->ns_blocks_per_segment / 2;
+	__u64 segnum = 0;
+	int ret = 0, blkdirty, dirty = 0;
+
+	down_write(&NILFS_MDT(sufile)->mi_sem);
+
+	segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
+	nsegs = nilfs_sufile_get_nsegments(sufile);
+
+	while (segnum < nsegs) {
+		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
+							 nsegs - 1);
+
+		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
+							   0, &su_bh);
+		if (ret < 0) {
+			if (ret != -ENOENT)
+				goto out;
+			/* hole */
+			segnum += n;
+			continue;
+		}
+
+		kaddr = kmap_atomic(su_bh->b_page);
+		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
+							  su_bh, kaddr);
+		blkdirty = 0;
+		for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
+			if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
+				continue;
+
+			if (su->su_nlive_blks <= max_segblks)
+				continue;
+
+			su->su_nlive_blks = max_segblks;
+			blkdirty = 1;
+		}
+
+		kunmap_atomic(kaddr);
+		if (blkdirty) {
+			mark_buffer_dirty(su_bh);
+			dirty = 1;
+		}
+		put_bh(su_bh);
+	}
+
+out:
+	if (dirty)
+		nilfs_mdt_mark_dirty(sufile);
+
+	up_write(&NILFS_MDT(sufile)->mi_sem);
+	return ret;
+}
+
+/**
  * nilfs_sufile_trim_fs() - trim ioctl handle function
  * @sufile: inode of segment usage file
  * @range: fstrim_range structure
diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
index ae3c52a..e831622 100644
--- a/fs/nilfs2/sufile.h
+++ b/fs/nilfs2/sufile.h
@@ -45,7 +45,8 @@ int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
 int nilfs_sufile_alloc(struct inode *, __u64 *);
 int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
 int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
-				   unsigned long nblocks, time_t modtime);
+				   unsigned long nblocks, __u32 nsnapshot_blks,
+				   time_t modtime);
 int nilfs_sufile_get_stat(struct inode *, struct nilfs_sustat *);
 ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
 				size_t);
@@ -72,6 +73,7 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
 int nilfs_sufile_read(struct super_block *sb, size_t susize,
 		      struct nilfs_inode *raw_inode, struct inode **inodep);
 int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
+int nilfs_sufile_fix_starving_segs(struct inode *);
 
 /**
  * nilfs_sufile_scrap - make a segment garbage
diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
index 87cab10..3d495f1 100644
--- a/fs/nilfs2/the_nilfs.h
+++ b/fs/nilfs2/the_nilfs.h
@@ -409,4 +409,11 @@ static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
 		NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
 }
 
+static inline int nilfs_feature_track_snapshots(struct the_nilfs *nilfs)
+{
+	return (nilfs->ns_feature_compat &
+		NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS) &&
+		nilfs_feature_track_live_blks(nilfs);
+}
+
 #endif /* _THE_NILFS_H */
diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
index 6ffdc09..a3c7593 100644
--- a/include/linux/nilfs2_fs.h
+++ b/include/linux/nilfs2_fs.h
@@ -222,11 +222,13 @@ struct nilfs_super_block {
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
 #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
+#define NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS		(1ULL << 2)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
 
 #define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
-				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
+				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS \
+				| NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
@@ -630,7 +632,7 @@ struct nilfs_segment_usage {
 	__le32 su_nblocks;
 	__le32 su_flags;
 	__le32 su_nlive_blks;
-	__le32 su_pad;
+	__le32 su_nsnapshot_blks;
 	__le64 su_nlive_lastmod;
 };
 
@@ -682,7 +684,7 @@ nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
 	su->su_flags = cpu_to_le32(0);
 	if (susz >= NILFS_EXT_SEGMENT_USAGE_SIZE) {
 		su->su_nlive_blks = cpu_to_le32(0);
-		su->su_pad = cpu_to_le32(0);
+		su->su_nsnapshot_blks = cpu_to_le32(0);
 		su->su_nlive_lastmod = cpu_to_le64(0);
 	}
 }
@@ -723,7 +725,7 @@ struct nilfs_suinfo {
 	__u32 sui_nblocks;
 	__u32 sui_flags;
 	__u32 sui_nlive_blks;
-	__u32 sui_pad;
+	__u32 sui_nsnapshot_blks;
 	__u64 sui_nlive_lastmod;
 };
 
@@ -770,6 +772,7 @@ enum {
 	NILFS_SUINFO_UPDATE_FLAGS,
 	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
 	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
+	NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
 	__NR_NILFS_SUINFO_UPDATE_FIELDS,
 };
 
@@ -794,6 +797,7 @@ NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
 NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
 NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
 NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
+NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
 NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
 
 #define NILFS_MIN_SUINFO_UPDATE_SIZE	\
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 1/6] nilfs-utils: extend SUFILE on-disk format to enable track live blocks
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (8 preceding siblings ...)
  2015-02-24 19:01   ` [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots Andreas Rohner
@ 2015-02-24 19:04   ` Andreas Rohner
       [not found]     ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-25  0:18   ` [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy Ryusuke Konishi
  10 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:04 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch extends the nilfs_segment_usage structure with two extra
fields. This changes the on-disk format of the SUFILE, but the nilfs2
metadata files are flexible enough, so that there are no compatibility
issues. The extension is fully backwards compatible. Nevertheless a
feature compatibility flag was added to indicate the on-disk format
change.

The new field su_nlive_blks is used to track the number of live blocks
in the corresponding segment. Its value should always be smaller than
su_nblocks, which contains the total number of blocks in the segment.

The field su_nlive_lastmod is necessary because of the protection period
used by the GC. It is a timestamp, which contains the last time
su_nlive_blks was modified. For example if a file is deleted, its
blocks are subtracted from su_nlive_blks and are therefore
considered to be reclaimable by the kernel. But the GC additionally
protects them with the protection period. So while su_nilve_blks
contains the number of potentially reclaimable blocks, the actual number
depends on the protection period. To enable GC policies to
effectively choose or prefer segments with unprotected blocks, the
timestamp in su_nlive_lastmod is necessary.

Since the changes to the disk layout are fully backwards compatible and
the feature flag cannot be set after file system creation time,
NILFS_FEATURE_COMPAT_SUFILE_EXTENSION is set by default. It can however
be disabled by mkfs.nilfs2 -O ^sufile_ext

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 bin/lssu.c          | 14 +++++++----
 include/nilfs2_fs.h | 46 +++++++++++++++++++++++++++++------
 lib/feature.c       |  2 ++
 man/mkfs.nilfs2.8   |  8 +++++++
 sbin/mkfs/mkfs.c    | 69 +++++++++++++++++++++++++++++++++++++++--------------
 5 files changed, 109 insertions(+), 30 deletions(-)

diff --git a/bin/lssu.c b/bin/lssu.c
index 09ed973..e50e628 100644
--- a/bin/lssu.c
+++ b/bin/lssu.c
@@ -104,8 +104,8 @@ static const struct lssu_format lssu_format[] = {
 	},
 	{
 		"           SEGNUM        DATE     TIME STAT     NBLOCKS" \
-		"       NLIVEBLOCKS",
-		"%17llu  %s %c%c%c%c  %10u %10u (%3u%%)\n"
+		"       NLIVEBLOCKS   NPREDLIVEBLOCKS",
+		"%17llu  %s %c%c%c%c  %10u %10u (%3u%%) %10u (%3u%%)\n"
 	}
 };
 
@@ -164,9 +164,9 @@ static ssize_t lssu_print_suinfo(struct nilfs *nilfs, __u64 segnum,
 	time_t t;
 	char timebuf[LSSU_BUFSIZE];
 	ssize_t i, n = 0, ret;
-	int ratio;
+	int ratio, predratio;
 	int protected;
-	size_t nliveblks;
+	size_t nliveblks, npredliveblks;
 
 	for (i = 0; i < nsi; i++, segnum++) {
 		if (!all && nilfs_suinfo_clean(&suinfos[i]))
@@ -192,7 +192,10 @@ static ssize_t lssu_print_suinfo(struct nilfs *nilfs, __u64 segnum,
 			break;
 		case LSSU_MODE_LATEST_USAGE:
 			nliveblks = 0;
+			npredliveblks = suinfos[i].sui_nlive_blks;
 			ratio = 0;
+			predratio = (npredliveblks * 100 + 99) /
+					blocks_per_segment;
 			protected = suinfos[i].sui_lastmod >= prottime;
 
 			if (!nilfs_suinfo_dirty(&suinfos[i]) ||
@@ -223,7 +226,8 @@ skip_scan:
 			       nilfs_suinfo_dirty(&suinfos[i]) ? 'd' : '-',
 			       nilfs_suinfo_error(&suinfos[i]) ? 'e' : '-',
 			       protected ? 'p' : '-',
-			       suinfos[i].sui_nblocks, nliveblks, ratio);
+			       suinfos[i].sui_nblocks, nliveblks, ratio,
+			       npredliveblks, predratio);
 			break;
 		}
 		n++;
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index a16ad4c..9137824 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -219,9 +219,11 @@ struct nilfs_super_block {
  * If there is a bit set in the incompatible feature set that the kernel
  * doesn't know about, it should refuse to mount the filesystem.
  */
-#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT	0x00000001ULL
+#define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
 
-#define NILFS_FEATURE_COMPAT_SUPP	0ULL
+#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
+
+#define NILFS_FEATURE_COMPAT_SUPP	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
@@ -607,18 +609,35 @@ struct nilfs_cpfile_header {
 	  sizeof(struct nilfs_checkpoint) - 1) /			\
 			sizeof(struct nilfs_checkpoint))
 
+#undef offsetof
+#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)
+
+#define sub_sizeof(TYPE, MEMBER) (offsetof(TYPE, MEMBER) +		\
+					sizeof(((TYPE *)0)->MEMBER))
 /**
  * struct nilfs_segment_usage - segment usage
  * @su_lastmod: last modified timestamp
  * @su_nblocks: number of blocks in segment
  * @su_flags: flags
+ * @su_nlive_blks: number of live blocks in the segment
+ * @su_pad: padding bytes
+ * @su_nlive_lastmod: timestamp nlive_blks was last modified
  */
 struct nilfs_segment_usage {
 	__le64 su_lastmod;
 	__le32 su_nblocks;
 	__le32 su_flags;
+	__le32 su_nlive_blks;
+	__le32 su_pad;
+	__le64 su_nlive_lastmod;
 };
 
+#define NILFS_MIN_SEGMENT_USAGE_SIZE	\
+	sub_sizeof(struct nilfs_segment_usage, su_flags)
+
+#define NILFS_EXT_SEGMENT_USAGE_SIZE	\
+	sub_sizeof(struct nilfs_segment_usage, su_nlive_lastmod)
+
 /* segment usage flag */
 enum {
 	NILFS_SEGMENT_USAGE_ACTIVE,
@@ -654,11 +673,16 @@ NILFS_SEGMENT_USAGE_FNS(DIRTY, dirty)
 NILFS_SEGMENT_USAGE_FNS(ERROR, error)
 
 static inline void
-nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su)
+nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
 {
 	su->su_lastmod = cpu_to_le64(0);
 	su->su_nblocks = cpu_to_le32(0);
 	su->su_flags = cpu_to_le32(0);
+	if (susz >= NILFS_EXT_SEGMENT_USAGE_SIZE) {
+		su->su_nlive_blks = cpu_to_le32(0);
+		su->su_pad = cpu_to_le32(0);
+		su->su_nlive_lastmod = cpu_to_le64(0);
+	}
 }
 
 static inline int
@@ -680,21 +704,25 @@ struct nilfs_sufile_header {
 	/* ... */
 };
 
-#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET	\
-	((sizeof(struct nilfs_sufile_header) +				\
-	  sizeof(struct nilfs_segment_usage) - 1) /			\
-			 sizeof(struct nilfs_segment_usage))
+#define NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET(susz)	\
+	((sizeof(struct nilfs_sufile_header) + (susz) - 1) / (susz))
 
 /**
  * nilfs_suinfo - segment usage information
  * @sui_lastmod: timestamp of last modification
  * @sui_nblocks: number of written blocks in segment
  * @sui_flags: segment usage flags
+ * @sui_nlive_blks: number of live blocks in the segment
+ * @sui_pad: padding bytes
+ * @sui_nlive_lastmod: timestamp nlive_blks was last modified
  */
 struct nilfs_suinfo {
 	__u64 sui_lastmod;
 	__u32 sui_nblocks;
 	__u32 sui_flags;
+	__u32 sui_nlive_blks;
+	__u32 sui_pad;
+	__u64 sui_nlive_lastmod;
 };
 
 #define NILFS_SUINFO_FNS(flag, name)					\
@@ -732,6 +760,8 @@ enum {
 	NILFS_SUINFO_UPDATE_LASTMOD,
 	NILFS_SUINFO_UPDATE_NBLOCKS,
 	NILFS_SUINFO_UPDATE_FLAGS,
+	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
+	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
 	__NR_NILFS_SUINFO_UPDATE_FIELDS,
 };
 
@@ -755,6 +785,8 @@ nilfs_suinfo_update_##name(const struct nilfs_suinfo_update *sup)	\
 NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
 NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
 NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
+NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
+NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
 
 enum {
 	NILFS_CHECKPOINT,
diff --git a/lib/feature.c b/lib/feature.c
index b3317b7..d954cda 100644
--- a/lib/feature.c
+++ b/lib/feature.c
@@ -55,6 +55,8 @@ struct nilfs_feature {
 
 static const struct nilfs_feature features[] = {
 	/* Compat features */
+	{ NILFS_FEATURE_TYPE_COMPAT,
+	  NILFS_FEATURE_COMPAT_SUFILE_EXTENSION, "sufile_ext" },
 	/* Read-only compat features */
 	{ NILFS_FEATURE_TYPE_COMPAT_RO,
 	  NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT, "block_count" },
diff --git a/man/mkfs.nilfs2.8 b/man/mkfs.nilfs2.8
index 0ff2fbe..6c9a644 100644
--- a/man/mkfs.nilfs2.8
+++ b/man/mkfs.nilfs2.8
@@ -168,6 +168,14 @@ pseudo-filesystem feature "none" will clear all filesystem features.
 .TP
 .B block_count
 Enable block count per checkpoint.
+.TP
+.B sufile_ext
+Enable SUFILE extension with extra fields. This is necessary for the
+track_live_blks and track_snapshots features to work. Once enabled it
+cannot be disabled, because it changes the ondisk format. Nevertheless it
+is fully compatible with older versions of the file system. This feature
+is on by default, because it is fully backwards compatible and can only
+be set at file system creation time.
 .RE
 .TP
 .B \-q
diff --git a/sbin/mkfs/mkfs.c b/sbin/mkfs/mkfs.c
index f5f7dbb..3985262 100644
--- a/sbin/mkfs/mkfs.c
+++ b/sbin/mkfs/mkfs.c
@@ -116,7 +116,12 @@ static time_t creation_time;
 static char volume_label[80];
 static __u64 compat_array[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	0,
+	/*
+	 * SUFILE_EXTENSION is set by default, because
+	 * it is fully compatible with previous versions and it
+	 * cannot be enabled later with nilfs-tune
+	 */
+	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION,
 	/* Read-only compat */
 	0,
 	/* Incompat */
@@ -375,12 +380,33 @@ static unsigned count_ifile_blocks(void)
 	return nblocks;
 }
 
+static inline int sufile_extension_enabled(void)
+{
+	return compat_array[NILFS_FEATURE_TYPE_COMPAT] &
+			NILFS_FEATURE_COMPAT_SUFILE_EXTENSION;
+}
+
+static unsigned get_sufile_entry_size(void)
+{
+	if (sufile_extension_enabled())
+		return NILFS_EXT_SEGMENT_USAGE_SIZE;
+	else
+		return NILFS_MIN_SEGMENT_USAGE_SIZE;
+}
+
+static unsigned get_sufile_first_entry_offset(void)
+{
+	unsigned susz = get_sufile_entry_size();
+
+	return NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET(susz);
+}
+
 static unsigned count_sufile_blocks(void)
 {
 	unsigned long sufile_segment_usages_per_block
-		= blocksize / sizeof(struct nilfs_segment_usage);
+		= blocksize / get_sufile_entry_size();
 	return DIV_ROUND_UP(nr_initial_segments +
-			   NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET,
+			   get_sufile_first_entry_offset(),
 			   sufile_segment_usages_per_block);
 }
 
@@ -1056,7 +1082,7 @@ static inline void check_ctime(time_t ctime)
 
 static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	0,
+	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
@@ -1499,8 +1525,8 @@ static void commit_cpfile(void)
 static void prepare_sufile(void)
 {
 	struct nilfs_file_info *fi = nilfs.files[NILFS_SUFILE_INO];
-	const unsigned entries_per_block
-		= blocksize / sizeof(struct nilfs_segment_usage);
+	const size_t susz = get_sufile_entry_size();
+	const unsigned entries_per_block = blocksize / susz;
 	blocknr_t blocknr = fi->start;
 	blocknr_t entry_block = blocknr;
 	struct nilfs_sufile_header *header;
@@ -1516,10 +1542,10 @@ static void prepare_sufile(void)
 	for (entry_block = blocknr;
 	     entry_block < blocknr + fi->nblocks; entry_block++) {
 		i = (entry_block == blocknr) ?
-			NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET : 0;
-		su = (struct nilfs_segment_usage *)
-			map_disk_buffer(entry_block, 1) + i;
-		for (; i < entries_per_block; i++, su++, segnum++) {
+			get_sufile_first_entry_offset() : 0;
+		su = map_disk_buffer(entry_block, 1) + i * susz;
+		for (; i < entries_per_block; i++, su = (void *)su + susz,
+		     segnum++) {
 #if 0 /* these fields are cleared when mapped first */
 			su->su_lastmod = 0;
 			su->su_nblocks = 0;
@@ -1529,7 +1555,7 @@ static void prepare_sufile(void)
 				nilfs_segment_usage_set_active(su);
 				nilfs_segment_usage_set_dirty(su);
 			} else
-				nilfs_segment_usage_set_clean(su);
+				nilfs_segment_usage_set_clean(su, susz);
 		}
 	}
 	init_inode(NILFS_SUFILE_INO, DT_REG, 0, 0);
@@ -1538,19 +1564,26 @@ static void prepare_sufile(void)
 static void commit_sufile(void)
 {
 	struct nilfs_file_info *fi = nilfs.files[NILFS_SUFILE_INO];
-	const unsigned entries_per_block
-		= blocksize / sizeof(struct nilfs_segment_usage);
+	const size_t susz = get_sufile_entry_size();
+	const unsigned entries_per_block = blocksize / susz;
 	struct nilfs_segment_usage *su;
 	unsigned segnum = fi->start / nilfs.diskinfo->blocks_per_segment;
 	blocknr_t blocknr = fi->start +
-		(segnum + NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET) /
+		(segnum + get_sufile_first_entry_offset()) /
 		entries_per_block;
-
-	su = map_disk_buffer(blocknr, 1);
-	su += (segnum + NILFS_SUFILE_FIRST_SEGMENT_USAGE_OFFSET) %
+	size_t entry_off = (segnum + get_sufile_first_entry_offset()) %
 		entries_per_block;
+
+	su = map_disk_buffer(blocknr, 1) + entry_off * susz;
+
 	su->su_lastmod = cpu_to_le64(nilfs.diskinfo->ctime);
 	su->su_nblocks = cpu_to_le32(nilfs.current_segment->nblocks);
+	if (sufile_extension_enabled()) {
+		/* nlive_blks = nblocks - (nsummary_blks + nsuperroot_blks) */
+		su->su_nlive_blks = cpu_to_le32(nilfs.current_segment->nblocks -
+				(nilfs.current_segment->nblk_sum + 1));
+		su->su_nlive_lastmod = su->su_lastmod;
+	}
 }
 
 static void prepare_dat(void)
@@ -1756,7 +1789,7 @@ static void prepare_super_block(struct nilfs_disk_info *di)
 	raw_sb->s_checkpoint_size =
 		cpu_to_le16(sizeof(struct nilfs_checkpoint));
 	raw_sb->s_segment_usage_size =
-		cpu_to_le16(sizeof(struct nilfs_segment_usage));
+		cpu_to_le16(get_sufile_entry_size());
 
 	raw_sb->s_feature_compat =
 		cpu_to_le64(compat_array[NILFS_FEATURE_TYPE_COMPAT]);
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 2/6] nilfs-utils: add additional flags for nilfs_vdesc
       [not found]     ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-02-24 19:04       ` Andreas Rohner
  2015-02-24 19:04       ` [PATCH 3/6] nilfs-utils: add support for tracking live blocks Andreas Rohner
                         ` (3 subsequent siblings)
  4 siblings, 0 replies; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:04 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch adds support for additional bit-flags to the nilfs_vdesc
structure used by the GC to communicate block information to the
kernel.

The field vd_flags cannot be used for this purpose, because it does
not support bit-flags, and changing that would break backwards
compatibility. Therefore the padding field is renamed to vd_blk_flags
to contain more flags.

Unfortunately older versions of nilfs-utils do not initialize the
padding field to zero. So it is necessary to signal to the kernel if
the new vd_blk_flags field contains usable flags or just random data.
Since the vd_period field is only used in userspace, and is guaranteed
to contain a value that is > 0 (NILFS_CNO_MIN == 1), it can be used to
give the kernel a hint. So if vd_period.p_start is set to 0, the
vd_blk_flags field will be interpreted by the kernel.

The following new flags are added:

NILFS_VDESC_SNAPSHOT:
    The block corresponding to the vdesc structure is protected by a
    snapshot. This information is used in the kernel as well as in
    nilfs-utils to calcualte the number of live blocks in a given
    segment. A block with this flag is counted as live regardless of
    other indicators.

NILFS_VDESC_PROTECTION_PERIOD:
    The block corresponding to the vdesc structure is protected by the
    protection period of the userspace GC. The block is actually
    reclaimable, but for the moment protected. So it has to be
    treated as if it were alive and moved to a new free segment,
    but it must not be counted as live by the kernel. This flag
    indicates to the kernel, that this block should be counted as
    reclaimable.

The nilfs_vdesc_is_live() function is modified to store the
corresponding flags in the vdesc structure. However the algorithm it
uses it not modified, so it should return exactly the same results.

After the nilfs_vdesc_is_live() is called the vd_period field is no
longer needed and set to 0, to indicate to the kernel, that the
vd_blk_flags field should be interpreted. This ensures full backward
compatibility:

Old nilfs2 and new nilfs-utils:
    vd_blk_flags is ignored

New nilfs2 and old nilfs-utils:
    vd_period.p_start > 0 so vd_blk_flags is ignored

New nilfs2 and new nilfs-utils:
    vd_period.p_start == 0 so vd_blk_flags is interpreted

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 include/nilfs2_fs.h | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++--
 lib/gc.c            | 36 ++++++++++++++++++++++++---------
 2 files changed, 83 insertions(+), 11 deletions(-)

diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index 9137824..d01a924 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -884,7 +884,7 @@ struct nilfs_vinfo {
  * @vd_blocknr: disk block number
  * @vd_offset: logical block offset inside a file
  * @vd_flags: flags (data or node block)
- * @vd_pad: padding
+ * @vd_blk_flags: additional flags
  */
 struct nilfs_vdesc {
 	__u64 vd_ino;
@@ -894,9 +894,63 @@ struct nilfs_vdesc {
 	__u64 vd_blocknr;
 	__u64 vd_offset;
 	__u32 vd_flags;
-	__u32 vd_pad;
+	/*
+	 * vd_blk_flags needed because vd_flags doesn't support
+	 * bit-flags because of backwards compatibility
+	 */
+	__u32 vd_blk_flags;
 };
 
+/* vdesc flags */
+enum {
+	NILFS_VDESC_DATA,
+	NILFS_VDESC_NODE,
+
+	/* ... */
+};
+enum {
+	NILFS_VDESC_SNAPSHOT,
+	NILFS_VDESC_PROTECTION_PERIOD,
+
+	/* ... */
+
+	__NR_NILFS_VDESC_FIELDS,
+};
+
+#define NILFS_VDESC_FNS(flag, name)					\
+static inline void							\
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_flags = NILFS_VDESC_##flag;				\
+}									\
+static inline int							\
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
+{									\
+	return vdesc->vd_flags == NILFS_VDESC_##flag;			\
+}
+
+#define NILFS_VDESC_FNS2(flag, name)					\
+static inline void							\
+nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_blk_flags |= (1UL << NILFS_VDESC_##flag);		\
+}									\
+static inline void							\
+nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)			\
+{									\
+	vdesc->vd_blk_flags &= ~(1UL << NILFS_VDESC_##flag);		\
+}									\
+static inline int							\
+nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
+{									\
+	return !!(vdesc->vd_blk_flags & (1UL << NILFS_VDESC_##flag));	\
+}
+
+NILFS_VDESC_FNS(DATA, data)
+NILFS_VDESC_FNS(NODE, node)
+NILFS_VDESC_FNS2(SNAPSHOT, snapshot)
+NILFS_VDESC_FNS2(PROTECTION_PERIOD, protection_period)
+
 /**
  * struct nilfs_bdesc - descriptor of disk block number
  * @bd_ino: inode number
diff --git a/lib/gc.c b/lib/gc.c
index 48c295a..b56744c 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -128,6 +128,7 @@ static int nilfs_acc_blocks_file(struct nilfs_file *file,
 				return -1;
 			bdesc->bd_ino = ino;
 			bdesc->bd_oblocknr = blk.b_blocknr;
+			bdesc->bd_pad = 0;
 			if (nilfs_block_is_data(&blk)) {
 				bdesc->bd_offset =
 					le64_to_cpu(*(__le64 *)blk.b_binfo);
@@ -148,17 +149,19 @@ static int nilfs_acc_blocks_file(struct nilfs_file *file,
 			vdesc->vd_ino = ino;
 			vdesc->vd_cno = cno;
 			vdesc->vd_blocknr = blk.b_blocknr;
+			vdesc->vd_flags = 0;
+			vdesc->vd_blk_flags = 0;
 			if (nilfs_block_is_data(&blk)) {
 				binfo = blk.b_binfo;
 				vdesc->vd_vblocknr =
 					le64_to_cpu(binfo->bi_v.bi_vblocknr);
 				vdesc->vd_offset =
 					le64_to_cpu(binfo->bi_v.bi_blkoff);
-				vdesc->vd_flags = 0;	/* data */
+				nilfs_vdesc_set_data(vdesc);
 			} else {
 				vdesc->vd_vblocknr =
 					le64_to_cpu(*(__le64 *)blk.b_binfo);
-				vdesc->vd_flags = 1;	/* node */
+				nilfs_vdesc_set_node(vdesc);
 			}
 		}
 	}
@@ -392,7 +395,7 @@ static ssize_t nilfs_get_snapshot(struct nilfs *nilfs, nilfs_cno_t **ssp)
  * @n: size of @ss array
  * @last_hit: the last snapshot number hit
  */
-static int nilfs_vdesc_is_live(const struct nilfs_vdesc *vdesc,
+static int nilfs_vdesc_is_live(struct nilfs_vdesc *vdesc,
 			       nilfs_cno_t protect, const nilfs_cno_t *ss,
 			       size_t n, nilfs_cno_t *last_hit)
 {
@@ -408,18 +411,22 @@ static int nilfs_vdesc_is_live(const struct nilfs_vdesc *vdesc,
 		return vdesc->vd_period.p_end == NILFS_CNO_MAX;
 	}
 
-	if (vdesc->vd_period.p_end == NILFS_CNO_MAX ||
-	    vdesc->vd_period.p_end > protect)
+	if (vdesc->vd_period.p_end == NILFS_CNO_MAX)
 		return 1;
 
+	if (vdesc->vd_period.p_end > protect)
+		nilfs_vdesc_set_protection_period(vdesc);
+
 	if (n == 0 || vdesc->vd_period.p_start > ss[n - 1] ||
 	    vdesc->vd_period.p_end <= ss[0])
-		return 0;
+		return nilfs_vdesc_protection_period(vdesc);
 
 	/* Try the last hit snapshot number */
 	if (*last_hit >= vdesc->vd_period.p_start &&
-	    *last_hit < vdesc->vd_period.p_end)
+	    *last_hit < vdesc->vd_period.p_end) {
+		nilfs_vdesc_set_snapshot(vdesc);
 		return 1;
+	}
 
 	low = 0;
 	high = n - 1;
@@ -435,10 +442,11 @@ static int nilfs_vdesc_is_live(const struct nilfs_vdesc *vdesc,
 		} else {
 			/* ss[index] is in the range [p_start, p_end) */
 			*last_hit = ss[index];
+			nilfs_vdesc_set_snapshot(vdesc);
 			return 1;
 		}
 	}
-	return 0;
+	return nilfs_vdesc_protection_period(vdesc);
 }
 
 /**
@@ -476,8 +484,18 @@ static int nilfs_toss_vdescs(struct nilfs *nilfs,
 			vdesc = nilfs_vector_get_element(vdescv, j);
 			assert(vdesc != NULL);
 			if (nilfs_vdesc_is_live(vdesc, protcno, ss, n,
-						&last_hit))
+						&last_hit)) {
+				/*
+				 * vd_period is not used any more after this,
+				 * but by setting it to 0 it can be used
+				 * as a flag to the kernel that vd_blk_flags
+				 * is used (old userspace tools didn't
+				 * initialize vd_pad to 0)
+				 */
+				vdesc->vd_period.p_start = 0;
+				vdesc->vd_period.p_end = 0;
 				break;
+			}
 
 			/*
 			 * Add the virtual block number to the candidate
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 3/6] nilfs-utils: add support for tracking live blocks
       [not found]     ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:04       ` [PATCH 2/6] nilfs-utils: add additional flags for nilfs_vdesc Andreas Rohner
@ 2015-02-24 19:04       ` Andreas Rohner
       [not found]         ` <1424804659-10986-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:04       ` [PATCH 4/6] nilfs-utils: implement the tracking of live blocks for set_suinfo Andreas Rohner
                         ` (2 subsequent siblings)
  4 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:04 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch adds a new feature flag NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
which allows the user to enable and disable the tracking of live
blocks. The flag can be set at file system creation time with mkfs or
at any later time with nilfs-tune.

Additionally a new option NILFS_OPT_TRACK_LIVE_BLKS is added to be
used by the GC. It is set to the same value as
NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS at startup. It is mainly used to
easily and efficiently check for the feature at runtime and to disable
it if the kernel doesn't support it.

It is fully backwards compatible, because
NILFS_FEATURE_COMPAT_SUFILE_EXTENSION also is backwards compatible and
it basically only tells the kernel to update a counter for every
segment in the SUFILE. If the kernel doesn't support it, the counter
won't be updated and the GC policies depending on that information
will work less efficient, but they would still work.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 include/nilfs.h              | 30 +++++++++++++++++++++++++++---
 include/nilfs2_fs.h          |  4 +++-
 lib/feature.c                |  2 ++
 lib/nilfs.c                  | 32 ++++----------------------------
 man/mkfs.nilfs2.8            |  6 ++++++
 sbin/mkfs/mkfs.c             |  3 ++-
 sbin/nilfs-tune/nilfs-tune.c |  4 ++--
 7 files changed, 46 insertions(+), 35 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index f695f48..22a9190 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -130,6 +130,7 @@ struct nilfs {
 
 #define NILFS_OPT_MMAP		0x01
 #define NILFS_OPT_SET_SUINFO	0x02
+#define NILFS_OPT_TRACK_LIVE_BLKS	0x04
 
 
 struct nilfs *nilfs_open(const char *, const char *, int);
@@ -141,9 +142,25 @@ void nilfs_opt_clear_mmap(struct nilfs *);
 int nilfs_opt_set_mmap(struct nilfs *);
 int nilfs_opt_test_mmap(struct nilfs *);
 
-void nilfs_opt_clear_set_suinfo(struct nilfs *);
-int nilfs_opt_set_set_suinfo(struct nilfs *);
-int nilfs_opt_test_set_suinfo(struct nilfs *);
+#define NILFS_OPT_FLAG(flag, name)					\
+static inline void							\
+nilfs_opt_set_##name(struct nilfs *nilfs)			\
+{									\
+	nilfs->n_opts |= NILFS_OPT_##flag;		\
+}									\
+static inline void							\
+nilfs_opt_clear_##name(struct nilfs *nilfs)			\
+{									\
+	nilfs->n_opts &= ~NILFS_OPT_##flag;		\
+}									\
+static inline int							\
+nilfs_opt_test_##name(const struct nilfs *nilfs)			\
+{									\
+	return !!(nilfs->n_opts & NILFS_OPT_##flag);	\
+}
+
+NILFS_OPT_FLAG(SET_SUINFO, set_suinfo);
+NILFS_OPT_FLAG(TRACK_LIVE_BLKS, track_live_blks);
 
 nilfs_cno_t nilfs_get_oldest_cno(struct nilfs *);
 
@@ -326,4 +343,11 @@ static inline __u32 nilfs_get_blocks_per_segment(const struct nilfs *nilfs)
 	return le32_to_cpu(nilfs->n_sb->s_blocks_per_segment);
 }
 
+static inline int nilfs_feature_track_live_blks(const struct nilfs *nilfs)
+{
+	__u64 fc = le64_to_cpu(nilfs->n_sb->s_feature_compat);
+	return (fc & NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS) &&
+		(fc & NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
+}
+
 #endif	/* NILFS_H */
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index d01a924..427ca53 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -220,10 +220,12 @@ struct nilfs_super_block {
  * doesn't know about, it should refuse to mount the filesystem.
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
+#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
 
-#define NILFS_FEATURE_COMPAT_SUPP	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION
+#define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
+				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
diff --git a/lib/feature.c b/lib/feature.c
index d954cda..ebe8c3f 100644
--- a/lib/feature.c
+++ b/lib/feature.c
@@ -57,6 +57,8 @@ static const struct nilfs_feature features[] = {
 	/* Compat features */
 	{ NILFS_FEATURE_TYPE_COMPAT,
 	  NILFS_FEATURE_COMPAT_SUFILE_EXTENSION, "sufile_ext" },
+	{ NILFS_FEATURE_TYPE_COMPAT,
+	  NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS, "track_live_blks" },
 	/* Read-only compat features */
 	{ NILFS_FEATURE_TYPE_COMPAT_RO,
 	  NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT, "block_count" },
diff --git a/lib/nilfs.c b/lib/nilfs.c
index 30db654..2067fc0 100644
--- a/lib/nilfs.c
+++ b/lib/nilfs.c
@@ -290,34 +290,6 @@ int nilfs_opt_test_mmap(struct nilfs *nilfs)
 	return !!(nilfs->n_opts & NILFS_OPT_MMAP);
 }
 
-/**
- * nilfs_opt_set_set_suinfo - set set_suinfo option
- * @nilfs: nilfs object
- */
-int nilfs_opt_set_set_suinfo(struct nilfs *nilfs)
-{
-	nilfs->n_opts |= NILFS_OPT_SET_SUINFO;
-	return 0;
-}
-
-/**
- * nilfs_opt_clear_set_suinfo - clear set_suinfo option
- * @nilfs: nilfs object
- */
-void nilfs_opt_clear_set_suinfo(struct nilfs *nilfs)
-{
-	nilfs->n_opts &= ~NILFS_OPT_SET_SUINFO;
-}
-
-/**
- * nilfs_opt_test_set_suinfo - test whether set_suinfo option is set or not
- * @nilfs: nilfs object
- */
-int nilfs_opt_test_set_suinfo(struct nilfs *nilfs)
-{
-	return !!(nilfs->n_opts & NILFS_OPT_SET_SUINFO);
-}
-
 static int nilfs_open_sem(struct nilfs *nilfs)
 {
 	char semnambuf[NAME_MAX - 4];
@@ -382,6 +354,7 @@ struct nilfs *nilfs_open(const char *dev, const char *dir, int flags)
 	nilfs->n_dev = NULL;
 	nilfs->n_ioc = NULL;
 	nilfs->n_mincno = NILFS_CNO_MIN;
+	nilfs->n_opts = 0;
 	memset(nilfs->n_sems, 0, sizeof(nilfs->n_sems));
 
 	if (flags & NILFS_OPEN_RAW) {
@@ -405,6 +378,9 @@ struct nilfs *nilfs_open(const char *dev, const char *dir, int flags)
 			errno = ENOTSUP;
 			goto out_fd;
 		}
+
+		if (nilfs_feature_track_live_blks(nilfs))
+			nilfs_opt_set_track_live_blks(nilfs);
 	}
 
 	if (flags &
diff --git a/man/mkfs.nilfs2.8 b/man/mkfs.nilfs2.8
index 6c9a644..2431ac9 100644
--- a/man/mkfs.nilfs2.8
+++ b/man/mkfs.nilfs2.8
@@ -176,6 +176,12 @@ cannot be disabled, because it changes the ondisk format. Nevertheless it
 is fully compatible with older versions of the file system. This feature
 is on by default, because it is fully backwards compatible and can only
 be set at file system creation time.
+.TP
+.B track_live_blks
+Enables the tracking of live blocks, which might improve the effectiveness of
+garbage collection, but entails a small runtime overhead. It is important to
+note, that this feature depends on sufile_ext, which can only be set
+at file system creation time.
 .RE
 .TP
 .B \-q
diff --git a/sbin/mkfs/mkfs.c b/sbin/mkfs/mkfs.c
index 3985262..680311c 100644
--- a/sbin/mkfs/mkfs.c
+++ b/sbin/mkfs/mkfs.c
@@ -1082,7 +1082,8 @@ static inline void check_ctime(time_t ctime)
 
 static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION,
+	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION |
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
diff --git a/sbin/nilfs-tune/nilfs-tune.c b/sbin/nilfs-tune/nilfs-tune.c
index 60f1d39..7889310 100644
--- a/sbin/nilfs-tune/nilfs-tune.c
+++ b/sbin/nilfs-tune/nilfs-tune.c
@@ -84,7 +84,7 @@ static void nilfs_tune_usage(void)
 
 static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	0,
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
@@ -93,7 +93,7 @@ static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 
 static const __u64 clear_ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	0,
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 4/6] nilfs-utils: implement the tracking of live blocks for set_suinfo
       [not found]     ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
  2015-02-24 19:04       ` [PATCH 2/6] nilfs-utils: add additional flags for nilfs_vdesc Andreas Rohner
  2015-02-24 19:04       ` [PATCH 3/6] nilfs-utils: add support for tracking live blocks Andreas Rohner
@ 2015-02-24 19:04       ` Andreas Rohner
  2015-02-24 19:04       ` [PATCH 5/6] nilfs-utils: add support for greedy/cost-benefit policies Andreas Rohner
  2015-02-24 19:04       ` [PATCH 6/6] nilfs-utils: add su_nsnapshot_blks field to indicate starvation Andreas Rohner
  4 siblings, 0 replies; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:04 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

If the tracking of live blocks is enabled, the information passed to
the kernel with the set_suinfo ioctl must also be modified. To this
end the nilfs_count_nlive_blks() fucntion is introduced. It simply
loops through the vdescv and bdescv vectors and counts the live
blocks belonging to a certain segment. Here the new vdesc flags
introduced earlier come in handy. If NILFS_VDESC_SNAPSHOT flag is set,
the block is always counted as alive. However if it is not set and
NILFS_VDESC_PROTECTION_PERIOD is set instead it is counted as
reclaimable.

Additionally the nilfs_xreclaim_segment() function is refactored, so
that the set_suinfo part is extracted into its own function
nilfs_try_set_suinfo(). This is useful, because the code gets more
complicated with the new additions.

If the kernel either doesn't support the set_suinfo ioctl or doesn't
support the set_nlive_blks flag, it returns ENOTTY or EINVAL
respectively and the corresponding options are disabled and not used
again.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 include/nilfs.h |   6 ++
 lib/gc.c        | 168 +++++++++++++++++++++++++++++++++++++++++++-------------
 2 files changed, 136 insertions(+), 38 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index 22a9190..8511163 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -343,6 +343,12 @@ static inline __u32 nilfs_get_blocks_per_segment(const struct nilfs *nilfs)
 	return le32_to_cpu(nilfs->n_sb->s_blocks_per_segment);
 }
 
+static inline __u64
+nilfs_get_segnum_of_block(const struct nilfs *nilfs, sector_t blocknr)
+{
+	return blocknr / nilfs_get_blocks_per_segment(nilfs);
+}
+
 static inline int nilfs_feature_track_live_blks(const struct nilfs *nilfs)
 {
 	__u64 fc = le64_to_cpu(nilfs->n_sb->s_feature_compat);
diff --git a/lib/gc.c b/lib/gc.c
index b56744c..a2461b9 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -620,6 +620,121 @@ static int nilfs_toss_bdescs(struct nilfs_vector *bdescv)
 }
 
 /**
+ * nilfs_count_nlive_blks - returns the number of live blocks in segnum
+ * @nilfs: nilfs object
+ * @segnum: segment number
+ * @bdescv: vector object storing (descriptors of) disk block numbers
+ * @vdescv: vector object storing (descriptors of) virtual block numbers
+ */
+static size_t nilfs_count_nlive_blks(const struct nilfs *nilfs,
+				     __u64 segnum,
+				     struct nilfs_vector *vdescv,
+				     struct nilfs_vector *bdescv)
+{
+	struct nilfs_vdesc *vdesc;
+	struct nilfs_bdesc *bdesc;
+	int i;
+	size_t res = 0;
+
+	for (i = 0; i < nilfs_vector_get_size(bdescv); i++) {
+		bdesc = nilfs_vector_get_element(bdescv, i);
+		assert(bdesc != NULL);
+
+		if (nilfs_get_segnum_of_block(nilfs, bdesc->bd_blocknr) ==
+		    segnum && nilfs_bdesc_is_live(bdesc))
+			++res;
+	}
+
+	for (i = 0; i < nilfs_vector_get_size(vdescv); i++) {
+		vdesc = nilfs_vector_get_element(vdescv, i);
+		assert(vdesc != NULL);
+
+		if (nilfs_get_segnum_of_block(nilfs, vdesc->vd_blocknr) ==
+		    segnum && (nilfs_vdesc_snapshot(vdesc) ||
+		    !nilfs_vdesc_protection_period(vdesc)))
+			++res;
+	}
+
+	return res;
+}
+
+/**
+ * nilfs_try_set_suinfo - wrapper for nilfs_set_suinfo
+ * @nilfs: nilfs object
+ * @segnums: array of segment numbers storing selected segments
+ * @nsegs: size of the @segnums array
+ * @vdescv: vector object storing (descriptors of) virtual block numbers
+ * @bdescv: vector object storing (descriptors of) disk block numbers
+ *
+ * Description: nilfs_try_set_suinfo() prepares the input data structure
+ * for nilfs_set_suinfo(). If the kernel doesn't support the
+ * NILFS_IOCTL_SET_SUINFO ioctl, errno is set to ENOTTY and the set_suinfo
+ * option is cleared to prevent future calls to nilfs_try_set_suinfo().
+ * Similarly if the SUFILE extension is not supported by the kernel,
+ * errno is set to EINVAL and the track_live_blks option is disabled.
+ *
+ * Return Value: On success, zero is returned.  On error, a negative value
+ * is returned. If errno is set to ENOTTY or EINVAL, the kernel doesn't support
+ * the current configuration for nilfs_set_suinfo().
+ */
+static int nilfs_try_set_suinfo(struct nilfs *nilfs, __u64 *segnums,
+		size_t nsegs, struct nilfs_vector *vdescv,
+		struct nilfs_vector *bdescv)
+{
+	struct nilfs_vector *supv;
+	struct nilfs_suinfo_update *sup;
+	struct timeval tv;
+	int ret = -1;
+	size_t i, nblocks;
+
+	supv = nilfs_vector_create(sizeof(struct nilfs_suinfo_update));
+	if (!supv)
+		goto out;
+
+	ret = gettimeofday(&tv, NULL);
+	if (ret < 0)
+		goto out;
+
+	for (i = 0; i < nsegs; ++i) {
+		sup = nilfs_vector_get_new_element(supv);
+		if (!sup) {
+			ret = -1;
+			goto out;
+		}
+
+		sup->sup_segnum = segnums[i];
+		sup->sup_flags = 0;
+		nilfs_suinfo_update_set_lastmod(sup);
+		sup->sup_sui.sui_lastmod = tv.tv_sec;
+
+		if (nilfs_opt_test_track_live_blks(nilfs)) {
+			nilfs_suinfo_update_set_nlive_blks(sup);
+
+			nblocks = nilfs_count_nlive_blks(nilfs,
+					segnums[i], vdescv, bdescv);
+			sup->sup_sui.sui_nlive_blks = nblocks;
+		}
+	}
+
+	ret = nilfs_set_suinfo(nilfs, nilfs_vector_get_data(supv), nsegs);
+	if (ret < 0) {
+		if (errno == ENOTTY) {
+			nilfs_gc_logger(LOG_WARNING,
+					"set_suinfo ioctl is not supported");
+			nilfs_opt_clear_set_suinfo(nilfs);
+		} else if (errno == EINVAL) {
+			nilfs_gc_logger(LOG_WARNING,
+					"sufile extension is not supported");
+			nilfs_opt_clear_track_live_blks(nilfs);
+		}
+	}
+
+out:
+	nilfs_vector_destroy(supv);
+	return ret;
+}
+
+/**
  * nilfs_xreclaim_segment - reclaim segments (enhanced API)
  * @nilfs: nilfs object
  * @segnums: array of segment numbers storing selected segments
@@ -633,14 +748,12 @@ int nilfs_xreclaim_segment(struct nilfs *nilfs,
 			   const struct nilfs_reclaim_params *params,
 			   struct nilfs_reclaim_stat *stat)
 {
-	struct nilfs_vector *vdescv, *bdescv, *periodv, *vblocknrv, *supv;
+	struct nilfs_vector *vdescv, *bdescv, *periodv, *vblocknrv;
 	sigset_t sigset, oldset, waitset;
 	nilfs_cno_t protcno;
-	ssize_t n, i, ret = -1;
+	ssize_t n, ret = -1;
 	size_t nblocks;
 	__u32 reclaimable_blocks;
-	struct nilfs_suinfo_update *sup;
-	struct timeval tv;
 
 	if (!(params->flags & NILFS_RECLAIM_PARAM_PROTSEQ) ||
 	    (params->flags & (~0UL << __NR_NILFS_RECLAIM_PARAMS))) {
@@ -659,8 +772,7 @@ int nilfs_xreclaim_segment(struct nilfs *nilfs,
 	bdescv = nilfs_vector_create(sizeof(struct nilfs_bdesc));
 	periodv = nilfs_vector_create(sizeof(struct nilfs_period));
 	vblocknrv = nilfs_vector_create(sizeof(__u64));
-	supv = nilfs_vector_create(sizeof(struct nilfs_suinfo_update));
-	if (!vdescv || !bdescv || !periodv || !vblocknrv || !supv)
+	if (!vdescv || !bdescv || !periodv || !vblocknrv)
 		goto out_vec;
 
 	sigemptyset(&sigset);
@@ -758,46 +870,27 @@ int nilfs_xreclaim_segment(struct nilfs *nilfs,
 	if ((params->flags & NILFS_RECLAIM_PARAM_MIN_RECLAIMABLE_BLKS) &&
 			nilfs_opt_test_set_suinfo(nilfs) &&
 			reclaimable_blocks < params->min_reclaimable_blks * n) {
-		if (stat) {
-			stat->deferred_segs = n;
-			stat->cleaned_segs = 0;
-		}
 
-		ret = gettimeofday(&tv, NULL);
-		if (ret < 0)
+		ret = nilfs_try_set_suinfo(nilfs, segnums, n, vdescv, bdescv);
+		if (ret == 0) {
+			if (stat) {
+				stat->deferred_segs = n;
+				stat->cleaned_segs = 0;
+			}
 			goto out_lock;
-
-		for (i = 0; i < n; ++i) {
-			sup = nilfs_vector_get_new_element(supv);
-			if (!sup)
-				goto out_lock;
-
-			sup->sup_segnum = segnums[i];
-			sup->sup_flags = 0;
-			nilfs_suinfo_update_set_lastmod(sup);
-			sup->sup_sui.sui_lastmod = tv.tv_sec;
 		}
 
-		ret = nilfs_set_suinfo(nilfs, nilfs_vector_get_data(supv), n);
-
-		if (ret == 0)
-			goto out_lock;
-
-		if (ret < 0 && errno != ENOTTY) {
+		if (ret < 0 && errno != ENOTTY && errno != EINVAL) {
 			nilfs_gc_logger(LOG_ERR, "cannot set suinfo: %s",
 					strerror(errno));
 			goto out_lock;
 		}
 
-		/* errno == ENOTTY */
-		nilfs_gc_logger(LOG_WARNING,
-				"set_suinfo ioctl is not supported");
-		nilfs_opt_clear_set_suinfo(nilfs);
-		if (stat) {
-			stat->deferred_segs = 0;
-			stat->cleaned_segs = n;
-		}
-		/* Try nilfs_clean_segments */
+		/*
+		 * errno == ENOTTY || errno == EINVAL
+		 * nilfs_try_set_suinfo() failed because it is not supported
+		 * so try nilfs_clean_segments() instead
+		 */
 	}
 
 	ret = nilfs_clean_segments(nilfs,
@@ -830,7 +923,6 @@ out_vec:
 	nilfs_vector_destroy(bdescv);
 	nilfs_vector_destroy(periodv);
 	nilfs_vector_destroy(vblocknrv);
-	nilfs_vector_destroy(supv);
 	/*
 	 * Flags of valid fields in stat->exflags must be unset.
 	 */
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 5/6] nilfs-utils: add support for greedy/cost-benefit policies
       [not found]     ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                         ` (2 preceding siblings ...)
  2015-02-24 19:04       ` [PATCH 4/6] nilfs-utils: implement the tracking of live blocks for set_suinfo Andreas Rohner
@ 2015-02-24 19:04       ` Andreas Rohner
  2015-02-24 19:04       ` [PATCH 6/6] nilfs-utils: add su_nsnapshot_blks field to indicate starvation Andreas Rohner
  4 siblings, 0 replies; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:04 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch implements the cost-benefit and greedy GC policies. These
are well known policies for log-structured file systems [1].

* Greedy:
  Select the segments with the most reclaimable space.
* Cost-Benefit [1]:
  Perform a cost-benefit analysis, whereby the reclaimable space
  gained is weighed against the cost of collecting the segment.

Since especially cost-benefit needs more information than is
available in nilfs_suinfo, a few extra parameters are added to the
policy callback function prototype. The flag p_comparison is added to
indicate how the importance values should be interpreted. For example
for the timestamp policy smaller values mean older timestamps, which
is better. For greedy and cost-benefit on the other hand, higher
values are better. nilfs_cleanerd_select_segments() was updated
accordingly.

The threshold in nilfs_cleanerd_select_segments() can no
longer be set to sustat->ss_nongc_ctime on default, because the
greedy/cost-benefit policies do not return a timestamp, so their
importance values cannot be compared to one on default. Instead
segments that are younger than sustat->ss_nongc_ctime are always
excluded.

[1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
tion of a log-structured file system. ACM Trans. Comput. Syst.,
10(1):26–52, February 1992.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 sbin/cleanerd/cldconfig.c | 79 +++++++++++++++++++++++++++++++++++++++++++++--
 sbin/cleanerd/cldconfig.h | 22 +++++++++----
 sbin/cleanerd/cleanerd.c  | 43 ++++++++++++++++++++------
 3 files changed, 126 insertions(+), 18 deletions(-)

diff --git a/sbin/cleanerd/cldconfig.c b/sbin/cleanerd/cldconfig.c
index c8b197b..68090e9 100644
--- a/sbin/cleanerd/cldconfig.c
+++ b/sbin/cleanerd/cldconfig.c
@@ -380,7 +380,9 @@ nilfs_cldconfig_handle_clean_check_interval(struct nilfs_cldconfig *config,
 }
 
 static unsigned long long
-nilfs_cldconfig_selection_policy_timestamp(const struct nilfs_suinfo *si)
+nilfs_cldconfig_selection_policy_timestamp(const struct nilfs_suinfo *si,
+					   const struct nilfs_sustat *sustat,
+					   __u64 prottime)
 {
 	return si->sui_lastmod;
 }
@@ -392,13 +394,84 @@ nilfs_cldconfig_handle_selection_policy_timestamp(struct nilfs_cldconfig *config
 	config->cf_selection_policy.p_importance =
 		NILFS_CLDCONFIG_SELECTION_POLICY_IMPORTANCE;
 	config->cf_selection_policy.p_threshold =
-		NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD;
+		NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD;
+	config->cf_selection_policy.p_comparison =
+		NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER;
+	return 0;
+}
+
+static unsigned long long
+nilfs_cldconfig_selection_policy_greedy(const struct nilfs_suinfo *si,
+					const struct nilfs_sustat *sustat,
+					__u64 prottime)
+{
+	if (si->sui_nblocks < si->sui_nlive_blks ||
+	    si->sui_nlive_lastmod >= prottime)
+		return 0;
+
+	return si->sui_nblocks - si->sui_nlive_blks;
+}
+
+static int
+nilfs_cldconfig_handle_selection_policy_greedy(struct nilfs_cldconfig *config,
+					       char **tokens, size_t ntoks)
+{
+	config->cf_selection_policy.p_importance =
+		nilfs_cldconfig_selection_policy_greedy;
+	config->cf_selection_policy.p_threshold =
+		NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD;
+	config->cf_selection_policy.p_comparison =
+		NILFS_CLDCONFIG_SELECTION_POLICY_BIGGER_IS_BETTER;
+	return 0;
+}
+
+static unsigned long long
+nilfs_cldconfig_selection_policy_cost_benefit(const struct nilfs_suinfo *si,
+					      const struct nilfs_sustat *sustat,
+					      __u64 prottime)
+{
+	__u32 free_blocks, cleaning_cost;
+	unsigned long long age;
+
+	if (si->sui_nblocks < si->sui_nlive_blks ||
+	    sustat->ss_nongc_ctime < si->sui_lastmod ||
+	    si->sui_nlive_lastmod >= prottime)
+		return 0;
+
+	free_blocks = si->sui_nblocks - si->sui_nlive_blks;
+	/* read the whole segment + write the live blocks */
+	cleaning_cost = 2 * si->sui_nlive_blks;
+	/*
+	 * multiply by 1000 to convert age to milliseconds
+	 * (higher precision for division)
+	 */
+	age = (sustat->ss_nongc_ctime - si->sui_lastmod) * 1000;
+
+	if (cleaning_cost == 0)
+		cleaning_cost = 1;
+
+	return (age * free_blocks) / cleaning_cost;
+}
+
+static int
+nilfs_cldconfig_handle_selection_policy_cost_benefit(
+						struct nilfs_cldconfig *config,
+						char **tokens, size_t ntoks)
+{
+	config->cf_selection_policy.p_importance =
+		nilfs_cldconfig_selection_policy_cost_benefit;
+	config->cf_selection_policy.p_threshold =
+		NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD;
+	config->cf_selection_policy.p_comparison =
+		NILFS_CLDCONFIG_SELECTION_POLICY_BIGGER_IS_BETTER;
 	return 0;
 }
 
 static const struct nilfs_cldconfig_polhandle
 nilfs_cldconfig_polhandle_table[] = {
 	{"timestamp",	nilfs_cldconfig_handle_selection_policy_timestamp},
+	{"greedy",	nilfs_cldconfig_handle_selection_policy_greedy},
+	{"cost-benefit", nilfs_cldconfig_handle_selection_policy_cost_benefit},
 };
 
 #define NILFS_CLDCONFIG_NPOLHANDLES			\
@@ -690,6 +763,8 @@ static void nilfs_cldconfig_set_default(struct nilfs_cldconfig *config,
 		NILFS_CLDCONFIG_SELECTION_POLICY_IMPORTANCE;
 	config->cf_selection_policy.p_threshold =
 		NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD;
+	config->cf_selection_policy.p_comparison =
+		NILFS_CLDCONFIG_SELECTION_POLICY_COMPARISON;
 	config->cf_protection_period.tv_sec = NILFS_CLDCONFIG_PROTECTION_PERIOD;
 	config->cf_protection_period.tv_usec = 0;
 
diff --git a/sbin/cleanerd/cldconfig.h b/sbin/cleanerd/cldconfig.h
index 2a0af5f..3c9f5e6 100644
--- a/sbin/cleanerd/cldconfig.h
+++ b/sbin/cleanerd/cldconfig.h
@@ -30,16 +30,22 @@
 #include <sys/time.h>
 #include <syslog.h>
 
+struct nilfs;
+struct nilfs_sustat;
 struct nilfs_suinfo;
 
 /**
  * struct nilfs_selection_policy -
- * @p_importance:
- * @p_threshold:
+ * @p_importance: function to calculate the importance for the policy
+ * @p_threshold: segments with lower/higher importance are ignored
+ * @p_comparison: flag that indicates how to sort the importance
  */
 struct nilfs_selection_policy {
-	unsigned long long (*p_importance)(const struct nilfs_suinfo *);
+	unsigned long long (*p_importance)(const struct nilfs_suinfo *,
+					   const struct nilfs_sustat *,
+					   __u64);
 	unsigned long long p_threshold;
+	int p_comparison;
 };
 
 /**
@@ -111,9 +117,15 @@ struct nilfs_cldconfig {
 	unsigned long cf_mc_min_reclaimable_blocks;
 };
 
+#define NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER	0
+#define NILFS_CLDCONFIG_SELECTION_POLICY_BIGGER_IS_BETTER	1
+#define NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD	0
 #define NILFS_CLDCONFIG_SELECTION_POLICY_IMPORTANCE	\
 			nilfs_cldconfig_selection_policy_timestamp
-#define NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD	0
+#define NILFS_CLDCONFIG_SELECTION_POLICY_THRESHOLD	\
+			NILFS_CLDCONFIG_SELECTION_POLICY_NO_THRESHOLD
+#define NILFS_CLDCONFIG_SELECTION_POLICY_COMPARISON	\
+			NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER
 #define NILFS_CLDCONFIG_PROTECTION_PERIOD		3600
 #define NILFS_CLDCONFIG_MIN_CLEAN_SEGMENTS		10
 #define NILFS_CLDCONFIG_MIN_CLEAN_SEGMENTS_UNIT		NILFS_SIZE_UNIT_PERCENT
@@ -135,8 +147,6 @@ struct nilfs_cldconfig {
 
 #define NILFS_CLDCONFIG_NSEGMENTS_PER_CLEAN_MAX	32
 
-struct nilfs;
-
 int nilfs_cldconfig_read(struct nilfs_cldconfig *config, const char *path,
 			 struct nilfs *nilfs);
 
diff --git a/sbin/cleanerd/cleanerd.c b/sbin/cleanerd/cleanerd.c
index d37bd5c..e0741f1 100644
--- a/sbin/cleanerd/cleanerd.c
+++ b/sbin/cleanerd/cleanerd.c
@@ -417,7 +417,7 @@ static void nilfs_cleanerd_destroy(struct nilfs_cleanerd *cleanerd)
 	free(cleanerd);
 }
 
-static int nilfs_comp_segimp(const void *elem1, const void *elem2)
+static int nilfs_comp_segimp_asc(const void *elem1, const void *elem2)
 {
 	const struct nilfs_segimp *segimp1 = elem1, *segimp2 = elem2;
 
@@ -429,6 +429,18 @@ static int nilfs_comp_segimp(const void *elem1, const void *elem2)
 	return (segimp1->si_segnum < segimp2->si_segnum) ? -1 : 1;
 }
 
+static int nilfs_comp_segimp_desc(const void *elem1, const void *elem2)
+{
+	const struct nilfs_segimp *segimp1 = elem1, *segimp2 = elem2;
+
+	if (segimp1->si_importance > segimp2->si_importance)
+		return -1;
+	else if (segimp1->si_importance < segimp2->si_importance)
+		return 1;
+
+	return (segimp1->si_segnum < segimp2->si_segnum) ? -1 : 1;
+}
+
 static int nilfs_cleanerd_automatic_suspend(struct nilfs_cleanerd *cleanerd)
 {
 	return cleanerd->config.cf_min_clean_segments > 0;
@@ -580,7 +592,7 @@ nilfs_cleanerd_select_segments(struct nilfs_cleanerd *cleanerd,
 	size_t count, nsegs;
 	ssize_t nssegs, n;
 	unsigned long long imp, thr;
-	int i;
+	int i, sib;
 
 	nsegs = nilfs_cleanerd_ncleansegs(cleanerd);
 	nilfs = cleanerd->nilfs;
@@ -600,11 +612,17 @@ nilfs_cleanerd_select_segments(struct nilfs_cleanerd *cleanerd,
 	prottime = tv2.tv_sec;
 	oldest = tv.tv_sec;
 
-	/* The segments that have larger importance than thr are not
+	/*
+	 * sufile extension fields may not be initialized by
+	 * nilfs_get_suinfo()
+	 */
+	memset(si, 0, sizeof(si));
+
+	/* The segments that have larger/smaller importance than thr are not
 	 * selected. */
-	thr = (config->cf_selection_policy.p_threshold != 0) ?
-		config->cf_selection_policy.p_threshold :
-		sustat->ss_nongc_ctime;
+	thr = config->cf_selection_policy.p_threshold;
+	sib = config->cf_selection_policy.p_comparison ==
+			NILFS_CLDCONFIG_SELECTION_POLICY_SMALLER_IS_BETTER;
 
 	for (segnum = 0; segnum < sustat->ss_nsegs; segnum += n) {
 		count = min_t(__u64, sustat->ss_nsegs - segnum,
@@ -615,11 +633,13 @@ nilfs_cleanerd_select_segments(struct nilfs_cleanerd *cleanerd,
 			goto out;
 		}
 		for (i = 0; i < n; i++) {
-			if (!nilfs_suinfo_reclaimable(&si[i]))
+			if (!nilfs_suinfo_reclaimable(&si[i]) ||
+				si[i].sui_lastmod >= sustat->ss_nongc_ctime)
 				continue;
 
-			imp = config->cf_selection_policy.p_importance(&si[i]);
-			if (imp < thr) {
+			imp = config->cf_selection_policy.p_importance(&si[i],
+					sustat, prottime);
+			if (!thr || (sib && imp < thr) || (!sib && imp > thr)) {
 				if (si[i].sui_lastmod < oldest)
 					oldest = si[i].sui_lastmod;
 				if (si[i].sui_lastmod < prottime) {
@@ -642,7 +662,10 @@ nilfs_cleanerd_select_segments(struct nilfs_cleanerd *cleanerd,
 			break;
 		}
 	}
-	nilfs_vector_sort(smv, nilfs_comp_segimp);
+	if (sib)
+		nilfs_vector_sort(smv, nilfs_comp_segimp_asc);
+	else
+		nilfs_vector_sort(smv, nilfs_comp_segimp_desc);
 
 	nssegs = (nilfs_vector_get_size(smv) < nsegs) ?
 		nilfs_vector_get_size(smv) : nsegs;
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [PATCH 6/6] nilfs-utils: add su_nsnapshot_blks field to indicate starvation
       [not found]     ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                         ` (3 preceding siblings ...)
  2015-02-24 19:04       ` [PATCH 5/6] nilfs-utils: add support for greedy/cost-benefit policies Andreas Rohner
@ 2015-02-24 19:04       ` Andreas Rohner
  4 siblings, 0 replies; 36+ messages in thread
From: Andreas Rohner @ 2015-02-24 19:04 UTC (permalink / raw)
  To: linux-nilfs-u79uwXL29TY76Z2rM5mHXA; +Cc: Andreas Rohner

This patch adds support for the field su_nsnapshot_blks and includes the
necessary flags to update it from the GC.

The GC already has the necessary information about which block belongs
to a snapshot and which doesn't. So these blocks are counted up and
passed to the caller.

The number of snapshot blocks will then be updated with
NILFS_IOCTL_SET_SUINFO ioctl.

Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
---
 include/nilfs.h              |  9 +++++++++
 include/nilfs2_fs.h          | 12 ++++++++----
 lib/feature.c                |  2 ++
 lib/gc.c                     | 19 ++++++++++++++-----
 lib/nilfs.c                  |  2 ++
 man/mkfs.nilfs2.8            |  6 ++++++
 sbin/mkfs/mkfs.c             |  3 ++-
 sbin/nilfs-tune/nilfs-tune.c |  6 ++++--
 8 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/include/nilfs.h b/include/nilfs.h
index 8511163..e84656b 100644
--- a/include/nilfs.h
+++ b/include/nilfs.h
@@ -131,6 +131,7 @@ struct nilfs {
 #define NILFS_OPT_MMAP		0x01
 #define NILFS_OPT_SET_SUINFO	0x02
 #define NILFS_OPT_TRACK_LIVE_BLKS	0x04
+#define NILFS_OPT_TRACK_SNAPSHOTS	0x08
 
 
 struct nilfs *nilfs_open(const char *, const char *, int);
@@ -161,6 +162,7 @@ nilfs_opt_test_##name(const struct nilfs *nilfs)			\
 
 NILFS_OPT_FLAG(SET_SUINFO, set_suinfo);
 NILFS_OPT_FLAG(TRACK_LIVE_BLKS, track_live_blks);
+NILFS_OPT_FLAG(TRACK_SNAPSHOTS, track_snapshots);
 
 nilfs_cno_t nilfs_get_oldest_cno(struct nilfs *);
 
@@ -356,4 +358,11 @@ static inline int nilfs_feature_track_live_blks(const struct nilfs *nilfs)
 		(fc & NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
 }
 
+static inline int nilfs_feature_track_snapshots(const struct nilfs *nilfs)
+{
+	__u64 fc = le64_to_cpu(nilfs->n_sb->s_feature_compat);
+	return (fc & NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS) &&
+		nilfs_feature_track_live_blks(nilfs);
+}
+
 #endif	/* NILFS_H */
diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
index 427ca53..f1f315c 100644
--- a/include/nilfs2_fs.h
+++ b/include/nilfs2_fs.h
@@ -221,11 +221,13 @@ struct nilfs_super_block {
  */
 #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
 #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
+#define NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS		(1ULL << 2)
 
 #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
 
 #define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
-				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
+				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS \
+				| NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS)
 #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
 #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
 
@@ -630,7 +632,7 @@ struct nilfs_segment_usage {
 	__le32 su_nblocks;
 	__le32 su_flags;
 	__le32 su_nlive_blks;
-	__le32 su_pad;
+	__le32 su_nsnapshot_blks;
 	__le64 su_nlive_lastmod;
 };
 
@@ -682,7 +684,7 @@ nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
 	su->su_flags = cpu_to_le32(0);
 	if (susz >= NILFS_EXT_SEGMENT_USAGE_SIZE) {
 		su->su_nlive_blks = cpu_to_le32(0);
-		su->su_pad = cpu_to_le32(0);
+		su->su_nsnapshot_blks = cpu_to_le32(0);
 		su->su_nlive_lastmod = cpu_to_le64(0);
 	}
 }
@@ -723,7 +725,7 @@ struct nilfs_suinfo {
 	__u32 sui_nblocks;
 	__u32 sui_flags;
 	__u32 sui_nlive_blks;
-	__u32 sui_pad;
+	__u32 sui_nsnapshot_blks;
 	__u64 sui_nlive_lastmod;
 };
 
@@ -764,6 +766,7 @@ enum {
 	NILFS_SUINFO_UPDATE_FLAGS,
 	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
 	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
+	NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
 	__NR_NILFS_SUINFO_UPDATE_FIELDS,
 };
 
@@ -788,6 +791,7 @@ NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
 NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
 NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
 NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
+NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
 NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
 
 enum {
diff --git a/lib/feature.c b/lib/feature.c
index ebe8c3f..376fa53 100644
--- a/lib/feature.c
+++ b/lib/feature.c
@@ -59,6 +59,8 @@ static const struct nilfs_feature features[] = {
 	  NILFS_FEATURE_COMPAT_SUFILE_EXTENSION, "sufile_ext" },
 	{ NILFS_FEATURE_TYPE_COMPAT,
 	  NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS, "track_live_blks" },
+	{ NILFS_FEATURE_TYPE_COMPAT,
+	  NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS, "track_snapshots" },
 	/* Read-only compat features */
 	{ NILFS_FEATURE_TYPE_COMPAT_RO,
 	  NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT, "block_count" },
diff --git a/lib/gc.c b/lib/gc.c
index a2461b9..f1b8b85 100644
--- a/lib/gc.c
+++ b/lib/gc.c
@@ -629,12 +629,13 @@ static int nilfs_toss_bdescs(struct nilfs_vector *bdescv)
 static size_t nilfs_count_nlive_blks(const struct nilfs *nilfs,
 				     __u64 segnum,
 				     struct nilfs_vector *vdescv,
-				     struct nilfs_vector *bdescv)
+				     struct nilfs_vector *bdescv,
+				     size_t *pnss)
 {
 	struct nilfs_vdesc *vdesc;
 	struct nilfs_bdesc *bdesc;
 	int i;
-	size_t res = 0;
+	size_t res = 0, nss = 0;
 
 	for (i = 0; i < nilfs_vector_get_size(bdescv); i++) {
 		bdesc = nilfs_vector_get_element(bdescv, i);
@@ -651,10 +652,16 @@ static size_t nilfs_count_nlive_blks(const struct nilfs *nilfs,
 
 		if (nilfs_get_segnum_of_block(nilfs, vdesc->vd_blocknr) ==
 		    segnum && (nilfs_vdesc_snapshot(vdesc) ||
-		    !nilfs_vdesc_protection_period(vdesc)))
+		    !nilfs_vdesc_protection_period(vdesc))) {
 			++res;
+			if (nilfs_vdesc_snapshot(vdesc))
+				++nss;
+		}
 	}
 
+	if (pnss)
+		*pnss = nss;
+
 	return res;
 }
 
@@ -685,7 +692,7 @@ static int nilfs_try_set_suinfo(struct nilfs *nilfs, __u64 *segnums,
 	struct nilfs_suinfo_update *sup;
 	struct timeval tv;
 	int ret = -1;
-	size_t i, nblocks;
+	size_t i, nblocks, nss;
 
 	supv = nilfs_vector_create(sizeof(struct nilfs_suinfo_update));
 	if (!supv)
@@ -709,10 +716,12 @@ static int nilfs_try_set_suinfo(struct nilfs *nilfs, __u64 *segnums,
 
 		if (nilfs_opt_test_track_live_blks(nilfs)) {
 			nilfs_suinfo_update_set_nlive_blks(sup);
+			nilfs_suinfo_update_set_nsnapshot_blks(sup);
 
 			nblocks = nilfs_count_nlive_blks(nilfs,
-					segnums[i], vdescv, bdescv);
+					segnums[i], vdescv, bdescv, &nss);
 			sup->sup_sui.sui_nlive_blks = nblocks;
+			sup->sup_sui.sui_nsnapshot_blks = nss;
 		}
 	}
 
diff --git a/lib/nilfs.c b/lib/nilfs.c
index 2067fc0..c453d5b 100644
--- a/lib/nilfs.c
+++ b/lib/nilfs.c
@@ -381,6 +381,8 @@ struct nilfs *nilfs_open(const char *dev, const char *dir, int flags)
 
 		if (nilfs_feature_track_live_blks(nilfs))
 			nilfs_opt_set_track_live_blks(nilfs);
+		if (nilfs_feature_track_snapshots(nilfs))
+			nilfs_opt_set_track_snapshots(nilfs);
 	}
 
 	if (flags &
diff --git a/man/mkfs.nilfs2.8 b/man/mkfs.nilfs2.8
index 2431ac9..c784883 100644
--- a/man/mkfs.nilfs2.8
+++ b/man/mkfs.nilfs2.8
@@ -182,6 +182,12 @@ Enables the tracking of live blocks, which might improve the effectiveness of
 garbage collection, but entails a small runtime overhead. It is important to
 note, that this feature depends on sufile_ext, which can only be set
 at file system creation time.
+.TP
+.B track_snapshots
+Enables an efficient heuristic tracking of the number of snapshot blocks in a
+segment. This prevents starvation of segments and improves the overall
+performance. It is important to note, that this feature depends on sufile_ext,
+which can only be set at file system creation time.
 .RE
 .TP
 .B \-q
diff --git a/sbin/mkfs/mkfs.c b/sbin/mkfs/mkfs.c
index 680311c..e69abc8 100644
--- a/sbin/mkfs/mkfs.c
+++ b/sbin/mkfs/mkfs.c
@@ -1083,7 +1083,8 @@ static inline void check_ctime(time_t ctime)
 static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
 	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION |
-	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
+	NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
diff --git a/sbin/nilfs-tune/nilfs-tune.c b/sbin/nilfs-tune/nilfs-tune.c
index 7889310..d595366 100644
--- a/sbin/nilfs-tune/nilfs-tune.c
+++ b/sbin/nilfs-tune/nilfs-tune.c
@@ -84,7 +84,8 @@ static void nilfs_tune_usage(void)
 
 static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
+	NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
@@ -93,7 +94,8 @@ static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
 
 static const __u64 clear_ok_features[NILFS_MAX_FEATURE_TYPES] = {
 	/* Compat */
-	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
+	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
+	NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS,
 	/* Read-only compat */
 	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
 	/* Incompat */
-- 
2.3.0

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
       [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
                     ` (9 preceding siblings ...)
  2015-02-24 19:04   ` [PATCH 1/6] nilfs-utils: extend SUFILE on-disk format to enable track live blocks Andreas Rohner
@ 2015-02-25  0:18   ` Ryusuke Konishi
       [not found]     ` <20150225.091804.1850885506186316087.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  10 siblings, 1 reply; 36+ messages in thread
From: Ryusuke Konishi @ 2015-02-25  0:18 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi Andreas,

Thank you for posting this proposal!

I would like to have time to review this series through, but please
wait for several days. (This week I'm quite busy until weekend)

Thanks,
Ryusuke Konishi

On Tue, 24 Feb 2015 20:01:35 +0100, Andreas Rohner wrote:
> Hi everyone!
> 
> One of the biggest performance problems of NILFS is its
> inefficient Timestamp GC policy. This patch set introduces two new GC
> policies, namely Cost-Benefit and Greedy.
> 
> The Cost-Benefit policy is nothing new. It has been around for a long
> time with log-structured file systems [1]. But it relies on accurate
> information, about the number of live blocks in a segment. NILFS
> currently does not provide the necessary information. So this patch set
> extends the entries in the SUFILE to include a counter for the number of
> live blocks. This counter is decremented whenever a file is deleted or
> overwritten.
> 
> Except for some tricky parts, the counting of live blocks is quite
> trivial. The problem is snapshots. At any time, a checkpoint can be
> turned into a snapshot or vice versa. So blocks that are reclaimable at
> one point in time, are protected by a snapshot a moment later.
> 
> This patch set does not try to track snapshots at all. Instead it uses a
> heuristic approach to prevent the worst case scenario. The performance
> is still significantly better than timestamp for my benchmarks.
> 
> The worst case scenario is, the following:
> 
> 1. Segment 1 is written
> 2. Snapshot is created
> 3. GC tries to reclaim Segment 1, but all blocks are protected
>    by the Snapshot. The GC has to set the number of live blocks
>    to maximum to avoid reclaiming this Segment again in the near future.
> 4. Snapshot is deleted
> 5. Segment 1 is reclaimable, but its counter is so high, that the GC
>    will never try to reclaim it again.
> 
> To prevent this kind of starvation I use another field in the SUFILE
> entry, to store the number of blocks that are protected by a snapshot.
> This value is just a heuristic and it is usually set to 0. Only if the
> GC reclaims a segment, it is written to the SUFILE entry. The GC has to
> check for snapshots anyway, so we get this information for free. By
> storing this information in the SUFILE we can avoid starvation in the
> following way:
> 
> 1. Segment 1 is written
> 2. Snapshot is created
> 3. GC tries to reclaim Segment 1, but all blocks are protected
>    by the Snapshot. The GC has to set the number of live blocks
>    to maximum to avoid reclaiming this Segment again in the near future.
> 4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
>    entry
> 5. Snapshot is deleted
> 6. On Snapshot deletion we walk through every entry in the SUFILE and
>    reduce the number of live blocks to half, if the number of snapshot
>    blocks is bigger than half of the maximum.
> 7. Segment 1 is reclaimable and the number of live blocks entry is at
>    half the maximum. The GC will try to reclaim this segment as soon as
>    there are no other better choices.
> 
> BENCHMARKS:
> -----------
> 
> My benchmark is quite simple. It consists of a process, that replays
> real NFS traces at a faster speed. It thereby creates relatively
> realistic patterns of file creation and deletions. At the same time
> multiple snapshots are created and deleted in parallel. I use a 100GB
> partition of a Samsung SSD:
> 
> WITH SNAPSHOTS EVERY 5 MINUTES:
> --------------------------------------------------------------------
>                 Execution time       Wear (Data written to disk)
> Timestamp:      100%                 100%
> Cost-Benefit:   80%                  43%
> 
> NO SNAPSHOTS:
> ---------------------------------------------------------------------
>                 Execution time       Wear (Data written to disk)
> Timestamp:      100%                 100%
> Cost-Benefit:   70%                  45%
> 
> I plan on adding more benchmark results soon.
> 
> Best regards,
> Andreas Rohner
> 
> [1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
>     tion of a log-structured file system. ACM Trans. Comput. Syst.,
>     10(1):26–52, February 1992.
> 
> Andreas Rohner (9):
>   nilfs2: refactor nilfs_sufile_updatev()
>   nilfs2: add simple cache for modifications to SUFILE
>   nilfs2: extend SUFILE on-disk format to enable counting of live blocks
>   nilfs2: add function to modify su_nlive_blks
>   nilfs2: add simple tracking of block deletions and updates
>   nilfs2: use modification cache to improve performance
>   nilfs2: add additional flags for nilfs_vdesc
>   nilfs2: improve accuracy and correct for invalid GC values
>   nilfs2: prevent starvation of segments protected by snapshots
> 
>  fs/nilfs2/bmap.c          |  84 +++++++-
>  fs/nilfs2/bmap.h          |  14 +-
>  fs/nilfs2/btree.c         |   4 +-
>  fs/nilfs2/cpfile.c        |   5 +
>  fs/nilfs2/dat.c           |  95 ++++++++-
>  fs/nilfs2/dat.h           |   8 +-
>  fs/nilfs2/direct.c        |   4 +-
>  fs/nilfs2/inode.c         |  24 ++-
>  fs/nilfs2/ioctl.c         |  27 ++-
>  fs/nilfs2/mdt.c           |   5 +-
>  fs/nilfs2/page.h          |   6 +-
>  fs/nilfs2/segbuf.c        |   6 +
>  fs/nilfs2/segbuf.h        |   3 +
>  fs/nilfs2/segment.c       | 155 +++++++++++++-
>  fs/nilfs2/segment.h       |   3 +
>  fs/nilfs2/sufile.c        | 533 +++++++++++++++++++++++++++++++++++++++++++---
>  fs/nilfs2/sufile.h        |  97 +++++++--
>  fs/nilfs2/the_nilfs.c     |   4 +
>  fs/nilfs2/the_nilfs.h     |  23 ++
>  include/linux/nilfs2_fs.h | 122 ++++++++++-
>  20 files changed, 1126 insertions(+), 96 deletions(-)
> 
> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
       [not found]     ` <20150225.091804.1850885506186316087.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-03-10  5:21       ` Ryusuke Konishi
       [not found]         ` <20150310.142119.813265940569588216.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-10  5:21 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: Text/Plain; charset="windows-1254", Size: 11732 bytes --]

Hi Andreas,

I looked through whole kernel patches and a part of util patches.
Overall comments are as follows:

[Algorithm]
As for algorithm, it looks about OK except for the starvation
countermeasure.  The stavation countermeasure looks adhoc/hacky, but
it's good that it doesn't change kernel/userland interface; we may be
able to replace it with better ways in a future or in a revised
version of this patchset.

(1) Drawback of the starvation countermeasure
    The patch 9/9 looks to make the execution time of chcp operation
    worse since it will scan through sufile to modify live block
    counters.  How much does it prolong the execution time ?

    In a use case of nilfs, many snapshots are created and they are
    automatically changed back to plain checkpoints because old
    snapshots are thinned out over time.  The patch 9/9 may impact on
    such usage.

(2) Compatibility
    What will happen in the following case:
    1. Create a file system, use it with the new module, and
       create snapshots.
    2. Mount it with an old module, and release snapshot with "chcp cp"
    3. Mount it with the new module, and cleanerd runs gc with
       cost benefit or greedy policy.
    
(3) Durability against unexpected power failures (just a note)
    The current patchset looks not to cause starvation issue even when
    unexpected power failure occurs during or after executing "chcp
    cp" because nilfs_ioctl_change_cpmode() do changes in a
    transactional way with nilfs_transaction_begin/commit.
    We should always think this kind of situtation to keep consistency.

[Coding Style]
(4) This patchset has several coding style issues. Please fix them and
    re-check with the latest checkpatch script (script/checkpatch.pl).

patch 2:
WARNING: Prefer kmalloc_array over kmalloc with multiply
#85: FILE: fs/nilfs2/sufile.c:1192:
+    mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),

patch 5,6:
WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
#60: 
the same semaphore has to be aquired. So if the DAT-Entry belongs to

WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
#46: 
be aquired, which blocks the entire SUFILE and effectively turns

WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
#53: 
afore mentioned lock only needs to be aquired, if the cache is full

(5) sub_sizeof macro:
    The same definition exists as offsetofend() in vfio.h,
    and a patch to move it to stddef.h is now proposed.

    Please use the same name, and redefine it only if it's not
    defined:

#ifndef offsetofend
#define offsetofend(TYPE, MEMBER) \
        (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
#endif

[Implementation]
(6) b_blocknr
    Please do not use bh->b_blocknr to store disk block number.  This
    field is used to keep virtual block number except for DAT files.
    It is only replaced to an actual block number during calling
    submit_bh().  Keep this policy.

    In segment constructor context, you can calculate the disk block
    number from the start disk address of the segment and the block
    index (offset) in the segment.

(7) sufile mod cache
    Consider gathering the cache into nilfs_sufile_info struct and
    stopping to pass it via argument of bmap/sufile/dat interface
    functions.  It's hacky, and decreases readability of programs, and
    is bloating changes of this patchset over multiple function
    blocks.

    The cache should be well designed. It's important to balance the
    performance and locality/transparency of the feature.  For
    instance, it can be implemented with radix-tree of objects in
    which each object has a vector of 2^k cache entries.

    I think the cache should be written back to the sufile buffers
    only within segment construction context. At least, it should be
    written back in the context in which a transaction lock is held.

    In addition, introducing a new bmap lock dependency,
    nilfs_sufile_lock_key, is undesireble. You should avoid it
    by delaying the writeback of cache entries to sufile.

(8) Changes to the sufile must be finished before dirty buffer
    collection of sufile.
    All mark_buffer_dirty() calls to sufile must be finished
    before or in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().

    (You can write fixed figures to sufile after the collection phase
     of sufile by preparatory marking buffer dirty before the
     colection phase.)

    In the current patchset, sufile mod cache can be flushed in
    nilfs_segctor_update_palyload_blocknr(), which comes after the
    dirty buffer collection phase.

(9) cpfile is also excluded in the dead block counting like sufile
    cpfile is always changed and written back along with sufile and dat.
    So, cpfile must be excluded from the dead block counting.
    Otherwise, sufile change can trigger cpfile changes, and it in turn
    triggers sufile.

    This also helps to simplify nilfs_dat_commit_end() that the patchset
    added two arguments for the dead block counting in the patchset.
    I mean, "dead" argument and "count_blocks" argument can be unified by
    changing meaning of the "dead" argument.


I will add detail comments for patches tonight or another day.

Regards,
Ryusuke Konishi

On Wed, 25 Feb 2015 09:18:04 +0900 (JST), Ryusuke Konishi wrote:
> Hi Andreas,
> 
> Thank you for posting this proposal!
> 
> I would like to have time to review this series through, but please
> wait for several days. (This week I'm quite busy until weekend)
> 
> Thanks,
> Ryusuke Konishi
> 
> On Tue, 24 Feb 2015 20:01:35 +0100, Andreas Rohner wrote:
>> Hi everyone!
>> 
>> One of the biggest performance problems of NILFS is its
>> inefficient Timestamp GC policy. This patch set introduces two new GC
>> policies, namely Cost-Benefit and Greedy.
>> 
>> The Cost-Benefit policy is nothing new. It has been around for a long
>> time with log-structured file systems [1]. But it relies on accurate
>> information, about the number of live blocks in a segment. NILFS
>> currently does not provide the necessary information. So this patch set
>> extends the entries in the SUFILE to include a counter for the number of
>> live blocks. This counter is decremented whenever a file is deleted or
>> overwritten.
>> 
>> Except for some tricky parts, the counting of live blocks is quite
>> trivial. The problem is snapshots. At any time, a checkpoint can be
>> turned into a snapshot or vice versa. So blocks that are reclaimable at
>> one point in time, are protected by a snapshot a moment later.
>> 
>> This patch set does not try to track snapshots at all. Instead it uses a
>> heuristic approach to prevent the worst case scenario. The performance
>> is still significantly better than timestamp for my benchmarks.
>> 
>> The worst case scenario is, the following:
>> 
>> 1. Segment 1 is written
>> 2. Snapshot is created
>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>    by the Snapshot. The GC has to set the number of live blocks
>>    to maximum to avoid reclaiming this Segment again in the near future.
>> 4. Snapshot is deleted
>> 5. Segment 1 is reclaimable, but its counter is so high, that the GC
>>    will never try to reclaim it again.
>> 
>> To prevent this kind of starvation I use another field in the SUFILE
>> entry, to store the number of blocks that are protected by a snapshot.
>> This value is just a heuristic and it is usually set to 0. Only if the
>> GC reclaims a segment, it is written to the SUFILE entry. The GC has to
>> check for snapshots anyway, so we get this information for free. By
>> storing this information in the SUFILE we can avoid starvation in the
>> following way:
>> 
>> 1. Segment 1 is written
>> 2. Snapshot is created
>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>    by the Snapshot. The GC has to set the number of live blocks
>>    to maximum to avoid reclaiming this Segment again in the near future.
>> 4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
>>    entry
>> 5. Snapshot is deleted
>> 6. On Snapshot deletion we walk through every entry in the SUFILE and
>>    reduce the number of live blocks to half, if the number of snapshot
>>    blocks is bigger than half of the maximum.
>> 7. Segment 1 is reclaimable and the number of live blocks entry is at
>>    half the maximum. The GC will try to reclaim this segment as soon as
>>    there are no other better choices.
>> 
>> BENCHMARKS:
>> -----------
>> 
>> My benchmark is quite simple. It consists of a process, that replays
>> real NFS traces at a faster speed. It thereby creates relatively
>> realistic patterns of file creation and deletions. At the same time
>> multiple snapshots are created and deleted in parallel. I use a 100GB
>> partition of a Samsung SSD:
>> 
>> WITH SNAPSHOTS EVERY 5 MINUTES:
>> --------------------------------------------------------------------
>>                 Execution time       Wear (Data written to disk)
>> Timestamp:      100%                 100%
>> Cost-Benefit:   80%                  43%
>> 
>> NO SNAPSHOTS:
>> ---------------------------------------------------------------------
>>                 Execution time       Wear (Data written to disk)
>> Timestamp:      100%                 100%
>> Cost-Benefit:   70%                  45%
>> 
>> I plan on adding more benchmark results soon.
>> 
>> Best regards,
>> Andreas Rohner
>> 
>> [1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
>>     tion of a log-structured file system. ACM Trans. Comput. Syst.,
>>     10(1):26–52, February 1992.
>> 
>> Andreas Rohner (9):
>>   nilfs2: refactor nilfs_sufile_updatev()
>>   nilfs2: add simple cache for modifications to SUFILE
>>   nilfs2: extend SUFILE on-disk format to enable counting of live blocks
>>   nilfs2: add function to modify su_nlive_blks
>>   nilfs2: add simple tracking of block deletions and updates
>>   nilfs2: use modification cache to improve performance
>>   nilfs2: add additional flags for nilfs_vdesc
>>   nilfs2: improve accuracy and correct for invalid GC values
>>   nilfs2: prevent starvation of segments protected by snapshots
>> 
>>  fs/nilfs2/bmap.c          |  84 +++++++-
>>  fs/nilfs2/bmap.h          |  14 +-
>>  fs/nilfs2/btree.c         |   4 +-
>>  fs/nilfs2/cpfile.c        |   5 +
>>  fs/nilfs2/dat.c           |  95 ++++++++-
>>  fs/nilfs2/dat.h           |   8 +-
>>  fs/nilfs2/direct.c        |   4 +-
>>  fs/nilfs2/inode.c         |  24 ++-
>>  fs/nilfs2/ioctl.c         |  27 ++-
>>  fs/nilfs2/mdt.c           |   5 +-
>>  fs/nilfs2/page.h          |   6 +-
>>  fs/nilfs2/segbuf.c        |   6 +
>>  fs/nilfs2/segbuf.h        |   3 +
>>  fs/nilfs2/segment.c       | 155 +++++++++++++-
>>  fs/nilfs2/segment.h       |   3 +
>>  fs/nilfs2/sufile.c        | 533 +++++++++++++++++++++++++++++++++++++++++++---
>>  fs/nilfs2/sufile.h        |  97 +++++++--
>>  fs/nilfs2/the_nilfs.c     |   4 +
>>  fs/nilfs2/the_nilfs.h     |  23 ++
>>  include/linux/nilfs2_fs.h | 122 ++++++++++-
>>  20 files changed, 1126 insertions(+), 96 deletions(-)
>> 
>> -- 
>> 2.3.0
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ž)_²)í…æèw*\x1fjg¬±¨\x1e¶‰šŽŠÝ¢j.ïÛ°\½½MŽúgjÌæa×\x02››–' ™©Þ¢¸\f¢·¦j:+v‰¨ŠwèjØm¶Ÿÿ¾\a«‘êçzZ+ƒùšŽŠÝ¢j"ú!¶i

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev()
       [not found]     ` <1424804504-10914-2-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-10 15:52       ` Ryusuke Konishi
       [not found]         ` <20150311.005220.1374468405510151934.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-10 15:52 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:36 +0100, Andreas Rohner wrote:
> This patch refactors nilfs_sufile_updatev() to take an array of
> arbitrary data structures instead of an array of segment numbers as
> input parameter. With this  change it is reusable for cases, where
> it is necessary to pass extra data to the update function. The only
> requirement for the data structures passed as input is, that they
> contain the segment number within the structure. By passing the
> offset to the segment number as another input parameter,
> nilfs_sufile_updatev() can be oblivious to the actual type of the
> input structures in the array.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/sufile.c | 79 ++++++++++++++++++++++++++++++++----------------------
>  fs/nilfs2/sufile.h | 39 ++++++++++++++-------------
>  2 files changed, 68 insertions(+), 50 deletions(-)
> 
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 2a869c3..1e8cac6 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -138,14 +138,18 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile)
>  /**
>   * nilfs_sufile_updatev - modify multiple segment usages at a time
>   * @sufile: inode of segment usage file
> - * @segnumv: array of segment numbers
> - * @nsegs: size of @segnumv array
> + * @datav: array of segment numbers
> + * @datasz: size of elements in @datav
> + * @segoff: offset to segnum within the elements of @datav
> + * @ndata: size of @datav array
>   * @create: creation flag
>   * @ndone: place to store number of modified segments on @segnumv
>   * @dofunc: primitive operation for the update
>   *
>   * Description: nilfs_sufile_updatev() repeatedly calls @dofunc
> - * against the given array of segments.  The @dofunc is called with
> + * against the given array of data elements. Every data element has
> + * to contain a valid segment number and @segoff should be the offset
> + * to that within the data structure. The @dofunc is called with
>   * buffers of a header block and the sufile block in which the target
>   * segment usage entry is contained.  If @ndone is given, the number
>   * of successfully modified segments from the head is stored in the
> @@ -163,50 +167,55 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile)
>   *
>   * %-EINVAL - Invalid segment usage number
>   */
> -int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs,
> -			 int create, size_t *ndone,
> -			 void (*dofunc)(struct inode *, __u64,
> +int nilfs_sufile_updatev(struct inode *sufile, void *datav, size_t datasz,
> +			 size_t segoff, size_t ndata, int create,
> +			 size_t *ndone,
> +			 void (*dofunc)(struct inode *, void *,
>  					struct buffer_head *,
>  					struct buffer_head *))

Using offset byte of the data like segoff is nasty.

Please consider defining a template structure and its variation:

struct nilfs_sufile_update_data {
       __u64 segnum;
       /* Optional data comes after segnum */
};

/**
 * struct nilfs_sufile_update_count - data type of nilfs_sufile_do_xxx
 * @segnum: segment number
 * @nadd: additional value to a counter
 * Description: This structure derives from nilfs_sufile_update_data
 * struct.
 */
struct nilfs_sufile_update_count {
       __u64 segnum;
       __u64 nadd;
};

int nilfs_sufile_updatev(struct inode *sufile,
			 struct nilfs_sufile_update_data *datav,
			 size_t datasz,
			 size_t ndata, int create, size_t *ndone,
                         void (*dofunc)(struct inode *,
					struct nilfs_sufile_update_data *,
					struct buffer_head *,
					struct buffer_head *))
{
	...
}

If you need define segnum in the middle of structure, you can use
container_of():

Example:

struct nilfs_sufile_update_xxx {
       __u32 item_a;
       __u32 item_b;
       struct nilfs_sufile_update_data u_data;
};

static inline struct nilfs_sufile_update_xxx *
NILFS_SU_UPDATE_XXX(struct nilfs_sufile_update_data *data)
{
	return container_of(data, struct nilfs_sufile_update_xxx, u_data);
}

void nilfs_sufile_do_xxx(...)
{
   struct nilfs_sufile_update_xxx *xxx;

   xxx = NILFS_SU_UPDATA_XXX(data);
   ...
}

I believe the former technique is enough in your case. (you can
suppose that segnum is always the first member of data, right ?).


>  {
>  	struct buffer_head *header_bh, *bh;
>  	unsigned long blkoff, prev_blkoff;
>  	__u64 *seg;
> -	size_t nerr = 0, n = 0;
> +	void *data, *dataend = datav + ndata * datasz;
> +	size_t n = 0;
>  	int ret = 0;
>  
> -	if (unlikely(nsegs == 0))
> +	if (unlikely(ndata == 0))
>  		goto out;
>  
> -	down_write(&NILFS_MDT(sufile)->mi_sem);
> -	for (seg = segnumv; seg < segnumv + nsegs; seg++) {
> +
> +	for (data = datav; data < dataend; data += datasz) {
> +		seg = data + segoff;
>  		if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) {
>  			printk(KERN_WARNING
>  			       "%s: invalid segment number: %llu\n", __func__,
>  			       (unsigned long long)*seg);
> -			nerr++;
> +			ret = -EINVAL;
> +			goto out;
>  		}
>  	}
> -	if (nerr > 0) {
> -		ret = -EINVAL;
> -		goto out_sem;
> -	}
>  
> +	down_write(&NILFS_MDT(sufile)->mi_sem);
>  	ret = nilfs_sufile_get_header_block(sufile, &header_bh);
>  	if (ret < 0)
>  		goto out_sem;
>  
> -	seg = segnumv;
> +	data = datav;
> +	seg = data + segoff;
>  	blkoff = nilfs_sufile_get_blkoff(sufile, *seg);
>  	ret = nilfs_mdt_get_block(sufile, blkoff, create, NULL, &bh);
>  	if (ret < 0)
>  		goto out_header;
>  
>  	for (;;) {
> -		dofunc(sufile, *seg, header_bh, bh);
> +		dofunc(sufile, data, header_bh, bh);
>  
> -		if (++seg >= segnumv + nsegs)
> +		++n;
> +		data += datasz;
> +		if (data >= dataend)
>  			break;
> +		seg = data + segoff;
>  		prev_blkoff = blkoff;
>  		blkoff = nilfs_sufile_get_blkoff(sufile, *seg);
>  		if (blkoff == prev_blkoff)
> @@ -220,28 +229,30 @@ int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs,
>  	}
>  	brelse(bh);
>  
> - out_header:
> -	n = seg - segnumv;
> +out_header:
>  	brelse(header_bh);
> - out_sem:
> +out_sem:
>  	up_write(&NILFS_MDT(sufile)->mi_sem);
> - out:
> +out:
>  	if (ndone)
>  		*ndone = n;
>  	return ret;
>  }
>  
> -int nilfs_sufile_update(struct inode *sufile, __u64 segnum, int create,
> -			void (*dofunc)(struct inode *, __u64,
> +int nilfs_sufile_update(struct inode *sufile, void *data, size_t segoff,
> +			int create,
> +			void (*dofunc)(struct inode *, void *,
>  				       struct buffer_head *,
>  				       struct buffer_head *))

ditto.

>  {
>  	struct buffer_head *header_bh, *bh;
> +	__u64 *seg;
>  	int ret;
>  
> -	if (unlikely(segnum >= nilfs_sufile_get_nsegments(sufile))) {
> +	seg = data + segoff;
> +	if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) {
>  		printk(KERN_WARNING "%s: invalid segment number: %llu\n",
> -		       __func__, (unsigned long long)segnum);
> +		       __func__, (unsigned long long)*seg);
>  		return -EINVAL;
>  	}

You can remove these nasty changes.

>  	down_write(&NILFS_MDT(sufile)->mi_sem);
> @@ -250,9 +261,9 @@ int nilfs_sufile_update(struct inode *sufile, __u64 segnum, int create,
>  	if (ret < 0)
>  		goto out_sem;
>  
> -	ret = nilfs_sufile_get_segment_usage_block(sufile, segnum, create, &bh);
> +	ret = nilfs_sufile_get_segment_usage_block(sufile, *seg, create, &bh);

ditto.

>  	if (!ret) {
> -		dofunc(sufile, segnum, header_bh, bh);
> +		dofunc(sufile, data, header_bh, bh);
>  		brelse(bh);
>  	}
>  	brelse(header_bh);
> @@ -406,12 +417,13 @@ int nilfs_sufile_alloc(struct inode *sufile, __u64 *segnump)
>  	return ret;
>  }
>  
> -void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum,
> +void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 *data,
>  				 struct buffer_head *header_bh,
>  				 struct buffer_head *su_bh)
>  {
>  	struct nilfs_segment_usage *su;
>  	void *kaddr;
> +	__u64 segnum = *data;
>  
>  	kaddr = kmap_atomic(su_bh->b_page);
>  	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
> @@ -431,13 +443,14 @@ void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum,
>  	nilfs_mdt_mark_dirty(sufile);
>  }
>  
> -void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
> +void nilfs_sufile_do_scrap(struct inode *sufile, __u64 *data,
>  			   struct buffer_head *header_bh,
>  			   struct buffer_head *su_bh)
>  {
>  	struct nilfs_segment_usage *su;
>  	void *kaddr;
>  	int clean, dirty;
> +	__u64 segnum = *data;

This can be converted to as follows:

        __u64 segnum = data->segnum;

>  
>  	kaddr = kmap_atomic(su_bh->b_page);
>  	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
> @@ -462,13 +475,14 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
>  	nilfs_mdt_mark_dirty(sufile);
>  }
>  
> -void nilfs_sufile_do_free(struct inode *sufile, __u64 segnum,
> +void nilfs_sufile_do_free(struct inode *sufile, __u64 *data,
>  			  struct buffer_head *header_bh,
>  			  struct buffer_head *su_bh)
>  {
>  	struct nilfs_segment_usage *su;
>  	void *kaddr;
>  	int sudirty;
> +	__u64 segnum = *data;
>  
>  	kaddr = kmap_atomic(su_bh->b_page);
>  	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
> @@ -596,13 +610,14 @@ int nilfs_sufile_get_stat(struct inode *sufile, struct nilfs_sustat *sustat)
>  	return ret;
>  }
>  
> -void nilfs_sufile_do_set_error(struct inode *sufile, __u64 segnum,
> +void nilfs_sufile_do_set_error(struct inode *sufile, __u64 *data,
>  			       struct buffer_head *header_bh,
>  			       struct buffer_head *su_bh)
>  {
>  	struct nilfs_segment_usage *su;
>  	void *kaddr;
>  	int suclean;
> +	__u64 segnum = *data;
>  
>  	kaddr = kmap_atomic(su_bh->b_page);
>  	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index b8afd72..2df6c71 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -46,21 +46,21 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
>  				size_t);
>  ssize_t nilfs_sufile_set_suinfo(struct inode *, void *, unsigned , size_t);
>  
> -int nilfs_sufile_updatev(struct inode *, __u64 *, size_t, int, size_t *,
> -			 void (*dofunc)(struct inode *, __u64,
> -					struct buffer_head *,
> -					struct buffer_head *));
> -int nilfs_sufile_update(struct inode *, __u64, int,
> -			void (*dofunc)(struct inode *, __u64,
> +int nilfs_sufile_updatev(struct inode *, void *, size_t, size_t, size_t, int,
> +			 size_t *, void (*dofunc)(struct inode *, void *,
> +						  struct buffer_head *,
> +						  struct buffer_head *));
> +int nilfs_sufile_update(struct inode *, void *, size_t, int,
> +			void (*dofunc)(struct inode *, void *,
>  				       struct buffer_head *,
>  				       struct buffer_head *));

> -void nilfs_sufile_do_scrap(struct inode *, __u64, struct buffer_head *,
> +void nilfs_sufile_do_scrap(struct inode *, __u64 *, struct buffer_head *,
>  			   struct buffer_head *);
> -void nilfs_sufile_do_free(struct inode *, __u64, struct buffer_head *,
> +void nilfs_sufile_do_free(struct inode *, __u64 *, struct buffer_head *,
>  			  struct buffer_head *);
> -void nilfs_sufile_do_cancel_free(struct inode *, __u64, struct buffer_head *,
> +void nilfs_sufile_do_cancel_free(struct inode *, __u64 *, struct buffer_head *,
>  				 struct buffer_head *);
> -void nilfs_sufile_do_set_error(struct inode *, __u64, struct buffer_head *,
> +void nilfs_sufile_do_set_error(struct inode *, __u64 *, struct buffer_head *,
>  			       struct buffer_head *);

Please, use "struct nilfs_sufile_update_data *" type for the second
argument of these declaration.

>  
>  int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
> @@ -75,7 +75,8 @@ int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
>   */
>  static inline int nilfs_sufile_scrap(struct inode *sufile, __u64 segnum)
>  {
> -	return nilfs_sufile_update(sufile, segnum, 1, nilfs_sufile_do_scrap);
> +	return nilfs_sufile_update(sufile, &segnum, 0, 1,
> +				   (void *)nilfs_sufile_do_scrap);
>  }

Then you can avoid this nasty (void *) cast to the callback function.

>  
>  /**
> @@ -85,7 +86,8 @@ static inline int nilfs_sufile_scrap(struct inode *sufile, __u64 segnum)
>   */
>  static inline int nilfs_sufile_free(struct inode *sufile, __u64 segnum)
>  {
> -	return nilfs_sufile_update(sufile, segnum, 0, nilfs_sufile_do_free);
> +	return nilfs_sufile_update(sufile, &segnum, 0, 0,
> +				   (void *)nilfs_sufile_do_free);
>  }

ditto

>  /**
> @@ -98,8 +100,8 @@ static inline int nilfs_sufile_free(struct inode *sufile, __u64 segnum)
>  static inline int nilfs_sufile_freev(struct inode *sufile, __u64 *segnumv,
>  				     size_t nsegs, size_t *ndone)
>  {
> -	return nilfs_sufile_updatev(sufile, segnumv, nsegs, 0, ndone,
> -				    nilfs_sufile_do_free);
> +	return nilfs_sufile_updatev(sufile, segnumv, sizeof(__u64), 0, nsegs,
> +				    0, ndone, (void *)nilfs_sufile_do_free);
>  }

ditto

>  /**
> @@ -116,8 +118,9 @@ static inline int nilfs_sufile_cancel_freev(struct inode *sufile,
>  					    __u64 *segnumv, size_t nsegs,
>  					    size_t *ndone)
>  {
> -	return nilfs_sufile_updatev(sufile, segnumv, nsegs, 0, ndone,
> -				    nilfs_sufile_do_cancel_free);
> +	return nilfs_sufile_updatev(sufile, segnumv, sizeof(__u64), 0, nsegs,
> +				    0, ndone,
> +				    (void *)nilfs_sufile_do_cancel_free);
>  }

ditto

>  /**
> @@ -139,8 +142,8 @@ static inline int nilfs_sufile_cancel_freev(struct inode *sufile,
>   */
>  static inline int nilfs_sufile_set_error(struct inode *sufile, __u64 segnum)
>  {
> -	return nilfs_sufile_update(sufile, segnum, 0,
> -				   nilfs_sufile_do_set_error);
> +	return nilfs_sufile_update(sufile, &segnum, 0, 0,
> +				   (void *)nilfs_sufile_do_set_error);
>  }
>  
>  #endif	/* _NILFS_SUFILE_H */

ditto


Regards,
Ryusuke Konishi

> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
       [not found]         ` <20150310.142119.813265940569588216.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-03-10 20:37           ` Andreas Rohner
       [not found]             ` <54FF561E.7030409-hi6Y0CQ0nG0@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-03-10 20:37 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi Ryusuke,

Thanks for your thorough review.

On 2015-03-10 06:21, Ryusuke Konishi wrote:
> Hi Andreas,
> 
> I looked through whole kernel patches and a part of util patches.
> Overall comments are as follows:
> 
> [Algorithm]
> As for algorithm, it looks about OK except for the starvation
> countermeasure.  The stavation countermeasure looks adhoc/hacky, but
> it's good that it doesn't change kernel/userland interface; we may be
> able to replace it with better ways in a future or in a revised
> version of this patchset.
> 
> (1) Drawback of the starvation countermeasure
>     The patch 9/9 looks to make the execution time of chcp operation
>     worse since it will scan through sufile to modify live block
>     counters.  How much does it prolong the execution time ?

I'll do some tests, but I haven't noticed any significant performance
drop. The GC basically does the same thing, every time it selects
segments to reclaim.

>     In a use case of nilfs, many snapshots are created and they are
>     automatically changed back to plain checkpoints because old
>     snapshots are thinned out over time.  The patch 9/9 may impact on
>     such usage.
>
> (2) Compatibility
>     What will happen in the following case:
>     1. Create a file system, use it with the new module, and
>        create snapshots.
>     2. Mount it with an old module, and release snapshot with "chcp cp"
>     3. Mount it with the new module, and cleanerd runs gc with
>        cost benefit or greedy policy.

Some segments could be subject to starvation. But it would probably only
affect a small number of segments and it could be fixed by "chcp ss
<CP>; chcp cp <CP>".

> (3) Durability against unexpected power failures (just a note)
>     The current patchset looks not to cause starvation issue even when
>     unexpected power failure occurs during or after executing "chcp
>     cp" because nilfs_ioctl_change_cpmode() do changes in a
>     transactional way with nilfs_transaction_begin/commit.
>     We should always think this kind of situtation to keep consistency.
> 
> [Coding Style]
> (4) This patchset has several coding style issues. Please fix them and
>     re-check with the latest checkpatch script (script/checkpatch.pl).

I'll fix that. Sorry.

> patch 2:
> WARNING: Prefer kmalloc_array over kmalloc with multiply
> #85: FILE: fs/nilfs2/sufile.c:1192:
> +    mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
> 
> patch 5,6:
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #60: 
> the same semaphore has to be aquired. So if the DAT-Entry belongs to
> 
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #46: 
> be aquired, which blocks the entire SUFILE and effectively turns
> 
> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
> #53: 
> afore mentioned lock only needs to be aquired, if the cache is full
> 
> (5) sub_sizeof macro:
>     The same definition exists as offsetofend() in vfio.h,
>     and a patch to move it to stddef.h is now proposed.
> 
>     Please use the same name, and redefine it only if it's not
>     defined:
> 
> #ifndef offsetofend
> #define offsetofend(TYPE, MEMBER) \
>         (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
> #endif

Ok I'll change that.

> [Implementation]
> (6) b_blocknr
>     Please do not use bh->b_blocknr to store disk block number.  This
>     field is used to keep virtual block number except for DAT files.
>     It is only replaced to an actual block number during calling
>     submit_bh().  Keep this policy.

As far as I can tell, this is only true for blocks of GC inodes and node
blocks. All other buffer_heads are always mapped to on disk blocks by
nilfs_get_block(). I only added the mapping in nilfs_segbuf_submit_bh()
to correctly set the value in b_blocknr to the new location.

>     In segment constructor context, you can calculate the disk block
>     number from the start disk address of the segment and the block
>     index (offset) in the segment.

If I understand you correctly, this approach would give me the on disk
location inside of the segment that is currently constructed. But I need
to know the previous on disk location of the buffer_head. I have to
decrement the counter for the previous segment.

> (7) sufile mod cache
>     Consider gathering the cache into nilfs_sufile_info struct and
>     stopping to pass it via argument of bmap/sufile/dat interface
>     functions.  It's hacky, and decreases readability of programs, and
>     is bloating changes of this patchset over multiple function
>     blocks.

If I use a global structure, I have to protect it with a lock. Since
almost any operation has to modify the counters in the SUFILE, this
would serialize the whole file system.

>     The cache should be well designed. It's important to balance the
>     performance and locality/transparency of the feature.  For
>     instance, it can be implemented with radix-tree of objects in
>     which each object has a vector of 2^k cache entries.

I'll look into that.

>     I think the cache should be written back to the sufile buffers
>     only within segment construction context. At least, it should be
>     written back in the context in which a transaction lock is held.
> 
>     In addition, introducing a new bmap lock dependency,
>     nilfs_sufile_lock_key, is undesireble. You should avoid it
>     by delaying the writeback of cache entries to sufile.

The cache could end up using a lot of memory. In the worst case one
entry per block.

> (8) Changes to the sufile must be finished before dirty buffer
>     collection of sufile.
>     All mark_buffer_dirty() calls to sufile must be finished
>     before or in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().
> 
>     (You can write fixed figures to sufile after the collection phase
>      of sufile by preparatory marking buffer dirty before the
>      colection phase.)
>
>     In the current patchset, sufile mod cache can be flushed in
>     nilfs_segctor_update_palyload_blocknr(), which comes after the
>     dirty buffer collection phase.

This is a hard problem. I have to count the blocks added in the
NILFS_ST_DAT stage. I don't know, which SUFILE blocks I have to mark in
advance. I'll have to think about this.

> (9) cpfile is also excluded in the dead block counting like sufile
>     cpfile is always changed and written back along with sufile and dat.
>     So, cpfile must be excluded from the dead block counting.
>     Otherwise, sufile change can trigger cpfile changes, and it in turn
>     triggers sufile.

I don't quite understand your example. How exactly can a sufile change
trigger a cpfile change and how can this turn into an infinite loop?

Thanks,
Andreas Rohner

>     This also helps to simplify nilfs_dat_commit_end() that the patchset
>     added two arguments for the dead block counting in the patchset.
>     I mean, "dead" argument and "count_blocks" argument can be unified by
>     changing meaning of the "dead" argument.
> 
> 
> I will add detail comments for patches tonight or another day.
> 
> Regards,
> Ryusuke Konishi
> 
> On Wed, 25 Feb 2015 09:18:04 +0900 (JST), Ryusuke Konishi wrote:
>> Hi Andreas,
>>
>> Thank you for posting this proposal!
>>
>> I would like to have time to review this series through, but please
>> wait for several days. (This week I'm quite busy until weekend)
>>
>> Thanks,
>> Ryusuke Konishi
>>
>> On Tue, 24 Feb 2015 20:01:35 +0100, Andreas Rohner wrote:
>>> Hi everyone!
>>>
>>> One of the biggest performance problems of NILFS is its
>>> inefficient Timestamp GC policy. This patch set introduces two new GC
>>> policies, namely Cost-Benefit and Greedy.
>>>
>>> The Cost-Benefit policy is nothing new. It has been around for a long
>>> time with log-structured file systems [1]. But it relies on accurate
>>> information, about the number of live blocks in a segment. NILFS
>>> currently does not provide the necessary information. So this patch set
>>> extends the entries in the SUFILE to include a counter for the number of
>>> live blocks. This counter is decremented whenever a file is deleted or
>>> overwritten.
>>>
>>> Except for some tricky parts, the counting of live blocks is quite
>>> trivial. The problem is snapshots. At any time, a checkpoint can be
>>> turned into a snapshot or vice versa. So blocks that are reclaimable at
>>> one point in time, are protected by a snapshot a moment later.
>>>
>>> This patch set does not try to track snapshots at all. Instead it uses a
>>> heuristic approach to prevent the worst case scenario. The performance
>>> is still significantly better than timestamp for my benchmarks.
>>>
>>> The worst case scenario is, the following:
>>>
>>> 1. Segment 1 is written
>>> 2. Snapshot is created
>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>    by the Snapshot. The GC has to set the number of live blocks
>>>    to maximum to avoid reclaiming this Segment again in the near future.
>>> 4. Snapshot is deleted
>>> 5. Segment 1 is reclaimable, but its counter is so high, that the GC
>>>    will never try to reclaim it again.
>>>
>>> To prevent this kind of starvation I use another field in the SUFILE
>>> entry, to store the number of blocks that are protected by a snapshot.
>>> This value is just a heuristic and it is usually set to 0. Only if the
>>> GC reclaims a segment, it is written to the SUFILE entry. The GC has to
>>> check for snapshots anyway, so we get this information for free. By
>>> storing this information in the SUFILE we can avoid starvation in the
>>> following way:
>>>
>>> 1. Segment 1 is written
>>> 2. Snapshot is created
>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>    by the Snapshot. The GC has to set the number of live blocks
>>>    to maximum to avoid reclaiming this Segment again in the near future.
>>> 4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
>>>    entry
>>> 5. Snapshot is deleted
>>> 6. On Snapshot deletion we walk through every entry in the SUFILE and
>>>    reduce the number of live blocks to half, if the number of snapshot
>>>    blocks is bigger than half of the maximum.
>>> 7. Segment 1 is reclaimable and the number of live blocks entry is at
>>>    half the maximum. The GC will try to reclaim this segment as soon as
>>>    there are no other better choices.
>>>
>>> BENCHMARKS:
>>> -----------
>>>
>>> My benchmark is quite simple. It consists of a process, that replays
>>> real NFS traces at a faster speed. It thereby creates relatively
>>> realistic patterns of file creation and deletions. At the same time
>>> multiple snapshots are created and deleted in parallel. I use a 100GB
>>> partition of a Samsung SSD:
>>>
>>> WITH SNAPSHOTS EVERY 5 MINUTES:
>>> --------------------------------------------------------------------
>>>                 Execution time       Wear (Data written to disk)
>>> Timestamp:      100%                 100%
>>> Cost-Benefit:   80%                  43%
>>>
>>> NO SNAPSHOTS:
>>> ---------------------------------------------------------------------
>>>                 Execution time       Wear (Data written to disk)
>>> Timestamp:      100%                 100%
>>> Cost-Benefit:   70%                  45%
>>>
>>> I plan on adding more benchmark results soon.
>>>
>>> Best regards,
>>> Andreas Rohner
>>>
>>> [1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
>>>     tion of a log-structured file system. ACM Trans. Comput. Syst.,
>>>     10(1):26–52, February 1992.
>>>
>>> Andreas Rohner (9):
>>>   nilfs2: refactor nilfs_sufile_updatev()
>>>   nilfs2: add simple cache for modifications to SUFILE
>>>   nilfs2: extend SUFILE on-disk format to enable counting of live blocks
>>>   nilfs2: add function to modify su_nlive_blks
>>>   nilfs2: add simple tracking of block deletions and updates
>>>   nilfs2: use modification cache to improve performance
>>>   nilfs2: add additional flags for nilfs_vdesc
>>>   nilfs2: improve accuracy and correct for invalid GC values
>>>   nilfs2: prevent starvation of segments protected by snapshots
>>>
>>>  fs/nilfs2/bmap.c          |  84 +++++++-
>>>  fs/nilfs2/bmap.h          |  14 +-
>>>  fs/nilfs2/btree.c         |   4 +-
>>>  fs/nilfs2/cpfile.c        |   5 +
>>>  fs/nilfs2/dat.c           |  95 ++++++++-
>>>  fs/nilfs2/dat.h           |   8 +-
>>>  fs/nilfs2/direct.c        |   4 +-
>>>  fs/nilfs2/inode.c         |  24 ++-
>>>  fs/nilfs2/ioctl.c         |  27 ++-
>>>  fs/nilfs2/mdt.c           |   5 +-
>>>  fs/nilfs2/page.h          |   6 +-
>>>  fs/nilfs2/segbuf.c        |   6 +
>>>  fs/nilfs2/segbuf.h        |   3 +
>>>  fs/nilfs2/segment.c       | 155 +++++++++++++-
>>>  fs/nilfs2/segment.h       |   3 +
>>>  fs/nilfs2/sufile.c        | 533 +++++++++++++++++++++++++++++++++++++++++++---
>>>  fs/nilfs2/sufile.h        |  97 +++++++--
>>>  fs/nilfs2/the_nilfs.c     |   4 +
>>>  fs/nilfs2/the_nilfs.h     |  23 ++
>>>  include/linux/nilfs2_fs.h | 122 ++++++++++-
>>>  20 files changed, 1126 insertions(+), 96 deletions(-)
>>>
>>> -- 
>>> 2.3.0
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev()
       [not found]         ` <20150311.005220.1374468405510151934.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-03-10 20:40           ` Andreas Rohner
  0 siblings, 0 replies; 36+ messages in thread
From: Andreas Rohner @ 2015-03-10 20:40 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-03-10 16:52, Ryusuke Konishi wrote:
> On Tue, 24 Feb 2015 20:01:36 +0100, Andreas Rohner wrote:
>> This patch refactors nilfs_sufile_updatev() to take an array of
>> arbitrary data structures instead of an array of segment numbers as
>> input parameter. With this  change it is reusable for cases, where
>> it is necessary to pass extra data to the update function. The only
>> requirement for the data structures passed as input is, that they
>> contain the segment number within the structure. By passing the
>> offset to the segment number as another input parameter,
>> nilfs_sufile_updatev() can be oblivious to the actual type of the
>> input structures in the array.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/sufile.c | 79 ++++++++++++++++++++++++++++++++----------------------
>>  fs/nilfs2/sufile.h | 39 ++++++++++++++-------------
>>  2 files changed, 68 insertions(+), 50 deletions(-)
>>
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 2a869c3..1e8cac6 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -138,14 +138,18 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile)
>>  /**
>>   * nilfs_sufile_updatev - modify multiple segment usages at a time
>>   * @sufile: inode of segment usage file
>> - * @segnumv: array of segment numbers
>> - * @nsegs: size of @segnumv array
>> + * @datav: array of segment numbers
>> + * @datasz: size of elements in @datav
>> + * @segoff: offset to segnum within the elements of @datav
>> + * @ndata: size of @datav array
>>   * @create: creation flag
>>   * @ndone: place to store number of modified segments on @segnumv
>>   * @dofunc: primitive operation for the update
>>   *
>>   * Description: nilfs_sufile_updatev() repeatedly calls @dofunc
>> - * against the given array of segments.  The @dofunc is called with
>> + * against the given array of data elements. Every data element has
>> + * to contain a valid segment number and @segoff should be the offset
>> + * to that within the data structure. The @dofunc is called with
>>   * buffers of a header block and the sufile block in which the target
>>   * segment usage entry is contained.  If @ndone is given, the number
>>   * of successfully modified segments from the head is stored in the
>> @@ -163,50 +167,55 @@ unsigned long nilfs_sufile_get_ncleansegs(struct inode *sufile)
>>   *
>>   * %-EINVAL - Invalid segment usage number
>>   */
>> -int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs,
>> -			 int create, size_t *ndone,
>> -			 void (*dofunc)(struct inode *, __u64,
>> +int nilfs_sufile_updatev(struct inode *sufile, void *datav, size_t datasz,
>> +			 size_t segoff, size_t ndata, int create,
>> +			 size_t *ndone,
>> +			 void (*dofunc)(struct inode *, void *,
>>  					struct buffer_head *,
>>  					struct buffer_head *))
> 
> Using offset byte of the data like segoff is nasty.
> 
> Please consider defining a template structure and its variation:
> 
> struct nilfs_sufile_update_data {
>        __u64 segnum;
>        /* Optional data comes after segnum */
> };
> 
> /**
>  * struct nilfs_sufile_update_count - data type of nilfs_sufile_do_xxx
>  * @segnum: segment number
>  * @nadd: additional value to a counter
>  * Description: This structure derives from nilfs_sufile_update_data
>  * struct.
>  */
> struct nilfs_sufile_update_count {
>        __u64 segnum;
>        __u64 nadd;
> };
> 
> int nilfs_sufile_updatev(struct inode *sufile,
> 			 struct nilfs_sufile_update_data *datav,
> 			 size_t datasz,
> 			 size_t ndata, int create, size_t *ndone,
>                          void (*dofunc)(struct inode *,
> 					struct nilfs_sufile_update_data *,
> 					struct buffer_head *,
> 					struct buffer_head *))
> {
> 	...
> }

I agree this is a much better solution. I'll change it.

Regards,
Andreas Rohner

> If you need define segnum in the middle of structure, you can use
> container_of():
> 
> Example:
> 
> struct nilfs_sufile_update_xxx {
>        __u32 item_a;
>        __u32 item_b;
>        struct nilfs_sufile_update_data u_data;
> };
> 
> static inline struct nilfs_sufile_update_xxx *
> NILFS_SU_UPDATE_XXX(struct nilfs_sufile_update_data *data)
> {
> 	return container_of(data, struct nilfs_sufile_update_xxx, u_data);
> }
> 
> void nilfs_sufile_do_xxx(...)
> {
>    struct nilfs_sufile_update_xxx *xxx;
> 
>    xxx = NILFS_SU_UPDATA_XXX(data);
>    ...
> }
> 
> I believe the former technique is enough in your case. (you can
> suppose that segnum is always the first member of data, right ?).
> 
> 
>>  {
>>  	struct buffer_head *header_bh, *bh;
>>  	unsigned long blkoff, prev_blkoff;
>>  	__u64 *seg;
>> -	size_t nerr = 0, n = 0;
>> +	void *data, *dataend = datav + ndata * datasz;
>> +	size_t n = 0;
>>  	int ret = 0;
>>  
>> -	if (unlikely(nsegs == 0))
>> +	if (unlikely(ndata == 0))
>>  		goto out;
>>  
>> -	down_write(&NILFS_MDT(sufile)->mi_sem);
>> -	for (seg = segnumv; seg < segnumv + nsegs; seg++) {
>> +
>> +	for (data = datav; data < dataend; data += datasz) {
>> +		seg = data + segoff;
>>  		if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) {
>>  			printk(KERN_WARNING
>>  			       "%s: invalid segment number: %llu\n", __func__,
>>  			       (unsigned long long)*seg);
>> -			nerr++;
>> +			ret = -EINVAL;
>> +			goto out;
>>  		}
>>  	}
>> -	if (nerr > 0) {
>> -		ret = -EINVAL;
>> -		goto out_sem;
>> -	}
>>  
>> +	down_write(&NILFS_MDT(sufile)->mi_sem);
>>  	ret = nilfs_sufile_get_header_block(sufile, &header_bh);
>>  	if (ret < 0)
>>  		goto out_sem;
>>  
>> -	seg = segnumv;
>> +	data = datav;
>> +	seg = data + segoff;
>>  	blkoff = nilfs_sufile_get_blkoff(sufile, *seg);
>>  	ret = nilfs_mdt_get_block(sufile, blkoff, create, NULL, &bh);
>>  	if (ret < 0)
>>  		goto out_header;
>>  
>>  	for (;;) {
>> -		dofunc(sufile, *seg, header_bh, bh);
>> +		dofunc(sufile, data, header_bh, bh);
>>  
>> -		if (++seg >= segnumv + nsegs)
>> +		++n;
>> +		data += datasz;
>> +		if (data >= dataend)
>>  			break;
>> +		seg = data + segoff;
>>  		prev_blkoff = blkoff;
>>  		blkoff = nilfs_sufile_get_blkoff(sufile, *seg);
>>  		if (blkoff == prev_blkoff)
>> @@ -220,28 +229,30 @@ int nilfs_sufile_updatev(struct inode *sufile, __u64 *segnumv, size_t nsegs,
>>  	}
>>  	brelse(bh);
>>  
>> - out_header:
>> -	n = seg - segnumv;
>> +out_header:
>>  	brelse(header_bh);
>> - out_sem:
>> +out_sem:
>>  	up_write(&NILFS_MDT(sufile)->mi_sem);
>> - out:
>> +out:
>>  	if (ndone)
>>  		*ndone = n;
>>  	return ret;
>>  }
>>  
>> -int nilfs_sufile_update(struct inode *sufile, __u64 segnum, int create,
>> -			void (*dofunc)(struct inode *, __u64,
>> +int nilfs_sufile_update(struct inode *sufile, void *data, size_t segoff,
>> +			int create,
>> +			void (*dofunc)(struct inode *, void *,
>>  				       struct buffer_head *,
>>  				       struct buffer_head *))
> 
> ditto.
> 
>>  {
>>  	struct buffer_head *header_bh, *bh;
>> +	__u64 *seg;
>>  	int ret;
>>  
>> -	if (unlikely(segnum >= nilfs_sufile_get_nsegments(sufile))) {
>> +	seg = data + segoff;
>> +	if (unlikely(*seg >= nilfs_sufile_get_nsegments(sufile))) {
>>  		printk(KERN_WARNING "%s: invalid segment number: %llu\n",
>> -		       __func__, (unsigned long long)segnum);
>> +		       __func__, (unsigned long long)*seg);
>>  		return -EINVAL;
>>  	}
> 
> You can remove these nasty changes.
> 
>>  	down_write(&NILFS_MDT(sufile)->mi_sem);
>> @@ -250,9 +261,9 @@ int nilfs_sufile_update(struct inode *sufile, __u64 segnum, int create,
>>  	if (ret < 0)
>>  		goto out_sem;
>>  
>> -	ret = nilfs_sufile_get_segment_usage_block(sufile, segnum, create, &bh);
>> +	ret = nilfs_sufile_get_segment_usage_block(sufile, *seg, create, &bh);
> 
> ditto.
> 
>>  	if (!ret) {
>> -		dofunc(sufile, segnum, header_bh, bh);
>> +		dofunc(sufile, data, header_bh, bh);
>>  		brelse(bh);
>>  	}
>>  	brelse(header_bh);
>> @@ -406,12 +417,13 @@ int nilfs_sufile_alloc(struct inode *sufile, __u64 *segnump)
>>  	return ret;
>>  }
>>  
>> -void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum,
>> +void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 *data,
>>  				 struct buffer_head *header_bh,
>>  				 struct buffer_head *su_bh)
>>  {
>>  	struct nilfs_segment_usage *su;
>>  	void *kaddr;
>> +	__u64 segnum = *data;
>>  
>>  	kaddr = kmap_atomic(su_bh->b_page);
>>  	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
>> @@ -431,13 +443,14 @@ void nilfs_sufile_do_cancel_free(struct inode *sufile, __u64 segnum,
>>  	nilfs_mdt_mark_dirty(sufile);
>>  }
>>  
>> -void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
>> +void nilfs_sufile_do_scrap(struct inode *sufile, __u64 *data,
>>  			   struct buffer_head *header_bh,
>>  			   struct buffer_head *su_bh)
>>  {
>>  	struct nilfs_segment_usage *su;
>>  	void *kaddr;
>>  	int clean, dirty;
>> +	__u64 segnum = *data;
> 
> This can be converted to as follows:
> 
>         __u64 segnum = data->segnum;
> 
>>  
>>  	kaddr = kmap_atomic(su_bh->b_page);
>>  	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
>> @@ -462,13 +475,14 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 segnum,
>>  	nilfs_mdt_mark_dirty(sufile);
>>  }
>>  
>> -void nilfs_sufile_do_free(struct inode *sufile, __u64 segnum,
>> +void nilfs_sufile_do_free(struct inode *sufile, __u64 *data,
>>  			  struct buffer_head *header_bh,
>>  			  struct buffer_head *su_bh)
>>  {
>>  	struct nilfs_segment_usage *su;
>>  	void *kaddr;
>>  	int sudirty;
>> +	__u64 segnum = *data;
>>  
>>  	kaddr = kmap_atomic(su_bh->b_page);
>>  	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
>> @@ -596,13 +610,14 @@ int nilfs_sufile_get_stat(struct inode *sufile, struct nilfs_sustat *sustat)
>>  	return ret;
>>  }
>>  
>> -void nilfs_sufile_do_set_error(struct inode *sufile, __u64 segnum,
>> +void nilfs_sufile_do_set_error(struct inode *sufile, __u64 *data,
>>  			       struct buffer_head *header_bh,
>>  			       struct buffer_head *su_bh)
>>  {
>>  	struct nilfs_segment_usage *su;
>>  	void *kaddr;
>>  	int suclean;
>> +	__u64 segnum = *data;
>>  
>>  	kaddr = kmap_atomic(su_bh->b_page);
>>  	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index b8afd72..2df6c71 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -46,21 +46,21 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
>>  				size_t);
>>  ssize_t nilfs_sufile_set_suinfo(struct inode *, void *, unsigned , size_t);
>>  
>> -int nilfs_sufile_updatev(struct inode *, __u64 *, size_t, int, size_t *,
>> -			 void (*dofunc)(struct inode *, __u64,
>> -					struct buffer_head *,
>> -					struct buffer_head *));
>> -int nilfs_sufile_update(struct inode *, __u64, int,
>> -			void (*dofunc)(struct inode *, __u64,
>> +int nilfs_sufile_updatev(struct inode *, void *, size_t, size_t, size_t, int,
>> +			 size_t *, void (*dofunc)(struct inode *, void *,
>> +						  struct buffer_head *,
>> +						  struct buffer_head *));
>> +int nilfs_sufile_update(struct inode *, void *, size_t, int,
>> +			void (*dofunc)(struct inode *, void *,
>>  				       struct buffer_head *,
>>  				       struct buffer_head *));
> 
>> -void nilfs_sufile_do_scrap(struct inode *, __u64, struct buffer_head *,
>> +void nilfs_sufile_do_scrap(struct inode *, __u64 *, struct buffer_head *,
>>  			   struct buffer_head *);
>> -void nilfs_sufile_do_free(struct inode *, __u64, struct buffer_head *,
>> +void nilfs_sufile_do_free(struct inode *, __u64 *, struct buffer_head *,
>>  			  struct buffer_head *);
>> -void nilfs_sufile_do_cancel_free(struct inode *, __u64, struct buffer_head *,
>> +void nilfs_sufile_do_cancel_free(struct inode *, __u64 *, struct buffer_head *,
>>  				 struct buffer_head *);
>> -void nilfs_sufile_do_set_error(struct inode *, __u64, struct buffer_head *,
>> +void nilfs_sufile_do_set_error(struct inode *, __u64 *, struct buffer_head *,
>>  			       struct buffer_head *);
> 
> Please, use "struct nilfs_sufile_update_data *" type for the second
> argument of these declaration.
> 
>>  
>>  int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
>> @@ -75,7 +75,8 @@ int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
>>   */
>>  static inline int nilfs_sufile_scrap(struct inode *sufile, __u64 segnum)
>>  {
>> -	return nilfs_sufile_update(sufile, segnum, 1, nilfs_sufile_do_scrap);
>> +	return nilfs_sufile_update(sufile, &segnum, 0, 1,
>> +				   (void *)nilfs_sufile_do_scrap);
>>  }
> 
> Then you can avoid this nasty (void *) cast to the callback function.
> 
>>  
>>  /**
>> @@ -85,7 +86,8 @@ static inline int nilfs_sufile_scrap(struct inode *sufile, __u64 segnum)
>>   */
>>  static inline int nilfs_sufile_free(struct inode *sufile, __u64 segnum)
>>  {
>> -	return nilfs_sufile_update(sufile, segnum, 0, nilfs_sufile_do_free);
>> +	return nilfs_sufile_update(sufile, &segnum, 0, 0,
>> +				   (void *)nilfs_sufile_do_free);
>>  }
> 
> ditto
> 
>>  /**
>> @@ -98,8 +100,8 @@ static inline int nilfs_sufile_free(struct inode *sufile, __u64 segnum)
>>  static inline int nilfs_sufile_freev(struct inode *sufile, __u64 *segnumv,
>>  				     size_t nsegs, size_t *ndone)
>>  {
>> -	return nilfs_sufile_updatev(sufile, segnumv, nsegs, 0, ndone,
>> -				    nilfs_sufile_do_free);
>> +	return nilfs_sufile_updatev(sufile, segnumv, sizeof(__u64), 0, nsegs,
>> +				    0, ndone, (void *)nilfs_sufile_do_free);
>>  }
> 
> ditto
> 
>>  /**
>> @@ -116,8 +118,9 @@ static inline int nilfs_sufile_cancel_freev(struct inode *sufile,
>>  					    __u64 *segnumv, size_t nsegs,
>>  					    size_t *ndone)
>>  {
>> -	return nilfs_sufile_updatev(sufile, segnumv, nsegs, 0, ndone,
>> -				    nilfs_sufile_do_cancel_free);
>> +	return nilfs_sufile_updatev(sufile, segnumv, sizeof(__u64), 0, nsegs,
>> +				    0, ndone,
>> +				    (void *)nilfs_sufile_do_cancel_free);
>>  }
> 
> ditto
> 
>>  /**
>> @@ -139,8 +142,8 @@ static inline int nilfs_sufile_cancel_freev(struct inode *sufile,
>>   */
>>  static inline int nilfs_sufile_set_error(struct inode *sufile, __u64 segnum)
>>  {
>> -	return nilfs_sufile_update(sufile, segnum, 0,
>> -				   nilfs_sufile_do_set_error);
>> +	return nilfs_sufile_update(sufile, &segnum, 0, 0,
>> +				   (void *)nilfs_sufile_do_set_error);
>>  }
>>  
>>  #endif	/* _NILFS_SUFILE_H */
> 
> ditto
> 
> 
> Regards,
> Ryusuke Konishi
> 
>> -- 
>> 2.3.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
       [not found]             ` <54FF561E.7030409-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-12 12:54               ` Ryusuke Konishi
       [not found]                 ` <20150312.215431.324210374799651841.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-12 12:54 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: Text/Plain; charset="windows-1254", Size: 17537 bytes --]

Hi Andreas,

On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote:
> Hi Ryusuke,
> 
> Thanks for your thorough review.
> 
> On 2015-03-10 06:21, Ryusuke Konishi wrote:
>> Hi Andreas,
>> 
>> I looked through whole kernel patches and a part of util patches.
>> Overall comments are as follows:
>> 
>> [Algorithm]
>> As for algorithm, it looks about OK except for the starvation
>> countermeasure.  The stavation countermeasure looks adhoc/hacky, but
>> it's good that it doesn't change kernel/userland interface; we may be
>> able to replace it with better ways in a future or in a revised
>> version of this patchset.
>> 
>> (1) Drawback of the starvation countermeasure
>>     The patch 9/9 looks to make the execution time of chcp operation
>>     worse since it will scan through sufile to modify live block
>>     counters.  How much does it prolong the execution time ?
> 
> I'll do some tests, but I haven't noticed any significant performance
> drop. The GC basically does the same thing, every time it selects
> segments to reclaim.

GC is performed in background by an independent process.  What I'm
care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from
command line interface or application.  They differ in this meaning.

Was a worse case senario considered in the test ?

For example:
1. Fill a TB class drive with data file(s), and make a snapshot on it.
2. Run one pass GC to update snapshot block counts.
3. And do "chcp cp"

If we don't observe noticeable delay on this class of drive, then I
think we can put the problem off.

>>     In a use case of nilfs, many snapshots are created and they are
>>     automatically changed back to plain checkpoints because old
>>     snapshots are thinned out over time.  The patch 9/9 may impact on
>>     such usage.
>>
>> (2) Compatibility
>>     What will happen in the following case:
>>     1. Create a file system, use it with the new module, and
>>        create snapshots.
>>     2. Mount it with an old module, and release snapshot with "chcp cp"
>>     3. Mount it with the new module, and cleanerd runs gc with
>>        cost benefit or greedy policy.
> 
> Some segments could be subject to starvation. But it would probably only
> affect a small number of segments and it could be fixed by "chcp ss
> <CP>; chcp cp <CP>".

Ok, let's treat this as a restriction for now.
If you come up with any good idea, please propose.

>> (3) Durability against unexpected power failures (just a note)
>>     The current patchset looks not to cause starvation issue even when
>>     unexpected power failure occurs during or after executing "chcp
>>     cp" because nilfs_ioctl_change_cpmode() do changes in a
>>     transactional way with nilfs_transaction_begin/commit.
>>     We should always think this kind of situtation to keep consistency.
>> 
>> [Coding Style]
>> (4) This patchset has several coding style issues. Please fix them and
>>     re-check with the latest checkpatch script (script/checkpatch.pl).
> 
> I'll fix that. Sorry.
> 
>> patch 2:
>> WARNING: Prefer kmalloc_array over kmalloc with multiply
>> #85: FILE: fs/nilfs2/sufile.c:1192:
>> +    mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
>> 
>> patch 5,6:
>> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
>> #60: 
>> the same semaphore has to be aquired. So if the DAT-Entry belongs to
>> 
>> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
>> #46: 
>> be aquired, which blocks the entire SUFILE and effectively turns
>> 
>> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
>> #53: 
>> afore mentioned lock only needs to be aquired, if the cache is full
>> 
>> (5) sub_sizeof macro:
>>     The same definition exists as offsetofend() in vfio.h,
>>     and a patch to move it to stddef.h is now proposed.
>> 
>>     Please use the same name, and redefine it only if it's not
>>     defined:
>> 
>> #ifndef offsetofend
>> #define offsetofend(TYPE, MEMBER) \
>>         (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
>> #endif
> 
> Ok I'll change that.
> 
>> [Implementation]
>> (6) b_blocknr
>>     Please do not use bh->b_blocknr to store disk block number.  This
>>     field is used to keep virtual block number except for DAT files.
>>     It is only replaced to an actual block number during calling
>>     submit_bh().  Keep this policy.
> 
> As far as I can tell, this is only true for blocks of GC inodes and node
> blocks. All other buffer_heads are always mapped to on disk blocks by
> nilfs_get_block(). I only added the mapping in nilfs_segbuf_submit_bh()
> to correctly set the value in b_blocknr to the new location.
> 

nilfs_get_block() is only used for regular files, directories, and so
on.  Blocks on metadata files are mapped through
nilfs_mdt_submit_block().  Anyway, yes, they stores actual disk block
number in b_blocknr in the current implementation.  But, it is just a
cutting corner of the current implementation, which comes from the
reason that we have to set actual disk block numbers when reading
blocks with vfs/mm functions.

Anyway I don't like you touch nilfs_get_block() and
nilfs_segbuf_submit_bh() in a part of the big patch.  At least, it
should be a separate patch.  I prefer you take alternative approach
which does the same thing without b_blocknr.  I would like to help to
implement the latter approach if you need to know disk block number in
the patchset.

>>     In segment constructor context, you can calculate the disk block
>>     number from the start disk address of the segment and the block
>>     index (offset) in the segment.
> 
> If I understand you correctly, this approach would give me the on disk
> location inside of the segment that is currently constructed. But I need
> to know the previous on disk location of the buffer_head. I have to
> decrement the counter for the previous segment.

What does the previous on disk location mean ?
And, why do you need to know the previous on disk location?

If it means reclaiming segment, you don't need to decrement its
counter because it will be freed.

If it means the original block to be overwritten,
nilfs_dat_commit_end() is called for the block through
nilfs_bmap_propagate().

If it means the original block of DAT file, it's OK to refer to
b_blocknr because DAT blocks never store virtual block number by
design.  I think it should be done in nilfs_btree_propagate_p() and
nilfs_direct_propagate(), in which no special "end of life" processing
is done against DAT blocks at present.

>> (7) sufile mod cache
>>     Consider gathering the cache into nilfs_sufile_info struct and
>>     stopping to pass it via argument of bmap/sufile/dat interface
>>     functions.  It's hacky, and decreases readability of programs, and
>>     is bloating changes of this patchset over multiple function
>>     blocks.
> 
> If I use a global structure, I have to protect it with a lock. Since
> almost any operation has to modify the counters in the SUFILE, this
> would serialize the whole file system.
> 

The lock acquisition will be needed if you write back to buffers on
SUFILE.  That is the reason why I say you should aggregate the
writeback to sufile into the segment constructor context.

You don't have to suppose a global lock.  You can use bgl_lock, for
example, if the lock contention really matters.

>>     The cache should be well designed. It's important to balance the
>>     performance and locality/transparency of the feature.  For
>>     instance, it can be implemented with radix-tree of objects in
>>     which each object has a vector of 2^k cache entries.
> 
> I'll look into that.
> 
>>     I think the cache should be written back to the sufile buffers
>>     only within segment construction context. At least, it should be
>>     written back in the context in which a transaction lock is held.
>> 
>>     In addition, introducing a new bmap lock dependency,
>>     nilfs_sufile_lock_key, is undesireble. You should avoid it
>>     by delaying the writeback of cache entries to sufile.
> 
> The cache could end up using a lot of memory. In the worst case one
> entry per block.

Why do you think it matters?  When you modify block counter of
segments, all the modified SUFILE blocks become dirty and pinned to
memory.  The cache can be designed better at least than the dirty
SUFILE buffers.

If you care about the need of "shrinker".  We can take other
techniques such as queuing changes and reflect them to sufile in
bundle by using workqueue.  Anyway it's a matter of design or
implementation technique.

>> (8) Changes to the sufile must be finished before dirty buffer
>>     collection of sufile.
>>     All mark_buffer_dirty() calls to sufile must be finished
>>     before or in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().
>> 
>>     (You can write fixed figures to sufile after the collection phase
>>      of sufile by preparatory marking buffer dirty before the
>>      colection phase.)
>>
>>     In the current patchset, sufile mod cache can be flushed in
>>     nilfs_segctor_update_palyload_blocknr(), which comes after the
>>     dirty buffer collection phase.
> 
> This is a hard problem. I have to count the blocks added in the
> NILFS_ST_DAT stage. I don't know, which SUFILE blocks I have to mark in
> advance. I'll have to think about this.
> 
>> (9) cpfile is also excluded in the dead block counting like sufile
>>     cpfile is always changed and written back along with sufile and dat.
>>     So, cpfile must be excluded from the dead block counting.
>>     Otherwise, sufile change can trigger cpfile changes, and it in turn
>>     triggers sufile.
> 
> I don't quite understand your example. How exactly can a sufile change
> trigger a cpfile change and how can this turn into an infinite loop?
> 

Sorry, it's my misunderstanding.  Since dirty blocks of cpfile is
collected before sufile, it is possible to avoid the loop by finishing
all dead block counting on cpfile and flushing it to sufile before or
in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().

Regards,
Ryusuke Konishi

> Thanks,
> Andreas Rohner
> 
>>     This also helps to simplify nilfs_dat_commit_end() that the patchset
>>     added two arguments for the dead block counting in the patchset.
>>     I mean, "dead" argument and "count_blocks" argument can be unified by
>>     changing meaning of the "dead" argument.
>> 
>> 
>> I will add detail comments for patches tonight or another day.
>> 
>> Regards,
>> Ryusuke Konishi
>> 
>> On Wed, 25 Feb 2015 09:18:04 +0900 (JST), Ryusuke Konishi wrote:
>>> Hi Andreas,
>>>
>>> Thank you for posting this proposal!
>>>
>>> I would like to have time to review this series through, but please
>>> wait for several days. (This week I'm quite busy until weekend)
>>>
>>> Thanks,
>>> Ryusuke Konishi
>>>
>>> On Tue, 24 Feb 2015 20:01:35 +0100, Andreas Rohner wrote:
>>>> Hi everyone!
>>>>
>>>> One of the biggest performance problems of NILFS is its
>>>> inefficient Timestamp GC policy. This patch set introduces two new GC
>>>> policies, namely Cost-Benefit and Greedy.
>>>>
>>>> The Cost-Benefit policy is nothing new. It has been around for a long
>>>> time with log-structured file systems [1]. But it relies on accurate
>>>> information, about the number of live blocks in a segment. NILFS
>>>> currently does not provide the necessary information. So this patch set
>>>> extends the entries in the SUFILE to include a counter for the number of
>>>> live blocks. This counter is decremented whenever a file is deleted or
>>>> overwritten.
>>>>
>>>> Except for some tricky parts, the counting of live blocks is quite
>>>> trivial. The problem is snapshots. At any time, a checkpoint can be
>>>> turned into a snapshot or vice versa. So blocks that are reclaimable at
>>>> one point in time, are protected by a snapshot a moment later.
>>>>
>>>> This patch set does not try to track snapshots at all. Instead it uses a
>>>> heuristic approach to prevent the worst case scenario. The performance
>>>> is still significantly better than timestamp for my benchmarks.
>>>>
>>>> The worst case scenario is, the following:
>>>>
>>>> 1. Segment 1 is written
>>>> 2. Snapshot is created
>>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>>    by the Snapshot. The GC has to set the number of live blocks
>>>>    to maximum to avoid reclaiming this Segment again in the near future.
>>>> 4. Snapshot is deleted
>>>> 5. Segment 1 is reclaimable, but its counter is so high, that the GC
>>>>    will never try to reclaim it again.
>>>>
>>>> To prevent this kind of starvation I use another field in the SUFILE
>>>> entry, to store the number of blocks that are protected by a snapshot.
>>>> This value is just a heuristic and it is usually set to 0. Only if the
>>>> GC reclaims a segment, it is written to the SUFILE entry. The GC has to
>>>> check for snapshots anyway, so we get this information for free. By
>>>> storing this information in the SUFILE we can avoid starvation in the
>>>> following way:
>>>>
>>>> 1. Segment 1 is written
>>>> 2. Snapshot is created
>>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>>    by the Snapshot. The GC has to set the number of live blocks
>>>>    to maximum to avoid reclaiming this Segment again in the near future.
>>>> 4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
>>>>    entry
>>>> 5. Snapshot is deleted
>>>> 6. On Snapshot deletion we walk through every entry in the SUFILE and
>>>>    reduce the number of live blocks to half, if the number of snapshot
>>>>    blocks is bigger than half of the maximum.
>>>> 7. Segment 1 is reclaimable and the number of live blocks entry is at
>>>>    half the maximum. The GC will try to reclaim this segment as soon as
>>>>    there are no other better choices.
>>>>
>>>> BENCHMARKS:
>>>> -----------
>>>>
>>>> My benchmark is quite simple. It consists of a process, that replays
>>>> real NFS traces at a faster speed. It thereby creates relatively
>>>> realistic patterns of file creation and deletions. At the same time
>>>> multiple snapshots are created and deleted in parallel. I use a 100GB
>>>> partition of a Samsung SSD:
>>>>
>>>> WITH SNAPSHOTS EVERY 5 MINUTES:
>>>> --------------------------------------------------------------------
>>>>                 Execution time       Wear (Data written to disk)
>>>> Timestamp:      100%                 100%
>>>> Cost-Benefit:   80%                  43%
>>>>
>>>> NO SNAPSHOTS:
>>>> ---------------------------------------------------------------------
>>>>                 Execution time       Wear (Data written to disk)
>>>> Timestamp:      100%                 100%
>>>> Cost-Benefit:   70%                  45%
>>>>
>>>> I plan on adding more benchmark results soon.
>>>>
>>>> Best regards,
>>>> Andreas Rohner
>>>>
>>>> [1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
>>>>     tion of a log-structured file system. ACM Trans. Comput. Syst.,
>>>>     10(1):26–52, February 1992.
>>>>
>>>> Andreas Rohner (9):
>>>>   nilfs2: refactor nilfs_sufile_updatev()
>>>>   nilfs2: add simple cache for modifications to SUFILE
>>>>   nilfs2: extend SUFILE on-disk format to enable counting of live blocks
>>>>   nilfs2: add function to modify su_nlive_blks
>>>>   nilfs2: add simple tracking of block deletions and updates
>>>>   nilfs2: use modification cache to improve performance
>>>>   nilfs2: add additional flags for nilfs_vdesc
>>>>   nilfs2: improve accuracy and correct for invalid GC values
>>>>   nilfs2: prevent starvation of segments protected by snapshots
>>>>
>>>>  fs/nilfs2/bmap.c          |  84 +++++++-
>>>>  fs/nilfs2/bmap.h          |  14 +-
>>>>  fs/nilfs2/btree.c         |   4 +-
>>>>  fs/nilfs2/cpfile.c        |   5 +
>>>>  fs/nilfs2/dat.c           |  95 ++++++++-
>>>>  fs/nilfs2/dat.h           |   8 +-
>>>>  fs/nilfs2/direct.c        |   4 +-
>>>>  fs/nilfs2/inode.c         |  24 ++-
>>>>  fs/nilfs2/ioctl.c         |  27 ++-
>>>>  fs/nilfs2/mdt.c           |   5 +-
>>>>  fs/nilfs2/page.h          |   6 +-
>>>>  fs/nilfs2/segbuf.c        |   6 +
>>>>  fs/nilfs2/segbuf.h        |   3 +
>>>>  fs/nilfs2/segment.c       | 155 +++++++++++++-
>>>>  fs/nilfs2/segment.h       |   3 +
>>>>  fs/nilfs2/sufile.c        | 533 +++++++++++++++++++++++++++++++++++++++++++---
>>>>  fs/nilfs2/sufile.h        |  97 +++++++--
>>>>  fs/nilfs2/the_nilfs.c     |   4 +
>>>>  fs/nilfs2/the_nilfs.h     |  23 ++
>>>>  include/linux/nilfs2_fs.h | 122 ++++++++++-
>>>>  20 files changed, 1126 insertions(+), 96 deletions(-)
>>>>
>>>> -- 
>>>> 2.3.0
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
N‹§²æìr¸›yúèšØb²X¬¶Ç§vØ^–)Þº{.nÇ+‰·¥Š{±ž)_²)í…æèw*\x1fjg¬±¨\x1e¶‰šŽŠÝ¢j.ïÛ°\½½MŽúgjÌæa×\x02››–' ™©Þ¢¸\f¢·¦j:+v‰¨ŠwèjØm¶Ÿÿ¾\a«‘êçzZ+ƒùšŽŠÝ¢j"ú!¶i

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 2/9] nilfs2: add simple cache for modifications to SUFILE
       [not found]     ` <1424804504-10914-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  0:45       ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  0:45 UTC (permalink / raw)
  To: andreas.rohner-hi6Y0CQ0nG0; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:37 +0100, Andreas Rohner wrote:
> This patch adds a simple, small cache that can be used to accumulate
> modifications to SUFILE entries. This is for example useful for
> keeping track of reclaimable blocks, because most of the
> modifications consist of small increments or decrements. By adding
> these up and temporarily storing them in a small cache, the
> performance can be improved. Additionally lock contention is
> reduced.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/sufile.c | 178 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nilfs2/sufile.h |  44 +++++++++++++
>  2 files changed, 222 insertions(+)
> 
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 1e8cac6..a369c30 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -1168,6 +1168,184 @@ out_sem:
>  }
>  
>  /**
> + * nilfs_sufile_mc_init - inits segusg modification cache
> + * @mc: modification cache
> + * @capacity: maximum capacity of the mod cache
> + *
> + * Description: Allocates memory for an array of nilfs_sufile_mod structures
> + * according to @capacity. This memory must be freed with
> + * nilfs_sufile_mc_destroy().
> + *
> + * Return Value: On success, 0 is returned. On error, one of the following
> + * negative error codes is returned.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + *
> + * %-EINVAL - Invalid capacity.
> + */
> +int nilfs_sufile_mc_init(struct nilfs_sufile_mod_cache *mc, size_t capacity)
> +{
> +	mc->mc_capacity = capacity;
> +	if (!capacity)
> +		return -EINVAL;
> +
> +	mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
> +			      GFP_KERNEL);

GFP_NOFS must be used instead of GFP_KERNEL to avoid initiating other
filesystem operations.

The abbreviation "mc" is not good, which is already used as the
abbreviation of "minimum clean" in userland.

> +	if (!mc->mc_mods)
> +		return -ENOMEM;
> +
> +	mc->mc_size = 0;
> +
> +	return 0;
> +}
> +
> +/**
> + * nilfs_sufile_mc_add - add signed value to segusg modification cache
> + * @mc: modification cache
> + * @segnum: segment number
> + * @value: signed value (can be positive and negative)
> + *
> + * Description: nilfs_sufile_mc_add() tries to add a pair of @segnum and
> + * @value to the modification cache. If the cache already contains a
> + * segment number equal to @segnum, then @value is simply added to the
> + * existing value. This way thousands of small modifications can be
> + * accumulated into one value. If @segnum cannot be found and the
> + * capacity allows it, a new element is added to the cache. If the
> + * capacity is reached an error value is returned.
> + *
> + * Return Value: On success, 0 is returned. On error, one of the following
> + * negative error codes is returned.
> + *
> + * %-ENOSPC - The mod cache has reached its capacity and must be flushed.
> + */
> +static inline int nilfs_sufile_mc_add(struct nilfs_sufile_mod_cache *mc,
> +				      __u64 segnum, __s64 value)
> +{
> +	struct nilfs_sufile_mod *mods = mc->mc_mods;
> +	int i;
> +
> +	for (i = 0; i < mc->mc_size; ++i, ++mods) {
> +		if (mods->m_segnum == segnum) {
> +			mods->m_value += value;
> +			return 0;
> +		}
> +	}
> +
> +	if (mc->mc_size < mc->mc_capacity) {
> +		mods->m_segnum = segnum;
> +		mods->m_value = value;
> +		mc->mc_size++;
> +		return 0;
> +	}
> +
> +	return -ENOSPC;
> +}
> +
> +/**
> + * nilfs_sufile_mc_clear - set mc_size to 0
> + * @mc: modification cache
> + *
> + * Description: nilfs_sufile_mc_clear() sets mc_size to 0, which enables
> + * nilfs_sufile_mc_add() to overwrite the elements in @mc.
> + */
> +static inline void nilfs_sufile_mc_clear(struct nilfs_sufile_mod_cache *mc)
> +{
> +	mc->mc_size = 0;
> +}
> +
> +/**
> + * nilfs_sufile_mc_reset - clear cache and add one element
> + * @mc: modification cache
> + * @segnum: segment number
> + * @value: signed value (can be positive and negative)
> + *
> + * Description: Clears the modification cache in @mc and adds a new pair of
> + * @segnum and @value to it at the same time.
> + */
> +static inline void nilfs_sufile_mc_reset(struct nilfs_sufile_mod_cache *mc,
> +					 __u64 segnum, __s64 value)
> +{
> +	struct nilfs_sufile_mod *mods = mc->mc_mods;
> +
> +	mods->m_segnum = segnum;
> +	mods->m_value = value;
> +	mc->mc_size = 1;
> +}

The name of this function is confusing.  Actual meaning of this
function is "reset" and "add", and that can be replaced with mc_clear
and mc_add.  Remove this function to simplify interface.

Regards,
Ryusuke Konishi

> +/**
> + * nilfs_sufile_mc_flush - flush modification cache
> + * @sufile: inode of segment usage file
> + * @mc: modification cache
> + * @dofunc: primitive operation for the update
> + *
> + * Description: nilfs_sufile_mc_flush() flushes the cached modifications
> + * and applies them to the segment usages on disk. It persists the cached
> + * changes, by calling @dofunc for every element in the cache. @dofunc also
> + * determines the interpretation of the cached values and how they should
> + * be applied to the corresponding segment usage entries.
> + *
> + * Return Value: On success, zero is returned.  On error, one of the
> + * following negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + *
> + * %-ENOENT - Given segment usage is in hole block
> + *
> + * %-EINVAL - Invalid segment usage number
> + */
> +static inline int nilfs_sufile_mc_flush(struct inode *sufile,
> +					struct nilfs_sufile_mod_cache *mc,
> +					void (*dofunc)(struct inode *,
> +						struct nilfs_sufile_mod *,
> +						struct buffer_head *,
> +						struct buffer_head *))
> +{
> +	return nilfs_sufile_updatev(sufile, mc->mc_mods,
> +				    sizeof(struct nilfs_sufile_mod),
> +				    offsetof(struct nilfs_sufile_mod, m_segnum),
> +				    mc->mc_size, 0, NULL, (void *)dofunc);
> +}
> +
> +/**
> + * nilfs_sufile_mc_update - immediately applies modification
> + * @sufile: inode of segment usage file
> + * @segnum: segment number
> + * @value: signed value (can be positive and negative)
> + * @dofunc: primitive operation for the update
> + *
> + * Description: nilfs_sufile_mc_update() is a helper function, that
> + * creates a temporary nilfs_sufile_mod structure out of @segnum and @value
> + * and immediately flushes it using @dofunc, without the use of a
> + * modification cache.
> + *
> + * Return Value: On success, zero is returned.  On error, one of the
> + * following negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + *
> + * %-ENOENT - Given segment usage is in hole block
> + *
> + * %-EINVAL - Invalid segment usage number
> + */
> +static inline int nilfs_sufile_mc_update(struct inode *sufile,
> +					 __u64 segnum, __s64 value,
> +					 void (*dofunc)(struct inode *,
> +						struct nilfs_sufile_mod *,
> +						struct buffer_head *,
> +						struct buffer_head *))
> +{
> +	struct nilfs_sufile_mod m = {.m_segnum = segnum, .m_value = value};
> +
> +	return nilfs_sufile_update(sufile, &m,
> +				   offsetof(struct nilfs_sufile_mod, m_segnum),
> +				   0, (void *)dofunc);
> +}
> +
> +/**
>   * nilfs_sufile_read - read or get sufile inode
>   * @sb: super block instance
>   * @susize: size of a segment usage entry
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index 2df6c71..c446325 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -146,4 +146,48 @@ static inline int nilfs_sufile_set_error(struct inode *sufile, __u64 segnum)
>  				   (void *)nilfs_sufile_do_set_error);
>  }
>  
> +#define NILFS_SUFILE_MC_SIZE_DEFAULT	5
> +#define NILFS_SUFILE_MC_SIZE_EXT	10
> +
> +/**
> + * struct nilfs_sufile_mod - segment usage modification
> + * @m_segnum: segment number
> + * @m_value: signed value that gets added to respective segusg field
> + */
> +struct nilfs_sufile_mod {
> +	__u64 m_segnum;
> +	__s64 m_value;
> +};
> +
> +/**
> + * struct nilfs_sufile_mod_cache - segment usage modification cache
> + * @mc_mods: array of modifications to segments
> + * @mc_capacity: maximum number of elements that fit in @mc_mods
> + * @mc_size: number of elements currently filled with valid data
> + */
> +struct nilfs_sufile_mod_cache {
> +	struct nilfs_sufile_mod *mc_mods;
> +	size_t mc_capacity;
> +	size_t mc_size;
> +};
> +
> +int nilfs_sufile_mc_init(struct nilfs_sufile_mod_cache *, size_t);
> +
> +/**
> + * nilfs_sufile_mc_destroy - destroy segusg modification cache
> + * @mc: modification cache
> + *
> + * Description: Releases the memory allocated by nilfs_sufile_mc_init and
> + * sets the size and capacity to 0. @mc should not be used after a call to
> + * this function.
> + */
> +static inline void nilfs_sufile_mc_destroy(struct nilfs_sufile_mod_cache *mc)
> +{
> +	if (mc) {
> +		kfree(mc->mc_mods);
> +		mc->mc_capacity = 0;
> +		mc->mc_size = 0;
> +	}
> +}
> +
>  #endif	/* _NILFS_SUFILE_H */
> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 6/9] nilfs2: use modification cache to improve performance
       [not found]     ` <1424804504-10914-7-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  1:04       ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  1:04 UTC (permalink / raw)
  To: andreas.rohner-hi6Y0CQ0nG0; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:41 +0100, Andreas Rohner wrote:
> This patch adds a small cache to accumulate the small decrements of
> the number of live blocks in a segment usage entry. If for example a
> large file is deleted, the segment usage entry has to be updated for
> every single block. But for every decrement, a MDT write lock has to
> be aquired, which blocks the entire SUFILE and effectively turns
> this lock into a global lock for the whole file system.
> 
> The cache tries to ameliorate this situation by adding up the
> decrements and increments for a given number of segments and
> applying the changes all at once. Because the changes are
> accumulated in memory and not immediately written to the SUFILE, the
> afore mentioned lock only needs to be aquired, if the cache is full
> or at the end of the respective operation.
> 
> To effectively get the pointer to the modification cache from the
> high level operations down to the update of the individual blocks in
> nilfs_dat_commit_end(), a new pointer b_private was added to struct
> nilfs_bmap.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/bmap.c    | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nilfs2/bmap.h    | 11 +++++++-
>  fs/nilfs2/btree.c   |  2 +-
>  fs/nilfs2/direct.c  |  2 +-
>  fs/nilfs2/inode.c   | 22 +++++++++++++---
>  fs/nilfs2/segment.c | 26 +++++++++++++++---
>  fs/nilfs2/segment.h |  3 +++
>  7 files changed, 132 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
> index ecd62ba..927acb7 100644
> --- a/fs/nilfs2/bmap.c
> +++ b/fs/nilfs2/bmap.c
> @@ -288,6 +288,43 @@ int nilfs_bmap_truncate(struct nilfs_bmap *bmap, unsigned long key)
>  }
>  
>  /**
> + * nilfs_bmap_truncate_with_mc - truncate a bmap to a specified key
> + * @bmap: bmap
> + * @mc: modification cache
> + * @key: key
> + *
> + * Description: nilfs_bmap_truncate_with_mc() removes key-record pairs whose
> + * keys are greater than or equal to @key from @bmap. It has the same
> + * functionality as nilfs_bmap_truncate(), but allows the passing
> + * of a modification cache to update segment usage information.
> + *
> + * Return Value: On success, 0 is returned. On error, one of the following
> + * negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + */
> +int nilfs_bmap_truncate_with_mc(struct nilfs_bmap *bmap,
> +				struct nilfs_sufile_mod_cache *mc,
> +				unsigned long key)
> +{
> +	int ret;
> +
> +	down_write(&bmap->b_sem);
> +
> +	bmap->b_private = mc;
> +
> +	ret = nilfs_bmap_do_truncate(bmap, key);
> +
> +	bmap->b_private = NULL;
> +
> +	up_write(&bmap->b_sem);
> +
> +	return nilfs_bmap_convert_error(bmap, __func__, ret);
> +}
> +
> +/**
>   * nilfs_bmap_clear - free resources a bmap holds
>   * @bmap: bmap
>   *
> @@ -328,6 +365,43 @@ int nilfs_bmap_propagate(struct nilfs_bmap *bmap, struct buffer_head *bh)
>  }
>  
>  /**
> + * nilfs_bmap_propagate_with_mc - propagate dirty state
> + * @bmap: bmap
> + * @mc: modification cache
> + * @bh: buffer head
> + *
> + * Description: nilfs_bmap_propagate_with_mc() marks the buffers that directly
> + * or indirectly refer to the block specified by @bh dirty. It has
> + * the same functionality as nilfs_bmap_propagate(), but allows the passing
> + * of a modification cache to update segment usage information.
> + *
> + * Return Value: On success, 0 is returned. On error, one of the following
> + * negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + */
> +int nilfs_bmap_propagate_with_mc(struct nilfs_bmap *bmap,
> +				 struct nilfs_sufile_mod_cache *mc,
> +				 struct buffer_head *bh)
> +{
> +	int ret;
> +
> +	down_write(&bmap->b_sem);
> +
> +	bmap->b_private = mc;
> +
> +	ret = bmap->b_ops->bop_propagate(bmap, bh);
> +
> +	bmap->b_private = NULL;
> +
> +	up_write(&bmap->b_sem);
> +
> +	return nilfs_bmap_convert_error(bmap, __func__, ret);
> +}

These bmap functions are really bad.  The mod cache argument has no
meaning with regard to block mapping operation.  I really hope we
don't have to add these variants by hiding the cache in sufile.

> +
> +/**
>   * nilfs_bmap_lookup_dirty_buffers -
>   * @bmap: bmap
>   * @listp: pointer to buffer head list
> @@ -490,6 +564,7 @@ int nilfs_bmap_read(struct nilfs_bmap *bmap, struct nilfs_inode *raw_inode)
>  
>  	init_rwsem(&bmap->b_sem);
>  	bmap->b_state = 0;
> +	bmap->b_private = NULL;
>  	bmap->b_inode = &NILFS_BMAP_I(bmap)->vfs_inode;
>  	switch (bmap->b_inode->i_ino) {
>  	case NILFS_DAT_INO:
> @@ -551,6 +626,7 @@ void nilfs_bmap_init_gc(struct nilfs_bmap *bmap)
>  	bmap->b_last_allocated_key = 0;
>  	bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
>  	bmap->b_state = 0;
> +	bmap->b_private = NULL;
>  	nilfs_btree_init_gc(bmap);
>  }
>  
> diff --git a/fs/nilfs2/bmap.h b/fs/nilfs2/bmap.h
> index 718c814..a8b935a 100644
> --- a/fs/nilfs2/bmap.h
> +++ b/fs/nilfs2/bmap.h
> @@ -36,6 +36,7 @@
>  
>  
>  struct nilfs_bmap;
> +struct nilfs_sufile_mod_cache;
>  
>  /**
>   * union nilfs_bmap_ptr_req - request for bmap ptr
> @@ -106,6 +107,7 @@ static inline int nilfs_bmap_is_new_ptr(unsigned long ptr)
>   * @b_ptr_type: pointer type
>   * @b_state: state
>   * @b_nchildren_per_block: maximum number of child nodes for non-root nodes
> + * @b_private: pointer for extra data
>   */
>  struct nilfs_bmap {
>  	union {
> @@ -120,6 +122,7 @@ struct nilfs_bmap {
>  	int b_ptr_type;
>  	int b_state;
>  	__u16 b_nchildren_per_block;
> +	void *b_private;
>  };
>  
>  /* pointer type */
> @@ -157,8 +160,14 @@ int nilfs_bmap_insert(struct nilfs_bmap *, unsigned long, unsigned long);
>  int nilfs_bmap_delete(struct nilfs_bmap *, unsigned long);
>  int nilfs_bmap_last_key(struct nilfs_bmap *, unsigned long *);
>  int nilfs_bmap_truncate(struct nilfs_bmap *, unsigned long);
> +int nilfs_bmap_truncate_with_mc(struct nilfs_bmap *,
> +				struct nilfs_sufile_mod_cache *,
> +				unsigned long);
>  void nilfs_bmap_clear(struct nilfs_bmap *);
>  int nilfs_bmap_propagate(struct nilfs_bmap *, struct buffer_head *);
> +int nilfs_bmap_propagate_with_mc(struct nilfs_bmap *,
> +				 struct nilfs_sufile_mod_cache *,
> +				 struct buffer_head *);
>  void nilfs_bmap_lookup_dirty_buffers(struct nilfs_bmap *, struct list_head *);
>  int nilfs_bmap_assign(struct nilfs_bmap *, struct buffer_head **,
>  		      unsigned long, union nilfs_binfo *);
> @@ -222,7 +231,7 @@ static inline void nilfs_bmap_commit_end_ptr(struct nilfs_bmap *bmap,
>  					     struct inode *dat)
>  {
>  	if (dat)
> -		nilfs_dat_commit_end(dat, &req->bpr_req, NULL,
> +		nilfs_dat_commit_end(dat, &req->bpr_req, bmap->b_private,
>  				     bmap->b_ptr_type == NILFS_BMAP_PTR_VS,
>  				     bmap->b_inode->i_ino != NILFS_SUFILE_INO);
>  }
> diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
> index 2af0519..c3c883e 100644
> --- a/fs/nilfs2/btree.c
> +++ b/fs/nilfs2/btree.c
> @@ -1851,7 +1851,7 @@ static void nilfs_btree_commit_update_v(struct nilfs_bmap *btree,
>  
>  	nilfs_dat_commit_update(dat, &path[level].bp_oldreq.bpr_req,
>  				&path[level].bp_newreq.bpr_req,
> -				NULL,
> +				btree->b_private,
>  				btree->b_ptr_type == NILFS_BMAP_PTR_VS,
>  				btree->b_inode->i_ino != NILFS_SUFILE_INO);
>  
> diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c
> index e022cfb..a716bba 100644
> --- a/fs/nilfs2/direct.c
> +++ b/fs/nilfs2/direct.c
> @@ -272,7 +272,7 @@ static int nilfs_direct_propagate(struct nilfs_bmap *bmap,
>  		if (ret < 0)
>  			return ret;
>  		nilfs_dat_commit_update(dat, &oldreq, &newreq,
> -				NULL,
> +				bmap->b_private,
>  				bmap->b_ptr_type == NILFS_BMAP_PTR_VS,
>  				bmap->b_inode->i_ino != NILFS_SUFILE_INO);
>  		set_buffer_nilfs_volatile(bh);
> diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
> index 8b59695..7f6d056 100644
> --- a/fs/nilfs2/inode.c
> +++ b/fs/nilfs2/inode.c
> @@ -34,6 +34,7 @@
>  #include "mdt.h"
>  #include "cpfile.h"
>  #include "ifile.h"
> +#include "sufile.h"
>  
>  /**
>   * struct nilfs_iget_args - arguments used during comparison between inodes
> @@ -714,29 +715,42 @@ void nilfs_update_inode(struct inode *inode, struct buffer_head *ibh, int flags)
>  static void nilfs_truncate_bmap(struct nilfs_inode_info *ii,
>  				unsigned long from)
>  {
> +	struct the_nilfs *nilfs = ii->vfs_inode.i_sb->s_fs_info;
> +	struct nilfs_sufile_mod_cache mc, *mcp = NULL;
>  	unsigned long b;
>  	int ret;
>  
>  	if (!test_bit(NILFS_I_BMAP, &ii->i_state))
>  		return;
> +
> +	if (nilfs_feature_track_live_blks(nilfs) &&
> +	    !nilfs_sufile_mc_init(&mc, NILFS_SUFILE_MC_SIZE_DEFAULT))
> +		mcp = &mc;
> +
>  repeat:
>  	ret = nilfs_bmap_last_key(ii->i_bmap, &b);
>  	if (ret == -ENOENT)
> -		return;
> +		goto out_free;
>  	else if (ret < 0)
>  		goto failed;
>  
>  	if (b < from)
> -		return;
> +		goto out_free;
>  
>  	b -= min_t(unsigned long, NILFS_MAX_TRUNCATE_BLOCKS, b - from);
> -	ret = nilfs_bmap_truncate(ii->i_bmap, b);
> +	ret = nilfs_bmap_truncate_with_mc(ii->i_bmap, mcp, b);
>  	nilfs_relax_pressure_in_lock(ii->vfs_inode.i_sb);
>  	if (!ret || (ret == -ENOMEM &&
> -		     nilfs_bmap_truncate(ii->i_bmap, b) == 0))
> +		     nilfs_bmap_truncate_with_mc(ii->i_bmap, mcp, b) == 0))
>  		goto repeat;
>  
> +out_free:
> +	nilfs_sufile_flush_nlive_blks(nilfs->ns_sufile, mcp);
> +	nilfs_sufile_mc_destroy(mcp);
> +	return;
>  failed:
> +	nilfs_sufile_flush_nlive_blks(nilfs->ns_sufile, mcp);
> +	nilfs_sufile_mc_destroy(mcp);
>  	nilfs_warning(ii->vfs_inode.i_sb, __func__,
>  		      "failed to truncate bmap (ino=%lu, err=%d)",
>  		      ii->vfs_inode.i_ino, ret);
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 6059f53..dc0070c 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -511,7 +511,8 @@ static int nilfs_collect_file_data(struct nilfs_sc_info *sci,
>  {
>  	int err;
>  
> -	err = nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
> +	err = nilfs_bmap_propagate_with_mc(NILFS_I(inode)->i_bmap,
> +					   sci->sc_mc, bh);
>  	if (err < 0)
>  		return err;
>  
> @@ -526,7 +527,8 @@ static int nilfs_collect_file_node(struct nilfs_sc_info *sci,
>  				   struct buffer_head *bh,
>  				   struct inode *inode)
>  {
> -	return nilfs_bmap_propagate(NILFS_I(inode)->i_bmap, bh);
> +	return nilfs_bmap_propagate_with_mc(NILFS_I(inode)->i_bmap,
> +					    sci->sc_mc, bh);
>  }
>  
>  static int nilfs_collect_file_bmap(struct nilfs_sc_info *sci,
> @@ -1386,7 +1388,7 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
>  		segbuf->sb_nlive_blks_added = segbuf->sb_sum.nfileblk;
>  
>  		if (nilfs_feature_track_live_blks(nilfs))
> -			nilfs_sufile_mod_nlive_blks(sufile, NULL,
> +			nilfs_sufile_mod_nlive_blks(sufile, sci->sc_mc,
>  						segbuf->sb_segnum,
>  						segbuf->sb_nlive_blks_added);
>  	}
> @@ -2014,6 +2016,9 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
>  		}
>  		nilfs_segctor_update_segusage(sci, nilfs);
>  
> +		nilfs_sufile_flush_nlive_blks(nilfs->ns_sufile,
> +					      sci->sc_mc);
> +
>  		/* Write partial segments */
>  		nilfs_segctor_prepare_write(sci);
>  
> @@ -2603,6 +2608,7 @@ static struct nilfs_sc_info *nilfs_segctor_new(struct super_block *sb,
>  {
>  	struct the_nilfs *nilfs = sb->s_fs_info;
>  	struct nilfs_sc_info *sci;
> +	int ret;
>  
>  	sci = kzalloc(sizeof(*sci), GFP_KERNEL);
>  	if (!sci)
> @@ -2633,6 +2639,18 @@ static struct nilfs_sc_info *nilfs_segctor_new(struct super_block *sb,
>  		sci->sc_interval = HZ * nilfs->ns_interval;
>  	if (nilfs->ns_watermark)
>  		sci->sc_watermark = nilfs->ns_watermark;
> +
> +	if (nilfs_feature_track_live_blks(nilfs)) {
> +		sci->sc_mc = kmalloc(sizeof(*(sci->sc_mc)), GFP_KERNEL);
> +		if (sci->sc_mc) {
> +			ret = nilfs_sufile_mc_init(sci->sc_mc,
> +						   NILFS_SUFILE_MC_SIZE_EXT);
> +			if (ret) {
> +				kfree(sci->sc_mc);
> +				sci->sc_mc = NULL;
> +			}
> +		}
> +	}
>  	return sci;
>  }
>  
> @@ -2701,6 +2719,8 @@ static void nilfs_segctor_destroy(struct nilfs_sc_info *sci)
>  	down_write(&nilfs->ns_segctor_sem);
>  
>  	del_timer_sync(&sci->sc_timer);
> +	nilfs_sufile_mc_destroy(sci->sc_mc);
> +	kfree(sci->sc_mc);
>  	kfree(sci);
>  }
>  
> diff --git a/fs/nilfs2/segment.h b/fs/nilfs2/segment.h
> index a48d6de..a857527 100644
> --- a/fs/nilfs2/segment.h
> +++ b/fs/nilfs2/segment.h
> @@ -80,6 +80,7 @@ struct nilfs_cstage {
>  };
>  
>  struct nilfs_segment_buffer;
> +struct nilfs_sufile_mod_cache;
>  
>  struct nilfs_segsum_pointer {
>  	struct buffer_head     *bh;
> @@ -129,6 +130,7 @@ struct nilfs_segsum_pointer {
>   * @sc_watermark: Watermark for the number of dirty buffers
>   * @sc_timer: Timer for segctord
>   * @sc_task: current thread of segctord
> + * @sc_mc: mod cache to add up updates for SUFILE during seg construction
>   */
>  struct nilfs_sc_info {
>  	struct super_block     *sc_super;
> @@ -185,6 +187,7 @@ struct nilfs_sc_info {
>  
>  	struct timer_list	sc_timer;
>  	struct task_struct     *sc_task;
> +	struct nilfs_sufile_mod_cache *sc_mc;
>  };
>  
>  /* sc_flags */

Again, I really hope you eliminate this changes by hiding the cache in
sufile.

Regards,
Ryusuke Konishi

> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 8/9] nilfs2: improve accuracy and correct for invalid GC values
       [not found]     ` <1424804504-10914-9-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  2:50       ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  2:50 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:43 +0100, Andreas Rohner wrote:
> This patch improves the accuracy of the su_nlive_blks segment
> usage field by also counting the blocks of the DAT-File. A block in
> the DAT-File is considered reclaimable as soon as it is overwritten.
> There is no need to consider protection periods, snapshots or
> checkpoints. So whenever a block is overwritten during segment
> construction, the segment usage information of the segment at the
> previous location of the block is decremented. To get the previous
> location of the block the b_blocknr field of the buffer_head
> structure is used.
> 
> SUFILE blocks are counted in a similar way, but if the GC reads a
> block into a GC inode, that already is in the cache, then there are
> two versions of the block. If this happens both versions will be
> counted, which can lead to small seemingly random incorrect values.
> But it is better to accept these small inaccuracies than to not
> count the SUFILE at all. These inaccuracies do not occur for the
> DAT-File, because it does not need a GC inode.
> 
> Additionally the blocks that belong to a GC inode are rechecked if
> they are reclaimable. If so the corresponding counter is
> decremented. The blocks were already checked in userspace, but
> without the proper locking. It is furthermore possible, that blocks
> become reclaimable during the cleaning process. For example by
> deleting checkpoints. To improve the performance of these extra
> checks, flags from userspace are used to determine reclaimability.
> If a block belongs to a snapshot it cannot be reclaimable and if
> it is within the protection period it must be counted as
> reclaimable.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/dat.c     |  70 ++++++++++++++++++++++++++++++++++++
>  fs/nilfs2/dat.h     |   1 +
>  fs/nilfs2/inode.c   |   2 ++
>  fs/nilfs2/segbuf.c  |   4 +++
>  fs/nilfs2/segbuf.h  |   1 +
>  fs/nilfs2/segment.c | 101 ++++++++++++++++++++++++++++++++++++++++++++++++++--
>  6 files changed, 177 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
> index d2c8f7e..63d079c 100644
> --- a/fs/nilfs2/dat.c
> +++ b/fs/nilfs2/dat.c
> @@ -35,6 +35,17 @@
>  #define NILFS_CNO_MAX	(~(__u64)0)
>  
>  /**
> + * nilfs_dat_entry_is_alive - check if @entry is alive
> + * @entry: DAT-Entry
> + *
> + * Description: Simple check if @entry is alive in the current checkpoint.
> + */
> +static inline int nilfs_dat_entry_is_live(struct nilfs_dat_entry *entry)
> +{
> +	return entry->de_end == cpu_to_le64(NILFS_CNO_MAX);
> +}
> +

Do not use "inline" directive in *.c files.  Compiler aggressively
does it.  "noinline" directive should be used instead for functions
that we want to prevent from being inlined.

> +/**
>   * struct nilfs_dat_info - on-memory private data of DAT file
>   * @mi: on-memory private data of metadata file
>   * @palloc_cache: persistent object allocator cache of DAT file
> @@ -391,6 +402,65 @@ int nilfs_dat_move(struct inode *dat, __u64 vblocknr, sector_t blocknr)
>  }
>  
>  /**
> + * nilfs_dat_is_live - checks if the virtual block number is alive
> + * @dat: DAT file inode
> + * @vblocknr: virtual block number
> + * @errp: pointer to return code if error occurred
> + *
> + * Description: nilfs_dat_is_live() looks up the DAT-Entry for
> + * @vblocknr and determines if the corresponding block is alive in the current
> + * checkpoint or not. This check ignores snapshots and protection periods.
> + *
> + * Return Value: 1 if vblocknr is alive and 0 otherwise. On error, 0 is
> + * returned and @errp is set to one of the following negative error codes.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + *
> + * %-ENOENT - A block number associated with @vblocknr does not exist.
> + */
> +int nilfs_dat_is_live(struct inode *dat, __u64 vblocknr, int *errp)
> +{
> +	struct buffer_head *entry_bh, *bh;
> +	struct nilfs_dat_entry *entry;
> +	sector_t blocknr;
> +	void *kaddr;
> +	int ret = 0, err;
> +
> +	err = nilfs_palloc_get_entry_block(dat, vblocknr, 0, &entry_bh);
> +	if (err < 0)
> +		goto out;
> +
> +	if (!nilfs_doing_gc() && buffer_nilfs_redirected(entry_bh)) {
> +		bh = nilfs_mdt_get_frozen_buffer(dat, entry_bh);
> +		if (bh) {
> +			WARN_ON(!buffer_uptodate(bh));
> +			put_bh(entry_bh);
> +			entry_bh = bh;
> +		}
> +	}
> +
> +	kaddr = kmap_atomic(entry_bh->b_page);
> +	entry = nilfs_palloc_block_get_entry(dat, vblocknr, entry_bh, kaddr);
> +	blocknr = le64_to_cpu(entry->de_blocknr);
> +	if (blocknr == 0) {
> +		err = -ENOENT;
> +		goto out_unmap;
> +	}
> +
> +	ret = nilfs_dat_entry_is_live(entry);
> +
> +out_unmap:
> +	kunmap_atomic(kaddr);
> +	put_bh(entry_bh);
> +out:
> +	if (errp)
> +		*errp = err;

Remove errp argument by returning it as the return value.  Rather, the
true/false result should be returned via an argument if you'd like to
avoid mixing these two.

> +	return ret;
> +}
> +
> +/**
>   * nilfs_dat_translate - translate a virtual block number to a block number
>   * @dat: DAT file inode
>   * @vblocknr: virtual block number
> diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
> index d196f09..3cbddd6 100644
> --- a/fs/nilfs2/dat.h
> +++ b/fs/nilfs2/dat.h
> @@ -32,6 +32,7 @@ struct nilfs_palloc_req;
>  struct nilfs_sufile_mod_cache;
>  
>  int nilfs_dat_translate(struct inode *, __u64, sector_t *);
> +int nilfs_dat_is_live(struct inode *, __u64, int *);
>  
>  int nilfs_dat_prepare_alloc(struct inode *, struct nilfs_palloc_req *);
>  void nilfs_dat_commit_alloc(struct inode *, struct nilfs_palloc_req *);
> diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
> index 7f6d056..5412a76 100644
> --- a/fs/nilfs2/inode.c
> +++ b/fs/nilfs2/inode.c
> @@ -90,6 +90,8 @@ int nilfs_get_block(struct inode *inode, sector_t blkoff,
>  	int err = 0, ret;
>  	unsigned maxblocks = bh_result->b_size >> inode->i_blkbits;
>  
> +	bh_result->b_blocknr = 0;
> +

Please do not add this in this big patch as I mentioned before.

>  	down_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
>  	ret = nilfs_bmap_lookup_contig(ii->i_bmap, blkoff, &blknum, maxblocks);
>  	up_read(&NILFS_MDT(nilfs->ns_dat)->mi_sem);
> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
> index 7a6e9cd..bbd807b 100644
> --- a/fs/nilfs2/segbuf.c
> +++ b/fs/nilfs2/segbuf.c
> @@ -58,6 +58,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
>  	INIT_LIST_HEAD(&segbuf->sb_payload_buffers);
>  	segbuf->sb_super_root = NULL;
>  	segbuf->sb_nlive_blks_added = 0;
> +	segbuf->sb_nlive_blks_diff = 0;
>  
>  	init_completion(&segbuf->sb_bio_event);
>  	atomic_set(&segbuf->sb_err, 0);
> @@ -451,6 +452,9 @@ static int nilfs_segbuf_submit_bh(struct nilfs_segment_buffer *segbuf,
>  

>  	len = bio_add_page(wi->bio, bh->b_page, bh->b_size, bh_offset(bh));
>  	if (len == bh->b_size) {
> +		lock_buffer(bh);
> +		map_bh(bh, segbuf->sb_super, wi->blocknr + wi->end);
> +		unlock_buffer(bh);
>  		wi->end++;
>  		return 0;
>  	}

ditto.  Stop this.

> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
> index d04da26..4e994f7 100644
> --- a/fs/nilfs2/segbuf.h
> +++ b/fs/nilfs2/segbuf.h
> @@ -84,6 +84,7 @@ struct nilfs_segment_buffer {
>  	sector_t		sb_pseg_start;
>  	unsigned		sb_rest_blocks;
>  	__u32			sb_nlive_blks_added;

> +	__s64			sb_nlive_blks_diff;

sb_nlive_blks_diff is always decremented.  It looks better
to alter this to

	__u32			sb_nlive_blks_deducted;

and increment it. (The term "diff" is ambiguous. Maybe,
it should be "deducted" or so.)


>  
>  	/* Buffers */
>  	struct list_head	sb_segsum_buffers;
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index dc0070c..16c7c36 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1385,7 +1385,8 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
>  		WARN_ON(ret); /* always succeed because the segusage is dirty */
>  
>  		/* should always be positive */
> -		segbuf->sb_nlive_blks_added = segbuf->sb_sum.nfileblk;
> +		segbuf->sb_nlive_blks_added = segbuf->sb_nlive_blks_diff +
> +					      segbuf->sb_sum.nfileblk;
>  
>  		if (nilfs_feature_track_live_blks(nilfs))
>  			nilfs_sufile_mod_nlive_blks(sufile, sci->sc_mc,
> @@ -1497,12 +1498,98 @@ static void nilfs_list_replace_buffer(struct buffer_head *old_bh,
>  	/* The caller must release old_bh */
>  }
>  
> +/**
> + * nilfs_segctor_dec_nlive_blks_gc - dec. nlive_blks for blocks of GC-Inodes
> + * @dat: dat inode
> + * @segbuf: currtent segment buffer
> + * @bh: current buffer head
> + *
> + * Description: nilfs_segctor_dec_nlive_blks_gc() is called if the inode to
> + * which @bh belongs is a GC-Inode. In that case it is not necessary to
> + * decrement the previous segment, because at the end of the GC process it
> + * will be freed anyway. It is however necessary to check again if the blocks
> + * are alive here, because the last check was in userspace without the proper
> + * locking. Additionally the blocks protected by the protection period should
> + * be considered reclaimable. It is assumed, that @bh->b_blocknr contains
> + * a virtual block number, which is only true if @bh is part of a GC-Inode.
> + */
> +static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
> +					    struct nilfs_segment_buffer *segbuf,
> +					    struct buffer_head *bh) {
> +	bool isreclaimable = buffer_nilfs_protection_period(bh) ||
> +				!nilfs_dat_is_live(dat, bh->b_blocknr, NULL);
> +
> +	if (!buffer_nilfs_snapshot(bh) && isreclaimable)
> +		segbuf->sb_nlive_blks_diff--;
> +}
> +
> +/**
> + * nilfs_segctor_dec_nlive_blks_nogc - dec. nlive_blks of segment
> + * @nilfs: the nilfs object
> + * @mc: modification cache
> + * @sb: currtent segment buffer
> + * @blocknr: current block number
> + *
> + * Description: Gets the segment number of the segment @blocknr belongs to
> + * and decrements the su_nlive_blks field of the corresponding segment usage
> + * entry.
> + */
> +static void nilfs_segctor_dec_nlive_blks_nogc(struct the_nilfs *nilfs,
> +					      struct nilfs_sufile_mod_cache *mc,
> +					      struct nilfs_segment_buffer *sb,
> +					      sector_t blocknr)
> +{
> +	__u64 segnum = nilfs_get_segnum_of_block(nilfs, blocknr);
> +
> +	if (segnum >= nilfs->ns_nsegments)
> +		return;
> +
> +	if (segnum == sb->sb_segnum)
> +		sb->sb_nlive_blks_diff--;
> +	else
> +		nilfs_sufile_mod_nlive_blks(nilfs->ns_sufile, mc, segnum, -1);
> +}

As I mentioned before, sufile shouldn't be changed (in precise, newly
marked dirty) after the collection phase of sufile.  This looks to be
violating it.

Regards,
Ryusuke Konishi

> +
> +/**
> + * nilfs_segctor_dec_nlive_blks - dec. nlive_blks of previous segment
> + * @nilfs: the nilfs object
> + * @mc: modification cache
> + * @sb: currtent segment buffer
> + * @bh: current buffer head
> + * @ino: current inode number
> + * @gc_inode: true if current inode is a GC-Inode
> + *
> + * Description: Handles GC-Inodes and normal inodes differently. For normal
> + * inodes @bh->b_blocknr contains the location where the block was read in. If
> + * the block is updated, the old version of it is considered reclaimable and so
> + * the su_nlive_blks field of the segment usage information of the old segment
> + * needs to be decremented. Only the DATFILE and SUFILE are decremented here,
> + * because normal files and other meta data files can be better decremented in
> + * nilfs_dat_commit_end().
> + */
> +static void nilfs_segctor_dec_nlive_blks(struct the_nilfs *nilfs,
> +					 struct nilfs_sufile_mod_cache *mc,
> +					 struct nilfs_segment_buffer *sb,
> +					 struct buffer_head *bh,
> +					 ino_t ino,
> +					 bool gc_inode)
> +{
> +	bool isnode = buffer_nilfs_node(bh);
> +
> +	if (gc_inode)
> +		nilfs_segctor_dec_nlive_blks_gc(nilfs->ns_dat, sb, bh);
> +	else if (ino == NILFS_DAT_INO || (ino == NILFS_SUFILE_INO && !isnode))
> +		nilfs_segctor_dec_nlive_blks_nogc(nilfs, mc, sb, bh->b_blocknr);
> +}
> +
>  static int
>  nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  				     struct nilfs_segment_buffer *segbuf,
>  				     int mode)
>  {
> +	struct the_nilfs *nilfs = sci->sc_super->s_fs_info;
>  	struct inode *inode = NULL;
> +	struct nilfs_inode_info *ii;
>  	sector_t blocknr;
>  	unsigned long nfinfo = segbuf->sb_sum.nfinfo;
>  	unsigned long nblocks = 0, ndatablk = 0;
> @@ -1512,7 +1599,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  	union nilfs_binfo binfo;
>  	struct buffer_head *bh, *bh_org;
>  	ino_t ino = 0;
> -	int err = 0;
> +	int err = 0, gc_inode = 0, track_live_blks;
> +
> +	track_live_blks = nilfs_feature_track_live_blks(nilfs);
>  
>  	if (!nfinfo)
>  		goto out;
> @@ -1533,6 +1622,9 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  
>  			inode = bh->b_page->mapping->host;
>  
> +			ii = NILFS_I(inode);
> +			gc_inode = test_bit(NILFS_I_GCINODE, &ii->i_state);
> +
>  			if (mode == SC_LSEG_DSYNC)
>  				sc_op = &nilfs_sc_dsync_ops;
>  			else if (ino == NILFS_DAT_INO)
> @@ -1540,6 +1632,11 @@ nilfs_segctor_update_payload_blocknr(struct nilfs_sc_info *sci,
>  			else /* file blocks */
>  				sc_op = &nilfs_sc_file_ops;
>  		}
> +
> +		if (track_live_blks)
> +			nilfs_segctor_dec_nlive_blks(nilfs, sci->sc_mc, segbuf,
> +						     bh, ino, gc_inode);
> +
>  		bh_org = bh;
>  		get_bh(bh_org);
>  		err = nilfs_bmap_assign(NILFS_I(inode)->i_bmap, &bh, blocknr,
> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 7/9] nilfs2: add additional flags for nilfs_vdesc
       [not found]     ` <1424804504-10914-8-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  3:21       ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  3:21 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:42 +0100, Andreas Rohner wrote:
> This patch adds support for additional bit-flags to the
> nilfs_vdesc structure used by the GC to communicate block
> information from userspace. The field vd_flags cannot be used for
> this purpose, because it does not support bit-flags, and changing
> that would break backwards compatibility. Therefore the padding
> field is renamed to vd_blk_flags to contain more flags.
> 
> Unfortunately older versions of the userspace tools do not
> initialize the padding field to zero. So it is necessary to signal
> to the kernel if the new vd_blk_flags field contains usable flags
> or just random data. Since the vd_period field is only used in
> userspace, and is guaranteed to contain a value that is > 0
> (NILFS_CNO_MIN == 1), it can be used to give the kernel a hint. So
> if the userspace tools set vd_period.p_start to 0, the
> vd_blk_flags field will be interpreted.
> 
> To make the flags available for later stages of the GC process,
> they are mapped to corresponding buffer_head flags.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/ioctl.c         | 23 ++++++++++++++++---
>  fs/nilfs2/page.h          |  6 ++++-
>  include/linux/nilfs2_fs.h | 58 +++++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 81 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/nilfs2/ioctl.c b/fs/nilfs2/ioctl.c
> index f6ee54e..63b1c77 100644
> --- a/fs/nilfs2/ioctl.c
> +++ b/fs/nilfs2/ioctl.c
> @@ -578,7 +578,7 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
>  	struct buffer_head *bh;
>  	int ret;
>  
> -	if (vdesc->vd_flags == 0)
> +	if (nilfs_vdesc_data(vdesc))
>  		ret = nilfs_gccache_submit_read_data(
>  			inode, vdesc->vd_offset, vdesc->vd_blocknr,
>  			vdesc->vd_vblocknr, &bh);
> @@ -592,7 +592,8 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
>  			       "%s: invalid virtual block address (%s): "
>  			       "ino=%llu, cno=%llu, offset=%llu, "
>  			       "blocknr=%llu, vblocknr=%llu\n",
> -			       __func__, vdesc->vd_flags ? "node" : "data",
> +			       __func__,
> +			       nilfs_vdesc_node(vdesc) ? "node" : "data",
>  			       (unsigned long long)vdesc->vd_ino,
>  			       (unsigned long long)vdesc->vd_cno,
>  			       (unsigned long long)vdesc->vd_offset,
> @@ -603,7 +604,8 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
>  	if (unlikely(!list_empty(&bh->b_assoc_buffers))) {
>  		printk(KERN_CRIT "%s: conflicting %s buffer: ino=%llu, "
>  		       "cno=%llu, offset=%llu, blocknr=%llu, vblocknr=%llu\n",
> -		       __func__, vdesc->vd_flags ? "node" : "data",
> +		       __func__,
> +		       nilfs_vdesc_node(vdesc) ? "node" : "data",
>  		       (unsigned long long)vdesc->vd_ino,
>  		       (unsigned long long)vdesc->vd_cno,
>  		       (unsigned long long)vdesc->vd_offset,
> @@ -612,6 +614,12 @@ static int nilfs_ioctl_move_inode_block(struct inode *inode,
>  		brelse(bh);
>  		return -EEXIST;
>  	}
> +
> +	if (nilfs_vdesc_snapshot(vdesc))
> +		set_buffer_nilfs_snapshot(bh);
> +	if (nilfs_vdesc_protection_period(vdesc))
> +		set_buffer_nilfs_protection_period(bh);
> +
>  	list_add_tail(&bh->b_assoc_buffers, buffers);
>  	return 0;
>  }
> @@ -662,6 +670,15 @@ static int nilfs_ioctl_move_blocks(struct super_block *sb,
>  		}
>  
>  		do {
> +			/*
> +			 * old user space tools to not initialize vd_blk_flags
> +			 * if vd_period.p_start > 0 then vd_blk_flags was
> +			 * not initialized properly and may contain invalid
> +			 * flags
> +			 */
> +			if (vdesc->vd_period.p_start > 0)
> +				vdesc->vd_blk_flags = 0;
> +
>  			ret = nilfs_ioctl_move_inode_block(inode, vdesc,
>  							   &buffers);
>  			if (unlikely(ret < 0)) {
> diff --git a/fs/nilfs2/page.h b/fs/nilfs2/page.h
> index a43b828..b9117e6 100644
> --- a/fs/nilfs2/page.h
> +++ b/fs/nilfs2/page.h
> @@ -36,13 +36,17 @@ enum {
>  	BH_NILFS_Volatile,
>  	BH_NILFS_Checked,
>  	BH_NILFS_Redirected,
> +	BH_NILFS_Snapshot,
> +	BH_NILFS_Protection_Period,
>  };
>  
>  BUFFER_FNS(NILFS_Node, nilfs_node)		/* nilfs node buffers */
>  BUFFER_FNS(NILFS_Volatile, nilfs_volatile)
>  BUFFER_FNS(NILFS_Checked, nilfs_checked)	/* buffer is verified */
>  BUFFER_FNS(NILFS_Redirected, nilfs_redirected)	/* redirected to a copy */
> -

> +BUFFER_FNS(NILFS_Snapshot, nilfs_snapshot)	/* belongs to a snapshot */
> +BUFFER_FNS(NILFS_Protection_Period, nilfs_protection_period) /* protected by
> +							protection period */
>  

I propose alternative names: "snapshot_protected", and
"period_protected" (or "time_protected") respectively to clarify
meaning of the flags.

>  int __nilfs_clear_page_dirty(struct page *);
>  
> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
> index 6ccb2ad..6ffdc09 100644
> --- a/include/linux/nilfs2_fs.h
> +++ b/include/linux/nilfs2_fs.h
> @@ -900,7 +900,7 @@ struct nilfs_vinfo {
>   * @vd_blocknr: disk block number
>   * @vd_offset: logical block offset inside a file
>   * @vd_flags: flags (data or node block)
> - * @vd_pad: padding
> + * @vd_blk_flags: additional flags
>   */
>  struct nilfs_vdesc {
>  	__u64 vd_ino;
> @@ -910,9 +910,63 @@ struct nilfs_vdesc {
>  	__u64 vd_blocknr;
>  	__u64 vd_offset;
>  	__u32 vd_flags;
> -	__u32 vd_pad;
> +	/*
> +	 * vd_blk_flags needed because vd_flags doesn't support
> +	 * bit-flags because of backwards compatibility
> +	 */
> +	__u32 vd_blk_flags;
>  };
>  

> +/* vdesc flags */
> +enum {
> +	NILFS_VDESC_DATA,
> +	NILFS_VDESC_NODE,
> +
> +	/* ... */
> +};
> +enum {
> +	NILFS_VDESC_SNAPSHOT,
> +	NILFS_VDESC_PROTECTION_PERIOD,
> +
> +	/* ... */
> +
> +	__NR_NILFS_VDESC_FIELDS,
> +};
> +
> +#define NILFS_VDESC_FNS(flag, name)					\
> +static inline void							\
> +nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
> +{									\
> +	vdesc->vd_flags = NILFS_VDESC_##flag;				\
> +}									\
> +static inline int							\
> +nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
> +{									\
> +	return vdesc->vd_flags == NILFS_VDESC_##flag;			\
> +}
> +

Do not add definitions for vd_flags, leave them, and
simplify your patch.

Regards,
Ryusuke Konishi

> +#define NILFS_VDESC_FNS2(flag, name)					\
> +static inline void							\
> +nilfs_vdesc_set_##name(struct nilfs_vdesc *vdesc)			\
> +{									\
> +	vdesc->vd_blk_flags |= (1UL << NILFS_VDESC_##flag);		\
> +}									\
> +static inline void							\
> +nilfs_vdesc_clear_##name(struct nilfs_vdesc *vdesc)			\
> +{									\
> +	vdesc->vd_blk_flags &= ~(1UL << NILFS_VDESC_##flag);		\
> +}									\
> +static inline int							\
> +nilfs_vdesc_##name(const struct nilfs_vdesc *vdesc)			\
> +{									\
> +	return !!(vdesc->vd_blk_flags & (1UL << NILFS_VDESC_##flag));	\
> +}
> +
> +NILFS_VDESC_FNS(DATA, data)
> +NILFS_VDESC_FNS(NODE, node)
> +NILFS_VDESC_FNS2(SNAPSHOT, snapshot)
> +NILFS_VDESC_FNS2(PROTECTION_PERIOD, protection_period)
> +
>  /**
>   * struct nilfs_bdesc - descriptor of disk block number
>   * @bd_ino: inode number
> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 5/9] nilfs2: add simple tracking of block deletions and updates
       [not found]     ` <1424804504-10914-6-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  3:46       ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  3:46 UTC (permalink / raw)
  To: andreas.rohner-hi6Y0CQ0nG0; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:40 +0100, Andreas Rohner wrote:
> This patch adds simple tracking of block deletions and updates for
> all files except the DAT- and the SUFILE-Metadatafiles. It uses the
> fact, that for every block, NILFS2 keeps an entry in the DAT-File
> and stores the checkpoint where it was created and deleted or
> overwritten. So whenever a block is deleted or overwritten
> nilfs_dat_commit_end() is called to update the DAT-Entry. At this
> point this patch simply decrements the su_nlive_blks field of the
> corresponding segment. The value of su_nlive_blks is set at segment
> creation time.
> 
> The blocks of the DAT-File cannot be counted this way, because it
> does not contain any entries about itself, so the function
> nilfs_dat_commit_end() is not called when its blocks are deleted or
> overwritten.
> 
> The SUFILE cannot be counted this way, because it would lead to a
> deadlock. When nilfs_dat_commit_end() is called, the bmap->b_sem is
> held by code way up the call chain. To decrement the SUFILE entry
> the same semaphore has to be aquired. So if the DAT-Entry belongs to
> the SUFILE both semaphores are the same and a deadlock will occur.
> But it works for any other file. So by excluding the SUFILE from
> being counted by the extra parameter count_blocks a deadlock can be
> avoided.
> 
> With the above changes the code does not pass the lock dependency
> checks of the kernel, because all the locks have the same class and
> the order in which the locks are taken is different. Usually it is:
> 
> 1. down_write(&NILFS_MDT(sufile)->mi_sem);
> 2. down_write(&bmap->b_sem);
> 
> Now it can also be reversed, which leads to failed checks:
> 
> 1. down_write(&bmap->b_sem); /* lock of a file other than SUFILE */
> 2. down_write(&NILFS_MDT(sufile)->mi_sem);
> 
> But this is safe as long as the first lock down_write(&bmap->b_sem)
> doesn't belong to the SUFILE.
> 
> It is also possible, that two bmap->b_sem locks have to be taken at
> the same time:
> 
> 1. down_write(&bmap->b_sem); /* lock of a file other than SUFILE */
> 2. down_write(&bmap->b_sem); /* lock of SUFILE */
> 
> Since bmap->b_sem of normal files and the bmap->b_sem of the
> SUFILE have the same lock class, the above behavior would also lead
> to a warning.
> 
> Because of this, it is necessary to introduce two new lock classes
> for the SUFILE. So the bmap->b_sem of the SUFILE gets its own lock
> class and the NILFS_MDT(sufile)->mi_sem as well.
> 
> A new feature compatibility flag
> NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS was added, so that the new
> features introduced by this patch can be enabled or disabled at any
> time.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/bmap.c          |  8 +++++++-
>  fs/nilfs2/bmap.h          |  5 +++--
>  fs/nilfs2/btree.c         |  4 +++-
>  fs/nilfs2/dat.c           | 25 ++++++++++++++++++++-----
>  fs/nilfs2/dat.h           |  7 +++++--
>  fs/nilfs2/direct.c        |  4 +++-
>  fs/nilfs2/mdt.c           |  5 ++++-
>  fs/nilfs2/segbuf.c        |  1 +
>  fs/nilfs2/segbuf.h        |  1 +
>  fs/nilfs2/segment.c       | 25 +++++++++++++++++++++----
>  fs/nilfs2/the_nilfs.c     |  4 ++++
>  fs/nilfs2/the_nilfs.h     | 16 ++++++++++++++++
>  include/linux/nilfs2_fs.h |  4 +++-
>  13 files changed, 91 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/nilfs2/bmap.c b/fs/nilfs2/bmap.c
> index aadbd0b..ecd62ba 100644
> --- a/fs/nilfs2/bmap.c
> +++ b/fs/nilfs2/bmap.c
> @@ -467,6 +467,7 @@ __u64 nilfs_bmap_find_target_in_group(const struct nilfs_bmap *bmap)
>  
>  static struct lock_class_key nilfs_bmap_dat_lock_key;
>  static struct lock_class_key nilfs_bmap_mdt_lock_key;
> +static struct lock_class_key nilfs_bmap_sufile_lock_key;
>  
>  /**
>   * nilfs_bmap_read - read a bmap from an inode
> @@ -498,12 +499,17 @@ int nilfs_bmap_read(struct nilfs_bmap *bmap, struct nilfs_inode *raw_inode)
>  		lockdep_set_class(&bmap->b_sem, &nilfs_bmap_dat_lock_key);
>  		break;
>  	case NILFS_CPFILE_INO:
> -	case NILFS_SUFILE_INO:
>  		bmap->b_ptr_type = NILFS_BMAP_PTR_VS;
>  		bmap->b_last_allocated_key = 0;
>  		bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
>  		lockdep_set_class(&bmap->b_sem, &nilfs_bmap_mdt_lock_key);
>  		break;
> +	case NILFS_SUFILE_INO:
> +		bmap->b_ptr_type = NILFS_BMAP_PTR_VS;
> +		bmap->b_last_allocated_key = 0;
> +		bmap->b_last_allocated_ptr = NILFS_BMAP_INVALID_PTR;
> +		lockdep_set_class(&bmap->b_sem, &nilfs_bmap_sufile_lock_key);
> +		break;
>  	case NILFS_IFILE_INO:
>  		lockdep_set_class(&bmap->b_sem, &nilfs_bmap_mdt_lock_key);
>  		/* Fall through */
> diff --git a/fs/nilfs2/bmap.h b/fs/nilfs2/bmap.h
> index b89e680..718c814 100644
> --- a/fs/nilfs2/bmap.h
> +++ b/fs/nilfs2/bmap.h
> @@ -222,8 +222,9 @@ static inline void nilfs_bmap_commit_end_ptr(struct nilfs_bmap *bmap,
>  					     struct inode *dat)
>  {
>  	if (dat)
> -		nilfs_dat_commit_end(dat, &req->bpr_req,
> -				     bmap->b_ptr_type == NILFS_BMAP_PTR_VS);
> +		nilfs_dat_commit_end(dat, &req->bpr_req, NULL,
> +				     bmap->b_ptr_type == NILFS_BMAP_PTR_VS,
> +				     bmap->b_inode->i_ino != NILFS_SUFILE_INO);
>  }
>  
>  static inline void nilfs_bmap_abort_end_ptr(struct nilfs_bmap *bmap,
> diff --git a/fs/nilfs2/btree.c b/fs/nilfs2/btree.c
> index b2e3ff3..2af0519 100644
> --- a/fs/nilfs2/btree.c
> +++ b/fs/nilfs2/btree.c
> @@ -1851,7 +1851,9 @@ static void nilfs_btree_commit_update_v(struct nilfs_bmap *btree,
>  
>  	nilfs_dat_commit_update(dat, &path[level].bp_oldreq.bpr_req,
>  				&path[level].bp_newreq.bpr_req,
> -				btree->b_ptr_type == NILFS_BMAP_PTR_VS);
> +				NULL,
> +				btree->b_ptr_type == NILFS_BMAP_PTR_VS,
> +				btree->b_inode->i_ino != NILFS_SUFILE_INO);
>  
>  	if (buffer_nilfs_node(path[level].bp_bh)) {
>  		nilfs_btnode_commit_change_key(
> diff --git a/fs/nilfs2/dat.c b/fs/nilfs2/dat.c
> index 0d5fada..d2c8f7e 100644
> --- a/fs/nilfs2/dat.c
> +++ b/fs/nilfs2/dat.c
> @@ -28,6 +28,7 @@
>  #include "mdt.h"
>  #include "alloc.h"
>  #include "dat.h"
> +#include "sufile.h"
>  
>  
>  #define NILFS_CNO_MIN	((__u64)1)
> @@ -185,12 +186,14 @@ int nilfs_dat_prepare_end(struct inode *dat, struct nilfs_palloc_req *req)
>  }
>  
>  void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
> -			  int dead)
> +			  struct nilfs_sufile_mod_cache *mc,
> +			  int dead, int count_blocks)
>  {
>  	struct nilfs_dat_entry *entry;
> -	__u64 start, end;
> +	__u64 start, end, segnum;
>  	sector_t blocknr;
>  	void *kaddr;
> +	struct the_nilfs *nilfs;
>  
>  	kaddr = kmap_atomic(req->pr_entry_bh->b_page);
>  	entry = nilfs_palloc_block_get_entry(dat, req->pr_entry_nr,
> @@ -206,8 +209,18 @@ void nilfs_dat_commit_end(struct inode *dat, struct nilfs_palloc_req *req,
>  
>  	if (blocknr == 0)
>  		nilfs_dat_commit_free(dat, req);
> -	else
> +	else {
>  		nilfs_dat_commit_entry(dat, req);
> +
> +		nilfs = dat->i_sb->s_fs_info;
> +
> +		if (count_blocks && nilfs_feature_track_live_blks(nilfs)) {
> +			segnum = nilfs_get_segnum_of_block(nilfs, blocknr);
> +
> +			nilfs_sufile_mod_nlive_blks(nilfs->ns_sufile, mc,
> +						    segnum, -1);
> +		}
> +	}
>  }
>  
>  void nilfs_dat_abort_end(struct inode *dat, struct nilfs_palloc_req *req)
> @@ -246,9 +259,11 @@ int nilfs_dat_prepare_update(struct inode *dat,
>  
>  void nilfs_dat_commit_update(struct inode *dat,
>  			     struct nilfs_palloc_req *oldreq,
> -			     struct nilfs_palloc_req *newreq, int dead)
> +			     struct nilfs_palloc_req *newreq,
> +			     struct nilfs_sufile_mod_cache *mc,
> +			     int dead, int count_blocks)
>  {
> -	nilfs_dat_commit_end(dat, oldreq, dead);
> +	nilfs_dat_commit_end(dat, oldreq, mc, dead, count_blocks);
>  	nilfs_dat_commit_alloc(dat, newreq);
>  }
>  
> diff --git a/fs/nilfs2/dat.h b/fs/nilfs2/dat.h
> index cbd8e97..d196f09 100644
> --- a/fs/nilfs2/dat.h
> +++ b/fs/nilfs2/dat.h
> @@ -29,6 +29,7 @@
>  
>  
>  struct nilfs_palloc_req;
> +struct nilfs_sufile_mod_cache;
>  
>  int nilfs_dat_translate(struct inode *, __u64, sector_t *);
>  
> @@ -39,12 +40,14 @@ int nilfs_dat_prepare_start(struct inode *, struct nilfs_palloc_req *);
>  void nilfs_dat_commit_start(struct inode *, struct nilfs_palloc_req *,
>  			    sector_t);
>  int nilfs_dat_prepare_end(struct inode *, struct nilfs_palloc_req *);
> -void nilfs_dat_commit_end(struct inode *, struct nilfs_palloc_req *, int);
> +void nilfs_dat_commit_end(struct inode *, struct nilfs_palloc_req *,
> +			  struct nilfs_sufile_mod_cache *, int, int);
>  void nilfs_dat_abort_end(struct inode *, struct nilfs_palloc_req *);
>  int nilfs_dat_prepare_update(struct inode *, struct nilfs_palloc_req *,
>  			     struct nilfs_palloc_req *);
>  void nilfs_dat_commit_update(struct inode *, struct nilfs_palloc_req *,
> -			     struct nilfs_palloc_req *, int);
> +			     struct nilfs_palloc_req *,
> +			     struct nilfs_sufile_mod_cache *, int, int);
>  void nilfs_dat_abort_update(struct inode *, struct nilfs_palloc_req *,
>  			    struct nilfs_palloc_req *);
>  
> diff --git a/fs/nilfs2/direct.c b/fs/nilfs2/direct.c
> index 82f4865..e022cfb 100644
> --- a/fs/nilfs2/direct.c
> +++ b/fs/nilfs2/direct.c
> @@ -272,7 +272,9 @@ static int nilfs_direct_propagate(struct nilfs_bmap *bmap,
>  		if (ret < 0)
>  			return ret;
>  		nilfs_dat_commit_update(dat, &oldreq, &newreq,
> -					bmap->b_ptr_type == NILFS_BMAP_PTR_VS);
> +				NULL,
> +				bmap->b_ptr_type == NILFS_BMAP_PTR_VS,
> +				bmap->b_inode->i_ino != NILFS_SUFILE_INO);
>  		set_buffer_nilfs_volatile(bh);
>  		nilfs_direct_set_ptr(bmap, key, newreq.pr_entry_nr);
>  	} else
> diff --git a/fs/nilfs2/mdt.c b/fs/nilfs2/mdt.c
> index 892cf5f..2a81f82 100644
> --- a/fs/nilfs2/mdt.c
> +++ b/fs/nilfs2/mdt.c
> @@ -414,7 +414,7 @@ static const struct address_space_operations def_mdt_aops = {
>  
>  static const struct inode_operations def_mdt_iops;
>  static const struct file_operations def_mdt_fops;
> -
> +static struct lock_class_key nilfs_mdt_mi_sufile_lock_key;
>  
>  int nilfs_mdt_init(struct inode *inode, gfp_t gfp_mask, size_t objsz)
>  {
> @@ -427,6 +427,9 @@ int nilfs_mdt_init(struct inode *inode, gfp_t gfp_mask, size_t objsz)
>  	init_rwsem(&mi->mi_sem);
>  	inode->i_private = mi;
>  
> +	if (inode->i_ino == NILFS_SUFILE_INO)
> +		lockdep_set_class(&mi->mi_sem, &nilfs_mdt_mi_sufile_lock_key);
> +
>  	inode->i_mode = S_IFREG;
>  	mapping_set_gfp_mask(inode->i_mapping, gfp_mask);
>  
> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
> index dc3a9efd..7a6e9cd 100644
> --- a/fs/nilfs2/segbuf.c
> +++ b/fs/nilfs2/segbuf.c
> @@ -57,6 +57,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
>  	INIT_LIST_HEAD(&segbuf->sb_segsum_buffers);
>  	INIT_LIST_HEAD(&segbuf->sb_payload_buffers);
>  	segbuf->sb_super_root = NULL;
> +	segbuf->sb_nlive_blks_added = 0;
>  
>  	init_completion(&segbuf->sb_bio_event);
>  	atomic_set(&segbuf->sb_err, 0);
> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
> index b04f08c..d04da26 100644
> --- a/fs/nilfs2/segbuf.h
> +++ b/fs/nilfs2/segbuf.h
> @@ -83,6 +83,7 @@ struct nilfs_segment_buffer {
>  	sector_t		sb_fseg_start, sb_fseg_end;
>  	sector_t		sb_pseg_start;
>  	unsigned		sb_rest_blocks;
> +	__u32			sb_nlive_blks_added;
>  
>  	/* Buffers */
>  	struct list_head	sb_segsum_buffers;
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 469086b..6059f53 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1367,9 +1367,10 @@ static void nilfs_free_incomplete_logs(struct list_head *logs,
>  }
>  
>  static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
> -					  struct inode *sufile)
> +					  struct the_nilfs *nilfs)
>  {
>  	struct nilfs_segment_buffer *segbuf;
> +	struct inode *sufile = nilfs->ns_sufile;
>  	unsigned long live_blocks;
>  	int ret;
>  
> @@ -1380,12 +1381,22 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
>  						     live_blocks,
>  						     sci->sc_seg_ctime);
>  		WARN_ON(ret); /* always succeed because the segusage is dirty */
> +
> +		/* should always be positive */
> +		segbuf->sb_nlive_blks_added = segbuf->sb_sum.nfileblk;
> +
> +		if (nilfs_feature_track_live_blks(nilfs))
> +			nilfs_sufile_mod_nlive_blks(sufile, NULL,
> +						segbuf->sb_segnum,
> +						segbuf->sb_nlive_blks_added);
>  	}
>  }
>  
> -static void nilfs_cancel_segusage(struct list_head *logs, struct inode *sufile)
> +static void nilfs_cancel_segusage(struct list_head *logs,
> +				  struct the_nilfs *nilfs)
>  {
>  	struct nilfs_segment_buffer *segbuf;
> +	struct inode *sufile = nilfs->ns_sufile;
>  	int ret;
>  
>  	segbuf = NILFS_FIRST_SEGBUF(logs);
> @@ -1394,6 +1405,12 @@ static void nilfs_cancel_segusage(struct list_head *logs, struct inode *sufile)
>  					     segbuf->sb_fseg_start, 0);
>  	WARN_ON(ret); /* always succeed because the segusage is dirty */
>  
> +	if (nilfs_feature_track_live_blks(nilfs))
> +		nilfs_sufile_mod_nlive_blks(sufile, NULL, segbuf->sb_segnum,
> +					-((__s64)segbuf->sb_nlive_blks_added));
> +
> +	segbuf->sb_nlive_blks_added = 0;
> +
>  	list_for_each_entry_continue(segbuf, logs, sb_list) {
>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>  						     0, 0);
> @@ -1729,7 +1746,7 @@ static void nilfs_segctor_abort_construction(struct nilfs_sc_info *sci,
>  	nilfs_abort_logs(&logs, ret ? : err);
>  
>  	list_splice_tail_init(&sci->sc_segbufs, &logs);
> -	nilfs_cancel_segusage(&logs, nilfs->ns_sufile);
> +	nilfs_cancel_segusage(&logs, nilfs);
>  	nilfs_free_incomplete_logs(&logs, nilfs);
>  
>  	if (sci->sc_stage.flags & NILFS_CF_SUFREED) {
> @@ -1995,7 +2012,7 @@ static int nilfs_segctor_do_construct(struct nilfs_sc_info *sci, int mode)
>  
>  			nilfs_segctor_fill_in_super_root(sci, nilfs);
>  		}
> -		nilfs_segctor_update_segusage(sci, nilfs->ns_sufile);
> +		nilfs_segctor_update_segusage(sci, nilfs);
>  
>  		/* Write partial segments */
>  		nilfs_segctor_prepare_write(sci);

Please separate changes below.

> diff --git a/fs/nilfs2/the_nilfs.c b/fs/nilfs2/the_nilfs.c
> index 69bd801..606fdfc 100644
> --- a/fs/nilfs2/the_nilfs.c
> +++ b/fs/nilfs2/the_nilfs.c
> @@ -630,6 +630,10 @@ int init_nilfs(struct the_nilfs *nilfs, struct super_block *sb, char *data)
>  	get_random_bytes(&nilfs->ns_next_generation,
>  			 sizeof(nilfs->ns_next_generation));
>  
> +	nilfs->ns_feature_compat = le64_to_cpu(sbp->s_feature_compat);
> +	nilfs->ns_feature_compat_ro = le64_to_cpu(sbp->s_feature_compat_ro);
> +	nilfs->ns_feature_incompat = le64_to_cpu(sbp->s_feature_incompat);
> +
>  	err = nilfs_store_disk_layout(nilfs, sbp);
>  	if (err)
>  		goto failed_sbh;
> diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
> index 23778d3..87cab10 100644
> --- a/fs/nilfs2/the_nilfs.h
> +++ b/fs/nilfs2/the_nilfs.h
> @@ -101,6 +101,9 @@ enum {
>   * @ns_dev_kobj: /sys/fs/<nilfs>/<device>
>   * @ns_dev_kobj_unregister: completion state
>   * @ns_dev_subgroups: <device> subgroups pointer
> + * @ns_feature_compat: Compatible feature set
> + * @ns_feature_compat_ro: Read-only compatible feature set
> + * @ns_feature_incompat: Incompatible feature set
>   */
>  struct the_nilfs {
>  	unsigned long		ns_flags;
> @@ -201,6 +204,11 @@ struct the_nilfs {
>  	struct kobject ns_dev_kobj;
>  	struct completion ns_dev_kobj_unregister;
>  	struct nilfs_sysfs_dev_subgroups *ns_dev_subgroups;
> +
> +	/* Features */
> +	__u64			ns_feature_compat;
> +	__u64			ns_feature_compat_ro;
> +	__u64			ns_feature_incompat;
>  };
>  
>  #define THE_NILFS_FNS(bit, name)					\
> @@ -393,4 +401,12 @@ static inline int nilfs_flush_device(struct the_nilfs *nilfs)
>  	return err;
>  }
>  
> +static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
> +{
> +	return (nilfs->ns_feature_compat &
> +		NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS) &&
> +		(nilfs->ns_feature_compat &
> +		NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
> +}
> +

This should be written as below:

static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
{
	const __u64 required_bits = NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS |
				    NILFS_FEATURE_COMPAT_SUFILE_EXTENSION;

	return ((nilfs->ns_feature_compat & required_bits) == required_bits);
}

Or you can drop the track flag at mount time if
NILFS_FEATURE_COMPAT_SUFILE_EXTENSION flag is not set or
nilfs_sufile_ext_supported(sufile) is false.

Regards,
Ryusuke Konishi

>  #endif /* _THE_NILFS_H */
> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
> index 5d83c55..6ccb2ad 100644
> --- a/include/linux/nilfs2_fs.h
> +++ b/include/linux/nilfs2_fs.h
> @@ -221,10 +221,12 @@ struct nilfs_super_block {
>   * doesn't know about, it should refuse to mount the filesystem.
>   */
>  #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
> +#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
>  
>  #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
>  
> -#define NILFS_FEATURE_COMPAT_SUPP	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION
> +#define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
> +				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>  
> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]     ` <1424804504-10914-10-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  3:51       ` Ryusuke Konishi
       [not found]         ` <20150314.125109.1017248837083480553.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  3:51 UTC (permalink / raw)
  To: andreas.rohner-hi6Y0CQ0nG0; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:44 +0100, Andreas Rohner wrote:
> It doesn't really matter if the number of reclaimable blocks for a
> segment is inaccurate, as long as the overall performance is better than
> the simple timestamp algorithm and starvation is prevented.
> 
> The following steps will lead to starvation of a segment:
> 
> 1. The segment is written
> 2. A snapshot is created
> 3. The files in the segment are deleted and the number of live
>    blocks for the segment is decremented to a very low value
> 4. The GC tries to free the segment, but there are no reclaimable
>    blocks, because they are all protected by the snapshot. To prevent an
>    infinite loop the GC has to adjust the number of live blocks to the
>    correct value.
> 5. The snapshot is converted to a checkpoint and the blocks in the
>    segment are now reclaimable.
> 6. The GC will never attemt to clean the segment again, because of it
>    incorrectly shows up as having a high number of live blocks.
> 
> To prevent this, the already existing padding field of the SUFILE entry
> is used to track the number of snapshot blocks in the segment. This
> number is only set by the GC, since it collects the necessary
> information anyway. So there is no need, to track which block belongs to
> which segment. In step 4 of the list above the GC will set the new field
> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
> reduced.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/cpfile.c        |   5 ++
>  fs/nilfs2/segbuf.c        |   1 +
>  fs/nilfs2/segbuf.h        |   1 +
>  fs/nilfs2/segment.c       |   7 ++-
>  fs/nilfs2/sufile.c        | 114 ++++++++++++++++++++++++++++++++++++++++++----
>  fs/nilfs2/sufile.h        |   4 +-
>  fs/nilfs2/the_nilfs.h     |   7 +++
>  include/linux/nilfs2_fs.h |  12 +++--
>  8 files changed, 136 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c
> index 0d58075..6b61fd7 100644
> --- a/fs/nilfs2/cpfile.c
> +++ b/fs/nilfs2/cpfile.c
> @@ -28,6 +28,7 @@
>  #include <linux/nilfs2_fs.h>
>  #include "mdt.h"
>  #include "cpfile.h"
> +#include "sufile.h"
>  
>  
>  static inline unsigned long
> @@ -703,6 +704,7 @@ static int nilfs_cpfile_clear_snapshot(struct inode *cpfile, __u64 cno)
>  	struct nilfs_cpfile_header *header;
>  	struct nilfs_checkpoint *cp;
>  	struct nilfs_snapshot_list *list;
> +	struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
>  	__u64 next, prev;
>  	void *kaddr;
>  	int ret;
> @@ -784,6 +786,9 @@ static int nilfs_cpfile_clear_snapshot(struct inode *cpfile, __u64 cno)
>  	mark_buffer_dirty(header_bh);
>  	nilfs_mdt_mark_dirty(cpfile);
>  
> +	if (nilfs_feature_track_snapshots(nilfs))
> +		nilfs_sufile_fix_starving_segs(nilfs->ns_sufile);
> +
>  	brelse(prev_bh);
>  
>   out_next:
> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
> index bbd807b..a98c576 100644
> --- a/fs/nilfs2/segbuf.c
> +++ b/fs/nilfs2/segbuf.c
> @@ -59,6 +59,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
>  	segbuf->sb_super_root = NULL;
>  	segbuf->sb_nlive_blks_added = 0;
>  	segbuf->sb_nlive_blks_diff = 0;
> +	segbuf->sb_nsnapshot_blks = 0;
>  
>  	init_completion(&segbuf->sb_bio_event);
>  	atomic_set(&segbuf->sb_err, 0);
> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
> index 4e994f7..7a462c4 100644
> --- a/fs/nilfs2/segbuf.h
> +++ b/fs/nilfs2/segbuf.h
> @@ -85,6 +85,7 @@ struct nilfs_segment_buffer {
>  	unsigned		sb_rest_blocks;
>  	__u32			sb_nlive_blks_added;
>  	__s64			sb_nlive_blks_diff;
> +	__u32			sb_nsnapshot_blks;
>  
>  	/* Buffers */
>  	struct list_head	sb_segsum_buffers;
> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
> index 16c7c36..b976198 100644
> --- a/fs/nilfs2/segment.c
> +++ b/fs/nilfs2/segment.c
> @@ -1381,6 +1381,7 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
>  			(segbuf->sb_pseg_start - segbuf->sb_fseg_start);
>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>  						     live_blocks,
> +						     segbuf->sb_nsnapshot_blks,
>  						     sci->sc_seg_ctime);
>  		WARN_ON(ret); /* always succeed because the segusage is dirty */
>  
> @@ -1405,7 +1406,7 @@ static void nilfs_cancel_segusage(struct list_head *logs,
>  	segbuf = NILFS_FIRST_SEGBUF(logs);
>  	ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>  					     segbuf->sb_pseg_start -
> -					     segbuf->sb_fseg_start, 0);
> +					     segbuf->sb_fseg_start, 0, 0);
>  	WARN_ON(ret); /* always succeed because the segusage is dirty */
>  
>  	if (nilfs_feature_track_live_blks(nilfs))
> @@ -1416,7 +1417,7 @@ static void nilfs_cancel_segusage(struct list_head *logs,
>  
>  	list_for_each_entry_continue(segbuf, logs, sb_list) {
>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
> -						     0, 0);
> +						     0, 0, 0);
>  		WARN_ON(ret); /* always succeed */
>  	}
>  }
> @@ -1521,6 +1522,8 @@ static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
>  
>  	if (!buffer_nilfs_snapshot(bh) && isreclaimable)
>  		segbuf->sb_nlive_blks_diff--;
> +	if (buffer_nilfs_snapshot(bh))
> +		segbuf->sb_nsnapshot_blks++;
>  }
>  
>  /**
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index 574a77e..a6dc7bf 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -468,7 +468,7 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 *data,
>  	su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
>  	if (nilfs_sufile_ext_supported(sufile)) {
>  		su->su_nlive_blks = cpu_to_le32(0);
> -		su->su_pad = cpu_to_le32(0);
> +		su->su_nsnapshot_blks = cpu_to_le32(0);
>  		su->su_nlive_lastmod = cpu_to_le64(0);
>  	}
>  	kunmap_atomic(kaddr);
> @@ -538,7 +538,8 @@ int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum)
>   * @modtime: modification time (option)
>   */
>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
> -				   unsigned long nblocks, time_t modtime)
> +				   unsigned long nblocks, __u32 nsnapshot_blks,
> +				   time_t modtime)
>  {
>  	struct buffer_head *bh;
>  	struct nilfs_segment_usage *su;
> @@ -556,9 +557,18 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>  	if (modtime)
>  		su->su_lastmod = cpu_to_le64(modtime);
>  	su->su_nblocks = cpu_to_le32(nblocks);
> -	if (nilfs_sufile_ext_supported(sufile) &&
> -	    nblocks < le32_to_cpu(su->su_nlive_blks))
> -		su->su_nlive_blks = su->su_nblocks;
> +	if (nilfs_sufile_ext_supported(sufile)) {
> +		if (nblocks < le32_to_cpu(su->su_nlive_blks))
> +			su->su_nlive_blks = su->su_nblocks;
> +
> +		nsnapshot_blks += le32_to_cpu(su->su_nsnapshot_blks);
> +
> +		if (nblocks < nsnapshot_blks)
> +			nsnapshot_blks = nblocks;
> +
> +		su->su_nsnapshot_blks = cpu_to_le32(nsnapshot_blks);
> +	}
> +
>  	kunmap_atomic(kaddr);
>  
>  	mark_buffer_dirty(bh);
> @@ -891,7 +901,7 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
>  
>  			if (sisz >= NILFS_EXT_SUINFO_SIZE) {
>  				si->sui_nlive_blks = nlb;
> -				si->sui_pad = 0;
> +				si->sui_nsnapshot_blks = 0;
>  				si->sui_nlive_lastmod = lm;
>  			}
>  		}
> @@ -939,6 +949,7 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  	int ret = 0;
>  	bool sup_ext = (supsz >= NILFS_EXT_SUINFO_UPDATE_SIZE);
>  	bool su_ext = nilfs_sufile_ext_supported(sufile);
> +	bool supsu_ext = sup_ext && su_ext;
>  
>  	if (unlikely(nsup == 0))
>  		return ret;
> @@ -952,6 +963,10 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  				nilfs->ns_blocks_per_segment)
>  			|| (nilfs_suinfo_update_nlive_blks(sup) && sup_ext &&
>  				sup->sup_sui.sui_nlive_blks >
> +				nilfs->ns_blocks_per_segment)
> +			|| (nilfs_suinfo_update_nsnapshot_blks(sup) &&
> +				sup_ext &&
> +				sup->sup_sui.sui_nsnapshot_blks >
>  				nilfs->ns_blocks_per_segment))
>  			return -EINVAL;
>  	}
> @@ -979,11 +994,15 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  		if (nilfs_suinfo_update_nblocks(sup))
>  			su->su_nblocks = cpu_to_le32(sup->sup_sui.sui_nblocks);
>  
> -		if (nilfs_suinfo_update_nlive_blks(sup) && sup_ext && su_ext)
> +		if (nilfs_suinfo_update_nlive_blks(sup) && supsu_ext)
>  			su->su_nlive_blks =
>  				cpu_to_le32(sup->sup_sui.sui_nlive_blks);
>  
> -		if (nilfs_suinfo_update_nlive_lastmod(sup) && sup_ext && su_ext)
> +		if (nilfs_suinfo_update_nsnapshot_blks(sup) && supsu_ext)
> +			su->su_nsnapshot_blks =
> +				cpu_to_le32(sup->sup_sui.sui_nsnapshot_blks);
> +
> +		if (nilfs_suinfo_update_nlive_lastmod(sup) && supsu_ext)
>  			su->su_nlive_lastmod =
>  				cpu_to_le64(sup->sup_sui.sui_nlive_lastmod);
>  
> @@ -1050,6 +1069,85 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  }
>  
>  /**
> + * nilfs_sufile_fix_starving_segs - fix potentially starving segments
> + * @sufile: inode of segment usage file
> + *
> + * Description: Scans for segments, which are potentially starving and
> + * reduces the number of live blocks to less than half of the maximum
> + * number of blocks in a segment. This way the segment is more likely to be
> + * chosen by the GC. A segment is marked as potentially starving, if more
> + * than half of the blocks it contains are protected by snapshots.
> + *
> + * Return Value: On success, 0 is returned and on error, one of the
> + * following negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + */
> +int nilfs_sufile_fix_starving_segs(struct inode *sufile)
> +{
> +	struct buffer_head *su_bh;
> +	struct nilfs_segment_usage *su;
> +	size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
> +	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
> +	void *kaddr;
> +	unsigned long nsegs, segusages_per_block;
> +	__u32 max_segblks = nilfs->ns_blocks_per_segment / 2;
> +	__u64 segnum = 0;
> +	int ret = 0, blkdirty, dirty = 0;
> +
> +	down_write(&NILFS_MDT(sufile)->mi_sem);
> +
> +	segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
> +	nsegs = nilfs_sufile_get_nsegments(sufile);
> +
> +	while (segnum < nsegs) {
> +		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
> +							 nsegs - 1);
> +
> +		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
> +							   0, &su_bh);
> +		if (ret < 0) {
> +			if (ret != -ENOENT)
> +				goto out;
> +			/* hole */
> +			segnum += n;
> +			continue;
> +		}
> +
> +		kaddr = kmap_atomic(su_bh->b_page);
> +		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
> +							  su_bh, kaddr);
> +		blkdirty = 0;
> +		for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
> +			if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
> +				continue;
> +
> +			if (su->su_nlive_blks <= max_segblks)
> +				continue;
> +
> +			su->su_nlive_blks = max_segblks;
> +			blkdirty = 1;
> +		}
> +
> +		kunmap_atomic(kaddr);
> +		if (blkdirty) {
> +			mark_buffer_dirty(su_bh);
> +			dirty = 1;
> +		}
> +		put_bh(su_bh);
> +	}
> +
> +out:
> +	if (dirty)
> +		nilfs_mdt_mark_dirty(sufile);
> +
> +	up_write(&NILFS_MDT(sufile)->mi_sem);
> +	return ret;
> +}
> +
> +/**
>   * nilfs_sufile_trim_fs() - trim ioctl handle function
>   * @sufile: inode of segment usage file
>   * @range: fstrim_range structure
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index ae3c52a..e831622 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -45,7 +45,8 @@ int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
>  int nilfs_sufile_alloc(struct inode *, __u64 *);
>  int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
> -				   unsigned long nblocks, time_t modtime);
> +				   unsigned long nblocks, __u32 nsnapshot_blks,
> +				   time_t modtime);
>  int nilfs_sufile_get_stat(struct inode *, struct nilfs_sustat *);
>  ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
>  				size_t);
> @@ -72,6 +73,7 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
>  int nilfs_sufile_read(struct super_block *sb, size_t susize,
>  		      struct nilfs_inode *raw_inode, struct inode **inodep);
>  int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
> +int nilfs_sufile_fix_starving_segs(struct inode *);
>  
>  /**
>   * nilfs_sufile_scrap - make a segment garbage
> diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
> index 87cab10..3d495f1 100644
> --- a/fs/nilfs2/the_nilfs.h
> +++ b/fs/nilfs2/the_nilfs.h
> @@ -409,4 +409,11 @@ static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
>  		NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
>  }
>  
> +static inline int nilfs_feature_track_snapshots(struct the_nilfs *nilfs)
> +{
> +	return (nilfs->ns_feature_compat &
> +		NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS) &&
> +		nilfs_feature_track_live_blks(nilfs);
> +}
> +
>  #endif /* _THE_NILFS_H */
> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
> index 6ffdc09..a3c7593 100644
> --- a/include/linux/nilfs2_fs.h
> +++ b/include/linux/nilfs2_fs.h
> @@ -222,11 +222,13 @@ struct nilfs_super_block {
>   */
>  #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
>  #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
> +#define NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS		(1ULL << 2)
>  
>  #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
>  
>  #define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
> -				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
> +				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS \
> +				| NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS)
>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>  

You don't have to add three compat flags just for this one patchset.
Please unify it.

#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 0)

looks to be enough.

Regards,
Ryusuke Konishi


> @@ -630,7 +632,7 @@ struct nilfs_segment_usage {
>  	__le32 su_nblocks;
>  	__le32 su_flags;
>  	__le32 su_nlive_blks;
> -	__le32 su_pad;
> +	__le32 su_nsnapshot_blks;
>  	__le64 su_nlive_lastmod;
>  };
>  
> @@ -682,7 +684,7 @@ nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
>  	su->su_flags = cpu_to_le32(0);
>  	if (susz >= NILFS_EXT_SEGMENT_USAGE_SIZE) {
>  		su->su_nlive_blks = cpu_to_le32(0);
> -		su->su_pad = cpu_to_le32(0);
> +		su->su_nsnapshot_blks = cpu_to_le32(0);
>  		su->su_nlive_lastmod = cpu_to_le64(0);
>  	}
>  }
> @@ -723,7 +725,7 @@ struct nilfs_suinfo {
>  	__u32 sui_nblocks;
>  	__u32 sui_flags;
>  	__u32 sui_nlive_blks;
> -	__u32 sui_pad;
> +	__u32 sui_nsnapshot_blks;
>  	__u64 sui_nlive_lastmod;
>  };
>  
> @@ -770,6 +772,7 @@ enum {
>  	NILFS_SUINFO_UPDATE_FLAGS,
>  	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
>  	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
> +	NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
>  	__NR_NILFS_SUINFO_UPDATE_FIELDS,
>  };
>  
> @@ -794,6 +797,7 @@ NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
>  NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
>  NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
>  NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
> +NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
>  NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
>  
>  #define NILFS_MIN_SUINFO_UPDATE_SIZE	\
> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/9] nilfs2: extend SUFILE on-disk format to enable counting of live blocks
       [not found]     ` <1424804504-10914-4-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  4:05       ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  4:05 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:38 +0100, Andreas Rohner wrote:
> *buf,
>  	int cleansi, cleansu, dirtysi, dirtysu;
>  	long ncleaned = 0, ndirtied = 0;
>  	int ret = 0;
> +	bool sup_ext = (supsz >= NILFS_EXT_SUINFO_UPDATE_SIZE);
> +	bool su_ext = nilfs_sufile_ext_supported(sufile);
>  
>  	if (unlikely(nsup == 0))
>  		return ret;
> @@ -926,6 +949,9 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  				(~0UL << __NR_NILFS_SUINFO_UPDATE_FIELDS))
>  			|| (nilfs_suinfo_update_nblocks(sup) &&
>  				sup->sup_sui.sui_nblocks >
> +				nilfs->ns_blocks_per_segment)
> +			|| (nilfs_suinfo_update_nlive_blks(sup) && sup_ext &&
> +				sup->sup_sui.sui_nlive_blks >
>  				nilfs->ns_blocks_per_segment))
>  			return -EINVAL;
>  	}
> @@ -953,6 +979,14 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>  		if (nilfs_suinfo_update_nblocks(sup))
>  			su->su_nblocks = cpu_to_le32(sup->sup_sui.sui_nblocks);
>  
> +		if (nilfs_suinfo_update_nlive_blks(sup) && sup_ext && su_ext)
> +			su->su_nlive_blks =
> +				cpu_to_le32(sup->sup_sui.sui_nlive_blks);
> +
> +		if (nilfs_suinfo_update_nlive_lastmod(sup) && sup_ext && su_ext)
> +			su->su_nlive_lastmod =
> +				cpu_to_le64(sup->sup_sui.sui_nlive_lastmod);
> +
>  		if (nilfs_suinfo_update_flags(sup)) {
>  			/*
>  			 * Active flag is a virtual flag projected by running
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index c446325..d56498b 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -28,6 +28,11 @@
>  #include <linux/nilfs2_fs.h>
>  #include "mdt.h"
>  
> +static inline int
> +nilfs_sufile_ext_supported(const struct inode *sufile)
> +{
> +	return NILFS_MDT(sufile)->mi_entry_size >= NILFS_EXT_SEGMENT_USAGE_SIZE;
> +}
>  
>  static inline unsigned long nilfs_sufile_get_nsegments(struct inode *sufile)
>  {
> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
> index ff3fea3..5d83c55 100644
> --- a/include/linux/nilfs2_fs.h
> +++ b/include/linux/nilfs2_fs.h
> @@ -220,9 +220,11 @@ struct nilfs_super_block {
>   * If there is a bit set in the incompatible feature set that the kernel
>   * doesn't know about, it should refuse to mount the filesystem.
>   */
> -#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT	0x00000001ULL
> +#define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)

This feature name is not good.  sufile can be extended more in a future.
You should name it based on the meaning of the extension of this time.

As I mentioned in another patch, I think this could be unified to the
TRACK_LIVE_BLKS feature that a later patch adds since the live block
counting of this patchset is inherently depending on the extention of
sufile.

>  
> -#define NILFS_FEATURE_COMPAT_SUPP	0ULL
> +#define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
> +

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 4/9] nilfs2: add function to modify su_nlive_blks
       [not found]     ` <1424804504-10914-5-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  4:57       ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  4:57 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:01:39 +0100, Andreas Rohner wrote:
> This patch adds a function to modify the su_nlive_blks field of the
> nilfs_segment_usage structure in the SUFILE. By using positive or
> negative integers, it is possible to add and substract any value from
> the su_nlive_blks field.
> 
> The use of a modification cache is optional and by passing a NULL
> pointer the value will be added or subtracted directly. Otherwise it is
> necessary to call nilfs_sufile_flush_nlive_blks() at some point to make
> the modifications persistent.
> 
> The modification cache is useful, because it allows for small values,
> like simple increments and decrements, to be added up before writing
> them to the SUFILE.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  fs/nilfs2/sufile.c | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  fs/nilfs2/sufile.h |   5 ++
>  2 files changed, 143 insertions(+)
> 
> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
> index ae08050..574a77e 100644
> --- a/fs/nilfs2/sufile.c
> +++ b/fs/nilfs2/sufile.c
> @@ -1380,6 +1380,144 @@ static inline int nilfs_sufile_mc_update(struct inode *sufile,
>  }
>  
>  /**
> + * nilfs_sufile_do_flush_nlive_blks - apply modification to su_nlive_blks
> + * @sufile: inode of segment usage file
> + * @mod: modification structure
> + * @header_bh: sufile header block
> + * @su_bh: block containing segment usage of m_segnum in @mod
> + *
> + * Description: nilfs_sufile_do_flush_nlive_blks() is a callback function
> + * used with nilfs_sufile_updatev(), that adds m_value in @mod to
> + * the su_nlive_blks field of the segment usage entry belonging to m_segnum.
> + */
> +static void nilfs_sufile_do_flush_nlive_blks(struct inode *sufile,
> +					     struct nilfs_sufile_mod *mod,
> +					     struct buffer_head *header_bh,
> +					     struct buffer_head *su_bh)
> +{
> +	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
> +	struct nilfs_segment_usage *su;
> +	void *kaddr;
> +	__u32 nblocks, nlive_blocks;
> +	__u64 segnum = mod->m_segnum;
> +	__s64 value = mod->m_value;
> +
> +	if (!value)
> +		return;
> +
> +	kaddr = kmap_atomic(su_bh->b_page);
> +
> +	su = nilfs_sufile_block_get_segment_usage(sufile, segnum, su_bh, kaddr);
> +	WARN_ON(nilfs_segment_usage_error(su));
> +
> +	nblocks = le32_to_cpu(su->su_nblocks);
> +	nlive_blocks = le32_to_cpu(su->su_nlive_blks);
> +
> +	value += nlive_blocks;
> +	if (value < 0)
> +		value = 0;
> +	else if (value > nblocks)
> +		value = nblocks;
> +
> +	/* do nothing if the value didn't change */
> +	if (value != nlive_blocks) {
> +		su->su_nlive_blks = cpu_to_le32(value);
> +		su->su_nlive_lastmod = cpu_to_le64(nilfs->ns_ctime);

ns_ctime should not be used because it is updated after writing
segment.  get_seconds() should be used.

> +	}
> +
> +	kunmap_atomic(kaddr);
> +
> +	if (value != nlive_blocks) {
> +		mark_buffer_dirty(su_bh);
> +		nilfs_mdt_mark_dirty(sufile);
> +	}
> +}
> +
> +/**
> + * nilfs_sufile_flush_nlive_blks - flush mod cache to su_nlive_blks
> + * @sufile: inode of segment usage file
> + * @mc: modification cache
> + *
> + * Description: nilfs_sufile_flush_nlive_blks() flushes the cached
> + * modifications in @mc, by applying them to the su_nlive_blks field of
> + * the corresponding segment usage entries. @mc can be NULL or empty. If
> + * the sufile extension needed to support su_nlive_blks is not supported the
> + * function will abort without error.
> + *
> + * Return Value: On success, zero is returned.  On error, one of the
> + * following negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + *
> + * %-ENOENT - Given segment usage is in hole block
> + *
> + * %-EINVAL - Invalid segment usage number
> + */
> +int nilfs_sufile_flush_nlive_blks(struct inode *sufile,
> +				  struct nilfs_sufile_mod_cache *mc)
> +{
> +	int ret;
> +
> +	if (!mc || !mc->mc_size || !nilfs_sufile_ext_supported(sufile))
> +		return 0;
> +
> +	ret = nilfs_sufile_mc_flush(sufile, mc,
> +				    nilfs_sufile_do_flush_nlive_blks);
> +
> +	nilfs_sufile_mc_clear(mc);
> +
> +	return ret;
> +}
> +
> +/**
> + * nilfs_sufile_mod_nlive_blks - modifiy su_nlive_blks using mod cache
> + * @sufile: inode of segment usage file
> + * @mc: modification cache
> + * @segnum: segment number
> + * @value: signed value (can be positive and negative)
> + *
> + * Description: nilfs_sufile_mod_nlive_blks() adds @value to the su_nlive_blks
> + * field of the segment usage entry for @segnum. If @mc is not NULL it first
> + * accumulates all modifications in the cache and flushes it if it is full.
> + * Otherwise the change is applied directly.
> + *
> + * Return Value: On success, zero is returned.  On error, one of the
> + * following negative error codes is returned.
> + *
> + * %-EIO - I/O error.
> + *
> + * %-ENOMEM - Insufficient amount of memory available.
> + *
> + * %-ENOENT - Given segment usage is in hole block
> + *
> + * %-EINVAL - Invalid segment usage number
> + */
> +int nilfs_sufile_mod_nlive_blks(struct inode *sufile,
> +				struct nilfs_sufile_mod_cache *mc,
> +				__u64 segnum, __s64 value)
> +{
> +	int ret;
> +
> +	if (!value || !nilfs_sufile_ext_supported(sufile))
> +		return 0;
> +
> +	if (!mc)
> +		return nilfs_sufile_mc_update(sufile, segnum, value,
> +				nilfs_sufile_do_flush_nlive_blks);
> +
> +	if (!nilfs_sufile_mc_add(mc, segnum, value))
> +		return 0;
> +
> +	ret = nilfs_sufile_flush_nlive_blks(sufile, mc);
> +
> +	nilfs_sufile_mc_reset(mc, segnum, value);
> +
> +	return ret;
> +}
> +
> +/**
>   * nilfs_sufile_read - read or get sufile inode
>   * @sb: super block instance
>   * @susize: size of a segment usage entry
> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
> index d56498b..ae3c52a 100644
> --- a/fs/nilfs2/sufile.h
> +++ b/fs/nilfs2/sufile.h
> @@ -195,4 +195,9 @@ static inline void nilfs_sufile_mc_destroy(struct nilfs_sufile_mod_cache *mc)
>  	}
>  }
>  
> +int nilfs_sufile_flush_nlive_blks(struct inode *,
> +				  struct nilfs_sufile_mod_cache *);
> +int nilfs_sufile_mod_nlive_blks(struct inode *, struct nilfs_sufile_mod_cache *,
> +				__u64, __s64);
> +

Please add variable names to arguments of new declarations.
(You don't have to add variable names to unrelated declarations)

Regards,
Ryusuke Konishi

>  #endif	/* _NILFS_SUFILE_H */
> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 3/6] nilfs-utils: add support for tracking live blocks
       [not found]         ` <1424804659-10986-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14  5:52           ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14  5:52 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Tue, 24 Feb 2015 20:04:16 +0100, Andreas Rohner wrote:
> This patch adds a new feature flag NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
> which allows the user to enable and disable the tracking of live
> blocks. The flag can be set at file system creation time with mkfs or
> at any later time with nilfs-tune.
> 
> Additionally a new option NILFS_OPT_TRACK_LIVE_BLKS is added to be
> used by the GC. It is set to the same value as
> NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS at startup. It is mainly used to
> easily and efficiently check for the feature at runtime and to disable
> it if the kernel doesn't support it.
> 
> It is fully backwards compatible, because
> NILFS_FEATURE_COMPAT_SUFILE_EXTENSION also is backwards compatible and
> it basically only tells the kernel to update a counter for every
> segment in the SUFILE. If the kernel doesn't support it, the counter
> won't be updated and the GC policies depending on that information
> will work less efficient, but they would still work.
> 
> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
> ---
>  include/nilfs.h              | 30 +++++++++++++++++++++++++++---
>  include/nilfs2_fs.h          |  4 +++-
>  lib/feature.c                |  2 ++
>  lib/nilfs.c                  | 32 ++++----------------------------
>  man/mkfs.nilfs2.8            |  6 ++++++
>  sbin/mkfs/mkfs.c             |  3 ++-
>  sbin/nilfs-tune/nilfs-tune.c |  4 ++--
>  7 files changed, 46 insertions(+), 35 deletions(-)
> 
> diff --git a/include/nilfs.h b/include/nilfs.h
> index f695f48..22a9190 100644
> --- a/include/nilfs.h
> +++ b/include/nilfs.h
> @@ -130,6 +130,7 @@ struct nilfs {
>  
>  #define NILFS_OPT_MMAP		0x01
>  #define NILFS_OPT_SET_SUINFO	0x02
> +#define NILFS_OPT_TRACK_LIVE_BLKS	0x04
>  
>  
>  struct nilfs *nilfs_open(const char *, const char *, int);
> @@ -141,9 +142,25 @@ void nilfs_opt_clear_mmap(struct nilfs *);
>  int nilfs_opt_set_mmap(struct nilfs *);
>  int nilfs_opt_test_mmap(struct nilfs *);
>  
> -void nilfs_opt_clear_set_suinfo(struct nilfs *);
> -int nilfs_opt_set_set_suinfo(struct nilfs *);
> -int nilfs_opt_test_set_suinfo(struct nilfs *);
> +#define NILFS_OPT_FLAG(flag, name)					\
> +static inline void							\
> +nilfs_opt_set_##name(struct nilfs *nilfs)			\
> +{									\
> +	nilfs->n_opts |= NILFS_OPT_##flag;		\
> +}									\
> +static inline void							\
> +nilfs_opt_clear_##name(struct nilfs *nilfs)			\
> +{									\
> +	nilfs->n_opts &= ~NILFS_OPT_##flag;		\
> +}									\
> +static inline int							\
> +nilfs_opt_test_##name(const struct nilfs *nilfs)			\
> +{									\
> +	return !!(nilfs->n_opts & NILFS_OPT_##flag);	\
> +}

Don't break library compatibility by inlining.  Al least this should
be done in a separate patch.

I think we should do the opposite thing.  I mean that the
implementation of nilfs structure should be hidden in nilfs library
sometime when we will bump up the library version because it will
break the library compatibility.

> +
> +NILFS_OPT_FLAG(SET_SUINFO, set_suinfo);
> +NILFS_OPT_FLAG(TRACK_LIVE_BLKS, track_live_blks);
>  
>  nilfs_cno_t nilfs_get_oldest_cno(struct nilfs *);
>  
> @@ -326,4 +343,11 @@ static inline __u32 nilfs_get_blocks_per_segment(const struct nilfs *nilfs)
>  	return le32_to_cpu(nilfs->n_sb->s_blocks_per_segment);
>  }
>  
> +static inline int nilfs_feature_track_live_blks(const struct nilfs *nilfs)
> +{
> +	__u64 fc = le64_to_cpu(nilfs->n_sb->s_feature_compat);
> +	return (fc & NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS) &&
> +		(fc & NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
> +}
> +
>  #endif	/* NILFS_H */
> diff --git a/include/nilfs2_fs.h b/include/nilfs2_fs.h
> index d01a924..427ca53 100644
> --- a/include/nilfs2_fs.h
> +++ b/include/nilfs2_fs.h
> @@ -220,10 +220,12 @@ struct nilfs_super_block {
>   * doesn't know about, it should refuse to mount the filesystem.
>   */
>  #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
> +#define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
>  
>  #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
>  
> -#define NILFS_FEATURE_COMPAT_SUPP	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION
> +#define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
> +				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>  
> diff --git a/lib/feature.c b/lib/feature.c
> index d954cda..ebe8c3f 100644
> --- a/lib/feature.c
> +++ b/lib/feature.c
> @@ -57,6 +57,8 @@ static const struct nilfs_feature features[] = {
>  	/* Compat features */
>  	{ NILFS_FEATURE_TYPE_COMPAT,
>  	  NILFS_FEATURE_COMPAT_SUFILE_EXTENSION, "sufile_ext" },
> +	{ NILFS_FEATURE_TYPE_COMPAT,
> +	  NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS, "track_live_blks" },
>  	/* Read-only compat features */
>  	{ NILFS_FEATURE_TYPE_COMPAT_RO,
>  	  NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT, "block_count" },
> diff --git a/lib/nilfs.c b/lib/nilfs.c
> index 30db654..2067fc0 100644
> --- a/lib/nilfs.c
> +++ b/lib/nilfs.c
> @@ -290,34 +290,6 @@ int nilfs_opt_test_mmap(struct nilfs *nilfs)
>  	return !!(nilfs->n_opts & NILFS_OPT_MMAP);
>  }
>  
> -/**
> - * nilfs_opt_set_set_suinfo - set set_suinfo option
> - * @nilfs: nilfs object
> - */
> -int nilfs_opt_set_set_suinfo(struct nilfs *nilfs)
> -{
> -	nilfs->n_opts |= NILFS_OPT_SET_SUINFO;
> -	return 0;
> -}
> -
> -/**
> - * nilfs_opt_clear_set_suinfo - clear set_suinfo option
> - * @nilfs: nilfs object
> - */
> -void nilfs_opt_clear_set_suinfo(struct nilfs *nilfs)
> -{
> -	nilfs->n_opts &= ~NILFS_OPT_SET_SUINFO;
> -}
> -
> -/**
> - * nilfs_opt_test_set_suinfo - test whether set_suinfo option is set or not
> - * @nilfs: nilfs object
> - */
> -int nilfs_opt_test_set_suinfo(struct nilfs *nilfs)
> -{
> -	return !!(nilfs->n_opts & NILFS_OPT_SET_SUINFO);
> -}
> -
>  static int nilfs_open_sem(struct nilfs *nilfs)
>  {
>  	char semnambuf[NAME_MAX - 4];
> @@ -382,6 +354,7 @@ struct nilfs *nilfs_open(const char *dev, const char *dir, int flags)
>  	nilfs->n_dev = NULL;
>  	nilfs->n_ioc = NULL;
>  	nilfs->n_mincno = NILFS_CNO_MIN;
> +	nilfs->n_opts = 0;

Please fix this as a separate patch.  This is a leak bug even though
it doesn't really matters.

Regards,
Ryusuke Konishi

>  	memset(nilfs->n_sems, 0, sizeof(nilfs->n_sems));
>  
>  	if (flags & NILFS_OPEN_RAW) {
> @@ -405,6 +378,9 @@ struct nilfs *nilfs_open(const char *dev, const char *dir, int flags)
>  			errno = ENOTSUP;
>  			goto out_fd;
>  		}
> +
> +		if (nilfs_feature_track_live_blks(nilfs))
> +			nilfs_opt_set_track_live_blks(nilfs);
>  	}
>  
>  	if (flags &
> diff --git a/man/mkfs.nilfs2.8 b/man/mkfs.nilfs2.8
> index 6c9a644..2431ac9 100644
> --- a/man/mkfs.nilfs2.8
> +++ b/man/mkfs.nilfs2.8
> @@ -176,6 +176,12 @@ cannot be disabled, because it changes the ondisk format. Nevertheless it
>  is fully compatible with older versions of the file system. This feature
>  is on by default, because it is fully backwards compatible and can only
>  be set at file system creation time.
> +.TP
> +.B track_live_blks
> +Enables the tracking of live blocks, which might improve the effectiveness of
> +garbage collection, but entails a small runtime overhead. It is important to
> +note, that this feature depends on sufile_ext, which can only be set
> +at file system creation time.
>  .RE
>  .TP
>  .B \-q
> diff --git a/sbin/mkfs/mkfs.c b/sbin/mkfs/mkfs.c
> index 3985262..680311c 100644
> --- a/sbin/mkfs/mkfs.c
> +++ b/sbin/mkfs/mkfs.c
> @@ -1082,7 +1082,8 @@ static inline void check_ctime(time_t ctime)
>  
>  static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
>  	/* Compat */
> -	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION,
> +	NILFS_FEATURE_COMPAT_SUFILE_EXTENSION |
> +	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
>  	/* Read-only compat */
>  	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
>  	/* Incompat */
> diff --git a/sbin/nilfs-tune/nilfs-tune.c b/sbin/nilfs-tune/nilfs-tune.c
> index 60f1d39..7889310 100644
> --- a/sbin/nilfs-tune/nilfs-tune.c
> +++ b/sbin/nilfs-tune/nilfs-tune.c
> @@ -84,7 +84,7 @@ static void nilfs_tune_usage(void)
>  
>  static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
>  	/* Compat */
> -	0,
> +	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
>  	/* Read-only compat */
>  	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
>  	/* Incompat */
> @@ -93,7 +93,7 @@ static const __u64 ok_features[NILFS_MAX_FEATURE_TYPES] = {
>  
>  static const __u64 clear_ok_features[NILFS_MAX_FEATURE_TYPES] = {
>  	/* Compat */
> -	0,
> +	NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS,
>  	/* Read-only compat */
>  	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT,
>  	/* Incompat */
> -- 
> 2.3.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
       [not found]                 ` <20150312.215431.324210374799651841.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-03-14 12:24                   ` Andreas Rohner
       [not found]                     ` <55042879.90701-hi6Y0CQ0nG0@public.gmane.org>
  0 siblings, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-03-14 12:24 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

Hi Ryusuke,

Thank you very much for your detailed review and feedback. I agree with
all of your points and I will start working on a rewrite immediately.

On 2015-03-12 13:54, Ryusuke Konishi wrote:
> Hi Andreas,
> 
> On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote:
>> Hi Ryusuke,
>>
>> Thanks for your thorough review.
>>
>> On 2015-03-10 06:21, Ryusuke Konishi wrote:
>>> Hi Andreas,
>>>
>>> I looked through whole kernel patches and a part of util patches.
>>> Overall comments are as follows:
>>>
>>> [Algorithm]
>>> As for algorithm, it looks about OK except for the starvation
>>> countermeasure.  The stavation countermeasure looks adhoc/hacky, but
>>> it's good that it doesn't change kernel/userland interface; we may be
>>> able to replace it with better ways in a future or in a revised
>>> version of this patchset.
>>>
>>> (1) Drawback of the starvation countermeasure
>>>     The patch 9/9 looks to make the execution time of chcp operation
>>>     worse since it will scan through sufile to modify live block
>>>     counters.  How much does it prolong the execution time ?
>>
>> I'll do some tests, but I haven't noticed any significant performance
>> drop. The GC basically does the same thing, every time it selects
>> segments to reclaim.
> 
> GC is performed in background by an independent process.  What I'm
> care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from
> command line interface or application.  They differ in this meaning.
> 
> Was a worse case senario considered in the test ?
> 
> For example:
> 1. Fill a TB class drive with data file(s), and make a snapshot on it.
> 2. Run one pass GC to update snapshot block counts.
> 3. And do "chcp cp"
> 
> If we don't observe noticeable delay on this class of drive, then I
> think we can put the problem off.

Yesterday I did a worst case test as you suggested. I used an old 1 TB
hard drive I had lying around. This was my setup:

1. Write a 850GB file
2. Create a snapshot
3. Delete the file
4. Let GC run through all segments
5. Verify with lssu that the GC has updated all SUFILE entries
6. Drop the page cache
7. chcp cp

The following results are with the page cache dropped immediately before
each call:

1. chcp ss
real	0m1.337s
user	0m0.017s
sys	0m0.030s

2. chcp cp
real	0m6.377s
user	0m0.023s
sys	0m0.053s

The following results are without the drop of the page cache:

1. chcp ss
real	0m0.137s
user	0m0.010s
sys	0m0.000s

2. chcp cp
real	0m0.016s
user	0m0.010s
sys	0m0.007s

There are 119233 segments in my test. Each SUFILE entry uses 32 bytes.
So the worst case for 1 TB with 8 MB segments would be 3.57 MB of random
reads and one 3.57 MB continuous write. You only get 6.377s because my
hard drive is so slow. You wouldn't notice any difference on a modern
SSD. Furthermore the SUFILE is also scanned by the segment allocation
algorithm and the GC, so it is very likely already in the page cache.

>>>     In a use case of nilfs, many snapshots are created and they are
>>>     automatically changed back to plain checkpoints because old
>>>     snapshots are thinned out over time.  The patch 9/9 may impact on
>>>     such usage.
>>>
>>> (2) Compatibility
>>>     What will happen in the following case:
>>>     1. Create a file system, use it with the new module, and
>>>        create snapshots.
>>>     2. Mount it with an old module, and release snapshot with "chcp cp"
>>>     3. Mount it with the new module, and cleanerd runs gc with
>>>        cost benefit or greedy policy.
>>
>> Some segments could be subject to starvation. But it would probably only
>> affect a small number of segments and it could be fixed by "chcp ss
>> <CP>; chcp cp <CP>".
> 
> Ok, let's treat this as a restriction for now.
> If you come up with any good idea, please propose.
> 
>>> (3) Durability against unexpected power failures (just a note)
>>>     The current patchset looks not to cause starvation issue even when
>>>     unexpected power failure occurs during or after executing "chcp
>>>     cp" because nilfs_ioctl_change_cpmode() do changes in a
>>>     transactional way with nilfs_transaction_begin/commit.
>>>     We should always think this kind of situtation to keep consistency.
>>>
>>> [Coding Style]
>>> (4) This patchset has several coding style issues. Please fix them and
>>>     re-check with the latest checkpatch script (script/checkpatch.pl).
>>
>> I'll fix that. Sorry.
>>
>>> patch 2:
>>> WARNING: Prefer kmalloc_array over kmalloc with multiply
>>> #85: FILE: fs/nilfs2/sufile.c:1192:
>>> +    mc->mc_mods = kmalloc(capacity * sizeof(struct nilfs_sufile_mod),
>>>
>>> patch 5,6:
>>> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
>>> #60: 
>>> the same semaphore has to be aquired. So if the DAT-Entry belongs to
>>>
>>> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
>>> #46: 
>>> be aquired, which blocks the entire SUFILE and effectively turns
>>>
>>> WARNING: 'aquired' may be misspelled - perhaps 'acquired'?
>>> #53: 
>>> afore mentioned lock only needs to be aquired, if the cache is full
>>>
>>> (5) sub_sizeof macro:
>>>     The same definition exists as offsetofend() in vfio.h,
>>>     and a patch to move it to stddef.h is now proposed.
>>>
>>>     Please use the same name, and redefine it only if it's not
>>>     defined:
>>>
>>> #ifndef offsetofend
>>> #define offsetofend(TYPE, MEMBER) \
>>>         (offsetof(TYPE, MEMBER) + sizeof(((TYPE *)0)->MEMBER))
>>> #endif
>>
>> Ok I'll change that.
>>
>>> [Implementation]
>>> (6) b_blocknr
>>>     Please do not use bh->b_blocknr to store disk block number.  This
>>>     field is used to keep virtual block number except for DAT files.
>>>     It is only replaced to an actual block number during calling
>>>     submit_bh().  Keep this policy.
>>
>> As far as I can tell, this is only true for blocks of GC inodes and node
>> blocks. All other buffer_heads are always mapped to on disk blocks by
>> nilfs_get_block(). I only added the mapping in nilfs_segbuf_submit_bh()
>> to correctly set the value in b_blocknr to the new location.
>>
> 
> nilfs_get_block() is only used for regular files, directories, and so
> on.  Blocks on metadata files are mapped through
> nilfs_mdt_submit_block().  Anyway, yes, they stores actual disk block
> number in b_blocknr in the current implementation.  But, it is just a
> cutting corner of the current implementation, which comes from the
> reason that we have to set actual disk block numbers when reading
> blocks with vfs/mm functions.
> 
> Anyway I don't like you touch nilfs_get_block() and
> nilfs_segbuf_submit_bh() in a part of the big patch.  At least, it
> should be a separate patch.  I prefer you take alternative approach
> which does the same thing without b_blocknr.  I would like to help to
> implement the latter approach if you need to know disk block number in
> the patchset.

I agree not using b_blocknr would be preferable.

>>>     In segment constructor context, you can calculate the disk block
>>>     number from the start disk address of the segment and the block
>>>     index (offset) in the segment.
>>
>> If I understand you correctly, this approach would give me the on disk
>> location inside of the segment that is currently constructed. But I need
>> to know the previous on disk location of the buffer_head. I have to
>> decrement the counter for the previous segment.
> 
> What does the previous on disk location mean ?
> And, why do you need to know the previous on disk location?
> 
> If it means reclaiming segment, you don't need to decrement its
> counter because it will be freed.
> 
> If it means the original block to be overwritten,
> nilfs_dat_commit_end() is called for the block through
> nilfs_bmap_propagate().

Yes I mean the original block to be overwritten.

> If it means the original block of DAT file, it's OK to refer to
> b_blocknr because DAT blocks never store virtual block number by
> design.  I think it should be done in nilfs_btree_propagate_p() and
> nilfs_direct_propagate(), in which no special "end of life" processing
> is done against DAT blocks at present.

I had problems with nilfs_btree_propagate_p(), because of the retry loop
in nilfs_segctor_collect(). In rare cases it will redo the collection,
which messes up the count.

But I think there is a way around that. We would have to flush the
counting cache before entering into the loop in nilfs_segctor_collect().
At the start of every iteration of the loop we clear the cache.

>>> (7) sufile mod cache
>>>     Consider gathering the cache into nilfs_sufile_info struct and
>>>     stopping to pass it via argument of bmap/sufile/dat interface
>>>     functions.  It's hacky, and decreases readability of programs, and
>>>     is bloating changes of this patchset over multiple function
>>>     blocks.
>>
>> If I use a global structure, I have to protect it with a lock. Since
>> almost any operation has to modify the counters in the SUFILE, this
>> would serialize the whole file system.
>>
> 
> The lock acquisition will be needed if you write back to buffers on
> SUFILE.  That is the reason why I say you should aggregate the
> writeback to sufile into the segment constructor context.
> 
> You don't have to suppose a global lock.  You can use bgl_lock, for
> example, if the lock contention really matters.

I agree.

>>>     The cache should be well designed. It's important to balance the
>>>     performance and locality/transparency of the feature.  For
>>>     instance, it can be implemented with radix-tree of objects in
>>>     which each object has a vector of 2^k cache entries.
>>
>> I'll look into that.
>>
>>>     I think the cache should be written back to the sufile buffers
>>>     only within segment construction context. At least, it should be
>>>     written back in the context in which a transaction lock is held.
>>>
>>>     In addition, introducing a new bmap lock dependency,
>>>     nilfs_sufile_lock_key, is undesireble. You should avoid it
>>>     by delaying the writeback of cache entries to sufile.
>>
>> The cache could end up using a lot of memory. In the worst case one
>> entry per block.
> 
> Why do you think it matters?  When you modify block counter of
> segments, all the modified SUFILE blocks become dirty and pinned to
> memory.  The cache can be designed better at least than the dirty
> SUFILE buffers.

I never thought about it in that way, but you are absolutely right. The
cache would use less memory, than the dirty SUFILE blocks.

Regards,
Andreas Rohner

> If you care about the need of "shrinker".  We can take other
> techniques such as queuing changes and reflect them to sufile in
> bundle by using workqueue.  Anyway it's a matter of design or
> implementation technique.
> 
>>> (8) Changes to the sufile must be finished before dirty buffer
>>>     collection of sufile.
>>>     All mark_buffer_dirty() calls to sufile must be finished
>>>     before or in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().
>>>
>>>     (You can write fixed figures to sufile after the collection phase
>>>      of sufile by preparatory marking buffer dirty before the
>>>      colection phase.)
>>>
>>>     In the current patchset, sufile mod cache can be flushed in
>>>     nilfs_segctor_update_palyload_blocknr(), which comes after the
>>>     dirty buffer collection phase.
>>
>> This is a hard problem. I have to count the blocks added in the
>> NILFS_ST_DAT stage. I don't know, which SUFILE blocks I have to mark in
>> advance. I'll have to think about this.
>>
>>> (9) cpfile is also excluded in the dead block counting like sufile
>>>     cpfile is always changed and written back along with sufile and dat.
>>>     So, cpfile must be excluded from the dead block counting.
>>>     Otherwise, sufile change can trigger cpfile changes, and it in turn
>>>     triggers sufile.
>>
>> I don't quite understand your example. How exactly can a sufile change
>> trigger a cpfile change and how can this turn into an infinite loop?
>>
> 
> Sorry, it's my misunderstanding.  Since dirty blocks of cpfile is
> collected before sufile, it is possible to avoid the loop by finishing
> all dead block counting on cpfile and flushing it to sufile before or
> in NILFS_ST_SUFILE stage of nilfs_segctor_collect_blocks().
> 
> Regards,
> Ryusuke Konishi
> 
>> Thanks,
>> Andreas Rohner
>>
>>>     This also helps to simplify nilfs_dat_commit_end() that the patchset
>>>     added two arguments for the dead block counting in the patchset.
>>>     I mean, "dead" argument and "count_blocks" argument can be unified by
>>>     changing meaning of the "dead" argument.
>>>
>>>
>>> I will add detail comments for patches tonight or another day.
>>>
>>> Regards,
>>> Ryusuke Konishi
>>>
>>> On Wed, 25 Feb 2015 09:18:04 +0900 (JST), Ryusuke Konishi wrote:
>>>> Hi Andreas,
>>>>
>>>> Thank you for posting this proposal!
>>>>
>>>> I would like to have time to review this series through, but please
>>>> wait for several days. (This week I'm quite busy until weekend)
>>>>
>>>> Thanks,
>>>> Ryusuke Konishi
>>>>
>>>> On Tue, 24 Feb 2015 20:01:35 +0100, Andreas Rohner wrote:
>>>>> Hi everyone!
>>>>>
>>>>> One of the biggest performance problems of NILFS is its
>>>>> inefficient Timestamp GC policy. This patch set introduces two new GC
>>>>> policies, namely Cost-Benefit and Greedy.
>>>>>
>>>>> The Cost-Benefit policy is nothing new. It has been around for a long
>>>>> time with log-structured file systems [1]. But it relies on accurate
>>>>> information, about the number of live blocks in a segment. NILFS
>>>>> currently does not provide the necessary information. So this patch set
>>>>> extends the entries in the SUFILE to include a counter for the number of
>>>>> live blocks. This counter is decremented whenever a file is deleted or
>>>>> overwritten.
>>>>>
>>>>> Except for some tricky parts, the counting of live blocks is quite
>>>>> trivial. The problem is snapshots. At any time, a checkpoint can be
>>>>> turned into a snapshot or vice versa. So blocks that are reclaimable at
>>>>> one point in time, are protected by a snapshot a moment later.
>>>>>
>>>>> This patch set does not try to track snapshots at all. Instead it uses a
>>>>> heuristic approach to prevent the worst case scenario. The performance
>>>>> is still significantly better than timestamp for my benchmarks.
>>>>>
>>>>> The worst case scenario is, the following:
>>>>>
>>>>> 1. Segment 1 is written
>>>>> 2. Snapshot is created
>>>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>>>    by the Snapshot. The GC has to set the number of live blocks
>>>>>    to maximum to avoid reclaiming this Segment again in the near future.
>>>>> 4. Snapshot is deleted
>>>>> 5. Segment 1 is reclaimable, but its counter is so high, that the GC
>>>>>    will never try to reclaim it again.
>>>>>
>>>>> To prevent this kind of starvation I use another field in the SUFILE
>>>>> entry, to store the number of blocks that are protected by a snapshot.
>>>>> This value is just a heuristic and it is usually set to 0. Only if the
>>>>> GC reclaims a segment, it is written to the SUFILE entry. The GC has to
>>>>> check for snapshots anyway, so we get this information for free. By
>>>>> storing this information in the SUFILE we can avoid starvation in the
>>>>> following way:
>>>>>
>>>>> 1. Segment 1 is written
>>>>> 2. Snapshot is created
>>>>> 3. GC tries to reclaim Segment 1, but all blocks are protected
>>>>>    by the Snapshot. The GC has to set the number of live blocks
>>>>>    to maximum to avoid reclaiming this Segment again in the near future.
>>>>> 4. GC sets the number of snapshot blocks in Segment 1 in the SUFILE
>>>>>    entry
>>>>> 5. Snapshot is deleted
>>>>> 6. On Snapshot deletion we walk through every entry in the SUFILE and
>>>>>    reduce the number of live blocks to half, if the number of snapshot
>>>>>    blocks is bigger than half of the maximum.
>>>>> 7. Segment 1 is reclaimable and the number of live blocks entry is at
>>>>>    half the maximum. The GC will try to reclaim this segment as soon as
>>>>>    there are no other better choices.
>>>>>
>>>>> BENCHMARKS:
>>>>> -----------
>>>>>
>>>>> My benchmark is quite simple. It consists of a process, that replays
>>>>> real NFS traces at a faster speed. It thereby creates relatively
>>>>> realistic patterns of file creation and deletions. At the same time
>>>>> multiple snapshots are created and deleted in parallel. I use a 100GB
>>>>> partition of a Samsung SSD:
>>>>>
>>>>> WITH SNAPSHOTS EVERY 5 MINUTES:
>>>>> --------------------------------------------------------------------
>>>>>                 Execution time       Wear (Data written to disk)
>>>>> Timestamp:      100%                 100%
>>>>> Cost-Benefit:   80%                  43%
>>>>>
>>>>> NO SNAPSHOTS:
>>>>> ---------------------------------------------------------------------
>>>>>                 Execution time       Wear (Data written to disk)
>>>>> Timestamp:      100%                 100%
>>>>> Cost-Benefit:   70%                  45%
>>>>>
>>>>> I plan on adding more benchmark results soon.
>>>>>
>>>>> Best regards,
>>>>> Andreas Rohner
>>>>>
>>>>> [1] Mendel Rosenblum and John K. Ousterhout. The design and implementa-
>>>>>     tion of a log-structured file system. ACM Trans. Comput. Syst.,
>>>>>     10(1):26–52, February 1992.
>>>>>
>>>>> Andreas Rohner (9):
>>>>>   nilfs2: refactor nilfs_sufile_updatev()
>>>>>   nilfs2: add simple cache for modifications to SUFILE
>>>>>   nilfs2: extend SUFILE on-disk format to enable counting of live blocks
>>>>>   nilfs2: add function to modify su_nlive_blks
>>>>>   nilfs2: add simple tracking of block deletions and updates
>>>>>   nilfs2: use modification cache to improve performance
>>>>>   nilfs2: add additional flags for nilfs_vdesc
>>>>>   nilfs2: improve accuracy and correct for invalid GC values
>>>>>   nilfs2: prevent starvation of segments protected by snapshots
>>>>>
>>>>>  fs/nilfs2/bmap.c          |  84 +++++++-
>>>>>  fs/nilfs2/bmap.h          |  14 +-
>>>>>  fs/nilfs2/btree.c         |   4 +-
>>>>>  fs/nilfs2/cpfile.c        |   5 +
>>>>>  fs/nilfs2/dat.c           |  95 ++++++++-
>>>>>  fs/nilfs2/dat.h           |   8 +-
>>>>>  fs/nilfs2/direct.c        |   4 +-
>>>>>  fs/nilfs2/inode.c         |  24 ++-
>>>>>  fs/nilfs2/ioctl.c         |  27 ++-
>>>>>  fs/nilfs2/mdt.c           |   5 +-
>>>>>  fs/nilfs2/page.h          |   6 +-
>>>>>  fs/nilfs2/segbuf.c        |   6 +
>>>>>  fs/nilfs2/segbuf.h        |   3 +
>>>>>  fs/nilfs2/segment.c       | 155 +++++++++++++-
>>>>>  fs/nilfs2/segment.h       |   3 +
>>>>>  fs/nilfs2/sufile.c        | 533 +++++++++++++++++++++++++++++++++++++++++++---
>>>>>  fs/nilfs2/sufile.h        |  97 +++++++--
>>>>>  fs/nilfs2/the_nilfs.c     |   4 +
>>>>>  fs/nilfs2/the_nilfs.h     |  23 ++
>>>>>  include/linux/nilfs2_fs.h | 122 ++++++++++-
>>>>>  20 files changed, 1126 insertions(+), 96 deletions(-)
>>>>>
>>>>> -- 
>>>>> 2.3.0
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]         ` <20150314.125109.1017248837083480553.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
@ 2015-03-14 12:36           ` Andreas Rohner
       [not found]             ` <55042B53.5000101-hi6Y0CQ0nG0@public.gmane.org>
  2015-03-14 14:32           ` Ryusuke Konishi
  1 sibling, 1 reply; 36+ messages in thread
From: Andreas Rohner @ 2015-03-14 12:36 UTC (permalink / raw)
  To: Ryusuke Konishi; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On 2015-03-14 04:51, Ryusuke Konishi wrote:
> On Tue, 24 Feb 2015 20:01:44 +0100, Andreas Rohner wrote:
>> It doesn't really matter if the number of reclaimable blocks for a
>> segment is inaccurate, as long as the overall performance is better than
>> the simple timestamp algorithm and starvation is prevented.
>>
>> The following steps will lead to starvation of a segment:
>>
>> 1. The segment is written
>> 2. A snapshot is created
>> 3. The files in the segment are deleted and the number of live
>>    blocks for the segment is decremented to a very low value
>> 4. The GC tries to free the segment, but there are no reclaimable
>>    blocks, because they are all protected by the snapshot. To prevent an
>>    infinite loop the GC has to adjust the number of live blocks to the
>>    correct value.
>> 5. The snapshot is converted to a checkpoint and the blocks in the
>>    segment are now reclaimable.
>> 6. The GC will never attemt to clean the segment again, because of it
>>    incorrectly shows up as having a high number of live blocks.
>>
>> To prevent this, the already existing padding field of the SUFILE entry
>> is used to track the number of snapshot blocks in the segment. This
>> number is only set by the GC, since it collects the necessary
>> information anyway. So there is no need, to track which block belongs to
>> which segment. In step 4 of the list above the GC will set the new field
>> su_nsnapshot_blks. In step 5 all entries in the SUFILE are checked and
>> entries with a big su_nsnapshot_blks field get their su_nlive_blks field
>> reduced.
>>
>> Signed-off-by: Andreas Rohner <andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
>> ---
>>  fs/nilfs2/cpfile.c        |   5 ++
>>  fs/nilfs2/segbuf.c        |   1 +
>>  fs/nilfs2/segbuf.h        |   1 +
>>  fs/nilfs2/segment.c       |   7 ++-
>>  fs/nilfs2/sufile.c        | 114 ++++++++++++++++++++++++++++++++++++++++++----
>>  fs/nilfs2/sufile.h        |   4 +-
>>  fs/nilfs2/the_nilfs.h     |   7 +++
>>  include/linux/nilfs2_fs.h |  12 +++--
>>  8 files changed, 136 insertions(+), 15 deletions(-)
>>
>> diff --git a/fs/nilfs2/cpfile.c b/fs/nilfs2/cpfile.c
>> index 0d58075..6b61fd7 100644
>> --- a/fs/nilfs2/cpfile.c
>> +++ b/fs/nilfs2/cpfile.c
>> @@ -28,6 +28,7 @@
>>  #include <linux/nilfs2_fs.h>
>>  #include "mdt.h"
>>  #include "cpfile.h"
>> +#include "sufile.h"
>>  
>>  
>>  static inline unsigned long
>> @@ -703,6 +704,7 @@ static int nilfs_cpfile_clear_snapshot(struct inode *cpfile, __u64 cno)
>>  	struct nilfs_cpfile_header *header;
>>  	struct nilfs_checkpoint *cp;
>>  	struct nilfs_snapshot_list *list;
>> +	struct the_nilfs *nilfs = cpfile->i_sb->s_fs_info;
>>  	__u64 next, prev;
>>  	void *kaddr;
>>  	int ret;
>> @@ -784,6 +786,9 @@ static int nilfs_cpfile_clear_snapshot(struct inode *cpfile, __u64 cno)
>>  	mark_buffer_dirty(header_bh);
>>  	nilfs_mdt_mark_dirty(cpfile);
>>  
>> +	if (nilfs_feature_track_snapshots(nilfs))
>> +		nilfs_sufile_fix_starving_segs(nilfs->ns_sufile);
>> +
>>  	brelse(prev_bh);
>>  
>>   out_next:
>> diff --git a/fs/nilfs2/segbuf.c b/fs/nilfs2/segbuf.c
>> index bbd807b..a98c576 100644
>> --- a/fs/nilfs2/segbuf.c
>> +++ b/fs/nilfs2/segbuf.c
>> @@ -59,6 +59,7 @@ struct nilfs_segment_buffer *nilfs_segbuf_new(struct super_block *sb)
>>  	segbuf->sb_super_root = NULL;
>>  	segbuf->sb_nlive_blks_added = 0;
>>  	segbuf->sb_nlive_blks_diff = 0;
>> +	segbuf->sb_nsnapshot_blks = 0;
>>  
>>  	init_completion(&segbuf->sb_bio_event);
>>  	atomic_set(&segbuf->sb_err, 0);
>> diff --git a/fs/nilfs2/segbuf.h b/fs/nilfs2/segbuf.h
>> index 4e994f7..7a462c4 100644
>> --- a/fs/nilfs2/segbuf.h
>> +++ b/fs/nilfs2/segbuf.h
>> @@ -85,6 +85,7 @@ struct nilfs_segment_buffer {
>>  	unsigned		sb_rest_blocks;
>>  	__u32			sb_nlive_blks_added;
>>  	__s64			sb_nlive_blks_diff;
>> +	__u32			sb_nsnapshot_blks;
>>  
>>  	/* Buffers */
>>  	struct list_head	sb_segsum_buffers;
>> diff --git a/fs/nilfs2/segment.c b/fs/nilfs2/segment.c
>> index 16c7c36..b976198 100644
>> --- a/fs/nilfs2/segment.c
>> +++ b/fs/nilfs2/segment.c
>> @@ -1381,6 +1381,7 @@ static void nilfs_segctor_update_segusage(struct nilfs_sc_info *sci,
>>  			(segbuf->sb_pseg_start - segbuf->sb_fseg_start);
>>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>>  						     live_blocks,
>> +						     segbuf->sb_nsnapshot_blks,
>>  						     sci->sc_seg_ctime);
>>  		WARN_ON(ret); /* always succeed because the segusage is dirty */
>>  
>> @@ -1405,7 +1406,7 @@ static void nilfs_cancel_segusage(struct list_head *logs,
>>  	segbuf = NILFS_FIRST_SEGBUF(logs);
>>  	ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>>  					     segbuf->sb_pseg_start -
>> -					     segbuf->sb_fseg_start, 0);
>> +					     segbuf->sb_fseg_start, 0, 0);
>>  	WARN_ON(ret); /* always succeed because the segusage is dirty */
>>  
>>  	if (nilfs_feature_track_live_blks(nilfs))
>> @@ -1416,7 +1417,7 @@ static void nilfs_cancel_segusage(struct list_head *logs,
>>  
>>  	list_for_each_entry_continue(segbuf, logs, sb_list) {
>>  		ret = nilfs_sufile_set_segment_usage(sufile, segbuf->sb_segnum,
>> -						     0, 0);
>> +						     0, 0, 0);
>>  		WARN_ON(ret); /* always succeed */
>>  	}
>>  }
>> @@ -1521,6 +1522,8 @@ static void nilfs_segctor_dec_nlive_blks_gc(struct inode *dat,
>>  
>>  	if (!buffer_nilfs_snapshot(bh) && isreclaimable)
>>  		segbuf->sb_nlive_blks_diff--;
>> +	if (buffer_nilfs_snapshot(bh))
>> +		segbuf->sb_nsnapshot_blks++;
>>  }
>>  
>>  /**
>> diff --git a/fs/nilfs2/sufile.c b/fs/nilfs2/sufile.c
>> index 574a77e..a6dc7bf 100644
>> --- a/fs/nilfs2/sufile.c
>> +++ b/fs/nilfs2/sufile.c
>> @@ -468,7 +468,7 @@ void nilfs_sufile_do_scrap(struct inode *sufile, __u64 *data,
>>  	su->su_flags = cpu_to_le32(1UL << NILFS_SEGMENT_USAGE_DIRTY);
>>  	if (nilfs_sufile_ext_supported(sufile)) {
>>  		su->su_nlive_blks = cpu_to_le32(0);
>> -		su->su_pad = cpu_to_le32(0);
>> +		su->su_nsnapshot_blks = cpu_to_le32(0);
>>  		su->su_nlive_lastmod = cpu_to_le64(0);
>>  	}
>>  	kunmap_atomic(kaddr);
>> @@ -538,7 +538,8 @@ int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum)
>>   * @modtime: modification time (option)
>>   */
>>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>> -				   unsigned long nblocks, time_t modtime)
>> +				   unsigned long nblocks, __u32 nsnapshot_blks,
>> +				   time_t modtime)
>>  {
>>  	struct buffer_head *bh;
>>  	struct nilfs_segment_usage *su;
>> @@ -556,9 +557,18 @@ int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>>  	if (modtime)
>>  		su->su_lastmod = cpu_to_le64(modtime);
>>  	su->su_nblocks = cpu_to_le32(nblocks);
>> -	if (nilfs_sufile_ext_supported(sufile) &&
>> -	    nblocks < le32_to_cpu(su->su_nlive_blks))
>> -		su->su_nlive_blks = su->su_nblocks;
>> +	if (nilfs_sufile_ext_supported(sufile)) {
>> +		if (nblocks < le32_to_cpu(su->su_nlive_blks))
>> +			su->su_nlive_blks = su->su_nblocks;
>> +
>> +		nsnapshot_blks += le32_to_cpu(su->su_nsnapshot_blks);
>> +
>> +		if (nblocks < nsnapshot_blks)
>> +			nsnapshot_blks = nblocks;
>> +
>> +		su->su_nsnapshot_blks = cpu_to_le32(nsnapshot_blks);
>> +	}
>> +
>>  	kunmap_atomic(kaddr);
>>  
>>  	mark_buffer_dirty(bh);
>> @@ -891,7 +901,7 @@ ssize_t nilfs_sufile_get_suinfo(struct inode *sufile, __u64 segnum, void *buf,
>>  
>>  			if (sisz >= NILFS_EXT_SUINFO_SIZE) {
>>  				si->sui_nlive_blks = nlb;
>> -				si->sui_pad = 0;
>> +				si->sui_nsnapshot_blks = 0;
>>  				si->sui_nlive_lastmod = lm;
>>  			}
>>  		}
>> @@ -939,6 +949,7 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>>  	int ret = 0;
>>  	bool sup_ext = (supsz >= NILFS_EXT_SUINFO_UPDATE_SIZE);
>>  	bool su_ext = nilfs_sufile_ext_supported(sufile);
>> +	bool supsu_ext = sup_ext && su_ext;
>>  
>>  	if (unlikely(nsup == 0))
>>  		return ret;
>> @@ -952,6 +963,10 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>>  				nilfs->ns_blocks_per_segment)
>>  			|| (nilfs_suinfo_update_nlive_blks(sup) && sup_ext &&
>>  				sup->sup_sui.sui_nlive_blks >
>> +				nilfs->ns_blocks_per_segment)
>> +			|| (nilfs_suinfo_update_nsnapshot_blks(sup) &&
>> +				sup_ext &&
>> +				sup->sup_sui.sui_nsnapshot_blks >
>>  				nilfs->ns_blocks_per_segment))
>>  			return -EINVAL;
>>  	}
>> @@ -979,11 +994,15 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>>  		if (nilfs_suinfo_update_nblocks(sup))
>>  			su->su_nblocks = cpu_to_le32(sup->sup_sui.sui_nblocks);
>>  
>> -		if (nilfs_suinfo_update_nlive_blks(sup) && sup_ext && su_ext)
>> +		if (nilfs_suinfo_update_nlive_blks(sup) && supsu_ext)
>>  			su->su_nlive_blks =
>>  				cpu_to_le32(sup->sup_sui.sui_nlive_blks);
>>  
>> -		if (nilfs_suinfo_update_nlive_lastmod(sup) && sup_ext && su_ext)
>> +		if (nilfs_suinfo_update_nsnapshot_blks(sup) && supsu_ext)
>> +			su->su_nsnapshot_blks =
>> +				cpu_to_le32(sup->sup_sui.sui_nsnapshot_blks);
>> +
>> +		if (nilfs_suinfo_update_nlive_lastmod(sup) && supsu_ext)
>>  			su->su_nlive_lastmod =
>>  				cpu_to_le64(sup->sup_sui.sui_nlive_lastmod);
>>  
>> @@ -1050,6 +1069,85 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>>  }
>>  
>>  /**
>> + * nilfs_sufile_fix_starving_segs - fix potentially starving segments
>> + * @sufile: inode of segment usage file
>> + *
>> + * Description: Scans for segments, which are potentially starving and
>> + * reduces the number of live blocks to less than half of the maximum
>> + * number of blocks in a segment. This way the segment is more likely to be
>> + * chosen by the GC. A segment is marked as potentially starving, if more
>> + * than half of the blocks it contains are protected by snapshots.
>> + *
>> + * Return Value: On success, 0 is returned and on error, one of the
>> + * following negative error codes is returned.
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
>> +int nilfs_sufile_fix_starving_segs(struct inode *sufile)
>> +{
>> +	struct buffer_head *su_bh;
>> +	struct nilfs_segment_usage *su;
>> +	size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
>> +	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
>> +	void *kaddr;
>> +	unsigned long nsegs, segusages_per_block;
>> +	__u32 max_segblks = nilfs->ns_blocks_per_segment / 2;
>> +	__u64 segnum = 0;
>> +	int ret = 0, blkdirty, dirty = 0;
>> +
>> +	down_write(&NILFS_MDT(sufile)->mi_sem);
>> +
>> +	segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
>> +	nsegs = nilfs_sufile_get_nsegments(sufile);
>> +
>> +	while (segnum < nsegs) {
>> +		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
>> +							 nsegs - 1);
>> +
>> +		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
>> +							   0, &su_bh);
>> +		if (ret < 0) {
>> +			if (ret != -ENOENT)
>> +				goto out;
>> +			/* hole */
>> +			segnum += n;
>> +			continue;
>> +		}
>> +
>> +		kaddr = kmap_atomic(su_bh->b_page);
>> +		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
>> +							  su_bh, kaddr);
>> +		blkdirty = 0;
>> +		for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
>> +			if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
>> +				continue;
>> +
>> +			if (su->su_nlive_blks <= max_segblks)
>> +				continue;
>> +
>> +			su->su_nlive_blks = max_segblks;
>> +			blkdirty = 1;
>> +		}
>> +
>> +		kunmap_atomic(kaddr);
>> +		if (blkdirty) {
>> +			mark_buffer_dirty(su_bh);
>> +			dirty = 1;
>> +		}
>> +		put_bh(su_bh);
>> +	}
>> +
>> +out:
>> +	if (dirty)
>> +		nilfs_mdt_mark_dirty(sufile);
>> +
>> +	up_write(&NILFS_MDT(sufile)->mi_sem);
>> +	return ret;
>> +}
>> +
>> +/**
>>   * nilfs_sufile_trim_fs() - trim ioctl handle function
>>   * @sufile: inode of segment usage file
>>   * @range: fstrim_range structure
>> diff --git a/fs/nilfs2/sufile.h b/fs/nilfs2/sufile.h
>> index ae3c52a..e831622 100644
>> --- a/fs/nilfs2/sufile.h
>> +++ b/fs/nilfs2/sufile.h
>> @@ -45,7 +45,8 @@ int nilfs_sufile_set_alloc_range(struct inode *sufile, __u64 start, __u64 end);
>>  int nilfs_sufile_alloc(struct inode *, __u64 *);
>>  int nilfs_sufile_mark_dirty(struct inode *sufile, __u64 segnum);
>>  int nilfs_sufile_set_segment_usage(struct inode *sufile, __u64 segnum,
>> -				   unsigned long nblocks, time_t modtime);
>> +				   unsigned long nblocks, __u32 nsnapshot_blks,
>> +				   time_t modtime);
>>  int nilfs_sufile_get_stat(struct inode *, struct nilfs_sustat *);
>>  ssize_t nilfs_sufile_get_suinfo(struct inode *, __u64, void *, unsigned,
>>  				size_t);
>> @@ -72,6 +73,7 @@ int nilfs_sufile_resize(struct inode *sufile, __u64 newnsegs);
>>  int nilfs_sufile_read(struct super_block *sb, size_t susize,
>>  		      struct nilfs_inode *raw_inode, struct inode **inodep);
>>  int nilfs_sufile_trim_fs(struct inode *sufile, struct fstrim_range *range);
>> +int nilfs_sufile_fix_starving_segs(struct inode *);
>>  
>>  /**
>>   * nilfs_sufile_scrap - make a segment garbage
>> diff --git a/fs/nilfs2/the_nilfs.h b/fs/nilfs2/the_nilfs.h
>> index 87cab10..3d495f1 100644
>> --- a/fs/nilfs2/the_nilfs.h
>> +++ b/fs/nilfs2/the_nilfs.h
>> @@ -409,4 +409,11 @@ static inline int nilfs_feature_track_live_blks(struct the_nilfs *nilfs)
>>  		NILFS_FEATURE_COMPAT_SUFILE_EXTENSION);
>>  }
>>  
>> +static inline int nilfs_feature_track_snapshots(struct the_nilfs *nilfs)
>> +{
>> +	return (nilfs->ns_feature_compat &
>> +		NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS) &&
>> +		nilfs_feature_track_live_blks(nilfs);
>> +}
>> +
>>  #endif /* _THE_NILFS_H */
>> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
>> index 6ffdc09..a3c7593 100644
>> --- a/include/linux/nilfs2_fs.h
>> +++ b/include/linux/nilfs2_fs.h
>> @@ -222,11 +222,13 @@ struct nilfs_super_block {
>>   */
>>  #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
>>  #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
>> +#define NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS		(1ULL << 2)
>>  
>>  #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
>>  
>>  #define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
>> -				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
>> +				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS \
>> +				| NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS)
>>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>>  
> 
> You don't have to add three compat flags just for this one patchset.
> Please unify it.
> 
> #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 0)
> 
> looks to be enough.

I could merge the TRACK_LIVE_BLKS and TRACK_SNAPSHOTS flag, but I would
suggest to at least leave the SUFILE_EXTENSION flag (maybe with a
different name). The SUFILE_EXTENSION flag has to be set at mkfs time
and it cannot be set or removed later, because you cannot change the on
disk format later. I actually set SUFILE_EXTENSION by default in mkfs,
because it is not harmful and it gives the user the option to switch the
other flags on later.

Regards,
Andreas Rohner

> Regards,
> Ryusuke Konishi
> 
> 
>> @@ -630,7 +632,7 @@ struct nilfs_segment_usage {
>>  	__le32 su_nblocks;
>>  	__le32 su_flags;
>>  	__le32 su_nlive_blks;
>> -	__le32 su_pad;
>> +	__le32 su_nsnapshot_blks;
>>  	__le64 su_nlive_lastmod;
>>  };
>>  
>> @@ -682,7 +684,7 @@ nilfs_segment_usage_set_clean(struct nilfs_segment_usage *su, size_t susz)
>>  	su->su_flags = cpu_to_le32(0);
>>  	if (susz >= NILFS_EXT_SEGMENT_USAGE_SIZE) {
>>  		su->su_nlive_blks = cpu_to_le32(0);
>> -		su->su_pad = cpu_to_le32(0);
>> +		su->su_nsnapshot_blks = cpu_to_le32(0);
>>  		su->su_nlive_lastmod = cpu_to_le64(0);
>>  	}
>>  }
>> @@ -723,7 +725,7 @@ struct nilfs_suinfo {
>>  	__u32 sui_nblocks;
>>  	__u32 sui_flags;
>>  	__u32 sui_nlive_blks;
>> -	__u32 sui_pad;
>> +	__u32 sui_nsnapshot_blks;
>>  	__u64 sui_nlive_lastmod;
>>  };
>>  
>> @@ -770,6 +772,7 @@ enum {
>>  	NILFS_SUINFO_UPDATE_FLAGS,
>>  	NILFS_SUINFO_UPDATE_NLIVE_BLKS,
>>  	NILFS_SUINFO_UPDATE_NLIVE_LASTMOD,
>> +	NILFS_SUINFO_UPDATE_NSNAPSHOT_BLKS,
>>  	__NR_NILFS_SUINFO_UPDATE_FIELDS,
>>  };
>>  
>> @@ -794,6 +797,7 @@ NILFS_SUINFO_UPDATE_FNS(LASTMOD, lastmod)
>>  NILFS_SUINFO_UPDATE_FNS(NBLOCKS, nblocks)
>>  NILFS_SUINFO_UPDATE_FNS(FLAGS, flags)
>>  NILFS_SUINFO_UPDATE_FNS(NLIVE_BLKS, nlive_blks)
>> +NILFS_SUINFO_UPDATE_FNS(NSNAPSHOT_BLKS, nsnapshot_blks)
>>  NILFS_SUINFO_UPDATE_FNS(NLIVE_LASTMOD, nlive_lastmod)
>>  
>>  #define NILFS_MIN_SUINFO_UPDATE_SIZE	\
>> -- 
>> 2.3.0
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]             ` <55042B53.5000101-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14 12:49               ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14 12:49 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sat, 14 Mar 2015 13:36:35 +0100, Andreas Rohner wrote:
> On 2015-03-14 04:51, Ryusuke Konishi wrote:
>> On Tue, 24 Feb 2015 20:01:44 +0100, Andreas Rohner wrote:
>>> diff --git a/include/linux/nilfs2_fs.h b/include/linux/nilfs2_fs.h
>>> index 6ffdc09..a3c7593 100644
>>> --- a/include/linux/nilfs2_fs.h
>>> +++ b/include/linux/nilfs2_fs.h
>>> @@ -222,11 +222,13 @@ struct nilfs_super_block {
>>>   */
>>>  #define NILFS_FEATURE_COMPAT_SUFILE_EXTENSION		(1ULL << 0)
>>>  #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 1)
>>> +#define NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS		(1ULL << 2)
>>>  
>>>  #define NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT		(1ULL << 0)
>>>  
>>>  #define NILFS_FEATURE_COMPAT_SUPP	(NILFS_FEATURE_COMPAT_SUFILE_EXTENSION \
>>> -				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS)
>>> +				| NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS \
>>> +				| NILFS_FEATURE_COMPAT_TRACK_SNAPSHOTS)
>>>  #define NILFS_FEATURE_COMPAT_RO_SUPP	NILFS_FEATURE_COMPAT_RO_BLOCK_COUNT
>>>  #define NILFS_FEATURE_INCOMPAT_SUPP	0ULL
>>>  
>> 
>> You don't have to add three compat flags just for this one patchset.
>> Please unify it.
>> 
>> #define NILFS_FEATURE_COMPAT_TRACK_LIVE_BLKS		(1ULL << 0)
>> 
>> looks to be enough.
> 
> I could merge the TRACK_LIVE_BLKS and TRACK_SNAPSHOTS flag, but I would
> suggest to at least leave the SUFILE_EXTENSION flag (maybe with a
> different name). The SUFILE_EXTENSION flag has to be set at mkfs time
> and it cannot be set or removed later, because you cannot change the on
> disk format later. I actually set SUFILE_EXTENSION by default in mkfs,
> because it is not harmful and it gives the user the option to switch the
> other flags on later.

I see, it sounds reasonable.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots
       [not found]         ` <20150314.125109.1017248837083480553.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
  2015-03-14 12:36           ` Andreas Rohner
@ 2015-03-14 14:32           ` Ryusuke Konishi
  1 sibling, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14 14:32 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA


One more comment.

On Sat, 14 Mar 2015 12:51:09 +0900 (JST), Ryusuke Konishi wrote:
> On Tue, 24 Feb 2015 20:01:44 +0100, Andreas Rohner wrote:
>> @@ -1050,6 +1069,85 @@ ssize_t nilfs_sufile_set_suinfo(struct inode *sufile, void *buf,
>>  }
>>  
>>  /**
>> + * nilfs_sufile_fix_starving_segs - fix potentially starving segments
>> + * @sufile: inode of segment usage file
>> + *
>> + * Description: Scans for segments, which are potentially starving and
>> + * reduces the number of live blocks to less than half of the maximum
>> + * number of blocks in a segment. This way the segment is more likely to be
>> + * chosen by the GC. A segment is marked as potentially starving, if more
>> + * than half of the blocks it contains are protected by snapshots.
>> + *
>> + * Return Value: On success, 0 is returned and on error, one of the
>> + * following negative error codes is returned.
>> + *
>> + * %-EIO - I/O error.
>> + *
>> + * %-ENOMEM - Insufficient amount of memory available.
>> + */
>> +int nilfs_sufile_fix_starving_segs(struct inode *sufile)
>> +{
>> +	struct buffer_head *su_bh;
>> +	struct nilfs_segment_usage *su;
>> +	size_t n, i, susz = NILFS_MDT(sufile)->mi_entry_size;
>> +	struct the_nilfs *nilfs = sufile->i_sb->s_fs_info;
>> +	void *kaddr;
>> +	unsigned long nsegs, segusages_per_block;
>> +	__u32 max_segblks = nilfs->ns_blocks_per_segment / 2;
>> +	__u64 segnum = 0;
>> +	int ret = 0, blkdirty, dirty = 0;
>> +
>> +	down_write(&NILFS_MDT(sufile)->mi_sem);
>> +
>> +	segusages_per_block = nilfs_sufile_segment_usages_per_block(sufile);
>> +	nsegs = nilfs_sufile_get_nsegments(sufile);
>> +
>> +	while (segnum < nsegs) {
>> +		n = nilfs_sufile_segment_usages_in_block(sufile, segnum,
>> +							 nsegs - 1);
>> +
>> +		ret = nilfs_sufile_get_segment_usage_block(sufile, segnum,
>> +							   0, &su_bh);
>> +		if (ret < 0) {
>> +			if (ret != -ENOENT)
>> +				goto out;
>> +			/* hole */
>> +			segnum += n;
>> +			continue;
>> +		}
>> +
>> +		kaddr = kmap_atomic(su_bh->b_page);
>> +		su = nilfs_sufile_block_get_segment_usage(sufile, segnum,
>> +							  su_bh, kaddr);
>> +		blkdirty = 0;
>> +		for (i = 0; i < n; ++i, ++segnum, su = (void *)su + susz) {
>> +			if (le32_to_cpu(su->su_nsnapshot_blks) <= max_segblks)
>> +				continue;
>> +
>> +			if (su->su_nlive_blks <= max_segblks)
>> +				continue;
>> +
>> +			su->su_nlive_blks = max_segblks;
>> +			blkdirty = 1;
>> +		}
>> +
>> +		kunmap_atomic(kaddr);
>> +		if (blkdirty) {
>> +			mark_buffer_dirty(su_bh);
>> +			dirty = 1;
>> +		}
>> +		put_bh(su_bh);

Insert cond_resched() here to mitigate latency issue (mainly for the
environment in which voluntary preemption is turned off).

Regards,
Ryusuke Konishi

>> +	}
>> +
>> +out:
>> +	if (dirty)
>> +		nilfs_mdt_mark_dirty(sufile);
>> +
>> +	up_write(&NILFS_MDT(sufile)->mi_sem);
>> +	return ret;
>> +}
>> +
>> +/**
>>   * nilfs_sufile_trim_fs() - trim ioctl handle function
>>   * @sufile: inode of segment usage file
>>   * @range: fstrim_range structure
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy
       [not found]                     ` <55042879.90701-hi6Y0CQ0nG0@public.gmane.org>
@ 2015-03-14 15:40                       ` Ryusuke Konishi
  0 siblings, 0 replies; 36+ messages in thread
From: Ryusuke Konishi @ 2015-03-14 15:40 UTC (permalink / raw)
  To: Andreas Rohner; +Cc: linux-nilfs-u79uwXL29TY76Z2rM5mHXA

On Sat, 14 Mar 2015 13:24:25 +0100, Andreas Rohner wrote:
> Hi Ryusuke,
> 
> Thank you very much for your detailed review and feedback. I agree with
> all of your points and I will start working on a rewrite immediately.
> 
> On 2015-03-12 13:54, Ryusuke Konishi wrote:
>> Hi Andreas,
>> 
>> On Tue, 10 Mar 2015 21:37:50 +0100, Andreas Rohner wrote:
>>> Hi Ryusuke,
>>>
>>> Thanks for your thorough review.
>>>
>>> On 2015-03-10 06:21, Ryusuke Konishi wrote:
>>>> Hi Andreas,
>>>>
>>>> I looked through whole kernel patches and a part of util patches.
>>>> Overall comments are as follows:
>>>>
>>>> [Algorithm]
>>>> As for algorithm, it looks about OK except for the starvation
>>>> countermeasure.  The stavation countermeasure looks adhoc/hacky, but
>>>> it's good that it doesn't change kernel/userland interface; we may be
>>>> able to replace it with better ways in a future or in a revised
>>>> version of this patchset.
>>>>
>>>> (1) Drawback of the starvation countermeasure
>>>>     The patch 9/9 looks to make the execution time of chcp operation
>>>>     worse since it will scan through sufile to modify live block
>>>>     counters.  How much does it prolong the execution time ?
>>>
>>> I'll do some tests, but I haven't noticed any significant performance
>>> drop. The GC basically does the same thing, every time it selects
>>> segments to reclaim.
>> 
>> GC is performed in background by an independent process.  What I'm
>> care about it that NILFS_IOCTL_CHANGE_CPMODE ioctl is called from
>> command line interface or application.  They differ in this meaning.
>> 
>> Was a worse case senario considered in the test ?
>> 
>> For example:
>> 1. Fill a TB class drive with data file(s), and make a snapshot on it.
>> 2. Run one pass GC to update snapshot block counts.
>> 3. And do "chcp cp"
>> 
>> If we don't observe noticeable delay on this class of drive, then I
>> think we can put the problem off.
> 
> Yesterday I did a worst case test as you suggested. I used an old 1 TB
> hard drive I had lying around. This was my setup:
> 
> 1. Write a 850GB file
> 2. Create a snapshot
> 3. Delete the file
> 4. Let GC run through all segments
> 5. Verify with lssu that the GC has updated all SUFILE entries
> 6. Drop the page cache
> 7. chcp cp
> 
> The following results are with the page cache dropped immediately before
> each call:
> 
> 1. chcp ss
> real	0m1.337s
> user	0m0.017s
> sys	0m0.030s
> 
> 2. chcp cp
> real	0m6.377s
> user	0m0.023s
> sys	0m0.053s
> 
> The following results are without the drop of the page cache:
> 
> 1. chcp ss
> real	0m0.137s
> user	0m0.010s
> sys	0m0.000s
> 
> 2. chcp cp
> real	0m0.016s
> user	0m0.010s
> sys	0m0.007s
> 
> There are 119233 segments in my test. Each SUFILE entry uses 32 bytes.
> So the worst case for 1 TB with 8 MB segments would be 3.57 MB of random
> reads and one 3.57 MB continuous write. You only get 6.377s because my
> hard drive is so slow. You wouldn't notice any difference on a modern
> SSD. Furthermore the SUFILE is also scanned by the segment allocation
> algorithm and the GC, so it is very likely already in the page cache.

6.377s is too long because nilfs_sufile_fix_starving_segs() locks
sufile mi_sem, and even lengthens lock period of the following locks:

 - cpfile mi_sem (held at nilfs_cpfile_clear_snapshot()).
 - transaction lock (held at nilfs_ioctl_change_cpmode()).
 - ns_snapshot_mount_mutex (held at nilfs_ioctl_change_cpmode()).

leading to freeze of all write operations, lssu, lscp, cleanerd, and
snapshot mount, etc.

It is preferable for the function to be moved outside of them and to
release/reacquire transaction lock and sufile mi_sem regularly in some
way.

Regards,
Ryusuke Konishi
--
To unsubscribe from this list: send the line "unsubscribe linux-nilfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2015-03-14 15:40 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-24 19:01 [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy Andreas Rohner
     [not found] ` <1424804504-10914-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-02-24 19:01   ` [PATCH 1/9] nilfs2: refactor nilfs_sufile_updatev() Andreas Rohner
     [not found]     ` <1424804504-10914-2-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-10 15:52       ` Ryusuke Konishi
     [not found]         ` <20150311.005220.1374468405510151934.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10 20:40           ` Andreas Rohner
2015-02-24 19:01   ` [PATCH 2/9] nilfs2: add simple cache for modifications to SUFILE Andreas Rohner
     [not found]     ` <1424804504-10914-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  0:45       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 3/9] nilfs2: extend SUFILE on-disk format to enable counting of live blocks Andreas Rohner
     [not found]     ` <1424804504-10914-4-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  4:05       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 4/9] nilfs2: add function to modify su_nlive_blks Andreas Rohner
     [not found]     ` <1424804504-10914-5-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  4:57       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 5/9] nilfs2: add simple tracking of block deletions and updates Andreas Rohner
     [not found]     ` <1424804504-10914-6-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  3:46       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 6/9] nilfs2: use modification cache to improve performance Andreas Rohner
     [not found]     ` <1424804504-10914-7-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  1:04       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 7/9] nilfs2: add additional flags for nilfs_vdesc Andreas Rohner
     [not found]     ` <1424804504-10914-8-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  3:21       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 8/9] nilfs2: improve accuracy and correct for invalid GC values Andreas Rohner
     [not found]     ` <1424804504-10914-9-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  2:50       ` Ryusuke Konishi
2015-02-24 19:01   ` [PATCH 9/9] nilfs2: prevent starvation of segments protected by snapshots Andreas Rohner
     [not found]     ` <1424804504-10914-10-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  3:51       ` Ryusuke Konishi
     [not found]         ` <20150314.125109.1017248837083480553.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-14 12:36           ` Andreas Rohner
     [not found]             ` <55042B53.5000101-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 12:49               ` Ryusuke Konishi
2015-03-14 14:32           ` Ryusuke Konishi
2015-02-24 19:04   ` [PATCH 1/6] nilfs-utils: extend SUFILE on-disk format to enable track live blocks Andreas Rohner
     [not found]     ` <1424804659-10986-1-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-02-24 19:04       ` [PATCH 2/6] nilfs-utils: add additional flags for nilfs_vdesc Andreas Rohner
2015-02-24 19:04       ` [PATCH 3/6] nilfs-utils: add support for tracking live blocks Andreas Rohner
     [not found]         ` <1424804659-10986-3-git-send-email-andreas.rohner-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14  5:52           ` Ryusuke Konishi
2015-02-24 19:04       ` [PATCH 4/6] nilfs-utils: implement the tracking of live blocks for set_suinfo Andreas Rohner
2015-02-24 19:04       ` [PATCH 5/6] nilfs-utils: add support for greedy/cost-benefit policies Andreas Rohner
2015-02-24 19:04       ` [PATCH 6/6] nilfs-utils: add su_nsnapshot_blks field to indicate starvation Andreas Rohner
2015-02-25  0:18   ` [PATCH 0/9] nilfs2: implementation of cost-benefit GC policy Ryusuke Konishi
     [not found]     ` <20150225.091804.1850885506186316087.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10  5:21       ` Ryusuke Konishi
     [not found]         ` <20150310.142119.813265940569588216.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-10 20:37           ` Andreas Rohner
     [not found]             ` <54FF561E.7030409-hi6Y0CQ0nG0@public.gmane.org>
2015-03-12 12:54               ` Ryusuke Konishi
     [not found]                 ` <20150312.215431.324210374799651841.konishi.ryusuke-Zyj7fXuS5i5L9jVzuh4AOg@public.gmane.org>
2015-03-14 12:24                   ` Andreas Rohner
     [not found]                     ` <55042879.90701-hi6Y0CQ0nG0@public.gmane.org>
2015-03-14 15:40                       ` Ryusuke Konishi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.