All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/13] ceph: snapshot and multimds fixes
@ 2017-09-04  9:38 Yan, Zheng
  2017-09-04  9:38 ` [PATCH 01/13] ceph: fix null pointer dereference in ceph_flush_snaps() Yan, Zheng
                   ` (12 more replies)
  0 siblings, 13 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:38 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

Yan, Zheng (13):
  ceph: fix null pointer dereference in ceph_flush_snaps()
  ceph: fix message order check in handle_cap_export()
  ceph: handle race between vmtruncate and queuing cap snap
  ceph: queue cap snap only when snap realm's context changes
  ceph: remove stale check in ceph_invalidatepage()
  ceph: properly get capsnap's size in get_oldest_context()
  ceph: make writepage_nounlock() invalidate page that beyonds EOF
  ceph: optimize pagevec iterating in ceph_writepages_start()
  ceph: cleanup local varibles in ceph_writepages_start()
  ceph: fix "range cyclic" mode writepages
  ceph: ignore wbc->range_{start,end} when write back snapshot data
  ceph: fix capsnap dirty pages accounting
  ceph: wait on writeback after writing snapshot data

 fs/ceph/addr.c  | 362 +++++++++++++++++++++++++++++++++-----------------------
 fs/ceph/caps.c  |   4 +-
 fs/ceph/inode.c |  13 +-
 fs/ceph/snap.c  |  37 +++---
 4 files changed, 244 insertions(+), 172 deletions(-)

-- 
2.13.5


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 01/13] ceph: fix null pointer dereference in ceph_flush_snaps()
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
@ 2017-09-04  9:38 ` Yan, Zheng
  2017-09-04  9:38 ` [PATCH 02/13] ceph: fix message order check in handle_cap_export() Yan, Zheng
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:38 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/caps.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 662ada467c32..5daf86621871 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -1469,7 +1469,7 @@ void ceph_flush_snaps(struct ceph_inode_info *ci,
 
 	if (psession) {
 		*psession = session;
-	} else {
+	} else if (session) {
 		mutex_unlock(&session->s_mutex);
 		ceph_put_mds_session(session);
 	}
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 02/13] ceph: fix message order check in handle_cap_export()
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
  2017-09-04  9:38 ` [PATCH 01/13] ceph: fix null pointer dereference in ceph_flush_snaps() Yan, Zheng
@ 2017-09-04  9:38 ` Yan, Zheng
  2017-09-04  9:38 ` [PATCH 03/13] ceph: handle race between vmtruncate and queuing cap snap Yan, Zheng
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:38 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

If caps for importer mds exists, but cap id mismatch, client should
have received corresponding import message. Because cap ID does not
change as long as client holds the caps.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/caps.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 5daf86621871..7a7945032802 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3427,7 +3427,7 @@ static void handle_cap_export(struct inode *inode, struct ceph_mds_caps *ex,
 	tcap = __get_cap_for_mds(ci, target);
 	if (tcap) {
 		/* already have caps from the target */
-		if (tcap->cap_id != t_cap_id ||
+		if (tcap->cap_id == t_cap_id &&
 		    ceph_seq_cmp(tcap->seq, t_seq) < 0) {
 			dout(" updating import cap %p mds%d\n", tcap, target);
 			tcap->cap_id = t_cap_id;
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 03/13] ceph: handle race between vmtruncate and queuing cap snap
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
  2017-09-04  9:38 ` [PATCH 01/13] ceph: fix null pointer dereference in ceph_flush_snaps() Yan, Zheng
  2017-09-04  9:38 ` [PATCH 02/13] ceph: fix message order check in handle_cap_export() Yan, Zheng
@ 2017-09-04  9:38 ` Yan, Zheng
  2017-09-04  9:38 ` [PATCH 04/13] ceph: queue cap snap only when snap realm's context changes Yan, Zheng
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:38 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

It's possible that we create a cap snap while there is pending
vmtruncate (truncate hasn't been processed by worker thread).
We should truncate dirty pages beyond capsnap->size in that case.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/inode.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index a19fafdf87f8..373dab5173ca 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1833,9 +1833,20 @@ void __ceph_do_pending_vmtruncate(struct inode *inode)
 	 * possibly truncate them.. so write AND block!
 	 */
 	if (ci->i_wrbuffer_ref_head < ci->i_wrbuffer_ref) {
+		struct ceph_cap_snap *capsnap;
+		to = ci->i_truncate_size;
+		list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
+			// MDS should have revoked Frw caps
+			WARN_ON_ONCE(capsnap->writing);
+			if (capsnap->dirty_pages && capsnap->size > to)
+				to = capsnap->size;
+		}
+		spin_unlock(&ci->i_ceph_lock);
 		dout("__do_pending_vmtruncate %p flushing snaps first\n",
 		     inode);
-		spin_unlock(&ci->i_ceph_lock);
+
+		truncate_pagecache(inode, to);
+
 		filemap_write_and_wait_range(&inode->i_data, 0,
 					     inode->i_sb->s_maxbytes);
 		goto retry;
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 04/13] ceph: queue cap snap only when snap realm's context changes
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (2 preceding siblings ...)
  2017-09-04  9:38 ` [PATCH 03/13] ceph: handle race between vmtruncate and queuing cap snap Yan, Zheng
@ 2017-09-04  9:38 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 05/13] ceph: remove stale check in ceph_invalidatepage() Yan, Zheng
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:38 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

If we create capsnap when snap realm's context does not change, the
new capsnap's snapc is equal to ci->i_head_snapc. Page wirteback code
can't differentiates dirty pages associated with the new capsnap from
dirty pages associated with i_head_snapc.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/snap.c | 37 ++++++++++++++++---------------------
 1 file changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index dab5d6732345..1ffc8b426c1c 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -299,7 +299,8 @@ static int cmpu64_rev(const void *a, const void *b)
 /*
  * build the snap context for a given realm.
  */
-static int build_snap_context(struct ceph_snap_realm *realm)
+static int build_snap_context(struct ceph_snap_realm *realm,
+			      struct list_head* dirty_realms)
 {
 	struct ceph_snap_realm *parent = realm->parent;
 	struct ceph_snap_context *snapc;
@@ -313,7 +314,7 @@ static int build_snap_context(struct ceph_snap_realm *realm)
 	 */
 	if (parent) {
 		if (!parent->cached_context) {
-			err = build_snap_context(parent);
+			err = build_snap_context(parent, dirty_realms);
 			if (err)
 				goto fail;
 		}
@@ -332,7 +333,7 @@ static int build_snap_context(struct ceph_snap_realm *realm)
 		     " (unchanged)\n",
 		     realm->ino, realm, realm->cached_context,
 		     realm->cached_context->seq,
-		     (unsigned int) realm->cached_context->num_snaps);
+		     (unsigned int)realm->cached_context->num_snaps);
 		return 0;
 	}
 
@@ -373,7 +374,11 @@ static int build_snap_context(struct ceph_snap_realm *realm)
 	     realm->ino, realm, snapc, snapc->seq,
 	     (unsigned int) snapc->num_snaps);
 
-	ceph_put_snap_context(realm->cached_context);
+	if (realm->cached_context) {
+		ceph_put_snap_context(realm->cached_context);
+		/* queue realm for cap_snap creation */
+		list_add_tail(&realm->dirty_item, dirty_realms);
+	}
 	realm->cached_context = snapc;
 	return 0;
 
@@ -394,15 +399,16 @@ static int build_snap_context(struct ceph_snap_realm *realm)
 /*
  * rebuild snap context for the given realm and all of its children.
  */
-static void rebuild_snap_realms(struct ceph_snap_realm *realm)
+static void rebuild_snap_realms(struct ceph_snap_realm *realm,
+				struct list_head *dirty_realms)
 {
 	struct ceph_snap_realm *child;
 
 	dout("rebuild_snap_realms %llx %p\n", realm->ino, realm);
-	build_snap_context(realm);
+	build_snap_context(realm, dirty_realms);
 
 	list_for_each_entry(child, &realm->children, child_item)
-		rebuild_snap_realms(child);
+		rebuild_snap_realms(child, dirty_realms);
 }
 
 
@@ -624,13 +630,11 @@ static void queue_realm_cap_snaps(struct ceph_snap_realm *realm)
 {
 	struct ceph_inode_info *ci;
 	struct inode *lastinode = NULL;
-	struct ceph_snap_realm *child;
 
 	dout("queue_realm_cap_snaps %p %llx inodes\n", realm, realm->ino);
 
 	spin_lock(&realm->inodes_with_caps_lock);
-	list_for_each_entry(ci, &realm->inodes_with_caps,
-			    i_snap_realm_item) {
+	list_for_each_entry(ci, &realm->inodes_with_caps, i_snap_realm_item) {
 		struct inode *inode = igrab(&ci->vfs_inode);
 		if (!inode)
 			continue;
@@ -643,14 +647,6 @@ static void queue_realm_cap_snaps(struct ceph_snap_realm *realm)
 	spin_unlock(&realm->inodes_with_caps_lock);
 	iput(lastinode);
 
-	list_for_each_entry(child, &realm->children, child_item) {
-		dout("queue_realm_cap_snaps %p %llx queue child %p %llx\n",
-		     realm, realm->ino, child, child->ino);
-		list_del_init(&child->dirty_item);
-		list_add(&child->dirty_item, &realm->dirty_item);
-	}
-
-	list_del_init(&realm->dirty_item);
 	dout("queue_realm_cap_snaps %p %llx done\n", realm, realm->ino);
 }
 
@@ -721,8 +717,6 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
 		if (err < 0)
 			goto fail;
 
-		/* queue realm for cap_snap creation */
-		list_add(&realm->dirty_item, &dirty_realms);
 		if (realm->seq > mdsc->last_snap_seq)
 			mdsc->last_snap_seq = realm->seq;
 
@@ -741,7 +735,7 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
 
 	/* invalidate when we reach the _end_ (root) of the trace */
 	if (invalidate && p >= e)
-		rebuild_snap_realms(realm);
+		rebuild_snap_realms(realm, &dirty_realms);
 
 	if (!first_realm)
 		first_realm = realm;
@@ -758,6 +752,7 @@ int ceph_update_snap_trace(struct ceph_mds_client *mdsc,
 	while (!list_empty(&dirty_realms)) {
 		realm = list_first_entry(&dirty_realms, struct ceph_snap_realm,
 					 dirty_item);
+		list_del_init(&realm->dirty_item);
 		queue_realm_cap_snaps(realm);
 	}
 
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 05/13] ceph: remove stale check in ceph_invalidatepage()
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (3 preceding siblings ...)
  2017-09-04  9:38 ` [PATCH 04/13] ceph: queue cap snap only when snap realm's context changes Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 06/13] ceph: properly get capsnap's size in get_oldest_context() Yan, Zheng
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

both set_page_dirty and truncate_complete_page should be called
for locked page. they can't race with each other.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 9 +--------
 1 file changed, 1 insertion(+), 8 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index d82036e19083..b6ac3da9ddab 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -152,17 +152,10 @@ static void ceph_invalidatepage(struct page *page, unsigned int offset,
 
 	ceph_invalidate_fscache_page(inode, page);
 
+	WARN_ON(!PageLocked(page));
 	if (!PagePrivate(page))
 		return;
 
-	/*
-	 * We can get non-dirty pages here due to races between
-	 * set_page_dirty and truncate_complete_page; just spit out a
-	 * warning, in case we end up with accounting problems later.
-	 */
-	if (!PageDirty(page))
-		pr_err("%p invalidatepage %p page not dirty\n", inode, page);
-
 	ClearPageChecked(page);
 
 	dout("%p invalidatepage %p idx %lu full dirty page\n",
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 06/13] ceph: properly get capsnap's size in get_oldest_context()
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (4 preceding siblings ...)
  2017-09-04  9:39 ` [PATCH 05/13] ceph: remove stale check in ceph_invalidatepage() Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 07/13] ceph: make writepage_nounlock() invalidate page that beyonds EOF Yan, Zheng
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

capsnap's size is set by __ceph_finish_cap_snap(). If capsnap is under
writing, its size is zero. In this case, get_oldest_context() should
read i_size. Besides, ceph_writepages_start() should re-check capsnap's
size after dirty pages get locked.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 137 +++++++++++++++++++++++++++++++++------------------------
 1 file changed, 80 insertions(+), 57 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index b6ac3da9ddab..03a1ee27b33c 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -463,14 +463,20 @@ static int ceph_readpages(struct file *file, struct address_space *mapping,
 	return rc;
 }
 
+struct ceph_writeback_ctl
+{
+	loff_t i_size;
+	u64 truncate_size;
+	u32 truncate_seq;
+	bool size_stable;
+};
+
 /*
  * Get ref for the oldest snapc for an inode with dirty data... that is, the
  * only snap context we are allowed to write back.
  */
-static struct ceph_snap_context *get_oldest_context(struct inode *inode,
-						    loff_t *snap_size,
-						    u64 *truncate_size,
-						    u32 *truncate_seq)
+static struct ceph_snap_context *
+get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_snap_context *snapc = NULL;
@@ -482,12 +488,17 @@ static struct ceph_snap_context *get_oldest_context(struct inode *inode,
 		     capsnap->context, capsnap->dirty_pages);
 		if (capsnap->dirty_pages) {
 			snapc = ceph_get_snap_context(capsnap->context);
-			if (snap_size)
-				*snap_size = capsnap->size;
-			if (truncate_size)
-				*truncate_size = capsnap->truncate_size;
-			if (truncate_seq)
-				*truncate_seq = capsnap->truncate_seq;
+			if (ctl) {
+				if (capsnap->writing) {
+					ctl->i_size = i_size_read(inode);
+					ctl->size_stable = false;
+				} else {
+					ctl->i_size = capsnap->size;
+					ctl->size_stable = true;
+				}
+				ctl->truncate_size = capsnap->truncate_size;
+				ctl->truncate_seq = capsnap->truncate_seq;
+			}
 			break;
 		}
 	}
@@ -495,15 +506,44 @@ static struct ceph_snap_context *get_oldest_context(struct inode *inode,
 		snapc = ceph_get_snap_context(ci->i_head_snapc);
 		dout(" head snapc %p has %d dirty pages\n",
 		     snapc, ci->i_wrbuffer_ref_head);
-		if (truncate_size)
-			*truncate_size = ci->i_truncate_size;
-		if (truncate_seq)
-			*truncate_seq = ci->i_truncate_seq;
+		if (ctl) {
+			ctl->i_size = i_size_read(inode);
+			ctl->truncate_size = ci->i_truncate_size;
+			ctl->truncate_seq = ci->i_truncate_seq;
+			ctl->size_stable = false;
+		}
 	}
 	spin_unlock(&ci->i_ceph_lock);
 	return snapc;
 }
 
+static u64 get_writepages_data_length(struct inode *inode,
+				      struct page *page, u64 start)
+{
+	struct ceph_inode_info *ci = ceph_inode(inode);
+	struct ceph_snap_context *snapc = page_snap_context(page);
+	struct ceph_cap_snap *capsnap = NULL;
+	u64 end = i_size_read(inode);
+
+	if (snapc != ci->i_head_snapc) {
+		bool found = false;
+		spin_lock(&ci->i_ceph_lock);
+		list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
+			if (capsnap->context == snapc) {
+				if (!capsnap->writing)
+					end = capsnap->size;
+				found = true;
+				break;
+			}
+		}
+		spin_unlock(&ci->i_ceph_lock);
+		WARN_ON(!found);
+	}
+	if (end > page_offset(page) + PAGE_SIZE)
+		end = page_offset(page) + PAGE_SIZE;
+	return end > start ? end - start : 0;
+}
+
 /*
  * Write a single page, but leave the page locked.
  *
@@ -515,21 +555,17 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 	struct inode *inode;
 	struct ceph_inode_info *ci;
 	struct ceph_fs_client *fsc;
-	struct ceph_osd_client *osdc;
 	struct ceph_snap_context *snapc, *oldest;
 	loff_t page_off = page_offset(page);
-	loff_t snap_size = -1;
 	long writeback_stat;
-	u64 truncate_size;
-	u32 truncate_seq;
 	int err, len = PAGE_SIZE;
+	struct ceph_writeback_ctl ceph_wbc;
 
 	dout("writepage %p idx %lu\n", page, page->index);
 
 	inode = page->mapping->host;
 	ci = ceph_inode(inode);
 	fsc = ceph_inode_to_client(inode);
-	osdc = &fsc->client->osdc;
 
 	/* verify this is a writeable snap context */
 	snapc = page_snap_context(page);
@@ -537,8 +573,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 		dout("writepage %p page %p not dirty?\n", inode, page);
 		return 0;
 	}
-	oldest = get_oldest_context(inode, &snap_size,
-				    &truncate_size, &truncate_seq);
+	oldest = get_oldest_context(inode, &ceph_wbc);
 	if (snapc->seq > oldest->seq) {
 		dout("writepage %p page %p snapc %p not writeable - noop\n",
 		     inode, page, snapc);
@@ -550,17 +585,14 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 	}
 	ceph_put_snap_context(oldest);
 
-	if (snap_size == -1)
-		snap_size = i_size_read(inode);
-
 	/* is this a partial page at end of file? */
-	if (page_off >= snap_size) {
-		dout("%p page eof %llu\n", page, snap_size);
+	if (page_off >= ceph_wbc.i_size) {
+		dout("%p page eof %llu\n", page, ceph_wbc.i_size);
 		return 0;
 	}
 
-	if (snap_size < page_off + len)
-		len = snap_size - page_off;
+	if (ceph_wbc.i_size < page_off + len)
+		len = ceph_wbc.i_size - page_off;
 
 	dout("writepage %p page %p index %lu on %llu~%u snapc %p seq %lld\n",
 	     inode, page, page->index, page_off, len, snapc, snapc->seq);
@@ -571,10 +603,10 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 		set_bdi_congested(inode_to_bdi(inode), BLK_RW_ASYNC);
 
 	set_page_writeback(page);
-	err = ceph_osdc_writepages(osdc, ceph_vino(inode),
-				   &ci->i_layout, snapc,
-				   page_off, len,
-				   truncate_seq, truncate_size,
+	err = ceph_osdc_writepages(&fsc->client->osdc, ceph_vino(inode),
+				   &ci->i_layout, snapc, page_off, len,
+				   ceph_wbc.truncate_seq,
+				   ceph_wbc.truncate_size,
 				   &inode->i_mtime, &page, 1);
 	if (err < 0) {
 		struct writeback_control tmp_wbc;
@@ -745,9 +777,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 	int rc = 0;
 	unsigned int wsize = i_blocksize(inode);
 	struct ceph_osd_request *req = NULL;
-	loff_t snap_size, i_size;
-	u64 truncate_size;
-	u32 truncate_seq;
+	struct ceph_writeback_ctl ceph_wbc;
 
 	dout("writepages_start %p (mode=%s)\n", inode,
 	     wbc->sync_mode == WB_SYNC_NONE ? "NONE" :
@@ -786,9 +816,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 retry:
 	/* find oldest snap context with dirty data */
 	ceph_put_snap_context(snapc);
-	snap_size = -1;
-	snapc = get_oldest_context(inode, &snap_size,
-				   &truncate_size, &truncate_seq);
+	snapc = get_oldest_context(inode, &ceph_wbc);
 	if (!snapc) {
 		/* hmm, why does writepages get called when there
 		   is no dirty data? */
@@ -798,8 +826,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 	dout(" oldest snapc is %p seq %lld (%d snaps)\n",
 	     snapc, snapc->seq, snapc->num_snaps);
 
-	i_size = i_size_read(inode);
-
 	if (last_snapc && snapc != last_snapc) {
 		/* if we switched to a newer snapc, restart our scan at the
 		 * start of the original file range. */
@@ -865,10 +891,9 @@ static int ceph_writepages_start(struct address_space *mapping,
 				dout("waiting on writeback %p\n", page);
 				wait_on_page_writeback(page);
 			}
-			if (page_offset(page) >=
-			    (snap_size == -1 ? i_size : snap_size)) {
-				dout("%p page eof %llu\n", page,
-				     (snap_size == -1 ? i_size : snap_size));
+			if (page_offset(page) >= ceph_wbc.i_size) {
+				dout("%p page eof %llu\n",
+				     page, ceph_wbc.i_size);
 				done = 1;
 				unlock_page(page);
 				break;
@@ -996,10 +1021,9 @@ static int ceph_writepages_start(struct address_space *mapping,
 		req = ceph_osdc_new_request(&fsc->client->osdc,
 					&ci->i_layout, vino,
 					offset, &len, 0, num_ops,
-					CEPH_OSD_OP_WRITE,
-					CEPH_OSD_FLAG_WRITE,
-					snapc, truncate_seq,
-					truncate_size, false);
+					CEPH_OSD_OP_WRITE, CEPH_OSD_FLAG_WRITE,
+					snapc, ceph_wbc.truncate_seq,
+					ceph_wbc.truncate_size, false);
 		if (IS_ERR(req)) {
 			req = ceph_osdc_new_request(&fsc->client->osdc,
 						&ci->i_layout, vino,
@@ -1008,8 +1032,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 						    CEPH_OSD_SLAB_OPS),
 						CEPH_OSD_OP_WRITE,
 						CEPH_OSD_FLAG_WRITE,
-						snapc, truncate_seq,
-						truncate_size, true);
+						snapc, ceph_wbc.truncate_seq,
+						ceph_wbc.truncate_size, true);
 			BUG_ON(IS_ERR(req));
 		}
 		BUG_ON(len < page_offset(pages[locked_pages - 1]) +
@@ -1046,14 +1070,15 @@ static int ceph_writepages_start(struct address_space *mapping,
 			len += PAGE_SIZE;
 		}
 
-		if (snap_size != -1) {
-			len = min(len, snap_size - offset);
+		if (ceph_wbc.size_stable) {
+			len = min(len, ceph_wbc.i_size - offset);
 		} else if (i == locked_pages) {
 			/* writepages_finish() clears writeback pages
 			 * according to the data length, so make sure
 			 * data length covers all locked pages */
 			u64 min_len = len + 1 - PAGE_SIZE;
-			len = min(len, (u64)i_size_read(inode) - offset);
+			len = get_writepages_data_length(inode, pages[i - 1],
+							 offset);
 			len = max(len, min_len);
 		}
 		dout("writepages got pages at %llu~%llu\n", offset, len);
@@ -1137,8 +1162,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 static int context_is_writeable_or_written(struct inode *inode,
 					   struct ceph_snap_context *snapc)
 {
-	struct ceph_snap_context *oldest = get_oldest_context(inode, NULL,
-							      NULL, NULL);
+	struct ceph_snap_context *oldest = get_oldest_context(inode, NULL);
 	int ret = !oldest || snapc->seq <= oldest->seq;
 
 	ceph_put_snap_context(oldest);
@@ -1183,8 +1207,7 @@ static int ceph_update_writeable_page(struct file *file,
 		 * this page is already dirty in another (older) snap
 		 * context!  is it writeable now?
 		 */
-		oldest = get_oldest_context(inode, NULL, NULL, NULL);
-
+		oldest = get_oldest_context(inode, NULL);
 		if (snapc->seq > oldest->seq) {
 			ceph_put_snap_context(oldest);
 			dout(" page %p snapc %p not current or oldest\n",
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 07/13] ceph: make writepage_nounlock() invalidate page that beyonds EOF
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (5 preceding siblings ...)
  2017-09-04  9:39 ` [PATCH 06/13] ceph: properly get capsnap's size in get_oldest_context() Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 08/13] ceph: optimize pagevec iterating in ceph_writepages_start() Yan, Zheng
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

Otherwise, the page left in state that page is assocated with a
snapc, but (PageDirty(page) || PageWriteback(page)) is false.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 50 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 32 insertions(+), 18 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 03a1ee27b33c..8526359c08b2 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -476,7 +476,8 @@ struct ceph_writeback_ctl
  * only snap context we are allowed to write back.
  */
 static struct ceph_snap_context *
-get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl)
+get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl,
+		   struct ceph_snap_context *page_snapc)
 {
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_snap_context *snapc = NULL;
@@ -486,21 +487,33 @@ get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl)
 	list_for_each_entry(capsnap, &ci->i_cap_snaps, ci_item) {
 		dout(" cap_snap %p snapc %p has %d dirty pages\n", capsnap,
 		     capsnap->context, capsnap->dirty_pages);
-		if (capsnap->dirty_pages) {
-			snapc = ceph_get_snap_context(capsnap->context);
-			if (ctl) {
-				if (capsnap->writing) {
-					ctl->i_size = i_size_read(inode);
-					ctl->size_stable = false;
-				} else {
-					ctl->i_size = capsnap->size;
-					ctl->size_stable = true;
-				}
-				ctl->truncate_size = capsnap->truncate_size;
-				ctl->truncate_seq = capsnap->truncate_seq;
+		if (!capsnap->dirty_pages)
+			continue;
+
+		/* get i_size, truncate_{seq,size} for page_snapc? */
+		if (snapc && capsnap->context != page_snapc)
+			continue;
+
+		if (ctl) {
+			if (capsnap->writing) {
+				ctl->i_size = i_size_read(inode);
+				ctl->size_stable = false;
+			} else {
+				ctl->i_size = capsnap->size;
+				ctl->size_stable = true;
 			}
-			break;
+			ctl->truncate_size = capsnap->truncate_size;
+			ctl->truncate_seq = capsnap->truncate_seq;
 		}
+
+		if (snapc)
+			break;
+
+		snapc = ceph_get_snap_context(capsnap->context);
+		if (!page_snapc ||
+		    page_snapc == snapc ||
+		    page_snapc->seq > snapc->seq)
+			break;
 	}
 	if (!snapc && ci->i_wrbuffer_ref_head) {
 		snapc = ceph_get_snap_context(ci->i_head_snapc);
@@ -573,7 +586,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 		dout("writepage %p page %p not dirty?\n", inode, page);
 		return 0;
 	}
-	oldest = get_oldest_context(inode, &ceph_wbc);
+	oldest = get_oldest_context(inode, &ceph_wbc, snapc);
 	if (snapc->seq > oldest->seq) {
 		dout("writepage %p page %p snapc %p not writeable - noop\n",
 		     inode, page, snapc);
@@ -588,6 +601,7 @@ static int writepage_nounlock(struct page *page, struct writeback_control *wbc)
 	/* is this a partial page at end of file? */
 	if (page_off >= ceph_wbc.i_size) {
 		dout("%p page eof %llu\n", page, ceph_wbc.i_size);
+		page->mapping->a_ops->invalidatepage(page, 0, PAGE_SIZE);
 		return 0;
 	}
 
@@ -816,7 +830,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 retry:
 	/* find oldest snap context with dirty data */
 	ceph_put_snap_context(snapc);
-	snapc = get_oldest_context(inode, &ceph_wbc);
+	snapc = get_oldest_context(inode, &ceph_wbc, NULL);
 	if (!snapc) {
 		/* hmm, why does writepages get called when there
 		   is no dirty data? */
@@ -1162,7 +1176,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 static int context_is_writeable_or_written(struct inode *inode,
 					   struct ceph_snap_context *snapc)
 {
-	struct ceph_snap_context *oldest = get_oldest_context(inode, NULL);
+	struct ceph_snap_context *oldest = get_oldest_context(inode, NULL, NULL);
 	int ret = !oldest || snapc->seq <= oldest->seq;
 
 	ceph_put_snap_context(oldest);
@@ -1207,7 +1221,7 @@ static int ceph_update_writeable_page(struct file *file,
 		 * this page is already dirty in another (older) snap
 		 * context!  is it writeable now?
 		 */
-		oldest = get_oldest_context(inode, NULL);
+		oldest = get_oldest_context(inode, NULL, NULL);
 		if (snapc->seq > oldest->seq) {
 			ceph_put_snap_context(oldest);
 			dout(" page %p snapc %p not current or oldest\n",
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 08/13] ceph: optimize pagevec iterating in ceph_writepages_start()
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (6 preceding siblings ...)
  2017-09-04  9:39 ` [PATCH 07/13] ceph: make writepage_nounlock() invalidate page that beyonds EOF Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 09/13] ceph: cleanup local varibles " Yan, Zheng
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

ceph_writepages_start() supports writing non-continuous pages.
If it encounters a non-dirty or non-writeable page in pagevec,
it can continue to check the rest pages in pagevec.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 54 +++++++++++++++++++++++++-----------------------------
 1 file changed, 25 insertions(+), 29 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8526359c08b2..5ca887bb5cae 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -851,7 +851,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 	while (!done && index <= end) {
 		unsigned i;
-		int first;
 		pgoff_t strip_unit_end = 0;
 		int num_ops = 0, op_idx;
 		int pvec_pages, locked_pages = 0;
@@ -864,7 +863,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 		max_pages = max_pages_ever;
 
 get_more_pages:
-		first = -1;
 		want = min(end - index,
 			   min((pgoff_t)PAGEVEC_SIZE,
 			       max_pages - (pgoff_t)locked_pages) - 1)
@@ -888,7 +886,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 			    unlikely(page->mapping != mapping)) {
 				dout("!dirty or !mapping %p\n", page);
 				unlock_page(page);
-				break;
+				continue;
 			}
 			if (!wbc->range_cyclic && page->index > end) {
 				dout("end of range %p\n", page);
@@ -901,10 +899,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 				unlock_page(page);
 				break;
 			}
-			if (wbc->sync_mode != WB_SYNC_NONE) {
-				dout("waiting on writeback %p\n", page);
-				wait_on_page_writeback(page);
-			}
 			if (page_offset(page) >= ceph_wbc.i_size) {
 				dout("%p page eof %llu\n",
 				     page, ceph_wbc.i_size);
@@ -913,9 +907,13 @@ static int ceph_writepages_start(struct address_space *mapping,
 				break;
 			}
 			if (PageWriteback(page)) {
-				dout("%p under writeback\n", page);
-				unlock_page(page);
-				break;
+				if (wbc->sync_mode == WB_SYNC_NONE) {
+					dout("%p under writeback\n", page);
+					unlock_page(page);
+					continue;
+				}
+				dout("waiting on writeback %p\n", page);
+				wait_on_page_writeback(page);
 			}
 
 			/* only if matching snap context */
@@ -924,15 +922,13 @@ static int ceph_writepages_start(struct address_space *mapping,
 				dout("page snapc %p %lld > oldest %p %lld\n",
 				     pgsnapc, pgsnapc->seq, snapc, snapc->seq);
 				unlock_page(page);
-				if (!locked_pages)
-					continue; /* keep looking for snap */
-				break;
+				continue;
 			}
 
 			if (!clear_page_dirty_for_io(page)) {
 				dout("%p !clear_page_dirty_for_io\n", page);
 				unlock_page(page);
-				break;
+				continue;
 			}
 
 			/*
@@ -988,8 +984,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 			}
 
 			/* note position of first page in pvec */
-			if (first < 0)
-				first = i;
 			dout("%p will write page %p idx %lu\n",
 			     inode, page, page->index);
 
@@ -1000,8 +994,10 @@ static int ceph_writepages_start(struct address_space *mapping,
 						  BLK_RW_ASYNC);
 			}
 
-			pages[locked_pages] = page;
-			locked_pages++;
+
+			pages[locked_pages++] = page;
+			pvec.pages[i] = NULL;
+
 			len += PAGE_SIZE;
 		}
 
@@ -1009,23 +1005,23 @@ static int ceph_writepages_start(struct address_space *mapping,
 		if (!locked_pages)
 			goto release_pvec_pages;
 		if (i) {
-			int j;
-			BUG_ON(!locked_pages || first < 0);
+			unsigned j, n = 0;
+			/* shift unused page to beginning of pvec */
+			for (j = 0; j < pvec_pages; j++) {
+				if (!pvec.pages[j])
+					continue;
+				if (n < j)
+					pvec.pages[n] = pvec.pages[j];
+				n++;
+			}
+			pvec.nr = n;
 
 			if (pvec_pages && i == pvec_pages &&
 			    locked_pages < max_pages) {
 				dout("reached end pvec, trying for more\n");
-				pagevec_reinit(&pvec);
+				pagevec_release(&pvec);
 				goto get_more_pages;
 			}
-
-			/* shift unused pages over in the pvec...  we
-			 * will need to release them below. */
-			for (j = i; j < pvec_pages; j++) {
-				dout(" pvec leftover page %p\n", pvec.pages[j]);
-				pvec.pages[j-i+first] = pvec.pages[j];
-			}
-			pvec.nr -= i-first;
 		}
 
 new_request:
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 09/13] ceph: cleanup local varibles in ceph_writepages_start()
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (7 preceding siblings ...)
  2017-09-04  9:39 ` [PATCH 08/13] ceph: optimize pagevec iterating in ceph_writepages_start() Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 10/13] ceph: fix "range cyclic" mode writepages Yan, Zheng
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

remove two varibles and define varibles of same type together.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 5ca887bb5cae..221df531b0c3 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -784,7 +784,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 	pgoff_t index, start, end;
 	int range_whole = 0;
 	int should_loop = 1;
-	pgoff_t max_pages = 0, max_pages_ever = 0;
 	struct ceph_snap_context *snapc = NULL, *last_snapc = NULL, *pgsnapc;
 	struct pagevec pvec;
 	int done = 0;
@@ -808,7 +807,6 @@ static int ceph_writepages_start(struct address_space *mapping,
 	}
 	if (fsc->mount_options->wsize < wsize)
 		wsize = fsc->mount_options->wsize;
-	max_pages_ever = wsize >> PAGE_SHIFT;
 
 	pagevec_init(&pvec, 0);
 
@@ -850,26 +848,25 @@ static int ceph_writepages_start(struct address_space *mapping,
 	last_snapc = snapc;
 
 	while (!done && index <= end) {
-		unsigned i;
-		pgoff_t strip_unit_end = 0;
 		int num_ops = 0, op_idx;
-		int pvec_pages, locked_pages = 0;
+		unsigned i, pvec_pages, max_pages, locked_pages = 0;
 		struct page **pages = NULL, **data_pages;
 		mempool_t *pool = NULL;	/* Becomes non-null if mempool used */
 		struct page *page;
-		int want;
+		pgoff_t strip_unit_end = 0;
 		u64 offset = 0, len = 0;
 
-		max_pages = max_pages_ever;
+		max_pages = wsize >> PAGE_SHIFT;
 
 get_more_pages:
-		want = min(end - index,
-			   min((pgoff_t)PAGEVEC_SIZE,
-			       max_pages - (pgoff_t)locked_pages) - 1)
-			+ 1;
+		pvec_pages = min_t(unsigned, PAGEVEC_SIZE,
+				   max_pages - locked_pages);
+		if (end - index < (u64)(pvec_pages - 1))
+			pvec_pages = (unsigned)(end - index) + 1;
+
 		pvec_pages = pagevec_lookup_tag(&pvec, mapping, &index,
 						PAGECACHE_TAG_DIRTY,
-						want);
+						pvec_pages);
 		dout("pagevec_lookup_tag got %d\n", pvec_pages);
 		if (!pvec_pages && !locked_pages)
 			break;
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 10/13] ceph: fix "range cyclic" mode writepages
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (8 preceding siblings ...)
  2017-09-04  9:39 ` [PATCH 09/13] ceph: cleanup local varibles " Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 11/13] ceph: ignore wbc->range_{start,end} when write back snapshot data Yan, Zheng
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

In range cyclic mode, writepages() should first write dirty pages
in range [writeback_index, (pgoff_t)-1], then write pages in range
[0, writeback_index -1]. Besides, if writepages() encounters a page
that beyond EOF, it should restart from the beginning.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 41 ++++++++++++++++++++++++-----------------
 1 file changed, 24 insertions(+), 17 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 221df531b0c3..46658d548a6e 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -781,16 +781,15 @@ static int ceph_writepages_start(struct address_space *mapping,
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	struct ceph_vino vino = ceph_vino(inode);
-	pgoff_t index, start, end;
-	int range_whole = 0;
-	int should_loop = 1;
+	pgoff_t index, start_index, end;
 	struct ceph_snap_context *snapc = NULL, *last_snapc = NULL, *pgsnapc;
 	struct pagevec pvec;
-	int done = 0;
 	int rc = 0;
 	unsigned int wsize = i_blocksize(inode);
 	struct ceph_osd_request *req = NULL;
 	struct ceph_writeback_ctl ceph_wbc;
+	bool should_loop, range_whole = false;
+	bool stop, done = false;
 
 	dout("writepages_start %p (mode=%s)\n", inode,
 	     wbc->sync_mode == WB_SYNC_NONE ? "NONE" :
@@ -810,20 +809,22 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 	pagevec_init(&pvec, 0);
 
+	start_index = wbc->range_cyclic ? mapping->writeback_index : 0;
+
 	/* where to start/end? */
 	if (wbc->range_cyclic) {
-		start = mapping->writeback_index; /* Start from prev offset */
+		index = start_index
 		end = -1;
-		dout(" cyclic, start at %lu\n", start);
+		should_loop = (index > 0);
+		dout(" cyclic, start at %lu\n", index);
 	} else {
-		start = wbc->range_start >> PAGE_SHIFT;
+		index = wbc->range_start >> PAGE_SHIFT;
 		end = wbc->range_end >> PAGE_SHIFT;
 		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = 1;
-		should_loop = 0;
-		dout(" not cyclic, %lu to %lu\n", start, end);
+			range_whole = true;
+		should_loop = false;
+		dout(" not cyclic, %lu to %lu\n", index, end);
 	}
-	index = start;
 
 retry:
 	/* find oldest snap context with dirty data */
@@ -847,7 +848,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 	}
 	last_snapc = snapc;
 
-	while (!done && index <= end) {
+	stop = false;
+	while (!stop && index <= end) {
 		int num_ops = 0, op_idx;
 		unsigned i, pvec_pages, max_pages, locked_pages = 0;
 		struct page **pages = NULL, **data_pages;
@@ -885,9 +887,11 @@ static int ceph_writepages_start(struct address_space *mapping,
 				unlock_page(page);
 				continue;
 			}
-			if (!wbc->range_cyclic && page->index > end) {
+			if (page->index > end) {
 				dout("end of range %p\n", page);
-				done = 1;
+				/* can't be range_cyclic (1st pass) because
+				 * end == -1 in that case. */
+				stop = done = true;
 				unlock_page(page);
 				break;
 			}
@@ -899,7 +903,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 			if (page_offset(page) >= ceph_wbc.i_size) {
 				dout("%p page eof %llu\n",
 				     page, ceph_wbc.i_size);
-				done = 1;
+				/* not done if range_cyclic */
+				stop = true;
 				unlock_page(page);
 				break;
 			}
@@ -1132,7 +1137,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 			goto new_request;
 
 		if (wbc->nr_to_write <= 0)
-			done = 1;
+			stop = done = true;
 
 release_pvec_pages:
 		dout("pagevec_release on %d pages (%p)\n", (int)pvec.nr,
@@ -1146,7 +1151,9 @@ static int ceph_writepages_start(struct address_space *mapping,
 	if (should_loop && !done) {
 		/* more to do; loop back to beginning of file */
 		dout("writepages looping back to beginning of file\n");
-		should_loop = 0;
+		should_loop = false;
+		end = start_index - 1;
+
 		index = 0;
 		goto retry;
 	}
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 11/13] ceph: ignore wbc->range_{start,end} when write back snapshot data
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (9 preceding siblings ...)
  2017-09-04  9:39 ` [PATCH 10/13] ceph: fix "range cyclic" mode writepages Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 12/13] ceph: fix capsnap dirty pages accounting Yan, Zheng
  2017-09-04  9:39 ` [PATCH 13/13] ceph: wait on writeback after writing snapshot data Yan, Zheng
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

Writepages() needs to write dirty pages to OSD in strict order of
snapshot context. It must first write dirty pages associated with
the oldest snapshot context. In the write range case, dirty pages
in the specified range can be associated with newer snapc. They
are not writeable until we write all dirty pages associated with
the oldest snapc.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 80 +++++++++++++++++++++++++++++++++-------------------------
 1 file changed, 46 insertions(+), 34 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 46658d548a6e..201e529e8a6c 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -469,6 +469,7 @@ struct ceph_writeback_ctl
 	u64 truncate_size;
 	u32 truncate_seq;
 	bool size_stable;
+	bool head_snapc;
 };
 
 /*
@@ -504,6 +505,7 @@ get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl,
 			}
 			ctl->truncate_size = capsnap->truncate_size;
 			ctl->truncate_seq = capsnap->truncate_seq;
+			ctl->head_snapc = false;
 		}
 
 		if (snapc)
@@ -524,6 +526,7 @@ get_oldest_context(struct inode *inode, struct ceph_writeback_ctl *ctl,
 			ctl->truncate_size = ci->i_truncate_size;
 			ctl->truncate_seq = ci->i_truncate_seq;
 			ctl->size_stable = false;
+			ctl->head_snapc = true;
 		}
 	}
 	spin_unlock(&ci->i_ceph_lock);
@@ -781,7 +784,7 @@ static int ceph_writepages_start(struct address_space *mapping,
 	struct ceph_inode_info *ci = ceph_inode(inode);
 	struct ceph_fs_client *fsc = ceph_inode_to_client(inode);
 	struct ceph_vino vino = ceph_vino(inode);
-	pgoff_t index, start_index, end;
+	pgoff_t index, start_index, end = -1;
 	struct ceph_snap_context *snapc = NULL, *last_snapc = NULL, *pgsnapc;
 	struct pagevec pvec;
 	int rc = 0;
@@ -810,25 +813,10 @@ static int ceph_writepages_start(struct address_space *mapping,
 	pagevec_init(&pvec, 0);
 
 	start_index = wbc->range_cyclic ? mapping->writeback_index : 0;
-
-	/* where to start/end? */
-	if (wbc->range_cyclic) {
-		index = start_index
-		end = -1;
-		should_loop = (index > 0);
-		dout(" cyclic, start at %lu\n", index);
-	} else {
-		index = wbc->range_start >> PAGE_SHIFT;
-		end = wbc->range_end >> PAGE_SHIFT;
-		if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
-			range_whole = true;
-		should_loop = false;
-		dout(" not cyclic, %lu to %lu\n", index, end);
-	}
+	index = start_index;
 
 retry:
 	/* find oldest snap context with dirty data */
-	ceph_put_snap_context(snapc);
 	snapc = get_oldest_context(inode, &ceph_wbc, NULL);
 	if (!snapc) {
 		/* hmm, why does writepages get called when there
@@ -839,13 +827,33 @@ static int ceph_writepages_start(struct address_space *mapping,
 	dout(" oldest snapc is %p seq %lld (%d snaps)\n",
 	     snapc, snapc->seq, snapc->num_snaps);
 
-	if (last_snapc && snapc != last_snapc) {
-		/* if we switched to a newer snapc, restart our scan at the
-		 * start of the original file range. */
-		dout("  snapc differs from last pass, restarting at %lu\n",
-		     index);
-		index = start;
+	should_loop = false;
+	if (ceph_wbc.head_snapc && snapc != last_snapc) {
+		/* where to start/end? */
+		if (wbc->range_cyclic) {
+			index = start_index;
+			end = -1;
+			if (index > 0)
+				should_loop = true;
+			dout(" cyclic, start at %lu\n", index);
+		} else {
+			index = wbc->range_start >> PAGE_SHIFT;
+			end = wbc->range_end >> PAGE_SHIFT;
+			if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)
+				range_whole = true;
+			dout(" not cyclic, %lu to %lu\n", index, end);
+		}
+	} else if (!ceph_wbc.head_snapc) {
+		/* Do not respect wbc->range_{start,end}. Dirty pages
+		 * in that range can be associated with newer snapc.
+		 * They are not writeable until we write all dirty pages
+		 * associated with 'snapc' get written */
+		if (index > 0 || wbc->sync_mode != WB_SYNC_NONE)
+			should_loop = true;
+		dout(" non-head snapc, range whole\n");
 	}
+
+	ceph_put_snap_context(last_snapc);
 	last_snapc = snapc;
 
 	stop = false;
@@ -891,7 +899,9 @@ static int ceph_writepages_start(struct address_space *mapping,
 				dout("end of range %p\n", page);
 				/* can't be range_cyclic (1st pass) because
 				 * end == -1 in that case. */
-				stop = done = true;
+				stop = true;
+				if (ceph_wbc.head_snapc)
+					done = true;
 				unlock_page(page);
 				break;
 			}
@@ -1136,24 +1146,26 @@ static int ceph_writepages_start(struct address_space *mapping,
 		if (pages)
 			goto new_request;
 
-		if (wbc->nr_to_write <= 0)
-			stop = done = true;
+		/*
+		 * We stop writing back only if we are not doing
+		 * integrity sync. In case of integrity sync we have to
+		 * keep going until we have written all the pages
+		 * we tagged for writeback prior to entering this loop.
+		 */
+		if (wbc->nr_to_write <= 0 && wbc->sync_mode == WB_SYNC_NONE)
+			done = stop = true;
 
 release_pvec_pages:
 		dout("pagevec_release on %d pages (%p)\n", (int)pvec.nr,
 		     pvec.nr ? pvec.pages[0] : NULL);
 		pagevec_release(&pvec);
-
-		if (locked_pages && !done)
-			goto retry;
 	}
 
 	if (should_loop && !done) {
 		/* more to do; loop back to beginning of file */
 		dout("writepages looping back to beginning of file\n");
-		should_loop = false;
-		end = start_index - 1;
-
+		end = start_index - 1; /* OK even when start_index == 0 */
+		start_index = 0;
 		index = 0;
 		goto retry;
 	}
@@ -1163,8 +1175,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 out:
 	ceph_osdc_put_request(req);
-	ceph_put_snap_context(snapc);
-	dout("writepages done, rc = %d\n", rc);
+	ceph_put_snap_context(last_snapc);
+	dout("writepages dend - startone, rc = %d\n", rc);
 	return rc;
 }
 
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 12/13] ceph: fix capsnap dirty pages accounting
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (10 preceding siblings ...)
  2017-09-04  9:39 ` [PATCH 11/13] ceph: ignore wbc->range_{start,end} when write back snapshot data Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  2017-09-04  9:39 ` [PATCH 13/13] ceph: wait on writeback after writing snapshot data Yan, Zheng
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

writepages_finish() calls ceph_put_wrbuffer_cap_refs() once for
all pages, parameter snapc is set to req->r_snapc. So writepages()
shouldn't write dirty pages associated with different snapc in
one OSD request.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 201e529e8a6c..1ffdb903eb79 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -930,8 +930,8 @@ static int ceph_writepages_start(struct address_space *mapping,
 
 			/* only if matching snap context */
 			pgsnapc = page_snap_context(page);
-			if (pgsnapc->seq > snapc->seq) {
-				dout("page snapc %p %lld > oldest %p %lld\n",
+			if (pgsnapc != snapc) {
+				dout("page snapc %p %lld != oldest %p %lld\n",
 				     pgsnapc, pgsnapc->seq, snapc, snapc->seq);
 				unlock_page(page);
 				continue;
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 13/13] ceph: wait on writeback after writing snapshot data
  2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
                   ` (11 preceding siblings ...)
  2017-09-04  9:39 ` [PATCH 12/13] ceph: fix capsnap dirty pages accounting Yan, Zheng
@ 2017-09-04  9:39 ` Yan, Zheng
  12 siblings, 0 replies; 14+ messages in thread
From: Yan, Zheng @ 2017-09-04  9:39 UTC (permalink / raw)
  To: ceph-devel, jlayton, idryomov; +Cc: Yan, Zheng

In sync mode, writepages() needs to write all dirty pages. But
it can only write dirty pages associated with the oldest snapc.
To write dirty pages associated with next snapc, it needs to wait
until current writes complete.

Without this wait, writepages() keeps looking up dirty pages, but
the found dirty pages are not writeable. It wastes CPU time.

Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
---
 fs/ceph/addr.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 1ffdb903eb79..b3e3edc09d80 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1165,6 +1165,30 @@ static int ceph_writepages_start(struct address_space *mapping,
 		/* more to do; loop back to beginning of file */
 		dout("writepages looping back to beginning of file\n");
 		end = start_index - 1; /* OK even when start_index == 0 */
+
+		/* to write dirty pages associated with next snapc,
+		 * we need to wait until current writes complete */
+		if (wbc->sync_mode != WB_SYNC_NONE &&
+		    start_index == 0 && /* all dirty pages were checked */
+		    !ceph_wbc.head_snapc) {
+			struct page *page;
+			unsigned i, nr;
+			index = 0;
+			while ((index <= end) &&
+			       (nr = pagevec_lookup_tag(&pvec, mapping, &index,
+							PAGECACHE_TAG_WRITEBACK,
+							PAGEVEC_SIZE))) {
+				for (i = 0; i < nr; i++) {
+					page = pvec.pages[i];
+					if (page_snap_context(page) != snapc)
+						continue;
+					wait_on_page_writeback(page);
+				}
+				pagevec_release(&pvec);
+				cond_resched();
+			}
+		}
+
 		start_index = 0;
 		index = 0;
 		goto retry;
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2017-09-04  9:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-04  9:38 [PATCH 00/13] ceph: snapshot and multimds fixes Yan, Zheng
2017-09-04  9:38 ` [PATCH 01/13] ceph: fix null pointer dereference in ceph_flush_snaps() Yan, Zheng
2017-09-04  9:38 ` [PATCH 02/13] ceph: fix message order check in handle_cap_export() Yan, Zheng
2017-09-04  9:38 ` [PATCH 03/13] ceph: handle race between vmtruncate and queuing cap snap Yan, Zheng
2017-09-04  9:38 ` [PATCH 04/13] ceph: queue cap snap only when snap realm's context changes Yan, Zheng
2017-09-04  9:39 ` [PATCH 05/13] ceph: remove stale check in ceph_invalidatepage() Yan, Zheng
2017-09-04  9:39 ` [PATCH 06/13] ceph: properly get capsnap's size in get_oldest_context() Yan, Zheng
2017-09-04  9:39 ` [PATCH 07/13] ceph: make writepage_nounlock() invalidate page that beyonds EOF Yan, Zheng
2017-09-04  9:39 ` [PATCH 08/13] ceph: optimize pagevec iterating in ceph_writepages_start() Yan, Zheng
2017-09-04  9:39 ` [PATCH 09/13] ceph: cleanup local varibles " Yan, Zheng
2017-09-04  9:39 ` [PATCH 10/13] ceph: fix "range cyclic" mode writepages Yan, Zheng
2017-09-04  9:39 ` [PATCH 11/13] ceph: ignore wbc->range_{start,end} when write back snapshot data Yan, Zheng
2017-09-04  9:39 ` [PATCH 12/13] ceph: fix capsnap dirty pages accounting Yan, Zheng
2017-09-04  9:39 ` [PATCH 13/13] ceph: wait on writeback after writing snapshot data Yan, Zheng

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.