All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC][PATCH 1/2] Add a super operation for writeback
@ 2014-06-01 21:41 Daniel Phillips
  2014-06-01 21:42 ` [RFC][PATCH 2/2] tux3: Use writeback hook to remove duplicated core code Daniel Phillips
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Daniel Phillips @ 2014-06-01 21:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: Linus Torvalds, Andrew Morton, OGAWA Hirofumi

Hi,

This is the first of four core changes we would like for Tux3. We
start with a hard one and suggest a simple solution.

The first patch in this series adds a new super operation to write
back multiple inodes in a single call. The second patch applies to
our linux-tux3 repository at git.kernel.org to demonstrate how this
interface is used, and removes about 450 lines of workaround code.

Traditionally, core kernel tells each mounted filesystems which
dirty pages of which inodes should be flushed to disk, but
unfortunately, is blissfully ignorant of filesystem-specific
ordering constraints. This scheme was really invented for Ext2 and
has not evolved much recently. Tux3, with its strong ordering and
optimized delta commit, cannot tolerate random flushing and
therefore takes responsibility for flush ordering itself. On the
other hand, Tux3 has no good way to know when is the right time to
flush, but core is good at that. This proposed API harmonizes those
two competencies so that Tux3 and core each take care of what they
are good at, and not more.

The API extension consists of a new writeback hook and two helpers
to be uses within the hook. The hook sits about halfway down the
fs-writeback stack, just after core has determined that some dirty
inodes should be flushed to disk and just before it starts thinking
about which inodes those should be. At that point, core calls Tux3
instead of continuing on down the usual do_writepages path. Tux3
responds by staging a new delta commit, using the new helpers to
tell core which inodes were flushed versus being left dirty in
cache. This is pretty much the same behavior as the traditional
writeout path, but less convoluted, probably more efficient, and
certainly easier to analyze.

The new writeback hook looks like:

    progress = sb->s_op->writeback(sb, &writeback_control, &nr_pages);

This should be self-explanatory: nr_pages and progress have the
semantics of existing usage in fs-writeback; writeback_control is
ignored by Tux3, but that is only because Tux3 always flushes
everything and does not require hints for now. We can safely assume
that &wbc or equivalent is wanted here. An obvious wart is the
overlap between "progress" and "nr_pages", but fs-writeback thinks
that way, so it would not make much sense to improve one without
improving the other.

Tux3 necessarily keeps its own dirty inode list, which is an area
of overlap with fs-writeback. In a perfect world, there would be
just one dirty inode list per superblock, on which both fs-writeback
and Tux3 would operate. That would be a deeper core change than
seems appropriate right now.

Potential races are a big issue with this API, which is no surprise.
The fs-writeback scheme requires keeping several kinds of object in
sync: tux3 dirty inode lists, fs-writeback dirty inode lists and
inode dirty state. The new helpers inode_writeback_done(inode) and
inode_writeback_touch(inode) take care of that while hiding
internal details of the fs-writeback implementation.

Tux3 calls inode_writeback_done when it has flushed an inode and
marked it clean, or calls inode_writeback_touch if it intends to
retain a dirty inode in cache. These have simple implementations.
The former just removes a clean inode from any fs-writeback list.
The latter updates the inode's dirty timestamp so that fs-writeback
does not keep trying flush it. Both these things could be done more
efficiently by re-engineering fs-writeback, but we prefer to work
with the existing scheme for now.

Hirofumi's masterful hack nicely avoided racy removal of inodes from
the writeback list by taking an internal fs-writeback lock inside
filesystem code. The new helper requires dropping i_lock inside
filesystem code and retaking it in the helper, so inode redirty can
race with writeback list removal. This does not seem to present a
problem because filesystem code is able to enforce strict
alternation of cleaning and calling the helper. As an offsetting
advantage, writeback lock contention is reduced.

Compared to Hirofumi's hack, the cost of this interface is one
additional spinlock per inode_writeback_done,  which is
insignificant compared to the convoluted code path that is avoided.
Regards,

Daniel

---
From: Daniel Phillips <daniel@tux3.org>
Subject: [PATCH] Add a super operation for writeback

Add a "writeback" super operation to be called in the
form:

        progress = sb->s_op->writeback(sb, &wbc, &pages);

The filesystem is expected to flush some inodes to disk
and return progress of at least 1, or if no inodes are
flushed, return progress of zero. The filesystem should
try to flush at least the number of pages specified in
*pages, or if that is not possible, return approximately
the number of pages not flushed into *pages.

Within the ->writeback callback, the filesystem should
call inode_writeback_done(inode) for each inode flushed
(and therefore set clean) or inode_writeback_touch(inode)
for any inode that will be retained dirty in cache.

Signed-off-by: Daniel Phillips  <daniel@tux3.org>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
---

 fs/fs-writeback.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++++++---
 include/linux/fs.h |    4 +++
 2 files changed, 60 insertions(+), 3 deletions(-)

diff -puN fs/fs-writeback.c~core-writeback fs/fs-writeback.c
--- linux-tux3/fs/fs-writeback.c~core-writeback	2014-05-31 06:43:19.416031712 +0900
+++ linux-tux3-hirofumi/fs/fs-writeback.c	2014-05-31 06:44:46.087904373 +0900
@@ -192,6 +192,35 @@ void inode_wb_list_del(struct inode *ino
 }

 /*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+	spin_lock(&bdi->wb.list_lock);
+	spin_lock(&inode->i_lock);
+	if (!(inode->i_state & I_DIRTY))
+		list_del_init(&inode->i_wb_list);
+	spin_unlock(&inode->i_lock);
+	spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
+	spin_lock(&bdi->wb.list_lock);
+	inode->dirtied_when = jiffies;
+	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+	spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -593,9 +622,9 @@ static long writeback_chunk_size(struct
  *
  * Return the number of pages and/or inodes written.
  */
-static long writeback_sb_inodes(struct super_block *sb,
-				struct bdi_writeback *wb,
-				struct wb_writeback_work *work)
+static long __writeback_sb_inodes(struct super_block *sb,
+				  struct bdi_writeback *wb,
+				  struct wb_writeback_work *work)
 {
 	struct writeback_control wbc = {
 		.sync_mode		= work->sync_mode,
@@ -710,6 +739,30 @@ static long writeback_sb_inodes(struct s
 	return wrote;
 }

+static long writeback_sb_inodes(struct super_block *sb,
+				struct bdi_writeback *wb,
+				struct wb_writeback_work *work)
+{
+	if (sb->s_op->writeback) {
+		struct writeback_control wbc = {
+			.sync_mode		= work->sync_mode,
+			.tagged_writepages	= work->tagged_writepages,
+			.for_kupdate		= work->for_kupdate,
+			.for_background		= work->for_background,
+			.for_sync		= work->for_sync,
+			.range_cyclic		= work->range_cyclic,
+		};
+		long ret;
+
+		spin_unlock(&wb->list_lock);
+		ret = sb->s_op->writeback(sb, &wbc, &work->nr_pages);
+		spin_lock(&wb->list_lock);
+		return ret;
+	}
+
+	return __writeback_sb_inodes(sb, wb, work);
+}
+
 static long __writeback_inodes_wb(struct bdi_writeback *wb,
 				  struct wb_writeback_work *work)
 {
diff -puN include/linux/fs.h~core-writeback include/linux/fs.h
--- linux-tux3/include/linux/fs.h~core-writeback	2014-05-31 06:43:19.436031682 +0900
+++ linux-tux3-hirofumi/include/linux/fs.h	2014-05-31 06:48:41.619558001 +0900
@@ -1536,6 +1536,8 @@ struct super_operations {
 	int (*drop_inode) (struct inode *);
 	void (*evict_inode) (struct inode *);
 	void (*put_super) (struct super_block *);
+	long (*writeback)(struct super_block *super,
+			  struct writeback_control *wbc, long *nr_pages);
 	int (*sync_fs)(struct super_block *sb, int wait);
 	int (*freeze_fs) (struct super_block *);
 	int (*unfreeze_fs) (struct super_block *);
@@ -1737,6 +1739,8 @@ static inline void file_accessed(struct
 		touch_atime(&file->f_path);
 }

+void inode_writeback_done(struct inode *inode);
+void inode_writeback_touch(struct inode *inode);
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
 int sync_inode_metadata(struct inode *inode, int wait);



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [RFC][PATCH 2/2] tux3: Use writeback hook to remove duplicated core code
  2014-06-01 21:41 [RFC][PATCH 1/2] Add a super operation for writeback Daniel Phillips
@ 2014-06-01 21:42 ` Daniel Phillips
  2014-06-02  3:30   ` Dave Chinner
  2014-06-02  3:15 ` [RFC][PATCH 1/2] Add a super operation for writeback Dave Chinner
  2014-06-02  8:30 ` Christian Stroetmann
  2 siblings, 1 reply; 24+ messages in thread
From: Daniel Phillips @ 2014-06-01 21:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel; +Cc: Linus Torvalds, Andrew Morton, OGAWA Hirofumi

Instead of re-implementing part of fs/fs-writeback.c, use a proposed
net ->writeback super operation to drive delta writeback. For each
inode that is cleaned, call inode_writeback_done(inode). For each
inode that will be kept dirty in cache, call inode_writeback_touch
so that the inode appears young to fs-writeback and does not trigger
repeated ->writeback flushes.

Signed-off-by: Daniel Phillips <daniel@tux3.org>
---
 fs/tux3/Makefile              |   2 +-
 fs/tux3/commit.c              |   1 -
 fs/tux3/commit_flusher.c      | 180 ++++++++++--------
 fs/tux3/commit_flusher.h      |  16 --
 fs/tux3/commit_flusher_hack.c | 423 ------------------------------------------
 fs/tux3/inode.c               |   2 -
 fs/tux3/super.c               |  17 +-
 fs/tux3/tux3.h                |  11 +-
 fs/tux3/writeback.c           |  75 ++------
 11 files changed, 128 insertions(+), 599 deletions(-)
 delete mode 100644 fs/tux3/commit_flusher.h
 delete mode 100644 fs/tux3/commit_flusher_hack.c

diff --git a/fs/tux3/Makefile b/fs/tux3/Makefile
index 9623a54..30faba5 100644
--- a/fs/tux3/Makefile
+++ b/fs/tux3/Makefile
@@ -13,7 +13,7 @@ tux3-objs += balloc.o btree.o buffer.o commit.o dir.o dleaf.o \
 EXTRA_CFLAGS += -Werror -std=gnu99 -Wno-declaration-after-statement
 #EXTRA_CFLAGS += -DTUX3_FLUSHER=TUX3_FLUSHER_SYNC
 #EXTRA_CFLAGS += -DTUX3_FLUSHER=TUX3_FLUSHER_ASYNC_OWN
-EXTRA_CFLAGS += -DTUX3_FLUSHER=TUX3_FLUSHER_ASYNC_HACK
+EXTRA_CFLAGS += -DTUX3_FLUSHER=TUX3_FLUSHER_ASYNC

 obj-$(CONFIG_TUX3_MMAP) += mmap_builtin_hack.o
 endif
diff --git a/fs/tux3/commit.c b/fs/tux3/commit.c
index dd76d49..84e686e 100644
--- a/fs/tux3/commit.c
+++ b/fs/tux3/commit.c
@@ -638,7 +638,6 @@ static void delta_transition(struct sb *sb)
      ((int)(a) - (int)(b) >= 0))

 #include "commit_flusher.c"
-#include "commit_flusher_hack.c"

 int force_unify(struct sb *sb)
 {
diff --git a/fs/tux3/commit_flusher.c b/fs/tux3/commit_flusher.c
index 8e7057d..2d938c5 100644
--- a/fs/tux3/commit_flusher.c
+++ b/fs/tux3/commit_flusher.c
@@ -4,7 +4,7 @@
  * Copyright (c) 2008-2014 OGAWA Hirofumi
  */

-#if TUX3_FLUSHER != TUX3_FLUSHER_ASYNC_HACK
+#if TUX3_FLUSHER == TUX3_FLUSHER_SYNC
 #include "tux3.h"

 static void __tux3_init_flusher(struct sb *sb)
@@ -15,72 +15,6 @@ static void __tux3_init_flusher(struct sb *sb)
 #endif
 }

-#if TUX3_FLUSHER == TUX3_FLUSHER_ASYNC_OWN
-static int flush_delta_work(void *data)
-{
-    struct sb *sb = data;
-    int err;
-
-    set_freezable();
-
-    /*
-     * Our parent may run at a different priority, just set us to normal
-     */
-    set_user_nice(current, 0);
-
-    while (!kthread_freezable_should_stop(NULL)) {
-        if (test_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state)) {
-            clear_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state);
-
-            err = flush_delta(sb);
-            /* FIXME: error handling */
-        }
-
-        set_current_state(TASK_INTERRUPTIBLE);
-        if (!test_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state) &&
-            !kthread_should_stop())
-            schedule();
-        __set_current_state(TASK_RUNNING);
-    }
-
-    return 0;
-}
-
-int tux3_init_flusher(struct sb *sb)
-{
-    struct task_struct *task;
-    char b[BDEVNAME_SIZE];
-
-    __tux3_init_flusher(sb);
-
-    bdevname(vfs_sb(sb)->s_bdev, b);
-
-    /* FIXME: we should use normal bdi-writeback by changing core */
-    task = kthread_run(flush_delta_work, sb, "tux3/%s", b);
-    if (IS_ERR(task))
-        return PTR_ERR(task);
-
-    sb->flush_task = task;
-
-    return 0;
-}
-
-void tux3_exit_flusher(struct sb *sb)
-{
-    if (sb->flush_task) {
-        kthread_stop(sb->flush_task);
-        sb->flush_task = NULL;
-    }
-}
-
-static void schedule_flush_delta(struct sb *sb)
-{
-    /* Start the flusher for pending delta */
-    wake_up_process(sb->flush_task);
-}
-
-#else /* TUX3_FLUSHER != TUX3_FLUSHER_ASYNC_OWN */
-
 int tux3_init_flusher(struct sb *sb)
 {
     __tux3_init_flusher(sb);
@@ -109,7 +43,6 @@ static int flush_pending_delta(struct sb *sb)
 out:
     return err;
 }
-#endif /* TUX3_FLUSHER != TUX3_FLUSHER_ASYNC_OWN */

 /* Try delta transition */
 static void try_delta_transition(struct sb *sb)
@@ -155,10 +88,8 @@ static int try_flush_pending_until_delta(struct sb *sb, unsigned delta)
     trace("delta %u, committed %u, backend_state %lx",
           delta, sb->committed_delta, sb->backend_state);

-#if TUX3_FLUSHER == TUX3_FLUSHER_SYNC
     if (!delta_after_eq(sb->committed_delta, delta))
         flush_pending_delta(sb);
-#endif

     return delta_after_eq(sb->committed_delta, delta);
 }
@@ -175,9 +106,7 @@ static int sync_current_delta(struct sb *sb, enum unify_flags unify_flag)
     unsigned delta;
     int err = 0;

-#if TUX3_FLUSHER == TUX3_FLUSHER_SYNC
     down_write(&sb->delta_lock);
-#endif
     /* Get delta that have to write */
     delta_ref = delta_get(sb);
 #ifdef UNIFY_DEBUG
@@ -197,10 +126,111 @@ static int sync_current_delta(struct sb *sb, enum unify_flags unify_flag)
     /* Wait until committing the current delta */
     err = wait_for_commit(sb, delta);
     assert(err || delta_after_eq(sb->committed_delta, delta));
-#if TUX3_FLUSHER == TUX3_FLUSHER_SYNC
     up_write(&sb->delta_lock);
+    return err;
+}
+
+#else /* TUX3_FLUSHER == TUX3_FLUSHER_ASYNC */
+
+static void try_delta_transition(struct sb *sb)
+{
+#if 0
+    trace("stage %u, backend_state %lx",
+          sb->staging_delta, sb->backend_state);
+    sync_inodes_sb(vfs_sb(sb));
 #endif
+}

-    return err;
+/* Do the delta transition until specified delta */
+static int try_delta_transition_until_delta(struct sb *sb, unsigned delta)
+{
+    trace("delta %u, stage %u, backend_state %lx",
+          delta, sb->staging_delta, sb->backend_state);
+
+    /* Already delta transition was started for delta */
+    if (delta_after_eq(sb->staging_delta, delta))
+        return 1;
+
+    if (!test_and_set_bit(TUX3_COMMIT_RUNNING_BIT, &sb->backend_state)) {
+        /* Recheck after grabed TUX3_COMMIT_RUNNING_BIT */
+        if (delta_after_eq(sb->staging_delta, delta)) {
+            clear_bit(TUX3_COMMIT_RUNNING_BIT, &sb->backend_state);
+            return 1;
+        }
+
+        delta_transition(sb);
+    }
+
+    return delta_after_eq(sb->staging_delta, delta);
 }
-#endif /* TUX3_FLUSHER == TUX3_FLUSHER_ASYNC_HACK */
+
+/* Advance delta transition until specified delta */
+static int wait_for_transition(struct sb *sb, unsigned delta)
+{
+    return wait_event_killable(sb->delta_event_wq,
+        try_delta_transition_until_delta(sb, delta));
+}
+
+long tux3_writeback(struct super_block *super, struct writeback_control *wbc, long *nr_pages)
+{
+    struct sb *sb = tux_sb(super);
+    struct delta_ref *delta_ref;
+    unsigned delta;
+    int err;
+
+    /* If we didn't finish replay yet, don't flush. */
+    if (!(super->s_flags & MS_ACTIVE))
+        return 0;
+
+    /* Get delta that have to write */
+    delta_ref = delta_get(sb);
+#ifdef UNIFY_DEBUG
+    /* NO_UNIFY and FORCE_UNIFY are not supported for now */
+    delta_ref->unify_flag = ALLOW_UNIFY;
+#endif
+    delta = delta_ref->delta;
+    delta_put(sb, delta_ref);
+
+    /* Make sure the delta transition was done for current delta */
+    err = wait_for_transition(sb, delta);
+    if (err)
+        return err;
+    assert(delta_after_eq(sb->staging_delta, delta));
+
+    /* Wait for last referencer of delta was gone */
+    wait_event(sb->delta_event_wq,
+           test_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state));
+
+    if (test_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state)) {
+        clear_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state);
+
+        err = flush_delta(sb);
+        /* FIXME: error handling */
+#if 0
+        /* wb_update_bandwidth() is not exported to module */
+        wb_update_bandwidth(wb, wb_start);
+#endif
+    }
+
+    *nr_pages = 0;
+    return 1;
+}
+
+static int sync_current_delta(struct sb *sb, enum unify_flags unify_flag)
+{
+    /* FORCE_UNIFY is not supported */
+    WARN_ON(unify_flag == FORCE_UNIFY);
+    /* This is called only for fsync, so we can take ->s_umount here */
+    down_read(&vfs_sb(sb)->s_umount);
+    sync_inodes_sb(vfs_sb(sb));
+    up_read(&vfs_sb(sb)->s_umount);
+    return 0;    /* FIXME: error code */
+}
+
+static void schedule_flush_delta(struct sb *sb)
+{
+    /* Wake up waiters for pending delta staging */
+    wake_up_all(&sb->delta_event_wq);
+}
+
+#endif /* TUX3_FLUSHER == TUX3_FLUSHER_ASYNC */
diff --git a/fs/tux3/commit_flusher.h b/fs/tux3/commit_flusher.h
deleted file mode 100644
index 2c0a144..0000000
--- a/fs/tux3/commit_flusher.h
+++ /dev/null
@@ -1,16 +0,0 @@
-#ifndef TUX3_COMMIT_FLUSHER_H
-#define TUX3_COMMIT_FLUSHER_H
-
-/* FIXME: Remove this file after implement of flusher interface */
-
-#if TUX3_FLUSHER == TUX3_FLUSHER_ASYNC_HACK
-/* Hack for BDI_CAP_NO_WRITEBACK */
-void tux3_set_mapping_bdi(struct inode *inode);
-#else
-static inline void tux3_set_mapping_bdi(struct inode *inode) { }
-#endif
-
-int tux3_init_flusher(struct sb *sb);
-void tux3_exit_flusher(struct sb *sb);
-
-#endif /* !TUX3_COMMIT_FLUSHER_H */
diff --git a/fs/tux3/commit_flusher_hack.c b/fs/tux3/commit_flusher_hack.c
deleted file mode 100644
index 08696ed..0000000
--- a/fs/tux3/commit_flusher_hack.c
+++ /dev/null
@@ -1,423 +0,0 @@
-/*
- * FIXME: this is hack to override writeback without patch kernel.
- * We should add proper interfaces to do this, instead. Then, remove
- * this stuff.
- */
-
-#if TUX3_FLUSHER == TUX3_FLUSHER_ASYNC_HACK
-#include "tux3.h"
-#include <linux/kthread.h>
-#include <linux/freezer.h>
-
-void tux3_set_mapping_bdi(struct inode *inode)
-{
-    /*
-     * Hack: set backing_dev_info to use our bdi.
-     */
-    inode->i_mapping->backing_dev_info = inode->i_sb->s_bdi;
-}
-
-/*
- * FIXME: dirty hack for now. We should add callback in writeback task
- * instead of custom bdi.
- */
-struct wb_writeback_work {
-    long nr_pages;
-    struct super_block *sb;
-    unsigned long *older_than_this;
-    enum writeback_sync_modes sync_mode;
-    unsigned int tagged_writepages:1;
-    unsigned int for_kupdate:1;
-    unsigned int range_cyclic:1;
-    unsigned int for_background:1;
-    unsigned int for_sync:1;    /* sync(2) WB_SYNC_ALL writeback */
-    enum wb_reason reason;        /* why was writeback initiated? */
-
-    struct list_head list;        /* pending work list */
-    struct completion *done;    /* set if the caller waits */
-};
-
-/* Do the delta transition until specified delta */
-static int try_delta_transition_until_delta(struct sb *sb, unsigned delta)
-{
-    trace("delta %u, stage %u, backend_state %lx",
-          delta, sb->staging_delta, sb->backend_state);
-
-    /* Already delta transition was started for delta */
-    if (delta_after_eq(sb->staging_delta, delta))
-        return 1;
-
-    if (!test_and_set_bit(TUX3_COMMIT_RUNNING_BIT, &sb->backend_state)) {
-        /* Recheck after grabed TUX3_COMMIT_RUNNING_BIT */
-        if (delta_after_eq(sb->staging_delta, delta)) {
-            clear_bit(TUX3_COMMIT_RUNNING_BIT, &sb->backend_state);
-            return 1;
-        }
-
-        delta_transition(sb);
-    }
-
-    return delta_after_eq(sb->staging_delta, delta);
-}
-
-/* Advance delta transition until specified delta */
-static int wait_for_transition(struct sb *sb, unsigned delta)
-{
-    return wait_event_killable(sb->delta_event_wq,
-                   try_delta_transition_until_delta(sb, delta));
-}
-
-static long tux3_wb_writeback(struct bdi_writeback *wb,
-                  struct wb_writeback_work *work)
-{
-    struct sb *sb = container_of(wb->bdi, struct sb, bdi);
-    struct delta_ref *delta_ref;
-    unsigned delta;
-    int err;
-
-    /* If we didn't finish replay yet, don't flush. */
-    if (!(vfs_sb(sb)->s_flags & MS_ACTIVE))
-        return 0;
-
-    /* Get delta that have to write */
-    delta_ref = delta_get(sb);
-#ifdef UNIFY_DEBUG
-    /* NO_UNIFY and FORCE_UNIFY are not supported for now */
-    delta_ref->unify_flag = ALLOW_UNIFY;
-#endif
-    delta = delta_ref->delta;
-    delta_put(sb, delta_ref);
-
-    /* Make sure the delta transition was done for current delta */
-    err = wait_for_transition(sb, delta);
-    if (err)
-        return err;
-    assert(delta_after_eq(sb->staging_delta, delta));
-
-    /* Wait for last referencer of delta was gone */
-    wait_event(sb->delta_event_wq,
-           test_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state));
-
-    if (test_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state)) {
-        clear_bit(TUX3_COMMIT_PENDING_BIT, &sb->backend_state);
-
-        err = flush_delta(sb);
-        /* FIXME: error handling */
-#if 0
-        /* wb_update_bandwidth() is not exported to module */
-        wb_update_bandwidth(wb, wb_start);
-#endif
-    }
-
-    return 1; /* FIXME: return code */
-}
-
-static bool inode_dirtied_after(struct inode *inode, unsigned long t)
-{
-    bool ret = time_after(inode->dirtied_when, t);
-#ifndef CONFIG_64BIT
-    /*
-     * For inodes being constantly redirtied, dirtied_when can get stuck.
-     * It _appears_ to be in the future, but is actually in distant past.
-     * This test is necessary to prevent such wrapped-around relative times
-     * from permanently stopping the whole bdi writeback.
-     */
-    ret = ret && time_before_eq(inode->dirtied_when, jiffies);
-#endif
-    return ret;
-}
-
-static int tux3_has_old_data(struct bdi_writeback *wb)
-{
-    static unsigned int tux3_dirty_expire_interval = 30 * 100;
-
-    int has_old = 0;
-
-    /*
-     * We don't flush for each inodes. So, we flush all for each
-     * tux3_dirty_expire_interval.
-     *
-     * FIXME: we should pickup only older inodes?
-     */
-    spin_lock(&wb->list_lock);
-    if (wb_has_dirty_io(wb)) {
-        unsigned long older_than_this = jiffies -
-            msecs_to_jiffies(tux3_dirty_expire_interval * 10);
-        struct inode *inode =
-            list_entry(wb->b_dirty.prev, struct inode, i_wb_list);
-
-        if (!inode_dirtied_after(inode, older_than_this))
-            has_old = 1;
-    }
-    spin_unlock(&wb->list_lock);
-
-    return has_old;
-}
-
-static long tux3_wb_check_old_data_flush(struct bdi_writeback *wb)
-{
-    /* Hack: dirty_expire_interval is not exported to module */
-    unsigned long expired;
-
-    /*
-     * When set to zero, disable periodic writeback
-     */
-    if (!dirty_writeback_interval)
-        return 0;
-
-    expired = wb->last_old_flush +
-            msecs_to_jiffies(dirty_writeback_interval * 10);
-    if (time_before(jiffies, expired))
-        return 0;
-
-    wb->last_old_flush = jiffies;
-
-    if (!tux3_has_old_data(wb)) {
-        /*
-         * If now after interval, we return 1 at least, to
-         * avoid to run tux3_wb_check_background_flush().
-         */
-        return 1;
-    }
-
-    struct wb_writeback_work work = {
-        .nr_pages    = 0,
-        .sync_mode    = WB_SYNC_NONE,
-        .for_kupdate    = 1,
-        .range_cyclic    = 1,
-        .reason        = WB_REASON_PERIODIC,
-    };
-
-    return tux3_wb_writeback(wb, &work);
-}
-
-static inline int tux3_over_bground_thresh(struct backing_dev_info *bdi,
-                       long wrote)
-{
-    /*
-     * FIXME: Memory pressure functions are not exported to module.
-     *
-     * So, if we didn't wrote any data on this wakeup, we assume
-     * this wakeup call is from memory pressure.
-     */
-    return !wrote;
-}
-
-static long tux3_wb_check_background_flush(struct bdi_writeback *wb, long wrote)
-{
-    if (tux3_over_bground_thresh(wb->bdi, wrote)) {
-
-        struct wb_writeback_work work = {
-            .nr_pages    = LONG_MAX,
-            .sync_mode    = WB_SYNC_NONE,
-            .for_background    = 1,
-            .range_cyclic    = 1,
-            .reason        = WB_REASON_BACKGROUND,
-        };
-
-        return tux3_wb_writeback(wb, &work);
-    }
-
-    return 0;
-}
-
-static struct wb_writeback_work *
-get_next_work_item(struct backing_dev_info *bdi)
-{
-    struct wb_writeback_work *work = NULL;
-
-    spin_lock_bh(&bdi->wb_lock);
-    if (!list_empty(&bdi->work_list)) {
-        work = list_entry(bdi->work_list.next,
-                  struct wb_writeback_work, list);
-        list_del_init(&work->list);
-    }
-    spin_unlock_bh(&bdi->wb_lock);
-    return work;
-}
-
-static long tux3_do_writeback(struct bdi_writeback *wb)
-{
-    struct backing_dev_info *bdi = wb->bdi;
-    struct wb_writeback_work *work = NULL;
-    long wrote = 0;
-
-    set_bit(BDI_writeback_running, &wb->bdi->state);
-    while ((work = get_next_work_item(bdi)) != NULL) {
-        trace("nr_pages %ld, sb %p, sync_mode %d, "
-              "tagged_writepages %d, for_kupdate %d, range_cyclic %d, "
-              "for_background %d, reason %d, done %p",
-              work->nr_pages, work->sb, work->sync_mode,
-              work->tagged_writepages, work->for_kupdate,
-              work->range_cyclic, work->for_background,
-              work->reason, work->done);
-
-        wrote += tux3_wb_writeback(wb, work);
-
-        /*
-         * Notify the caller of completion if this is a synchronous
-         * work item, otherwise just free it.
-         */
-        if (work->done)
-            complete(work->done);
-        else
-            kfree(work);
-    }
-    trace("flush done");
-
-    /*
-     * Check for periodic writeback, kupdated() style
-     */
-    wrote += tux3_wb_check_old_data_flush(wb);
-    wrote += tux3_wb_check_background_flush(wb, wrote);
-    clear_bit(BDI_writeback_running, &wb->bdi->state);
-
-    return wrote;
-}
-
-/* Dirty hack to get bdi_wq address from module */
-static struct workqueue_struct *kernel_bdi_wq;
-
-/*
- * Handle writeback of dirty data for the device backed by this bdi. Also
- * reschedules periodically and does kupdated style flushing.
- */
-static void tux3_writeback_workfn(struct work_struct *work)
-{
-    struct bdi_writeback *wb = container_of(to_delayed_work(work),
-                        struct bdi_writeback, dwork);
-    struct backing_dev_info *bdi = wb->bdi;
-    long pages_written;
-
-#if 0
-    /* set_worker_desc() is not exported to module */
-    set_worker_desc("flush-tux3-%s", dev_name(bdi->dev));
-#endif
-    current->flags |= PF_SWAPWRITE;
-
-#if 0
-    /* current_is_workqueue_rescuer() is not exported to module */
-    if (likely(!current_is_workqueue_rescuer() ||
-           list_empty(&bdi->bdi_list)))
-#endif
-    {
-        /*
-         * The normal path.  Keep writing back @bdi until its
-         * work_list is empty.  Note that this path is also taken
-         * if @bdi is shutting down even when we're running off the
-         * rescuer as work_list needs to be drained.
-         */
-        do {
-            pages_written = tux3_do_writeback(wb);
-//            trace_writeback_pages_written(pages_written);
-        } while (!list_empty(&bdi->work_list));
-    }
-#if 0
-    else {
-        /*
-         * bdi_wq can't get enough workers and we're running off
-         * the emergency worker.  Don't hog it.  Hopefully, 1024 is
-         * enough for efficient IO.
-         */
-        pages_written = writeback_inodes_wb(&bdi->wb, 1024,
-                            WB_REASON_FORKER_THREAD);
-        trace_writeback_pages_written(pages_written);
-    }
-#endif
-    if (!list_empty(&bdi->work_list) ||
-        (wb_has_dirty_io(wb) && dirty_writeback_interval))
-        queue_delayed_work(kernel_bdi_wq, &wb->dwork,
-            msecs_to_jiffies(dirty_writeback_interval * 10));
-
-    current->flags &= ~PF_SWAPWRITE;
-}
-
-#include <linux/kallsyms.h>
-static int tux3_setup_writeback(struct sb *sb, struct backing_dev_info *bdi)
-{
-    /* Dirty hack to get bdi_wq address from module */
-    if (kernel_bdi_wq == NULL) {
-        unsigned long wq_addr;
-
-        wq_addr = kallsyms_lookup_name("bdi_wq");
-        if (!wq_addr) {
-            tux3_err(sb, "couldn't find bdi_wq address\n");
-            return -EINVAL;
-        }
-        kernel_bdi_wq = *(struct workqueue_struct **)wq_addr;
-        tux3_msg(sb, "use bdi_wq %p", kernel_bdi_wq);
-    }
-
-    /* Overwrite callback by ourself handler */
-    INIT_DELAYED_WORK(&bdi->wb.dwork, tux3_writeback_workfn);
-
-    return 0;
-}
-
-static int tux3_congested_fn(void *congested_data, int bdi_bits)
-{
-    return bdi_congested(congested_data, bdi_bits);
-}
-
-/*
- * We need to disable writeback to control dirty flags of inode.
- * Otherwise, writeback will clear dirty, and inode can be reclaimed
- * without our control.
- */
-int tux3_init_flusher(struct sb *sb)
-{
-    struct backing_dev_info *bdi = &sb->bdi;
-    int err;
-
-    bdi->ra_pages        = vfs_sb(sb)->s_bdi->ra_pages;
-    bdi->congested_fn    = tux3_congested_fn;
-    bdi->congested_data    = vfs_sb(sb)->s_bdi;
-
-    err = bdi_setup_and_register(bdi, "tux3", BDI_CAP_MAP_COPY);
-    if (err)
-        return err;
-
-    err = tux3_setup_writeback(sb, bdi);
-    if (err) {
-        bdi_destroy(bdi);
-        return err;
-    }
-
-    vfs_sb(sb)->s_bdi = bdi;
-
-    return 0;
-}
-
-void tux3_exit_flusher(struct sb *sb)
-{
-    struct backing_dev_info *bdi = vfs_sb(sb)->s_bdi;
-    if (bdi == &sb->bdi)
-        bdi_destroy(bdi);
-}
-
-static void schedule_flush_delta(struct sb *sb)
-{
-    /* Wake up waiters for pending delta staging */
-    wake_up_all(&sb->delta_event_wq);
-}
-
-static void try_delta_transition(struct sb *sb)
-{
-#if 0
-    trace("stage %u, backend_state %lx",
-          sb->staging_delta, sb->backend_state);
-    sync_inodes_sb(vfs_sb(sb));
-#endif
-}
-
-static int sync_current_delta(struct sb *sb, enum unify_flags unify_flag)
-{
-    /* FORCE_UNIFY is not supported */
-    WARN_ON(unify_flag == FORCE_UNIFY);
-    /* This is called only for fsync, so we can take ->s_umount here */
-    down_read(&vfs_sb(sb)->s_umount);
-    sync_inodes_sb(vfs_sb(sb));
-    up_read(&vfs_sb(sb)->s_umount);
-    return 0;    /* FIXME: error code */
-}
-#endif /* TUX3_FLUSHER != TUX3_FLUSHER_ASYNC_HACK */
diff --git a/fs/tux3/inode.c b/fs/tux3/inode.c
index 1bfb28f..5c9b1f4 100644
--- a/fs/tux3/inode.c
+++ b/fs/tux3/inode.c
@@ -932,8 +932,6 @@ static void tux_setup_inode(struct inode *inode)

     assert(tux_inode(inode)->inum != TUX_INVALID_INO);

-    tux3_set_mapping_bdi(inode);
-
 //    inode->i_generation = 0;
 //    inode->i_flags = 0;

diff --git a/fs/tux3/super.c b/fs/tux3/super.c
index 931c86d..68642d4 100644
--- a/fs/tux3/super.c
+++ b/fs/tux3/super.c
@@ -126,9 +126,6 @@ static void __tux3_put_super(struct sb *sbi)
     iput(sbi->volmap);
     sbi->volmap = NULL;

-    /* Cleanup flusher after inode was evicted */
-    tux3_exit_flusher(sbi);
-
     tux3_free_idefer_map(sbi->idefer_map);
     sbi->idefer_map = NULL;
     /* FIXME: add more sanity check */
@@ -178,13 +175,6 @@ struct replay *tux3_init_fs(struct sb *sbi)
     char *name;
     int err;

-    /* Initialize flusher before setup inode */
-    err = tux3_init_flusher(sbi);
-    if (err) {
-        tux3_err(sbi, "failed to initialize flusher");
-        goto error;
-    }
-
     err = -ENOMEM;

     /* Prepare non on-disk inodes */
@@ -375,7 +365,7 @@ static void tux3_destroy_inode(struct inode *inode)
     call_rcu(&inode->i_rcu, tux3_i_callback);
 }

-#if TUX3_FLUSHER != TUX3_FLUSHER_ASYNC_HACK
+#if TUX3_FLUSHER != TUX3_FLUSHER_ASYNC
 static int tux3_sync_fs(struct super_block *sb, int wait)
 {
     /* FIXME: We should support "wait" parameter. */
@@ -423,12 +413,13 @@ static const struct super_operations tux3_super_ops = {
     .evict_inode    = tux3_evict_inode,
     /* FIXME: we have to handle write_inode of sync (e.g. cache pressure) */
 //    .write_inode    = tux3_write_inode,
-#if TUX3_FLUSHER != TUX3_FLUSHER_ASYNC_HACK
-    /* If TUX3_FLUSHER_ASYNC_HACK, normal kernel flush request does all */
+#if TUX3_FLUSHER != TUX3_FLUSHER_ASYNC
+    /* If TUX3_FLUSHER_ASYNC, normal kernel flush request does all */
     .sync_fs    = tux3_sync_fs,
 #endif
     .put_super    = tux3_put_super,
     .statfs        = tux3_statfs,
+    .writeback = tux3_writeback,
 };

 static int tux3_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tux3/tux3.h b/fs/tux3/tux3.h
index 002d6d4..3ca6756 100644
--- a/fs/tux3/tux3.h
+++ b/fs/tux3/tux3.h
@@ -222,7 +222,7 @@ struct stash { struct flink_head head; u64 *pos, *top; };
 /* Flush asynchronously by own timing */
 #define TUX3_FLUSHER_ASYNC_OWN        2
 /* Flush asynchronously by kernel normal timing (by hackish way) */
-#define TUX3_FLUSHER_ASYNC_HACK        3
+#define TUX3_FLUSHER_ASYNC        3

 /* Refcount for delta */
 struct delta_ref {
@@ -271,9 +271,6 @@ struct sb {
 #if TUX3_FLUSHER == TUX3_FLUSHER_ASYNC_OWN
     struct task_struct *flush_task;        /* work to flush delta */
 #endif
-#if TUX3_FLUSHER == TUX3_FLUSHER_ASYNC_HACK
-    struct backing_dev_info bdi;
-#endif

     struct btree itree;    /* Inode btree */
     struct btree otree;    /* Orphan btree */
@@ -793,9 +790,6 @@ int change_end(struct sb *sb);
 void change_begin_if_needed(struct sb *sb, int need_sep);
 void change_end_if_needed(struct sb *sb);

-/* commit_flusher.c */
-#include "commit_flusher.h"
-
 /* dir.c */
 void tux_set_entry(struct buffer_head *buffer, struct tux3_dirent *entry,
            inum_t inum, umode_t mode);
@@ -978,6 +972,9 @@ static inline void tux3_mark_inode_dirty_sync(struct inode *inode)
     __tux3_mark_inode_dirty(inode, I_DIRTY_SYNC);
 }

+struct super_block;
+struct writeback_control;
+long tux3_writeback(struct super_block *super, struct writeback_control *wbc, long *nr_pages);
 void tux3_dirty_inode(struct inode *inode, int flags);
 void tux3_mark_inode_to_delete(struct inode *inode);
 void tux3_iattrdirty(struct inode *inode);
diff --git a/fs/tux3/writeback.c b/fs/tux3/writeback.c
index 9ecafc0..b4b4798 100644
--- a/fs/tux3/writeback.c
+++ b/fs/tux3/writeback.c
@@ -124,57 +124,6 @@ static inline unsigned tux3_dirty_flags(struct inode *inode, unsigned delta)
     return ret;
 }

-/*
- * We don't use i_wb_list though, bdi flusher checks this via
- * wb_has_dirty_io(). So if inode become clean, we remove inode from
- * it.
- */
-static inline void tux3_inode_wb_lock(struct inode *inode)
-{
-#ifdef __KERNEL__
-    struct backing_dev_info *bdi = inode->i_sb->s_bdi;
-    spin_lock(&bdi->wb.list_lock);
-#endif
-}
-
-static inline void tux3_inode_wb_unlock(struct inode *inode)
-{
-#ifdef __KERNEL__
-    struct backing_dev_info *bdi = inode->i_sb->s_bdi;
-    spin_unlock(&bdi->wb.list_lock);
-#endif
-}
-
-static inline void tux3_inode_wb_list_del(struct inode *inode)
-{
-#ifdef __KERNEL__
-    list_del_init(&inode->i_wb_list);
-#endif
-}
-
-/*
- * __mark_inode_dirty() doesn't know about delta boundary (we don't
- * clear I_DIRTY before flush, in order to prevent the inode to be
- * freed). So, if inode was re-dirtied for frontend delta while
- * flushing old delta, ->dirtied_when may not be updated by
- * __mark_inode_dirty() forever.
- *
- * Although we don't use ->dirtied_when, bdi flusher uses
- * ->dirtied_when to decide flush timing, so we have to update
- * ->dirtied_when ourself.
- */
-static void tux3_inode_wb_update_dirtied_when(struct inode *inode)
-{
-#ifdef __KERNEL__
-    /* Take lock only if we have to update. */
-    struct backing_dev_info *bdi = inode->i_sb->s_bdi;
-    tux3_inode_wb_lock(inode);
-    inode->dirtied_when = jiffies;
-    list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
-    tux3_inode_wb_unlock(inode);
-#endif
-}
-
 /* This is hook of __mark_inode_dirty() and called I_DIRTY_PAGES too */
 void tux3_dirty_inode(struct inode *inode, int flags)
 {
@@ -220,11 +169,19 @@ void tux3_dirty_inode(struct inode *inode, int flags)
     spin_unlock(&tuxnode->lock);

     /*
-     * Update ->i_wb_list and ->dirtied_when if need. See comment
-     * of tux3_inode_wb_update_dirtied_when().
+     * Update ->i_wb_list and ->dirtied_when if needed.
+     * __mark_inode_dirty() doesn't know about delta boundary (we don't
+     * clear I_DIRTY before flush, in order to prevent the inode to be
+     * freed). So, if inode was re-dirtied for frontend delta while
+     * flushing old delta, ->dirtied_when may not be updated by
+     * __mark_inode_dirty() forever.
+     *
+     * Although we don't use ->dirtied_when, bdi flusher uses
+     * ->dirtied_when to decide flush timing, so we have to update
+     * ->dirtied_when ourself.
      */
     if (re_dirtied)
-        tux3_inode_wb_update_dirtied_when(inode);
+        inode_writeback_touch(inode);
 }

 /*
@@ -289,23 +246,20 @@ static void tux3_clear_dirty_inode_nolock(struct inode *inode, unsigned delta,
     }

     /* Update state if inode isn't dirty anymore */
-    if (!(tuxnode->flags & ~NON_DIRTY_FLAGS)) {
+    if (!(tuxnode->flags & ~NON_DIRTY_FLAGS))
         inode->i_state &= ~I_DIRTY;
-        tux3_inode_wb_list_del(inode);
-    }
 }

 /* Clear dirty flags for delta */
 static void __tux3_clear_dirty_inode(struct inode *inode, unsigned delta)
 {
     struct tux3_inode *tuxnode = tux_inode(inode);
-    tux3_inode_wb_lock(inode);
     spin_lock(&inode->i_lock);
     spin_lock(&tuxnode->lock);
     tux3_clear_dirty_inode_nolock(inode, delta, 0);
     spin_unlock(&tuxnode->lock);
     spin_unlock(&inode->i_lock);
-    tux3_inode_wb_unlock(inode);
+    inode_writeback_done(inode);
 }

 /*
@@ -315,14 +269,13 @@ static void __tux3_clear_dirty_inode(struct inode *inode, unsigned delta)
 void tux3_clear_dirty_inode(struct inode *inode)
 {
     struct tux3_inode *tuxnode = tux_inode(inode);
-    tux3_inode_wb_lock(inode);
     spin_lock(&inode->i_lock);
     spin_lock(&tuxnode->lock);
     tux3_iattr_clear_dirty(tuxnode);
     tux3_clear_dirty_inode_nolock(inode, tux3_inode_delta(inode), 1);
     spin_unlock(&tuxnode->lock);
     spin_unlock(&inode->i_lock);
-    tux3_inode_wb_unlock(inode);
+    inode_writeback_done(inode);
 }

 void __tux3_mark_inode_dirty(struct inode *inode, int flags)




^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-01 21:41 [RFC][PATCH 1/2] Add a super operation for writeback Daniel Phillips
  2014-06-01 21:42 ` [RFC][PATCH 2/2] tux3: Use writeback hook to remove duplicated core code Daniel Phillips
@ 2014-06-02  3:15 ` Dave Chinner
  2014-06-02 20:02   ` Daniel Phillips
  2014-06-02  8:30 ` Christian Stroetmann
  2 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2014-06-02  3:15 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi

On Sun, Jun 01, 2014 at 02:41:02PM -0700, Daniel Phillips wrote:
> ---
> From: Daniel Phillips <daniel@tux3.org>
> Subject: [PATCH] Add a super operation for writeback
> 
> Add a "writeback" super operation to be called in the
> form:
> 
>         progress = sb->s_op->writeback(sb, &wbc, &pages);
> 
> The filesystem is expected to flush some inodes to disk
> and return progress of at least 1, or if no inodes are
> flushed, return progress of zero. The filesystem should
> try to flush at least the number of pages specified in
> *pages, or if that is not possible, return approximately
> the number of pages not flushed into *pages.
> 
> Within the ->writeback callback, the filesystem should
> call inode_writeback_done(inode) for each inode flushed
> (and therefore set clean) or inode_writeback_touch(inode)
> for any inode that will be retained dirty in cache.
> 
> Signed-off-by: Daniel Phillips  <daniel@tux3.org>
> Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
> ---
> 
>  fs/fs-writeback.c  |   59 +++++++++++++++++++++++++++++++++++++++++++++++++---
>  include/linux/fs.h |    4 +++
>  2 files changed, 60 insertions(+), 3 deletions(-)
> 
> diff -puN fs/fs-writeback.c~core-writeback fs/fs-writeback.c
> --- linux-tux3/fs/fs-writeback.c~core-writeback	2014-05-31 06:43:19.416031712 +0900
> +++ linux-tux3-hirofumi/fs/fs-writeback.c	2014-05-31 06:44:46.087904373 +0900
> @@ -192,6 +192,35 @@ void inode_wb_list_del(struct inode *ino
>  }
> 
>  /*
> + * Remove inode from writeback list if clean.
> + */
> +void inode_writeback_done(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode_to_bdi(inode);
> +
> +	spin_lock(&bdi->wb.list_lock);
> +	spin_lock(&inode->i_lock);
> +	if (!(inode->i_state & I_DIRTY))
> +		list_del_init(&inode->i_wb_list);
> +	spin_unlock(&inode->i_lock);
> +	spin_unlock(&bdi->wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_done);
> +
> +/*
> + * Add inode to writeback dirty list with current time.
> + */
> +void inode_writeback_touch(struct inode *inode)
> +{
> +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> +	spin_lock(&bdi->wb.list_lock);
> +	inode->dirtied_when = jiffies;
> +	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> +	spin_unlock(&bdi->wb.list_lock);
> +}
> +EXPORT_SYMBOL_GPL(inode_writeback_touch);

You should be able to use redirty_tail() for this....

Hmmmm - this is using the wb dirty lists and locks, but you
don't pass the wb structure to the writeback callout? That seem
wrong to me - why would you bother manipulating these lists if you
aren't using them to track dirty inodes in the first place?

> +
> +/*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
>   *
> @@ -593,9 +622,9 @@ static long writeback_chunk_size(struct
>   *
>   * Return the number of pages and/or inodes written.
>   */
> -static long writeback_sb_inodes(struct super_block *sb,
> -				struct bdi_writeback *wb,
> -				struct wb_writeback_work *work)
> +static long __writeback_sb_inodes(struct super_block *sb,
> +				  struct bdi_writeback *wb,
> +				  struct wb_writeback_work *work)
>  {
>  	struct writeback_control wbc = {
>  		.sync_mode		= work->sync_mode,
> @@ -710,6 +739,30 @@ static long writeback_sb_inodes(struct s
>  	return wrote;
>  }
> 
> +static long writeback_sb_inodes(struct super_block *sb,
> +				struct bdi_writeback *wb,
> +				struct wb_writeback_work *work)
> +{
> +	if (sb->s_op->writeback) {
> +		struct writeback_control wbc = {
> +			.sync_mode		= work->sync_mode,
> +			.tagged_writepages	= work->tagged_writepages,
> +			.for_kupdate		= work->for_kupdate,
> +			.for_background		= work->for_background,
> +			.for_sync		= work->for_sync,
> +			.range_cyclic		= work->range_cyclic,
> +		};
> +		long ret;
> +
> +		spin_unlock(&wb->list_lock);
> +		ret = sb->s_op->writeback(sb, &wbc, &work->nr_pages);
> +		spin_lock(&wb->list_lock);
> +		return ret;
> +	}
> +
> +	return __writeback_sb_inodes(sb, wb, work);
> +}

The first thing that __writeback_sb_inodes() does is create a struct
writeback_control from the wb_writeback_work. That should be done
here and passed to __writeback_sb_inodes(), which should be renamed
"generic_writeback_sb_inodes()".  Also, all the fields in the wbc
need to be initialised correctly (i.e including range_start/end).

Further, a writeback implementation will need to use the generic bdi
list and lock structures and so we need to pass the bdi_writeback.
Similarly, if we are going to pass nr_pages, we might as well pass
the entire work structure. 

Finally, I don't like the way the wb->list_lock is treated
differently by this code. I suspect that we need to rationalise the
layering of the wb->list_lock as it is currently not clear what it
protects and what (nested) layers of the writeback code actually
require it.

What I'd like to see is this work:

struct super_ops ... = {
....
	.writeback = generic_writeback_sb_inodes,
....

And that means writeback_sb_inodes() would become:

static long writeback_sb_inodes(struct super_block *sb,
				struct bdi_writeback *wb,
				struct wb_writeback_work *work)
{
	struct writeback_control wbc = {
		.sync_mode		= work->sync_mode,
		.tagged_writepages	= work->tagged_writepages,
		.for_kupdate		= work->for_kupdate,
		.for_background		= work->for_background,
		.for_sync		= work->for_sync,
		.range_cyclic		= work->range_cyclic,
		.range_start		= 0,
		.range_end		= LLONG_MAX,
	};

	if (sb->s_op->writeback)
		return sb->s_op->writeback(sb, wb, work, &wbc);

	return generic_writeback_sb_inodes(sb, wb, work, &wbc);
}

And the higher/lower layers deal with wb->list_lock appropriately.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/2] tux3: Use writeback hook to remove duplicated core code
  2014-06-01 21:42 ` [RFC][PATCH 2/2] tux3: Use writeback hook to remove duplicated core code Daniel Phillips
@ 2014-06-02  3:30   ` Dave Chinner
  2014-06-02 20:07     ` Daniel Phillips
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2014-06-02  3:30 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi

On Sun, Jun 01, 2014 at 02:42:48PM -0700, Daniel Phillips wrote:
> Instead of re-implementing part of fs/fs-writeback.c, use a proposed
> net ->writeback super operation to drive delta writeback. For each
> inode that is cleaned, call inode_writeback_done(inode). For each
> inode that will be kept dirty in cache, call inode_writeback_touch
> so that the inode appears young to fs-writeback and does not trigger
> repeated ->writeback flushes.
> 
> Signed-off-by: Daniel Phillips <daniel@tux3.org>

I have not looked at the sanity of the tux3 writeback algorithm, so
I'm not commenting on whether it works or not. However, this caught
my eye:

>  static void __tux3_clear_dirty_inode(struct inode *inode, unsigned delta)
>  {
>      struct tux3_inode *tuxnode = tux_inode(inode);
> -    tux3_inode_wb_lock(inode);
>      spin_lock(&inode->i_lock);
>      spin_lock(&tuxnode->lock);
>      tux3_clear_dirty_inode_nolock(inode, delta, 0);
>      spin_unlock(&tuxnode->lock);
>      spin_unlock(&inode->i_lock);
> -    tux3_inode_wb_unlock(inode);
> +    inode_writeback_done(inode);
>  }

I get very worried whenever I see locks inside inode->i_lock. In
general, i_lock is supposed to be the innermost lock that is taken,
and there are very few exceptions to that - the inode LRU list is
one of the few.

I don't know what the tuxnode->lock is, but I found this:

 *     inode->i_lock
 *         tuxnode->lock (to protect tuxnode data)
 *             tuxnode->dirty_inodes_lock (for i_ddc->dirty_inodes,
 *                                         Note: timestamp can be updated
 *                                         outside inode->i_mutex)

and this:

 *     inode->i_lock
 *         tuxnode->lock
 *         sb->dirty_inodes_lock

Which indicates that you take a filesystem global lock a couple of
layers underneath the VFS per-inode i_lock. I'd suggest you want to
separate the use of the vfs inode ilock from the locking heirarchy
of the tux3 inode....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-01 21:41 [RFC][PATCH 1/2] Add a super operation for writeback Daniel Phillips
  2014-06-01 21:42 ` [RFC][PATCH 2/2] tux3: Use writeback hook to remove duplicated core code Daniel Phillips
  2014-06-02  3:15 ` [RFC][PATCH 1/2] Add a super operation for writeback Dave Chinner
@ 2014-06-02  8:30 ` Christian Stroetmann
  2014-06-03  3:39   ` Dave Chinner
  2 siblings, 1 reply; 24+ messages in thread
From: Christian Stroetmann @ 2014-06-02  8:30 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Linux Kernel, Linux FS Devel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi

On the 1st of June 2014 23:41, Daniel Phillips wrote:
> Hi,
>
> This is the first of four core changes we would like for Tux3. We
> start with a hard one and suggest a simple solution.
>
> The first patch in this series adds a new super operation to write
> back multiple inodes in a single call. The second patch applies to
> our linux-tux3 repository at git.kernel.org to demonstrate how this
> interface is used, and removes about 450 lines of workaround code.
>
> Traditionally, core kernel tells each mounted filesystems which
> dirty pages of which inodes should be flushed to disk, but
> unfortunately, is blissfully ignorant of filesystem-specific
> ordering constraints. This scheme was really invented for Ext2 and
> has not evolved much recently. Tux3, with its strong ordering and
> optimized delta commit, cannot tolerate random flushing and
> therefore takes responsibility for flush ordering itself. On the
> other hand, Tux3 has no good way to know when is the right time to
> flush, but core is good at that. This proposed API harmonizes those
> two competencies so that Tux3 and core each take care of what they
> are good at, and not more.
>
> The API extension consists of a new writeback hook and two helpers
> to be uses within the hook. The hook sits about halfway down the
> fs-writeback stack, just after core has determined that some dirty
> inodes should be flushed to disk and just before it starts thinking
> about which inodes those should be. At that point, core calls Tux3
> instead of continuing on down the usual do_writepages path. Tux3
> responds by staging a new delta commit, using the new helpers to
> tell core which inodes were flushed versus being left dirty in
> cache. This is pretty much the same behavior as the traditional
> writeout path, but less convoluted, probably more efficient, and
> certainly easier to analyze.
>
> The new writeback hook looks like:
>
>      progress = sb->s_op->writeback(sb,&writeback_control,&nr_pages);
>
> This should be self-explanatory: nr_pages and progress have the
> semantics of existing usage in fs-writeback; writeback_control is
> ignored by Tux3, but that is only because Tux3 always flushes
> everything and does not require hints for now. We can safely assume
> that&wbc or equivalent is wanted here. An obvious wart is the
> overlap between "progress" and "nr_pages", but fs-writeback thinks
> that way, so it would not make much sense to improve one without
> improving the other.
>
> Tux3 necessarily keeps its own dirty inode list, which is an area
> of overlap with fs-writeback. In a perfect world, there would be
> just one dirty inode list per superblock, on which both fs-writeback
> and Tux3 would operate. That would be a deeper core change than
> seems appropriate right now.
>
> Potential races are a big issue with this API, which is no surprise.
> The fs-writeback scheme requires keeping several kinds of object in
> sync: tux3 dirty inode lists, fs-writeback dirty inode lists and
> inode dirty state. The new helpers inode_writeback_done(inode) and
> inode_writeback_touch(inode) take care of that while hiding
> internal details of the fs-writeback implementation.
>
> Tux3 calls inode_writeback_done when it has flushed an inode and
> marked it clean, or calls inode_writeback_touch if it intends to
> retain a dirty inode in cache. These have simple implementations.
> The former just removes a clean inode from any fs-writeback list.
> The latter updates the inode's dirty timestamp so that fs-writeback
> does not keep trying flush it. Both these things could be done more
> efficiently by re-engineering fs-writeback, but we prefer to work
> with the existing scheme for now.
>
> Hirofumi's masterful hack nicely avoided racy removal of inodes from
> the writeback list by taking an internal fs-writeback lock inside
> filesystem code. The new helper requires dropping i_lock inside
> filesystem code and retaking it in the helper, so inode redirty can
> race with writeback list removal. This does not seem to present a
> problem because filesystem code is able to enforce strict
> alternation of cleaning and calling the helper. As an offsetting
> advantage, writeback lock contention is reduced.
>
> Compared to Hirofumi's hack, the cost of this interface is one
> additional spinlock per inode_writeback_done,  which is
> insignificant compared to the convoluted code path that is avoided.
> Regards,
>
> Daniel
When I followed the advice of Dave Chinner:
"We're not going to merge that page forking stuff (like you were told at 
LSF 2013 more than a year ago: http://lwn.net/Articles/548091/) without 
rigorous design review and a demonstration of the solutions to all the 
hard corner cases it has"
given in his e-mail related with the presentation of the latest version 
of the Tux3 file system (see [1]) and read the linked article, I found 
in the second comments:
"Parts of this almost sound like it either a.) overlaps with or b.) 
would benefit greatly from something similar to Featherstitch [[2]]."

Could it be that we have with Featherstitch a general solution already 
that is said to be even "file system agnostic"?
Honestly, I thought that something like this would make its way into the 
Linux code base.



Have fun
Christian

[1] Dave Chinner Re: [RFC] Tux3 for review 
https://lkml.org/lkml/2014/5/18/158
[2] Featherstitch http://featherstitch.cs.ucla.edu/

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-02  3:15 ` [RFC][PATCH 1/2] Add a super operation for writeback Dave Chinner
@ 2014-06-02 20:02   ` Daniel Phillips
  2014-06-03  3:33     ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Phillips @ 2014-06-02 20:02 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi


On 06/01/2014 08:15 PM, Dave Chinner wrote:
> On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote:
>> +
>> +/*
>> + * Add inode to writeback dirty list with current time.
>> + */
>> +void inode_writeback_touch(struct inode *inode)
>> +{
>> +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
>> +	spin_lock(&bdi->wb.list_lock);
>> +	inode->dirtied_when = jiffies;
>> +	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
>> +	spin_unlock(&bdi->wb.list_lock);
>> +}
>> +EXPORT_SYMBOL_GPL(inode_writeback_touch);
> You should be able to use redirty_tail() for this....

Redirty_tail nearly works, but "if (!list_empty(&wb->b_dirty))" is not correct because the inode
needs to end up on the dirty list whether it was already there or not. This requirement is analogous
to __mark_inode_dirty() and must tolerate similar races. At the microoptimization level, calling
redirty_tail from inode_writeback_touch would be less efficient and more bulky. Another small issue
is, redirty_tail does not always update the timestamp, which could trigger some bogus writeback events.

> Hmmmm - this is using the wb dirty lists and locks, but you
> don't pass the wb structure to the writeback callout? That seem
> wrong to me - why would you bother manipulating these lists if you
> aren't using them to track dirty inodes in the first place?

>From Tux3's point of view, the wb lists just allow fs-writeback to determine when to call
->writeback(). We agree that inode lists are a suboptimal mechanism, but that is how fs-writeback
currently thinks. It would be better if our inodes never go onto wb lists in the first place,
provided that fs-writeback can still drive writeback appropriately.

Perhaps fs-writeback should have an option to work without inode lists at all, and just maintain
writeback scheduling statistics in the superblock or similar. That would be a big change, more on
the experimental side. We would be happy to take it on after merge, hopefully for the benefit of
other filesystems besides Tux3.

What we pass to ->writeback() is just a matter of taste at the moment because we currently ignore
everything except &work->nr_pages. Anything sane is fine. Note that bdi_writeback is already
available from bdi->wb, the "default writeback info", whatever that means. A quick tour of existing
usage suggests that you can reliably obtain the wb that way, but a complete audit would be needed.

Most probably, this API will evolve as new users arrive, and also as our Tux3 usage becomes more
sophisticated. For now, Tux3 just flushes everything without asking questions. Exactly how that
might change in the future is hard to predict. You are in a better position to know what XFS would
require in order to use this interface.

> The first thing that __writeback_sb_inodes() does is create a struct
> writeback_control from the wb_writeback_work. That should be done
> here and passed to __writeback_sb_inodes(), which should be renamed
> "generic_writeback_sb_inodes()".  Also, all the fields in the wbc
> need to be initialised correctly (i.e including range_start/end).

Good point. I will try that generic_writeback_sb_inode concept for the next iteration, which will
need a day or two including regression testing.

> Further, a writeback implementation will need to use the generic bdi
> list and lock structures and so we need to pass the bdi_writeback.
> Similarly, if we are going to pass nr_pages, we might as well pass
> the entire work structure. 
>
> Finally, I don't like the way the wb->list_lock is treated
> differently by this code. I suspect that we need to rationalise the
> layering of the wb->list_lock as it is currently not clear what it
> protects and what (nested) layers of the writeback code actually
> require it.

One design goal of this proposed writeback interface is to hide the wb list lock entirely from the
filesystem so core writeback can evolve more easily. This is not cast in stone, but it smells like
decent factoring. We can save some spinlocks by violating that factoring (as Hirofumi's original
hack did) at the cost of holding a potentially busy wb lock longer, which does not seem like a good
trade.

I agree that writeback list locking is murky, and fs-writeback is murky in general. We would like
Tux3 to be part of the solution, not the problem.

> What I'd like to see is this work:
>
> struct super_ops ... = {
> ....
> 	.writeback = generic_writeback_sb_inodes,
> ....
>
> And that means writeback_sb_inodes() would become:
>
> static long writeback_sb_inodes(struct super_block *sb,
> 				struct bdi_writeback *wb,
> 				struct wb_writeback_work *work)
> {
> 	struct writeback_control wbc = {
> 		.sync_mode		= work->sync_mode,
> 		.tagged_writepages	= work->tagged_writepages,
> 		.for_kupdate		= work->for_kupdate,
> 		.for_background		= work->for_background,
> 		.for_sync		= work->for_sync,
> 		.range_cyclic		= work->range_cyclic,
> 		.range_start		= 0,
> 		.range_end		= LLONG_MAX,
> 	};
>
> 	if (sb->s_op->writeback)
> 		return sb->s_op->writeback(sb, wb, work, &wbc);
>
> 	return generic_writeback_sb_inodes(sb, wb, work, &wbc);
> }
>
> And the higher/lower layers deal with wb->list_lock appropriately.
>

Looks good to me. As noted above, I am not sure that *work is actually required but it does no harm,
so...

Regards,

Daniel


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 2/2] tux3: Use writeback hook to remove duplicated core code
  2014-06-02  3:30   ` Dave Chinner
@ 2014-06-02 20:07     ` Daniel Phillips
  0 siblings, 0 replies; 24+ messages in thread
From: Daniel Phillips @ 2014-06-02 20:07 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi


On 06/01/2014 08:30 PM, Dave Chinner wrote:
> I get very worried whenever I see locks inside inode->i_lock. In
> general, i_lock is supposed to be the innermost lock that is taken,
> and there are very few exceptions to that - the inode LRU list is
> one of the few.

I generally trust Hirofumi to ensure that our locking is sane, but please point out any specific
issue. We are well aware of the need to get out of our critical section fast, as is apparent in
tux3_clear_dirty_inode_nolock. Hogging our own i_locks would mainly hurt our own benchmarks.

For what it is worth, the proposed writeback API improves our SMP situation with respect to other
filesystems by moving tux3_clear_dirty_inode_nolock outside the wb list lock.

> I don't know what the tuxnode->lock is, but I found this:
>
>  *     inode->i_lock
>  *         tuxnode->lock (to protect tuxnode data)
>  *             tuxnode->dirty_inodes_lock (for i_ddc->dirty_inodes,
>  *                                         Note: timestamp can be updated
>  *                                         outside inode->i_mutex)
>
> and this:
>
>  *     inode->i_lock
>  *         tuxnode->lock
>  *         sb->dirty_inodes_lock
>
> Which indicates that you take a filesystem global lock a couple of
> layers underneath the VFS per-inode i_lock. I'd suggest you want to
> separate the use of the vfs inode ilock from the locking heirarchy
> of the tux3 inode....
>

Our nested locks synchronize VFS state with Tux3 state, which is not optional. Alternatively, we
could rely on i_lock alone, which would increase contention.

The sb->dirty_inodes_lock is held briefly, which you can see in tux3_dirty_inode and
tux3_clear_dirty_inode_nolock. If it shows up in a profile we could break it up.

Regards,

Daniel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-02 20:02   ` Daniel Phillips
@ 2014-06-03  3:33     ` Dave Chinner
  2014-06-03  7:01       ` Daniel Phillips
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2014-06-03  3:33 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi


[ please line wrap at something sane like 68 columns ]

On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
> On 06/01/2014 08:15 PM, Dave Chinner wrote:
> > On Sun, Jun 01, 2014 at 02:41:02PM -0700, I wrote:
> >> +
> >> +/*
> >> + * Add inode to writeback dirty list with current time.
> >> + */
> >> +void inode_writeback_touch(struct inode *inode)
> >> +{
> >> +	struct backing_dev_info *bdi = inode->i_sb->s_bdi;
> >> +	spin_lock(&bdi->wb.list_lock);
> >> +	inode->dirtied_when = jiffies;
> >> +	list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
> >> +	spin_unlock(&bdi->wb.list_lock);
> >> +}
> >> +EXPORT_SYMBOL_GPL(inode_writeback_touch);
> > You should be able to use redirty_tail() for this....
> 
> Redirty_tail nearly works, but "if (!list_empty(&wb->b_dirty))" is
> not correct because the inode needs to end up on the dirty list
> whether it was already there or not.

redirty_tail() always moves the inode to the end of the dirty
list.

> This requirement is analogous to __mark_inode_dirty() and must
> tolerate similar races. At the microoptimization level, calling
> redirty_tail from inode_writeback_touch would be less efficient
> and more bulky. Another small issue is, redirty_tail does not
> always update the timestamp, which could trigger some bogus
> writeback events.

redirty_tail does not update the timestamp when it doesn't need to
change. If it needs to be changed because the current value would
violate the time ordering requirements of the list, it rewrites it.

So there is essentially no functional difference between the new
function and redirty_tail....

> > Hmmmm - this is using the wb dirty lists and locks, but you
> > don't pass the wb structure to the writeback callout? That seem
> > wrong to me - why would you bother manipulating these lists if
> > you aren't using them to track dirty inodes in the first place?
> 
> From Tux3's point of view, the wb lists just allow fs-writeback to
> determine when to call ->writeback(). We agree that inode lists
> are a suboptimal mechanism, but that is how fs-writeback currently
> thinks. It would be better if our inodes never go onto wb lists in
> the first place, provided that fs-writeback can still drive
> writeback appropriately.

It can't, and definitely not with the callout you added.

However, we already avoid the VFS writeback lists for certain
filesystems for pure metadata. e.g. XFS does not use the VFS dirty
inode lists for inode metadata changes.  They get tracked internally
by the transaction subsystem which does it's own writeback according
to the requirements of journal space availability.

This is done simply by not calling mark_inode_dirty() on any
metadata only change. If we want to do the same for data, then we'd
simply not call mark_inode_dirty() in the data IO path. That
requires a custom ->set_page_dirty method to be provided by the
filesystem that didn't call

	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

and instead did it's own thing.

So the per-superblock dirty tracking is something we can do right
now, and some filesystems do it for metadata. The missing piece for
data is writeback infrastructure capable of deferring to superblocks
for writeback rather than BDIs....

> Perhaps fs-writeback should have an option to work without inode
> lists at all, and just maintain writeback scheduling statistics in
> the superblock or similar. That would be a big change, more on the
> experimental side. We would be happy to take it on after merge,
> hopefully for the benefit of other filesystems besides Tux3.

Well, I don't see it that way. If you have a requirement to be able
to track dirty inodes internally, then lets move to that implement
that infrastructure rather than hacking around with callouts that
only have a limited shelf-life.

> What we pass to ->writeback() is just a matter of taste at the
> moment because we currently ignore everything except
> &work->nr_pages. Anything sane is fine. Note that bdi_writeback is
> already available from bdi->wb, the "default writeback info",
> whatever that means. A quick tour of existing usage suggests that
> you can reliably obtain the wb that way, but a complete audit
> would be needed.
> 
> Most probably, this API will evolve as new users arrive, and also
> as our Tux3 usage becomes more sophisticated. For now, Tux3 just
> flushes everything without asking questions. Exactly how that
> might change in the future is hard to predict. You are in a better
> position to know what XFS would require in order to use this
> interface.

XFS already has everything it needs internally to track dirty
inodes. In fact, we used to do data writeback from within XFS and we
slowly removed it as the generic writeback code was improved made
the XFS code redundant.

That said, parallelising writeback so we can support hundreds of
GB/s of delayed allocation based writeback is something we
eventually need to do, and that pretty much means we need to bring
dirty data inode tracking back into XFS.

So what we really need is writeback infrastructure that is:

	a) independent of the dirty inode lists
	b) implements background, periodic and immediate writeback
	c) can be driven by BDI or superblock

IOWs, the abstraction we need for this writeback callout is not at
the writeback_sb_inodes() layer, it's above the dirty inode list
queuing layer. IOWs, the callout needs to be closer to the
wb_do_writeback() layer which drives all the writeback work
including periodic and background writeback, and the interactions
between foreground and background work that wb_writeback() uses
needs to be turned into a bunch of helpers...

> > Finally, I don't like the way the wb->list_lock is treated
> > differently by this code. I suspect that we need to rationalise
> > the layering of the wb->list_lock as it is currently not clear
> > what it protects and what (nested) layers of the writeback code
> > actually require it.
> 
> One design goal of this proposed writeback interface is to hide
> the wb list lock entirely from the filesystem so core writeback
> can evolve more easily.

Yes, I realise that. Hence my point that the bleeding of it across
layers of writeback infrstructure is sub-optimal.

> This is not cast in stone, but it smells
> like decent factoring. We can save some spinlocks by violating
> that factoring (as Hirofumi's original hack did) at the cost of
> holding a potentially busy wb lock longer, which does not seem
> like a good trade.

If we abstract at a higher layer, the wb lock protects just the work
queuing and dispatch, and everything else can be done by the
superblock callout. If the superblock callout is missing, then
we simply fall down to the existing wb_writeback() code that retakes
the lock and does it's work....

Let's get the layering and abstraction in the right place the first
time, rather having to revist this in the not-to-distant future.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-02  8:30 ` Christian Stroetmann
@ 2014-06-03  3:39   ` Dave Chinner
  2014-06-03  5:30     ` Christian Stroetmann
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2014-06-03  3:39 UTC (permalink / raw)
  To: Christian Stroetmann
  Cc: Daniel Phillips, Linux Kernel, Linux FS Devel, Linus Torvalds,
	Andrew Morton, OGAWA Hirofumi

On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote:
> When I followed the advice of Dave Chinner:
> "We're not going to merge that page forking stuff (like you were
> told at LSF 2013 more than a year ago:
> http://lwn.net/Articles/548091/) without rigorous design review and
> a demonstration of the solutions to all the hard corner cases it
> has"
> given in his e-mail related with the presentation of the latest
> version of the Tux3 file system (see [1]) and read the linked
> article, I found in the second comments:
> "Parts of this almost sound like it either a.) overlaps with or b.)
> would benefit greatly from something similar to Featherstitch
> [[2]]."
> 
> Could it be that we have with Featherstitch a general solution
> already that is said to be even "file system agnostic"?
> Honestly, I thought that something like this would make its way into
> the Linux code base.

Here's what I said about the last proposal (a few months ago) for
integrating featherstitch into the kernel:

http://www.spinics.net/lists/linux-fsdevel/msg72799.html

It's not a viable solution.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  3:39   ` Dave Chinner
@ 2014-06-03  5:30     ` Christian Stroetmann
  2014-06-03 14:57       ` Theodore Ts'o
  0 siblings, 1 reply; 24+ messages in thread
From: Christian Stroetmann @ 2014-06-03  5:30 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Daniel Phillips, Linux Kernel, Linux FS Devel, Linus Torvalds,
	Andrew Morton, OGAWA Hirofumi

On the 3rd of June 2014 05:39, Dave Chinner wrote:
> On Mon, Jun 02, 2014 at 10:30:07AM +0200, Christian Stroetmann wrote:
>> When I followed the advice of Dave Chinner:
>> "We're not going to merge that page forking stuff (like you were
>> told at LSF 2013 more than a year ago:
>> http://lwn.net/Articles/548091/) without rigorous design review and
>> a demonstration of the solutions to all the hard corner cases it
>> has"
>> given in his e-mail related with the presentation of the latest
>> version of the Tux3 file system (see [1]) and read the linked
>> article, I found in the second comments:
>> "Parts of this almost sound like it either a.) overlaps with or b.)
>> would benefit greatly from something similar to Featherstitch
>> [[2]]."
>>
>> Could it be that we have with Featherstitch a general solution
>> already that is said to be even "file system agnostic"?
>> Honestly, I thought that something like this would make its way into
>> the Linux code base.
> Here's what I said about the last proposal (a few months ago) for
> integrating featherstitch into the kernel:
>
> http://www.spinics.net/lists/linux-fsdevel/msg72799.html
>
> It's not a viable solution.
>
> Cheers,
>
> Dave.

How annoying, I did not remember your e-mail of the referenced thread 
"[Lsf-pc] [LSF/MM TOPIC] atomic block device" despite I saved it on 
local disk. Thanks a lot for the reminder.

I also directly saw the problem with the research prototype 
Featherstitch, specifically the point "All the filesystem modules it has 
are built into the featherstitch kernel module, and called through a VFS 
shim layer". But it is just a prototype and its concept of abstraction 
has not to be copied 1:1 into the Linux code base.

In general, I do not believe that the complexity problems of soft 
updates, atomic writes, and related techniques can be solved by 
hand/manually. So my suggestion is to automatically handle the 
complexity problem of e.g. dependancies in a way that is comparable to 
a(n on-the-fly) file-system compiler so to say that works on a very 
large dependancy graph (having several billions of graph vertices 
actually). And at this point an abstraction like it is given with 
Featherstitch helps to feed and control this special FS compiler.

Actually, I have to follow the discussion further on the one hand and go 
deeper into the highly complex problem space on the other hand.



With all the best
Christian

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  3:33     ` Dave Chinner
@ 2014-06-03  7:01       ` Daniel Phillips
  2014-06-03  7:26         ` Daniel Phillips
                           ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Daniel Phillips @ 2014-06-03  7:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi

Hi Dave,
On 06/02/2014 08:33 PM, Dave Chinner wrote:
> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
>>
>> Redirty_tail nearly works, but "if (!list_empty(&wb->b_dirty))" is
>> not correct because the inode needs to end up on the dirty list
>> whether it was already there or not.
> redirty_tail() always moves the inode to the end of the dirty
> list.
>
>> This requirement is analogous to __mark_inode_dirty() and must
>> tolerate similar races. At the microoptimization level, calling
>> redirty_tail from inode_writeback_touch would be less efficient
>> and more bulky. Another small issue is, redirty_tail does not
>> always update the timestamp, which could trigger some bogus
>> writeback events.
> redirty_tail does not update the timestamp when it doesn't need to
> change. If it needs to be changed because the current value would
> violate the time ordering requirements of the list, it rewrites it.
>
> So there is essentially no functional difference between the new
> function and redirty_tail....

Hirofumi, would you care to comment?

>
>>> Hmmmm - this is using the wb dirty lists and locks, but you
>>> don't pass the wb structure to the writeback callout? That seem
>>> wrong to me - why would you bother manipulating these lists if
>>> you aren't using them to track dirty inodes in the first place?
>> From Tux3's point of view, the wb lists just allow fs-writeback to
>> determine when to call ->writeback(). We agree that inode lists
>> are a suboptimal mechanism, but that is how fs-writeback currently
>> thinks. It would be better if our inodes never go onto wb lists in
>> the first place, provided that fs-writeback can still drive
>> writeback appropriately.
> It can't, and definitely not with the callout you added.

Obviously, fs-writeback does not get to choose inodes or specific pages
with our proposal (which is the whole point) but it still decides when
to call Tux3 vs some other filesystem on the same device, and is still
able to indicate how much data it thinks should be written out. Even
though we ignore the latter suggestion and always flush everything, we
appreciate the timing of those callbacks very much, because they give
us exactly the pressure sensitive batching behavior we want.

> However, we already avoid the VFS writeback lists for certain
> filesystems for pure metadata. e.g. XFS does not use the VFS dirty
> inode lists for inode metadata changes.  They get tracked internally
> by the transaction subsystem which does it's own writeback according
> to the requirements of journal space availability.
>
> This is done simply by not calling mark_inode_dirty() on any
> metadata only change. If we want to do the same for data, then we'd
> simply not call mark_inode_dirty() in the data IO path. That
> requires a custom ->set_page_dirty method to be provided by the
> filesystem that didn't call
>
> 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> and instead did it's own thing.
>
> So the per-superblock dirty tracking is something we can do right
> now, and some filesystems do it for metadata. The missing piece for
> data is writeback infrastructure capable of deferring to superblocks
> for writeback rather than BDIs....

We agree that fs-writeback inode lists are broken for anything
more sophisticated than Ext2. An advantage of the patch under
consideration is that it still lets fs-writeback mostly work the
way it has worked for the last few years, except for not allowing it
to pick specific inodes and data pages for writeout. As far as I
can see, it still balances writeout between different filesystems
on the same block device pretty well.

>> Perhaps fs-writeback should have an option to work without inode
>> lists at all, and just maintain writeback scheduling statistics in
>> the superblock or similar. That would be a big change, more on the
>> experimental side. We would be happy to take it on after merge,
>> hopefully for the benefit of other filesystems besides Tux3.
> Well, I don't see it that way. If you have a requirement to be able
> to track dirty inodes internally, then lets move to that implement
> that infrastructure rather than hacking around with callouts that
> only have a limited shelf-life.

What you call shelf-life, I call iteration. Small changes are beautiful.
Apologies for the rhetoric, content is below.

> XFS already has everything it needs internally to track dirty
> inodes. In fact, we used to do data writeback from within XFS and we
> slowly removed it as the generic writeback code was improved made
> the XFS code redundant.
>
> That said, parallelising writeback so we can support hundreds of
> GB/s of delayed allocation based writeback is something we
> eventually need to do, and that pretty much means we need to bring
> dirty data inode tracking back into XFS.
>
> So what we really need is writeback infrastructure that is:
>
> 	a) independent of the dirty inode lists
> 	b) implements background, periodic and immediate writeback
> 	c) can be driven by BDI or superblock
>
> IOWs, the abstraction we need for this writeback callout is not at
> the writeback_sb_inodes() layer, it's above the dirty inode list
> queuing layer. IOWs, the callout needs to be closer to the
> wb_do_writeback() layer which drives all the writeback work
> including periodic and background writeback, and the interactions
> between foreground and background work that wb_writeback() uses
> needs to be turned into a bunch of helpers...

I agree, and that is what Tux3 really wants. I just wonder about the
wisdom of embarking on a big change to fs-writeback when a small
change will do. How will fs-writeback balancing between different
filesystems on the same device work? What will those helpers look
like? How long will it take to prove the new scheme stable? How can
we prove that no existing fs-writeback clients will regress?

What about a sporting proposal: you post a patch that trumps ours in
elegance, that you could in theory use for XFS. We verify that it
works for Tux3 at least as well as the current patch. We also test
it for you.

>> This is not cast in stone, but it smells
>> like decent factoring. We can save some spinlocks by violating
>> that factoring (as Hirofumi's original hack did) at the cost of
>> holding a potentially busy wb lock longer, which does not seem
>> like a good trade.
> If we abstract at a higher layer, the wb lock protects just the work
> queuing and dispatch, and everything else can be done by the
> superblock callout. If the superblock callout is missing, then
> we simply fall down to the existing wb_writeback() code that retakes
> the lock and does it's work....
>
> Let's get the layering and abstraction in the right place the first
> time, rather having to revist this in the not-to-distant future.
>

That would be ideal, but incremental also has its attractions. In that
vein, here is a rewrite along the lines you suggested yesterday. The
only change to your code is, the writeback lock is dropped before calling
->writeback(). I tried doing it the other way (in the fs) but it was just
too ugly. As a result, generic_writeback_sb_inodes is called under the wb
list lock, but ->writeback drops it. An oddity that would obviously go
away with a better abstraction along the lines you suggest.

Regards,

Daniel

---

>From 8672ba5915e94a833321435d979615ac7059c789 Mon Sep 17 00:00:00 2001
From: Daniel Phillips <daniel@tux3.org>
Date: Mon, 2 Jun 2014 21:40:19 -0700
Subject: [PATCH] Add a super operation for writeback

Add a "writeback" super operation to be called in the
form:

   progress = s_op->writeback(sb, wb, work, wbc);

Where sb is (struct super_block *), wb is (struct
bdi_writeback *), work is (struct wb_writeback_work *),
and wbc is (struct writeback_control *).

The filesystem is expected to flush some inodes to disk
and return progress of at least 1, or if no inodes are
flushed, return progress of zero. The filesystem should
try to flush at least the number of pages specified in
work->nr_pages, or if that is not possible, return
approximately the number of pages that were not flushed
in work->nr_pages.

Within the ->writeback callback, the filesystem should
call inode_writeback_done(inode) for each inode flushed
(and therefore set clean) or inode_writeback_touch(inode)
for any inode that will be retained dirty in cache.

Signed-off-by: Daniel Phillips <daniel@tux3.org>
---
 fs/fs-writeback.c         | 86 +++++++++++++++++------------------------------
 include/linux/fs.h        |  8 +++--
 include/linux/writeback.h | 19 +++++++++++
 3 files changed, 56 insertions(+), 57 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 154d63e..ae25d25 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@
  */
 #define MIN_WRITEBACK_PAGES    (4096UL >> (PAGE_CACHE_SHIFT - 10))
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-    long nr_pages;
-    struct super_block *sb;
-    unsigned long *older_than_this;
-    enum writeback_sync_modes sync_mode;
-    unsigned int tagged_writepages:1;
-    unsigned int for_kupdate:1;
-    unsigned int range_cyclic:1;
-    unsigned int for_background:1;
-    unsigned int for_sync:1;    /* sync(2) WB_SYNC_ALL writeback */
-    enum wb_reason reason;        /* why was writeback initiated? */
-
-    struct list_head list;        /* pending work list */
-    struct completion *done;    /* set if the caller waits */
-};
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -622,20 +603,12 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
  *
  * Return the number of pages and/or inodes written.
  */
-static long __writeback_sb_inodes(struct super_block *sb,
-                struct bdi_writeback *wb,
-                struct wb_writeback_work *work)
+long generic_writeback_sb_inodes(
+    struct super_block *sb,
+    struct bdi_writeback *wb,
+    struct wb_writeback_work *work,
+    struct writeback_control *wbc)
 {
-    struct writeback_control wbc = {
-        .sync_mode        = work->sync_mode,
-        .tagged_writepages    = work->tagged_writepages,
-        .for_kupdate        = work->for_kupdate,
-        .for_background        = work->for_background,
-        .for_sync        = work->for_sync,
-        .range_cyclic        = work->range_cyclic,
-        .range_start        = 0,
-        .range_end        = LLONG_MAX,
-    };
     unsigned long start_time = jiffies;
     long write_chunk;
     long wrote = 0;  /* count both pages and inodes */
@@ -673,7 +646,7 @@ static long __writeback_sb_inodes(struct super_block *sb,
             redirty_tail(inode, wb);
             continue;
         }
-        if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
+        if ((inode->i_state & I_SYNC) && wbc->sync_mode != WB_SYNC_ALL) {
             /*
              * If this inode is locked for writeback and we are not
              * doing writeback-for-data-integrity, move it to
@@ -706,22 +679,22 @@ static long __writeback_sb_inodes(struct super_block *sb,
         spin_unlock(&inode->i_lock);
 
         write_chunk = writeback_chunk_size(wb->bdi, work);
-        wbc.nr_to_write = write_chunk;
-        wbc.pages_skipped = 0;
+        wbc->nr_to_write = write_chunk;
+        wbc->pages_skipped = 0;
 
         /*
          * We use I_SYNC to pin the inode in memory. While it is set
          * evict_inode() will wait so the inode cannot be freed.
          */
-        __writeback_single_inode(inode, &wbc);
+        __writeback_single_inode(inode, wbc);
 
-        work->nr_pages -= write_chunk - wbc.nr_to_write;
-        wrote += write_chunk - wbc.nr_to_write;
+        work->nr_pages -= write_chunk - wbc->nr_to_write;
+        wrote += write_chunk - wbc->nr_to_write;
         spin_lock(&wb->list_lock);
         spin_lock(&inode->i_lock);
         if (!(inode->i_state & I_DIRTY))
             wrote++;
-        requeue_inode(inode, wb, &wbc);
+        requeue_inode(inode, wb, wbc);
         inode_sync_complete(inode);
         spin_unlock(&inode->i_lock);
         cond_resched_lock(&wb->list_lock);
@@ -739,28 +712,31 @@ static long __writeback_sb_inodes(struct super_block *sb,
     return wrote;
 }
 
-static long writeback_sb_inodes(struct super_block *sb,
-                struct bdi_writeback *wb,
-                struct wb_writeback_work *work)
+static long writeback_sb_inodes(
+    struct super_block *sb,
+    struct bdi_writeback *wb,
+    struct wb_writeback_work *work)
 {
-    if (sb->s_op->writeback) {
-        struct writeback_control wbc = {
-            .sync_mode = work->sync_mode,
-            .tagged_writepages = work->tagged_writepages,
-            .for_kupdate = work->for_kupdate,
-            .for_background = work->for_background,
-            .for_sync = work->for_sync,
-            .range_cyclic = work->range_cyclic,
-        };
-        long ret;
+    struct writeback_control wbc = {
+        .sync_mode        = work->sync_mode,
+        .tagged_writepages    = work->tagged_writepages,
+        .for_kupdate        = work->for_kupdate,
+        .for_background        = work->for_background,
+        .for_sync        = work->for_sync,
+        .range_cyclic        = work->range_cyclic,
+        .range_start        = 0,
+        .range_end        = LLONG_MAX,
+    };
 
+    if (sb->s_op->writeback) {
+        int err;
         spin_unlock(&wb->list_lock);
-        ret = sb->s_op->writeback(sb, &wbc, &work->nr_pages);
+        err = sb->s_op->writeback(sb, wb, work, &wbc);
         spin_lock(&wb->list_lock);
-        return ret;
+        return err;
     }
 
-    return __writeback_sb_inodes(sb, wb, work);
+    return generic_writeback_sb_inodes(sb, wb, work, &wbc);
 }
 
 static long __writeback_inodes_wb(struct bdi_writeback *wb,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 91dae3e..fc07d33 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -295,6 +295,8 @@ enum positive_aop_returns {
 struct page;
 struct address_space;
 struct writeback_control;
+struct wb_writeback_work;
+struct bdi_writeback;
 
 /*
  * "descriptor" for what we're up to with a read.
@@ -1542,8 +1544,10 @@ struct super_operations {
     int (*statfs) (struct dentry *, struct kstatfs *);
     int (*remount_fs) (struct super_block *, int *, char *);
     void (*umount_begin) (struct super_block *);
-    long (*writeback)(struct super_block *super, struct writeback_control *wbc, long *nr_pages);
-
+    long (*writeback)(struct super_block *sb,
+                struct bdi_writeback *wb,
+                struct wb_writeback_work *work,
+                struct writeback_control *wbc);
     int (*show_options)(struct seq_file *, struct dentry *);
     int (*show_devname)(struct seq_file *, struct dentry *);
     int (*show_path)(struct seq_file *, struct dentry *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5777c13..24e12be 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -87,6 +87,25 @@ struct writeback_control {
 };
 
 /*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_work {
+    long nr_pages;
+    struct super_block *sb;
+    unsigned long *older_than_this;
+    enum writeback_sync_modes sync_mode;
+    unsigned int tagged_writepages:1;
+    unsigned int for_kupdate:1;
+    unsigned int range_cyclic:1;
+    unsigned int for_background:1;
+    unsigned int for_sync:1;    /* sync(2) WB_SYNC_ALL writeback */
+    enum wb_reason reason;        /* why was writeback initiated? */
+
+    struct list_head list;        /* pending work list */
+    struct completion *done;    /* set if the caller waits */
+};
+
+/*
  * fs/fs-writeback.c
  */   
 struct bdi_writeback;
-- 
1.9.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  7:01       ` Daniel Phillips
@ 2014-06-03  7:26         ` Daniel Phillips
  2014-06-03  7:47         ` OGAWA Hirofumi
  2014-06-03  7:52         ` Dave Chinner
  2 siblings, 0 replies; 24+ messages in thread
From: Daniel Phillips @ 2014-06-03  7:26 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi

Hi Dave,

Here is a non-incremental patch. This implements your suggestion
from yesterday, except that the wb list lock is dropped before
calling ->writeback().

Regards,

Daniel

>From d030d328757b160b39b252e82811a94843513cfc Mon Sep 17 00:00:00 2001
From: Daniel Phillips <daniel@tux3.org>
Date: Tue, 3 Jun 2014 00:19:11 -0700
Subject: [PATCH]     Add a super operation for writeback

Add a "writeback" super operation to be called in the
form:

   progress = s_op->writeback(sb, wb, work, wbc);

Where sb is (struct super_block *), wb is (struct
bdi_writeback *), work is (struct wb_writeback_work *),
and wbc is (struct writeback_control *).

The filesystem is expected to flush some inodes to disk
and return progress of at least 1, or if no inodes are
flushed, return progress of zero. The filesystem should
try to flush at least the number of pages specified in
work->nr_pages, or if that is not possible, return
approximately the number of pages that were not flushed
in work->nr_pages.

Within the ->writeback callback, the filesystem should
call inode_writeback_done(inode) for each inode flushed
and therefore set clean) or inode_writeback_touch(inode)
for any inode that will be retained dirty in cache.

Signed-off-by: Daniel Phillips <daniel@tux3.org>
---
 fs/fs-writeback.c         | 107 +++++++++++++++++++++++++++++-----------------
 include/linux/fs.h        |   9 +++-
 include/linux/writeback.h |  19 ++++++++
 3 files changed, 95 insertions(+), 40 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index be568b7..98810bd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,25 +34,6 @@
  */
 #define MIN_WRITEBACK_PAGES    (4096UL >> (PAGE_CACHE_SHIFT - 10))
 
-/*
- * Passed into wb_writeback(), essentially a subset of writeback_control
- */
-struct wb_writeback_work {
-    long nr_pages;
-    struct super_block *sb;
-    unsigned long *older_than_this;
-    enum writeback_sync_modes sync_mode;
-    unsigned int tagged_writepages:1;
-    unsigned int for_kupdate:1;
-    unsigned int range_cyclic:1;
-    unsigned int for_background:1;
-    unsigned int for_sync:1;    /* sync(2) WB_SYNC_ALL writeback */
-    enum wb_reason reason;        /* why was writeback initiated? */
-
-    struct list_head list;        /* pending work list */
-    struct completion *done;    /* set if the caller waits */
-};
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -192,6 +173,35 @@ void inode_wb_list_del(struct inode *inode)
 }
 
 /*
+ * Remove inode from writeback list if clean.
+ */
+void inode_writeback_done(struct inode *inode)
+{
+    struct backing_dev_info *bdi = inode_to_bdi(inode);
+
+    spin_lock(&bdi->wb.list_lock);
+    spin_lock(&inode->i_lock);
+    if (!(inode->i_state & I_DIRTY))
+        list_del_init(&inode->i_wb_list);
+    spin_unlock(&inode->i_lock);
+    spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_done);
+
+/*
+ * Add inode to writeback dirty list with current time.
+ */
+void inode_writeback_touch(struct inode *inode)
+{
+    struct backing_dev_info *bdi = inode->i_sb->s_bdi;
+    spin_lock(&bdi->wb.list_lock);
+    inode->dirtied_when = jiffies;
+    list_move(&inode->i_wb_list, &bdi->wb.b_dirty);
+    spin_unlock(&bdi->wb.list_lock);
+}
+EXPORT_SYMBOL_GPL(inode_writeback_touch);
+
+/*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
  *
@@ -593,20 +603,12 @@ static long writeback_chunk_size(struct backing_dev_info *bdi,
  *
  * Return the number of pages and/or inodes written.
  */
-static long writeback_sb_inodes(struct super_block *sb,
-                struct bdi_writeback *wb,
-                struct wb_writeback_work *work)
+long generic_writeback_sb_inodes(
+    struct super_block *sb,
+    struct bdi_writeback *wb,
+    struct wb_writeback_work *work,
+    struct writeback_control *wbc)
 {
-    struct writeback_control wbc = {
-        .sync_mode        = work->sync_mode,
-        .tagged_writepages    = work->tagged_writepages,
-        .for_kupdate        = work->for_kupdate,
-        .for_background        = work->for_background,
-        .for_sync        = work->for_sync,
-        .range_cyclic        = work->range_cyclic,
-        .range_start        = 0,
-        .range_end        = LLONG_MAX,
-    };
     unsigned long start_time = jiffies;
     long write_chunk;
     long wrote = 0;  /* count both pages and inodes */
@@ -644,7 +646,7 @@ static long writeback_sb_inodes(struct super_block *sb,
             redirty_tail(inode, wb);
             continue;
         }
-        if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
+        if ((inode->i_state & I_SYNC) && wbc->sync_mode != WB_SYNC_ALL) {
             /*
              * If this inode is locked for writeback and we are not
              * doing writeback-for-data-integrity, move it to
@@ -677,22 +679,22 @@ static long writeback_sb_inodes(struct super_block *sb,
         spin_unlock(&inode->i_lock);
 
         write_chunk = writeback_chunk_size(wb->bdi, work);
-        wbc.nr_to_write = write_chunk;
-        wbc.pages_skipped = 0;
+        wbc->nr_to_write = write_chunk;
+        wbc->pages_skipped = 0;
 
         /*
          * We use I_SYNC to pin the inode in memory. While it is set
          * evict_inode() will wait so the inode cannot be freed.
          */
-        __writeback_single_inode(inode, &wbc);
+        __writeback_single_inode(inode, wbc);
 
-        work->nr_pages -= write_chunk - wbc.nr_to_write;
-        wrote += write_chunk - wbc.nr_to_write;
+        work->nr_pages -= write_chunk - wbc->nr_to_write;
+        wrote += write_chunk - wbc->nr_to_write;
         spin_lock(&wb->list_lock);
         spin_lock(&inode->i_lock);
         if (!(inode->i_state & I_DIRTY))
             wrote++;
-        requeue_inode(inode, wb, &wbc);
+        requeue_inode(inode, wb, wbc);
         inode_sync_complete(inode);
         spin_unlock(&inode->i_lock);
         cond_resched_lock(&wb->list_lock);
@@ -710,6 +712,33 @@ static long writeback_sb_inodes(struct super_block *sb,
     return wrote;
 }
 
+static long writeback_sb_inodes(
+    struct super_block *sb,
+    struct bdi_writeback *wb,
+    struct wb_writeback_work *work)
+{
+    struct writeback_control wbc = {
+        .sync_mode        = work->sync_mode,
+        .tagged_writepages    = work->tagged_writepages,
+        .for_kupdate        = work->for_kupdate,
+        .for_background        = work->for_background,
+        .for_sync        = work->for_sync,
+        .range_cyclic        = work->range_cyclic,
+        .range_start        = 0,
+        .range_end        = LLONG_MAX,
+    };
+
+    if (sb->s_op->writeback) {
+        long ret;
+        spin_unlock(&wb->list_lock);
+        ret = sb->s_op->writeback(sb, wb, work, &wbc);
+        spin_lock(&wb->list_lock);
+        return ret;
+    }
+
+    return generic_writeback_sb_inodes(sb, wb, work, &wbc);
+}
+
 static long __writeback_inodes_wb(struct bdi_writeback *wb,
                   struct wb_writeback_work *work)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8780312..fc07d33 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -295,6 +295,8 @@ enum positive_aop_returns {
 struct page;
 struct address_space;
 struct writeback_control;
+struct wb_writeback_work;
+struct bdi_writeback;
 
 /*
  * "descriptor" for what we're up to with a read.
@@ -1542,7 +1544,10 @@ struct super_operations {
     int (*statfs) (struct dentry *, struct kstatfs *);
     int (*remount_fs) (struct super_block *, int *, char *);
     void (*umount_begin) (struct super_block *);
-
+    long (*writeback)(struct super_block *sb,
+                struct bdi_writeback *wb,
+                struct wb_writeback_work *work,
+                struct writeback_control *wbc);
     int (*show_options)(struct seq_file *, struct dentry *);
     int (*show_devname)(struct seq_file *, struct dentry *);
     int (*show_path)(struct seq_file *, struct dentry *);
@@ -1739,6 +1744,8 @@ static inline void file_accessed(struct file *file)
 
 int sync_inode(struct inode *inode, struct writeback_control *wbc);
 int sync_inode_metadata(struct inode *inode, int wait);
+void inode_writeback_done(struct inode *inode);
+void inode_writeback_touch(struct inode *inode);
 
 struct file_system_type {
     const char *name;
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 5777c13..24e12be 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -87,6 +87,25 @@ struct writeback_control {
 };
 
 /*
+ * Passed into wb_writeback(), essentially a subset of writeback_control
+ */
+struct wb_writeback_work {
+    long nr_pages;
+    struct super_block *sb;
+    unsigned long *older_than_this;
+    enum writeback_sync_modes sync_mode;
+    unsigned int tagged_writepages:1;
+    unsigned int for_kupdate:1;
+    unsigned int range_cyclic:1;
+    unsigned int for_background:1;
+    unsigned int for_sync:1;    /* sync(2) WB_SYNC_ALL writeback */
+    enum wb_reason reason;        /* why was writeback initiated? */
+
+    struct list_head list;        /* pending work list */
+    struct completion *done;    /* set if the caller waits */
+};
+
+/*
  * fs/fs-writeback.c
  */   
 struct bdi_writeback;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  7:01       ` Daniel Phillips
  2014-06-03  7:26         ` Daniel Phillips
@ 2014-06-03  7:47         ` OGAWA Hirofumi
  2014-06-03  8:12           ` Dave Chinner
  2014-06-03  7:52         ` Dave Chinner
  2 siblings, 1 reply; 24+ messages in thread
From: OGAWA Hirofumi @ 2014-06-03  7:47 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Dave Chinner, linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton

Daniel Phillips <daniel@phunq.net> writes:

> Hi Dave,
> On 06/02/2014 08:33 PM, Dave Chinner wrote:
>> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
>>>
>>> Redirty_tail nearly works, but "if (!list_empty(&wb->b_dirty))" is
>>> not correct because the inode needs to end up on the dirty list
>>> whether it was already there or not.
>> redirty_tail() always moves the inode to the end of the dirty
>> list.

It doesn't move inode to end of the dirty if wb.b_dirty is empty
(I.e. it can move from wb.b_io to wb.b_dirty too).

BTW, this is called for like mark inode dirty process, not always
writeback path.

>>> This requirement is analogous to __mark_inode_dirty() and must
>>> tolerate similar races. At the microoptimization level, calling
>>> redirty_tail from inode_writeback_touch would be less efficient
>>> and more bulky. Another small issue is, redirty_tail does not
>>> always update the timestamp, which could trigger some bogus
>>> writeback events.
>> redirty_tail does not update the timestamp when it doesn't need to
>> change. If it needs to be changed because the current value would
>> violate the time ordering requirements of the list, it rewrites it.
>>
>> So there is essentially no functional difference between the new
>> function and redirty_tail....
>
> Hirofumi, would you care to comment?

It has difference.

Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30
(tail->dirtied_when is expired at 31 with default config). In this case,
redirty_tail() doesn't update ->dirtied_when.

And if inode->dirtied_when is not updated to 30, expire time has
difference. I.e. 32 vs 60.

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  7:01       ` Daniel Phillips
  2014-06-03  7:26         ` Daniel Phillips
  2014-06-03  7:47         ` OGAWA Hirofumi
@ 2014-06-03  7:52         ` Dave Chinner
  2014-06-03 14:05           ` Jan Kara
  2 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2014-06-03  7:52 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: linux-kernel, linux-fsdevel, Linus Torvalds, Andrew Morton,
	OGAWA Hirofumi

On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote:
> Hi Dave,
> On 06/02/2014 08:33 PM, Dave Chinner wrote:
> > On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
> >>> Hmmmm - this is using the wb dirty lists and locks, but you
> >>> don't pass the wb structure to the writeback callout? That seem
> >>> wrong to me - why would you bother manipulating these lists if
> >>> you aren't using them to track dirty inodes in the first place?
> >> From Tux3's point of view, the wb lists just allow fs-writeback to
> >> determine when to call ->writeback(). We agree that inode lists
> >> are a suboptimal mechanism, but that is how fs-writeback currently
> >> thinks. It would be better if our inodes never go onto wb lists in
> >> the first place, provided that fs-writeback can still drive
> >> writeback appropriately.
> > It can't, and definitely not with the callout you added.
> 
> Obviously, fs-writeback does not get to choose inodes or specific pages
> with our proposal (which is the whole point) but it still decides when
> to call Tux3 vs some other filesystem on the same device, and is still
> able to indicate how much data it thinks should be written out. Even
> though we ignore the latter suggestion and always flush everything, we
> appreciate the timing of those callbacks very much, because they give
> us exactly the pressure sensitive batching behavior we want.

Which, again, points out that you want per-superblock flushing, not
per-bdi flushing which is what the current writeback code does.

> > However, we already avoid the VFS writeback lists for certain
> > filesystems for pure metadata. e.g. XFS does not use the VFS dirty
> > inode lists for inode metadata changes.  They get tracked internally
> > by the transaction subsystem which does it's own writeback according
> > to the requirements of journal space availability.
> >
> > This is done simply by not calling mark_inode_dirty() on any
> > metadata only change. If we want to do the same for data, then we'd
> > simply not call mark_inode_dirty() in the data IO path. That
> > requires a custom ->set_page_dirty method to be provided by the
> > filesystem that didn't call
> >
> > 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> >
> > and instead did it's own thing.
> >
> > So the per-superblock dirty tracking is something we can do right
> > now, and some filesystems do it for metadata. The missing piece for
> > data is writeback infrastructure capable of deferring to superblocks
> > for writeback rather than BDIs....
> 
> We agree that fs-writeback inode lists are broken for anything
> more sophisticated than Ext2.

No, I did not say that.

I said XFS does something different for metadata changes because it
has different flushing constraints and requirements than the generic
code provides. That does not make the generic code broken.

> An advantage of the patch under
> consideration is that it still lets fs-writeback mostly work the
> way it has worked for the last few years, except for not allowing it
> to pick specific inodes and data pages for writeout. As far as I
> can see, it still balances writeout between different filesystems
> on the same block device pretty well.

Not really. If there are 3 superblocks on a BDI, and the dirty inode
list iterates between 2 of them with lots of dirty inodes, it can
starve writeback from the third until one of it's dirty inodes pops
to the head of the b_io list. So it's inherently unfair from that
perspective.

Changing the high level flushing to be per-superblock rather than
per-BDI would enable us to control that behaviour and be much fairer
to all the superblocks on a given BDI. That said, I don't really
care that much about this case...

> >> Perhaps fs-writeback should have an option to work without inode
> >> lists at all, and just maintain writeback scheduling statistics in
> >> the superblock or similar. That would be a big change, more on the
> >> experimental side. We would be happy to take it on after merge,
> >> hopefully for the benefit of other filesystems besides Tux3.
> > Well, I don't see it that way. If you have a requirement to be able
> > to track dirty inodes internally, then lets move to that implement
> > that infrastructure rather than hacking around with callouts that
> > only have a limited shelf-life.
> 
> What you call shelf-life, I call iteration. Small changes are beautiful.
> Apologies for the rhetoric, content is below.

Ok, let me use stronger language: implementing something at the
wrong layer only to have to redo it at the correct layer a short
while later is wasted effort. Do it right the first time.

I do beleive I've explained this to you at least once before, so
lets not have to go over this again.

> > XFS already has everything it needs internally to track dirty
> > inodes. In fact, we used to do data writeback from within XFS and we
> > slowly removed it as the generic writeback code was improved made
> > the XFS code redundant.
> >
> > That said, parallelising writeback so we can support hundreds of
> > GB/s of delayed allocation based writeback is something we
> > eventually need to do, and that pretty much means we need to bring
> > dirty data inode tracking back into XFS.
> >
> > So what we really need is writeback infrastructure that is:
> >
> > 	a) independent of the dirty inode lists
> > 	b) implements background, periodic and immediate writeback
> > 	c) can be driven by BDI or superblock
> >
> > IOWs, the abstraction we need for this writeback callout is not at
> > the writeback_sb_inodes() layer, it's above the dirty inode list
> > queuing layer. IOWs, the callout needs to be closer to the
> > wb_do_writeback() layer which drives all the writeback work
> > including periodic and background writeback, and the interactions
> > between foreground and background work that wb_writeback() uses
> > needs to be turned into a bunch of helpers...
> 
> I agree, and that is what Tux3 really wants. I just wonder about the
> wisdom of embarking on a big change to fs-writeback when a small
> change will do. How will fs-writeback balancing between different
> filesystems on the same device work? What will those helpers look
> like? How long will it take to prove the new scheme stable? How can
> we prove that no existing fs-writeback clients will regress?

If the filesystem doesn't provide a callout, then the existing
writeback code runs. So for everything but tux3, there is no change
in behaviour. For tux3, you don't have to care about such
interactions while you are developing it. IOWs, we don't have to
solve every problem immediately but we get the hooks in the right
place and clean up and sanitise the locking and writeback
infrastructure at the same time.

> What about a sporting proposal: you post a patch that trumps ours in
> elegance, that you could in theory use for XFS. We verify that it
> works for Tux3 at least as well as the current patch. We also test
> it for you.

Sorry, the world doesn't work that way. I have other work to do, and
the onus on getting new code into the tree is on the people who need
it, not the people who are reviewing the code.

> >> This is not cast in stone, but it smells
> >> like decent factoring. We can save some spinlocks by violating
> >> that factoring (as Hirofumi's original hack did) at the cost of
> >> holding a potentially busy wb lock longer, which does not seem
> >> like a good trade.
> > If we abstract at a higher layer, the wb lock protects just the work
> > queuing and dispatch, and everything else can be done by the
> > superblock callout. If the superblock callout is missing, then
> > we simply fall down to the existing wb_writeback() code that retakes
> > the lock and does it's work....
> >
> > Let's get the layering and abstraction in the right place the first
> > time, rather having to revist this in the not-to-distant future.
> >
> 
> That would be ideal, but incremental also has its attractions. In that
> vein, here is a rewrite along the lines you suggested yesterday.

Sure, but that assumes the hook is in the correct place, which with
the information you gave today tells me it isn't. So polishing the
original patch isn't the solution. Driving the hook up the stack
seems to me like the better approach.

> - * Passed into wb_writeback(), essentially a subset of writeback_control
> - */
> -struct wb_writeback_work {
> -    long nr_pages;
> -    struct super_block *sb;
> -    unsigned long *older_than_this;

Something is mangling tabs to 4 spaces in your patches.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  7:47         ` OGAWA Hirofumi
@ 2014-06-03  8:12           ` Dave Chinner
  2014-06-03  8:57             ` OGAWA Hirofumi
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2014-06-03  8:12 UTC (permalink / raw)
  To: OGAWA Hirofumi
  Cc: Daniel Phillips, linux-kernel, linux-fsdevel, Linus Torvalds,
	Andrew Morton

On Tue, Jun 03, 2014 at 04:47:52PM +0900, OGAWA Hirofumi wrote:
> Daniel Phillips <daniel@phunq.net> writes:
> 
> > Hi Dave,
> > On 06/02/2014 08:33 PM, Dave Chinner wrote:
> >> On Mon, Jun 02, 2014 at 01:02:29PM -0700, Daniel Phillips wrote:
> >>>
> >>> Redirty_tail nearly works, but "if (!list_empty(&wb->b_dirty))" is
> >>> not correct because the inode needs to end up on the dirty list
> >>> whether it was already there or not.
> >> redirty_tail() always moves the inode to the end of the dirty
> >> list.
> 
> It doesn't move inode to end of the dirty if wb.b_dirty is empty
> (I.e. it can move from wb.b_io to wb.b_dirty too).

Um, really?  What code are you reading? From 3.15-rc8:

static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
{
        assert_spin_locked(&wb->list_lock);
        if (!list_empty(&wb->b_dirty)) {
                struct inode *tail;

                tail = wb_inode(wb->b_dirty.next);
                if (time_before(inode->dirtied_when, tail->dirtied_when))
                        inode->dirtied_when = jiffies;
        }
        list_move(&inode->i_wb_list, &wb->b_dirty);
}

The list_move() is not conditional at all and so the inode is
*always* moved to the tail of wb->b_dirty....

> It has difference.
> 
> Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30
> (tail->dirtied_when is expired at 31 with default config). In this case,
> redirty_tail() doesn't update ->dirtied_when.

OK, that's a different issue, and is actually handled by
requeue_inode(), which is called to put inodes back on the correct
dirty list when IO completes. I think that if you are going to use
the wb dirty inode lists, you should probably use the existing
functions to manage the inode lists appropriately rather than
creating your own writeback list lifecycle.

If tux3 wants it's own dirty inode list lifecycles, then that's
where avoiding the wb lists completely is an appropriate design
point. I don't want to hack little bits and pieces all over the
writeback code to support what tux3 is doing right now if it's going
to do something else real soon. When tux3 moves to use it's own
internal lists these new funcitons just have to be removed again, so
let's skip the hack step we are at right now and go straight for
supporting the "don't need the fs-writeback lists" infrstructure.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  8:12           ` Dave Chinner
@ 2014-06-03  8:57             ` OGAWA Hirofumi
  0 siblings, 0 replies; 24+ messages in thread
From: OGAWA Hirofumi @ 2014-06-03  8:57 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Daniel Phillips, linux-kernel, linux-fsdevel, Linus Torvalds,
	Andrew Morton

Dave Chinner <david@fromorbit.com> writes:

>> It doesn't move inode to end of the dirty if wb.b_dirty is empty
>> (I.e. it can move from wb.b_io to wb.b_dirty too).
>
> Um, really?  What code are you reading? From 3.15-rc8:
>
> static void redirty_tail(struct inode *inode, struct bdi_writeback *wb)
> {
>         assert_spin_locked(&wb->list_lock);
>         if (!list_empty(&wb->b_dirty)) {
>                 struct inode *tail;
>
>                 tail = wb_inode(wb->b_dirty.next);
>                 if (time_before(inode->dirtied_when, tail->dirtied_when))
>                         inode->dirtied_when = jiffies;
>         }
>         list_move(&inode->i_wb_list, &wb->b_dirty);
> }
>
> The list_move() is not conditional at all and so the inode is
> *always* moved to the tail of wb->b_dirty....

Oh, you are right.

>> It has difference.
>> 
>> Say, tail->dirtied_when == 1, inode->dirtied_when == 2, and now == 30
>> (tail->dirtied_when is expired at 31 with default config). In this case,
>> redirty_tail() doesn't update ->dirtied_when.
>
> OK, that's a different issue, and is actually handled by
> requeue_inode(), which is called to put inodes back on the correct
> dirty list when IO completes. I think that if you are going to use
> the wb dirty inode lists, you should probably use the existing
> functions to manage the inode lists appropriately rather than
> creating your own writeback list lifecycle.

See __mark_inode_dirty() what does. We can consolidate that with
__mark_inode_dirty() if you want.

In our case, dirty while inode has I_DIRTY needs to handle new dirty
sometime, because of data=journal behavior.

> If tux3 wants it's own dirty inode list lifecycles, then that's
> where avoiding the wb lists completely is an appropriate design
> point. I don't want to hack little bits and pieces all over the
> writeback code to support what tux3 is doing right now if it's going
> to do something else real soon. When tux3 moves to use it's own
> internal lists these new funcitons just have to be removed again, so
> let's skip the hack step we are at right now and go straight for
> supporting the "don't need the fs-writeback lists" infrstructure.

Is it agreed with someone? There was bdflush, pdflush, and now bdi
flush. Why changed to bdi flush? This respects the intent of change
"pdflush to bdi flush" more or less.

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  7:52         ` Dave Chinner
@ 2014-06-03 14:05           ` Jan Kara
  2014-06-03 14:14             ` Christoph Hellwig
  0 siblings, 1 reply; 24+ messages in thread
From: Jan Kara @ 2014-06-03 14:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Daniel Phillips, linux-kernel, linux-fsdevel, Linus Torvalds,
	Andrew Morton, OGAWA Hirofumi

On Tue 03-06-14 17:52:09, Dave Chinner wrote:
> On Tue, Jun 03, 2014 at 12:01:11AM -0700, Daniel Phillips wrote:
> > > However, we already avoid the VFS writeback lists for certain
> > > filesystems for pure metadata. e.g. XFS does not use the VFS dirty
> > > inode lists for inode metadata changes.  They get tracked internally
> > > by the transaction subsystem which does it's own writeback according
> > > to the requirements of journal space availability.
> > >
> > > This is done simply by not calling mark_inode_dirty() on any
> > > metadata only change. If we want to do the same for data, then we'd
> > > simply not call mark_inode_dirty() in the data IO path. That
> > > requires a custom ->set_page_dirty method to be provided by the
> > > filesystem that didn't call
> > >
> > > 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > >
> > > and instead did it's own thing.
> > >
> > > So the per-superblock dirty tracking is something we can do right
> > > now, and some filesystems do it for metadata. The missing piece for
> > > data is writeback infrastructure capable of deferring to superblocks
> > > for writeback rather than BDIs....
> > 
> > We agree that fs-writeback inode lists are broken for anything
> > more sophisticated than Ext2.
> 
> No, I did not say that.
> 
> I said XFS does something different for metadata changes because it
> has different flushing constraints and requirements than the generic
> code provides. That does not make the generic code broken.
> 
> > An advantage of the patch under
> > consideration is that it still lets fs-writeback mostly work the
> > way it has worked for the last few years, except for not allowing it
> > to pick specific inodes and data pages for writeout. As far as I
> > can see, it still balances writeout between different filesystems
> > on the same block device pretty well.
> 
> Not really. If there are 3 superblocks on a BDI, and the dirty inode
> list iterates between 2 of them with lots of dirty inodes, it can
> starve writeback from the third until one of it's dirty inodes pops
> to the head of the b_io list. So it's inherently unfair from that
> perspective.
> 
> Changing the high level flushing to be per-superblock rather than
> per-BDI would enable us to control that behaviour and be much fairer
> to all the superblocks on a given BDI. That said, I don't really
> care that much about this case...
So we currently flush inodes in first dirtied first written back order when
superblock is not specified in writeback work. That completely ignores the
fact to which superblock inode belongs but I don't see per-sb fairness to
actually make any sense when
1) flushing old data (to keep promise set in dirty_expire_centisecs)
2) flushing data to reduce number of dirty pages
And these are really the only two cases where we don't do per-sb flushing.

Now when filesystems want to do something more clever (and I can see
reasons for that e.g. when journalling metadata, even more so when
journalling data) I agree we need to somehow implement the above two types
of writeback using per-sb flushing. Type 1) is actually pretty easy - just
tell each sb to writeback dirty data upto time T. Type 2) is more difficult
because that is more openended task - it seems similar to what shrinkers do
but that would require us to track per sb amount of dirty pages / inodes
and I'm not sure we want to add even more page counting statistics...
Especially since often bdi == fs. Thoughts?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03 14:05           ` Jan Kara
@ 2014-06-03 14:14             ` Christoph Hellwig
  2014-06-03 14:25               ` Theodore Ts'o
  2014-06-03 15:21               ` Jan Kara
  0 siblings, 2 replies; 24+ messages in thread
From: Christoph Hellwig @ 2014-06-03 14:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dave Chinner, Daniel Phillips, linux-kernel, linux-fsdevel,
	Linus Torvalds, Andrew Morton, OGAWA Hirofumi, Jens Axboe

On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
> So we currently flush inodes in first dirtied first written back order when
> superblock is not specified in writeback work. That completely ignores the
> fact to which superblock inode belongs but I don't see per-sb fairness to
> actually make any sense when
> 1) flushing old data (to keep promise set in dirty_expire_centisecs)
> 2) flushing data to reduce number of dirty pages
> And these are really the only two cases where we don't do per-sb flushing.
> 
> Now when filesystems want to do something more clever (and I can see
> reasons for that e.g. when journalling metadata, even more so when
> journalling data) I agree we need to somehow implement the above two types
> of writeback using per-sb flushing. Type 1) is actually pretty easy - just
> tell each sb to writeback dirty data upto time T. Type 2) is more difficult
> because that is more openended task - it seems similar to what shrinkers do
> but that would require us to track per sb amount of dirty pages / inodes
> and I'm not sure we want to add even more page counting statistics...
> Especially since often bdi == fs. Thoughts?

Honestly I think doing per-bdi writeback has been a major mistake.  As
you said it only even matters when we have filesystems on multiple
partitions on a single device, and even then only in a simple setup,
as soon as we use LVM or btrfs this sort of sharing stops to happen
anyway.  I don't even see much of a benefit except that we prevent
two flushing daemons to congest a single device for that special case
of multiple filesystems on partitions of the same device, and that could
be solved in other ways.

The major benefit of the per-bdi writeback was that for the usual case
of one filesystem per device we get exactly one flusher thread per
filesystems intead of multiple competing ones, but per-sb writeback
would solve that just as fine.


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03 14:14             ` Christoph Hellwig
@ 2014-06-03 14:25               ` Theodore Ts'o
  2014-06-03 15:21               ` Jan Kara
  1 sibling, 0 replies; 24+ messages in thread
From: Theodore Ts'o @ 2014-06-03 14:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, Daniel Phillips, linux-kernel,
	linux-fsdevel, Linus Torvalds, Andrew Morton, OGAWA Hirofumi,
	Jens Axboe

On Tue, Jun 03, 2014 at 07:14:44AM -0700, Christoph Hellwig wrote:
> Honestly I think doing per-bdi writeback has been a major mistake.  As
> you said it only even matters when we have filesystems on multiple
> partitions on a single device, and even then only in a simple setup,
> as soon as we use LVM or btrfs this sort of sharing stops to happen
> anyway.  I don't even see much of a benefit except that we prevent
> two flushing daemons to congest a single device for that special case
> of multiple filesystems on partitions of the same device, and that could
> be solved in other ways.

To be fair, back when per-bdi writeback was introduced, having
multiple partitions on a single disk was far more common, and the use
of LVM was much less common.  These days, many more systems using one
big root filesystem, and or using flash where the parallel writes can
actually be a good thing (since there isn't a single disk head which
has to seek all over the HDD), the case for keeping per-bdi writeback
is much weaker, if not non-existent.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03  5:30     ` Christian Stroetmann
@ 2014-06-03 14:57       ` Theodore Ts'o
  2014-06-03 16:30         ` Christian Stroetmann
  0 siblings, 1 reply; 24+ messages in thread
From: Theodore Ts'o @ 2014-06-03 14:57 UTC (permalink / raw)
  To: Christian Stroetmann
  Cc: Dave Chinner, Daniel Phillips, Linux Kernel, Linux FS Devel,
	Linus Torvalds, Andrew Morton, OGAWA Hirofumi

On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote:
> In general, I do not believe that the complexity problems of soft updates,
> atomic writes, and related techniques can be solved by hand/manually. So my
> suggestion is to automatically handle the complexity problem of e.g.
> dependancies in a way that is comparable to a(n on-the-fly) file-system
> compiler so to say that works on a very large dependancy graph (having
> several billions of graph vertices actually). And at this point an
> abstraction like it is given with Featherstitch helps to feed and control
> this special FS compiler.

Well, if you want to try to implement something like this, go for it!

I'd be very curious to see how well (a) how much CPU overhead it takes
to crunch on a dependency graph with billions of vertices, and (b) how
easily can it be to express these dependencies and maintainable such a
dependency language would be.  Sounds like a great research topic, and
I'll note the Call For Papers for FAST 2015 is out, and if you can
solve these problems, it would make a great FAST 2015 submission:

https://www.usenix.org/conference/fast15/call-for-papers

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03 14:14             ` Christoph Hellwig
  2014-06-03 14:25               ` Theodore Ts'o
@ 2014-06-03 15:21               ` Jan Kara
  2014-06-03 22:37                 ` Daniel Phillips
  1 sibling, 1 reply; 24+ messages in thread
From: Jan Kara @ 2014-06-03 15:21 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, Dave Chinner, Daniel Phillips, linux-kernel,
	linux-fsdevel, Linus Torvalds, Andrew Morton, OGAWA Hirofumi,
	Jens Axboe

On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:
> On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
> > So we currently flush inodes in first dirtied first written back order when
> > superblock is not specified in writeback work. That completely ignores the
> > fact to which superblock inode belongs but I don't see per-sb fairness to
> > actually make any sense when
> > 1) flushing old data (to keep promise set in dirty_expire_centisecs)
> > 2) flushing data to reduce number of dirty pages
> > And these are really the only two cases where we don't do per-sb flushing.
> > 
> > Now when filesystems want to do something more clever (and I can see
> > reasons for that e.g. when journalling metadata, even more so when
> > journalling data) I agree we need to somehow implement the above two types
> > of writeback using per-sb flushing. Type 1) is actually pretty easy - just
> > tell each sb to writeback dirty data upto time T. Type 2) is more difficult
> > because that is more openended task - it seems similar to what shrinkers do
> > but that would require us to track per sb amount of dirty pages / inodes
> > and I'm not sure we want to add even more page counting statistics...
> > Especially since often bdi == fs. Thoughts?
> 
> Honestly I think doing per-bdi writeback has been a major mistake.  As
> you said it only even matters when we have filesystems on multiple
> partitions on a single device, and even then only in a simple setup,
> as soon as we use LVM or btrfs this sort of sharing stops to happen
> anyway.  I don't even see much of a benefit except that we prevent
> two flushing daemons to congest a single device for that special case
> of multiple filesystems on partitions of the same device, and that could
> be solved in other ways.
  So I agree per-bdi / per-sb matters only in simple setups but machines
with single rotating disk with several partitions and without LVM aren't
that rare AFAICT from my experience. And I agree we went for per-bdi
flushing to avoid two threads congesting a single device leading to
suboptimal IO patterns during background writeback.

So currently I'm convinced we want to go for per-sb dirty tracking. That
also makes some speedups in that code noticeably simpler. I'm not convinced
about the per-sb flushing thread - if we don't regress the multiple sb on
bdi case when we just let the threads from different superblocks contend
for IO, then that would be a natural thing to do. But once we have to
introduce some synchronization between threads to avoid regressions, I
think it might be easier to just stay with per-bdi thread which switches
between superblocks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03 14:57       ` Theodore Ts'o
@ 2014-06-03 16:30         ` Christian Stroetmann
  0 siblings, 0 replies; 24+ messages in thread
From: Christian Stroetmann @ 2014-06-03 16:30 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Daniel Phillips, Linux Kernel, Linux FS Devel,
	Linus Torvalds, Andrew Morton, OGAWA Hirofumi

On the 3rd of June 2014 16:57, Theodore Ts'o wrote:
> On Tue, Jun 03, 2014 at 07:30:32AM +0200, Christian Stroetmann wrote:
>> In general, I do not believe that the complexity problems of soft updates,
>> atomic writes, and related techniques can be solved by hand/manually. So my
>> suggestion is to automatically handle the complexity problem of e.g.
>> dependancies in a way that is comparable to a(n on-the-fly) file-system
>> compiler so to say that works on a very large dependancy graph (having
>> several billions of graph vertices actually). And at this point an
>> abstraction like it is given with Featherstitch helps to feed and control
>> this special FS compiler.
> Well, if you want to try to implement something like this, go for it!

I am already active since some weeks.

> I'd be very curious to see how well (a) how much CPU overhead it takes
> to crunch on a dependency graph with billions of vertices, and (b) how
> easily can it be to express these dependencies and maintainable such a
> dependency language would be.  Sounds like a great research topic, and

To a) A run is expected to take some few hours on a single computing node.

Also, such a graph processing must not be done all the time, but only if 
a new application demands a specific handling of the data in respect to 
e.g. one of the ACID criterias. That is the reason why I put 
"on-the-fly" in paranthesis.

To b) I hoped that file system developers could make some suggestions or 
point to some no-gos.
I am also thinking about Petri-Nets in this relation, though it is just 
an idea.

I would also like to mention that it could be used in conjunction with 
Non-Volatile Memory (NVM) as well.

> I'll note the Call For Papers for FAST 2015 is out, and if you can
> solve these problems, it would make a great FAST 2015 submission:
>
> https://www.usenix.org/conference/fast15/call-for-papers

Are you serious or have I missed the 1st of April once again?
Actually, I could only write a general overview about the approach 
comparable to a white paper, but nothing more.

> Cheers,
>
> 					- Ted
>

Best regards
Christian

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03 15:21               ` Jan Kara
@ 2014-06-03 22:37                 ` Daniel Phillips
  2014-06-04 20:16                   ` Jan Kara
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Phillips @ 2014-06-03 22:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Dave Chinner, linux-kernel, linux-fsdevel,
	Linus Torvalds, Andrew Morton, OGAWA Hirofumi, Jens Axboe

On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote:
> On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:
>> On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
>  ...
>   So I agree per-bdi / per-sb matters only in simple setups but machines
> with single rotating disk with several partitions and without LVM aren't
> that rare AFAICT from my experience.

Retribution is sure to be swift, terrible and eternal for anyone who dares 
to
break those.

> And I agree we went for per-bdi
> flushing to avoid two threads congesting a single device leading to
> suboptimal IO patterns during background writeback.

A proposal is on the table to implement s_ops->writeback() as a per-sb
operation in such a way that nothing changes in the current per-inode path.
Good or bad approach?

> So currently I'm convinced we want to go for per-sb dirty tracking. That
> also makes some speedups in that code noticeably simpler. I'm not 
convinced
> about the per-sb flushing thread - if we don't regress the multiple sb on
> bdi case when we just let the threads from different superblocks contend
> for IO, then that would be a natural thing to do. But once we have to
> introduce some synchronization between threads to avoid regressions, I
> think it might be easier to just stay with per-bdi thread which switches
> between superblocks.

Could you elaborate on the means of switching between superblocks? Do you 
mean
a new fs-writeback path just for data=journal class filesystems, or are you
suggesting changing the way all filesystems are driven?

Regards,

Daniel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC][PATCH 1/2] Add a super operation for writeback
  2014-06-03 22:37                 ` Daniel Phillips
@ 2014-06-04 20:16                   ` Jan Kara
  0 siblings, 0 replies; 24+ messages in thread
From: Jan Kara @ 2014-06-04 20:16 UTC (permalink / raw)
  To: Daniel Phillips
  Cc: Jan Kara, Christoph Hellwig, Dave Chinner, linux-kernel,
	linux-fsdevel, Linus Torvalds, Andrew Morton, OGAWA Hirofumi,
	Jens Axboe

On Tue 03-06-14 15:37:39, Daniel Phillips wrote:
> On Tuesday, June 3, 2014 8:21:55 AM PDT, Jan Kara wrote:
> >On Tue 03-06-14 07:14:44, Christoph Hellwig wrote:
> >>On Tue, Jun 03, 2014 at 04:05:31PM +0200, Jan Kara wrote:
> >And I agree we went for per-bdi
> >flushing to avoid two threads congesting a single device leading to
> >suboptimal IO patterns during background writeback.
> 
> A proposal is on the table to implement s_ops->writeback() as a per-sb
> operation in such a way that nothing changes in the current per-inode path.
> Good or bad approach?
  Having s_ops->writeback() is fine. But I just hate how you hook in that
callback into the writeback code (and Dave has expressed similar concerns).
The way you hook it up, filesystem still has to have one inode in the dirty
list so that __writeback_inodes_wb() can find that inode, get superblock
from it and ask for writeback to be done via callback. That's really ugly
and no-go for me.

> >So currently I'm convinced we want to go for per-sb dirty tracking. That
> >also makes some speedups in that code noticeably simpler. I'm not
> convinced
> >about the per-sb flushing thread - if we don't regress the multiple sb on
> >bdi case when we just let the threads from different superblocks contend
> >for IO, then that would be a natural thing to do. But once we have to
> >introduce some synchronization between threads to avoid regressions, I
> >think it might be easier to just stay with per-bdi thread which switches
> >between superblocks.
> 
> Could you elaborate on the means of switching between superblocks?  Do
> you mean a new fs-writeback path just for data=journal class filesystems,
> or are you suggesting changing the way all filesystems are driven?
  So I suggest changing the way all filesystems are driven to a per-sb one.
That makes sense for other reasons as well and allows clean incorporation
of your writeback callback (either a fs will use generic callback which
would do what generic writeback code does now, only with per-sb dirty lists
instead of current per-bdi ones, or the fs can do its own stuff). I can
write something (the switch isn't really that complex) but it will need
quite some testing to verify we don't regress somewhere...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2014-06-04 20:16 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-06-01 21:41 [RFC][PATCH 1/2] Add a super operation for writeback Daniel Phillips
2014-06-01 21:42 ` [RFC][PATCH 2/2] tux3: Use writeback hook to remove duplicated core code Daniel Phillips
2014-06-02  3:30   ` Dave Chinner
2014-06-02 20:07     ` Daniel Phillips
2014-06-02  3:15 ` [RFC][PATCH 1/2] Add a super operation for writeback Dave Chinner
2014-06-02 20:02   ` Daniel Phillips
2014-06-03  3:33     ` Dave Chinner
2014-06-03  7:01       ` Daniel Phillips
2014-06-03  7:26         ` Daniel Phillips
2014-06-03  7:47         ` OGAWA Hirofumi
2014-06-03  8:12           ` Dave Chinner
2014-06-03  8:57             ` OGAWA Hirofumi
2014-06-03  7:52         ` Dave Chinner
2014-06-03 14:05           ` Jan Kara
2014-06-03 14:14             ` Christoph Hellwig
2014-06-03 14:25               ` Theodore Ts'o
2014-06-03 15:21               ` Jan Kara
2014-06-03 22:37                 ` Daniel Phillips
2014-06-04 20:16                   ` Jan Kara
2014-06-02  8:30 ` Christian Stroetmann
2014-06-03  3:39   ` Dave Chinner
2014-06-03  5:30     ` Christian Stroetmann
2014-06-03 14:57       ` Theodore Ts'o
2014-06-03 16:30         ` Christian Stroetmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.