linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v10 0/9] Add fast commits in Ext4 file system
@ 2020-10-15 20:37 Harshad Shirwadkar
  2020-10-15 20:37 ` [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
                   ` (8 more replies)
  0 siblings, 9 replies; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch series adds support for fast commits which is a simplified
version of the scheme proposed by Park and Shin, in their paper,
"iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
give the client file system an opportunity to perform a faster
commit. Only if the file system cannot perform such a commit
operation, then JBD2 should fall back to traditional commits.

Because JBD2 operates at block granularity, for every file system
metadata update it commits all the changed blocks are written to the
journal at commit time. This is inefficient because updates to some
blocks that JBD2 commits are derivable from some other blocks. For
example, if a new extent is added to an inode, then corresponding
updates to the inode table, the block bitmap, the group descriptor and
the superblock can be derived based on just the extent information and
the corresponding inode information. So, if we take this relationship
between blocks into account and replay the journalled blocks smartly,
we could increase performance of file system commits significantly.

Fast commits introduced in this patch have two main contributions:

(1) Making JBD2 fast commit aware, so that clients of JBD2 can
    implement fast commits

(2) Add support in ext4 to use JBD2's new interfaces and implement
    fast commits

Fast commit operation
---------------------

The new fast commit operation works by tracking file system deltas
since last commit in memory and committing these deltas to disk during
fsync(). Ext4 maintains directory entry updates in an in-memory
queue. Also, the inodes that have changed since last commit are
maintained in an in-memory queue. These queues are flushed to disk
during the commit time in a log-structured way. Fast commit area is
organized as a log of TAG-LENGTH-VALUE tuples with a special "tail"
tag marking the end of a commit. If certain operation prevents fast
commit from happening, the commit code falls back to JBD2 full commit
operation and thus invalidating all the fast commits since last full
commit. JBD2 provides new jbd2_fc_start() and jbd2_fc_stop() functions
to co-ordinate between JBD2's full commits and client file system's
fast commits.

Recovery operation
------------------

During recovery, JBD2 lets the client file system handle fast commit
blocks as it wants. After performing transaction replay, JBD2 invokes
client file system's recovery path handler. During the scan phase,
Ext4's recovery path handler determines the validity of fast commit
log by making sure CRC and TID of fast commits are valid. During the
replay phase, the recovery handler replays tags one by one. These
replay handlers are idempotent. Thus, if we crash in the middle of
recovery, Ext4 can restart the log replay and reach the identical
final state.

Testing
-------

e2fsprogs was updated to set fast commit feature flag and to ignore
fast commit blocks during e2fsck.

https://github.com/harshadjs/e2fsprogs.git

No regressions were introduced in smoke tests.

How to Use this feature?
-----------------------

This feature should not be used in production until corresponding
e2fsprogs changes are ready. These changes are being worked on at -
https://github.com/harshadjs/e2fsprogs.git. This feature can be set at
mkfs time. For testing purposes, this feature can also be enabled by
passing a mount time flag "fc_debug_force". This mount flag should
only be used for testing purposes and never for production.

Once enabled, fast commit information can be viewed in
/proc/fs/ext4/<dev>/fc_info.

Performance Evaluation
----------------------

Ext4 performance was compared with and without fast commits using
fsmark, dbench and filebench benchmarks with local file system and
over NFS. This is the summary of results:

|-----------+-------------------+----------------+----------------+--------|
| Benchmark | Config            | No FC          | FC             | % diff |
|-----------+-------------------+----------------+----------------+--------|
| Fsmark    | Local, 8 threads  | 1475.1 files/s | 4309.8 files/s | +192.2 |
| Fsmark    | NFS, 4 threads    | 299.4 files/s  | 409.45 files/s |  +36.8 |
|-----------+-------------------+----------------+----------------+--------|
| Dbench    | Local, 2 procs    | 33.32 MB/s     | 70.87 MB/s     | +112.7 |
| Dbench    | NFS, 2 procs      | 8.84 MB/s      | 11.88 MB/s     |  +34.4 |
|-----------+-------------------+----------------+----------------+--------|
| Dbench    | Local, 10 procs   | 90.48 MB/s     | 110.12 MB/s    |  +21.7 |
| Dbench    | NFS, 10 procs     | 34.62 MB/s     | 52.83 MB/s     |  +52.6 |
|-----------+-------------------+----------------+----------------+--------|
| FileBench | Local, 16 threads | 10442.3 ops/s  | 18617.8 ops/s  |  +78.3 |
|           | (Varmail)         |                |                |        |
| FileBench | NFS, 16 threads   | 1531.3 ops/s   | 2681.5 ops/s   |  +75.1 |
|           | (Varmail)         |                |                |        |
|-----------+-------------------+----------------+----------------+--------|

NFS Performance Evaluation
--------------------------

NFS performs commit_metadata operation very frequently which resulted
in a linux kernel untar operation resulting in over ~180 journal
commits / second. The same untar operation results in 2.5 commits /
second. However, as the above table shows, the benefits that NFS sees
aren't as great as the local disk. The reason for that is the network
latency. Before fast commits, NFS was bottlenecked on journal commit
performance. However, with fast commits reducing that time
significantly, NFS performance now gets bottlenecked on network
latency. NFS running on networks with lower latency (< 300 us) will
see better performance than the NFS numbers reported above.

DAX Support
-----------

Fast commits helps improve Ext4 performance on DAX devices
too. However, there as an opportunity to do even better. Collaborating
with Rohan Kadekodi (rak@cs.utexas.edu) from UT Austin and Saurabh
Kadekodi (saukad@cs.cmu.edu) from CMU, we have added synchronous fast
commits which write at byte granularity (instead of block
granularity). This is WIP available at -
https://github.com/harshadjs/linux/tree/fc-pmem-renewed. Doing this
way, we get stronger guarantees than current Ext4 very cheaply on
persistent memory devices.

Changes since V9
----------------

* Removed "PARTIAL_INODE" tag and now only using "FULL_INODE" tag for
  replay.
* A few bugfixes as pointed out by Ritesh and Ted.
* Readability improvements: added more comments and made naming of
  variables more consistent
* Documentation updates

[1] iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call
https://www.usenix.org/conference/atc17/technical-sessions/presentation/park

Harshad Shirwadkar (9):
  doc: update ext4 and journalling docs to include fast commit feature
  ext4: add fast_commit feature and handling for extended mount options
  ext4 / jbd2: add fast commit initialization
  jbd2: add fast commit machinery
  ext4: main fast-commit commit path
  jbd2: fast commit recovery path
  ext4: fast commit recovery path
  ext4: add a mount opt to forcefully turn fast commits on
  ext4: add fast commit stats in procfs

 Documentation/filesystems/ext4/journal.rst |   66 +
 Documentation/filesystems/journalling.rst  |   33 +
 fs/ext4/Makefile                           |    2 +-
 fs/ext4/acl.c                              |    2 +
 fs/ext4/balloc.c                           |    7 +-
 fs/ext4/ext4.h                             |  101 +
 fs/ext4/ext4_jbd2.c                        |    2 +-
 fs/ext4/extents.c                          |  309 ++-
 fs/ext4/extents_status.c                   |   24 +
 fs/ext4/fast_commit.c                      | 2128 ++++++++++++++++++++
 fs/ext4/fast_commit.h                      |  159 ++
 fs/ext4/file.c                             |   10 +-
 fs/ext4/fsync.c                            |    2 +-
 fs/ext4/ialloc.c                           |  168 +-
 fs/ext4/inode.c                            |  130 +-
 fs/ext4/ioctl.c                            |   22 +-
 fs/ext4/mballoc.c                          |  206 +-
 fs/ext4/namei.c                            |  186 +-
 fs/ext4/super.c                            |   84 +-
 fs/ext4/sysfs.c                            |    2 +
 fs/ext4/xattr.c                            |    3 +
 fs/jbd2/commit.c                           |   44 +
 fs/jbd2/journal.c                          |  243 ++-
 fs/jbd2/recovery.c                         |   57 +-
 include/linux/jbd2.h                       |   91 +-
 include/trace/events/ext4.h                |  228 ++-
 26 files changed, 4144 insertions(+), 165 deletions(-)
 create mode 100644 fs/ext4/fast_commit.c
 create mode 100644 fs/ext4/fast_commit.h

-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
@ 2020-10-15 20:37 ` Harshad Shirwadkar
  2020-10-21 16:04   ` Jan Kara
  2020-10-15 20:37 ` [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options Harshad Shirwadkar
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch adds necessary documentation for fast commits.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 Documentation/filesystems/ext4/journal.rst | 66 ++++++++++++++++++++++
 Documentation/filesystems/journalling.rst  | 33 +++++++++++
 2 files changed, 99 insertions(+)

diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..a522037a28cf 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -28,6 +28,17 @@ metadata are written to disk through the journal. This is slower but
 safest. If ``data=writeback``, dirty data blocks are not flushed to the
 disk before the metadata are written to disk through the journal.
 
+In case of ``data=ordered`` mode, Ext4 also supports fast commits which
+help reduce commit latency significantly. The default ``data=ordered``
+mode works by logging metadata blocks to the journal. In fast commit
+mode, Ext4 only stores the minimal delta needed to recreate the
+affected metadata in fast commit space that is shared with JBD2.
+Once the fast commit area fills in or if fast commit is not possible
+or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
+A full commit invalidates all the fast commits that happened before
+it and thus it makes the fast commit area empty for further fast
+commits. This feature needs to be enabled at mkfs time.
+
 The journal inode is typically inode 8. The first 68 bytes of the
 journal inode are replicated in the ext4 superblock. The journal itself
 is normal (but hidden) file within the filesystem. The file usually
@@ -609,3 +620,58 @@ bytes long (but uses a full block):
      - h\_commit\_nsec
      - Nanoseconds component of the above timestamp.
 
+Fast commits
+~~~~~~~~~~~~
+
+Fast commit area is organized as a log of tag length values. Each TLV has
+a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
+of the entire field. It is followed by variable length tag specific value.
+Here is the list of supported tags and their meanings:
+
+.. list-table::
+   :widths: 8 20 20 32
+   :header-rows: 1
+
+   * - Tag
+     - Meaning
+     - Value struct
+     - Description
+   * - EXT4_FC_TAG_HEAD
+     - Fast commit area header
+     - ``struct ext4_fc_head``
+     - Stores the TID of the transaction after which these fast commits should
+       be applied.
+   * - EXT4_FC_TAG_ADD_RANGE
+     - Add extent to inode
+     - ``struct ext4_fc_add_range``
+     - Stores the inode number and extent to be added in this inode
+   * - EXT4_FC_TAG_DEL_RANGE
+     - Remove logical offsets to inode
+     - ``struct ext4_fc_del_range``
+     - Stores the inode number and the logical offset range that needs to be
+       removed
+   * - EXT4_FC_TAG_CREAT
+     - Create directory entry for a newly created file
+     - ``struct ext4_fc_dentry_info``
+     - Stores the parent inode numer, inode number and directory entry of the
+       newly created file
+   * - EXT4_FC_TAG_LINK
+     - Link a directory entry to an inode
+     - ``struct ext4_fc_dentry_info``
+     - Stores the parent inode numer, inode number and directory entry
+   * - EXT4_FC_TAG_UNLINK
+     - Unink a directory entry of an inode
+     - ``struct ext4_fc_dentry_info``
+     - Stores the parent inode numer, inode number and directory entry
+
+   * - EXT4_FC_TAG_PAD
+     - Padding (unused area)
+     - None
+     - Unused bytes in the fast commit area.
+
+   * - EXT4_FC_TAG_TAIL
+     - Mark the end of a fast commit
+     - ``struct ext4_fc_tail``
+     - Stores the TID of the commit, CRC of the fast commit of which this tag
+       represents the end of
+
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 7e2be2faf653..5a5f70b4063e 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -132,6 +132,39 @@ The opportunities for abuse and DOS attacks with this should be obvious,
 if you allow unprivileged userspace to trigger codepaths containing
 these calls.
 
+Fast commits
+~~~~~~~~~~~~
+
+JBD2 to also allows you to perform file-system specific delta commits known as
+fast commits. In order to use fast commits, you first need to call
+:c:func:`jbd2_fc_init` and tell how many blocks at the end of journal
+area should be reserved for fast commits. Along with that, you will also need
+to set following callbacks that perform correspodning work:
+
+`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
+fast commit.
+
+`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
+blocks.
+
+File system is free to perform fast commits as and when it wants as long as it
+gets permission from JBD2 to do so by calling the function
+:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
+file  system should tell JBD2 about it by calling
+:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
+commit immediately after stopping the fast commit it can do so by calling
+:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
+fails for some reason and the only way to guarantee consistency is for JBD2 to
+perform the full traditional commit.
+
+JBD2 helper functions to manage fast commit buffers. File system can use
+:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
+and wait on IO completion of fast commit buffers.
+
+Currently, only Ext4 implements fast commits. For details of its implementation
+of fast commits, please refer to the top level comments in
+fs/ext4/fast_commit.c.
+
 Summary
 ~~~~~~~
 
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
  2020-10-15 20:37 ` [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
@ 2020-10-15 20:37 ` Harshad Shirwadkar
  2020-10-21 16:18   ` Jan Kara
  2020-10-15 20:37 ` [PATCH v10 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

We are running out of mount option bits. Add handling for using
s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
ability to turn off the fast commit feature in Ext4.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h       |  4 ++++
 fs/ext4/super.c      | 27 ++++++++++++++++++++++-----
 include/linux/jbd2.h |  5 ++++-
 3 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1879531a119f..02d7dc378505 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1213,6 +1213,8 @@ struct ext4_inode_info {
 #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM	0x00000008 /* User explicitly
 						specified journal checksum */
 
+#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT	0x00000010 /* Journal fast commit */
+
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
 #define set_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt |= \
@@ -1813,6 +1815,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode)
 #define EXT4_FEATURE_COMPAT_RESIZE_INODE	0x0010
 #define EXT4_FEATURE_COMPAT_DIR_INDEX		0x0020
 #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2	0x0200
+#define EXT4_FEATURE_COMPAT_FAST_COMMIT		0x0400
 #define EXT4_FEATURE_COMPAT_STABLE_INODES	0x0800
 
 #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
@@ -1915,6 +1918,7 @@ EXT4_FEATURE_COMPAT_FUNCS(xattr,		EXT_ATTR)
 EXT4_FEATURE_COMPAT_FUNCS(resize_inode,		RESIZE_INODE)
 EXT4_FEATURE_COMPAT_FUNCS(dir_index,		DIR_INDEX)
 EXT4_FEATURE_COMPAT_FUNCS(sparse_super2,	SPARSE_SUPER2)
+EXT4_FEATURE_COMPAT_FUNCS(fast_commit,		FAST_COMMIT)
 EXT4_FEATURE_COMPAT_FUNCS(stable_inodes,	STABLE_INODES)
 
 EXT4_FEATURE_RO_COMPAT_FUNCS(sparse_super,	SPARSE_SUPER)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 901c1c938276..70256a240442 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1709,7 +1709,7 @@ enum {
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
-	Opt_prefetch_block_bitmaps,
+	Opt_prefetch_block_bitmaps, Opt_no_fc,
 };
 
 static const match_table_t tokens = {
@@ -1796,6 +1796,7 @@ static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable=%u"},
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
+	{Opt_no_fc, "no_fc"},
 	{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption"},
@@ -1922,6 +1923,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
 #define MOPT_EXT4_ONLY	(MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING	0x0400
 #define MOPT_SKIP	0x0800
+#define	MOPT_2		0x1000
 
 static const struct mount_opts {
 	int	token;
@@ -2022,6 +2024,8 @@ static const struct mount_opts {
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
 	{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
 	 MOPT_SET},
+	{Opt_no_fc, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
+	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
 	{Opt_err, 0, 0}
 };
 
@@ -2398,10 +2402,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 			WARN_ON(1);
 			return -1;
 		}
-		if (arg != 0)
-			sbi->s_mount_opt |= m->mount_opt;
-		else
-			sbi->s_mount_opt &= ~m->mount_opt;
+		if (m->flags & MOPT_2) {
+			if (arg != 0)
+				sbi->s_mount_opt2 |= m->mount_opt;
+			else
+				sbi->s_mount_opt2 &= ~m->mount_opt;
+		} else {
+			if (arg != 0)
+				sbi->s_mount_opt |= m->mount_opt;
+			else
+				sbi->s_mount_opt &= ~m->mount_opt;
+		}
 	}
 	return 1;
 }
@@ -2618,6 +2629,9 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 		SEQ_OPTS_PUTS("dax=inode");
 	}
 
+	if (test_opt2(sb, JOURNAL_FAST_COMMIT))
+		SEQ_OPTS_PUTS("fast_commit");
+
 	ext4_show_quota_options(seq, sb);
 	return 0;
 }
@@ -4121,6 +4135,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
 	set_opt(sb, POSIX_ACL);
 #endif
+	if (ext4_has_feature_fast_commit(sb))
+		set_opt2(sb, JOURNAL_FAST_COMMIT);
 	/* don't forget to enable journal_csum when metadata_csum is enabled. */
 	if (ext4_has_metadata_csum(sb))
 		set_opt(sb, JOURNAL_CHECKSUM);
@@ -4777,6 +4793,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM;
 		clear_opt(sb, JOURNAL_CHECKSUM);
 		clear_opt(sb, DATA_FLAGS);
+		clear_opt2(sb, JOURNAL_FAST_COMMIT);
 		sbi->s_journal = NULL;
 		needs_recovery = 0;
 		goto no_journal;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 04afa6dcd60d..0685cc95e501 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -289,6 +289,7 @@ typedef struct journal_superblock_s
 #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT	0x00000004
 #define JBD2_FEATURE_INCOMPAT_CSUM_V2		0x00000008
 #define JBD2_FEATURE_INCOMPAT_CSUM_V3		0x00000010
+#define JBD2_FEATURE_INCOMPAT_FAST_COMMIT	0x00000020
 
 /* See "journal feature predicate functions" below */
 
@@ -299,7 +300,8 @@ typedef struct journal_superblock_s
 					JBD2_FEATURE_INCOMPAT_64BIT | \
 					JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \
 					JBD2_FEATURE_INCOMPAT_CSUM_V2 | \
-					JBD2_FEATURE_INCOMPAT_CSUM_V3)
+					JBD2_FEATURE_INCOMPAT_CSUM_V3 | \
+					JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
 
 #ifdef __KERNEL__
 
@@ -1263,6 +1265,7 @@ JBD2_FEATURE_INCOMPAT_FUNCS(64bit,		64BIT)
 JBD2_FEATURE_INCOMPAT_FUNCS(async_commit,	ASYNC_COMMIT)
 JBD2_FEATURE_INCOMPAT_FUNCS(csum2,		CSUM_V2)
 JBD2_FEATURE_INCOMPAT_FUNCS(csum3,		CSUM_V3)
+JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit,	FAST_COMMIT)
 
 /*
  * Journal flag definitions
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v10 3/9] ext4 / jbd2: add fast commit initialization
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
  2020-10-15 20:37 ` [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
  2020-10-15 20:37 ` [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options Harshad Shirwadkar
@ 2020-10-15 20:37 ` Harshad Shirwadkar
  2020-10-21 20:00   ` Jan Kara
  2020-10-15 20:37 ` [PATCH v10 4/9] jbd2: add fast commit machinery Harshad Shirwadkar
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar, kernel test robot

This patch adds fast commit area trackers in the journal_t
structure. These are initialized via the jbd2_fc_init() routine that
this patch adds. This patch also adds ext4/fast_commit.c and
ext4/fast_commit.h files for fast commit code that will be added in
subsequent patches in this series.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/Makefile      |  2 +-
 fs/ext4/ext4.h        |  4 ++++
 fs/ext4/fast_commit.c | 20 ++++++++++++++++
 fs/ext4/fast_commit.h |  9 ++++++++
 fs/ext4/super.c       |  1 +
 fs/jbd2/journal.c     | 53 +++++++++++++++++++++++++++++++++++++++----
 include/linux/jbd2.h  | 39 +++++++++++++++++++++++++++++++
 7 files changed, 122 insertions(+), 6 deletions(-)
 create mode 100644 fs/ext4/fast_commit.c
 create mode 100644 fs/ext4/fast_commit.h

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 2e42f47a7f98..49e7af6cc93f 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -10,7 +10,7 @@ ext4-y	:= balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
 		indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
 		mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
 		super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \
-		xattr_user.o
+		xattr_user.o fast_commit.o
 
 ext4-$(CONFIG_EXT4_FS_POSIX_ACL)	+= acl.o
 ext4-$(CONFIG_EXT4_FS_SECURITY)		+= xattr_security.o
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 02d7dc378505..2c412d32db0f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -963,6 +963,7 @@ do {									       \
 #endif /* defined(__KERNEL__) || defined(__linux__) */
 
 #include "extents_status.h"
+#include "fast_commit.h"
 
 /*
  * Lock subclasses for i_data_sem in the ext4_inode_info structure.
@@ -2678,6 +2679,9 @@ extern int ext4_init_inode_table(struct super_block *sb,
 				 ext4_group_t group, int barrier);
 extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
 
+/* fast_commit.c */
+
+void ext4_fc_init(struct super_block *sb, journal_t *journal);
 /* mballoc.c */
 extern const struct seq_operations ext4_mb_seq_groups_ops;
 extern long ext4_mb_stats;
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
new file mode 100644
index 000000000000..0dad8bdb1253
--- /dev/null
+++ b/fs/ext4/fast_commit.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * fs/ext4/fast_commit.c
+ *
+ * Written by Harshad Shirwadkar <harshadshirwadkar@gmail.com>
+ *
+ * Ext4 fast commits routines.
+ */
+#include "ext4_jbd2.h"
+
+void ext4_fc_init(struct super_block *sb, journal_t *journal)
+{
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
+		return;
+	if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
+		pr_warn("Error while enabling fast commits, turning off.");
+		ext4_clear_feature_fast_commit(sb);
+	}
+}
diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
new file mode 100644
index 000000000000..8362bf5e6e00
--- /dev/null
+++ b/fs/ext4/fast_commit.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __FAST_COMMIT_H__
+#define __FAST_COMMIT_H__
+
+/* Number of blocks in journal area to allocate for fast commits */
+#define EXT4_NUM_FC_BLKS		256
+
+#endif /* __FAST_COMMIT_H__ */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 70256a240442..23bf55057fc2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -5170,6 +5170,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
 	journal->j_commit_interval = sbi->s_commit_interval;
 	journal->j_min_batch_time = sbi->s_min_batch_time;
 	journal->j_max_batch_time = sbi->s_max_batch_time;
+	ext4_fc_init(sb, journal);
 
 	write_lock(&journal->j_state_lock);
 	if (test_opt(sb, BARRIER))
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index c0600405e7a2..4497bfbac527 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1181,6 +1181,14 @@ static journal_t *journal_init_common(struct block_device *bdev,
 	if (!journal->j_wbuf)
 		goto err_cleanup;
 
+	if (journal->j_fc_wbufsize > 0) {
+		journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
+					sizeof(struct buffer_head *),
+					GFP_KERNEL);
+		if (!journal->j_fc_wbuf)
+			goto err_cleanup;
+	}
+
 	bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
 	if (!bh) {
 		pr_err("%s: Cannot get buffer for journal superblock\n",
@@ -1194,11 +1202,23 @@ static journal_t *journal_init_common(struct block_device *bdev,
 
 err_cleanup:
 	kfree(journal->j_wbuf);
+	kfree(journal->j_fc_wbuf);
 	jbd2_journal_destroy_revoke(journal);
 	kfree(journal);
 	return NULL;
 }
 
+int jbd2_fc_init(journal_t *journal, int num_fc_blks)
+{
+	journal->j_fc_wbufsize = num_fc_blks;
+	journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
+				sizeof(struct buffer_head *), GFP_KERNEL);
+	if (!journal->j_fc_wbuf)
+		return -ENOMEM;
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_fc_init);
+
 /* jbd2_journal_init_dev and jbd2_journal_init_inode:
  *
  * Create a journal structure assigned some fixed set of disk blocks to
@@ -1316,11 +1336,20 @@ static int journal_reset(journal_t *journal)
 	}
 
 	journal->j_first = first;
-	journal->j_last = last;
 
-	journal->j_head = first;
-	journal->j_tail = first;
-	journal->j_free = last - first;
+	if (jbd2_has_feature_fast_commit(journal) &&
+	    journal->j_fc_wbufsize > 0) {
+		journal->j_fc_last = last;
+		journal->j_last = last - journal->j_fc_wbufsize;
+		journal->j_fc_first = journal->j_last + 1;
+		journal->j_fc_off = 0;
+	} else {
+		journal->j_last = last;
+	}
+
+	journal->j_head = journal->j_first;
+	journal->j_tail = journal->j_first;
+	journal->j_free = journal->j_last - journal->j_first;
 
 	journal->j_tail_sequence = journal->j_transaction_sequence;
 	journal->j_commit_sequence = journal->j_transaction_sequence - 1;
@@ -1665,9 +1694,18 @@ static int load_superblock(journal_t *journal)
 	journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
 	journal->j_tail = be32_to_cpu(sb->s_start);
 	journal->j_first = be32_to_cpu(sb->s_first);
-	journal->j_last = be32_to_cpu(sb->s_maxlen);
 	journal->j_errno = be32_to_cpu(sb->s_errno);
 
+	if (jbd2_has_feature_fast_commit(journal) &&
+	    journal->j_fc_wbufsize > 0) {
+		journal->j_fc_last = be32_to_cpu(sb->s_maxlen);
+		journal->j_last = journal->j_fc_last - journal->j_fc_wbufsize;
+		journal->j_fc_first = journal->j_last + 1;
+		journal->j_fc_off = 0;
+	} else {
+		journal->j_last = be32_to_cpu(sb->s_maxlen);
+	}
+
 	return 0;
 }
 
@@ -1728,6 +1766,9 @@ int jbd2_journal_load(journal_t *journal)
 	 */
 	journal->j_flags &= ~JBD2_ABORT;
 
+	if (journal->j_fc_wbufsize > 0)
+		jbd2_journal_set_features(journal, 0, 0,
+					  JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
 	/* OK, we've finished with the dynamic journal bits:
 	 * reinitialise the dynamic contents of the superblock in memory
 	 * and reset them on disk. */
@@ -1811,6 +1852,8 @@ int jbd2_journal_destroy(journal_t *journal)
 		jbd2_journal_destroy_revoke(journal);
 	if (journal->j_chksum_driver)
 		crypto_free_shash(journal->j_chksum_driver);
+	if (journal->j_fc_wbufsize > 0)
+		kfree(journal->j_fc_wbuf);
 	kfree(journal->j_wbuf);
 	kfree(journal);
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 0685cc95e501..008629b4d615 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -918,6 +918,30 @@ struct journal_s
 	 */
 	unsigned long		j_last;
 
+	/**
+	 * @j_fc_first:
+	 *
+	 * The block number of the first fast commit block in the journal
+	 * [j_state_lock].
+	 */
+	unsigned long		j_fc_first;
+
+	/**
+	 * @j_fc_off:
+	 *
+	 * Number of fast commit blocks currently allocated.
+	 * [j_state_lock].
+	 */
+	unsigned long		j_fc_off;
+
+	/**
+	 * @j_fc_last:
+	 *
+	 * The block number one beyond the last fast commit block in the journal
+	 * [j_state_lock].
+	 */
+	unsigned long		j_fc_last;
+
 	/**
 	 * @j_dev: Device where we store the journal.
 	 */
@@ -1068,6 +1092,12 @@ struct journal_s
 	 */
 	struct buffer_head	**j_wbuf;
 
+	/**
+	 * @j_fc_wbuf: Array of fast commit bhs for
+	 * jbd2_journal_commit_transaction.
+	 */
+	struct buffer_head	**j_fc_wbuf;
+
 	/**
 	 * @j_wbufsize:
 	 *
@@ -1075,6 +1105,13 @@ struct journal_s
 	 */
 	int			j_wbufsize;
 
+	/**
+	 * @j_fc_wbufsize:
+	 *
+	 * Size of @j_fc_wbuf array.
+	 */
+	int			j_fc_wbufsize;
+
 	/**
 	 * @j_last_sync_writer:
 	 *
@@ -1535,6 +1572,8 @@ void __jbd2_log_wait_for_space(journal_t *journal);
 extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
 extern int jbd2_cleanup_journal_tail(journal_t *);
 
+/* Fast commit related APIs */
+int jbd2_fc_init(journal_t *journal, int num_fc_blks);
 /*
  * is_journal_abort
  *
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v10 4/9] jbd2: add fast commit machinery
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
                   ` (2 preceding siblings ...)
  2020-10-15 20:37 ` [PATCH v10 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
@ 2020-10-15 20:37 ` Harshad Shirwadkar
  2020-10-22 10:16   ` Jan Kara
  2020-10-15 20:37 ` [PATCH v10 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This functions adds necessary APIs needed in JBD2 layer for fast
commits.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/fast_commit.c |   8 ++
 fs/jbd2/commit.c      |  44 ++++++++++
 fs/jbd2/journal.c     | 190 +++++++++++++++++++++++++++++++++++++++++-
 include/linux/jbd2.h  |  27 ++++++
 4 files changed, 268 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 0dad8bdb1253..f2d11b4c6b62 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -8,11 +8,19 @@
  * Ext4 fast commits routines.
  */
 #include "ext4_jbd2.h"
+/*
+ * Fast commit cleanup routine. This is called after every fast commit and
+ * full commit. full is true if we are called after a full commit.
+ */
+static void ext4_fc_cleanup(journal_t *journal, int full)
+{
+}
 
 void ext4_fc_init(struct super_block *sb, journal_t *journal)
 {
 	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
 		return;
+	journal->j_fc_cleanup_callback = ext4_fc_cleanup;
 	if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
 		pr_warn("Error while enabling fast commits, turning off.");
 		ext4_clear_feature_fast_commit(sb);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6252b4c50666..fa688e163a80 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -206,6 +206,30 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
 	return generic_writepages(mapping, &wbc);
 }
 
+/* Send all the data buffers related to an inode */
+int jbd2_submit_inode_data(struct jbd2_inode *jinode)
+{
+
+	if (!jinode || !(jinode->i_flags & JI_WRITE_DATA))
+		return 0;
+
+	trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
+	return jbd2_journal_submit_inode_data_buffers(jinode);
+
+}
+EXPORT_SYMBOL(jbd2_submit_inode_data);
+
+int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
+{
+	if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
+		!jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
+		return 0;
+	return filemap_fdatawait_range_keep_errors(
+		jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
+		jinode->i_dirty_end);
+}
+EXPORT_SYMBOL(jbd2_wait_inode_data);
+
 /*
  * Submit all the data buffers of inode associated with the transaction to
  * disk.
@@ -415,6 +439,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	J_ASSERT(journal->j_running_transaction != NULL);
 	J_ASSERT(journal->j_committing_transaction == NULL);
 
+	write_lock(&journal->j_state_lock);
+	journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
+	while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) {
+		DEFINE_WAIT(wait);
+
+		prepare_to_wait(&journal->j_fc_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		write_unlock(&journal->j_state_lock);
+		schedule();
+		write_lock(&journal->j_state_lock);
+		finish_wait(&journal->j_fc_wait, &wait);
+	}
+	write_unlock(&journal->j_state_lock);
+
 	commit_transaction = journal->j_running_transaction;
 
 	trace_jbd2_start_commit(journal, commit_transaction);
@@ -422,6 +460,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 			commit_transaction->t_tid);
 
 	write_lock(&journal->j_state_lock);
+	journal->j_fc_off = 0;
 	J_ASSERT(commit_transaction->t_state == T_RUNNING);
 	commit_transaction->t_state = T_LOCKED;
 
@@ -1121,12 +1160,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 
 	if (journal->j_commit_callback)
 		journal->j_commit_callback(journal, commit_transaction);
+	if (journal->j_fc_cleanup_callback)
+		journal->j_fc_cleanup_callback(journal, 1);
 
 	trace_jbd2_end_commit(journal, commit_transaction);
 	jbd_debug(1, "JBD2: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 
 	write_lock(&journal->j_state_lock);
+	journal->j_flags &= ~JBD2_FULL_COMMIT_ONGOING;
+	journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
 	spin_lock(&journal->j_list_lock);
 	commit_transaction->t_state = T_FINISHED;
 	/* Check if the transaction can be dropped now that we are finished */
@@ -1138,6 +1181,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	spin_unlock(&journal->j_list_lock);
 	write_unlock(&journal->j_state_lock);
 	wake_up(&journal->j_wait_done_commit);
+	wake_up(&journal->j_fc_wait);
 
 	/*
 	 * Calculate overall stats
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 4497bfbac527..0c7c42bd530f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -159,7 +159,9 @@ static void commit_timeout(struct timer_list *t)
  *
  * 1) COMMIT:  Every so often we need to commit the current state of the
  *    filesystem to disk.  The journal thread is responsible for writing
- *    all of the metadata buffers to disk.
+ *    all of the metadata buffers to disk. If a fast commit is ongoing
+ *    journal thread waits until it's done and then continues from
+ *    there on.
  *
  * 2) CHECKPOINT: We cannot reuse a used section of the log file until all
  *    of the data in that part of the log has been rewritten elsewhere on
@@ -716,6 +718,75 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 	return err;
 }
 
+/*
+ * Start a fast commit. If there's an ongoing fast or full commit wait for
+ * it to complete. Returns 0 if a new fast commit was started. Returns -EALREADY
+ * if a fast commit is not needed, either because there's an already a commit
+ * going on or this tid has already been committed. Returns -EINVAL if no jbd2
+ * commit has yet been performed.
+ */
+int jbd2_fc_begin_commit(journal_t *journal, tid_t tid)
+{
+	/*
+	 * Fast commits only allowed if at least one full commit has
+	 * been processed.
+	 */
+	if (!journal->j_stats.ts_tid)
+		return -EINVAL;
+
+	if (tid <= journal->j_commit_sequence)
+		return -EALREADY;
+
+	write_lock(&journal->j_state_lock);
+	if (journal->j_flags & JBD2_FULL_COMMIT_ONGOING ||
+	    (journal->j_flags & JBD2_FAST_COMMIT_ONGOING)) {
+		DEFINE_WAIT(wait);
+
+		prepare_to_wait(&journal->j_fc_wait, &wait,
+				TASK_UNINTERRUPTIBLE);
+		write_unlock(&journal->j_state_lock);
+		schedule();
+		finish_wait(&journal->j_fc_wait, &wait);
+		return -EALREADY;
+	}
+	journal->j_flags |= JBD2_FAST_COMMIT_ONGOING;
+	write_unlock(&journal->j_state_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_fc_begin_commit);
+
+/*
+ * Stop a fast commit. If fallback is set, this function starts commit of
+ * TID tid before any other fast commit can start.
+ */
+static int __jbd2_fc_end_commit(journal_t *journal, tid_t tid, bool fallback)
+{
+	if (journal->j_fc_cleanup_callback)
+		journal->j_fc_cleanup_callback(journal, 0);
+	write_lock(&journal->j_state_lock);
+	journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
+	if (fallback)
+		journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
+	write_unlock(&journal->j_state_lock);
+	wake_up(&journal->j_fc_wait);
+	if (fallback)
+		return jbd2_complete_transaction(journal, tid);
+	return 0;
+}
+
+int jbd2_fc_end_commit(journal_t *journal)
+{
+	return __jbd2_fc_end_commit(journal, 0, 0);
+}
+EXPORT_SYMBOL(jbd2_fc_end_commit);
+
+int jbd2_fc_end_commit_fallback(journal_t *journal, tid_t tid)
+{
+	return __jbd2_fc_end_commit(journal, tid, 1);
+}
+EXPORT_SYMBOL(jbd2_fc_end_commit_fallback);
+
 /* Return 1 when transaction with given tid has already committed. */
 int jbd2_transaction_committed(journal_t *journal, tid_t tid)
 {
@@ -784,6 +855,110 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
 	return jbd2_journal_bmap(journal, blocknr, retp);
 }
 
+/* Map one fast commit buffer for use by the file system */
+int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
+{
+	unsigned long long pblock;
+	unsigned long blocknr;
+	int ret = 0;
+	struct buffer_head *bh;
+	int fc_off;
+
+	*bh_out = NULL;
+	write_lock(&journal->j_state_lock);
+
+	if (journal->j_fc_off + journal->j_fc_first < journal->j_fc_last) {
+		fc_off = journal->j_fc_off;
+		blocknr = journal->j_fc_first + fc_off;
+		journal->j_fc_off++;
+	} else {
+		ret = -EINVAL;
+	}
+	write_unlock(&journal->j_state_lock);
+
+	if (ret)
+		return ret;
+
+	ret = jbd2_journal_bmap(journal, blocknr, &pblock);
+	if (ret)
+		return ret;
+
+	bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
+	if (!bh)
+		return -ENOMEM;
+
+	lock_buffer(bh);
+
+	clear_buffer_uptodate(bh);
+	set_buffer_dirty(bh);
+	unlock_buffer(bh);
+	journal->j_fc_wbuf[fc_off] = bh;
+
+	*bh_out = bh;
+
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_fc_get_buf);
+
+/*
+ * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
+ * for completion.
+ */
+int jbd2_fc_wait_bufs(journal_t *journal, int num_blks)
+{
+	struct buffer_head *bh;
+	int i, j_fc_off;
+
+	read_lock(&journal->j_state_lock);
+	j_fc_off = journal->j_fc_off;
+	read_unlock(&journal->j_state_lock);
+
+	/*
+	 * Wait in reverse order to minimize chances of us being woken up before
+	 * all IOs have completed
+	 */
+	for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
+		bh = journal->j_fc_wbuf[i];
+		wait_on_buffer(bh);
+		put_bh(bh);
+		journal->j_fc_wbuf[i] = NULL;
+		if (unlikely(!buffer_uptodate(bh)))
+			return -EIO;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_fc_wait_bufs);
+
+/*
+ * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
+ * for completion.
+ */
+int jbd2_fc_release_bufs(journal_t *journal)
+{
+	struct buffer_head *bh;
+	int i, j_fc_off;
+
+	read_lock(&journal->j_state_lock);
+	j_fc_off = journal->j_fc_off;
+	read_unlock(&journal->j_state_lock);
+
+	/*
+	 * Wait in reverse order to minimize chances of us being woken up before
+	 * all IOs have completed
+	 */
+	for (i = j_fc_off - 1; i >= 0; i--) {
+		bh = journal->j_fc_wbuf[i];
+		if (!bh)
+			break;
+		put_bh(bh);
+		journal->j_fc_wbuf[i] = NULL;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_fc_release_bufs);
+
 /*
  * Conversion of logical to physical block numbers for the journal
  *
@@ -1142,6 +1317,7 @@ static journal_t *journal_init_common(struct block_device *bdev,
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
 	init_waitqueue_head(&journal->j_wait_reserved);
+	init_waitqueue_head(&journal->j_fc_wait);
 	mutex_init(&journal->j_abort_mutex);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
@@ -1495,6 +1671,7 @@ int jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
 static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
 {
 	journal_superblock_t *sb = journal->j_superblock;
+	bool had_fast_commit = false;
 
 	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	lock_buffer(journal->j_sb_buffer);
@@ -1508,9 +1685,20 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
 
 	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
 	sb->s_start    = cpu_to_be32(0);
+	if (jbd2_has_feature_fast_commit(journal)) {
+		/*
+		 * When journal is clean, no need to commit fast commit flag and
+		 * make file system incompatible with older kernels.
+		 */
+		jbd2_clear_feature_fast_commit(journal);
+		had_fast_commit = true;
+	}
 
 	jbd2_write_superblock(journal, write_op);
 
+	if (had_fast_commit)
+		jbd2_set_feature_fast_commit(journal);
+
 	/* Log is no longer empty */
 	write_lock(&journal->j_state_lock);
 	journal->j_flags |= JBD2_FLUSHED;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 008629b4d615..a009d9b9c620 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -861,6 +861,13 @@ struct journal_s
 	 */
 	wait_queue_head_t	j_wait_reserved;
 
+	/**
+	 * @j_fc_wait:
+	 *
+	 * Wait queue to wait for completion of async fast commits.
+	 */
+	wait_queue_head_t	j_fc_wait;
+
 	/**
 	 * @j_checkpoint_mutex:
 	 *
@@ -1232,6 +1239,15 @@ struct journal_s
 	 */
 	struct lockdep_map	j_trans_commit_map;
 #endif
+
+	/**
+	 * @j_fc_cleanup_callback:
+	 *
+	 * Clean-up after fast commit or full commit. JBD2 calls this function
+	 * after every commit operation.
+	 */
+	void (*j_fc_cleanup_callback)(struct journal_s *journal, int);
+
 };
 
 #define jbd2_might_wait_for_commit(j) \
@@ -1316,6 +1332,8 @@ JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit,	FAST_COMMIT)
 #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
 						 * data write error in ordered
 						 * mode */
+#define JBD2_FAST_COMMIT_ONGOING	0x100	/* Fast commit is ongoing */
+#define JBD2_FULL_COMMIT_ONGOING	0x200	/* Full commit is ongoing */
 
 /*
  * Function declarations for the journaling transaction and buffer
@@ -1574,6 +1592,15 @@ extern int jbd2_cleanup_journal_tail(journal_t *);
 
 /* Fast commit related APIs */
 int jbd2_fc_init(journal_t *journal, int num_fc_blks);
+int jbd2_fc_begin_commit(journal_t *journal, tid_t tid);
+int jbd2_fc_end_commit(journal_t *journal);
+int jbd2_fc_end_commit_fallback(journal_t *journal, tid_t tid);
+int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out);
+int jbd2_submit_inode_data(struct jbd2_inode *jinode);
+int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode);
+int jbd2_fc_wait_bufs(journal_t *journal, int num_blks);
+int jbd2_fc_release_bufs(journal_t *journal);
+
 /*
  * is_journal_abort
  *
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
                   ` (3 preceding siblings ...)
  2020-10-15 20:37 ` [PATCH v10 4/9] jbd2: add fast commit machinery Harshad Shirwadkar
@ 2020-10-15 20:37 ` Harshad Shirwadkar
  2020-10-23 10:30   ` Jan Kara
  2020-10-15 20:37 ` [PATCH v10 6/9] jbd2: fast commit recovery path Harshad Shirwadkar
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar, kernel test robot

This patch adds main fast commit commit path handlers. The overall
patch can be divided into two inter-related parts:

(A) Metadata updates tracking

    This part consists of helper functions to track changes that need
    to be committed during a commit operation. These updates are
    maintained by Ext4 in different in-memory queues. Following are
    the APIs and their short description that are implemented in this
    patch:

    - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
      operations
    - ext4_fc_track_range() - Track changed logical block offsets
      inodes
    - ext4_fc_track_inode() - Track inodes
    - ext4_fc_mark_ineligible() - Mark file system fast commit
      ineligible()
    - ext4_fc_start_update() / ext4_fc_stop_update() /
      ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
      functions are useful for co-ordinating inode updates with
      commits.

(B) Main commit Path

    This part consists of functions to convert updates tracked in
    in-memory data structures into on-disk commits. Function
    ext4_fc_commit() is the main entry point to commit path.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/acl.c               |    2 +
 fs/ext4/ext4.h              |   67 ++
 fs/ext4/extents.c           |   48 +-
 fs/ext4/fast_commit.c       | 1172 +++++++++++++++++++++++++++++++++++
 fs/ext4/fast_commit.h       |  110 ++++
 fs/ext4/file.c              |   10 +-
 fs/ext4/fsync.c             |    2 +-
 fs/ext4/inode.c             |   41 +-
 fs/ext4/ioctl.c             |   16 +-
 fs/ext4/namei.c             |   37 +-
 fs/ext4/super.c             |   31 +
 fs/ext4/xattr.c             |    3 +
 include/trace/events/ext4.h |  172 +++++
 13 files changed, 1685 insertions(+), 26 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 76f634d185f1..68aaed48315f 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -242,6 +242,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
+	ext4_fc_start_update(inode);
 
 	if ((type == ACL_TYPE_ACCESS) && acl) {
 		error = posix_acl_update_mode(inode, &mode, &acl);
@@ -259,6 +260,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	}
 out_stop:
 	ext4_journal_stop(handle);
+	ext4_fc_stop_update(inode);
 	if (error == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
 	return error;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 2c412d32db0f..6b291cad72be 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1021,6 +1021,28 @@ struct ext4_inode_info {
 
 	struct list_head i_orphan;	/* unlinked but open inodes */
 
+	/* Fast commit related info */
+
+	struct list_head i_fc_list;	/*
+					 * inodes that need fast commit
+					 * protected by sbi->s_fc_lock.
+					 */
+
+	/* Start of lblk range that needs to be committed in this fast commit */
+	ext4_lblk_t i_fc_lblk_start;
+
+	/* End of lblk range that needs to be committed in this fast commit */
+	ext4_lblk_t i_fc_lblk_len;
+
+	/* Number of ongoing updates on this inode */
+	atomic_t  i_fc_updates;
+
+	/* Fast commit wait queue for this inode */
+	wait_queue_head_t i_fc_wait;
+
+	/* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */
+	struct mutex i_fc_lock;
+
 	/*
 	 * i_disksize keeps track of what the inode size is ON DISK, not
 	 * in memory.  During truncate, i_size is set to the new size by
@@ -1141,6 +1163,10 @@ struct ext4_inode_info {
 #define	EXT4_VALID_FS			0x0001	/* Unmounted cleanly */
 #define	EXT4_ERROR_FS			0x0002	/* Errors detected */
 #define	EXT4_ORPHAN_FS			0x0004	/* Orphans being recovered */
+#define EXT4_FC_INELIGIBLE		0x0008	/* Fast commit ineligible */
+#define EXT4_FC_COMMITTING		0x0010	/* File system underoing a fast
+						 * commit.
+						 */
 
 /*
  * Misc. filesystem flags
@@ -1613,6 +1639,30 @@ struct ext4_sb_info {
 	/* Record the errseq of the backing block device */
 	errseq_t s_bdev_wb_err;
 	spinlock_t s_bdev_wb_lock;
+
+	/* Ext4 fast commit stuff */
+	atomic_t s_fc_subtid;
+	atomic_t s_fc_ineligible_updates;
+	/*
+	 * After commit starts, the main queue gets locked, and the further
+	 * updates get added in the staging queue.
+	 */
+#define FC_Q_MAIN	0
+#define FC_Q_STAGING	1
+	struct list_head s_fc_q[2];	/* Inodes staged for fast commit
+					 * that have data changes in them.
+					 */
+	struct list_head s_fc_dentry_q[2];	/* directory entry updates */
+	unsigned int s_fc_bytes;
+	/*
+	 * Main fast commit lock. This lock protects accesses to the
+	 * following fields:
+	 * ei->i_fc_list, s_fc_dentry_q, s_fc_q, s_fc_bytes, s_fc_bh.
+	 */
+	spinlock_t s_fc_lock;
+	struct buffer_head *s_fc_bh;
+	struct ext4_fc_stats s_fc_stats;
+	u64 s_fc_avg_commit_time;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1723,6 +1773,7 @@ enum {
 	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
 	EXT4_STATE_LUSTRE_EA_INODE,	/* Lustre-style ea_inode */
 	EXT4_STATE_VERITY_IN_PROGRESS,	/* building fs-verity Merkle tree */
+	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
@@ -2682,6 +2733,22 @@ extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
 /* fast_commit.c */
 
 void ext4_fc_init(struct super_block *sb, journal_t *journal);
+void ext4_fc_init_inode(struct inode *inode);
+void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
+			 ext4_lblk_t end);
+void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry);
+void ext4_fc_track_link(struct inode *inode, struct dentry *dentry);
+void ext4_fc_track_create(struct inode *inode, struct dentry *dentry);
+void ext4_fc_track_inode(struct inode *inode);
+void ext4_fc_mark_ineligible(struct super_block *sb, int reason);
+void ext4_fc_start_ineligible(struct super_block *sb, int reason);
+void ext4_fc_stop_ineligible(struct super_block *sb);
+void ext4_fc_start_update(struct inode *inode);
+void ext4_fc_stop_update(struct inode *inode);
+void ext4_fc_del(struct inode *inode);
+int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
+int __init ext4_fc_init_dentry_cache(void);
+
 /* mballoc.c */
 extern const struct seq_operations ext4_mb_seq_groups_ops;
 extern long ext4_mb_stats;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e46f3381ba4c..a2bb87d75500 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3723,6 +3723,7 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
 	err = ext4_ext_dirty(handle, inode, path + path->p_depth);
 out:
 	ext4_ext_show_leaf(inode, path);
+	ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
 	return err;
 }
 
@@ -3794,6 +3795,7 @@ convert_initialized_extent(handle_t *handle, struct inode *inode,
 	if (*allocated > map->m_len)
 		*allocated = map->m_len;
 	map->m_len = *allocated;
+	ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
 	return 0;
 }
 
@@ -4327,7 +4329,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	map->m_len = ar.len;
 	allocated = map->m_len;
 	ext4_ext_show_leaf(inode, path);
-
+	ext4_fc_track_range(inode, map->m_lblk, map->m_lblk + map->m_len - 1);
 out:
 	ext4_ext_drop_refs(path);
 	kfree(path);
@@ -4600,7 +4602,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	ret = ext4_mark_inode_dirty(handle, inode);
 	if (unlikely(ret))
 		goto out_handle;
-
+	ext4_fc_track_range(inode, offset >> inode->i_sb->s_blocksize_bits,
+			(offset + len - 1) >> inode->i_sb->s_blocksize_bits);
 	/* Zero out partial block at the edges of the range */
 	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
 	if (ret >= 0)
@@ -4648,23 +4651,34 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
 		     FALLOC_FL_INSERT_RANGE))
 		return -EOPNOTSUPP;
+	ext4_fc_track_range(inode, offset >> blkbits,
+			(offset + len - 1) >> blkbits);
 
-	if (mode & FALLOC_FL_PUNCH_HOLE)
-		return ext4_punch_hole(inode, offset, len);
+	ext4_fc_start_update(inode);
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		ret = ext4_punch_hole(inode, offset, len);
+		goto exit;
+	}
 
 	ret = ext4_convert_inline_data(inode);
 	if (ret)
-		return ret;
+		goto exit;
 
-	if (mode & FALLOC_FL_COLLAPSE_RANGE)
-		return ext4_collapse_range(inode, offset, len);
-
-	if (mode & FALLOC_FL_INSERT_RANGE)
-		return ext4_insert_range(inode, offset, len);
+	if (mode & FALLOC_FL_COLLAPSE_RANGE) {
+		ret = ext4_collapse_range(inode, offset, len);
+		goto exit;
+	}
 
-	if (mode & FALLOC_FL_ZERO_RANGE)
-		return ext4_zero_range(file, offset, len, mode);
+	if (mode & FALLOC_FL_INSERT_RANGE) {
+		ret = ext4_insert_range(inode, offset, len);
+		goto exit;
+	}
 
+	if (mode & FALLOC_FL_ZERO_RANGE) {
+		ret = ext4_zero_range(file, offset, len, mode);
+		goto exit;
+	}
 	trace_ext4_fallocate_enter(inode, offset, len, mode);
 	lblk = offset >> blkbits;
 
@@ -4698,12 +4712,14 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		goto out;
 
 	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
-		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
-						EXT4_I(inode)->i_sync_tid);
+		ret = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
+					EXT4_I(inode)->i_sync_tid);
 	}
 out:
 	inode_unlock(inode);
 	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
+exit:
+	ext4_fc_stop_update(inode);
 	return ret;
 }
 
@@ -5291,6 +5307,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 		ret = PTR_ERR(handle);
 		goto out_mmap;
 	}
+	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE);
 
 	down_write(&EXT4_I(inode)->i_data_sem);
 	ext4_discard_preallocations(inode, 0);
@@ -5329,6 +5346,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 
 out_stop:
 	ext4_journal_stop(handle);
+	ext4_fc_stop_ineligible(sb);
 out_mmap:
 	up_write(&EXT4_I(inode)->i_mmap_sem);
 out_mutex:
@@ -5429,6 +5447,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 		ret = PTR_ERR(handle);
 		goto out_mmap;
 	}
+	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE);
 
 	/* Expand file to avoid data loss if there is error while shifting */
 	inode->i_size += len;
@@ -5503,6 +5522,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 
 out_stop:
 	ext4_journal_stop(handle);
+	ext4_fc_stop_ineligible(sb);
 out_mmap:
 	up_write(&EXT4_I(inode)->i_mmap_sem);
 out_mutex:
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index f2d11b4c6b62..e0fa3bd18346 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -7,13 +7,1174 @@
  *
  * Ext4 fast commits routines.
  */
+#include "ext4.h"
 #include "ext4_jbd2.h"
+#include "ext4_extents.h"
+#include "mballoc.h"
+
+/*
+ * Ext4 Fast Commits
+ * -----------------
+ *
+ * Ext4 fast commits implement fine grained journalling for Ext4.
+ *
+ * Fast commits are organized as a log of tag-length-value (TLV) structs. (See
+ * struct ext4_fc_tl). Each TLV contains some delta that is replayed TLV by
+ * TLV during the recovery phase. For the scenarios for which we currently
+ * don't have replay code, fast commit falls back to full commits.
+ * Fast commits record delta in one of the following three categories.
+ *
+ * (A) Directory entry updates:
+ *
+ * - EXT4_FC_TAG_UNLINK		- records directory entry unlink
+ * - EXT4_FC_TAG_LINK		- records directory entry link
+ * - EXT4_FC_TAG_CREAT		- records inode and directory entry creation
+ *
+ * (B) File specific data range updates:
+ *
+ * - EXT4_FC_TAG_ADD_RANGE	- records addition of new blocks to an inode
+ * - EXT4_FC_TAG_DEL_RANGE	- records deletion of blocks from an inode
+ *
+ * (C) Inode metadata (mtime / ctime etc):
+ *
+ * - EXT4_FC_TAG_INODE		- record the inode that should be replayed
+ *				  during recovery. Note that iblocks field is
+ *				  not replayed and instead derived during
+ *				  replay.
+ * Commit Operation
+ * ----------------
+ * With fast commits, we maintain all the directory entry operations in the
+ * order in which they are issued in an in-memory queue. This queue is flushed
+ * to disk during the commit operation. We also maintain a list of inodes
+ * that need to be committed during a fast commit in another in memory queue of
+ * inodes. During the commit operation, we commit in the following order:
+ *
+ * [1] Lock inodes for any further data updates by setting COMMITTING state
+ * [2] Submit data buffers of all the inodes
+ * [3] Wait for [2] to complete
+ * [4] Commit all the directory entry updates in the fast commit space
+ * [5] Commit all the changed inode structures
+ * [6] Write tail tag (this tag ensures the atomicity, please read the following
+ *     section for more details).
+ * [7] Wait for [4], [5] and [6] to complete.
+ *
+ * All the inode updates must call ext4_fc_start_update() before starting an
+ * update. If such an ongoing update is present, fast commit waits for it to
+ * complete. The completion of such an update is marked by
+ * ext4_fc_stop_update().
+ *
+ * Fast Commit Ineligibility
+ * -------------------------
+ * Not all operations are supported by fast commits today (e.g extended
+ * attributes). Fast commit ineligiblity is marked by calling one of the
+ * two following functions:
+ *
+ * - ext4_fc_mark_ineligible(): This makes next fast commit operation to fall
+ *   back to full commit. This is useful in case of transient errors.
+ *
+ * - ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() - This makes all
+ *   the fast commits happening between ext4_fc_start_ineligible() and
+ *   ext4_fc_stop_ineligible() and one fast commit after the call to
+ *   ext4_fc_stop_ineligible() to fall back to full commits. It is important to
+ *   make one more fast commit to fall back to full commit after stop call so
+ *   that it guaranteed that the fast commit ineligible operation contained
+ *   within ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() is
+ *   followed by at least 1 full commit.
+ *
+ * Atomicity of commits
+ * --------------------
+ * In order to gaurantee atomicity during the commit operation, fast commit
+ * uses "EXT4_FC_TAG_TAIL" tag that marks a fast commit as complete. Tail
+ * tag contains CRC of the contents and TID of the transaction after which
+ * this fast commit should be applied. Recovery code replays fast commit
+ * logs only if there's at least 1 valid tail present. For every fast commit
+ * operation, there is 1 tail. This means, we may end up with multiple tails
+ * in the fast commit space. Here's an example:
+ *
+ * - Create a new file A and remove existing file B
+ * - fsync()
+ * - Append contents to file A
+ * - Truncate file A
+ * - fsync()
+ *
+ * The fast commit space at the end of above operations would look like this:
+ *      [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL]
+ *             |<---  Fast Commit 1   --->|<---      Fast Commit 2     ---->|
+ *
+ * Replay code should thus check for all the valid tails in the FC area.
+ *
+ * TODOs
+ * -----
+ * 1) Make fast commit atomic updates more fine grained. Today, a fast commit
+ *    eligible update must be protected within ext4_fc_start_update() and
+ *    ext4_fc_stop_update(). These routines are called at much higher
+ *    routines. This can be made more fine grained by combining with
+ *    ext4_journal_start().
+ *
+ * 2) Same above for ext4_fc_start_ineligible() and ext4_fc_stop_ineligible()
+ *
+ * 3) Handle more ineligible cases.
+ */
+
+#include <trace/events/ext4.h>
+static struct kmem_cache *ext4_fc_dentry_cachep;
+
+static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+{
+	BUFFER_TRACE(bh, "");
+	if (uptodate) {
+		ext4_debug("%s: Block %lld up-to-date",
+			   __func__, bh->b_blocknr);
+		set_buffer_uptodate(bh);
+	} else {
+		ext4_debug("%s: Block %lld not up-to-date",
+			   __func__, bh->b_blocknr);
+		clear_buffer_uptodate(bh);
+	}
+
+	unlock_buffer(bh);
+}
+
+static inline void ext4_fc_reset_inode(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	ei->i_fc_lblk_start = 0;
+	ei->i_fc_lblk_len = 0;
+}
+
+void ext4_fc_init_inode(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	ext4_fc_reset_inode(inode);
+	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+	INIT_LIST_HEAD(&ei->i_fc_list);
+	init_waitqueue_head(&ei->i_fc_wait);
+	atomic_set(&ei->i_fc_updates, 0);
+}
+
+/*
+ * Inform Ext4's fast about start of an inode update
+ *
+ * This function is called by the high level call VFS callbacks before
+ * performing any inode update. This function blocks if there's an ongoing
+ * fast commit on the inode in question.
+ */
+void ext4_fc_start_update(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return;
+
+restart:
+	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+	if (list_empty(&ei->i_fc_list))
+		goto out;
+
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+		wait_queue_head_t *wq;
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+		schedule();
+		finish_wait(wq, &wait.wq_entry);
+		goto restart;
+	}
+out:
+	atomic_inc(&ei->i_fc_updates);
+	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+}
+
+/*
+ * Stop inode update and wake up waiting fast commits if any.
+ */
+void ext4_fc_stop_update(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return;
+
+	if (atomic_dec_and_test(&ei->i_fc_updates))
+		wake_up_all(&ei->i_fc_wait);
+}
+
+/*
+ * Remove inode from fast commit list. If the inode is being committed
+ * we wait until inode commit is done.
+ */
+void ext4_fc_del(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return;
+
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return;
+
+restart:
+	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+	if (list_empty(&ei->i_fc_list)) {
+		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+		return;
+	}
+
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+		wait_queue_head_t *wq;
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+		schedule();
+		finish_wait(wq, &wait.wq_entry);
+		goto restart;
+	}
+	if (!list_empty(&ei->i_fc_list))
+		list_del_init(&ei->i_fc_list);
+	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+}
+
+/*
+ * Mark file system as fast commit ineligible. This means that next commit
+ * operation would result in a full jbd2 commit.
+ */
+void ext4_fc_mark_ineligible(struct super_block *sb, int reason)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	sbi->s_mount_state |= EXT4_FC_INELIGIBLE;
+	WARN_ON(reason >= EXT4_FC_REASON_MAX);
+	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
+}
+
+/*
+ * Start a fast commit ineligible update. Any commits that happen while
+ * such an operation is in progress fall back to full commits.
+ */
+void ext4_fc_start_ineligible(struct super_block *sb, int reason)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	WARN_ON(reason >= EXT4_FC_REASON_MAX);
+	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
+	atomic_inc(&sbi->s_fc_ineligible_updates);
+}
+
+/*
+ * Stop a fast commit ineligible update. We set EXT4_FC_INELIGIBLE flag here
+ * to ensure that after stopping the ineligible update, at least one full
+ * commit takes place.
+ */
+void ext4_fc_stop_ineligible(struct super_block *sb)
+{
+	EXT4_SB(sb)->s_mount_state |= EXT4_FC_INELIGIBLE;
+	atomic_dec(&EXT4_SB(sb)->s_fc_ineligible_updates);
+}
+
+static inline int ext4_fc_is_ineligible(struct super_block *sb)
+{
+	return (EXT4_SB(sb)->s_mount_state & EXT4_FC_INELIGIBLE) ||
+		atomic_read(&EXT4_SB(sb)->s_fc_ineligible_updates);
+}
+
+/*
+ * Generic fast commit tracking function. If this is the first time this we are
+ * called after a full commit, we initialize fast commit fields and then call
+ * __fc_track_fn() with update = 0. If we have already been called after a full
+ * commit, we pass update = 1. Based on that, the track function can determine
+ * if it needs to track a field for the first time or if it needs to just
+ * update the previously tracked value.
+ *
+ * If enqueue is set, this function enqueues the inode in fast commit list.
+ */
+static int ext4_fc_track_template(
+	struct inode *inode, int (*__fc_track_fn)(struct inode *, void *, bool),
+	void *args, int enqueue)
+{
+	tid_t running_txn_tid;
+	bool update = false;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	int ret;
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return -EOPNOTSUPP;
+
+	if (ext4_fc_is_ineligible(inode->i_sb))
+		return -EINVAL;
+
+	running_txn_tid = sbi->s_journal ?
+		sbi->s_journal->j_commit_sequence + 1 : 0;
+
+	mutex_lock(&ei->i_fc_lock);
+	if (running_txn_tid == ei->i_sync_tid) {
+		update = true;
+	} else {
+		ext4_fc_reset_inode(inode);
+		ei->i_sync_tid = running_txn_tid;
+	}
+	ret = __fc_track_fn(inode, args, update);
+	mutex_unlock(&ei->i_fc_lock);
+
+	if (!enqueue)
+		return ret;
+
+	spin_lock(&sbi->s_fc_lock);
+	if (list_empty(&EXT4_I(inode)->i_fc_list))
+		list_add_tail(&EXT4_I(inode)->i_fc_list,
+				(sbi->s_mount_state & EXT4_FC_COMMITTING) ?
+				&sbi->s_fc_q[FC_Q_STAGING] :
+				&sbi->s_fc_q[FC_Q_MAIN]);
+	spin_unlock(&sbi->s_fc_lock);
+
+	return ret;
+}
+
+struct __track_dentry_update_args {
+	struct dentry *dentry;
+	int op;
+};
+
+/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */
+static int __track_dentry_update(struct inode *inode, void *arg, bool update)
+{
+	struct ext4_fc_dentry_update *node;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct __track_dentry_update_args *dentry_update =
+		(struct __track_dentry_update_args *)arg;
+	struct dentry *dentry = dentry_update->dentry;
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+	mutex_unlock(&ei->i_fc_lock);
+	node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS);
+	if (!node) {
+		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_MEM);
+		mutex_lock(&ei->i_fc_lock);
+		return -ENOMEM;
+	}
+
+	node->fcd_op = dentry_update->op;
+	node->fcd_parent = dentry->d_parent->d_inode->i_ino;
+	node->fcd_ino = inode->i_ino;
+	if (dentry->d_name.len > DNAME_INLINE_LEN) {
+		node->fcd_name.name = kmalloc(dentry->d_name.len, GFP_NOFS);
+		if (!node->fcd_name.name) {
+			kmem_cache_free(ext4_fc_dentry_cachep, node);
+			ext4_fc_mark_ineligible(inode->i_sb,
+				EXT4_FC_REASON_MEM);
+			mutex_lock(&ei->i_fc_lock);
+			return -ENOMEM;
+		}
+		memcpy((u8 *)node->fcd_name.name, dentry->d_name.name,
+			dentry->d_name.len);
+	} else {
+		memcpy(node->fcd_iname, dentry->d_name.name,
+			dentry->d_name.len);
+		node->fcd_name.name = node->fcd_iname;
+	}
+	node->fcd_name.len = dentry->d_name.len;
+
+	spin_lock(&sbi->s_fc_lock);
+	if (sbi->s_mount_state & EXT4_FC_COMMITTING)
+		list_add_tail(&node->fcd_list,
+				&sbi->s_fc_dentry_q[FC_Q_STAGING]);
+	else
+		list_add_tail(&node->fcd_list, &sbi->s_fc_dentry_q[FC_Q_MAIN]);
+	spin_unlock(&sbi->s_fc_lock);
+	mutex_lock(&ei->i_fc_lock);
+
+	return 0;
+}
+
+void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry)
+{
+	struct __track_dentry_update_args args;
+	int ret;
+
+	args.dentry = dentry;
+	args.op = EXT4_FC_TAG_UNLINK;
+
+	ret = ext4_fc_track_template(inode, __track_dentry_update,
+					(void *)&args, 0);
+	trace_ext4_fc_track_unlink(inode, dentry, ret);
+}
+
+void ext4_fc_track_link(struct inode *inode, struct dentry *dentry)
+{
+	struct __track_dentry_update_args args;
+	int ret;
+
+	args.dentry = dentry;
+	args.op = EXT4_FC_TAG_LINK;
+
+	ret = ext4_fc_track_template(inode, __track_dentry_update,
+					(void *)&args, 0);
+	trace_ext4_fc_track_link(inode, dentry, ret);
+}
+
+void ext4_fc_track_create(struct inode *inode, struct dentry *dentry)
+{
+	struct __track_dentry_update_args args;
+	int ret;
+
+	args.dentry = dentry;
+	args.op = EXT4_FC_TAG_CREAT;
+
+	ret = ext4_fc_track_template(inode, __track_dentry_update,
+					(void *)&args, 0);
+	trace_ext4_fc_track_create(inode, dentry, ret);
+}
+
+/* __track_fn for inode tracking */
+static int __track_inode(struct inode *inode, void *arg, bool update)
+{
+	if (update)
+		return -EEXIST;
+
+	EXT4_I(inode)->i_fc_lblk_len = 0;
+
+	return 0;
+}
+
+void ext4_fc_track_inode(struct inode *inode)
+{
+	int ret;
+
+	if (S_ISDIR(inode->i_mode))
+		return;
+
+	ret = ext4_fc_track_template(inode, __track_inode, NULL, 1);
+	trace_ext4_fc_track_inode(inode, ret);
+}
+
+struct __track_range_args {
+	ext4_lblk_t start, end;
+};
+
+/* __track_fn for tracking data updates */
+static int __track_range(struct inode *inode, void *arg, bool update)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	ext4_lblk_t oldstart;
+	struct __track_range_args *__arg =
+		(struct __track_range_args *)arg;
+
+	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) {
+		ext4_debug("Special inode %ld being modified\n", inode->i_ino);
+		return -ECANCELED;
+	}
+
+	oldstart = ei->i_fc_lblk_start;
+
+	if (update && ei->i_fc_lblk_len > 0) {
+		ei->i_fc_lblk_start = min(ei->i_fc_lblk_start, __arg->start);
+		ei->i_fc_lblk_len =
+			max(oldstart + ei->i_fc_lblk_len - 1, __arg->end) -
+				ei->i_fc_lblk_start + 1;
+	} else {
+		ei->i_fc_lblk_start = __arg->start;
+		ei->i_fc_lblk_len = __arg->end - __arg->start + 1;
+	}
+
+	return 0;
+}
+
+void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
+			 ext4_lblk_t end)
+{
+	struct __track_range_args args;
+	int ret;
+
+	if (S_ISDIR(inode->i_mode))
+		return;
+
+	args.start = start;
+	args.end = end;
+
+	ret = ext4_fc_track_template(inode,  __track_range, &args, 1);
+
+	trace_ext4_fc_track_range(inode, start, end, ret);
+}
+
+static void ext4_fc_submit_bh(struct super_block *sb)
+{
+	int write_flags = REQ_SYNC;
+	struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
+
+	if (test_opt(sb, BARRIER))
+		write_flags |= REQ_FUA | REQ_PREFLUSH;
+	lock_buffer(bh);
+	clear_buffer_dirty(bh);
+	set_buffer_uptodate(bh);
+	bh->b_end_io = ext4_end_buffer_io_sync;
+	submit_bh(REQ_OP_WRITE, write_flags, bh);
+	EXT4_SB(sb)->s_fc_bh = NULL;
+}
+
+/* Ext4 commit path routines */
+
+/* memzero and update CRC */
+static void *ext4_fc_memzero(struct super_block *sb, void *dst, int len,
+				u32 *crc)
+{
+	void *ret;
+
+	ret = memset(dst, 0, len);
+	if (crc)
+		*crc = ext4_chksum(EXT4_SB(sb), *crc, dst, len);
+	return ret;
+}
+
+/*
+ * Allocate len bytes on a fast commit buffer.
+ *
+ * During the commit time this function is used to manage fast commit
+ * block space. We don't split a fast commit log onto different
+ * blocks. So this function makes sure that if there's not enough space
+ * on the current block, the remaining space in the current block is
+ * marked as unused by adding EXT4_FC_TAG_PAD tag. In that case,
+ * new block is from jbd2 and CRC is updated to reflect the padding
+ * we added.
+ */
+static u8 *ext4_fc_reserve_space(struct super_block *sb, int len, u32 *crc)
+{
+	struct ext4_fc_tl *tl;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct buffer_head *bh;
+	int bsize = sbi->s_journal->j_blocksize;
+	int ret, off = sbi->s_fc_bytes % bsize;
+	int pad_len;
+
+	/*
+	 * After allocating len, we should have space at least for a 0 byte
+	 * padding.
+	 */
+	if (len + sizeof(struct ext4_fc_tl) > bsize)
+		return NULL;
+
+	if (bsize - off - 1 > len + sizeof(struct ext4_fc_tl)) {
+		/*
+		 * Only allocate from current buffer if we have enough space for
+		 * this request AND we have space to add a zero byte padding.
+		 */
+		if (!sbi->s_fc_bh) {
+			ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
+			if (ret)
+				return NULL;
+			sbi->s_fc_bh = bh;
+		}
+		sbi->s_fc_bytes += len;
+		return sbi->s_fc_bh->b_data + off;
+	}
+	/* Need to add PAD tag */
+	tl = (struct ext4_fc_tl *)(sbi->s_fc_bh->b_data + off);
+	tl->fc_tag = cpu_to_le16(EXT4_FC_TAG_PAD);
+	pad_len = bsize - off - 1 - sizeof(struct ext4_fc_tl);
+	tl->fc_len = cpu_to_le16(pad_len);
+	if (crc)
+		*crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl));
+	if (pad_len > 0)
+		ext4_fc_memzero(sb, tl + 1, pad_len, crc);
+	ext4_fc_submit_bh(sb);
+
+	ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
+	if (ret)
+		return NULL;
+	sbi->s_fc_bh = bh;
+	sbi->s_fc_bytes = (sbi->s_fc_bytes / bsize + 1) * bsize + len;
+	return sbi->s_fc_bh->b_data;
+}
+
+/* memcpy to fc reserved space and update CRC */
+static void *ext4_fc_memcpy(struct super_block *sb, void *dst, const void *src,
+				int len, u32 *crc)
+{
+	if (crc)
+		*crc = ext4_chksum(EXT4_SB(sb), *crc, src, len);
+	return memcpy(dst, src, len);
+}
+
+/*
+ * Complete a fast commit by writing tail tag.
+ *
+ * Writing tail tag marks the end of a fast commit. In order to guarantee
+ * atomicity, after writing tail tag, even if there's space remaining
+ * in the block, next commit shouldn't use it. That's why tail tag
+ * has the length as that of the remaining space on the block.
+ */
+static int ext4_fc_write_tail(struct super_block *sb, u32 crc)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_tl tl;
+	struct ext4_fc_tail tail;
+	int off, bsize = sbi->s_journal->j_blocksize;
+	u8 *dst;
+
+	/*
+	 * ext4_fc_reserve_space takes care of allocating an extra block if
+	 * there's no enough space on this block for accommodating this tail.
+	 */
+	dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc);
+	if (!dst)
+		return -ENOSPC;
+
+	off = sbi->s_fc_bytes % bsize;
+
+	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL);
+	tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail));
+	sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize);
+
+	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc);
+	dst += sizeof(tl);
+	tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid);
+	ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc);
+	dst += sizeof(tail.fc_tid);
+	tail.fc_crc = cpu_to_le32(crc);
+	ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
+
+	ext4_fc_submit_bh(sb);
+
+	return 0;
+}
+
+/*
+ * Adds tag, length, value and updates CRC. Returns true if tlv was added.
+ * Returns false if there's not enough space.
+ */
+static bool ext4_fc_add_tlv(struct super_block *sb, u16 tag, u16 len, u8 *val,
+			   u32 *crc)
+{
+	struct ext4_fc_tl tl;
+	u8 *dst;
+
+	dst = ext4_fc_reserve_space(sb, sizeof(tl) + len, crc);
+	if (!dst)
+		return false;
+
+	tl.fc_tag = cpu_to_le16(tag);
+	tl.fc_len = cpu_to_le16(len);
+
+	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
+	ext4_fc_memcpy(sb, dst + sizeof(tl), val, len, crc);
+
+	return true;
+}
+
+/* Same as above, but adds dentry tlv. */
+static  bool ext4_fc_add_dentry_tlv(struct super_block *sb, u16 tag,
+					int parent_ino, int ino, int dlen,
+					const unsigned char *dname,
+					u32 *crc)
+{
+	struct ext4_fc_dentry_info fcd;
+	struct ext4_fc_tl tl;
+	u8 *dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(fcd) + dlen,
+					crc);
+
+	if (!dst)
+		return false;
+
+	fcd.fc_parent_ino = cpu_to_le32(parent_ino);
+	fcd.fc_ino = cpu_to_le32(ino);
+	tl.fc_tag = cpu_to_le16(tag);
+	tl.fc_len = cpu_to_le16(sizeof(fcd) + dlen);
+	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
+	dst += sizeof(tl);
+	ext4_fc_memcpy(sb, dst, &fcd, sizeof(fcd), crc);
+	dst += sizeof(fcd);
+	ext4_fc_memcpy(sb, dst, dname, dlen, crc);
+	dst += dlen;
+
+	return true;
+}
+
+/*
+ * Writes inode in the fast commit space under TLV with tag @tag.
+ * Returns 0 on success, error on failure.
+ */
+static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
+	int ret;
+	struct ext4_iloc iloc;
+	struct ext4_fc_inode fc_inode;
+	struct ext4_fc_tl tl;
+	u8 *dst;
+
+	ret = ext4_get_inode_loc(inode, &iloc);
+	if (ret)
+		return ret;
+
+	if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
+		inode_len += ei->i_extra_isize;
+
+	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
+	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
+	tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino));
+
+	dst = ext4_fc_reserve_space(inode->i_sb,
+			sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc);
+	if (!dst)
+		return -ECANCELED;
+
+	if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc))
+		return -ECANCELED;
+	dst += sizeof(tl);
+	if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc))
+		return -ECANCELED;
+	dst += sizeof(fc_inode);
+	if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc),
+					inode_len, crc))
+		return -ECANCELED;
+
+	return 0;
+}
+
+/*
+ * Writes updated data ranges for the inode in question. Updates CRC.
+ * Returns 0 on success, error otherwise.
+ */
+static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
+{
+	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_map_blocks map;
+	struct ext4_fc_add_range fc_ext;
+	struct ext4_fc_del_range lrange;
+	struct ext4_extent *ex;
+	int ret;
+
+	mutex_lock(&ei->i_fc_lock);
+	if (ei->i_fc_lblk_len == 0) {
+		mutex_unlock(&ei->i_fc_lock);
+		return 0;
+	}
+	old_blk_size = ei->i_fc_lblk_start;
+	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
+	ei->i_fc_lblk_len = 0;
+	mutex_unlock(&ei->i_fc_lock);
+
+	cur_lblk_off = old_blk_size;
+	jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
+		  __func__, cur_lblk_off, new_blk_size, inode->i_ino);
+
+	while (cur_lblk_off <= new_blk_size) {
+		map.m_lblk = cur_lblk_off;
+		map.m_len = new_blk_size - cur_lblk_off + 1;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret < 0)
+			return -ECANCELED;
+
+		if (map.m_len == 0) {
+			cur_lblk_off++;
+			continue;
+		}
+
+		if (ret == 0) {
+			lrange.fc_ino = cpu_to_le32(inode->i_ino);
+			lrange.fc_lblk = cpu_to_le32(map.m_lblk);
+			lrange.fc_len = cpu_to_le32(map.m_len);
+			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
+					    sizeof(lrange), (u8 *)&lrange, crc))
+				return -ENOSPC;
+		} else {
+			fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
+			ex = (struct ext4_extent *)&fc_ext.fc_ex;
+			ex->ee_block = cpu_to_le32(map.m_lblk);
+			ex->ee_len = cpu_to_le16(map.m_len);
+			ext4_ext_store_pblock(ex, map.m_pblk);
+			if (map.m_flags & EXT4_MAP_UNWRITTEN)
+				ext4_ext_mark_unwritten(ex);
+			else
+				ext4_ext_mark_initialized(ex);
+			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
+					    sizeof(fc_ext), (u8 *)&fc_ext, crc))
+				return -ENOSPC;
+		}
+
+		cur_lblk_off += map.m_len;
+	}
+
+	return 0;
+}
+
+
+/* Submit data for all the fast commit inodes */
+static int ext4_fc_submit_inode_data_all(journal_t *journal)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *ei;
+	struct list_head *pos;
+	int ret = 0;
+
+	spin_lock(&sbi->s_fc_lock);
+	sbi->s_mount_state |= EXT4_FC_COMMITTING;
+	list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
+		ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		ext4_set_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING);
+		while (atomic_read(&ei->i_fc_updates)) {
+			DEFINE_WAIT(wait);
+
+			prepare_to_wait(&ei->i_fc_wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+			if (atomic_read(&ei->i_fc_updates)) {
+				spin_unlock(&sbi->s_fc_lock);
+				schedule();
+				spin_lock(&sbi->s_fc_lock);
+			}
+			finish_wait(&ei->i_fc_wait, &wait);
+		}
+		spin_unlock(&sbi->s_fc_lock);
+		ret = jbd2_submit_inode_data(ei->jinode);
+		if (ret)
+			return ret;
+		spin_lock(&sbi->s_fc_lock);
+	}
+	spin_unlock(&sbi->s_fc_lock);
+
+	return ret;
+}
+
+/* Wait for completion of data for all the fast commit inodes */
+static int ext4_fc_wait_inode_data_all(journal_t *journal)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *pos, *n;
+	int ret = 0;
+
+	spin_lock(&sbi->s_fc_lock);
+	list_for_each_entry_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		if (!ext4_test_inode_state(&pos->vfs_inode,
+					   EXT4_STATE_FC_COMMITTING))
+			continue;
+		spin_unlock(&sbi->s_fc_lock);
+
+		ret = jbd2_wait_inode_data(journal, pos->jinode);
+		if (ret)
+			return ret;
+		spin_lock(&sbi->s_fc_lock);
+	}
+	spin_unlock(&sbi->s_fc_lock);
+
+	return 0;
+}
+
+/* Commit all the directory entry updates */
+static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_dentry_update *fc_dentry;
+	struct inode *inode;
+	struct list_head *pos, *n, *fcd_pos, *fcd_n;
+	struct ext4_inode_info *ei;
+	int ret;
+
+	if (list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN]))
+		return 0;
+	list_for_each_safe(fcd_pos, fcd_n, &sbi->s_fc_dentry_q[FC_Q_MAIN]) {
+		fc_dentry = list_entry(fcd_pos, struct ext4_fc_dentry_update,
+					fcd_list);
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT) {
+			spin_unlock(&sbi->s_fc_lock);
+			if (!ext4_fc_add_dentry_tlv(
+				sb, fc_dentry->fcd_op,
+				fc_dentry->fcd_parent, fc_dentry->fcd_ino,
+				fc_dentry->fcd_name.len,
+				fc_dentry->fcd_name.name, crc)) {
+				return -ENOSPC;
+			}
+			spin_lock(&sbi->s_fc_lock);
+			continue;
+		}
+
+		inode = NULL;
+		list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
+			ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
+			if (ei->vfs_inode.i_ino == fc_dentry->fcd_ino) {
+				inode = &ei->vfs_inode;
+				break;
+			}
+		}
+		/*
+		 * If we don't find inode in our list, then it was deleted,
+		 * in which case, we don't need to record it's create tag.
+		 */
+		if (!inode)
+			continue;
+		spin_unlock(&sbi->s_fc_lock);
+
+		/*
+		 * We first write the inode and then the create dirent. This
+		 * allows the recovery code to create an unnamed inode first
+		 * and then link it to a directory entry. This allows us
+		 * to use namei.c routines almost as is and simplifies
+		 * the recovery code.
+		 */
+		ret = ext4_fc_write_inode(inode, crc);
+		if (ret)
+			return ret;
+		ret = ext4_fc_write_inode_data(inode, crc);
+		if (ret)
+			return ret;
+
+		if (!ext4_fc_add_dentry_tlv(
+			sb, fc_dentry->fcd_op,
+			fc_dentry->fcd_parent, fc_dentry->fcd_ino,
+			fc_dentry->fcd_name.len,
+			fc_dentry->fcd_name.name, crc))
+			return -ENOSPC;
+
+		spin_lock(&sbi->s_fc_lock);
+	}
+	return 0;
+}
+
+static int ext4_fc_perform_commit(journal_t *journal)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_head head;
+	struct list_head *pos;
+	struct inode *inode;
+	struct blk_plug plug;
+	int ret = 0;
+	u32 crc = 0;
+
+	ret = ext4_fc_submit_inode_data_all(journal);
+	if (ret)
+		return ret;
+
+	ret = ext4_fc_wait_inode_data_all(journal);
+	if (ret)
+		return ret;
+
+	blk_start_plug(&plug);
+	if (sbi->s_fc_bytes == 0) {
+		/*
+		 * Add a head tag only if this is the first fast commit
+		 * in this TID.
+		 */
+		head.fc_features = cpu_to_le32(EXT4_FC_SUPPORTED_FEATURES);
+		head.fc_tid = cpu_to_le32(
+			sbi->s_journal->j_running_transaction->t_tid);
+		if (!ext4_fc_add_tlv(sb, EXT4_FC_TAG_HEAD, sizeof(head),
+			(u8 *)&head, &crc))
+			goto out;
+	}
+
+	spin_lock(&sbi->s_fc_lock);
+	ret = ext4_fc_commit_dentry_updates(journal, &crc);
+	if (ret) {
+		spin_unlock(&sbi->s_fc_lock);
+		goto out;
+	}
+
+	list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		inode = &iter->vfs_inode;
+		if (!ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+			continue;
+
+		spin_unlock(&sbi->s_fc_lock);
+		ret = ext4_fc_write_inode_data(inode, &crc);
+		if (ret)
+			goto out;
+		ret = ext4_fc_write_inode(inode, &crc);
+		if (ret)
+			goto out;
+		spin_lock(&sbi->s_fc_lock);
+	}
+	spin_unlock(&sbi->s_fc_lock);
+
+	ret = ext4_fc_write_tail(sb, crc);
+
+out:
+	blk_finish_plug(&plug);
+	return ret;
+}
+
+/*
+ * The main commit entry point. Performs a fast commit for transaction
+ * commit_tid if needed. If it's not possible to perform a fast commit
+ * due to various reasons, we fall back to full commit. Returns 0
+ * on success, error otherwise.
+ */
+int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	int nblks = 0, ret, bsize = journal->j_blocksize;
+	int subtid = atomic_read(&sbi->s_fc_subtid);
+	int reason = EXT4_FC_REASON_OK, fc_bufs_before = 0;
+	ktime_t start_time, commit_time;
+
+	trace_ext4_fc_commit_start(sb);
+
+	start_time = ktime_get();
+
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
+		(ext4_fc_is_ineligible(sb))) {
+		reason = EXT4_FC_REASON_INELIGIBLE;
+		goto out;
+	}
+
+restart_fc:
+	ret = jbd2_fc_begin_commit(journal, commit_tid);
+	if (ret == -EALREADY) {
+		/* There was an ongoing commit, check if we need to restart */
+		if (atomic_read(&sbi->s_fc_subtid) <= subtid &&
+			commit_tid > journal->j_commit_sequence)
+			goto restart_fc;
+		reason = EXT4_FC_REASON_ALREADY_COMMITTED;
+		goto out;
+	} else if (ret) {
+		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
+		reason = EXT4_FC_REASON_FC_START_FAILED;
+		goto out;
+	}
+
+	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
+	ret = ext4_fc_perform_commit(journal);
+	if (ret < 0) {
+		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
+		reason = EXT4_FC_REASON_FC_FAILED;
+		goto out;
+	}
+	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
+	ret = jbd2_fc_wait_bufs(journal, nblks);
+	if (ret < 0) {
+		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
+		reason = EXT4_FC_REASON_FC_FAILED;
+		goto out;
+	}
+	atomic_inc(&sbi->s_fc_subtid);
+	jbd2_fc_end_commit(journal);
+out:
+	/* Has any ineligible update happened since we started? */
+	if (reason == EXT4_FC_REASON_OK && ext4_fc_is_ineligible(sb)) {
+		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
+		reason = EXT4_FC_REASON_INELIGIBLE;
+	}
+
+	spin_lock(&sbi->s_fc_lock);
+	if (reason != EXT4_FC_REASON_OK &&
+		reason != EXT4_FC_REASON_ALREADY_COMMITTED) {
+		sbi->s_fc_stats.fc_ineligible_commits++;
+	} else {
+		sbi->s_fc_stats.fc_num_commits++;
+		sbi->s_fc_stats.fc_numblks += nblks;
+	}
+	spin_unlock(&sbi->s_fc_lock);
+	nblks = (reason == EXT4_FC_REASON_OK) ? nblks : 0;
+	trace_ext4_fc_commit_stop(sb, nblks, reason);
+	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+	/*
+	 * weight the commit time higher than the average time so we don't
+	 * react too strongly to vast changes in the commit time
+	 */
+	if (likely(sbi->s_fc_avg_commit_time))
+		sbi->s_fc_avg_commit_time = (commit_time +
+				sbi->s_fc_avg_commit_time * 3) / 4;
+	else
+		sbi->s_fc_avg_commit_time = commit_time;
+	jbd_debug(1,
+		"Fast commit ended with blks = %d, reason = %d, subtid - %d",
+		nblks, reason, subtid);
+	if (reason == EXT4_FC_REASON_FC_FAILED)
+		return jbd2_fc_end_commit_fallback(journal, commit_tid);
+	if (reason == EXT4_FC_REASON_FC_START_FAILED ||
+		reason == EXT4_FC_REASON_INELIGIBLE)
+		return jbd2_complete_transaction(journal, commit_tid);
+	return 0;
+}
+
 /*
  * Fast commit cleanup routine. This is called after every fast commit and
  * full commit. full is true if we are called after a full commit.
  */
 static void ext4_fc_cleanup(journal_t *journal, int full)
 {
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	struct list_head *pos, *n;
+
+	if (full && sbi->s_fc_bh)
+		sbi->s_fc_bh = NULL;
+
+	jbd2_fc_release_bufs(journal);
+
+	spin_lock(&sbi->s_fc_lock);
+	list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		list_del_init(&iter->i_fc_list);
+		ext4_clear_inode_state(&iter->vfs_inode,
+				       EXT4_STATE_FC_COMMITTING);
+		ext4_fc_reset_inode(&iter->vfs_inode);
+		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
+		smp_mb();
+#if (BITS_PER_LONG < 64)
+		wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
+#else
+		wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
+#endif
+	}
+
+	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
+		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
+					     struct ext4_fc_dentry_update,
+					     fcd_list);
+		list_del_init(&fc_dentry->fcd_list);
+		spin_unlock(&sbi->s_fc_lock);
+
+		if (fc_dentry->fcd_name.name &&
+			fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
+			kfree(fc_dentry->fcd_name.name);
+		kmem_cache_free(ext4_fc_dentry_cachep, fc_dentry);
+		spin_lock(&sbi->s_fc_lock);
+	}
+
+	list_splice_init(&sbi->s_fc_dentry_q[FC_Q_STAGING],
+				&sbi->s_fc_dentry_q[FC_Q_MAIN]);
+	list_splice_init(&sbi->s_fc_q[FC_Q_STAGING],
+				&sbi->s_fc_q[FC_Q_STAGING]);
+
+	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
+	sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE;
+
+	if (full)
+		sbi->s_fc_bytes = 0;
+	spin_unlock(&sbi->s_fc_lock);
+	trace_ext4_fc_stats(sb);
 }
 
 void ext4_fc_init(struct super_block *sb, journal_t *journal)
@@ -26,3 +1187,14 @@ void ext4_fc_init(struct super_block *sb, journal_t *journal)
 		ext4_clear_feature_fast_commit(sb);
 	}
 }
+
+int __init ext4_fc_init_dentry_cache(void)
+{
+	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
+					   SLAB_RECLAIM_ACCOUNT);
+
+	if (ext4_fc_dentry_cachep == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
index 8362bf5e6e00..560bc9ca8c79 100644
--- a/fs/ext4/fast_commit.h
+++ b/fs/ext4/fast_commit.h
@@ -6,4 +6,114 @@
 /* Number of blocks in journal area to allocate for fast commits */
 #define EXT4_NUM_FC_BLKS		256
 
+/* Fast commit tags */
+#define EXT4_FC_TAG_ADD_RANGE		0x0001
+#define EXT4_FC_TAG_DEL_RANGE		0x0002
+#define EXT4_FC_TAG_CREAT		0x0003
+#define EXT4_FC_TAG_LINK		0x0004
+#define EXT4_FC_TAG_UNLINK		0x0005
+#define EXT4_FC_TAG_INODE		0x0006
+#define EXT4_FC_TAG_PAD			0x0007
+#define EXT4_FC_TAG_TAIL		0x0008
+#define EXT4_FC_TAG_HEAD		0x0009
+
+#define EXT4_FC_SUPPORTED_FEATURES	0x0
+
+/* On disk fast commit tlv value structures */
+
+/* Fast commit on disk tag length structure */
+struct ext4_fc_tl {
+	__le16 fc_tag;
+	__le16 fc_len;
+};
+
+/* Value structure for tag EXT4_FC_TAG_HEAD. */
+struct ext4_fc_head {
+	__le32 fc_features;
+	__le32 fc_tid;
+};
+
+/* Value structure for EXT4_FC_TAG_ADD_RANGE. */
+struct ext4_fc_add_range {
+	__le32 fc_ino;
+	__u8 fc_ex[12];
+};
+
+/* Value structure for tag EXT4_FC_TAG_DEL_RANGE. */
+struct ext4_fc_del_range {
+	__le32 fc_ino;
+	__le32 fc_lblk;
+	__le32 fc_len;
+};
+
+/*
+ * This is the value structure for tags EXT4_FC_TAG_CREAT, EXT4_FC_TAG_LINK
+ * and EXT4_FC_TAG_UNLINK.
+ */
+struct ext4_fc_dentry_info {
+	__le32 fc_parent_ino;
+	__le32 fc_ino;
+	u8 fc_dname[0];
+};
+
+/* Value structure for EXT4_FC_TAG_INODE and EXT4_FC_TAG_INODE_PARTIAL. */
+struct ext4_fc_inode {
+	__le32 fc_ino;
+	__u8 fc_raw_inode[0];
+};
+
+/* Value structure for tag EXT4_FC_TAG_TAIL. */
+struct ext4_fc_tail {
+	__le32 fc_tid;
+	__le32 fc_crc;
+};
+
+/*
+ * In memory list of dentry updates that are performed on the file
+ * system used by fast commit code.
+ */
+struct ext4_fc_dentry_update {
+	int fcd_op;		/* Type of update create / unlink / link */
+	int fcd_parent;		/* Parent inode number */
+	int fcd_ino;		/* Inode number */
+	struct qstr fcd_name;	/* Dirent name */
+	unsigned char fcd_iname[DNAME_INLINE_LEN];	/* Dirent name string */
+	struct list_head fcd_list;
+};
+
+/*
+ * Fast commit reason codes
+ */
+enum {
+	/*
+	 * Commit status codes:
+	 */
+	EXT4_FC_REASON_OK = 0,
+	EXT4_FC_REASON_INELIGIBLE,
+	EXT4_FC_REASON_ALREADY_COMMITTED,
+	EXT4_FC_REASON_FC_START_FAILED,
+	EXT4_FC_REASON_FC_FAILED,
+
+	/*
+	 * Fast commit ineligiblity reasons:
+	 */
+	EXT4_FC_REASON_XATTR = 0,
+	EXT4_FC_REASON_CROSS_RENAME,
+	EXT4_FC_REASON_JOURNAL_FLAG_CHANGE,
+	EXT4_FC_REASON_MEM,
+	EXT4_FC_REASON_SWAP_BOOT,
+	EXT4_FC_REASON_RESIZE,
+	EXT4_FC_REASON_RENAME_DIR,
+	EXT4_FC_REASON_FALLOC_RANGE,
+	EXT4_FC_COMMIT_FAILED,
+	EXT4_FC_REASON_MAX
+};
+
+struct ext4_fc_stats {
+	unsigned int fc_ineligible_reason_count[EXT4_FC_REASON_MAX];
+	unsigned long fc_num_commits;
+	unsigned long fc_ineligible_commits;
+	unsigned long fc_numblks;
+};
+
 #endif /* __FAST_COMMIT_H__ */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 02ffbd29d6b0..d85412d12e3a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -260,6 +260,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		return -EOPNOTSUPP;
 
+	ext4_fc_start_update(inode);
 	inode_lock(inode);
 	ret = ext4_write_checks(iocb, from);
 	if (ret <= 0)
@@ -271,6 +272,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 
 out:
 	inode_unlock(inode);
+	ext4_fc_stop_update(inode);
 	if (likely(ret > 0)) {
 		iocb->ki_pos += ret;
 		ret = generic_write_sync(iocb, ret);
@@ -534,7 +536,9 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out;
 		}
 
+		ext4_fc_start_update(inode);
 		ret = ext4_orphan_add(handle, inode);
+		ext4_fc_stop_update(inode);
 		if (ret) {
 			ext4_journal_stop(handle);
 			goto out;
@@ -656,8 +660,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 #endif
 	if (iocb->ki_flags & IOCB_DIRECT)
 		return ext4_dio_write_iter(iocb, from);
-
-	return ext4_buffered_write_iter(iocb, from);
+	else
+		return ext4_buffered_write_iter(iocb, from);
 }
 
 #ifdef CONFIG_FS_DAX
@@ -757,6 +761,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!daxdev_mapping_supported(vma, dax_dev))
 		return -EOPNOTSUPP;
 
+	ext4_fc_start_update(inode);
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
@@ -764,6 +769,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
+	ext4_fc_stop_update(inode);
 	return 0;
 }
 
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 6476994d9861..81a545fd14a3 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -112,7 +112,7 @@ static int ext4_fsync_journal(struct inode *inode, bool datasync,
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		*needs_barrier = true;
 
-	return jbd2_complete_transaction(journal, commit_tid);
+	return ext4_fc_commit(journal, commit_tid);
 }
 
 /*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 09096fe6170e..f5e9c76c9b07 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -729,6 +729,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			if (ret)
 				return ret;
 		}
+		ext4_fc_track_range(inode, map->m_lblk,
+			    map->m_lblk + map->m_len - 1);
 	}
 
 	if (retval < 0)
@@ -4097,6 +4099,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 
 		up_write(&EXT4_I(inode)->i_data_sem);
 	}
+	ext4_fc_track_range(inode, first_block, stop_block);
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
 
@@ -4716,6 +4719,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	for (block = 0; block < EXT4_N_BLOCKS; block++)
 		ei->i_data[block] = raw_inode->i_block[block];
 	INIT_LIST_HEAD(&ei->i_orphan);
+	ext4_fc_init_inode(&ei->vfs_inode);
 
 	/*
 	 * Set transaction id's of transactions that have to be committed
@@ -5162,7 +5166,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 			return 0;
 
-		err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
+		err = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
 						EXT4_I(inode)->i_sync_tid);
 	} else {
 		struct ext4_iloc iloc;
@@ -5291,6 +5295,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		if (error)
 			return error;
 	}
+	ext4_fc_start_update(inode);
 	if ((ia_valid & ATTR_UID && !uid_eq(attr->ia_uid, inode->i_uid)) ||
 	    (ia_valid & ATTR_GID && !gid_eq(attr->ia_gid, inode->i_gid))) {
 		handle_t *handle;
@@ -5314,6 +5319,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 
 		if (error) {
 			ext4_journal_stop(handle);
+			ext4_fc_stop_update(inode);
 			return error;
 		}
 		/* Update corresponding info in inode so that everything is in
@@ -5336,11 +5342,15 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 			struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
-			if (attr->ia_size > sbi->s_bitmap_maxbytes)
+			if (attr->ia_size > sbi->s_bitmap_maxbytes) {
+				ext4_fc_stop_update(inode);
 				return -EFBIG;
+			}
 		}
-		if (!S_ISREG(inode->i_mode))
+		if (!S_ISREG(inode->i_mode)) {
+			ext4_fc_stop_update(inode);
 			return -EINVAL;
+		}
 
 		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
 			inode_inc_iversion(inode);
@@ -5364,7 +5374,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		rc = ext4_break_layouts(inode);
 		if (rc) {
 			up_write(&EXT4_I(inode)->i_mmap_sem);
-			return rc;
+			goto err_out;
 		}
 
 		if (attr->ia_size != inode->i_size) {
@@ -5385,6 +5395,21 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				inode->i_mtime = current_time(inode);
 				inode->i_ctime = inode->i_mtime;
 			}
+
+			if (shrink)
+				ext4_fc_track_range(inode,
+					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
+					inode->i_sb->s_blocksize_bits,
+					(oldsize > 0 ? oldsize - 1 : 0) >>
+					inode->i_sb->s_blocksize_bits);
+			else
+				ext4_fc_track_range(
+					inode,
+					(oldsize > 0 ? oldsize - 1 : oldsize) >>
+					inode->i_sb->s_blocksize_bits,
+					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
+					inode->i_sb->s_blocksize_bits);
+
 			down_write(&EXT4_I(inode)->i_data_sem);
 			EXT4_I(inode)->i_disksize = attr->ia_size;
 			rc = ext4_mark_inode_dirty(handle, inode);
@@ -5443,9 +5468,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		rc = posix_acl_chmod(inode, inode->i_mode);
 
 err_out:
-	ext4_std_error(inode->i_sb, error);
+	if  (error)
+		ext4_std_error(inode->i_sb, error);
 	if (!error)
 		error = rc;
+	ext4_fc_stop_update(inode);
 	return error;
 }
 
@@ -5627,6 +5654,8 @@ int ext4_mark_iloc_dirty(handle_t *handle,
 		put_bh(iloc->bh);
 		return -EIO;
 	}
+	ext4_fc_track_inode(inode);
+
 	if (IS_I_VERSION(inode))
 		inode_inc_iversion(inode);
 
@@ -5950,6 +5979,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
+	ext4_fc_mark_ineligible(inode->i_sb,
+		EXT4_FC_REASON_JOURNAL_FLAG_CHANGE);
 	err = ext4_mark_inode_dirty(handle, inode);
 	ext4_handle_sync(handle);
 	ext4_journal_stop(handle);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 36eca3bc036a..d2f8f50deef6 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -165,6 +165,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
 		err = -EINVAL;
 		goto err_out;
 	}
+	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_SWAP_BOOT);
 
 	/* Protect extent tree against block allocations via delalloc */
 	ext4_double_down_write_data_sem(inode, inode_bl);
@@ -247,6 +248,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
 
 err_out1:
 	ext4_journal_stop(handle);
+	ext4_fc_stop_ineligible(sb);
 	ext4_double_up_write_data_sem(inode, inode_bl);
 
 err_out:
@@ -807,7 +809,7 @@ static int ext4_ioctl_get_es_cache(struct file *filp, unsigned long arg)
 	return error;
 }
 
-long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
 	struct super_block *sb = inode->i_sb;
@@ -1074,6 +1076,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 
 		err = ext4_resize_fs(sb, n_blocks_count);
 		if (EXT4_SB(sb)->s_journal) {
+			ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE);
 			jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
 			err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
 			jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
@@ -1308,6 +1311,17 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 	}
 }
 
+long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	long ret;
+
+	ext4_fc_start_update(file_inode(filp));
+	ret = __ext4_ioctl(filp, cmd, arg);
+	ext4_fc_stop_update(file_inode(filp));
+
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 701ef9fa21c3..fd7be1435f2d 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2611,7 +2611,7 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		       bool excl)
 {
 	handle_t *handle;
-	struct inode *inode;
+	struct inode *inode, *inode_save;
 	int err, credits, retries = 0;
 
 	err = dquot_initialize(dir);
@@ -2629,7 +2629,11 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
+		inode_save = inode;
+		ihold(inode_save);
 		err = ext4_add_nondir(handle, dentry, &inode);
+		ext4_fc_track_create(inode_save, dentry);
+		iput(inode_save);
 	}
 	if (handle)
 		ext4_journal_stop(handle);
@@ -2644,7 +2648,7 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
 		      umode_t mode, dev_t rdev)
 {
 	handle_t *handle;
-	struct inode *inode;
+	struct inode *inode, *inode_save;
 	int err, credits, retries = 0;
 
 	err = dquot_initialize(dir);
@@ -2661,7 +2665,12 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
 	if (!IS_ERR(inode)) {
 		init_special_inode(inode, inode->i_mode, rdev);
 		inode->i_op = &ext4_special_inode_operations;
+		inode_save = inode;
+		ihold(inode_save);
 		err = ext4_add_nondir(handle, dentry, &inode);
+		if (!err)
+			ext4_fc_track_create(inode_save, dentry);
+		iput(inode_save);
 	}
 	if (handle)
 		ext4_journal_stop(handle);
@@ -2825,7 +2834,9 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 		iput(inode);
 		goto out_retry;
 	}
+	ext4_fc_track_create(inode, dentry);
 	ext4_inc_count(dir);
+
 	ext4_update_dx_flag(dir);
 	err = ext4_mark_inode_dirty(handle, dir);
 	if (err)
@@ -3165,6 +3176,7 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
 		goto end_rmdir;
 	ext4_dec_count(dir);
 	ext4_update_dx_flag(dir);
+	ext4_fc_track_unlink(inode, dentry);
 	retval = ext4_mark_inode_dirty(handle, dir);
 
 #ifdef CONFIG_UNICODE
@@ -3251,6 +3263,8 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 	inode->i_ctime = current_time(inode);
 	retval = ext4_mark_inode_dirty(handle, inode);
 
+	if (!retval)
+		ext4_fc_track_unlink(d_inode(dentry), dentry);
 #ifdef CONFIG_UNICODE
 	/* VFS negative dentries are incompatible with Encoding and
 	 * Case-insensitiveness. Eventually we'll want avoid
@@ -3872,6 +3886,22 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
 	retval = ext4_mark_inode_dirty(handle, old.dir);
 	if (unlikely(retval))
 		goto end_rename;
+
+	if (S_ISDIR(old.inode->i_mode)) {
+		/*
+		 * We disable fast commits here that's because the
+		 * replay code is not yet capable of changing dot dot
+		 * dirents in directories.
+		 */
+		ext4_fc_mark_ineligible(old.inode->i_sb,
+			EXT4_FC_REASON_RENAME_DIR);
+	} else {
+		if (new.inode)
+			ext4_fc_track_unlink(new.inode, new.dentry);
+		ext4_fc_track_link(old.inode, new.dentry);
+		ext4_fc_track_unlink(old.inode, old.dentry);
+	}
+
 	if (new.inode) {
 		retval = ext4_mark_inode_dirty(handle, new.inode);
 		if (unlikely(retval))
@@ -4015,7 +4045,8 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
 	retval = ext4_mark_inode_dirty(handle, new.inode);
 	if (unlikely(retval))
 		goto end_rename;
-
+	ext4_fc_mark_ineligible(new.inode->i_sb,
+				EXT4_FC_REASON_CROSS_RENAME);
 	if (old.dir_bh) {
 		retval = ext4_rename_dir_finish(handle, &old, new.dir->i_ino);
 		if (retval)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 23bf55057fc2..505cebd26235 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1331,6 +1331,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_unwritten, 0);
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+	ext4_fc_init_inode(&ei->vfs_inode);
+	mutex_init(&ei->i_fc_lock);
 	return &ei->vfs_inode;
 }
 
@@ -1348,6 +1350,10 @@ static int ext4_drop_inode(struct inode *inode)
 static void ext4_free_in_core_inode(struct inode *inode)
 {
 	fscrypt_free_inode(inode);
+	if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
+		pr_warn("%s: inode %ld still in fc list",
+			__func__, inode->i_ino);
+	}
 	kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
 }
 
@@ -1373,6 +1379,7 @@ static void init_once(void *foo)
 	init_rwsem(&ei->i_data_sem);
 	init_rwsem(&ei->i_mmap_sem);
 	inode_init_once(&ei->vfs_inode);
+	ext4_fc_init_inode(&ei->vfs_inode);
 }
 
 static int __init init_inodecache(void)
@@ -1401,6 +1408,7 @@ static void destroy_inodecache(void)
 
 void ext4_clear_inode(struct inode *inode)
 {
+	ext4_fc_del(inode);
 	invalidate_inode_buffers(inode);
 	clear_inode(inode);
 	ext4_discard_preallocations(inode, 0);
@@ -4744,6 +4752,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
 	mutex_init(&sbi->s_orphan_lock);
 
+	/* Initialize fast commit stuff */
+	atomic_set(&sbi->s_fc_subtid, 0);
+	atomic_set(&sbi->s_fc_ineligible_updates, 0);
+	INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_MAIN]);
+	INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_STAGING]);
+	INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_MAIN]);
+	INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_STAGING]);
+	sbi->s_fc_bytes = 0;
+	sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE;
+	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
+	spin_lock_init(&sbi->s_fc_lock);
+	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+
 	sb->s_root = NULL;
 
 	needs_recovery = (es->s_last_orphan != 0 ||
@@ -6510,6 +6531,10 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
 	brelse(bh);
 out:
 	if (inode->i_size < off + len) {
+		ext4_fc_track_range(inode,
+			(inode->i_size > 0 ? inode->i_size - 1 : 0)
+				>> inode->i_sb->s_blocksize_bits,
+			(off + len) >> inode->i_sb->s_blocksize_bits);
 		i_size_write(inode, off + len);
 		EXT4_I(inode)->i_disksize = inode->i_size;
 		err2 = ext4_mark_inode_dirty(handle, inode);
@@ -6638,6 +6663,11 @@ static int __init ext4_init_fs(void)
 	err = init_inodecache();
 	if (err)
 		goto out1;
+
+	err = ext4_fc_init_dentry_cache();
+	if (err)
+		goto out05;
+
 	register_as_ext3();
 	register_as_ext2();
 	err = register_filesystem(&ext4_fs_type);
@@ -6648,6 +6678,7 @@ static int __init ext4_init_fs(void)
 out:
 	unregister_as_ext2();
 	unregister_as_ext3();
+out05:
 	destroy_inodecache();
 out1:
 	ext4_exit_mballoc();
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index cba4b877c606..6127e94ea4f5 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -2419,6 +2419,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		if (IS_SYNC(inode))
 			ext4_handle_sync(handle);
 	}
+	ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
 
 cleanup:
 	brelse(is.iloc.bh);
@@ -2496,6 +2497,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 		if (error == 0)
 			error = error2;
 	}
+	ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
 
 	return error;
 }
@@ -2928,6 +2930,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 					 error);
 			goto cleanup;
 		}
+		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
 	}
 	error = 0;
 cleanup:
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 4c8b99ec8606..521de3a82118 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -95,6 +95,16 @@ TRACE_DEFINE_ENUM(ES_REFERENCED_B);
 	{ FALLOC_FL_COLLAPSE_RANGE,	"COLLAPSE_RANGE"},	\
 	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"})
 
+#define show_fc_reason(reason)						\
+	__print_symbolic(reason,					\
+		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
+		{ EXT4_FC_REASON_CROSS_RENAME,	"CROSS_RENAME"},	\
+		{ EXT4_FC_REASON_JOURNAL_FLAG_CHANGE, "JOURNAL_FLAG_CHANGE"}, \
+		{ EXT4_FC_REASON_MEM,	"NO_MEM"},			\
+		{ EXT4_FC_REASON_SWAP_BOOT,	"SWAP_BOOT"},		\
+		{ EXT4_FC_REASON_RESIZE,	"RESIZE"},		\
+		{ EXT4_FC_REASON_RENAME_DIR,	"RENAME_DIR"},		\
+		{ EXT4_FC_REASON_FALLOC_RANGE,	"FALLOC_RANGE"})
 
 TRACE_EVENT(ext4_other_inode_update_time,
 	TP_PROTO(struct inode *inode, ino_t orig_ino),
@@ -2791,6 +2801,168 @@ TRACE_EVENT(ext4_lazy_itable_init,
 		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->group)
 );
 
+TRACE_EVENT(ext4_fc_commit_start,
+	TP_PROTO(struct super_block *sb),
+
+	TP_ARGS(sb),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+	),
+
+	TP_printk("fast_commit started on dev %d,%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev))
+);
+
+TRACE_EVENT(ext4_fc_commit_stop,
+	    TP_PROTO(struct super_block *sb, int nblks, int reason),
+
+	TP_ARGS(sb, nblks, reason),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, nblks)
+		__field(int, reason)
+		__field(int, num_fc)
+		__field(int, num_fc_ineligible)
+		__field(int, nblks_agg)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->nblks = nblks;
+		__entry->reason = reason;
+		__entry->num_fc = EXT4_SB(sb)->s_fc_stats.fc_num_commits;
+		__entry->num_fc_ineligible =
+			EXT4_SB(sb)->s_fc_stats.fc_ineligible_commits;
+		__entry->nblks_agg = EXT4_SB(sb)->s_fc_stats.fc_numblks;
+	),
+
+	TP_printk("fc on [%d,%d] nblks %d, reason %d, fc = %d, ineligible = %d, agg_nblks %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nblks, __entry->reason, __entry->num_fc,
+		  __entry->num_fc_ineligible, __entry->nblks_agg)
+);
+
+#define FC_REASON_NAME_STAT(reason)					\
+	show_fc_reason(reason),						\
+	__entry->sbi->s_fc_stats.fc_ineligible_reason_count[reason]
+
+TRACE_EVENT(ext4_fc_stats,
+	    TP_PROTO(struct super_block *sb),
+
+	    TP_ARGS(sb),
+
+	    TP_STRUCT__entry(
+		    __field(dev_t, dev)
+		    __field(struct ext4_sb_info *, sbi)
+		    __field(int, count)
+		    ),
+
+	    TP_fast_assign(
+		    __entry->dev = sb->s_dev;
+		    __entry->sbi = EXT4_SB(sb);
+		    ),
+
+	    TP_printk("dev %d:%d fc ineligible reasons:\n"
+		      "%s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s,%d; "
+		      "num_commits:%ld, ineligible: %ld, numblks: %ld",
+		      MAJOR(__entry->dev), MINOR(__entry->dev),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_XATTR),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_CROSS_RENAME),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_JOURNAL_FLAG_CHANGE),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_MEM),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_SWAP_BOOT),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_RESIZE),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_RENAME_DIR),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_FALLOC_RANGE),
+		      __entry->sbi->s_fc_stats.fc_num_commits,
+		      __entry->sbi->s_fc_stats.fc_ineligible_commits,
+		      __entry->sbi->s_fc_stats.fc_numblks)
+
+);
+
+#define DEFINE_TRACE_DENTRY_EVENT(__type)				\
+	TRACE_EVENT(ext4_fc_track_##__type,				\
+	    TP_PROTO(struct inode *inode, struct dentry *dentry, int ret), \
+									\
+	    TP_ARGS(inode, dentry, ret),				\
+									\
+	    TP_STRUCT__entry(						\
+		    __field(dev_t, dev)					\
+		    __field(int, ino)					\
+		    __field(int, error)					\
+		    ),							\
+									\
+	    TP_fast_assign(						\
+		    __entry->dev = inode->i_sb->s_dev;			\
+		    __entry->ino = inode->i_ino;			\
+		    __entry->error = ret;				\
+		    ),							\
+									\
+	    TP_printk("dev %d:%d, inode %d, error %d, fc_%s",		\
+		      MAJOR(__entry->dev), MINOR(__entry->dev),		\
+		      __entry->ino, __entry->error,			\
+		      #__type)						\
+	)
+
+DEFINE_TRACE_DENTRY_EVENT(create);
+DEFINE_TRACE_DENTRY_EVENT(link);
+DEFINE_TRACE_DENTRY_EVENT(unlink);
+
+TRACE_EVENT(ext4_fc_track_inode,
+	    TP_PROTO(struct inode *inode, int ret),
+
+	    TP_ARGS(inode, ret),
+
+	    TP_STRUCT__entry(
+		    __field(dev_t, dev)
+		    __field(int, ino)
+		    __field(int, error)
+		    ),
+
+	    TP_fast_assign(
+		    __entry->dev = inode->i_sb->s_dev;
+		    __entry->ino = inode->i_ino;
+		    __entry->error = ret;
+		    ),
+
+	    TP_printk("dev %d:%d, inode %d, error %d",
+		      MAJOR(__entry->dev), MINOR(__entry->dev),
+		      __entry->ino, __entry->error)
+	);
+
+TRACE_EVENT(ext4_fc_track_range,
+	    TP_PROTO(struct inode *inode, long start, long end, int ret),
+
+	    TP_ARGS(inode, start, end, ret),
+
+	    TP_STRUCT__entry(
+		    __field(dev_t, dev)
+		    __field(int, ino)
+		    __field(long, start)
+		    __field(long, end)
+		    __field(int, error)
+		    ),
+
+	    TP_fast_assign(
+		    __entry->dev = inode->i_sb->s_dev;
+		    __entry->ino = inode->i_ino;
+		    __entry->start = start;
+		    __entry->end = end;
+		    __entry->error = ret;
+		    ),
+
+	    TP_printk("dev %d:%d, inode %d, error %d, start %ld, end %ld",
+		      MAJOR(__entry->dev), MINOR(__entry->dev),
+		      __entry->ino, __entry->error, __entry->start,
+		      __entry->end)
+	);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v10 6/9] jbd2: fast commit recovery path
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
                   ` (4 preceding siblings ...)
  2020-10-15 20:37 ` [PATCH v10 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
@ 2020-10-15 20:37 ` Harshad Shirwadkar
  2020-10-15 20:37 ` [PATCH v10 7/9] ext4: " Harshad Shirwadkar
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch adds fast commit recovery support in JBD2.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/fast_commit.c | 15 ++++++++++++
 fs/jbd2/recovery.c    | 57 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/jbd2.h  | 20 +++++++++++++++
 3 files changed, 88 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index e0fa3bd18346..32ed4495f9c6 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1177,8 +1177,23 @@ static void ext4_fc_cleanup(journal_t *journal, int full)
 	trace_ext4_fc_stats(sb);
 }
 
+/*
+ * Main recovery path entry point.
+ */
+static int ext4_fc_replay(journal_t *journal, struct buffer_head *bh,
+				enum passtype pass, int off, tid_t expected_tid)
+{
+	return 0;
+}
+
 void ext4_fc_init(struct super_block *sb, journal_t *journal)
 {
+	/*
+	 * We set replay callback even if fast commit disabled because we may
+	 * could still have fast commit blocks that need to be replayed even if
+	 * fast commit has now been turned off.
+	 */
+	journal->j_fc_replay_callback = ext4_fc_replay;
 	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
 		return;
 	journal->j_fc_cleanup_callback = ext4_fc_cleanup;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index 5f61ce83e940..b9c734b34e26 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -35,7 +35,6 @@ struct recovery_info
 	int		nr_revoke_hits;
 };
 
-enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
 static int do_one_pass(journal_t *journal,
 				struct recovery_info *info, enum passtype pass);
 static int scan_revoke_records(journal_t *, struct buffer_head *,
@@ -225,10 +224,51 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
 /* Make sure we wrap around the log correctly! */
 #define wrap(journal, var)						\
 do {									\
-	if (var >= (journal)->j_last)					\
-		var -= ((journal)->j_last - (journal)->j_first);	\
+	unsigned long _wrap_last =					\
+		jbd2_has_feature_fast_commit(journal) ?			\
+			(journal)->j_fc_last : (journal)->j_last;	\
+									\
+	if (var >= _wrap_last)						\
+		var -= (_wrap_last - (journal)->j_first);		\
 } while (0)
 
+static int fc_do_one_pass(journal_t *journal,
+			  struct recovery_info *info, enum passtype pass)
+{
+	unsigned int expected_commit_id = info->end_transaction;
+	unsigned long next_fc_block;
+	struct buffer_head *bh;
+	int err = 0;
+
+	next_fc_block = journal->j_fc_first;
+	if (!journal->j_fc_replay_callback)
+		return 0;
+
+	while (next_fc_block <= journal->j_fc_last) {
+		jbd_debug(3, "Fast commit replay: next block %ld",
+			  next_fc_block);
+		err = jread(&bh, journal, next_fc_block);
+		if (err) {
+			jbd_debug(3, "Fast commit replay: read error");
+			break;
+		}
+
+		jbd_debug(3, "Processing fast commit blk with seq %d");
+		err = journal->j_fc_replay_callback(journal, bh, pass,
+					next_fc_block - journal->j_fc_first,
+					expected_commit_id);
+		next_fc_block++;
+		if (err < 0 || err == JBD2_FC_REPLAY_STOP)
+			break;
+		err = 0;
+	}
+
+	if (err)
+		jbd_debug(3, "Fast commit replay failed, err = %d\n", err);
+
+	return err;
+}
+
 /**
  * jbd2_journal_recover - recovers a on-disk journal
  * @journal: the journal to recover
@@ -472,7 +512,9 @@ static int do_one_pass(journal_t *journal,
 				break;
 
 		jbd_debug(2, "Scanning for sequence ID %u at %lu/%lu\n",
-			  next_commit_ID, next_log_block, journal->j_last);
+			  next_commit_ID, next_log_block,
+			  jbd2_has_feature_fast_commit(journal) ?
+			  journal->j_fc_last : journal->j_last);
 
 		/* Skip over each chunk of the transaction looking
 		 * either the next descriptor block or the final commit
@@ -832,6 +874,13 @@ static int do_one_pass(journal_t *journal,
 				success = -EIO;
 		}
 	}
+
+	if (jbd2_has_feature_fast_commit(journal) &&  pass != PASS_REVOKE) {
+		err = fc_do_one_pass(journal, info, pass);
+		if (err)
+			success = err;
+	}
+
 	if (block_error && success == 0)
 		success = -EIO;
 	return success;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a009d9b9c620..fb3d71ad6eea 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -751,6 +751,11 @@ jbd2_time_diff(unsigned long start, unsigned long end)
 
 #define JBD2_NR_BATCH	64
 
+enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
+
+#define JBD2_FC_REPLAY_STOP	0
+#define JBD2_FC_REPLAY_CONTINUE	1
+
 /**
  * struct journal_s - The journal_s type is the concrete type associated with
  *     journal_t.
@@ -1248,6 +1253,21 @@ struct journal_s
 	 */
 	void (*j_fc_cleanup_callback)(struct journal_s *journal, int);
 
+	/*
+	 * @j_fc_replay_callback:
+	 *
+	 * File-system specific function that performs replay of a fast
+	 * commit. JBD2 calls this function for each fast commit block found in
+	 * the journal. This function should return JBD2_FC_REPLAY_CONTINUE
+	 * to indicate that the block was processed correctly and more fast
+	 * commit replay should continue. Return value of JBD2_FC_REPLAY_STOP
+	 * indicates the end of replay (no more blocks remaining). A negative
+	 * return value indicates error.
+	 */
+	int (*j_fc_replay_callback)(struct journal_s *journal,
+				    struct buffer_head *bh,
+				    enum passtype pass, int off,
+				    tid_t expected_commit_id);
 };
 
 #define jbd2_might_wait_for_commit(j) \
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v10 7/9] ext4: fast commit recovery path
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
                   ` (5 preceding siblings ...)
  2020-10-15 20:37 ` [PATCH v10 6/9] jbd2: fast commit recovery path Harshad Shirwadkar
@ 2020-10-15 20:37 ` Harshad Shirwadkar
  2020-10-15 20:38 ` [PATCH v10 8/9] ext4: add a mount opt to forcefully turn fast commits on Harshad Shirwadkar
  2020-10-15 20:38 ` [PATCH v10 9/9] ext4: add fast commit stats in procfs Harshad Shirwadkar
  8 siblings, 0 replies; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:37 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar, kernel test robot

This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/balloc.c            |   7 +-
 fs/ext4/ext4.h              |  26 ++
 fs/ext4/ext4_jbd2.c         |   2 +-
 fs/ext4/extents.c           | 261 +++++++++++
 fs/ext4/extents_status.c    |  24 +
 fs/ext4/fast_commit.c       | 897 +++++++++++++++++++++++++++++++++++-
 fs/ext4/fast_commit.h       |  40 ++
 fs/ext4/ialloc.c            | 168 ++++++-
 fs/ext4/inode.c             |  89 ++--
 fs/ext4/ioctl.c             |   6 +-
 fs/ext4/mballoc.c           | 206 ++++++++-
 fs/ext4/namei.c             | 149 +++---
 fs/ext4/super.c             |  21 +
 include/trace/events/ext4.h |  56 ++-
 14 files changed, 1821 insertions(+), 131 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index dea738ba2acd..1d640b145637 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -368,7 +368,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
 				      struct buffer_head *bh)
 {
 	ext4_fsblk_t	blk;
-	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+	struct ext4_group_info *grp;
+
+	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
+	grp = ext4_get_group_info(sb, block_group);
 
 	if (buffer_verified(bh))
 		return 0;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 6b291cad72be..ff5094eb0e39 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1167,6 +1167,7 @@ struct ext4_inode_info {
 #define EXT4_FC_COMMITTING		0x0010	/* File system underoing a fast
 						 * commit.
 						 */
+#define EXT4_FC_REPLAY			0x0020	/* Fast commit replay ongoing */
 
 /*
  * Misc. filesystem flags
@@ -1663,6 +1664,10 @@ struct ext4_sb_info {
 	struct buffer_head *s_fc_bh;
 	struct ext4_fc_stats s_fc_stats;
 	u64 s_fc_avg_commit_time;
+#ifdef CONFIG_EXT4_DEBUG
+	int s_fc_debug_max_replay;
+#endif
+	struct ext4_fc_replay_state s_fc_replay_state;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -2705,6 +2710,7 @@ extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
 			  struct dx_hash_info *hinfo);
 
 /* ialloc.c */
+extern int ext4_mark_inode_used(struct super_block *sb, int ino);
 extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t,
 				      const struct qstr *qstr, __u32 goal,
 				      uid_t *owner, __u32 i_flags,
@@ -2746,6 +2752,8 @@ void ext4_fc_stop_ineligible(struct super_block *sb);
 void ext4_fc_start_update(struct inode *inode);
 void ext4_fc_stop_update(struct inode *inode);
 void ext4_fc_del(struct inode *inode);
+bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t block);
+void ext4_fc_replay_cleanup(struct super_block *sb);
 int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
 int __init ext4_fc_init_dentry_cache(void);
 
@@ -2778,8 +2786,12 @@ extern int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
 				ext4_fsblk_t block, unsigned long count);
 extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
 extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
+extern void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
+		       int len, int state);
 
 /* inode.c */
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+			 struct ext4_inode_info *ei);
 int ext4_inode_is_fast_symlink(struct inode *inode);
 struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
@@ -2826,6 +2838,8 @@ extern int  ext4_sync_inode(handle_t *, struct inode *);
 extern void ext4_dirty_inode(struct inode *, int);
 extern int ext4_change_inode_journal_flag(struct inode *, int);
 extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
+extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+			  struct ext4_iloc *iloc);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
@@ -2859,12 +2873,15 @@ extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
 /* ioctl.c */
 extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
 extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
+extern void ext4_reset_inode_seed(struct inode *inode);
 
 /* migrate.c */
 extern int ext4_ext_migrate(struct inode *);
 extern int ext4_ind_migrate(struct inode *inode);
 
 /* namei.c */
+extern int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+			     struct inode *inode);
 extern int ext4_dirblock_csum_verify(struct inode *inode,
 				     struct buffer_head *bh);
 extern int ext4_orphan_add(handle_t *, struct inode *);
@@ -3444,6 +3461,10 @@ extern int ext4_handle_dirty_dirblock(handle_t *handle, struct inode *inode,
 extern int ext4_ci_compare(const struct inode *parent,
 			   const struct qstr *fname,
 			   const struct qstr *entry, bool quick);
+extern int __ext4_unlink(struct inode *dir, const struct qstr *d_name,
+			 struct inode *inode);
+extern int __ext4_link(struct inode *dir, struct inode *inode,
+		       struct dentry *dentry);
 
 #define S_SHIFT 12
 static const unsigned char ext4_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = {
@@ -3544,6 +3565,11 @@ extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
 extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
 				       int check_cred, int restart_cred,
 				       int revoke_cred);
+extern void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end);
+extern int ext4_ext_replay_set_iblocks(struct inode *inode);
+extern int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start,
+		int len, int unwritten, ext4_fsblk_t pblk);
+extern int ext4_ext_clear_bb(struct inode *inode);
 
 
 /* move_extent.c */
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 760b9ee49dc0..0fd0c42a4f7d 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -100,7 +100,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 		return ERR_PTR(err);
 
 	journal = EXT4_SB(sb)->s_journal;
-	if (!journal)
+	if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
 		return ext4_get_nojournal();
 	return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
 				   GFP_NOFS, type, line);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a2bb87d75500..6b33b9c86b00 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5804,3 +5804,264 @@ int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu)
 
 	return err ? err : mapped;
 }
+
+/*
+ * Updates physical block address and unwritten status of extent starting at
+ * lblk start and of len. If such an extent doesn't exist, this function
+ * splits the extent tree appropriately to create an extent like this.
+ * This function is called in Ext4 fast commit replay path. Returns 0 on success
+ * and error on failure.
+ */
+int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start,
+		int len, int unwritten, ext4_fsblk_t pblk)
+{
+	struct ext4_ext_path *path = NULL, *ppath;
+	struct ext4_extent *ex;
+	int ret;
+
+	path = ext4_find_extent(inode, start, NULL, 0);
+	if (!path)
+		return -EINVAL;
+	ex = path[path->p_depth].p_ext;
+	if (!ex) {
+		ret = -EFSCORRUPTED;
+		goto out;
+	}
+
+	if (le32_to_cpu(ex->ee_block) != start ||
+		ext4_ext_get_actual_len(ex) != len) {
+		/* We need to split this extent to match our extent first */
+		ppath = path;
+		down_write(&EXT4_I(inode)->i_data_sem);
+		ret = ext4_force_split_extent_at(NULL, inode, &ppath, start, 1);
+		up_write(&EXT4_I(inode)->i_data_sem);
+		if (ret)
+			goto out;
+		kfree(path);
+		path = ext4_find_extent(inode, start, NULL, 0);
+		if (IS_ERR(path))
+			return -1;
+		ppath = path;
+		ex = path[path->p_depth].p_ext;
+		WARN_ON(le32_to_cpu(ex->ee_block) != start);
+		if (ext4_ext_get_actual_len(ex) != len) {
+			down_write(&EXT4_I(inode)->i_data_sem);
+			ret = ext4_force_split_extent_at(NULL, inode, &ppath,
+							 start + len, 1);
+			up_write(&EXT4_I(inode)->i_data_sem);
+			if (ret)
+				goto out;
+			kfree(path);
+			path = ext4_find_extent(inode, start, NULL, 0);
+			if (IS_ERR(path))
+				return -EINVAL;
+			ex = path[path->p_depth].p_ext;
+		}
+	}
+	if (unwritten)
+		ext4_ext_mark_unwritten(ex);
+	else
+		ext4_ext_mark_initialized(ex);
+	ext4_ext_store_pblock(ex, pblk);
+	down_write(&EXT4_I(inode)->i_data_sem);
+	ret = ext4_ext_dirty(NULL, inode, &path[path->p_depth]);
+	up_write(&EXT4_I(inode)->i_data_sem);
+out:
+	ext4_ext_drop_refs(path);
+	kfree(path);
+	ext4_mark_inode_dirty(NULL, inode);
+	return ret;
+}
+
+/* Try to shrink the extent tree */
+void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end)
+{
+	struct ext4_ext_path *path = NULL;
+	struct ext4_extent *ex;
+	ext4_lblk_t old_cur, cur = 0;
+
+	while (cur < end) {
+		path = ext4_find_extent(inode, cur, NULL, 0);
+		if (IS_ERR(path))
+			return;
+		ex = path[path->p_depth].p_ext;
+		if (!ex) {
+			ext4_ext_drop_refs(path);
+			kfree(path);
+			ext4_mark_inode_dirty(NULL, inode);
+			return;
+		}
+		old_cur = cur;
+		cur = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
+		if (cur <= old_cur)
+			cur = old_cur + 1;
+		ext4_ext_try_to_merge(NULL, inode, path, ex);
+		down_write(&EXT4_I(inode)->i_data_sem);
+		ext4_ext_dirty(NULL, inode, &path[path->p_depth]);
+		up_write(&EXT4_I(inode)->i_data_sem);
+		ext4_mark_inode_dirty(NULL, inode);
+		ext4_ext_drop_refs(path);
+		kfree(path);
+	}
+}
+
+/* Check if *cur is a hole and if it is, skip it */
+static void skip_hole(struct inode *inode, ext4_lblk_t *cur)
+{
+	int ret;
+	struct ext4_map_blocks map;
+
+	map.m_lblk = *cur;
+	map.m_len = ((inode->i_size) >> inode->i_sb->s_blocksize_bits) - *cur;
+
+	ret = ext4_map_blocks(NULL, inode, &map, 0);
+	if (ret != 0)
+		return;
+	*cur = *cur + map.m_len;
+}
+
+/* Count number of blocks used by this inode and update i_blocks */
+int ext4_ext_replay_set_iblocks(struct inode *inode)
+{
+	struct ext4_ext_path *path = NULL, *path2 = NULL;
+	struct ext4_extent *ex;
+	ext4_lblk_t cur = 0, end;
+	int numblks = 0, i, ret = 0;
+	ext4_fsblk_t cmp1, cmp2;
+	struct ext4_map_blocks map;
+
+	/* Determin the size of the file first */
+	path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL,
+					EXT4_EX_NOCACHE);
+	if (IS_ERR(path))
+		return PTR_ERR(path);
+	ex = path[path->p_depth].p_ext;
+	if (!ex) {
+		ext4_ext_drop_refs(path);
+		kfree(path);
+		goto out;
+	}
+	end = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
+	ext4_ext_drop_refs(path);
+	kfree(path);
+
+	/* Count the number of data blocks */
+	cur = 0;
+	while (cur < end) {
+		map.m_lblk = cur;
+		map.m_len = end - cur;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret < 0)
+			break;
+		if (ret > 0)
+			numblks += ret;
+		cur = cur + map.m_len;
+	}
+
+	/*
+	 * Count the number of extent tree blocks. We do it by looking up
+	 * two successive extents and determining the difference between
+	 * their paths. When path is different for 2 successive extents
+	 * we compare the blocks in the path at each level and increment
+	 * iblocks by total number of differences found.
+	 */
+	cur = 0;
+	skip_hole(inode, &cur);
+	path = ext4_find_extent(inode, cur, NULL, 0);
+	if (IS_ERR(path))
+		goto out;
+	numblks += path->p_depth;
+	ext4_ext_drop_refs(path);
+	kfree(path);
+	while (cur < end) {
+		path = ext4_find_extent(inode, cur, NULL, 0);
+		if (IS_ERR(path))
+			break;
+		ex = path[path->p_depth].p_ext;
+		if (!ex) {
+			ext4_ext_drop_refs(path);
+			kfree(path);
+			return 0;
+		}
+		cur = max(cur + 1, le32_to_cpu(ex->ee_block) +
+					ext4_ext_get_actual_len(ex));
+		skip_hole(inode, &cur);
+
+		path2 = ext4_find_extent(inode, cur, NULL, 0);
+		if (IS_ERR(path2)) {
+			ext4_ext_drop_refs(path);
+			kfree(path);
+			break;
+		}
+		ex = path2[path2->p_depth].p_ext;
+		for (i = 0; i <= max(path->p_depth, path2->p_depth); i++) {
+			cmp1 = cmp2 = 0;
+			if (i <= path->p_depth)
+				cmp1 = path[i].p_bh ?
+					path[i].p_bh->b_blocknr : 0;
+			if (i <= path2->p_depth)
+				cmp2 = path2[i].p_bh ?
+					path2[i].p_bh->b_blocknr : 0;
+			if (cmp1 != cmp2 && cmp2 != 0)
+				numblks++;
+		}
+		ext4_ext_drop_refs(path);
+		ext4_ext_drop_refs(path2);
+		kfree(path);
+		kfree(path2);
+	}
+
+out:
+	inode->i_blocks = numblks << (inode->i_sb->s_blocksize_bits - 9);
+	ext4_mark_inode_dirty(NULL, inode);
+	return 0;
+}
+
+int ext4_ext_clear_bb(struct inode *inode)
+{
+	struct ext4_ext_path *path = NULL;
+	struct ext4_extent *ex;
+	ext4_lblk_t cur = 0, end;
+	int j, ret = 0;
+	struct ext4_map_blocks map;
+
+	/* Determin the size of the file first */
+	path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL,
+					EXT4_EX_NOCACHE);
+	if (IS_ERR(path))
+		return PTR_ERR(path);
+	ex = path[path->p_depth].p_ext;
+	if (!ex) {
+		ext4_ext_drop_refs(path);
+		kfree(path);
+		return 0;
+	}
+	end = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
+	ext4_ext_drop_refs(path);
+	kfree(path);
+
+	cur = 0;
+	while (cur < end) {
+		map.m_lblk = cur;
+		map.m_len = end - cur;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			path = ext4_find_extent(inode, map.m_lblk, NULL, 0);
+			if (!IS_ERR_OR_NULL(path)) {
+				for (j = 0; j < path->p_depth; j++) {
+
+					ext4_mb_mark_bb(inode->i_sb,
+							path[j].p_block, 1, 0);
+				}
+				ext4_ext_drop_refs(path);
+				kfree(path);
+			}
+			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
+		}
+		cur = cur + map.m_len;
+	}
+
+	return 0;
+}
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e75171535375..0a729027322d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -311,6 +311,9 @@ void ext4_es_find_extent_range(struct inode *inode,
 			       ext4_lblk_t lblk, ext4_lblk_t end,
 			       struct extent_status *es)
 {
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return;
+
 	trace_ext4_es_find_extent_range_enter(inode, lblk);
 
 	read_lock(&EXT4_I(inode)->i_es_lock);
@@ -361,6 +364,9 @@ bool ext4_es_scan_range(struct inode *inode,
 {
 	bool ret;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return false;
+
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	ret = __es_scan_range(inode, matching_fn, lblk, end);
 	read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -404,6 +410,9 @@ bool ext4_es_scan_clu(struct inode *inode,
 {
 	bool ret;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return false;
+
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	ret = __es_scan_clu(inode, matching_fn, lblk);
 	read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -812,6 +821,9 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	int err = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
 	es_debug("add [%u/%u) %llu %x to extent status tree of inode %lu\n",
 		 lblk, len, pblk, status, inode->i_ino);
 
@@ -873,6 +885,9 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return;
+
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, pblk, status);
@@ -908,6 +923,9 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct rb_node *node;
 	int found = 0;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
 	trace_ext4_es_lookup_extent_enter(inode, lblk);
 	es_debug("lookup extent in block %u\n", lblk);
 
@@ -1419,6 +1437,9 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	int err = 0;
 	int reserved = 0;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
 	trace_ext4_es_remove_extent(inode, lblk, len);
 	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
 		 lblk, len, inode->i_ino);
@@ -1969,6 +1990,9 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status newes;
 	int err = 0;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
 	es_debug("add [%u/1) delayed to extent status tree of inode %lu\n",
 		 lblk, inode->i_ino);
 
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 32ed4495f9c6..1dda5329be61 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -165,7 +165,8 @@ void ext4_fc_start_update(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
 		return;
 
 restart:
@@ -204,7 +205,8 @@ void ext4_fc_stop_update(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
 		return;
 
 	if (atomic_dec_and_test(&ei->i_fc_updates))
@@ -219,11 +221,8 @@ void ext4_fc_del(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
-		return;
-
-
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
 		return;
 
 restart:
@@ -265,6 +264,10 @@ void ext4_fc_mark_ineligible(struct super_block *sb, int reason)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
+		return;
+
 	sbi->s_mount_state |= EXT4_FC_INELIGIBLE;
 	WARN_ON(reason >= EXT4_FC_REASON_MAX);
 	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
@@ -278,6 +281,10 @@ void ext4_fc_start_ineligible(struct super_block *sb, int reason)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
+		return;
+
 	WARN_ON(reason >= EXT4_FC_REASON_MAX);
 	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
 	atomic_inc(&sbi->s_fc_ineligible_updates);
@@ -290,6 +297,10 @@ void ext4_fc_start_ineligible(struct super_block *sb, int reason)
  */
 void ext4_fc_stop_ineligible(struct super_block *sb)
 {
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
+		return;
+
 	EXT4_SB(sb)->s_mount_state |= EXT4_FC_INELIGIBLE;
 	atomic_dec(&EXT4_SB(sb)->s_fc_ineligible_updates);
 }
@@ -320,7 +331,8 @@ static int ext4_fc_track_template(
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	int ret;
 
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (sbi->s_mount_state & EXT4_FC_REPLAY))
 		return -EOPNOTSUPP;
 
 	if (ext4_fc_is_ineligible(inode->i_sb))
@@ -1177,13 +1189,880 @@ static void ext4_fc_cleanup(journal_t *journal, int full)
 	trace_ext4_fc_stats(sb);
 }
 
+/* Ext4 Replay Path Routines */
+
+/* Get length of a particular tlv */
+static inline int ext4_fc_tag_len(struct ext4_fc_tl *tl)
+{
+	return le16_to_cpu(tl->fc_len);
+}
+
+/* Get a pointer to "value" of a tlv */
+static inline u8 *ext4_fc_tag_val(struct ext4_fc_tl *tl)
+{
+	return (u8 *)tl + sizeof(*tl);
+}
+
+/* Helper struct for dentry replay routines */
+struct dentry_info_args {
+	int parent_ino, dname_len, ino, inode_len;
+	char *dname;
+};
+
+static inline void tl_to_darg(struct dentry_info_args *darg,
+				struct  ext4_fc_tl *tl)
+{
+	struct ext4_fc_dentry_info *fcd;
+
+	fcd = (struct ext4_fc_dentry_info *)ext4_fc_tag_val(tl);
+
+	darg->parent_ino = le32_to_cpu(fcd->fc_parent_ino);
+	darg->ino = le32_to_cpu(fcd->fc_ino);
+	darg->dname = fcd->fc_dname;
+	darg->dname_len = ext4_fc_tag_len(tl) -
+			sizeof(struct ext4_fc_dentry_info);
+}
+
+/* Unlink replay function */
+static int ext4_fc_replay_unlink(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	struct inode *inode, *old_parent;
+	struct qstr entry;
+	struct dentry_info_args darg;
+	int ret = 0;
+
+	tl_to_darg(&darg, tl);
+
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_UNLINK, darg.ino,
+			darg.parent_ino, darg.dname_len);
+
+	entry.name = darg.dname;
+	entry.len = darg.dname_len;
+	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
+
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode %d not found", darg.ino);
+		return 0;
+	}
+
+	old_parent = ext4_iget(sb, darg.parent_ino,
+				EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(old_parent)) {
+		jbd_debug(1, "Dir with inode  %d not found", darg.parent_ino);
+		iput(inode);
+		return 0;
+	}
+
+	ret = __ext4_unlink(old_parent, &entry, inode);
+	/* -ENOENT ok coz it might not exist anymore. */
+	if (ret == -ENOENT)
+		ret = 0;
+	iput(old_parent);
+	iput(inode);
+	return ret;
+}
+
+static int ext4_fc_replay_link_internal(struct super_block *sb,
+				struct dentry_info_args *darg,
+				struct inode *inode)
+{
+	struct inode *dir = NULL;
+	struct dentry *dentry_dir = NULL, *dentry_inode = NULL;
+	struct qstr qstr_dname = QSTR_INIT(darg->dname, darg->dname_len);
+	int ret = 0;
+
+	dir = ext4_iget(sb, darg->parent_ino, EXT4_IGET_NORMAL);
+	if (IS_ERR(dir)) {
+		jbd_debug(1, "Dir with inode %d not found.", darg->parent_ino);
+		dir = NULL;
+		goto out;
+	}
+
+	dentry_dir = d_obtain_alias(dir);
+	if (IS_ERR(dentry_dir)) {
+		jbd_debug(1, "Failed to obtain dentry");
+		dentry_dir = NULL;
+		goto out;
+	}
+
+	dentry_inode = d_alloc(dentry_dir, &qstr_dname);
+	if (!dentry_inode) {
+		jbd_debug(1, "Inode dentry not created.");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = __ext4_link(dir, inode, dentry_inode);
+	/*
+	 * It's possible that link already existed since data blocks
+	 * for the dir in question got persisted before we crashed OR
+	 * we replayed this tag and crashed before the entire replay
+	 * could complete.
+	 */
+	if (ret && ret != -EEXIST) {
+		jbd_debug(1, "Failed to link\n");
+		goto out;
+	}
+
+	ret = 0;
+out:
+	if (dentry_dir) {
+		d_drop(dentry_dir);
+		dput(dentry_dir);
+	} else if (dir) {
+		iput(dir);
+	}
+	if (dentry_inode) {
+		d_drop(dentry_inode);
+		dput(dentry_inode);
+	}
+
+	return ret;
+}
+
+/* Link replay function */
+static int ext4_fc_replay_link(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	struct inode *inode;
+	struct dentry_info_args darg;
+	int ret = 0;
+
+	tl_to_darg(&darg, tl);
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_LINK, darg.ino,
+			darg.parent_ino, darg.dname_len);
+
+	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode not found.");
+		return 0;
+	}
+
+	ret = ext4_fc_replay_link_internal(sb, &darg, inode);
+	iput(inode);
+	return ret;
+}
+
+/*
+ * Record all the modified inodes during replay. We use this later to setup
+ * block bitmaps correctly.
+ */
+static int ext4_fc_record_modified_inode(struct super_block *sb, int ino)
+{
+	struct ext4_fc_replay_state *state;
+	int i;
+
+	state = &EXT4_SB(sb)->s_fc_replay_state;
+	for (i = 0; i < state->fc_modified_inodes_used; i++)
+		if (state->fc_modified_inodes[i] == ino)
+			return 0;
+	if (state->fc_modified_inodes_used == state->fc_modified_inodes_size) {
+		state->fc_modified_inodes_size +=
+			EXT4_FC_REPLAY_REALLOC_INCREMENT;
+		state->fc_modified_inodes = krealloc(
+					state->fc_modified_inodes, sizeof(int) *
+					state->fc_modified_inodes_size,
+					GFP_KERNEL);
+		if (!state->fc_modified_inodes)
+			return -ENOMEM;
+	}
+	state->fc_modified_inodes[state->fc_modified_inodes_used++] = ino;
+	return 0;
+}
+
+/*
+ * Inode replay function
+ */
+static int ext4_fc_replay_inode(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	struct ext4_fc_inode *fc_inode;
+	struct ext4_inode *raw_inode;
+	struct ext4_inode *raw_fc_inode;
+	struct inode *inode = NULL;
+	struct ext4_iloc iloc;
+	int inode_len, ino, ret, tag = le16_to_cpu(tl->fc_tag);
+	struct ext4_extent_header *eh;
+
+	fc_inode = (struct ext4_fc_inode *)ext4_fc_tag_val(tl);
+
+	ino = le32_to_cpu(fc_inode->fc_ino);
+	trace_ext4_fc_replay(sb, tag, ino, 0, 0);
+
+	inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
+	if (!IS_ERR_OR_NULL(inode)) {
+		ext4_ext_clear_bb(inode);
+		iput(inode);
+	}
+
+	ext4_fc_record_modified_inode(sb, ino);
+
+	raw_fc_inode = (struct ext4_inode *)fc_inode->fc_raw_inode;
+	ret = ext4_get_fc_inode_loc(sb, ino, &iloc);
+	if (ret)
+		goto out;
+
+	inode_len = ext4_fc_tag_len(tl) - sizeof(struct ext4_fc_inode);
+	raw_inode = ext4_raw_inode(&iloc);
+
+	memcpy(raw_inode, raw_fc_inode, offsetof(struct ext4_inode, i_block));
+	memcpy(&raw_inode->i_generation, &raw_fc_inode->i_generation,
+		inode_len - offsetof(struct ext4_inode, i_generation));
+	if (le32_to_cpu(raw_inode->i_flags) & EXT4_EXTENTS_FL) {
+		eh = (struct ext4_extent_header *)(&raw_inode->i_block[0]);
+		if (eh->eh_magic != EXT4_EXT_MAGIC) {
+			memset(eh, 0, sizeof(*eh));
+			eh->eh_magic = EXT4_EXT_MAGIC;
+			eh->eh_max = cpu_to_le16(
+				(sizeof(raw_inode->i_block) -
+				 sizeof(struct ext4_extent_header))
+				 / sizeof(struct ext4_extent));
+		}
+	} else if (le32_to_cpu(raw_inode->i_flags) & EXT4_INLINE_DATA_FL) {
+		memcpy(raw_inode->i_block, raw_fc_inode->i_block,
+			sizeof(raw_inode->i_block));
+	}
+
+	/* Immediately update the inode on disk. */
+	ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh);
+	if (ret)
+		goto out;
+	ret = sync_dirty_buffer(iloc.bh);
+	if (ret)
+		goto out;
+	ret = ext4_mark_inode_used(sb, ino);
+	if (ret)
+		goto out;
+
+	/* Given that we just wrote the inode on disk, this SHOULD succeed. */
+	inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode not found.");
+		return -EFSCORRUPTED;
+	}
+
+	/*
+	 * Our allocator could have made different decisions than before
+	 * crashing. This should be fixed but until then, we calculate
+	 * the number of blocks the inode.
+	 */
+	ext4_ext_replay_set_iblocks(inode);
+
+	inode->i_generation = le32_to_cpu(ext4_raw_inode(&iloc)->i_generation);
+	ext4_reset_inode_seed(inode);
+
+	ext4_inode_csum_set(inode, ext4_raw_inode(&iloc), EXT4_I(inode));
+	ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh);
+	sync_dirty_buffer(iloc.bh);
+	brelse(iloc.bh);
+out:
+	iput(inode);
+	if (!ret)
+		blkdev_issue_flush(sb->s_bdev, GFP_KERNEL);
+
+	return 0;
+}
+
+/*
+ * Dentry create replay function.
+ *
+ * EXT4_FC_TAG_CREAT is preceded by EXT4_FC_TAG_INODE_FULL. Which means, the
+ * inode for which we are trying to create a dentry here, should already have
+ * been replayed before we start here.
+ */
+static int ext4_fc_replay_create(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	int ret = 0;
+	struct inode *inode = NULL;
+	struct inode *dir = NULL;
+	struct dentry_info_args darg;
+
+	tl_to_darg(&darg, tl);
+
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_CREAT, darg.ino,
+			darg.parent_ino, darg.dname_len);
+
+	/* This takes care of update group descriptor and other metadata */
+	ret = ext4_mark_inode_used(sb, darg.ino);
+	if (ret)
+		goto out;
+
+	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "inode %d not found.", darg.ino);
+		inode = NULL;
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (S_ISDIR(inode->i_mode)) {
+		/*
+		 * If we are creating a directory, we need to make sure that the
+		 * dot and dot dot dirents are setup properly.
+		 */
+		dir = ext4_iget(sb, darg.parent_ino, EXT4_IGET_NORMAL);
+		if (IS_ERR_OR_NULL(dir)) {
+			jbd_debug(1, "Dir %d not found.", darg.ino);
+			goto out;
+		}
+		ret = ext4_init_new_dir(NULL, dir, inode);
+		iput(dir);
+		if (ret) {
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = ext4_fc_replay_link_internal(sb, &darg, inode);
+	if (ret)
+		goto out;
+	set_nlink(inode, 1);
+	ext4_mark_inode_dirty(NULL, inode);
+out:
+	if (inode)
+		iput(inode);
+	return ret;
+}
+
+/*
+ * Record physical disk regions which are in use as per fast commit area. Our
+ * simple replay phase allocator excludes these regions from allocation.
+ */
+static int ext4_fc_record_regions(struct super_block *sb, int ino,
+		ext4_lblk_t lblk, ext4_fsblk_t pblk, int len)
+{
+	struct ext4_fc_replay_state *state;
+	struct ext4_fc_alloc_region *region;
+
+	state = &EXT4_SB(sb)->s_fc_replay_state;
+	if (state->fc_regions_used == state->fc_regions_size) {
+		state->fc_regions_size +=
+			EXT4_FC_REPLAY_REALLOC_INCREMENT;
+		state->fc_regions = krealloc(
+					state->fc_regions,
+					state->fc_regions_size *
+					sizeof(struct ext4_fc_alloc_region),
+					GFP_KERNEL);
+		if (!state->fc_regions)
+			return -ENOMEM;
+	}
+	region = &state->fc_regions[state->fc_regions_used++];
+	region->ino = ino;
+	region->lblk = lblk;
+	region->pblk = pblk;
+	region->len = len;
+
+	return 0;
+}
+
+/* Replay add range tag */
+static int ext4_fc_replay_add_range(struct super_block *sb,
+				struct ext4_fc_tl *tl)
+{
+	struct ext4_fc_add_range *fc_add_ex;
+	struct ext4_extent newex, *ex;
+	struct inode *inode;
+	ext4_lblk_t start, cur;
+	int remaining, len;
+	ext4_fsblk_t start_pblk;
+	struct ext4_map_blocks map;
+	struct ext4_ext_path *path = NULL;
+	int ret;
+
+	fc_add_ex = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl);
+	ex = (struct ext4_extent *)&fc_add_ex->fc_ex;
+
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_ADD_RANGE,
+		le32_to_cpu(fc_add_ex->fc_ino), le32_to_cpu(ex->ee_block),
+		ext4_ext_get_actual_len(ex));
+
+	inode = ext4_iget(sb, le32_to_cpu(fc_add_ex->fc_ino),
+				EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode not found.");
+		return 0;
+	}
+
+	ret = ext4_fc_record_modified_inode(sb, inode->i_ino);
+
+	start = le32_to_cpu(ex->ee_block);
+	start_pblk = ext4_ext_pblock(ex);
+	len = ext4_ext_get_actual_len(ex);
+
+	cur = start;
+	remaining = len;
+	jbd_debug(1, "ADD_RANGE, lblk %d, pblk %lld, len %d, unwritten %d, inode %ld\n",
+		  start, start_pblk, len, ext4_ext_is_unwritten(ex),
+		  inode->i_ino);
+
+	while (remaining > 0) {
+		map.m_lblk = cur;
+		map.m_len = remaining;
+		map.m_pblk = 0;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+
+		if (ret < 0) {
+			iput(inode);
+			return 0;
+		}
+
+		if (ret == 0) {
+			/* Range is not mapped */
+			path = ext4_find_extent(inode, cur, NULL, 0);
+			if (!path)
+				continue;
+			memset(&newex, 0, sizeof(newex));
+			newex.ee_block = cpu_to_le32(cur);
+			ext4_ext_store_pblock(
+				&newex, start_pblk + cur - start);
+			newex.ee_len = cpu_to_le16(map.m_len);
+			if (ext4_ext_is_unwritten(ex))
+				ext4_ext_mark_unwritten(&newex);
+			down_write(&EXT4_I(inode)->i_data_sem);
+			ret = ext4_ext_insert_extent(
+				NULL, inode, &path, &newex, 0);
+			up_write((&EXT4_I(inode)->i_data_sem));
+			ext4_ext_drop_refs(path);
+			kfree(path);
+			if (ret) {
+				iput(inode);
+				return 0;
+			}
+			goto next;
+		}
+
+		if (start_pblk + cur - start != map.m_pblk) {
+			/*
+			 * Logical to physical mapping changed. This can happen
+			 * if this range was removed and then reallocated to
+			 * map to new physical blocks during a fast commit.
+			 */
+			ret = ext4_ext_replay_update_ex(inode, cur, map.m_len,
+					ext4_ext_is_unwritten(ex),
+					start_pblk + cur - start);
+			if (ret) {
+				iput(inode);
+				return 0;
+			}
+			/*
+			 * Mark the old blocks as free since they aren't used
+			 * anymore. We maintain an array of all the modified
+			 * inodes. In case these blocks are still used at either
+			 * a different logical range in the same inode or in
+			 * some different inode, we will mark them as allocated
+			 * at the end of the FC replay using our array of
+			 * modified inodes.
+			 */
+			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
+			goto next;
+		}
+
+		/* Range is mapped and needs a state change */
+		jbd_debug(1, "Converting from %d to %d %lld",
+				map.m_flags & EXT4_MAP_UNWRITTEN,
+			ext4_ext_is_unwritten(ex), map.m_pblk);
+		ret = ext4_ext_replay_update_ex(inode, cur, map.m_len,
+					ext4_ext_is_unwritten(ex), map.m_pblk);
+		if (ret) {
+			iput(inode);
+			return 0;
+		}
+		/*
+		 * We may have split the extent tree while toggling the state.
+		 * Try to shrink the extent tree now.
+		 */
+		ext4_ext_replay_shrink_inode(inode, start + len);
+next:
+		cur += map.m_len;
+		remaining -= map.m_len;
+	}
+	ext4_ext_replay_shrink_inode(inode, i_size_read(inode) >>
+					sb->s_blocksize_bits);
+	iput(inode);
+	return 0;
+}
+
+/* Replay DEL_RANGE tag */
+static int
+ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	struct inode *inode;
+	struct ext4_fc_del_range *lrange;
+	struct ext4_map_blocks map;
+	ext4_lblk_t cur, remaining;
+	int ret;
+
+	lrange = (struct ext4_fc_del_range *)ext4_fc_tag_val(tl);
+	cur = le32_to_cpu(lrange->fc_lblk);
+	remaining = le32_to_cpu(lrange->fc_len);
+
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_DEL_RANGE,
+		le32_to_cpu(lrange->fc_ino), cur, remaining);
+
+	inode = ext4_iget(sb, le32_to_cpu(lrange->fc_ino), EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode %d not found", le32_to_cpu(lrange->fc_ino));
+		return 0;
+	}
+
+	ret = ext4_fc_record_modified_inode(sb, inode->i_ino);
+
+	jbd_debug(1, "DEL_RANGE, inode %ld, lblk %d, len %d\n",
+			inode->i_ino, le32_to_cpu(lrange->fc_lblk),
+			le32_to_cpu(lrange->fc_len));
+	while (remaining > 0) {
+		map.m_lblk = cur;
+		map.m_len = remaining;
+
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret < 0) {
+			iput(inode);
+			return 0;
+		}
+		if (ret > 0) {
+			remaining -= ret;
+			cur += ret;
+			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
+		} else {
+			remaining -= map.m_len;
+			cur += map.m_len;
+		}
+	}
+
+	ret = ext4_punch_hole(inode,
+		le32_to_cpu(lrange->fc_lblk) << sb->s_blocksize_bits,
+		le32_to_cpu(lrange->fc_len) <<  sb->s_blocksize_bits);
+	if (ret)
+		jbd_debug(1, "ext4_punch_hole returned %d", ret);
+	ext4_ext_replay_shrink_inode(inode,
+		i_size_read(inode) >> sb->s_blocksize_bits);
+	ext4_mark_inode_dirty(NULL, inode);
+	iput(inode);
+
+	return 0;
+}
+
+static inline const char *tag2str(u16 tag)
+{
+	switch (tag) {
+	case EXT4_FC_TAG_LINK:
+		return "TAG_ADD_ENTRY";
+	case EXT4_FC_TAG_UNLINK:
+		return "TAG_DEL_ENTRY";
+	case EXT4_FC_TAG_ADD_RANGE:
+		return "TAG_ADD_RANGE";
+	case EXT4_FC_TAG_CREAT:
+		return "TAG_CREAT_DENTRY";
+	case EXT4_FC_TAG_DEL_RANGE:
+		return "TAG_DEL_RANGE";
+	case EXT4_FC_TAG_INODE:
+		return "TAG_INODE";
+	case EXT4_FC_TAG_PAD:
+		return "TAG_PAD";
+	case EXT4_FC_TAG_TAIL:
+		return "TAG_TAIL";
+	case EXT4_FC_TAG_HEAD:
+		return "TAG_HEAD";
+	default:
+		return "TAG_ERROR";
+	}
+}
+
+static void ext4_fc_set_bitmaps_and_counters(struct super_block *sb)
+{
+	struct ext4_fc_replay_state *state;
+	struct inode *inode;
+	struct ext4_ext_path *path = NULL;
+	struct ext4_map_blocks map;
+	int i, ret, j;
+	ext4_lblk_t cur, end;
+
+	state = &EXT4_SB(sb)->s_fc_replay_state;
+	for (i = 0; i < state->fc_modified_inodes_used; i++) {
+		inode = ext4_iget(sb, state->fc_modified_inodes[i],
+			EXT4_IGET_NORMAL);
+		if (IS_ERR_OR_NULL(inode)) {
+			jbd_debug(1, "Inode %d not found.",
+				state->fc_modified_inodes[i]);
+			continue;
+		}
+		cur = 0;
+		end = EXT_MAX_BLOCKS;
+		while (cur < end) {
+			map.m_lblk = cur;
+			map.m_len = end - cur;
+
+			ret = ext4_map_blocks(NULL, inode, &map, 0);
+			if (ret < 0)
+				break;
+
+			if (ret > 0) {
+				path = ext4_find_extent(inode, map.m_lblk, NULL, 0);
+				if (!IS_ERR_OR_NULL(path)) {
+					for (j = 0; j < path->p_depth; j++)
+						ext4_mb_mark_bb(inode->i_sb,
+							path[j].p_block, 1, 1);
+					ext4_ext_drop_refs(path);
+					kfree(path);
+				}
+				cur += ret;
+				ext4_mb_mark_bb(inode->i_sb, map.m_pblk,
+							map.m_len, 1);
+			} else {
+				cur = cur + (map.m_len ? map.m_len : 1);
+			}
+		}
+		iput(inode);
+	}
+}
+
+/*
+ * Check if block is in excluded regions for block allocation. The simple
+ * allocator that runs during replay phase is calls this function to see
+ * if it is okay to use a block.
+ */
+bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t blk)
+{
+	int i;
+	struct ext4_fc_replay_state *state;
+
+	state = &EXT4_SB(sb)->s_fc_replay_state;
+	for (i = 0; i < state->fc_regions_valid; i++) {
+		if (state->fc_regions[i].ino == 0 ||
+			state->fc_regions[i].len == 0)
+			continue;
+		if (blk >= state->fc_regions[i].pblk &&
+		    blk < state->fc_regions[i].pblk + state->fc_regions[i].len)
+			return true;
+	}
+	return false;
+}
+
+/* Cleanup function called after replay */
+void ext4_fc_replay_cleanup(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	sbi->s_mount_state &= ~EXT4_FC_REPLAY;
+	kfree(sbi->s_fc_replay_state.fc_regions);
+	kfree(sbi->s_fc_replay_state.fc_modified_inodes);
+}
+
+/*
+ * Recovery Scan phase handler
+ *
+ * This function is called during the scan phase and is responsible
+ * for doing following things:
+ * - Make sure the fast commit area has valid tags for replay
+ * - Count number of tags that need to be replayed by the replay handler
+ * - Verify CRC
+ * - Create a list of excluded blocks for allocation during replay phase
+ *
+ * This function returns JBD2_FC_REPLAY_CONTINUE to indicate that SCAN is
+ * incomplete and JBD2 should send more blocks. It returns JBD2_FC_REPLAY_STOP
+ * to indicate that scan has finished and JBD2 can now start replay phase.
+ * It returns a negative error to indicate that there was an error. At the end
+ * of a successful scan phase, sbi->s_fc_replay_state.fc_replay_num_tags is set
+ * to indicate the number of tags that need to replayed during the replay phase.
+ */
+static int ext4_fc_replay_scan(journal_t *journal,
+				struct buffer_head *bh, int off,
+				tid_t expected_tid)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_replay_state *state;
+	int ret = JBD2_FC_REPLAY_CONTINUE;
+	struct ext4_fc_add_range *ext;
+	struct ext4_fc_tl *tl;
+	struct ext4_fc_tail *tail;
+	__u8 *start, *end;
+	struct ext4_fc_head *head;
+	struct ext4_extent *ex;
+
+	state = &sbi->s_fc_replay_state;
+
+	start = (u8 *)bh->b_data;
+	end = (__u8 *)bh->b_data + journal->j_blocksize - 1;
+
+	if (state->fc_replay_expected_off == 0) {
+		state->fc_cur_tag = 0;
+		state->fc_replay_num_tags = 0;
+		state->fc_crc = 0;
+		state->fc_regions = NULL;
+		state->fc_regions_valid = state->fc_regions_used =
+			state->fc_regions_size = 0;
+		/* Check if we can stop early */
+		if (le16_to_cpu(((struct ext4_fc_tl *)start)->fc_tag)
+			!= EXT4_FC_TAG_HEAD)
+			return 0;
+	}
+
+	if (off != state->fc_replay_expected_off) {
+		ret = -EFSCORRUPTED;
+		goto out_err;
+	}
+
+	state->fc_replay_expected_off++;
+	fc_for_each_tl(start, end, tl) {
+		jbd_debug(3, "Scan phase, tag:%s, blk %lld\n",
+			  tag2str(le16_to_cpu(tl->fc_tag)), bh->b_blocknr);
+		switch (le16_to_cpu(tl->fc_tag)) {
+		case EXT4_FC_TAG_ADD_RANGE:
+			ext = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl);
+			ex = (struct ext4_extent *)&ext->fc_ex;
+			ret = ext4_fc_record_regions(sb,
+				le32_to_cpu(ext->fc_ino),
+				le32_to_cpu(ex->ee_block), ext4_ext_pblock(ex),
+				ext4_ext_get_actual_len(ex));
+			if (ret < 0)
+				break;
+			ret = JBD2_FC_REPLAY_CONTINUE;
+			fallthrough;
+		case EXT4_FC_TAG_DEL_RANGE:
+		case EXT4_FC_TAG_LINK:
+		case EXT4_FC_TAG_UNLINK:
+		case EXT4_FC_TAG_CREAT:
+		case EXT4_FC_TAG_INODE:
+		case EXT4_FC_TAG_PAD:
+			state->fc_cur_tag++;
+			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
+					sizeof(*tl) + ext4_fc_tag_len(tl));
+			break;
+		case EXT4_FC_TAG_TAIL:
+			state->fc_cur_tag++;
+			tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl);
+			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
+						sizeof(*tl) +
+						offsetof(struct ext4_fc_tail,
+						fc_crc));
+			if (le32_to_cpu(tail->fc_tid) == expected_tid &&
+				le32_to_cpu(tail->fc_crc) == state->fc_crc) {
+				state->fc_replay_num_tags = state->fc_cur_tag;
+				state->fc_regions_valid =
+					state->fc_regions_used;
+			} else {
+				ret = state->fc_replay_num_tags ?
+					JBD2_FC_REPLAY_STOP : -EFSBADCRC;
+			}
+			state->fc_crc = 0;
+			break;
+		case EXT4_FC_TAG_HEAD:
+			head = (struct ext4_fc_head *)ext4_fc_tag_val(tl);
+			if (le32_to_cpu(head->fc_features) &
+				~EXT4_FC_SUPPORTED_FEATURES) {
+				ret = -EOPNOTSUPP;
+				break;
+			}
+			if (le32_to_cpu(head->fc_tid) != expected_tid) {
+				ret = JBD2_FC_REPLAY_STOP;
+				break;
+			}
+			state->fc_cur_tag++;
+			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
+					sizeof(*tl) + ext4_fc_tag_len(tl));
+			break;
+		default:
+			ret = state->fc_replay_num_tags ?
+				JBD2_FC_REPLAY_STOP : -ECANCELED;
+		}
+		if (ret < 0 || ret == JBD2_FC_REPLAY_STOP)
+			break;
+	}
+
+out_err:
+	trace_ext4_fc_replay_scan(sb, ret, off);
+	return ret;
+}
+
 /*
  * Main recovery path entry point.
+ * The meaning of return codes is similar as above.
  */
 static int ext4_fc_replay(journal_t *journal, struct buffer_head *bh,
 				enum passtype pass, int off, tid_t expected_tid)
 {
-	return 0;
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_tl *tl;
+	__u8 *start, *end;
+	int ret = JBD2_FC_REPLAY_CONTINUE;
+	struct ext4_fc_replay_state *state = &sbi->s_fc_replay_state;
+	struct ext4_fc_tail *tail;
+
+	if (pass == PASS_SCAN) {
+		state->fc_current_pass = PASS_SCAN;
+		return ext4_fc_replay_scan(journal, bh, off, expected_tid);
+	}
+
+	if (state->fc_current_pass != pass) {
+		state->fc_current_pass = pass;
+		sbi->s_mount_state |= EXT4_FC_REPLAY;
+	}
+	if (!sbi->s_fc_replay_state.fc_replay_num_tags) {
+		jbd_debug(1, "Replay stops\n");
+		ext4_fc_set_bitmaps_and_counters(sb);
+		return 0;
+	}
+
+#ifdef CONFIG_EXT4_DEBUG
+	if (sbi->s_fc_debug_max_replay && off >= sbi->s_fc_debug_max_replay) {
+		pr_warn("Dropping fc block %d because max_replay set\n", off);
+		return JBD2_FC_REPLAY_STOP;
+	}
+#endif
+
+	start = (u8 *)bh->b_data;
+	end = (__u8 *)bh->b_data + journal->j_blocksize - 1;
+
+	fc_for_each_tl(start, end, tl) {
+		if (state->fc_replay_num_tags == 0) {
+			ret = JBD2_FC_REPLAY_STOP;
+			ext4_fc_set_bitmaps_and_counters(sb);
+			break;
+		}
+		jbd_debug(3, "Replay phase, tag:%s\n",
+				tag2str(le16_to_cpu(tl->fc_tag)));
+		state->fc_replay_num_tags--;
+		switch (le16_to_cpu(tl->fc_tag)) {
+		case EXT4_FC_TAG_LINK:
+			ret = ext4_fc_replay_link(sb, tl);
+			break;
+		case EXT4_FC_TAG_UNLINK:
+			ret = ext4_fc_replay_unlink(sb, tl);
+			break;
+		case EXT4_FC_TAG_ADD_RANGE:
+			ret = ext4_fc_replay_add_range(sb, tl);
+			break;
+		case EXT4_FC_TAG_CREAT:
+			ret = ext4_fc_replay_create(sb, tl);
+			break;
+		case EXT4_FC_TAG_DEL_RANGE:
+			ret = ext4_fc_replay_del_range(sb, tl);
+			break;
+		case EXT4_FC_TAG_INODE:
+			ret = ext4_fc_replay_inode(sb, tl);
+			break;
+		case EXT4_FC_TAG_PAD:
+			trace_ext4_fc_replay(sb, EXT4_FC_TAG_PAD, 0,
+				ext4_fc_tag_len(tl), 0);
+			break;
+		case EXT4_FC_TAG_TAIL:
+			trace_ext4_fc_replay(sb, EXT4_FC_TAG_TAIL, 0,
+				ext4_fc_tag_len(tl), 0);
+			tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl);
+			WARN_ON(le32_to_cpu(tail->fc_tid) != expected_tid);
+			break;
+		case EXT4_FC_TAG_HEAD:
+			break;
+		default:
+			trace_ext4_fc_replay(sb, le16_to_cpu(tl->fc_tag), 0,
+				ext4_fc_tag_len(tl), 0);
+			ret = -ECANCELED;
+			break;
+		}
+		if (ret < 0)
+			break;
+		ret = JBD2_FC_REPLAY_CONTINUE;
+	}
+	return ret;
 }
 
 void ext4_fc_init(struct super_block *sb, journal_t *journal)
diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
index 560bc9ca8c79..06907d485989 100644
--- a/fs/ext4/fast_commit.h
+++ b/fs/ext4/fast_commit.h
@@ -116,4 +116,44 @@ struct ext4_fc_stats {
 	unsigned long fc_numblks;
 };
 
+#define EXT4_FC_REPLAY_REALLOC_INCREMENT	4
+
+/*
+ * Physical block regions added to different inodes due to fast commit
+ * recovery. These are set during the SCAN phase. During the replay phase,
+ * our allocator excludes these from its allocation. This ensures that
+ * we don't accidentally allocating a block that is going to be used by
+ * another inode.
+ */
+struct ext4_fc_alloc_region {
+	ext4_lblk_t lblk;
+	ext4_fsblk_t pblk;
+	int ino, len;
+};
+
+/*
+ * Fast commit replay state.
+ */
+struct ext4_fc_replay_state {
+	int fc_replay_num_tags;
+	int fc_replay_expected_off;
+	int fc_current_pass;
+	int fc_cur_tag;
+	int fc_crc;
+	struct ext4_fc_alloc_region *fc_regions;
+	int fc_regions_size, fc_regions_used, fc_regions_valid;
+	int *fc_modified_inodes;
+	int fc_modified_inodes_used, fc_modified_inodes_size;
+};
+
+#define region_last(__region) (((__region)->lblk) + ((__region)->len) - 1)
+
+#define fc_for_each_tl(__start, __end, __tl)				\
+	for (tl = (struct ext4_fc_tl *)start;				\
+		(u8 *)tl < (u8 *)end;					\
+		tl = (struct ext4_fc_tl *)((u8 *)tl +			\
+					sizeof(struct ext4_fc_tl) +	\
+					+ le16_to_cpu(tl->fc_len)))
+
+
 #endif /* __FAST_COMMIT_H__ */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index 33c0fc0197ce..2400a8200435 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -82,7 +82,12 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
 				      struct buffer_head *bh)
 {
 	ext4_fsblk_t	blk;
-	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+	struct ext4_group_info *grp;
+
+	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
+	grp = ext4_get_group_info(sb, block_group);
 
 	if (buffer_verified(bh))
 		return 0;
@@ -281,15 +286,17 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
 	bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
 	/* Don't bother if the inode bitmap is corrupt. */
-	grp = ext4_get_group_info(sb, block_group);
 	if (IS_ERR(bitmap_bh)) {
 		fatal = PTR_ERR(bitmap_bh);
 		bitmap_bh = NULL;
 		goto error_return;
 	}
-	if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
-		fatal = -EFSCORRUPTED;
-		goto error_return;
+	if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+		grp = ext4_get_group_info(sb, block_group);
+		if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
+			fatal = -EFSCORRUPTED;
+			goto error_return;
+		}
 	}
 
 	BUFFER_TRACE(bitmap_bh, "get_write_access");
@@ -739,6 +746,122 @@ static int find_inode_bit(struct super_block *sb, ext4_group_t group,
 	return 1;
 }
 
+int ext4_mark_inode_used(struct super_block *sb, int ino)
+{
+	unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count);
+	struct buffer_head *inode_bitmap_bh = NULL, *group_desc_bh = NULL;
+	struct ext4_group_desc *gdp;
+	ext4_group_t group;
+	int bit;
+	int err = -EFSCORRUPTED;
+
+	if (ino < EXT4_FIRST_INO(sb) || ino > max_ino)
+		goto out;
+
+	group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
+	inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
+	if (IS_ERR(inode_bitmap_bh))
+		return PTR_ERR(inode_bitmap_bh);
+
+	if (ext4_test_bit(bit, inode_bitmap_bh->b_data)) {
+		err = 0;
+		goto out;
+	}
+
+	gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
+	if (!gdp || !group_desc_bh) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	ext4_set_bit(bit, inode_bitmap_bh->b_data);
+
+	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+	err = ext4_handle_dirty_metadata(NULL, NULL, inode_bitmap_bh);
+	if (err) {
+		ext4_std_error(sb, err);
+		goto out;
+	}
+	err = sync_dirty_buffer(inode_bitmap_bh);
+	if (err) {
+		ext4_std_error(sb, err);
+		goto out;
+	}
+
+	/* We may have to initialize the block bitmap if it isn't already */
+	if (ext4_has_group_desc_csum(sb) &&
+	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+		struct buffer_head *block_bitmap_bh;
+
+		block_bitmap_bh = ext4_read_block_bitmap(sb, group);
+		if (IS_ERR(block_bitmap_bh)) {
+			err = PTR_ERR(block_bitmap_bh);
+			goto out;
+		}
+
+		BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
+		err = ext4_handle_dirty_metadata(NULL, NULL, block_bitmap_bh);
+		sync_dirty_buffer(block_bitmap_bh);
+
+		/* recheck and clear flag under lock if we still need to */
+		ext4_lock_group(sb, group);
+		if (ext4_has_group_desc_csum(sb) &&
+		    (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
+			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+			ext4_free_group_clusters_set(sb, gdp,
+				ext4_free_clusters_after_init(sb, group, gdp));
+			ext4_block_bitmap_csum_set(sb, group, gdp,
+						   block_bitmap_bh);
+			ext4_group_desc_csum_set(sb, group, gdp);
+		}
+		ext4_unlock_group(sb, group);
+		brelse(block_bitmap_bh);
+
+		if (err) {
+			ext4_std_error(sb, err);
+			goto out;
+		}
+	}
+
+	/* Update the relevant bg descriptor fields */
+	if (ext4_has_group_desc_csum(sb)) {
+		int free;
+
+		ext4_lock_group(sb, group); /* while we modify the bg desc */
+		free = EXT4_INODES_PER_GROUP(sb) -
+			ext4_itable_unused_count(sb, gdp);
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
+			free = 0;
+		}
+
+		/*
+		 * Check the relative inode number against the last used
+		 * relative inode number in this group. if it is greater
+		 * we need to update the bg_itable_unused count
+		 */
+		if (bit >= free)
+			ext4_itable_unused_set(sb, gdp,
+					(EXT4_INODES_PER_GROUP(sb) - bit - 1));
+	} else {
+		ext4_lock_group(sb, group);
+	}
+
+	ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1);
+	if (ext4_has_group_desc_csum(sb)) {
+		ext4_inode_bitmap_csum_set(sb, group, gdp, inode_bitmap_bh,
+					   EXT4_INODES_PER_GROUP(sb) / 8);
+		ext4_group_desc_csum_set(sb, group, gdp);
+	}
+
+	ext4_unlock_group(sb, group);
+	err = ext4_handle_dirty_metadata(NULL, NULL, group_desc_bh);
+	sync_dirty_buffer(group_desc_bh);
+out:
+	return err;
+}
+
 /*
  * There are two policies for allocating an inode.  If the new inode is
  * a directory, then a forward search is made for a block group with both
@@ -768,7 +891,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	struct inode *ret;
 	ext4_group_t i;
 	ext4_group_t flex_group;
-	struct ext4_group_info *grp;
+	struct ext4_group_info *grp = NULL;
 	int encrypt = 0;
 
 	/* Cannot create files in a deleted directory */
@@ -906,15 +1029,21 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		if (ext4_free_inodes_count(sb, gdp) == 0)
 			goto next_group;
 
-		grp = ext4_get_group_info(sb, group);
-		/* Skip groups with already-known suspicious inode tables */
-		if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
-			goto next_group;
+		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+			grp = ext4_get_group_info(sb, group);
+			/*
+			 * Skip groups with already-known suspicious inode
+			 * tables
+			 */
+			if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
+				goto next_group;
+		}
 
 		brelse(inode_bitmap_bh);
 		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
 		/* Skip groups with suspicious inode tables */
-		if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp) ||
+		if (((!(sbi->s_mount_state & EXT4_FC_REPLAY))
+		     && EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) ||
 		    IS_ERR(inode_bitmap_bh)) {
 			inode_bitmap_bh = NULL;
 			goto next_group;
@@ -933,7 +1062,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			goto next_group;
 		}
 
-		if (!handle) {
+		if ((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && !handle) {
 			BUG_ON(nblocks <= 0);
 			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
 				 handle_type, nblocks, 0,
@@ -1037,9 +1166,15 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	/* Update the relevant bg descriptor fields */
 	if (ext4_has_group_desc_csum(sb)) {
 		int free;
-		struct ext4_group_info *grp = ext4_get_group_info(sb, group);
-
-		down_read(&grp->alloc_sem); /* protect vs itable lazyinit */
+		struct ext4_group_info *grp = NULL;
+
+		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+			grp = ext4_get_group_info(sb, group);
+			down_read(&grp->alloc_sem); /*
+						     * protect vs itable
+						     * lazyinit
+						     */
+		}
 		ext4_lock_group(sb, group); /* while we modify the bg desc */
 		free = EXT4_INODES_PER_GROUP(sb) -
 			ext4_itable_unused_count(sb, gdp);
@@ -1055,7 +1190,8 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		if (ino > free)
 			ext4_itable_unused_set(sb, gdp,
 					(EXT4_INODES_PER_GROUP(sb) - ino));
-		up_read(&grp->alloc_sem);
+		if (!(sbi->s_mount_state & EXT4_FC_REPLAY))
+			up_read(&grp->alloc_sem);
 	} else {
 		ext4_lock_group(sb, group);
 	}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f5e9c76c9b07..2154e08d8026 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -101,8 +101,8 @@ static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
 	return provided == calculated;
 }
 
-static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
-				struct ext4_inode_info *ei)
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+			 struct ext4_inode_info *ei)
 {
 	__u32 csum;
 
@@ -514,7 +514,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		return -EFSCORRUPTED;
 
 	/* Lookup extent status tree firstly */
-	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
+	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) &&
+	    ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
 		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
 			map->m_pblk = ext4_es_pblock(&es) +
 					map->m_lblk - es.es_lblk;
@@ -827,7 +828,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
 	int create = map_flags & EXT4_GET_BLOCKS_CREATE;
 	int err;
 
-	J_ASSERT(handle != NULL || create == 0);
+	J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		 || handle != NULL || create == 0);
 
 	map.m_lblk = block;
 	map.m_len = 1;
@@ -843,7 +845,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
 		return ERR_PTR(-ENOMEM);
 	if (map.m_flags & EXT4_MAP_NEW) {
 		J_ASSERT(create != 0);
-		J_ASSERT(handle != NULL);
+		J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+			 || (handle != NULL));
 
 		/*
 		 * Now that we do not always journal data, we should
@@ -4279,22 +4282,22 @@ int ext4_truncate(struct inode *inode)
  * data in memory that is needed to recreate the on-disk version of this
  * inode.
  */
-static int __ext4_get_inode_loc(struct inode *inode,
-				struct ext4_iloc *iloc, int in_mem)
+static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
+				struct ext4_iloc *iloc, int in_mem,
+				ext4_fsblk_t *ret_block)
 {
 	struct ext4_group_desc	*gdp;
 	struct buffer_head	*bh;
-	struct super_block	*sb = inode->i_sb;
 	ext4_fsblk_t		block;
 	struct blk_plug		plug;
 	int			inodes_per_block, inode_offset;
 
 	iloc->bh = NULL;
-	if (inode->i_ino < EXT4_ROOT_INO ||
-	    inode->i_ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+	if (ino < EXT4_ROOT_INO ||
+	    ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
 		return -EFSCORRUPTED;
 
-	iloc->block_group = (inode->i_ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
 	gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
 	if (!gdp)
 		return -EIO;
@@ -4303,7 +4306,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
 	 * Figure out the offset within the block group inode table
 	 */
 	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
-	inode_offset = ((inode->i_ino - 1) %
+	inode_offset = ((ino - 1) %
 			EXT4_INODES_PER_GROUP(sb));
 	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
 	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
@@ -4395,14 +4398,14 @@ static int __ext4_get_inode_loc(struct inode *inode,
 		 * has in-inode xattrs, or we don't have this inode in memory.
 		 * Read the block from disk.
 		 */
-		trace_ext4_load_inode(inode);
+		trace_ext4_load_inode(sb, ino);
 		ext4_read_bh_nowait(bh, REQ_META | REQ_PRIO, NULL);
 		blk_finish_plug(&plug);
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh)) {
 		simulate_eio:
-			ext4_error_inode_block(inode, block, EIO,
-					       "unable to read itable block");
+			if (ret_block)
+				*ret_block = block;
 			brelse(bh);
 			return -EIO;
 		}
@@ -4412,11 +4415,43 @@ static int __ext4_get_inode_loc(struct inode *inode,
 	return 0;
 }
 
+static int __ext4_get_inode_loc_noinmem(struct inode *inode,
+					struct ext4_iloc *iloc)
+{
+	ext4_fsblk_t err_blk;
+	int ret;
+
+	ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc, 0,
+					&err_blk);
+
+	if (ret == -EIO)
+		ext4_error_inode_block(inode, err_blk, EIO,
+					"unable to read itable block");
+
+	return ret;
+}
+
 int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
 {
+	ext4_fsblk_t err_blk;
+	int ret;
+
 	/* We have all inode data except xattrs in memory here. */
-	return __ext4_get_inode_loc(inode, iloc,
-		!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
+	ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc,
+		!ext4_test_inode_state(inode, EXT4_STATE_XATTR), &err_blk);
+
+	if (ret == -EIO)
+		ext4_error_inode_block(inode, err_blk, EIO,
+					"unable to read itable block");
+
+	return ret;
+}
+
+
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+			  struct ext4_iloc *iloc)
+{
+	return __ext4_get_inode_loc(sb, ino, iloc, 0, NULL);
 }
 
 static bool ext4_should_enable_dax(struct inode *inode)
@@ -4582,7 +4617,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	ei = EXT4_I(inode);
 	iloc.bh = NULL;
 
-	ret = __ext4_get_inode_loc(inode, &iloc, 0);
+	ret = __ext4_get_inode_loc_noinmem(inode, &iloc);
 	if (ret < 0)
 		goto bad_inode;
 	raw_inode = ext4_raw_inode(&iloc);
@@ -4628,10 +4663,11 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 					      sizeof(gen));
 	}
 
-	if (!ext4_inode_csum_verify(inode, raw_inode, ei) ||
-	    ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) {
-		ext4_error_inode_err(inode, function, line, 0, EFSBADCRC,
-				     "iget: checksum invalid");
+	if ((!ext4_inode_csum_verify(inode, raw_inode, ei) ||
+	    ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) &&
+	     (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))) {
+		ext4_error_inode_err(inode, function, line, 0,
+				EFSBADCRC, "iget: checksum invalid");
 		ret = -EFSBADCRC;
 		goto bad_inode;
 	}
@@ -4785,9 +4821,10 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 		goto bad_inode;
 	} else if (!ext4_has_inline_data(inode)) {
 		/* validate the block references in the inode */
-		if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
-		   (S_ISLNK(inode->i_mode) &&
-		    !ext4_inode_is_fast_symlink(inode))) {
+		if (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) &&
+			(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+			(S_ISLNK(inode->i_mode) &&
+			!ext4_inode_is_fast_symlink(inode)))) {
 			if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 				ret = ext4_ext_check_inode(inode);
 			else
@@ -5171,7 +5208,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 	} else {
 		struct ext4_iloc iloc;
 
-		err = __ext4_get_inode_loc(inode, &iloc, 0);
+		err = __ext4_get_inode_loc_noinmem(inode, &iloc);
 		if (err)
 			return err;
 		/*
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index d2f8f50deef6..f0381876a7e5 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -86,7 +86,7 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
 	i_size_write(inode2, isize);
 }
 
-static void reset_inode_seed(struct inode *inode)
+void ext4_reset_inode_seed(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -200,8 +200,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
 
 	inode->i_generation = prandom_u32();
 	inode_bl->i_generation = prandom_u32();
-	reset_inode_seed(inode);
-	reset_inode_seed(inode_bl);
+	ext4_reset_inode_seed(inode);
+	ext4_reset_inode_seed(inode_bl);
 
 	ext4_discard_preallocations(inode, 0);
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 74a48d6ff9cc..85abbfb98cbe 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1502,14 +1502,16 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
 
 		blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
 		blocknr += EXT4_C2B(sbi, block);
-		ext4_grp_locked_error(sb, e4b->bd_group,
-				      inode ? inode->i_ino : 0,
-				      blocknr,
-				      "freeing already freed block "
-				      "(bit %u); block bitmap corrupt.",
-				      block);
-		ext4_mark_group_bitmap_corrupted(sb, e4b->bd_group,
+		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+			ext4_grp_locked_error(sb, e4b->bd_group,
+					      inode ? inode->i_ino : 0,
+					      blocknr,
+					      "freeing already freed block (bit %u); block bitmap corrupt.",
+					      block);
+			ext4_mark_group_bitmap_corrupted(
+				sb, e4b->bd_group,
 				EXT4_GROUP_INFO_BBITMAP_CORRUPT);
+		}
 		mb_regenerate_buddy(e4b);
 		goto done;
 	}
@@ -3296,6 +3298,84 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
 	return err;
 }
 
+/*
+ * Idempotent helper for Ext4 fast commit replay path to set the state of
+ * blocks in bitmaps and update counters.
+ */
+void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
+			int len, int state)
+{
+	struct buffer_head *bitmap_bh = NULL;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *gdp_bh;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	ext4_group_t group;
+	ext4_grpblk_t blkoff;
+	int i, clen, err;
+	int already;
+
+	clen = EXT4_B2C(sbi, len);
+
+	ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
+	bitmap_bh = ext4_read_block_bitmap(sb, group);
+	if (IS_ERR(bitmap_bh)) {
+		err = PTR_ERR(bitmap_bh);
+		bitmap_bh = NULL;
+		goto out_err;
+	}
+
+	err = -EIO;
+	gdp = ext4_get_group_desc(sb, group, &gdp_bh);
+	if (!gdp)
+		goto out_err;
+
+	ext4_lock_group(sb, group);
+	already = 0;
+	for (i = 0; i < clen; i++)
+		if (!mb_test_bit(blkoff + i, bitmap_bh->b_data) == !state)
+			already++;
+
+	if (state)
+		ext4_set_bits(bitmap_bh->b_data, blkoff, clen);
+	else
+		mb_test_and_clear_bits(bitmap_bh->b_data, blkoff, clen);
+	if (ext4_has_group_desc_csum(sb) &&
+	    (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
+		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+		ext4_free_group_clusters_set(sb, gdp,
+					     ext4_free_clusters_after_init(sb,
+						group, gdp));
+	}
+	if (state)
+		clen = ext4_free_group_clusters(sb, gdp) - clen + already;
+	else
+		clen = ext4_free_group_clusters(sb, gdp) + clen - already;
+
+	ext4_free_group_clusters_set(sb, gdp, clen);
+	ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh);
+	ext4_group_desc_csum_set(sb, group, gdp);
+
+	ext4_unlock_group(sb, group);
+
+	if (sbi->s_log_groups_per_flex) {
+		ext4_group_t flex_group = ext4_flex_group(sbi, group);
+
+		atomic64_sub(len,
+			     &sbi_array_rcu_deref(sbi, s_flex_groups,
+						  flex_group)->free_clusters);
+	}
+
+	err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+	if (err)
+		goto out_err;
+	sync_dirty_buffer(bitmap_bh);
+	err = ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
+	sync_dirty_buffer(gdp_bh);
+
+out_err:
+	brelse(bitmap_bh);
+}
+
 /*
  * here we normalize request for locality group
  * Group request are normalized to s_mb_group_prealloc, which goes to
@@ -4272,6 +4352,9 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 		return;
 	}
 
+	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+		return;
+
 	mb_debug(sb, "discard preallocation for inode %lu\n",
 		 inode->i_ino);
 	trace_ext4_discard_preallocations(inode,
@@ -4819,6 +4902,9 @@ static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
 	return ret;
 }
 
+static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
+				struct ext4_allocation_request *ar, int *errp);
+
 /*
  * Main entry point into mballoc to allocate blocks
  * it tries to use preallocation first, then falls back
@@ -4840,6 +4926,8 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 	sbi = EXT4_SB(sb);
 
 	trace_ext4_request_blocks(ar);
+	if (sbi->s_mount_state & EXT4_FC_REPLAY)
+		return ext4_mb_new_blocks_simple(handle, ar, errp);
 
 	/* Allow to use superuser reservation for quota file */
 	if (ext4_is_quota_file(ar->inode))
@@ -5067,6 +5155,102 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	return 0;
 }
 
+/*
+ * Simple allocator for Ext4 fast commit replay path. It searches for blocks
+ * linearly starting at the goal block and also excludes the blocks which
+ * are going to be in use after fast commit replay.
+ */
+static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
+				struct ext4_allocation_request *ar, int *errp)
+{
+	struct buffer_head *bitmap_bh;
+	struct super_block *sb = ar->inode->i_sb;
+	ext4_group_t group;
+	ext4_grpblk_t blkoff;
+	int  i;
+	ext4_fsblk_t goal, block;
+	struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+
+	goal = ar->goal;
+	if (goal < le32_to_cpu(es->s_first_data_block) ||
+			goal >= ext4_blocks_count(es))
+		goal = le32_to_cpu(es->s_first_data_block);
+
+	ar->len = 0;
+	ext4_get_group_no_and_offset(sb, goal, &group, &blkoff);
+	for (; group < ext4_get_groups_count(sb); group++) {
+		bitmap_bh = ext4_read_block_bitmap(sb, group);
+		if (IS_ERR(bitmap_bh)) {
+			*errp = PTR_ERR(bitmap_bh);
+			pr_warn("Failed to read block bitmap\n");
+			return 0;
+		}
+
+		ext4_get_group_no_and_offset(sb,
+			max(ext4_group_first_block_no(sb, group), goal),
+			NULL, &blkoff);
+		i = mb_find_next_zero_bit(bitmap_bh->b_data, sb->s_blocksize,
+						blkoff);
+		brelse(bitmap_bh);
+		if (i >= sb->s_blocksize)
+			continue;
+		if (ext4_fc_replay_check_excluded(sb,
+			ext4_group_first_block_no(sb, group) + i))
+			continue;
+		break;
+	}
+
+	if (group >= ext4_get_groups_count(sb) && i >= sb->s_blocksize)
+		return 0;
+
+	block = ext4_group_first_block_no(sb, group) + i;
+	ext4_mb_mark_bb(sb, block, 1, 1);
+	ar->len = 1;
+
+	return block;
+}
+
+static void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block,
+					unsigned long count)
+{
+	struct buffer_head *bitmap_bh;
+	struct super_block *sb = inode->i_sb;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *gdp_bh;
+	ext4_group_t group;
+	ext4_grpblk_t blkoff;
+	int already_freed = 0, err, i;
+
+	ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
+	bitmap_bh = ext4_read_block_bitmap(sb, group);
+	if (IS_ERR(bitmap_bh)) {
+		err = PTR_ERR(bitmap_bh);
+		pr_warn("Failed to read block bitmap\n");
+		return;
+	}
+	gdp = ext4_get_group_desc(sb, group, &gdp_bh);
+	if (!gdp)
+		return;
+
+	for (i = 0; i < count; i++) {
+		if (!mb_test_bit(blkoff + i, bitmap_bh->b_data))
+			already_freed++;
+	}
+	mb_clear_bits(bitmap_bh->b_data, blkoff, count);
+	err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+	if (err)
+		return;
+	ext4_free_group_clusters_set(
+		sb, gdp, ext4_free_group_clusters(sb, gdp) +
+		count - already_freed);
+	ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh);
+	ext4_group_desc_csum_set(sb, group, gdp);
+	ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
+	sync_dirty_buffer(bitmap_bh);
+	sync_dirty_buffer(gdp_bh);
+	brelse(bitmap_bh);
+}
+
 /**
  * ext4_free_blocks() -- Free given blocks and update quota
  * @handle:		handle for this transaction
@@ -5093,6 +5277,13 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 	int err = 0;
 	int ret;
 
+	sbi = EXT4_SB(sb);
+
+	if (sbi->s_mount_state & EXT4_FC_REPLAY) {
+		ext4_free_blocks_simple(inode, block, count);
+		return;
+	}
+
 	might_sleep();
 	if (bh) {
 		if (block)
@@ -5101,7 +5292,6 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 			block = bh->b_blocknr;
 	}
 
-	sbi = EXT4_SB(sb);
 	if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) &&
 	    !ext4_inode_block_valid(inode, block, count)) {
 		ext4_error(sb, "Freeing blocks not in datazone - "
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index fd7be1435f2d..cde346074662 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2749,7 +2749,7 @@ struct ext4_dir_entry_2 *ext4_init_dot_dotdot(struct inode *inode,
 	return ext4_next_entry(de, blocksize);
 }
 
-static int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+int ext4_init_new_dir(handle_t *handle, struct inode *dir,
 			     struct inode *inode)
 {
 	struct buffer_head *dir_block = NULL;
@@ -3197,42 +3197,32 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
 	return retval;
 }
 
-static int ext4_unlink(struct inode *dir, struct dentry *dentry)
+int __ext4_unlink(struct inode *dir, const struct qstr *d_name,
+		  struct inode *inode)
 {
-	int retval;
-	struct inode *inode;
+	int retval = -ENOENT;
 	struct buffer_head *bh;
 	struct ext4_dir_entry_2 *de;
 	handle_t *handle = NULL;
+	int skip_remove_dentry = 0;
 
-	if (unlikely(ext4_forced_shutdown(EXT4_SB(dir->i_sb))))
-		return -EIO;
-
-	trace_ext4_unlink_enter(dir, dentry);
-	/* Initialize quotas before so that eventual writes go
-	 * in separate transaction */
-	retval = dquot_initialize(dir);
-	if (retval)
-		goto out_trace;
-	retval = dquot_initialize(d_inode(dentry));
-	if (retval)
-		goto out_trace;
-
-	bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
-	if (IS_ERR(bh)) {
-		retval = PTR_ERR(bh);
-		goto out_trace;
-	}
-	if (!bh) {
-		retval = -ENOENT;
-		goto out_trace;
-	}
+	bh = ext4_find_entry(dir, d_name, &de, NULL);
+	if (IS_ERR(bh))
+		return PTR_ERR(bh);
 
-	inode = d_inode(dentry);
+	if (!bh)
+		return -ENOENT;
 
 	if (le32_to_cpu(de->inode) != inode->i_ino) {
-		retval = -EFSCORRUPTED;
-		goto out_bh;
+		/*
+		 * It's okay if we find dont find dentry which matches
+		 * the inode. That's because it might have gotten
+		 * renamed to a different inode number
+		 */
+		if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+			skip_remove_dentry = 1;
+		else
+			goto out_bh;
 	}
 
 	handle = ext4_journal_start(dir, EXT4_HT_DIR,
@@ -3245,17 +3235,21 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 	if (IS_DIRSYNC(dir))
 		ext4_handle_sync(handle);
 
-	retval = ext4_delete_entry(handle, dir, de, bh);
-	if (retval)
-		goto out_handle;
-	dir->i_ctime = dir->i_mtime = current_time(dir);
-	ext4_update_dx_flag(dir);
-	retval = ext4_mark_inode_dirty(handle, dir);
-	if (retval)
-		goto out_handle;
+	if (!skip_remove_dentry) {
+		retval = ext4_delete_entry(handle, dir, de, bh);
+		if (retval)
+			goto out_handle;
+		dir->i_ctime = dir->i_mtime = current_time(dir);
+		ext4_update_dx_flag(dir);
+		retval = ext4_mark_inode_dirty(handle, dir);
+		if (retval)
+			goto out_handle;
+	} else {
+		retval = 0;
+	}
 	if (inode->i_nlink == 0)
 		ext4_warning_inode(inode, "Deleting file '%.*s' with no links",
-				   dentry->d_name.len, dentry->d_name.name);
+				   d_name->len, d_name->name);
 	else
 		drop_nlink(inode);
 	if (!inode->i_nlink)
@@ -3263,6 +3257,33 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 	inode->i_ctime = current_time(inode);
 	retval = ext4_mark_inode_dirty(handle, inode);
 
+out_handle:
+	ext4_journal_stop(handle);
+out_bh:
+	brelse(bh);
+	return retval;
+}
+
+static int ext4_unlink(struct inode *dir, struct dentry *dentry)
+{
+	int retval;
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(dir->i_sb))))
+		return -EIO;
+
+	trace_ext4_unlink_enter(dir, dentry);
+	/*
+	 * Initialize quotas before so that eventual writes go
+	 * in separate transaction
+	 */
+	retval = dquot_initialize(dir);
+	if (retval)
+		goto out_trace;
+	retval = dquot_initialize(d_inode(dentry));
+	if (retval)
+		goto out_trace;
+
+	retval = __ext4_unlink(dir, &dentry->d_name, d_inode(dentry));
 	if (!retval)
 		ext4_fc_track_unlink(d_inode(dentry), dentry);
 #ifdef CONFIG_UNICODE
@@ -3276,10 +3297,6 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 		d_invalidate(dentry);
 #endif
 
-out_handle:
-	ext4_journal_stop(handle);
-out_bh:
-	brelse(bh);
 out_trace:
 	trace_ext4_unlink_exit(dentry, retval);
 	return retval;
@@ -3360,7 +3377,8 @@ static int ext4_symlink(struct inode *dir,
 		 */
 		drop_nlink(inode);
 		err = ext4_orphan_add(handle, inode);
-		ext4_journal_stop(handle);
+		if (handle)
+			ext4_journal_stop(handle);
 		handle = NULL;
 		if (err)
 			goto err_drop_inode;
@@ -3414,29 +3432,10 @@ static int ext4_symlink(struct inode *dir,
 	return err;
 }
 
-static int ext4_link(struct dentry *old_dentry,
-		     struct inode *dir, struct dentry *dentry)
+int __ext4_link(struct inode *dir, struct inode *inode, struct dentry *dentry)
 {
 	handle_t *handle;
-	struct inode *inode = d_inode(old_dentry);
 	int err, retries = 0;
-
-	if (inode->i_nlink >= EXT4_LINK_MAX)
-		return -EMLINK;
-
-	err = fscrypt_prepare_link(old_dentry, dir, dentry);
-	if (err)
-		return err;
-
-	if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) &&
-	    (!projid_eq(EXT4_I(dir)->i_projid,
-			EXT4_I(old_dentry->d_inode)->i_projid)))
-		return -EXDEV;
-
-	err = dquot_initialize(dir);
-	if (err)
-		return err;
-
 retry:
 	handle = ext4_journal_start(dir, EXT4_HT_DIR,
 		(EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +
@@ -3453,6 +3452,7 @@ static int ext4_link(struct dentry *old_dentry,
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
+		ext4_fc_track_link(inode, dentry);
 		err = ext4_mark_inode_dirty(handle, inode);
 		/* this can happen only for tmpfile being
 		 * linked the first time
@@ -3470,6 +3470,29 @@ static int ext4_link(struct dentry *old_dentry,
 	return err;
 }
 
+static int ext4_link(struct dentry *old_dentry,
+		     struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = d_inode(old_dentry);
+	int err;
+
+	if (inode->i_nlink >= EXT4_LINK_MAX)
+		return -EMLINK;
+
+	err = fscrypt_prepare_link(old_dentry, dir, dentry);
+	if (err)
+		return err;
+
+	if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) &&
+	    (!projid_eq(EXT4_I(dir)->i_projid,
+			EXT4_I(old_dentry->d_inode)->i_projid)))
+		return -EXDEV;
+
+	err = dquot_initialize(dir);
+	if (err)
+		return err;
+	return __ext4_link(dir, inode, dentry);
+}
 
 /*
  * Try to find buffer head where contains the parent block.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 505cebd26235..ced05c6879a6 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1718,6 +1718,9 @@ enum {
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_prefetch_block_bitmaps, Opt_no_fc,
+#ifdef CONFIG_EXT4_DEBUG
+	Opt_fc_debug_max_replay
+#endif
 };
 
 static const match_table_t tokens = {
@@ -1805,6 +1808,9 @@ static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
 	{Opt_no_fc, "no_fc"},
+#ifdef CONFIG_EXT4_DEBUG
+	{Opt_fc_debug_max_replay, "fc_debug_max_replay=%u"},
+#endif
 	{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption"},
@@ -2034,6 +2040,9 @@ static const struct mount_opts {
 	 MOPT_SET},
 	{Opt_no_fc, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
+#ifdef CONFIG_EXT4_DEBUG
+	{Opt_fc_debug_max_replay, 0, MOPT_GTE0},
+#endif
 	{Opt_err, 0, 0}
 };
 
@@ -2242,6 +2251,10 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		sbi->s_li_wait_mult = arg;
 	} else if (token == Opt_max_dir_size_kb) {
 		sbi->s_max_dir_size_kb = arg;
+#ifdef CONFIG_EXT4_DEBUG
+	} else if (token == Opt_fc_debug_max_replay) {
+		sbi->s_fc_debug_max_replay = arg;
+#endif
 	} else if (token == Opt_stripe) {
 		sbi->s_stripe = arg;
 	} else if (token == Opt_resuid) {
@@ -4764,6 +4777,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
 	spin_lock_init(&sbi->s_fc_lock);
 	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+	sbi->s_fc_replay_state.fc_regions = NULL;
+	sbi->s_fc_replay_state.fc_regions_size = 0;
+	sbi->s_fc_replay_state.fc_regions_used = 0;
+	sbi->s_fc_replay_state.fc_regions_valid = 0;
+	sbi->s_fc_replay_state.fc_modified_inodes = NULL;
+	sbi->s_fc_replay_state.fc_modified_inodes_size = 0;
+	sbi->s_fc_replay_state.fc_modified_inodes_used = 0;
 
 	sb->s_root = NULL;
 
@@ -4979,6 +4999,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			goto failed_mount4a;
 		}
 	}
+	ext4_fc_replay_cleanup(sb);
 
 	ext4_ext_init(sb);
 	err = ext4_mb_init(sb);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 521de3a82118..b14314fcf732 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1776,9 +1776,9 @@ TRACE_EVENT(ext4_ext_load_extent,
 );
 
 TRACE_EVENT(ext4_load_inode,
-	TP_PROTO(struct inode *inode),
+	TP_PROTO(struct super_block *sb, unsigned long ino),
 
-	TP_ARGS(inode),
+	TP_ARGS(sb, ino),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev		)
@@ -1786,8 +1786,8 @@ TRACE_EVENT(ext4_load_inode,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= inode->i_sb->s_dev;
-		__entry->ino		= inode->i_ino;
+		__entry->dev		= sb->s_dev;
+		__entry->ino		= ino;
 	),
 
 	TP_printk("dev %d,%d ino %ld",
@@ -2801,6 +2801,54 @@ TRACE_EVENT(ext4_lazy_itable_init,
 		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->group)
 );
 
+TRACE_EVENT(ext4_fc_replay_scan,
+	TP_PROTO(struct super_block *sb, int error, int off),
+
+	TP_ARGS(sb, error, off),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, error)
+		__field(int, off)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->error = error;
+		__entry->off = off;
+	),
+
+	TP_printk("FC scan pass on dev %d,%d: error %d, off %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->error, __entry->off)
+);
+
+TRACE_EVENT(ext4_fc_replay,
+	TP_PROTO(struct super_block *sb, int tag, int ino, int priv1, int priv2),
+
+	TP_ARGS(sb, tag, ino, priv1, priv2),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, tag)
+		__field(int, ino)
+		__field(int, priv1)
+		__field(int, priv2)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->tag = tag;
+		__entry->ino = ino;
+		__entry->priv1 = priv1;
+		__entry->priv2 = priv2;
+	),
+
+	TP_printk("FC Replay %d,%d: tag %d, ino %d, data1 %d, data2 %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tag, __entry->ino, __entry->priv1, __entry->priv2)
+);
+
 TRACE_EVENT(ext4_fc_commit_start,
 	TP_PROTO(struct super_block *sb),
 
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v10 8/9] ext4: add a mount opt to forcefully turn fast commits on
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
                   ` (6 preceding siblings ...)
  2020-10-15 20:37 ` [PATCH v10 7/9] ext4: " Harshad Shirwadkar
@ 2020-10-15 20:38 ` Harshad Shirwadkar
  2020-10-15 20:38 ` [PATCH v10 9/9] ext4: add fast commit stats in procfs Harshad Shirwadkar
  8 siblings, 0 replies; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:38 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This is a debug only mount option that forcefully turns fast commits
on at mount time.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/super.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ced05c6879a6..114753e66391 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1719,8 +1719,9 @@ enum {
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_prefetch_block_bitmaps, Opt_no_fc,
 #ifdef CONFIG_EXT4_DEBUG
-	Opt_fc_debug_max_replay
+	Opt_fc_debug_max_replay,
 #endif
+	Opt_fc_debug_force
 };
 
 static const match_table_t tokens = {
@@ -1808,6 +1809,7 @@ static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
 	{Opt_no_fc, "no_fc"},
+	{Opt_fc_debug_force, "fc_debug_force"},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_max_replay, "fc_debug_max_replay=%u"},
 #endif
@@ -2040,6 +2042,8 @@ static const struct mount_opts {
 	 MOPT_SET},
 	{Opt_no_fc, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
+	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
+	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_max_replay, 0, MOPT_GTE0},
 #endif
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH v10 9/9] ext4: add fast commit stats in procfs
  2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
                   ` (7 preceding siblings ...)
  2020-10-15 20:38 ` [PATCH v10 8/9] ext4: add a mount opt to forcefully turn fast commits on Harshad Shirwadkar
@ 2020-10-15 20:38 ` Harshad Shirwadkar
  8 siblings, 0 replies; 33+ messages in thread
From: Harshad Shirwadkar @ 2020-10-15 20:38 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This commit adds a file in procfs that tracks fast commit related
statistics.

root@kvm-xfstests:/mnt# cat /proc/fs/ext4/vdc/fc_info
fc stats:
7772 commits
15 ineligible
4083 numblks
2242us avg_commit_time
Ineligible reasons:
"Extended attributes changed":  0
"Cross rename": 0
"Journal flag changed": 0
"Insufficient memory":  0
"Swap boot":    0
"Resize":       0
"Dir renamed":  0
"Falloc range op":      0
"FC Commit Failed":     15

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h        |  2 +-
 fs/ext4/fast_commit.c | 34 ++++++++++++++++++++++++++++++++++
 fs/ext4/sysfs.c       |  2 ++
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index ff5094eb0e39..18a6df442671 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2737,7 +2737,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
 extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
 
 /* fast_commit.c */
-
+int ext4_fc_info_show(struct seq_file *seq, void *v);
 void ext4_fc_init(struct super_block *sb, journal_t *journal);
 void ext4_fc_init_inode(struct inode *inode);
 void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 1dda5329be61..3e3ec989a2df 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -2082,6 +2082,40 @@ void ext4_fc_init(struct super_block *sb, journal_t *journal)
 	}
 }
 
+const char *fc_ineligible_reasons[] = {
+	"Extended attributes changed",
+	"Cross rename",
+	"Journal flag changed",
+	"Insufficient memory",
+	"Swap boot",
+	"Resize",
+	"Dir renamed",
+	"Falloc range op",
+	"FC Commit Failed"
+};
+
+int ext4_fc_info_show(struct seq_file *seq, void *v)
+{
+	struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private);
+	struct ext4_fc_stats *stats = &sbi->s_fc_stats;
+	int i;
+
+	if (v != SEQ_START_TOKEN)
+		return 0;
+
+	seq_printf(seq,
+		"fc stats:\n%ld commits\n%ld ineligible\n%ld numblks\n%lluus avg_commit_time\n",
+		   stats->fc_num_commits, stats->fc_ineligible_commits,
+		   stats->fc_numblks,
+		   div_u64(sbi->s_fc_avg_commit_time, 1000));
+	seq_puts(seq, "Ineligible reasons:\n");
+	for (i = 0; i < EXT4_FC_REASON_MAX; i++)
+		seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i],
+			stats->fc_ineligible_reason_count[i]);
+
+	return 0;
+}
+
 int __init ext4_fc_init_dentry_cache(void)
 {
 	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index bfabb799fa45..5ff33d18996a 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -521,6 +521,8 @@ int ext4_register_sysfs(struct super_block *sb)
 		proc_create_single_data("es_shrinker_info", S_IRUGO,
 				sbi->s_proc, ext4_seq_es_shrinker_info_show,
 				sb);
+		proc_create_single_data("fc_info", 0444, sbi->s_proc,
+					ext4_fc_info_show, sb);
 		proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
 				&ext4_mb_seq_groups_ops, sb);
 	}
-- 
2.29.0.rc1.297.gfa9743e501-goog


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-10-15 20:37 ` [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
@ 2020-10-21 16:04   ` Jan Kara
  2020-10-21 17:25     ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-10-21 16:04 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso

On Thu 15-10-20 13:37:53, Harshad Shirwadkar wrote:
> +   * - EXT4_FC_TAG_CREAT
> +     - Create directory entry for a newly created file
> +     - ``struct ext4_fc_dentry_info``
> +     - Stores the parent inode numer, inode number and directory entry of the
                                  ^^^ number

> +       newly created file
> +   * - EXT4_FC_TAG_LINK
> +     - Link a directory entry to an inode
> +     - ``struct ext4_fc_dentry_info``
> +     - Stores the parent inode numer, inode number and directory entry
                                  ^^^^ number

BTW, how is EXT4_FC_TAG_CREAT different from EXT4_FC_TAG_LINK? It seems
like they describe essentially the same operation?

> +   * - EXT4_FC_TAG_UNLINK
> +     - Unink a directory entry of an inode
          ^^^^ Unlink

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options
  2020-10-15 20:37 ` [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options Harshad Shirwadkar
@ 2020-10-21 16:18   ` Jan Kara
  2020-10-21 17:31     ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-10-21 16:18 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso

On Thu 15-10-20 13:37:54, Harshad Shirwadkar wrote:
> We are running out of mount option bits. Add handling for using
> s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
> ability to turn off the fast commit feature in Ext4.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>  fs/ext4/ext4.h       |  4 ++++
>  fs/ext4/super.c      | 27 ++++++++++++++++++++++-----
>  include/linux/jbd2.h |  5 ++++-
>  3 files changed, 30 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 1879531a119f..02d7dc378505 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1213,6 +1213,8 @@ struct ext4_inode_info {
>  #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM	0x00000008 /* User explicitly
>  						specified journal checksum */
>  
> +#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT	0x00000010 /* Journal fast commit */
> +
>  #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
>  						~EXT4_MOUNT_##opt
>  #define set_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt |= \
> @@ -1813,6 +1815,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode)
>  #define EXT4_FEATURE_COMPAT_RESIZE_INODE	0x0010
>  #define EXT4_FEATURE_COMPAT_DIR_INDEX		0x0020
>  #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2	0x0200
> +#define EXT4_FEATURE_COMPAT_FAST_COMMIT		0x0400
>  #define EXT4_FEATURE_COMPAT_STABLE_INODES	0x0800

Is fast commit really a compat feature? IMO if there are fast commits
stored in the journal, the filesystem is actually incompatible with the
old kernels because data we guranteed to be permanenly stored may be
invisible for the old kernel (since it won't replay fastcommit
transactions).

...

Oh, now I see that the journal FAST_COMMIT is actually incompat. So what's
the point of compat ext4 feature with incompat JBD2 feature?

> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 901c1c938276..70256a240442 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1709,7 +1709,7 @@ enum {
>  	Opt_dioread_nolock, Opt_dioread_lock,
>  	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
>  	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> -	Opt_prefetch_block_bitmaps,
> +	Opt_prefetch_block_bitmaps, Opt_no_fc,

It would be more consistent to use a name 'Opt_nofc' and IMHO 'fc' is
really too short an ambiguous. I agree "nofastcommit" is somewhat long but
still OK and much more descriptive...

>  };
>  
>  static const match_table_t tokens = {
> @@ -1796,6 +1796,7 @@ static const match_table_t tokens = {
>  	{Opt_init_itable, "init_itable=%u"},
>  	{Opt_init_itable, "init_itable"},
>  	{Opt_noinit_itable, "noinit_itable"},
> +	{Opt_no_fc, "no_fc"},

And here "nofastcommit", or perhaps "nofast_commit".

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-10-21 16:04   ` Jan Kara
@ 2020-10-21 17:25     ` harshad shirwadkar
  2020-10-22 13:06       ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-21 17:25 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4 Developers List, Theodore Y. Ts'o

Thanks Jan for taking a look at the patches.

On Wed, Oct 21, 2020 at 9:04 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 15-10-20 13:37:53, Harshad Shirwadkar wrote:
> > +   * - EXT4_FC_TAG_CREAT
> > +     - Create directory entry for a newly created file
> > +     - ``struct ext4_fc_dentry_info``
> > +     - Stores the parent inode numer, inode number and directory entry of the
>                                   ^^^ number
Ack
>
> > +       newly created file
> > +   * - EXT4_FC_TAG_LINK
> > +     - Link a directory entry to an inode
> > +     - ``struct ext4_fc_dentry_info``
> > +     - Stores the parent inode numer, inode number and directory entry
>                                   ^^^^ number
Ack
>
> BTW, how is EXT4_FC_TAG_CREAT different from EXT4_FC_TAG_LINK? It seems
> like they describe essentially the same operation?
In the replay path, creat has to do certain things that link doesn't.
For example, "creat" needs to mark the inode as used in the bitmap and
also if it's a directory that's being created, it needs to initialize
the "." and ".." dirents in the directory. That's why we need
different tags.
>
> > +   * - EXT4_FC_TAG_UNLINK
> > +     - Unink a directory entry of an inode
>           ^^^^ Unlink
Ack

Thanks,
Harshad
>
>                                                                         Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options
  2020-10-21 16:18   ` Jan Kara
@ 2020-10-21 17:31     ` harshad shirwadkar
  2020-10-22 13:09       ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-21 17:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4 Developers List, Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 9:18 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 15-10-20 13:37:54, Harshad Shirwadkar wrote:
> > We are running out of mount option bits. Add handling for using
> > s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
> > ability to turn off the fast commit feature in Ext4.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > ---
> >  fs/ext4/ext4.h       |  4 ++++
> >  fs/ext4/super.c      | 27 ++++++++++++++++++++++-----
> >  include/linux/jbd2.h |  5 ++++-
> >  3 files changed, 30 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 1879531a119f..02d7dc378505 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1213,6 +1213,8 @@ struct ext4_inode_info {
> >  #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM        0x00000008 /* User explicitly
> >                                               specified journal checksum */
> >
> > +#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT      0x00000010 /* Journal fast commit */
> > +
> >  #define clear_opt(sb, opt)           EXT4_SB(sb)->s_mount_opt &= \
> >                                               ~EXT4_MOUNT_##opt
> >  #define set_opt(sb, opt)             EXT4_SB(sb)->s_mount_opt |= \
> > @@ -1813,6 +1815,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode)
> >  #define EXT4_FEATURE_COMPAT_RESIZE_INODE     0x0010
> >  #define EXT4_FEATURE_COMPAT_DIR_INDEX                0x0020
> >  #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2    0x0200
> > +#define EXT4_FEATURE_COMPAT_FAST_COMMIT              0x0400
> >  #define EXT4_FEATURE_COMPAT_STABLE_INODES    0x0800
>
> Is fast commit really a compat feature? IMO if there are fast commits
> stored in the journal, the filesystem is actually incompatible with the
> old kernels because data we guranteed to be permanenly stored may be
> invisible for the old kernel (since it won't replay fastcommit
> transactions).
>
> ...
>
> Oh, now I see that the journal FAST_COMMIT is actually incompat. So what's
> the point of compat ext4 feature with incompat JBD2 feature?
So having fast commits enabled on an ext4 file system doesn't
immediately make it incompatible with the older kernels. FS becomes
incompatible only if there are fast commits blocks that are stored in
the journal. So, one of the tricks that this patchset does is on a
clean unmount, since it's guaranteed that there are no fast commit
blocks in journal, we clear out the JBD2 incompat flag and preserve
the compat flag in ext4. So, we can think of ext4 compat flag as "FS
will try fast commits when possible" while jbd2 incompat flag as
"There are fast commits blocks present in the journal". Does that make
sense?
>
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 901c1c938276..70256a240442 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -1709,7 +1709,7 @@ enum {
> >       Opt_dioread_nolock, Opt_dioread_lock,
> >       Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
> >       Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
> > -     Opt_prefetch_block_bitmaps,
> > +     Opt_prefetch_block_bitmaps, Opt_no_fc,
>
> It would be more consistent to use a name 'Opt_nofc' and IMHO 'fc' is
> really too short an ambiguous. I agree "nofastcommit" is somewhat long but
> still OK and much more descriptive...
Ack
>
> >  };
> >
> >  static const match_table_t tokens = {
> > @@ -1796,6 +1796,7 @@ static const match_table_t tokens = {
> >       {Opt_init_itable, "init_itable=%u"},
> >       {Opt_init_itable, "init_itable"},
> >       {Opt_noinit_itable, "noinit_itable"},
> > +     {Opt_no_fc, "no_fc"},
>
> And here "nofastcommit", or perhaps "nofast_commit".
Ack

Thanks,
Harshad
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 3/9] ext4 / jbd2: add fast commit initialization
  2020-10-15 20:37 ` [PATCH v10 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
@ 2020-10-21 20:00   ` Jan Kara
  2020-10-29 23:28     ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-10-21 20:00 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, kernel test robot

On Thu 15-10-20 13:37:55, Harshad Shirwadkar wrote:
> diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> new file mode 100644
> index 000000000000..8362bf5e6e00
> --- /dev/null
> +++ b/fs/ext4/fast_commit.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __FAST_COMMIT_H__
> +#define __FAST_COMMIT_H__
> +
> +/* Number of blocks in journal area to allocate for fast commits */
> +#define EXT4_NUM_FC_BLKS		256

Maybe this could be tunable (at least during mkfs but maybe also with
a mount option)? I can imagine some people will want to tune this for their
workloads similarly as they tune the journal size. And although current
minimal journal size is 1024, I'd be actually calmer if jbd2 properly
checked from the start that requested fastcommit area isn't too big for the
journal...

> +
> +#endif /* __FAST_COMMIT_H__ */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 70256a240442..23bf55057fc2 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -5170,6 +5170,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
>  	journal->j_commit_interval = sbi->s_commit_interval;
>  	journal->j_min_batch_time = sbi->s_min_batch_time;
>  	journal->j_max_batch_time = sbi->s_max_batch_time;
> +	ext4_fc_init(sb, journal);
>  
>  	write_lock(&journal->j_state_lock);
>  	if (test_opt(sb, BARRIER))
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index c0600405e7a2..4497bfbac527 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1181,6 +1181,14 @@ static journal_t *journal_init_common(struct block_device *bdev,
>  	if (!journal->j_wbuf)
>  		goto err_cleanup;
>  
> +	if (journal->j_fc_wbufsize > 0) {
> +		journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> +					sizeof(struct buffer_head *),
> +					GFP_KERNEL);
> +		if (!journal->j_fc_wbuf)
> +			goto err_cleanup;
> +	}
> +

Hum, but journal_init_common() gets called e.g. through
jbd2_journal_init_inode() before ext4_init_journal_params() sets
j_fc_wbufsize? How is this supposed to work?

>  	bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
>  	if (!bh) {
>  		pr_err("%s: Cannot get buffer for journal superblock\n",
> @@ -1194,11 +1202,23 @@ static journal_t *journal_init_common(struct block_device *bdev,
>  
>  err_cleanup:
>  	kfree(journal->j_wbuf);
> +	kfree(journal->j_fc_wbuf);
>  	jbd2_journal_destroy_revoke(journal);
>  	kfree(journal);
>  	return NULL;
>  }
>  
> +int jbd2_fc_init(journal_t *journal, int num_fc_blks)
> +{
> +	journal->j_fc_wbufsize = num_fc_blks;
> +	journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> +				sizeof(struct buffer_head *), GFP_KERNEL);
> +	if (!journal->j_fc_wbuf)
> +		return -ENOMEM;
> +	return 0;
> +}
> +EXPORT_SYMBOL(jbd2_fc_init);

Hum, probably I'd find it less error prone to have size of fastcommit area
as an argument to jbd2_journal_init_dev() and jbd2_journal_init_inode().
That way we are sure journal parameters are initialized correctly from the
start. OTOH number of fastcommit blocks in the journal as we load it from
the disk and need to replay could be different from the number of
fastcommit blocks requested now (once we allow tuning) and this can get
confusing pretty fast. So maybe we just set number of fastcommit blocks in
journal_init_common() and then perform setup of everything else in
journal_reset()?

> +
>  /* jbd2_journal_init_dev and jbd2_journal_init_inode:
>   *
>   * Create a journal structure assigned some fixed set of disk blocks to
> @@ -1316,11 +1336,20 @@ static int journal_reset(journal_t *journal)
>  	}
>  
>  	journal->j_first = first;
> -	journal->j_last = last;
>  
> -	journal->j_head = first;
> -	journal->j_tail = first;
> -	journal->j_free = last - first;
> +	if (jbd2_has_feature_fast_commit(journal) &&
> +	    journal->j_fc_wbufsize > 0) {
> +		journal->j_fc_last = last;
> +		journal->j_last = last - journal->j_fc_wbufsize;
> +		journal->j_fc_first = journal->j_last + 1;
> +		journal->j_fc_off = 0;
> +	} else {
> +		journal->j_last = last;
> +	}
> +
> +	journal->j_head = journal->j_first;
> +	journal->j_tail = journal->j_first;
> +	journal->j_free = journal->j_last - journal->j_first;

So the journal size is effectively shorter by j_fc_wbufsize. But this has
also impact on maximum transaction size we can allow for the journal and
related parameters (generally derived from j_maxlen you don't touch).
So this needs to get fixed. Maybe just setting j_maxlen lower is the
easiest but then please change the comment at its definition to mention in
memory value is without fastcommit blocks. Or just create new journal
parameter for the size of area usable for normal commits.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 4/9] jbd2: add fast commit machinery
  2020-10-15 20:37 ` [PATCH v10 4/9] jbd2: add fast commit machinery Harshad Shirwadkar
@ 2020-10-22 10:16   ` Jan Kara
  2020-10-23 17:17     ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-10-22 10:16 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso

On Thu 15-10-20 13:37:56, Harshad Shirwadkar wrote:
> This functions adds necessary APIs needed in JBD2 layer for fast
> commits.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>  fs/ext4/fast_commit.c |   8 ++
>  fs/jbd2/commit.c      |  44 ++++++++++
>  fs/jbd2/journal.c     | 190 +++++++++++++++++++++++++++++++++++++++++-
>  include/linux/jbd2.h  |  27 ++++++
>  4 files changed, 268 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 0dad8bdb1253..f2d11b4c6b62 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -8,11 +8,19 @@
>   * Ext4 fast commits routines.
>   */
>  #include "ext4_jbd2.h"
> +/*
> + * Fast commit cleanup routine. This is called after every fast commit and
> + * full commit. full is true if we are called after a full commit.
> + */
> +static void ext4_fc_cleanup(journal_t *journal, int full)
> +{
> +}
>  
>  void ext4_fc_init(struct super_block *sb, journal_t *journal)
>  {
>  	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
>  		return;
> +	journal->j_fc_cleanup_callback = ext4_fc_cleanup;
>  	if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
>  		pr_warn("Error while enabling fast commits, turning off.");
>  		ext4_clear_feature_fast_commit(sb);
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 6252b4c50666..fa688e163a80 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -206,6 +206,30 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
>  	return generic_writepages(mapping, &wbc);
>  }
>  
> +/* Send all the data buffers related to an inode */
> +int jbd2_submit_inode_data(struct jbd2_inode *jinode)
> +{
> +
> +	if (!jinode || !(jinode->i_flags & JI_WRITE_DATA))
> +		return 0;
> +
> +	trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> +	return jbd2_journal_submit_inode_data_buffers(jinode);
> +
> +}
> +EXPORT_SYMBOL(jbd2_submit_inode_data);
> +
> +int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> +{
> +	if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
> +		!jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
> +		return 0;
> +	return filemap_fdatawait_range_keep_errors(
> +		jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
> +		jinode->i_dirty_end);
> +}
> +EXPORT_SYMBOL(jbd2_wait_inode_data);
> +
>  /*
>   * Submit all the data buffers of inode associated with the transaction to
>   * disk.
> @@ -415,6 +439,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	J_ASSERT(journal->j_running_transaction != NULL);
>  	J_ASSERT(journal->j_committing_transaction == NULL);
>  
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
> +	while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) {
> +		DEFINE_WAIT(wait);
> +
> +		prepare_to_wait(&journal->j_fc_wait, &wait,
> +				TASK_UNINTERRUPTIBLE);
> +		write_unlock(&journal->j_state_lock);
> +		schedule();
> +		write_lock(&journal->j_state_lock);
> +		finish_wait(&journal->j_fc_wait, &wait);
> +	}
> +	write_unlock(&journal->j_state_lock);

Hum, I'd like to understand: Is there a reason to block fastcommits already
when the running transaction is in T_LOCKED state? Strictly speaking it is
necessary only once we get to T_FLUSH state AFAIU (because only then we
start to write transaction to the journal). I guess there are both
advantages and disadvantages to it - if we allowed fastcommits running in
T_LOCKED state, we could lower fsync() latency more. OTOH it could increase
commit latency because we'd have to wait for fastcommits after T_LOCKED
state.

Another option is to just block new fast commits at the beginning of
T_LOCKED state and wait for running fastcommits at the end of T_LOCKED
state. That way waiting for outstanding handles and waiting for fastcommits
would be running in parallel and we'd reduce the latency...

Also I'm not sure JBD2_FULL_COMMIT_ONGOING is really needed. I understand
it is handy at this point but longer term, I'd find it more maintainable if
we just had a helper function jbd2_fastcommit_allowed() (or whatever) that
will check journal state and based on presence and state of committing
transaction return whether fastcommits are allowed or not...

> +
>  	commit_transaction = journal->j_running_transaction;
>  
>  	trace_jbd2_start_commit(journal, commit_transaction);
> @@ -422,6 +460,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  			commit_transaction->t_tid);
>  
>  	write_lock(&journal->j_state_lock);
> +	journal->j_fc_off = 0;
>  	J_ASSERT(commit_transaction->t_state == T_RUNNING);
>  	commit_transaction->t_state = T_LOCKED;
>  
> @@ -1121,12 +1160,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  
>  	if (journal->j_commit_callback)
>  		journal->j_commit_callback(journal, commit_transaction);
> +	if (journal->j_fc_cleanup_callback)
> +		journal->j_fc_cleanup_callback(journal, 1);
>  
>  	trace_jbd2_end_commit(journal, commit_transaction);
>  	jbd_debug(1, "JBD2: commit %d complete, head %d\n",
>  		  journal->j_commit_sequence, journal->j_tail_sequence);
>  
>  	write_lock(&journal->j_state_lock);
> +	journal->j_flags &= ~JBD2_FULL_COMMIT_ONGOING;
> +	journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
>  	spin_lock(&journal->j_list_lock);
>  	commit_transaction->t_state = T_FINISHED;
>  	/* Check if the transaction can be dropped now that we are finished */
> @@ -1138,6 +1181,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>  	spin_unlock(&journal->j_list_lock);
>  	write_unlock(&journal->j_state_lock);
>  	wake_up(&journal->j_wait_done_commit);
> +	wake_up(&journal->j_fc_wait);
>  
>  	/*
>  	 * Calculate overall stats
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 4497bfbac527..0c7c42bd530f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -159,7 +159,9 @@ static void commit_timeout(struct timer_list *t)
>   *
>   * 1) COMMIT:  Every so often we need to commit the current state of the
>   *    filesystem to disk.  The journal thread is responsible for writing
> - *    all of the metadata buffers to disk.
> + *    all of the metadata buffers to disk. If a fast commit is ongoing
> + *    journal thread waits until it's done and then continues from
> + *    there on.
>   *
>   * 2) CHECKPOINT: We cannot reuse a used section of the log file until all
>   *    of the data in that part of the log has been rewritten elsewhere on
> @@ -716,6 +718,75 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
>  	return err;
>  }
>  
> +/*
> + * Start a fast commit. If there's an ongoing fast or full commit wait for
> + * it to complete. Returns 0 if a new fast commit was started. Returns -EALREADY
> + * if a fast commit is not needed, either because there's an already a commit
> + * going on or this tid has already been committed. Returns -EINVAL if no jbd2
> + * commit has yet been performed.
> + */
> +int jbd2_fc_begin_commit(journal_t *journal, tid_t tid)
> +{
> +	/*
> +	 * Fast commits only allowed if at least one full commit has
> +	 * been processed.
> +	 */
> +	if (!journal->j_stats.ts_tid)
> +		return -EINVAL;
> +
> +	if (tid <= journal->j_commit_sequence)
> +		return -EALREADY;

This check is racy and possibly using stale value of j_commit_sequence
since j_commit_sequence needs j_state_lock for reliable reading.

> +
> +	write_lock(&journal->j_state_lock);
> +	if (journal->j_flags & JBD2_FULL_COMMIT_ONGOING ||
> +	    (journal->j_flags & JBD2_FAST_COMMIT_ONGOING)) {
> +		DEFINE_WAIT(wait);
> +
> +		prepare_to_wait(&journal->j_fc_wait, &wait,
> +				TASK_UNINTERRUPTIBLE);
> +		write_unlock(&journal->j_state_lock);
> +		schedule();
> +		finish_wait(&journal->j_fc_wait, &wait);
> +		return -EALREADY;
> +	}
> +	journal->j_flags |= JBD2_FAST_COMMIT_ONGOING;
> +	write_unlock(&journal->j_state_lock);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(jbd2_fc_begin_commit);
> +
> +/*
> + * Stop a fast commit. If fallback is set, this function starts commit of
> + * TID tid before any other fast commit can start.
> + */
> +static int __jbd2_fc_end_commit(journal_t *journal, tid_t tid, bool fallback)
> +{
> +	if (journal->j_fc_cleanup_callback)
> +		journal->j_fc_cleanup_callback(journal, 0);
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
> +	if (fallback)
> +		journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
> +	write_unlock(&journal->j_state_lock);
> +	wake_up(&journal->j_fc_wait);
> +	if (fallback)
> +		return jbd2_complete_transaction(journal, tid);
> +	return 0;
> +}
> +
> +int jbd2_fc_end_commit(journal_t *journal)
> +{
> +	return __jbd2_fc_end_commit(journal, 0, 0);

'fallback' is bool so please use true / false for it.

> +}
> +EXPORT_SYMBOL(jbd2_fc_end_commit);
> +
> +int jbd2_fc_end_commit_fallback(journal_t *journal, tid_t tid)
> +{
> +	return __jbd2_fc_end_commit(journal, tid, 1);
> +}
> +EXPORT_SYMBOL(jbd2_fc_end_commit_fallback);
> +

Is there a need for 'tid' here? Once jbd2_fc_begin_commit() sets
JBD2_FAST_COMMIT_ONGOING normal commit cannot proceed so when we decide we
cannot do fastcommit in the end, we know the transaction that needs to
commit is the currently running transaction, so we can fetch its TID from
the journal once we hold j_state_lock before clearing
JBD2_FAST_COMMIT_ONGOING. Cannot we?

>  /* Return 1 when transaction with given tid has already committed. */
>  int jbd2_transaction_committed(journal_t *journal, tid_t tid)
>  {
> @@ -784,6 +855,110 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
>  	return jbd2_journal_bmap(journal, blocknr, retp);
>  }
>  
> +/* Map one fast commit buffer for use by the file system */
> +int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
> +{
> +	unsigned long long pblock;
> +	unsigned long blocknr;
> +	int ret = 0;
> +	struct buffer_head *bh;
> +	int fc_off;
> +
> +	*bh_out = NULL;
> +	write_lock(&journal->j_state_lock);
> +
> +	if (journal->j_fc_off + journal->j_fc_first < journal->j_fc_last) {
> +		fc_off = journal->j_fc_off;
> +		blocknr = journal->j_fc_first + fc_off;
> +		journal->j_fc_off++;
> +	} else {
> +		ret = -EINVAL;
> +	}
> +	write_unlock(&journal->j_state_lock);

Is j_state_lock really needed here? There is always only one process doing
fastcommit so nobody else should be touching j_fc_off and other fields. Or
am I missing something?

> +
> +	if (ret)
> +		return ret;
> +
> +	ret = jbd2_journal_bmap(journal, blocknr, &pblock);
> +	if (ret)
> +		return ret;
> +
> +	bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
> +	if (!bh)
> +		return -ENOMEM;
> +
> +	lock_buffer(bh);
> +
> +	clear_buffer_uptodate(bh);
> +	set_buffer_dirty(bh);

Uh, that's a weird state to leave buffer in (!uptodate & dirty). Flush
worker could spot such buffer and try to write it out, which would blow
up... I wouldn't touch the buffer state here, once proper content is
filled, I'd mark the buffer as uptodate & dirty. That's how buffer state is
usually managed.

> +	unlock_buffer(bh);
> +	journal->j_fc_wbuf[fc_off] = bh;
> +
> +	*bh_out = bh;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(jbd2_fc_get_buf);
> +
> +/*
> + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
> + * for completion.
> + */
> +int jbd2_fc_wait_bufs(journal_t *journal, int num_blks)
> +{
> +	struct buffer_head *bh;
> +	int i, j_fc_off;
> +
> +	read_lock(&journal->j_state_lock);
> +	j_fc_off = journal->j_fc_off;
> +	read_unlock(&journal->j_state_lock);

Same comment regarding j_state_lock as for jbd2_fc_get_buf().

> +
> +	/*
> +	 * Wait in reverse order to minimize chances of us being woken up before
> +	 * all IOs have completed
> +	 */
> +	for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
> +		bh = journal->j_fc_wbuf[i];
> +		wait_on_buffer(bh);
> +		put_bh(bh);
> +		journal->j_fc_wbuf[i] = NULL;
> +		if (unlikely(!buffer_uptodate(bh)))
> +			return -EIO;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(jbd2_fc_wait_bufs);
> +
> +/*
> + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
> + * for completion.
> + */
> +int jbd2_fc_release_bufs(journal_t *journal)
> +{
> +	struct buffer_head *bh;
> +	int i, j_fc_off;
> +
> +	read_lock(&journal->j_state_lock);
> +	j_fc_off = journal->j_fc_off;
> +	read_unlock(&journal->j_state_lock);
> +
> +	/*
> +	 * Wait in reverse order to minimize chances of us being woken up before
> +	 * all IOs have completed
> +	 */
> +	for (i = j_fc_off - 1; i >= 0; i--) {
> +		bh = journal->j_fc_wbuf[i];
> +		if (!bh)
> +			break;
> +		put_bh(bh);
> +		journal->j_fc_wbuf[i] = NULL;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(jbd2_fc_release_bufs);
> +

I kind of wonder if releasing of buffers shouldn't be done automatically
either as part of jbd2_fc_wait_bufs() or when ending fastcommit. But I
don't have a strong opinion so this is just an idea for consideration.

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-10-21 17:25     ` harshad shirwadkar
@ 2020-10-22 13:06       ` Jan Kara
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2020-10-22 13:06 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Jan Kara, Ext4 Developers List, Theodore Y. Ts'o

On Wed 21-10-20 10:25:14, harshad shirwadkar wrote:
> > BTW, how is EXT4_FC_TAG_CREAT different from EXT4_FC_TAG_LINK? It seems
> > like they describe essentially the same operation?
> In the replay path, creat has to do certain things that link doesn't.
> For example, "creat" needs to mark the inode as used in the bitmap and
> also if it's a directory that's being created, it needs to initialize
> the "." and ".." dirents in the directory. That's why we need
> different tags.

Aha, OK, makes sence. Thanks for explanation. BTW it would be good to have
some documentation (or at least examples) how a sequence of system calls
translates to fastcommit log entries and then how these are replayed in
case of crash.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options
  2020-10-21 17:31     ` harshad shirwadkar
@ 2020-10-22 13:09       ` Jan Kara
  2020-10-26 16:40         ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-10-22 13:09 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Jan Kara, Ext4 Developers List, Theodore Y. Ts'o

On Wed 21-10-20 10:31:48, harshad shirwadkar wrote:
> On Wed, Oct 21, 2020 at 9:18 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 15-10-20 13:37:54, Harshad Shirwadkar wrote:
> > > We are running out of mount option bits. Add handling for using
> > > s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
> > > ability to turn off the fast commit feature in Ext4.
> > >
> > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > > ---
> > >  fs/ext4/ext4.h       |  4 ++++
> > >  fs/ext4/super.c      | 27 ++++++++++++++++++++++-----
> > >  include/linux/jbd2.h |  5 ++++-
> > >  3 files changed, 30 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > index 1879531a119f..02d7dc378505 100644
> > > --- a/fs/ext4/ext4.h
> > > +++ b/fs/ext4/ext4.h
> > > @@ -1213,6 +1213,8 @@ struct ext4_inode_info {
> > >  #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM        0x00000008 /* User explicitly
> > >                                               specified journal checksum */
> > >
> > > +#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT      0x00000010 /* Journal fast commit */
> > > +
> > >  #define clear_opt(sb, opt)           EXT4_SB(sb)->s_mount_opt &= \
> > >                                               ~EXT4_MOUNT_##opt
> > >  #define set_opt(sb, opt)             EXT4_SB(sb)->s_mount_opt |= \
> > > @@ -1813,6 +1815,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode)
> > >  #define EXT4_FEATURE_COMPAT_RESIZE_INODE     0x0010
> > >  #define EXT4_FEATURE_COMPAT_DIR_INDEX                0x0020
> > >  #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2    0x0200
> > > +#define EXT4_FEATURE_COMPAT_FAST_COMMIT              0x0400
> > >  #define EXT4_FEATURE_COMPAT_STABLE_INODES    0x0800
> >
> > Is fast commit really a compat feature? IMO if there are fast commits
> > stored in the journal, the filesystem is actually incompatible with the
> > old kernels because data we guranteed to be permanenly stored may be
> > invisible for the old kernel (since it won't replay fastcommit
> > transactions).
> >
> > ...
> >
> > Oh, now I see that the journal FAST_COMMIT is actually incompat. So what's
> > the point of compat ext4 feature with incompat JBD2 feature?
> So having fast commits enabled on an ext4 file system doesn't
> immediately make it incompatible with the older kernels. FS becomes
> incompatible only if there are fast commits blocks that are stored in
> the journal. So, one of the tricks that this patchset does is on a
> clean unmount, since it's guaranteed that there are no fast commit
> blocks in journal, we clear out the JBD2 incompat flag and preserve
> the compat flag in ext4. So, we can think of ext4 compat flag as "FS
> will try fast commits when possible" while jbd2 incompat flag as
> "There are fast commits blocks present in the journal". Does that make
> sense?

Yes, understood. That's clever. Thanks for explanation! But please add the
above justification to the description of EXT4_FEATURE_COMPAT_FAST_COMMIT
feature or somewhere around that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-15 20:37 ` [PATCH v10 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
@ 2020-10-23 10:30   ` Jan Kara
  2020-10-26 20:55     ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-10-23 10:30 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso, kernel test robot

On Thu 15-10-20 13:37:57, Harshad Shirwadkar wrote:
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 76f634d185f1..68aaed48315f 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -242,6 +242,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  	handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
>  	if (IS_ERR(handle))
>  		return PTR_ERR(handle);
> +	ext4_fc_start_update(inode);
>  
>  	if ((type == ACL_TYPE_ACCESS) && acl) {
>  		error = posix_acl_update_mode(inode, &mode, &acl);
> @@ -259,6 +260,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>  	}
>  out_stop:
>  	ext4_journal_stop(handle);
> +	ext4_fc_stop_update(inode);
>  	if (error == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>  		goto retry;
>  	return error;
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 2c412d32db0f..6b291cad72be 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1021,6 +1021,28 @@ struct ext4_inode_info {
>  
>  	struct list_head i_orphan;	/* unlinked but open inodes */
>  
> +	/* Fast commit related info */
> +
> +	struct list_head i_fc_list;	/*
> +					 * inodes that need fast commit
> +					 * protected by sbi->s_fc_lock.
> +					 */
> +
> +	/* Start of lblk range that needs to be committed in this fast commit */
> +	ext4_lblk_t i_fc_lblk_start;
> +
> +	/* End of lblk range that needs to be committed in this fast commit */
> +	ext4_lblk_t i_fc_lblk_len;
> +
> +	/* Number of ongoing updates on this inode */
> +	atomic_t  i_fc_updates;
> +
> +	/* Fast commit wait queue for this inode */
> +	wait_queue_head_t i_fc_wait;
> +
> +	/* Protect concurrent accesses on i_fc_lblk_start, i_fc_lblk_len */
> +	struct mutex i_fc_lock;
> +
>  	/*
>  	 * i_disksize keeps track of what the inode size is ON DISK, not
>  	 * in memory.  During truncate, i_size is set to the new size by
> @@ -1141,6 +1163,10 @@ struct ext4_inode_info {
>  #define	EXT4_VALID_FS			0x0001	/* Unmounted cleanly */
>  #define	EXT4_ERROR_FS			0x0002	/* Errors detected */
>  #define	EXT4_ORPHAN_FS			0x0004	/* Orphans being recovered */
> +#define EXT4_FC_INELIGIBLE		0x0008	/* Fast commit ineligible */
> +#define EXT4_FC_COMMITTING		0x0010	/* File system underoing a fast
	  ^^ please align these as the previous values
Also the names should have _FS suffix.

Now after more looking, these are actually used in s_mount_state which is
persistently stored on disk which is probably not what you want. You rather
want to use something like sbi->s_mount_flags for these?

And now that I also look at sbi->s_mount_flags, these should use atomic
bitops as currently they seem to be succeptible to RMW races (e.g. due to
EXT4_MF_MNTDIR_SAMPLED flag) and your flags also need the atomic behavior.
That would be a separate patch fixing this.

> +						 * commit.
> +						 */
>  
>  /*
>   * Misc. filesystem flags
> @@ -1613,6 +1639,30 @@ struct ext4_sb_info {
>  	/* Record the errseq of the backing block device */
>  	errseq_t s_bdev_wb_err;
>  	spinlock_t s_bdev_wb_lock;
> +
> +	/* Ext4 fast commit stuff */
> +	atomic_t s_fc_subtid;
> +	atomic_t s_fc_ineligible_updates;
> +	/*
> +	 * After commit starts, the main queue gets locked, and the further
> +	 * updates get added in the staging queue.
> +	 */
> +#define FC_Q_MAIN	0
> +#define FC_Q_STAGING	1
> +	struct list_head s_fc_q[2];	/* Inodes staged for fast commit
> +					 * that have data changes in them.
> +					 */
> +	struct list_head s_fc_dentry_q[2];	/* directory entry updates */
> +	unsigned int s_fc_bytes;
> +	/*
> +	 * Main fast commit lock. This lock protects accesses to the
> +	 * following fields:
> +	 * ei->i_fc_list, s_fc_dentry_q, s_fc_q, s_fc_bytes, s_fc_bh.
> +	 */
> +	spinlock_t s_fc_lock;
> +	struct buffer_head *s_fc_bh;
> +	struct ext4_fc_stats s_fc_stats;
> +	u64 s_fc_avg_commit_time;
>  };
>  
>  static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> @@ -1723,6 +1773,7 @@ enum {
>  	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
>  	EXT4_STATE_LUSTRE_EA_INODE,	/* Lustre-style ea_inode */
>  	EXT4_STATE_VERITY_IN_PROGRESS,	/* building fs-verity Merkle tree */
> +	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
>  };
>  
>  #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index e46f3381ba4c..a2bb87d75500 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3723,6 +3723,7 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
>  	err = ext4_ext_dirty(handle, inode, path + path->p_depth);
>  out:
>  	ext4_ext_show_leaf(inode, path);
> +	ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
>  	return err;
>  }
>  
> @@ -3794,6 +3795,7 @@ convert_initialized_extent(handle_t *handle, struct inode *inode,
>  	if (*allocated > map->m_len)
>  		*allocated = map->m_len;
>  	map->m_len = *allocated;
> +	ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
>  	return 0;
>  }
>  
> @@ -4327,7 +4329,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>  	map->m_len = ar.len;
>  	allocated = map->m_len;
>  	ext4_ext_show_leaf(inode, path);
> -
> +	ext4_fc_track_range(inode, map->m_lblk, map->m_lblk + map->m_len - 1);
>  out:
>  	ext4_ext_drop_refs(path);
>  	kfree(path);
> @@ -4600,7 +4602,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>  	ret = ext4_mark_inode_dirty(handle, inode);
>  	if (unlikely(ret))
>  		goto out_handle;
> -
> +	ext4_fc_track_range(inode, offset >> inode->i_sb->s_blocksize_bits,
> +			(offset + len - 1) >> inode->i_sb->s_blocksize_bits);
>  	/* Zero out partial block at the edges of the range */
>  	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
>  	if (ret >= 0)
> @@ -4648,23 +4651,34 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
>  		     FALLOC_FL_INSERT_RANGE))
>  		return -EOPNOTSUPP;
> +	ext4_fc_track_range(inode, offset >> blkbits,
> +			(offset + len - 1) >> blkbits);
>  
> -	if (mode & FALLOC_FL_PUNCH_HOLE)
> -		return ext4_punch_hole(inode, offset, len);
> +	ext4_fc_start_update(inode);
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		ret = ext4_punch_hole(inode, offset, len);
> +		goto exit;
> +	}
>  
>  	ret = ext4_convert_inline_data(inode);
>  	if (ret)
> -		return ret;
> +		goto exit;
>  
> -	if (mode & FALLOC_FL_COLLAPSE_RANGE)
> -		return ext4_collapse_range(inode, offset, len);
> -
> -	if (mode & FALLOC_FL_INSERT_RANGE)
> -		return ext4_insert_range(inode, offset, len);
> +	if (mode & FALLOC_FL_COLLAPSE_RANGE) {
> +		ret = ext4_collapse_range(inode, offset, len);
> +		goto exit;
> +	}
>  
> -	if (mode & FALLOC_FL_ZERO_RANGE)
> -		return ext4_zero_range(file, offset, len, mode);
> +	if (mode & FALLOC_FL_INSERT_RANGE) {
> +		ret = ext4_insert_range(inode, offset, len);
> +		goto exit;
> +	}
>  
> +	if (mode & FALLOC_FL_ZERO_RANGE) {
> +		ret = ext4_zero_range(file, offset, len, mode);
> +		goto exit;
> +	}
>  	trace_ext4_fallocate_enter(inode, offset, len, mode);
>  	lblk = offset >> blkbits;
>  
> @@ -4698,12 +4712,14 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		goto out;
>  
>  	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
> -		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
> -						EXT4_I(inode)->i_sync_tid);
> +		ret = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
> +					EXT4_I(inode)->i_sync_tid);
>  	}
>  out:
>  	inode_unlock(inode);
>  	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
> +exit:
> +	ext4_fc_stop_update(inode);
>  	return ret;
>  }
>  
> @@ -5291,6 +5307,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
>  		ret = PTR_ERR(handle);
>  		goto out_mmap;
>  	}
> +	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE);
>  
>  	down_write(&EXT4_I(inode)->i_data_sem);
>  	ext4_discard_preallocations(inode, 0);
> @@ -5329,6 +5346,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
>  
>  out_stop:
>  	ext4_journal_stop(handle);
> +	ext4_fc_stop_ineligible(sb);
>  out_mmap:
>  	up_write(&EXT4_I(inode)->i_mmap_sem);
>  out_mutex:
> @@ -5429,6 +5447,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
>  		ret = PTR_ERR(handle);
>  		goto out_mmap;
>  	}
> +	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE);
>  
>  	/* Expand file to avoid data loss if there is error while shifting */
>  	inode->i_size += len;
> @@ -5503,6 +5522,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
>  
>  out_stop:
>  	ext4_journal_stop(handle);
> +	ext4_fc_stop_ineligible(sb);
>  out_mmap:
>  	up_write(&EXT4_I(inode)->i_mmap_sem);
>  out_mutex:
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index f2d11b4c6b62..e0fa3bd18346 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -7,13 +7,1174 @@
>   *
>   * Ext4 fast commits routines.
>   */
> +#include "ext4.h"
>  #include "ext4_jbd2.h"
> +#include "ext4_extents.h"
> +#include "mballoc.h"
> +
> +/*
> + * Ext4 Fast Commits
> + * -----------------
> + *
> + * Ext4 fast commits implement fine grained journalling for Ext4.
> + *
> + * Fast commits are organized as a log of tag-length-value (TLV) structs. (See
> + * struct ext4_fc_tl). Each TLV contains some delta that is replayed TLV by
> + * TLV during the recovery phase. For the scenarios for which we currently
> + * don't have replay code, fast commit falls back to full commits.
> + * Fast commits record delta in one of the following three categories.
> + *
> + * (A) Directory entry updates:
> + *
> + * - EXT4_FC_TAG_UNLINK		- records directory entry unlink
> + * - EXT4_FC_TAG_LINK		- records directory entry link
> + * - EXT4_FC_TAG_CREAT		- records inode and directory entry creation
> + *
> + * (B) File specific data range updates:
> + *
> + * - EXT4_FC_TAG_ADD_RANGE	- records addition of new blocks to an inode
> + * - EXT4_FC_TAG_DEL_RANGE	- records deletion of blocks from an inode
> + *
> + * (C) Inode metadata (mtime / ctime etc):
> + *
> + * - EXT4_FC_TAG_INODE		- record the inode that should be replayed
> + *				  during recovery. Note that iblocks field is
> + *				  not replayed and instead derived during
> + *				  replay.
> + * Commit Operation
> + * ----------------
> + * With fast commits, we maintain all the directory entry operations in the
> + * order in which they are issued in an in-memory queue. This queue is flushed
> + * to disk during the commit operation. We also maintain a list of inodes
> + * that need to be committed during a fast commit in another in memory queue of
> + * inodes. During the commit operation, we commit in the following order:
> + *
> + * [1] Lock inodes for any further data updates by setting COMMITTING state
> + * [2] Submit data buffers of all the inodes
> + * [3] Wait for [2] to complete
> + * [4] Commit all the directory entry updates in the fast commit space
> + * [5] Commit all the changed inode structures
> + * [6] Write tail tag (this tag ensures the atomicity, please read the following
> + *     section for more details).
> + * [7] Wait for [4], [5] and [6] to complete.
> + *
> + * All the inode updates must call ext4_fc_start_update() before starting an
> + * update. If such an ongoing update is present, fast commit waits for it to
> + * complete. The completion of such an update is marked by
> + * ext4_fc_stop_update().
> + *
> + * Fast Commit Ineligibility
> + * -------------------------
> + * Not all operations are supported by fast commits today (e.g extended
> + * attributes). Fast commit ineligiblity is marked by calling one of the
> + * two following functions:
> + *
> + * - ext4_fc_mark_ineligible(): This makes next fast commit operation to fall
> + *   back to full commit. This is useful in case of transient errors.
> + *
> + * - ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() - This makes all
> + *   the fast commits happening between ext4_fc_start_ineligible() and
> + *   ext4_fc_stop_ineligible() and one fast commit after the call to
> + *   ext4_fc_stop_ineligible() to fall back to full commits. It is important to
> + *   make one more fast commit to fall back to full commit after stop call so
> + *   that it guaranteed that the fast commit ineligible operation contained
> + *   within ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() is
> + *   followed by at least 1 full commit.
> + *
> + * Atomicity of commits
> + * --------------------
> + * In order to gaurantee atomicity during the commit operation, fast commit
                  ^^^ guarantee

> + * uses "EXT4_FC_TAG_TAIL" tag that marks a fast commit as complete. Tail
> + * tag contains CRC of the contents and TID of the transaction after which
> + * this fast commit should be applied. Recovery code replays fast commit
> + * logs only if there's at least 1 valid tail present. For every fast commit
> + * operation, there is 1 tail. This means, we may end up with multiple tails
> + * in the fast commit space. Here's an example:
> + *
> + * - Create a new file A and remove existing file B
> + * - fsync()

Great that there's an example here. But what do we fsync here? A or dir with
A or something else?

> + * - Append contents to file A
> + * - Truncate file A
> + * - fsync()

And what is fsynced here?

> + *
> + * The fast commit space at the end of above operations would look like this:
> + *      [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL]
> + *             |<---  Fast Commit 1   --->|<---      Fast Commit 2     ---->|
> + *
> + * Replay code should thus check for all the valid tails in the FC area.

And one design question: Why do we record unlink of B here? I was kind of
hoping that fastcommit due to fsync(A) would record only operations related
to A. Because the way you wrote it, fast commit is inherently still a
filesystem-global operation requiring global ordering of metadata changes
with all the scalability bottlenecks current journalling code has. It's
faster by some factor due to more efficient packing of "small" changes not
fundamentally faster AFAICT...

> + *
> + * TODOs
> + * -----
> + * 1) Make fast commit atomic updates more fine grained. Today, a fast commit
> + *    eligible update must be protected within ext4_fc_start_update() and
> + *    ext4_fc_stop_update(). These routines are called at much higher
> + *    routines. This can be made more fine grained by combining with
> + *    ext4_journal_start().
> + *
> + * 2) Same above for ext4_fc_start_ineligible() and ext4_fc_stop_ineligible()
> + *
> + * 3) Handle more ineligible cases.
> + */
> +
> +#include <trace/events/ext4.h>
> +static struct kmem_cache *ext4_fc_dentry_cachep;
> +
> +static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> +{
> +	BUFFER_TRACE(bh, "");
> +	if (uptodate) {
> +		ext4_debug("%s: Block %lld up-to-date",
> +			   __func__, bh->b_blocknr);
> +		set_buffer_uptodate(bh);
> +	} else {
> +		ext4_debug("%s: Block %lld not up-to-date",
> +			   __func__, bh->b_blocknr);
> +		clear_buffer_uptodate(bh);
> +	}
> +
> +	unlock_buffer(bh);
> +}
> +
> +static inline void ext4_fc_reset_inode(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	ei->i_fc_lblk_start = 0;
> +	ei->i_fc_lblk_len = 0;
> +}
> +
> +void ext4_fc_init_inode(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	ext4_fc_reset_inode(inode);
> +	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
> +	INIT_LIST_HEAD(&ei->i_fc_list);
> +	init_waitqueue_head(&ei->i_fc_wait);
> +	atomic_set(&ei->i_fc_updates, 0);
> +}
> +
> +/*
> + * Inform Ext4's fast about start of an inode update
> + *
> + * This function is called by the high level call VFS callbacks before
> + * performing any inode update. This function blocks if there's an ongoing
> + * fast commit on the inode in question.
> + */
> +void ext4_fc_start_update(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return;
> +
> +restart:
> +	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +	if (list_empty(&ei->i_fc_list))
> +		goto out;
> +
> +	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> +		wait_queue_head_t *wq;
> +#if (BITS_PER_LONG < 64)
> +		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> +				EXT4_STATE_FC_COMMITTING);
> +		wq = bit_waitqueue(&ei->i_state_flags,
> +				   EXT4_STATE_FC_COMMITTING);
> +#else
> +		DEFINE_WAIT_BIT(wait, &ei->i_flags,
> +				EXT4_STATE_FC_COMMITTING);
> +		wq = bit_waitqueue(&ei->i_flags,
> +				   EXT4_STATE_FC_COMMITTING);
> +#endif
> +		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> +		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wq_entry);
> +		goto restart;
> +	}
> +out:
> +	atomic_inc(&ei->i_fc_updates);
> +	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +}
> +
> +/*
> + * Stop inode update and wake up waiting fast commits if any.
> + */
> +void ext4_fc_stop_update(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return;
> +
> +	if (atomic_dec_and_test(&ei->i_fc_updates))
> +		wake_up_all(&ei->i_fc_wait);
> +}
> +
> +/*
> + * Remove inode from fast commit list. If the inode is being committed
> + * we wait until inode commit is done.
> + */
> +void ext4_fc_del(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return;
> +
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return;

Uh, why testing twice?

> +
> +restart:
> +	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +	if (list_empty(&ei->i_fc_list)) {
> +		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +		return;
> +	}
> +
> +	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> +		wait_queue_head_t *wq;
> +#if (BITS_PER_LONG < 64)
> +		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> +				EXT4_STATE_FC_COMMITTING);
> +		wq = bit_waitqueue(&ei->i_state_flags,
> +				   EXT4_STATE_FC_COMMITTING);
> +#else
> +		DEFINE_WAIT_BIT(wait, &ei->i_flags,
> +				EXT4_STATE_FC_COMMITTING);
> +		wq = bit_waitqueue(&ei->i_flags,
> +				   EXT4_STATE_FC_COMMITTING);
> +#endif

Create a helper function for waiting for EXT4_STATE_FC_COMMITTING? It is
opencoded several times...

> +		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> +		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wq_entry);
> +		goto restart;
> +	}
> +	if (!list_empty(&ei->i_fc_list))

You've checked for list_empty() above, no need to recheck again...

> +		list_del_init(&ei->i_fc_list);
> +	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +}
> +
> +/*
> + * Mark file system as fast commit ineligible. This means that next commit
> + * operation would result in a full jbd2 commit.
> + */
> +void ext4_fc_mark_ineligible(struct super_block *sb, int reason)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> +	sbi->s_mount_state |= EXT4_FC_INELIGIBLE;
> +	WARN_ON(reason >= EXT4_FC_REASON_MAX);
> +	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
> +}
> +
> +/*
> + * Start a fast commit ineligible update. Any commits that happen while
> + * such an operation is in progress fall back to full commits.
> + */
> +void ext4_fc_start_ineligible(struct super_block *sb, int reason)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> +	WARN_ON(reason >= EXT4_FC_REASON_MAX);
> +	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
> +	atomic_inc(&sbi->s_fc_ineligible_updates);
> +}
> +
> +/*
> + * Stop a fast commit ineligible update. We set EXT4_FC_INELIGIBLE flag here
> + * to ensure that after stopping the ineligible update, at least one full
> + * commit takes place.
> + */
> +void ext4_fc_stop_ineligible(struct super_block *sb)
> +{
> +	EXT4_SB(sb)->s_mount_state |= EXT4_FC_INELIGIBLE;
> +	atomic_dec(&EXT4_SB(sb)->s_fc_ineligible_updates);
> +}
> +
> +static inline int ext4_fc_is_ineligible(struct super_block *sb)
> +{
> +	return (EXT4_SB(sb)->s_mount_state & EXT4_FC_INELIGIBLE) ||
> +		atomic_read(&EXT4_SB(sb)->s_fc_ineligible_updates);
> +}
> +
> +/*
> + * Generic fast commit tracking function. If this is the first time this we are
> + * called after a full commit, we initialize fast commit fields and then call
> + * __fc_track_fn() with update = 0. If we have already been called after a full
> + * commit, we pass update = 1. Based on that, the track function can determine
> + * if it needs to track a field for the first time or if it needs to just
> + * update the previously tracked value.
> + *
> + * If enqueue is set, this function enqueues the inode in fast commit list.
> + */
> +static int ext4_fc_track_template(
> +	struct inode *inode, int (*__fc_track_fn)(struct inode *, void *, bool),
> +	void *args, int enqueue)
> +{
> +	tid_t running_txn_tid;
> +	bool update = false;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	int ret;
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return -EOPNOTSUPP;
> +
> +	if (ext4_fc_is_ineligible(inode->i_sb))
> +		return -EINVAL;
> +
> +	running_txn_tid = sbi->s_journal ?
> +		sbi->s_journal->j_commit_sequence + 1 : 0;

This looks problematic. The j_commit_sequence sampling is racy - first
without j_state_lock you can be fetching stale value, second you don't
know whether there is transaction currently committing or not. If there is,
j_commit_sequence will contain TID of the transaction before it which is
wrong for your purposes. I think you should pass 'handle' into all the
tracking functions and derive running transaction TID from that as we do it
elsewhere.

> +
> +	mutex_lock(&ei->i_fc_lock);
> +	if (running_txn_tid == ei->i_sync_tid) {
> +		update = true;
> +	} else {
> +		ext4_fc_reset_inode(inode);
> +		ei->i_sync_tid = running_txn_tid;
> +	}
> +	ret = __fc_track_fn(inode, args, update);
> +	mutex_unlock(&ei->i_fc_lock);
> +
> +	if (!enqueue)
> +		return ret;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	if (list_empty(&EXT4_I(inode)->i_fc_list))
> +		list_add_tail(&EXT4_I(inode)->i_fc_list,
> +				(sbi->s_mount_state & EXT4_FC_COMMITTING) ?
> +				&sbi->s_fc_q[FC_Q_STAGING] :
> +				&sbi->s_fc_q[FC_Q_MAIN]);
> +	spin_unlock(&sbi->s_fc_lock);

OK, so how do you prevent inode from being freed while it is still on
i_fc_list? I don't see anything preventing that and it could cause nasty
use-after-free issues. Note that for similar reasons JBD2 uses external
separately allocated inode for jbd2_inode so that it can have separate
lifetime (related to transaction commits) from struct ext4_inode_info.

> +
> +	return ret;
> +}
> +
> +struct __track_dentry_update_args {
> +	struct dentry *dentry;
> +	int op;
> +};
> +
> +/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */
> +static int __track_dentry_update(struct inode *inode, void *arg, bool update)
> +{
> +	struct ext4_fc_dentry_update *node;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct __track_dentry_update_args *dentry_update =
> +		(struct __track_dentry_update_args *)arg;
> +	struct dentry *dentry = dentry_update->dentry;
> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +
> +	mutex_unlock(&ei->i_fc_lock);
> +	node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS);
> +	if (!node) {
> +		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_MEM);
> +		mutex_lock(&ei->i_fc_lock);
> +		return -ENOMEM;
> +	}
> +
> +	node->fcd_op = dentry_update->op;
> +	node->fcd_parent = dentry->d_parent->d_inode->i_ino;
> +	node->fcd_ino = inode->i_ino;
> +	if (dentry->d_name.len > DNAME_INLINE_LEN) {
> +		node->fcd_name.name = kmalloc(dentry->d_name.len, GFP_NOFS);
> +		if (!node->fcd_name.name) {
> +			kmem_cache_free(ext4_fc_dentry_cachep, node);
> +			ext4_fc_mark_ineligible(inode->i_sb,
> +				EXT4_FC_REASON_MEM);
> +			mutex_lock(&ei->i_fc_lock);
> +			return -ENOMEM;
> +		}
> +		memcpy((u8 *)node->fcd_name.name, dentry->d_name.name,
> +			dentry->d_name.len);
> +	} else {
> +		memcpy(node->fcd_iname, dentry->d_name.name,
> +			dentry->d_name.len);
> +		node->fcd_name.name = node->fcd_iname;
> +	}
> +	node->fcd_name.len = dentry->d_name.len;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	if (sbi->s_mount_state & EXT4_FC_COMMITTING)
> +		list_add_tail(&node->fcd_list,
> +				&sbi->s_fc_dentry_q[FC_Q_STAGING]);
> +	else
> +		list_add_tail(&node->fcd_list, &sbi->s_fc_dentry_q[FC_Q_MAIN]);
> +	spin_unlock(&sbi->s_fc_lock);
> +	mutex_lock(&ei->i_fc_lock);
> +
> +	return 0;
> +}
> +
> +void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry)
> +{
> +	struct __track_dentry_update_args args;
> +	int ret;
> +
> +	args.dentry = dentry;
> +	args.op = EXT4_FC_TAG_UNLINK;
> +
> +	ret = ext4_fc_track_template(inode, __track_dentry_update,
> +					(void *)&args, 0);
> +	trace_ext4_fc_track_unlink(inode, dentry, ret);
> +}
> +
> +void ext4_fc_track_link(struct inode *inode, struct dentry *dentry)
> +{
> +	struct __track_dentry_update_args args;
> +	int ret;
> +
> +	args.dentry = dentry;
> +	args.op = EXT4_FC_TAG_LINK;
> +
> +	ret = ext4_fc_track_template(inode, __track_dentry_update,
> +					(void *)&args, 0);
> +	trace_ext4_fc_track_link(inode, dentry, ret);
> +}
> +
> +void ext4_fc_track_create(struct inode *inode, struct dentry *dentry)
> +{
> +	struct __track_dentry_update_args args;
> +	int ret;
> +
> +	args.dentry = dentry;
> +	args.op = EXT4_FC_TAG_CREAT;
> +
> +	ret = ext4_fc_track_template(inode, __track_dentry_update,
> +					(void *)&args, 0);
> +	trace_ext4_fc_track_create(inode, dentry, ret);
> +}
> +
> +/* __track_fn for inode tracking */
> +static int __track_inode(struct inode *inode, void *arg, bool update)
> +{
> +	if (update)
> +		return -EEXIST;
> +
> +	EXT4_I(inode)->i_fc_lblk_len = 0;
> +
> +	return 0;
> +}
> +
> +void ext4_fc_track_inode(struct inode *inode)
> +{
> +	int ret;
> +
> +	if (S_ISDIR(inode->i_mode))
> +		return;
> +
> +	ret = ext4_fc_track_template(inode, __track_inode, NULL, 1);
> +	trace_ext4_fc_track_inode(inode, ret);
> +}
> +
> +struct __track_range_args {
> +	ext4_lblk_t start, end;
> +};
> +
> +/* __track_fn for tracking data updates */
> +static int __track_range(struct inode *inode, void *arg, bool update)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	ext4_lblk_t oldstart;
> +	struct __track_range_args *__arg =
> +		(struct __track_range_args *)arg;
> +
> +	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) {
> +		ext4_debug("Special inode %ld being modified\n", inode->i_ino);
> +		return -ECANCELED;
> +	}
> +
> +	oldstart = ei->i_fc_lblk_start;
> +
> +	if (update && ei->i_fc_lblk_len > 0) {
> +		ei->i_fc_lblk_start = min(ei->i_fc_lblk_start, __arg->start);
> +		ei->i_fc_lblk_len =
> +			max(oldstart + ei->i_fc_lblk_len - 1, __arg->end) -
> +				ei->i_fc_lblk_start + 1;
> +	} else {
> +		ei->i_fc_lblk_start = __arg->start;
> +		ei->i_fc_lblk_len = __arg->end - __arg->start + 1;
> +	}
> +
> +	return 0;
> +}
> +
> +void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
> +			 ext4_lblk_t end)
> +{
> +	struct __track_range_args args;
> +	int ret;
> +
> +	if (S_ISDIR(inode->i_mode))
> +		return;
> +
> +	args.start = start;
> +	args.end = end;
> +
> +	ret = ext4_fc_track_template(inode,  __track_range, &args, 1);
> +
> +	trace_ext4_fc_track_range(inode, start, end, ret);
> +}
> +
> +static void ext4_fc_submit_bh(struct super_block *sb)
> +{
> +	int write_flags = REQ_SYNC;
> +	struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
> +
> +	if (test_opt(sb, BARRIER))
> +		write_flags |= REQ_FUA | REQ_PREFLUSH;

Submitting each fastcommit buffer with REQ_FUA | REQ_PREFLUSH is
unnecessarily expensive (especially if there will be unrelated writes
happening to the filesystem while fastcommit is running). If nothing else,
it's enough to have REQ_PREFLUSH only once during the whole fastcommit to
flush out written back data blocks (plus journal device may be different
from the filesystem device so you need to be flushing the filesystem device
for this - see how the jbd2 commit code does this).

Also REQ_FUA on each block may be overkill for devices that don't support
it natively (and thus REQ_FUA is simulated with full write cache pre and
post flush) - for such devices it would be better to just write out
fastcommit normally and then issue one cache flush. With careful
checksumming, block ID tagging and such, it should be safe against disk
reordering writes. But I guess we can leave this optimization as a TODO
item for later (but I think it would be good to design the on-disk format of
fastcommit blocks so that it does not rely on FUA writes).
 
> +	lock_buffer(bh);
> +	clear_buffer_dirty(bh);
> +	set_buffer_uptodate(bh);
> +	bh->b_end_io = ext4_end_buffer_io_sync;
> +	submit_bh(REQ_OP_WRITE, write_flags, bh);
> +	EXT4_SB(sb)->s_fc_bh = NULL;
> +}
> +
> +/* Ext4 commit path routines */
> +
> +/* memzero and update CRC */
> +static void *ext4_fc_memzero(struct super_block *sb, void *dst, int len,
> +				u32 *crc)
> +{
> +	void *ret;
> +
> +	ret = memset(dst, 0, len);
> +	if (crc)
> +		*crc = ext4_chksum(EXT4_SB(sb), *crc, dst, len);
> +	return ret;
> +}
> +
> +/*
> + * Allocate len bytes on a fast commit buffer.
> + *
> + * During the commit time this function is used to manage fast commit
> + * block space. We don't split a fast commit log onto different
> + * blocks. So this function makes sure that if there's not enough space
> + * on the current block, the remaining space in the current block is
> + * marked as unused by adding EXT4_FC_TAG_PAD tag. In that case,
> + * new block is from jbd2 and CRC is updated to reflect the padding
> + * we added.
> + */
> +static u8 *ext4_fc_reserve_space(struct super_block *sb, int len, u32 *crc)
> +{
> +	struct ext4_fc_tl *tl;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct buffer_head *bh;
> +	int bsize = sbi->s_journal->j_blocksize;
> +	int ret, off = sbi->s_fc_bytes % bsize;
> +	int pad_len;
> +
> +	/*
> +	 * After allocating len, we should have space at least for a 0 byte
> +	 * padding.
> +	 */
> +	if (len + sizeof(struct ext4_fc_tl) > bsize)
> +		return NULL;
> +
> +	if (bsize - off - 1 > len + sizeof(struct ext4_fc_tl)) {
> +		/*
> +		 * Only allocate from current buffer if we have enough space for
> +		 * this request AND we have space to add a zero byte padding.
> +		 */
> +		if (!sbi->s_fc_bh) {
> +			ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
> +			if (ret)
> +				return NULL;
> +			sbi->s_fc_bh = bh;
> +		}
> +		sbi->s_fc_bytes += len;
> +		return sbi->s_fc_bh->b_data + off;
> +	}
> +	/* Need to add PAD tag */
> +	tl = (struct ext4_fc_tl *)(sbi->s_fc_bh->b_data + off);
> +	tl->fc_tag = cpu_to_le16(EXT4_FC_TAG_PAD);
> +	pad_len = bsize - off - 1 - sizeof(struct ext4_fc_tl);
> +	tl->fc_len = cpu_to_le16(pad_len);
> +	if (crc)
> +		*crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl));
> +	if (pad_len > 0)
> +		ext4_fc_memzero(sb, tl + 1, pad_len, crc);
> +	ext4_fc_submit_bh(sb);
> +
> +	ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
> +	if (ret)
> +		return NULL;
> +	sbi->s_fc_bh = bh;
> +	sbi->s_fc_bytes = (sbi->s_fc_bytes / bsize + 1) * bsize + len;
> +	return sbi->s_fc_bh->b_data;
> +}
> +
> +/* memcpy to fc reserved space and update CRC */
> +static void *ext4_fc_memcpy(struct super_block *sb, void *dst, const void *src,
> +				int len, u32 *crc)
> +{
> +	if (crc)
> +		*crc = ext4_chksum(EXT4_SB(sb), *crc, src, len);
> +	return memcpy(dst, src, len);
> +}
> +
> +/*
> + * Complete a fast commit by writing tail tag.
> + *
> + * Writing tail tag marks the end of a fast commit. In order to guarantee
> + * atomicity, after writing tail tag, even if there's space remaining
> + * in the block, next commit shouldn't use it. That's why tail tag
> + * has the length as that of the remaining space on the block.
> + */
> +static int ext4_fc_write_tail(struct super_block *sb, u32 crc)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_fc_tl tl;
> +	struct ext4_fc_tail tail;
> +	int off, bsize = sbi->s_journal->j_blocksize;
> +	u8 *dst;
> +
> +	/*
> +	 * ext4_fc_reserve_space takes care of allocating an extra block if
> +	 * there's no enough space on this block for accommodating this tail.
> +	 */
> +	dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc);
> +	if (!dst)
> +		return -ENOSPC;
> +
> +	off = sbi->s_fc_bytes % bsize;
> +
> +	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL);
> +	tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail));
> +	sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize);
> +
> +	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc);
> +	dst += sizeof(tl);
> +	tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid);
> +	ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc);
> +	dst += sizeof(tail.fc_tid);
> +	tail.fc_crc = cpu_to_le32(crc);
> +	ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
> +
> +	ext4_fc_submit_bh(sb);
> +
> +	return 0;
> +}

Is there a reason to pass CRC all around (so you have to have special
functions like ext4_fc_memcpy(), ext4_fc_memzero(), ...) instead of just
creating the whole block and then computing CRC in one go?

In fact, as looking through the code, it seems to me it would be slightly
nicer layer separation and interface if JBD2 provided functions for storage
of data blobs and handled the details of space & block management,
checksums, writeout, on recovery verification of correctness (so it would
just provide back a stream of blobs for FS to replay). Just an idea for
consideration, the current interface isn't too bad and we can change it
later if we decide so.

> +
> +/*
> + * Adds tag, length, value and updates CRC. Returns true if tlv was added.
> + * Returns false if there's not enough space.
> + */
> +static bool ext4_fc_add_tlv(struct super_block *sb, u16 tag, u16 len, u8 *val,
> +			   u32 *crc)
> +{
> +	struct ext4_fc_tl tl;
> +	u8 *dst;
> +
> +	dst = ext4_fc_reserve_space(sb, sizeof(tl) + len, crc);
> +	if (!dst)
> +		return false;
> +
> +	tl.fc_tag = cpu_to_le16(tag);
> +	tl.fc_len = cpu_to_le16(len);
> +
> +	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
> +	ext4_fc_memcpy(sb, dst + sizeof(tl), val, len, crc);
> +
> +	return true;
> +}
> +
> +/* Same as above, but adds dentry tlv. */
> +static  bool ext4_fc_add_dentry_tlv(struct super_block *sb, u16 tag,
> +					int parent_ino, int ino, int dlen,
> +					const unsigned char *dname,
> +					u32 *crc)
> +{
> +	struct ext4_fc_dentry_info fcd;
> +	struct ext4_fc_tl tl;
> +	u8 *dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(fcd) + dlen,
> +					crc);
> +
> +	if (!dst)
> +		return false;
> +
> +	fcd.fc_parent_ino = cpu_to_le32(parent_ino);
> +	fcd.fc_ino = cpu_to_le32(ino);
> +	tl.fc_tag = cpu_to_le16(tag);
> +	tl.fc_len = cpu_to_le16(sizeof(fcd) + dlen);
> +	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
> +	dst += sizeof(tl);
> +	ext4_fc_memcpy(sb, dst, &fcd, sizeof(fcd), crc);
> +	dst += sizeof(fcd);
> +	ext4_fc_memcpy(sb, dst, dname, dlen, crc);
> +	dst += dlen;
> +
> +	return true;
> +}
> +
> +/*
> + * Writes inode in the fast commit space under TLV with tag @tag.
> + * Returns 0 on success, error on failure.
> + */
> +static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
> +	int ret;
> +	struct ext4_iloc iloc;
> +	struct ext4_fc_inode fc_inode;
> +	struct ext4_fc_tl tl;
> +	u8 *dst;
> +
> +	ret = ext4_get_inode_loc(inode, &iloc);
> +	if (ret)
> +		return ret;
> +
> +	if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
> +		inode_len += ei->i_extra_isize;
> +
> +	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
> +	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
> +	tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino));
> +
> +	dst = ext4_fc_reserve_space(inode->i_sb,
> +			sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc);
> +	if (!dst)
> +		return -ECANCELED;
> +
> +	if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc))
> +		return -ECANCELED;
> +	dst += sizeof(tl);
> +	if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc))
> +		return -ECANCELED;
> +	dst += sizeof(fc_inode);
> +	if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc),
> +					inode_len, crc))
> +		return -ECANCELED;

Isn't this racy? What guarantees the inode state you record here is a valid
one for the fastcommit? I mean this gets called at the time of fastcommit
(i.e., fsync), so a fastcommit code must record changes to all other
metadata that relate to the currently recorded inode state. But this isn't
serialized in any way (AFAICT) with on-going inode changes so how can
fastcommit code guarantee that? This is a similar case as a problem I
describe below...

> +
> +	return 0;
> +}
> +
> +/*
> + * Writes updated data ranges for the inode in question. Updates CRC.
> + * Returns 0 on success, error otherwise.
> + */
> +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> +{
> +	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct ext4_map_blocks map;
> +	struct ext4_fc_add_range fc_ext;
> +	struct ext4_fc_del_range lrange;
> +	struct ext4_extent *ex;
> +	int ret;
> +
> +	mutex_lock(&ei->i_fc_lock);
> +	if (ei->i_fc_lblk_len == 0) {
> +		mutex_unlock(&ei->i_fc_lock);
> +		return 0;
> +	}
> +	old_blk_size = ei->i_fc_lblk_start;
> +	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> +	ei->i_fc_lblk_len = 0;
> +	mutex_unlock(&ei->i_fc_lock);
> +
> +	cur_lblk_off = old_blk_size;
> +	jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> +		  __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> +
> +	while (cur_lblk_off <= new_blk_size) {
> +		map.m_lblk = cur_lblk_off;
> +		map.m_len = new_blk_size - cur_lblk_off + 1;
> +		ret = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (ret < 0)
> +			return -ECANCELED;

So isn't this actually racy with a risk of stale data exposure? Consider a
situation like:

Task 1:				Task 2:
pwrite(file, buf, 8192, 0)
punch(file, 0, 4096)
fsync(file)
  writeout range 4096-8192
  fastcommit for inode range 0-8192
				pwrite(file, buf, 4096, 0)
    ext4_map_blocks(file)
      - reports that block at offset 0 is mapped so that is recorded in
        fastcommit record. But data for that is not written so after a
        crash we'd expose stale data in that block.

Am I missing something?  

> +
> +		if (map.m_len == 0) {
> +			cur_lblk_off++;
> +			continue;
> +		}
> +
> +		if (ret == 0) {
> +			lrange.fc_ino = cpu_to_le32(inode->i_ino);
> +			lrange.fc_lblk = cpu_to_le32(map.m_lblk);
> +			lrange.fc_len = cpu_to_le32(map.m_len);
> +			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
> +					    sizeof(lrange), (u8 *)&lrange, crc))
> +				return -ENOSPC;
> +		} else {
> +			fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
> +			ex = (struct ext4_extent *)&fc_ext.fc_ex;
> +			ex->ee_block = cpu_to_le32(map.m_lblk);
> +			ex->ee_len = cpu_to_le16(map.m_len);
> +			ext4_ext_store_pblock(ex, map.m_pblk);
> +			if (map.m_flags & EXT4_MAP_UNWRITTEN)
> +				ext4_ext_mark_unwritten(ex);
> +			else
> +				ext4_ext_mark_initialized(ex);
> +			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
> +					    sizeof(fc_ext), (u8 *)&fc_ext, crc))
> +				return -ENOSPC;
> +		}
> +
> +		cur_lblk_off += map.m_len;
> +	}
> +
> +	return 0;
> +}
> +
> +
> +/* Submit data for all the fast commit inodes */
> +static int ext4_fc_submit_inode_data_all(journal_t *journal)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_inode_info *ei;
> +	struct list_head *pos;
> +	int ret = 0;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	sbi->s_mount_state |= EXT4_FC_COMMITTING;
> +	list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
> +		ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
> +		ext4_set_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING);
> +		while (atomic_read(&ei->i_fc_updates)) {
> +			DEFINE_WAIT(wait);
> +
> +			prepare_to_wait(&ei->i_fc_wait, &wait,
> +						TASK_UNINTERRUPTIBLE);
> +			if (atomic_read(&ei->i_fc_updates)) {
> +				spin_unlock(&sbi->s_fc_lock);
> +				schedule();
> +				spin_lock(&sbi->s_fc_lock);
> +			}
> +			finish_wait(&ei->i_fc_wait, &wait);
> +		}
> +		spin_unlock(&sbi->s_fc_lock);
> +		ret = jbd2_submit_inode_data(ei->jinode);
> +		if (ret)
> +			return ret;
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +
> +	return ret;
> +}
> +
> +/* Wait for completion of data for all the fast commit inodes */
> +static int ext4_fc_wait_inode_data_all(journal_t *journal)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_inode_info *pos, *n;
> +	int ret = 0;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	list_for_each_entry_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
> +		if (!ext4_test_inode_state(&pos->vfs_inode,
> +					   EXT4_STATE_FC_COMMITTING))
> +			continue;
> +		spin_unlock(&sbi->s_fc_lock);
> +
> +		ret = jbd2_wait_inode_data(journal, pos->jinode);
> +		if (ret)
> +			return ret;
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +
> +	return 0;
> +}
> +
> +/* Commit all the directory entry updates */
> +static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_fc_dentry_update *fc_dentry;
> +	struct inode *inode;
> +	struct list_head *pos, *n, *fcd_pos, *fcd_n;
> +	struct ext4_inode_info *ei;
> +	int ret;
> +
> +	if (list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN]))
> +		return 0;
> +	list_for_each_safe(fcd_pos, fcd_n, &sbi->s_fc_dentry_q[FC_Q_MAIN]) {
> +		fc_dentry = list_entry(fcd_pos, struct ext4_fc_dentry_update,
> +					fcd_list);
> +		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT) {
> +			spin_unlock(&sbi->s_fc_lock);
> +			if (!ext4_fc_add_dentry_tlv(
> +				sb, fc_dentry->fcd_op,
> +				fc_dentry->fcd_parent, fc_dentry->fcd_ino,
> +				fc_dentry->fcd_name.len,
> +				fc_dentry->fcd_name.name, crc)) {
> +				return -ENOSPC;
> +			}
> +			spin_lock(&sbi->s_fc_lock);
> +			continue;
> +		}
> +
> +		inode = NULL;
> +		list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
> +			ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
> +			if (ei->vfs_inode.i_ino == fc_dentry->fcd_ino) {
> +				inode = &ei->vfs_inode;
> +				break;
> +			}
> +		}
> +		/*
> +		 * If we don't find inode in our list, then it was deleted,
> +		 * in which case, we don't need to record it's create tag.
> +		 */
> +		if (!inode)
> +			continue;
> +		spin_unlock(&sbi->s_fc_lock);
> +
> +		/*
> +		 * We first write the inode and then the create dirent. This
> +		 * allows the recovery code to create an unnamed inode first
> +		 * and then link it to a directory entry. This allows us
> +		 * to use namei.c routines almost as is and simplifies
> +		 * the recovery code.
> +		 */
> +		ret = ext4_fc_write_inode(inode, crc);
> +		if (ret)
> +			return ret;
> +		ret = ext4_fc_write_inode_data(inode, crc);
> +		if (ret)
> +			return ret;
> +
> +		if (!ext4_fc_add_dentry_tlv(
> +			sb, fc_dentry->fcd_op,
> +			fc_dentry->fcd_parent, fc_dentry->fcd_ino,
> +			fc_dentry->fcd_name.len,
> +			fc_dentry->fcd_name.name, crc))
> +			return -ENOSPC;
> +
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +	return 0;
> +}
> +
> +static int ext4_fc_perform_commit(journal_t *journal)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_inode_info *iter;
> +	struct ext4_fc_head head;
> +	struct list_head *pos;
> +	struct inode *inode;
> +	struct blk_plug plug;
> +	int ret = 0;
> +	u32 crc = 0;
> +
> +	ret = ext4_fc_submit_inode_data_all(journal);
> +	if (ret)
> +		return ret;
> +
> +	ret = ext4_fc_wait_inode_data_all(journal);
> +	if (ret)
> +		return ret;
> +
> +	blk_start_plug(&plug);
> +	if (sbi->s_fc_bytes == 0) {
> +		/*
> +		 * Add a head tag only if this is the first fast commit
> +		 * in this TID.
> +		 */
> +		head.fc_features = cpu_to_le32(EXT4_FC_SUPPORTED_FEATURES);
> +		head.fc_tid = cpu_to_le32(
> +			sbi->s_journal->j_running_transaction->t_tid);
> +		if (!ext4_fc_add_tlv(sb, EXT4_FC_TAG_HEAD, sizeof(head),
> +			(u8 *)&head, &crc))
> +			goto out;
> +	}
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	ret = ext4_fc_commit_dentry_updates(journal, &crc);
> +	if (ret) {
> +		spin_unlock(&sbi->s_fc_lock);
> +		goto out;
> +	}
> +
> +	list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
> +		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
> +		inode = &iter->vfs_inode;
> +		if (!ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
> +			continue;
> +
> +		spin_unlock(&sbi->s_fc_lock);
> +		ret = ext4_fc_write_inode_data(inode, &crc);
> +		if (ret)
> +			goto out;
> +		ret = ext4_fc_write_inode(inode, &crc);
> +		if (ret)
> +			goto out;
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +
> +	ret = ext4_fc_write_tail(sb, crc);
> +
> +out:
> +	blk_finish_plug(&plug);
> +	return ret;
> +}
> +
> +/*
> + * The main commit entry point. Performs a fast commit for transaction
> + * commit_tid if needed. If it's not possible to perform a fast commit
> + * due to various reasons, we fall back to full commit. Returns 0
> + * on success, error otherwise.
> + */
> +int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	int nblks = 0, ret, bsize = journal->j_blocksize;
> +	int subtid = atomic_read(&sbi->s_fc_subtid);
> +	int reason = EXT4_FC_REASON_OK, fc_bufs_before = 0;
> +	ktime_t start_time, commit_time;
> +
> +	trace_ext4_fc_commit_start(sb);
> +
> +	start_time = ktime_get();
> +
> +	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
> +		(ext4_fc_is_ineligible(sb))) {
> +		reason = EXT4_FC_REASON_INELIGIBLE;
> +		goto out;
> +	}
> +
> +restart_fc:
> +	ret = jbd2_fc_begin_commit(journal, commit_tid);
> +	if (ret == -EALREADY) {
> +		/* There was an ongoing commit, check if we need to restart */
> +		if (atomic_read(&sbi->s_fc_subtid) <= subtid &&
> +			commit_tid > journal->j_commit_sequence)
> +			goto restart_fc;
> +		reason = EXT4_FC_REASON_ALREADY_COMMITTED;
> +		goto out;
> +	} else if (ret) {
> +		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
> +		reason = EXT4_FC_REASON_FC_START_FAILED;
> +		goto out;
> +	}
> +
> +	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
> +	ret = ext4_fc_perform_commit(journal);
> +	if (ret < 0) {
> +		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
> +		reason = EXT4_FC_REASON_FC_FAILED;
> +		goto out;
> +	}
> +	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
> +	ret = jbd2_fc_wait_bufs(journal, nblks);
> +	if (ret < 0) {
> +		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
> +		reason = EXT4_FC_REASON_FC_FAILED;
> +		goto out;
> +	}
> +	atomic_inc(&sbi->s_fc_subtid);
> +	jbd2_fc_end_commit(journal);
> +out:
> +	/* Has any ineligible update happened since we started? */
> +	if (reason == EXT4_FC_REASON_OK && ext4_fc_is_ineligible(sb)) {
> +		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
> +		reason = EXT4_FC_REASON_INELIGIBLE;
> +	}
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	if (reason != EXT4_FC_REASON_OK &&
> +		reason != EXT4_FC_REASON_ALREADY_COMMITTED) {
> +		sbi->s_fc_stats.fc_ineligible_commits++;
> +	} else {
> +		sbi->s_fc_stats.fc_num_commits++;
> +		sbi->s_fc_stats.fc_numblks += nblks;
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +	nblks = (reason == EXT4_FC_REASON_OK) ? nblks : 0;
> +	trace_ext4_fc_commit_stop(sb, nblks, reason);
> +	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> +	/*
> +	 * weight the commit time higher than the average time so we don't
> +	 * react too strongly to vast changes in the commit time
> +	 */
> +	if (likely(sbi->s_fc_avg_commit_time))
> +		sbi->s_fc_avg_commit_time = (commit_time +
> +				sbi->s_fc_avg_commit_time * 3) / 4;
> +	else
> +		sbi->s_fc_avg_commit_time = commit_time;
> +	jbd_debug(1,
> +		"Fast commit ended with blks = %d, reason = %d, subtid - %d",
> +		nblks, reason, subtid);
> +	if (reason == EXT4_FC_REASON_FC_FAILED)
> +		return jbd2_fc_end_commit_fallback(journal, commit_tid);
> +	if (reason == EXT4_FC_REASON_FC_START_FAILED ||
> +		reason == EXT4_FC_REASON_INELIGIBLE)
> +		return jbd2_complete_transaction(journal, commit_tid);
> +	return 0;
> +}
> +
>  /*
>   * Fast commit cleanup routine. This is called after every fast commit and
>   * full commit. full is true if we are called after a full commit.
>   */
>  static void ext4_fc_cleanup(journal_t *journal, int full)
>  {
> +	struct super_block *sb = journal->j_private;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_inode_info *iter;
> +	struct ext4_fc_dentry_update *fc_dentry;
> +	struct list_head *pos, *n;
> +
> +	if (full && sbi->s_fc_bh)
> +		sbi->s_fc_bh = NULL;
> +
> +	jbd2_fc_release_bufs(journal);
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
> +		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
> +		list_del_init(&iter->i_fc_list);
> +		ext4_clear_inode_state(&iter->vfs_inode,
> +				       EXT4_STATE_FC_COMMITTING);
> +		ext4_fc_reset_inode(&iter->vfs_inode);
> +		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
> +		smp_mb();
> +#if (BITS_PER_LONG < 64)
> +		wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
> +#else
> +		wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
> +#endif
> +	}
> +
> +	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
> +		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
> +					     struct ext4_fc_dentry_update,
> +					     fcd_list);
> +		list_del_init(&fc_dentry->fcd_list);
> +		spin_unlock(&sbi->s_fc_lock);
> +
> +		if (fc_dentry->fcd_name.name &&
> +			fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
> +			kfree(fc_dentry->fcd_name.name);
> +		kmem_cache_free(ext4_fc_dentry_cachep, fc_dentry);
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +
> +	list_splice_init(&sbi->s_fc_dentry_q[FC_Q_STAGING],
> +				&sbi->s_fc_dentry_q[FC_Q_MAIN]);
> +	list_splice_init(&sbi->s_fc_q[FC_Q_STAGING],
> +				&sbi->s_fc_q[FC_Q_STAGING]);
> +
> +	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
> +	sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE;
> +
> +	if (full)
> +		sbi->s_fc_bytes = 0;
> +	spin_unlock(&sbi->s_fc_lock);
> +	trace_ext4_fc_stats(sb);
>  }
>  
>  void ext4_fc_init(struct super_block *sb, journal_t *journal)
> @@ -26,3 +1187,14 @@ void ext4_fc_init(struct super_block *sb, journal_t *journal)
>  		ext4_clear_feature_fast_commit(sb);
>  	}
>  }
> +
> +int __init ext4_fc_init_dentry_cache(void)
> +{
> +	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
> +					   SLAB_RECLAIM_ACCOUNT);
> +
> +	if (ext4_fc_dentry_cachep == NULL)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> index 8362bf5e6e00..560bc9ca8c79 100644
> --- a/fs/ext4/fast_commit.h
> +++ b/fs/ext4/fast_commit.h
> @@ -6,4 +6,114 @@
>  /* Number of blocks in journal area to allocate for fast commits */
>  #define EXT4_NUM_FC_BLKS		256
>  
> +/* Fast commit tags */
> +#define EXT4_FC_TAG_ADD_RANGE		0x0001
> +#define EXT4_FC_TAG_DEL_RANGE		0x0002
> +#define EXT4_FC_TAG_CREAT		0x0003
> +#define EXT4_FC_TAG_LINK		0x0004
> +#define EXT4_FC_TAG_UNLINK		0x0005
> +#define EXT4_FC_TAG_INODE		0x0006
> +#define EXT4_FC_TAG_PAD			0x0007
> +#define EXT4_FC_TAG_TAIL		0x0008
> +#define EXT4_FC_TAG_HEAD		0x0009
> +
> +#define EXT4_FC_SUPPORTED_FEATURES	0x0
> +
> +/* On disk fast commit tlv value structures */
> +
> +/* Fast commit on disk tag length structure */
> +struct ext4_fc_tl {
> +	__le16 fc_tag;
> +	__le16 fc_len;
> +};
> +
> +/* Value structure for tag EXT4_FC_TAG_HEAD. */
> +struct ext4_fc_head {
> +	__le32 fc_features;
> +	__le32 fc_tid;
> +};
> +
> +/* Value structure for EXT4_FC_TAG_ADD_RANGE. */
> +struct ext4_fc_add_range {
> +	__le32 fc_ino;
> +	__u8 fc_ex[12];
> +};
> +
> +/* Value structure for tag EXT4_FC_TAG_DEL_RANGE. */
> +struct ext4_fc_del_range {
> +	__le32 fc_ino;
> +	__le32 fc_lblk;
> +	__le32 fc_len;
> +};
> +
> +/*
> + * This is the value structure for tags EXT4_FC_TAG_CREAT, EXT4_FC_TAG_LINK
> + * and EXT4_FC_TAG_UNLINK.
> + */
> +struct ext4_fc_dentry_info {
> +	__le32 fc_parent_ino;
> +	__le32 fc_ino;
> +	u8 fc_dname[0];
> +};
> +
> +/* Value structure for EXT4_FC_TAG_INODE and EXT4_FC_TAG_INODE_PARTIAL. */
> +struct ext4_fc_inode {
> +	__le32 fc_ino;
> +	__u8 fc_raw_inode[0];
> +};
> +
> +/* Value structure for tag EXT4_FC_TAG_TAIL. */
> +struct ext4_fc_tail {
> +	__le32 fc_tid;
> +	__le32 fc_crc;
> +};
> +
> +/*
> + * In memory list of dentry updates that are performed on the file
> + * system used by fast commit code.
> + */
> +struct ext4_fc_dentry_update {
> +	int fcd_op;		/* Type of update create / unlink / link */
> +	int fcd_parent;		/* Parent inode number */
> +	int fcd_ino;		/* Inode number */
> +	struct qstr fcd_name;	/* Dirent name */
> +	unsigned char fcd_iname[DNAME_INLINE_LEN];	/* Dirent name string */
> +	struct list_head fcd_list;
> +};
> +
> +/*
> + * Fast commit reason codes
> + */
> +enum {
> +	/*
> +	 * Commit status codes:
> +	 */
> +	EXT4_FC_REASON_OK = 0,
> +	EXT4_FC_REASON_INELIGIBLE,
> +	EXT4_FC_REASON_ALREADY_COMMITTED,
> +	EXT4_FC_REASON_FC_START_FAILED,
> +	EXT4_FC_REASON_FC_FAILED,
> +
> +	/*
> +	 * Fast commit ineligiblity reasons:
> +	 */
> +	EXT4_FC_REASON_XATTR = 0,
> +	EXT4_FC_REASON_CROSS_RENAME,
> +	EXT4_FC_REASON_JOURNAL_FLAG_CHANGE,
> +	EXT4_FC_REASON_MEM,
> +	EXT4_FC_REASON_SWAP_BOOT,
> +	EXT4_FC_REASON_RESIZE,
> +	EXT4_FC_REASON_RENAME_DIR,
> +	EXT4_FC_REASON_FALLOC_RANGE,
> +	EXT4_FC_COMMIT_FAILED,
> +	EXT4_FC_REASON_MAX
> +};
> +
> +struct ext4_fc_stats {
> +	unsigned int fc_ineligible_reason_count[EXT4_FC_REASON_MAX];
> +	unsigned long fc_num_commits;
> +	unsigned long fc_ineligible_commits;
> +	unsigned long fc_numblks;
> +};
> +
>  #endif /* __FAST_COMMIT_H__ */
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 02ffbd29d6b0..d85412d12e3a 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -260,6 +260,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  	if (iocb->ki_flags & IOCB_NOWAIT)
>  		return -EOPNOTSUPP;
>  
> +	ext4_fc_start_update(inode);
>  	inode_lock(inode);
>  	ret = ext4_write_checks(iocb, from);
>  	if (ret <= 0)
> @@ -271,6 +272,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  
>  out:
>  	inode_unlock(inode);
> +	ext4_fc_stop_update(inode);
>  	if (likely(ret > 0)) {
>  		iocb->ki_pos += ret;
>  		ret = generic_write_sync(iocb, ret);
> @@ -534,7 +536,9 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			goto out;
>  		}
>  
> +		ext4_fc_start_update(inode);
>  		ret = ext4_orphan_add(handle, inode);
> +		ext4_fc_stop_update(inode);

Why is here protected only the orphan addition? What about other changes
happening to the inode during direct write?

>  		if (ret) {
>  			ext4_journal_stop(handle);
>  			goto out;
> @@ -656,8 +660,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  #endif
>  	if (iocb->ki_flags & IOCB_DIRECT)
>  		return ext4_dio_write_iter(iocb, from);
> -
> -	return ext4_buffered_write_iter(iocb, from);
> +	else
> +		return ext4_buffered_write_iter(iocb, from);

Why this change?

>  }
>  
>  #ifdef CONFIG_FS_DAX
> @@ -757,6 +761,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>  	if (!daxdev_mapping_supported(vma, dax_dev))
>  		return -EOPNOTSUPP;
>  
> +	ext4_fc_start_update(inode);
>  	file_accessed(file);

Uh, is this ext4_fc_start_update() for the file_accessed() call? What about
all the other inode timestamp updates? I'd say handling in ext4_setattr()
should be enough?

Also I don't see anything tracking inode changes due to writes through mmap?
How is that supposed to work?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 4/9] jbd2: add fast commit machinery
  2020-10-22 10:16   ` Jan Kara
@ 2020-10-23 17:17     ` harshad shirwadkar
  2020-10-26  9:03       ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-23 17:17 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4 Developers List, Theodore Y. Ts'o

Thanks Jan for reviewing the patches.

On Thu, Oct 22, 2020 at 3:16 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 15-10-20 13:37:56, Harshad Shirwadkar wrote:
> > This functions adds necessary APIs needed in JBD2 layer for fast
> > commits.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > ---
> >  fs/ext4/fast_commit.c |   8 ++
> >  fs/jbd2/commit.c      |  44 ++++++++++
> >  fs/jbd2/journal.c     | 190 +++++++++++++++++++++++++++++++++++++++++-
> >  include/linux/jbd2.h  |  27 ++++++
> >  4 files changed, 268 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > index 0dad8bdb1253..f2d11b4c6b62 100644
> > --- a/fs/ext4/fast_commit.c
> > +++ b/fs/ext4/fast_commit.c
> > @@ -8,11 +8,19 @@
> >   * Ext4 fast commits routines.
> >   */
> >  #include "ext4_jbd2.h"
> > +/*
> > + * Fast commit cleanup routine. This is called after every fast commit and
> > + * full commit. full is true if we are called after a full commit.
> > + */
> > +static void ext4_fc_cleanup(journal_t *journal, int full)
> > +{
> > +}
> >
> >  void ext4_fc_init(struct super_block *sb, journal_t *journal)
> >  {
> >       if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
> >               return;
> > +     journal->j_fc_cleanup_callback = ext4_fc_cleanup;
> >       if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
> >               pr_warn("Error while enabling fast commits, turning off.");
> >               ext4_clear_feature_fast_commit(sb);
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 6252b4c50666..fa688e163a80 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -206,6 +206,30 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
> >       return generic_writepages(mapping, &wbc);
> >  }
> >
> > +/* Send all the data buffers related to an inode */
> > +int jbd2_submit_inode_data(struct jbd2_inode *jinode)
> > +{
> > +
> > +     if (!jinode || !(jinode->i_flags & JI_WRITE_DATA))
> > +             return 0;
> > +
> > +     trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> > +     return jbd2_journal_submit_inode_data_buffers(jinode);
> > +
> > +}
> > +EXPORT_SYMBOL(jbd2_submit_inode_data);
> > +
> > +int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> > +{
> > +     if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
> > +             !jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
> > +             return 0;
> > +     return filemap_fdatawait_range_keep_errors(
> > +             jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
> > +             jinode->i_dirty_end);
> > +}
> > +EXPORT_SYMBOL(jbd2_wait_inode_data);
> > +
> >  /*
> >   * Submit all the data buffers of inode associated with the transaction to
> >   * disk.
> > @@ -415,6 +439,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >       J_ASSERT(journal->j_running_transaction != NULL);
> >       J_ASSERT(journal->j_committing_transaction == NULL);
> >
> > +     write_lock(&journal->j_state_lock);
> > +     journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
> > +     while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) {
> > +             DEFINE_WAIT(wait);
> > +
> > +             prepare_to_wait(&journal->j_fc_wait, &wait,
> > +                             TASK_UNINTERRUPTIBLE);
> > +             write_unlock(&journal->j_state_lock);
> > +             schedule();
> > +             write_lock(&journal->j_state_lock);
> > +             finish_wait(&journal->j_fc_wait, &wait);
> > +     }
> > +     write_unlock(&journal->j_state_lock);
>
> Hum, I'd like to understand: Is there a reason to block fastcommits already
> when the running transaction is in T_LOCKED state? Strictly speaking it is
> necessary only once we get to T_FLUSH state AFAIU (because only then we
> start to write transaction to the journal). I guess there are both
> advantages and disadvantages to it - if we allowed fastcommits running in
> T_LOCKED state, we could lower fsync() latency more. OTOH it could increase
> commit latency because we'd have to wait for fastcommits after T_LOCKED
> state.
That's right. I thought given that the transaction is anyway entering
locked state, might as well wait for it to complete instead of writing
blocks that are going to be obsoleted immediately. Also note that this
full commit could have started due to fast commits being in ineligible
state. If that's the case, the fast commit code will realize that it
can't do much and it will again wait for a full commit. So, even
though there is a fsync latency benefit to waiting till T_FLUSH, I'd
still marginally prefer blocking fast commits once the transaction
enters T_LOCKED state.
>
> Another option is to just block new fast commits at the beginning of
> T_LOCKED state and wait for running fastcommits at the end of T_LOCKED
> state. That way waiting for outstanding handles and waiting for fastcommits
> would be running in parallel and we'd reduce the latency...
This is a good idea! I'll add that TODO item in the code.
>
> Also I'm not sure JBD2_FULL_COMMIT_ONGOING is really needed. I understand
> it is handy at this point but longer term, I'd find it more maintainable if
> we just had a helper function jbd2_fastcommit_allowed() (or whatever) that
> will check journal state and based on presence and state of committing
> transaction return whether fastcommits are allowed or not...
Makes sense.
>
> > +
> >       commit_transaction = journal->j_running_transaction;
> >
> >       trace_jbd2_start_commit(journal, commit_transaction);
> > @@ -422,6 +460,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >                       commit_transaction->t_tid);
> >
> >       write_lock(&journal->j_state_lock);
> > +     journal->j_fc_off = 0;
> >       J_ASSERT(commit_transaction->t_state == T_RUNNING);
> >       commit_transaction->t_state = T_LOCKED;
> >
> > @@ -1121,12 +1160,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >
> >       if (journal->j_commit_callback)
> >               journal->j_commit_callback(journal, commit_transaction);
> > +     if (journal->j_fc_cleanup_callback)
> > +             journal->j_fc_cleanup_callback(journal, 1);
> >
> >       trace_jbd2_end_commit(journal, commit_transaction);
> >       jbd_debug(1, "JBD2: commit %d complete, head %d\n",
> >                 journal->j_commit_sequence, journal->j_tail_sequence);
> >
> >       write_lock(&journal->j_state_lock);
> > +     journal->j_flags &= ~JBD2_FULL_COMMIT_ONGOING;
> > +     journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
> >       spin_lock(&journal->j_list_lock);
> >       commit_transaction->t_state = T_FINISHED;
> >       /* Check if the transaction can be dropped now that we are finished */
> > @@ -1138,6 +1181,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >       spin_unlock(&journal->j_list_lock);
> >       write_unlock(&journal->j_state_lock);
> >       wake_up(&journal->j_wait_done_commit);
> > +     wake_up(&journal->j_fc_wait);
> >
> >       /*
> >        * Calculate overall stats
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 4497bfbac527..0c7c42bd530f 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -159,7 +159,9 @@ static void commit_timeout(struct timer_list *t)
> >   *
> >   * 1) COMMIT:  Every so often we need to commit the current state of the
> >   *    filesystem to disk.  The journal thread is responsible for writing
> > - *    all of the metadata buffers to disk.
> > + *    all of the metadata buffers to disk. If a fast commit is ongoing
> > + *    journal thread waits until it's done and then continues from
> > + *    there on.
> >   *
> >   * 2) CHECKPOINT: We cannot reuse a used section of the log file until all
> >   *    of the data in that part of the log has been rewritten elsewhere on
> > @@ -716,6 +718,75 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> >       return err;
> >  }
> >
> > +/*
> > + * Start a fast commit. If there's an ongoing fast or full commit wait for
> > + * it to complete. Returns 0 if a new fast commit was started. Returns -EALREADY
> > + * if a fast commit is not needed, either because there's an already a commit
> > + * going on or this tid has already been committed. Returns -EINVAL if no jbd2
> > + * commit has yet been performed.
> > + */
> > +int jbd2_fc_begin_commit(journal_t *journal, tid_t tid)
> > +{
> > +     /*
> > +      * Fast commits only allowed if at least one full commit has
> > +      * been processed.
> > +      */
> > +     if (!journal->j_stats.ts_tid)
> > +             return -EINVAL;
> > +
> > +     if (tid <= journal->j_commit_sequence)
> > +             return -EALREADY;
>
> This check is racy and possibly using stale value of j_commit_sequence
> since j_commit_sequence needs j_state_lock for reliable reading.
Ack, I'll fix this.
>
> > +
> > +     write_lock(&journal->j_state_lock);
> > +     if (journal->j_flags & JBD2_FULL_COMMIT_ONGOING ||
> > +         (journal->j_flags & JBD2_FAST_COMMIT_ONGOING)) {
> > +             DEFINE_WAIT(wait);
> > +
> > +             prepare_to_wait(&journal->j_fc_wait, &wait,
> > +                             TASK_UNINTERRUPTIBLE);
> > +             write_unlock(&journal->j_state_lock);
> > +             schedule();
> > +             finish_wait(&journal->j_fc_wait, &wait);
> > +             return -EALREADY;
> > +     }
> > +     journal->j_flags |= JBD2_FAST_COMMIT_ONGOING;
> > +     write_unlock(&journal->j_state_lock);
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL(jbd2_fc_begin_commit);
> > +
> > +/*
> > + * Stop a fast commit. If fallback is set, this function starts commit of
> > + * TID tid before any other fast commit can start.
> > + */
> > +static int __jbd2_fc_end_commit(journal_t *journal, tid_t tid, bool fallback)
> > +{
> > +     if (journal->j_fc_cleanup_callback)
> > +             journal->j_fc_cleanup_callback(journal, 0);
> > +     write_lock(&journal->j_state_lock);
> > +     journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
> > +     if (fallback)
> > +             journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
> > +     write_unlock(&journal->j_state_lock);
> > +     wake_up(&journal->j_fc_wait);
> > +     if (fallback)
> > +             return jbd2_complete_transaction(journal, tid);
> > +     return 0;
> > +}
> > +
> > +int jbd2_fc_end_commit(journal_t *journal)
> > +{
> > +     return __jbd2_fc_end_commit(journal, 0, 0);
>
> 'fallback' is bool so please use true / false for it.
Ack
>
> > +}
> > +EXPORT_SYMBOL(jbd2_fc_end_commit);
> > +
> > +int jbd2_fc_end_commit_fallback(journal_t *journal, tid_t tid)
> > +{
> > +     return __jbd2_fc_end_commit(journal, tid, 1);
> > +}
> > +EXPORT_SYMBOL(jbd2_fc_end_commit_fallback);
> > +
>
> Is there a need for 'tid' here? Once jbd2_fc_begin_commit() sets
> JBD2_FAST_COMMIT_ONGOING normal commit cannot proceed so when we decide we
> cannot do fastcommit in the end, we know the transaction that needs to
> commit is the currently running transaction, so we can fetch its TID from
> the journal once we hold j_state_lock before clearing
> JBD2_FAST_COMMIT_ONGOING. Cannot we?
>
> >  /* Return 1 when transaction with given tid has already committed. */
> >  int jbd2_transaction_committed(journal_t *journal, tid_t tid)
> >  {
> > @@ -784,6 +855,110 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
> >       return jbd2_journal_bmap(journal, blocknr, retp);
> >  }
> >
> > +/* Map one fast commit buffer for use by the file system */
> > +int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
> > +{
> > +     unsigned long long pblock;
> > +     unsigned long blocknr;
> > +     int ret = 0;
> > +     struct buffer_head *bh;
> > +     int fc_off;
> > +
> > +     *bh_out = NULL;
> > +     write_lock(&journal->j_state_lock);
> > +
> > +     if (journal->j_fc_off + journal->j_fc_first < journal->j_fc_last) {
> > +             fc_off = journal->j_fc_off;
> > +             blocknr = journal->j_fc_first + fc_off;
> > +             journal->j_fc_off++;
> > +     } else {
> > +             ret = -EINVAL;
> > +     }
> > +     write_unlock(&journal->j_state_lock);
>
> Is j_state_lock really needed here? There is always only one process doing
> fastcommit so nobody else should be touching j_fc_off and other fields. Or
> am I missing something?
You are right, there should only be one process calling
jbd2_fc_get_buf. I'll fix this.
>
> > +
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = jbd2_journal_bmap(journal, blocknr, &pblock);
> > +     if (ret)
> > +             return ret;
> > +
> > +     bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
> > +     if (!bh)
> > +             return -ENOMEM;
> > +
> > +     lock_buffer(bh);
> > +
> > +     clear_buffer_uptodate(bh);
> > +     set_buffer_dirty(bh);
>
> Uh, that's a weird state to leave buffer in (!uptodate & dirty). Flush
> worker could spot such buffer and try to write it out, which would blow
> up... I wouldn't touch the buffer state here, once proper content is
> filled, I'd mark the buffer as uptodate & dirty. That's how buffer state is
> usually managed.
Ack.
>
> > +     unlock_buffer(bh);
> > +     journal->j_fc_wbuf[fc_off] = bh;
> > +
> > +     *bh_out = bh;
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL(jbd2_fc_get_buf);
> > +
> > +/*
> > + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
> > + * for completion.
> > + */
> > +int jbd2_fc_wait_bufs(journal_t *journal, int num_blks)
> > +{
> > +     struct buffer_head *bh;
> > +     int i, j_fc_off;
> > +
> > +     read_lock(&journal->j_state_lock);
> > +     j_fc_off = journal->j_fc_off;
> > +     read_unlock(&journal->j_state_lock);
>
> Same comment regarding j_state_lock as for jbd2_fc_get_buf().
>
> > +
> > +     /*
> > +      * Wait in reverse order to minimize chances of us being woken up before
> > +      * all IOs have completed
> > +      */
> > +     for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
> > +             bh = journal->j_fc_wbuf[i];
> > +             wait_on_buffer(bh);
> > +             put_bh(bh);
> > +             journal->j_fc_wbuf[i] = NULL;
> > +             if (unlikely(!buffer_uptodate(bh)))
> > +                     return -EIO;
> > +     }
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL(jbd2_fc_wait_bufs);
> > +
> > +/*
> > + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
> > + * for completion.
> > + */
> > +int jbd2_fc_release_bufs(journal_t *journal)
> > +{
> > +     struct buffer_head *bh;
> > +     int i, j_fc_off;
> > +
> > +     read_lock(&journal->j_state_lock);
> > +     j_fc_off = journal->j_fc_off;
> > +     read_unlock(&journal->j_state_lock);
> > +
> > +     /*
> > +      * Wait in reverse order to minimize chances of us being woken up before
> > +      * all IOs have completed
> > +      */
> > +     for (i = j_fc_off - 1; i >= 0; i--) {
> > +             bh = journal->j_fc_wbuf[i];
> > +             if (!bh)
> > +                     break;
> > +             put_bh(bh);
> > +             journal->j_fc_wbuf[i] = NULL;
> > +     }
> > +
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL(jbd2_fc_release_bufs);
> > +
>
> I kind of wonder if releasing of buffers shouldn't be done automatically
> either as part of jbd2_fc_wait_bufs() or when ending fastcommit. But I
> don't have a strong opinion so this is just an idea for consideration.
So, that's what I do. The buffers get released in jbd2_fc_wait_bufs().
However, in case of errors or fallback to full commits, buffers may
not be submitted and thus won't be released. So this function is to
release all the unsubmitted buffers. This gets called from the cleanup
callback which is called after every successful or failed full commit
or fast commit.

Thanks,
Harshad
>
>                                                                 Honza
>
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 4/9] jbd2: add fast commit machinery
  2020-10-23 17:17     ` harshad shirwadkar
@ 2020-10-26  9:03       ` Jan Kara
  2020-10-26 16:34         ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-10-26  9:03 UTC (permalink / raw)
  To: harshad shirwadkar; +Cc: Jan Kara, Ext4 Developers List, Theodore Y. Ts'o

On Fri 23-10-20 10:17:18, harshad shirwadkar wrote:
> Thanks Jan for reviewing the patches.

You're welcome. Rather I'm sorry that I've got to that after so long time.

> On Thu, Oct 22, 2020 at 3:16 AM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 15-10-20 13:37:56, Harshad Shirwadkar wrote:
> > > This functions adds necessary APIs needed in JBD2 layer for fast
> > > commits.
> > >
> > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > > ---
> > >  fs/ext4/fast_commit.c |   8 ++
> > >  fs/jbd2/commit.c      |  44 ++++++++++
> > >  fs/jbd2/journal.c     | 190 +++++++++++++++++++++++++++++++++++++++++-
> > >  include/linux/jbd2.h  |  27 ++++++
> > >  4 files changed, 268 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > > index 0dad8bdb1253..f2d11b4c6b62 100644
> > > --- a/fs/ext4/fast_commit.c
> > > +++ b/fs/ext4/fast_commit.c
> > > @@ -8,11 +8,19 @@
> > >   * Ext4 fast commits routines.
> > >   */
> > >  #include "ext4_jbd2.h"
> > > +/*
> > > + * Fast commit cleanup routine. This is called after every fast commit and
> > > + * full commit. full is true if we are called after a full commit.
> > > + */
> > > +static void ext4_fc_cleanup(journal_t *journal, int full)
> > > +{
> > > +}
> > >
> > >  void ext4_fc_init(struct super_block *sb, journal_t *journal)
> > >  {
> > >       if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
> > >               return;
> > > +     journal->j_fc_cleanup_callback = ext4_fc_cleanup;
> > >       if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
> > >               pr_warn("Error while enabling fast commits, turning off.");
> > >               ext4_clear_feature_fast_commit(sb);
> > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > > index 6252b4c50666..fa688e163a80 100644
> > > --- a/fs/jbd2/commit.c
> > > +++ b/fs/jbd2/commit.c
> > > @@ -206,6 +206,30 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
> > >       return generic_writepages(mapping, &wbc);
> > >  }
> > >
> > > +/* Send all the data buffers related to an inode */
> > > +int jbd2_submit_inode_data(struct jbd2_inode *jinode)
> > > +{
> > > +
> > > +     if (!jinode || !(jinode->i_flags & JI_WRITE_DATA))
> > > +             return 0;
> > > +
> > > +     trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> > > +     return jbd2_journal_submit_inode_data_buffers(jinode);
> > > +
> > > +}
> > > +EXPORT_SYMBOL(jbd2_submit_inode_data);
> > > +
> > > +int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> > > +{
> > > +     if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
> > > +             !jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
> > > +             return 0;
> > > +     return filemap_fdatawait_range_keep_errors(
> > > +             jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
> > > +             jinode->i_dirty_end);
> > > +}
> > > +EXPORT_SYMBOL(jbd2_wait_inode_data);
> > > +
> > >  /*
> > >   * Submit all the data buffers of inode associated with the transaction to
> > >   * disk.
> > > @@ -415,6 +439,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > >       J_ASSERT(journal->j_running_transaction != NULL);
> > >       J_ASSERT(journal->j_committing_transaction == NULL);
> > >
> > > +     write_lock(&journal->j_state_lock);
> > > +     journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
> > > +     while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) {
> > > +             DEFINE_WAIT(wait);
> > > +
> > > +             prepare_to_wait(&journal->j_fc_wait, &wait,
> > > +                             TASK_UNINTERRUPTIBLE);
> > > +             write_unlock(&journal->j_state_lock);
> > > +             schedule();
> > > +             write_lock(&journal->j_state_lock);
> > > +             finish_wait(&journal->j_fc_wait, &wait);
> > > +     }
> > > +     write_unlock(&journal->j_state_lock);
> >
> > Hum, I'd like to understand: Is there a reason to block fastcommits already
> > when the running transaction is in T_LOCKED state? Strictly speaking it is
> > necessary only once we get to T_FLUSH state AFAIU (because only then we
> > start to write transaction to the journal). I guess there are both
> > advantages and disadvantages to it - if we allowed fastcommits running in
> > T_LOCKED state, we could lower fsync() latency more. OTOH it could increase
> > commit latency because we'd have to wait for fastcommits after T_LOCKED
> > state.
> That's right. I thought given that the transaction is anyway entering
> locked state, might as well wait for it to complete instead of writing
> blocks that are going to be obsoleted immediately. Also note that this
> full commit could have started due to fast commits being in ineligible
> state. If that's the case, the fast commit code will realize that it
> can't do much and it will again wait for a full commit. So, even
> though there is a fsync latency benefit to waiting till T_FLUSH, I'd
> still marginally prefer blocking fast commits once the transaction
> enters T_LOCKED state.

OK, makes sence. We can always change it if we find good performance
reasons for other choice.

> > > +}
> > > +EXPORT_SYMBOL(jbd2_fc_end_commit);
> > > +
> > > +int jbd2_fc_end_commit_fallback(journal_t *journal, tid_t tid)
> > > +{
> > > +     return __jbd2_fc_end_commit(journal, tid, 1);
> > > +}
> > > +EXPORT_SYMBOL(jbd2_fc_end_commit_fallback);
> > > +
> >
> > Is there a need for 'tid' here? Once jbd2_fc_begin_commit() sets
> > JBD2_FAST_COMMIT_ONGOING normal commit cannot proceed so when we decide we
> > cannot do fastcommit in the end, we know the transaction that needs to
> > commit is the currently running transaction, so we can fetch its TID from
> > the journal once we hold j_state_lock before clearing
> > JBD2_FAST_COMMIT_ONGOING. Cannot we?

Did you miss this comment?

> > >  /* Return 1 when transaction with given tid has already committed. */
> > >  int jbd2_transaction_committed(journal_t *journal, tid_t tid)
> > >  {
> > > @@ -784,6 +855,110 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
> > >       return jbd2_journal_bmap(journal, blocknr, retp);
> > >  }
> > >
> > > +/* Map one fast commit buffer for use by the file system */
> > > +int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
> > > +{
> > > +     unsigned long long pblock;
> > > +     unsigned long blocknr;
> > > +     int ret = 0;
> > > +     struct buffer_head *bh;
> > > +     int fc_off;
> > > +
> > > +     *bh_out = NULL;
> > > +     write_lock(&journal->j_state_lock);
> > > +
> > > +     if (journal->j_fc_off + journal->j_fc_first < journal->j_fc_last) {
> > > +             fc_off = journal->j_fc_off;
> > > +             blocknr = journal->j_fc_first + fc_off;
> > > +             journal->j_fc_off++;
> > > +     } else {
> > > +             ret = -EINVAL;
> > > +     }
> > > +     write_unlock(&journal->j_state_lock);
> >
> > Is j_state_lock really needed here? There is always only one process doing
> > fastcommit so nobody else should be touching j_fc_off and other fields. Or
> > am I missing something?
> You are right, there should only be one process calling
> jbd2_fc_get_buf. I'll fix this.

Maybe add a comment to j_fc_off & co. that they are not protected by any
lock - only by the fact that there's always only a single process doing
fastcommit.

> > > +
> > > +     /*
> > > +      * Wait in reverse order to minimize chances of us being woken up before
> > > +      * all IOs have completed
> > > +      */
> > > +     for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
> > > +             bh = journal->j_fc_wbuf[i];
> > > +             wait_on_buffer(bh);
> > > +             put_bh(bh);
> > > +             journal->j_fc_wbuf[i] = NULL;
> > > +             if (unlikely(!buffer_uptodate(bh)))
> > > +                     return -EIO;
> > > +     }
> > > +
> > > +     return 0;
> > > +}
> > > +EXPORT_SYMBOL(jbd2_fc_wait_bufs);
> > > +
> > > +/*
> > > + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
> > > + * for completion.
> > > + */
> > > +int jbd2_fc_release_bufs(journal_t *journal)
> > > +{
> > > +     struct buffer_head *bh;
> > > +     int i, j_fc_off;
> > > +
> > > +     read_lock(&journal->j_state_lock);
> > > +     j_fc_off = journal->j_fc_off;
> > > +     read_unlock(&journal->j_state_lock);
> > > +
> > > +     /*
> > > +      * Wait in reverse order to minimize chances of us being woken up before
> > > +      * all IOs have completed
> > > +      */
> > > +     for (i = j_fc_off - 1; i >= 0; i--) {
> > > +             bh = journal->j_fc_wbuf[i];
> > > +             if (!bh)
> > > +                     break;
> > > +             put_bh(bh);
> > > +             journal->j_fc_wbuf[i] = NULL;
> > > +     }
> > > +
> > > +     return 0;
> > > +}
> > > +EXPORT_SYMBOL(jbd2_fc_release_bufs);
> > > +
> >
> > I kind of wonder if releasing of buffers shouldn't be done automatically
> > either as part of jbd2_fc_wait_bufs() or when ending fastcommit. But I
> > don't have a strong opinion so this is just an idea for consideration.
> So, that's what I do. The buffers get released in jbd2_fc_wait_bufs().
> However, in case of errors or fallback to full commits, buffers may
> not be submitted and thus won't be released. So this function is to
> release all the unsubmitted buffers. This gets called from the cleanup
> callback which is called after every successful or failed full commit
> or fast commit.

Aha, I missed that. Thanks for explanation.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 4/9] jbd2: add fast commit machinery
  2020-10-26  9:03       ` Jan Kara
@ 2020-10-26 16:34         ` harshad shirwadkar
  0 siblings, 0 replies; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-26 16:34 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4 Developers List, Theodore Y. Ts'o

On Mon, Oct 26, 2020 at 2:03 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 23-10-20 10:17:18, harshad shirwadkar wrote:
> > Thanks Jan for reviewing the patches.
>
> You're welcome. Rather I'm sorry that I've got to that after so long time.
>
> > On Thu, Oct 22, 2020 at 3:16 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 15-10-20 13:37:56, Harshad Shirwadkar wrote:
> > > > This functions adds necessary APIs needed in JBD2 layer for fast
> > > > commits.
> > > >
> > > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > > > ---
> > > >  fs/ext4/fast_commit.c |   8 ++
> > > >  fs/jbd2/commit.c      |  44 ++++++++++
> > > >  fs/jbd2/journal.c     | 190 +++++++++++++++++++++++++++++++++++++++++-
> > > >  include/linux/jbd2.h  |  27 ++++++
> > > >  4 files changed, 268 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > > > index 0dad8bdb1253..f2d11b4c6b62 100644
> > > > --- a/fs/ext4/fast_commit.c
> > > > +++ b/fs/ext4/fast_commit.c
> > > > @@ -8,11 +8,19 @@
> > > >   * Ext4 fast commits routines.
> > > >   */
> > > >  #include "ext4_jbd2.h"
> > > > +/*
> > > > + * Fast commit cleanup routine. This is called after every fast commit and
> > > > + * full commit. full is true if we are called after a full commit.
> > > > + */
> > > > +static void ext4_fc_cleanup(journal_t *journal, int full)
> > > > +{
> > > > +}
> > > >
> > > >  void ext4_fc_init(struct super_block *sb, journal_t *journal)
> > > >  {
> > > >       if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
> > > >               return;
> > > > +     journal->j_fc_cleanup_callback = ext4_fc_cleanup;
> > > >       if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
> > > >               pr_warn("Error while enabling fast commits, turning off.");
> > > >               ext4_clear_feature_fast_commit(sb);
> > > > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > > > index 6252b4c50666..fa688e163a80 100644
> > > > --- a/fs/jbd2/commit.c
> > > > +++ b/fs/jbd2/commit.c
> > > > @@ -206,6 +206,30 @@ int jbd2_journal_submit_inode_data_buffers(struct jbd2_inode *jinode)
> > > >       return generic_writepages(mapping, &wbc);
> > > >  }
> > > >
> > > > +/* Send all the data buffers related to an inode */
> > > > +int jbd2_submit_inode_data(struct jbd2_inode *jinode)
> > > > +{
> > > > +
> > > > +     if (!jinode || !(jinode->i_flags & JI_WRITE_DATA))
> > > > +             return 0;
> > > > +
> > > > +     trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> > > > +     return jbd2_journal_submit_inode_data_buffers(jinode);
> > > > +
> > > > +}
> > > > +EXPORT_SYMBOL(jbd2_submit_inode_data);
> > > > +
> > > > +int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> > > > +{
> > > > +     if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
> > > > +             !jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
> > > > +             return 0;
> > > > +     return filemap_fdatawait_range_keep_errors(
> > > > +             jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
> > > > +             jinode->i_dirty_end);
> > > > +}
> > > > +EXPORT_SYMBOL(jbd2_wait_inode_data);
> > > > +
> > > >  /*
> > > >   * Submit all the data buffers of inode associated with the transaction to
> > > >   * disk.
> > > > @@ -415,6 +439,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> > > >       J_ASSERT(journal->j_running_transaction != NULL);
> > > >       J_ASSERT(journal->j_committing_transaction == NULL);
> > > >
> > > > +     write_lock(&journal->j_state_lock);
> > > > +     journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
> > > > +     while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) {
> > > > +             DEFINE_WAIT(wait);
> > > > +
> > > > +             prepare_to_wait(&journal->j_fc_wait, &wait,
> > > > +                             TASK_UNINTERRUPTIBLE);
> > > > +             write_unlock(&journal->j_state_lock);
> > > > +             schedule();
> > > > +             write_lock(&journal->j_state_lock);
> > > > +             finish_wait(&journal->j_fc_wait, &wait);
> > > > +     }
> > > > +     write_unlock(&journal->j_state_lock);
> > >
> > > Hum, I'd like to understand: Is there a reason to block fastcommits already
> > > when the running transaction is in T_LOCKED state? Strictly speaking it is
> > > necessary only once we get to T_FLUSH state AFAIU (because only then we
> > > start to write transaction to the journal). I guess there are both
> > > advantages and disadvantages to it - if we allowed fastcommits running in
> > > T_LOCKED state, we could lower fsync() latency more. OTOH it could increase
> > > commit latency because we'd have to wait for fastcommits after T_LOCKED
> > > state.
> > That's right. I thought given that the transaction is anyway entering
> > locked state, might as well wait for it to complete instead of writing
> > blocks that are going to be obsoleted immediately. Also note that this
> > full commit could have started due to fast commits being in ineligible
> > state. If that's the case, the fast commit code will realize that it
> > can't do much and it will again wait for a full commit. So, even
> > though there is a fsync latency benefit to waiting till T_FLUSH, I'd
> > still marginally prefer blocking fast commits once the transaction
> > enters T_LOCKED state.
>
> OK, makes sence. We can always change it if we find good performance
> reasons for other choice.
Sounds good!
>
> > > > +}
> > > > +EXPORT_SYMBOL(jbd2_fc_end_commit);
> > > > +
> > > > +int jbd2_fc_end_commit_fallback(journal_t *journal, tid_t tid)
> > > > +{
> > > > +     return __jbd2_fc_end_commit(journal, tid, 1);
> > > > +}
> > > > +EXPORT_SYMBOL(jbd2_fc_end_commit_fallback);
> > > > +
> > >
> > > Is there a need for 'tid' here? Once jbd2_fc_begin_commit() sets
> > > JBD2_FAST_COMMIT_ONGOING normal commit cannot proceed so when we decide we
> > > cannot do fastcommit in the end, we know the transaction that needs to
> > > commit is the currently running transaction, so we can fetch its TID from
> > > the journal once we hold j_state_lock before clearing
> > > JBD2_FAST_COMMIT_ONGOING. Cannot we?
>
> Did you miss this comment?
Oops. Yeah, I did. Sorry for that. Yes, when we call this function, we
know that the TID that needs to be committed is the current running
transaction. I'll fix this.
>
> > > >  /* Return 1 when transaction with given tid has already committed. */
> > > >  int jbd2_transaction_committed(journal_t *journal, tid_t tid)
> > > >  {
> > > > @@ -784,6 +855,110 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
> > > >       return jbd2_journal_bmap(journal, blocknr, retp);
> > > >  }
> > > >
> > > > +/* Map one fast commit buffer for use by the file system */
> > > > +int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
> > > > +{
> > > > +     unsigned long long pblock;
> > > > +     unsigned long blocknr;
> > > > +     int ret = 0;
> > > > +     struct buffer_head *bh;
> > > > +     int fc_off;
> > > > +
> > > > +     *bh_out = NULL;
> > > > +     write_lock(&journal->j_state_lock);
> > > > +
> > > > +     if (journal->j_fc_off + journal->j_fc_first < journal->j_fc_last) {
> > > > +             fc_off = journal->j_fc_off;
> > > > +             blocknr = journal->j_fc_first + fc_off;
> > > > +             journal->j_fc_off++;
> > > > +     } else {
> > > > +             ret = -EINVAL;
> > > > +     }
> > > > +     write_unlock(&journal->j_state_lock);
> > >
> > > Is j_state_lock really needed here? There is always only one process doing
> > > fastcommit so nobody else should be touching j_fc_off and other fields. Or
> > > am I missing something?
> > You are right, there should only be one process calling
> > jbd2_fc_get_buf. I'll fix this.
>
> Maybe add a comment to j_fc_off & co. that they are not protected by any
> lock - only by the fact that there's always only a single process doing
> fastcommit.
Ack

Thanks,
Harshad

>
> > > > +
> > > > +     /*
> > > > +      * Wait in reverse order to minimize chances of us being woken up before
> > > > +      * all IOs have completed
> > > > +      */
> > > > +     for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
> > > > +             bh = journal->j_fc_wbuf[i];
> > > > +             wait_on_buffer(bh);
> > > > +             put_bh(bh);
> > > > +             journal->j_fc_wbuf[i] = NULL;
> > > > +             if (unlikely(!buffer_uptodate(bh)))
> > > > +                     return -EIO;
> > > > +     }
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(jbd2_fc_wait_bufs);
> > > > +
> > > > +/*
> > > > + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
> > > > + * for completion.
> > > > + */
> > > > +int jbd2_fc_release_bufs(journal_t *journal)
> > > > +{
> > > > +     struct buffer_head *bh;
> > > > +     int i, j_fc_off;
> > > > +
> > > > +     read_lock(&journal->j_state_lock);
> > > > +     j_fc_off = journal->j_fc_off;
> > > > +     read_unlock(&journal->j_state_lock);
> > > > +
> > > > +     /*
> > > > +      * Wait in reverse order to minimize chances of us being woken up before
> > > > +      * all IOs have completed
> > > > +      */
> > > > +     for (i = j_fc_off - 1; i >= 0; i--) {
> > > > +             bh = journal->j_fc_wbuf[i];
> > > > +             if (!bh)
> > > > +                     break;
> > > > +             put_bh(bh);
> > > > +             journal->j_fc_wbuf[i] = NULL;
> > > > +     }
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +EXPORT_SYMBOL(jbd2_fc_release_bufs);
> > > > +
> > >
> > > I kind of wonder if releasing of buffers shouldn't be done automatically
> > > either as part of jbd2_fc_wait_bufs() or when ending fastcommit. But I
> > > don't have a strong opinion so this is just an idea for consideration.
> > So, that's what I do. The buffers get released in jbd2_fc_wait_bufs().
> > However, in case of errors or fallback to full commits, buffers may
> > not be submitted and thus won't be released. So this function is to
> > release all the unsubmitted buffers. This gets called from the cleanup
> > callback which is called after every successful or failed full commit
> > or fast commit.
>
> Aha, I missed that. Thanks for explanation.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options
  2020-10-22 13:09       ` Jan Kara
@ 2020-10-26 16:40         ` harshad shirwadkar
  0 siblings, 0 replies; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-26 16:40 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4 Developers List, Theodore Y. Ts'o

Will do, thanks!

On Thu, Oct 22, 2020 at 6:09 AM Jan Kara <jack@suse.cz> wrote:
>
> On Wed 21-10-20 10:31:48, harshad shirwadkar wrote:
> > On Wed, Oct 21, 2020 at 9:18 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > On Thu 15-10-20 13:37:54, Harshad Shirwadkar wrote:
> > > > We are running out of mount option bits. Add handling for using
> > > > s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
> > > > ability to turn off the fast commit feature in Ext4.
> > > >
> > > > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > > > ---
> > > >  fs/ext4/ext4.h       |  4 ++++
> > > >  fs/ext4/super.c      | 27 ++++++++++++++++++++++-----
> > > >  include/linux/jbd2.h |  5 ++++-
> > > >  3 files changed, 30 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > > > index 1879531a119f..02d7dc378505 100644
> > > > --- a/fs/ext4/ext4.h
> > > > +++ b/fs/ext4/ext4.h
> > > > @@ -1213,6 +1213,8 @@ struct ext4_inode_info {
> > > >  #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM        0x00000008 /* User explicitly
> > > >                                               specified journal checksum */
> > > >
> > > > +#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT      0x00000010 /* Journal fast commit */
> > > > +
> > > >  #define clear_opt(sb, opt)           EXT4_SB(sb)->s_mount_opt &= \
> > > >                                               ~EXT4_MOUNT_##opt
> > > >  #define set_opt(sb, opt)             EXT4_SB(sb)->s_mount_opt |= \
> > > > @@ -1813,6 +1815,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode)
> > > >  #define EXT4_FEATURE_COMPAT_RESIZE_INODE     0x0010
> > > >  #define EXT4_FEATURE_COMPAT_DIR_INDEX                0x0020
> > > >  #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2    0x0200
> > > > +#define EXT4_FEATURE_COMPAT_FAST_COMMIT              0x0400
> > > >  #define EXT4_FEATURE_COMPAT_STABLE_INODES    0x0800
> > >
> > > Is fast commit really a compat feature? IMO if there are fast commits
> > > stored in the journal, the filesystem is actually incompatible with the
> > > old kernels because data we guranteed to be permanenly stored may be
> > > invisible for the old kernel (since it won't replay fastcommit
> > > transactions).
> > >
> > > ...
> > >
> > > Oh, now I see that the journal FAST_COMMIT is actually incompat. So what's
> > > the point of compat ext4 feature with incompat JBD2 feature?
> > So having fast commits enabled on an ext4 file system doesn't
> > immediately make it incompatible with the older kernels. FS becomes
> > incompatible only if there are fast commits blocks that are stored in
> > the journal. So, one of the tricks that this patchset does is on a
> > clean unmount, since it's guaranteed that there are no fast commit
> > blocks in journal, we clear out the JBD2 incompat flag and preserve
> > the compat flag in ext4. So, we can think of ext4 compat flag as "FS
> > will try fast commits when possible" while jbd2 incompat flag as
> > "There are fast commits blocks present in the journal". Does that make
> > sense?
>
> Yes, understood. That's clever. Thanks for explanation! But please add the
> above justification to the description of EXT4_FEATURE_COMPAT_FAST_COMMIT
> feature or somewhere around that.
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-23 10:30   ` Jan Kara
@ 2020-10-26 20:55     ` harshad shirwadkar
  2020-10-27 14:29       ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-26 20:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4 Developers List, Theodore Y. Ts'o, kernel test robot

I have reduced the size of the email by trimming off some parts of the
original email. Responses inlined:

On Fri, Oct 23, 2020 at 3:30 AM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 15-10-20 13:37:57, Harshad Shirwadkar wrote:
> > diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> > index 76f634d185f1..68aaed48315f 100644
> >       /*
> >        * i_disksize keeps track of what the inode size is ON DISK, not
> >        * in memory.  During truncate, i_size is set to the new size by
> > @@ -1141,6 +1163,10 @@ struct ext4_inode_info {
> >  #define      EXT4_VALID_FS                   0x0001  /* Unmounted cleanly */
> >  #define      EXT4_ERROR_FS                   0x0002  /* Errors detected */
> >  #define      EXT4_ORPHAN_FS                  0x0004  /* Orphans being recovered */
> > +#define EXT4_FC_INELIGIBLE           0x0008  /* Fast commit ineligible */
> > +#define EXT4_FC_COMMITTING           0x0010  /* File system underoing a fast
>           ^^ please align these as the previous values
> Also the names should have _FS suffix.
Ack
>
> Now after more looking, these are actually used in s_mount_state which is
> persistently stored on disk which is probably not what you want. You rather
> want to use something like sbi->s_mount_flags for these?
Oh, you are right, thanks for catching this. I wanted to use
sbi->s_mount_flags. Will fix this.

>
> And now that I also look at sbi->s_mount_flags, these should use atomic
> bitops as currently they seem to be succeptible to RMW races (e.g. due to
> EXT4_MF_MNTDIR_SAMPLED flag) and your flags also need the atomic behavior.
> That would be a separate patch fixing this.
Right, I'll send out a patch for this.
>
> > +                                              * commit.
> > +                                              */
> >
> >  /*
> >   * Misc. filesystem flags
> > @@ -1613,6 +1639,30 @@ struct ext4_sb_info {
> > + *   ext4_fc_stop_ineligible() to fall back to full commits. It is important to
> > + *   make one more fast commit to fall back to full commit after stop call so
> > + *   that it guaranteed that the fast commit ineligible operation contained
> > + *   within ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() is
> > + *   followed by at least 1 full commit.
> > + *
> > + * Atomicity of commits
> > + * --------------------
> > + * In order to gaurantee atomicity during the commit operation, fast commit
>                   ^^^ guarantee

Ack
>
> > + * uses "EXT4_FC_TAG_TAIL" tag that marks a fast commit as complete. Tail
> > + * tag contains CRC of the contents and TID of the transaction after which
> > + * this fast commit should be applied. Recovery code replays fast commit
> > + * logs only if there's at least 1 valid tail present. For every fast commit
> > + * operation, there is 1 tail. This means, we may end up with multiple tails
> > + * in the fast commit space. Here's an example:
> > + *
> > + * - Create a new file A and remove existing file B
> > + * - fsync()
>
> Great that there's an example here. But what do we fsync here? A or dir with
> A or something else?
Well it doesn't matter, explained below:
>
> > + * - Append contents to file A
> > + * - Truncate file A
> > + * - fsync()
>
> And what is fsynced here?
Same
>
> > + *
> > + * The fast commit space at the end of above operations would look like this:
> > + *      [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL]
> > + *             |<---  Fast Commit 1   --->|<---      Fast Commit 2     ---->|
> > + *
> > + * Replay code should thus check for all the valid tails in the FC area.
>
> And one design question: Why do we record unlink of B here? I was kind of
> hoping that fastcommit due to fsync(A) would record only operations related
> to A. Because the way you wrote it, fast commit is inherently still a
> filesystem-global operation requiring global ordering of metadata changes
> with all the scalability bottlenecks current journalling code has. It's
> faster by some factor due to more efficient packing of "small" changes not
> fundamentally faster AFAICT...
So given that fsync() for Ext4 traditionally resulted in syncing of
all the dirty inodes / buffers. If we fsync() only the file in
question, I'm worried that we may break some of the existing
applications. In the earlier version of the series, I had a
"soft_consistency" mode which did exactly that. It broke a bunch of
xfstests that had this assumption. Also, in my tests I didn't see a
big performance difference between these fast commits and the fast
commits with soft consistency. Most probably, that's because the
benchmarks perform a fsync on all the files and current fast commits
give it a batching effect which soft consistency mode would fail to
provide.

But I'm not fixated on this, I think it's still good to have
soft_consistency mode. Good thing is this doesn't affect the on-disk
format. So, this is something that can be gradually added to Ext4.
>
> > + *
> > + * TODOs
> > + * -----
> > + * 1) Make fast commit atomic updates more fine grained. Today, a fast commit
> > + *    eligible update must be protected within ext4_fc_start_update() and
> > + *    ext4_fc_stop_update(). These routines are called at much higher
> > + *    routines. This can be made more fine grained by combining with
> > + *    ext4_journal_start().
> > + *
> > + * 2) Same above for ext4_fc_start_ineligible() and ext4_fc_stop_ineligible()
> > + *
> > + * 3) Handle more ineligible cases.
> > + */
> > +void ext4_fc_del(struct inode *inode)
> > +{
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +
> > +     if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> > +             return;
> > +
> > +
> > +     if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> > +             return;
>
> Uh, why testing twice?
Oops, will fix this.
>
> > +
> > +restart:
> > +     spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> > +     if (list_empty(&ei->i_fc_list)) {
> > +             spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> > +             return;
> > +     }
> > +
> > +     if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> > +             wait_queue_head_t *wq;
> > +#if (BITS_PER_LONG < 64)
> > +             DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> > +                             EXT4_STATE_FC_COMMITTING);
> > +             wq = bit_waitqueue(&ei->i_state_flags,
> > +                                EXT4_STATE_FC_COMMITTING);
> > +#else
> > +             DEFINE_WAIT_BIT(wait, &ei->i_flags,
> > +                             EXT4_STATE_FC_COMMITTING);
> > +             wq = bit_waitqueue(&ei->i_flags,
> > +                                EXT4_STATE_FC_COMMITTING);
> > +#endif
>
> Create a helper function for waiting for EXT4_STATE_FC_COMMITTING? It is
> opencoded several times...
Ack
>
> > +             prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> > +             spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> > +             schedule();
> > +             finish_wait(wq, &wait.wq_entry);
> > +             goto restart;
> > +     }
> > +     if (!list_empty(&ei->i_fc_list))
>
> You've checked for list_empty() above, no need to recheck again...
Ack
>
> > +             list_del_init(&ei->i_fc_list);
> > +     spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> > +}
> > +
> > +static int ext4_fc_track_template(
> > +     struct inode *inode, int (*__fc_track_fn)(struct inode *, void *, bool),
> > +     void *args, int enqueue)
> > +{
> > +     tid_t running_txn_tid;
> > +     bool update = false;
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +     struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > +     int ret;
> > +
> > +     if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> > +             return -EOPNOTSUPP;
> > +
> > +     if (ext4_fc_is_ineligible(inode->i_sb))
> > +             return -EINVAL;
> > +
> > +     running_txn_tid = sbi->s_journal ?
> > +             sbi->s_journal->j_commit_sequence + 1 : 0;
>
> This looks problematic. The j_commit_sequence sampling is racy - first
> without j_state_lock you can be fetching stale value, second you don't
> know whether there is transaction currently committing or not. If there is,
> j_commit_sequence will contain TID of the transaction before it which is
> wrong for your purposes. I think you should pass 'handle' into all the
> tracking functions and derive running transaction TID from that as we do it
> elsewhere.
Oh thanks for pointing this out. Okay makes sense, I'll use handle for this.
>
> > +
> > +     mutex_lock(&ei->i_fc_lock);
> > +     if (running_txn_tid == ei->i_sync_tid) {
> > +             update = true;
> > +     } else {
> > +             ext4_fc_reset_inode(inode);
> > +             ei->i_sync_tid = running_txn_tid;
> > +     }
> > +     ret = __fc_track_fn(inode, args, update);
> > +     mutex_unlock(&ei->i_fc_lock);
> > +
> > +     if (!enqueue)
> > +             return ret;
> > +
> > +     spin_lock(&sbi->s_fc_lock);
> > +     if (list_empty(&EXT4_I(inode)->i_fc_list))
> > +             list_add_tail(&EXT4_I(inode)->i_fc_list,
> > +                             (sbi->s_mount_state & EXT4_FC_COMMITTING) ?
> > +                             &sbi->s_fc_q[FC_Q_STAGING] :
> > +                             &sbi->s_fc_q[FC_Q_MAIN]);
> > +     spin_unlock(&sbi->s_fc_lock);
>
> OK, so how do you prevent inode from being freed while it is still on
> i_fc_list? I don't see anything preventing that and it could cause nasty
> use-after-free issues. Note that for similar reasons JBD2 uses external
> separately allocated inode for jbd2_inode so that it can have separate
> lifetime (related to transaction commits) from struct ext4_inode_info.
So, if you see the function ext4_fc_del() above, it's called from
ext4_clear_inode(). What ext4_fc_del() does is that, if the inode is
not being committed, it just removes it from the list. If that inode
was deleted, we have a separate dentry queue which will record the
deletion of the inode, so we don't really need the struct
ext4_inode_info for recording that on-disk. However, if the inode is
being committed (this is figured out by checking the per inode
COMMITTING state), ext4_fc_del() waits until the completion.
>
> > +
> > +     return ret;
> > +}
> > +
> > +struct __track_dentry_update_args {
> > +     struct dentry *dentry;
> > +     int op;
> > +};
> > +
> > +/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */
> > +static void ext4_fc_submit_bh(struct super_block *sb)
> > +{
> > +     int write_flags = REQ_SYNC;
> > +     struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
> > +
> > +     if (test_opt(sb, BARRIER))
> > +             write_flags |= REQ_FUA | REQ_PREFLUSH;
>
> Submitting each fastcommit buffer with REQ_FUA | REQ_PREFLUSH is
> unnecessarily expensive (especially if there will be unrelated writes
> happening to the filesystem while fastcommit is running). If nothing else,
> it's enough to have REQ_PREFLUSH only once during the whole fastcommit to
> flush out written back data blocks (plus journal device may be different
> from the filesystem device so you need to be flushing the filesystem device
> for this - see how the jbd2 commit code does this).
>
> Also REQ_FUA on each block may be overkill for devices that don't support
> it natively (and thus REQ_FUA is simulated with full write cache pre and
> post flush) - for such devices it would be better to just write out
> fastcommit normally and then issue one cache flush. With careful
> checksumming, block ID tagging and such, it should be safe against disk
> reordering writes. But I guess we can leave this optimization as a TODO
> item for later (but I think it would be good to design the on-disk format of
> fastcommit blocks so that it does not rely on FUA writes).
I see. The on disk format doesn't rely on FUA / PREFLUSH, I added it
based on the observation that in most cases all the fast commit info
was written in 1 block only. I didn't see much difference in the
performance but I get your point. I'll add this as a TODO in the code
for now.
>
> > +     lock_buffer(bh);
> > +     clear_buffer_dirty(bh);
> > +     set_buffer_uptodate(bh);
> > +     bh->b_end_io = ext4_end_buffer_io_sync;
> > +     submit_bh(REQ_OP_WRITE, write_flags, bh);
> > +     EXT4_SB(sb)->s_fc_bh = NULL;
> > +}
> > +
> > +/* Ext4 commit path routines */
> > +
> > +/* memzero and update CRC */
> > +static void *ext4_fc_memzero(struct super_block *sb, void *dst, int len,
> > +                             u32 *crc)
> > +{
> > +     void *ret;
> > +
> > +     ret = memset(dst, 0, len);
> > +     if (crc)
> > +             *crc = ext4_chksum(EXT4_SB(sb), *crc, dst, len);
> > +     return ret;
> > +}
> > +
> > +/*
> > + * Allocate len bytes on a fast commit buffer.
> > + *
> > + * During the commit time this function is used to manage fast commit
> > + * block space. We don't split a fast commit log onto different
> > + * blocks. So this function makes sure that if there's not enough space
> > + * on the current block, the remaining space in the current block is
> > + * marked as unused by adding EXT4_FC_TAG_PAD tag. In that case,
> > + * new block is from jbd2 and CRC is updated to reflect the padding
> > + * we added.
> > + */
> > +static u8 *ext4_fc_reserve_space(struct super_block *sb, int len, u32 *crc)
> > +{
> > +     struct ext4_fc_tl *tl;
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct buffer_head *bh;
> > +     int bsize = sbi->s_journal->j_blocksize;
> > +     int ret, off = sbi->s_fc_bytes % bsize;
> > +     int pad_len;
> > +
> > +     /*
> > +      * After allocating len, we should have space at least for a 0 byte
> > +      * padding.
> > +      */
> > +     if (len + sizeof(struct ext4_fc_tl) > bsize)
> > +             return NULL;
> > +
> > +     if (bsize - off - 1 > len + sizeof(struct ext4_fc_tl)) {
> > +             /*
> > +              * Only allocate from current buffer if we have enough space for
> > +              * this request AND we have space to add a zero byte padding.
> > +              */
> > +             if (!sbi->s_fc_bh) {
> > +                     ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
> > +                     if (ret)
> > +                             return NULL;
> > +                     sbi->s_fc_bh = bh;
> > +             }
> > +             sbi->s_fc_bytes += len;
> > +             return sbi->s_fc_bh->b_data + off;
> > +     }
> > +     /* Need to add PAD tag */
> > +     tl = (struct ext4_fc_tl *)(sbi->s_fc_bh->b_data + off);
> > +     tl->fc_tag = cpu_to_le16(EXT4_FC_TAG_PAD);
> > +     pad_len = bsize - off - 1 - sizeof(struct ext4_fc_tl);
> > +     tl->fc_len = cpu_to_le16(pad_len);
> > +     if (crc)
> > +             *crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl));
> > +     if (pad_len > 0)
> > +             ext4_fc_memzero(sb, tl + 1, pad_len, crc);
> > +     ext4_fc_submit_bh(sb);
> > +
> > +     ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
> > +     if (ret)
> > +             return NULL;
> > +     sbi->s_fc_bh = bh;
> > +     sbi->s_fc_bytes = (sbi->s_fc_bytes / bsize + 1) * bsize + len;
> > +     return sbi->s_fc_bh->b_data;
> > +}
> > +
> > +/* memcpy to fc reserved space and update CRC */
> > +static void *ext4_fc_memcpy(struct super_block *sb, void *dst, const void *src,
> > +                             int len, u32 *crc)
> > +{
> > +     if (crc)
> > +             *crc = ext4_chksum(EXT4_SB(sb), *crc, src, len);
> > +     return memcpy(dst, src, len);
> > +}
> > +
> > +/*
> > + * Complete a fast commit by writing tail tag.
> > + *
> > + * Writing tail tag marks the end of a fast commit. In order to guarantee
> > + * atomicity, after writing tail tag, even if there's space remaining
> > + * in the block, next commit shouldn't use it. That's why tail tag
> > + * has the length as that of the remaining space on the block.
> > + */
> > +static int ext4_fc_write_tail(struct super_block *sb, u32 crc)
> > +{
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct ext4_fc_tl tl;
> > +     struct ext4_fc_tail tail;
> > +     int off, bsize = sbi->s_journal->j_blocksize;
> > +     u8 *dst;
> > +
> > +     /*
> > +      * ext4_fc_reserve_space takes care of allocating an extra block if
> > +      * there's no enough space on this block for accommodating this tail.
> > +      */
> > +     dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc);
> > +     if (!dst)
> > +             return -ENOSPC;
> > +
> > +     off = sbi->s_fc_bytes % bsize;
> > +
> > +     tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL);
> > +     tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail));
> > +     sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize);
> > +
> > +     ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc);
> > +     dst += sizeof(tl);
> > +     tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid);
> > +     ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc);
> > +     dst += sizeof(tail.fc_tid);
> > +     tail.fc_crc = cpu_to_le32(crc);
> > +     ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
> > +
> > +     ext4_fc_submit_bh(sb);
> > +
> > +     return 0;
> > +}
>
> Is there a reason to pass CRC all around (so you have to have special
> functions like ext4_fc_memcpy(), ext4_fc_memzero(), ...) instead of just
> creating the whole block and then computing CRC in one go?
>
> In fact, as looking through the code, it seems to me it would be slightly
> nicer layer separation and interface if JBD2 provided functions for storage
> of data blobs and handled the details of space & block management,
> checksums, writeout, on recovery verification of correctness (so it would
> just provide back a stream of blobs for FS to replay). Just an idea for
> consideration, the current interface isn't too bad and we can change it
> later if we decide so.
I designed this keeping DAX mode in mind where we would benefit if we
don't use buffer heads and blocks. There is no block level CRC, but
CRC covers all the tags either from the start or from the last tail
tag (whichever comes first). This kind of CRC can span across
multipleblocks or we could have multiple CRCs in one block. Passing
CRC around helps us to compute CRC as we write tags to storage. In DAX
mode, this would allow fast commit commits to be smaller than block
size. DAX mode code isn't implemented completely yet, but I wanted to
make sure that the design of on-disk format is consistent and
efficient for both DAX and non-DAX modes.
>
> > +
> > +/*
> > + * Adds tag, length, value and updates CRC. Returns true if tlv was added.
> > + * Returns false if there's not enough space.
> > + */
> > + */
> > +static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
> > +{
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +     int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
> > +     int ret;
> > +     struct ext4_iloc iloc;
> > +     struct ext4_fc_inode fc_inode;
> > +     struct ext4_fc_tl tl;
> > +     u8 *dst;
> > +
> > +     ret = ext4_get_inode_loc(inode, &iloc);
> > +     if (ret)
> > +             return ret;
> > +
> > +     if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
> > +             inode_len += ei->i_extra_isize;
> > +
> > +     fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
> > +     tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
> > +     tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino));
> > +
> > +     dst = ext4_fc_reserve_space(inode->i_sb,
> > +                     sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc);
> > +     if (!dst)
> > +             return -ECANCELED;
> > +
> > +     if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc))
> > +             return -ECANCELED;
> > +     dst += sizeof(tl);
> > +     if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc))
> > +             return -ECANCELED;
> > +     dst += sizeof(fc_inode);
> > +     if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc),
> > +                                     inode_len, crc))
> > +             return -ECANCELED;
>
> Isn't this racy? What guarantees the inode state you record here is a valid
> one for the fastcommit? I mean this gets called at the time of fastcommit
> (i.e., fsync), so a fastcommit code must record changes to all other
> metadata that relate to the currently recorded inode state. But this isn't
> serialized in any way (AFAICT) with on-going inode changes so how can
> fastcommit code guarantee that? This is a similar case as a problem I
> describe below...
So we have ext4_fc_start_update(inode) / ext4_fc_stop_update(inode)
which is called by all the operations that happen on an inode. If the
inode in question is undergoing a fast commit, ext4_fc_start_update()
will block. So that ensures that inode won't be modified once fast
commit starts. So, in general, before doing any fast commit related
operation, we'll first put the inode in committing state, that's the
state of the inode which will be committed on-disk in fast commit.
>
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Writes updated data ranges for the inode in question. Updates CRC.
> > + * Returns 0 on success, error otherwise.
> > + */
> > +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> > +{
> > +     ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +     struct ext4_map_blocks map;
> > +     struct ext4_fc_add_range fc_ext;
> > +     struct ext4_fc_del_range lrange;
> > +     struct ext4_extent *ex;
> > +     int ret;
> > +
> > +     mutex_lock(&ei->i_fc_lock);
> > +     if (ei->i_fc_lblk_len == 0) {
> > +             mutex_unlock(&ei->i_fc_lock);
> > +             return 0;
> > +     }
> > +     old_blk_size = ei->i_fc_lblk_start;
> > +     new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> > +     ei->i_fc_lblk_len = 0;
> > +     mutex_unlock(&ei->i_fc_lock);
> > +
> > +     cur_lblk_off = old_blk_size;
> > +     jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> > +               __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> > +
> > +     while (cur_lblk_off <= new_blk_size) {
> > +             map.m_lblk = cur_lblk_off;
> > +             map.m_len = new_blk_size - cur_lblk_off + 1;
> > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > +             if (ret < 0)
> > +                     return -ECANCELED;
>
> So isn't this actually racy with a risk of stale data exposure? Consider a
> situation like:
>
> Task 1:                         Task 2:
> pwrite(file, buf, 8192, 0)
> punch(file, 0, 4096)
> fsync(file)
>   writeout range 4096-8192
>   fastcommit for inode range 0-8192
>                                 pwrite(file, buf, 4096, 0)
>     ext4_map_blocks(file)
>       - reports that block at offset 0 is mapped so that is recorded in
>         fastcommit record. But data for that is not written so after a
>         crash we'd expose stale data in that block.
>
> Am I missing something?
So the way this gets handled is before entering this function, the
inode enters COMMITTING state (in ext4_fc_submit_inode_data_all
function). Once in COMMITTING state, all the inodes on this inode get
paused. Also, the commit path waits until all the ongoing updates on
that inode are completed. Once they are completed, only then its data
buffers are flushed and this ext4_map_blocks is called. So Task-2 here
would have either completely finished or would wait until the end of
this inode's commit. I realize that I should probably add more
comments to make this more clearer in the code. But is handling it
this way sufficient or am I missing any more cases?
>
> > +
> > +             if (map.m_len == 0) {
> > +                     cur_lblk_off++;
> > +                     continue;
> > +             }
> > +
> > @@ -271,6 +272,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >
> >  out:
> >       inode_unlock(inode);
> > +     ext4_fc_stop_update(inode);
> >       if (likely(ret > 0)) {
> >               iocb->ki_pos += ret;
> >               ret = generic_write_sync(iocb, ret);
> > @@ -534,7 +536,9 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >                       goto out;
> >               }
> >
> > +             ext4_fc_start_update(inode);
> >               ret = ext4_orphan_add(handle, inode);
> > +             ext4_fc_stop_update(inode);
>
> Why is here protected only the orphan addition? What about other changes
> happening to the inode during direct write?
This is the only change that is protected by handle in this function.
What I'm trying to do here (and in other places) is that anything that
happens between ext4_journal_start() and ext4_journal_stop() happens
atomically. The way to guarantee that is to ensure that the same block
is also surrounded by ext4_fc_start_update(inode) and
ext4_fc_stop_update(inode).

I also realized while looking at this comment is that we probably need
a new TLV for adding orphan inode to the list?
>
> >               if (ret) {
> >                       ext4_journal_stop(handle);
> >                       goto out;
> > @@ -656,8 +660,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  #endif
> >       if (iocb->ki_flags & IOCB_DIRECT)
> >               return ext4_dio_write_iter(iocb, from);
> > -
> > -     return ext4_buffered_write_iter(iocb, from);
> > +     else
> > +             return ext4_buffered_write_iter(iocb, from);
>
> Why this change?
Oops, this can be removed.
>
> >  }
> >
> >  #ifdef CONFIG_FS_DAX
> > @@ -757,6 +761,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
> >       if (!daxdev_mapping_supported(vma, dax_dev))
> >               return -EOPNOTSUPP;
> >
> > +     ext4_fc_start_update(inode);
> >       file_accessed(file);
>
> Uh, is this ext4_fc_start_update() for the file_accessed() call? What about
> all the other inode timestamp updates? I'd say handling in ext4_setattr()
> should be enough?
Makes sense. I'll remove this.
>
> Also I don't see anything tracking inode changes due to writes through mmap?
> How is that supposed to work?
Right, I have missed those. I see that mmap function
ext4_page_mkwrite() calls ext4_jbd2_inode_add_write that tells jbd2
what is the range that needs to be written for the inode in question.
I guess I can just update that function to update inode's FC range as
well?

Thanks,
Harshad
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-26 20:55     ` harshad shirwadkar
@ 2020-10-27 14:29       ` Jan Kara
  2020-10-27 17:38         ` harshad shirwadkar
  2020-10-27 18:45         ` Theodore Y. Ts'o
  0 siblings, 2 replies; 33+ messages in thread
From: Jan Kara @ 2020-10-27 14:29 UTC (permalink / raw)
  To: harshad shirwadkar
  Cc: Jan Kara, Ext4 Developers List, Theodore Y. Ts'o, kernel test robot

On Mon 26-10-20 13:55:47, harshad shirwadkar wrote:
> > > + *
> > > + * The fast commit space at the end of above operations would look like this:
> > > + *      [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL]
> > > + *             |<---  Fast Commit 1   --->|<---      Fast Commit 2     ---->|
> > > + *
> > > + * Replay code should thus check for all the valid tails in the FC area.
> >
> > And one design question: Why do we record unlink of B here? I was kind of
> > hoping that fastcommit due to fsync(A) would record only operations related
> > to A. Because the way you wrote it, fast commit is inherently still a
> > filesystem-global operation requiring global ordering of metadata changes
> > with all the scalability bottlenecks current journalling code has. It's
> > faster by some factor due to more efficient packing of "small" changes not
> > fundamentally faster AFAICT...
> So given that fsync() for Ext4 traditionally resulted in syncing of
> all the dirty inodes / buffers. If we fsync() only the file in
> question, I'm worried that we may break some of the existing
> applications. In the earlier version of the series, I had a
> "soft_consistency" mode which did exactly that. It broke a bunch of
> xfstests that had this assumption. Also, in my tests I didn't see a
> big performance difference between these fast commits and the fast
> commits with soft consistency. Most probably, that's because the
> benchmarks perform a fsync on all the files and current fast commits
> give it a batching effect which soft consistency mode would fail to
> provide.
> 
> But I'm not fixated on this, I think it's still good to have
> soft_consistency mode. Good thing is this doesn't affect the on-disk
> format. So, this is something that can be gradually added to Ext4.

OK, I see. Maybe add a paragraph about this to fastcommit doc? I agree that
we can leave these optimizations for later, I was just wondering whether
there isn't some fundamental reason why global flush would be required and
I'm happy to hear that there isn't.

The advantage of soft_consistency as you call it would be IMO most seen if
there's relatively heavy non-fsync IO load in parallel with frequent fsyncs
of a tiny file. And such load is not infrequent in practice. I agree that
benchmarks like dbench are unlikely to benefit from soft_consistency since
all IO the benchmark does is in fact forced by fsync.

I also think that with soft_consistency we could benefit (e.g. on SSD
storage) from having several fast-commit areas in the journal so multiple
fastcommits can run in parallel. But that's also for some later
experimentation...

> > > +
> > > +     mutex_lock(&ei->i_fc_lock);
> > > +     if (running_txn_tid == ei->i_sync_tid) {
> > > +             update = true;
> > > +     } else {
> > > +             ext4_fc_reset_inode(inode);
> > > +             ei->i_sync_tid = running_txn_tid;
> > > +     }
> > > +     ret = __fc_track_fn(inode, args, update);
> > > +     mutex_unlock(&ei->i_fc_lock);
> > > +
> > > +     if (!enqueue)
> > > +             return ret;
> > > +
> > > +     spin_lock(&sbi->s_fc_lock);
> > > +     if (list_empty(&EXT4_I(inode)->i_fc_list))
> > > +             list_add_tail(&EXT4_I(inode)->i_fc_list,
> > > +                             (sbi->s_mount_state & EXT4_FC_COMMITTING) ?
> > > +                             &sbi->s_fc_q[FC_Q_STAGING] :
> > > +                             &sbi->s_fc_q[FC_Q_MAIN]);
> > > +     spin_unlock(&sbi->s_fc_lock);
> >
> > OK, so how do you prevent inode from being freed while it is still on
> > i_fc_list? I don't see anything preventing that and it could cause nasty
> > use-after-free issues. Note that for similar reasons JBD2 uses external
> > separately allocated inode for jbd2_inode so that it can have separate
> > lifetime (related to transaction commits) from struct ext4_inode_info.
> So, if you see the function ext4_fc_del() above, it's called from
> ext4_clear_inode(). What ext4_fc_del() does is that, if the inode is
> not being committed, it just removes it from the list. If that inode
> was deleted, we have a separate dentry queue which will record the
> deletion of the inode, so we don't really need the struct
> ext4_inode_info for recording that on-disk. However, if the inode is
> being committed (this is figured out by checking the per inode
> COMMITTING state), ext4_fc_del() waits until the completion.

But I don't think this quite works. Consider the following scenario:

inode I gets modified in transaction T
  you add I to FC list

memory pressure reclaims I from memory
  you remove I from FC list

open(I) -> inode gets loaded to memory again. Not tracked in FC list.
fsync(I) -> nothing to do, FC list is empty
<crash>

And 'I' now doesn't contain data in should because T didn't commit yet and
FC was empty.

> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +struct __track_dentry_update_args {
> > > +     struct dentry *dentry;
> > > +     int op;
> > > +};
> > > +
> > > +/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */
> > > +static void ext4_fc_submit_bh(struct super_block *sb)
> > > +{
> > > +     int write_flags = REQ_SYNC;
> > > +     struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
> > > +
> > > +     if (test_opt(sb, BARRIER))
> > > +             write_flags |= REQ_FUA | REQ_PREFLUSH;
> >
> > Submitting each fastcommit buffer with REQ_FUA | REQ_PREFLUSH is
> > unnecessarily expensive (especially if there will be unrelated writes
> > happening to the filesystem while fastcommit is running). If nothing else,
> > it's enough to have REQ_PREFLUSH only once during the whole fastcommit to
> > flush out written back data blocks (plus journal device may be different
> > from the filesystem device so you need to be flushing the filesystem device
> > for this - see how the jbd2 commit code does this).
> >
> > Also REQ_FUA on each block may be overkill for devices that don't support
> > it natively (and thus REQ_FUA is simulated with full write cache pre and
> > post flush) - for such devices it would be better to just write out
> > fastcommit normally and then issue one cache flush. With careful
> > checksumming, block ID tagging and such, it should be safe against disk
> > reordering writes. But I guess we can leave this optimization as a TODO
> > item for later (but I think it would be good to design the on-disk format of
> > fastcommit blocks so that it does not rely on FUA writes).
> I see. The on disk format doesn't rely on FUA / PREFLUSH, I added it
> based on the observation that in most cases all the fast commit info
> was written in 1 block only. I didn't see much difference in the
> performance but I get your point. I'll add this as a TODO in the code
> for now.

OK, the performance optimization can wait for later but the flushing of
proper device needs to be fixed soon - as I wrote above REQ_PREFLUSH is not
enough (and needed at all) when the journal device is different from the
filesystem device.

> > > +/*
> > > + * Complete a fast commit by writing tail tag.
> > > + *
> > > + * Writing tail tag marks the end of a fast commit. In order to guarantee
> > > + * atomicity, after writing tail tag, even if there's space remaining
> > > + * in the block, next commit shouldn't use it. That's why tail tag
> > > + * has the length as that of the remaining space on the block.
> > > + */
> > > +static int ext4_fc_write_tail(struct super_block *sb, u32 crc)
> > > +{
> > > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > +     struct ext4_fc_tl tl;
> > > +     struct ext4_fc_tail tail;
> > > +     int off, bsize = sbi->s_journal->j_blocksize;
> > > +     u8 *dst;
> > > +
> > > +     /*
> > > +      * ext4_fc_reserve_space takes care of allocating an extra block if
> > > +      * there's no enough space on this block for accommodating this tail.
> > > +      */
> > > +     dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc);
> > > +     if (!dst)
> > > +             return -ENOSPC;
> > > +
> > > +     off = sbi->s_fc_bytes % bsize;
> > > +
> > > +     tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL);
> > > +     tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail));
> > > +     sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize);
> > > +
> > > +     ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc);
> > > +     dst += sizeof(tl);
> > > +     tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid);
> > > +     ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc);
> > > +     dst += sizeof(tail.fc_tid);
> > > +     tail.fc_crc = cpu_to_le32(crc);
> > > +     ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
> > > +
> > > +     ext4_fc_submit_bh(sb);
> > > +
> > > +     return 0;
> > > +}
> >
> > Is there a reason to pass CRC all around (so you have to have special
> > functions like ext4_fc_memcpy(), ext4_fc_memzero(), ...) instead of just
> > creating the whole block and then computing CRC in one go?
> >
> > In fact, as looking through the code, it seems to me it would be slightly
> > nicer layer separation and interface if JBD2 provided functions for storage
> > of data blobs and handled the details of space & block management,
> > checksums, writeout, on recovery verification of correctness (so it would
> > just provide back a stream of blobs for FS to replay). Just an idea for
> > consideration, the current interface isn't too bad and we can change it
> > later if we decide so.
> I designed this keeping DAX mode in mind where we would benefit if we
> don't use buffer heads and blocks. There is no block level CRC, but
> CRC covers all the tags either from the start or from the last tail
> tag (whichever comes first). This kind of CRC can span across
> multipleblocks or we could have multiple CRCs in one block. Passing
> CRC around helps us to compute CRC as we write tags to storage. In DAX
> mode, this would allow fast commit commits to be smaller than block
> size. DAX mode code isn't implemented completely yet, but I wanted to
> make sure that the design of on-disk format is consistent and
> efficient for both DAX and non-DAX modes.

OK, I understand. Thanks for explanation!

> > > +
> > > +/*
> > > + * Adds tag, length, value and updates CRC. Returns true if tlv was added.
> > > + * Returns false if there's not enough space.
> > > + */
> > > + */
> > > +static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
> > > +{
> > > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > > +     int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
> > > +     int ret;
> > > +     struct ext4_iloc iloc;
> > > +     struct ext4_fc_inode fc_inode;
> > > +     struct ext4_fc_tl tl;
> > > +     u8 *dst;
> > > +
> > > +     ret = ext4_get_inode_loc(inode, &iloc);
> > > +     if (ret)
> > > +             return ret;
> > > +
> > > +     if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
> > > +             inode_len += ei->i_extra_isize;
> > > +
> > > +     fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
> > > +     tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
> > > +     tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino));
> > > +
> > > +     dst = ext4_fc_reserve_space(inode->i_sb,
> > > +                     sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc);
> > > +     if (!dst)
> > > +             return -ECANCELED;
> > > +
> > > +     if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc))
> > > +             return -ECANCELED;
> > > +     dst += sizeof(tl);
> > > +     if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc))
> > > +             return -ECANCELED;
> > > +     dst += sizeof(fc_inode);
> > > +     if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc),
> > > +                                     inode_len, crc))
> > > +             return -ECANCELED;
> >
> > Isn't this racy? What guarantees the inode state you record here is a valid
> > one for the fastcommit? I mean this gets called at the time of fastcommit
> > (i.e., fsync), so a fastcommit code must record changes to all other
> > metadata that relate to the currently recorded inode state. But this isn't
> > serialized in any way (AFAICT) with on-going inode changes so how can
> > fastcommit code guarantee that? This is a similar case as a problem I
> > describe below...
> So we have ext4_fc_start_update(inode) / ext4_fc_stop_update(inode)
> which is called by all the operations that happen on an inode. If the
> inode in question is undergoing a fast commit, ext4_fc_start_update()
> will block. So that ensures that inode won't be modified once fast
> commit starts. So, in general, before doing any fast commit related
> operation, we'll first put the inode in committing state, that's the
> state of the inode which will be committed on-disk in fast commit.

I see. See the case below for my comments.

> > > +
> > > +     return 0;
> > > +}
> > > +
> > > +/*
> > > + * Writes updated data ranges for the inode in question. Updates CRC.
> > > + * Returns 0 on success, error otherwise.
> > > + */
> > > +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> > > +{
> > > +     ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> > > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > > +     struct ext4_map_blocks map;
> > > +     struct ext4_fc_add_range fc_ext;
> > > +     struct ext4_fc_del_range lrange;
> > > +     struct ext4_extent *ex;
> > > +     int ret;
> > > +
> > > +     mutex_lock(&ei->i_fc_lock);
> > > +     if (ei->i_fc_lblk_len == 0) {
> > > +             mutex_unlock(&ei->i_fc_lock);
> > > +             return 0;
> > > +     }
> > > +     old_blk_size = ei->i_fc_lblk_start;
> > > +     new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> > > +     ei->i_fc_lblk_len = 0;
> > > +     mutex_unlock(&ei->i_fc_lock);
> > > +
> > > +     cur_lblk_off = old_blk_size;
> > > +     jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> > > +               __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> > > +
> > > +     while (cur_lblk_off <= new_blk_size) {
> > > +             map.m_lblk = cur_lblk_off;
> > > +             map.m_len = new_blk_size - cur_lblk_off + 1;
> > > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > > +             if (ret < 0)
> > > +                     return -ECANCELED;
> >
> > So isn't this actually racy with a risk of stale data exposure? Consider a
> > situation like:
> >
> > Task 1:                         Task 2:
> > pwrite(file, buf, 8192, 0)
> > punch(file, 0, 4096)
> > fsync(file)
> >   writeout range 4096-8192
> >   fastcommit for inode range 0-8192
> >                                 pwrite(file, buf, 4096, 0)
> >     ext4_map_blocks(file)
> >       - reports that block at offset 0 is mapped so that is recorded in
> >         fastcommit record. But data for that is not written so after a
> >         crash we'd expose stale data in that block.
> >
> > Am I missing something?
> So the way this gets handled is before entering this function, the
> inode enters COMMITTING state (in ext4_fc_submit_inode_data_all
> function). Once in COMMITTING state, all the inodes on this inode get
> paused. Also, the commit path waits until all the ongoing updates on
> that inode are completed. Once they are completed, only then its data
> buffers are flushed and this ext4_map_blocks is called. So Task-2 here
> would have either completely finished or would wait until the end of
> this inode's commit. I realize that I should probably add more
> comments to make this more clearer in the code. But is handling it
> this way sufficient or am I missing any more cases?

I see. In principle this should work. But I don't like that we have yet
another mechanism that needs to properly wrap inode changes to make
fastcommits work. And if we get it wrong somewhere, the breakage will be
almost impossible to notice until someone looses data after a power
failure. So it seems a bit fragile to me.

Ideally I think we would reuse the current transaction machinery for this
somehow (so that changes added through one transaction handle would behave
atomically wrt to fastcommits) but the details are not clear to me yet. I
need to think more about this...

> > > +
> > > +             if (map.m_len == 0) {
> > > +                     cur_lblk_off++;
> > > +                     continue;
> > > +             }
> > > +
> > > @@ -271,6 +272,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > >
> > >  out:
> > >       inode_unlock(inode);
> > > +     ext4_fc_stop_update(inode);
> > >       if (likely(ret > 0)) {
> > >               iocb->ki_pos += ret;
> > >               ret = generic_write_sync(iocb, ret);
> > > @@ -534,7 +536,9 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > >                       goto out;
> > >               }
> > >
> > > +             ext4_fc_start_update(inode);
> > >               ret = ext4_orphan_add(handle, inode);
> > > +             ext4_fc_stop_update(inode);
> >
> > Why is here protected only the orphan addition? What about other changes
> > happening to the inode during direct write?
> This is the only change that is protected by handle in this function.
> What I'm trying to do here (and in other places) is that anything that
> happens between ext4_journal_start() and ext4_journal_stop() happens
> atomically. The way to guarantee that is to ensure that the same block
> is also surrounded by ext4_fc_start_update(inode) and
> ext4_fc_stop_update(inode).
> 
> I also realized while looking at this comment is that we probably need
> a new TLV for adding orphan inode to the list?

> > Also I don't see anything tracking inode changes due to writes through mmap?
> > How is that supposed to work?
> Right, I have missed those. I see that mmap function
> ext4_page_mkwrite() calls ext4_jbd2_inode_add_write that tells jbd2
> what is the range that needs to be written for the inode in question.
> I guess I can just update that function to update inode's FC range as
> well?

Yes, you need to add tracking of the page range handled in
ext4_page_mkwrite() to FC.

I've also realized that you probably need to disable fastcommits when data
journaling is enabled for the inode (probably just disable fastcommit
feature with data=jounral mount option, make inode ineligible if it has
'journal data' flag set).

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-27 14:29       ` Jan Kara
@ 2020-10-27 17:38         ` harshad shirwadkar
  2020-10-30 15:28           ` Jan Kara
  2020-10-27 18:45         ` Theodore Y. Ts'o
  1 sibling, 1 reply; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-27 17:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4 Developers List, Theodore Y. Ts'o, kernel test robot

On Tue, Oct 27, 2020 at 7:29 AM Jan Kara <jack@suse.cz> wrote:
>
> On Mon 26-10-20 13:55:47, harshad shirwadkar wrote:
> > > > + *
> > > > + * The fast commit space at the end of above operations would look like this:
> > > > + *      [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL]
> > > > + *             |<---  Fast Commit 1   --->|<---      Fast Commit 2     ---->|
> > > > + *
> > > > + * Replay code should thus check for all the valid tails in the FC area.
> > >
> > > And one design question: Why do we record unlink of B here? I was kind of
> > > hoping that fastcommit due to fsync(A) would record only operations related
> > > to A. Because the way you wrote it, fast commit is inherently still a
> > > filesystem-global operation requiring global ordering of metadata changes
> > > with all the scalability bottlenecks current journalling code has. It's
> > > faster by some factor due to more efficient packing of "small" changes not
> > > fundamentally faster AFAICT...
> > So given that fsync() for Ext4 traditionally resulted in syncing of
> > all the dirty inodes / buffers. If we fsync() only the file in
> > question, I'm worried that we may break some of the existing
> > applications. In the earlier version of the series, I had a
> > "soft_consistency" mode which did exactly that. It broke a bunch of
> > xfstests that had this assumption. Also, in my tests I didn't see a
> > big performance difference between these fast commits and the fast
> > commits with soft consistency. Most probably, that's because the
> > benchmarks perform a fsync on all the files and current fast commits
> > give it a batching effect which soft consistency mode would fail to
> > provide.
> >
> > But I'm not fixated on this, I think it's still good to have
> > soft_consistency mode. Good thing is this doesn't affect the on-disk
> > format. So, this is something that can be gradually added to Ext4.
>
> OK, I see. Maybe add a paragraph about this to fastcommit doc? I agree that
> we can leave these optimizations for later, I was just wondering whether
> there isn't some fundamental reason why global flush would be required and
> I'm happy to hear that there isn't.
Ack
>
> The advantage of soft_consistency as you call it would be IMO most seen if
> there's relatively heavy non-fsync IO load in parallel with frequent fsyncs
> of a tiny file. And such load is not infrequent in practice. I agree that
> benchmarks like dbench are unlikely to benefit from soft_consistency since
> all IO the benchmark does is in fact forced by fsync.
>
> I also think that with soft_consistency we could benefit (e.g. on SSD
> storage) from having several fast-commit areas in the journal so multiple
> fastcommits can run in parallel. But that's also for some later
> experimentation...
Yeah makes sense.
>
> > > > +
> > > > +     mutex_lock(&ei->i_fc_lock);
> > > > +     if (running_txn_tid == ei->i_sync_tid) {
> > > > +             update = true;
> > > > +     } else {
> > > > +             ext4_fc_reset_inode(inode);
> > > > +             ei->i_sync_tid = running_txn_tid;
> > > > +     }
> > > > +     ret = __fc_track_fn(inode, args, update);
> > > > +     mutex_unlock(&ei->i_fc_lock);
> > > > +
> > > > +     if (!enqueue)
> > > > +             return ret;
> > > > +
> > > > +     spin_lock(&sbi->s_fc_lock);
> > > > +     if (list_empty(&EXT4_I(inode)->i_fc_list))
> > > > +             list_add_tail(&EXT4_I(inode)->i_fc_list,
> > > > +                             (sbi->s_mount_state & EXT4_FC_COMMITTING) ?
> > > > +                             &sbi->s_fc_q[FC_Q_STAGING] :
> > > > +                             &sbi->s_fc_q[FC_Q_MAIN]);
> > > > +     spin_unlock(&sbi->s_fc_lock);
> > >
> > > OK, so how do you prevent inode from being freed while it is still on
> > > i_fc_list? I don't see anything preventing that and it could cause nasty
> > > use-after-free issues. Note that for similar reasons JBD2 uses external
> > > separately allocated inode for jbd2_inode so that it can have separate
> > > lifetime (related to transaction commits) from struct ext4_inode_info.
> > So, if you see the function ext4_fc_del() above, it's called from
> > ext4_clear_inode(). What ext4_fc_del() does is that, if the inode is
> > not being committed, it just removes it from the list. If that inode
> > was deleted, we have a separate dentry queue which will record the
> > deletion of the inode, so we don't really need the struct
> > ext4_inode_info for recording that on-disk. However, if the inode is
> > being committed (this is figured out by checking the per inode
> > COMMITTING state), ext4_fc_del() waits until the completion.
>
> But I don't think this quite works. Consider the following scenario:
>
> inode I gets modified in transaction T
>   you add I to FC list
>
> memory pressure reclaims I from memory
>   you remove I from FC list
>
> open(I) -> inode gets loaded to memory again. Not tracked in FC list.
> fsync(I) -> nothing to do, FC list is empty
> <crash>
>
> And 'I' now doesn't contain data in should because T didn't commit yet and
> FC was empty.
Hmmm, I see. This needs to get fixed. However, I'm a little confused
here. On memory pressure, the call chain would be like:
VFS->ext4_evict_inode() -> ext4_free_inode() -> ext4_clear_inode(). In
ext4_clear_inode(), we free up the jbd2_inode as well. If that's the
case, how does jbd2_inode survive the memory pressure where its
corresponding VFS inode is freed up?

Assuming I'm missing something, one option would be to track
jbd2_inode in the FC list instead of ext4_inode_info? Would that take
care of the problem? Another option would be to trigger a fast_commit
from ext4_evict_inode if the inode being freed is on fc list. But I'm
worried that would increase the latency of unlink operation.
>
> > > > +
> > > > +     return ret;
> > > > +}
> > > > +
> > > > +struct __track_dentry_update_args {
> > > > +     struct dentry *dentry;
> > > > +     int op;
> > > > +};
> > > > +
> > > > +/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */
> > > > +static void ext4_fc_submit_bh(struct super_block *sb)
> > > > +{
> > > > +     int write_flags = REQ_SYNC;
> > > > +     struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
> > > > +
> > > > +     if (test_opt(sb, BARRIER))
> > > > +             write_flags |= REQ_FUA | REQ_PREFLUSH;
> > >
> > > Submitting each fastcommit buffer with REQ_FUA | REQ_PREFLUSH is
> > > unnecessarily expensive (especially if there will be unrelated writes
> > > happening to the filesystem while fastcommit is running). If nothing else,
> > > it's enough to have REQ_PREFLUSH only once during the whole fastcommit to
> > > flush out written back data blocks (plus journal device may be different
> > > from the filesystem device so you need to be flushing the filesystem device
> > > for this - see how the jbd2 commit code does this).
> > >
> > > Also REQ_FUA on each block may be overkill for devices that don't support
> > > it natively (and thus REQ_FUA is simulated with full write cache pre and
> > > post flush) - for such devices it would be better to just write out
> > > fastcommit normally and then issue one cache flush. With careful
> > > checksumming, block ID tagging and such, it should be safe against disk
> > > reordering writes. But I guess we can leave this optimization as a TODO
> > > item for later (but I think it would be good to design the on-disk format of
> > > fastcommit blocks so that it does not rely on FUA writes).
> > I see. The on disk format doesn't rely on FUA / PREFLUSH, I added it
> > based on the observation that in most cases all the fast commit info
> > was written in 1 block only. I didn't see much difference in the
> > performance but I get your point. I'll add this as a TODO in the code
> > for now.
>
> OK, the performance optimization can wait for later but the flushing of
> proper device needs to be fixed soon - as I wrote above REQ_PREFLUSH is not
> enough (and needed at all) when the journal device is different from the
> filesystem device.
Ack
>
> > > > +/*
> > > > + * Complete a fast commit by writing tail tag.
> > > > + *
> > > > + * Writing tail tag marks the end of a fast commit. In order to guarantee
> > > > + * atomicity, after writing tail tag, even if there's space remaining
> > > > + * in the block, next commit shouldn't use it. That's why tail tag
> > > > + * has the length as that of the remaining space on the block.
> > > > + */
> > > > +static int ext4_fc_write_tail(struct super_block *sb, u32 crc)
> > > > +{
> > > > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > > > +     struct ext4_fc_tl tl;
> > > > +     struct ext4_fc_tail tail;
> > > > +     int off, bsize = sbi->s_journal->j_blocksize;
> > > > +     u8 *dst;
> > > > +
> > > > +     /*
> > > > +      * ext4_fc_reserve_space takes care of allocating an extra block if
> > > > +      * there's no enough space on this block for accommodating this tail.
> > > > +      */
> > > > +     dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc);
> > > > +     if (!dst)
> > > > +             return -ENOSPC;
> > > > +
> > > > +     off = sbi->s_fc_bytes % bsize;
> > > > +
> > > > +     tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL);
> > > > +     tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail));
> > > > +     sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize);
> > > > +
> > > > +     ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc);
> > > > +     dst += sizeof(tl);
> > > > +     tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid);
> > > > +     ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc);
> > > > +     dst += sizeof(tail.fc_tid);
> > > > +     tail.fc_crc = cpu_to_le32(crc);
> > > > +     ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
> > > > +
> > > > +     ext4_fc_submit_bh(sb);
> > > > +
> > > > +     return 0;
> > > > +}
> > >
> > > Is there a reason to pass CRC all around (so you have to have special
> > > functions like ext4_fc_memcpy(), ext4_fc_memzero(), ...) instead of just
> > > creating the whole block and then computing CRC in one go?
> > >
> > > In fact, as looking through the code, it seems to me it would be slightly
> > > nicer layer separation and interface if JBD2 provided functions for storage
> > > of data blobs and handled the details of space & block management,
> > > checksums, writeout, on recovery verification of correctness (so it would
> > > just provide back a stream of blobs for FS to replay). Just an idea for
> > > consideration, the current interface isn't too bad and we can change it
> > > later if we decide so.
> > I designed this keeping DAX mode in mind where we would benefit if we
> > don't use buffer heads and blocks. There is no block level CRC, but
> > CRC covers all the tags either from the start or from the last tail
> > tag (whichever comes first). This kind of CRC can span across
> > multipleblocks or we could have multiple CRCs in one block. Passing
> > CRC around helps us to compute CRC as we write tags to storage. In DAX
> > mode, this would allow fast commit commits to be smaller than block
> > size. DAX mode code isn't implemented completely yet, but I wanted to
> > make sure that the design of on-disk format is consistent and
> > efficient for both DAX and non-DAX modes.
>
> OK, I understand. Thanks for explanation!
>
> > > > +
> > > > +/*
> > > > + * Adds tag, length, value and updates CRC. Returns true if tlv was added.
> > > > + * Returns false if there's not enough space.
> > > > + */
> > > > + */
> > > > +static int ext4_fc_write_inode(struct inode *inode, u32 *crc)
> > > > +{
> > > > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > > > +     int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
> > > > +     int ret;
> > > > +     struct ext4_iloc iloc;
> > > > +     struct ext4_fc_inode fc_inode;
> > > > +     struct ext4_fc_tl tl;
> > > > +     u8 *dst;
> > > > +
> > > > +     ret = ext4_get_inode_loc(inode, &iloc);
> > > > +     if (ret)
> > > > +             return ret;
> > > > +
> > > > +     if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
> > > > +             inode_len += ei->i_extra_isize;
> > > > +
> > > > +     fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
> > > > +     tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_INODE);
> > > > +     tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino));
> > > > +
> > > > +     dst = ext4_fc_reserve_space(inode->i_sb,
> > > > +                     sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc);
> > > > +     if (!dst)
> > > > +             return -ECANCELED;
> > > > +
> > > > +     if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc))
> > > > +             return -ECANCELED;
> > > > +     dst += sizeof(tl);
> > > > +     if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc))
> > > > +             return -ECANCELED;
> > > > +     dst += sizeof(fc_inode);
> > > > +     if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc),
> > > > +                                     inode_len, crc))
> > > > +             return -ECANCELED;
> > >
> > > Isn't this racy? What guarantees the inode state you record here is a valid
> > > one for the fastcommit? I mean this gets called at the time of fastcommit
> > > (i.e., fsync), so a fastcommit code must record changes to all other
> > > metadata that relate to the currently recorded inode state. But this isn't
> > > serialized in any way (AFAICT) with on-going inode changes so how can
> > > fastcommit code guarantee that? This is a similar case as a problem I
> > > describe below...
> > So we have ext4_fc_start_update(inode) / ext4_fc_stop_update(inode)
> > which is called by all the operations that happen on an inode. If the
> > inode in question is undergoing a fast commit, ext4_fc_start_update()
> > will block. So that ensures that inode won't be modified once fast
> > commit starts. So, in general, before doing any fast commit related
> > operation, we'll first put the inode in committing state, that's the
> > state of the inode which will be committed on-disk in fast commit.
>
> I see. See the case below for my comments.
>
> > > > +
> > > > +     return 0;
> > > > +}
> > > > +
> > > > +/*
> > > > + * Writes updated data ranges for the inode in question. Updates CRC.
> > > > + * Returns 0 on success, error otherwise.
> > > > + */
> > > > +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> > > > +{
> > > > +     ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> > > > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > > > +     struct ext4_map_blocks map;
> > > > +     struct ext4_fc_add_range fc_ext;
> > > > +     struct ext4_fc_del_range lrange;
> > > > +     struct ext4_extent *ex;
> > > > +     int ret;
> > > > +
> > > > +     mutex_lock(&ei->i_fc_lock);
> > > > +     if (ei->i_fc_lblk_len == 0) {
> > > > +             mutex_unlock(&ei->i_fc_lock);
> > > > +             return 0;
> > > > +     }
> > > > +     old_blk_size = ei->i_fc_lblk_start;
> > > > +     new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> > > > +     ei->i_fc_lblk_len = 0;
> > > > +     mutex_unlock(&ei->i_fc_lock);
> > > > +
> > > > +     cur_lblk_off = old_blk_size;
> > > > +     jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> > > > +               __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> > > > +
> > > > +     while (cur_lblk_off <= new_blk_size) {
> > > > +             map.m_lblk = cur_lblk_off;
> > > > +             map.m_len = new_blk_size - cur_lblk_off + 1;
> > > > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > > > +             if (ret < 0)
> > > > +                     return -ECANCELED;
> > >
> > > So isn't this actually racy with a risk of stale data exposure? Consider a
> > > situation like:
> > >
> > > Task 1:                         Task 2:
> > > pwrite(file, buf, 8192, 0)
> > > punch(file, 0, 4096)
> > > fsync(file)
> > >   writeout range 4096-8192
> > >   fastcommit for inode range 0-8192
> > >                                 pwrite(file, buf, 4096, 0)
> > >     ext4_map_blocks(file)
> > >       - reports that block at offset 0 is mapped so that is recorded in
> > >         fastcommit record. But data for that is not written so after a
> > >         crash we'd expose stale data in that block.
> > >
> > > Am I missing something?
> > So the way this gets handled is before entering this function, the
> > inode enters COMMITTING state (in ext4_fc_submit_inode_data_all
> > function). Once in COMMITTING state, all the inodes on this inode get
> > paused. Also, the commit path waits until all the ongoing updates on
> > that inode are completed. Once they are completed, only then its data
> > buffers are flushed and this ext4_map_blocks is called. So Task-2 here
> > would have either completely finished or would wait until the end of
> > this inode's commit. I realize that I should probably add more
> > comments to make this more clearer in the code. But is handling it
> > this way sufficient or am I missing any more cases?
>
> I see. In principle this should work. But I don't like that we have yet
> another mechanism that needs to properly wrap inode changes to make
> fastcommits work. And if we get it wrong somewhere, the breakage will be
> almost impossible to notice until someone looses data after a power
> failure. So it seems a bit fragile to me.
Ack
>
> Ideally I think we would reuse the current transaction machinery for this
> somehow (so that changes added through one transaction handle would behave
> atomically wrt to fastcommits) but the details are not clear to me yet. I
> need to think more about this...
Yeah, I thought about that too. All we need to do is to atomically
increment an "number of ongoing updates" counter on an inode, which
could be done by existing ext4_journal_start()/stop() functions.
However, the problem is that current ext4_journal_start()/stop() don't
take inode as an argumen. I considered changing all the
ext4_journal_start/stop calls but that would have inflated the size of
this patch series which is already pretty big. But we can do that as a
follow up cleanup. Does that sound reasonable?
>
> > > > +
> > > > +             if (map.m_len == 0) {
> > > > +                     cur_lblk_off++;
> > > > +                     continue;
> > > > +             }
> > > > +
> > > > @@ -271,6 +272,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > > >
> > > >  out:
> > > >       inode_unlock(inode);
> > > > +     ext4_fc_stop_update(inode);
> > > >       if (likely(ret > 0)) {
> > > >               iocb->ki_pos += ret;
> > > >               ret = generic_write_sync(iocb, ret);
> > > > @@ -534,7 +536,9 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > > >                       goto out;
> > > >               }
> > > >
> > > > +             ext4_fc_start_update(inode);
> > > >               ret = ext4_orphan_add(handle, inode);
> > > > +             ext4_fc_stop_update(inode);
> > >
> > > Why is here protected only the orphan addition? What about other changes
> > > happening to the inode during direct write?
> > This is the only change that is protected by handle in this function.
> > What I'm trying to do here (and in other places) is that anything that
> > happens between ext4_journal_start() and ext4_journal_stop() happens
> > atomically. The way to guarantee that is to ensure that the same block
> > is also surrounded by ext4_fc_start_update(inode) and
> > ext4_fc_stop_update(inode).
> >
> > I also realized while looking at this comment is that we probably need
> > a new TLV for adding orphan inode to the list?
>
> > > Also I don't see anything tracking inode changes due to writes through mmap?
> > > How is that supposed to work?
> > Right, I have missed those. I see that mmap function
> > ext4_page_mkwrite() calls ext4_jbd2_inode_add_write that tells jbd2
> > what is the range that needs to be written for the inode in question.
> > I guess I can just update that function to update inode's FC range as
> > well?
>
> Yes, you need to add tracking of the page range handled in
> ext4_page_mkwrite() to FC.
Ack
>
> I've also realized that you probably need to disable fastcommits when data
> journaling is enabled for the inode (probably just disable fastcommit
> feature with data=jounral mount option, make inode ineligible if it has
> 'journal data' flag set).
Oh yes, thanks for pointing that out. I'll fix that too

Thanks,
Harshad
>
>                                                                         Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-27 14:29       ` Jan Kara
  2020-10-27 17:38         ` harshad shirwadkar
@ 2020-10-27 18:45         ` Theodore Y. Ts'o
  1 sibling, 0 replies; 33+ messages in thread
From: Theodore Y. Ts'o @ 2020-10-27 18:45 UTC (permalink / raw)
  To: Jan Kara; +Cc: harshad shirwadkar, Ext4 Developers List, Jayashree, vijay

On Tue, Oct 27, 2020 at 03:29:10PM +0100, Jan Kara wrote:
> 
> OK, I see. Maybe add a paragraph about this to fastcommit doc? I agree that
> we can leave these optimizations for later, I was just wondering whether
> there isn't some fundamental reason why global flush would be required and
> I'm happy to hear that there isn't.
> 
> The advantage of soft_consistency as you call it would be IMO most seen if
> there's relatively heavy non-fsync IO load in parallel with frequent fsyncs
> of a tiny file. And such load is not infrequent in practice. I agree that
> benchmarks like dbench are unlikely to benefit from soft_consistency since
> all IO the benchmark does is in fact forced by fsync.
> 
> I also think that with soft_consistency we could benefit (e.g. on SSD
> storage) from having several fast-commit areas in the journal so multiple
> fastcommits can run in parallel. But that's also for some later
> experimentation...

Right, so this is the reason why I wasn't super-excited by the
proposal to document crash recovery semantics in Linux file systems
proposed by Jayashree Mohan and Prof. Vijay Chidambaram last year[1].  I
knew that we were planning the Fast Commit work (Jayashree and Vijay,
this is a simplified version of the proposal made by Park and Shin in
their iJournaling paper[2]) and having something document that an
fsync(2) to one file guarantees that changes made to some other file
that were made "earlier" would disallow this particular optimization.

[1] http://lore.kernel.org/r/1552418820-18102-1-git-send-email-jaya@cs.utexas.edu
[2] https://www.usenix.org/conference/atc17/technical-sessions/presentation/park

That being said, I was afraid that there *were* applications that
might be (wrongly) making this assumption, even though it wasn't
guaranteed by POSIX.  So when it didn't make much difference for
benchmarks, and given that our original goal was to speed up NFS file
serving, where every single NFS RPC has to be synchronous before an
acknowledgement is sent back to the client, we decided to take the
conservative path --- at least for now.

I do agree with you that I can certainly think of workloads where not
requiring entanglement of unrelated file writes via fsync(2) could be
a huge performance win.

One of the things that I did discuss with Harshad was using some
hueristics, where if there are two "unrelated" applications (e.g.,
different session id, or process group leader, or different uid,
etc. --- details to be determined layer), we would not entangele
writes to unrelated files via fsync(2), while forcing files written by
the same application to share fate with one another even if only file
is fsync'ed.  Hopefully, this would head off the possibility of
another O_PONIES[3] controversy while still giving most of the
benefits of not making fsync(2) a global file system barrier.  It
would be hell to document in a standards specification, so the
official rule would still be "fsync(2) only applies to the single
file, and anything else is an accident of the implementation", per
POSIX.

[3] https://lwn.net/Articles/322823/

I still think the right answer is a new system call which takes an
array of file descriptors, so the application can explicitly declare
which set of files can be reliably fsync'ed in the same transaction
commit.  The downside is that this would require applications to
change what they are doing, and it would take the better part of a
decade before we could assume well-written applications are explicitly
declaring their crash recovery needs.

					- Ted

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 3/9] ext4 / jbd2: add fast commit initialization
  2020-10-21 20:00   ` Jan Kara
@ 2020-10-29 23:28     ` harshad shirwadkar
  2020-10-30 15:40       ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-29 23:28 UTC (permalink / raw)
  To: Jan Kara; +Cc: Ext4 Developers List, Theodore Y. Ts'o, kernel test robot

On Wed, Oct 21, 2020 at 1:00 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 15-10-20 13:37:55, Harshad Shirwadkar wrote:
> > diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> > new file mode 100644
> > index 000000000000..8362bf5e6e00
> > --- /dev/null
> > +++ b/fs/ext4/fast_commit.h
> > @@ -0,0 +1,9 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef __FAST_COMMIT_H__
> > +#define __FAST_COMMIT_H__
> > +
> > +/* Number of blocks in journal area to allocate for fast commits */
> > +#define EXT4_NUM_FC_BLKS             256
>
> Maybe this could be tunable (at least during mkfs but maybe also with
> a mount option)? I can imagine some people will want to tune this for their
> workloads similarly as they tune the journal size. And although current
> minimal journal size is 1024, I'd be actually calmer if jbd2 properly
> checked from the start that requested fastcommit area isn't too big for the
> journal...
Sounds good, commit e029c5f2798720b463e8df0e184a4d1036311b43 ("ext4:
make num of fast commit blocks configurable") fixes this. With that
commit, now we have reserved a field in the superblock that tells the
number of fast commit blocks. Now that this is configurable, I wonder
if there's any point in giving the file system the ability to
configure the number of blocks? In other words, I'm thinking of
dropping jbd2_fc_init() which takes the number of fast commit blocks
as an argument and just solely rely on the value found in the journal
superblock. New mke2fs will allow you to set the number of fast commit
blocks in JBD2 superblock. Any objections on that?
>
> > +
> > +#endif /* __FAST_COMMIT_H__ */
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 70256a240442..23bf55057fc2 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -5170,6 +5170,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
> >       journal->j_commit_interval = sbi->s_commit_interval;
> >       journal->j_min_batch_time = sbi->s_min_batch_time;
> >       journal->j_max_batch_time = sbi->s_max_batch_time;
> > +     ext4_fc_init(sb, journal);
> >
> >       write_lock(&journal->j_state_lock);
> >       if (test_opt(sb, BARRIER))
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index c0600405e7a2..4497bfbac527 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -1181,6 +1181,14 @@ static journal_t *journal_init_common(struct block_device *bdev,
> >       if (!journal->j_wbuf)
> >               goto err_cleanup;
> >
> > +     if (journal->j_fc_wbufsize > 0) {
> > +             journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> > +                                     sizeof(struct buffer_head *),
> > +                                     GFP_KERNEL);
> > +             if (!journal->j_fc_wbuf)
> > +                     goto err_cleanup;
> > +     }
> > +
>
> Hum, but journal_init_common() gets called e.g. through
> jbd2_journal_init_inode() before ext4_init_journal_params() sets
> j_fc_wbufsize? How is this supposed to work?
I realized that this part never really gets executed in the current
code. That's because when journal_init_common is called, j_fc_wbufsize
is not set. It only gets set later, so this could have been dropped.
>
> >       bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
> >       if (!bh) {
> >               pr_err("%s: Cannot get buffer for journal superblock\n",
> > @@ -1194,11 +1202,23 @@ static journal_t *journal_init_common(struct block_device *bdev,
> >
> >  err_cleanup:
> >       kfree(journal->j_wbuf);
> > +     kfree(journal->j_fc_wbuf);
> >       jbd2_journal_destroy_revoke(journal);
> >       kfree(journal);
> >       return NULL;
> >  }
> >
> > +int jbd2_fc_init(journal_t *journal, int num_fc_blks)
> > +{
> > +     journal->j_fc_wbufsize = num_fc_blks;
> > +     journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> > +                             sizeof(struct buffer_head *), GFP_KERNEL);
> > +     if (!journal->j_fc_wbuf)
> > +             return -ENOMEM;
> > +     return 0;
> > +}
> > +EXPORT_SYMBOL(jbd2_fc_init);
>
> Hum, probably I'd find it less error prone to have size of fastcommit area
> as an argument to jbd2_journal_init_dev() and jbd2_journal_init_inode().
> That way we are sure journal parameters are initialized correctly from the
> start. OTOH number of fastcommit blocks in the journal as we load it from
> the disk and need to replay could be different from the number of
> fastcommit blocks requested now (once we allow tuning) and this can get
> confusing pretty fast. So maybe we just set number of fastcommit blocks in
> journal_init_common() and then perform setup of everything else in
> journal_reset()?
Please see my comment above. If we just rely on the value found in the
superblock, then there is no question of FS requesting a different
number of FC blocks than what we find in journal superblock. If we go
that route, then we can set the default value of j_fc_wbufsize in
journal_init_common().  Whenever we journal superblock after that, we
can override the default value with what we find in the superblock.
>
> > +
> >  /* jbd2_journal_init_dev and jbd2_journal_init_inode:
> >   *
> >   * Create a journal structure assigned some fixed set of disk blocks to
> > @@ -1316,11 +1336,20 @@ static int journal_reset(journal_t *journal)
> >       }
> >
> >       journal->j_first = first;
> > -     journal->j_last = last;
> >
> > -     journal->j_head = first;
> > -     journal->j_tail = first;
> > -     journal->j_free = last - first;
> > +     if (jbd2_has_feature_fast_commit(journal) &&
> > +         journal->j_fc_wbufsize > 0) {
> > +             journal->j_fc_last = last;
> > +             journal->j_last = last - journal->j_fc_wbufsize;
> > +             journal->j_fc_first = journal->j_last + 1;
> > +             journal->j_fc_off = 0;
> > +     } else {
> > +             journal->j_last = last;
> > +     }
> > +
> > +     journal->j_head = journal->j_first;
> > +     journal->j_tail = journal->j_first;
> > +     journal->j_free = journal->j_last - journal->j_first;
>
> So the journal size is effectively shorter by j_fc_wbufsize. But this has
> also impact on maximum transaction size we can allow for the journal and
> related parameters (generally derived from j_maxlen you don't touch).
> So this needs to get fixed. Maybe just setting j_maxlen lower is the
> easiest but then please change the comment at its definition to mention in
> memory value is without fastcommit blocks. Or just create new journal
> parameter for the size of area usable for normal commits.
Ack, will do

Thanks,
Harshad.

>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-27 17:38         ` harshad shirwadkar
@ 2020-10-30 15:28           ` Jan Kara
  2020-10-30 16:45             ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-10-30 15:28 UTC (permalink / raw)
  To: harshad shirwadkar
  Cc: Jan Kara, Ext4 Developers List, Theodore Y. Ts'o, kernel test robot

On Tue 27-10-20 10:38:19, harshad shirwadkar wrote:
> On Tue, Oct 27, 2020 at 7:29 AM Jan Kara <jack@suse.cz> wrote:
> > > > > +
> > > > > +     mutex_lock(&ei->i_fc_lock);
> > > > > +     if (running_txn_tid == ei->i_sync_tid) {
> > > > > +             update = true;
> > > > > +     } else {
> > > > > +             ext4_fc_reset_inode(inode);
> > > > > +             ei->i_sync_tid = running_txn_tid;
> > > > > +     }
> > > > > +     ret = __fc_track_fn(inode, args, update);
> > > > > +     mutex_unlock(&ei->i_fc_lock);
> > > > > +
> > > > > +     if (!enqueue)
> > > > > +             return ret;
> > > > > +
> > > > > +     spin_lock(&sbi->s_fc_lock);
> > > > > +     if (list_empty(&EXT4_I(inode)->i_fc_list))
> > > > > +             list_add_tail(&EXT4_I(inode)->i_fc_list,
> > > > > +                             (sbi->s_mount_state & EXT4_FC_COMMITTING) ?
> > > > > +                             &sbi->s_fc_q[FC_Q_STAGING] :
> > > > > +                             &sbi->s_fc_q[FC_Q_MAIN]);
> > > > > +     spin_unlock(&sbi->s_fc_lock);
> > > >
> > > > OK, so how do you prevent inode from being freed while it is still on
> > > > i_fc_list? I don't see anything preventing that and it could cause nasty
> > > > use-after-free issues. Note that for similar reasons JBD2 uses external
> > > > separately allocated inode for jbd2_inode so that it can have separate
> > > > lifetime (related to transaction commits) from struct ext4_inode_info.
> > > So, if you see the function ext4_fc_del() above, it's called from
> > > ext4_clear_inode(). What ext4_fc_del() does is that, if the inode is
> > > not being committed, it just removes it from the list. If that inode
> > > was deleted, we have a separate dentry queue which will record the
> > > deletion of the inode, so we don't really need the struct
> > > ext4_inode_info for recording that on-disk. However, if the inode is
> > > being committed (this is figured out by checking the per inode
> > > COMMITTING state), ext4_fc_del() waits until the completion.
> >
> > But I don't think this quite works. Consider the following scenario:
> >
> > inode I gets modified in transaction T
> >   you add I to FC list
> >
> > memory pressure reclaims I from memory
> >   you remove I from FC list
> >
> > open(I) -> inode gets loaded to memory again. Not tracked in FC list.
> > fsync(I) -> nothing to do, FC list is empty
> > <crash>
> >
> > And 'I' now doesn't contain data in should because T didn't commit yet and
> > FC was empty.
> Hmmm, I see. This needs to get fixed. However, I'm a little confused
> here. On memory pressure, the call chain would be like:
> VFS->ext4_evict_inode() -> ext4_free_inode() -> ext4_clear_inode(). In
> ext4_clear_inode(), we free up the jbd2_inode as well. If that's the
> case, how does jbd2_inode survive the memory pressure where its
> corresponding VFS inode is freed up?

Right (and I forgot about this detail of jbd2_inode lifetime). But with
jbd2_inode the thing is that it needs to stay around only as long as there
are dirty pages attached to the inode - once pages are written out (and
this always happens before inode can be evicted from memory), we are sure
the following transaction commit has nothing to do with the inode so we can
safely free it.

With your FC list, we need to track what has changed in the inode even
after all data pages have been written out.

> Assuming I'm missing something, one option would be to track
> jbd2_inode in the FC list instead of ext4_inode_info? Would that take
> care of the problem? Another option would be to trigger a fast_commit
> from ext4_evict_inode if the inode being freed is on fc list. But I'm
> worried that would increase the latency of unlink operation.

So tracking in jbd2_inode will not help - I was confused about that.
Forcing FC on inode eviction is IMO a no-go. That would regress some loads
and also make behavior under memory pressure worse (XFS was actually doing
something similar and they had serious trouble with that under heavy memory
pressure because they needed to write tens of thousands of inodes to the
log during reclaim).

I think that if we are evicting an inode that is in fastcommit and that isn't
unlinked, we just mark the fs as ineligible - to note we are loosing info
needed for fastcommit. This shouldn't happen frequently and if it does, it
means the machine is under heavy memory pressure and it likely isn't
beneficial to keep the info around or try to reload inode from disk on
fastcommit.

> > > > > +
> > > > > +     return 0;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * Writes updated data ranges for the inode in question. Updates CRC.
> > > > > + * Returns 0 on success, error otherwise.
> > > > > + */
> > > > > +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> > > > > +{
> > > > > +     ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> > > > > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > > > > +     struct ext4_map_blocks map;
> > > > > +     struct ext4_fc_add_range fc_ext;
> > > > > +     struct ext4_fc_del_range lrange;
> > > > > +     struct ext4_extent *ex;
> > > > > +     int ret;
> > > > > +
> > > > > +     mutex_lock(&ei->i_fc_lock);
> > > > > +     if (ei->i_fc_lblk_len == 0) {
> > > > > +             mutex_unlock(&ei->i_fc_lock);
> > > > > +             return 0;
> > > > > +     }
> > > > > +     old_blk_size = ei->i_fc_lblk_start;
> > > > > +     new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> > > > > +     ei->i_fc_lblk_len = 0;
> > > > > +     mutex_unlock(&ei->i_fc_lock);
> > > > > +
> > > > > +     cur_lblk_off = old_blk_size;
> > > > > +     jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> > > > > +               __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> > > > > +
> > > > > +     while (cur_lblk_off <= new_blk_size) {
> > > > > +             map.m_lblk = cur_lblk_off;
> > > > > +             map.m_len = new_blk_size - cur_lblk_off + 1;
> > > > > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > > > > +             if (ret < 0)
> > > > > +                     return -ECANCELED;
> > > >
> > > > So isn't this actually racy with a risk of stale data exposure? Consider a
> > > > situation like:
> > > >
> > > > Task 1:                         Task 2:
> > > > pwrite(file, buf, 8192, 0)
> > > > punch(file, 0, 4096)
> > > > fsync(file)
> > > >   writeout range 4096-8192
> > > >   fastcommit for inode range 0-8192
> > > >                                 pwrite(file, buf, 4096, 0)
> > > >     ext4_map_blocks(file)
> > > >       - reports that block at offset 0 is mapped so that is recorded in
> > > >         fastcommit record. But data for that is not written so after a
> > > >         crash we'd expose stale data in that block.
> > > >
> > > > Am I missing something?
> > > So the way this gets handled is before entering this function, the
> > > inode enters COMMITTING state (in ext4_fc_submit_inode_data_all
> > > function). Once in COMMITTING state, all the inodes on this inode get
> > > paused. Also, the commit path waits until all the ongoing updates on
> > > that inode are completed. Once they are completed, only then its data
> > > buffers are flushed and this ext4_map_blocks is called. So Task-2 here
> > > would have either completely finished or would wait until the end of
> > > this inode's commit. I realize that I should probably add more
> > > comments to make this more clearer in the code. But is handling it
> > > this way sufficient or am I missing any more cases?
> >
> > I see. In principle this should work. But I don't like that we have yet
> > another mechanism that needs to properly wrap inode changes to make
> > fastcommits work. And if we get it wrong somewhere, the breakage will be
> > almost impossible to notice until someone looses data after a power
> > failure. So it seems a bit fragile to me.
> Ack
> >
> > Ideally I think we would reuse the current transaction machinery for this
> > somehow (so that changes added through one transaction handle would behave
> > atomically wrt to fastcommits) but the details are not clear to me yet. I
> > need to think more about this...
> Yeah, I thought about that too. All we need to do is to atomically
> increment an "number of ongoing updates" counter on an inode, which
> could be done by existing ext4_journal_start()/stop() functions.
> However, the problem is that current ext4_journal_start()/stop() don't
> take inode as an argumen. I considered changing all the
> ext4_journal_start/stop calls but that would have inflated the size of
> this patch series which is already pretty big. But we can do that as a
> follow up cleanup. Does that sound reasonable?

So ext4_journal_start() actually does take inode as an argument and we use
it quite some places (we also have ext4_journal_start_sb() which takes just
the superblock). What I'm not sure about is whether that's the inode you
want to protect for fastcommit purposes (would need some code auditing) or
whether there are not more inodes that need the protection for some
operations. ext4_journal_stop() could be handled by recording the inode in
the handle on ext4_journal_start() so ext4_journal_stop() then knows for
which inode to decrement the counter.

Another possibility would be to increment the counter in
ext4_get_inode_loc() - that is a clear indication we are going to change
something in the inode. This also automatically handles the situation when
multiple inodes are modified by the operation or that proper inodes are
being protected. With decrementing the counter it is somewhat more
difficult. I think we can only do that at ext4_journal_stop() time so we
need to record in the handle for which inodes we acquired the update
references and drop them from ext4_journal_stop(). This would look as a
rather robust solution to me...

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 3/9] ext4 / jbd2: add fast commit initialization
  2020-10-29 23:28     ` harshad shirwadkar
@ 2020-10-30 15:40       ` Jan Kara
  0 siblings, 0 replies; 33+ messages in thread
From: Jan Kara @ 2020-10-30 15:40 UTC (permalink / raw)
  To: harshad shirwadkar
  Cc: Jan Kara, Ext4 Developers List, Theodore Y. Ts'o, kernel test robot

On Thu 29-10-20 16:28:34, harshad shirwadkar wrote:
> On Wed, Oct 21, 2020 at 1:00 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 15-10-20 13:37:55, Harshad Shirwadkar wrote:
> > > diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> > > new file mode 100644
> > > index 000000000000..8362bf5e6e00
> > > --- /dev/null
> > > +++ b/fs/ext4/fast_commit.h
> > > @@ -0,0 +1,9 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +
> > > +#ifndef __FAST_COMMIT_H__
> > > +#define __FAST_COMMIT_H__
> > > +
> > > +/* Number of blocks in journal area to allocate for fast commits */
> > > +#define EXT4_NUM_FC_BLKS             256
> >
> > Maybe this could be tunable (at least during mkfs but maybe also with
> > a mount option)? I can imagine some people will want to tune this for their
> > workloads similarly as they tune the journal size. And although current
> > minimal journal size is 1024, I'd be actually calmer if jbd2 properly
> > checked from the start that requested fastcommit area isn't too big for the
> > journal...
> Sounds good, commit e029c5f2798720b463e8df0e184a4d1036311b43 ("ext4:
> make num of fast commit blocks configurable") fixes this. With that
> commit, now we have reserved a field in the superblock that tells the
> number of fast commit blocks. Now that this is configurable, I wonder
> if there's any point in giving the file system the ability to
> configure the number of blocks? In other words, I'm thinking of
> dropping jbd2_fc_init() which takes the number of fast commit blocks
> as an argument and just solely rely on the value found in the journal
> superblock. New mke2fs will allow you to set the number of fast commit
> blocks in JBD2 superblock. Any objections on that?

Yeah, that sounds as a good cleanup to me.

> > > +
> > > +#endif /* __FAST_COMMIT_H__ */
> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > > index 70256a240442..23bf55057fc2 100644
> > > --- a/fs/ext4/super.c
> > > +++ b/fs/ext4/super.c
> > > @@ -5170,6 +5170,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
> > >       journal->j_commit_interval = sbi->s_commit_interval;
> > >       journal->j_min_batch_time = sbi->s_min_batch_time;
> > >       journal->j_max_batch_time = sbi->s_max_batch_time;
> > > +     ext4_fc_init(sb, journal);
> > >
> > >       write_lock(&journal->j_state_lock);
> > >       if (test_opt(sb, BARRIER))
> > > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > > index c0600405e7a2..4497bfbac527 100644
> > > --- a/fs/jbd2/journal.c
> > > +++ b/fs/jbd2/journal.c
> > > @@ -1181,6 +1181,14 @@ static journal_t *journal_init_common(struct block_device *bdev,
> > >       if (!journal->j_wbuf)
> > >               goto err_cleanup;
> > >
> > > +     if (journal->j_fc_wbufsize > 0) {
> > > +             journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> > > +                                     sizeof(struct buffer_head *),
> > > +                                     GFP_KERNEL);
> > > +             if (!journal->j_fc_wbuf)
> > > +                     goto err_cleanup;
> > > +     }
> > > +
> >
> > Hum, but journal_init_common() gets called e.g. through
> > jbd2_journal_init_inode() before ext4_init_journal_params() sets
> > j_fc_wbufsize? How is this supposed to work?
> I realized that this part never really gets executed in the current
> code. That's because when journal_init_common is called, j_fc_wbufsize
> is not set. It only gets set later, so this could have been dropped.

OK, just clean it up please..

> > >       bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
> > >       if (!bh) {
> > >               pr_err("%s: Cannot get buffer for journal superblock\n",
> > > @@ -1194,11 +1202,23 @@ static journal_t *journal_init_common(struct block_device *bdev,
> > >
> > >  err_cleanup:
> > >       kfree(journal->j_wbuf);
> > > +     kfree(journal->j_fc_wbuf);
> > >       jbd2_journal_destroy_revoke(journal);
> > >       kfree(journal);
> > >       return NULL;
> > >  }
> > >
> > > +int jbd2_fc_init(journal_t *journal, int num_fc_blks)
> > > +{
> > > +     journal->j_fc_wbufsize = num_fc_blks;
> > > +     journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> > > +                             sizeof(struct buffer_head *), GFP_KERNEL);
> > > +     if (!journal->j_fc_wbuf)
> > > +             return -ENOMEM;
> > > +     return 0;
> > > +}
> > > +EXPORT_SYMBOL(jbd2_fc_init);
> >
> > Hum, probably I'd find it less error prone to have size of fastcommit area
> > as an argument to jbd2_journal_init_dev() and jbd2_journal_init_inode().
> > That way we are sure journal parameters are initialized correctly from the
> > start. OTOH number of fastcommit blocks in the journal as we load it from
> > the disk and need to replay could be different from the number of
> > fastcommit blocks requested now (once we allow tuning) and this can get
> > confusing pretty fast. So maybe we just set number of fastcommit blocks in
> > journal_init_common() and then perform setup of everything else in
> > journal_reset()?
> Please see my comment above. If we just rely on the value found in the
> superblock, then there is no question of FS requesting a different
> number of FC blocks than what we find in journal superblock. If we go
> that route, then we can set the default value of j_fc_wbufsize in
> journal_init_common().  Whenever we journal superblock after that, we
> can override the default value with what we find in the superblock.

Yes, having only the value in journal superblock deals nicely with all
these issues. I'm for it.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-30 15:28           ` Jan Kara
@ 2020-10-30 16:45             ` harshad shirwadkar
  2020-11-03 10:04               ` Jan Kara
  0 siblings, 1 reply; 33+ messages in thread
From: harshad shirwadkar @ 2020-10-30 16:45 UTC (permalink / raw)
  To: Jan Kara, Andreas Dilger, Theodore Y. Ts'o
  Cc: Ext4 Developers List, kernel test robot

On Fri, Oct 30, 2020 at 8:28 AM Jan Kara <jack@suse.cz> wrote:
>
> On Tue 27-10-20 10:38:19, harshad shirwadkar wrote:
> > On Tue, Oct 27, 2020 at 7:29 AM Jan Kara <jack@suse.cz> wrote:
> > > > > > +
> > > > > > +     mutex_lock(&ei->i_fc_lock);
> > > > > > +     if (running_txn_tid == ei->i_sync_tid) {
> > > > > > +             update = true;
> > > > > > +     } else {
> > > > > > +             ext4_fc_reset_inode(inode);
> > > > > > +             ei->i_sync_tid = running_txn_tid;
> > > > > > +     }
> > > > > > +     ret = __fc_track_fn(inode, args, update);
> > > > > > +     mutex_unlock(&ei->i_fc_lock);
> > > > > > +
> > > > > > +     if (!enqueue)
> > > > > > +             return ret;
> > > > > > +
> > > > > > +     spin_lock(&sbi->s_fc_lock);
> > > > > > +     if (list_empty(&EXT4_I(inode)->i_fc_list))
> > > > > > +             list_add_tail(&EXT4_I(inode)->i_fc_list,
> > > > > > +                             (sbi->s_mount_state & EXT4_FC_COMMITTING) ?
> > > > > > +                             &sbi->s_fc_q[FC_Q_STAGING] :
> > > > > > +                             &sbi->s_fc_q[FC_Q_MAIN]);
> > > > > > +     spin_unlock(&sbi->s_fc_lock);
> > > > >
> > > > > OK, so how do you prevent inode from being freed while it is still on
> > > > > i_fc_list? I don't see anything preventing that and it could cause nasty
> > > > > use-after-free issues. Note that for similar reasons JBD2 uses external
> > > > > separately allocated inode for jbd2_inode so that it can have separate
> > > > > lifetime (related to transaction commits) from struct ext4_inode_info.
> > > > So, if you see the function ext4_fc_del() above, it's called from
> > > > ext4_clear_inode(). What ext4_fc_del() does is that, if the inode is
> > > > not being committed, it just removes it from the list. If that inode
> > > > was deleted, we have a separate dentry queue which will record the
> > > > deletion of the inode, so we don't really need the struct
> > > > ext4_inode_info for recording that on-disk. However, if the inode is
> > > > being committed (this is figured out by checking the per inode
> > > > COMMITTING state), ext4_fc_del() waits until the completion.
> > >
> > > But I don't think this quite works. Consider the following scenario:
> > >
> > > inode I gets modified in transaction T
> > >   you add I to FC list
> > >
> > > memory pressure reclaims I from memory
> > >   you remove I from FC list
> > >
> > > open(I) -> inode gets loaded to memory again. Not tracked in FC list.
> > > fsync(I) -> nothing to do, FC list is empty
> > > <crash>
> > >
> > > And 'I' now doesn't contain data in should because T didn't commit yet and
> > > FC was empty.
> > Hmmm, I see. This needs to get fixed. However, I'm a little confused
> > here. On memory pressure, the call chain would be like:
> > VFS->ext4_evict_inode() -> ext4_free_inode() -> ext4_clear_inode(). In
> > ext4_clear_inode(), we free up the jbd2_inode as well. If that's the
> > case, how does jbd2_inode survive the memory pressure where its
> > corresponding VFS inode is freed up?
>
> Right (and I forgot about this detail of jbd2_inode lifetime). But with
> jbd2_inode the thing is that it needs to stay around only as long as there
> are dirty pages attached to the inode - once pages are written out (and
> this always happens before inode can be evicted from memory), we are sure
> the following transaction commit has nothing to do with the inode so we can
> safely free it.
>
> With your FC list, we need to track what has changed in the inode even
> after all data pages have been written out.
>
> > Assuming I'm missing something, one option would be to track
> > jbd2_inode in the FC list instead of ext4_inode_info? Would that take
> > care of the problem? Another option would be to trigger a fast_commit
> > from ext4_evict_inode if the inode being freed is on fc list. But I'm
> > worried that would increase the latency of unlink operation.
>
> So tracking in jbd2_inode will not help - I was confused about that.
> Forcing FC on inode eviction is IMO a no-go. That would regress some loads
> and also make behavior under memory pressure worse (XFS was actually doing
> something similar and they had serious trouble with that under heavy memory
> pressure because they needed to write tens of thousands of inodes to the
> log during reclaim).
>
> I think that if we are evicting an inode that is in fastcommit and that isn't
> unlinked, we just mark the fs as ineligible - to note we are loosing info
> needed for fastcommit. This shouldn't happen frequently and if it does, it
> means the machine is under heavy memory pressure and it likely isn't
> beneficial to keep the info around or try to reload inode from disk on
> fastcommit.
Ack, this sounds good to me! Thanks, I'll do this.
>
> > > > > > +
> > > > > > +     return 0;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * Writes updated data ranges for the inode in question. Updates CRC.
> > > > > > + * Returns 0 on success, error otherwise.
> > > > > > + */
> > > > > > +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> > > > > > +{
> > > > > > +     ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> > > > > > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > > > > > +     struct ext4_map_blocks map;
> > > > > > +     struct ext4_fc_add_range fc_ext;
> > > > > > +     struct ext4_fc_del_range lrange;
> > > > > > +     struct ext4_extent *ex;
> > > > > > +     int ret;
> > > > > > +
> > > > > > +     mutex_lock(&ei->i_fc_lock);
> > > > > > +     if (ei->i_fc_lblk_len == 0) {
> > > > > > +             mutex_unlock(&ei->i_fc_lock);
> > > > > > +             return 0;
> > > > > > +     }
> > > > > > +     old_blk_size = ei->i_fc_lblk_start;
> > > > > > +     new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> > > > > > +     ei->i_fc_lblk_len = 0;
> > > > > > +     mutex_unlock(&ei->i_fc_lock);
> > > > > > +
> > > > > > +     cur_lblk_off = old_blk_size;
> > > > > > +     jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> > > > > > +               __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> > > > > > +
> > > > > > +     while (cur_lblk_off <= new_blk_size) {
> > > > > > +             map.m_lblk = cur_lblk_off;
> > > > > > +             map.m_len = new_blk_size - cur_lblk_off + 1;
> > > > > > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > > > > > +             if (ret < 0)
> > > > > > +                     return -ECANCELED;
> > > > >
> > > > > So isn't this actually racy with a risk of stale data exposure? Consider a
> > > > > situation like:
> > > > >
> > > > > Task 1:                         Task 2:
> > > > > pwrite(file, buf, 8192, 0)
> > > > > punch(file, 0, 4096)
> > > > > fsync(file)
> > > > >   writeout range 4096-8192
> > > > >   fastcommit for inode range 0-8192
> > > > >                                 pwrite(file, buf, 4096, 0)
> > > > >     ext4_map_blocks(file)
> > > > >       - reports that block at offset 0 is mapped so that is recorded in
> > > > >         fastcommit record. But data for that is not written so after a
> > > > >         crash we'd expose stale data in that block.
> > > > >
> > > > > Am I missing something?
> > > > So the way this gets handled is before entering this function, the
> > > > inode enters COMMITTING state (in ext4_fc_submit_inode_data_all
> > > > function). Once in COMMITTING state, all the inodes on this inode get
> > > > paused. Also, the commit path waits until all the ongoing updates on
> > > > that inode are completed. Once they are completed, only then its data
> > > > buffers are flushed and this ext4_map_blocks is called. So Task-2 here
> > > > would have either completely finished or would wait until the end of
> > > > this inode's commit. I realize that I should probably add more
> > > > comments to make this more clearer in the code. But is handling it
> > > > this way sufficient or am I missing any more cases?
> > >
> > > I see. In principle this should work. But I don't like that we have yet
> > > another mechanism that needs to properly wrap inode changes to make
> > > fastcommits work. And if we get it wrong somewhere, the breakage will be
> > > almost impossible to notice until someone looses data after a power
> > > failure. So it seems a bit fragile to me.
> > Ack
> > >
> > > Ideally I think we would reuse the current transaction machinery for this
> > > somehow (so that changes added through one transaction handle would behave
> > > atomically wrt to fastcommits) but the details are not clear to me yet. I
> > > need to think more about this...
> > Yeah, I thought about that too. All we need to do is to atomically
> > increment an "number of ongoing updates" counter on an inode, which
> > could be done by existing ext4_journal_start()/stop() functions.
> > However, the problem is that current ext4_journal_start()/stop() don't
> > take inode as an argumen. I considered changing all the
> > ext4_journal_start/stop calls but that would have inflated the size of
> > this patch series which is already pretty big. But we can do that as a
> > follow up cleanup. Does that sound reasonable?
>
> So ext4_journal_start() actually does take inode as an argument and we use
> it quite some places (we also have ext4_journal_start_sb() which takes just
> the superblock). What I'm not sure about is whether that's the inode you
> want to protect for fastcommit purposes (would need some code auditing) or
> whether there are not more inodes that need the protection for some
> operations. ext4_journal_stop() could be handled by recording the inode in
> the handle on ext4_journal_start() so ext4_journal_stop() then knows for
> which inode to decrement the counter.
>
> Another possibility would be to increment the counter in
> ext4_get_inode_loc() - that is a clear indication we are going to change
> something in the inode. This also automatically handles the situation when
> multiple inodes are modified by the operation or that proper inodes are
> being protected. With decrementing the counter it is somewhat more
> difficult. I think we can only do that at ext4_journal_stop() time so we
> need to record in the handle for which inodes we acquired the update
> references and drop them from ext4_journal_stop(). This would look as a
> rather robust solution to me...
..the only problem here is that the same handle can be returned by
multiple calls to ext4_journal_start(). That means a handle returned
by ext4_journal_start() could be associated with multiple inodes. One
way to deal with this would be to define ext4 specific handle
structure. So, each call to ext4_journal_start would return a struct
that looks like following:

struct ext4_handle {
    handle_t *jbd2_handle;
    struct inode *inode;
}

So now on ext4_journal_stop(), we know for which inode we need to drop
counters. The objects of this struct would either need to have their
own kmem_cache or would need to be defined on stack (I think the
latter is preferred). Should we do this? If we do this, this is going
to be a pretty big change (will have to inspect all the existing
callers of ext4_journal_start() and ext4_journal_stop()).

Another option would be to change the definition of handle_t such that
on every call to jbd2_journal_start(), we get a new wrapper object
that takes a reference on handle_t. Such an object would have a
private pointer that FS can use the way it wants. This will be a
relatively smaller change but it would impact OCFS too. But if we go
this route, we can't avoid using a new kmem_cache, since now these new
handle wrappers would need to be allocated inside of JBD2.

I kind of like the second option better because it keeps the change
comparatively smaller. Wdyt? Also, Ted / Andreas, wdyt?

Thanks,
Harshad

Thank
>
>                                                                 Honza
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-10-30 16:45             ` harshad shirwadkar
@ 2020-11-03 10:04               ` Jan Kara
  2020-11-03 18:31                 ` harshad shirwadkar
  0 siblings, 1 reply; 33+ messages in thread
From: Jan Kara @ 2020-11-03 10:04 UTC (permalink / raw)
  To: harshad shirwadkar
  Cc: Jan Kara, Andreas Dilger, Theodore Y. Ts'o,
	Ext4 Developers List, kernel test robot

On Fri 30-10-20 09:45:10, harshad shirwadkar wrote:
> > > > > > > +
> > > > > > > +     return 0;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * Writes updated data ranges for the inode in question. Updates CRC.
> > > > > > > + * Returns 0 on success, error otherwise.
> > > > > > > + */
> > > > > > > +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> > > > > > > +{
> > > > > > > +     ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> > > > > > > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > > > > > > +     struct ext4_map_blocks map;
> > > > > > > +     struct ext4_fc_add_range fc_ext;
> > > > > > > +     struct ext4_fc_del_range lrange;
> > > > > > > +     struct ext4_extent *ex;
> > > > > > > +     int ret;
> > > > > > > +
> > > > > > > +     mutex_lock(&ei->i_fc_lock);
> > > > > > > +     if (ei->i_fc_lblk_len == 0) {
> > > > > > > +             mutex_unlock(&ei->i_fc_lock);
> > > > > > > +             return 0;
> > > > > > > +     }
> > > > > > > +     old_blk_size = ei->i_fc_lblk_start;
> > > > > > > +     new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> > > > > > > +     ei->i_fc_lblk_len = 0;
> > > > > > > +     mutex_unlock(&ei->i_fc_lock);
> > > > > > > +
> > > > > > > +     cur_lblk_off = old_blk_size;
> > > > > > > +     jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> > > > > > > +               __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> > > > > > > +
> > > > > > > +     while (cur_lblk_off <= new_blk_size) {
> > > > > > > +             map.m_lblk = cur_lblk_off;
> > > > > > > +             map.m_len = new_blk_size - cur_lblk_off + 1;
> > > > > > > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > > > > > > +             if (ret < 0)
> > > > > > > +                     return -ECANCELED;
> > > > > >
> > > > > > So isn't this actually racy with a risk of stale data exposure? Consider a
> > > > > > situation like:
> > > > > >
> > > > > > Task 1:                         Task 2:
> > > > > > pwrite(file, buf, 8192, 0)
> > > > > > punch(file, 0, 4096)
> > > > > > fsync(file)
> > > > > >   writeout range 4096-8192
> > > > > >   fastcommit for inode range 0-8192
> > > > > >                                 pwrite(file, buf, 4096, 0)
> > > > > >     ext4_map_blocks(file)
> > > > > >       - reports that block at offset 0 is mapped so that is recorded in
> > > > > >         fastcommit record. But data for that is not written so after a
> > > > > >         crash we'd expose stale data in that block.
> > > > > >
> > > > > > Am I missing something?
> > > > > So the way this gets handled is before entering this function, the
> > > > > inode enters COMMITTING state (in ext4_fc_submit_inode_data_all
> > > > > function). Once in COMMITTING state, all the inodes on this inode get
> > > > > paused. Also, the commit path waits until all the ongoing updates on
> > > > > that inode are completed. Once they are completed, only then its data
> > > > > buffers are flushed and this ext4_map_blocks is called. So Task-2 here
> > > > > would have either completely finished or would wait until the end of
> > > > > this inode's commit. I realize that I should probably add more
> > > > > comments to make this more clearer in the code. But is handling it
> > > > > this way sufficient or am I missing any more cases?
> > > >
> > > > I see. In principle this should work. But I don't like that we have yet
> > > > another mechanism that needs to properly wrap inode changes to make
> > > > fastcommits work. And if we get it wrong somewhere, the breakage will be
> > > > almost impossible to notice until someone looses data after a power
> > > > failure. So it seems a bit fragile to me.
> > > Ack
> > > >
> > > > Ideally I think we would reuse the current transaction machinery for this
> > > > somehow (so that changes added through one transaction handle would behave
> > > > atomically wrt to fastcommits) but the details are not clear to me yet. I
> > > > need to think more about this...
> > > Yeah, I thought about that too. All we need to do is to atomically
> > > increment an "number of ongoing updates" counter on an inode, which
> > > could be done by existing ext4_journal_start()/stop() functions.
> > > However, the problem is that current ext4_journal_start()/stop() don't
> > > take inode as an argumen. I considered changing all the
> > > ext4_journal_start/stop calls but that would have inflated the size of
> > > this patch series which is already pretty big. But we can do that as a
> > > follow up cleanup. Does that sound reasonable?
> >
> > So ext4_journal_start() actually does take inode as an argument and we use
> > it quite some places (we also have ext4_journal_start_sb() which takes just
> > the superblock). What I'm not sure about is whether that's the inode you
> > want to protect for fastcommit purposes (would need some code auditing) or
> > whether there are not more inodes that need the protection for some
> > operations. ext4_journal_stop() could be handled by recording the inode in
> > the handle on ext4_journal_start() so ext4_journal_stop() then knows for
> > which inode to decrement the counter.
> >
> > Another possibility would be to increment the counter in
> > ext4_get_inode_loc() - that is a clear indication we are going to change
> > something in the inode. This also automatically handles the situation when
> > multiple inodes are modified by the operation or that proper inodes are
> > being protected. With decrementing the counter it is somewhat more
> > difficult. I think we can only do that at ext4_journal_stop() time so we
> > need to record in the handle for which inodes we acquired the update
> > references and drop them from ext4_journal_stop(). This would look as a
> > rather robust solution to me...
> ..the only problem here is that the same handle can be returned by
> multiple calls to ext4_journal_start(). That means a handle returned
> by ext4_journal_start() could be associated with multiple inodes. One

That is not quite true. ext4_journal_start() returns always a new handle
(unless that process has already a handle started, but nested handles are
not interesting for our case). Just multiple handles may refer to the same
transaction which is what confused you I guess. So each handle has 1:1
correspondence with a logical operation that needs to be performed
atomically and you can store your inode in handle_t (==
jbd2_journal_handle). Maybe to make the layering clear, you could add a
helper jbd2_associate_handle_with_inode() or something like that for the
storing and similar helper for fetching the inode.

Now I'm not certain that each logical operation has only single inode that
gets modified in it - e.g. rename may modify multiple inodes. Now I suspect
that you are marking the fs as inelligible in all the cases that modify
more inodes but it's difficult to be sure with your current scheme. That's
another way that should be automated by the scheme (which is easy enough -
you can mark fs as ineligible if handle already has different inode
associated with it in ext2_get_inode_loc()).

I don't think you need to play any games with fs private structure at this
point as you describe below...

								Honza


> way to deal with this would be to define ext4 specific handle
> structure. So, each call to ext4_journal_start would return a struct
> that looks like following:
> 
> struct ext4_handle {
>     handle_t *jbd2_handle;
>     struct inode *inode;
> }
> 
> So now on ext4_journal_stop(), we know for which inode we need to drop
> counters. The objects of this struct would either need to have their
> own kmem_cache or would need to be defined on stack (I think the
> latter is preferred). Should we do this? If we do this, this is going
> to be a pretty big change (will have to inspect all the existing
> callers of ext4_journal_start() and ext4_journal_stop()).
> 
> Another option would be to change the definition of handle_t such that
> on every call to jbd2_journal_start(), we get a new wrapper object
> that takes a reference on handle_t. Such an object would have a
> private pointer that FS can use the way it wants. This will be a
> relatively smaller change but it would impact OCFS too. But if we go
> this route, we can't avoid using a new kmem_cache, since now these new
> handle wrappers would need to be allocated inside of JBD2.
> 
> I kind of like the second option better because it keeps the change
> comparatively smaller. Wdyt? Also, Ted / Andreas, wdyt?
> 
> Thanks,
> Harshad
> 
> Thank
> >
> >                                                                 Honza
> > --
> > Jan Kara <jack@suse.com>
> > SUSE Labs, CR
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH v10 5/9] ext4: main fast-commit commit path
  2020-11-03 10:04               ` Jan Kara
@ 2020-11-03 18:31                 ` harshad shirwadkar
  0 siblings, 0 replies; 33+ messages in thread
From: harshad shirwadkar @ 2020-11-03 18:31 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andreas Dilger, Theodore Y. Ts'o, Ext4 Developers List,
	kernel test robot

On Tue, Nov 3, 2020 at 2:04 AM Jan Kara <jack@suse.cz> wrote:
>
> On Fri 30-10-20 09:45:10, harshad shirwadkar wrote:
> > > > > > > > +
> > > > > > > > +     return 0;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * Writes updated data ranges for the inode in question. Updates CRC.
> > > > > > > > + * Returns 0 on success, error otherwise.
> > > > > > > > + */
> > > > > > > > +static int ext4_fc_write_inode_data(struct inode *inode, u32 *crc)
> > > > > > > > +{
> > > > > > > > +     ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> > > > > > > > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > > > > > > > +     struct ext4_map_blocks map;
> > > > > > > > +     struct ext4_fc_add_range fc_ext;
> > > > > > > > +     struct ext4_fc_del_range lrange;
> > > > > > > > +     struct ext4_extent *ex;
> > > > > > > > +     int ret;
> > > > > > > > +
> > > > > > > > +     mutex_lock(&ei->i_fc_lock);
> > > > > > > > +     if (ei->i_fc_lblk_len == 0) {
> > > > > > > > +             mutex_unlock(&ei->i_fc_lock);
> > > > > > > > +             return 0;
> > > > > > > > +     }
> > > > > > > > +     old_blk_size = ei->i_fc_lblk_start;
> > > > > > > > +     new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> > > > > > > > +     ei->i_fc_lblk_len = 0;
> > > > > > > > +     mutex_unlock(&ei->i_fc_lock);
> > > > > > > > +
> > > > > > > > +     cur_lblk_off = old_blk_size;
> > > > > > > > +     jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> > > > > > > > +               __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> > > > > > > > +
> > > > > > > > +     while (cur_lblk_off <= new_blk_size) {
> > > > > > > > +             map.m_lblk = cur_lblk_off;
> > > > > > > > +             map.m_len = new_blk_size - cur_lblk_off + 1;
> > > > > > > > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > > > > > > > +             if (ret < 0)
> > > > > > > > +                     return -ECANCELED;
> > > > > > >
> > > > > > > So isn't this actually racy with a risk of stale data exposure? Consider a
> > > > > > > situation like:
> > > > > > >
> > > > > > > Task 1:                         Task 2:
> > > > > > > pwrite(file, buf, 8192, 0)
> > > > > > > punch(file, 0, 4096)
> > > > > > > fsync(file)
> > > > > > >   writeout range 4096-8192
> > > > > > >   fastcommit for inode range 0-8192
> > > > > > >                                 pwrite(file, buf, 4096, 0)
> > > > > > >     ext4_map_blocks(file)
> > > > > > >       - reports that block at offset 0 is mapped so that is recorded in
> > > > > > >         fastcommit record. But data for that is not written so after a
> > > > > > >         crash we'd expose stale data in that block.
> > > > > > >
> > > > > > > Am I missing something?
> > > > > > So the way this gets handled is before entering this function, the
> > > > > > inode enters COMMITTING state (in ext4_fc_submit_inode_data_all
> > > > > > function). Once in COMMITTING state, all the inodes on this inode get
> > > > > > paused. Also, the commit path waits until all the ongoing updates on
> > > > > > that inode are completed. Once they are completed, only then its data
> > > > > > buffers are flushed and this ext4_map_blocks is called. So Task-2 here
> > > > > > would have either completely finished or would wait until the end of
> > > > > > this inode's commit. I realize that I should probably add more
> > > > > > comments to make this more clearer in the code. But is handling it
> > > > > > this way sufficient or am I missing any more cases?
> > > > >
> > > > > I see. In principle this should work. But I don't like that we have yet
> > > > > another mechanism that needs to properly wrap inode changes to make
> > > > > fastcommits work. And if we get it wrong somewhere, the breakage will be
> > > > > almost impossible to notice until someone looses data after a power
> > > > > failure. So it seems a bit fragile to me.
> > > > Ack
> > > > >
> > > > > Ideally I think we would reuse the current transaction machinery for this
> > > > > somehow (so that changes added through one transaction handle would behave
> > > > > atomically wrt to fastcommits) but the details are not clear to me yet. I
> > > > > need to think more about this...
> > > > Yeah, I thought about that too. All we need to do is to atomically
> > > > increment an "number of ongoing updates" counter on an inode, which
> > > > could be done by existing ext4_journal_start()/stop() functions.
> > > > However, the problem is that current ext4_journal_start()/stop() don't
> > > > take inode as an argumen. I considered changing all the
> > > > ext4_journal_start/stop calls but that would have inflated the size of
> > > > this patch series which is already pretty big. But we can do that as a
> > > > follow up cleanup. Does that sound reasonable?
> > >
> > > So ext4_journal_start() actually does take inode as an argument and we use
> > > it quite some places (we also have ext4_journal_start_sb() which takes just
> > > the superblock). What I'm not sure about is whether that's the inode you
> > > want to protect for fastcommit purposes (would need some code auditing) or
> > > whether there are not more inodes that need the protection for some
> > > operations. ext4_journal_stop() could be handled by recording the inode in
> > > the handle on ext4_journal_start() so ext4_journal_stop() then knows for
> > > which inode to decrement the counter.
> > >
> > > Another possibility would be to increment the counter in
> > > ext4_get_inode_loc() - that is a clear indication we are going to change
> > > something in the inode. This also automatically handles the situation when
> > > multiple inodes are modified by the operation or that proper inodes are
> > > being protected. With decrementing the counter it is somewhat more
> > > difficult. I think we can only do that at ext4_journal_stop() time so we
> > > need to record in the handle for which inodes we acquired the update
> > > references and drop them from ext4_journal_stop(). This would look as a
> > > rather robust solution to me...
> > ..the only problem here is that the same handle can be returned by
> > multiple calls to ext4_journal_start(). That means a handle returned
> > by ext4_journal_start() could be associated with multiple inodes. One
>
> That is not quite true. ext4_journal_start() returns always a new handle
> (unless that process has already a handle started, but nested handles are
> not interesting for our case). Just multiple handles may refer to the same
> transaction which is what confused you I guess. So each handle has 1:1
> correspondence with a logical operation that needs to be performed
> atomically and you can store your inode in handle_t (==
> jbd2_journal_handle). Maybe to make the layering clear, you could add a
> helper jbd2_associate_handle_with_inode() or something like that for the
> storing and similar helper for fetching the inode.
Ah I see, I was definitely confused about that. Thanks for the
explanation. I understand it now.
>
> Now I'm not certain that each logical operation has only single inode that
> gets modified in it - e.g. rename may modify multiple inodes. Now I suspect
> that you are marking the fs as inelligible in all the cases that modify
> more inodes but it's difficult to be sure with your current scheme. That's
> another way that should be automated by the scheme (which is easy enough -
> you can mark fs as ineligible if handle already has different inode
> associated with it in ext2_get_inode_loc()).
Makes sense, I'll recheck what happens in case of multiple inodes
modification in 1 logical operation. But this solution sounds good to
me.
>
> I don't think you need to play any games with fs private structure at this
> point as you describe below...
I agree, thanks,
Harshad
>
>                                                                 Honza
>
>
> > way to deal with this would be to define ext4 specific handle
> > structure. So, each call to ext4_journal_start would return a struct
> > that looks like following:
> >
> > struct ext4_handle {
> >     handle_t *jbd2_handle;
> >     struct inode *inode;
> > }
> >
> > So now on ext4_journal_stop(), we know for which inode we need to drop
> > counters. The objects of this struct would either need to have their
> > own kmem_cache or would need to be defined on stack (I think the
> > latter is preferred). Should we do this? If we do this, this is going
> > to be a pretty big change (will have to inspect all the existing
> > callers of ext4_journal_start() and ext4_journal_stop()).
> >
> > Another option would be to change the definition of handle_t such that
> > on every call to jbd2_journal_start(), we get a new wrapper object
> > that takes a reference on handle_t. Such an object would have a
> > private pointer that FS can use the way it wants. This will be a
> > relatively smaller change but it would impact OCFS too. But if we go
> > this route, we can't avoid using a new kmem_cache, since now these new
> > handle wrappers would need to be allocated inside of JBD2.
> >
> > I kind of like the second option better because it keeps the change
> > comparatively smaller. Wdyt? Also, Ted / Andreas, wdyt?
> >
> > Thanks,
> > Harshad
> >
> > Thank
> > >
> > >                                                                 Honza
> > > --
> > > Jan Kara <jack@suse.com>
> > > SUSE Labs, CR
> --
> Jan Kara <jack@suse.com>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2020-11-03 18:31 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-15 20:37 [PATCH v10 0/9] Add fast commits in Ext4 file system Harshad Shirwadkar
2020-10-15 20:37 ` [PATCH v10 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
2020-10-21 16:04   ` Jan Kara
2020-10-21 17:25     ` harshad shirwadkar
2020-10-22 13:06       ` Jan Kara
2020-10-15 20:37 ` [PATCH v10 2/9] ext4: add fast_commit feature and handling for extended mount options Harshad Shirwadkar
2020-10-21 16:18   ` Jan Kara
2020-10-21 17:31     ` harshad shirwadkar
2020-10-22 13:09       ` Jan Kara
2020-10-26 16:40         ` harshad shirwadkar
2020-10-15 20:37 ` [PATCH v10 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
2020-10-21 20:00   ` Jan Kara
2020-10-29 23:28     ` harshad shirwadkar
2020-10-30 15:40       ` Jan Kara
2020-10-15 20:37 ` [PATCH v10 4/9] jbd2: add fast commit machinery Harshad Shirwadkar
2020-10-22 10:16   ` Jan Kara
2020-10-23 17:17     ` harshad shirwadkar
2020-10-26  9:03       ` Jan Kara
2020-10-26 16:34         ` harshad shirwadkar
2020-10-15 20:37 ` [PATCH v10 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
2020-10-23 10:30   ` Jan Kara
2020-10-26 20:55     ` harshad shirwadkar
2020-10-27 14:29       ` Jan Kara
2020-10-27 17:38         ` harshad shirwadkar
2020-10-30 15:28           ` Jan Kara
2020-10-30 16:45             ` harshad shirwadkar
2020-11-03 10:04               ` Jan Kara
2020-11-03 18:31                 ` harshad shirwadkar
2020-10-27 18:45         ` Theodore Y. Ts'o
2020-10-15 20:37 ` [PATCH v10 6/9] jbd2: fast commit recovery path Harshad Shirwadkar
2020-10-15 20:37 ` [PATCH v10 7/9] ext4: " Harshad Shirwadkar
2020-10-15 20:38 ` [PATCH v10 8/9] ext4: add a mount opt to forcefully turn fast commits on Harshad Shirwadkar
2020-10-15 20:38 ` [PATCH v10 9/9] ext4: add fast commit stats in procfs Harshad Shirwadkar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).