Linux-ext4 Archive on lore.kernel.org
 help / color / Atom feed
* [PATCH v9 0/9] ext4: add fast commits feature
@ 2020-09-19  0:54 Harshad Shirwadkar
  2020-09-19  0:54 ` [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
                   ` (8 more replies)
  0 siblings, 9 replies; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch series adds support for fast commits which is a simplified
version of the scheme proposed by Park and Shin, in their paper,
"iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call"[1]. The basic idea of fast commits is to make JBD2
give the client file system an opportunity to perform a faster
commit. Only if the file system cannot perform such a commit
operation, then JBD2 should fall back to traditional commits.

Because JBD2 operates at block granularity, for every file system
metadata update it commits all the changed blocks are written to the
journal at commit time. This is inefficient because updates to some
blocks that JBD2 commits are derivable from some other blocks. For
example, if a new extent is added to an inode, then corresponding
updates to the inode table, the block bitmap, the group descriptor and
the superblock can be derived based on just the extent information and
the corresponding inode information. So, if we take this relationship
between blocks into account and replay the journalled blocks smartly,
we could increase performance of file system commits significantly.

Fast commits introduced in this patch have two main contributions:

(1) Making JBD2 fast commit aware, so that clients of JBD2 can
    implement fast commits

(2) Add support in ext4 to use JBD2's new interfaces and implement
    fast commits

Fast commit operation
---------------------

The new fast commit operation works by tracking file system deltas
since last commit in memory and committing these deltas to disk during
fsync(). Ext4 maintains directory entry updates in an in-memory
queue. Also, the inodes that have changed since last commit are
maintained in an in-memory queue. These queues are flushed to disk
during the commit time in a log-structured way. Fast commit area is
organized as a log of TAG-LENGTH-VALUE tuples with a special "tail"
tag marking the end of a commit. If certain operation prevents fast
commit from happening, the commit code falls back to JBD2 full commit
operation and thus invalidating all the fast commits since last full
commit. JBD2 provides new jbd2_fc_start() and jbd2_fc_stop() functions
to co-ordinate between JBD2's full commits and client file system's
fast commits.

Recovery operation
------------------

During recovery, JBD2 lets the client file system handle fast commit
blocks as it wants. After performing transaction replay, JBD2 invokes
client file system's recovery path handler. During the scan phase,
Ext4's recovery path handler determines the validity of fast commit
log by making sure CRC and TID of fast commits are valid. During the
replay phase, the recovery handler replays tags one by one. These
replay handlers are idempotent. Thus, if we crash in the middle of
recovery, Ext4 can restart the log replay and reach the identical
final state.

Testing
-------

e2fsprogs was updated to set fast commit feature flag and to ignore
fast commit blocks during e2fsck.

https://github.com/harshadjs/e2fsprogs.git

No regressions were introduced in smoke tests.

How to Use this feature?
-----------------------

This feature should not be used in production until corresponding
e2fsprogs changes are ready. These changes are being worked on at -
https://github.com/harshadjs/e2fsprogs.git. This feature can be set at
mkfs time. For testing purposes, this feature can also be enabled by
passing a mount time flag "fc_debug_force". This mount flag should
only be used for testing purposes and never for production.

Once enabled, fast commit information can be viewed in
/proc/fs/ext4/<dev>/fc_info.

Performance Evaluation
----------------------

Ext4 performance was compared with and without fast commits using
fsmark, dbench and filebench benchmarks with local file system and
over NFS. This is the summary of results:

|-----------+-------------------+----------------+----------------+--------|
| Benchmark | Config            | No FC          | FC             | % diff |
|-----------+-------------------+----------------+----------------+--------|
| Fsmark    | Local, 8 threads  | 1475.1 files/s | 4309.8 files/s | +192.2 |
| Fsmark    | NFS, 4 threads    | 299.4 files/s  | 409.45 files/s |  +36.8 |
|-----------+-------------------+----------------+----------------+--------|
| Dbench    | Local, 2 procs    | 33.32 MB/s     | 70.87 MB/s     | +112.7 |
| Dbench    | NFS, 2 procs      | 8.84 MB/s      | 11.88 MB/s     |  +34.4 |
|-----------+-------------------+----------------+----------------+--------|
| Dbench    | Local, 10 procs   | 90.48 MB/s     | 110.12 MB/s    |  +21.7 |
| Dbench    | NFS, 10 procs     | 34.62 MB/s     | 52.83 MB/s     |  +52.6 |
|-----------+-------------------+----------------+----------------+--------|
| FileBench | Local, 16 threads | 10442.3 ops/s  | 18617.8 ops/s  |  +78.3 |
|           | (Varmail)         |                |                |        |
| FileBench | NFS, 16 threads   | 1531.3 ops/s   | 2681.5 ops/s   |  +75.1 |
|           | (Varmail)         |                |                |        |
|-----------+-------------------+----------------+----------------+--------|

NFS Performance Evaluation
--------------------------

NFS performs commit_metadata operation very frequently which resulted
in a linux kernel untar operation resulting in over ~180 journal
commits / second. The same untar operation results in 2.5 commits /
second. However, as the above table shows, the benefits that NFS sees
aren't as great as the local disk. The reason for that is the network
latency. Before fast commits, NFS was bottlenecked on journal commit
performance. However, with fast commits reducing that time
significantly, NFS performance now gets bottlenecked on network
latency. NFS running on networks with lower latency (< 300 us) will
see better performance than the NFS numbers reported above.

DAX Support
-----------

Fast commits helps improve Ext4 performance on DAX devices
too. However, there as an opportunity to do even better. Collaborating
with Rohan Kadekodi (rak@cs.utexas.edu) from UT Austin and Saurabh
Kadekodi (saukad@cs.cmu.edu) from CMU, we have added synchronous fast
commits which write at byte granularity (instead of block
granularity). This is WIP available at -
https://github.com/harshadjs/linux/tree/fc-pmem-renewed. Doing this
way, we get stronger guarantees than current Ext4 very cheaply on
persistent memory devices.

Changes since V8
----------------

* Added procfs tracking for fast commits
* Improved recovery path
* Added mount option to turn fast commits on for testing purpose
* A few bugfixes
* Rebased on top of ext4 dev branch

[1] iJournaling: Fine-Grained Journaling for Improving the Latency of
Fsync System Call
https://www.usenix.org/conference/atc17/technical-sessions/presentation/park

Harshad Shirwadkar (9):
  doc: update ext4 and journalling docs to include fast commit feature
  ext4: add fast_commit feature and handling for extended mount options
  ext4 / jbd2: add fast commit initialization
  jbd2: add fast commit machinery
  ext4: main fast-commit commit path
  jbd2: fast commit recovery path
  ext4: fast commit recovery path
  ext4: add a mount opt to forcefully turn fast commits on
  ext4: add fast commit stats in procfs

 Documentation/filesystems/ext4/journal.rst |   66 +
 Documentation/filesystems/journalling.rst  |   28 +
 fs/ext4/Makefile                           |    2 +-
 fs/ext4/acl.c                              |    2 +
 fs/ext4/balloc.c                           |    7 +-
 fs/ext4/ext4.h                             |   95 +
 fs/ext4/ext4_jbd2.c                        |    2 +-
 fs/ext4/extents.c                          |  309 ++-
 fs/ext4/extents_status.c                   |   24 +
 fs/ext4/fast_commit.c                      | 2149 ++++++++++++++++++++
 fs/ext4/fast_commit.h                      |  160 ++
 fs/ext4/file.c                             |   10 +-
 fs/ext4/fsync.c                            |    2 +-
 fs/ext4/ialloc.c                           |  165 +-
 fs/ext4/inode.c                            |  130 +-
 fs/ext4/ioctl.c                            |   22 +-
 fs/ext4/mballoc.c                          |  208 +-
 fs/ext4/namei.c                            |  185 +-
 fs/ext4/super.c                            |   81 +-
 fs/ext4/sysfs.c                            |    2 +
 fs/ext4/xattr.c                            |    3 +
 fs/jbd2/commit.c                           |   61 +
 fs/jbd2/journal.c                          |  238 ++-
 fs/jbd2/recovery.c                         |   59 +-
 include/linux/jbd2.h                       |   91 +-
 include/trace/events/ext4.h                |  228 ++-
 26 files changed, 4164 insertions(+), 165 deletions(-)
 create mode 100644 fs/ext4/fast_commit.c
 create mode 100644 fs/ext4/fast_commit.h

-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  2020-09-22 17:50   ` Darrick J. Wong
  2020-10-09 18:28   ` Theodore Y. Ts'o
  2020-09-19  0:54 ` [PATCH v9 2/9] ext4: add fast_commit feature and handling for extended mount options Harshad Shirwadkar
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch adds necessary documentation for fast commits.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 Documentation/filesystems/ext4/journal.rst | 66 ++++++++++++++++++++++
 Documentation/filesystems/journalling.rst  | 28 +++++++++
 2 files changed, 94 insertions(+)

diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
index ea613ee701f5..c2e4d010a201 100644
--- a/Documentation/filesystems/ext4/journal.rst
+++ b/Documentation/filesystems/ext4/journal.rst
@@ -28,6 +28,17 @@ metadata are written to disk through the journal. This is slower but
 safest. If ``data=writeback``, dirty data blocks are not flushed to the
 disk before the metadata are written to disk through the journal.
 
+In case of ``data=ordered`` mode, Ext4 also supports fast commits which
+help reduce commit latency significantly. The default ``data=ordered``
+mode works by logging metadata blocks tothe journal. In fast commit
+mode, Ext4 only stores the minimal delta needed to recreate the
+affected metadata in fast commit space that is shared with JBD2.
+Once the fast commit area fills in or if fast commit is not possible
+or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
+A full commit invalidates all the fast commits that happened before
+it and thus it makes the fast commit area empty for further fast
+commits. This feature needs to be enabled at compile time.
+
 The journal inode is typically inode 8. The first 68 bytes of the
 journal inode are replicated in the ext4 superblock. The journal itself
 is normal (but hidden) file within the filesystem. The file usually
@@ -609,3 +620,58 @@ bytes long (but uses a full block):
      - h\_commit\_nsec
      - Nanoseconds component of the above timestamp.
 
+Fast commits
+~~~~~~~~~~~~
+
+Fast commit area is organized as a log of tag tag length values. Each TLV has
+a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
+of the entire field. It is followed by variable length tag specific value.
+Here is the list of supported tags and their meanings:
+
+.. list-table::
+   :widths: 8 20 20 32
+   :header-rows: 1
+
+   * - Tag
+     - Meaning
+     - Value struct
+     - Description
+   * - EXT4_FC_TAG_HEAD
+     - Fast commit area header
+     - ``struct ext4_fc_head``
+     - Stores the TID of the transaction after which these fast commits should
+       be applied.
+   * - EXT4_FC_TAG_ADD_RANGE
+     - Add extent to inode
+     - ``struct ext4_fc_add_range``
+     - Stores the inode number and extent to be added in this inode
+   * - EXT4_FC_TAG_DEL_RANGE
+     - Remove logical offsets to inode
+     - ``struct ext4_fc_del_range``
+     - Stores the inode number and the logical offset range that needs to be
+       removed
+   * - EXT4_FC_TAG_CREAT
+     - Create directory entry for a newly created file
+     - ``struct ext4_fc_dentry_info``
+     - Stores the parent inode numer, inode number and directory entry of the
+       newly created file
+   * - EXT4_FC_TAG_LINK
+     - Link a directory entry to an inode
+     - ``struct ext4_fc_dentry_info``
+     - Stores the parent inode numer, inode number and directory entry
+   * - EXT4_FC_TAG_UNLINK
+     - Unink a directory entry of an inode
+     - ``struct ext4_fc_dentry_info``
+     - Stores the parent inode numer, inode number and directory entry
+
+   * - EXT4_FC_TAG_PAD
+     - Padding (unused area)
+     - None
+     - Unused bytes in the fast commit area.
+
+   * - EXT4_FC_TAG_TAIL
+     - Mark the end of a fast commit
+     - ``struct ext4_fc_tail``
+     - Stores the TID of the commit, CRC of the fast commit of which this tag
+       represents the end of
+
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index 58ce6b395206..a9817220dc9b 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -132,6 +132,34 @@ The opportunities for abuse and DOS attacks with this should be obvious,
 if you allow unprivileged userspace to trigger codepaths containing
 these calls.
 
+Fast commits
+~~~~~~~~~~~~
+
+JBD2 to also allows you to perform file-system specific delta commits known as
+fast commits. In order to use fast commits, you first need to call
+:c:func:`jbd2_fc_init` and tell how many blocks at the end of journal
+area should be reserved for fast commits. Along with that, you will also need
+to set following callbacks that perform correspodning work:
+
+`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
+fast commit.
+
+`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
+blocks.
+
+File system is free to perform fast commits as and when it wants as long as it
+gets permission from JBD2 to do so by calling the function
+:c:func:`jbd2_fc_start()`. Once a fast commit is done, the client
+file  system should tell JBD2 about it by calling :c:func:`jbd2_fc_stop()`.
+If file system wants JBD2 to perform a full commit immediately after stopping
+the fast commit it can do so by calling :c:func:`jbd2_fc_stop_do_commit()`.
+This is useful if fast commit operation fails for some reason and the only way
+to guarantee consistency is for JBD2 to perform the full traditional commit.
+
+JBD2 helper functions to manage fast commit buffers. File system can use
+:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
+and wait on IO completion of fast commit buffers.
+
 Summary
 ~~~~~~~
 
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 2/9] ext4: add fast_commit feature and handling for extended mount options
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
  2020-09-19  0:54 ` [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  2020-10-09 17:58   ` Theodore Y. Ts'o
  2020-09-19  0:54 ` [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
                   ` (6 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

We are running out of mount option bits. Add handling for using
s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
ability to turn on / off the fast commit feature in Ext4.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h       |  4 ++++
 fs/ext4/super.c      | 24 +++++++++++++++++++-----
 include/linux/jbd2.h |  5 ++++-
 3 files changed, 27 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 523e00d7b392..82e889d5c2ed 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1214,6 +1214,8 @@ struct ext4_inode_info {
 #define EXT4_MOUNT2_EXPLICIT_JOURNAL_CHECKSUM	0x00000008 /* User explicitly
 						specified journal checksum */
 
+#define EXT4_MOUNT2_JOURNAL_FAST_COMMIT	0x00000010 /* Journal fast commit */
+
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
 #define set_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt |= \
@@ -1814,6 +1816,7 @@ static inline bool ext4_verity_in_progress(struct inode *inode)
 #define EXT4_FEATURE_COMPAT_RESIZE_INODE	0x0010
 #define EXT4_FEATURE_COMPAT_DIR_INDEX		0x0020
 #define EXT4_FEATURE_COMPAT_SPARSE_SUPER2	0x0200
+#define EXT4_FEATURE_COMPAT_FAST_COMMIT		0x0400
 #define EXT4_FEATURE_COMPAT_STABLE_INODES	0x0800
 
 #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
@@ -1916,6 +1919,7 @@ EXT4_FEATURE_COMPAT_FUNCS(xattr,		EXT_ATTR)
 EXT4_FEATURE_COMPAT_FUNCS(resize_inode,		RESIZE_INODE)
 EXT4_FEATURE_COMPAT_FUNCS(dir_index,		DIR_INDEX)
 EXT4_FEATURE_COMPAT_FUNCS(sparse_super2,	SPARSE_SUPER2)
+EXT4_FEATURE_COMPAT_FUNCS(fast_commit,		FAST_COMMIT)
 EXT4_FEATURE_COMPAT_FUNCS(stable_inodes,	STABLE_INODES)
 
 EXT4_FEATURE_RO_COMPAT_FUNCS(sparse_super,	SPARSE_SUPER)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 13bdddc081e0..b62858ee420b 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1526,7 +1526,7 @@ enum {
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
-	Opt_prefetch_block_bitmaps,
+	Opt_prefetch_block_bitmaps, Opt_no_fc,
 };
 
 static const match_table_t tokens = {
@@ -1613,6 +1613,7 @@ static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable=%u"},
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
+	{Opt_no_fc, "no_fc"},
 	{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption"},
@@ -1738,6 +1739,7 @@ static int clear_qf_name(struct super_block *sb, int qtype)
 #define MOPT_EXT4_ONLY	(MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING	0x0400
 #define MOPT_SKIP	0x0800
+#define	MOPT_2		0x1000
 
 static const struct mount_opts {
 	int	token;
@@ -1838,6 +1840,8 @@ static const struct mount_opts {
 	{Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
 	{Opt_prefetch_block_bitmaps, EXT4_MOUNT_PREFETCH_BLOCK_BITMAPS,
 	 MOPT_SET},
+	{Opt_no_fc, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
+	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
 	{Opt_err, 0, 0}
 };
 
@@ -2207,10 +2211,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 			WARN_ON(1);
 			return -1;
 		}
-		if (arg != 0)
-			sbi->s_mount_opt |= m->mount_opt;
-		else
-			sbi->s_mount_opt &= ~m->mount_opt;
+		if (m->flags & MOPT_2) {
+			if (arg != 0)
+				sbi->s_mount_opt2 |= m->mount_opt;
+			else
+				sbi->s_mount_opt2 &= ~m->mount_opt;
+		} else {
+			if (arg != 0)
+				sbi->s_mount_opt |= m->mount_opt;
+			else
+				sbi->s_mount_opt &= ~m->mount_opt;
+		}
 	}
 	return 1;
 }
@@ -3924,6 +3935,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 #ifdef CONFIG_EXT4_FS_POSIX_ACL
 	set_opt(sb, POSIX_ACL);
 #endif
+	if (ext4_has_feature_fast_commit(sb))
+		set_opt2(sb, JOURNAL_FAST_COMMIT);
 	/* don't forget to enable journal_csum when metadata_csum is enabled. */
 	if (ext4_has_metadata_csum(sb))
 		set_opt(sb, JOURNAL_CHECKSUM);
@@ -4576,6 +4589,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		sbi->s_def_mount_opt &= ~EXT4_MOUNT_JOURNAL_CHECKSUM;
 		clear_opt(sb, JOURNAL_CHECKSUM);
 		clear_opt(sb, DATA_FLAGS);
+		clear_opt2(sb, JOURNAL_FAST_COMMIT);
 		sbi->s_journal = NULL;
 		needs_recovery = 0;
 		goto no_journal;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index a756a4cdf939..f438257d7f31 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -288,6 +288,7 @@ typedef struct journal_superblock_s
 #define JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT	0x00000004
 #define JBD2_FEATURE_INCOMPAT_CSUM_V2		0x00000008
 #define JBD2_FEATURE_INCOMPAT_CSUM_V3		0x00000010
+#define JBD2_FEATURE_INCOMPAT_FAST_COMMIT	0x00000020
 
 /* See "journal feature predicate functions" below */
 
@@ -298,7 +299,8 @@ typedef struct journal_superblock_s
 					JBD2_FEATURE_INCOMPAT_64BIT | \
 					JBD2_FEATURE_INCOMPAT_ASYNC_COMMIT | \
 					JBD2_FEATURE_INCOMPAT_CSUM_V2 | \
-					JBD2_FEATURE_INCOMPAT_CSUM_V3)
+					JBD2_FEATURE_INCOMPAT_CSUM_V3 | \
+					JBD2_FEATURE_INCOMPAT_FAST_COMMIT)
 
 #ifdef __KERNEL__
 
@@ -1239,6 +1241,7 @@ JBD2_FEATURE_INCOMPAT_FUNCS(64bit,		64BIT)
 JBD2_FEATURE_INCOMPAT_FUNCS(async_commit,	ASYNC_COMMIT)
 JBD2_FEATURE_INCOMPAT_FUNCS(csum2,		CSUM_V2)
 JBD2_FEATURE_INCOMPAT_FUNCS(csum3,		CSUM_V3)
+JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit,	FAST_COMMIT)
 
 /*
  * Journal flag definitions
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
  2020-09-19  0:54 ` [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
  2020-09-19  0:54 ` [PATCH v9 2/9] ext4: add fast_commit feature and handling for extended mount options Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  2020-09-19 15:22   ` kernel test robot
  2020-10-09 16:10   ` Ritesh Harjani
  2020-09-19  0:54 ` [PATCH v9 4/9] jbd2: add fast commit machinery Harshad Shirwadkar
                   ` (5 subsequent siblings)
  8 siblings, 2 replies; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch adds fast commit area trackers in the journal_t
structure. These are initialized via the jbd2_fc_init() routine that
this patch adds. This patch also adds ext4/fast_commit.c and
ext4/fast_commit.h files for fast commit code that will be added in
subsequent patches in this series.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/Makefile      |  2 +-
 fs/ext4/ext4.h        |  4 ++++
 fs/ext4/fast_commit.c | 20 +++++++++++++++++
 fs/ext4/fast_commit.h |  9 ++++++++
 fs/ext4/super.c       |  1 +
 fs/jbd2/journal.c     | 52 ++++++++++++++++++++++++++++++++++++++-----
 include/linux/jbd2.h  | 39 ++++++++++++++++++++++++++++++++
 7 files changed, 121 insertions(+), 6 deletions(-)
 create mode 100644 fs/ext4/fast_commit.c
 create mode 100644 fs/ext4/fast_commit.h

diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
index 2e42f47a7f98..49e7af6cc93f 100644
--- a/fs/ext4/Makefile
+++ b/fs/ext4/Makefile
@@ -10,7 +10,7 @@ ext4-y	:= balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
 		indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
 		mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
 		super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \
-		xattr_user.o
+		xattr_user.o fast_commit.o
 
 ext4-$(CONFIG_EXT4_FS_POSIX_ACL)	+= acl.o
 ext4-$(CONFIG_EXT4_FS_SECURITY)		+= xattr_security.o
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 82e889d5c2ed..9af3971dd12e 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -964,6 +964,7 @@ do {									       \
 #endif /* defined(__KERNEL__) || defined(__linux__) */
 
 #include "extents_status.h"
+#include "fast_commit.h"
 
 /*
  * Lock subclasses for i_data_sem in the ext4_inode_info structure.
@@ -2679,6 +2680,9 @@ extern int ext4_init_inode_table(struct super_block *sb,
 				 ext4_group_t group, int barrier);
 extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
 
+/* fast_commit.c */
+
+void ext4_fc_init(struct super_block *sb, journal_t *journal);
 /* mballoc.c */
 extern const struct seq_operations ext4_mb_seq_groups_ops;
 extern long ext4_mb_stats;
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
new file mode 100644
index 000000000000..0dad8bdb1253
--- /dev/null
+++ b/fs/ext4/fast_commit.c
@@ -0,0 +1,20 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * fs/ext4/fast_commit.c
+ *
+ * Written by Harshad Shirwadkar <harshadshirwadkar@gmail.com>
+ *
+ * Ext4 fast commits routines.
+ */
+#include "ext4_jbd2.h"
+
+void ext4_fc_init(struct super_block *sb, journal_t *journal)
+{
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
+		return;
+	if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
+		pr_warn("Error while enabling fast commits, turning off.");
+		ext4_clear_feature_fast_commit(sb);
+	}
+}
diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
new file mode 100644
index 000000000000..8362bf5e6e00
--- /dev/null
+++ b/fs/ext4/fast_commit.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __FAST_COMMIT_H__
+#define __FAST_COMMIT_H__
+
+/* Number of blocks in journal area to allocate for fast commits */
+#define EXT4_NUM_FC_BLKS		256
+
+#endif /* __FAST_COMMIT_H__ */
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b62858ee420b..94aaaf940449 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4962,6 +4962,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
 	journal->j_commit_interval = sbi->s_commit_interval;
 	journal->j_min_batch_time = sbi->s_min_batch_time;
 	journal->j_max_batch_time = sbi->s_max_batch_time;
+	ext4_fc_init(sb, journal);
 
 	write_lock(&journal->j_state_lock);
 	if (test_opt(sb, BARRIER))
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 17fdc482f554..736a1736619f 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -1179,6 +1179,14 @@ static journal_t *journal_init_common(struct block_device *bdev,
 	if (!journal->j_wbuf)
 		goto err_cleanup;
 
+	if (journal->j_fc_wbufsize > 0) {
+		journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
+					sizeof(struct buffer_head *),
+					GFP_KERNEL);
+		if (!journal->j_fc_wbuf)
+			goto err_cleanup;
+	}
+
 	bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
 	if (!bh) {
 		pr_err("%s: Cannot get buffer for journal superblock\n",
@@ -1192,11 +1200,22 @@ static journal_t *journal_init_common(struct block_device *bdev,
 
 err_cleanup:
 	kfree(journal->j_wbuf);
+	kfree(journal->j_fc_wbuf);
 	jbd2_journal_destroy_revoke(journal);
 	kfree(journal);
 	return NULL;
 }
 
+int jbd2_fc_init(journal_t *journal, int num_fc_blks)
+{
+	journal->j_fc_wbufsize = num_fc_blks;
+	journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
+				sizeof(struct buffer_head *), GFP_KERNEL);
+	if (!journal->j_fc_wbuf)
+		return -ENOMEM;
+	return 0;
+}
+
 /* jbd2_journal_init_dev and jbd2_journal_init_inode:
  *
  * Create a journal structure assigned some fixed set of disk blocks to
@@ -1314,11 +1333,20 @@ static int journal_reset(journal_t *journal)
 	}
 
 	journal->j_first = first;
-	journal->j_last = last;
 
-	journal->j_head = first;
-	journal->j_tail = first;
-	journal->j_free = last - first;
+	if (jbd2_has_feature_fast_commit(journal) &&
+	    journal->j_fc_wbufsize > 0) {
+		journal->j_last_fc = last;
+		journal->j_last = last - journal->j_fc_wbufsize;
+		journal->j_first_fc = journal->j_last + 1;
+		journal->j_fc_off = 0;
+	} else {
+		journal->j_last = last;
+	}
+
+	journal->j_head = journal->j_first;
+	journal->j_tail = journal->j_first;
+	journal->j_free = journal->j_last - journal->j_first;
 
 	journal->j_tail_sequence = journal->j_transaction_sequence;
 	journal->j_commit_sequence = journal->j_transaction_sequence - 1;
@@ -1663,9 +1691,18 @@ static int load_superblock(journal_t *journal)
 	journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
 	journal->j_tail = be32_to_cpu(sb->s_start);
 	journal->j_first = be32_to_cpu(sb->s_first);
-	journal->j_last = be32_to_cpu(sb->s_maxlen);
 	journal->j_errno = be32_to_cpu(sb->s_errno);
 
+	if (jbd2_has_feature_fast_commit(journal) &&
+	    journal->j_fc_wbufsize > 0) {
+		journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
+		journal->j_last = journal->j_last_fc - journal->j_fc_wbufsize;
+		journal->j_first_fc = journal->j_last + 1;
+		journal->j_fc_off = 0;
+	} else {
+		journal->j_last = be32_to_cpu(sb->s_maxlen);
+	}
+
 	return 0;
 }
 
@@ -1726,6 +1763,9 @@ int jbd2_journal_load(journal_t *journal)
 	 */
 	journal->j_flags &= ~JBD2_ABORT;
 
+	if (journal->j_fc_wbufsize > 0)
+		jbd2_journal_set_features(journal, 0, 0,
+					  JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
 	/* OK, we've finished with the dynamic journal bits:
 	 * reinitialise the dynamic contents of the superblock in memory
 	 * and reset them on disk. */
@@ -1809,6 +1849,8 @@ int jbd2_journal_destroy(journal_t *journal)
 		jbd2_journal_destroy_revoke(journal);
 	if (journal->j_chksum_driver)
 		crypto_free_shash(journal->j_chksum_driver);
+	if (journal->j_fc_wbufsize > 0)
+		kfree(journal->j_fc_wbuf);
 	kfree(journal->j_wbuf);
 	kfree(journal);
 
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index f438257d7f31..36f65a818366 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -915,6 +915,30 @@ struct journal_s
 	 */
 	unsigned long		j_last;
 
+	/**
+	 * @j_first_fc:
+	 *
+	 * The block number of the first fast commit block in the journal
+	 * [j_state_lock].
+	 */
+	unsigned long		j_first_fc;
+
+	/**
+	 * @j_fc_off:
+	 *
+	 * Number of fast commit blocks currently allocated.
+	 * [j_state_lock].
+	 */
+	unsigned long		j_fc_off;
+
+	/**
+	 * @j_last_fc:
+	 *
+	 * The block number one beyond the last fast commit block in the journal
+	 * [j_state_lock].
+	 */
+	unsigned long		j_last_fc;
+
 	/**
 	 * @j_dev: Device where we store the journal.
 	 */
@@ -1065,6 +1089,12 @@ struct journal_s
 	 */
 	struct buffer_head	**j_wbuf;
 
+	/**
+	 * @j_fc_wbuf: Array of fast commit bhs for
+	 * jbd2_journal_commit_transaction.
+	 */
+	struct buffer_head	**j_fc_wbuf;
+
 	/**
 	 * @j_wbufsize:
 	 *
@@ -1072,6 +1102,13 @@ struct journal_s
 	 */
 	int			j_wbufsize;
 
+	/**
+	 * @j_fc_wbufsize:
+	 *
+	 * Size of @j_fc_wbuf array.
+	 */
+	int			j_fc_wbufsize;
+
 	/**
 	 * @j_last_sync_writer:
 	 *
@@ -1507,6 +1544,8 @@ void __jbd2_log_wait_for_space(journal_t *journal);
 extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
 extern int jbd2_cleanup_journal_tail(journal_t *);
 
+/* Fast commit related APIs */
+int jbd2_fc_init(journal_t *journal, int num_fc_blks);
 /*
  * is_journal_abort
  *
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 4/9] jbd2: add fast commit machinery
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
                   ` (2 preceding siblings ...)
  2020-09-19  0:54 ` [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  2020-10-09 16:16   ` Ritesh Harjani
  2020-09-19  0:54 ` [PATCH v9 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch implements following APIs in JBD2 to allow for fast
commits:

jbd2_fc_start(): Start a new fast commit. This function waits for any
existing fast commit or full commit to complete.

jbd2_fc_stop(): Stop fast commit. This function ends current fast
commit and wakes up either the journal thread or the other fast commit
waiting for current fast commit to complete.

jbd2_fc_stop_do_commit(): Stop fast commit and perform a full
commit. This is same as above but also performs a full commit.

This patch also adds a cleanup handler in journal_t that is called
after every full and fast commit.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/fast_commit.c |  8 ++++++
 fs/jbd2/commit.c      | 19 ++++++++++++
 fs/jbd2/journal.c     | 67 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/jbd2.h  | 21 ++++++++++++++
 4 files changed, 115 insertions(+)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 0dad8bdb1253..f2d11b4c6b62 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -8,11 +8,19 @@
  * Ext4 fast commits routines.
  */
 #include "ext4_jbd2.h"
+/*
+ * Fast commit cleanup routine. This is called after every fast commit and
+ * full commit. full is true if we are called after a full commit.
+ */
+static void ext4_fc_cleanup(journal_t *journal, int full)
+{
+}
 
 void ext4_fc_init(struct super_block *sb, journal_t *journal)
 {
 	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
 		return;
+	journal->j_fc_cleanup_callback = ext4_fc_cleanup;
 	if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
 		pr_warn("Error while enabling fast commits, turning off.");
 		ext4_clear_feature_fast_commit(sb);
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index 6d2da8ad0e6f..ba35ecb18616 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -413,6 +413,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	J_ASSERT(journal->j_running_transaction != NULL);
 	J_ASSERT(journal->j_committing_transaction == NULL);
 
+	write_lock(&journal->j_state_lock);
+	journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
+	while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) {
+		DEFINE_WAIT(wait);
+
+		prepare_to_wait(&journal->j_wait_fc, &wait,
+				TASK_UNINTERRUPTIBLE);
+		write_unlock(&journal->j_state_lock);
+		schedule();
+		write_lock(&journal->j_state_lock);
+		finish_wait(&journal->j_wait_fc, &wait);
+	}
+	write_unlock(&journal->j_state_lock);
+
 	commit_transaction = journal->j_running_transaction;
 
 	trace_jbd2_start_commit(journal, commit_transaction);
@@ -1119,12 +1133,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 
 	if (journal->j_commit_callback)
 		journal->j_commit_callback(journal, commit_transaction);
+	if (journal->j_fc_cleanup_callback)
+		journal->j_fc_cleanup_callback(journal, 1);
 
 	trace_jbd2_end_commit(journal, commit_transaction);
 	jbd_debug(1, "JBD2: commit %d complete, head %d\n",
 		  journal->j_commit_sequence, journal->j_tail_sequence);
 
 	write_lock(&journal->j_state_lock);
+	journal->j_flags &= ~JBD2_FULL_COMMIT_ONGOING;
+	journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
 	spin_lock(&journal->j_list_lock);
 	commit_transaction->t_state = T_FINISHED;
 	/* Check if the transaction can be dropped now that we are finished */
@@ -1136,6 +1154,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 	spin_unlock(&journal->j_list_lock);
 	write_unlock(&journal->j_state_lock);
 	wake_up(&journal->j_wait_done_commit);
+	wake_up(&journal->j_wait_fc);
 
 	/*
 	 * Calculate overall stats
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 736a1736619f..17a30a2c38f9 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -714,6 +714,72 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
 	return err;
 }
 
+/*
+ * Start a fast commit. If there's an ongoing fast or full commit wait for
+ * it to complete. Returns 0 if a new fast commit was started. Returns -EALREADY
+ * if a fast commit is not needed, either because there's an already a commit
+ * going on or this tid has already been committed. Returns -EINVAL if no jbd2
+ * commit has yet been performed.
+ */
+int jbd2_fc_start(journal_t *journal, tid_t tid)
+{
+	/*
+	 * Fast commits only allowed if at least one full commit has
+	 * been processed.
+	 */
+	if (!journal->j_stats.ts_tid)
+		return -EINVAL;
+
+	if (tid <= journal->j_commit_sequence)
+		return -EALREADY;
+
+	write_lock(&journal->j_state_lock);
+	if (journal->j_flags & JBD2_FULL_COMMIT_ONGOING ||
+	    (journal->j_flags & JBD2_FAST_COMMIT_ONGOING)) {
+		DEFINE_WAIT(wait);
+
+		prepare_to_wait(&journal->j_wait_fc, &wait,
+				TASK_UNINTERRUPTIBLE);
+		write_unlock(&journal->j_state_lock);
+		schedule();
+		finish_wait(&journal->j_wait_fc, &wait);
+		return -EALREADY;
+	}
+	journal->j_flags |= JBD2_FAST_COMMIT_ONGOING;
+	write_unlock(&journal->j_state_lock);
+
+	return 0;
+}
+
+/*
+ * Stop a fast commit. If fallback is set, this function starts commit of
+ * TID tid before any other fast commit can start.
+ */
+static int __jbd2_fc_stop(journal_t *journal, tid_t tid, bool fallback)
+{
+	if (journal->j_fc_cleanup_callback)
+		journal->j_fc_cleanup_callback(journal, 0);
+	write_lock(&journal->j_state_lock);
+	journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
+	if (fallback)
+		journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
+	write_unlock(&journal->j_state_lock);
+	wake_up(&journal->j_wait_fc);
+	if (fallback)
+		return jbd2_complete_transaction(journal, tid);
+	return 0;
+}
+
+int jbd2_fc_stop(journal_t *journal)
+{
+	return __jbd2_fc_stop(journal, 0, 0);
+}
+
+int jbd2_fc_stop_do_commit(journal_t *journal, tid_t tid)
+{
+	return __jbd2_fc_stop(journal, tid, 1);
+}
+
 /* Return 1 when transaction with given tid has already committed. */
 int jbd2_transaction_committed(journal_t *journal, tid_t tid)
 {
@@ -1140,6 +1206,7 @@ static journal_t *journal_init_common(struct block_device *bdev,
 	init_waitqueue_head(&journal->j_wait_commit);
 	init_waitqueue_head(&journal->j_wait_updates);
 	init_waitqueue_head(&journal->j_wait_reserved);
+	init_waitqueue_head(&journal->j_wait_fc);
 	mutex_init(&journal->j_abort_mutex);
 	mutex_init(&journal->j_barrier);
 	mutex_init(&journal->j_checkpoint_mutex);
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 36f65a818366..aad986a9f3ff 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -858,6 +858,13 @@ struct journal_s
 	 */
 	wait_queue_head_t	j_wait_reserved;
 
+	/**
+	 * @j_wait_fc:
+	 *
+	 * Wait queue to wait for completion of async fast commits.
+	 */
+	wait_queue_head_t	j_wait_fc;
+
 	/**
 	 * @j_checkpoint_mutex:
 	 *
@@ -1208,6 +1215,15 @@ struct journal_s
 	 */
 	struct lockdep_map	j_trans_commit_map;
 #endif
+
+	/**
+	 * @j_fc_cleanup_callback:
+	 *
+	 * Clean-up after fast commit or full commit. JBD2 calls this function
+	 * after every commit operation.
+	 */
+	void (*j_fc_cleanup_callback)(struct journal_s *journal, int);
+
 };
 
 #define jbd2_might_wait_for_commit(j) \
@@ -1292,6 +1308,8 @@ JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit,	FAST_COMMIT)
 #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
 						 * data write error in ordered
 						 * mode */
+#define JBD2_FAST_COMMIT_ONGOING	0x100	/* Fast commit is ongoing */
+#define JBD2_FULL_COMMIT_ONGOING	0x200	/* Full commit is ongoing */
 
 /*
  * Function declarations for the journaling transaction and buffer
@@ -1546,6 +1564,9 @@ extern int jbd2_cleanup_journal_tail(journal_t *);
 
 /* Fast commit related APIs */
 int jbd2_fc_init(journal_t *journal, int num_fc_blks);
+int jbd2_fc_start(journal_t *journal, tid_t tid);
+int jbd2_fc_stop(journal_t *journal);
+int jbd2_fc_stop_do_commit(journal_t *journal, tid_t tid);
 /*
  * is_journal_abort
  *
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 5/9] ext4: main fast-commit commit path
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
                   ` (3 preceding siblings ...)
  2020-09-19  0:54 ` [PATCH v9 4/9] jbd2: add fast commit machinery Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  2020-09-19  8:19   ` kernel test robot
                     ` (2 more replies)
  2020-09-19  0:54 ` [PATCH v9 6/9] jbd2: fast commit recovery path Harshad Shirwadkar
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch adds main fast commit commit path handlers. The overall
patch can be divided into two inter-related parts:

(A) Metadata updates tracking

    This part consists of helper functions to track changes that need
    to be committed during a commit operation. These updates are
    maintained by Ext4 in different in-memory queues. Following are
    the APIs and their short description that are implemented in this
    patch:

    - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
      operations
    - ext4_fc_track_range() - Track changed logical block offsets
      inodes
    - ext4_fc_track_inode() - Track inodes
    - ext4_fc_mark_ineligible() - Mark file system fast commit
      ineligible()
    - ext4_fc_start_update() / ext4_fc_stop_update() /
      ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
      functions are useful for co-ordinating inode updates with
      commits.

(B) Main commit Path

    This part consists of functions to convert updates tracked in
    in-memory data structures into on-disk commits. Function
    ext4_fc_commit() is the main entry point to commit path.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/acl.c               |    2 +
 fs/ext4/ext4.h              |   61 ++
 fs/ext4/extents.c           |   48 +-
 fs/ext4/fast_commit.c       | 1209 +++++++++++++++++++++++++++++++++++
 fs/ext4/fast_commit.h       |  111 ++++
 fs/ext4/file.c              |   10 +-
 fs/ext4/fsync.c             |    2 +-
 fs/ext4/inode.c             |   41 +-
 fs/ext4/ioctl.c             |   16 +-
 fs/ext4/namei.c             |   36 +-
 fs/ext4/super.c             |   31 +
 fs/ext4/xattr.c             |    3 +
 fs/jbd2/commit.c            |   42 ++
 fs/jbd2/journal.c           |  119 +++-
 include/linux/jbd2.h        |    6 +
 include/trace/events/ext4.h |  172 +++++
 16 files changed, 1882 insertions(+), 27 deletions(-)

diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
index 76f634d185f1..68aaed48315f 100644
--- a/fs/ext4/acl.c
+++ b/fs/ext4/acl.c
@@ -242,6 +242,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
+	ext4_fc_start_update(inode);
 
 	if ((type == ACL_TYPE_ACCESS) && acl) {
 		error = posix_acl_update_mode(inode, &mode, &acl);
@@ -259,6 +260,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 	}
 out_stop:
 	ext4_journal_stop(handle);
+	ext4_fc_stop_update(inode);
 	if (error == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
 		goto retry;
 	return error;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9af3971dd12e..27d48d166e5d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1022,6 +1022,27 @@ struct ext4_inode_info {
 
 	struct list_head i_orphan;	/* unlinked but open inodes */
 
+	/* Fast commit related info */
+
+	struct list_head i_fc_list;	/*
+					 * inodes that need fast commit
+					 * protected by sbi->s_fc_lock.
+					 */
+
+	/* Start of lblk range that needs to be committed in this fast commit */
+	ext4_lblk_t i_fc_lblk_start;
+
+	/* End of lblk range that needs to be committed in this fast commit */
+	ext4_lblk_t i_fc_lblk_len;
+
+	/* Number of ongoing updates on this inode */
+	atomic_t  i_fc_updates;
+
+	/* Fast commit wait queue for this inode */
+	wait_queue_head_t i_fc_wait;
+
+	struct mutex i_fc_lock;
+
 	/*
 	 * i_disksize keeps track of what the inode size is ON DISK, not
 	 * in memory.  During truncate, i_size is set to the new size by
@@ -1142,6 +1163,10 @@ struct ext4_inode_info {
 #define	EXT4_VALID_FS			0x0001	/* Unmounted cleanly */
 #define	EXT4_ERROR_FS			0x0002	/* Errors detected */
 #define	EXT4_ORPHAN_FS			0x0004	/* Orphans being recovered */
+#define EXT4_FC_INELIGIBLE		0x0008	/* Fast commit ineligible */
+#define EXT4_FC_COMMITTING		0x0010	/* File system underoing a fast
+						 * commit.
+						 */
 
 /*
  * Misc. filesystem flags
@@ -1614,6 +1639,25 @@ struct ext4_sb_info {
 	/* Record the errseq of the backing block device */
 	errseq_t s_bdev_wb_err;
 	spinlock_t s_bdev_wb_lock;
+
+	/* Ext4 fast commit stuff */
+	atomic_t s_fc_subtid;
+	atomic_t s_fc_ineligible_updates;
+	/*
+	 * After commit starts, the main queue gets locked, and the further
+	 * updates get added in the the staging queue
+	 */
+#define FC_Q_MAIN	0
+#define FC_Q_STAGING	1
+	struct list_head s_fc_q[2];	/* Inodes staged for fast commit
+					 * that have data changes in them.
+					 */
+	struct list_head s_fc_dentry_q[2];	/* directory entry updates */
+	int s_fc_bytes;
+	spinlock_t s_fc_lock;
+	struct buffer_head *s_fc_bh;
+	struct ext4_fc_stats s_fc_stats;
+	u64 s_fc_avg_commit_time;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -1724,6 +1768,7 @@ enum {
 	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
 	EXT4_STATE_LUSTRE_EA_INODE,	/* Lustre-style ea_inode */
 	EXT4_STATE_VERITY_IN_PROGRESS,	/* building fs-verity Merkle tree */
+	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
 };
 
 #define EXT4_INODE_BIT_FNS(name, field, offset)				\
@@ -2683,6 +2728,22 @@ extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
 /* fast_commit.c */
 
 void ext4_fc_init(struct super_block *sb, journal_t *journal);
+void ext4_fc_init_inode(struct inode *inode);
+void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
+			 ext4_lblk_t end);
+void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry);
+void ext4_fc_track_link(struct inode *inode, struct dentry *dentry);
+void ext4_fc_track_create(struct inode *inode, struct dentry *dentry);
+void ext4_fc_track_inode(struct inode *inode);
+void ext4_fc_mark_ineligible(struct super_block *sb, int reason);
+void ext4_fc_start_ineligible(struct super_block *sb, int reason);
+void ext4_fc_stop_ineligible(struct super_block *sb);
+void ext4_fc_start_update(struct inode *inode);
+void ext4_fc_stop_update(struct inode *inode);
+void ext4_fc_del(struct inode *inode);
+int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
+int __init ext4_fc_init_dentry_cache(void);
+
 /* mballoc.c */
 extern const struct seq_operations ext4_mb_seq_groups_ops;
 extern long ext4_mb_stats;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a0481582187a..8de236fedade 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3723,6 +3723,7 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
 	err = ext4_ext_dirty(handle, inode, path + path->p_depth);
 out:
 	ext4_ext_show_leaf(inode, path);
+	ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
 	return err;
 }
 
@@ -3794,6 +3795,7 @@ convert_initialized_extent(handle_t *handle, struct inode *inode,
 	if (*allocated > map->m_len)
 		*allocated = map->m_len;
 	map->m_len = *allocated;
+	ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
 	return 0;
 }
 
@@ -4327,7 +4329,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	map->m_len = ar.len;
 	allocated = map->m_len;
 	ext4_ext_show_leaf(inode, path);
-
+	ext4_fc_track_range(inode, map->m_lblk, map->m_len);
 out:
 	ext4_ext_drop_refs(path);
 	kfree(path);
@@ -4600,7 +4602,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
 	ret = ext4_mark_inode_dirty(handle, inode);
 	if (unlikely(ret))
 		goto out_handle;
-
+	ext4_fc_track_range(inode, offset >> inode->i_sb->s_blocksize_bits,
+			(offset + len - 1) >> inode->i_sb->s_blocksize_bits);
 	/* Zero out partial block at the edges of the range */
 	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
 	if (ret >= 0)
@@ -4648,23 +4651,34 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
 		     FALLOC_FL_INSERT_RANGE))
 		return -EOPNOTSUPP;
+	ext4_fc_track_range(inode, offset >> blkbits,
+			(offset + len - 1) >> blkbits);
 
-	if (mode & FALLOC_FL_PUNCH_HOLE)
-		return ext4_punch_hole(inode, offset, len);
+	ext4_fc_start_update(inode);
+
+	if (mode & FALLOC_FL_PUNCH_HOLE) {
+		ret = ext4_punch_hole(inode, offset, len);
+		goto exit;
+	}
 
 	ret = ext4_convert_inline_data(inode);
 	if (ret)
-		return ret;
+		goto exit;
 
-	if (mode & FALLOC_FL_COLLAPSE_RANGE)
-		return ext4_collapse_range(inode, offset, len);
-
-	if (mode & FALLOC_FL_INSERT_RANGE)
-		return ext4_insert_range(inode, offset, len);
+	if (mode & FALLOC_FL_COLLAPSE_RANGE) {
+		ret = ext4_collapse_range(inode, offset, len);
+		goto exit;
+	}
 
-	if (mode & FALLOC_FL_ZERO_RANGE)
-		return ext4_zero_range(file, offset, len, mode);
+	if (mode & FALLOC_FL_INSERT_RANGE) {
+		ret = ext4_insert_range(inode, offset, len);
+		goto exit;
+	}
 
+	if (mode & FALLOC_FL_ZERO_RANGE) {
+		ret = ext4_zero_range(file, offset, len, mode);
+		goto exit;
+	}
 	trace_ext4_fallocate_enter(inode, offset, len, mode);
 	lblk = offset >> blkbits;
 
@@ -4698,12 +4712,14 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		goto out;
 
 	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
-		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
-						EXT4_I(inode)->i_sync_tid);
+		ret = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
+					EXT4_I(inode)->i_sync_tid);
 	}
 out:
 	inode_unlock(inode);
 	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
+exit:
+	ext4_fc_stop_update(inode);
 	return ret;
 }
 
@@ -5291,6 +5307,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 		ret = PTR_ERR(handle);
 		goto out_mmap;
 	}
+	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE);
 
 	down_write(&EXT4_I(inode)->i_data_sem);
 	ext4_discard_preallocations(inode, 0);
@@ -5329,6 +5346,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
 
 out_stop:
 	ext4_journal_stop(handle);
+	ext4_fc_stop_ineligible(sb);
 out_mmap:
 	up_write(&EXT4_I(inode)->i_mmap_sem);
 out_mutex:
@@ -5429,6 +5447,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 		ret = PTR_ERR(handle);
 		goto out_mmap;
 	}
+	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE);
 
 	/* Expand file to avoid data loss if there is error while shifting */
 	inode->i_size += len;
@@ -5503,6 +5522,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
 
 out_stop:
 	ext4_journal_stop(handle);
+	ext4_fc_stop_ineligible(sb);
 out_mmap:
 	up_write(&EXT4_I(inode)->i_mmap_sem);
 out_mutex:
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index f2d11b4c6b62..1c3e5f39d643 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -7,13 +7,1211 @@
  *
  * Ext4 fast commits routines.
  */
+#include "ext4.h"
 #include "ext4_jbd2.h"
+#include "ext4_extents.h"
+#include "mballoc.h"
+
+/*
+ * Ext4 Fast Commits
+ * -----------------
+ *
+ * Ext4 fast commits implement fine grained journalling for Ext4.
+ *
+ * Fast commits are organized as a log of tag-length-value (TLV) structs. (See
+ * struct ext4_fc_tl). Each TLV contains some delta that is replayed TLV by
+ * TLV during the recovery phase. For the scenarios for which we currently
+ * don't have replay code, fast commit falls back to full commits.
+ * Fast commits record delta in one of the following three categories.
+ *
+ * (A) Directory entry updates:
+ *
+ * - EXT4_FC_TAG_UNLINK		- records directory entry unlink
+ * - EXT4_FC_TAG_LINK		- records directory entry link
+ * - EXT4_FC_TAG_CREAT		- records inode and directory entry creation
+ *
+ * (B) File specific data range updates:
+ *
+ * - EXT4_FC_TAG_ADD_RANGE	- records addition of new blocks to an inode
+ * - EXT4_FC_TAG_DEL_RANGE	- recordd deletion of blocks from an inode
+ *
+ * (C) Inode metadata (mtime / ctime etc):
+ *
+ * - EXT4_FC_TAG_INODE_FULL	- record the inode that should be replayed as is
+ *				  during recovery. This tag is useful for
+ *				  setting up a new inode. This tag is necessary
+ *				  for CREAT tag.
+ * - EXT4_FC_TAG_INODE_PARTIAL	- record the inode that should only partially be
+ *				  be replayed during recovery. This tag is
+ *				  useful when the same inode has had updates
+ *				  recorded as ADD_RANGE or DEL_RANGE tags.
+ *
+ * Commit Operation
+ * ----------------
+ * With fast commits, we maintain all the directory entry operations in the
+ * order in which they are issued in an in-memory queue. This queue is flushed
+ * to disk during the commit operation. We also maintain a list of inodes
+ * that need to be committed during a fast commit in another in memory queue of
+ * inodes. During the commit operation, we commit in the following order:
+ *
+ * [1] Lock inodes for any further data updates by setting COMMITTING state
+ * [2] Submit data buffers of all the inodes
+ * [3] Wait for [2] to complete
+ * [4] Commit all the directory entry updates in the fast commit space
+ * [5] Commit all the changed inode structures
+ * [6] Write tail tag (this tag ensures the atomicity, please read the following
+ *     section for more details).
+ * [7] Wait for [4], [5] and [6] to complete.
+ *
+ * All the inode updates must call ext4_fc_start_update() before starting an
+ * update. If such an ongoing update is present, fast commit waits for it to
+ * complete. The completion of such an update is marked by
+ * ext4_fc_stop_update().
+ *
+ * Fast Commit Ineligibility
+ * -------------------------
+ * Not all operations are supported by fast commits today (e.g extended
+ * attributes). Fast commit ineligiblity is marked by calling one of the
+ * two following functions:
+ *
+ * - ext4_fc_mark_ineligible(): This makes next fast commit operation to fall
+ *   back to full commit. This is useful in case of transient errors.
+ *
+ * - ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() - This makes all
+ *   the fast commits happening between ext4_fc_start_ineligible() and
+ *   ext4_fc_stop_ineligible() and one fast commit after the call to
+ *   ext4_fc_stop_ineligible() to fall back to full commits. It is important to
+ *   make one more fast commit to fall back to full commit after stop call so
+ *   that it guaranteed that the fast commit ineligible operation contained
+ *   within ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() is
+ *   followed by at least 1 full commit.
+ *
+ * Atomicity of commits
+ * --------------------
+ * In order to gaurantee atomicity during the commit operation, fast commit
+ * uses "EXT4_FC_TAG_TAIL" tag that marks a fast commit as complete. Tail
+ * tag contains CRC of the contents and TID of the transaction after which
+ * this fast commit should be applied. Recovery code replays fast commit
+ * logs only if there's at least 1 valid tail present. For every fast commit
+ * operation, there is 1 tail. This means, we may end up with multiple tails
+ * in the fast commit space. Here's an example:
+ *
+ * - Create a new file A and remove existing file B
+ * - fsync()
+ * - Append contents to file A
+ * - Truncate file A
+ * - fsync()
+ *
+ * The fast commit space at the end of above operations would look like this:
+ *      [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL]
+ *             |<---  Fast Commit 1   --->|<---      Fast Commit 2     ---->|
+ *
+ * Replay code should thus check for all the valid tails in the FC area.
+ *
+ * TODOs
+ * -----
+ * 1) Make fast commit atomic updates more fine grained. Today, a fast commit
+ *    eligible update must be protected within ext4_fc_start_update() and
+ *    ext4_fc_stop_update(). These routines are called at much higher
+ *    routines. This can be made more fine grained by combining with
+ *    ext4_journal_start().
+ *
+ * 2) Same above for ext4_fc_start_ineligible() and ext4_fc_stop_ineligible()
+ *
+ * 3) Handle more ineligible cases.
+ */
+
+#include <trace/events/ext4.h>
+static struct kmem_cache *ext4_fc_dentry_cachep;
+
+static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
+{
+	BUFFER_TRACE(bh, "");
+	if (uptodate) {
+		ext4_debug("%s: Block %lld up-to-date",
+			   __func__, bh->b_blocknr);
+		set_buffer_uptodate(bh);
+	} else {
+		ext4_debug("%s: Block %lld not up-to-date",
+			   __func__, bh->b_blocknr);
+		clear_buffer_uptodate(bh);
+	}
+
+	unlock_buffer(bh);
+}
+
+static inline void ext4_fc_reset_inode(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	ei->i_fc_lblk_start = 0;
+	ei->i_fc_lblk_len = 0;
+}
+
+void ext4_fc_init_inode(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	ext4_fc_reset_inode(inode);
+	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
+	INIT_LIST_HEAD(&ei->i_fc_list);
+	init_waitqueue_head(&ei->i_fc_wait);
+	atomic_set(&ei->i_fc_updates, 0);
+}
+
+/*
+ * Inform Ext4's fast about start of an inode update
+ *
+ * This function is called by the high level call VFS callbacks before
+ * performing any inode update. This function blocks if there's an ongoing
+ * fast commit on the inode in question.
+ */
+void ext4_fc_start_update(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return;
+
+restart:
+	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+	if (list_empty(&EXT4_I(inode)->i_fc_list))
+		goto out;
+
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+		wait_queue_head_t *wq;
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+		schedule();
+		finish_wait(wq, &wait.wq_entry);
+		goto restart;
+	}
+out:
+	atomic_inc(&ei->i_fc_updates);
+	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+}
+
+/*
+ * Stop inode update and wake up waiting fast commits if any.
+ */
+void ext4_fc_stop_update(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return;
+
+	if (atomic_dec_and_test(&ei->i_fc_updates))
+		wake_up_all(&ei->i_fc_wait);
+}
+
+/*
+ * Remove inode from fast commit list. If the inode is being committed
+ * we wait until inode commit is done.
+ */
+void ext4_fc_del(struct inode *inode)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return;
+
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return;
+
+restart:
+	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+	if (list_empty(&ei->i_fc_list)) {
+		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+		return;
+	}
+
+	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
+		wait_queue_head_t *wq;
+#if (BITS_PER_LONG < 64)
+		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_state_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#else
+		DEFINE_WAIT_BIT(wait, &ei->i_flags,
+				EXT4_STATE_FC_COMMITTING);
+		wq = bit_waitqueue(&ei->i_flags,
+				   EXT4_STATE_FC_COMMITTING);
+#endif
+		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
+		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+		schedule();
+		finish_wait(wq, &wait.wq_entry);
+		goto restart;
+	}
+	if (!list_empty(&ei->i_fc_list))
+		list_del_init(&ei->i_fc_list);
+	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
+}
+
+/*
+ * Mark file system as fast commit ineligible. This means that next commit
+ * operation would result in a full jbd2 commit.
+ */
+void ext4_fc_mark_ineligible(struct super_block *sb, int reason)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	sbi->s_mount_state |= EXT4_FC_INELIGIBLE;
+	WARN_ON(reason >= EXT4_FC_REASON_MAX);
+	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
+}
+
+/*
+ * Start a fast commit ineligible update. Any commits that happen while
+ * such an operation is in progress fall back to full commits.
+ */
+void ext4_fc_start_ineligible(struct super_block *sb, int reason)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	WARN_ON(reason >= EXT4_FC_REASON_MAX);
+	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
+	atomic_inc(&sbi->s_fc_ineligible_updates);
+}
+
+/*
+ * Stop a fast commit ineligible update. We set EXT4_FC_INELIGIBLE flag here
+ * to ensure that after stopping the ineligible update, at least one full
+ * commit takes place.
+ */
+void ext4_fc_stop_ineligible(struct super_block *sb)
+{
+	EXT4_SB(sb)->s_mount_state |= EXT4_FC_INELIGIBLE;
+	atomic_dec(&EXT4_SB(sb)->s_fc_ineligible_updates);
+}
+
+static inline int ext4_fc_is_ineligible(struct super_block *sb)
+{
+	return (EXT4_SB(sb)->s_mount_state & EXT4_FC_INELIGIBLE) ||
+		atomic_read(&EXT4_SB(sb)->s_fc_ineligible_updates);
+}
+
+/*
+ * Generic fast commit tracking function. If this is the first time this we are
+ * called after a full commit, we initialize fast commit fields and then call
+ * __fc_track_fn() with update = 0. If we have already been called after a full
+ * commit, we pass update = 1. Based on that, the track function can determine
+ * if it needs to track a field for the first time or if it needs to just
+ * update the previously tracked value.
+ *
+ * If enqueue is set, this function enqueues the inode in fast commit list.
+ */
+static int ext4_fc_track_template(
+	struct inode *inode, int (*__fc_track_fn)(struct inode *, void *, bool),
+	void *args, int enqueue)
+{
+	tid_t running_txn_tid;
+	bool update = false;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+	int ret;
+
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+		return -EOPNOTSUPP;
+
+	if (ext4_fc_is_ineligible(inode->i_sb))
+		return -EINVAL;
+
+	running_txn_tid = sbi->s_journal ?
+		sbi->s_journal->j_commit_sequence + 1 : 0;
+
+	mutex_lock(&ei->i_fc_lock);
+	if (running_txn_tid == ei->i_sync_tid) {
+		update = true;
+	} else {
+		ext4_fc_reset_inode(inode);
+		ei->i_sync_tid = running_txn_tid;
+	}
+	ret = __fc_track_fn(inode, args, update);
+	mutex_unlock(&ei->i_fc_lock);
+
+	if (!enqueue)
+		return ret;
+
+	spin_lock(&sbi->s_fc_lock);
+	if (list_empty(&EXT4_I(inode)->i_fc_list))
+		list_add_tail(&EXT4_I(inode)->i_fc_list,
+				(sbi->s_mount_state & EXT4_FC_COMMITTING) ?
+				&sbi->s_fc_q[FC_Q_STAGING] :
+				&sbi->s_fc_q[FC_Q_MAIN]);
+	spin_unlock(&sbi->s_fc_lock);
+
+	return ret;
+}
+
+struct __track_dentry_update_args {
+	struct dentry *dentry;
+	int op;
+};
+
+/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */
+static int __track_dentry_update(struct inode *inode, void *arg, bool update)
+{
+	struct ext4_fc_dentry_update *node;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct __track_dentry_update_args *dentry_update =
+		(struct __track_dentry_update_args *)arg;
+	struct dentry *dentry = dentry_update->dentry;
+	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+
+	mutex_unlock(&ei->i_fc_lock);
+	node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS);
+	if (!node) {
+		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_MEM);
+		mutex_lock(&ei->i_fc_lock);
+		return -ENOMEM;
+	}
+
+	node->fcd_op = dentry_update->op;
+	node->fcd_parent = dentry->d_parent->d_inode->i_ino;
+	node->fcd_ino = inode->i_ino;
+	if (dentry->d_name.len > DNAME_INLINE_LEN) {
+		node->fcd_name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);
+		if (!node->fcd_name.name) {
+			kmem_cache_free(ext4_fc_dentry_cachep, node);
+			ext4_fc_mark_ineligible(inode->i_sb,
+				EXT4_FC_REASON_MEM);
+			mutex_lock(&ei->i_fc_lock);
+			return -ENOMEM;
+		}
+		memcpy((u8 *)node->fcd_name.name, dentry->d_name.name,
+			dentry->d_name.len);
+	} else {
+		memcpy(node->fcd_iname, dentry->d_name.name,
+			dentry->d_name.len);
+		node->fcd_name.name = node->fcd_iname;
+	}
+	node->fcd_name.len = dentry->d_name.len;
+
+	spin_lock(&sbi->s_fc_lock);
+	if (sbi->s_mount_state & EXT4_FC_COMMITTING)
+		list_add_tail(&node->fcd_list,
+				&sbi->s_fc_dentry_q[FC_Q_STAGING]);
+	else
+		list_add_tail(&node->fcd_list, &sbi->s_fc_dentry_q[FC_Q_MAIN]);
+	spin_unlock(&sbi->s_fc_lock);
+	mutex_lock(&ei->i_fc_lock);
+
+	return 0;
+}
+
+void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry)
+{
+	struct __track_dentry_update_args args;
+	int ret;
+
+	args.dentry = dentry;
+	args.op = EXT4_FC_TAG_UNLINK;
+
+	ret = ext4_fc_track_template(inode, __track_dentry_update,
+					(void *)&args, 0);
+	trace_ext4_fc_track_unlink(inode, dentry, ret);
+}
+
+void ext4_fc_track_link(struct inode *inode, struct dentry *dentry)
+{
+	struct __track_dentry_update_args args;
+	int ret;
+
+	args.dentry = dentry;
+	args.op = EXT4_FC_TAG_LINK;
+
+	ret = ext4_fc_track_template(inode, __track_dentry_update,
+					(void *)&args, 0);
+	trace_ext4_fc_track_link(inode, dentry, ret);
+}
+
+void ext4_fc_track_create(struct inode *inode, struct dentry *dentry)
+{
+	struct __track_dentry_update_args args;
+	int ret;
+
+	args.dentry = dentry;
+	args.op = EXT4_FC_TAG_CREAT;
+
+	ret = ext4_fc_track_template(inode, __track_dentry_update,
+					(void *)&args, 0);
+	trace_ext4_fc_track_create(inode, dentry, ret);
+}
+
+/* __track_fn for inode tracking */
+static int __track_inode(struct inode *inode, void *arg, bool update)
+{
+	if (update)
+		return -EEXIST;
+
+	EXT4_I(inode)->i_fc_lblk_len = 0;
+
+	return 0;
+}
+
+void ext4_fc_track_inode(struct inode *inode)
+{
+	int ret;
+
+	if (S_ISDIR(inode->i_mode))
+		return;
+
+	ret = ext4_fc_track_template(inode, __track_inode, NULL, 1);
+	trace_ext4_fc_track_inode(inode, ret);
+}
+
+struct __track_range_args {
+	ext4_lblk_t start, end;
+};
+
+/* __track_fn for tracking data updates */
+static int __track_range(struct inode *inode, void *arg, bool update)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	ext4_lblk_t oldstart;
+	struct __track_range_args *__arg =
+		(struct __track_range_args *)arg;
+
+	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) {
+		ext4_debug("Special inode %ld being modified\n", inode->i_ino);
+		return -ECANCELED;
+	}
+
+	oldstart = ei->i_fc_lblk_start;
+
+	if (update && ei->i_fc_lblk_len > 0) {
+		ei->i_fc_lblk_start = min(ei->i_fc_lblk_start, __arg->start);
+		ei->i_fc_lblk_len =
+			max(oldstart + ei->i_fc_lblk_len - 1, __arg->end) -
+				ei->i_fc_lblk_start + 1;
+	} else {
+		ei->i_fc_lblk_start = __arg->start;
+		ei->i_fc_lblk_len = __arg->end - __arg->start + 1;
+	}
+
+	return 0;
+}
+
+void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
+			 ext4_lblk_t end)
+{
+	struct __track_range_args args;
+	int ret;
+
+	if (S_ISDIR(inode->i_mode))
+		return;
+
+	args.start = start;
+	args.end = end;
+
+	ret = ext4_fc_track_template(inode,  __track_range, &args, 1);
+
+	trace_ext4_fc_track_range(inode, start, end, ret);
+}
+
+static void ext4_fc_submit_bh(struct super_block *sb)
+{
+	int write_flags = REQ_SYNC;
+	struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
+
+	if (test_opt(sb, BARRIER))
+		write_flags |= REQ_FUA | REQ_PREFLUSH;
+	lock_buffer(bh);
+	clear_buffer_dirty(bh);
+	set_buffer_uptodate(bh);
+	bh->b_end_io = ext4_end_buffer_io_sync;
+	submit_bh(REQ_OP_WRITE, write_flags, bh);
+	EXT4_SB(sb)->s_fc_bh = NULL;
+}
+
+/* Ext4 commit path routines */
+
+/* memzero and update CRC */
+static void *ext4_fc_memzero(struct super_block *sb, void *dst, int len,
+				u32 *crc)
+{
+	void *ret;
+
+	ret = memset(dst, 0, len);
+	if (crc)
+		*crc = ext4_chksum(EXT4_SB(sb), *crc, dst, len);
+	return ret;
+}
+
+/*
+ * Allocate len bytes on a fast commit buffer.
+ *
+ * During the commit time this function is used to manage fast commit
+ * block space. We don't split a fast commit log onto different
+ * blocks. So this function makes sure that if there's not enough space
+ * on the current block, the remaining space in the current block is
+ * marked as unused by adding EXT4_FC_TAG_PAD tag. In that case,
+ * new block is from jbd2 and CRC is updated to reflect the padding
+ * we added.
+ */
+static u8 *ext4_fc_reserve_space(struct super_block *sb, int len, u32 *crc)
+{
+	struct ext4_fc_tl *tl;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct buffer_head *bh;
+	int bsize = sbi->s_journal->j_blocksize;
+	int ret, off = sbi->s_fc_bytes % bsize;
+	int pad_len;
+
+	if (bsize - off - 1 > len + sizeof(struct ext4_fc_tl)) {
+		/*
+		 * Only allocate from current buffer if we have enough space for
+		 * this request AND we have space to add a zero byte padding.
+		 */
+		if (!sbi->s_fc_bh) {
+			ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
+			if (ret)
+				return NULL;
+			sbi->s_fc_bh = bh;
+		}
+		sbi->s_fc_bytes += len;
+		return sbi->s_fc_bh->b_data + off;
+	}
+	/* Need to add PAD tag */
+	tl = (struct ext4_fc_tl *)(sbi->s_fc_bh->b_data + off);
+	tl->fc_tag = cpu_to_le16(EXT4_FC_TAG_PAD);
+	pad_len = bsize - off - 1 - sizeof(struct ext4_fc_tl);
+	tl->fc_len = cpu_to_le16(pad_len);
+	if (crc)
+		*crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl));
+	if (pad_len > 0)
+		ext4_fc_memzero(sb, tl + 1, pad_len, crc);
+	ext4_fc_submit_bh(sb);
+
+	ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
+	if (ret)
+		return NULL;
+	sbi->s_fc_bh = bh;
+	sbi->s_fc_bytes = (sbi->s_fc_bytes / bsize + 1) * bsize + len;
+	return sbi->s_fc_bh->b_data;
+}
+
+/* memcpy to fc reserved space and update CRC */
+static void *ext4_fc_memcpy(struct super_block *sb, void *dst, const void *src,
+				int len, u32 *crc)
+{
+	if (crc)
+		*crc = ext4_chksum(EXT4_SB(sb), *crc, src, len);
+	return memcpy(dst, src, len);
+}
+
+/*
+ * Complete a fast commit by writing tail tag.
+ *
+ * Writing tail tag marks the end of a fast commit. In order to guarantee
+ * atomicity, after writing tail tag, even if there's space remaining
+ * in the block, next commit shouldn't use it. That's why tail tag
+ * has the length as that of the remaining space on the block.
+ */
+static int ext4_fc_write_tail(struct super_block *sb, u32 crc)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_tl tl;
+	struct ext4_fc_tail tail;
+	int off, bsize = sbi->s_journal->j_blocksize;
+	u8 *dst;
+
+	/*
+	 * ext4_fc_reserve_space takes care of allocating an extra block if
+	 * there's no enough space on this block for accommodating this tail.
+	 */
+	dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc);
+	if (!dst)
+		return -ENOSPC;
+
+	off = sbi->s_fc_bytes % bsize;
+
+	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL);
+	tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail));
+	sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize);
+
+	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc);
+	dst += sizeof(tl);
+	tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid);
+	ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc);
+	dst += sizeof(tail.fc_tid);
+	tail.fc_crc = cpu_to_le32(crc);
+	ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
+
+	ext4_fc_submit_bh(sb);
+
+	return 0;
+}
+
+/*
+ * Adds tag, length, value and updates CRC. Returns true if tlv was added.
+ * Returns false if there's not enough space.
+ */
+static bool ext4_fc_add_tlv(struct super_block *sb, u16 tag, u16 len, u8 *val,
+			   u32 *crc)
+{
+	struct ext4_fc_tl tl;
+	u8 *dst;
+
+	dst = ext4_fc_reserve_space(sb, sizeof(tl) + len, crc);
+	if (!dst)
+		return false;
+
+	tl.fc_tag = cpu_to_le16(tag);
+	tl.fc_len = cpu_to_le16(len);
+
+	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
+	ext4_fc_memcpy(sb, dst + sizeof(tl), val, len, crc);
+
+	return true;
+}
+
+/* Same as above, but adds dentry tlv. */
+static  bool ext4_fc_add_dentry_tlv(struct super_block *sb, u16 tag,
+					int parent_ino, int ino, int dlen,
+					const unsigned char *dname,
+					u32 *crc)
+{
+	struct ext4_fc_dentry_info fcd;
+	struct ext4_fc_tl tl;
+	u8 *dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(fcd) + dlen,
+					crc);
+
+	if (!dst)
+		return false;
+
+	fcd.fc_parent_ino = cpu_to_le32(parent_ino);
+	fcd.fc_ino = cpu_to_le32(ino);
+	tl.fc_tag = cpu_to_le16(tag);
+	tl.fc_len = cpu_to_le16(sizeof(fcd) + dlen);
+	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
+	dst += sizeof(tl);
+	ext4_fc_memcpy(sb, dst, &fcd, sizeof(fcd), crc);
+	dst += sizeof(fcd);
+	ext4_fc_memcpy(sb, dst, dname, dlen, crc);
+	dst += dlen;
+
+	return true;
+}
+
+/*
+ * Writes inode in the fast commit space under TLV with tag @tag.
+ * Returns 0 on success, error on failure.
+ */
+static int ext4_fc_write_inode(struct inode *inode, u32 *crc, int tag)
+{
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
+	int ret;
+	struct ext4_iloc iloc;
+	struct ext4_fc_inode fc_inode;
+	struct ext4_fc_tl tl;
+	u8 *dst;
+
+	ret = ext4_get_inode_loc(inode, &iloc);
+	if (ret)
+		return ret;
+
+	if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
+		inode_len += ei->i_extra_isize;
+
+	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
+	tl.fc_tag = cpu_to_le16(tag);
+	tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino));
+
+	dst = ext4_fc_reserve_space(inode->i_sb,
+			sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc);
+	if (!dst)
+		return -ECANCELED;
+
+	if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc))
+		return -ECANCELED;
+	dst += sizeof(tl);
+	if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc))
+		return -ECANCELED;
+	dst += sizeof(fc_inode);
+	if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc),
+					inode_len, crc))
+		return -ECANCELED;
+
+	return 0;
+}
+
+/*
+ * Writes updated data ranges for the inode in question. Updates CRC.
+ * Returns 0 on success, error otherwise.
+ */
+static int ext4_fc_write_data(struct inode *inode, u32 *crc)
+{
+	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	struct ext4_map_blocks map;
+	struct ext4_fc_add_range fc_ext;
+	struct ext4_fc_del_range lrange;
+	struct ext4_extent *ex;
+	int ret;
+
+	mutex_lock(&ei->i_fc_lock);
+	if (ei->i_fc_lblk_len == 0) {
+		mutex_unlock(&ei->i_fc_lock);
+		return 0;
+	}
+	old_blk_size = ei->i_fc_lblk_start;
+	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
+	ei->i_fc_lblk_len = 0;
+	mutex_unlock(&ei->i_fc_lock);
+
+	cur_lblk_off = old_blk_size;
+	jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
+		  __func__, cur_lblk_off, new_blk_size, inode->i_ino);
+
+	while (cur_lblk_off <= new_blk_size) {
+		map.m_lblk = cur_lblk_off;
+		map.m_len = new_blk_size - cur_lblk_off + 1;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret < 0)
+			return -ECANCELED;
+
+		if (map.m_len == 0) {
+			cur_lblk_off++;
+			continue;
+		}
+
+		if (ret == 0) {
+			lrange.fc_ino = cpu_to_le32(inode->i_ino);
+			lrange.fc_lblk = cpu_to_le32(map.m_lblk);
+			lrange.fc_len = cpu_to_le32(map.m_len);
+			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
+					    sizeof(lrange), (u8 *)&lrange, crc))
+				return -ENOSPC;
+		} else {
+			fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
+			ex = (struct ext4_extent *)&fc_ext.fc_ex;
+			ex->ee_block = cpu_to_le32(map.m_lblk);
+			ex->ee_len = cpu_to_le32(map.m_len);
+			ext4_ext_store_pblock(ex, map.m_pblk);
+			if (map.m_flags & EXT4_MAP_UNWRITTEN)
+				ext4_ext_mark_unwritten(ex);
+			else
+				ext4_ext_mark_initialized(ex);
+			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
+					    sizeof(fc_ext), (u8 *)&fc_ext, crc))
+				return -ENOSPC;
+		}
+
+		cur_lblk_off += map.m_len;
+	}
+
+	return 0;
+}
+
+/*
+ * Commit data inode. If tag == EXT4_FC_TAG_INODE_PARTIAL, we write the
+ * updated ranges for that inode before committing the inode itself.
+ * This ensures that the during the replay inode->i_block[] is properly
+ * initiated before the partial replay of the inode.
+ */
+static int ext4_fc_commit_inode(journal_t *journal, struct inode *inode,
+				u32 *crc, int tag)
+{
+	int ret;
+	int commit_full_inode =
+		tag == EXT4_FC_TAG_INODE_FULL || ext4_has_inline_data(inode);
+
+	if (commit_full_inode) {
+		ret = ext4_fc_write_inode(inode, crc, tag);
+		if (ret < 0)
+			return ret;
+	}
+
+	ret = ext4_fc_write_data(inode, crc);
+	if (ret < 0)
+		return ret;
+
+	if (!commit_full_inode)
+		ret = ext4_fc_write_inode(inode, crc, tag);
+
+	return ret;
+}
+
+/* Submit data for all the fast commit inodes */
+static int ext4_fc_submit_inode_data_all(journal_t *journal)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *ei;
+	struct list_head *pos;
+	int ret = 0;
+
+	spin_lock(&sbi->s_fc_lock);
+	sbi->s_mount_state |= EXT4_FC_COMMITTING;
+	list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
+		ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		ext4_set_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING);
+		while (atomic_read(&ei->i_fc_updates)) {
+			DEFINE_WAIT(wait);
+
+			prepare_to_wait(&ei->i_fc_wait, &wait,
+						TASK_UNINTERRUPTIBLE);
+			if (atomic_read(&ei->i_fc_updates)) {
+				spin_unlock(&sbi->s_fc_lock);
+				schedule();
+				spin_lock(&sbi->s_fc_lock);
+			}
+			finish_wait(&ei->i_fc_wait, &wait);
+		}
+		spin_unlock(&sbi->s_fc_lock);
+		ret = jbd2_submit_inode_data(journal, ei->jinode);
+		if (ret)
+			return ret;
+		spin_lock(&sbi->s_fc_lock);
+	}
+	spin_unlock(&sbi->s_fc_lock);
+
+	return ret;
+}
+
+/* Wait for completion of data for all the fast commit inodes */
+static int ext4_fc_wait_inode_data_all(journal_t *journal)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *pos, *n;
+	int ret = 0;
+
+	spin_lock(&sbi->s_fc_lock);
+	list_for_each_entry_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
+		if (!ext4_test_inode_state(&pos->vfs_inode,
+					   EXT4_STATE_FC_COMMITTING))
+			continue;
+		spin_unlock(&sbi->s_fc_lock);
+
+		ret = jbd2_wait_inode_data(journal, pos->jinode);
+		if (ret)
+			return ret;
+		spin_lock(&sbi->s_fc_lock);
+	}
+	spin_unlock(&sbi->s_fc_lock);
+
+	return 0;
+}
+
+/* Commit all the directory entry updates */
+static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_dentry_update *fc_dentry;
+	struct inode *inode;
+	struct list_head *pos, *n, *fcd_pos, *fcd_n;
+	struct ext4_inode_info *ei;
+	int ret;
+
+	spin_lock(&sbi->s_fc_lock);
+	if (list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
+		spin_unlock(&sbi->s_fc_lock);
+		return 0;
+	}
+	list_for_each_safe(fcd_pos, fcd_n, &sbi->s_fc_dentry_q[FC_Q_MAIN]) {
+		fc_dentry = list_entry(fcd_pos, struct ext4_fc_dentry_update,
+					fcd_list);
+		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT) {
+			spin_unlock(&sbi->s_fc_lock);
+			if (!ext4_fc_add_dentry_tlv(
+				sb, fc_dentry->fcd_op,
+				fc_dentry->fcd_parent, fc_dentry->fcd_ino,
+				fc_dentry->fcd_name.len,
+				fc_dentry->fcd_name.name, crc)) {
+				return -ENOSPC;
+			}
+			spin_lock(&sbi->s_fc_lock);
+			continue;
+		}
+
+		inode = NULL;
+		list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
+			ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
+			if (ei->vfs_inode.i_ino == fc_dentry->fcd_ino) {
+				inode = &ei->vfs_inode;
+				break;
+			}
+		}
+		/*
+		 * If we don't find inode in our list, then it was deleted,
+		 * in which case, we don't need to record it's create tag.
+		 */
+		if (!inode)
+			continue;
+		spin_unlock(&sbi->s_fc_lock);
+
+		/*
+		 * We first write the inode and then the create dirent. This
+		 * allows the recovery code to create an unnamed inode first
+		 * and then link it to a directory entry. This allows us
+		 * to use namei.c routines almost as is and simplifies
+		 * the recovery code.
+		 */
+		ret = ext4_fc_commit_inode(journal, inode, crc,
+						EXT4_FC_TAG_INODE_FULL);
+		if (ret)
+			return ret;
+
+		if (!ext4_fc_add_dentry_tlv(
+			sb, fc_dentry->fcd_op,
+			fc_dentry->fcd_parent, fc_dentry->fcd_ino,
+			fc_dentry->fcd_name.len,
+			fc_dentry->fcd_name.name, crc))
+			return -ENOSPC;
+
+		spin_lock(&sbi->s_fc_lock);
+	}
+	spin_unlock(&sbi->s_fc_lock);
+	return 0;
+}
+
+static int ext4_fc_perform_commit(journal_t *journal)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_head head;
+	struct list_head *pos;
+	struct inode *inode;
+	struct blk_plug plug;
+	int ret = 0;
+	u32 crc = 0;
+
+	ret = ext4_fc_submit_inode_data_all(journal);
+	if (ret)
+		return ret;
+
+	ret = ext4_fc_wait_inode_data_all(journal);
+	if (ret)
+		return ret;
+
+	blk_start_plug(&plug);
+	if (sbi->s_fc_bytes == 0) {
+		/*
+		 * Add a head tag only if this is the first fast commit
+		 * in this TID.
+		 */
+		head.fc_features = cpu_to_le32(EXT4_FC_SUPPORTED_FEATURES);
+		head.fc_tid = cpu_to_le32(
+			sbi->s_journal->j_running_transaction->t_tid);
+		if (!ext4_fc_add_tlv(sb, EXT4_FC_TAG_HEAD, sizeof(head),
+			(u8 *)&head, &crc))
+			goto out;
+	}
+
+	spin_lock(&sbi->s_fc_lock);
+	if (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
+		spin_unlock(&sbi->s_fc_lock);
+		ret = ext4_fc_commit_dentry_updates(journal, &crc);
+		if (ret)
+			goto out;
+		spin_lock(&sbi->s_fc_lock);
+	}
+
+	list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		inode = &iter->vfs_inode;
+		if (!ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
+			continue;
+
+		spin_unlock(&sbi->s_fc_lock);
+		/*
+		 * Given that we have already written data range updates for
+		 * this inode, we don't want the recovery code to recreate
+		 * the entire inode at this point. Also, the recovery code
+		 * allocator may make slightly different decisions for
+		 * extent tree blocks allocations, resulting in potentially
+		 * different inode.i_blocks and inode.i_block[]. So, until
+		 * we have identical allocator in recovery path, we should
+		 * record only partial inode here.
+		 */
+		ret = ext4_fc_commit_inode(journal, inode, &crc,
+						EXT4_FC_TAG_INODE_PARTIAL);
+		if (ret)
+			goto out;
+		spin_lock(&sbi->s_fc_lock);
+	}
+	spin_unlock(&sbi->s_fc_lock);
+
+	ret = ext4_fc_write_tail(sb, crc);
+
+out:
+	blk_finish_plug(&plug);
+	return ret;
+}
+
+/*
+ * The main commit entry point. Performs a fast commit for transaction
+ * commit_tid if needed. If it's not possible to perform a fast commit
+ * due to various reasons, we fall back to full commit. Returns 0
+ * on success, error otherwise.
+ */
+int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
+{
+	struct super_block *sb = (struct super_block *)(journal->j_private);
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	int nblks = 0, ret, bsize = journal->j_blocksize;
+	int subtid = atomic_read(&sbi->s_fc_subtid);
+	int reason = EXT4_FC_REASON_OK, fc_bufs_before = 0;
+	ktime_t start_time, commit_time;
+
+	trace_ext4_fc_commit_start(sb);
+
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
+		(ext4_fc_is_ineligible(sb))) {
+		reason = EXT4_FC_REASON_INELIGIBLE;
+		goto out;
+	}
+
+	start_time = ktime_get();
+restart_fc:
+	ret = jbd2_fc_start(journal, commit_tid);
+	if (ret == -EALREADY) {
+		/* There was an ongoing commit, check if we need to restart */
+		if (atomic_read(&sbi->s_fc_subtid) <= subtid &&
+			commit_tid > journal->j_commit_sequence)
+			goto restart_fc;
+		reason = EXT4_FC_REASON_ALREADY_COMMITTED;
+		goto out;
+	} else if (ret) {
+		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
+		reason = EXT4_FC_REASON_FC_START_FAILED;
+		goto out;
+	}
+
+	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
+	ret = ext4_fc_perform_commit(journal);
+	if (ret < 0) {
+		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
+		reason = EXT4_FC_REASON_FC_FAILED;
+		goto out;
+	}
+	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
+	ret = jbd2_fc_wait_bufs(journal, nblks);
+	if (ret < 0) {
+		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
+		reason = EXT4_FC_REASON_FC_FAILED;
+		goto out;
+	}
+	atomic_inc(&sbi->s_fc_subtid);
+	jbd2_fc_stop(journal);
+out:
+	/* Has any ineligible update happened since we started? */
+	if (reason == EXT4_FC_REASON_OK && ext4_fc_is_ineligible(sb)) {
+		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
+		reason = EXT4_FC_REASON_INELIGIBLE;
+	}
+
+	spin_lock(&sbi->s_fc_lock);
+	if (reason != EXT4_FC_REASON_OK &&
+		reason != EXT4_FC_REASON_ALREADY_COMMITTED) {
+		sbi->s_fc_stats.fc_ineligible_commits++;
+	} else {
+		sbi->s_fc_stats.fc_num_commits++;
+		sbi->s_fc_stats.fc_numblks += nblks;
+	}
+	spin_unlock(&sbi->s_fc_lock);
+	nblks = (reason == EXT4_FC_REASON_OK) ? nblks : 0;
+	trace_ext4_fc_commit_stop(sb, nblks, reason);
+	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
+	/*
+	 * weight the commit time higher than the average time so we don't
+	 * react too strongly to vast changes in the commit time
+	 */
+	if (likely(sbi->s_fc_avg_commit_time))
+		sbi->s_fc_avg_commit_time = (commit_time +
+				sbi->s_fc_avg_commit_time * 3) / 4;
+	else
+		sbi->s_fc_avg_commit_time = commit_time;
+	jbd_debug(1,
+		"Fast commit ended with blks = %d, reason = %d, subtid - %d",
+		nblks, reason, subtid);
+	if (reason == EXT4_FC_REASON_FC_FAILED)
+		return jbd2_fc_stop_do_commit(journal, commit_tid);
+	if (reason == EXT4_FC_REASON_FC_START_FAILED ||
+		reason == EXT4_FC_REASON_INELIGIBLE)
+		return jbd2_complete_transaction(journal, commit_tid);
+	return 0;
+}
+
 /*
  * Fast commit cleanup routine. This is called after every fast commit and
  * full commit. full is true if we are called after a full commit.
  */
 static void ext4_fc_cleanup(journal_t *journal, int full)
 {
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_inode_info *iter;
+	struct ext4_fc_dentry_update *fc_dentry;
+	struct list_head *pos, *n;
+
+	if (full && sbi->s_fc_bh)
+		sbi->s_fc_bh = NULL;
+
+	jbd2_fc_release_bufs(journal);
+
+	spin_lock(&sbi->s_fc_lock);
+	list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
+		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
+		list_del_init(&iter->i_fc_list);
+		ext4_clear_inode_state(&iter->vfs_inode,
+				       EXT4_STATE_FC_COMMITTING);
+		ext4_fc_reset_inode(&iter->vfs_inode);
+		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
+		smp_mb();
+#if (BITS_PER_LONG < 64)
+		wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
+#else
+		wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
+#endif
+	}
+
+	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
+		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
+					     struct ext4_fc_dentry_update,
+					     fcd_list);
+		list_del_init(&fc_dentry->fcd_list);
+		spin_unlock(&sbi->s_fc_lock);
+
+		if (fc_dentry->fcd_name.name &&
+			fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
+			kfree(fc_dentry->fcd_name.name);
+		kmem_cache_free(ext4_fc_dentry_cachep, fc_dentry);
+		spin_lock(&sbi->s_fc_lock);
+	}
+
+	list_splice_init(&sbi->s_fc_dentry_q[FC_Q_STAGING],
+				&sbi->s_fc_dentry_q[FC_Q_MAIN]);
+	list_splice_init(&sbi->s_fc_q[FC_Q_STAGING],
+				&sbi->s_fc_q[FC_Q_STAGING]);
+
+	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
+	sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE;
+
+	if (full)
+		sbi->s_fc_bytes = 0;
+	spin_unlock(&sbi->s_fc_lock);
+	trace_ext4_fc_stats(sb);
 }
 
 void ext4_fc_init(struct super_block *sb, journal_t *journal)
@@ -26,3 +1224,14 @@ void ext4_fc_init(struct super_block *sb, journal_t *journal)
 		ext4_clear_feature_fast_commit(sb);
 	}
 }
+
+int __init ext4_fc_init_dentry_cache(void)
+{
+	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
+					   SLAB_RECLAIM_ACCOUNT);
+
+	if (ext4_fc_dentry_cachep == NULL)
+		return -ENOMEM;
+
+	return 0;
+}
diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
index 8362bf5e6e00..a541d2bbe24b 100644
--- a/fs/ext4/fast_commit.h
+++ b/fs/ext4/fast_commit.h
@@ -6,4 +6,115 @@
 /* Number of blocks in journal area to allocate for fast commits */
 #define EXT4_NUM_FC_BLKS		256
 
+/* Fast commit tags */
+#define EXT4_FC_TAG_ADD_RANGE		0x0001
+#define EXT4_FC_TAG_DEL_RANGE		0x0002
+#define EXT4_FC_TAG_CREAT		0x0003
+#define EXT4_FC_TAG_LINK		0x0004
+#define EXT4_FC_TAG_UNLINK		0x0005
+#define EXT4_FC_TAG_INODE_FULL		0x0006
+#define EXT4_FC_TAG_INODE_PARTIAL	0x0007
+#define EXT4_FC_TAG_PAD			0x0008
+#define EXT4_FC_TAG_TAIL		0x0009
+#define EXT4_FC_TAG_HEAD		0x000A
+
+#define EXT4_FC_SUPPORTED_FEATURES	0x0
+
+/* On disk fast commit tlv value structures */
+
+/* Fast commit on disk tag length structure */
+struct ext4_fc_tl {
+	__le16 fc_tag;
+	__le16 fc_len;
+};
+
+/* Value structure for tag EXT4_FC_TAG_HEAD. */
+struct ext4_fc_head {
+	__le32 fc_features;
+	__le32 fc_tid;
+};
+
+/* Value structure for EXT4_FC_TAG_ADD_RANGE. */
+struct ext4_fc_add_range {
+	__le32 fc_ino;
+	__u8 fc_ex[12];
+};
+
+/* Value structure for tag EXT4_FC_TAG_DEL_RANGE. */
+struct ext4_fc_del_range {
+	__le32 fc_ino;
+	__le32 fc_lblk;
+	__le32 fc_len;
+};
+
+/*
+ * This is the value structure for tags EXT4_FC_TAG_CREAT, EXT4_FC_TAG_LINK
+ * and EXT4_FC_TAG_UNLINK.
+ */
+struct ext4_fc_dentry_info {
+	__le32 fc_parent_ino;
+	__le32 fc_ino;
+	u8 fc_dname[0];
+};
+
+/* Value structure for EXT4_FC_TAG_INODE and EXT4_FC_TAG_INODE_PARTIAL. */
+struct ext4_fc_inode {
+	__le32 fc_ino;
+	__u8 fc_raw_inode[0];
+};
+
+/* Value structure for tag EXT4_FC_TAG_TAIL. */
+struct ext4_fc_tail {
+	__le32 fc_tid;
+	__le32 fc_crc;
+};
+
+/*
+ * In memory list of dentry updates that are performed on the file
+ * system used by fast commit code.
+ */
+struct ext4_fc_dentry_update {
+	int fcd_op;		/* Type of update create / unlink / link */
+	int fcd_parent;		/* Parent inode number */
+	int fcd_ino;		/* Inode number */
+	struct qstr fcd_name;	/* Dirent name */
+	unsigned char fcd_iname[DNAME_INLINE_LEN];	/* Dirent name string */
+	struct list_head fcd_list;
+};
+
+/*
+ * Fast commit reason codes
+ */
+enum {
+	/*
+	 * Commit status codes:
+	 */
+	EXT4_FC_REASON_OK = 0,
+	EXT4_FC_REASON_INELIGIBLE,
+	EXT4_FC_REASON_ALREADY_COMMITTED,
+	EXT4_FC_REASON_FC_START_FAILED,
+	EXT4_FC_REASON_FC_FAILED,
+
+	/*
+	 * Fast commit ineligiblity reasons:
+	 */
+	EXT4_FC_REASON_XATTR = 0,
+	EXT4_FC_REASON_CROSS_RENAME,
+	EXT4_FC_REASON_JOURNAL_FLAG_CHANGE,
+	EXT4_FC_REASON_MEM,
+	EXT4_FC_REASON_SWAP_BOOT,
+	EXT4_FC_REASON_RESIZE,
+	EXT4_FC_REASON_RENAME_DIR,
+	EXT4_FC_REASON_FALLOC_RANGE,
+	EXT4_FC_COMMIT_FAILED,
+	EXT4_FC_REASON_MAX
+};
+
+struct ext4_fc_stats {
+	int fc_ineligible_reason_count[EXT4_FC_REASON_MAX];
+	int fc_num_commits;
+	int fc_ineligible_commits;
+	int fc_numblks;
+};
+
 #endif /* __FAST_COMMIT_H__ */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index e608ce3fb535..018f698caa71 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -260,6 +260,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 	if (iocb->ki_flags & IOCB_NOWAIT)
 		return -EOPNOTSUPP;
 
+	ext4_fc_start_update(inode);
 	inode_lock(inode);
 	ret = ext4_write_checks(iocb, from);
 	if (ret <= 0)
@@ -271,6 +272,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 
 out:
 	inode_unlock(inode);
+	ext4_fc_stop_update(inode);
 	if (likely(ret > 0)) {
 		iocb->ki_pos += ret;
 		ret = generic_write_sync(iocb, ret);
@@ -527,6 +529,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 	if (unaligned_io)
 		inode_dio_wait(inode);
 
+	ext4_fc_start_update(inode);
 	if (extend) {
 		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
 		if (IS_ERR(handle)) {
@@ -556,6 +559,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		inode_unlock_shared(inode);
 	else
 		inode_unlock(inode);
+	ext4_fc_stop_update(inode);
 
 	if (ret >= 0 && iov_iter_count(from)) {
 		ssize_t err;
@@ -654,8 +658,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 #endif
 	if (iocb->ki_flags & IOCB_DIRECT)
 		return ext4_dio_write_iter(iocb, from);
-
-	return ext4_buffered_write_iter(iocb, from);
+	else
+		return ext4_buffered_write_iter(iocb, from);
 }
 
 #ifdef CONFIG_FS_DAX
@@ -755,6 +759,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	if (!daxdev_mapping_supported(vma, dax_dev))
 		return -EOPNOTSUPP;
 
+	ext4_fc_start_update(inode);
 	file_accessed(file);
 	if (IS_DAX(file_inode(file))) {
 		vma->vm_ops = &ext4_dax_vm_ops;
@@ -762,6 +767,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
 	} else {
 		vma->vm_ops = &ext4_file_vm_ops;
 	}
+	ext4_fc_stop_update(inode);
 	return 0;
 }
 
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 1d668c8f131f..7c9fa35a01d1 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -112,7 +112,7 @@ static int ext4_fsync_journal(struct inode *inode, bool datasync,
 	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
 		*needs_barrier = true;
 
-	return jbd2_complete_transaction(journal, commit_tid);
+	return ext4_fc_commit(journal, commit_tid);
 }
 
 /*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 77543f988258..26eed76812f9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -729,6 +729,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			if (ret)
 				return ret;
 		}
+		ext4_fc_track_range(inode, map->m_lblk,
+			    map->m_lblk + map->m_len - 1);
 	}
 
 	if (retval < 0)
@@ -4073,6 +4075,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
 
 		up_write(&EXT4_I(inode)->i_data_sem);
 	}
+	ext4_fc_track_range(inode, first_block, stop_block);
 	if (IS_SYNC(inode))
 		ext4_handle_sync(handle);
 
@@ -4703,6 +4706,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	for (block = 0; block < EXT4_N_BLOCKS; block++)
 		ei->i_data[block] = raw_inode->i_block[block];
 	INIT_LIST_HEAD(&ei->i_orphan);
+	ext4_fc_init_inode(&ei->vfs_inode);
 
 	/*
 	 * Set transaction id's of transactions that have to be committed
@@ -5149,7 +5153,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
 			return 0;
 
-		err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
+		err = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
 						EXT4_I(inode)->i_sync_tid);
 	} else {
 		struct ext4_iloc iloc;
@@ -5278,6 +5282,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		if (error)
 			return error;
 	}
+	ext4_fc_start_update(inode);
 	if ((ia_valid & ATTR_UID && !uid_eq(attr->ia_uid, inode->i_uid)) ||
 	    (ia_valid & ATTR_GID && !gid_eq(attr->ia_gid, inode->i_gid))) {
 		handle_t *handle;
@@ -5301,6 +5306,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 
 		if (error) {
 			ext4_journal_stop(handle);
+			ext4_fc_stop_update(inode);
 			return error;
 		}
 		/* Update corresponding info in inode so that everything is in
@@ -5323,11 +5329,15 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
 			struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
-			if (attr->ia_size > sbi->s_bitmap_maxbytes)
+			if (attr->ia_size > sbi->s_bitmap_maxbytes) {
+				ext4_fc_stop_update(inode);
 				return -EFBIG;
+			}
 		}
-		if (!S_ISREG(inode->i_mode))
+		if (!S_ISREG(inode->i_mode)) {
+			ext4_fc_stop_update(inode);
 			return -EINVAL;
+		}
 
 		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
 			inode_inc_iversion(inode);
@@ -5351,7 +5361,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		rc = ext4_break_layouts(inode);
 		if (rc) {
 			up_write(&EXT4_I(inode)->i_mmap_sem);
-			return rc;
+			goto err_out;
 		}
 
 		if (attr->ia_size != inode->i_size) {
@@ -5372,6 +5382,21 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				inode->i_mtime = current_time(inode);
 				inode->i_ctime = inode->i_mtime;
 			}
+
+			if (shrink)
+				ext4_fc_track_range(inode,
+					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
+					inode->i_sb->s_blocksize_bits,
+					(oldsize > 0 ? oldsize - 1 : 0) >>
+					inode->i_sb->s_blocksize_bits);
+			else
+				ext4_fc_track_range(
+					inode,
+					(oldsize > 0 ? oldsize - 1 : oldsize) >>
+					inode->i_sb->s_blocksize_bits,
+					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
+					inode->i_sb->s_blocksize_bits);
+
 			down_write(&EXT4_I(inode)->i_data_sem);
 			EXT4_I(inode)->i_disksize = attr->ia_size;
 			rc = ext4_mark_inode_dirty(handle, inode);
@@ -5430,9 +5455,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 		rc = posix_acl_chmod(inode, inode->i_mode);
 
 err_out:
-	ext4_std_error(inode->i_sb, error);
+	if  (error)
+		ext4_std_error(inode->i_sb, error);
 	if (!error)
 		error = rc;
+	ext4_fc_stop_update(inode);
 	return error;
 }
 
@@ -5614,6 +5641,8 @@ int ext4_mark_iloc_dirty(handle_t *handle,
 		put_bh(iloc->bh);
 		return -EIO;
 	}
+	ext4_fc_track_inode(inode);
+
 	if (IS_I_VERSION(inode))
 		inode_inc_iversion(inode);
 
@@ -5937,6 +5966,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
 	if (IS_ERR(handle))
 		return PTR_ERR(handle);
 
+	ext4_fc_mark_ineligible(inode->i_sb,
+		EXT4_FC_REASON_JOURNAL_FLAG_CHANGE);
 	err = ext4_mark_inode_dirty(handle, inode);
 	ext4_handle_sync(handle);
 	ext4_journal_stop(handle);
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 36eca3bc036a..d2f8f50deef6 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -165,6 +165,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
 		err = -EINVAL;
 		goto err_out;
 	}
+	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_SWAP_BOOT);
 
 	/* Protect extent tree against block allocations via delalloc */
 	ext4_double_down_write_data_sem(inode, inode_bl);
@@ -247,6 +248,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
 
 err_out1:
 	ext4_journal_stop(handle);
+	ext4_fc_stop_ineligible(sb);
 	ext4_double_up_write_data_sem(inode, inode_bl);
 
 err_out:
@@ -807,7 +809,7 @@ static int ext4_ioctl_get_es_cache(struct file *filp, unsigned long arg)
 	return error;
 }
 
-long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
 	struct super_block *sb = inode->i_sb;
@@ -1074,6 +1076,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 
 		err = ext4_resize_fs(sb, n_blocks_count);
 		if (EXT4_SB(sb)->s_journal) {
+			ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE);
 			jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
 			err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
 			jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
@@ -1308,6 +1311,17 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 	}
 }
 
+long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
+{
+	long ret;
+
+	ext4_fc_start_update(file_inode(filp));
+	ret = __ext4_ioctl(filp, cmd, arg);
+	ext4_fc_stop_update(file_inode(filp));
+
+	return ret;
+}
+
 #ifdef CONFIG_COMPAT
 long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 {
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 153a9fbe1dd0..563243f3b682 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2611,7 +2611,7 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		       bool excl)
 {
 	handle_t *handle;
-	struct inode *inode;
+	struct inode *inode, *inode_save;
 	int err, credits, retries = 0;
 
 	err = dquot_initialize(dir);
@@ -2629,7 +2629,11 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
 		inode->i_op = &ext4_file_inode_operations;
 		inode->i_fop = &ext4_file_operations;
 		ext4_set_aops(inode);
+		inode_save = inode;
+		ihold(inode_save);
 		err = ext4_add_nondir(handle, dentry, &inode);
+		ext4_fc_track_create(inode_save, dentry);
+		iput(inode_save);
 	}
 	if (handle)
 		ext4_journal_stop(handle);
@@ -2644,7 +2648,7 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
 		      umode_t mode, dev_t rdev)
 {
 	handle_t *handle;
-	struct inode *inode;
+	struct inode *inode, *inode_save;
 	int err, credits, retries = 0;
 
 	err = dquot_initialize(dir);
@@ -2661,7 +2665,12 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
 	if (!IS_ERR(inode)) {
 		init_special_inode(inode, inode->i_mode, rdev);
 		inode->i_op = &ext4_special_inode_operations;
+		inode_save = inode;
+		ihold(inode_save);
 		err = ext4_add_nondir(handle, dentry, &inode);
+		if (!err)
+			ext4_fc_track_create(inode_save, dentry);
+		iput(inode_save);
 	}
 	if (handle)
 		ext4_journal_stop(handle);
@@ -2825,6 +2834,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 		iput(inode);
 		goto out_retry;
 	}
+	ext4_fc_track_create(inode, dentry);
 	ext4_inc_count(handle, dir);
 	ext4_update_dx_flag(dir);
 	err = ext4_mark_inode_dirty(handle, dir);
@@ -3165,6 +3175,7 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
 		goto end_rmdir;
 	ext4_dec_count(handle, dir);
 	ext4_update_dx_flag(dir);
+	ext4_fc_track_unlink(inode, dentry);
 	retval = ext4_mark_inode_dirty(handle, dir);
 
 #ifdef CONFIG_UNICODE
@@ -3251,6 +3262,8 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 	inode->i_ctime = current_time(inode);
 	retval = ext4_mark_inode_dirty(handle, inode);
 
+	if (!retval)
+		ext4_fc_track_unlink(d_inode(dentry), dentry);
 #ifdef CONFIG_UNICODE
 	/* VFS negative dentries are incompatible with Encoding and
 	 * Case-insensitiveness. Eventually we'll want avoid
@@ -3872,6 +3885,22 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
 	retval = ext4_mark_inode_dirty(handle, old.dir);
 	if (unlikely(retval))
 		goto end_rename;
+
+	if (S_ISDIR(old.inode->i_mode)) {
+		/*
+		 * We disable fast commits here that's because the
+		 * replay code is not yet capable of changing dot dot
+		 * dirents in directories.
+		 */
+		ext4_fc_mark_ineligible(old.inode->i_sb,
+			EXT4_FC_REASON_RENAME_DIR);
+	} else {
+		if (new.inode)
+			ext4_fc_track_unlink(new.inode, new.dentry);
+		ext4_fc_track_link(old.inode, new.dentry);
+		ext4_fc_track_unlink(old.inode, old.dentry);
+	}
+
 	if (new.inode) {
 		retval = ext4_mark_inode_dirty(handle, new.inode);
 		if (unlikely(retval))
@@ -4015,7 +4044,8 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
 	retval = ext4_mark_inode_dirty(handle, new.inode);
 	if (unlikely(retval))
 		goto end_rename;
-
+	ext4_fc_mark_ineligible(new.inode->i_sb,
+				EXT4_FC_REASON_CROSS_RENAME);
 	if (old.dir_bh) {
 		retval = ext4_rename_dir_finish(handle, &old, new.dir->i_ino);
 		if (retval)
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 94aaaf940449..2a959e2c7a91 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1149,6 +1149,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
 	ei->i_datasync_tid = 0;
 	atomic_set(&ei->i_unwritten, 0);
 	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
+	ext4_fc_init_inode(&ei->vfs_inode);
+	mutex_init(&ei->i_fc_lock);
 	return &ei->vfs_inode;
 }
 
@@ -1166,6 +1168,10 @@ static int ext4_drop_inode(struct inode *inode)
 static void ext4_free_in_core_inode(struct inode *inode)
 {
 	fscrypt_free_inode(inode);
+	if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
+		pr_warn("%s: inode %ld still in fc list",
+			__func__, inode->i_ino);
+	}
 	kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
 }
 
@@ -1191,6 +1197,7 @@ static void init_once(void *foo)
 	init_rwsem(&ei->i_data_sem);
 	init_rwsem(&ei->i_mmap_sem);
 	inode_init_once(&ei->vfs_inode);
+	ext4_fc_init_inode(&ei->vfs_inode);
 }
 
 static int __init init_inodecache(void)
@@ -1219,6 +1226,7 @@ static void destroy_inodecache(void)
 
 void ext4_clear_inode(struct inode *inode)
 {
+	ext4_fc_del(inode);
 	invalidate_inode_buffers(inode);
 	clear_inode(inode);
 	ext4_discard_preallocations(inode, 0);
@@ -4540,6 +4548,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
 	mutex_init(&sbi->s_orphan_lock);
 
+	/* Initialize fast commit stuff */
+	atomic_set(&sbi->s_fc_subtid, 0);
+	atomic_set(&sbi->s_fc_ineligible_updates, 0);
+	INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_MAIN]);
+	INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_STAGING]);
+	INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_MAIN]);
+	INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_STAGING]);
+	sbi->s_fc_bytes = 0;
+	sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE;
+	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
+	spin_lock_init(&sbi->s_fc_lock);
+	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+
 	sb->s_root = NULL;
 
 	needs_recovery = (es->s_last_orphan != 0 ||
@@ -6312,6 +6333,10 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
 	brelse(bh);
 out:
 	if (inode->i_size < off + len) {
+		ext4_fc_track_range(inode,
+			(inode->i_size > 0 ? inode->i_size - 1 : 0)
+				>> inode->i_sb->s_blocksize_bits,
+			(off + len) >> inode->i_sb->s_blocksize_bits);
 		i_size_write(inode, off + len);
 		EXT4_I(inode)->i_disksize = inode->i_size;
 		err2 = ext4_mark_inode_dirty(handle, inode);
@@ -6440,6 +6465,11 @@ static int __init ext4_init_fs(void)
 	err = init_inodecache();
 	if (err)
 		goto out1;
+
+	err = ext4_fc_init_dentry_cache();
+	if (err)
+		goto out05;
+
 	register_as_ext3();
 	register_as_ext2();
 	err = register_filesystem(&ext4_fs_type);
@@ -6450,6 +6480,7 @@ static int __init ext4_init_fs(void)
 out:
 	unregister_as_ext2();
 	unregister_as_ext3();
+out05:
 	destroy_inodecache();
 out1:
 	ext4_exit_mballoc();
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index cba4b877c606..6127e94ea4f5 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -2419,6 +2419,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
 		if (IS_SYNC(inode))
 			ext4_handle_sync(handle);
 	}
+	ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
 
 cleanup:
 	brelse(is.iloc.bh);
@@ -2496,6 +2497,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
 		if (error == 0)
 			error = error2;
 	}
+	ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
 
 	return error;
 }
@@ -2928,6 +2930,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
 					 error);
 			goto cleanup;
 		}
+		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
 	}
 	error = 0;
 cleanup:
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index ba35ecb18616..dadd9994e74b 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -202,6 +202,47 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
 	return ret;
 }
 
+/* Send all the data buffers related to an inode */
+int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
+{
+	struct address_space *mapping;
+	loff_t dirty_start;
+	loff_t dirty_end;
+	int ret;
+
+	if (!jinode)
+		return 0;
+
+	dirty_start = jinode->i_dirty_start;
+	dirty_end = jinode->i_dirty_end;
+
+	if (!(jinode->i_flags & JI_WRITE_DATA))
+		return 0;
+
+	dirty_start = jinode->i_dirty_start;
+	dirty_end = jinode->i_dirty_end;
+
+	mapping = jinode->i_vfs_inode->i_mapping;
+
+	trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
+	ret = journal_submit_inode_data_buffers(mapping, dirty_start,
+						dirty_end);
+
+	return ret;
+}
+EXPORT_SYMBOL(jbd2_submit_inode_data);
+
+int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
+{
+	if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
+		!jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
+		return 0;
+	return filemap_fdatawait_range_keep_errors(
+		jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
+		jinode->i_dirty_end);
+}
+EXPORT_SYMBOL(jbd2_wait_inode_data);
+
 /*
  * Submit all the data buffers of inode associated with the transaction to
  * disk.
@@ -434,6 +475,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
 			commit_transaction->t_tid);
 
 	write_lock(&journal->j_state_lock);
+	journal->j_fc_off = 0;
 	J_ASSERT(commit_transaction->t_state == T_RUNNING);
 	commit_transaction->t_state = T_LOCKED;
 
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 17a30a2c38f9..0661e00b46c0 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -157,7 +157,9 @@ static void commit_timeout(struct timer_list *t)
  *
  * 1) COMMIT:  Every so often we need to commit the current state of the
  *    filesystem to disk.  The journal thread is responsible for writing
- *    all of the metadata buffers to disk.
+ *    all of the metadata buffers to disk. If a fast commit is ongoing
+ *    journal thread waits until it's done and then continues from
+ *    there on.
  *
  * 2) CHECKPOINT: We cannot reuse a used section of the log file until all
  *    of the data in that part of the log has been rewritten elsewhere on
@@ -848,6 +850,109 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
 	return jbd2_journal_bmap(journal, blocknr, retp);
 }
 
+/* Map one fast commit buffer for use by the file system */
+int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
+{
+	unsigned long long pblock;
+	unsigned long blocknr;
+	int ret = 0;
+	struct buffer_head *bh;
+	int fc_off;
+
+	*bh_out = NULL;
+	write_lock(&journal->j_state_lock);
+
+	if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
+		fc_off = journal->j_fc_off;
+		blocknr = journal->j_first_fc + fc_off;
+		journal->j_fc_off++;
+	} else {
+		ret = -EINVAL;
+	}
+	write_unlock(&journal->j_state_lock);
+
+	if (ret)
+		return ret;
+
+	ret = jbd2_journal_bmap(journal, blocknr, &pblock);
+	if (ret)
+		return ret;
+
+	bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
+	if (!bh)
+		return -ENOMEM;
+
+	lock_buffer(bh);
+
+	clear_buffer_uptodate(bh);
+	set_buffer_dirty(bh);
+	unlock_buffer(bh);
+	journal->j_fc_wbuf[fc_off] = bh;
+
+	*bh_out = bh;
+
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_fc_get_buf);
+
+/*
+ * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
+ * for completion.
+ */
+int jbd2_fc_wait_bufs(journal_t *journal, int num_blks)
+{
+	struct buffer_head *bh;
+	int i, j_fc_off;
+
+	read_lock(&journal->j_state_lock);
+	j_fc_off = journal->j_fc_off;
+	read_unlock(&journal->j_state_lock);
+
+	/*
+	 * Wait in reverse order to minimize chances of us being woken up before
+	 * all IOs have completed
+	 */
+	for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
+		bh = journal->j_fc_wbuf[i];
+		wait_on_buffer(bh);
+		put_bh(bh);
+		journal->j_fc_wbuf[i] = NULL;
+		if (unlikely(!buffer_uptodate(bh)))
+			return -EIO;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(jbd2_fc_wait_bufs);
+
+/*
+ * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
+ * for completion.
+ */
+int jbd2_fc_release_bufs(journal_t *journal)
+{
+	struct buffer_head *bh;
+	int i, j_fc_off;
+
+	read_lock(&journal->j_state_lock);
+	j_fc_off = journal->j_fc_off;
+	read_unlock(&journal->j_state_lock);
+
+	/*
+	 * Wait in reverse order to minimize chances of us being woken up before
+	 * all IOs have completed
+	 */
+	for (i = j_fc_off - 1; i >= 0; i--) {
+		bh = journal->j_fc_wbuf[i];
+		if (!bh)
+			break;
+		put_bh(bh);
+		journal->j_fc_wbuf[i] = NULL;
+	}
+
+	return 0;
+}
+
 /*
  * Conversion of logical to physical block numbers for the journal
  *
@@ -1559,6 +1664,7 @@ int jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
 static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
 {
 	journal_superblock_t *sb = journal->j_superblock;
+	bool had_fast_commit = false;
 
 	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
 	lock_buffer(journal->j_sb_buffer);
@@ -1572,9 +1678,20 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
 
 	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
 	sb->s_start    = cpu_to_be32(0);
+	if (jbd2_has_feature_fast_commit(journal)) {
+		/*
+		 * When journal is clean, no need to commit fast commit flag and
+		 * make file system incompatible with older kernels.
+		 */
+		jbd2_clear_feature_fast_commit(journal);
+		had_fast_commit = true;
+	}
 
 	jbd2_write_superblock(journal, write_op);
 
+	if (had_fast_commit)
+		jbd2_set_feature_fast_commit(journal);
+
 	/* Log is no longer empty */
 	write_lock(&journal->j_state_lock);
 	journal->j_flags |= JBD2_FLUSHED;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index aad986a9f3ff..11c4ec967662 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1567,6 +1567,12 @@ int jbd2_fc_init(journal_t *journal, int num_fc_blks);
 int jbd2_fc_start(journal_t *journal, tid_t tid);
 int jbd2_fc_stop(journal_t *journal);
 int jbd2_fc_stop_do_commit(journal_t *journal, tid_t tid);
+int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out);
+int jbd2_fc_wait_bufs(journal_t *journal, int num_blks);
+int jbd2_fc_release_bufs(journal_t *journal);
+int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode);
+int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode);
+
 /*
  * is_journal_abort
  *
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 4c8b99ec8606..c8cb43189f02 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -95,6 +95,16 @@ TRACE_DEFINE_ENUM(ES_REFERENCED_B);
 	{ FALLOC_FL_COLLAPSE_RANGE,	"COLLAPSE_RANGE"},	\
 	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"})
 
+#define show_fc_reason(reason)						\
+	__print_symbolic(reason,					\
+		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
+		{ EXT4_FC_REASON_CROSS_RENAME,	"CROSS_RENAME"},	\
+		{ EXT4_FC_REASON_JOURNAL_FLAG_CHANGE, "JOURNAL_FLAG_CHANGE"}, \
+		{ EXT4_FC_REASON_MEM,	"NO_MEM"},			\
+		{ EXT4_FC_REASON_SWAP_BOOT,	"SWAP_BOOT"},		\
+		{ EXT4_FC_REASON_RESIZE,	"RESIZE"},		\
+		{ EXT4_FC_REASON_RENAME_DIR,	"RENAME_DIR"},		\
+		{ EXT4_FC_REASON_FALLOC_RANGE,	"FALLOC_RANGE"})
 
 TRACE_EVENT(ext4_other_inode_update_time,
 	TP_PROTO(struct inode *inode, ino_t orig_ino),
@@ -2791,6 +2801,168 @@ TRACE_EVENT(ext4_lazy_itable_init,
 		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->group)
 );
 
+TRACE_EVENT(ext4_fc_commit_start,
+	TP_PROTO(struct super_block *sb),
+
+	TP_ARGS(sb),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+	),
+
+	TP_printk("fast_commit started on dev %d,%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev))
+);
+
+TRACE_EVENT(ext4_fc_commit_stop,
+	    TP_PROTO(struct super_block *sb, int nblks, int reason),
+
+	TP_ARGS(sb, nblks, reason),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, nblks)
+		__field(int, reason)
+		__field(int, num_fc)
+		__field(int, num_fc_ineligible)
+		__field(int, nblks_agg)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->nblks = nblks;
+		__entry->reason = reason;
+		__entry->num_fc = EXT4_SB(sb)->s_fc_stats.fc_num_commits;
+		__entry->num_fc_ineligible =
+			EXT4_SB(sb)->s_fc_stats.fc_ineligible_commits;
+		__entry->nblks_agg = EXT4_SB(sb)->s_fc_stats.fc_numblks;
+	),
+
+	TP_printk("fc on [%d,%d] nblks %d, reason %d, fc = %d, ineligible = %d, agg_nblks %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->nblks, __entry->reason, __entry->num_fc,
+		  __entry->num_fc_ineligible, __entry->nblks_agg)
+);
+
+#define FC_REASON_NAME_STAT(reason)					\
+	show_fc_reason(reason),						\
+	__entry->sbi->s_fc_stats.fc_ineligible_reason_count[reason]
+
+TRACE_EVENT(ext4_fc_stats,
+	    TP_PROTO(struct super_block *sb),
+
+	    TP_ARGS(sb),
+
+	    TP_STRUCT__entry(
+		    __field(dev_t, dev)
+		    __field(struct ext4_sb_info *, sbi)
+		    __field(int, count)
+		    ),
+
+	    TP_fast_assign(
+		    __entry->dev = sb->s_dev;
+		    __entry->sbi = EXT4_SB(sb);
+		    ),
+
+	    TP_printk("dev %d:%d fc ineligible reasons:\n"
+		      "%s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s,%d; "
+		      "num_commits:%d, ineligible: %d, numblks: %d",
+		      MAJOR(__entry->dev), MINOR(__entry->dev),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_XATTR),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_CROSS_RENAME),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_JOURNAL_FLAG_CHANGE),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_MEM),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_SWAP_BOOT),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_RESIZE),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_RENAME_DIR),
+		      FC_REASON_NAME_STAT(EXT4_FC_REASON_FALLOC_RANGE),
+		      __entry->sbi->s_fc_stats.fc_num_commits,
+		      __entry->sbi->s_fc_stats.fc_ineligible_commits,
+		      __entry->sbi->s_fc_stats.fc_numblks)
+
+);
+
+#define DEFINE_TRACE_DENTRY_EVENT(__type)				\
+	TRACE_EVENT(ext4_fc_track_##__type,				\
+	    TP_PROTO(struct inode *inode, struct dentry *dentry, int ret), \
+									\
+	    TP_ARGS(inode, dentry, ret),				\
+									\
+	    TP_STRUCT__entry(						\
+		    __field(dev_t, dev)					\
+		    __field(int, ino)					\
+		    __field(int, error)					\
+		    ),							\
+									\
+	    TP_fast_assign(						\
+		    __entry->dev = inode->i_sb->s_dev;			\
+		    __entry->ino = inode->i_ino;			\
+		    __entry->error = ret;				\
+		    ),							\
+									\
+	    TP_printk("dev %d:%d, inode %d, error %d, fc_%s",		\
+		      MAJOR(__entry->dev), MINOR(__entry->dev),		\
+		      __entry->ino, __entry->error,			\
+		      #__type)						\
+	)
+
+DEFINE_TRACE_DENTRY_EVENT(create);
+DEFINE_TRACE_DENTRY_EVENT(link);
+DEFINE_TRACE_DENTRY_EVENT(unlink);
+
+TRACE_EVENT(ext4_fc_track_inode,
+	    TP_PROTO(struct inode *inode, int ret),
+
+	    TP_ARGS(inode, ret),
+
+	    TP_STRUCT__entry(
+		    __field(dev_t, dev)
+		    __field(int, ino)
+		    __field(int, error)
+		    ),
+
+	    TP_fast_assign(
+		    __entry->dev = inode->i_sb->s_dev;
+		    __entry->ino = inode->i_ino;
+		    __entry->error = ret;
+		    ),
+
+	    TP_printk("dev %d:%d, inode %d, error %d",
+		      MAJOR(__entry->dev), MINOR(__entry->dev),
+		      __entry->ino, __entry->error)
+	);
+
+TRACE_EVENT(ext4_fc_track_range,
+	    TP_PROTO(struct inode *inode, long start, long end, int ret),
+
+	    TP_ARGS(inode, start, end, ret),
+
+	    TP_STRUCT__entry(
+		    __field(dev_t, dev)
+		    __field(int, ino)
+		    __field(long, start)
+		    __field(long, end)
+		    __field(int, error)
+		    ),
+
+	    TP_fast_assign(
+		    __entry->dev = inode->i_sb->s_dev;
+		    __entry->ino = inode->i_ino;
+		    __entry->start = start;
+		    __entry->end = end;
+		    __entry->error = ret;
+		    ),
+
+	    TP_printk("dev %d:%d, inode %d, error %d, start %ld, end %ld",
+		      MAJOR(__entry->dev), MINOR(__entry->dev),
+		      __entry->ino, __entry->error, __entry->start,
+		      __entry->end)
+	);
+
 #endif /* _TRACE_EXT4_H */
 
 /* This part must be outside protection */
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 6/9] jbd2: fast commit recovery path
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
                   ` (4 preceding siblings ...)
  2020-09-19  0:54 ` [PATCH v9 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  2020-09-19  0:54 ` [PATCH v9 7/9] ext4: " Harshad Shirwadkar
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch adds fast commit recovery support in JBD2.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/fast_commit.c | 15 +++++++++++
 fs/jbd2/recovery.c    | 59 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/jbd2.h  | 20 +++++++++++++++
 3 files changed, 90 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 1c3e5f39d643..6e251b5682b4 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -1214,8 +1214,23 @@ static void ext4_fc_cleanup(journal_t *journal, int full)
 	trace_ext4_fc_stats(sb);
 }
 
+/*
+ * Main recovery path entry point.
+ */
+static int ext4_fc_replay(journal_t *journal, struct buffer_head *bh,
+				enum passtype pass, int off, tid_t expected_tid)
+{
+	return 0;
+}
+
 void ext4_fc_init(struct super_block *sb, journal_t *journal)
 {
+	/*
+	 * We set replay callback even if fast commit disabled because we may
+	 * could still have fast commit blocks that need to be replayed even if
+	 * fast commit has now been turned off.
+	 */
+	journal->j_fc_replay_callback = ext4_fc_replay;
 	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
 		return;
 	journal->j_fc_cleanup_callback = ext4_fc_cleanup;
diff --git a/fs/jbd2/recovery.c b/fs/jbd2/recovery.c
index faa97d748474..6c6107c8df96 100644
--- a/fs/jbd2/recovery.c
+++ b/fs/jbd2/recovery.c
@@ -35,7 +35,6 @@ struct recovery_info
 	int		nr_revoke_hits;
 };
 
-enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
 static int do_one_pass(journal_t *journal,
 				struct recovery_info *info, enum passtype pass);
 static int scan_revoke_records(journal_t *, struct buffer_head *,
@@ -225,10 +224,53 @@ static int count_tags(journal_t *journal, struct buffer_head *bh)
 /* Make sure we wrap around the log correctly! */
 #define wrap(journal, var)						\
 do {									\
-	if (var >= (journal)->j_last)					\
-		var -= ((journal)->j_last - (journal)->j_first);	\
+	unsigned long _wrap_last =					\
+		jbd2_has_feature_fast_commit(journal) ?			\
+			(journal)->j_last_fc : (journal)->j_last;	\
+									\
+	if (var >= _wrap_last)						\
+		var -= (_wrap_last - (journal)->j_first);		\
 } while (0)
 
+static int fc_do_one_pass(journal_t *journal,
+			  struct recovery_info *info, enum passtype pass)
+{
+	unsigned int expected_commit_id = info->end_transaction;
+	unsigned long next_fc_block;
+	struct buffer_head *bh;
+	unsigned int seq;
+	int err = 0;
+
+	next_fc_block = journal->j_first_fc;
+	if (!journal->j_fc_replay_callback)
+		return 0;
+
+	while (next_fc_block <= journal->j_last_fc) {
+		jbd_debug(3, "Fast commit replay: next block %ld",
+			  next_fc_block);
+		err = jread(&bh, journal, next_fc_block);
+		if (err) {
+			jbd_debug(3, "Fast commit replay: read error");
+			break;
+		}
+
+		jbd_debug(3, "Processing fast commit blk with seq %d",
+			  seq);
+		err = journal->j_fc_replay_callback(journal, bh, pass,
+					next_fc_block - journal->j_first_fc,
+					expected_commit_id);
+		next_fc_block++;
+		if (err < 0 || err == JBD2_FC_REPLAY_STOP)
+			break;
+		err = 0;
+	}
+
+	if (err)
+		jbd_debug(3, "Fast commit replay failed, err = %d\n", err);
+
+	return err;
+}
+
 /**
  * jbd2_journal_recover - recovers a on-disk journal
  * @journal: the journal to recover
@@ -470,7 +512,9 @@ static int do_one_pass(journal_t *journal,
 				break;
 
 		jbd_debug(2, "Scanning for sequence ID %u at %lu/%lu\n",
-			  next_commit_ID, next_log_block, journal->j_last);
+			  next_commit_ID, next_log_block,
+			  jbd2_has_feature_fast_commit(journal) ?
+			  journal->j_last_fc : journal->j_last);
 
 		/* Skip over each chunk of the transaction looking
 		 * either the next descriptor block or the final commit
@@ -777,6 +821,13 @@ static int do_one_pass(journal_t *journal,
 				success = -EIO;
 		}
 	}
+
+	if (jbd2_has_feature_fast_commit(journal) &&  pass != PASS_REVOKE) {
+		err = fc_do_one_pass(journal, info, pass);
+		if (err)
+			success = err;
+	}
+
 	if (block_error && success == 0)
 		success = -EIO;
 	return success;
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 11c4ec967662..8e1849d81a7f 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -748,6 +748,11 @@ jbd2_time_diff(unsigned long start, unsigned long end)
 
 #define JBD2_NR_BATCH	64
 
+enum passtype {PASS_SCAN, PASS_REVOKE, PASS_REPLAY};
+
+#define JBD2_FC_REPLAY_STOP	0
+#define JBD2_FC_REPLAY_CONTINUE	1
+
 /**
  * struct journal_s - The journal_s type is the concrete type associated with
  *     journal_t.
@@ -1224,6 +1229,21 @@ struct journal_s
 	 */
 	void (*j_fc_cleanup_callback)(struct journal_s *journal, int);
 
+	/*
+	 * @j_fc_replay_callback:
+	 *
+	 * File-system specific function that performs replay of a fast
+	 * commit. JBD2 calls this function for each fast commit block found in
+	 * the journal. This function should return JBD2_FC_REPLAY_CONTINUE
+	 * to indicate that the block was processed correctly and more fast
+	 * commit replay should continue. Return value of JBD2_FC_REPLAY_STOP
+	 * indicates the end of replay (no more blocks remaining). A negative
+	 * return value indicates error.
+	 */
+	int (*j_fc_replay_callback)(struct journal_s *journal,
+				    struct buffer_head *bh,
+				    enum passtype pass, int off,
+				    tid_t expected_commit_id);
 };
 
 #define jbd2_might_wait_for_commit(j) \
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 7/9] ext4: fast commit recovery path
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
                   ` (5 preceding siblings ...)
  2020-09-19  0:54 ` [PATCH v9 6/9] jbd2: fast commit recovery path Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  2020-09-19 14:15   ` kernel test robot
  2020-10-09 17:14   ` Ritesh Harjani
  2020-09-19  0:54 ` [PATCH v9 8/9] ext4: add a mount opt to forcefully turn fast commits on Harshad Shirwadkar
  2020-09-19  0:54 ` [PATCH v9 9/9] ext4: add fast commit stats in procfs Harshad Shirwadkar
  8 siblings, 2 replies; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This patch adds fast commit recovery path support for Ext4 file
system. We add several helper functions that are similar in spirit to
e2fsprogs journal recovery path handlers. Example of such functions
include - a simple block allocator, idempotent block bitmap update
function etc. Using these routines and the fast commit log in the fast
commit area, the recovery path (ext4_fc_replay()) performs fast commit
log recovery.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/balloc.c            |   7 +-
 fs/ext4/ext4.h              |  26 ++
 fs/ext4/ext4_jbd2.c         |   2 +-
 fs/ext4/extents.c           | 261 +++++++++++
 fs/ext4/extents_status.c    |  24 +
 fs/ext4/fast_commit.c       | 881 +++++++++++++++++++++++++++++++++++-
 fs/ext4/fast_commit.h       |  40 ++
 fs/ext4/ialloc.c            | 165 ++++++-
 fs/ext4/inode.c             |  89 ++--
 fs/ext4/ioctl.c             |   6 +-
 fs/ext4/mballoc.c           | 208 ++++++++-
 fs/ext4/namei.c             | 149 +++---
 fs/ext4/super.c             |  21 +
 include/trace/events/ext4.h |  56 ++-
 14 files changed, 1804 insertions(+), 131 deletions(-)

diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index 48c3df47748d..77108c99ae90 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -368,7 +368,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
 				      struct buffer_head *bh)
 {
 	ext4_fsblk_t	blk;
-	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+	struct ext4_group_info *grp;
+
+	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
+	grp = ext4_get_group_info(sb, block_group);
 
 	if (buffer_verified(bh))
 		return 0;
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 27d48d166e5d..372a38292ed1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1167,6 +1167,7 @@ struct ext4_inode_info {
 #define EXT4_FC_COMMITTING		0x0010	/* File system underoing a fast
 						 * commit.
 						 */
+#define EXT4_FC_REPLAY			0x0020	/* Fast commit replay ongoing */
 
 /*
  * Misc. filesystem flags
@@ -1658,6 +1659,10 @@ struct ext4_sb_info {
 	struct buffer_head *s_fc_bh;
 	struct ext4_fc_stats s_fc_stats;
 	u64 s_fc_avg_commit_time;
+#ifdef CONFIG_EXT4_DEBUG
+	int s_fc_debug_max_replay;
+#endif
+	struct ext4_fc_replay_state s_fc_replay_state;
 };
 
 static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
@@ -2700,6 +2705,7 @@ extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
 			  struct dx_hash_info *hinfo);
 
 /* ialloc.c */
+extern int ext4_mark_inode_used(struct super_block *sb, int ino);
 extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t,
 				      const struct qstr *qstr, __u32 goal,
 				      uid_t *owner, __u32 i_flags,
@@ -2741,6 +2747,8 @@ void ext4_fc_stop_ineligible(struct super_block *sb);
 void ext4_fc_start_update(struct inode *inode);
 void ext4_fc_stop_update(struct inode *inode);
 void ext4_fc_del(struct inode *inode);
+bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t block);
+void ext4_fc_replay_cleanup(struct super_block *sb);
 int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
 int __init ext4_fc_init_dentry_cache(void);
 
@@ -2773,8 +2781,12 @@ extern int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
 				ext4_fsblk_t block, unsigned long count);
 extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
 extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
+extern void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
+		       int len, int state);
 
 /* inode.c */
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+			 struct ext4_inode_info *ei);
 int ext4_inode_is_fast_symlink(struct inode *inode);
 struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
 struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
@@ -2821,6 +2833,8 @@ extern int  ext4_sync_inode(handle_t *, struct inode *);
 extern void ext4_dirty_inode(struct inode *, int);
 extern int ext4_change_inode_journal_flag(struct inode *, int);
 extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
+extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+			  struct ext4_iloc *iloc);
 extern int ext4_inode_attach_jinode(struct inode *inode);
 extern int ext4_can_truncate(struct inode *inode);
 extern int ext4_truncate(struct inode *);
@@ -2854,12 +2868,15 @@ extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
 /* ioctl.c */
 extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
 extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
+extern void ext4_reset_inode_seed(struct inode *inode);
 
 /* migrate.c */
 extern int ext4_ext_migrate(struct inode *);
 extern int ext4_ind_migrate(struct inode *inode);
 
 /* namei.c */
+extern int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+			     struct inode *inode);
 extern int ext4_dirblock_csum_verify(struct inode *inode,
 				     struct buffer_head *bh);
 extern int ext4_orphan_add(handle_t *, struct inode *);
@@ -3426,6 +3443,10 @@ extern int ext4_handle_dirty_dirblock(handle_t *handle, struct inode *inode,
 extern int ext4_ci_compare(const struct inode *parent,
 			   const struct qstr *fname,
 			   const struct qstr *entry, bool quick);
+extern int __ext4_unlink(struct inode *dir, const struct qstr *d_name,
+			 struct inode *inode);
+extern int __ext4_link(struct inode *dir, struct inode *inode,
+		       struct dentry *dentry);
 
 #define S_SHIFT 12
 static const unsigned char ext4_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = {
@@ -3526,6 +3547,11 @@ extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
 extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
 				       int check_cred, int restart_cred,
 				       int revoke_cred);
+extern void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end);
+extern int ext4_ext_replay_set_iblocks(struct inode *inode);
+extern int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start,
+		int len, int unwritten, ext4_fsblk_t pblk);
+extern int ext4_ext_clear_bb(struct inode *inode);
 
 
 /* move_extent.c */
diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index 760b9ee49dc0..0fd0c42a4f7d 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -100,7 +100,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
 		return ERR_PTR(err);
 
 	journal = EXT4_SB(sb)->s_journal;
-	if (!journal)
+	if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
 		return ext4_get_nojournal();
 	return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
 				   GFP_NOFS, type, line);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 8de236fedade..29945f1172fc 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5804,3 +5804,264 @@ int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu)
 
 	return err ? err : mapped;
 }
+
+/*
+ * Updates physical block address and unwritten status of extent starting at
+ * lblk start and of len. If such an extent doesn't exist, this function
+ * splits the extent tree appropriately to create an extent like this.
+ * This function is called in Ext4 fast commit replay path. Returns 0 on success
+ * and error on failure.
+ */
+int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start,
+		int len, int unwritten, ext4_fsblk_t pblk)
+{
+	struct ext4_ext_path *path = NULL, *ppath;
+	struct ext4_extent *ex;
+	int ret;
+
+	path = ext4_find_extent(inode, start, NULL, 0);
+	if (!path)
+		return -EINVAL;
+	ex = path[path->p_depth].p_ext;
+	if (!ex) {
+		ret = -EFSCORRUPTED;
+		goto out;
+	}
+
+	if (le32_to_cpu(ex->ee_block) != start ||
+		ext4_ext_get_actual_len(ex) != len) {
+		/* We need to split this extent to match our extent first */
+		ppath = path;
+		down_write(&EXT4_I(inode)->i_data_sem);
+		ret = ext4_force_split_extent_at(NULL, inode, &ppath, start, 1);
+		up_write(&EXT4_I(inode)->i_data_sem);
+		if (ret)
+			goto out;
+		kfree(path);
+		path = ext4_find_extent(inode, start, NULL, 0);
+		if (IS_ERR(path))
+			return -1;
+		ppath = path;
+		ex = path[path->p_depth].p_ext;
+		WARN_ON(le32_to_cpu(ex->ee_block) != start);
+		if (ext4_ext_get_actual_len(ex) != len) {
+			down_write(&EXT4_I(inode)->i_data_sem);
+			ret = ext4_force_split_extent_at(NULL, inode, &ppath,
+							 start + len, 1);
+			up_write(&EXT4_I(inode)->i_data_sem);
+			if (ret)
+				goto out;
+			kfree(path);
+			path = ext4_find_extent(inode, start, NULL, 0);
+			if (IS_ERR(path))
+				return -EINVAL;
+			ex = path[path->p_depth].p_ext;
+		}
+	}
+	if (unwritten)
+		ext4_ext_mark_unwritten(ex);
+	else
+		ext4_ext_mark_initialized(ex);
+	ext4_ext_store_pblock(ex, pblk);
+	down_write(&EXT4_I(inode)->i_data_sem);
+	ret = ext4_ext_dirty(NULL, inode, &path[path->p_depth]);
+	up_write(&EXT4_I(inode)->i_data_sem);
+out:
+	ext4_ext_drop_refs(path);
+	kfree(path);
+	ext4_mark_inode_dirty(NULL, inode);
+	return ret;
+}
+
+/* Try to shrink the extent tree */
+void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end)
+{
+	struct ext4_ext_path *path = NULL;
+	struct ext4_extent *ex;
+	ext4_lblk_t old_cur, cur = 0;
+
+	while (cur < end) {
+		path = ext4_find_extent(inode, cur, NULL, 0);
+		if (IS_ERR(path))
+			return;
+		ex = path[path->p_depth].p_ext;
+		if (!ex) {
+			ext4_ext_drop_refs(path);
+			kfree(path);
+			ext4_mark_inode_dirty(NULL, inode);
+			return;
+		}
+		old_cur = cur;
+		cur = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
+		if (cur <= old_cur)
+			cur = old_cur + 1;
+		ext4_ext_try_to_merge(NULL, inode, path, ex);
+		down_write(&EXT4_I(inode)->i_data_sem);
+		ext4_ext_dirty(NULL, inode, &path[path->p_depth]);
+		up_write(&EXT4_I(inode)->i_data_sem);
+		ext4_mark_inode_dirty(NULL, inode);
+		ext4_ext_drop_refs(path);
+		kfree(path);
+	}
+}
+
+/* Check if *cur is a hole and if it is, skip it */
+static void skip_hole(struct inode *inode, ext4_lblk_t *cur)
+{
+	int ret;
+	struct ext4_map_blocks map;
+
+	map.m_lblk = *cur;
+	map.m_len = ((inode->i_size) >> inode->i_sb->s_blocksize_bits) - *cur;
+
+	ret = ext4_map_blocks(NULL, inode, &map, 0);
+	if (ret != 0)
+		return;
+	*cur = *cur + map.m_len;
+}
+
+/* Count number of blocks used by this inode and update i_blocks */
+int ext4_ext_replay_set_iblocks(struct inode *inode)
+{
+	struct ext4_ext_path *path = NULL, *path2 = NULL;
+	struct ext4_extent *ex;
+	ext4_lblk_t cur = 0, end;
+	int numblks = 0, i, ret = 0;
+	ext4_fsblk_t cmp1, cmp2;
+	struct ext4_map_blocks map;
+
+	/* Determin the size of the file first */
+	path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL,
+					EXT4_EX_NOCACHE);
+	if (IS_ERR(path))
+		return PTR_ERR(path);
+	ex = path[path->p_depth].p_ext;
+	if (!ex) {
+		ext4_ext_drop_refs(path);
+		kfree(path);
+		goto out;
+	}
+	end = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
+	ext4_ext_drop_refs(path);
+	kfree(path);
+
+	/* Count the number of data blocks */
+	cur = 0;
+	while (cur < end) {
+		map.m_lblk = cur;
+		map.m_len = end - cur;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret < 0)
+			break;
+		if (ret > 0)
+			numblks += ret;
+		cur = cur + map.m_len;
+	}
+
+	/*
+	 * Count the number of extent tree blocks. We do it by looking up
+	 * two successive extents and determining the difference between
+	 * their paths. When path is different for 2 successive extents
+	 * we compare the blocks in the path at each level and increment
+	 * iblocks by total number of differences found.
+	 */
+	cur = 0;
+	skip_hole(inode, &cur);
+	path = ext4_find_extent(inode, cur, NULL, 0);
+	if (IS_ERR(path))
+		goto out;
+	numblks += path->p_depth;
+	ext4_ext_drop_refs(path);
+	kfree(path);
+	while (cur < end) {
+		path = ext4_find_extent(inode, cur, NULL, 0);
+		if (IS_ERR(path))
+			break;
+		ex = path[path->p_depth].p_ext;
+		if (!ex) {
+			ext4_ext_drop_refs(path);
+			kfree(path);
+			return 0;
+		}
+		cur = max(cur + 1, le32_to_cpu(ex->ee_block) +
+					ext4_ext_get_actual_len(ex));
+		skip_hole(inode, &cur);
+
+		path2 = ext4_find_extent(inode, cur, NULL, 0);
+		if (IS_ERR(path2)) {
+			ext4_ext_drop_refs(path);
+			kfree(path);
+			break;
+		}
+		ex = path2[path2->p_depth].p_ext;
+		for (i = 0; i <= max(path->p_depth, path2->p_depth); i++) {
+			cmp1 = cmp2 = 0;
+			if (i <= path->p_depth)
+				cmp1 = path[i].p_bh ?
+					path[i].p_bh->b_blocknr : 0;
+			if (i <= path2->p_depth)
+				cmp2 = path2[i].p_bh ?
+					path2[i].p_bh->b_blocknr : 0;
+			if (cmp1 != cmp2 && cmp2 != 0)
+				numblks++;
+		}
+		ext4_ext_drop_refs(path);
+		ext4_ext_drop_refs(path2);
+		kfree(path);
+		kfree(path2);
+	}
+
+out:
+	inode->i_blocks = numblks << (inode->i_sb->s_blocksize_bits - 9);
+	ext4_mark_inode_dirty(NULL, inode);
+	return 0;
+}
+
+int ext4_ext_clear_bb(struct inode *inode)
+{
+	struct ext4_ext_path *path = NULL;
+	struct ext4_extent *ex;
+	ext4_lblk_t cur = 0, end;
+	int j, ret = 0;
+	struct ext4_map_blocks map;
+
+	/* Determin the size of the file first */
+	path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL,
+					EXT4_EX_NOCACHE);
+	if (IS_ERR(path))
+		return PTR_ERR(path);
+	ex = path[path->p_depth].p_ext;
+	if (!ex) {
+		ext4_ext_drop_refs(path);
+		kfree(path);
+		return 0;
+	}
+	end = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
+	ext4_ext_drop_refs(path);
+	kfree(path);
+
+	cur = 0;
+	while (cur < end) {
+		map.m_lblk = cur;
+		map.m_len = end - cur;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret < 0)
+			break;
+		if (ret > 0) {
+			path = ext4_find_extent(inode, map.m_lblk, NULL, 0);
+			if (!IS_ERR_OR_NULL(path)) {
+				for (j = 0; j < path->p_depth; j++) {
+
+					ext4_mb_mark_bb(inode->i_sb,
+							path[j].p_block, 1, 0);
+				}
+				ext4_ext_drop_refs(path);
+				kfree(path);
+			}
+			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
+		}
+		cur = cur + map.m_len;
+	}
+
+	return 0;
+}
diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
index e75171535375..0a729027322d 100644
--- a/fs/ext4/extents_status.c
+++ b/fs/ext4/extents_status.c
@@ -311,6 +311,9 @@ void ext4_es_find_extent_range(struct inode *inode,
 			       ext4_lblk_t lblk, ext4_lblk_t end,
 			       struct extent_status *es)
 {
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return;
+
 	trace_ext4_es_find_extent_range_enter(inode, lblk);
 
 	read_lock(&EXT4_I(inode)->i_es_lock);
@@ -361,6 +364,9 @@ bool ext4_es_scan_range(struct inode *inode,
 {
 	bool ret;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return false;
+
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	ret = __es_scan_range(inode, matching_fn, lblk, end);
 	read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -404,6 +410,9 @@ bool ext4_es_scan_clu(struct inode *inode,
 {
 	bool ret;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return false;
+
 	read_lock(&EXT4_I(inode)->i_es_lock);
 	ret = __es_scan_clu(inode, matching_fn, lblk);
 	read_unlock(&EXT4_I(inode)->i_es_lock);
@@ -812,6 +821,9 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
 	int err = 0;
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
 	es_debug("add [%u/%u) %llu %x to extent status tree of inode %lu\n",
 		 lblk, len, pblk, status, inode->i_ino);
 
@@ -873,6 +885,9 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status newes;
 	ext4_lblk_t end = lblk + len - 1;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return;
+
 	newes.es_lblk = lblk;
 	newes.es_len = len;
 	ext4_es_store_pblock_status(&newes, pblk, status);
@@ -908,6 +923,9 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
 	struct rb_node *node;
 	int found = 0;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
 	trace_ext4_es_lookup_extent_enter(inode, lblk);
 	es_debug("lookup extent in block %u\n", lblk);
 
@@ -1419,6 +1437,9 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
 	int err = 0;
 	int reserved = 0;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
 	trace_ext4_es_remove_extent(inode, lblk, len);
 	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
 		 lblk, len, inode->i_ino);
@@ -1969,6 +1990,9 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
 	struct extent_status newes;
 	int err = 0;
 
+	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
 	es_debug("add [%u/1) delayed to extent status tree of inode %lu\n",
 		 lblk, inode->i_ino);
 
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 6e251b5682b4..63429076ad59 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -170,7 +170,8 @@ void ext4_fc_start_update(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
 		return;
 
 restart:
@@ -209,7 +210,8 @@ void ext4_fc_stop_update(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
 		return;
 
 	if (atomic_dec_and_test(&ei->i_fc_updates))
@@ -224,11 +226,8 @@ void ext4_fc_del(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
-		return;
-
-
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
 		return;
 
 restart:
@@ -270,6 +269,10 @@ void ext4_fc_mark_ineligible(struct super_block *sb, int reason)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
+		return;
+
 	sbi->s_mount_state |= EXT4_FC_INELIGIBLE;
 	WARN_ON(reason >= EXT4_FC_REASON_MAX);
 	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
@@ -283,6 +286,10 @@ void ext4_fc_start_ineligible(struct super_block *sb, int reason)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
+		return;
+
 	WARN_ON(reason >= EXT4_FC_REASON_MAX);
 	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
 	atomic_inc(&sbi->s_fc_ineligible_updates);
@@ -295,6 +302,10 @@ void ext4_fc_start_ineligible(struct super_block *sb, int reason)
  */
 void ext4_fc_stop_ineligible(struct super_block *sb)
 {
+	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
+	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
+		return;
+
 	EXT4_SB(sb)->s_mount_state |= EXT4_FC_INELIGIBLE;
 	atomic_dec(&EXT4_SB(sb)->s_fc_ineligible_updates);
 }
@@ -325,7 +336,8 @@ static int ext4_fc_track_template(
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
 	int ret;
 
-	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
+	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
+	    (sbi->s_mount_state & EXT4_FC_REPLAY))
 		return -EOPNOTSUPP;
 
 	if (ext4_fc_is_ineligible(inode->i_sb))
@@ -1214,13 +1226,864 @@ static void ext4_fc_cleanup(journal_t *journal, int full)
 	trace_ext4_fc_stats(sb);
 }
 
+/* Ext4 Replay Path Routines */
+
+/* Get length of a particular tlv */
+static inline int ext4_fc_tag_len(struct ext4_fc_tl *tl)
+{
+	return le16_to_cpu(tl->fc_len);
+}
+
+/* Get a pointer to "value" of a tlv */
+static inline u8 *ext4_fc_tag_val(struct ext4_fc_tl *tl)
+{
+	return (u8 *)tl + sizeof(*tl);
+}
+
+/* Helper struct for dentry replay routines */
+struct dentry_info_args {
+	int parent_ino, dname_len, ino, inode_len;
+	char *dname;
+};
+
+static inline void tl_to_darg(struct dentry_info_args *darg,
+				struct  ext4_fc_tl *tl)
+{
+	struct ext4_fc_dentry_info *fcd;
+
+	fcd = (struct ext4_fc_dentry_info *)ext4_fc_tag_val(tl);
+
+	darg->parent_ino = le32_to_cpu(fcd->fc_parent_ino);
+	darg->ino = le32_to_cpu(fcd->fc_ino);
+	darg->dname = fcd->fc_dname;
+	darg->dname_len = ext4_fc_tag_len(tl) -
+			sizeof(struct ext4_fc_dentry_info);
+}
+
+/* Unlink replay function */
+static int ext4_fc_replay_unlink(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	struct inode *inode, *old_parent;
+	struct qstr entry;
+	struct dentry_info_args darg;
+	int ret = 0;
+
+	tl_to_darg(&darg, tl);
+
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_UNLINK, darg.ino,
+			darg.parent_ino, darg.dname_len);
+
+	entry.name = darg.dname;
+	entry.len = darg.dname_len;
+	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
+
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode %d not found", darg.ino);
+		return 0;
+	}
+
+	old_parent = ext4_iget(sb, darg.parent_ino,
+				EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(old_parent)) {
+		jbd_debug(1, "Dir with inode  %d not found", darg.parent_ino);
+		iput(inode);
+		return 0;
+	}
+
+	ret = __ext4_unlink(old_parent, &entry, inode);
+	/* -ENOENT ok coz it might not exist anymore. */
+	if (ret == -ENOENT)
+		ret = 0;
+	iput(old_parent);
+	iput(inode);
+	return ret;
+}
+
+static int ext4_fc_replay_link_internal(struct super_block *sb,
+				struct dentry_info_args *darg,
+				struct inode *inode)
+{
+	struct inode *dir = NULL;
+	struct dentry *dentry_dir = NULL, *dentry_inode = NULL;
+	struct qstr qstr_dname = QSTR_INIT(darg->dname, darg->dname_len);
+	int ret = 0;
+
+	dir = ext4_iget(sb, darg->parent_ino, EXT4_IGET_NORMAL);
+	if (IS_ERR(dir)) {
+		jbd_debug(1, "Dir with inode %d not found.", darg->parent_ino);
+		dir = NULL;
+		goto out;
+	}
+
+	dentry_dir = d_obtain_alias(dir);
+	if (IS_ERR(dentry_dir)) {
+		jbd_debug(1, "Failed to obtain dentry");
+		dentry_dir = NULL;
+		goto out;
+	}
+
+	dentry_inode = d_alloc(dentry_dir, &qstr_dname);
+	if (!dentry_inode) {
+		jbd_debug(1, "Inode dentry not created.");
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = __ext4_link(dir, inode, dentry_inode);
+	/*
+	 * It's possible that link already existed since data blocks
+	 * for the dir in question got persisted before we crashed OR
+	 * we replayed this tag and crashed before the entire replay
+	 * could complete.
+	 */
+	if (ret && ret != -EEXIST) {
+		jbd_debug(1, "Failed to link\n");
+		goto out;
+	}
+
+	ret = 0;
+out:
+	if (dentry_dir) {
+		d_drop(dentry_dir);
+		dput(dentry_dir);
+	} else if (dir) {
+		iput(dir);
+	}
+	if (dentry_inode) {
+		d_drop(dentry_inode);
+		dput(dentry_inode);
+	}
+
+	return ret;
+}
+
+/* Link replay function */
+static int ext4_fc_replay_link(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	struct inode *inode;
+	struct dentry_info_args darg;
+	int ret = 0;
+
+	tl_to_darg(&darg, tl);
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_LINK, darg.ino,
+			darg.parent_ino, darg.dname_len);
+
+	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode not found.");
+		return 0;
+	}
+
+	ret = ext4_fc_replay_link_internal(sb, &darg, inode);
+	iput(inode);
+	return ret;
+}
+
+/*
+ * Record all the modified inodes during replay. We use this later to setup
+ * block bitmaps correctly.
+ */
+static int ext4_fc_record_modified_inode(struct super_block *sb, int ino)
+{
+	struct ext4_fc_replay_state *state;
+	int i;
+
+	state = &EXT4_SB(sb)->s_fc_replay_state;
+	for (i = 0; i < state->fc_modified_inodes_used; i++)
+		if (state->fc_modified_inodes[i] == ino)
+			return 0;
+	if (state->fc_modified_inodes_used == state->fc_modified_inodes_size) {
+		state->fc_modified_inodes_size +=
+			EXT4_FC_REPLAY_REALLOC_INCREMENT;
+		state->fc_modified_inodes = krealloc(
+					state->fc_modified_inodes, sizeof(int) *
+					state->fc_modified_inodes_size,
+					GFP_KERNEL);
+		if (!state->fc_modified_inodes)
+			return -ENOMEM;
+	}
+	state->fc_modified_inodes[state->fc_modified_inodes_used++] = ino;
+	return 0;
+}
+
+/*
+ * Inode replay function
+ *
+ * If the tag is EXT4_FC_TAG_INODE_FULL, copy the entire inode to its location.
+ * If the tag is EXT4_FC_TAG_INODE_PARTIAL, copy everything except i_block.
+ * This is useful if i_block has been modified due to previous ADD_RANGE /
+ * DEL_RANGE tags.
+ */
+static int ext4_fc_replay_inode(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	struct ext4_fc_inode *fc_inode;
+	u8 *raw_fc_inode;
+	struct inode *inode = NULL;
+	struct ext4_iloc iloc;
+	int inode_len, ino, ret, tag = le16_to_cpu(tl->fc_tag);
+
+	fc_inode = (struct ext4_fc_inode *)ext4_fc_tag_val(tl);
+
+	ino = le32_to_cpu(fc_inode->fc_ino);
+	trace_ext4_fc_replay(sb, tag, ino, 0, 0);
+
+	inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
+	if (!IS_ERR_OR_NULL(inode)) {
+		ext4_ext_clear_bb(inode);
+		iput(inode);
+	}
+
+	ext4_fc_record_modified_inode(sb, ino);
+
+	raw_fc_inode = fc_inode->fc_raw_inode;
+	ret = ext4_get_fc_inode_loc(sb, ino, &iloc);
+	if (ret)
+		goto out;
+
+	inode_len = ext4_fc_tag_len(tl) - sizeof(struct ext4_fc_inode);
+
+	if (tag == EXT4_FC_TAG_INODE_FULL) {
+		memcpy(ext4_raw_inode(&iloc), raw_fc_inode, inode_len);
+	} else {
+		memcpy(ext4_raw_inode(&iloc), raw_fc_inode,
+			offsetof(struct ext4_inode, i_block));
+		memcpy(&ext4_raw_inode(&iloc)->i_generation,
+			&((struct ext4_inode *)(raw_fc_inode))->i_generation,
+			inode_len -
+			offsetof(struct ext4_inode, i_generation));
+	}
+
+	/* Immediately update the inode on disk. */
+	ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh);
+	sync_dirty_buffer(iloc.bh);
+
+	ret = ext4_mark_inode_used(sb, ino);
+	if (ret)
+		goto out;
+
+	/* Given that we just wrote the inode on disk, this SHOULD succeed. */
+	inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode not found.");
+		return -EFSCORRUPTED;
+	}
+
+	/*
+	 * Our allocator could have made different decisions than before
+	 * crashing. This should be fixed but until then, we calculate
+	 * the number of blocks the inode.
+	 */
+	if (tag == EXT4_FC_TAG_INODE_PARTIAL)
+		ext4_ext_replay_set_iblocks(inode);
+
+	inode->i_generation = le32_to_cpu(ext4_raw_inode(&iloc)->i_generation);
+	ext4_reset_inode_seed(inode);
+
+	ext4_inode_csum_set(inode, ext4_raw_inode(&iloc), EXT4_I(inode));
+	ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh);
+	sync_dirty_buffer(iloc.bh);
+	brelse(iloc.bh);
+out:
+	iput(inode);
+	if (!ret)
+		blkdev_issue_flush(sb->s_bdev, GFP_KERNEL);
+
+	return 0;
+}
+
+/*
+ * Dentry create replay function.
+ *
+ * EXT4_FC_TAG_CREAT is preceded by EXT4_FC_TAG_INODE_FULL. Which means, the
+ * inode for which we are trying to create a dentry here, should already have
+ * been replayed before we start here.
+ */
+static int ext4_fc_replay_create(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	int ret = 0;
+	struct inode *inode = NULL;
+	struct inode *dir = NULL;
+	struct dentry_info_args darg;
+
+	tl_to_darg(&darg, tl);
+
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_CREAT, darg.ino,
+			darg.parent_ino, darg.dname_len);
+
+	/* This takes care of update group descriptor and other metadata */
+	ret = ext4_mark_inode_used(sb, darg.ino);
+	if (ret)
+		goto out;
+
+	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "inode %d not found.", darg.ino);
+		inode = NULL;
+		ret = -EINVAL;
+		goto out;
+	}
+
+	if (S_ISDIR(inode->i_mode)) {
+		/*
+		 * If we are creating a directory, we need to make sure that the
+		 * dot and dot dot dirents are setup properly.
+		 */
+		dir = ext4_iget(sb, darg.parent_ino, EXT4_IGET_NORMAL);
+		if (IS_ERR_OR_NULL(dir)) {
+			jbd_debug(1, "Dir %d not found.", darg.ino);
+			goto out;
+		}
+		ret = ext4_init_new_dir(NULL, dir, inode);
+		iput(dir);
+		if (ret) {
+			ret = 0;
+			goto out;
+		}
+	}
+	ret = ext4_fc_replay_link_internal(sb, &darg, inode);
+	if (ret)
+		goto out;
+	set_nlink(inode, 1);
+	ext4_mark_inode_dirty(NULL, inode);
+out:
+	if (inode)
+		iput(inode);
+	return ret;
+}
+
+/*
+ * Record physical disk regions which are in use as per fast commit area. Our
+ * simple replay phase allocator excludes these regions from allocation.
+ */
+static int ext4_fc_record_regions(struct super_block *sb, int ino,
+		ext4_lblk_t lblk, ext4_fsblk_t pblk, int len)
+{
+	struct ext4_fc_replay_state *state;
+	struct ext4_fc_alloc_region *region;
+
+	state = &EXT4_SB(sb)->s_fc_replay_state;
+	if (state->fc_regions_used == state->fc_regions_size) {
+		state->fc_regions_size +=
+			EXT4_FC_REPLAY_REALLOC_INCREMENT;
+		state->fc_regions = krealloc(
+					state->fc_regions,
+					state->fc_regions_size *
+					sizeof(struct ext4_fc_alloc_region),
+					GFP_KERNEL);
+		if (!state->fc_regions)
+			return -ENOMEM;
+	}
+	region = &state->fc_regions[state->fc_regions_used++];
+	region->ino = ino;
+	region->lblk = lblk;
+	region->pblk = pblk;
+	region->len = len;
+
+	return 0;
+}
+
+/* Replay add range tag */
+static int ext4_fc_replay_add_range(struct super_block *sb,
+				struct ext4_fc_tl *tl)
+{
+	struct ext4_fc_add_range *fc_add_ex;
+	struct ext4_extent newex, *ex;
+	struct inode *inode;
+	ext4_lblk_t start, cur;
+	int remaining, len;
+	ext4_fsblk_t start_pblk;
+	struct ext4_map_blocks map;
+	struct ext4_ext_path *path = NULL;
+	int ret;
+
+	fc_add_ex = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl);
+	ex = (struct ext4_extent *)&fc_add_ex->fc_ex;
+
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_ADD_RANGE,
+		le32_to_cpu(fc_add_ex->fc_ino), le32_to_cpu(ex->ee_block),
+		ext4_ext_get_actual_len(ex));
+
+	inode = ext4_iget(sb, le32_to_cpu(fc_add_ex->fc_ino),
+				EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode not found.");
+		return 0;
+	}
+
+	ret = ext4_fc_record_modified_inode(sb, inode->i_ino);
+
+	start = le32_to_cpu(ex->ee_block);
+	start_pblk = ext4_ext_pblock(ex);
+	len = ext4_ext_get_actual_len(ex);
+
+	cur = start;
+	remaining = len;
+	jbd_debug(1, "ADD_RANGE, lblk %d, pblk %lld, len %d, unwritten %d, inode %ld\n",
+		  start, start_pblk, len, ext4_ext_is_unwritten(ex),
+		  inode->i_ino);
+
+	while (remaining > 0) {
+		map.m_lblk = cur;
+		map.m_len = remaining;
+		map.m_pblk = 0;
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+
+		if (ret < 0) {
+			iput(inode);
+			return 0;
+		}
+
+		if (ret == 0) {
+			/* Range not mapped */
+			path = ext4_find_extent(inode, cur, NULL, 0);
+			if (!path)
+				continue;
+			memset(&newex, 0, sizeof(newex));
+			newex.ee_block = cpu_to_le32(cur);
+			ext4_ext_store_pblock(
+				&newex, start_pblk + cur - start);
+			newex.ee_len = cpu_to_le16(map.m_len);
+			if (ext4_ext_is_unwritten(ex))
+				ext4_ext_mark_unwritten(&newex);
+			down_write(&EXT4_I(inode)->i_data_sem);
+			ret = ext4_ext_insert_extent(
+				NULL, inode, &path, &newex, 0);
+			up_write((&EXT4_I(inode)->i_data_sem));
+			ext4_ext_drop_refs(path);
+			kfree(path);
+			if (ret) {
+				iput(inode);
+				return 0;
+			}
+			goto next;
+		}
+
+		if (start_pblk + cur - start != map.m_pblk) {
+			/* Logical to physical mapping changed */
+			ret = ext4_ext_replay_update_ex(inode, cur, map.m_len,
+					ext4_ext_is_unwritten(ex),
+					start_pblk + cur - start);
+			if (ret) {
+				iput(inode);
+				return 0;
+			}
+			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
+			goto next;
+		}
+
+		/* Range is mapped and needs a state change */
+		jbd_debug(1, "Converting from %d to %d %lld",
+				map.m_flags & EXT4_MAP_UNWRITTEN,
+			ext4_ext_is_unwritten(ex), map.m_pblk);
+		ret = ext4_ext_replay_update_ex(inode, cur, map.m_len,
+					ext4_ext_is_unwritten(ex), map.m_pblk);
+		if (ret) {
+			iput(inode);
+			return 0;
+		}
+		/*
+		 * We may have split the extent tree while toggling the state.
+		 * Try to shrink the exten tree now.
+		 */
+		ext4_ext_replay_shrink_inode(inode, start + len);
+next:
+		cur += map.m_len;
+		remaining -= map.m_len;
+	}
+	ext4_ext_replay_shrink_inode(inode, i_size_read(inode) >>
+					sb->s_blocksize_bits);
+	iput(inode);
+	return 0;
+}
+
+/* Replay DEL_RANGE tag */
+static int
+ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl)
+{
+	struct inode *inode;
+	struct ext4_fc_del_range *lrange;
+	struct ext4_map_blocks map;
+	ext4_lblk_t cur, remaining;
+	int ret;
+
+	lrange = (struct ext4_fc_del_range *)ext4_fc_tag_val(tl);
+	cur = le32_to_cpu(lrange->fc_lblk);
+	remaining = le32_to_cpu(lrange->fc_len);
+
+	trace_ext4_fc_replay(sb, EXT4_FC_TAG_DEL_RANGE,
+		le32_to_cpu(lrange->fc_ino), cur, remaining);
+
+	inode = ext4_iget(sb, le32_to_cpu(lrange->fc_ino), EXT4_IGET_NORMAL);
+	if (IS_ERR_OR_NULL(inode)) {
+		jbd_debug(1, "Inode %d not found", le32_to_cpu(lrange->fc_ino));
+		return 0;
+	}
+
+	ret = ext4_fc_record_modified_inode(sb, inode->i_ino);
+
+	jbd_debug(1, "DEL_RANGE, inode %ld, lblk %d, len %d\n",
+			inode->i_ino, le32_to_cpu(lrange->fc_lblk),
+			le32_to_cpu(lrange->fc_len));
+	while (remaining > 0) {
+		map.m_lblk = cur;
+		map.m_len = remaining;
+
+		ret = ext4_map_blocks(NULL, inode, &map, 0);
+		if (ret < 0) {
+			iput(inode);
+			return 0;
+		}
+		if (ret > 0) {
+			remaining -= ret;
+			cur += ret;
+			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
+		} else {
+			remaining -= map.m_len;
+			cur += map.m_len;
+		}
+	}
+
+	ret = ext4_punch_hole(inode,
+		le32_to_cpu(lrange->fc_lblk) << sb->s_blocksize_bits,
+		le32_to_cpu(lrange->fc_len) <<  sb->s_blocksize_bits);
+	if (ret)
+		jbd_debug(1, "ext4_punch_hole returned %d", ret);
+	ext4_ext_replay_shrink_inode(inode,
+		i_size_read(inode) >> sb->s_blocksize_bits);
+	ext4_mark_inode_dirty(NULL, inode);
+	iput(inode);
+
+	return 0;
+}
+
+static inline const char *tag2str(u16 tag)
+{
+	switch (tag) {
+	case EXT4_FC_TAG_LINK:
+		return "TAG_ADD_ENTRY";
+	case EXT4_FC_TAG_UNLINK:
+		return "TAG_DEL_ENTRY";
+	case EXT4_FC_TAG_ADD_RANGE:
+		return "TAG_ADD_RANGE";
+	case EXT4_FC_TAG_CREAT:
+		return "TAG_CREAT_DENTRY";
+	case EXT4_FC_TAG_DEL_RANGE:
+		return "TAG_DEL_RANGE";
+	case EXT4_FC_TAG_INODE_FULL:
+		return "TAG_INODE_FULL";
+	case EXT4_FC_TAG_INODE_PARTIAL:
+		return "TAG_INODE_PARTIAL";
+	case EXT4_FC_TAG_PAD:
+		return "TAG_PAD";
+	case EXT4_FC_TAG_TAIL:
+		return "TAG_TAIL";
+	case EXT4_FC_TAG_HEAD:
+		return "TAG_HEAD";
+	default:
+		return "TAG_ERROR";
+	}
+}
+
+void ext4_fc_set_bitmaps_and_counters(struct super_block *sb)
+{
+	struct ext4_fc_replay_state *state;
+	struct inode *inode;
+	struct ext4_ext_path *path = NULL;
+	struct ext4_map_blocks map;
+	int i, ret, j;
+	ext4_lblk_t cur, end;
+
+	state = &EXT4_SB(sb)->s_fc_replay_state;
+	for (i = 0; i < state->fc_modified_inodes_used; i++) {
+		inode = ext4_iget(sb, state->fc_modified_inodes[i],
+			EXT4_IGET_NORMAL);
+		if (IS_ERR_OR_NULL(inode)) {
+			jbd_debug(1, "Inode %d not found.",
+				state->fc_modified_inodes[i]);
+			continue;
+		}
+		cur = 0;
+		end = EXT_MAX_BLOCKS;
+		while (cur < end) {
+			map.m_lblk = cur;
+			map.m_len = end - cur;
+
+			ret = ext4_map_blocks(NULL, inode, &map, 0);
+			if (ret < 0)
+				break;
+
+			if (ret > 0) {
+				path = ext4_find_extent(inode, map.m_lblk, NULL, 0);
+				if (!IS_ERR_OR_NULL(path)) {
+					for (j = 0; j < path->p_depth; j++)
+						ext4_mb_mark_bb(inode->i_sb,
+							path[j].p_block, 1, 1);
+					ext4_ext_drop_refs(path);
+					kfree(path);
+				}
+				cur += ret;
+				ext4_mb_mark_bb(inode->i_sb, map.m_pblk,
+							map.m_len, 1);
+			} else {
+				cur = cur + (map.m_len ? map.m_len : 1);
+			}
+		}
+		iput(inode);
+	}
+}
+
+/*
+ * Check if block is in excluded regions for block allocation. The simple
+ * allocator that runs during replay phase is calls this function to see
+ * if it is okay to use a block.
+ */
+bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t blk)
+{
+	int i;
+	struct ext4_fc_replay_state *state;
+
+	state = &EXT4_SB(sb)->s_fc_replay_state;
+	for (i = 0; i < state->fc_regions_valid; i++) {
+		if (state->fc_regions[i].ino == 0 ||
+			state->fc_regions[i].len == 0)
+			continue;
+		if (blk >= state->fc_regions[i].pblk &&
+		    blk < state->fc_regions[i].pblk + state->fc_regions[i].len)
+			return true;
+	}
+	return false;
+}
+
+/* Cleanup function called after replay */
+void ext4_fc_replay_cleanup(struct super_block *sb)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+
+	sbi->s_mount_state &= ~EXT4_FC_REPLAY;
+	kfree(sbi->s_fc_replay_state.fc_regions);
+	kfree(sbi->s_fc_replay_state.fc_modified_inodes);
+}
+
+/*
+ * Recovery Scan phase handler
+ *
+ * This function is called during the scan phase and is responsible
+ * for doing following things:
+ * - Make sure the fast commit area has valid tags for replay
+ * - Count number of tags that need to be replayed by the replay handler
+ * - Verify CRC
+ * - Create a list of excluded blocks for allocation during replay phase
+ *
+ * This function returns JBD2_FC_REPLAY_CONTINUE to indicate that SCAN is
+ * incomplete and JBD2 should send more blocks. It returns JBD2_FC_REPLAY_STOP
+ * to indicate that scan has finished and JBD2 can now start replay phase.
+ * It returns a negative error to indicate that there was an error. At the end
+ * of a successful scan phase, sbi->s_fc_replay_state.fc_replay_num_tags is set
+ * to indicate the number of tags that need to replayed during the replay phase.
+ */
+static int ext4_fc_replay_scan(journal_t *journal,
+				struct buffer_head *bh, int off,
+				tid_t expected_tid)
+{
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_replay_state *state;
+	int ret = JBD2_FC_REPLAY_CONTINUE;
+	struct ext4_fc_add_range *ext;
+	struct ext4_fc_tl *tl;
+	struct ext4_fc_tail *tail;
+	__u8 *start, *end;
+	struct ext4_fc_head *head;
+	struct ext4_extent *ex;
+
+	state = &sbi->s_fc_replay_state;
+
+	start = (u8 *)bh->b_data;
+	end = (__u8 *)bh->b_data + journal->j_blocksize - 1;
+
+	if (state->fc_replay_expected_off == 0) {
+		state->fc_cur_tag = 0;
+		state->fc_replay_num_tags = 0;
+		state->fc_crc = 0;
+		state->fc_regions = NULL;
+		state->fc_regions_valid = state->fc_regions_used =
+			state->fc_regions_size = 0;
+		/* Check if we can stop early */
+		if (le16_to_cpu(((struct ext4_fc_tl *)start)->fc_tag)
+			!= EXT4_FC_TAG_HEAD)
+			return 0;
+	}
+
+	if (off != state->fc_replay_expected_off) {
+		ret = -EFSCORRUPTED;
+		goto out_err;
+	}
+
+	state->fc_replay_expected_off++;
+	fc_for_each_tl(start, end, tl) {
+		jbd_debug(3, "Scan phase, tag:%s, blk %lld\n",
+			  tag2str(le16_to_cpu(tl->fc_tag)), bh->b_blocknr);
+		switch (le16_to_cpu(tl->fc_tag)) {
+		case EXT4_FC_TAG_ADD_RANGE:
+			ext = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl);
+			ex = (struct ext4_extent *)&ext->fc_ex;
+			ret = ext4_fc_record_regions(sb,
+				le32_to_cpu(ext->fc_ino),
+				le32_to_cpu(ex->ee_block), ext4_ext_pblock(ex),
+				ext4_ext_get_actual_len(ex));
+			if (ret < 0)
+				break;
+			ret = JBD2_FC_REPLAY_CONTINUE;
+			fallthrough;
+		case EXT4_FC_TAG_DEL_RANGE:
+		case EXT4_FC_TAG_LINK:
+		case EXT4_FC_TAG_UNLINK:
+		case EXT4_FC_TAG_CREAT:
+		case EXT4_FC_TAG_INODE_FULL:
+		case EXT4_FC_TAG_INODE_PARTIAL:
+		case EXT4_FC_TAG_PAD:
+			state->fc_cur_tag++;
+			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
+					sizeof(*tl) + ext4_fc_tag_len(tl));
+			break;
+		case EXT4_FC_TAG_TAIL:
+			state->fc_cur_tag++;
+			tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl);
+			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
+						sizeof(*tl) +
+						offsetof(struct ext4_fc_tail,
+						fc_crc));
+			if (le32_to_cpu(tail->fc_tid) == expected_tid &&
+				le32_to_cpu(tail->fc_crc) == state->fc_crc) {
+				state->fc_replay_num_tags = state->fc_cur_tag;
+				state->fc_regions_valid =
+					state->fc_regions_used;
+			} else {
+				ret = state->fc_replay_num_tags ?
+					JBD2_FC_REPLAY_STOP : -EFSBADCRC;
+			}
+			state->fc_crc = 0;
+			break;
+		case EXT4_FC_TAG_HEAD:
+			head = (struct ext4_fc_head *)ext4_fc_tag_val(tl);
+			if (le32_to_cpu(head->fc_features) &
+				~EXT4_FC_SUPPORTED_FEATURES) {
+				ret = -EOPNOTSUPP;
+				break;
+			}
+			if (le32_to_cpu(head->fc_tid) != expected_tid) {
+				ret = JBD2_FC_REPLAY_STOP;
+				break;
+			}
+			state->fc_cur_tag++;
+			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
+					sizeof(*tl) + ext4_fc_tag_len(tl));
+			break;
+		default:
+			ret = state->fc_replay_num_tags ?
+				JBD2_FC_REPLAY_STOP : -ECANCELED;
+		}
+		if (ret < 0 || ret == JBD2_FC_REPLAY_STOP)
+			break;
+	}
+
+out_err:
+	trace_ext4_fc_replay_scan(sb, ret, off);
+	return ret;
+}
+
 /*
  * Main recovery path entry point.
+ * The meaning of return codes is similar as above.
  */
 static int ext4_fc_replay(journal_t *journal, struct buffer_head *bh,
 				enum passtype pass, int off, tid_t expected_tid)
 {
-	return 0;
+	struct super_block *sb = journal->j_private;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_fc_tl *tl;
+	__u8 *start, *end;
+	int ret = JBD2_FC_REPLAY_CONTINUE;
+	struct ext4_fc_replay_state *state = &sbi->s_fc_replay_state;
+	struct ext4_fc_tail *tail;
+
+	if (pass == PASS_SCAN) {
+		state->fc_current_pass = PASS_SCAN;
+		return ext4_fc_replay_scan(journal, bh, off, expected_tid);
+	}
+
+	if (state->fc_current_pass != pass) {
+		state->fc_current_pass = pass;
+		sbi->s_mount_state |= EXT4_FC_REPLAY;
+	}
+	if (!sbi->s_fc_replay_state.fc_replay_num_tags) {
+		jbd_debug(1, "Replay stops\n");
+		ext4_fc_set_bitmaps_and_counters(sb);
+		return 0;
+	}
+
+#ifdef CONFIG_EXT4_DEBUG
+	if (sbi->s_fc_debug_max_replay && off >= sbi->s_fc_debug_max_replay) {
+		pr_warn("Dropping fc block %d because max_replay set\n", off);
+		return -EINVAL;
+	}
+#endif
+
+	start = (u8 *)bh->b_data;
+	end = (__u8 *)bh->b_data + journal->j_blocksize - 1;
+
+	fc_for_each_tl(start, end, tl) {
+		if (state->fc_replay_num_tags == 0) {
+			ret = JBD2_FC_REPLAY_STOP;
+			ext4_fc_set_bitmaps_and_counters(sb);
+			break;
+		}
+		jbd_debug(3, "Replay phase, tag:%s\n",
+				tag2str(le16_to_cpu(tl->fc_tag)));
+		state->fc_replay_num_tags--;
+		switch (le16_to_cpu(tl->fc_tag)) {
+		case EXT4_FC_TAG_LINK:
+			ret = ext4_fc_replay_link(sb, tl);
+			break;
+		case EXT4_FC_TAG_UNLINK:
+			ret = ext4_fc_replay_unlink(sb, tl);
+			break;
+		case EXT4_FC_TAG_ADD_RANGE:
+			ret = ext4_fc_replay_add_range(sb, tl);
+			break;
+		case EXT4_FC_TAG_CREAT:
+			ret = ext4_fc_replay_create(sb, tl);
+			break;
+		case EXT4_FC_TAG_DEL_RANGE:
+			ret = ext4_fc_replay_del_range(sb, tl);
+			break;
+		case EXT4_FC_TAG_INODE_PARTIAL:
+		case EXT4_FC_TAG_INODE_FULL:
+			ret = ext4_fc_replay_inode(sb, tl);
+			break;
+		case EXT4_FC_TAG_PAD:
+			trace_ext4_fc_replay(sb, EXT4_FC_TAG_PAD, 0,
+				ext4_fc_tag_len(tl), 0);
+			break;
+		case EXT4_FC_TAG_TAIL:
+			trace_ext4_fc_replay(sb, EXT4_FC_TAG_TAIL, 0,
+				ext4_fc_tag_len(tl), 0);
+			tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl);
+			WARN_ON(le32_to_cpu(tail->fc_tid) != expected_tid);
+			break;
+		case EXT4_FC_TAG_HEAD:
+			break;
+		default:
+			trace_ext4_fc_replay(sb, le16_to_cpu(tl->fc_tag), 0,
+				ext4_fc_tag_len(tl), 0);
+			ret = -ECANCELED;
+			break;
+		}
+		if (ret < 0)
+			break;
+		ret = JBD2_FC_REPLAY_CONTINUE;
+	}
+	return ret;
 }
 
 void ext4_fc_init(struct super_block *sb, journal_t *journal)
diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
index a541d2bbe24b..cf4d8772d055 100644
--- a/fs/ext4/fast_commit.h
+++ b/fs/ext4/fast_commit.h
@@ -117,4 +117,44 @@ struct ext4_fc_stats {
 	int fc_numblks;
 };
 
+#define EXT4_FC_REPLAY_REALLOC_INCREMENT	4
+
+/*
+ * Physical block regions added to different inodes due to fast commit
+ * recovery. These are set during the SCAN phase. During the replay phase,
+ * our allocator excludes these from its allocation. This ensures that
+ * we don't accidentally allocating a block that is going to be used by
+ * another inode.
+ */
+struct ext4_fc_alloc_region {
+	ext4_lblk_t lblk;
+	ext4_fsblk_t pblk;
+	int ino, len;
+};
+
+/*
+ * Fast commit replay state.
+ */
+struct ext4_fc_replay_state {
+	int fc_replay_num_tags;
+	int fc_replay_expected_off;
+	int fc_current_pass;
+	int fc_cur_tag;
+	int fc_crc;
+	struct ext4_fc_alloc_region *fc_regions;
+	int fc_regions_size, fc_regions_used, fc_regions_valid;
+	int *fc_modified_inodes;
+	int fc_modified_inodes_used, fc_modified_inodes_size;
+};
+
+#define region_last(__region) (((__region)->lblk) + ((__region)->len) - 1)
+
+#define fc_for_each_tl(__start, __end, __tl)				\
+	for (tl = (struct ext4_fc_tl *)start;				\
+		(u8 *)tl < (u8 *)end;					\
+		tl = (struct ext4_fc_tl *)((u8 *)tl +			\
+					sizeof(struct ext4_fc_tl) +	\
+					+ le16_to_cpu(tl->fc_len)))
+
+
 #endif /* __FAST_COMMIT_H__ */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index df25d38d6539..db9b9eeb9560 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -82,7 +82,12 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
 				      struct buffer_head *bh)
 {
 	ext4_fsblk_t	blk;
-	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
+	struct ext4_group_info *grp;
+
+	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+		return 0;
+
+	grp = ext4_get_group_info(sb, block_group);
 
 	if (buffer_verified(bh))
 		return 0;
@@ -284,15 +289,17 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
 	bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
 	bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
 	/* Don't bother if the inode bitmap is corrupt. */
-	grp = ext4_get_group_info(sb, block_group);
 	if (IS_ERR(bitmap_bh)) {
 		fatal = PTR_ERR(bitmap_bh);
 		bitmap_bh = NULL;
 		goto error_return;
 	}
-	if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
-		fatal = -EFSCORRUPTED;
-		goto error_return;
+	if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+		grp = ext4_get_group_info(sb, block_group);
+		if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
+			fatal = -EFSCORRUPTED;
+			goto error_return;
+		}
 	}
 
 	BUFFER_TRACE(bitmap_bh, "get_write_access");
@@ -742,6 +749,119 @@ static int find_inode_bit(struct super_block *sb, ext4_group_t group,
 	return 1;
 }
 
+int ext4_mark_inode_used(struct super_block *sb, int ino)
+{
+	unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count);
+	struct buffer_head *inode_bitmap_bh = NULL, *group_desc_bh = NULL;
+	struct ext4_group_desc *gdp;
+	ext4_group_t group;
+	int bit;
+	int err = -EFSCORRUPTED;
+
+	if (ino < EXT4_FIRST_INO(sb) || ino > max_ino)
+		goto out;
+
+	group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
+	inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
+	if (IS_ERR(inode_bitmap_bh))
+		return PTR_ERR(inode_bitmap_bh);
+
+	if (ext4_test_bit(bit, inode_bitmap_bh->b_data)) {
+		err = 0;
+		goto out;
+	}
+
+	gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
+	if (!gdp || !group_desc_bh) {
+		err = -EINVAL;
+		goto out;
+	}
+
+	ext4_set_bit(bit, inode_bitmap_bh->b_data);
+
+	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
+	err = ext4_handle_dirty_metadata(NULL, NULL, inode_bitmap_bh);
+	if (err) {
+		ext4_std_error(sb, err);
+		goto out;
+	}
+	sync_dirty_buffer(inode_bitmap_bh);
+	BUFFER_TRACE(group_desc_bh, "get_write_access");
+
+	/* We may have to initialize the block bitmap if it isn't already */
+	if (ext4_has_group_desc_csum(sb) &&
+	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
+		struct buffer_head *block_bitmap_bh;
+
+		block_bitmap_bh = ext4_read_block_bitmap(sb, group);
+		if (IS_ERR(block_bitmap_bh)) {
+			err = PTR_ERR(block_bitmap_bh);
+			goto out;
+		}
+
+		BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
+		err = ext4_handle_dirty_metadata(NULL, NULL, block_bitmap_bh);
+		sync_dirty_buffer(block_bitmap_bh);
+
+		/* recheck and clear flag under lock if we still need to */
+		ext4_lock_group(sb, group);
+		if (ext4_has_group_desc_csum(sb) &&
+		    (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
+			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+			ext4_free_group_clusters_set(sb, gdp,
+				ext4_free_clusters_after_init(sb, group, gdp));
+			ext4_block_bitmap_csum_set(sb, group, gdp,
+						   block_bitmap_bh);
+			ext4_group_desc_csum_set(sb, group, gdp);
+		}
+		ext4_unlock_group(sb, group);
+		brelse(block_bitmap_bh);
+
+		if (err) {
+			ext4_std_error(sb, err);
+			goto out;
+		}
+	}
+
+	/* Update the relevant bg descriptor fields */
+	if (ext4_has_group_desc_csum(sb)) {
+		int free;
+
+		ext4_lock_group(sb, group); /* while we modify the bg desc */
+		free = EXT4_INODES_PER_GROUP(sb) -
+			ext4_itable_unused_count(sb, gdp);
+		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
+			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
+			free = 0;
+		}
+
+		/*
+		 * Check the relative inode number against the last used
+		 * relative inode number in this group. if it is greater
+		 * we need to update the bg_itable_unused count
+		 */
+		if (bit >= free)
+			ext4_itable_unused_set(sb, gdp,
+					(EXT4_INODES_PER_GROUP(sb) - bit - 1));
+	} else {
+		ext4_lock_group(sb, group);
+	}
+
+	ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1);
+	if (ext4_has_group_desc_csum(sb)) {
+		ext4_inode_bitmap_csum_set(sb, group, gdp, inode_bitmap_bh,
+					   EXT4_INODES_PER_GROUP(sb) / 8);
+		ext4_group_desc_csum_set(sb, group, gdp);
+	}
+
+	ext4_unlock_group(sb, group);
+	err = ext4_handle_dirty_metadata(NULL, NULL, group_desc_bh);
+	sync_dirty_buffer(group_desc_bh);
+out:
+	return err;
+}
+
 /*
  * There are two policies for allocating an inode.  If the new inode is
  * a directory, then a forward search is made for a block group with both
@@ -771,7 +891,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	struct inode *ret;
 	ext4_group_t i;
 	ext4_group_t flex_group;
-	struct ext4_group_info *grp;
+	struct ext4_group_info *grp = NULL;
 	int encrypt = 0;
 
 	/* Cannot create files in a deleted directory */
@@ -909,15 +1029,21 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		if (ext4_free_inodes_count(sb, gdp) == 0)
 			goto next_group;
 
-		grp = ext4_get_group_info(sb, group);
-		/* Skip groups with already-known suspicious inode tables */
-		if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
-			goto next_group;
+		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+			grp = ext4_get_group_info(sb, group);
+			/*
+			 * Skip groups with already-known suspicious inode
+			 * tables
+			 */
+			if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
+				goto next_group;
+		}
 
 		brelse(inode_bitmap_bh);
 		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
 		/* Skip groups with suspicious inode tables */
-		if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp) ||
+		if (((!(sbi->s_mount_state & EXT4_FC_REPLAY))
+		     && EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) ||
 		    IS_ERR(inode_bitmap_bh)) {
 			inode_bitmap_bh = NULL;
 			goto next_group;
@@ -936,7 +1062,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			goto next_group;
 		}
 
-		if (!handle) {
+		if ((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && !handle) {
 			BUG_ON(nblocks <= 0);
 			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
 				 handle_type, nblocks, 0,
@@ -1040,9 +1166,15 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 	/* Update the relevant bg descriptor fields */
 	if (ext4_has_group_desc_csum(sb)) {
 		int free;
-		struct ext4_group_info *grp = ext4_get_group_info(sb, group);
-
-		down_read(&grp->alloc_sem); /* protect vs itable lazyinit */
+		struct ext4_group_info *grp = NULL;
+
+		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+			grp = ext4_get_group_info(sb, group);
+			down_read(&grp->alloc_sem); /*
+						     * protect vs itable
+						     * lazyinit
+						     */
+		}
 		ext4_lock_group(sb, group); /* while we modify the bg desc */
 		free = EXT4_INODES_PER_GROUP(sb) -
 			ext4_itable_unused_count(sb, gdp);
@@ -1058,7 +1190,8 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 		if (ino > free)
 			ext4_itable_unused_set(sb, gdp,
 					(EXT4_INODES_PER_GROUP(sb) - ino));
-		up_read(&grp->alloc_sem);
+		if (!(sbi->s_mount_state & EXT4_FC_REPLAY))
+			up_read(&grp->alloc_sem);
 	} else {
 		ext4_lock_group(sb, group);
 	}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 26eed76812f9..9dce088171cc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -101,8 +101,8 @@ static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
 	return provided == calculated;
 }
 
-static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
-				struct ext4_inode_info *ei)
+void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
+			 struct ext4_inode_info *ei)
 {
 	__u32 csum;
 
@@ -514,7 +514,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		return -EFSCORRUPTED;
 
 	/* Lookup extent status tree firstly */
-	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
+	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) &&
+	    ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
 		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
 			map->m_pblk = ext4_es_pblock(&es) +
 					map->m_lblk - es.es_lblk;
@@ -827,7 +828,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
 	int create = map_flags & EXT4_GET_BLOCKS_CREATE;
 	int err;
 
-	J_ASSERT(handle != NULL || create == 0);
+	J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+		 || handle != NULL || create == 0);
 
 	map.m_lblk = block;
 	map.m_len = 1;
@@ -843,7 +845,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
 		return ERR_PTR(-ENOMEM);
 	if (map.m_flags & EXT4_MAP_NEW) {
 		J_ASSERT(create != 0);
-		J_ASSERT(handle != NULL);
+		J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+			 || (handle != NULL));
 
 		/*
 		 * Now that we do not always journal data, we should
@@ -4255,22 +4258,22 @@ int ext4_truncate(struct inode *inode)
  * data in memory that is needed to recreate the on-disk version of this
  * inode.
  */
-static int __ext4_get_inode_loc(struct inode *inode,
-				struct ext4_iloc *iloc, int in_mem)
+static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
+				struct ext4_iloc *iloc, int in_mem,
+				ext4_fsblk_t *ret_block)
 {
 	struct ext4_group_desc	*gdp;
 	struct buffer_head	*bh;
-	struct super_block	*sb = inode->i_sb;
 	ext4_fsblk_t		block;
 	struct blk_plug		plug;
 	int			inodes_per_block, inode_offset;
 
 	iloc->bh = NULL;
-	if (inode->i_ino < EXT4_ROOT_INO ||
-	    inode->i_ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
+	if (ino < EXT4_ROOT_INO ||
+	    ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
 		return -EFSCORRUPTED;
 
-	iloc->block_group = (inode->i_ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
 	gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
 	if (!gdp)
 		return -EIO;
@@ -4279,7 +4282,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
 	 * Figure out the offset within the block group inode table
 	 */
 	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
-	inode_offset = ((inode->i_ino - 1) %
+	inode_offset = ((ino - 1) %
 			EXT4_INODES_PER_GROUP(sb));
 	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
 	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
@@ -4380,7 +4383,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
 		 * has in-inode xattrs, or we don't have this inode in memory.
 		 * Read the block from disk.
 		 */
-		trace_ext4_load_inode(inode);
+		trace_ext4_load_inode(sb, ino);
 		get_bh(bh);
 		bh->b_end_io = end_buffer_read_sync;
 		submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
@@ -4388,8 +4391,8 @@ static int __ext4_get_inode_loc(struct inode *inode,
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh)) {
 		simulate_eio:
-			ext4_error_inode_block(inode, block, EIO,
-					       "unable to read itable block");
+			if (ret_block)
+				*ret_block = block;
 			brelse(bh);
 			return -EIO;
 		}
@@ -4399,11 +4402,43 @@ static int __ext4_get_inode_loc(struct inode *inode,
 	return 0;
 }
 
+static int __ext4_get_inode_loc_noinmem(struct inode *inode,
+					struct ext4_iloc *iloc)
+{
+	ext4_fsblk_t err_blk;
+	int ret;
+
+	ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc, 0,
+					&err_blk);
+
+	if (ret == -EIO)
+		ext4_error_inode_block(inode, err_blk, EIO,
+					"unable to read itable block");
+
+	return ret;
+}
+
 int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
 {
+	ext4_fsblk_t err_blk;
+	int ret;
+
 	/* We have all inode data except xattrs in memory here. */
-	return __ext4_get_inode_loc(inode, iloc,
-		!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
+	ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc,
+		!ext4_test_inode_state(inode, EXT4_STATE_XATTR), &err_blk);
+
+	if (ret == -EIO)
+		ext4_error_inode_block(inode, err_blk, EIO,
+					"unable to read itable block");
+
+	return ret;
+}
+
+
+int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
+			  struct ext4_iloc *iloc)
+{
+	return __ext4_get_inode_loc(sb, ino, iloc, 0, NULL);
 }
 
 static bool ext4_should_enable_dax(struct inode *inode)
@@ -4569,7 +4604,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 	ei = EXT4_I(inode);
 	iloc.bh = NULL;
 
-	ret = __ext4_get_inode_loc(inode, &iloc, 0);
+	ret = __ext4_get_inode_loc_noinmem(inode, &iloc);
 	if (ret < 0)
 		goto bad_inode;
 	raw_inode = ext4_raw_inode(&iloc);
@@ -4615,10 +4650,11 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 					      sizeof(gen));
 	}
 
-	if (!ext4_inode_csum_verify(inode, raw_inode, ei) ||
-	    ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) {
-		ext4_error_inode_err(inode, function, line, 0, EFSBADCRC,
-				     "iget: checksum invalid");
+	if ((!ext4_inode_csum_verify(inode, raw_inode, ei) ||
+	    ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) &&
+	     (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))) {
+		ext4_error_inode_err(inode, function, line, 0,
+				EFSBADCRC, "iget: checksum invalid");
 		ret = -EFSBADCRC;
 		goto bad_inode;
 	}
@@ -4772,9 +4808,10 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
 		goto bad_inode;
 	} else if (!ext4_has_inline_data(inode)) {
 		/* validate the block references in the inode */
-		if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
-		   (S_ISLNK(inode->i_mode) &&
-		    !ext4_inode_is_fast_symlink(inode))) {
+		if (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) &&
+			(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
+			(S_ISLNK(inode->i_mode) &&
+			!ext4_inode_is_fast_symlink(inode)))) {
 			if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 				ret = ext4_ext_check_inode(inode);
 			else
@@ -5158,7 +5195,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
 	} else {
 		struct ext4_iloc iloc;
 
-		err = __ext4_get_inode_loc(inode, &iloc, 0);
+		err = __ext4_get_inode_loc_noinmem(inode, &iloc);
 		if (err)
 			return err;
 		/*
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index d2f8f50deef6..f0381876a7e5 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -86,7 +86,7 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
 	i_size_write(inode2, isize);
 }
 
-static void reset_inode_seed(struct inode *inode)
+void ext4_reset_inode_seed(struct inode *inode)
 {
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
@@ -200,8 +200,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
 
 	inode->i_generation = prandom_u32();
 	inode_bl->i_generation = prandom_u32();
-	reset_inode_seed(inode);
-	reset_inode_seed(inode_bl);
+	ext4_reset_inode_seed(inode);
+	ext4_reset_inode_seed(inode_bl);
 
 	ext4_discard_preallocations(inode, 0);
 
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 132c118d12e1..ea894529118a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1508,14 +1508,16 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
 
 		blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
 		blocknr += EXT4_C2B(sbi, block);
-		ext4_grp_locked_error(sb, e4b->bd_group,
-				      inode ? inode->i_ino : 0,
-				      blocknr,
-				      "freeing already freed block "
-				      "(bit %u); block bitmap corrupt.",
-				      block);
-		ext4_mark_group_bitmap_corrupted(sb, e4b->bd_group,
+		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
+			ext4_grp_locked_error(sb, e4b->bd_group,
+					      inode ? inode->i_ino : 0,
+					      blocknr,
+					      "freeing already freed block (bit %u); block bitmap corrupt.",
+					      block);
+			ext4_mark_group_bitmap_corrupted(
+				sb, e4b->bd_group,
 				EXT4_GROUP_INFO_BBITMAP_CORRUPT);
+		}
 		mb_regenerate_buddy(e4b);
 		goto done;
 	}
@@ -3302,6 +3304,86 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
 	return err;
 }
 
+/*
+ * Idempotent helper for Ext4 fast commit replay path to set the state of
+ * blocks in bitmaps and update counters.
+ */
+void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
+			int len, int state)
+{
+	struct buffer_head *bitmap_bh = NULL;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *gdp_bh;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	ext4_group_t group;
+	ext4_fsblk_t cluster;
+	ext4_grpblk_t blkoff;
+	int i, clen, err;
+	int already;
+
+	cluster = EXT4_B2C(sbi, block);
+	clen = EXT4_B2C(sbi, len);
+
+	ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
+	bitmap_bh = ext4_read_block_bitmap(sb, group);
+	if (IS_ERR(bitmap_bh)) {
+		err = PTR_ERR(bitmap_bh);
+		bitmap_bh = NULL;
+		goto out_err;
+	}
+
+	err = -EIO;
+	gdp = ext4_get_group_desc(sb, group, &gdp_bh);
+	if (!gdp)
+		goto out_err;
+
+	ext4_lock_group(sb, group);
+	already = 0;
+	for (i = 0; i < clen; i++)
+		if (!mb_test_bit(blkoff + i, bitmap_bh->b_data) == !state)
+			already++;
+
+	if (state)
+		ext4_set_bits(bitmap_bh->b_data, blkoff, clen);
+	else
+		mb_test_and_clear_bits(bitmap_bh->b_data, blkoff, clen);
+	if (ext4_has_group_desc_csum(sb) &&
+	    (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
+		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
+		ext4_free_group_clusters_set(sb, gdp,
+					     ext4_free_clusters_after_init(sb,
+						group, gdp));
+	}
+	if (state)
+		clen = ext4_free_group_clusters(sb, gdp) - clen + already;
+	else
+		clen = ext4_free_group_clusters(sb, gdp) + clen - already;
+
+	ext4_free_group_clusters_set(sb, gdp, clen);
+	ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh);
+	ext4_group_desc_csum_set(sb, group, gdp);
+
+	ext4_unlock_group(sb, group);
+
+	if (sbi->s_log_groups_per_flex) {
+		ext4_group_t flex_group = ext4_flex_group(sbi, group);
+
+		atomic64_sub(len,
+			     &sbi_array_rcu_deref(sbi, s_flex_groups,
+						  flex_group)->free_clusters);
+	}
+
+	err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+	if (err)
+		goto out_err;
+	sync_dirty_buffer(bitmap_bh);
+	err = ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
+	sync_dirty_buffer(gdp_bh);
+
+out_err:
+	brelse(bitmap_bh);
+}
+
 /*
  * here we normalize request for locality group
  * Group request are normalized to s_mb_group_prealloc, which goes to
@@ -4283,6 +4365,9 @@ void ext4_discard_preallocations(struct inode *inode, unsigned int needed)
 		return;
 	}
 
+	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
+		return;
+
 	mb_debug(sb, "discard preallocation for inode %lu\n",
 		 inode->i_ino);
 	trace_ext4_discard_preallocations(inode,
@@ -4830,6 +4915,9 @@ static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
 	return ret;
 }
 
+static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
+				struct ext4_allocation_request *ar, int *errp);
+
 /*
  * Main entry point into mballoc to allocate blocks
  * it tries to use preallocation first, then falls back
@@ -4851,6 +4939,8 @@ ext4_fsblk_t ext4_mb_new_blocks(handle_t *handle,
 	sbi = EXT4_SB(sb);
 
 	trace_ext4_request_blocks(ar);
+	if (sbi->s_mount_state & EXT4_FC_REPLAY)
+		return ext4_mb_new_blocks_simple(handle, ar, errp);
 
 	/* Allow to use superuser reservation for quota file */
 	if (ext4_is_quota_file(ar->inode))
@@ -5078,6 +5168,102 @@ ext4_mb_free_metadata(handle_t *handle, struct ext4_buddy *e4b,
 	return 0;
 }
 
+/*
+ * Simple allocator for Ext4 fast commit replay path. It searches for blocks
+ * linearly starting at the goal block and also excludes the blocks which
+ * are going to be in use after fast commit replay.
+ */
+static ext4_fsblk_t ext4_mb_new_blocks_simple(handle_t *handle,
+				struct ext4_allocation_request *ar, int *errp)
+{
+	struct buffer_head *bitmap_bh;
+	struct super_block *sb = ar->inode->i_sb;
+	ext4_group_t group;
+	ext4_grpblk_t blkoff;
+	int  i;
+	ext4_fsblk_t goal, block;
+	struct ext4_super_block *es = EXT4_SB(sb)->s_es;
+
+	goal = ar->goal;
+	if (goal < le32_to_cpu(es->s_first_data_block) ||
+			goal >= ext4_blocks_count(es))
+		goal = le32_to_cpu(es->s_first_data_block);
+
+	ar->len = 0;
+	ext4_get_group_no_and_offset(sb, goal, &group, &blkoff);
+	for (; group < ext4_get_groups_count(sb); group++) {
+		bitmap_bh = ext4_read_block_bitmap(sb, group);
+		if (IS_ERR(bitmap_bh)) {
+			*errp = PTR_ERR(bitmap_bh);
+			pr_warn("Failed to read block bitmap\n");
+			return 0;
+		}
+
+		ext4_get_group_no_and_offset(sb,
+			max(ext4_group_first_block_no(sb, group), goal),
+			NULL, &blkoff);
+		i = mb_find_next_zero_bit(bitmap_bh->b_data, sb->s_blocksize,
+						blkoff);
+		brelse(bitmap_bh);
+		if (i >= sb->s_blocksize)
+			continue;
+		if (ext4_fc_replay_check_excluded(sb,
+			ext4_group_first_block_no(sb, group) + i))
+			continue;
+		break;
+	}
+
+	if (group >= ext4_get_groups_count(sb) && i >= sb->s_blocksize)
+		return 0;
+
+	block = ext4_group_first_block_no(sb, group) + i;
+	ext4_mb_mark_bb(sb, block, 1, 1);
+	ar->len = 1;
+
+	return block;
+}
+
+void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block,
+			     unsigned long count)
+{
+	struct buffer_head *bitmap_bh;
+	struct super_block *sb = inode->i_sb;
+	struct ext4_group_desc *gdp;
+	struct buffer_head *gdp_bh;
+	ext4_group_t group;
+	ext4_grpblk_t blkoff;
+	int already_freed = 0, err, i;
+
+	ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
+	bitmap_bh = ext4_read_block_bitmap(sb, group);
+	if (IS_ERR(bitmap_bh)) {
+		err = PTR_ERR(bitmap_bh);
+		pr_warn("Failed to read block bitmap\n");
+		return;
+	}
+	gdp = ext4_get_group_desc(sb, group, &gdp_bh);
+	if (!gdp)
+		return;
+
+	for (i = 0; i < count; i++) {
+		if (!mb_test_bit(blkoff + i, bitmap_bh->b_data))
+			already_freed++;
+	}
+	mb_clear_bits(bitmap_bh->b_data, blkoff, count);
+	err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
+	if (err)
+		return;
+	ext4_free_group_clusters_set(
+		sb, gdp, ext4_free_group_clusters(sb, gdp) +
+		count - already_freed);
+	ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh);
+	ext4_group_desc_csum_set(sb, group, gdp);
+	ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
+	sync_dirty_buffer(bitmap_bh);
+	sync_dirty_buffer(gdp_bh);
+	brelse(bitmap_bh);
+}
+
 /**
  * ext4_free_blocks() -- Free given blocks and update quota
  * @handle:		handle for this transaction
@@ -5104,6 +5290,13 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 	int err = 0;
 	int ret;
 
+	sbi = EXT4_SB(sb);
+
+	if (sbi->s_mount_state & EXT4_FC_REPLAY) {
+		ext4_free_blocks_simple(inode, block, count);
+		return;
+	}
+
 	might_sleep();
 	if (bh) {
 		if (block)
@@ -5112,7 +5305,6 @@ void ext4_free_blocks(handle_t *handle, struct inode *inode,
 			block = bh->b_blocknr;
 	}
 
-	sbi = EXT4_SB(sb);
 	if (!(flags & EXT4_FREE_BLOCKS_VALIDATED) &&
 	    !ext4_inode_block_valid(inode, block, count)) {
 		ext4_error(sb, "Freeing blocks not in datazone - "
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 563243f3b682..3c4d882d4d7b 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2749,7 +2749,7 @@ struct ext4_dir_entry_2 *ext4_init_dot_dotdot(struct inode *inode,
 	return ext4_next_entry(de, blocksize);
 }
 
-static int ext4_init_new_dir(handle_t *handle, struct inode *dir,
+int ext4_init_new_dir(handle_t *handle, struct inode *dir,
 			     struct inode *inode)
 {
 	struct buffer_head *dir_block = NULL;
@@ -3196,42 +3196,32 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
 	return retval;
 }
 
-static int ext4_unlink(struct inode *dir, struct dentry *dentry)
+int __ext4_unlink(struct inode *dir, const struct qstr *d_name,
+		  struct inode *inode)
 {
-	int retval;
-	struct inode *inode;
+	int retval = -ENOENT;
 	struct buffer_head *bh;
 	struct ext4_dir_entry_2 *de;
 	handle_t *handle = NULL;
+	int skip_remove_dentry = 0;
 
-	if (unlikely(ext4_forced_shutdown(EXT4_SB(dir->i_sb))))
-		return -EIO;
-
-	trace_ext4_unlink_enter(dir, dentry);
-	/* Initialize quotas before so that eventual writes go
-	 * in separate transaction */
-	retval = dquot_initialize(dir);
-	if (retval)
-		goto out_trace;
-	retval = dquot_initialize(d_inode(dentry));
-	if (retval)
-		goto out_trace;
-
-	bh = ext4_find_entry(dir, &dentry->d_name, &de, NULL);
-	if (IS_ERR(bh)) {
-		retval = PTR_ERR(bh);
-		goto out_trace;
-	}
-	if (!bh) {
-		retval = -ENOENT;
-		goto out_trace;
-	}
+	bh = ext4_find_entry(dir, d_name, &de, NULL);
+	if (IS_ERR(bh))
+		return PTR_ERR(bh);
 
-	inode = d_inode(dentry);
+	if (!bh)
+		return -ENOENT;
 
 	if (le32_to_cpu(de->inode) != inode->i_ino) {
-		retval = -EFSCORRUPTED;
-		goto out_bh;
+		/*
+		 * It's okay if we find dont find dentry which matches
+		 * the inode. That's because it might have gotten
+		 * renamed to a different inode number
+		 */
+		if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
+			skip_remove_dentry = 1;
+		else
+			goto out_bh;
 	}
 
 	handle = ext4_journal_start(dir, EXT4_HT_DIR,
@@ -3244,17 +3234,21 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 	if (IS_DIRSYNC(dir))
 		ext4_handle_sync(handle);
 
-	retval = ext4_delete_entry(handle, dir, de, bh);
-	if (retval)
-		goto out_handle;
-	dir->i_ctime = dir->i_mtime = current_time(dir);
-	ext4_update_dx_flag(dir);
-	retval = ext4_mark_inode_dirty(handle, dir);
-	if (retval)
-		goto out_handle;
+	if (!skip_remove_dentry) {
+		retval = ext4_delete_entry(handle, dir, de, bh);
+		if (retval)
+			goto out_handle;
+		dir->i_ctime = dir->i_mtime = current_time(dir);
+		ext4_update_dx_flag(dir);
+		retval = ext4_mark_inode_dirty(handle, dir);
+		if (retval)
+			goto out_handle;
+	} else {
+		retval = 0;
+	}
 	if (inode->i_nlink == 0)
 		ext4_warning_inode(inode, "Deleting file '%.*s' with no links",
-				   dentry->d_name.len, dentry->d_name.name);
+				   d_name->len, d_name->name);
 	else
 		drop_nlink(inode);
 	if (!inode->i_nlink)
@@ -3262,6 +3256,33 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 	inode->i_ctime = current_time(inode);
 	retval = ext4_mark_inode_dirty(handle, inode);
 
+out_handle:
+	ext4_journal_stop(handle);
+out_bh:
+	brelse(bh);
+	return retval;
+}
+
+static int ext4_unlink(struct inode *dir, struct dentry *dentry)
+{
+	int retval;
+
+	if (unlikely(ext4_forced_shutdown(EXT4_SB(dir->i_sb))))
+		return -EIO;
+
+	trace_ext4_unlink_enter(dir, dentry);
+	/*
+	 * Initialize quotas before so that eventual writes go
+	 * in separate transaction
+	 */
+	retval = dquot_initialize(dir);
+	if (retval)
+		goto out_trace;
+	retval = dquot_initialize(d_inode(dentry));
+	if (retval)
+		goto out_trace;
+
+	retval = __ext4_unlink(dir, &dentry->d_name, d_inode(dentry));
 	if (!retval)
 		ext4_fc_track_unlink(d_inode(dentry), dentry);
 #ifdef CONFIG_UNICODE
@@ -3275,10 +3296,6 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
 		d_invalidate(dentry);
 #endif
 
-out_handle:
-	ext4_journal_stop(handle);
-out_bh:
-	brelse(bh);
 out_trace:
 	trace_ext4_unlink_exit(dentry, retval);
 	return retval;
@@ -3359,7 +3376,8 @@ static int ext4_symlink(struct inode *dir,
 		 */
 		drop_nlink(inode);
 		err = ext4_orphan_add(handle, inode);
-		ext4_journal_stop(handle);
+		if (handle)
+			ext4_journal_stop(handle);
 		handle = NULL;
 		if (err)
 			goto err_drop_inode;
@@ -3413,29 +3431,10 @@ static int ext4_symlink(struct inode *dir,
 	return err;
 }
 
-static int ext4_link(struct dentry *old_dentry,
-		     struct inode *dir, struct dentry *dentry)
+int __ext4_link(struct inode *dir, struct inode *inode, struct dentry *dentry)
 {
 	handle_t *handle;
-	struct inode *inode = d_inode(old_dentry);
 	int err, retries = 0;
-
-	if (inode->i_nlink >= EXT4_LINK_MAX)
-		return -EMLINK;
-
-	err = fscrypt_prepare_link(old_dentry, dir, dentry);
-	if (err)
-		return err;
-
-	if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) &&
-	    (!projid_eq(EXT4_I(dir)->i_projid,
-			EXT4_I(old_dentry->d_inode)->i_projid)))
-		return -EXDEV;
-
-	err = dquot_initialize(dir);
-	if (err)
-		return err;
-
 retry:
 	handle = ext4_journal_start(dir, EXT4_HT_DIR,
 		(EXT4_DATA_TRANS_BLOCKS(dir->i_sb) +
@@ -3452,6 +3451,7 @@ static int ext4_link(struct dentry *old_dentry,
 
 	err = ext4_add_entry(handle, dentry, inode);
 	if (!err) {
+		ext4_fc_track_link(inode, dentry);
 		err = ext4_mark_inode_dirty(handle, inode);
 		/* this can happen only for tmpfile being
 		 * linked the first time
@@ -3469,6 +3469,29 @@ static int ext4_link(struct dentry *old_dentry,
 	return err;
 }
 
+static int ext4_link(struct dentry *old_dentry,
+		     struct inode *dir, struct dentry *dentry)
+{
+	struct inode *inode = d_inode(old_dentry);
+	int err;
+
+	if (inode->i_nlink >= EXT4_LINK_MAX)
+		return -EMLINK;
+
+	err = fscrypt_prepare_link(old_dentry, dir, dentry);
+	if (err)
+		return err;
+
+	if ((ext4_test_inode_flag(dir, EXT4_INODE_PROJINHERIT)) &&
+	    (!projid_eq(EXT4_I(dir)->i_projid,
+			EXT4_I(old_dentry->d_inode)->i_projid)))
+		return -EXDEV;
+
+	err = dquot_initialize(dir);
+	if (err)
+		return err;
+	return __ext4_link(dir, inode, dentry);
+}
 
 /*
  * Try to find buffer head where contains the parent block.
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2a959e2c7a91..967c4eac87d2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1535,6 +1535,9 @@ enum {
 	Opt_discard, Opt_nodiscard, Opt_init_itable, Opt_noinit_itable,
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_prefetch_block_bitmaps, Opt_no_fc,
+#ifdef CONFIG_EXT4_DEBUG
+	Opt_fc_debug_max_replay
+#endif
 };
 
 static const match_table_t tokens = {
@@ -1622,6 +1625,9 @@ static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
 	{Opt_no_fc, "no_fc"},
+#ifdef CONFIG_EXT4_DEBUG
+	{Opt_fc_debug_max_replay, "fc_debug_max_replay=%u"},
+#endif
 	{Opt_max_dir_size_kb, "max_dir_size_kb=%u"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption=%s"},
 	{Opt_test_dummy_encryption, "test_dummy_encryption"},
@@ -1850,6 +1856,9 @@ static const struct mount_opts {
 	 MOPT_SET},
 	{Opt_no_fc, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
+#ifdef CONFIG_EXT4_DEBUG
+	{Opt_fc_debug_max_replay, 0, MOPT_GTE0},
+#endif
 	{Opt_err, 0, 0}
 };
 
@@ -2051,6 +2060,10 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
 		sbi->s_li_wait_mult = arg;
 	} else if (token == Opt_max_dir_size_kb) {
 		sbi->s_max_dir_size_kb = arg;
+#ifdef CONFIG_EXT4_DEBUG
+	} else if (token == Opt_fc_debug_max_replay) {
+		sbi->s_fc_debug_max_replay = arg;
+#endif
 	} else if (token == Opt_stripe) {
 		sbi->s_stripe = arg;
 	} else if (token == Opt_resuid) {
@@ -4560,6 +4573,13 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
 	spin_lock_init(&sbi->s_fc_lock);
 	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
+	sbi->s_fc_replay_state.fc_regions = NULL;
+	sbi->s_fc_replay_state.fc_regions_size = 0;
+	sbi->s_fc_replay_state.fc_regions_used = 0;
+	sbi->s_fc_replay_state.fc_regions_valid = 0;
+	sbi->s_fc_replay_state.fc_modified_inodes = NULL;
+	sbi->s_fc_replay_state.fc_modified_inodes_size = 0;
+	sbi->s_fc_replay_state.fc_modified_inodes_used = 0;
 
 	sb->s_root = NULL;
 
@@ -4771,6 +4791,7 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 			goto failed_mount4a;
 		}
 	}
+	ext4_fc_replay_cleanup(sb);
 
 	ext4_ext_init(sb);
 	err = ext4_mb_init(sb);
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index c8cb43189f02..50dc71366482 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -1776,9 +1776,9 @@ TRACE_EVENT(ext4_ext_load_extent,
 );
 
 TRACE_EVENT(ext4_load_inode,
-	TP_PROTO(struct inode *inode),
+	TP_PROTO(struct super_block *sb, unsigned long ino),
 
-	TP_ARGS(inode),
+	TP_ARGS(sb, ino),
 
 	TP_STRUCT__entry(
 		__field(	dev_t,	dev		)
@@ -1786,8 +1786,8 @@ TRACE_EVENT(ext4_load_inode,
 	),
 
 	TP_fast_assign(
-		__entry->dev		= inode->i_sb->s_dev;
-		__entry->ino		= inode->i_ino;
+		__entry->dev		= sb->s_dev;
+		__entry->ino		= ino;
 	),
 
 	TP_printk("dev %d,%d ino %ld",
@@ -2801,6 +2801,54 @@ TRACE_EVENT(ext4_lazy_itable_init,
 		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->group)
 );
 
+TRACE_EVENT(ext4_fc_replay_scan,
+	TP_PROTO(struct super_block *sb, int error, int off),
+
+	TP_ARGS(sb, error, off),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, error)
+		__field(int, off)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->error = error;
+		__entry->off = off;
+	),
+
+	TP_printk("FC scan pass on dev %d,%d: error %d, off %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->error, __entry->off)
+);
+
+TRACE_EVENT(ext4_fc_replay,
+	TP_PROTO(struct super_block *sb, int tag, int ino, int priv1, int priv2),
+
+	TP_ARGS(sb, tag, ino, priv1, priv2),
+
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, tag)
+		__field(int, ino)
+		__field(int, priv1)
+		__field(int, priv2)
+	),
+
+	TP_fast_assign(
+		__entry->dev = sb->s_dev;
+		__entry->tag = tag;
+		__entry->ino = ino;
+		__entry->priv1 = priv1;
+		__entry->priv2 = priv2;
+	),
+
+	TP_printk("FC Replay %d,%d: tag %d, ino %d, data1 %d, data2 %d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->tag, __entry->ino, __entry->priv1, __entry->priv2)
+);
+
 TRACE_EVENT(ext4_fc_commit_start,
 	TP_PROTO(struct super_block *sb),
 
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 8/9] ext4: add a mount opt to forcefully turn fast commits on
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
                   ` (6 preceding siblings ...)
  2020-09-19  0:54 ` [PATCH v9 7/9] ext4: " Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  2020-09-19  0:54 ` [PATCH v9 9/9] ext4: add fast commit stats in procfs Harshad Shirwadkar
  8 siblings, 0 replies; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This is a debug only mount option that forcefully turns fast commits
on at mount time.

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/super.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 967c4eac87d2..43ac8c19bf1d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1536,8 +1536,9 @@ enum {
 	Opt_max_dir_size_kb, Opt_nojournal_checksum, Opt_nombcache,
 	Opt_prefetch_block_bitmaps, Opt_no_fc,
 #ifdef CONFIG_EXT4_DEBUG
-	Opt_fc_debug_max_replay
+	Opt_fc_debug_max_replay,
 #endif
+	Opt_fc_debug_force
 };
 
 static const match_table_t tokens = {
@@ -1625,6 +1626,7 @@ static const match_table_t tokens = {
 	{Opt_init_itable, "init_itable"},
 	{Opt_noinit_itable, "noinit_itable"},
 	{Opt_no_fc, "no_fc"},
+	{Opt_fc_debug_force, "fc_debug_force"},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_max_replay, "fc_debug_max_replay=%u"},
 #endif
@@ -1856,6 +1858,8 @@ static const struct mount_opts {
 	 MOPT_SET},
 	{Opt_no_fc, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
 	 MOPT_CLEAR | MOPT_2 | MOPT_EXT4_ONLY},
+	{Opt_fc_debug_force, EXT4_MOUNT2_JOURNAL_FAST_COMMIT,
+	 MOPT_SET | MOPT_2 | MOPT_EXT4_ONLY},
 #ifdef CONFIG_EXT4_DEBUG
 	{Opt_fc_debug_max_replay, 0, MOPT_GTE0},
 #endif
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v9 9/9] ext4: add fast commit stats in procfs
  2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
                   ` (7 preceding siblings ...)
  2020-09-19  0:54 ` [PATCH v9 8/9] ext4: add a mount opt to forcefully turn fast commits on Harshad Shirwadkar
@ 2020-09-19  0:54 ` Harshad Shirwadkar
  8 siblings, 0 replies; 29+ messages in thread
From: Harshad Shirwadkar @ 2020-09-19  0:54 UTC (permalink / raw)
  To: linux-ext4; +Cc: tytso, Harshad Shirwadkar

This commit adds a file in procfs that tracks fast commit related
statistics.

root@kvm-xfstests:/mnt# cat /proc/fs/ext4/vdc/fc_info
fc stats:
7772 commits
15 ineligible
4083 numblks
2242us avg_commit_time
Ineligible reasons:
"Extended attributes changed":  0
"Cross rename": 0
"Journal flag changed": 0
"Insufficient memory":  0
"Swap boot":    0
"Resize":       0
"Dir renamed":  0
"Falloc range op":      0
"FC Commit Failed":     15

Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
---
 fs/ext4/ext4.h        |  2 +-
 fs/ext4/fast_commit.c | 34 ++++++++++++++++++++++++++++++++++
 fs/ext4/sysfs.c       |  2 ++
 3 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 372a38292ed1..1a8b10ed412f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2732,7 +2732,7 @@ extern int ext4_init_inode_table(struct super_block *sb,
 extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
 
 /* fast_commit.c */
-
+int ext4_fc_info_show(struct seq_file *seq, void *v);
 void ext4_fc_init(struct super_block *sb, journal_t *journal);
 void ext4_fc_init_inode(struct inode *inode);
 void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
index 63429076ad59..1380bfe7bd0d 100644
--- a/fs/ext4/fast_commit.c
+++ b/fs/ext4/fast_commit.c
@@ -2103,6 +2103,40 @@ void ext4_fc_init(struct super_block *sb, journal_t *journal)
 	}
 }
 
+const char *fc_ineligible_reasons[] = {
+	"Extended attributes changed",
+	"Cross rename",
+	"Journal flag changed",
+	"Insufficient memory",
+	"Swap boot",
+	"Resize",
+	"Dir renamed",
+	"Falloc range op",
+	"FC Commit Failed"
+};
+
+int ext4_fc_info_show(struct seq_file *seq, void *v)
+{
+	struct ext4_sb_info *sbi = EXT4_SB((struct super_block *)seq->private);
+	struct ext4_fc_stats *stats = &sbi->s_fc_stats;
+	int i;
+
+	if (v != SEQ_START_TOKEN)
+		return 0;
+
+	seq_printf(seq,
+		"fc stats:\n%d commits\n%d ineligible\n%d numblks\n%lluus avg_commit_time\n",
+		   stats->fc_num_commits, stats->fc_ineligible_commits,
+		   stats->fc_numblks,
+		   div_u64(sbi->s_fc_avg_commit_time, 1000));
+	seq_puts(seq, "Ineligible reasons:\n");
+	for (i = 0; i < EXT4_FC_REASON_MAX; i++)
+		seq_printf(seq, "\"%s\":\t%d\n", fc_ineligible_reasons[i],
+			stats->fc_ineligible_reason_count[i]);
+
+	return 0;
+}
+
 int __init ext4_fc_init_dentry_cache(void)
 {
 	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
diff --git a/fs/ext4/sysfs.c b/fs/ext4/sysfs.c
index bfabb799fa45..5ff33d18996a 100644
--- a/fs/ext4/sysfs.c
+++ b/fs/ext4/sysfs.c
@@ -521,6 +521,8 @@ int ext4_register_sysfs(struct super_block *sb)
 		proc_create_single_data("es_shrinker_info", S_IRUGO,
 				sbi->s_proc, ext4_seq_es_shrinker_info_show,
 				sb);
+		proc_create_single_data("fc_info", 0444, sbi->s_proc,
+					ext4_fc_info_show, sb);
 		proc_create_seq_data("mb_groups", S_IRUGO, sbi->s_proc,
 				&ext4_mb_seq_groups_ops, sb);
 	}
-- 
2.28.0.681.g6f77f65b4e-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/9] ext4: main fast-commit commit path
  2020-09-19  0:54 ` [PATCH v9 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
@ 2020-09-19  8:19   ` kernel test robot
  2020-10-09 17:04   ` Ritesh Harjani
  2020-10-09 19:14   ` Theodore Y. Ts'o
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-09-19  8:19 UTC (permalink / raw)
  To: Harshad Shirwadkar, linux-ext4
  Cc: kbuild-all, clang-built-linux, tytso, Harshad Shirwadkar


[-- Attachment #1: Type: text/plain, Size: 7485 bytes --]

Hi Harshad,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on ext4/dev]
[also build test WARNING on linus/master v5.9-rc5 next-20200918]
[cannot apply to tip/perf/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-fast-commits-feature/20200919-085652
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: x86_64-randconfig-a006-20200917 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project ed79827aea444e6995fb3d36abc2bfd36331773c)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/ext4/fast_commit.c:1079:6: warning: variable 'start_time' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
           if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/ext4/fast_commit.c:1135:51: note: uninitialized use occurs here
           commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
                                                            ^~~~~~~~~~
   include/linux/ktime.h:46:39: note: expanded from macro 'ktime_sub'
   #define ktime_sub(lhs, rhs)     ((lhs) - (rhs))
                                             ^~~
   fs/ext4/fast_commit.c:1079:2: note: remove the 'if' if its condition is always false
           if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> fs/ext4/fast_commit.c:1079:6: warning: variable 'start_time' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
           if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/ext4/fast_commit.c:1135:51: note: uninitialized use occurs here
           commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
                                                            ^~~~~~~~~~
   include/linux/ktime.h:46:39: note: expanded from macro 'ktime_sub'
   #define ktime_sub(lhs, rhs)     ((lhs) - (rhs))
                                             ^~~
   fs/ext4/fast_commit.c:1079:6: note: remove the '||' if its condition is always false
           if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/ext4/fast_commit.c:1075:20: note: initialize the variable 'start_time' to silence this warning
           ktime_t start_time, commit_time;
                             ^
                              = 0
   2 warnings generated.

# https://github.com/0day-ci/linux/commit/2384cbfbcf98b789d426c39b458c52adbb36d4f9
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Harshad-Shirwadkar/ext4-add-fast-commits-feature/20200919-085652
git checkout 2384cbfbcf98b789d426c39b458c52adbb36d4f9
vim +1079 fs/ext4/fast_commit.c

  1061	
  1062	/*
  1063	 * The main commit entry point. Performs a fast commit for transaction
  1064	 * commit_tid if needed. If it's not possible to perform a fast commit
  1065	 * due to various reasons, we fall back to full commit. Returns 0
  1066	 * on success, error otherwise.
  1067	 */
  1068	int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
  1069	{
  1070		struct super_block *sb = (struct super_block *)(journal->j_private);
  1071		struct ext4_sb_info *sbi = EXT4_SB(sb);
  1072		int nblks = 0, ret, bsize = journal->j_blocksize;
  1073		int subtid = atomic_read(&sbi->s_fc_subtid);
  1074		int reason = EXT4_FC_REASON_OK, fc_bufs_before = 0;
  1075		ktime_t start_time, commit_time;
  1076	
  1077		trace_ext4_fc_commit_start(sb);
  1078	
> 1079		if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
  1080			(ext4_fc_is_ineligible(sb))) {
  1081			reason = EXT4_FC_REASON_INELIGIBLE;
  1082			goto out;
  1083		}
  1084	
  1085		start_time = ktime_get();
  1086	restart_fc:
  1087		ret = jbd2_fc_start(journal, commit_tid);
  1088		if (ret == -EALREADY) {
  1089			/* There was an ongoing commit, check if we need to restart */
  1090			if (atomic_read(&sbi->s_fc_subtid) <= subtid &&
  1091				commit_tid > journal->j_commit_sequence)
  1092				goto restart_fc;
  1093			reason = EXT4_FC_REASON_ALREADY_COMMITTED;
  1094			goto out;
  1095		} else if (ret) {
  1096			sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
  1097			reason = EXT4_FC_REASON_FC_START_FAILED;
  1098			goto out;
  1099		}
  1100	
  1101		fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
  1102		ret = ext4_fc_perform_commit(journal);
  1103		if (ret < 0) {
  1104			sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
  1105			reason = EXT4_FC_REASON_FC_FAILED;
  1106			goto out;
  1107		}
  1108		nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
  1109		ret = jbd2_fc_wait_bufs(journal, nblks);
  1110		if (ret < 0) {
  1111			sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
  1112			reason = EXT4_FC_REASON_FC_FAILED;
  1113			goto out;
  1114		}
  1115		atomic_inc(&sbi->s_fc_subtid);
  1116		jbd2_fc_stop(journal);
  1117	out:
  1118		/* Has any ineligible update happened since we started? */
  1119		if (reason == EXT4_FC_REASON_OK && ext4_fc_is_ineligible(sb)) {
  1120			sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
  1121			reason = EXT4_FC_REASON_INELIGIBLE;
  1122		}
  1123	
  1124		spin_lock(&sbi->s_fc_lock);
  1125		if (reason != EXT4_FC_REASON_OK &&
  1126			reason != EXT4_FC_REASON_ALREADY_COMMITTED) {
  1127			sbi->s_fc_stats.fc_ineligible_commits++;
  1128		} else {
  1129			sbi->s_fc_stats.fc_num_commits++;
  1130			sbi->s_fc_stats.fc_numblks += nblks;
  1131		}
  1132		spin_unlock(&sbi->s_fc_lock);
  1133		nblks = (reason == EXT4_FC_REASON_OK) ? nblks : 0;
  1134		trace_ext4_fc_commit_stop(sb, nblks, reason);
  1135		commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
  1136		/*
  1137		 * weight the commit time higher than the average time so we don't
  1138		 * react too strongly to vast changes in the commit time
  1139		 */
  1140		if (likely(sbi->s_fc_avg_commit_time))
  1141			sbi->s_fc_avg_commit_time = (commit_time +
  1142					sbi->s_fc_avg_commit_time * 3) / 4;
  1143		else
  1144			sbi->s_fc_avg_commit_time = commit_time;
  1145		jbd_debug(1,
  1146			"Fast commit ended with blks = %d, reason = %d, subtid - %d",
  1147			nblks, reason, subtid);
  1148		if (reason == EXT4_FC_REASON_FC_FAILED)
  1149			return jbd2_fc_stop_do_commit(journal, commit_tid);
  1150		if (reason == EXT4_FC_REASON_FC_START_FAILED ||
  1151			reason == EXT4_FC_REASON_INELIGIBLE)
  1152			return jbd2_complete_transaction(journal, commit_tid);
  1153		return 0;
  1154	}
  1155	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 43605 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 7/9] ext4: fast commit recovery path
  2020-09-19  0:54 ` [PATCH v9 7/9] ext4: " Harshad Shirwadkar
@ 2020-09-19 14:15   ` kernel test robot
  2020-10-09 17:14   ` Ritesh Harjani
  1 sibling, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-09-19 14:15 UTC (permalink / raw)
  To: Harshad Shirwadkar, linux-ext4
  Cc: kbuild-all, clang-built-linux, tytso, Harshad Shirwadkar


[-- Attachment #1: Type: text/plain, Size: 6026 bytes --]

Hi Harshad,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on ext4/dev]
[also build test WARNING on linus/master v5.9-rc5 next-20200918]
[cannot apply to tip/perf/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-fast-commits-feature/20200919-085652
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: x86_64-randconfig-a006-20200917 (attached as .config)
compiler: clang version 12.0.0 (https://github.com/llvm/llvm-project ed79827aea444e6995fb3d36abc2bfd36331773c)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install x86_64 cross compiling tool for clang build
        # apt-get install binutils-x86-64-linux-gnu
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

>> fs/ext4/mballoc.c:5226:6: warning: no previous prototype for function 'ext4_free_blocks_simple' [-Wmissing-prototypes]
   void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block,
        ^
   fs/ext4/mballoc.c:5226:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block,
   ^
   static 
   1 warning generated.
--
   fs/ext4/fast_commit.c:1091:6: warning: variable 'start_time' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
           if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/ext4/fast_commit.c:1147:51: note: uninitialized use occurs here
           commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
                                                            ^~~~~~~~~~
   include/linux/ktime.h:46:39: note: expanded from macro 'ktime_sub'
   #define ktime_sub(lhs, rhs)     ((lhs) - (rhs))
                                             ^~~
   fs/ext4/fast_commit.c:1091:2: note: remove the 'if' if its condition is always false
           if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/ext4/fast_commit.c:1091:6: warning: variable 'start_time' is used uninitialized whenever '||' condition is true [-Wsometimes-uninitialized]
           if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/ext4/fast_commit.c:1147:51: note: uninitialized use occurs here
           commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
                                                            ^~~~~~~~~~
   include/linux/ktime.h:46:39: note: expanded from macro 'ktime_sub'
   #define ktime_sub(lhs, rhs)     ((lhs) - (rhs))
                                             ^~~
   fs/ext4/fast_commit.c:1091:6: note: remove the '||' if its condition is always false
           if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   fs/ext4/fast_commit.c:1087:20: note: initialize the variable 'start_time' to silence this warning
           ktime_t start_time, commit_time;
                             ^
                              = 0
>> fs/ext4/fast_commit.c:1787:6: warning: no previous prototype for function 'ext4_fc_set_bitmaps_and_counters' [-Wmissing-prototypes]
   void ext4_fc_set_bitmaps_and_counters(struct super_block *sb)
        ^
   fs/ext4/fast_commit.c:1787:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
   void ext4_fc_set_bitmaps_and_counters(struct super_block *sb)
   ^
   static 
   3 warnings generated.

# https://github.com/0day-ci/linux/commit/e027573763bf3d778cd1c279f9b08698817c00f1
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Harshad-Shirwadkar/ext4-add-fast-commits-feature/20200919-085652
git checkout e027573763bf3d778cd1c279f9b08698817c00f1
vim +/ext4_free_blocks_simple +5226 fs/ext4/mballoc.c

  5225	
> 5226	void ext4_free_blocks_simple(struct inode *inode, ext4_fsblk_t block,
  5227				     unsigned long count)
  5228	{
  5229		struct buffer_head *bitmap_bh;
  5230		struct super_block *sb = inode->i_sb;
  5231		struct ext4_group_desc *gdp;
  5232		struct buffer_head *gdp_bh;
  5233		ext4_group_t group;
  5234		ext4_grpblk_t blkoff;
  5235		int already_freed = 0, err, i;
  5236	
  5237		ext4_get_group_no_and_offset(sb, block, &group, &blkoff);
  5238		bitmap_bh = ext4_read_block_bitmap(sb, group);
  5239		if (IS_ERR(bitmap_bh)) {
  5240			err = PTR_ERR(bitmap_bh);
  5241			pr_warn("Failed to read block bitmap\n");
  5242			return;
  5243		}
  5244		gdp = ext4_get_group_desc(sb, group, &gdp_bh);
  5245		if (!gdp)
  5246			return;
  5247	
  5248		for (i = 0; i < count; i++) {
  5249			if (!mb_test_bit(blkoff + i, bitmap_bh->b_data))
  5250				already_freed++;
  5251		}
  5252		mb_clear_bits(bitmap_bh->b_data, blkoff, count);
  5253		err = ext4_handle_dirty_metadata(NULL, NULL, bitmap_bh);
  5254		if (err)
  5255			return;
  5256		ext4_free_group_clusters_set(
  5257			sb, gdp, ext4_free_group_clusters(sb, gdp) +
  5258			count - already_freed);
  5259		ext4_block_bitmap_csum_set(sb, group, gdp, bitmap_bh);
  5260		ext4_group_desc_csum_set(sb, group, gdp);
  5261		ext4_handle_dirty_metadata(NULL, NULL, gdp_bh);
  5262		sync_dirty_buffer(bitmap_bh);
  5263		sync_dirty_buffer(gdp_bh);
  5264		brelse(bitmap_bh);
  5265	}
  5266	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 43605 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization
  2020-09-19  0:54 ` [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
@ 2020-09-19 15:22   ` kernel test robot
  2020-10-09 16:10   ` Ritesh Harjani
  1 sibling, 0 replies; 29+ messages in thread
From: kernel test robot @ 2020-09-19 15:22 UTC (permalink / raw)
  To: Harshad Shirwadkar, linux-ext4; +Cc: kbuild-all, tytso, Harshad Shirwadkar


[-- Attachment #1: Type: text/plain, Size: 1341 bytes --]

Hi Harshad,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on ext4/dev]
[also build test ERROR on linus/master v5.9-rc5 next-20200918]
[cannot apply to tip/perf/core tytso-fscrypt/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Harshad-Shirwadkar/ext4-add-fast-commits-feature/20200919-085652
base:   https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git dev
config: m68k-allmodconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=m68k 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All errors (new ones prefixed by >>, old ones prefixed by <<):

>> ERROR: modpost: "jbd2_fc_init" [fs/ext4/ext4.ko] undefined!

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 57294 bytes --]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-09-19  0:54 ` [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
@ 2020-09-22 17:50   ` Darrick J. Wong
  2020-09-24  6:56     ` harshad shirwadkar
  2020-10-09 18:28   ` Theodore Y. Ts'o
  1 sibling, 1 reply; 29+ messages in thread
From: Darrick J. Wong @ 2020-09-22 17:50 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4, tytso

On Fri, Sep 18, 2020 at 05:54:43PM -0700, Harshad Shirwadkar wrote:
> This patch adds necessary documentation for fast commits.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>  Documentation/filesystems/ext4/journal.rst | 66 ++++++++++++++++++++++
>  Documentation/filesystems/journalling.rst  | 28 +++++++++
>  2 files changed, 94 insertions(+)
> 
> diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> index ea613ee701f5..c2e4d010a201 100644
> --- a/Documentation/filesystems/ext4/journal.rst
> +++ b/Documentation/filesystems/ext4/journal.rst
> @@ -28,6 +28,17 @@ metadata are written to disk through the journal. This is slower but
>  safest. If ``data=writeback``, dirty data blocks are not flushed to the
>  disk before the metadata are written to disk through the journal.
>  
> +In case of ``data=ordered`` mode, Ext4 also supports fast commits which
> +help reduce commit latency significantly. The default ``data=ordered``
> +mode works by logging metadata blocks tothe journal. In fast commit

"to the journal"

> +mode, Ext4 only stores the minimal delta needed to recreate the
> +affected metadata in fast commit space that is shared with JBD2.
> +Once the fast commit area fills in or if fast commit is not possible
> +or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
> +A full commit invalidates all the fast commits that happened before
> +it and thus it makes the fast commit area empty for further fast
> +commits. This feature needs to be enabled at compile time.

And mkfs time too, I would hope?

> +
>  The journal inode is typically inode 8. The first 68 bytes of the
>  journal inode are replicated in the ext4 superblock. The journal itself
>  is normal (but hidden) file within the filesystem. The file usually
> @@ -609,3 +620,58 @@ bytes long (but uses a full block):
>       - h\_commit\_nsec
>       - Nanoseconds component of the above timestamp.
>  
> +Fast commits
> +~~~~~~~~~~~~
> +
> +Fast commit area is organized as a log of tag tag length values. Each TLV has
> +a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
> +of the entire field. It is followed by variable length tag specific value.

"The fast commit area is organized as a log of tagged variable-length
values.  Each value begins with a ``struct ext4_fc_tl`` tag that
identifies the type of the value and its length, and is followed by the
value itself." ?

I would've called that struct "ext4_fc_tag" or something, since "tl"
isn't really a word... ah well.

> +Here is the list of supported tags and their meanings:
> +
> +.. list-table::
> +   :widths: 8 20 20 32
> +   :header-rows: 1
> +
> +   * - Tag
> +     - Meaning
> +     - Value struct
> +     - Description
> +   * - EXT4_FC_TAG_HEAD
> +     - Fast commit area header
> +     - ``struct ext4_fc_head``
> +     - Stores the TID of the transaction after which these fast commits should
> +       be applied.

So I guess log recovery is supposed to apply the transaction TID, then
apply these fast commits, and then move on to the next transaction?

--D

> +   * - EXT4_FC_TAG_ADD_RANGE
> +     - Add extent to inode
> +     - ``struct ext4_fc_add_range``
> +     - Stores the inode number and extent to be added in this inode
> +   * - EXT4_FC_TAG_DEL_RANGE
> +     - Remove logical offsets to inode
> +     - ``struct ext4_fc_del_range``
> +     - Stores the inode number and the logical offset range that needs to be
> +       removed
> +   * - EXT4_FC_TAG_CREAT
> +     - Create directory entry for a newly created file
> +     - ``struct ext4_fc_dentry_info``
> +     - Stores the parent inode numer, inode number and directory entry of the
> +       newly created file
> +   * - EXT4_FC_TAG_LINK
> +     - Link a directory entry to an inode
> +     - ``struct ext4_fc_dentry_info``
> +     - Stores the parent inode numer, inode number and directory entry
> +   * - EXT4_FC_TAG_UNLINK
> +     - Unink a directory entry of an inode
> +     - ``struct ext4_fc_dentry_info``
> +     - Stores the parent inode numer, inode number and directory entry
> +
> +   * - EXT4_FC_TAG_PAD
> +     - Padding (unused area)
> +     - None
> +     - Unused bytes in the fast commit area.
> +
> +   * - EXT4_FC_TAG_TAIL
> +     - Mark the end of a fast commit
> +     - ``struct ext4_fc_tail``
> +     - Stores the TID of the commit, CRC of the fast commit of which this tag
> +       represents the end of
> +
> diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> index 58ce6b395206..a9817220dc9b 100644
> --- a/Documentation/filesystems/journalling.rst
> +++ b/Documentation/filesystems/journalling.rst
> @@ -132,6 +132,34 @@ The opportunities for abuse and DOS attacks with this should be obvious,
>  if you allow unprivileged userspace to trigger codepaths containing
>  these calls.
>  
> +Fast commits
> +~~~~~~~~~~~~
> +
> +JBD2 to also allows you to perform file-system specific delta commits known as
> +fast commits. In order to use fast commits, you first need to call
> +:c:func:`jbd2_fc_init` and tell how many blocks at the end of journal
> +area should be reserved for fast commits. Along with that, you will also need
> +to set following callbacks that perform correspodning work:
> +
> +`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
> +fast commit.
> +
> +`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
> +blocks.
> +
> +File system is free to perform fast commits as and when it wants as long as it
> +gets permission from JBD2 to do so by calling the function
> +:c:func:`jbd2_fc_start()`. Once a fast commit is done, the client
> +file  system should tell JBD2 about it by calling :c:func:`jbd2_fc_stop()`.
> +If file system wants JBD2 to perform a full commit immediately after stopping
> +the fast commit it can do so by calling :c:func:`jbd2_fc_stop_do_commit()`.
> +This is useful if fast commit operation fails for some reason and the only way
> +to guarantee consistency is for JBD2 to perform the full traditional commit.
> +
> +JBD2 helper functions to manage fast commit buffers. File system can use
> +:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
> +and wait on IO completion of fast commit buffers.
> +
>  Summary
>  ~~~~~~~
>  
> -- 
> 2.28.0.681.g6f77f65b4e-goog
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-09-22 17:50   ` Darrick J. Wong
@ 2020-09-24  6:56     ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2020-09-24  6:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Ext4 Developers List, Theodore Y. Ts'o

Thanks Darrick for the comments. Responses inlined below:

On Tue, Sep 22, 2020 at 10:52 AM Darrick J. Wong
<darrick.wong@oracle.com> wrote:
>
> On Fri, Sep 18, 2020 at 05:54:43PM -0700, Harshad Shirwadkar wrote:
> > This patch adds necessary documentation for fast commits.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > ---
> >  Documentation/filesystems/ext4/journal.rst | 66 ++++++++++++++++++++++
> >  Documentation/filesystems/journalling.rst  | 28 +++++++++
> >  2 files changed, 94 insertions(+)
> >
> > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > index ea613ee701f5..c2e4d010a201 100644
> > --- a/Documentation/filesystems/ext4/journal.rst
> > +++ b/Documentation/filesystems/ext4/journal.rst
> > @@ -28,6 +28,17 @@ metadata are written to disk through the journal. This is slower but
> >  safest. If ``data=writeback``, dirty data blocks are not flushed to the
> >  disk before the metadata are written to disk through the journal.
> >
> > +In case of ``data=ordered`` mode, Ext4 also supports fast commits which
> > +help reduce commit latency significantly. The default ``data=ordered``
> > +mode works by logging metadata blocks tothe journal. In fast commit
>
> "to the journal"
ack
>
> > +mode, Ext4 only stores the minimal delta needed to recreate the
> > +affected metadata in fast commit space that is shared with JBD2.
> > +Once the fast commit area fills in or if fast commit is not possible
> > +or if JBD2 commit timer goes off, Ext4 performs a traditional full commit.
> > +A full commit invalidates all the fast commits that happened before
> > +it and thus it makes the fast commit area empty for further fast
> > +commits. This feature needs to be enabled at compile time.
>
> And mkfs time too, I would hope?
Haha, thanks for catching this, yeah I meant mkfs time :)
>
> > +
> >  The journal inode is typically inode 8. The first 68 bytes of the
> >  journal inode are replicated in the ext4 superblock. The journal itself
> >  is normal (but hidden) file within the filesystem. The file usually
> > @@ -609,3 +620,58 @@ bytes long (but uses a full block):
> >       - h\_commit\_nsec
> >       - Nanoseconds component of the above timestamp.
> >
> > +Fast commits
> > +~~~~~~~~~~~~
> > +
> > +Fast commit area is organized as a log of tag tag length values. Each TLV has
> > +a ``struct ext4_fc_tl`` in the beginning which stores the tag and the length
> > +of the entire field. It is followed by variable length tag specific value.
>
> "The fast commit area is organized as a log of tagged variable-length
> values.  Each value begins with a ``struct ext4_fc_tl`` tag that
> identifies the type of the value and its length, and is followed by the
> value itself." ?
>
> I would've called that struct "ext4_fc_tag" or something, since "tl"
> isn't really a word... ah well.
Ack, ext4_fc_tag is good, ideally I would have liked a name that
somehow conveys that there's tag and length in there. But, I get it
"tl" doesn't mean anything, so unless we find a better name, I can use
ext4_fc_tag.
>
> > +Here is the list of supported tags and their meanings:
> > +
> > +.. list-table::
> > +   :widths: 8 20 20 32
> > +   :header-rows: 1
> > +
> > +   * - Tag
> > +     - Meaning
> > +     - Value struct
> > +     - Description
> > +   * - EXT4_FC_TAG_HEAD
> > +     - Fast commit area header
> > +     - ``struct ext4_fc_head``
> > +     - Stores the TID of the transaction after which these fast commits should
> > +       be applied.
>
> So I guess log recovery is supposed to apply the transaction TID, then
> apply these fast commits, and then move on to the next transaction?
So, the log recovery should apply all the transactions in the journal
first. Let's say the last transaction was TID. Log recovery should
only apply fast commits if the head tag in fast commit has the
transaction ID as TID. Since a full commit invalidates all the
previous fast commits, there is no need to replay any fast commits <
TID.

Thanks,
Harshad
>
> --D
>
> > +   * - EXT4_FC_TAG_ADD_RANGE
> > +     - Add extent to inode
> > +     - ``struct ext4_fc_add_range``
> > +     - Stores the inode number and extent to be added in this inode
> > +   * - EXT4_FC_TAG_DEL_RANGE
> > +     - Remove logical offsets to inode
> > +     - ``struct ext4_fc_del_range``
> > +     - Stores the inode number and the logical offset range that needs to be
> > +       removed
> > +   * - EXT4_FC_TAG_CREAT
> > +     - Create directory entry for a newly created file
> > +     - ``struct ext4_fc_dentry_info``
> > +     - Stores the parent inode numer, inode number and directory entry of the
> > +       newly created file
> > +   * - EXT4_FC_TAG_LINK
> > +     - Link a directory entry to an inode
> > +     - ``struct ext4_fc_dentry_info``
> > +     - Stores the parent inode numer, inode number and directory entry
> > +   * - EXT4_FC_TAG_UNLINK
> > +     - Unink a directory entry of an inode
> > +     - ``struct ext4_fc_dentry_info``
> > +     - Stores the parent inode numer, inode number and directory entry
> > +
> > +   * - EXT4_FC_TAG_PAD
> > +     - Padding (unused area)
> > +     - None
> > +     - Unused bytes in the fast commit area.
> > +
> > +   * - EXT4_FC_TAG_TAIL
> > +     - Mark the end of a fast commit
> > +     - ``struct ext4_fc_tail``
> > +     - Stores the TID of the commit, CRC of the fast commit of which this tag
> > +       represents the end of
> > +
> > diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
> > index 58ce6b395206..a9817220dc9b 100644
> > --- a/Documentation/filesystems/journalling.rst
> > +++ b/Documentation/filesystems/journalling.rst
> > @@ -132,6 +132,34 @@ The opportunities for abuse and DOS attacks with this should be obvious,
> >  if you allow unprivileged userspace to trigger codepaths containing
> >  these calls.
> >
> > +Fast commits
> > +~~~~~~~~~~~~
> > +
> > +JBD2 to also allows you to perform file-system specific delta commits known as
> > +fast commits. In order to use fast commits, you first need to call
> > +:c:func:`jbd2_fc_init` and tell how many blocks at the end of journal
> > +area should be reserved for fast commits. Along with that, you will also need
> > +to set following callbacks that perform correspodning work:
> > +
> > +`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
> > +fast commit.
> > +
> > +`journal->j_fc_replay_cb`: Replay function called for replay of fast commit
> > +blocks.
> > +
> > +File system is free to perform fast commits as and when it wants as long as it
> > +gets permission from JBD2 to do so by calling the function
> > +:c:func:`jbd2_fc_start()`. Once a fast commit is done, the client
> > +file  system should tell JBD2 about it by calling :c:func:`jbd2_fc_stop()`.
> > +If file system wants JBD2 to perform a full commit immediately after stopping
> > +the fast commit it can do so by calling :c:func:`jbd2_fc_stop_do_commit()`.
> > +This is useful if fast commit operation fails for some reason and the only way
> > +to guarantee consistency is for JBD2 to perform the full traditional commit.
> > +
> > +JBD2 helper functions to manage fast commit buffers. File system can use
> > +:c:func:`jbd2_fc_get_buf()` and :c:func:`jbd2_fc_wait_bufs()` to allocate
> > +and wait on IO completion of fast commit buffers.
> > +
> >  Summary
> >  ~~~~~~~
> >
> > --
> > 2.28.0.681.g6f77f65b4e-goog
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization
  2020-09-19  0:54 ` [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
  2020-09-19 15:22   ` kernel test robot
@ 2020-10-09 16:10   ` Ritesh Harjani
  2020-10-13  0:28     ` harshad shirwadkar
  1 sibling, 1 reply; 29+ messages in thread
From: Ritesh Harjani @ 2020-10-09 16:10 UTC (permalink / raw)
  To: Harshad Shirwadkar, linux-ext4; +Cc: tytso


Sorry about the delay. Few comments below.

On 9/19/20 6:24 AM, Harshad Shirwadkar wrote:
> This patch adds fast commit area trackers in the journal_t
> structure. These are initialized via the jbd2_fc_init() routine that
> this patch adds. This patch also adds ext4/fast_commit.c and
> ext4/fast_commit.h files for fast commit code that will be added in
> subsequent patches in this series.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>   fs/ext4/Makefile      |  2 +-
>   fs/ext4/ext4.h        |  4 ++++
>   fs/ext4/fast_commit.c | 20 +++++++++++++++++
>   fs/ext4/fast_commit.h |  9 ++++++++
>   fs/ext4/super.c       |  1 +
>   fs/jbd2/journal.c     | 52 ++++++++++++++++++++++++++++++++++++++-----
>   include/linux/jbd2.h  | 39 ++++++++++++++++++++++++++++++++
>   7 files changed, 121 insertions(+), 6 deletions(-)
>   create mode 100644 fs/ext4/fast_commit.c
>   create mode 100644 fs/ext4/fast_commit.h
> 
> diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
> index 2e42f47a7f98..49e7af6cc93f 100644
> --- a/fs/ext4/Makefile
> +++ b/fs/ext4/Makefile
> @@ -10,7 +10,7 @@ ext4-y	:= balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
>   		indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
>   		mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
>   		super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \
> -		xattr_user.o
> +		xattr_user.o fast_commit.o
> 
>   ext4-$(CONFIG_EXT4_FS_POSIX_ACL)	+= acl.o
>   ext4-$(CONFIG_EXT4_FS_SECURITY)		+= xattr_security.o
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 82e889d5c2ed..9af3971dd12e 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -964,6 +964,7 @@ do {									       \
>   #endif /* defined(__KERNEL__) || defined(__linux__) */
> 
>   #include "extents_status.h"
> +#include "fast_commit.h"
> 
>   /*
>    * Lock subclasses for i_data_sem in the ext4_inode_info structure.
> @@ -2679,6 +2680,9 @@ extern int ext4_init_inode_table(struct super_block *sb,
>   				 ext4_group_t group, int barrier);
>   extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
> 
> +/* fast_commit.c */
> +
> +void ext4_fc_init(struct super_block *sb, journal_t *journal);
>   /* mballoc.c */
>   extern const struct seq_operations ext4_mb_seq_groups_ops;
>   extern long ext4_mb_stats;
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> new file mode 100644
> index 000000000000..0dad8bdb1253
> --- /dev/null
> +++ b/fs/ext4/fast_commit.c
> @@ -0,0 +1,20 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * fs/ext4/fast_commit.c
> + *
> + * Written by Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> + *
> + * Ext4 fast commits routines.
> + */
> +#include "ext4_jbd2.h"
> +
> +void ext4_fc_init(struct super_block *sb, journal_t *journal)
> +{
> +	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
> +		return;
> +	if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
> +		pr_warn("Error while enabling fast commits, turning off.");
> +		ext4_clear_feature_fast_commit(sb);
> +	}
> +}
> diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> new file mode 100644
> index 000000000000..8362bf5e6e00
> --- /dev/null
> +++ b/fs/ext4/fast_commit.h
> @@ -0,0 +1,9 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +#ifndef __FAST_COMMIT_H__
> +#define __FAST_COMMIT_H__
> +
> +/* Number of blocks in journal area to allocate for fast commits */
> +#define EXT4_NUM_FC_BLKS		256

Just wanted to understand how is this value determined?
Do you think this needs to be configurable?
Just thinking since, on some platforms blksz could be of 64K.

> +
> +#endif /* __FAST_COMMIT_H__ */
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b62858ee420b..94aaaf940449 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4962,6 +4962,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
>   	journal->j_commit_interval = sbi->s_commit_interval;
>   	journal->j_min_batch_time = sbi->s_min_batch_time;
>   	journal->j_max_batch_time = sbi->s_max_batch_time;
> +	ext4_fc_init(sb, journal);
> 
>   	write_lock(&journal->j_state_lock);
>   	if (test_opt(sb, BARRIER))
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 17fdc482f554..736a1736619f 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -1179,6 +1179,14 @@ static journal_t *journal_init_common(struct block_device *bdev,
>   	if (!journal->j_wbuf)
>   		goto err_cleanup;
> 
> +	if (journal->j_fc_wbufsize > 0) {
> +		journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> +					sizeof(struct buffer_head *),
> +					GFP_KERNEL);
> +		if (!journal->j_fc_wbuf)
> +			goto err_cleanup;
> +	}
> +
>   	bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
>   	if (!bh) {
>   		pr_err("%s: Cannot get buffer for journal superblock\n",
> @@ -1192,11 +1200,22 @@ static journal_t *journal_init_common(struct block_device *bdev,
> 
>   err_cleanup:
>   	kfree(journal->j_wbuf);
> +	kfree(journal->j_fc_wbuf);
>   	jbd2_journal_destroy_revoke(journal);
>   	kfree(journal);
>   	return NULL;
>   }
> 
> +int jbd2_fc_init(journal_t *journal, int num_fc_blks)
> +{
> +	journal->j_fc_wbufsize = num_fc_blks;
> +	journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> +				sizeof(struct buffer_head *), GFP_KERNEL);
> +	if (!journal->j_fc_wbuf)
> +		return -ENOMEM;
> +	return 0;
> +}
> +
>   /* jbd2_journal_init_dev and jbd2_journal_init_inode:
>    *
>    * Create a journal structure assigned some fixed set of disk blocks to
> @@ -1314,11 +1333,20 @@ static int journal_reset(journal_t *journal)
>   	}
> 
>   	journal->j_first = first;
> -	journal->j_last = last;
> 
> -	journal->j_head = first;
> -	journal->j_tail = first;
> -	journal->j_free = last - first;
> +	if (jbd2_has_feature_fast_commit(journal) &&
> +	    journal->j_fc_wbufsize > 0) {
> +		journal->j_last_fc = last;
> +		journal->j_last = last - journal->j_fc_wbufsize;
> +		journal->j_first_fc = journal->j_last + 1;
> +		journal->j_fc_off = 0;
> +	} else {
> +		journal->j_last = last;
> +	}
> +
> +	journal->j_head = journal->j_first;
> +	journal->j_tail = journal->j_first;
> +	journal->j_free = journal->j_last - journal->j_first;
> 
>   	journal->j_tail_sequence = journal->j_transaction_sequence;
>   	journal->j_commit_sequence = journal->j_transaction_sequence - 1;
> @@ -1663,9 +1691,18 @@ static int load_superblock(journal_t *journal)
>   	journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
>   	journal->j_tail = be32_to_cpu(sb->s_start);
>   	journal->j_first = be32_to_cpu(sb->s_first);
> -	journal->j_last = be32_to_cpu(sb->s_maxlen);
>   	journal->j_errno = be32_to_cpu(sb->s_errno);
> 
> +	if (jbd2_has_feature_fast_commit(journal) &&
> +	    journal->j_fc_wbufsize > 0) {
> +		journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
> +		journal->j_last = journal->j_last_fc - journal->j_fc_wbufsize;
> +		journal->j_first_fc = journal->j_last + 1;
> +		journal->j_fc_off = 0;
> +	} else {
> +		journal->j_last = be32_to_cpu(sb->s_maxlen);
> +	}
> +
>   	return 0;
>   }
> 
> @@ -1726,6 +1763,9 @@ int jbd2_journal_load(journal_t *journal)
>   	 */
>   	journal->j_flags &= ~JBD2_ABORT;
> 
> +	if (journal->j_fc_wbufsize > 0)
> +		jbd2_journal_set_features(journal, 0, 0,
> +					  JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
>   	/* OK, we've finished with the dynamic journal bits:
>   	 * reinitialise the dynamic contents of the superblock in memory
>   	 * and reset them on disk. */
> @@ -1809,6 +1849,8 @@ int jbd2_journal_destroy(journal_t *journal)
>   		jbd2_journal_destroy_revoke(journal);
>   	if (journal->j_chksum_driver)
>   		crypto_free_shash(journal->j_chksum_driver);
> +	if (journal->j_fc_wbufsize > 0)
> +		kfree(journal->j_fc_wbuf);
>   	kfree(journal->j_wbuf);
>   	kfree(journal);
> 
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index f438257d7f31..36f65a818366 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -915,6 +915,30 @@ struct journal_s
>   	 */
>   	unsigned long		j_last;
> 
> +	/**
> +	 * @j_first_fc:
> +	 *
> +	 * The block number of the first fast commit block in the journal
> +	 * [j_state_lock].
> +	 */
> +	unsigned long		j_first_fc;
> +
> +	/**
> +	 * @j_fc_off:
> +	 *
> +	 * Number of fast commit blocks currently allocated.
> +	 * [j_state_lock].
> +	 */
> +	unsigned long		j_fc_off;

I guess choosing a single naming convention for fast commit would be 
very helpful for grepping/searching.
So for e.g. we could have everything using j_fc_**
If you agree, then we may have to change other members of this structure
accordingly.

-ritesh

> +
> +	/**
> +	 * @j_last_fc:
> +	 *
> +	 * The block number one beyond the last fast commit block in the journal
> +	 * [j_state_lock].
> +	 */
> +	unsigned long		j_last_fc;
> +
>   	/**
>   	 * @j_dev: Device where we store the journal.
>   	 */
> @@ -1065,6 +1089,12 @@ struct journal_s
>   	 */
>   	struct buffer_head	**j_wbuf;
> 
> +	/**
> +	 * @j_fc_wbuf: Array of fast commit bhs for
> +	 * jbd2_journal_commit_transaction.
> +	 */
> +	struct buffer_head	**j_fc_wbuf;
> +
>   	/**
>   	 * @j_wbufsize:
>   	 *
> @@ -1072,6 +1102,13 @@ struct journal_s
>   	 */
>   	int			j_wbufsize;
> 
> +	/**
> +	 * @j_fc_wbufsize:
> +	 *
> +	 * Size of @j_fc_wbuf array.
> +	 */
> +	int			j_fc_wbufsize;
> +
>   	/**
>   	 * @j_last_sync_writer:
>   	 *
> @@ -1507,6 +1544,8 @@ void __jbd2_log_wait_for_space(journal_t *journal);
>   extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
>   extern int jbd2_cleanup_journal_tail(journal_t *);
> 
> +/* Fast commit related APIs */
> +int jbd2_fc_init(journal_t *journal, int num_fc_blks);
>   /*
>    * is_journal_abort
>    *
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 4/9] jbd2: add fast commit machinery
  2020-09-19  0:54 ` [PATCH v9 4/9] jbd2: add fast commit machinery Harshad Shirwadkar
@ 2020-10-09 16:16   ` Ritesh Harjani
  2020-10-13  0:27     ` harshad shirwadkar
  0 siblings, 1 reply; 29+ messages in thread
From: Ritesh Harjani @ 2020-10-09 16:16 UTC (permalink / raw)
  To: Harshad Shirwadkar, linux-ext4; +Cc: tytso



On 9/19/20 6:24 AM, Harshad Shirwadkar wrote:
> This patch implements following APIs in JBD2 to allow for fast
> commits:
> 
> jbd2_fc_start(): Start a new fast commit. This function waits for any
> existing fast commit or full commit to complete.
> 
> jbd2_fc_stop(): Stop fast commit. This function ends current fast
> commit and wakes up either the journal thread or the other fast commit
> waiting for current fast commit to complete.
> 
> jbd2_fc_stop_do_commit(): Stop fast commit and perform a full
> commit. This is same as above but also performs a full commit.
> 
> This patch also adds a cleanup handler in journal_t that is called
> after every full and fast commit.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>   fs/ext4/fast_commit.c |  8 ++++++
>   fs/jbd2/commit.c      | 19 ++++++++++++
>   fs/jbd2/journal.c     | 67 +++++++++++++++++++++++++++++++++++++++++++
>   include/linux/jbd2.h  | 21 ++++++++++++++
>   4 files changed, 115 insertions(+)
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 0dad8bdb1253..f2d11b4c6b62 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -8,11 +8,19 @@
>    * Ext4 fast commits routines.
>    */
>   #include "ext4_jbd2.h"
> +/*
> + * Fast commit cleanup routine. This is called after every fast commit and
> + * full commit. full is true if we are called after a full commit.
> + */
> +static void ext4_fc_cleanup(journal_t *journal, int full)
> +{
> +}
> 
>   void ext4_fc_init(struct super_block *sb, journal_t *journal)
>   {
>   	if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
>   		return;
> +	journal->j_fc_cleanup_callback = ext4_fc_cleanup;
>   	if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
>   		pr_warn("Error while enabling fast commits, turning off.");
>   		ext4_clear_feature_fast_commit(sb);
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index 6d2da8ad0e6f..ba35ecb18616 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -413,6 +413,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>   	J_ASSERT(journal->j_running_transaction != NULL);
>   	J_ASSERT(journal->j_committing_transaction == NULL);
> 
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;

Shouldn't we set this flag only after the while loop ends and before 
releasing the write lock()? Like how we are doing in jbd2_fc_start()?


> +	while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) {
> +		DEFINE_WAIT(wait);
> +
> +		prepare_to_wait(&journal->j_wait_fc, &wait,
> +				TASK_UNINTERRUPTIBLE);
> +		write_unlock(&journal->j_state_lock);
> +		schedule();
> +		write_lock(&journal->j_state_lock);
> +		finish_wait(&journal->j_wait_fc, &wait);
> +	}
> +	write_unlock(&journal->j_state_lock);
> +
>   	commit_transaction = journal->j_running_transaction;
> 
>   	trace_jbd2_start_commit(journal, commit_transaction);
> @@ -1119,12 +1133,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> 
>   	if (journal->j_commit_callback)
>   		journal->j_commit_callback(journal, commit_transaction);
> +	if (journal->j_fc_cleanup_callback)
> +		journal->j_fc_cleanup_callback(journal, 1);
> 
>   	trace_jbd2_end_commit(journal, commit_transaction);
>   	jbd_debug(1, "JBD2: commit %d complete, head %d\n",
>   		  journal->j_commit_sequence, journal->j_tail_sequence);
> 
>   	write_lock(&journal->j_state_lock);
> +	journal->j_flags &= ~JBD2_FULL_COMMIT_ONGOING;
> +	journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
>   	spin_lock(&journal->j_list_lock);
>   	commit_transaction->t_state = T_FINISHED;
>   	/* Check if the transaction can be dropped now that we are finished */
> @@ -1136,6 +1154,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>   	spin_unlock(&journal->j_list_lock);
>   	write_unlock(&journal->j_state_lock);
>   	wake_up(&journal->j_wait_done_commit);
> +	wake_up(&journal->j_wait_fc);
> 
>   	/*
>   	 * Calculate overall stats
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 736a1736619f..17a30a2c38f9 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -714,6 +714,72 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
>   	return err;
>   }
> 
> +/*
> + * Start a fast commit. If there's an ongoing fast or full commit wait for
> + * it to complete. Returns 0 if a new fast commit was started. Returns -EALREADY
> + * if a fast commit is not needed, either because there's an already a commit
> + * going on or this tid has already been committed. Returns -EINVAL if no jbd2
> + * commit has yet been performed.
> + */
> +int jbd2_fc_start(journal_t *journal, tid_t tid)
> +{
> +	/*
> +	 * Fast commits only allowed if at least one full commit has
> +	 * been processed.
> +	 */
> +	if (!journal->j_stats.ts_tid)
> +		return -EINVAL;
> +
> +	if (tid <= journal->j_commit_sequence)
> +		return -EALREADY;
> +
> +	write_lock(&journal->j_state_lock);
> +	if (journal->j_flags & JBD2_FULL_COMMIT_ONGOING ||
> +	    (journal->j_flags & JBD2_FAST_COMMIT_ONGOING)) {
> +		DEFINE_WAIT(wait);
> +
> +		prepare_to_wait(&journal->j_wait_fc, &wait,
> +				TASK_UNINTERRUPTIBLE);
> +		write_unlock(&journal->j_state_lock);
> +		schedule();
> +		finish_wait(&journal->j_wait_fc, &wait);
> +		return -EALREADY;
> +	}
> +	journal->j_flags |= JBD2_FAST_COMMIT_ONGOING;
> +	write_unlock(&journal->j_state_lock);
> +
> +	return 0;
> +}
> +
> +/*
> + * Stop a fast commit. If fallback is set, this function starts commit of
> + * TID tid before any other fast commit can start.
> + */
> +static int __jbd2_fc_stop(journal_t *journal, tid_t tid, bool fallback)
> +{
> +	if (journal->j_fc_cleanup_callback)
> +		journal->j_fc_cleanup_callback(journal, 0);
> +	write_lock(&journal->j_state_lock);
> +	journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
> +	if (fallback)
> +		journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
> +	write_unlock(&journal->j_state_lock);
> +	wake_up(&journal->j_wait_fc);
> +	if (fallback)
> +		return jbd2_complete_transaction(journal, tid);
> +	return 0;
> +}
> +
> +int jbd2_fc_stop(journal_t *journal)
> +{
> +	return __jbd2_fc_stop(journal, 0, 0);
> +}
> +
> +int jbd2_fc_stop_do_commit(journal_t *journal, tid_t tid)
> +{
> +	return __jbd2_fc_stop(journal, tid, 1);
> +}
> +
>   /* Return 1 when transaction with given tid has already committed. */
>   int jbd2_transaction_committed(journal_t *journal, tid_t tid)
>   {
> @@ -1140,6 +1206,7 @@ static journal_t *journal_init_common(struct block_device *bdev,
>   	init_waitqueue_head(&journal->j_wait_commit);
>   	init_waitqueue_head(&journal->j_wait_updates);
>   	init_waitqueue_head(&journal->j_wait_reserved);
> +	init_waitqueue_head(&journal->j_wait_fc);
>   	mutex_init(&journal->j_abort_mutex);
>   	mutex_init(&journal->j_barrier);
>   	mutex_init(&journal->j_checkpoint_mutex);
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index 36f65a818366..aad986a9f3ff 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -858,6 +858,13 @@ struct journal_s
>   	 */
>   	wait_queue_head_t	j_wait_reserved;
> 
> +	/**
> +	 * @j_wait_fc:
> +	 *
> +	 * Wait queue to wait for completion of async fast commits.
> +	 */
> +	wait_queue_head_t	j_wait_fc;

If we follow the naming convention then j_fc_wait, will be more
convenient.

> +
>   	/**
>   	 * @j_checkpoint_mutex:
>   	 *
> @@ -1208,6 +1215,15 @@ struct journal_s
>   	 */
>   	struct lockdep_map	j_trans_commit_map;
>   #endif
> +
> +	/**
> +	 * @j_fc_cleanup_callback:
> +	 *
> +	 * Clean-up after fast commit or full commit. JBD2 calls this function
> +	 * after every commit operation.
> +	 */
> +	void (*j_fc_cleanup_callback)(struct journal_s *journal, int);
> +
>   };
> 
>   #define jbd2_might_wait_for_commit(j) \
> @@ -1292,6 +1308,8 @@ JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit,	FAST_COMMIT)
>   #define JBD2_ABORT_ON_SYNCDATA_ERR	0x040	/* Abort the journal on file
>   						 * data write error in ordered
>   						 * mode */
> +#define JBD2_FAST_COMMIT_ONGOING	0x100	/* Fast commit is ongoing */
> +#define JBD2_FULL_COMMIT_ONGOING	0x200	/* Full commit is ongoing */
> 
>   /*
>    * Function declarations for the journaling transaction and buffer
> @@ -1546,6 +1564,9 @@ extern int jbd2_cleanup_journal_tail(journal_t *);
> 
>   /* Fast commit related APIs */
>   int jbd2_fc_init(journal_t *journal, int num_fc_blks);
> +int jbd2_fc_start(journal_t *journal, tid_t tid);
> +int jbd2_fc_stop(journal_t *journal);
> +int jbd2_fc_stop_do_commit(journal_t *journal, tid_t tid);
>   /*
>    * is_journal_abort
>    *
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/9] ext4: main fast-commit commit path
  2020-09-19  0:54 ` [PATCH v9 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
  2020-09-19  8:19   ` kernel test robot
@ 2020-10-09 17:04   ` Ritesh Harjani
  2020-10-13  0:25     ` harshad shirwadkar
  2020-10-09 19:14   ` Theodore Y. Ts'o
  2 siblings, 1 reply; 29+ messages in thread
From: Ritesh Harjani @ 2020-10-09 17:04 UTC (permalink / raw)
  To: Harshad Shirwadkar, linux-ext4; +Cc: tytso



On 9/19/20 6:24 AM, Harshad Shirwadkar wrote:
> This patch adds main fast commit commit path handlers. The overall
> patch can be divided into two inter-related parts:
> 
> (A) Metadata updates tracking
> 
>      This part consists of helper functions to track changes that need
>      to be committed during a commit operation. These updates are
>      maintained by Ext4 in different in-memory queues. Following are
>      the APIs and their short description that are implemented in this
>      patch:
> 
>      - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
>        operations
>      - ext4_fc_track_range() - Track changed logical block offsets
>        inodes
>      - ext4_fc_track_inode() - Track inodes
>      - ext4_fc_mark_ineligible() - Mark file system fast commit
>        ineligible()
>      - ext4_fc_start_update() / ext4_fc_stop_update() /
>        ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
>        functions are useful for co-ordinating inode updates with
>        commits.
> 
> (B) Main commit Path
> 
>      This part consists of functions to convert updates tracked in
>      in-memory data structures into on-disk commits. Function
>      ext4_fc_commit() is the main entry point to commit path.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>   fs/ext4/acl.c               |    2 +
>   fs/ext4/ext4.h              |   61 ++
>   fs/ext4/extents.c           |   48 +-
>   fs/ext4/fast_commit.c       | 1209 +++++++++++++++++++++++++++++++++++
>   fs/ext4/fast_commit.h       |  111 ++++
>   fs/ext4/file.c              |   10 +-
>   fs/ext4/fsync.c             |    2 +-
>   fs/ext4/inode.c             |   41 +-
>   fs/ext4/ioctl.c             |   16 +-
>   fs/ext4/namei.c             |   36 +-
>   fs/ext4/super.c             |   31 +
>   fs/ext4/xattr.c             |    3 +
>   fs/jbd2/commit.c            |   42 ++
>   fs/jbd2/journal.c           |  119 +++-
>   include/linux/jbd2.h        |    6 +
>   include/trace/events/ext4.h |  172 +++++
>   16 files changed, 1882 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> index 76f634d185f1..68aaed48315f 100644
> --- a/fs/ext4/acl.c
> +++ b/fs/ext4/acl.c
> @@ -242,6 +242,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>   	handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
>   	if (IS_ERR(handle))
>   		return PTR_ERR(handle);
> +	ext4_fc_start_update(inode);
> 
>   	if ((type == ACL_TYPE_ACCESS) && acl) {
>   		error = posix_acl_update_mode(inode, &mode, &acl);
> @@ -259,6 +260,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
>   	}
>   out_stop:
>   	ext4_journal_stop(handle);
> +	ext4_fc_stop_update(inode);
>   	if (error == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
>   		goto retry;
>   	return error;
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 9af3971dd12e..27d48d166e5d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1022,6 +1022,27 @@ struct ext4_inode_info {
> 
>   	struct list_head i_orphan;	/* unlinked but open inodes */
> 
> +	/* Fast commit related info */
> +
> +	struct list_head i_fc_list;	/*
> +					 * inodes that need fast commit
> +					 * protected by sbi->s_fc_lock.
> +					 */
> +
> +	/* Start of lblk range that needs to be committed in this fast commit */
> +	ext4_lblk_t i_fc_lblk_start;
> +
> +	/* End of lblk range that needs to be committed in this fast commit */
> +	ext4_lblk_t i_fc_lblk_len;
> +
> +	/* Number of ongoing updates on this inode */
> +	atomic_t  i_fc_updates;
> +
> +	/* Fast commit wait queue for this inode */
> +	wait_queue_head_t i_fc_wait;
> +
> +	struct mutex i_fc_lock;

Although the code does explain the use of i_fc_lock.
But a small comment here explaining what does this mutex lock protects 
will be helpful.

>   	/*
>   	 * i_disksize keeps track of what the inode size is ON DISK, not
>   	 * in memory.  During truncate, i_size is set to the new size by
> @@ -1142,6 +1163,10 @@ struct ext4_inode_info {
>   #define	EXT4_VALID_FS			0x0001	/* Unmounted cleanly */
>   #define	EXT4_ERROR_FS			0x0002	/* Errors detected */
>   #define	EXT4_ORPHAN_FS			0x0004	/* Orphans being recovered */
> +#define EXT4_FC_INELIGIBLE		0x0008	/* Fast commit ineligible */
> +#define EXT4_FC_COMMITTING		0x0010	/* File system underoing a fast
> +						 * commit.
> +						 */
> 
>   /*
>    * Misc. filesystem flags
> @@ -1614,6 +1639,25 @@ struct ext4_sb_info {
>   	/* Record the errseq of the backing block device */
>   	errseq_t s_bdev_wb_err;
>   	spinlock_t s_bdev_wb_lock;
> +
> +	/* Ext4 fast commit stuff */
> +	atomic_t s_fc_subtid;
> +	atomic_t s_fc_ineligible_updates;
> +	/*
> +	 * After commit starts, the main queue gets locked, and the further
> +	 * updates get added in the the staging queue
> +	 */
> +#define FC_Q_MAIN	0
> +#define FC_Q_STAGING	1
> +	struct list_head s_fc_q[2];	/* Inodes staged for fast commit
> +					 * that have data changes in them.
> +					 */
> +	struct list_head s_fc_dentry_q[2];	/* directory entry updates */
> +	int s_fc_bytes;

We don't need unsigned long for s_fc_bytes here is it?


> +	spinlock_t s_fc_lock;

some comment pls explaining what all this s_fc_lock protects?

> +	struct buffer_head *s_fc_bh;
> +	struct ext4_fc_stats s_fc_stats;
> +	u64 s_fc_avg_commit_time;
>   };


Also I really like this i_fc_** & s_fc_** & ext4_fc_**() convention :)


> 
>   static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> @@ -1724,6 +1768,7 @@ enum {
>   	EXT4_STATE_EXT_PRECACHED,	/* extents have been precached */
>   	EXT4_STATE_LUSTRE_EA_INODE,	/* Lustre-style ea_inode */
>   	EXT4_STATE_VERITY_IN_PROGRESS,	/* building fs-verity Merkle tree */
> +	EXT4_STATE_FC_COMMITTING,	/* Fast commit ongoing */
>   };
> 
>   #define EXT4_INODE_BIT_FNS(name, field, offset)				\
> @@ -2683,6 +2728,22 @@ extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
>   /* fast_commit.c */
> 
>   void ext4_fc_init(struct super_block *sb, journal_t *journal);
> +void ext4_fc_init_inode(struct inode *inode);
> +void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
> +			 ext4_lblk_t end);
> +void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry);
> +void ext4_fc_track_link(struct inode *inode, struct dentry *dentry);
> +void ext4_fc_track_create(struct inode *inode, struct dentry *dentry);
> +void ext4_fc_track_inode(struct inode *inode);
> +void ext4_fc_mark_ineligible(struct super_block *sb, int reason);
> +void ext4_fc_start_ineligible(struct super_block *sb, int reason);
> +void ext4_fc_stop_ineligible(struct super_block *sb);
> +void ext4_fc_start_update(struct inode *inode);
> +void ext4_fc_stop_update(struct inode *inode);
> +void ext4_fc_del(struct inode *inode);
> +int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
> +int __init ext4_fc_init_dentry_cache(void);
> +
>   /* mballoc.c */
>   extern const struct seq_operations ext4_mb_seq_groups_ops;
>   extern long ext4_mb_stats;
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index a0481582187a..8de236fedade 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -3723,6 +3723,7 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
>   	err = ext4_ext_dirty(handle, inode, path + path->p_depth);
>   out:
>   	ext4_ext_show_leaf(inode, path);
> +	ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
>   	return err;
>   }
> 
> @@ -3794,6 +3795,7 @@ convert_initialized_extent(handle_t *handle, struct inode *inode,
>   	if (*allocated > map->m_len)
>   		*allocated = map->m_len;
>   	map->m_len = *allocated;
> +	ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
>   	return 0;
>   }
> 
> @@ -4327,7 +4329,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
>   	map->m_len = ar.len;
>   	allocated = map->m_len;
>   	ext4_ext_show_leaf(inode, path);
> -
> +	ext4_fc_track_range(inode, map->m_lblk, map->m_len);

Shouldn't the last argument be map->m_lblk + map->m_len - 1, no?



>   out:
>   	ext4_ext_drop_refs(path);
>   	kfree(path);
> @@ -4600,7 +4602,8 @@ static long ext4_zero_range(struct file *file, loff_t offset,
>   	ret = ext4_mark_inode_dirty(handle, inode);
>   	if (unlikely(ret))
>   		goto out_handle;
> -
> +	ext4_fc_track_range(inode, offset >> inode->i_sb->s_blocksize_bits,
> +			(offset + len - 1) >> inode->i_sb->s_blocksize_bits);
>   	/* Zero out partial block at the edges of the range */
>   	ret = ext4_zero_partial_blocks(handle, inode, offset, len);
>   	if (ret >= 0)
> @@ -4648,23 +4651,34 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   		     FALLOC_FL_COLLAPSE_RANGE | FALLOC_FL_ZERO_RANGE |
>   		     FALLOC_FL_INSERT_RANGE))
>   		return -EOPNOTSUPP;
> +	ext4_fc_track_range(inode, offset >> blkbits,
> +			(offset + len - 1) >> blkbits);
> 
> -	if (mode & FALLOC_FL_PUNCH_HOLE)
> -		return ext4_punch_hole(inode, offset, len);
> +	ext4_fc_start_update(inode);
> +
> +	if (mode & FALLOC_FL_PUNCH_HOLE) {
> +		ret = ext4_punch_hole(inode, offset, len);
> +		goto exit;
> +	}
> 
>   	ret = ext4_convert_inline_data(inode);
>   	if (ret)
> -		return ret;
> +		goto exit;
> 
> -	if (mode & FALLOC_FL_COLLAPSE_RANGE)
> -		return ext4_collapse_range(inode, offset, len);
> -
> -	if (mode & FALLOC_FL_INSERT_RANGE)
> -		return ext4_insert_range(inode, offset, len);
> +	if (mode & FALLOC_FL_COLLAPSE_RANGE) {
> +		ret = ext4_collapse_range(inode, offset, len);
> +		goto exit;
> +	}
> 
> -	if (mode & FALLOC_FL_ZERO_RANGE)
> -		return ext4_zero_range(file, offset, len, mode);
> +	if (mode & FALLOC_FL_INSERT_RANGE) {
> +		ret = ext4_insert_range(inode, offset, len);
> +		goto exit;
> +	}
> 
> +	if (mode & FALLOC_FL_ZERO_RANGE) {
> +		ret = ext4_zero_range(file, offset, len, mode);
> +		goto exit;
> +	}
>   	trace_ext4_fallocate_enter(inode, offset, len, mode);
>   	lblk = offset >> blkbits;
> 
> @@ -4698,12 +4712,14 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>   		goto out;
> 
>   	if (file->f_flags & O_SYNC && EXT4_SB(inode->i_sb)->s_journal) {
> -		ret = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
> -						EXT4_I(inode)->i_sync_tid);
> +		ret = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
> +					EXT4_I(inode)->i_sync_tid);
>   	}
>   out:
>   	inode_unlock(inode);
>   	trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
> +exit:
> +	ext4_fc_stop_update(inode);
>   	return ret;
>   }
> 
> @@ -5291,6 +5307,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
>   		ret = PTR_ERR(handle);
>   		goto out_mmap;
>   	}
> +	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE);
> 
>   	down_write(&EXT4_I(inode)->i_data_sem);
>   	ext4_discard_preallocations(inode, 0);
> @@ -5329,6 +5346,7 @@ static int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
> 
>   out_stop:
>   	ext4_journal_stop(handle);
> +	ext4_fc_stop_ineligible(sb);
>   out_mmap:
>   	up_write(&EXT4_I(inode)->i_mmap_sem);
>   out_mutex:
> @@ -5429,6 +5447,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
>   		ret = PTR_ERR(handle);
>   		goto out_mmap;
>   	}
> +	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_FALLOC_RANGE);
> 
>   	/* Expand file to avoid data loss if there is error while shifting */
>   	inode->i_size += len;
> @@ -5503,6 +5522,7 @@ static int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
> 
>   out_stop:
>   	ext4_journal_stop(handle);
> +	ext4_fc_stop_ineligible(sb);
>   out_mmap:
>   	up_write(&EXT4_I(inode)->i_mmap_sem);
>   out_mutex:
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index f2d11b4c6b62..1c3e5f39d643 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -7,13 +7,1211 @@
>    *
>    * Ext4 fast commits routines.
>    */
> +#include "ext4.h"
>   #include "ext4_jbd2.h"
> +#include "ext4_extents.h"
> +#include "mballoc.h"
> +
> +/*
> + * Ext4 Fast Commits
> + * -----------------
> + *
> + * Ext4 fast commits implement fine grained journalling for Ext4.
> + *
> + * Fast commits are organized as a log of tag-length-value (TLV) structs. (See
> + * struct ext4_fc_tl). Each TLV contains some delta that is replayed TLV by
> + * TLV during the recovery phase. For the scenarios for which we currently
> + * don't have replay code, fast commit falls back to full commits.
> + * Fast commits record delta in one of the following three categories.
> + *
> + * (A) Directory entry updates:
> + *
> + * - EXT4_FC_TAG_UNLINK		- records directory entry unlink
> + * - EXT4_FC_TAG_LINK		- records directory entry link
> + * - EXT4_FC_TAG_CREAT		- records inode and directory entry creation
> + *
> + * (B) File specific data range updates:
> + *
> + * - EXT4_FC_TAG_ADD_RANGE	- records addition of new blocks to an inode
> + * - EXT4_FC_TAG_DEL_RANGE	- recordd deletion of blocks from an inode

s/recordd/records


> + *
> + * (C) Inode metadata (mtime / ctime etc):
> + *
> + * - EXT4_FC_TAG_INODE_FULL	- record the inode that should be replayed as is
> + *				  during recovery. This tag is useful for
> + *				  setting up a new inode. This tag is necessary
> + *				  for CREAT tag.
> + * - EXT4_FC_TAG_INODE_PARTIAL	- record the inode that should only partially be
> + *				  be replayed during recovery. This tag is
> + *				  useful when the same inode has had updates
> + *				  recorded as ADD_RANGE or DEL_RANGE tags.

So we keep INODE_FULL to only track the inode metdata.
and PARTIAL is to track for any extent add/del range operations right?
Is there anything else that I am missing?


> + *
> + * Commit Operation
> + * ----------------
> + * With fast commits, we maintain all the directory entry operations in the
> + * order in which they are issued in an in-memory queue. This queue is flushed
> + * to disk during the commit operation. We also maintain a list of inodes
> + * that need to be committed during a fast commit in another in memory queue of
> + * inodes. During the commit operation, we commit in the following order:
> + *
> + * [1] Lock inodes for any further data updates by setting COMMITTING state
> + * [2] Submit data buffers of all the inodes
> + * [3] Wait for [2] to complete
> + * [4] Commit all the directory entry updates in the fast commit space
> + * [5] Commit all the changed inode structures
> + * [6] Write tail tag (this tag ensures the atomicity, please read the following
> + *     section for more details).
> + * [7] Wait for [4], [5] and [6] to complete.
> + *
> + * All the inode updates must call ext4_fc_start_update() before starting an
> + * update. If such an ongoing update is present, fast commit waits for it to
> + * complete. The completion of such an update is marked by
> + * ext4_fc_stop_update().
> + *
> + * Fast Commit Ineligibility
> + * -------------------------
> + * Not all operations are supported by fast commits today (e.g extended
> + * attributes). Fast commit ineligiblity is marked by calling one of the
> + * two following functions:
> + *
> + * - ext4_fc_mark_ineligible(): This makes next fast commit operation to fall
> + *   back to full commit. This is useful in case of transient errors.
> + *
> + * - ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() - This makes all
> + *   the fast commits happening between ext4_fc_start_ineligible() and
> + *   ext4_fc_stop_ineligible() and one fast commit after the call to
> + *   ext4_fc_stop_ineligible() to fall back to full commits. It is important to
> + *   make one more fast commit to fall back to full commit after stop call so
> + *   that it guaranteed that the fast commit ineligible operation contained
> + *   within ext4_fc_start_ineligible() and ext4_fc_stop_ineligible() is
> + *   followed by at least 1 full commit.
> + *
> + * Atomicity of commits
> + * --------------------
> + * In order to gaurantee atomicity during the commit operation, fast commit
> + * uses "EXT4_FC_TAG_TAIL" tag that marks a fast commit as complete. Tail
> + * tag contains CRC of the contents and TID of the transaction after which
> + * this fast commit should be applied. Recovery code replays fast commit
> + * logs only if there's at least 1 valid tail present. For every fast commit
> + * operation, there is 1 tail. This means, we may end up with multiple tails
> + * in the fast commit space. Here's an example:
> + *
> + * - Create a new file A and remove existing file B
> + * - fsync()
> + * - Append contents to file A
> + * - Truncate file A
> + * - fsync()
> + *
> + * The fast commit space at the end of above operations would look like this:
> + *      [HEAD] [CREAT A] [UNLINK B] [TAIL] [ADD_RANGE A] [DEL_RANGE A] [TAIL]
> + *             |<---  Fast Commit 1   --->|<---      Fast Commit 2     ---->|
> + *
> + * Replay code should thus check for all the valid tails in the FC area.
> + *
> + * TODOs
> + * -----
> + * 1) Make fast commit atomic updates more fine grained. Today, a fast commit
> + *    eligible update must be protected within ext4_fc_start_update() and
> + *    ext4_fc_stop_update(). These routines are called at much higher
> + *    routines. This can be made more fine grained by combining with
> + *    ext4_journal_start().
> + *
> + * 2) Same above for ext4_fc_start_ineligible() and ext4_fc_stop_ineligible()
> + *
> + * 3) Handle more ineligible cases.
> + */
> +
> +#include <trace/events/ext4.h>
> +static struct kmem_cache *ext4_fc_dentry_cachep;
> +
> +static void ext4_end_buffer_io_sync(struct buffer_head *bh, int uptodate)
> +{
> +	BUFFER_TRACE(bh, "");
> +	if (uptodate) {
> +		ext4_debug("%s: Block %lld up-to-date",
> +			   __func__, bh->b_blocknr);
> +		set_buffer_uptodate(bh);
> +	} else {
> +		ext4_debug("%s: Block %lld not up-to-date",
> +			   __func__, bh->b_blocknr);
> +		clear_buffer_uptodate(bh);
> +	}
> +
> +	unlock_buffer(bh);
> +}
> +
> +static inline void ext4_fc_reset_inode(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	ei->i_fc_lblk_start = 0;
> +	ei->i_fc_lblk_len = 0;
> +}
> +
> +void ext4_fc_init_inode(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	ext4_fc_reset_inode(inode);
> +	ext4_clear_inode_state(inode, EXT4_STATE_FC_COMMITTING);
> +	INIT_LIST_HEAD(&ei->i_fc_list);
> +	init_waitqueue_head(&ei->i_fc_wait);
> +	atomic_set(&ei->i_fc_updates, 0);
> +}
> +
> +/*
> + * Inform Ext4's fast about start of an inode update
> + *
> + * This function is called by the high level call VFS callbacks before
> + * performing any inode update. This function blocks if there's an ongoing
> + * fast commit on the inode in question.
> + */
> +void ext4_fc_start_update(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return;
> +
> +restart:
> +	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +	if (list_empty(&EXT4_I(inode)->i_fc_list))

We can use ei->i_fc_list directly.



> +		goto out;
> +
> +	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> +		wait_queue_head_t *wq;
> +#if (BITS_PER_LONG < 64)
> +		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> +				EXT4_STATE_FC_COMMITTING);
> +		wq = bit_waitqueue(&ei->i_state_flags,
> +				   EXT4_STATE_FC_COMMITTING);
> +#else
> +		DEFINE_WAIT_BIT(wait, &ei->i_flags,
> +				EXT4_STATE_FC_COMMITTING);
> +		wq = bit_waitqueue(&ei->i_flags,
> +				   EXT4_STATE_FC_COMMITTING);
> +#endif
> +		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> +		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wq_entry);
> +		goto restart;
> +	}
> +out:
> +	atomic_inc(&ei->i_fc_updates);
> +	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +}
> +
> +/*
> + * Stop inode update and wake up waiting fast commits if any.
> + */
> +void ext4_fc_stop_update(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return;
> +
> +	if (atomic_dec_and_test(&ei->i_fc_updates))
> +		wake_up_all(&ei->i_fc_wait);
> +}
> +
> +/*
> + * Remove inode from fast commit list. If the inode is being committed
> + * we wait until inode commit is done.
> + */
> +void ext4_fc_del(struct inode *inode)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return;
> +
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return;


Redundant check, so let's remove this.



> +
> +restart:
> +	spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +	if (list_empty(&ei->i_fc_list)) {
> +		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +		return;
> +	}
> +
> +	if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> +		wait_queue_head_t *wq;
> +#if (BITS_PER_LONG < 64)
> +		DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> +				EXT4_STATE_FC_COMMITTING);
> +		wq = bit_waitqueue(&ei->i_state_flags,
> +				   EXT4_STATE_FC_COMMITTING);
> +#else
> +		DEFINE_WAIT_BIT(wait, &ei->i_flags,
> +				EXT4_STATE_FC_COMMITTING);
> +		wq = bit_waitqueue(&ei->i_flags,
> +				   EXT4_STATE_FC_COMMITTING);
> +#endif
> +		prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> +		spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +		schedule();
> +		finish_wait(wq, &wait.wq_entry);
> +		goto restart;
> +	}
> +	if (!list_empty(&ei->i_fc_list))
> +		list_del_init(&ei->i_fc_list);
> +	spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> +}
> +
> +/*
> + * Mark file system as fast commit ineligible. This means that next commit
> + * operation would result in a full jbd2 commit.
> + */
> +void ext4_fc_mark_ineligible(struct super_block *sb, int reason)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> +	sbi->s_mount_state |= EXT4_FC_INELIGIBLE;
> +	WARN_ON(reason >= EXT4_FC_REASON_MAX);
> +	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
> +}
> +
> +/*
> + * Start a fast commit ineligible update. Any commits that happen while
> + * such an operation is in progress fall back to full commits.
> + */
> +void ext4_fc_start_ineligible(struct super_block *sb, int reason)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> +	WARN_ON(reason >= EXT4_FC_REASON_MAX);
> +	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
> +	atomic_inc(&sbi->s_fc_ineligible_updates);
> +}
> +
> +/*
> + * Stop a fast commit ineligible update. We set EXT4_FC_INELIGIBLE flag here
> + * to ensure that after stopping the ineligible update, at least one full
> + * commit takes place.
> + */
> +void ext4_fc_stop_ineligible(struct super_block *sb)
> +{
> +	EXT4_SB(sb)->s_mount_state |= EXT4_FC_INELIGIBLE;
> +	atomic_dec(&EXT4_SB(sb)->s_fc_ineligible_updates);
> +}
> +
> +static inline int ext4_fc_is_ineligible(struct super_block *sb)
> +{
> +	return (EXT4_SB(sb)->s_mount_state & EXT4_FC_INELIGIBLE) ||
> +		atomic_read(&EXT4_SB(sb)->s_fc_ineligible_updates);
> +}
> +
> +/*
> + * Generic fast commit tracking function. If this is the first time this we are
> + * called after a full commit, we initialize fast commit fields and then call
> + * __fc_track_fn() with update = 0. If we have already been called after a full
> + * commit, we pass update = 1. Based on that, the track function can determine
> + * if it needs to track a field for the first time or if it needs to just
> + * update the previously tracked value.
> + *
> + * If enqueue is set, this function enqueues the inode in fast commit list.
> + */
> +static int ext4_fc_track_template(
> +	struct inode *inode, int (*__fc_track_fn)(struct inode *, void *, bool),
> +	void *args, int enqueue)
> +{
> +	tid_t running_txn_tid;
> +	bool update = false;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +	int ret;
> +
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +		return -EOPNOTSUPP;
> +
> +	if (ext4_fc_is_ineligible(inode->i_sb))
> +		return -EINVAL;
> +
> +	running_txn_tid = sbi->s_journal ?
> +		sbi->s_journal->j_commit_sequence + 1 : 0;
> +
> +	mutex_lock(&ei->i_fc_lock);
> +	if (running_txn_tid == ei->i_sync_tid) {
> +		update = true;
> +	} else {
> +		ext4_fc_reset_inode(inode);
> +		ei->i_sync_tid = running_txn_tid;
> +	}
> +	ret = __fc_track_fn(inode, args, update);
> +	mutex_unlock(&ei->i_fc_lock);
> +
> +	if (!enqueue)
> +		return ret;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	if (list_empty(&EXT4_I(inode)->i_fc_list))
> +		list_add_tail(&EXT4_I(inode)->i_fc_list,
> +				(sbi->s_mount_state & EXT4_FC_COMMITTING) ?
> +				&sbi->s_fc_q[FC_Q_STAGING] :
> +				&sbi->s_fc_q[FC_Q_MAIN]);
> +	spin_unlock(&sbi->s_fc_lock);
> +
> +	return ret;
> +}
> +
> +struct __track_dentry_update_args {
> +	struct dentry *dentry;
> +	int op;
> +};
> +
> +/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */
> +static int __track_dentry_update(struct inode *inode, void *arg, bool update)
> +{
> +	struct ext4_fc_dentry_update *node;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct __track_dentry_update_args *dentry_update =
> +		(struct __track_dentry_update_args *)arg;
> +	struct dentry *dentry = dentry_update->dentry;
> +	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> +
> +	mutex_unlock(&ei->i_fc_lock);
> +	node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS);
> +	if (!node) {
> +		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_MEM);
> +		mutex_lock(&ei->i_fc_lock);
> +		return -ENOMEM;
> +	}
> +
> +	node->fcd_op = dentry_update->op;
> +	node->fcd_parent = dentry->d_parent->d_inode->i_ino;
> +	node->fcd_ino = inode->i_ino;
> +	if (dentry->d_name.len > DNAME_INLINE_LEN) {
> +		node->fcd_name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);


Shouldn't this be GFP_NOFS too?



> +		if (!node->fcd_name.name) {
> +			kmem_cache_free(ext4_fc_dentry_cachep, node);
> +			ext4_fc_mark_ineligible(inode->i_sb,
> +				EXT4_FC_REASON_MEM);
> +			mutex_lock(&ei->i_fc_lock);
> +			return -ENOMEM;
> +		}
> +		memcpy((u8 *)node->fcd_name.name, dentry->d_name.name,
> +			dentry->d_name.len);
> +	} else {
> +		memcpy(node->fcd_iname, dentry->d_name.name,
> +			dentry->d_name.len);
> +		node->fcd_name.name = node->fcd_iname;
> +	}
> +	node->fcd_name.len = dentry->d_name.len;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	if (sbi->s_mount_state & EXT4_FC_COMMITTING)
> +		list_add_tail(&node->fcd_list,
> +				&sbi->s_fc_dentry_q[FC_Q_STAGING]);
> +	else
> +		list_add_tail(&node->fcd_list, &sbi->s_fc_dentry_q[FC_Q_MAIN]);
> +	spin_unlock(&sbi->s_fc_lock);
> +	mutex_lock(&ei->i_fc_lock);
> +
> +	return 0;
> +}
> +
> +void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry)
> +{
> +	struct __track_dentry_update_args args;
> +	int ret;
> +
> +	args.dentry = dentry;
> +	args.op = EXT4_FC_TAG_UNLINK;
> +
> +	ret = ext4_fc_track_template(inode, __track_dentry_update,
> +					(void *)&args, 0);
> +	trace_ext4_fc_track_unlink(inode, dentry, ret);
> +}
> +
> +void ext4_fc_track_link(struct inode *inode, struct dentry *dentry)
> +{
> +	struct __track_dentry_update_args args;
> +	int ret;
> +
> +	args.dentry = dentry;
> +	args.op = EXT4_FC_TAG_LINK;
> +
> +	ret = ext4_fc_track_template(inode, __track_dentry_update,
> +					(void *)&args, 0);
> +	trace_ext4_fc_track_link(inode, dentry, ret);
> +}
> +
> +void ext4_fc_track_create(struct inode *inode, struct dentry *dentry)
> +{
> +	struct __track_dentry_update_args args;
> +	int ret;
> +
> +	args.dentry = dentry;
> +	args.op = EXT4_FC_TAG_CREAT;
> +
> +	ret = ext4_fc_track_template(inode, __track_dentry_update,
> +					(void *)&args, 0);
> +	trace_ext4_fc_track_create(inode, dentry, ret);
> +}
> +
> +/* __track_fn for inode tracking */
> +static int __track_inode(struct inode *inode, void *arg, bool update)
> +{
> +	if (update)
> +		return -EEXIST;
> +
> +	EXT4_I(inode)->i_fc_lblk_len = 0;
> +
> +	return 0;
> +}
> +
> +void ext4_fc_track_inode(struct inode *inode)
> +{
> +	int ret;
> +
> +	if (S_ISDIR(inode->i_mode))
> +		return;
> +
> +	ret = ext4_fc_track_template(inode, __track_inode, NULL, 1);
> +	trace_ext4_fc_track_inode(inode, ret);
> +}
> +
> +struct __track_range_args {
> +	ext4_lblk_t start, end;
> +};
> +
> +/* __track_fn for tracking data updates */
> +static int __track_range(struct inode *inode, void *arg, bool update)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	ext4_lblk_t oldstart;
> +	struct __track_range_args *__arg =
> +		(struct __track_range_args *)arg;
> +
> +	if (inode->i_ino < EXT4_FIRST_INO(inode->i_sb)) {
> +		ext4_debug("Special inode %ld being modified\n", inode->i_ino);
> +		return -ECANCELED;
> +	}
> +
> +	oldstart = ei->i_fc_lblk_start;
> +
> +	if (update && ei->i_fc_lblk_len > 0) {
> +		ei->i_fc_lblk_start = min(ei->i_fc_lblk_start, __arg->start);
> +		ei->i_fc_lblk_len =
> +			max(oldstart + ei->i_fc_lblk_len - 1, __arg->end) -
> +				ei->i_fc_lblk_start + 1;
> +	} else {
> +		ei->i_fc_lblk_start = __arg->start;
> +		ei->i_fc_lblk_len = __arg->end - __arg->start + 1;
> +	}
> +
> +	return 0;
> +}
> +
> +void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
> +			 ext4_lblk_t end)
> +{
> +	struct __track_range_args args;
> +	int ret;
> +
> +	if (S_ISDIR(inode->i_mode))
> +		return;
> +
> +	args.start = start;
> +	args.end = end;
> +
> +	ret = ext4_fc_track_template(inode,  __track_range, &args, 1);
> +
> +	trace_ext4_fc_track_range(inode, start, end, ret);
> +}
> +
> +static void ext4_fc_submit_bh(struct super_block *sb)
> +{
> +	int write_flags = REQ_SYNC;
> +	struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
> +
> +	if (test_opt(sb, BARRIER))
> +		write_flags |= REQ_FUA | REQ_PREFLUSH;
> +	lock_buffer(bh);
> +	clear_buffer_dirty(bh);
> +	set_buffer_uptodate(bh);

Please don't mind me asking this.
But shouldn't we clear the buffer dirty after the IO is submitted
and update will be set once the IO is completed?
Maybe I am missing something here.


> +	bh->b_end_io = ext4_end_buffer_io_sync;
> +	submit_bh(REQ_OP_WRITE, write_flags, bh);
> +	EXT4_SB(sb)->s_fc_bh = NULL;
> +}
> +
> +/* Ext4 commit path routines */
> +
> +/* memzero and update CRC */
> +static void *ext4_fc_memzero(struct super_block *sb, void *dst, int len,
> +				u32 *crc)
> +{
> +	void *ret;
> +
> +	ret = memset(dst, 0, len);
> +	if (crc)
> +		*crc = ext4_chksum(EXT4_SB(sb), *crc, dst, len);
> +	return ret;
> +}
> +
> +/*
> + * Allocate len bytes on a fast commit buffer.
> + *
> + * During the commit time this function is used to manage fast commit
> + * block space. We don't split a fast commit log onto different
> + * blocks. So this function makes sure that if there's not enough space
> + * on the current block, the remaining space in the current block is
> + * marked as unused by adding EXT4_FC_TAG_PAD tag. In that case,
> + * new block is from jbd2 and CRC is updated to reflect the padding
> + * we added.
> + */
> +static u8 *ext4_fc_reserve_space(struct super_block *sb, int len, u32 *crc)
> +{
> +	struct ext4_fc_tl *tl;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct buffer_head *bh;
> +	int bsize = sbi->s_journal->j_blocksize;
> +	int ret, off = sbi->s_fc_bytes % bsize;
> +	int pad_len;
> +
> +	if (bsize - off - 1 > len + sizeof(struct ext4_fc_tl)) {
> +		/*
> +		 * Only allocate from current buffer if we have enough space for
> +		 * this request AND we have space to add a zero byte padding.
> +		 */
> +		if (!sbi->s_fc_bh) {
> +			ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
> +			if (ret)
> +				return NULL;
> +			sbi->s_fc_bh = bh;
> +		}
> +		sbi->s_fc_bytes += len;
> +		return sbi->s_fc_bh->b_data + off;
> +	}
> +	/* Need to add PAD tag */
> +	tl = (struct ext4_fc_tl *)(sbi->s_fc_bh->b_data + off);
> +	tl->fc_tag = cpu_to_le16(EXT4_FC_TAG_PAD);
> +	pad_len = bsize - off - 1 - sizeof(struct ext4_fc_tl);
> +	tl->fc_len = cpu_to_le16(pad_len);
> +	if (crc)
> +		*crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl));
> +	if (pad_len > 0)
> +		ext4_fc_memzero(sb, tl + 1, pad_len, crc);
> +	ext4_fc_submit_bh(sb);
> +
> +	ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
> +	if (ret)
> +		return NULL;
> +	sbi->s_fc_bh = bh;
> +	sbi->s_fc_bytes = (sbi->s_fc_bytes / bsize + 1) * bsize + len;

why do we need +1 here?



> +	return sbi->s_fc_bh->b_data;
> +}
> +
> +/* memcpy to fc reserved space and update CRC */
> +static void *ext4_fc_memcpy(struct super_block *sb, void *dst, const void *src,
> +				int len, u32 *crc)
> +{
> +	if (crc)
> +		*crc = ext4_chksum(EXT4_SB(sb), *crc, src, len);
> +	return memcpy(dst, src, len);
> +}
> +
> +/*
> + * Complete a fast commit by writing tail tag.
> + *
> + * Writing tail tag marks the end of a fast commit. In order to guarantee
> + * atomicity, after writing tail tag, even if there's space remaining
> + * in the block, next commit shouldn't use it. That's why tail tag
> + * has the length as that of the remaining space on the block.
> + */
> +static int ext4_fc_write_tail(struct super_block *sb, u32 crc)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_fc_tl tl;
> +	struct ext4_fc_tail tail;
> +	int off, bsize = sbi->s_journal->j_blocksize;
> +	u8 *dst;
> +
> +	/*
> +	 * ext4_fc_reserve_space takes care of allocating an extra block if
> +	 * there's no enough space on this block for accommodating this tail.
> +	 */
> +	dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc);
> +	if (!dst)
> +		return -ENOSPC;
> +
> +	off = sbi->s_fc_bytes % bsize;
> +
> +	tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL);
> +	tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail));
> +	sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize);
> +
> +	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc);
> +	dst += sizeof(tl);
> +	tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid);
> +	ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc);
> +	dst += sizeof(tail.fc_tid);
> +	tail.fc_crc = cpu_to_le32(crc);
> +	ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
> +
> +	ext4_fc_submit_bh(sb);
> +
> +	return 0;
> +}
> +
> +/*
> + * Adds tag, length, value and updates CRC. Returns true if tlv was added.
> + * Returns false if there's not enough space.
> + */
> +static bool ext4_fc_add_tlv(struct super_block *sb, u16 tag, u16 len, u8 *val,
> +			   u32 *crc)
> +{
> +	struct ext4_fc_tl tl;
> +	u8 *dst;
> +
> +	dst = ext4_fc_reserve_space(sb, sizeof(tl) + len, crc);
> +	if (!dst)
> +		return false;
> +
> +	tl.fc_tag = cpu_to_le16(tag);
> +	tl.fc_len = cpu_to_le16(len);
> +
> +	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
> +	ext4_fc_memcpy(sb, dst + sizeof(tl), val, len, crc);
> +
> +	return true;
> +}
> +
> +/* Same as above, but adds dentry tlv. */
> +static  bool ext4_fc_add_dentry_tlv(struct super_block *sb, u16 tag,
> +					int parent_ino, int ino, int dlen,
> +					const unsigned char *dname,
> +					u32 *crc)
> +{
> +	struct ext4_fc_dentry_info fcd;
> +	struct ext4_fc_tl tl;
> +	u8 *dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(fcd) + dlen,
> +					crc);
> +
> +	if (!dst)
> +		return false;
> +
> +	fcd.fc_parent_ino = cpu_to_le32(parent_ino);
> +	fcd.fc_ino = cpu_to_le32(ino);
> +	tl.fc_tag = cpu_to_le16(tag);
> +	tl.fc_len = cpu_to_le16(sizeof(fcd) + dlen);
> +	ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
> +	dst += sizeof(tl);
> +	ext4_fc_memcpy(sb, dst, &fcd, sizeof(fcd), crc);
> +	dst += sizeof(fcd);
> +	ext4_fc_memcpy(sb, dst, dname, dlen, crc);
> +	dst += dlen;
> +
> +	return true;
> +}
> +
> +/*
> + * Writes inode in the fast commit space under TLV with tag @tag.
> + * Returns 0 on success, error on failure.
> + */
> +static int ext4_fc_write_inode(struct inode *inode, u32 *crc, int tag)
> +{
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
> +	int ret;
> +	struct ext4_iloc iloc;
> +	struct ext4_fc_inode fc_inode;
> +	struct ext4_fc_tl tl;
> +	u8 *dst;
> +
> +	ret = ext4_get_inode_loc(inode, &iloc);
> +	if (ret)
> +		return ret;
> +
> +	if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
> +		inode_len += ei->i_extra_isize;
> +
> +	fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
> +	tl.fc_tag = cpu_to_le16(tag);
> +	tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino));
> +
> +	dst = ext4_fc_reserve_space(inode->i_sb,
> +			sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc);
> +	if (!dst)
> +		return -ECANCELED;
> +
> +	if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc))
> +		return -ECANCELED;
> +	dst += sizeof(tl);
> +	if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc))
> +		return -ECANCELED;
> +	dst += sizeof(fc_inode);
> +	if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc),
> +					inode_len, crc))
> +		return -ECANCELED;
> +
> +	return 0;
> +}
> +
> +/*
> + * Writes updated data ranges for the inode in question. Updates CRC.
> + * Returns 0 on success, error otherwise.
> + */
> +static int ext4_fc_write_data(struct inode *inode, u32 *crc)
> +{
> +	ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	struct ext4_map_blocks map;
> +	struct ext4_fc_add_range fc_ext;
> +	struct ext4_fc_del_range lrange;
> +	struct ext4_extent *ex;
> +	int ret;
> +
> +	mutex_lock(&ei->i_fc_lock);
> +	if (ei->i_fc_lblk_len == 0) {
> +		mutex_unlock(&ei->i_fc_lock);
> +		return 0;
> +	}
> +	old_blk_size = ei->i_fc_lblk_start;
> +	new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> +	ei->i_fc_lblk_len = 0;
> +	mutex_unlock(&ei->i_fc_lock);
> +
> +	cur_lblk_off = old_blk_size;
> +	jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> +		  __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> +
> +	while (cur_lblk_off <= new_blk_size) {
> +		map.m_lblk = cur_lblk_off;
> +		map.m_len = new_blk_size - cur_lblk_off + 1;
> +		ret = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (ret < 0)
> +			return -ECANCELED;
> +
> +		if (map.m_len == 0) {
> +			cur_lblk_off++;
> +			continue;
> +		}
> +
> +		if (ret == 0) {
> +			lrange.fc_ino = cpu_to_le32(inode->i_ino);
> +			lrange.fc_lblk = cpu_to_le32(map.m_lblk);
> +			lrange.fc_len = cpu_to_le32(map.m_len);
> +			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
> +					    sizeof(lrange), (u8 *)&lrange, crc))
> +				return -ENOSPC;
> +		} else {
> +			fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
> +			ex = (struct ext4_extent *)&fc_ext.fc_ex;
> +			ex->ee_block = cpu_to_le32(map.m_lblk);
> +			ex->ee_len = cpu_to_le32(map.m_len);
> +			ext4_ext_store_pblock(ex, map.m_pblk);
> +			if (map.m_flags & EXT4_MAP_UNWRITTEN)
> +				ext4_ext_mark_unwritten(ex);
> +			else
> +				ext4_ext_mark_initialized(ex);
> +			if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
> +					    sizeof(fc_ext), (u8 *)&fc_ext, crc))
> +				return -ENOSPC;
> +		}
> +
> +		cur_lblk_off += map.m_len;
> +	}
> +
> +	return 0;
> +}
> +
> +/*
> + * Commit data inode. If tag == EXT4_FC_TAG_INODE_PARTIAL, we write the
> + * updated ranges for that inode before committing the inode itself.
> + * This ensures that the during the replay inode->i_block[] is properly
> + * initiated before the partial replay of the inode.
> + */
> +static int ext4_fc_commit_inode(journal_t *journal, struct inode *inode,
> +				u32 *crc, int tag)
> +{
> +	int ret;
> +	int commit_full_inode =
> +		tag == EXT4_FC_TAG_INODE_FULL || ext4_has_inline_data(inode);
> +
> +	if (commit_full_inode) {
> +		ret = ext4_fc_write_inode(inode, crc, tag);
> +		if (ret < 0)
> +			return ret;
> +	}
> +
> +	ret = ext4_fc_write_data(inode, crc);
> +	if (ret < 0)
> +		return ret;
> +
> +	if (!commit_full_inode)
> +		ret = ext4_fc_write_inode(inode, crc, tag);
> +
> +	return ret;
> +}
> +
> +/* Submit data for all the fast commit inodes */
> +static int ext4_fc_submit_inode_data_all(journal_t *journal)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_inode_info *ei;
> +	struct list_head *pos;
> +	int ret = 0;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	sbi->s_mount_state |= EXT4_FC_COMMITTING;
> +	list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
> +		ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
> +		ext4_set_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING);
> +		while (atomic_read(&ei->i_fc_updates)) {
> +			DEFINE_WAIT(wait);
> +
> +			prepare_to_wait(&ei->i_fc_wait, &wait,
> +						TASK_UNINTERRUPTIBLE);
> +			if (atomic_read(&ei->i_fc_updates)) {
> +				spin_unlock(&sbi->s_fc_lock);
> +				schedule();
> +				spin_lock(&sbi->s_fc_lock);
> +			}
> +			finish_wait(&ei->i_fc_wait, &wait);
> +		}
> +		spin_unlock(&sbi->s_fc_lock);
> +		ret = jbd2_submit_inode_data(journal, ei->jinode);
> +		if (ret)
> +			return ret;
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +
> +	return ret;
> +}
> +
> +/* Wait for completion of data for all the fast commit inodes */
> +static int ext4_fc_wait_inode_data_all(journal_t *journal)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_inode_info *pos, *n;
> +	int ret = 0;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	list_for_each_entry_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
> +		if (!ext4_test_inode_state(&pos->vfs_inode,
> +					   EXT4_STATE_FC_COMMITTING))
> +			continue;
> +		spin_unlock(&sbi->s_fc_lock);
> +
> +		ret = jbd2_wait_inode_data(journal, pos->jinode);
> +		if (ret)
> +			return ret;
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +
> +	return 0;
> +}
> +
> +/* Commit all the directory entry updates */
> +static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_fc_dentry_update *fc_dentry;
> +	struct inode *inode;
> +	struct list_head *pos, *n, *fcd_pos, *fcd_n;
> +	struct ext4_inode_info *ei;
> +	int ret;
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	if (list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
> +		spin_unlock(&sbi->s_fc_lock);
> +		return 0;
> +	}
> +	list_for_each_safe(fcd_pos, fcd_n, &sbi->s_fc_dentry_q[FC_Q_MAIN]) {
> +		fc_dentry = list_entry(fcd_pos, struct ext4_fc_dentry_update,
> +					fcd_list);
> +		if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT) {
> +			spin_unlock(&sbi->s_fc_lock);
> +			if (!ext4_fc_add_dentry_tlv(
> +				sb, fc_dentry->fcd_op,
> +				fc_dentry->fcd_parent, fc_dentry->fcd_ino,
> +				fc_dentry->fcd_name.len,
> +				fc_dentry->fcd_name.name, crc)) {
> +				return -ENOSPC;
> +			}
> +			spin_lock(&sbi->s_fc_lock);
> +			continue;
> +		}
> +
> +		inode = NULL;
> +		list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
> +			ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
> +			if (ei->vfs_inode.i_ino == fc_dentry->fcd_ino) {
> +				inode = &ei->vfs_inode;
> +				break;
> +			}
> +		}
> +		/*
> +		 * If we don't find inode in our list, then it was deleted,
> +		 * in which case, we don't need to record it's create tag.
> +		 */
> +		if (!inode)
> +			continue;
> +		spin_unlock(&sbi->s_fc_lock);
> +
> +		/*
> +		 * We first write the inode and then the create dirent. This
> +		 * allows the recovery code to create an unnamed inode first
> +		 * and then link it to a directory entry. This allows us
> +		 * to use namei.c routines almost as is and simplifies
> +		 * the recovery code.
> +		 */
> +		ret = ext4_fc_commit_inode(journal, inode, crc,
> +						EXT4_FC_TAG_INODE_FULL);
> +		if (ret)
> +			return ret;
> +
> +		if (!ext4_fc_add_dentry_tlv(
> +			sb, fc_dentry->fcd_op,
> +			fc_dentry->fcd_parent, fc_dentry->fcd_ino,
> +			fc_dentry->fcd_name.len,
> +			fc_dentry->fcd_name.name, crc))
> +			return -ENOSPC;
> +
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +	return 0;
> +}
> +
> +static int ext4_fc_perform_commit(journal_t *journal)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_inode_info *iter;
> +	struct ext4_fc_head head;
> +	struct list_head *pos;
> +	struct inode *inode;
> +	struct blk_plug plug;
> +	int ret = 0;
> +	u32 crc = 0;
> +
> +	ret = ext4_fc_submit_inode_data_all(journal);
> +	if (ret)
> +		return ret;
> +
> +	ret = ext4_fc_wait_inode_data_all(journal);
> +	if (ret)
> +		return ret;
> +
> +	blk_start_plug(&plug);
> +	if (sbi->s_fc_bytes == 0) {
> +		/*
> +		 * Add a head tag only if this is the first fast commit
> +		 * in this TID.
> +		 */
> +		head.fc_features = cpu_to_le32(EXT4_FC_SUPPORTED_FEATURES);
> +		head.fc_tid = cpu_to_le32(
> +			sbi->s_journal->j_running_transaction->t_tid);
> +		if (!ext4_fc_add_tlv(sb, EXT4_FC_TAG_HEAD, sizeof(head),
> +			(u8 *)&head, &crc))
> +			goto out;
> +	}
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	if (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {

This looks redundant check here. Since we anyway check for list
emptiness inside ext4_fc_commit_dentry_updates func()
with s_fc_lock held.
That should simplify this code block. I guess then we only need to call
ext4_fc_commit_dentry_updates().

> +		spin_unlock(&sbi->s_fc_lock);
> +		ret = ext4_fc_commit_dentry_updates(journal, &crc);
> +		if (ret)
> +			goto out;
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +
> +	list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
> +		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
> +		inode = &iter->vfs_inode;
> +		if (!ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING))
> +			continue;
> +
> +		spin_unlock(&sbi->s_fc_lock);
> +		/*
> +		 * Given that we have already written data range updates for
> +		 * this inode, we don't want the recovery code to recreate
> +		 * the entire inode at this point. Also, the recovery code
> +		 * allocator may make slightly different decisions for
> +		 * extent tree blocks allocations, resulting in potentially
> +		 * different inode.i_blocks and inode.i_block[]. So, until
> +		 * we have identical allocator in recovery path, we should
> +		 * record only partial inode here.
> +		 */
> +		ret = ext4_fc_commit_inode(journal, inode, &crc,
> +						EXT4_FC_TAG_INODE_PARTIAL);
> +		if (ret)
> +			goto out;
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +
> +	ret = ext4_fc_write_tail(sb, crc);
> +
> +out:
> +	blk_finish_plug(&plug);
> +	return ret;
> +}
> +
> +/*
> + * The main commit entry point. Performs a fast commit for transaction
> + * commit_tid if needed. If it's not possible to perform a fast commit
> + * due to various reasons, we fall back to full commit. Returns 0
> + * on success, error otherwise.
> + */
> +int ext4_fc_commit(journal_t *journal, tid_t commit_tid)
> +{
> +	struct super_block *sb = (struct super_block *)(journal->j_private);
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	int nblks = 0, ret, bsize = journal->j_blocksize;
> +	int subtid = atomic_read(&sbi->s_fc_subtid);
> +	int reason = EXT4_FC_REASON_OK, fc_bufs_before = 0;
> +	ktime_t start_time, commit_time;
> +
> +	trace_ext4_fc_commit_start(sb);
> +
> +	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
> +		(ext4_fc_is_ineligible(sb))) {
> +		reason = EXT4_FC_REASON_INELIGIBLE;
> +		goto out;
> +	}
> +
> +	start_time = ktime_get();
> +restart_fc:
> +	ret = jbd2_fc_start(journal, commit_tid);
> +	if (ret == -EALREADY) {
> +		/* There was an ongoing commit, check if we need to restart */
> +		if (atomic_read(&sbi->s_fc_subtid) <= subtid &&
> +			commit_tid > journal->j_commit_sequence)
> +			goto restart_fc;
> +		reason = EXT4_FC_REASON_ALREADY_COMMITTED;
> +		goto out;
> +	} else if (ret) {
> +		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
> +		reason = EXT4_FC_REASON_FC_START_FAILED;
> +		goto out;
> +	}
> +
> +	fc_bufs_before = (sbi->s_fc_bytes + bsize - 1) / bsize;
> +	ret = ext4_fc_perform_commit(journal);
> +	if (ret < 0) {
> +		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
> +		reason = EXT4_FC_REASON_FC_FAILED;
> +		goto out;
> +	}
> +	nblks = (sbi->s_fc_bytes + bsize - 1) / bsize - fc_bufs_before;
> +	ret = jbd2_fc_wait_bufs(journal, nblks);
> +	if (ret < 0) {
> +		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
> +		reason = EXT4_FC_REASON_FC_FAILED;
> +		goto out;
> +	}
> +	atomic_inc(&sbi->s_fc_subtid);
> +	jbd2_fc_stop(journal);
> +out:
> +	/* Has any ineligible update happened since we started? */
> +	if (reason == EXT4_FC_REASON_OK && ext4_fc_is_ineligible(sb)) {
> +		sbi->s_fc_stats.fc_ineligible_reason_count[EXT4_FC_COMMIT_FAILED]++;
> +		reason = EXT4_FC_REASON_INELIGIBLE;
> +	}
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	if (reason != EXT4_FC_REASON_OK &&
> +		reason != EXT4_FC_REASON_ALREADY_COMMITTED) {
> +		sbi->s_fc_stats.fc_ineligible_commits++;
> +	} else {
> +		sbi->s_fc_stats.fc_num_commits++;
> +		sbi->s_fc_stats.fc_numblks += nblks;
> +	}
> +	spin_unlock(&sbi->s_fc_lock);
> +	nblks = (reason == EXT4_FC_REASON_OK) ? nblks : 0;
> +	trace_ext4_fc_commit_stop(sb, nblks, reason);
> +	commit_time = ktime_to_ns(ktime_sub(ktime_get(), start_time));
> +	/*
> +	 * weight the commit time higher than the average time so we don't
> +	 * react too strongly to vast changes in the commit time
> +	 */
> +	if (likely(sbi->s_fc_avg_commit_time))
> +		sbi->s_fc_avg_commit_time = (commit_time +
> +				sbi->s_fc_avg_commit_time * 3) / 4;
> +	else
> +		sbi->s_fc_avg_commit_time = commit_time;
> +	jbd_debug(1,
> +		"Fast commit ended with blks = %d, reason = %d, subtid - %d",
> +		nblks, reason, subtid);
> +	if (reason == EXT4_FC_REASON_FC_FAILED)
> +		return jbd2_fc_stop_do_commit(journal, commit_tid);
> +	if (reason == EXT4_FC_REASON_FC_START_FAILED ||
> +		reason == EXT4_FC_REASON_INELIGIBLE)
> +		return jbd2_complete_transaction(journal, commit_tid);
> +	return 0;
> +}
> +
>   /*
>    * Fast commit cleanup routine. This is called after every fast commit and
>    * full commit. full is true if we are called after a full commit.
>    */
>   static void ext4_fc_cleanup(journal_t *journal, int full)
>   {
> +	struct super_block *sb = journal->j_private;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_inode_info *iter;
> +	struct ext4_fc_dentry_update *fc_dentry;
> +	struct list_head *pos, *n;
> +
> +	if (full && sbi->s_fc_bh)
> +		sbi->s_fc_bh = NULL;
> +
> +	jbd2_fc_release_bufs(journal);
> +
> +	spin_lock(&sbi->s_fc_lock);
> +	list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
> +		iter = list_entry(pos, struct ext4_inode_info, i_fc_list);
> +		list_del_init(&iter->i_fc_list);
> +		ext4_clear_inode_state(&iter->vfs_inode,
> +				       EXT4_STATE_FC_COMMITTING);
> +		ext4_fc_reset_inode(&iter->vfs_inode);
> +		/* Make sure EXT4_STATE_FC_COMMITTING bit is clear */
> +		smp_mb();
> +#if (BITS_PER_LONG < 64)
> +		wake_up_bit(&iter->i_state_flags, EXT4_STATE_FC_COMMITTING);
> +#else
> +		wake_up_bit(&iter->i_flags, EXT4_STATE_FC_COMMITTING);
> +#endif
> +	}
> +
> +	while (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
> +		fc_dentry = list_first_entry(&sbi->s_fc_dentry_q[FC_Q_MAIN],
> +					     struct ext4_fc_dentry_update,
> +					     fcd_list);
> +		list_del_init(&fc_dentry->fcd_list);
> +		spin_unlock(&sbi->s_fc_lock);
> +
> +		if (fc_dentry->fcd_name.name &&
> +			fc_dentry->fcd_name.len > DNAME_INLINE_LEN)
> +			kfree(fc_dentry->fcd_name.name);
> +		kmem_cache_free(ext4_fc_dentry_cachep, fc_dentry);
> +		spin_lock(&sbi->s_fc_lock);
> +	}
> +
> +	list_splice_init(&sbi->s_fc_dentry_q[FC_Q_STAGING],
> +				&sbi->s_fc_dentry_q[FC_Q_MAIN]);
> +	list_splice_init(&sbi->s_fc_q[FC_Q_STAGING],
> +				&sbi->s_fc_q[FC_Q_STAGING]);
> +
> +	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
> +	sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE;
> +
> +	if (full)
> +		sbi->s_fc_bytes = 0;
> +	spin_unlock(&sbi->s_fc_lock);
> +	trace_ext4_fc_stats(sb);
>   }
> 
>   void ext4_fc_init(struct super_block *sb, journal_t *journal)
> @@ -26,3 +1224,14 @@ void ext4_fc_init(struct super_block *sb, journal_t *journal)
>   		ext4_clear_feature_fast_commit(sb);
>   	}
>   }
> +
> +int __init ext4_fc_init_dentry_cache(void)
> +{
> +	ext4_fc_dentry_cachep = KMEM_CACHE(ext4_fc_dentry_update,
> +					   SLAB_RECLAIM_ACCOUNT);
> +
> +	if (ext4_fc_dentry_cachep == NULL)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> index 8362bf5e6e00..a541d2bbe24b 100644
> --- a/fs/ext4/fast_commit.h
> +++ b/fs/ext4/fast_commit.h
> @@ -6,4 +6,115 @@
>   /* Number of blocks in journal area to allocate for fast commits */
>   #define EXT4_NUM_FC_BLKS		256
> 
> +/* Fast commit tags */
> +#define EXT4_FC_TAG_ADD_RANGE		0x0001
> +#define EXT4_FC_TAG_DEL_RANGE		0x0002
> +#define EXT4_FC_TAG_CREAT		0x0003
> +#define EXT4_FC_TAG_LINK		0x0004
> +#define EXT4_FC_TAG_UNLINK		0x0005
> +#define EXT4_FC_TAG_INODE_FULL		0x0006
> +#define EXT4_FC_TAG_INODE_PARTIAL	0x0007
> +#define EXT4_FC_TAG_PAD			0x0008
> +#define EXT4_FC_TAG_TAIL		0x0009
> +#define EXT4_FC_TAG_HEAD		0x000A
> +
> +#define EXT4_FC_SUPPORTED_FEATURES	0x0
> +
> +/* On disk fast commit tlv value structures */
> +
> +/* Fast commit on disk tag length structure */
> +struct ext4_fc_tl {
> +	__le16 fc_tag;
> +	__le16 fc_len;
> +};
> +
> +/* Value structure for tag EXT4_FC_TAG_HEAD. */
> +struct ext4_fc_head {
> +	__le32 fc_features;
> +	__le32 fc_tid;
> +};
> +
> +/* Value structure for EXT4_FC_TAG_ADD_RANGE. */
> +struct ext4_fc_add_range {
> +	__le32 fc_ino;
> +	__u8 fc_ex[12];
> +};
> +
> +/* Value structure for tag EXT4_FC_TAG_DEL_RANGE. */
> +struct ext4_fc_del_range {
> +	__le32 fc_ino;
> +	__le32 fc_lblk;
> +	__le32 fc_len;
> +};
> +
> +/*
> + * This is the value structure for tags EXT4_FC_TAG_CREAT, EXT4_FC_TAG_LINK
> + * and EXT4_FC_TAG_UNLINK.
> + */
> +struct ext4_fc_dentry_info {
> +	__le32 fc_parent_ino;
> +	__le32 fc_ino;
> +	u8 fc_dname[0];
> +};
> +
> +/* Value structure for EXT4_FC_TAG_INODE and EXT4_FC_TAG_INODE_PARTIAL. */
> +struct ext4_fc_inode {
> +	__le32 fc_ino;
> +	__u8 fc_raw_inode[0];
> +};
> +
> +/* Value structure for tag EXT4_FC_TAG_TAIL. */
> +struct ext4_fc_tail {
> +	__le32 fc_tid;
> +	__le32 fc_crc;
> +};
> +
> +/*
> + * In memory list of dentry updates that are performed on the file
> + * system used by fast commit code.
> + */
> +struct ext4_fc_dentry_update {
> +	int fcd_op;		/* Type of update create / unlink / link */
> +	int fcd_parent;		/* Parent inode number */
> +	int fcd_ino;		/* Inode number */
> +	struct qstr fcd_name;	/* Dirent name */
> +	unsigned char fcd_iname[DNAME_INLINE_LEN];	/* Dirent name string */
> +	struct list_head fcd_list;
> +};
> +
> +/*
> + * Fast commit reason codes
> + */
> +enum {
> +	/*
> +	 * Commit status codes:
> +	 */
> +	EXT4_FC_REASON_OK = 0,
> +	EXT4_FC_REASON_INELIGIBLE,
> +	EXT4_FC_REASON_ALREADY_COMMITTED,
> +	EXT4_FC_REASON_FC_START_FAILED,
> +	EXT4_FC_REASON_FC_FAILED,
> +
> +	/*
> +	 * Fast commit ineligiblity reasons:
> +	 */
> +	EXT4_FC_REASON_XATTR = 0,
> +	EXT4_FC_REASON_CROSS_RENAME,
> +	EXT4_FC_REASON_JOURNAL_FLAG_CHANGE,
> +	EXT4_FC_REASON_MEM,
> +	EXT4_FC_REASON_SWAP_BOOT,
> +	EXT4_FC_REASON_RESIZE,
> +	EXT4_FC_REASON_RENAME_DIR,
> +	EXT4_FC_REASON_FALLOC_RANGE,
> +	EXT4_FC_COMMIT_FAILED,
> +	EXT4_FC_REASON_MAX
> +};
> +
> +struct ext4_fc_stats {
> +	int fc_ineligible_reason_count[EXT4_FC_REASON_MAX];
> +	int fc_num_commits;
> +	int fc_ineligible_commits;
> +	int fc_numblks; > +};

I guess, all above counters should be unsigned int or unsigned long right?




> +
>   #endif /* __FAST_COMMIT_H__ */
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index e608ce3fb535..018f698caa71 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -260,6 +260,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>   	if (iocb->ki_flags & IOCB_NOWAIT)
>   		return -EOPNOTSUPP;
> 
> +	ext4_fc_start_update(inode);
>   	inode_lock(inode);
>   	ret = ext4_write_checks(iocb, from);
>   	if (ret <= 0)
> @@ -271,6 +272,7 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> 
>   out:
>   	inode_unlock(inode);
> +	ext4_fc_stop_update(inode);
>   	if (likely(ret > 0)) {
>   		iocb->ki_pos += ret;
>   		ret = generic_write_sync(iocb, ret);
> @@ -527,6 +529,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   	if (unaligned_io)
>   		inode_dio_wait(inode);
> 
> +	ext4_fc_start_update(inode);
>   	if (extend) {
>   		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>   		if (IS_ERR(handle)) {
> @@ -556,6 +559,7 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   		inode_unlock_shared(inode);
>   	else
>   		inode_unlock(inode);
> +	ext4_fc_stop_update(inode);
> 
>   	if (ret >= 0 && iov_iter_count(from)) {
>   		ssize_t err;
> @@ -654,8 +658,8 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>   #endif
>   	if (iocb->ki_flags & IOCB_DIRECT)
>   		return ext4_dio_write_iter(iocb, from);
> -
> -	return ext4_buffered_write_iter(iocb, from);
> +	else
> +		return ext4_buffered_write_iter(iocb, from);
>   }
> 
>   #ifdef CONFIG_FS_DAX
> @@ -755,6 +759,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>   	if (!daxdev_mapping_supported(vma, dax_dev))
>   		return -EOPNOTSUPP;
> 
> +	ext4_fc_start_update(inode);
>   	file_accessed(file);
>   	if (IS_DAX(file_inode(file))) {
>   		vma->vm_ops = &ext4_dax_vm_ops;
> @@ -762,6 +767,7 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>   	} else {
>   		vma->vm_ops = &ext4_file_vm_ops;
>   	}
> +	ext4_fc_stop_update(inode);
>   	return 0;
>   }
> 
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 1d668c8f131f..7c9fa35a01d1 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -112,7 +112,7 @@ static int ext4_fsync_journal(struct inode *inode, bool datasync,
>   	    !jbd2_trans_will_send_data_barrier(journal, commit_tid))
>   		*needs_barrier = true;
> 
> -	return jbd2_complete_transaction(journal, commit_tid);
> +	return ext4_fc_commit(journal, commit_tid);
>   }
> 
>   /*
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 77543f988258..26eed76812f9 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -729,6 +729,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>   			if (ret)
>   				return ret;
>   		}
> +		ext4_fc_track_range(inode, map->m_lblk,
> +			    map->m_lblk + map->m_len - 1);
>   	}
> 
>   	if (retval < 0)
> @@ -4073,6 +4075,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
> 
>   		up_write(&EXT4_I(inode)->i_data_sem);
>   	}
> +	ext4_fc_track_range(inode, first_block, stop_block);
>   	if (IS_SYNC(inode))
>   		ext4_handle_sync(handle);
> 
> @@ -4703,6 +4706,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>   	for (block = 0; block < EXT4_N_BLOCKS; block++)
>   		ei->i_data[block] = raw_inode->i_block[block];
>   	INIT_LIST_HEAD(&ei->i_orphan);
> +	ext4_fc_init_inode(&ei->vfs_inode);
> 
>   	/*
>   	 * Set transaction id's of transactions that have to be committed
> @@ -5149,7 +5153,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
>   		if (wbc->sync_mode != WB_SYNC_ALL || wbc->for_sync)
>   			return 0;
> 
> -		err = jbd2_complete_transaction(EXT4_SB(inode->i_sb)->s_journal,
> +		err = ext4_fc_commit(EXT4_SB(inode->i_sb)->s_journal,
>   						EXT4_I(inode)->i_sync_tid);
>   	} else {
>   		struct ext4_iloc iloc;
> @@ -5278,6 +5282,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>   		if (error)
>   			return error;
>   	}
> +	ext4_fc_start_update(inode);
>   	if ((ia_valid & ATTR_UID && !uid_eq(attr->ia_uid, inode->i_uid)) ||
>   	    (ia_valid & ATTR_GID && !gid_eq(attr->ia_gid, inode->i_gid))) {
>   		handle_t *handle;
> @@ -5301,6 +5306,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
> 
>   		if (error) {
>   			ext4_journal_stop(handle);
> +			ext4_fc_stop_update(inode);
>   			return error;
>   		}
>   		/* Update corresponding info in inode so that everything is in
> @@ -5323,11 +5329,15 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>   		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
>   			struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> 
> -			if (attr->ia_size > sbi->s_bitmap_maxbytes)
> +			if (attr->ia_size > sbi->s_bitmap_maxbytes) {
> +				ext4_fc_stop_update(inode);
>   				return -EFBIG;
> +			}
>   		}
> -		if (!S_ISREG(inode->i_mode))
> +		if (!S_ISREG(inode->i_mode)) {
> +			ext4_fc_stop_update(inode);
>   			return -EINVAL;
> +		}
> 
>   		if (IS_I_VERSION(inode) && attr->ia_size != inode->i_size)
>   			inode_inc_iversion(inode);
> @@ -5351,7 +5361,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>   		rc = ext4_break_layouts(inode);
>   		if (rc) {
>   			up_write(&EXT4_I(inode)->i_mmap_sem);
> -			return rc;
> +			goto err_out;
>   		}
> 
>   		if (attr->ia_size != inode->i_size) {
> @@ -5372,6 +5382,21 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>   				inode->i_mtime = current_time(inode);
>   				inode->i_ctime = inode->i_mtime;
>   			}
> +
> +			if (shrink)
> +				ext4_fc_track_range(inode,
> +					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> +					inode->i_sb->s_blocksize_bits,
> +					(oldsize > 0 ? oldsize - 1 : 0) >>
> +					inode->i_sb->s_blocksize_bits);
> +			else
> +				ext4_fc_track_range(
> +					inode,
> +					(oldsize > 0 ? oldsize - 1 : oldsize) >>
> +					inode->i_sb->s_blocksize_bits,
> +					(attr->ia_size > 0 ? attr->ia_size - 1 : 0) >>
> +					inode->i_sb->s_blocksize_bits);
> +
>   			down_write(&EXT4_I(inode)->i_data_sem);
>   			EXT4_I(inode)->i_disksize = attr->ia_size;
>   			rc = ext4_mark_inode_dirty(handle, inode);
> @@ -5430,9 +5455,11 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
>   		rc = posix_acl_chmod(inode, inode->i_mode);
> 
>   err_out:
> -	ext4_std_error(inode->i_sb, error);
> +	if  (error)
> +		ext4_std_error(inode->i_sb, error);
>   	if (!error)
>   		error = rc;
> +	ext4_fc_stop_update(inode);
>   	return error;
>   }
> 
> @@ -5614,6 +5641,8 @@ int ext4_mark_iloc_dirty(handle_t *handle,
>   		put_bh(iloc->bh);
>   		return -EIO;
>   	}
> +	ext4_fc_track_inode(inode);
> +
>   	if (IS_I_VERSION(inode))
>   		inode_inc_iversion(inode);
> 
> @@ -5937,6 +5966,8 @@ int ext4_change_inode_journal_flag(struct inode *inode, int val)
>   	if (IS_ERR(handle))
>   		return PTR_ERR(handle);
> 
> +	ext4_fc_mark_ineligible(inode->i_sb,
> +		EXT4_FC_REASON_JOURNAL_FLAG_CHANGE);
>   	err = ext4_mark_inode_dirty(handle, inode);
>   	ext4_handle_sync(handle);
>   	ext4_journal_stop(handle);
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 36eca3bc036a..d2f8f50deef6 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -165,6 +165,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
>   		err = -EINVAL;
>   		goto err_out;
>   	}
> +	ext4_fc_start_ineligible(sb, EXT4_FC_REASON_SWAP_BOOT);
> 
>   	/* Protect extent tree against block allocations via delalloc */
>   	ext4_double_down_write_data_sem(inode, inode_bl);
> @@ -247,6 +248,7 @@ static long swap_inode_boot_loader(struct super_block *sb,
> 
>   err_out1:
>   	ext4_journal_stop(handle);
> +	ext4_fc_stop_ineligible(sb);
>   	ext4_double_up_write_data_sem(inode, inode_bl);
> 
>   err_out:
> @@ -807,7 +809,7 @@ static int ext4_ioctl_get_es_cache(struct file *filp, unsigned long arg)
>   	return error;
>   }
> 
> -long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> +static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>   {
>   	struct inode *inode = file_inode(filp);
>   	struct super_block *sb = inode->i_sb;
> @@ -1074,6 +1076,7 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> 
>   		err = ext4_resize_fs(sb, n_blocks_count);
>   		if (EXT4_SB(sb)->s_journal) {
> +			ext4_fc_mark_ineligible(sb, EXT4_FC_REASON_RESIZE);
>   			jbd2_journal_lock_updates(EXT4_SB(sb)->s_journal);
>   			err2 = jbd2_journal_flush(EXT4_SB(sb)->s_journal);
>   			jbd2_journal_unlock_updates(EXT4_SB(sb)->s_journal);
> @@ -1308,6 +1311,17 @@ long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>   	}
>   }
> 
> +long ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> +{
> +	long ret;
> +
> +	ext4_fc_start_update(file_inode(filp));
> +	ret = __ext4_ioctl(filp, cmd, arg);
> +	ext4_fc_stop_update(file_inode(filp));
> +
> +	return ret;
> +}
> +
>   #ifdef CONFIG_COMPAT
>   long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>   {
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index 153a9fbe1dd0..563243f3b682 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -2611,7 +2611,7 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
>   		       bool excl)
>   {
>   	handle_t *handle;
> -	struct inode *inode;
> +	struct inode *inode, *inode_save;
>   	int err, credits, retries = 0;
> 
>   	err = dquot_initialize(dir);
> @@ -2629,7 +2629,11 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
>   		inode->i_op = &ext4_file_inode_operations;
>   		inode->i_fop = &ext4_file_operations;
>   		ext4_set_aops(inode);
> +		inode_save = inode;
> +		ihold(inode_save);
>   		err = ext4_add_nondir(handle, dentry, &inode);
> +		ext4_fc_track_create(inode_save, dentry);

Shouldn't we check for err before calling ext4_fc_track_create?


> +		iput(inode_save);
>   	}
>   	if (handle)
>   		ext4_journal_stop(handle);
> @@ -2644,7 +2648,7 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
>   		      umode_t mode, dev_t rdev)
>   {
>   	handle_t *handle;
> -	struct inode *inode;
> +	struct inode *inode, *inode_save;
>   	int err, credits, retries = 0;
> 
>   	err = dquot_initialize(dir);
> @@ -2661,7 +2665,12 @@ static int ext4_mknod(struct inode *dir, struct dentry *dentry,
>   	if (!IS_ERR(inode)) {
>   		init_special_inode(inode, inode->i_mode, rdev);
>   		inode->i_op = &ext4_special_inode_operations;
> +		inode_save = inode;
> +		ihold(inode_save);
>   		err = ext4_add_nondir(handle, dentry, &inode);
> +		if (!err)
> +			ext4_fc_track_create(inode_save, dentry);
> +		iput(inode_save);
>   	}
>   	if (handle)
>   		ext4_journal_stop(handle);
> @@ -2825,6 +2834,7 @@ static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
>   		iput(inode);
>   		goto out_retry;
>   	}
> +	ext4_fc_track_create(inode, dentry);
>   	ext4_inc_count(handle, dir);
>   	ext4_update_dx_flag(dir);
>   	err = ext4_mark_inode_dirty(handle, dir);
> @@ -3165,6 +3175,7 @@ static int ext4_rmdir(struct inode *dir, struct dentry *dentry)
>   		goto end_rmdir;
>   	ext4_dec_count(handle, dir);
>   	ext4_update_dx_flag(dir);
> +	ext4_fc_track_unlink(inode, dentry);
>   	retval = ext4_mark_inode_dirty(handle, dir);
> 
>   #ifdef CONFIG_UNICODE
> @@ -3251,6 +3262,8 @@ static int ext4_unlink(struct inode *dir, struct dentry *dentry)
>   	inode->i_ctime = current_time(inode);
>   	retval = ext4_mark_inode_dirty(handle, inode);
> 
> +	if (!retval)
> +		ext4_fc_track_unlink(d_inode(dentry), dentry);
>   #ifdef CONFIG_UNICODE
>   	/* VFS negative dentries are incompatible with Encoding and
>   	 * Case-insensitiveness. Eventually we'll want avoid
> @@ -3872,6 +3885,22 @@ static int ext4_rename(struct inode *old_dir, struct dentry *old_dentry,
>   	retval = ext4_mark_inode_dirty(handle, old.dir);
>   	if (unlikely(retval))
>   		goto end_rename;
> +
> +	if (S_ISDIR(old.inode->i_mode)) {
> +		/*
> +		 * We disable fast commits here that's because the
> +		 * replay code is not yet capable of changing dot dot
> +		 * dirents in directories.
> +		 */
> +		ext4_fc_mark_ineligible(old.inode->i_sb,
> +			EXT4_FC_REASON_RENAME_DIR);
> +	} else {
> +		if (new.inode)
> +			ext4_fc_track_unlink(new.inode, new.dentry);
> +		ext4_fc_track_link(old.inode, new.dentry);
> +		ext4_fc_track_unlink(old.inode, old.dentry);
> +	}
> +
>   	if (new.inode) {
>   		retval = ext4_mark_inode_dirty(handle, new.inode);
>   		if (unlikely(retval))
> @@ -4015,7 +4044,8 @@ static int ext4_cross_rename(struct inode *old_dir, struct dentry *old_dentry,
>   	retval = ext4_mark_inode_dirty(handle, new.inode);
>   	if (unlikely(retval))
>   		goto end_rename;
> -
> +	ext4_fc_mark_ineligible(new.inode->i_sb,
> +				EXT4_FC_REASON_CROSS_RENAME);
>   	if (old.dir_bh) {
>   		retval = ext4_rename_dir_finish(handle, &old, new.dir->i_ino);
>   		if (retval)
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 94aaaf940449..2a959e2c7a91 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1149,6 +1149,8 @@ static struct inode *ext4_alloc_inode(struct super_block *sb)
>   	ei->i_datasync_tid = 0;
>   	atomic_set(&ei->i_unwritten, 0);
>   	INIT_WORK(&ei->i_rsv_conversion_work, ext4_end_io_rsv_work);
> +	ext4_fc_init_inode(&ei->vfs_inode);
> +	mutex_init(&ei->i_fc_lock);
>   	return &ei->vfs_inode;
>   }
> 
> @@ -1166,6 +1168,10 @@ static int ext4_drop_inode(struct inode *inode)
>   static void ext4_free_in_core_inode(struct inode *inode)
>   {
>   	fscrypt_free_inode(inode);
> +	if (!list_empty(&(EXT4_I(inode)->i_fc_list))) {
> +		pr_warn("%s: inode %ld still in fc list",
> +			__func__, inode->i_ino);
> +	}
>   	kmem_cache_free(ext4_inode_cachep, EXT4_I(inode));
>   }
> 
> @@ -1191,6 +1197,7 @@ static void init_once(void *foo)
>   	init_rwsem(&ei->i_data_sem);
>   	init_rwsem(&ei->i_mmap_sem);
>   	inode_init_once(&ei->vfs_inode);
> +	ext4_fc_init_inode(&ei->vfs_inode);
>   }
> 
>   static int __init init_inodecache(void)
> @@ -1219,6 +1226,7 @@ static void destroy_inodecache(void)
> 
>   void ext4_clear_inode(struct inode *inode)
>   {
> +	ext4_fc_del(inode);
>   	invalidate_inode_buffers(inode);
>   	clear_inode(inode);
>   	ext4_discard_preallocations(inode, 0);
> @@ -4540,6 +4548,19 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>   	INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
>   	mutex_init(&sbi->s_orphan_lock);
> 
> +	/* Initialize fast commit stuff */
> +	atomic_set(&sbi->s_fc_subtid, 0);
> +	atomic_set(&sbi->s_fc_ineligible_updates, 0);
> +	INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_MAIN]);
> +	INIT_LIST_HEAD(&sbi->s_fc_q[FC_Q_STAGING]);
> +	INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_MAIN]);
> +	INIT_LIST_HEAD(&sbi->s_fc_dentry_q[FC_Q_STAGING]);
> +	sbi->s_fc_bytes = 0;
> +	sbi->s_mount_state &= ~EXT4_FC_INELIGIBLE;
> +	sbi->s_mount_state &= ~EXT4_FC_COMMITTING;
> +	spin_lock_init(&sbi->s_fc_lock);
> +	memset(&sbi->s_fc_stats, 0, sizeof(sbi->s_fc_stats));
> +
>   	sb->s_root = NULL;
> 
>   	needs_recovery = (es->s_last_orphan != 0 ||
> @@ -6312,6 +6333,10 @@ static ssize_t ext4_quota_write(struct super_block *sb, int type,
>   	brelse(bh);
>   out:
>   	if (inode->i_size < off + len) {
> +		ext4_fc_track_range(inode,
> +			(inode->i_size > 0 ? inode->i_size - 1 : 0)
> +				>> inode->i_sb->s_blocksize_bits,
> +			(off + len) >> inode->i_sb->s_blocksize_bits);
>   		i_size_write(inode, off + len);
>   		EXT4_I(inode)->i_disksize = inode->i_size;
>   		err2 = ext4_mark_inode_dirty(handle, inode);
> @@ -6440,6 +6465,11 @@ static int __init ext4_init_fs(void)
>   	err = init_inodecache();
>   	if (err)
>   		goto out1;
> +
> +	err = ext4_fc_init_dentry_cache();
> +	if (err)
> +		goto out05;
> +
>   	register_as_ext3();
>   	register_as_ext2();
>   	err = register_filesystem(&ext4_fs_type);
> @@ -6450,6 +6480,7 @@ static int __init ext4_init_fs(void)
>   out:
>   	unregister_as_ext2();
>   	unregister_as_ext3();
> +out05:
>   	destroy_inodecache();
>   out1:
>   	ext4_exit_mballoc();
> diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
> index cba4b877c606..6127e94ea4f5 100644
> --- a/fs/ext4/xattr.c
> +++ b/fs/ext4/xattr.c
> @@ -2419,6 +2419,7 @@ ext4_xattr_set_handle(handle_t *handle, struct inode *inode, int name_index,
>   		if (IS_SYNC(inode))
>   			ext4_handle_sync(handle);
>   	}
> +	ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
> 
>   cleanup:
>   	brelse(is.iloc.bh);
> @@ -2496,6 +2497,7 @@ ext4_xattr_set(struct inode *inode, int name_index, const char *name,
>   		if (error == 0)
>   			error = error2;
>   	}
> +	ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
> 
>   	return error;
>   }
> @@ -2928,6 +2930,7 @@ int ext4_xattr_delete_inode(handle_t *handle, struct inode *inode,
>   					 error);
>   			goto cleanup;
>   		}
> +		ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_XATTR);
>   	}
>   	error = 0;
>   cleanup:
> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index ba35ecb18616..dadd9994e74b 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -202,6 +202,47 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
>   	return ret;
>   }
> 
> +/* Send all the data buffers related to an inode */
> +int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> +{
> +	struct address_space *mapping;
> +	loff_t dirty_start;
> +	loff_t dirty_end;
> +	int ret;
> +
> +	if (!jinode)
> +		return 0;
> +
> +	dirty_start = jinode->i_dirty_start;
> +	dirty_end = jinode->i_dirty_end;
> +
> +	if (!(jinode->i_flags & JI_WRITE_DATA))
> +		return 0;

This can be clubbed with above if logic to check if !jinode

> +
> +	dirty_start = jinode->i_dirty_start;
> +	dirty_end = jinode->i_dirty_end;

redundant check

> +
> +	mapping = jinode->i_vfs_inode->i_mapping;
> +
> +	trace_jbd2_submit_inode_data(jinode->i_vfs_inode);
> +	ret = journal_submit_inode_data_buffers(mapping, dirty_start,
> +						dirty_end);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(jbd2_submit_inode_data);
> +
> +int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> +{
> +	if (!jinode || !(jinode->i_flags & JI_WAIT_DATA) ||
> +		!jinode->i_vfs_inode || !jinode->i_vfs_inode->i_mapping)
> +		return 0;
> +	return filemap_fdatawait_range_keep_errors(
> +		jinode->i_vfs_inode->i_mapping, jinode->i_dirty_start,
> +		jinode->i_dirty_end);
> +}
> +EXPORT_SYMBOL(jbd2_wait_inode_data);
> +
>   /*
>    * Submit all the data buffers of inode associated with the transaction to
>    * disk.
> @@ -434,6 +475,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
>   			commit_transaction->t_tid);
> 
>   	write_lock(&journal->j_state_lock);
> +	journal->j_fc_off = 0;
>   	J_ASSERT(commit_transaction->t_state == T_RUNNING);
>   	commit_transaction->t_state = T_LOCKED;
> 
> diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> index 17a30a2c38f9..0661e00b46c0 100644
> --- a/fs/jbd2/journal.c
> +++ b/fs/jbd2/journal.c
> @@ -157,7 +157,9 @@ static void commit_timeout(struct timer_list *t)
>    *
>    * 1) COMMIT:  Every so often we need to commit the current state of the
>    *    filesystem to disk.  The journal thread is responsible for writing
> - *    all of the metadata buffers to disk.
> + *    all of the metadata buffers to disk. If a fast commit is ongoing
> + *    journal thread waits until it's done and then continues from
> + *    there on.
>    *
>    * 2) CHECKPOINT: We cannot reuse a used section of the log file until all
>    *    of the data in that part of the log has been rewritten elsewhere on
> @@ -848,6 +850,109 @@ int jbd2_journal_next_log_block(journal_t *journal, unsigned long long *retp)
>   	return jbd2_journal_bmap(journal, blocknr, retp);
>   }
> 
> +/* Map one fast commit buffer for use by the file system */
> +int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
> +{
> +	unsigned long long pblock;
> +	unsigned long blocknr;
> +	int ret = 0;
> +	struct buffer_head *bh;
> +	int fc_off;
> +
> +	*bh_out = NULL;
> +	write_lock(&journal->j_state_lock);
> +
> +	if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
> +		fc_off = journal->j_fc_off;
> +		blocknr = journal->j_first_fc + fc_off;
> +		journal->j_fc_off++;


Not sure if we need to should roll back j_fc_off in case of errors from
this func.

-ritesh


> +	} else {
> +		ret = -EINVAL;
> +	}
> +	write_unlock(&journal->j_state_lock);
> +
> +	if (ret)
> +		return ret;
> +
> +	ret = jbd2_journal_bmap(journal, blocknr, &pblock);
> +	if (ret)
> +		return ret;
> +
> +	bh = __getblk(journal->j_dev, pblock, journal->j_blocksize);
> +	if (!bh)
> +		return -ENOMEM;
> +
> +	lock_buffer(bh);
> +
> +	clear_buffer_uptodate(bh);
> +	set_buffer_dirty(bh);
> +	unlock_buffer(bh);
> +	journal->j_fc_wbuf[fc_off] = bh;
> +
> +	*bh_out = bh;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(jbd2_fc_get_buf);
> +
> +/*
> + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
> + * for completion.
> + */
> +int jbd2_fc_wait_bufs(journal_t *journal, int num_blks)
> +{
> +	struct buffer_head *bh;
> +	int i, j_fc_off;
> +
> +	read_lock(&journal->j_state_lock);
> +	j_fc_off = journal->j_fc_off;
> +	read_unlock(&journal->j_state_lock);
> +
> +	/*
> +	 * Wait in reverse order to minimize chances of us being woken up before
> +	 * all IOs have completed
> +	 */
> +	for (i = j_fc_off - 1; i >= j_fc_off - num_blks; i--) {
> +		bh = journal->j_fc_wbuf[i];
> +		wait_on_buffer(bh);
> +		put_bh(bh);
> +		journal->j_fc_wbuf[i] = NULL;
> +		if (unlikely(!buffer_uptodate(bh)))
> +			return -EIO;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL(jbd2_fc_wait_bufs);
> +
> +/*
> + * Wait on fast commit buffers that were allocated by jbd2_fc_get_buf
> + * for completion.
> + */
> +int jbd2_fc_release_bufs(journal_t *journal)
> +{
> +	struct buffer_head *bh;
> +	int i, j_fc_off;
> +
> +	read_lock(&journal->j_state_lock);
> +	j_fc_off = journal->j_fc_off;
> +	read_unlock(&journal->j_state_lock);
> +
> +	/*
> +	 * Wait in reverse order to minimize chances of us being woken up before
> +	 * all IOs have completed
> +	 */
> +	for (i = j_fc_off - 1; i >= 0; i--) {
> +		bh = journal->j_fc_wbuf[i];
> +		if (!bh)
> +			break;
> +		put_bh(bh);
> +		journal->j_fc_wbuf[i] = NULL;
> +	}
> +
> +	return 0;
> +}
> +
>   /*
>    * Conversion of logical to physical block numbers for the journal
>    *
> @@ -1559,6 +1664,7 @@ int jbd2_journal_update_sb_log_tail(journal_t *journal, tid_t tail_tid,
>   static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
>   {
>   	journal_superblock_t *sb = journal->j_superblock;
> +	bool had_fast_commit = false;
> 
>   	BUG_ON(!mutex_is_locked(&journal->j_checkpoint_mutex));
>   	lock_buffer(journal->j_sb_buffer);
> @@ -1572,9 +1678,20 @@ static void jbd2_mark_journal_empty(journal_t *journal, int write_op)
> 
>   	sb->s_sequence = cpu_to_be32(journal->j_tail_sequence);
>   	sb->s_start    = cpu_to_be32(0);
> +	if (jbd2_has_feature_fast_commit(journal)) {
> +		/*
> +		 * When journal is clean, no need to commit fast commit flag and
> +		 * make file system incompatible with older kernels.
> +		 */
> +		jbd2_clear_feature_fast_commit(journal);
> +		had_fast_commit = true;
> +	}
> 
>   	jbd2_write_superblock(journal, write_op);
> 
> +	if (had_fast_commit)
> +		jbd2_set_feature_fast_commit(journal);
> +
>   	/* Log is no longer empty */
>   	write_lock(&journal->j_state_lock);
>   	journal->j_flags |= JBD2_FLUSHED;
> diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> index aad986a9f3ff..11c4ec967662 100644
> --- a/include/linux/jbd2.h
> +++ b/include/linux/jbd2.h
> @@ -1567,6 +1567,12 @@ int jbd2_fc_init(journal_t *journal, int num_fc_blks);
>   int jbd2_fc_start(journal_t *journal, tid_t tid);
>   int jbd2_fc_stop(journal_t *journal);
>   int jbd2_fc_stop_do_commit(journal_t *journal, tid_t tid);
> +int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out);
> +int jbd2_fc_wait_bufs(journal_t *journal, int num_blks);
> +int jbd2_fc_release_bufs(journal_t *journal);
> +int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode);
> +int jbd2_wait_inode_data(journal_t *journal, struct jbd2_inode *jinode);
> +
>   /*
>    * is_journal_abort
>    *
> diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
> index 4c8b99ec8606..c8cb43189f02 100644
> --- a/include/trace/events/ext4.h
> +++ b/include/trace/events/ext4.h
> @@ -95,6 +95,16 @@ TRACE_DEFINE_ENUM(ES_REFERENCED_B);
>   	{ FALLOC_FL_COLLAPSE_RANGE,	"COLLAPSE_RANGE"},	\
>   	{ FALLOC_FL_ZERO_RANGE,		"ZERO_RANGE"})
> 
> +#define show_fc_reason(reason)						\
> +	__print_symbolic(reason,					\
> +		{ EXT4_FC_REASON_XATTR,		"XATTR"},		\
> +		{ EXT4_FC_REASON_CROSS_RENAME,	"CROSS_RENAME"},	\
> +		{ EXT4_FC_REASON_JOURNAL_FLAG_CHANGE, "JOURNAL_FLAG_CHANGE"}, \
> +		{ EXT4_FC_REASON_MEM,	"NO_MEM"},			\
> +		{ EXT4_FC_REASON_SWAP_BOOT,	"SWAP_BOOT"},		\
> +		{ EXT4_FC_REASON_RESIZE,	"RESIZE"},		\
> +		{ EXT4_FC_REASON_RENAME_DIR,	"RENAME_DIR"},		\
> +		{ EXT4_FC_REASON_FALLOC_RANGE,	"FALLOC_RANGE"})
> 
>   TRACE_EVENT(ext4_other_inode_update_time,
>   	TP_PROTO(struct inode *inode, ino_t orig_ino),
> @@ -2791,6 +2801,168 @@ TRACE_EVENT(ext4_lazy_itable_init,
>   		  MAJOR(__entry->dev), MINOR(__entry->dev), __entry->group)
>   );
> 
> +TRACE_EVENT(ext4_fc_commit_start,
> +	TP_PROTO(struct super_block *sb),
> +
> +	TP_ARGS(sb),
> +
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev = sb->s_dev;
> +	),
> +
> +	TP_printk("fast_commit started on dev %d,%d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev))
> +);
> +
> +TRACE_EVENT(ext4_fc_commit_stop,
> +	    TP_PROTO(struct super_block *sb, int nblks, int reason),
> +
> +	TP_ARGS(sb, nblks, reason),
> +
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(int, nblks)
> +		__field(int, reason)
> +		__field(int, num_fc)
> +		__field(int, num_fc_ineligible)
> +		__field(int, nblks_agg)
> +	),
> +
> +	TP_fast_assign(
> +		__entry->dev = sb->s_dev;
> +		__entry->nblks = nblks;
> +		__entry->reason = reason;
> +		__entry->num_fc = EXT4_SB(sb)->s_fc_stats.fc_num_commits;
> +		__entry->num_fc_ineligible =
> +			EXT4_SB(sb)->s_fc_stats.fc_ineligible_commits;
> +		__entry->nblks_agg = EXT4_SB(sb)->s_fc_stats.fc_numblks;
> +	),
> +
> +	TP_printk("fc on [%d,%d] nblks %d, reason %d, fc = %d, ineligible = %d, agg_nblks %d",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->nblks, __entry->reason, __entry->num_fc,
> +		  __entry->num_fc_ineligible, __entry->nblks_agg)
> +);
> +
> +#define FC_REASON_NAME_STAT(reason)					\
> +	show_fc_reason(reason),						\
> +	__entry->sbi->s_fc_stats.fc_ineligible_reason_count[reason]
> +
> +TRACE_EVENT(ext4_fc_stats,
> +	    TP_PROTO(struct super_block *sb),
> +
> +	    TP_ARGS(sb),
> +
> +	    TP_STRUCT__entry(
> +		    __field(dev_t, dev)
> +		    __field(struct ext4_sb_info *, sbi)
> +		    __field(int, count)
> +		    ),
> +
> +	    TP_fast_assign(
> +		    __entry->dev = sb->s_dev;
> +		    __entry->sbi = EXT4_SB(sb);
> +		    ),
> +
> +	    TP_printk("dev %d:%d fc ineligible reasons:\n"
> +		      "%s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s:%d, %s,%d; "
> +		      "num_commits:%d, ineligible: %d, numblks: %d",
> +		      MAJOR(__entry->dev), MINOR(__entry->dev),
> +		      FC_REASON_NAME_STAT(EXT4_FC_REASON_XATTR),
> +		      FC_REASON_NAME_STAT(EXT4_FC_REASON_CROSS_RENAME),
> +		      FC_REASON_NAME_STAT(EXT4_FC_REASON_JOURNAL_FLAG_CHANGE),
> +		      FC_REASON_NAME_STAT(EXT4_FC_REASON_MEM),
> +		      FC_REASON_NAME_STAT(EXT4_FC_REASON_SWAP_BOOT),
> +		      FC_REASON_NAME_STAT(EXT4_FC_REASON_RESIZE),
> +		      FC_REASON_NAME_STAT(EXT4_FC_REASON_RENAME_DIR),
> +		      FC_REASON_NAME_STAT(EXT4_FC_REASON_FALLOC_RANGE),
> +		      __entry->sbi->s_fc_stats.fc_num_commits,
> +		      __entry->sbi->s_fc_stats.fc_ineligible_commits,
> +		      __entry->sbi->s_fc_stats.fc_numblks)
> +
> +);
> +
> +#define DEFINE_TRACE_DENTRY_EVENT(__type)				\
> +	TRACE_EVENT(ext4_fc_track_##__type,				\
> +	    TP_PROTO(struct inode *inode, struct dentry *dentry, int ret), \
> +									\
> +	    TP_ARGS(inode, dentry, ret),				\
> +									\
> +	    TP_STRUCT__entry(						\
> +		    __field(dev_t, dev)					\
> +		    __field(int, ino)					\
> +		    __field(int, error)					\
> +		    ),							\
> +									\
> +	    TP_fast_assign(						\
> +		    __entry->dev = inode->i_sb->s_dev;			\
> +		    __entry->ino = inode->i_ino;			\
> +		    __entry->error = ret;				\
> +		    ),							\
> +									\
> +	    TP_printk("dev %d:%d, inode %d, error %d, fc_%s",		\
> +		      MAJOR(__entry->dev), MINOR(__entry->dev),		\
> +		      __entry->ino, __entry->error,			\
> +		      #__type)						\
> +	)
> +
> +DEFINE_TRACE_DENTRY_EVENT(create);
> +DEFINE_TRACE_DENTRY_EVENT(link);
> +DEFINE_TRACE_DENTRY_EVENT(unlink);
> +
> +TRACE_EVENT(ext4_fc_track_inode,
> +	    TP_PROTO(struct inode *inode, int ret),
> +
> +	    TP_ARGS(inode, ret),
> +
> +	    TP_STRUCT__entry(
> +		    __field(dev_t, dev)
> +		    __field(int, ino)
> +		    __field(int, error)
> +		    ),
> +
> +	    TP_fast_assign(
> +		    __entry->dev = inode->i_sb->s_dev;
> +		    __entry->ino = inode->i_ino;
> +		    __entry->error = ret;
> +		    ),
> +
> +	    TP_printk("dev %d:%d, inode %d, error %d",
> +		      MAJOR(__entry->dev), MINOR(__entry->dev),
> +		      __entry->ino, __entry->error)
> +	);
> +
> +TRACE_EVENT(ext4_fc_track_range,
> +	    TP_PROTO(struct inode *inode, long start, long end, int ret),
> +
> +	    TP_ARGS(inode, start, end, ret),
> +
> +	    TP_STRUCT__entry(
> +		    __field(dev_t, dev)
> +		    __field(int, ino)
> +		    __field(long, start)
> +		    __field(long, end)
> +		    __field(int, error)
> +		    ),
> +
> +	    TP_fast_assign(
> +		    __entry->dev = inode->i_sb->s_dev;
> +		    __entry->ino = inode->i_ino;
> +		    __entry->start = start;
> +		    __entry->end = end;
> +		    __entry->error = ret;
> +		    ),
> +
> +	    TP_printk("dev %d:%d, inode %d, error %d, start %ld, end %ld",
> +		      MAJOR(__entry->dev), MINOR(__entry->dev),
> +		      __entry->ino, __entry->error, __entry->start,
> +		      __entry->end)
> +	);
> +
>   #endif /* _TRACE_EXT4_H */
> 
>   /* This part must be outside protection */
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 7/9] ext4: fast commit recovery path
  2020-09-19  0:54 ` [PATCH v9 7/9] ext4: " Harshad Shirwadkar
  2020-09-19 14:15   ` kernel test robot
@ 2020-10-09 17:14   ` Ritesh Harjani
  2020-10-13  0:27     ` harshad shirwadkar
  1 sibling, 1 reply; 29+ messages in thread
From: Ritesh Harjani @ 2020-10-09 17:14 UTC (permalink / raw)
  To: Harshad Shirwadkar, linux-ext4; +Cc: tytso



On 9/19/20 6:24 AM, Harshad Shirwadkar wrote:
> This patch adds fast commit recovery path support for Ext4 file
> system. We add several helper functions that are similar in spirit to
> e2fsprogs journal recovery path handlers. Example of such functions
> include - a simple block allocator, idempotent block bitmap update
> function etc. Using these routines and the fast commit log in the fast
> commit area, the recovery path (ext4_fc_replay()) performs fast commit
> log recovery.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>   fs/ext4/balloc.c            |   7 +-
>   fs/ext4/ext4.h              |  26 ++
>   fs/ext4/ext4_jbd2.c         |   2 +-
>   fs/ext4/extents.c           | 261 +++++++++++
>   fs/ext4/extents_status.c    |  24 +
>   fs/ext4/fast_commit.c       | 881 +++++++++++++++++++++++++++++++++++-
>   fs/ext4/fast_commit.h       |  40 ++
>   fs/ext4/ialloc.c            | 165 ++++++-
>   fs/ext4/inode.c             |  89 ++--
>   fs/ext4/ioctl.c             |   6 +-
>   fs/ext4/mballoc.c           | 208 ++++++++-
>   fs/ext4/namei.c             | 149 +++---
>   fs/ext4/super.c             |  21 +
>   include/trace/events/ext4.h |  56 ++-
>   14 files changed, 1804 insertions(+), 131 deletions(-)
> 
> diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
> index 48c3df47748d..77108c99ae90 100644
> --- a/fs/ext4/balloc.c
> +++ b/fs/ext4/balloc.c
> @@ -368,7 +368,12 @@ static int ext4_validate_block_bitmap(struct super_block *sb,
>   				      struct buffer_head *bh)
>   {
>   	ext4_fsblk_t	blk;
> -	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
> +	struct ext4_group_info *grp;
> +
> +	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return 0;
> +
> +	grp = ext4_get_group_info(sb, block_group);
> 
>   	if (buffer_verified(bh))
>   		return 0;
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 27d48d166e5d..372a38292ed1 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1167,6 +1167,7 @@ struct ext4_inode_info {
>   #define EXT4_FC_COMMITTING		0x0010	/* File system underoing a fast
>   						 * commit.
>   						 */
> +#define EXT4_FC_REPLAY			0x0020	/* Fast commit replay ongoing */
> 
>   /*
>    * Misc. filesystem flags
> @@ -1658,6 +1659,10 @@ struct ext4_sb_info {
>   	struct buffer_head *s_fc_bh;
>   	struct ext4_fc_stats s_fc_stats;
>   	u64 s_fc_avg_commit_time;
> +#ifdef CONFIG_EXT4_DEBUG
> +	int s_fc_debug_max_replay;
> +#endif
> +	struct ext4_fc_replay_state s_fc_replay_state;
>   };
> 
>   static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> @@ -2700,6 +2705,7 @@ extern int ext4fs_dirhash(const struct inode *dir, const char *name, int len,
>   			  struct dx_hash_info *hinfo);
> 
>   /* ialloc.c */
> +extern int ext4_mark_inode_used(struct super_block *sb, int ino);
>   extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t,
>   				      const struct qstr *qstr, __u32 goal,
>   				      uid_t *owner, __u32 i_flags,
> @@ -2741,6 +2747,8 @@ void ext4_fc_stop_ineligible(struct super_block *sb);
>   void ext4_fc_start_update(struct inode *inode);
>   void ext4_fc_stop_update(struct inode *inode);
>   void ext4_fc_del(struct inode *inode);
> +bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t block);
> +void ext4_fc_replay_cleanup(struct super_block *sb);
>   int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
>   int __init ext4_fc_init_dentry_cache(void);
> 
> @@ -2773,8 +2781,12 @@ extern int ext4_group_add_blocks(handle_t *handle, struct super_block *sb,
>   				ext4_fsblk_t block, unsigned long count);
>   extern int ext4_trim_fs(struct super_block *, struct fstrim_range *);
>   extern void ext4_process_freed_data(struct super_block *sb, tid_t commit_tid);
> +extern void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
> +		       int len, int state);
> 
>   /* inode.c */
> +void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
> +			 struct ext4_inode_info *ei);
>   int ext4_inode_is_fast_symlink(struct inode *inode);
>   struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
>   struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
> @@ -2821,6 +2833,8 @@ extern int  ext4_sync_inode(handle_t *, struct inode *);
>   extern void ext4_dirty_inode(struct inode *, int);
>   extern int ext4_change_inode_journal_flag(struct inode *, int);
>   extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *);
> +extern int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
> +			  struct ext4_iloc *iloc);
>   extern int ext4_inode_attach_jinode(struct inode *inode);
>   extern int ext4_can_truncate(struct inode *inode);
>   extern int ext4_truncate(struct inode *);
> @@ -2854,12 +2868,15 @@ extern int ext4_ind_remove_space(handle_t *handle, struct inode *inode,
>   /* ioctl.c */
>   extern long ext4_ioctl(struct file *, unsigned int, unsigned long);
>   extern long ext4_compat_ioctl(struct file *, unsigned int, unsigned long);
> +extern void ext4_reset_inode_seed(struct inode *inode);
> 
>   /* migrate.c */
>   extern int ext4_ext_migrate(struct inode *);
>   extern int ext4_ind_migrate(struct inode *inode);
> 
>   /* namei.c */
> +extern int ext4_init_new_dir(handle_t *handle, struct inode *dir,
> +			     struct inode *inode);
>   extern int ext4_dirblock_csum_verify(struct inode *inode,
>   				     struct buffer_head *bh);
>   extern int ext4_orphan_add(handle_t *, struct inode *);
> @@ -3426,6 +3443,10 @@ extern int ext4_handle_dirty_dirblock(handle_t *handle, struct inode *inode,
>   extern int ext4_ci_compare(const struct inode *parent,
>   			   const struct qstr *fname,
>   			   const struct qstr *entry, bool quick);
> +extern int __ext4_unlink(struct inode *dir, const struct qstr *d_name,
> +			 struct inode *inode);
> +extern int __ext4_link(struct inode *dir, struct inode *inode,
> +		       struct dentry *dentry);
> 
>   #define S_SHIFT 12
>   static const unsigned char ext4_type_by_mode[(S_IFMT >> S_SHIFT) + 1] = {
> @@ -3526,6 +3547,11 @@ extern int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu);
>   extern int ext4_datasem_ensure_credits(handle_t *handle, struct inode *inode,
>   				       int check_cred, int restart_cred,
>   				       int revoke_cred);
> +extern void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end);
> +extern int ext4_ext_replay_set_iblocks(struct inode *inode);
> +extern int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start,
> +		int len, int unwritten, ext4_fsblk_t pblk);
> +extern int ext4_ext_clear_bb(struct inode *inode);
> 
> 
>   /* move_extent.c */
> diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
> index 760b9ee49dc0..0fd0c42a4f7d 100644
> --- a/fs/ext4/ext4_jbd2.c
> +++ b/fs/ext4/ext4_jbd2.c
> @@ -100,7 +100,7 @@ handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
>   		return ERR_PTR(err);
> 
>   	journal = EXT4_SB(sb)->s_journal;
> -	if (!journal)
> +	if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
>   		return ext4_get_nojournal();
>   	return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
>   				   GFP_NOFS, type, line);
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 8de236fedade..29945f1172fc 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -5804,3 +5804,264 @@ int ext4_clu_mapped(struct inode *inode, ext4_lblk_t lclu)
> 
>   	return err ? err : mapped;
>   }
> +
> +/*
> + * Updates physical block address and unwritten status of extent starting at
> + * lblk start and of len. If such an extent doesn't exist, this function
> + * splits the extent tree appropriately to create an extent like this.
> + * This function is called in Ext4 fast commit replay path. Returns 0 on success
> + * and error on failure.
> + */
> +int ext4_ext_replay_update_ex(struct inode *inode, ext4_lblk_t start,
> +		int len, int unwritten, ext4_fsblk_t pblk)
> +{
> +	struct ext4_ext_path *path = NULL, *ppath;
> +	struct ext4_extent *ex;
> +	int ret;
> +
> +	path = ext4_find_extent(inode, start, NULL, 0);
> +	if (!path)
> +		return -EINVAL;
> +	ex = path[path->p_depth].p_ext;
> +	if (!ex) {
> +		ret = -EFSCORRUPTED;
> +		goto out;
> +	}
> +
> +	if (le32_to_cpu(ex->ee_block) != start ||
> +		ext4_ext_get_actual_len(ex) != len) {
> +		/* We need to split this extent to match our extent first */
> +		ppath = path;
> +		down_write(&EXT4_I(inode)->i_data_sem);
> +		ret = ext4_force_split_extent_at(NULL, inode, &ppath, start, 1);
> +		up_write(&EXT4_I(inode)->i_data_sem);
> +		if (ret)
> +			goto out;
> +		kfree(path);
> +		path = ext4_find_extent(inode, start, NULL, 0);
> +		if (IS_ERR(path))
> +			return -1;
> +		ppath = path;
> +		ex = path[path->p_depth].p_ext;
> +		WARN_ON(le32_to_cpu(ex->ee_block) != start);
> +		if (ext4_ext_get_actual_len(ex) != len) {
> +			down_write(&EXT4_I(inode)->i_data_sem);
> +			ret = ext4_force_split_extent_at(NULL, inode, &ppath,
> +							 start + len, 1);
> +			up_write(&EXT4_I(inode)->i_data_sem);
> +			if (ret)
> +				goto out;
> +			kfree(path);
> +			path = ext4_find_extent(inode, start, NULL, 0);
> +			if (IS_ERR(path))
> +				return -EINVAL;
> +			ex = path[path->p_depth].p_ext;
> +		}
> +	}
> +	if (unwritten)
> +		ext4_ext_mark_unwritten(ex);
> +	else
> +		ext4_ext_mark_initialized(ex);
> +	ext4_ext_store_pblock(ex, pblk);
> +	down_write(&EXT4_I(inode)->i_data_sem);
> +	ret = ext4_ext_dirty(NULL, inode, &path[path->p_depth]);
> +	up_write(&EXT4_I(inode)->i_data_sem);
> +out:
> +	ext4_ext_drop_refs(path);
> +	kfree(path);
> +	ext4_mark_inode_dirty(NULL, inode);
> +	return ret;
> +}
> +
> +/* Try to shrink the extent tree */
> +void ext4_ext_replay_shrink_inode(struct inode *inode, ext4_lblk_t end)
> +{
> +	struct ext4_ext_path *path = NULL;
> +	struct ext4_extent *ex;
> +	ext4_lblk_t old_cur, cur = 0;
> +
> +	while (cur < end) {
> +		path = ext4_find_extent(inode, cur, NULL, 0);
> +		if (IS_ERR(path))
> +			return;
> +		ex = path[path->p_depth].p_ext;
> +		if (!ex) {
> +			ext4_ext_drop_refs(path);
> +			kfree(path);
> +			ext4_mark_inode_dirty(NULL, inode);
> +			return;
> +		}
> +		old_cur = cur;
> +		cur = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
> +		if (cur <= old_cur)
> +			cur = old_cur + 1;
> +		ext4_ext_try_to_merge(NULL, inode, path, ex);
> +		down_write(&EXT4_I(inode)->i_data_sem);
> +		ext4_ext_dirty(NULL, inode, &path[path->p_depth]);
> +		up_write(&EXT4_I(inode)->i_data_sem);
> +		ext4_mark_inode_dirty(NULL, inode);
> +		ext4_ext_drop_refs(path);
> +		kfree(path);
> +	}
> +}
> +
> +/* Check if *cur is a hole and if it is, skip it */
> +static void skip_hole(struct inode *inode, ext4_lblk_t *cur)
> +{
> +	int ret;
> +	struct ext4_map_blocks map;
> +
> +	map.m_lblk = *cur;
> +	map.m_len = ((inode->i_size) >> inode->i_sb->s_blocksize_bits) - *cur;
> +
> +	ret = ext4_map_blocks(NULL, inode, &map, 0);
> +	if (ret != 0)
> +		return;
> +	*cur = *cur + map.m_len;
> +}
> +
> +/* Count number of blocks used by this inode and update i_blocks */
> +int ext4_ext_replay_set_iblocks(struct inode *inode)
> +{
> +	struct ext4_ext_path *path = NULL, *path2 = NULL;
> +	struct ext4_extent *ex;
> +	ext4_lblk_t cur = 0, end;
> +	int numblks = 0, i, ret = 0;
> +	ext4_fsblk_t cmp1, cmp2;
> +	struct ext4_map_blocks map;
> +
> +	/* Determin the size of the file first */
> +	path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL,
> +					EXT4_EX_NOCACHE);
> +	if (IS_ERR(path))
> +		return PTR_ERR(path);
> +	ex = path[path->p_depth].p_ext;
> +	if (!ex) {
> +		ext4_ext_drop_refs(path);
> +		kfree(path);
> +		goto out;
> +	}
> +	end = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
> +	ext4_ext_drop_refs(path);
> +	kfree(path);
> +
> +	/* Count the number of data blocks */
> +	cur = 0;
> +	while (cur < end) {
> +		map.m_lblk = cur;
> +		map.m_len = end - cur;
> +		ret = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (ret < 0)
> +			break;
> +		if (ret > 0)
> +			numblks += ret;
> +		cur = cur + map.m_len;
> +	}
> +
> +	/*
> +	 * Count the number of extent tree blocks. We do it by looking up
> +	 * two successive extents and determining the difference between
> +	 * their paths. When path is different for 2 successive extents
> +	 * we compare the blocks in the path at each level and increment
> +	 * iblocks by total number of differences found.
> +	 */
> +	cur = 0;
> +	skip_hole(inode, &cur);
> +	path = ext4_find_extent(inode, cur, NULL, 0);
> +	if (IS_ERR(path))
> +		goto out;
> +	numblks += path->p_depth;
> +	ext4_ext_drop_refs(path);
> +	kfree(path);
> +	while (cur < end) {
> +		path = ext4_find_extent(inode, cur, NULL, 0);
> +		if (IS_ERR(path))
> +			break;
> +		ex = path[path->p_depth].p_ext;
> +		if (!ex) {
> +			ext4_ext_drop_refs(path);
> +			kfree(path);
> +			return 0;
> +		}
> +		cur = max(cur + 1, le32_to_cpu(ex->ee_block) +
> +					ext4_ext_get_actual_len(ex));
> +		skip_hole(inode, &cur);
> +
> +		path2 = ext4_find_extent(inode, cur, NULL, 0);
> +		if (IS_ERR(path2)) {
> +			ext4_ext_drop_refs(path);
> +			kfree(path);
> +			break;
> +		}
> +		ex = path2[path2->p_depth].p_ext;
> +		for (i = 0; i <= max(path->p_depth, path2->p_depth); i++) {
> +			cmp1 = cmp2 = 0;
> +			if (i <= path->p_depth)
> +				cmp1 = path[i].p_bh ?
> +					path[i].p_bh->b_blocknr : 0;
> +			if (i <= path2->p_depth)
> +				cmp2 = path2[i].p_bh ?
> +					path2[i].p_bh->b_blocknr : 0;
> +			if (cmp1 != cmp2 && cmp2 != 0)
> +				numblks++;
> +		}
> +		ext4_ext_drop_refs(path);
> +		ext4_ext_drop_refs(path2);
> +		kfree(path);
> +		kfree(path2);
> +	}
> +
> +out:
> +	inode->i_blocks = numblks << (inode->i_sb->s_blocksize_bits - 9);
> +	ext4_mark_inode_dirty(NULL, inode);
> +	return 0;
> +}
> +
> +int ext4_ext_clear_bb(struct inode *inode)
> +{
> +	struct ext4_ext_path *path = NULL;
> +	struct ext4_extent *ex;
> +	ext4_lblk_t cur = 0, end;
> +	int j, ret = 0;
> +	struct ext4_map_blocks map;
> +
> +	/* Determin the size of the file first */
> +	path = ext4_find_extent(inode, EXT_MAX_BLOCKS - 1, NULL,
> +					EXT4_EX_NOCACHE);
> +	if (IS_ERR(path))
> +		return PTR_ERR(path);
> +	ex = path[path->p_depth].p_ext;
> +	if (!ex) {
> +		ext4_ext_drop_refs(path);
> +		kfree(path);
> +		return 0;
> +	}
> +	end = le32_to_cpu(ex->ee_block) + ext4_ext_get_actual_len(ex);
> +	ext4_ext_drop_refs(path);
> +	kfree(path);
> +
> +	cur = 0;
> +	while (cur < end) {
> +		map.m_lblk = cur;
> +		map.m_len = end - cur;
> +		ret = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (ret < 0)
> +			break;
> +		if (ret > 0) {
> +			path = ext4_find_extent(inode, map.m_lblk, NULL, 0);
> +			if (!IS_ERR_OR_NULL(path)) {
> +				for (j = 0; j < path->p_depth; j++) {
> +
> +					ext4_mb_mark_bb(inode->i_sb,
> +							path[j].p_block, 1, 0);
> +				}
> +				ext4_ext_drop_refs(path);
> +				kfree(path);
> +			}
> +			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
> +		}
> +		cur = cur + map.m_len;
> +	}
> +
> +	return 0;
> +}
> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
> index e75171535375..0a729027322d 100644
> --- a/fs/ext4/extents_status.c
> +++ b/fs/ext4/extents_status.c
> @@ -311,6 +311,9 @@ void ext4_es_find_extent_range(struct inode *inode,
>   			       ext4_lblk_t lblk, ext4_lblk_t end,
>   			       struct extent_status *es)
>   {
> +	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return;
> +
>   	trace_ext4_es_find_extent_range_enter(inode, lblk);
> 
>   	read_lock(&EXT4_I(inode)->i_es_lock);
> @@ -361,6 +364,9 @@ bool ext4_es_scan_range(struct inode *inode,
>   {
>   	bool ret;
> 
> +	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return false;
> +
>   	read_lock(&EXT4_I(inode)->i_es_lock);
>   	ret = __es_scan_range(inode, matching_fn, lblk, end);
>   	read_unlock(&EXT4_I(inode)->i_es_lock);
> @@ -404,6 +410,9 @@ bool ext4_es_scan_clu(struct inode *inode,
>   {
>   	bool ret;
> 
> +	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return false;
> +
>   	read_lock(&EXT4_I(inode)->i_es_lock);
>   	ret = __es_scan_clu(inode, matching_fn, lblk);
>   	read_unlock(&EXT4_I(inode)->i_es_lock);
> @@ -812,6 +821,9 @@ int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>   	int err = 0;
>   	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> 
> +	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return 0;
> +
>   	es_debug("add [%u/%u) %llu %x to extent status tree of inode %lu\n",
>   		 lblk, len, pblk, status, inode->i_ino);
> 
> @@ -873,6 +885,9 @@ void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
>   	struct extent_status newes;
>   	ext4_lblk_t end = lblk + len - 1;
> 
> +	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return;
> +
>   	newes.es_lblk = lblk;
>   	newes.es_len = len;
>   	ext4_es_store_pblock_status(&newes, pblk, status);
> @@ -908,6 +923,9 @@ int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
>   	struct rb_node *node;
>   	int found = 0;
> 
> +	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return 0;
> +
>   	trace_ext4_es_lookup_extent_enter(inode, lblk);
>   	es_debug("lookup extent in block %u\n", lblk);
> 
> @@ -1419,6 +1437,9 @@ int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
>   	int err = 0;
>   	int reserved = 0;
> 
> +	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return 0;
> +
>   	trace_ext4_es_remove_extent(inode, lblk, len);
>   	es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
>   		 lblk, len, inode->i_ino);
> @@ -1969,6 +1990,9 @@ int ext4_es_insert_delayed_block(struct inode *inode, ext4_lblk_t lblk,
>   	struct extent_status newes;
>   	int err = 0;
> 
> +	if (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return 0;
> +
>   	es_debug("add [%u/1) delayed to extent status tree of inode %lu\n",
>   		 lblk, inode->i_ino);
> 
> diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> index 6e251b5682b4..63429076ad59 100644
> --- a/fs/ext4/fast_commit.c
> +++ b/fs/ext4/fast_commit.c
> @@ -170,7 +170,8 @@ void ext4_fc_start_update(struct inode *inode)
>   {
>   	struct ext4_inode_info *ei = EXT4_I(inode);
> 
> -	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
> +	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
>   		return;
> 
>   restart:
> @@ -209,7 +210,8 @@ void ext4_fc_stop_update(struct inode *inode)
>   {
>   	struct ext4_inode_info *ei = EXT4_I(inode);
> 
> -	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
> +	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
>   		return;
> 
>   	if (atomic_dec_and_test(&ei->i_fc_updates))
> @@ -224,11 +226,8 @@ void ext4_fc_del(struct inode *inode)
>   {
>   	struct ext4_inode_info *ei = EXT4_I(inode);
> 
> -	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> -		return;
> -
> -
> -	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
> +	    (EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY))
>   		return;
> 
>   restart:
> @@ -270,6 +269,10 @@ void ext4_fc_mark_ineligible(struct super_block *sb, int reason)
>   {
>   	struct ext4_sb_info *sbi = EXT4_SB(sb);
> 
> +	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
> +	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
> +		return;
> +
>   	sbi->s_mount_state |= EXT4_FC_INELIGIBLE;
>   	WARN_ON(reason >= EXT4_FC_REASON_MAX);
>   	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
> @@ -283,6 +286,10 @@ void ext4_fc_start_ineligible(struct super_block *sb, int reason)
>   {
>   	struct ext4_sb_info *sbi = EXT4_SB(sb);
> 
> +	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
> +	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
> +		return;
> +
>   	WARN_ON(reason >= EXT4_FC_REASON_MAX);
>   	sbi->s_fc_stats.fc_ineligible_reason_count[reason]++;
>   	atomic_inc(&sbi->s_fc_ineligible_updates);
> @@ -295,6 +302,10 @@ void ext4_fc_start_ineligible(struct super_block *sb, int reason)
>    */
>   void ext4_fc_stop_ineligible(struct super_block *sb)
>   {
> +	if (!test_opt2(sb, JOURNAL_FAST_COMMIT) ||
> +	    (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
> +		return;
> +
>   	EXT4_SB(sb)->s_mount_state |= EXT4_FC_INELIGIBLE;
>   	atomic_dec(&EXT4_SB(sb)->s_fc_ineligible_updates);
>   }
> @@ -325,7 +336,8 @@ static int ext4_fc_track_template(
>   	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>   	int ret;
> 
> -	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> +	if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT) ||
> +	    (sbi->s_mount_state & EXT4_FC_REPLAY))
>   		return -EOPNOTSUPP;
> 
>   	if (ext4_fc_is_ineligible(inode->i_sb))
> @@ -1214,13 +1226,864 @@ static void ext4_fc_cleanup(journal_t *journal, int full)
>   	trace_ext4_fc_stats(sb);
>   }
> 
> +/* Ext4 Replay Path Routines */
> +
> +/* Get length of a particular tlv */
> +static inline int ext4_fc_tag_len(struct ext4_fc_tl *tl)
> +{
> +	return le16_to_cpu(tl->fc_len);
> +}
> +
> +/* Get a pointer to "value" of a tlv */
> +static inline u8 *ext4_fc_tag_val(struct ext4_fc_tl *tl)
> +{
> +	return (u8 *)tl + sizeof(*tl);
> +}
> +
> +/* Helper struct for dentry replay routines */
> +struct dentry_info_args {
> +	int parent_ino, dname_len, ino, inode_len;
> +	char *dname;
> +};
> +
> +static inline void tl_to_darg(struct dentry_info_args *darg,
> +				struct  ext4_fc_tl *tl)
> +{
> +	struct ext4_fc_dentry_info *fcd;
> +
> +	fcd = (struct ext4_fc_dentry_info *)ext4_fc_tag_val(tl);
> +
> +	darg->parent_ino = le32_to_cpu(fcd->fc_parent_ino);
> +	darg->ino = le32_to_cpu(fcd->fc_ino);
> +	darg->dname = fcd->fc_dname;
> +	darg->dname_len = ext4_fc_tag_len(tl) -
> +			sizeof(struct ext4_fc_dentry_info);
> +}
> +
> +/* Unlink replay function */
> +static int ext4_fc_replay_unlink(struct super_block *sb, struct ext4_fc_tl *tl)
> +{
> +	struct inode *inode, *old_parent;
> +	struct qstr entry;
> +	struct dentry_info_args darg;
> +	int ret = 0;
> +
> +	tl_to_darg(&darg, tl);
> +
> +	trace_ext4_fc_replay(sb, EXT4_FC_TAG_UNLINK, darg.ino,
> +			darg.parent_ino, darg.dname_len);
> +
> +	entry.name = darg.dname;
> +	entry.len = darg.dname_len;
> +	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
> +
> +	if (IS_ERR_OR_NULL(inode)) {
> +		jbd_debug(1, "Inode %d not found", darg.ino);
> +		return 0;
> +	}
> +
> +	old_parent = ext4_iget(sb, darg.parent_ino,
> +				EXT4_IGET_NORMAL);
> +	if (IS_ERR_OR_NULL(old_parent)) {
> +		jbd_debug(1, "Dir with inode  %d not found", darg.parent_ino);
> +		iput(inode);
> +		return 0;
> +	}
> +
> +	ret = __ext4_unlink(old_parent, &entry, inode);
> +	/* -ENOENT ok coz it might not exist anymore. */
> +	if (ret == -ENOENT)
> +		ret = 0;
> +	iput(old_parent);
> +	iput(inode);
> +	return ret;
> +}
> +
> +static int ext4_fc_replay_link_internal(struct super_block *sb,
> +				struct dentry_info_args *darg,
> +				struct inode *inode)
> +{
> +	struct inode *dir = NULL;
> +	struct dentry *dentry_dir = NULL, *dentry_inode = NULL;
> +	struct qstr qstr_dname = QSTR_INIT(darg->dname, darg->dname_len);
> +	int ret = 0;
> +
> +	dir = ext4_iget(sb, darg->parent_ino, EXT4_IGET_NORMAL);
> +	if (IS_ERR(dir)) {
> +		jbd_debug(1, "Dir with inode %d not found.", darg->parent_ino);
> +		dir = NULL;
> +		goto out;
> +	}
> +
> +	dentry_dir = d_obtain_alias(dir);
> +	if (IS_ERR(dentry_dir)) {
> +		jbd_debug(1, "Failed to obtain dentry");
> +		dentry_dir = NULL;
> +		goto out;
> +	}
> +
> +	dentry_inode = d_alloc(dentry_dir, &qstr_dname);
> +	if (!dentry_inode) {
> +		jbd_debug(1, "Inode dentry not created.");
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = __ext4_link(dir, inode, dentry_inode);
> +	/*
> +	 * It's possible that link already existed since data blocks
> +	 * for the dir in question got persisted before we crashed OR
> +	 * we replayed this tag and crashed before the entire replay
> +	 * could complete.
> +	 */
> +	if (ret && ret != -EEXIST) {
> +		jbd_debug(1, "Failed to link\n");
> +		goto out;
> +	}
> +
> +	ret = 0;
> +out:
> +	if (dentry_dir) {
> +		d_drop(dentry_dir);
> +		dput(dentry_dir);
> +	} else if (dir) {
> +		iput(dir);
> +	}
> +	if (dentry_inode) {
> +		d_drop(dentry_inode);
> +		dput(dentry_inode);
> +	}
> +
> +	return ret;
> +}
> +
> +/* Link replay function */
> +static int ext4_fc_replay_link(struct super_block *sb, struct ext4_fc_tl *tl)
> +{
> +	struct inode *inode;
> +	struct dentry_info_args darg;
> +	int ret = 0;
> +
> +	tl_to_darg(&darg, tl);
> +	trace_ext4_fc_replay(sb, EXT4_FC_TAG_LINK, darg.ino,
> +			darg.parent_ino, darg.dname_len);
> +
> +	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
> +	if (IS_ERR_OR_NULL(inode)) {
> +		jbd_debug(1, "Inode not found.");
> +		return 0;
> +	}
> +
> +	ret = ext4_fc_replay_link_internal(sb, &darg, inode);
> +	iput(inode);
> +	return ret;
> +}
> +
> +/*
> + * Record all the modified inodes during replay. We use this later to setup
> + * block bitmaps correctly.
> + */
> +static int ext4_fc_record_modified_inode(struct super_block *sb, int ino)
> +{
> +	struct ext4_fc_replay_state *state;
> +	int i;
> +
> +	state = &EXT4_SB(sb)->s_fc_replay_state;
> +	for (i = 0; i < state->fc_modified_inodes_used; i++)
> +		if (state->fc_modified_inodes[i] == ino)
> +			return 0;
> +	if (state->fc_modified_inodes_used == state->fc_modified_inodes_size) {
> +		state->fc_modified_inodes_size +=
> +			EXT4_FC_REPLAY_REALLOC_INCREMENT;
> +		state->fc_modified_inodes = krealloc(
> +					state->fc_modified_inodes, sizeof(int) *
> +					state->fc_modified_inodes_size,
> +					GFP_KERNEL);
> +		if (!state->fc_modified_inodes)
> +			return -ENOMEM;
> +	}
> +	state->fc_modified_inodes[state->fc_modified_inodes_used++] = ino;
> +	return 0;
> +}
> +
> +/*
> + * Inode replay function
> + *
> + * If the tag is EXT4_FC_TAG_INODE_FULL, copy the entire inode to its location.
> + * If the tag is EXT4_FC_TAG_INODE_PARTIAL, copy everything except i_block.
> + * This is useful if i_block has been modified due to previous ADD_RANGE /
> + * DEL_RANGE tags.
> + */
> +static int ext4_fc_replay_inode(struct super_block *sb, struct ext4_fc_tl *tl)
> +{
> +	struct ext4_fc_inode *fc_inode;
> +	u8 *raw_fc_inode;
> +	struct inode *inode = NULL;
> +	struct ext4_iloc iloc;
> +	int inode_len, ino, ret, tag = le16_to_cpu(tl->fc_tag);
> +
> +	fc_inode = (struct ext4_fc_inode *)ext4_fc_tag_val(tl);
> +
> +	ino = le32_to_cpu(fc_inode->fc_ino);
> +	trace_ext4_fc_replay(sb, tag, ino, 0, 0);
> +
> +	inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
> +	if (!IS_ERR_OR_NULL(inode)) {
> +		ext4_ext_clear_bb(inode);
> +		iput(inode);
> +	}
> +
> +	ext4_fc_record_modified_inode(sb, ino);
> +
> +	raw_fc_inode = fc_inode->fc_raw_inode;
> +	ret = ext4_get_fc_inode_loc(sb, ino, &iloc);
> +	if (ret)
> +		goto out;
> +
> +	inode_len = ext4_fc_tag_len(tl) - sizeof(struct ext4_fc_inode);
> +
> +	if (tag == EXT4_FC_TAG_INODE_FULL) {
> +		memcpy(ext4_raw_inode(&iloc), raw_fc_inode, inode_len);
> +	} else {
> +		memcpy(ext4_raw_inode(&iloc), raw_fc_inode,
> +			offsetof(struct ext4_inode, i_block));
> +		memcpy(&ext4_raw_inode(&iloc)->i_generation,
> +			&((struct ext4_inode *)(raw_fc_inode))->i_generation,
> +			inode_len -
> +			offsetof(struct ext4_inode, i_generation));
> +	}
> +
> +	/* Immediately update the inode on disk. */
> +	ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh);
> +	sync_dirty_buffer(iloc.bh);
> +
> +	ret = ext4_mark_inode_used(sb, ino);
> +	if (ret)
> +		goto out;
> +
> +	/* Given that we just wrote the inode on disk, this SHOULD succeed. */
> +	inode = ext4_iget(sb, ino, EXT4_IGET_NORMAL);
> +	if (IS_ERR_OR_NULL(inode)) {
> +		jbd_debug(1, "Inode not found.");
> +		return -EFSCORRUPTED;
> +	}
> +
> +	/*
> +	 * Our allocator could have made different decisions than before
> +	 * crashing. This should be fixed but until then, we calculate
> +	 * the number of blocks the inode.
> +	 */
> +	if (tag == EXT4_FC_TAG_INODE_PARTIAL)
> +		ext4_ext_replay_set_iblocks(inode);
> +
> +	inode->i_generation = le32_to_cpu(ext4_raw_inode(&iloc)->i_generation);
> +	ext4_reset_inode_seed(inode);
> +
> +	ext4_inode_csum_set(inode, ext4_raw_inode(&iloc), EXT4_I(inode));
> +	ret = ext4_handle_dirty_metadata(NULL, NULL, iloc.bh);
> +	sync_dirty_buffer(iloc.bh);
> +	brelse(iloc.bh);
> +out:
> +	iput(inode);
> +	if (!ret)
> +		blkdev_issue_flush(sb->s_bdev, GFP_KERNEL);
> +
> +	return 0;
> +}
> +
> +/*
> + * Dentry create replay function.
> + *
> + * EXT4_FC_TAG_CREAT is preceded by EXT4_FC_TAG_INODE_FULL. Which means, the
> + * inode for which we are trying to create a dentry here, should already have
> + * been replayed before we start here.
> + */
> +static int ext4_fc_replay_create(struct super_block *sb, struct ext4_fc_tl *tl)
> +{
> +	int ret = 0;
> +	struct inode *inode = NULL;
> +	struct inode *dir = NULL;
> +	struct dentry_info_args darg;
> +
> +	tl_to_darg(&darg, tl);
> +
> +	trace_ext4_fc_replay(sb, EXT4_FC_TAG_CREAT, darg.ino,
> +			darg.parent_ino, darg.dname_len);
> +
> +	/* This takes care of update group descriptor and other metadata */
> +	ret = ext4_mark_inode_used(sb, darg.ino);
> +	if (ret)
> +		goto out;
> +
> +	inode = ext4_iget(sb, darg.ino, EXT4_IGET_NORMAL);
> +	if (IS_ERR_OR_NULL(inode)) {
> +		jbd_debug(1, "inode %d not found.", darg.ino);
> +		inode = NULL;
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	if (S_ISDIR(inode->i_mode)) {
> +		/*
> +		 * If we are creating a directory, we need to make sure that the
> +		 * dot and dot dot dirents are setup properly.
> +		 */
> +		dir = ext4_iget(sb, darg.parent_ino, EXT4_IGET_NORMAL);
> +		if (IS_ERR_OR_NULL(dir)) {
> +			jbd_debug(1, "Dir %d not found.", darg.ino);
> +			goto out;
> +		}
> +		ret = ext4_init_new_dir(NULL, dir, inode);
> +		iput(dir);
> +		if (ret) {
> +			ret = 0;
> +			goto out;
> +		}
> +	}
> +	ret = ext4_fc_replay_link_internal(sb, &darg, inode);
> +	if (ret)
> +		goto out;
> +	set_nlink(inode, 1);
> +	ext4_mark_inode_dirty(NULL, inode);
> +out:
> +	if (inode)
> +		iput(inode);
> +	return ret;
> +}
> +
> +/*
> + * Record physical disk regions which are in use as per fast commit area. Our
> + * simple replay phase allocator excludes these regions from allocation.
> + */
> +static int ext4_fc_record_regions(struct super_block *sb, int ino,
> +		ext4_lblk_t lblk, ext4_fsblk_t pblk, int len)
> +{
> +	struct ext4_fc_replay_state *state;
> +	struct ext4_fc_alloc_region *region;
> +
> +	state = &EXT4_SB(sb)->s_fc_replay_state;
> +	if (state->fc_regions_used == state->fc_regions_size) {
> +		state->fc_regions_size +=
> +			EXT4_FC_REPLAY_REALLOC_INCREMENT;
> +		state->fc_regions = krealloc(
> +					state->fc_regions,
> +					state->fc_regions_size *
> +					sizeof(struct ext4_fc_alloc_region),
> +					GFP_KERNEL);
> +		if (!state->fc_regions)
> +			return -ENOMEM;
> +	}
> +	region = &state->fc_regions[state->fc_regions_used++];
> +	region->ino = ino;
> +	region->lblk = lblk;
> +	region->pblk = pblk;
> +	region->len = len;
> +
> +	return 0;
> +}
> +
> +/* Replay add range tag */
> +static int ext4_fc_replay_add_range(struct super_block *sb,
> +				struct ext4_fc_tl *tl)
> +{
> +	struct ext4_fc_add_range *fc_add_ex;
> +	struct ext4_extent newex, *ex;
> +	struct inode *inode;
> +	ext4_lblk_t start, cur;
> +	int remaining, len;
> +	ext4_fsblk_t start_pblk;
> +	struct ext4_map_blocks map;
> +	struct ext4_ext_path *path = NULL;
> +	int ret;
> +
> +	fc_add_ex = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl);
> +	ex = (struct ext4_extent *)&fc_add_ex->fc_ex;
> +
> +	trace_ext4_fc_replay(sb, EXT4_FC_TAG_ADD_RANGE,
> +		le32_to_cpu(fc_add_ex->fc_ino), le32_to_cpu(ex->ee_block),
> +		ext4_ext_get_actual_len(ex));
> +
> +	inode = ext4_iget(sb, le32_to_cpu(fc_add_ex->fc_ino),
> +				EXT4_IGET_NORMAL);
> +	if (IS_ERR_OR_NULL(inode)) {
> +		jbd_debug(1, "Inode not found.");
> +		return 0;
> +	}
> +
> +	ret = ext4_fc_record_modified_inode(sb, inode->i_ino);
> +
> +	start = le32_to_cpu(ex->ee_block);
> +	start_pblk = ext4_ext_pblock(ex);
> +	len = ext4_ext_get_actual_len(ex);
> +
> +	cur = start;
> +	remaining = len;
> +	jbd_debug(1, "ADD_RANGE, lblk %d, pblk %lld, len %d, unwritten %d, inode %ld\n",
> +		  start, start_pblk, len, ext4_ext_is_unwritten(ex),
> +		  inode->i_ino);
> +
> +	while (remaining > 0) {
> +		map.m_lblk = cur;
> +		map.m_len = remaining;
> +		map.m_pblk = 0;
> +		ret = ext4_map_blocks(NULL, inode, &map, 0);
> +
> +		if (ret < 0) {
> +			iput(inode);
> +			return 0;
> +		}
> +
> +		if (ret == 0) {
> +			/* Range not mapped */
> +			path = ext4_find_extent(inode, cur, NULL, 0);
> +			if (!path)
> +				continue;
> +			memset(&newex, 0, sizeof(newex));
> +			newex.ee_block = cpu_to_le32(cur);
> +			ext4_ext_store_pblock(
> +				&newex, start_pblk + cur - start);
> +			newex.ee_len = cpu_to_le16(map.m_len);
> +			if (ext4_ext_is_unwritten(ex))
> +				ext4_ext_mark_unwritten(&newex);
> +			down_write(&EXT4_I(inode)->i_data_sem);
> +			ret = ext4_ext_insert_extent(
> +				NULL, inode, &path, &newex, 0);
> +			up_write((&EXT4_I(inode)->i_data_sem));
> +			ext4_ext_drop_refs(path);
> +			kfree(path);
> +			if (ret) {
> +				iput(inode);
> +				return 0;
> +			}
> +			goto next;
> +		}
> +
> +		if (start_pblk + cur - start != map.m_pblk) { > +			/* Logical to physical mapping changed */


Sorry I am not sure if I understand this correctly. Can we pls put more
comments on when and how can this condition happen?
I am sure I am mising something.

Also what about if the mapping changed and the start pblk is different
but it's still an overlapping mapping?
Do we take care of that case here? why I ask this, because we are
clearing the block bitmaps for map.m_len below.

> +			ret = ext4_ext_replay_update_ex(inode, cur, map.m_len,
> +					ext4_ext_is_unwritten(ex),
> +					start_pblk + cur - start);
> +			if (ret) {
> +				iput(inode);
> +				return 0;
> +			}
> +			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
> +			goto next;
> +		}
> +
> +		/* Range is mapped and needs a state change */
> +		jbd_debug(1, "Converting from %d to %d %lld",
> +				map.m_flags & EXT4_MAP_UNWRITTEN,
> +			ext4_ext_is_unwritten(ex), map.m_pblk);
> +		ret = ext4_ext_replay_update_ex(inode, cur, map.m_len,
> +					ext4_ext_is_unwritten(ex), map.m_pblk);
> +		if (ret) {
> +			iput(inode);
> +			return 0;
> +		}
> +		/*
> +		 * We may have split the extent tree while toggling the state.
> +		 * Try to shrink the exten tree now.

s/exten/extent



> +		 */
> +		ext4_ext_replay_shrink_inode(inode, start + len);
> +next:
> +		cur += map.m_len;
> +		remaining -= map.m_len;
> +	}
> +	ext4_ext_replay_shrink_inode(inode, i_size_read(inode) >>
> +					sb->s_blocksize_bits);
> +	iput(inode);
> +	return 0;
> +}
> +
> +/* Replay DEL_RANGE tag */
> +static int
> +ext4_fc_replay_del_range(struct super_block *sb, struct ext4_fc_tl *tl)
> +{
> +	struct inode *inode;
> +	struct ext4_fc_del_range *lrange;
> +	struct ext4_map_blocks map;
> +	ext4_lblk_t cur, remaining;
> +	int ret;
> +
> +	lrange = (struct ext4_fc_del_range *)ext4_fc_tag_val(tl);
> +	cur = le32_to_cpu(lrange->fc_lblk);
> +	remaining = le32_to_cpu(lrange->fc_len);
> +
> +	trace_ext4_fc_replay(sb, EXT4_FC_TAG_DEL_RANGE,
> +		le32_to_cpu(lrange->fc_ino), cur, remaining);
> +
> +	inode = ext4_iget(sb, le32_to_cpu(lrange->fc_ino), EXT4_IGET_NORMAL);
> +	if (IS_ERR_OR_NULL(inode)) {
> +		jbd_debug(1, "Inode %d not found", le32_to_cpu(lrange->fc_ino));
> +		return 0;
> +	}
> +
> +	ret = ext4_fc_record_modified_inode(sb, inode->i_ino);
> +
> +	jbd_debug(1, "DEL_RANGE, inode %ld, lblk %d, len %d\n",
> +			inode->i_ino, le32_to_cpu(lrange->fc_lblk),
> +			le32_to_cpu(lrange->fc_len));
> +	while (remaining > 0) {
> +		map.m_lblk = cur;
> +		map.m_len = remaining;
> +
> +		ret = ext4_map_blocks(NULL, inode, &map, 0);
> +		if (ret < 0) {
> +			iput(inode);
> +			return 0;
> +		}
> +		if (ret > 0) {
> +			remaining -= ret;
> +			cur += ret;
> +			ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
> +		} else {
> +			remaining -= map.m_len;
> +			cur += map.m_len;
> +		}
> +	}
> +
> +	ret = ext4_punch_hole(inode,
> +		le32_to_cpu(lrange->fc_lblk) << sb->s_blocksize_bits,
> +		le32_to_cpu(lrange->fc_len) <<  sb->s_blocksize_bits);
> +	if (ret)
> +		jbd_debug(1, "ext4_punch_hole returned %d", ret);
> +	ext4_ext_replay_shrink_inode(inode,
> +		i_size_read(inode) >> sb->s_blocksize_bits);
> +	ext4_mark_inode_dirty(NULL, inode);
> +	iput(inode);
> +
> +	return 0;
> +}
> +
> +static inline const char *tag2str(u16 tag)
> +{
> +	switch (tag) {
> +	case EXT4_FC_TAG_LINK:
> +		return "TAG_ADD_ENTRY";
> +	case EXT4_FC_TAG_UNLINK:
> +		return "TAG_DEL_ENTRY";
> +	case EXT4_FC_TAG_ADD_RANGE:
> +		return "TAG_ADD_RANGE";
> +	case EXT4_FC_TAG_CREAT:
> +		return "TAG_CREAT_DENTRY";
> +	case EXT4_FC_TAG_DEL_RANGE:
> +		return "TAG_DEL_RANGE";
> +	case EXT4_FC_TAG_INODE_FULL:
> +		return "TAG_INODE_FULL";
> +	case EXT4_FC_TAG_INODE_PARTIAL:
> +		return "TAG_INODE_PARTIAL";
> +	case EXT4_FC_TAG_PAD:
> +		return "TAG_PAD";
> +	case EXT4_FC_TAG_TAIL:
> +		return "TAG_TAIL";
> +	case EXT4_FC_TAG_HEAD:
> +		return "TAG_HEAD";
> +	default:
> +		return "TAG_ERROR";
> +	}
> +}
> +
> +void ext4_fc_set_bitmaps_and_counters(struct super_block *sb)

static ?

> +{
> +	struct ext4_fc_replay_state *state;
> +	struct inode *inode;
> +	struct ext4_ext_path *path = NULL;
> +	struct ext4_map_blocks map;
> +	int i, ret, j;
> +	ext4_lblk_t cur, end;
> +
> +	state = &EXT4_SB(sb)->s_fc_replay_state;
> +	for (i = 0; i < state->fc_modified_inodes_used; i++) {
> +		inode = ext4_iget(sb, state->fc_modified_inodes[i],
> +			EXT4_IGET_NORMAL);
> +		if (IS_ERR_OR_NULL(inode)) {
> +			jbd_debug(1, "Inode %d not found.",
> +				state->fc_modified_inodes[i]);
> +			continue;
> +		}
> +		cur = 0;
> +		end = EXT_MAX_BLOCKS;
> +		while (cur < end) {
> +			map.m_lblk = cur;
> +			map.m_len = end - cur;
> +
> +			ret = ext4_map_blocks(NULL, inode, &map, 0);
> +			if (ret < 0)
> +				break;
> +
> +			if (ret > 0) {
> +				path = ext4_find_extent(inode, map.m_lblk, NULL, 0);
> +				if (!IS_ERR_OR_NULL(path)) {
> +					for (j = 0; j < path->p_depth; j++)
> +						ext4_mb_mark_bb(inode->i_sb,
> +							path[j].p_block, 1, 1);
> +					ext4_ext_drop_refs(path);
> +					kfree(path);
> +				}
> +				cur += ret;
> +				ext4_mb_mark_bb(inode->i_sb, map.m_pblk,
> +							map.m_len, 1);
> +			} else {
> +				cur = cur + (map.m_len ? map.m_len : 1);
> +			}
> +		}
> +		iput(inode);
> +	}
> +}
> +
> +/*
> + * Check if block is in excluded regions for block allocation. The simple
> + * allocator that runs during replay phase is calls this function to see
> + * if it is okay to use a block.
> + */
> +bool ext4_fc_replay_check_excluded(struct super_block *sb, ext4_fsblk_t blk)
> +{
> +	int i;
> +	struct ext4_fc_replay_state *state;
> +
> +	state = &EXT4_SB(sb)->s_fc_replay_state;
> +	for (i = 0; i < state->fc_regions_valid; i++) {
> +		if (state->fc_regions[i].ino == 0 ||
> +			state->fc_regions[i].len == 0)
> +			continue;
> +		if (blk >= state->fc_regions[i].pblk &&
> +		    blk < state->fc_regions[i].pblk + state->fc_regions[i].len)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* Cleanup function called after replay */
> +void ext4_fc_replay_cleanup(struct super_block *sb)
> +{
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +
> +	sbi->s_mount_state &= ~EXT4_FC_REPLAY;
> +	kfree(sbi->s_fc_replay_state.fc_regions);
> +	kfree(sbi->s_fc_replay_state.fc_modified_inodes);
> +}
> +
> +/*
> + * Recovery Scan phase handler
> + *
> + * This function is called during the scan phase and is responsible
> + * for doing following things:
> + * - Make sure the fast commit area has valid tags for replay
> + * - Count number of tags that need to be replayed by the replay handler
> + * - Verify CRC
> + * - Create a list of excluded blocks for allocation during replay phase
> + *
> + * This function returns JBD2_FC_REPLAY_CONTINUE to indicate that SCAN is
> + * incomplete and JBD2 should send more blocks. It returns JBD2_FC_REPLAY_STOP
> + * to indicate that scan has finished and JBD2 can now start replay phase.
> + * It returns a negative error to indicate that there was an error. At the end
> + * of a successful scan phase, sbi->s_fc_replay_state.fc_replay_num_tags is set
> + * to indicate the number of tags that need to replayed during the replay phase.
> + */
> +static int ext4_fc_replay_scan(journal_t *journal,
> +				struct buffer_head *bh, int off,
> +				tid_t expected_tid)
> +{
> +	struct super_block *sb = journal->j_private;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_fc_replay_state *state;
> +	int ret = JBD2_FC_REPLAY_CONTINUE;
> +	struct ext4_fc_add_range *ext;
> +	struct ext4_fc_tl *tl;
> +	struct ext4_fc_tail *tail;
> +	__u8 *start, *end;
> +	struct ext4_fc_head *head;
> +	struct ext4_extent *ex;
> +
> +	state = &sbi->s_fc_replay_state;
> +
> +	start = (u8 *)bh->b_data;
> +	end = (__u8 *)bh->b_data + journal->j_blocksize - 1;
> +
> +	if (state->fc_replay_expected_off == 0) {
> +		state->fc_cur_tag = 0;
> +		state->fc_replay_num_tags = 0;
> +		state->fc_crc = 0;
> +		state->fc_regions = NULL;
> +		state->fc_regions_valid = state->fc_regions_used =
> +			state->fc_regions_size = 0;
> +		/* Check if we can stop early */
> +		if (le16_to_cpu(((struct ext4_fc_tl *)start)->fc_tag)
> +			!= EXT4_FC_TAG_HEAD)
> +			return 0;
> +	}
> +
> +	if (off != state->fc_replay_expected_off) {
> +		ret = -EFSCORRUPTED;
> +		goto out_err;
> +	}
> +
> +	state->fc_replay_expected_off++;
> +	fc_for_each_tl(start, end, tl) {
> +		jbd_debug(3, "Scan phase, tag:%s, blk %lld\n",
> +			  tag2str(le16_to_cpu(tl->fc_tag)), bh->b_blocknr);
> +		switch (le16_to_cpu(tl->fc_tag)) {
> +		case EXT4_FC_TAG_ADD_RANGE:
> +			ext = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl);
> +			ex = (struct ext4_extent *)&ext->fc_ex;
> +			ret = ext4_fc_record_regions(sb,
> +				le32_to_cpu(ext->fc_ino),
> +				le32_to_cpu(ex->ee_block), ext4_ext_pblock(ex),
> +				ext4_ext_get_actual_len(ex));
> +			if (ret < 0)
> +				break;
> +			ret = JBD2_FC_REPLAY_CONTINUE;
> +			fallthrough;
> +		case EXT4_FC_TAG_DEL_RANGE:
> +		case EXT4_FC_TAG_LINK:
> +		case EXT4_FC_TAG_UNLINK:
> +		case EXT4_FC_TAG_CREAT:
> +		case EXT4_FC_TAG_INODE_FULL:
> +		case EXT4_FC_TAG_INODE_PARTIAL:
> +		case EXT4_FC_TAG_PAD:
> +			state->fc_cur_tag++;
> +			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
> +					sizeof(*tl) + ext4_fc_tag_len(tl));
> +			break;
> +		case EXT4_FC_TAG_TAIL:
> +			state->fc_cur_tag++;
> +			tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl);
> +			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
> +						sizeof(*tl) +
> +						offsetof(struct ext4_fc_tail,
> +						fc_crc));
> +			if (le32_to_cpu(tail->fc_tid) == expected_tid &&
> +				le32_to_cpu(tail->fc_crc) == state->fc_crc) {
> +				state->fc_replay_num_tags = state->fc_cur_tag;
> +				state->fc_regions_valid =
> +					state->fc_regions_used;
> +			} else {
> +				ret = state->fc_replay_num_tags ?
> +					JBD2_FC_REPLAY_STOP : -EFSBADCRC;
> +			}
> +			state->fc_crc = 0;
> +			break;
> +		case EXT4_FC_TAG_HEAD:
> +			head = (struct ext4_fc_head *)ext4_fc_tag_val(tl);
> +			if (le32_to_cpu(head->fc_features) &
> +				~EXT4_FC_SUPPORTED_FEATURES) {
> +				ret = -EOPNOTSUPP;
> +				break;
> +			}
> +			if (le32_to_cpu(head->fc_tid) != expected_tid) {
> +				ret = JBD2_FC_REPLAY_STOP;
> +				break;
> +			}
> +			state->fc_cur_tag++;
> +			state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
> +					sizeof(*tl) + ext4_fc_tag_len(tl));


why do we need to calculate state->fc_crc for HEAD?
I don't see we comparing this anywhere right? anything I missed?

> +			break;
> +		default:
> +			ret = state->fc_replay_num_tags ?
> +				JBD2_FC_REPLAY_STOP : -ECANCELED;
> +		}
> +		if (ret < 0 || ret == JBD2_FC_REPLAY_STOP)
> +			break;
> +	}
> +
> +out_err:
> +	trace_ext4_fc_replay_scan(sb, ret, off);
> +	return ret;
> +}
> +
>   /*
>    * Main recovery path entry point.
> + * The meaning of return codes is similar as above.
>    */
>   static int ext4_fc_replay(journal_t *journal, struct buffer_head *bh,
>   				enum passtype pass, int off, tid_t expected_tid)
>   {
> -	return 0;
> +	struct super_block *sb = journal->j_private;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_fc_tl *tl;
> +	__u8 *start, *end;
> +	int ret = JBD2_FC_REPLAY_CONTINUE;
> +	struct ext4_fc_replay_state *state = &sbi->s_fc_replay_state;
> +	struct ext4_fc_tail *tail;
> +
> +	if (pass == PASS_SCAN) {
> +		state->fc_current_pass = PASS_SCAN;
> +		return ext4_fc_replay_scan(journal, bh, off, expected_tid);
> +	}
> +
> +	if (state->fc_current_pass != pass) {
> +		state->fc_current_pass = pass;
> +		sbi->s_mount_state |= EXT4_FC_REPLAY;
> +	}
> +	if (!sbi->s_fc_replay_state.fc_replay_num_tags) {
> +		jbd_debug(1, "Replay stops\n");
> +		ext4_fc_set_bitmaps_and_counters(sb);
> +		return 0;
> +	}
> +
> +#ifdef CONFIG_EXT4_DEBUG
> +	if (sbi->s_fc_debug_max_replay && off >= sbi->s_fc_debug_max_replay) {
> +		pr_warn("Dropping fc block %d because max_replay set\n", off);
> +		return -EINVAL;
> +	}
> +#endif
> +
> +	start = (u8 *)bh->b_data;
> +	end = (__u8 *)bh->b_data + journal->j_blocksize - 1;
> +
> +	fc_for_each_tl(start, end, tl) {
> +		if (state->fc_replay_num_tags == 0) {
> +			ret = JBD2_FC_REPLAY_STOP;
> +			ext4_fc_set_bitmaps_and_counters(sb);
> +			break;
> +		}
> +		jbd_debug(3, "Replay phase, tag:%s\n",
> +				tag2str(le16_to_cpu(tl->fc_tag)));
> +		state->fc_replay_num_tags--;
> +		switch (le16_to_cpu(tl->fc_tag)) {
> +		case EXT4_FC_TAG_LINK:
> +			ret = ext4_fc_replay_link(sb, tl);
> +			break;
> +		case EXT4_FC_TAG_UNLINK:
> +			ret = ext4_fc_replay_unlink(sb, tl);
> +			break;
> +		case EXT4_FC_TAG_ADD_RANGE:
> +			ret = ext4_fc_replay_add_range(sb, tl);
> +			break;
> +		case EXT4_FC_TAG_CREAT:
> +			ret = ext4_fc_replay_create(sb, tl);
> +			break;
> +		case EXT4_FC_TAG_DEL_RANGE:
> +			ret = ext4_fc_replay_del_range(sb, tl);
> +			break;
> +		case EXT4_FC_TAG_INODE_PARTIAL:
> +		case EXT4_FC_TAG_INODE_FULL:
> +			ret = ext4_fc_replay_inode(sb, tl);
> +			break;
> +		case EXT4_FC_TAG_PAD:
> +			trace_ext4_fc_replay(sb, EXT4_FC_TAG_PAD, 0,
> +				ext4_fc_tag_len(tl), 0);
> +			break;
> +		case EXT4_FC_TAG_TAIL:
> +			trace_ext4_fc_replay(sb, EXT4_FC_TAG_TAIL, 0,
> +				ext4_fc_tag_len(tl), 0);
> +			tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl);
> +			WARN_ON(le32_to_cpu(tail->fc_tid) != expected_tid);
> +			break;
> +		case EXT4_FC_TAG_HEAD:
> +			break;
> +		default:
> +			trace_ext4_fc_replay(sb, le16_to_cpu(tl->fc_tag), 0,
> +				ext4_fc_tag_len(tl), 0);
> +			ret = -ECANCELED;
> +			break;
> +		}
> +		if (ret < 0)
> +			break;
> +		ret = JBD2_FC_REPLAY_CONTINUE;
> +	}
> +	return ret;
>   }
> 
>   void ext4_fc_init(struct super_block *sb, journal_t *journal)
> diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> index a541d2bbe24b..cf4d8772d055 100644
> --- a/fs/ext4/fast_commit.h
> +++ b/fs/ext4/fast_commit.h
> @@ -117,4 +117,44 @@ struct ext4_fc_stats {
>   	int fc_numblks;
>   };
> 
> +#define EXT4_FC_REPLAY_REALLOC_INCREMENT	4
> +
> +/*
> + * Physical block regions added to different inodes due to fast commit
> + * recovery. These are set during the SCAN phase. During the replay phase,
> + * our allocator excludes these from its allocation. This ensures that
> + * we don't accidentally allocating a block that is going to be used by
> + * another inode.
> + */
> +struct ext4_fc_alloc_region {
> +	ext4_lblk_t lblk;
> +	ext4_fsblk_t pblk;
> +	int ino, len;
> +};
> +
> +/*
> + * Fast commit replay state.
> + */
> +struct ext4_fc_replay_state {
> +	int fc_replay_num_tags;
> +	int fc_replay_expected_off;
> +	int fc_current_pass;
> +	int fc_cur_tag;
> +	int fc_crc;
> +	struct ext4_fc_alloc_region *fc_regions;
> +	int fc_regions_size, fc_regions_used, fc_regions_valid;
> +	int *fc_modified_inodes;
> +	int fc_modified_inodes_used, fc_modified_inodes_size;
> +};
> +
> +#define region_last(__region) (((__region)->lblk) + ((__region)->len) - 1)
> +
> +#define fc_for_each_tl(__start, __end, __tl)				\
> +	for (tl = (struct ext4_fc_tl *)start;				\
> +		(u8 *)tl < (u8 *)end;					\
> +		tl = (struct ext4_fc_tl *)((u8 *)tl +			\
> +					sizeof(struct ext4_fc_tl) +	\
> +					+ le16_to_cpu(tl->fc_len)))
> +
> +
>   #endif /* __FAST_COMMIT_H__ */
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index df25d38d6539..db9b9eeb9560 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -82,7 +82,12 @@ static int ext4_validate_inode_bitmap(struct super_block *sb,
>   				      struct buffer_head *bh)
>   {
>   	ext4_fsblk_t	blk;
> -	struct ext4_group_info *grp = ext4_get_group_info(sb, block_group);
> +	struct ext4_group_info *grp;
> +
> +	if (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)
> +		return 0;
> +
> +	grp = ext4_get_group_info(sb, block_group);
> 
>   	if (buffer_verified(bh))
>   		return 0;
> @@ -284,15 +289,17 @@ void ext4_free_inode(handle_t *handle, struct inode *inode)
>   	bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
>   	bitmap_bh = ext4_read_inode_bitmap(sb, block_group);
>   	/* Don't bother if the inode bitmap is corrupt. */
> -	grp = ext4_get_group_info(sb, block_group);
>   	if (IS_ERR(bitmap_bh)) {
>   		fatal = PTR_ERR(bitmap_bh);
>   		bitmap_bh = NULL;
>   		goto error_return;
>   	}
> -	if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
> -		fatal = -EFSCORRUPTED;
> -		goto error_return;
> +	if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
> +		grp = ext4_get_group_info(sb, block_group);
> +		if (unlikely(EXT4_MB_GRP_IBITMAP_CORRUPT(grp))) {
> +			fatal = -EFSCORRUPTED;
> +			goto error_return;
> +		}
>   	}
> 
>   	BUFFER_TRACE(bitmap_bh, "get_write_access");
> @@ -742,6 +749,119 @@ static int find_inode_bit(struct super_block *sb, ext4_group_t group,
>   	return 1;
>   }
> 
> +int ext4_mark_inode_used(struct super_block *sb, int ino)
> +{
> +	unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count);
> +	struct buffer_head *inode_bitmap_bh = NULL, *group_desc_bh = NULL;
> +	struct ext4_group_desc *gdp;
> +	ext4_group_t group;
> +	int bit;
> +	int err = -EFSCORRUPTED;
> +
> +	if (ino < EXT4_FIRST_INO(sb) || ino > max_ino)
> +		goto out;
> +
> +	group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
> +	bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
> +	inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
> +	if (IS_ERR(inode_bitmap_bh))
> +		return PTR_ERR(inode_bitmap_bh);
> +
> +	if (ext4_test_bit(bit, inode_bitmap_bh->b_data)) {
> +		err = 0;
> +		goto out;
> +	}
> +
> +	gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
> +	if (!gdp || !group_desc_bh) {
> +		err = -EINVAL;
> +		goto out;
> +	}
> +
> +	ext4_set_bit(bit, inode_bitmap_bh->b_data);
> +
> +	BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> +	err = ext4_handle_dirty_metadata(NULL, NULL, inode_bitmap_bh);
> +	if (err) {
> +		ext4_std_error(sb, err);
> +		goto out;
> +	}
> +	sync_dirty_buffer(inode_bitmap_bh);

Shouldn't we handle error from sync_dirty_buffer()?

> +	BUFFER_TRACE(group_desc_bh, "get_write_access");

The above BUFFER_TRACE() is not correct. We should remove it from here.



> +
> +	/* We may have to initialize the block bitmap if it isn't already */
> +	if (ext4_has_group_desc_csum(sb) &&
> +	    gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> +		struct buffer_head *block_bitmap_bh;
> +
> +		block_bitmap_bh = ext4_read_block_bitmap(sb, group);
> +		if (IS_ERR(block_bitmap_bh)) {
> +			err = PTR_ERR(block_bitmap_bh);
> +			goto out;
> +		}
> +
> +		BUFFER_TRACE(block_bitmap_bh, "dirty block bitmap");
> +		err = ext4_handle_dirty_metadata(NULL, NULL, block_bitmap_bh);
> +		sync_dirty_buffer(block_bitmap_bh);
> +
> +		/* recheck and clear flag under lock if we still need to */
> +		ext4_lock_group(sb, group);
> +		if (ext4_has_group_desc_csum(sb) &&
> +		    (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT))) {
> +			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
> +			ext4_free_group_clusters_set(sb, gdp,
> +				ext4_free_clusters_after_init(sb, group, gdp));
> +			ext4_block_bitmap_csum_set(sb, group, gdp,
> +						   block_bitmap_bh);
> +			ext4_group_desc_csum_set(sb, group, gdp);
> +		}
> +		ext4_unlock_group(sb, group);
> +		brelse(block_bitmap_bh);
> +
> +		if (err) {
> +			ext4_std_error(sb, err);
> +			goto out;
> +		}
> +	}
> +
> +	/* Update the relevant bg descriptor fields */
> +	if (ext4_has_group_desc_csum(sb)) {
> +		int free;
> +
> +		ext4_lock_group(sb, group); /* while we modify the bg desc */
> +		free = EXT4_INODES_PER_GROUP(sb) -
> +			ext4_itable_unused_count(sb, gdp);
> +		if (gdp->bg_flags & cpu_to_le16(EXT4_BG_INODE_UNINIT)) {
> +			gdp->bg_flags &= cpu_to_le16(~EXT4_BG_INODE_UNINIT);
> +			free = 0;
> +		}
> +
> +		/*
> +		 * Check the relative inode number against the last used
> +		 * relative inode number in this group. if it is greater
> +		 * we need to update the bg_itable_unused count
> +		 */
> +		if (bit >= free)
> +			ext4_itable_unused_set(sb, gdp,
> +					(EXT4_INODES_PER_GROUP(sb) - bit - 1));
> +	} else {
> +		ext4_lock_group(sb, group);
> +	}
> +
> +	ext4_free_inodes_set(sb, gdp, ext4_free_inodes_count(sb, gdp) - 1);
> +	if (ext4_has_group_desc_csum(sb)) {
> +		ext4_inode_bitmap_csum_set(sb, group, gdp, inode_bitmap_bh,
> +					   EXT4_INODES_PER_GROUP(sb) / 8);
> +		ext4_group_desc_csum_set(sb, group, gdp);
> +	}
> +
> +	ext4_unlock_group(sb, group);
> +	err = ext4_handle_dirty_metadata(NULL, NULL, group_desc_bh);
> +	sync_dirty_buffer(group_desc_bh);
> +out:
> +	return err;
> +}
> +
>   /*
>    * There are two policies for allocating an inode.  If the new inode is
>    * a directory, then a forward search is made for a block group with both
> @@ -771,7 +891,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>   	struct inode *ret;
>   	ext4_group_t i;
>   	ext4_group_t flex_group;
> -	struct ext4_group_info *grp;
> +	struct ext4_group_info *grp = NULL;
>   	int encrypt = 0;
> 
>   	/* Cannot create files in a deleted directory */
> @@ -909,15 +1029,21 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>   		if (ext4_free_inodes_count(sb, gdp) == 0)
>   			goto next_group;
> 
> -		grp = ext4_get_group_info(sb, group);
> -		/* Skip groups with already-known suspicious inode tables */
> -		if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
> -			goto next_group;
> +		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
> +			grp = ext4_get_group_info(sb, group);
> +			/*
> +			 * Skip groups with already-known suspicious inode
> +			 * tables
> +			 */
> +			if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp))
> +				goto next_group;
> +		}
> 
>   		brelse(inode_bitmap_bh);
>   		inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
>   		/* Skip groups with suspicious inode tables */
> -		if (EXT4_MB_GRP_IBITMAP_CORRUPT(grp) ||
> +		if (((!(sbi->s_mount_state & EXT4_FC_REPLAY))
> +		     && EXT4_MB_GRP_IBITMAP_CORRUPT(grp)) ||
>   		    IS_ERR(inode_bitmap_bh)) {
>   			inode_bitmap_bh = NULL;
>   			goto next_group;
> @@ -936,7 +1062,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>   			goto next_group;
>   		}
> 
> -		if (!handle) {
> +		if ((!(sbi->s_mount_state & EXT4_FC_REPLAY)) && !handle) {
>   			BUG_ON(nblocks <= 0);
>   			handle = __ext4_journal_start_sb(dir->i_sb, line_no,
>   				 handle_type, nblocks, 0,
> @@ -1040,9 +1166,15 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>   	/* Update the relevant bg descriptor fields */
>   	if (ext4_has_group_desc_csum(sb)) {
>   		int free;
> -		struct ext4_group_info *grp = ext4_get_group_info(sb, group);
> -
> -		down_read(&grp->alloc_sem); /* protect vs itable lazyinit */
> +		struct ext4_group_info *grp = NULL;
> +
> +		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
> +			grp = ext4_get_group_info(sb, group);
> +			down_read(&grp->alloc_sem); /*
> +						     * protect vs itable
> +						     * lazyinit
> +						     */
> +		}
>   		ext4_lock_group(sb, group); /* while we modify the bg desc */
>   		free = EXT4_INODES_PER_GROUP(sb) -
>   			ext4_itable_unused_count(sb, gdp);
> @@ -1058,7 +1190,8 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>   		if (ino > free)
>   			ext4_itable_unused_set(sb, gdp,
>   					(EXT4_INODES_PER_GROUP(sb) - ino));
> -		up_read(&grp->alloc_sem);
> +		if (!(sbi->s_mount_state & EXT4_FC_REPLAY))
> +			up_read(&grp->alloc_sem);
>   	} else {
>   		ext4_lock_group(sb, group);
>   	}
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 26eed76812f9..9dce088171cc 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -101,8 +101,8 @@ static int ext4_inode_csum_verify(struct inode *inode, struct ext4_inode *raw,
>   	return provided == calculated;
>   }
> 
> -static void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
> -				struct ext4_inode_info *ei)
> +void ext4_inode_csum_set(struct inode *inode, struct ext4_inode *raw,
> +			 struct ext4_inode_info *ei)
>   {
>   	__u32 csum;
> 
> @@ -514,7 +514,8 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>   		return -EFSCORRUPTED;
> 
>   	/* Lookup extent status tree firstly */
> -	if (ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
> +	if (!(EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY) &&
> +	    ext4_es_lookup_extent(inode, map->m_lblk, NULL, &es)) {
>   		if (ext4_es_is_written(&es) || ext4_es_is_unwritten(&es)) {
>   			map->m_pblk = ext4_es_pblock(&es) +
>   					map->m_lblk - es.es_lblk;
> @@ -827,7 +828,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
>   	int create = map_flags & EXT4_GET_BLOCKS_CREATE;
>   	int err;
> 
> -	J_ASSERT(handle != NULL || create == 0);
> +	J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +		 || handle != NULL || create == 0);
> 
>   	map.m_lblk = block;
>   	map.m_len = 1;
> @@ -843,7 +845,8 @@ struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
>   		return ERR_PTR(-ENOMEM);
>   	if (map.m_flags & EXT4_MAP_NEW) {
>   		J_ASSERT(create != 0);
> -		J_ASSERT(handle != NULL);
> +		J_ASSERT((EXT4_SB(inode->i_sb)->s_mount_state & EXT4_FC_REPLAY)
> +			 || (handle != NULL));
> 
>   		/*
>   		 * Now that we do not always journal data, we should
> @@ -4255,22 +4258,22 @@ int ext4_truncate(struct inode *inode)
>    * data in memory that is needed to recreate the on-disk version of this
>    * inode.
>    */
> -static int __ext4_get_inode_loc(struct inode *inode,
> -				struct ext4_iloc *iloc, int in_mem)
> +static int __ext4_get_inode_loc(struct super_block *sb, unsigned long ino,
> +				struct ext4_iloc *iloc, int in_mem,
> +				ext4_fsblk_t *ret_block)
>   {
>   	struct ext4_group_desc	*gdp;
>   	struct buffer_head	*bh;
> -	struct super_block	*sb = inode->i_sb;
>   	ext4_fsblk_t		block;
>   	struct blk_plug		plug;
>   	int			inodes_per_block, inode_offset;
> 
>   	iloc->bh = NULL;
> -	if (inode->i_ino < EXT4_ROOT_INO ||
> -	    inode->i_ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
> +	if (ino < EXT4_ROOT_INO ||
> +	    ino > le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count))
>   		return -EFSCORRUPTED;
> 
> -	iloc->block_group = (inode->i_ino - 1) / EXT4_INODES_PER_GROUP(sb);
> +	iloc->block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
>   	gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
>   	if (!gdp)
>   		return -EIO;
> @@ -4279,7 +4282,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
>   	 * Figure out the offset within the block group inode table
>   	 */
>   	inodes_per_block = EXT4_SB(sb)->s_inodes_per_block;
> -	inode_offset = ((inode->i_ino - 1) %
> +	inode_offset = ((ino - 1) %
>   			EXT4_INODES_PER_GROUP(sb));
>   	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
>   	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
> @@ -4380,7 +4383,7 @@ static int __ext4_get_inode_loc(struct inode *inode,
>   		 * has in-inode xattrs, or we don't have this inode in memory.
>   		 * Read the block from disk.
>   		 */
> -		trace_ext4_load_inode(inode);
> +		trace_ext4_load_inode(sb, ino);
>   		get_bh(bh);
>   		bh->b_end_io = end_buffer_read_sync;
>   		submit_bh(REQ_OP_READ, REQ_META | REQ_PRIO, bh);
> @@ -4388,8 +4391,8 @@ static int __ext4_get_inode_loc(struct inode *inode,
>   		wait_on_buffer(bh);
>   		if (!buffer_uptodate(bh)) {
>   		simulate_eio:
> -			ext4_error_inode_block(inode, block, EIO,
> -					       "unable to read itable block");
> +			if (ret_block)
> +				*ret_block = block;
>   			brelse(bh);
>   			return -EIO;
>   		}
> @@ -4399,11 +4402,43 @@ static int __ext4_get_inode_loc(struct inode *inode,
>   	return 0;
>   }
> 
> +static int __ext4_get_inode_loc_noinmem(struct inode *inode,
> +					struct ext4_iloc *iloc)
> +{
> +	ext4_fsblk_t err_blk;
> +	int ret;
> +
> +	ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc, 0,
> +					&err_blk);
> +
> +	if (ret == -EIO)
> +		ext4_error_inode_block(inode, err_blk, EIO,
> +					"unable to read itable block");
> +
> +	return ret;
> +}
> +
>   int ext4_get_inode_loc(struct inode *inode, struct ext4_iloc *iloc)
>   {
> +	ext4_fsblk_t err_blk;
> +	int ret;
> +
>   	/* We have all inode data except xattrs in memory here. */
> -	return __ext4_get_inode_loc(inode, iloc,
> -		!ext4_test_inode_state(inode, EXT4_STATE_XATTR));
> +	ret = __ext4_get_inode_loc(inode->i_sb, inode->i_ino, iloc,
> +		!ext4_test_inode_state(inode, EXT4_STATE_XATTR), &err_blk);
> +
> +	if (ret == -EIO)
> +		ext4_error_inode_block(inode, err_blk, EIO,
> +					"unable to read itable block");
> +
> +	return ret;
> +}
> +
> +
> +int ext4_get_fc_inode_loc(struct super_block *sb, unsigned long ino,
> +			  struct ext4_iloc *iloc)
> +{
> +	return __ext4_get_inode_loc(sb, ino, iloc, 0, NULL);
>   }
> 
>   static bool ext4_should_enable_dax(struct inode *inode)
> @@ -4569,7 +4604,7 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>   	ei = EXT4_I(inode);
>   	iloc.bh = NULL;
> 
> -	ret = __ext4_get_inode_loc(inode, &iloc, 0);
> +	ret = __ext4_get_inode_loc_noinmem(inode, &iloc);
>   	if (ret < 0)
>   		goto bad_inode;
>   	raw_inode = ext4_raw_inode(&iloc);
> @@ -4615,10 +4650,11 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>   					      sizeof(gen));
>   	}
> 
> -	if (!ext4_inode_csum_verify(inode, raw_inode, ei) ||
> -	    ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) {
> -		ext4_error_inode_err(inode, function, line, 0, EFSBADCRC,
> -				     "iget: checksum invalid");
> +	if ((!ext4_inode_csum_verify(inode, raw_inode, ei) ||
> +	    ext4_simulate_fail(sb, EXT4_SIM_INODE_CRC)) &&
> +	     (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))) {
> +		ext4_error_inode_err(inode, function, line, 0,
> +				EFSBADCRC, "iget: checksum invalid");
>   		ret = -EFSBADCRC;
>   		goto bad_inode;
>   	}
> @@ -4772,9 +4808,10 @@ struct inode *__ext4_iget(struct super_block *sb, unsigned long ino,
>   		goto bad_inode;
>   	} else if (!ext4_has_inline_data(inode)) {
>   		/* validate the block references in the inode */
> -		if (S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> -		   (S_ISLNK(inode->i_mode) &&
> -		    !ext4_inode_is_fast_symlink(inode))) {
> +		if (!(EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY) &&
> +			(S_ISREG(inode->i_mode) || S_ISDIR(inode->i_mode) ||
> +			(S_ISLNK(inode->i_mode) &&
> +			!ext4_inode_is_fast_symlink(inode)))) {
>   			if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>   				ret = ext4_ext_check_inode(inode);
>   			else
> @@ -5158,7 +5195,7 @@ int ext4_write_inode(struct inode *inode, struct writeback_control *wbc)
>   	} else {
>   		struct ext4_iloc iloc;
> 
> -		err = __ext4_get_inode_loc(inode, &iloc, 0);
> +		err = __ext4_get_inode_loc_noinmem(inode, &iloc);
>   		if (err)
>   			return err;
>   		/*
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index d2f8f50deef6..f0381876a7e5 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -86,7 +86,7 @@ static void swap_inode_data(struct inode *inode1, struct inode *inode2)
>   	i_size_write(inode2, isize);
>   }
> 
> -static void reset_inode_seed(struct inode *inode)
> +void ext4_reset_inode_seed(struct inode *inode)
>   {
>   	struct ext4_inode_info *ei = EXT4_I(inode);
>   	struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> @@ -200,8 +200,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
> 
>   	inode->i_generation = prandom_u32();
>   	inode_bl->i_generation = prandom_u32();
> -	reset_inode_seed(inode);
> -	reset_inode_seed(inode_bl);
> +	ext4_reset_inode_seed(inode);
> +	ext4_reset_inode_seed(inode_bl);
> 
>   	ext4_discard_preallocations(inode, 0);
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 132c118d12e1..ea894529118a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -1508,14 +1508,16 @@ static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
> 
>   		blocknr = ext4_group_first_block_no(sb, e4b->bd_group);
>   		blocknr += EXT4_C2B(sbi, block);
> -		ext4_grp_locked_error(sb, e4b->bd_group,
> -				      inode ? inode->i_ino : 0,
> -				      blocknr,
> -				      "freeing already freed block "
> -				      "(bit %u); block bitmap corrupt.",
> -				      block);
> -		ext4_mark_group_bitmap_corrupted(sb, e4b->bd_group,
> +		if (!(sbi->s_mount_state & EXT4_FC_REPLAY)) {
> +			ext4_grp_locked_error(sb, e4b->bd_group,
> +					      inode ? inode->i_ino : 0,
> +					      blocknr,
> +					      "freeing already freed block (bit %u); block bitmap corrupt.",
> +					      block);
> +			ext4_mark_group_bitmap_corrupted(
> +				sb, e4b->bd_group,
>   				EXT4_GROUP_INFO_BBITMAP_CORRUPT);
> +		}
>   		mb_regenerate_buddy(e4b);
>   		goto done;
>   	}
> @@ -3302,6 +3304,86 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
>   	return err;
>   }
> 
> +/*
> + * Idempotent helper for Ext4 fast commit replay path to set the state of
> + * blocks in bitmaps and update counters.
> + */
> +void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
> +			int len, int state)
> +{
> +	struct buffer_head *bitmap_bh = NULL;
> +	struct ext4_group_desc *gdp;
> +	struct buffer_head *gdp_bh;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	ext4_group_t group;
> +	ext4_fsblk_t cluster;

I guess we never use this variable cluster. We can as well drop it.

-ritesh


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 2/9] ext4: add fast_commit feature and handling for extended mount options
  2020-09-19  0:54 ` [PATCH v9 2/9] ext4: add fast_commit feature and handling for extended mount options Harshad Shirwadkar
@ 2020-10-09 17:58   ` Theodore Y. Ts'o
  2020-10-13  0:27     ` harshad shirwadkar
  0 siblings, 1 reply; 29+ messages in thread
From: Theodore Y. Ts'o @ 2020-10-09 17:58 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Fri, Sep 18, 2020 at 05:54:44PM -0700, Harshad Shirwadkar wrote:
> We are running out of mount option bits. Add handling for using
> s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
> ability to turn on / off the fast commit feature in Ext4.

Shouldn't that read "...ability to turn off the fast commit feature via a
mount option"?

> @@ -2207,10 +2211,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
>  			WARN_ON(1);
>  			return -1;
>  		}
> -		if (arg != 0)
> -			sbi->s_mount_opt |= m->mount_opt;
> -		else
> -			sbi->s_mount_opt &= ~m->mount_opt;
> +		if (m->flags & MOPT_2) {
> +			if (arg != 0)
> +				sbi->s_mount_opt2 |= m->mount_opt;
> +			else
> +				sbi->s_mount_opt2 &= ~m->mount_opt;
> +		} else {
> +			if (arg != 0)
> +				sbi->s_mount_opt |= m->mount_opt;
> +			else
> +				sbi->s_mount_opt &= ~m->mount_opt;
> +		}
>  	}
>  	return 1;
>  }


This requires a matching change in _ext4_show_options(), so that the
MOPT_2 options are properly displayed in /proc/mounts.

						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-09-19  0:54 ` [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
  2020-09-22 17:50   ` Darrick J. Wong
@ 2020-10-09 18:28   ` Theodore Y. Ts'o
  2020-10-13  0:27     ` harshad shirwadkar
  1 sibling, 1 reply; 29+ messages in thread
From: Theodore Y. Ts'o @ 2020-10-09 18:28 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Fri, Sep 18, 2020 at 05:54:43PM -0700, Harshad Shirwadkar wrote:
> This patch adds necessary documentation for fast commits.
> 
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> ---
>  Documentation/filesystems/ext4/journal.rst | 66 ++++++++++++++++++++++
>  Documentation/filesystems/journalling.rst  | 28 +++++++++
>  2 files changed, 94 insertions(+)
> 
> diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> index ea613ee701f5..c2e4d010a201 100644
> --- a/Documentation/filesystems/ext4/journal.rst
> +++ b/Documentation/filesystems/ext4/journal.rst
> @@ -609,3 +620,58 @@ bytes long (but uses a full block):
>       - h\_commit\_nsec
>       - Nanoseconds component of the above timestamp.
>  
> +Fast commits
> +~~~~~~~~~~~~
> +
> +Fast commit area is organized as a log of tag tag length values. Each TLV has

s/tag tag/tag/

> +
> +File system is free to perform fast commits as and when it wants as long as it
> +gets permission from JBD2 to do so by calling the function
> +:c:func:`jbd2_fc_start()`. Once a fast commit is done, the client
> +file  system should tell JBD2 about it by calling :c:func:`jbd2_fc_stop()`.
> +If file system wants JBD2 to perform a full commit immediately after stopping
> +the fast commit it can do so by calling :c:func:`jbd2_fc_stop_do_commit()`.
> +This is useful if fast commit operation fails for some reason and the only way
> +to guarantee consistency is for JBD2 to perform the full traditional commit.

One of the things which is a bit confusing is that there is a
substantial part of the fast commit functionality which is implemented
in ext4, and not in the jbd2 layer.

We can't just talk about ext4_fc_start_update() and
ext4_fc_stop_update() here, since it would be a vit of a layering
violation.  But some kind of explanation of how a file system would
use the jbd2 fast commit framework would be useful, and the big
picture view of how the ext4 fast commit infrastruction (which is
currently documented in the top-level comments of
fs/ext4/fast_commit.c) fit into jbd2 infrastructure.

Maybe put the big picture explanation in fs/ext4/fast_commit.c and
then put a pointer in journaling.rst to the comments in
fs/ext4/fast_commit.c as an example of how the jbd2 fast_commit
infrastructure would get used (for example, if ocfs2 ever got
interested in doing something similar)?  Or maybe we need to move some
of the description from comments in fast_commit.c to a file in
Documentation/filesystems/ext4/fast_commit.rst, perhaps?

						- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/9] ext4: main fast-commit commit path
  2020-09-19  0:54 ` [PATCH v9 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
  2020-09-19  8:19   ` kernel test robot
  2020-10-09 17:04   ` Ritesh Harjani
@ 2020-10-09 19:14   ` Theodore Y. Ts'o
  2020-10-13  0:27     ` harshad shirwadkar
  2 siblings, 1 reply; 29+ messages in thread
From: Theodore Y. Ts'o @ 2020-10-09 19:14 UTC (permalink / raw)
  To: Harshad Shirwadkar; +Cc: linux-ext4

On Fri, Sep 18, 2020 at 05:54:47PM -0700, Harshad Shirwadkar wrote:
>  fs/jbd2/commit.c            |   42 ++
>  fs/jbd2/journal.c           |  119 +++-

Why are these changes here instead of the previous commit (jbd2: add
fast commit machinery)?

> diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> index ba35ecb18616..dadd9994e74b 100644
> --- a/fs/jbd2/commit.c
> +++ b/fs/jbd2/commit.c
> @@ -202,6 +202,47 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
>  	return ret;
>  }
>  
> +/* Send all the data buffers related to an inode */
> +int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> +{
> +	struct address_space *mapping;
> +	loff_t dirty_start;
> +	loff_t dirty_end;
> +	int ret;
> +
> +	if (!jinode)
> +		return 0;
> +
> +	dirty_start = jinode->i_dirty_start;
> +	dirty_end = jinode->i_dirty_end;
> +
> +	if (!(jinode->i_flags & JI_WRITE_DATA))
> +		return 0;
> +
> +	dirty_start = jinode->i_dirty_start;
> +	dirty_end = jinode->i_dirty_end;

Why is dirty_start and dirty_end initialized twice?

Also, this is going to conflcit with Mauricio's data=journal patches,
which you'll notice when you rebase these patches on the current dev branch.

(The dev branch temporarily had your v9 patches merged in, so we could
get the test bots to comment on your changes, but I've since pulled
the fc patches back out.)

					- Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/9] ext4: main fast-commit commit path
  2020-10-09 17:04   ` Ritesh Harjani
@ 2020-10-13  0:25     ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2020-10-13  0:25 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Ext4 Developers List, Theodore Y. Ts'o

Replies inlined, stripping down the original patch to make this more readable.

On Fri, Oct 9, 2020 at 10:04 AM Ritesh Harjani <riteshh@linux.ibm.com> wrote:
>
>
>
> On 9/19/20 6:24 AM, Harshad Shirwadkar wrote:
> > This patch adds main fast commit commit path handlers. The overall
> > patch can be divided into two inter-related parts:
> >
> > (A) Metadata updates tracking
> >
> >      This part consists of helper functions to track changes that need
> >      to be committed during a commit operation. These updates are
> >      maintained by Ext4 in different in-memory queues. Following are
> >      the APIs and their short description that are implemented in this
> >      patch:
> >
> >      - ext4_fc_track_link/unlink/creat() - Track unlink. link and creat
> >        operations
> >      - ext4_fc_track_range() - Track changed logical block offsets
> >        inodes
> >      - ext4_fc_track_inode() - Track inodes
> >      - ext4_fc_mark_ineligible() - Mark file system fast commit
> >        ineligible()
> >      - ext4_fc_start_update() / ext4_fc_stop_update() /
> >        ext4_fc_start_ineligible() / ext4_fc_stop_ineligible() These
> >        functions are useful for co-ordinating inode updates with
> >        commits.
> >
> > (B) Main commit Path
> >
> >      This part consists of functions to convert updates tracked in
> >      in-memory data structures into on-disk commits. Function
> >      ext4_fc_commit() is the main entry point to commit path.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > ---
> >   fs/ext4/acl.c               |    2 +
> >   fs/ext4/ext4.h              |   61 ++
> >   fs/ext4/extents.c           |   48 +-
> >   fs/ext4/fast_commit.c       | 1209 +++++++++++++++++++++++++++++++++++
> >   fs/ext4/fast_commit.h       |  111 ++++
> >   fs/ext4/file.c              |   10 +-
> >   fs/ext4/fsync.c             |    2 +-
> >   fs/ext4/inode.c             |   41 +-
> >   fs/ext4/ioctl.c             |   16 +-
> >   fs/ext4/namei.c             |   36 +-
> >   fs/ext4/super.c             |   31 +
> >   fs/ext4/xattr.c             |    3 +
> >   fs/jbd2/commit.c            |   42 ++
> >   fs/jbd2/journal.c           |  119 +++-
> >   include/linux/jbd2.h        |    6 +
> >   include/trace/events/ext4.h |  172 +++++
> >   16 files changed, 1882 insertions(+), 27 deletions(-)
> >
> > diff --git a/fs/ext4/acl.c b/fs/ext4/acl.c
> > index 76f634d185f1..68aaed48315f 100644
> > --- a/fs/ext4/acl.c
> > +++ b/fs/ext4/acl.c
> > @@ -242,6 +242,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
> >       handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits);
> >       if (IS_ERR(handle))
> >               return PTR_ERR(handle);
> > +     ext4_fc_start_update(inode);
> >
> >       if ((type == ACL_TYPE_ACCESS) && acl) {
> >               error = posix_acl_update_mode(inode, &mode, &acl);
> > @@ -259,6 +260,7 @@ ext4_set_acl(struct inode *inode, struct posix_acl *acl, int type)
> >       }
> >   out_stop:
> >       ext4_journal_stop(handle);
> > +     ext4_fc_stop_update(inode);
> >       if (error == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> >               goto retry;
> >       return error;
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 9af3971dd12e..27d48d166e5d 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -1022,6 +1022,27 @@ struct ext4_inode_info {
> >
> >       struct list_head i_orphan;      /* unlinked but open inodes */
> >
> > +     /* Fast commit related info */
> > +
> > +     struct list_head i_fc_list;     /*
> > +                                      * inodes that need fast commit
> > +                                      * protected by sbi->s_fc_lock.
> > +                                      */
> > +
> > +     /* Start of lblk range that needs to be committed in this fast commit */
> > +     ext4_lblk_t i_fc_lblk_start;
> > +
> > +     /* End of lblk range that needs to be committed in this fast commit */
> > +     ext4_lblk_t i_fc_lblk_len;
> > +
> > +     /* Number of ongoing updates on this inode */
> > +     atomic_t  i_fc_updates;
> > +
> > +     /* Fast commit wait queue for this inode */
> > +     wait_queue_head_t i_fc_wait;
> > +
> > +     struct mutex i_fc_lock;
>
> Although the code does explain the use of i_fc_lock.
> But a small comment here explaining what does this mutex lock protects
> will be helpful.
Sounds good, I'll add that.
>
> >       /*
> >        * i_disksize keeps track of what the inode size is ON DISK, not
> >        * in memory.  During truncate, i_size is set to the new size by
> > @@ -1142,6 +1163,10 @@ struct ext4_inode_info {
> >   #define     EXT4_VALID_FS                   0x0001  /* Unmounted cleanly */
> >   #define     EXT4_ERROR_FS                   0x0002  /* Errors detected */
> >   #define     EXT4_ORPHAN_FS                  0x0004  /* Orphans being recovered */
> > +#define EXT4_FC_INELIGIBLE           0x0008  /* Fast commit ineligible */
> > +#define EXT4_FC_COMMITTING           0x0010  /* File system underoing a fast
> > +                                              * commit.
> > +                                              */
> >
> >   /*
> >    * Misc. filesystem flags
> > @@ -1614,6 +1639,25 @@ struct ext4_sb_info {
> >       /* Record the errseq of the backing block device */
> >       errseq_t s_bdev_wb_err;
> >       spinlock_t s_bdev_wb_lock;
> > +
> > +     /* Ext4 fast commit stuff */
> > +     atomic_t s_fc_subtid;
> > +     atomic_t s_fc_ineligible_updates;
> > +     /*
> > +      * After commit starts, the main queue gets locked, and the further
> > +      * updates get added in the the staging queue
> > +      */
> > +#define FC_Q_MAIN    0
> > +#define FC_Q_STAGING 1
> > +     struct list_head s_fc_q[2];     /* Inodes staged for fast commit
> > +                                      * that have data changes in them.
> > +                                      */
> > +     struct list_head s_fc_dentry_q[2];      /* directory entry updates */
> > +     int s_fc_bytes;
>
> We don't need unsigned long for s_fc_bytes here is it?
Yes, we do. I'll fix this in V10.
>
>
> > +     spinlock_t s_fc_lock;
>
> some comment pls explaining what all this s_fc_lock protects?
Thanks, I'll add that.
>
> > +     struct buffer_head *s_fc_bh;
> > +     struct ext4_fc_stats s_fc_stats;
> > +     u64 s_fc_avg_commit_time;
> >   };
>
>
> Also I really like this i_fc_** & s_fc_** & ext4_fc_**() convention :)
Thanks! :)
>
>
> >
> >   static inline struct ext4_sb_info *EXT4_SB(struct super_block *sb)
> > @@ -1724,6 +1768,7 @@ enum {
> >       EXT4_STATE_EXT_PRECACHED,       /* extents have been precached */
> >       EXT4_STATE_LUSTRE_EA_INODE,     /* Lustre-style ea_inode */
> >       EXT4_STATE_VERITY_IN_PROGRESS,  /* building fs-verity Merkle tree */
> > +     EXT4_STATE_FC_COMMITTING,       /* Fast commit ongoing */
> >   };
> >
> >   #define EXT4_INODE_BIT_FNS(name, field, offset)                             \
> > @@ -2683,6 +2728,22 @@ extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
> >   /* fast_commit.c */
> >
> >   void ext4_fc_init(struct super_block *sb, journal_t *journal);
> > +void ext4_fc_init_inode(struct inode *inode);
> > +void ext4_fc_track_range(struct inode *inode, ext4_lblk_t start,
> > +                      ext4_lblk_t end);
> > +void ext4_fc_track_unlink(struct inode *inode, struct dentry *dentry);
> > +void ext4_fc_track_link(struct inode *inode, struct dentry *dentry);
> > +void ext4_fc_track_create(struct inode *inode, struct dentry *dentry);
> > +void ext4_fc_track_inode(struct inode *inode);
> > +void ext4_fc_mark_ineligible(struct super_block *sb, int reason);
> > +void ext4_fc_start_ineligible(struct super_block *sb, int reason);
> > +void ext4_fc_stop_ineligible(struct super_block *sb);
> > +void ext4_fc_start_update(struct inode *inode);
> > +void ext4_fc_stop_update(struct inode *inode);
> > +void ext4_fc_del(struct inode *inode);
> > +int ext4_fc_commit(journal_t *journal, tid_t commit_tid);
> > +int __init ext4_fc_init_dentry_cache(void);
> > +
> >   /* mballoc.c */
> >   extern const struct seq_operations ext4_mb_seq_groups_ops;
> >   extern long ext4_mb_stats;
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index a0481582187a..8de236fedade 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -3723,6 +3723,7 @@ static int ext4_convert_unwritten_extents_endio(handle_t *handle,
> >       err = ext4_ext_dirty(handle, inode, path + path->p_depth);
> >   out:
> >       ext4_ext_show_leaf(inode, path);
> > +     ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
> >       return err;
> >   }
> >
> > @@ -3794,6 +3795,7 @@ convert_initialized_extent(handle_t *handle, struct inode *inode,
> >       if (*allocated > map->m_len)
> >               *allocated = map->m_len;
> >       map->m_len = *allocated;
> > +     ext4_fc_track_range(inode, ee_block, ee_block + ee_len - 1);
> >       return 0;
> >   }
> >
> > @@ -4327,7 +4329,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
> >       map->m_len = ar.len;
> >       allocated = map->m_len;
> >       ext4_ext_show_leaf(inode, path);
> > -
> > +     ext4_fc_track_range(inode, map->m_lblk, map->m_len);
>
> Shouldn't the last argument be map->m_lblk + map->m_len - 1, no?
Thanks for catching this! I'll fix this in V10.
>
>
>
> > + * (B) File specific data range updates:
> > + *
> > + * - EXT4_FC_TAG_ADD_RANGE   - records addition of new blocks to an inode
> > + * - EXT4_FC_TAG_DEL_RANGE   - recordd deletion of blocks from an inode
>
> s/recordd/records
Ack
>
>
> > + *
> > + * (C) Inode metadata (mtime / ctime etc):
> > + *
> > + * - EXT4_FC_TAG_INODE_FULL  - record the inode that should be replayed as is
> > + *                             during recovery. This tag is useful for
> > + *                             setting up a new inode. This tag is necessary
> > + *                             for CREAT tag.
> > + * - EXT4_FC_TAG_INODE_PARTIAL       - record the inode that should only partially be
> > + *                             be replayed during recovery. This tag is
> > + *                             useful when the same inode has had updates
> > + *                             recorded as ADD_RANGE or DEL_RANGE tags.
>
> So we keep INODE_FULL to only track the inode metdata.
> and PARTIAL is to track for any extent add/del range operations right?
> Is there anything else that I am missing?
The main difference between INODE_FULL and PARTIAL is whether or not
i_block[] fields in the inode should be overwritten or not. If we are
dealing with inode that has inline data, the fast commit recovery
operation needs to overwrite the old data with new data. So the
following flow of operations will result in fast commit with
"INODE_FULL" tag:

- New Inode A created with inline data "old"
- Full commit operation (now A is persisted on disk)
- Inode A's contents were modified to "new"
- fsync() (in this case Inode A will be recorded on disk with "INODE_FULL" tag

In case of add range / del range however, since we modify the inode's
extent tree and the resulting inode->i_block[] maybe different than
before crash. That's why we also need a partial tag.
>
>
> > +/*
> > + * Inform Ext4's fast about start of an inode update
> > + *
> > + * This function is called by the high level call VFS callbacks before
> > + * performing any inode update. This function blocks if there's an ongoing
> > + * fast commit on the inode in question.
> > + */
> > +void ext4_fc_start_update(struct inode *inode)
> > +{
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +
> > +     if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> > +             return;
> > +
> > +restart:
> > +     spin_lock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> > +     if (list_empty(&EXT4_I(inode)->i_fc_list))
>
> We can use ei->i_fc_list directly.
Ack
>
>
>
> > +             goto out;
> > +
> > +     if (ext4_test_inode_state(inode, EXT4_STATE_FC_COMMITTING)) {
> > +             wait_queue_head_t *wq;
> > +#if (BITS_PER_LONG < 64)
> > +             DEFINE_WAIT_BIT(wait, &ei->i_state_flags,
> > +                             EXT4_STATE_FC_COMMITTING);
> > +             wq = bit_waitqueue(&ei->i_state_flags,
> > +                                EXT4_STATE_FC_COMMITTING);
> > +#else
> > +             DEFINE_WAIT_BIT(wait, &ei->i_flags,
> > +                             EXT4_STATE_FC_COMMITTING);
> > +             wq = bit_waitqueue(&ei->i_flags,
> > +                                EXT4_STATE_FC_COMMITTING);
> > +#endif
> > +             prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> > +             spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> > +             schedule();
> > +             finish_wait(wq, &wait.wq_entry);
> > +             goto restart;
> > +     }
> > +out:
> > +     atomic_inc(&ei->i_fc_updates);
> > +     spin_unlock(&EXT4_SB(inode->i_sb)->s_fc_lock);
> > +}
> > +
> > +/*
> > + * Stop inode update and wake up waiting fast commits if any.
> > + */
> > +void ext4_fc_stop_update(struct inode *inode)
> > +{
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +
> > +     if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> > +             return;
> > +
> > +     if (atomic_dec_and_test(&ei->i_fc_updates))
> > +             wake_up_all(&ei->i_fc_wait);
> > +}
> > +
> > +/*
> > + * Remove inode from fast commit list. If the inode is being committed
> > + * we wait until inode commit is done.
> > + */
> > +void ext4_fc_del(struct inode *inode)
> > +{
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +
> > +     if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> > +             return;
> > +
> > +
> > +     if (!test_opt2(inode->i_sb, JOURNAL_FAST_COMMIT))
> > +             return;
>
>
> Redundant check, so let's remove this.
Ack
>
>
>
> > +
> > +/* __track_fn for directory entry updates. Called with ei->i_fc_lock. */
> > +static int __track_dentry_update(struct inode *inode, void *arg, bool update)
> > +{
> > +     struct ext4_fc_dentry_update *node;
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +     struct __track_dentry_update_args *dentry_update =
> > +             (struct __track_dentry_update_args *)arg;
> > +     struct dentry *dentry = dentry_update->dentry;
> > +     struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> > +
> > +     mutex_unlock(&ei->i_fc_lock);
> > +     node = kmem_cache_alloc(ext4_fc_dentry_cachep, GFP_NOFS);
> > +     if (!node) {
> > +             ext4_fc_mark_ineligible(inode->i_sb, EXT4_FC_REASON_MEM);
> > +             mutex_lock(&ei->i_fc_lock);
> > +             return -ENOMEM;
> > +     }
> > +
> > +     node->fcd_op = dentry_update->op;
> > +     node->fcd_parent = dentry->d_parent->d_inode->i_ino;
> > +     node->fcd_ino = inode->i_ino;
> > +     if (dentry->d_name.len > DNAME_INLINE_LEN) {
> > +             node->fcd_name.name = kmalloc(dentry->d_name.len, GFP_KERNEL);
>
>
> Shouldn't this be GFP_NOFS too?
Ack, will fix this.
>
>
>
> > +static void ext4_fc_submit_bh(struct super_block *sb)
> > +{
> > +     int write_flags = REQ_SYNC;
> > +     struct buffer_head *bh = EXT4_SB(sb)->s_fc_bh;
> > +
> > +     if (test_opt(sb, BARRIER))
> > +             write_flags |= REQ_FUA | REQ_PREFLUSH;
> > +     lock_buffer(bh);
> > +     clear_buffer_dirty(bh);
> > +     set_buffer_uptodate(bh);
>
> Please don't mind me asking this.
> But shouldn't we clear the buffer dirty after the IO is submitted
> and update will be set once the IO is completed?
> Maybe I am missing something here.
Thanks for asking this. What you say makes sense to me, but what I did
here is basically copied from fs/jbd2/commit.c. Is there a reason why
we clear dirty flag and set uptodate in jbd2/commit.c before
submitting bh?
>
>
> > +     bh->b_end_io = ext4_end_buffer_io_sync;
> > +     submit_bh(REQ_OP_WRITE, write_flags, bh);
> > +     EXT4_SB(sb)->s_fc_bh = NULL;
> > +}
> > +
> > +/* Ext4 commit path routines */
> > +
> > +/* memzero and update CRC */
> > +static void *ext4_fc_memzero(struct super_block *sb, void *dst, int len,
> > +                             u32 *crc)
> > +{
> > +     void *ret;
> > +
> > +     ret = memset(dst, 0, len);
> > +     if (crc)
> > +             *crc = ext4_chksum(EXT4_SB(sb), *crc, dst, len);
> > +     return ret;
> > +}
> > +
> > +/*
> > + * Allocate len bytes on a fast commit buffer.
> > + *
> > + * During the commit time this function is used to manage fast commit
> > + * block space. We don't split a fast commit log onto different
> > + * blocks. So this function makes sure that if there's not enough space
> > + * on the current block, the remaining space in the current block is
> > + * marked as unused by adding EXT4_FC_TAG_PAD tag. In that case,
> > + * new block is from jbd2 and CRC is updated to reflect the padding
> > + * we added.
> > + */
> > +static u8 *ext4_fc_reserve_space(struct super_block *sb, int len, u32 *crc)
> > +{
> > +     struct ext4_fc_tl *tl;
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct buffer_head *bh;
> > +     int bsize = sbi->s_journal->j_blocksize;
> > +     int ret, off = sbi->s_fc_bytes % bsize;
> > +     int pad_len;
> > +
> > +     if (bsize - off - 1 > len + sizeof(struct ext4_fc_tl)) {
> > +             /*
> > +              * Only allocate from current buffer if we have enough space for
> > +              * this request AND we have space to add a zero byte padding.
> > +              */
> > +             if (!sbi->s_fc_bh) {
> > +                     ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
> > +                     if (ret)
> > +                             return NULL;
> > +                     sbi->s_fc_bh = bh;
> > +             }
> > +             sbi->s_fc_bytes += len;
> > +             return sbi->s_fc_bh->b_data + off;
> > +     }
> > +     /* Need to add PAD tag */
> > +     tl = (struct ext4_fc_tl *)(sbi->s_fc_bh->b_data + off);
> > +     tl->fc_tag = cpu_to_le16(EXT4_FC_TAG_PAD);
> > +     pad_len = bsize - off - 1 - sizeof(struct ext4_fc_tl);
> > +     tl->fc_len = cpu_to_le16(pad_len);
> > +     if (crc)
> > +             *crc = ext4_chksum(sbi, *crc, tl, sizeof(*tl));
> > +     if (pad_len > 0)
> > +             ext4_fc_memzero(sb, tl + 1, pad_len, crc);
> > +     ext4_fc_submit_bh(sb);
> > +
> > +     ret = jbd2_fc_get_buf(EXT4_SB(sb)->s_journal, &bh);
> > +     if (ret)
> > +             return NULL;
> > +     sbi->s_fc_bh = bh;
> > +     sbi->s_fc_bytes = (sbi->s_fc_bytes / bsize + 1) * bsize + len;
>
> why do we need +1 here?
Since we just added padding, we want to skip the remaining bytes in
the current bh and we want s_fc_bytes to go len bytes into the next
block. The value of ((sbi->s_fc_bytes / bsize + 1) * bsize) is
basically rounding up of division (sbi->s_fc_bytes / bsize).
>
>
>
> > +     return sbi->s_fc_bh->b_data;
> > +}
> > +
> > +/* memcpy to fc reserved space and update CRC */
> > +static void *ext4_fc_memcpy(struct super_block *sb, void *dst, const void *src,
> > +                             int len, u32 *crc)
> > +{
> > +     if (crc)
> > +             *crc = ext4_chksum(EXT4_SB(sb), *crc, src, len);
> > +     return memcpy(dst, src, len);
> > +}
> > +
> > +/*
> > + * Complete a fast commit by writing tail tag.
> > + *
> > + * Writing tail tag marks the end of a fast commit. In order to guarantee
> > + * atomicity, after writing tail tag, even if there's space remaining
> > + * in the block, next commit shouldn't use it. That's why tail tag
> > + * has the length as that of the remaining space on the block.
> > + */
> > +static int ext4_fc_write_tail(struct super_block *sb, u32 crc)
> > +{
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct ext4_fc_tl tl;
> > +     struct ext4_fc_tail tail;
> > +     int off, bsize = sbi->s_journal->j_blocksize;
> > +     u8 *dst;
> > +
> > +     /*
> > +      * ext4_fc_reserve_space takes care of allocating an extra block if
> > +      * there's no enough space on this block for accommodating this tail.
> > +      */
> > +     dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(tail), &crc);
> > +     if (!dst)
> > +             return -ENOSPC;
> > +
> > +     off = sbi->s_fc_bytes % bsize;
> > +
> > +     tl.fc_tag = cpu_to_le16(EXT4_FC_TAG_TAIL);
> > +     tl.fc_len = cpu_to_le16(bsize - off - 1 + sizeof(struct ext4_fc_tail));
> > +     sbi->s_fc_bytes = round_up(sbi->s_fc_bytes, bsize);
> > +
> > +     ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), &crc);
> > +     dst += sizeof(tl);
> > +     tail.fc_tid = cpu_to_le32(sbi->s_journal->j_running_transaction->t_tid);
> > +     ext4_fc_memcpy(sb, dst, &tail.fc_tid, sizeof(tail.fc_tid), &crc);
> > +     dst += sizeof(tail.fc_tid);
> > +     tail.fc_crc = cpu_to_le32(crc);
> > +     ext4_fc_memcpy(sb, dst, &tail.fc_crc, sizeof(tail.fc_crc), NULL);
> > +
> > +     ext4_fc_submit_bh(sb);
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Adds tag, length, value and updates CRC. Returns true if tlv was added.
> > + * Returns false if there's not enough space.
> > + */
> > +static bool ext4_fc_add_tlv(struct super_block *sb, u16 tag, u16 len, u8 *val,
> > +                        u32 *crc)
> > +{
> > +     struct ext4_fc_tl tl;
> > +     u8 *dst;
> > +
> > +     dst = ext4_fc_reserve_space(sb, sizeof(tl) + len, crc);
> > +     if (!dst)
> > +             return false;
> > +
> > +     tl.fc_tag = cpu_to_le16(tag);
> > +     tl.fc_len = cpu_to_le16(len);
> > +
> > +     ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
> > +     ext4_fc_memcpy(sb, dst + sizeof(tl), val, len, crc);
> > +
> > +     return true;
> > +}
> > +
> > +/* Same as above, but adds dentry tlv. */
> > +static  bool ext4_fc_add_dentry_tlv(struct super_block *sb, u16 tag,
> > +                                     int parent_ino, int ino, int dlen,
> > +                                     const unsigned char *dname,
> > +                                     u32 *crc)
> > +{
> > +     struct ext4_fc_dentry_info fcd;
> > +     struct ext4_fc_tl tl;
> > +     u8 *dst = ext4_fc_reserve_space(sb, sizeof(tl) + sizeof(fcd) + dlen,
> > +                                     crc);
> > +
> > +     if (!dst)
> > +             return false;
> > +
> > +     fcd.fc_parent_ino = cpu_to_le32(parent_ino);
> > +     fcd.fc_ino = cpu_to_le32(ino);
> > +     tl.fc_tag = cpu_to_le16(tag);
> > +     tl.fc_len = cpu_to_le16(sizeof(fcd) + dlen);
> > +     ext4_fc_memcpy(sb, dst, &tl, sizeof(tl), crc);
> > +     dst += sizeof(tl);
> > +     ext4_fc_memcpy(sb, dst, &fcd, sizeof(fcd), crc);
> > +     dst += sizeof(fcd);
> > +     ext4_fc_memcpy(sb, dst, dname, dlen, crc);
> > +     dst += dlen;
> > +
> > +     return true;
> > +}
> > +
> > +/*
> > + * Writes inode in the fast commit space under TLV with tag @tag.
> > + * Returns 0 on success, error on failure.
> > + */
> > +static int ext4_fc_write_inode(struct inode *inode, u32 *crc, int tag)
> > +{
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +     int inode_len = EXT4_GOOD_OLD_INODE_SIZE;
> > +     int ret;
> > +     struct ext4_iloc iloc;
> > +     struct ext4_fc_inode fc_inode;
> > +     struct ext4_fc_tl tl;
> > +     u8 *dst;
> > +
> > +     ret = ext4_get_inode_loc(inode, &iloc);
> > +     if (ret)
> > +             return ret;
> > +
> > +     if (EXT4_INODE_SIZE(inode->i_sb) > EXT4_GOOD_OLD_INODE_SIZE)
> > +             inode_len += ei->i_extra_isize;
> > +
> > +     fc_inode.fc_ino = cpu_to_le32(inode->i_ino);
> > +     tl.fc_tag = cpu_to_le16(tag);
> > +     tl.fc_len = cpu_to_le16(inode_len + sizeof(fc_inode.fc_ino));
> > +
> > +     dst = ext4_fc_reserve_space(inode->i_sb,
> > +                     sizeof(tl) + inode_len + sizeof(fc_inode.fc_ino), crc);
> > +     if (!dst)
> > +             return -ECANCELED;
> > +
> > +     if (!ext4_fc_memcpy(inode->i_sb, dst, &tl, sizeof(tl), crc))
> > +             return -ECANCELED;
> > +     dst += sizeof(tl);
> > +     if (!ext4_fc_memcpy(inode->i_sb, dst, &fc_inode, sizeof(fc_inode), crc))
> > +             return -ECANCELED;
> > +     dst += sizeof(fc_inode);
> > +     if (!ext4_fc_memcpy(inode->i_sb, dst, (u8 *)ext4_raw_inode(&iloc),
> > +                                     inode_len, crc))
> > +             return -ECANCELED;
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Writes updated data ranges for the inode in question. Updates CRC.
> > + * Returns 0 on success, error otherwise.
> > + */
> > +static int ext4_fc_write_data(struct inode *inode, u32 *crc)
> > +{
> > +     ext4_lblk_t old_blk_size, cur_lblk_off, new_blk_size;
> > +     struct ext4_inode_info *ei = EXT4_I(inode);
> > +     struct ext4_map_blocks map;
> > +     struct ext4_fc_add_range fc_ext;
> > +     struct ext4_fc_del_range lrange;
> > +     struct ext4_extent *ex;
> > +     int ret;
> > +
> > +     mutex_lock(&ei->i_fc_lock);
> > +     if (ei->i_fc_lblk_len == 0) {
> > +             mutex_unlock(&ei->i_fc_lock);
> > +             return 0;
> > +     }
> > +     old_blk_size = ei->i_fc_lblk_start;
> > +     new_blk_size = ei->i_fc_lblk_start + ei->i_fc_lblk_len - 1;
> > +     ei->i_fc_lblk_len = 0;
> > +     mutex_unlock(&ei->i_fc_lock);
> > +
> > +     cur_lblk_off = old_blk_size;
> > +     jbd_debug(1, "%s: will try writing %d to %d for inode %ld\n",
> > +               __func__, cur_lblk_off, new_blk_size, inode->i_ino);
> > +
> > +     while (cur_lblk_off <= new_blk_size) {
> > +             map.m_lblk = cur_lblk_off;
> > +             map.m_len = new_blk_size - cur_lblk_off + 1;
> > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > +             if (ret < 0)
> > +                     return -ECANCELED;
> > +
> > +             if (map.m_len == 0) {
> > +                     cur_lblk_off++;
> > +                     continue;
> > +             }
> > +
> > +             if (ret == 0) {
> > +                     lrange.fc_ino = cpu_to_le32(inode->i_ino);
> > +                     lrange.fc_lblk = cpu_to_le32(map.m_lblk);
> > +                     lrange.fc_len = cpu_to_le32(map.m_len);
> > +                     if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_DEL_RANGE,
> > +                                         sizeof(lrange), (u8 *)&lrange, crc))
> > +                             return -ENOSPC;
> > +             } else {
> > +                     fc_ext.fc_ino = cpu_to_le32(inode->i_ino);
> > +                     ex = (struct ext4_extent *)&fc_ext.fc_ex;
> > +                     ex->ee_block = cpu_to_le32(map.m_lblk);
> > +                     ex->ee_len = cpu_to_le32(map.m_len);
> > +                     ext4_ext_store_pblock(ex, map.m_pblk);
> > +                     if (map.m_flags & EXT4_MAP_UNWRITTEN)
> > +                             ext4_ext_mark_unwritten(ex);
> > +                     else
> > +                             ext4_ext_mark_initialized(ex);
> > +                     if (!ext4_fc_add_tlv(inode->i_sb, EXT4_FC_TAG_ADD_RANGE,
> > +                                         sizeof(fc_ext), (u8 *)&fc_ext, crc))
> > +                             return -ENOSPC;
> > +             }
> > +
> > +             cur_lblk_off += map.m_len;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Commit data inode. If tag == EXT4_FC_TAG_INODE_PARTIAL, we write the
> > + * updated ranges for that inode before committing the inode itself.
> > + * This ensures that the during the replay inode->i_block[] is properly
> > + * initiated before the partial replay of the inode.
> > + */
> > +static int ext4_fc_commit_inode(journal_t *journal, struct inode *inode,
> > +                             u32 *crc, int tag)
> > +{
> > +     int ret;
> > +     int commit_full_inode =
> > +             tag == EXT4_FC_TAG_INODE_FULL || ext4_has_inline_data(inode);
> > +
> > +     if (commit_full_inode) {
> > +             ret = ext4_fc_write_inode(inode, crc, tag);
> > +             if (ret < 0)
> > +                     return ret;
> > +     }
> > +
> > +     ret = ext4_fc_write_data(inode, crc);
> > +     if (ret < 0)
> > +             return ret;
> > +
> > +     if (!commit_full_inode)
> > +             ret = ext4_fc_write_inode(inode, crc, tag);
> > +
> > +     return ret;
> > +}
> > +
> > +/* Submit data for all the fast commit inodes */
> > +static int ext4_fc_submit_inode_data_all(journal_t *journal)
> > +{
> > +     struct super_block *sb = (struct super_block *)(journal->j_private);
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct ext4_inode_info *ei;
> > +     struct list_head *pos;
> > +     int ret = 0;
> > +
> > +     spin_lock(&sbi->s_fc_lock);
> > +     sbi->s_mount_state |= EXT4_FC_COMMITTING;
> > +     list_for_each(pos, &sbi->s_fc_q[FC_Q_MAIN]) {
> > +             ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
> > +             ext4_set_inode_state(&ei->vfs_inode, EXT4_STATE_FC_COMMITTING);
> > +             while (atomic_read(&ei->i_fc_updates)) {
> > +                     DEFINE_WAIT(wait);
> > +
> > +                     prepare_to_wait(&ei->i_fc_wait, &wait,
> > +                                             TASK_UNINTERRUPTIBLE);
> > +                     if (atomic_read(&ei->i_fc_updates)) {
> > +                             spin_unlock(&sbi->s_fc_lock);
> > +                             schedule();
> > +                             spin_lock(&sbi->s_fc_lock);
> > +                     }
> > +                     finish_wait(&ei->i_fc_wait, &wait);
> > +             }
> > +             spin_unlock(&sbi->s_fc_lock);
> > +             ret = jbd2_submit_inode_data(journal, ei->jinode);
> > +             if (ret)
> > +                     return ret;
> > +             spin_lock(&sbi->s_fc_lock);
> > +     }
> > +     spin_unlock(&sbi->s_fc_lock);
> > +
> > +     return ret;
> > +}
> > +
> > +/* Wait for completion of data for all the fast commit inodes */
> > +static int ext4_fc_wait_inode_data_all(journal_t *journal)
> > +{
> > +     struct super_block *sb = (struct super_block *)(journal->j_private);
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct ext4_inode_info *pos, *n;
> > +     int ret = 0;
> > +
> > +     spin_lock(&sbi->s_fc_lock);
> > +     list_for_each_entry_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN], i_fc_list) {
> > +             if (!ext4_test_inode_state(&pos->vfs_inode,
> > +                                        EXT4_STATE_FC_COMMITTING))
> > +                     continue;
> > +             spin_unlock(&sbi->s_fc_lock);
> > +
> > +             ret = jbd2_wait_inode_data(journal, pos->jinode);
> > +             if (ret)
> > +                     return ret;
> > +             spin_lock(&sbi->s_fc_lock);
> > +     }
> > +     spin_unlock(&sbi->s_fc_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +/* Commit all the directory entry updates */
> > +static int ext4_fc_commit_dentry_updates(journal_t *journal, u32 *crc)
> > +{
> > +     struct super_block *sb = (struct super_block *)(journal->j_private);
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct ext4_fc_dentry_update *fc_dentry;
> > +     struct inode *inode;
> > +     struct list_head *pos, *n, *fcd_pos, *fcd_n;
> > +     struct ext4_inode_info *ei;
> > +     int ret;
> > +
> > +     spin_lock(&sbi->s_fc_lock);
> > +     if (list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
> > +             spin_unlock(&sbi->s_fc_lock);
> > +             return 0;
> > +     }
> > +     list_for_each_safe(fcd_pos, fcd_n, &sbi->s_fc_dentry_q[FC_Q_MAIN]) {
> > +             fc_dentry = list_entry(fcd_pos, struct ext4_fc_dentry_update,
> > +                                     fcd_list);
> > +             if (fc_dentry->fcd_op != EXT4_FC_TAG_CREAT) {
> > +                     spin_unlock(&sbi->s_fc_lock);
> > +                     if (!ext4_fc_add_dentry_tlv(
> > +                             sb, fc_dentry->fcd_op,
> > +                             fc_dentry->fcd_parent, fc_dentry->fcd_ino,
> > +                             fc_dentry->fcd_name.len,
> > +                             fc_dentry->fcd_name.name, crc)) {
> > +                             return -ENOSPC;
> > +                     }
> > +                     spin_lock(&sbi->s_fc_lock);
> > +                     continue;
> > +             }
> > +
> > +             inode = NULL;
> > +             list_for_each_safe(pos, n, &sbi->s_fc_q[FC_Q_MAIN]) {
> > +                     ei = list_entry(pos, struct ext4_inode_info, i_fc_list);
> > +                     if (ei->vfs_inode.i_ino == fc_dentry->fcd_ino) {
> > +                             inode = &ei->vfs_inode;
> > +                             break;
> > +                     }
> > +             }
> > +             /*
> > +              * If we don't find inode in our list, then it was deleted,
> > +              * in which case, we don't need to record it's create tag.
> > +              */
> > +             if (!inode)
> > +                     continue;
> > +             spin_unlock(&sbi->s_fc_lock);
> > +
> > +             /*
> > +              * We first write the inode and then the create dirent. This
> > +              * allows the recovery code to create an unnamed inode first
> > +              * and then link it to a directory entry. This allows us
> > +              * to use namei.c routines almost as is and simplifies
> > +              * the recovery code.
> > +              */
> > +             ret = ext4_fc_commit_inode(journal, inode, crc,
> > +                                             EXT4_FC_TAG_INODE_FULL);
> > +             if (ret)
> > +                     return ret;
> > +
> > +             if (!ext4_fc_add_dentry_tlv(
> > +                     sb, fc_dentry->fcd_op,
> > +                     fc_dentry->fcd_parent, fc_dentry->fcd_ino,
> > +                     fc_dentry->fcd_name.len,
> > +                     fc_dentry->fcd_name.name, crc))
> > +                     return -ENOSPC;
> > +
> > +             spin_lock(&sbi->s_fc_lock);
> > +     }
> > +     spin_unlock(&sbi->s_fc_lock);
> > +     return 0;
> > +}
> > +
> > +static int ext4_fc_perform_commit(journal_t *journal)
> > +{
> > +     struct super_block *sb = (struct super_block *)(journal->j_private);
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct ext4_inode_info *iter;
> > +     struct ext4_fc_head head;
> > +     struct list_head *pos;
> > +     struct inode *inode;
> > +     struct blk_plug plug;
> > +     int ret = 0;
> > +     u32 crc = 0;
> > +
> > +     ret = ext4_fc_submit_inode_data_all(journal);
> > +     if (ret)
> > +             return ret;
> > +
> > +     ret = ext4_fc_wait_inode_data_all(journal);
> > +     if (ret)
> > +             return ret;
> > +
> > +     blk_start_plug(&plug);
> > +     if (sbi->s_fc_bytes == 0) {
> > +             /*
> > +              * Add a head tag only if this is the first fast commit
> > +              * in this TID.
> > +              */
> > +             head.fc_features = cpu_to_le32(EXT4_FC_SUPPORTED_FEATURES);
> > +             head.fc_tid = cpu_to_le32(
> > +                     sbi->s_journal->j_running_transaction->t_tid);
> > +             if (!ext4_fc_add_tlv(sb, EXT4_FC_TAG_HEAD, sizeof(head),
> > +                     (u8 *)&head, &crc))
> > +                     goto out;
> > +     }
> > +
> > +     spin_lock(&sbi->s_fc_lock);
> > +     if (!list_empty(&sbi->s_fc_dentry_q[FC_Q_MAIN])) {
>
> This looks redundant check here. Since we anyway check for list
> emptiness inside ext4_fc_commit_dentry_updates func()
> with s_fc_lock held.
> That should simplify this code block. I guess then we only need to call
> ext4_fc_commit_dentry_updates().
>
> > +
> > +struct ext4_fc_stats {
> > +     int fc_ineligible_reason_count[EXT4_FC_REASON_MAX];
> > +     int fc_num_commits;
> > +     int fc_ineligible_commits;
> > +     int fc_numblks; > +};
>
> I guess, all above counters should be unsigned int or unsigned long right?
Ack, thanks will fix this.
>
>
>
>
> > +
> > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> > index 153a9fbe1dd0..563243f3b682 100644
> > --- a/fs/ext4/namei.c
> > +++ b/fs/ext4/namei.c
> > @@ -2611,7 +2611,7 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
> >                      bool excl)
> >   {
> >       handle_t *handle;
> > -     struct inode *inode;
> > +     struct inode *inode, *inode_save;
> >       int err, credits, retries = 0;
> >
> >       err = dquot_initialize(dir);
> > @@ -2629,7 +2629,11 @@ static int ext4_create(struct inode *dir, struct dentry *dentry, umode_t mode,
> >               inode->i_op = &ext4_file_inode_operations;
> >               inode->i_fop = &ext4_file_operations;
> >               ext4_set_aops(inode);
> > +             inode_save = inode;
> > +             ihold(inode_save);
> >               err = ext4_add_nondir(handle, dentry, &inode);
> > +             ext4_fc_track_create(inode_save, dentry);
>
> Shouldn't we check for err before calling ext4_fc_track_create?
Makes sense, will fix this.
>
>
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index ba35ecb18616..dadd9994e74b 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -202,6 +202,47 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
> >       return ret;
> >   }
> >
> > +/* Send all the data buffers related to an inode */
> > +int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> > +{
> > +     struct address_space *mapping;
> > +     loff_t dirty_start;
> > +     loff_t dirty_end;
> > +     int ret;
> > +
> > +     if (!jinode)
> > +             return 0;
> > +
> > +     dirty_start = jinode->i_dirty_start;
> > +     dirty_end = jinode->i_dirty_end;
> > +
> > +     if (!(jinode->i_flags & JI_WRITE_DATA))
> > +             return 0;
>
> This can be clubbed with above if logic to check if !jinode
Ack
>
> > +
> > +     dirty_start = jinode->i_dirty_start;
> > +     dirty_end = jinode->i_dirty_end;
>
> redundant check
Ack
>
> >
> > +/* Map one fast commit buffer for use by the file system */
> > +int jbd2_fc_get_buf(journal_t *journal, struct buffer_head **bh_out)
> > +{
> > +     unsigned long long pblock;
> > +     unsigned long blocknr;
> > +     int ret = 0;
> > +     struct buffer_head *bh;
> > +     int fc_off;
> > +
> > +     *bh_out = NULL;
> > +     write_lock(&journal->j_state_lock);
> > +
> > +     if (journal->j_fc_off + journal->j_first_fc < journal->j_last_fc) {
> > +             fc_off = journal->j_fc_off;
> > +             blocknr = journal->j_first_fc + fc_off;
> > +             journal->j_fc_off++;
>
>
> Not sure if we need to should roll back j_fc_off in case of errors from
> this func.
We can do that. Given that this function is called serially (at least
today), we can reset j_fc_off. Will add that in V10.

Thanks,
Harshad
>
> -ritesh
>
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 5/9] ext4: main fast-commit commit path
  2020-10-09 19:14   ` Theodore Y. Ts'o
@ 2020-10-13  0:27     ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2020-10-13  0:27 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

On Fri, Oct 9, 2020 at 12:14 PM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Sep 18, 2020 at 05:54:47PM -0700, Harshad Shirwadkar wrote:
> >  fs/jbd2/commit.c            |   42 ++
> >  fs/jbd2/journal.c           |  119 +++-
>
> Why are these changes here instead of the previous commit (jbd2: add
> fast commit machinery)?
Makes sense, Ill move these functions to the previous commit.
>
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index ba35ecb18616..dadd9994e74b 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -202,6 +202,47 @@ static int journal_submit_inode_data_buffers(struct address_space *mapping,
> >       return ret;
> >  }
> >
> > +/* Send all the data buffers related to an inode */
> > +int jbd2_submit_inode_data(journal_t *journal, struct jbd2_inode *jinode)
> > +{
> > +     struct address_space *mapping;
> > +     loff_t dirty_start;
> > +     loff_t dirty_end;
> > +     int ret;
> > +
> > +     if (!jinode)
> > +             return 0;
> > +
> > +     dirty_start = jinode->i_dirty_start;
> > +     dirty_end = jinode->i_dirty_end;
> > +
> > +     if (!(jinode->i_flags & JI_WRITE_DATA))
> > +             return 0;
> > +
> > +     dirty_start = jinode->i_dirty_start;
> > +     dirty_end = jinode->i_dirty_end;
>
> Why is dirty_start and dirty_end initialized twice?
Thanks for catching this. I'll fix this in V10.
>
> Also, this is going to conflcit with Mauricio's data=journal patches,
> which you'll notice when you rebase these patches on the current dev branch.
Thanks for the heads up.
- Harshad
>
> (The dev branch temporarily had your v9 patches merged in, so we could
> get the test bots to comment on your changes, but I've since pulled
> the fc patches back out.)
>
>                                         - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature
  2020-10-09 18:28   ` Theodore Y. Ts'o
@ 2020-10-13  0:27     ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2020-10-13  0:27 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

On Fri, Oct 9, 2020 at 11:28 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Sep 18, 2020 at 05:54:43PM -0700, Harshad Shirwadkar wrote:
> > This patch adds necessary documentation for fast commits.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > ---
> >  Documentation/filesystems/ext4/journal.rst | 66 ++++++++++++++++++++++
> >  Documentation/filesystems/journalling.rst  | 28 +++++++++
> >  2 files changed, 94 insertions(+)
> >
> > diff --git a/Documentation/filesystems/ext4/journal.rst b/Documentation/filesystems/ext4/journal.rst
> > index ea613ee701f5..c2e4d010a201 100644
> > --- a/Documentation/filesystems/ext4/journal.rst
> > +++ b/Documentation/filesystems/ext4/journal.rst
> > @@ -609,3 +620,58 @@ bytes long (but uses a full block):
> >       - h\_commit\_nsec
> >       - Nanoseconds component of the above timestamp.
> >
> > +Fast commits
> > +~~~~~~~~~~~~
> > +
> > +Fast commit area is organized as a log of tag tag length values. Each TLV has
>
> s/tag tag/tag/
ack
>
> > +
> > +File system is free to perform fast commits as and when it wants as long as it
> > +gets permission from JBD2 to do so by calling the function
> > +:c:func:`jbd2_fc_start()`. Once a fast commit is done, the client
> > +file  system should tell JBD2 about it by calling :c:func:`jbd2_fc_stop()`.
> > +If file system wants JBD2 to perform a full commit immediately after stopping
> > +the fast commit it can do so by calling :c:func:`jbd2_fc_stop_do_commit()`.
> > +This is useful if fast commit operation fails for some reason and the only way
> > +to guarantee consistency is for JBD2 to perform the full traditional commit.
>
> One of the things which is a bit confusing is that there is a
> substantial part of the fast commit functionality which is implemented
> in ext4, and not in the jbd2 layer.
>
> We can't just talk about ext4_fc_start_update() and
> ext4_fc_stop_update() here, since it would be a vit of a layering
> violation.  But some kind of explanation of how a file system would
> use the jbd2 fast commit framework would be useful, and the big
> picture view of how the ext4 fast commit infrastruction (which is
> currently documented in the top-level comments of
> fs/ext4/fast_commit.c) fit into jbd2 infrastructure.
As we discussed offline, the names "jbd2_fc_start()" and
"jbd2_fc_stop()" are kind of confusing. I didn't mean to put any Ext4
specific information here and it sounds like the names
"jbd2_fc_start/stop()" should be renamed to something like
jbd2_fc_begin_commit() and jbd2_fc_end_commit(). My goal is to add
documentation here that explains how a client FS can use JBD2 fast
commits.
>
> Maybe put the big picture explanation in fs/ext4/fast_commit.c and
> then put a pointer in journaling.rst to the comments in
> fs/ext4/fast_commit.c as an example of how the jbd2 fast_commit
> infrastructure would get used (for example, if ocfs2 ever got
> interested in doing something similar)?  Or maybe we need to move some
> of the description from comments in fast_commit.c to a file in
> Documentation/filesystems/ext4/fast_commit.rst, perhaps?
I like the first option. I'll update the docs accordingly in V10.

Thanks,
Harshad.
>
>                                                 - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 2/9] ext4: add fast_commit feature and handling for extended mount options
  2020-10-09 17:58   ` Theodore Y. Ts'o
@ 2020-10-13  0:27     ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2020-10-13  0:27 UTC (permalink / raw)
  To: Theodore Y. Ts'o; +Cc: Ext4 Developers List

On Fri, Oct 9, 2020 at 10:58 AM Theodore Y. Ts'o <tytso@mit.edu> wrote:
>
> On Fri, Sep 18, 2020 at 05:54:44PM -0700, Harshad Shirwadkar wrote:
> > We are running out of mount option bits. Add handling for using
> > s_mount_opt2. Add ext4 and jbd2 fast commit feature flag and also add
> > ability to turn on / off the fast commit feature in Ext4.
>
> Shouldn't that read "...ability to turn off the fast commit feature via a
> mount option"?
>
> > @@ -2207,10 +2211,17 @@ static int handle_mount_opt(struct super_block *sb, char *opt, int token,
> >                       WARN_ON(1);
> >                       return -1;
> >               }
> > -             if (arg != 0)
> > -                     sbi->s_mount_opt |= m->mount_opt;
> > -             else
> > -                     sbi->s_mount_opt &= ~m->mount_opt;
> > +             if (m->flags & MOPT_2) {
> > +                     if (arg != 0)
> > +                             sbi->s_mount_opt2 |= m->mount_opt;
> > +                     else
> > +                             sbi->s_mount_opt2 &= ~m->mount_opt;
> > +             } else {
> > +                     if (arg != 0)
> > +                             sbi->s_mount_opt |= m->mount_opt;
> > +                     else
> > +                             sbi->s_mount_opt &= ~m->mount_opt;
> > +             }
> >       }
> >       return 1;
> >  }
>
>
> This requires a matching change in _ext4_show_options(), so that the
> MOPT_2 options are properly displayed in /proc/mounts.
Thanks for pointing that out, I'll add this in V10.

Thanks,
Harshad
>
>                                                 - Ted

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 7/9] ext4: fast commit recovery path
  2020-10-09 17:14   ` Ritesh Harjani
@ 2020-10-13  0:27     ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2020-10-13  0:27 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Ext4 Developers List, Theodore Y. Ts'o

Replies inlined (also I have stripped down the original inlined patch
to make this more readable).

On Fri, Oct 9, 2020 at 10:14 AM Ritesh Harjani <riteshh@linux.ibm.com> wrote:
>
>
>
> > +/* Replay add range tag */
> > +static int ext4_fc_replay_add_range(struct super_block *sb,
> > +                             struct ext4_fc_tl *tl)
> > +{
> > +     struct ext4_fc_add_range *fc_add_ex;
> > +     struct ext4_extent newex, *ex;
> > +     struct inode *inode;
> > +     ext4_lblk_t start, cur;
> > +     int remaining, len;
> > +     ext4_fsblk_t start_pblk;
> > +     struct ext4_map_blocks map;
> > +     struct ext4_ext_path *path = NULL;
> > +     int ret;
> > +
> > +     fc_add_ex = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl);
> > +     ex = (struct ext4_extent *)&fc_add_ex->fc_ex;
> > +
> > +     trace_ext4_fc_replay(sb, EXT4_FC_TAG_ADD_RANGE,
> > +             le32_to_cpu(fc_add_ex->fc_ino), le32_to_cpu(ex->ee_block),
> > +             ext4_ext_get_actual_len(ex));
> > +
> > +     inode = ext4_iget(sb, le32_to_cpu(fc_add_ex->fc_ino),
> > +                             EXT4_IGET_NORMAL);
> > +     if (IS_ERR_OR_NULL(inode)) {
> > +             jbd_debug(1, "Inode not found.");
> > +             return 0;
> > +     }
> > +
> > +     ret = ext4_fc_record_modified_inode(sb, inode->i_ino);
> > +
> > +     start = le32_to_cpu(ex->ee_block);
> > +     start_pblk = ext4_ext_pblock(ex);
> > +     len = ext4_ext_get_actual_len(ex);
> > +
> > +     cur = start;
> > +     remaining = len;
> > +     jbd_debug(1, "ADD_RANGE, lblk %d, pblk %lld, len %d, unwritten %d, inode %ld\n",
> > +               start, start_pblk, len, ext4_ext_is_unwritten(ex),
> > +               inode->i_ino);
> > +
> > +     while (remaining > 0) {
> > +             map.m_lblk = cur;
> > +             map.m_len = remaining;
> > +             map.m_pblk = 0;
> > +             ret = ext4_map_blocks(NULL, inode, &map, 0);
> > +
> > +             if (ret < 0) {
> > +                     iput(inode);
> > +                     return 0;
> > +             }
> > +
> > +             if (ret == 0) {
> > +                     /* Range not mapped */
> > +                     path = ext4_find_extent(inode, cur, NULL, 0);
> > +                     if (!path)
> > +                             continue;
> > +                     memset(&newex, 0, sizeof(newex));
> > +                     newex.ee_block = cpu_to_le32(cur);
> > +                     ext4_ext_store_pblock(
> > +                             &newex, start_pblk + cur - start);
> > +                     newex.ee_len = cpu_to_le16(map.m_len);
> > +                     if (ext4_ext_is_unwritten(ex))
> > +                             ext4_ext_mark_unwritten(&newex);
> > +                     down_write(&EXT4_I(inode)->i_data_sem);
> > +                     ret = ext4_ext_insert_extent(
> > +                             NULL, inode, &path, &newex, 0);
> > +                     up_write((&EXT4_I(inode)->i_data_sem));
> > +                     ext4_ext_drop_refs(path);
> > +                     kfree(path);
> > +                     if (ret) {
> > +                             iput(inode);
> > +                             return 0;
> > +                     }
> > +                     goto next;
> > +             }
> > +
> > +             if (start_pblk + cur - start != map.m_pblk) { > +                       /* Logical to physical mapping changed */
>
>
> Sorry I am not sure if I understand this correctly. Can we pls put more
> comments on when and how can this condition happen?
> I am sure I am mising something.
Sorry, I realized that the code is rather a bit too cryptic, I'll add
more comments in V10 here to explain what's going on. This condition
can happen in following scenario:
- fallocate insert range on file f at offset 4k, length 8k
- write on this range
- sync
- fallocate remove range
- fallocate insert range again. At this point, lblk -> pblk mapping of
the range would have changed from last sync. Calling fsync at this
point would just result in "ADD_RANGE" tag with the newly added
mapping.
In this particular scenario, the recovery code would hit this
condition. Does that make sense?
>
> Also what about if the mapping changed and the start pblk is differen
> but it's still an overlapping mapping?
> Do we take care of that case here? why I ask this, because we are
> clearing the block bitmaps for map.m_len below.
We record the inode that is being modified by "ADD_RANGE" /
"DEL_RANGE" operation. The case of overlapping ranges gets handled by
"ext4_fc_set_bitmaps_and_counters" which is called at the end of the
replay which traverses all the ranges in modified inodes and makes
sure that all the blocks that are there in an inode are marked in use.
>
> > +                     ret = ext4_ext_replay_update_ex(inode, cur, map.m_len,
> > +                                     ext4_ext_is_unwritten(ex),
> > +                                     start_pblk + cur - start);
> > +                     if (ret) {
> > +                             iput(inode);
> > +                             return 0;
> > +                     }
> > +                     ext4_mb_mark_bb(inode->i_sb, map.m_pblk, map.m_len, 0);
> > +                     goto next;
> > +             }
> > +
> > +             /* Range is mapped and needs a state change */
> > +             jbd_debug(1, "Converting from %d to %d %lld",
> > +                             map.m_flags & EXT4_MAP_UNWRITTEN,
> > +                     ext4_ext_is_unwritten(ex), map.m_pblk);
> > +             ret = ext4_ext_replay_update_ex(inode, cur, map.m_len,
> > +                                     ext4_ext_is_unwritten(ex), map.m_pblk);
> > +             if (ret) {
> > +                     iput(inode);
> > +                     return 0;
> > +             }
> > +             /*
> > +              * We may have split the extent tree while toggling the state.
> > +              * Try to shrink the exten tree now.
>
> s/exten/extent
Ack
>
> > +             return "TAG_TAIL";
> > +     case EXT4_FC_TAG_HEAD:
> > +             return "TAG_HEAD";
> > +     default:
> > +             return "TAG_ERROR";
> > +     }
> > +}
> > +
> > +void ext4_fc_set_bitmaps_and_counters(struct super_block *sb)
>
> static ?
Ack
>
> > +static int ext4_fc_replay_scan(journal_t *journal,
> > +                             struct buffer_head *bh, int off,
> > +                             tid_t expected_tid)
> > +{
> > +     struct super_block *sb = journal->j_private;
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     struct ext4_fc_replay_state *state;
> > +     int ret = JBD2_FC_REPLAY_CONTINUE;
> > +     struct ext4_fc_add_range *ext;
> > +     struct ext4_fc_tl *tl;
> > +     struct ext4_fc_tail *tail;
> > +     __u8 *start, *end;
> > +     struct ext4_fc_head *head;
> > +     struct ext4_extent *ex;
> > +
> > +     state = &sbi->s_fc_replay_state;
> > +
> > +     start = (u8 *)bh->b_data;
> > +     end = (__u8 *)bh->b_data + journal->j_blocksize - 1;
> > +
> > +     if (state->fc_replay_expected_off == 0) {
> > +             state->fc_cur_tag = 0;
> > +             state->fc_replay_num_tags = 0;
> > +             state->fc_crc = 0;
> > +             state->fc_regions = NULL;
> > +             state->fc_regions_valid = state->fc_regions_used =
> > +                     state->fc_regions_size = 0;
> > +             /* Check if we can stop early */
> > +             if (le16_to_cpu(((struct ext4_fc_tl *)start)->fc_tag)
> > +                     != EXT4_FC_TAG_HEAD)
> > +                     return 0;
> > +     }
> > +
> > +     if (off != state->fc_replay_expected_off) {
> > +             ret = -EFSCORRUPTED;
> > +             goto out_err;
> > +     }
> > +
> > +     state->fc_replay_expected_off++;
> > +     fc_for_each_tl(start, end, tl) {
> > +             jbd_debug(3, "Scan phase, tag:%s, blk %lld\n",
> > +                       tag2str(le16_to_cpu(tl->fc_tag)), bh->b_blocknr);
> > +             switch (le16_to_cpu(tl->fc_tag)) {
> > +             case EXT4_FC_TAG_ADD_RANGE:
> > +                     ext = (struct ext4_fc_add_range *)ext4_fc_tag_val(tl);
> > +                     ex = (struct ext4_extent *)&ext->fc_ex;
> > +                     ret = ext4_fc_record_regions(sb,
> > +                             le32_to_cpu(ext->fc_ino),
> > +                             le32_to_cpu(ex->ee_block), ext4_ext_pblock(ex),
> > +                             ext4_ext_get_actual_len(ex));
> > +                     if (ret < 0)
> > +                             break;
> > +                     ret = JBD2_FC_REPLAY_CONTINUE;
> > +                     fallthrough;
> > +             case EXT4_FC_TAG_DEL_RANGE:
> > +             case EXT4_FC_TAG_LINK:
> > +             case EXT4_FC_TAG_UNLINK:
> > +             case EXT4_FC_TAG_CREAT:
> > +             case EXT4_FC_TAG_INODE_FULL:
> > +             case EXT4_FC_TAG_INODE_PARTIAL:
> > +             case EXT4_FC_TAG_PAD:
> > +                     state->fc_cur_tag++;
> > +                     state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
> > +                                     sizeof(*tl) + ext4_fc_tag_len(tl));
> > +                     break;
> > +             case EXT4_FC_TAG_TAIL:
> > +                     state->fc_cur_tag++;
> > +                     tail = (struct ext4_fc_tail *)ext4_fc_tag_val(tl);
> > +                     state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
> > +                                             sizeof(*tl) +
> > +                                             offsetof(struct ext4_fc_tail,
> > +                                             fc_crc));
> > +                     if (le32_to_cpu(tail->fc_tid) == expected_tid &&
> > +                             le32_to_cpu(tail->fc_crc) == state->fc_crc) {
> > +                             state->fc_replay_num_tags = state->fc_cur_tag;
> > +                             state->fc_regions_valid =
> > +                                     state->fc_regions_used;
> > +                     } else {
> > +                             ret = state->fc_replay_num_tags ?
> > +                                     JBD2_FC_REPLAY_STOP : -EFSBADCRC;
> > +                     }
> > +                     state->fc_crc = 0;
> > +                     break;
> > +             case EXT4_FC_TAG_HEAD:
> > +                     head = (struct ext4_fc_head *)ext4_fc_tag_val(tl);
> > +                     if (le32_to_cpu(head->fc_features) &
> > +                             ~EXT4_FC_SUPPORTED_FEATURES) {
> > +                             ret = -EOPNOTSUPP;
> > +                             break;
> > +                     }
> > +                     if (le32_to_cpu(head->fc_tid) != expected_tid) {
> > +                             ret = JBD2_FC_REPLAY_STOP;
> > +                             break;
> > +                     }
> > +                     state->fc_cur_tag++;
> > +                     state->fc_crc = ext4_chksum(sbi, state->fc_crc, tl,
> > +                                     sizeof(*tl) + ext4_fc_tag_len(tl));
>
>
> why do we need to calculate state->fc_crc for HEAD?
> I don't see we comparing this anywhere right? anything I missed?
Since head is the first tag in the fc area, CRC is first calculated
here. This CRC is modified after every tag found in the FC area, until
we reach a valid tail at which point, the CRC calculated till now is
verified against the CRC found in the tail tag itself.

FC area would look something like:
[HEAD][T1][T2][T3][TAIL1][T4][T5][T6][TAIL2]

For every commit operation, a tail gets written. That's why we see
multiple tails in FC area. In this example, CRC stored in Tail1 is
calculated as CRC(head, T1, T2, T3, Tail1). Similarly, CRC in tail2 is
CRC(T4, T5, T6, Tail2). In the scan phase, we maintain
fc_state->fc_crc as a running CRC until we find a valid tail. Once a
valid tail is found, the calculated CRC is compared against the CRC
found in the tail.
>
> > +int ext4_mark_inode_used(struct super_block *sb, int ino)
> > +{
> > +     unsigned long max_ino = le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count);
> > +     struct buffer_head *inode_bitmap_bh = NULL, *group_desc_bh = NULL;
> > +     struct ext4_group_desc *gdp;
> > +     ext4_group_t group;
> > +     int bit;
> > +     int err = -EFSCORRUPTED;
> > +
> > +     if (ino < EXT4_FIRST_INO(sb) || ino > max_ino)
> > +             goto out;
> > +
> > +     group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
> > +     bit = (ino - 1) % EXT4_INODES_PER_GROUP(sb);
> > +     inode_bitmap_bh = ext4_read_inode_bitmap(sb, group);
> > +     if (IS_ERR(inode_bitmap_bh))
> > +             return PTR_ERR(inode_bitmap_bh);
> > +
> > +     if (ext4_test_bit(bit, inode_bitmap_bh->b_data)) {
> > +             err = 0;
> > +             goto out;
> > +     }
> > +
> > +     gdp = ext4_get_group_desc(sb, group, &group_desc_bh);
> > +     if (!gdp || !group_desc_bh) {
> > +             err = -EINVAL;
> > +             goto out;
> > +     }
> > +
> > +     ext4_set_bit(bit, inode_bitmap_bh->b_data);
> > +
> > +     BUFFER_TRACE(inode_bitmap_bh, "call ext4_handle_dirty_metadata");
> > +     err = ext4_handle_dirty_metadata(NULL, NULL, inode_bitmap_bh);
> > +     if (err) {
> > +             ext4_std_error(sb, err);
> > +             goto out;
> > +     }
> > +     sync_dirty_buffer(inode_bitmap_bh);
>
> Shouldn't we handle error from sync_dirty_buffer()?
Yeah that would be good. I'll do that.
>
> > +     BUFFER_TRACE(group_desc_bh, "get_write_access");
>
> The above BUFFER_TRACE() is not correct. We should remove it from here.
Ack, will do.
> > +/*
> > + * Idempotent helper for Ext4 fast commit replay path to set the state of
> > + * blocks in bitmaps and update counters.
> > + */
> > +void ext4_mb_mark_bb(struct super_block *sb, ext4_fsblk_t block,
> > +                     int len, int state)
> > +{
> > +     struct buffer_head *bitmap_bh = NULL;
> > +     struct ext4_group_desc *gdp;
> > +     struct buffer_head *gdp_bh;
> > +     struct ext4_sb_info *sbi = EXT4_SB(sb);
> > +     ext4_group_t group;
> > +     ext4_fsblk_t cluster;
>
> I guess we never use this variable cluster. We can as well drop it.
Yeah, sorry for this, Ill clean it up in V10.

Thanks,
Harshad
>
> -ritesh
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 4/9] jbd2: add fast commit machinery
  2020-10-09 16:16   ` Ritesh Harjani
@ 2020-10-13  0:27     ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2020-10-13  0:27 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Ext4 Developers List, Theodore Y. Ts'o

On Fri, Oct 9, 2020 at 9:16 AM Ritesh Harjani <riteshh@linux.ibm.com> wrote:
>
>
>
> On 9/19/20 6:24 AM, Harshad Shirwadkar wrote:
> > This patch implements following APIs in JBD2 to allow for fast
> > commits:
> >
> > jbd2_fc_start(): Start a new fast commit. This function waits for any
> > existing fast commit or full commit to complete.
> >
> > jbd2_fc_stop(): Stop fast commit. This function ends current fast
> > commit and wakes up either the journal thread or the other fast commit
> > waiting for current fast commit to complete.
> >
> > jbd2_fc_stop_do_commit(): Stop fast commit and perform a full
> > commit. This is same as above but also performs a full commit.
> >
> > This patch also adds a cleanup handler in journal_t that is called
> > after every full and fast commit.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > ---
> >   fs/ext4/fast_commit.c |  8 ++++++
> >   fs/jbd2/commit.c      | 19 ++++++++++++
> >   fs/jbd2/journal.c     | 67 +++++++++++++++++++++++++++++++++++++++++++
> >   include/linux/jbd2.h  | 21 ++++++++++++++
> >   4 files changed, 115 insertions(+)
> >
> > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > index 0dad8bdb1253..f2d11b4c6b62 100644
> > --- a/fs/ext4/fast_commit.c
> > +++ b/fs/ext4/fast_commit.c
> > @@ -8,11 +8,19 @@
> >    * Ext4 fast commits routines.
> >    */
> >   #include "ext4_jbd2.h"
> > +/*
> > + * Fast commit cleanup routine. This is called after every fast commit and
> > + * full commit. full is true if we are called after a full commit.
> > + */
> > +static void ext4_fc_cleanup(journal_t *journal, int full)
> > +{
> > +}
> >
> >   void ext4_fc_init(struct super_block *sb, journal_t *journal)
> >   {
> >       if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
> >               return;
> > +     journal->j_fc_cleanup_callback = ext4_fc_cleanup;
> >       if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
> >               pr_warn("Error while enabling fast commits, turning off.");
> >               ext4_clear_feature_fast_commit(sb);
> > diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
> > index 6d2da8ad0e6f..ba35ecb18616 100644
> > --- a/fs/jbd2/commit.c
> > +++ b/fs/jbd2/commit.c
> > @@ -413,6 +413,20 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >       J_ASSERT(journal->j_running_transaction != NULL);
> >       J_ASSERT(journal->j_committing_transaction == NULL);
> >
> > +     write_lock(&journal->j_state_lock);
> > +     journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
>
> Shouldn't we set this flag only after the while loop ends and before
> releasing the write lock()? Like how we are doing in jbd2_fc_start()?
The reason why we need to do this before the while loop is to ensure
that a full commit takes priority over a fast commit. So, if we have
reached here, we are going to perform a full commit. Setting this flag
before the while loop ensures that any other ext4 sync operations that
may start after we have entered this while loop don't start another
fast commit. This ensures that the commit thread doesn't starve. I'll
add a comment in the code explaining this.
>
>
> > +     while (journal->j_flags & JBD2_FAST_COMMIT_ONGOING) {
> > +             DEFINE_WAIT(wait);
> > +
> > +             prepare_to_wait(&journal->j_wait_fc, &wait,
> > +                             TASK_UNINTERRUPTIBLE);
> > +             write_unlock(&journal->j_state_lock);
> > +             schedule();
> > +             write_lock(&journal->j_state_lock);
> > +             finish_wait(&journal->j_wait_fc, &wait);
> > +     }
> > +     write_unlock(&journal->j_state_lock);
> > +
> >       commit_transaction = journal->j_running_transaction;
> >
> >       trace_jbd2_start_commit(journal, commit_transaction);
> > @@ -1119,12 +1133,16 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >
> >       if (journal->j_commit_callback)
> >               journal->j_commit_callback(journal, commit_transaction);
> > +     if (journal->j_fc_cleanup_callback)
> > +             journal->j_fc_cleanup_callback(journal, 1);
> >
> >       trace_jbd2_end_commit(journal, commit_transaction);
> >       jbd_debug(1, "JBD2: commit %d complete, head %d\n",
> >                 journal->j_commit_sequence, journal->j_tail_sequence);
> >
> >       write_lock(&journal->j_state_lock);
> > +     journal->j_flags &= ~JBD2_FULL_COMMIT_ONGOING;
> > +     journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
> >       spin_lock(&journal->j_list_lock);
> >       commit_transaction->t_state = T_FINISHED;
> >       /* Check if the transaction can be dropped now that we are finished */
> > @@ -1136,6 +1154,7 @@ void jbd2_journal_commit_transaction(journal_t *journal)
> >       spin_unlock(&journal->j_list_lock);
> >       write_unlock(&journal->j_state_lock);
> >       wake_up(&journal->j_wait_done_commit);
> > +     wake_up(&journal->j_wait_fc);
> >
> >       /*
> >        * Calculate overall stats
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 736a1736619f..17a30a2c38f9 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -714,6 +714,72 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
> >       return err;
> >   }
> >
> > +/*
> > + * Start a fast commit. If there's an ongoing fast or full commit wait for
> > + * it to complete. Returns 0 if a new fast commit was started. Returns -EALREADY
> > + * if a fast commit is not needed, either because there's an already a commit
> > + * going on or this tid has already been committed. Returns -EINVAL if no jbd2
> > + * commit has yet been performed.
> > + */
> > +int jbd2_fc_start(journal_t *journal, tid_t tid)
> > +{
> > +     /*
> > +      * Fast commits only allowed if at least one full commit has
> > +      * been processed.
> > +      */
> > +     if (!journal->j_stats.ts_tid)
> > +             return -EINVAL;
> > +
> > +     if (tid <= journal->j_commit_sequence)
> > +             return -EALREADY;
> > +
> > +     write_lock(&journal->j_state_lock);
> > +     if (journal->j_flags & JBD2_FULL_COMMIT_ONGOING ||
> > +         (journal->j_flags & JBD2_FAST_COMMIT_ONGOING)) {
> > +             DEFINE_WAIT(wait);
> > +
> > +             prepare_to_wait(&journal->j_wait_fc, &wait,
> > +                             TASK_UNINTERRUPTIBLE);
> > +             write_unlock(&journal->j_state_lock);
> > +             schedule();
> > +             finish_wait(&journal->j_wait_fc, &wait);
> > +             return -EALREADY;
> > +     }
> > +     journal->j_flags |= JBD2_FAST_COMMIT_ONGOING;
> > +     write_unlock(&journal->j_state_lock);
> > +
> > +     return 0;
> > +}
> > +
> > +/*
> > + * Stop a fast commit. If fallback is set, this function starts commit of
> > + * TID tid before any other fast commit can start.
> > + */
> > +static int __jbd2_fc_stop(journal_t *journal, tid_t tid, bool fallback)
> > +{
> > +     if (journal->j_fc_cleanup_callback)
> > +             journal->j_fc_cleanup_callback(journal, 0);
> > +     write_lock(&journal->j_state_lock);
> > +     journal->j_flags &= ~JBD2_FAST_COMMIT_ONGOING;
> > +     if (fallback)
> > +             journal->j_flags |= JBD2_FULL_COMMIT_ONGOING;
> > +     write_unlock(&journal->j_state_lock);
> > +     wake_up(&journal->j_wait_fc);
> > +     if (fallback)
> > +             return jbd2_complete_transaction(journal, tid);
> > +     return 0;
> > +}
> > +
> > +int jbd2_fc_stop(journal_t *journal)
> > +{
> > +     return __jbd2_fc_stop(journal, 0, 0);
> > +}
> > +
> > +int jbd2_fc_stop_do_commit(journal_t *journal, tid_t tid)
> > +{
> > +     return __jbd2_fc_stop(journal, tid, 1);
> > +}
> > +
> >   /* Return 1 when transaction with given tid has already committed. */
> >   int jbd2_transaction_committed(journal_t *journal, tid_t tid)
> >   {
> > @@ -1140,6 +1206,7 @@ static journal_t *journal_init_common(struct block_device *bdev,
> >       init_waitqueue_head(&journal->j_wait_commit);
> >       init_waitqueue_head(&journal->j_wait_updates);
> >       init_waitqueue_head(&journal->j_wait_reserved);
> > +     init_waitqueue_head(&journal->j_wait_fc);
> >       mutex_init(&journal->j_abort_mutex);
> >       mutex_init(&journal->j_barrier);
> >       mutex_init(&journal->j_checkpoint_mutex);
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index 36f65a818366..aad986a9f3ff 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -858,6 +858,13 @@ struct journal_s
> >        */
> >       wait_queue_head_t       j_wait_reserved;
> >
> > +     /**
> > +      * @j_wait_fc:
> > +      *
> > +      * Wait queue to wait for completion of async fast commits.
> > +      */
> > +     wait_queue_head_t       j_wait_fc;
>
> If we follow the naming convention then j_fc_wait, will be more
> convenient.
Makes sense, will do that in V10.

Thanks,
Harshad

>
> > +
> >       /**
> >        * @j_checkpoint_mutex:
> >        *
> > @@ -1208,6 +1215,15 @@ struct journal_s
> >        */
> >       struct lockdep_map      j_trans_commit_map;
> >   #endif
> > +
> > +     /**
> > +      * @j_fc_cleanup_callback:
> > +      *
> > +      * Clean-up after fast commit or full commit. JBD2 calls this function
> > +      * after every commit operation.
> > +      */
> > +     void (*j_fc_cleanup_callback)(struct journal_s *journal, int);
> > +
> >   };
> >
> >   #define jbd2_might_wait_for_commit(j) \
> > @@ -1292,6 +1308,8 @@ JBD2_FEATURE_INCOMPAT_FUNCS(fast_commit,        FAST_COMMIT)
> >   #define JBD2_ABORT_ON_SYNCDATA_ERR  0x040   /* Abort the journal on file
> >                                                * data write error in ordered
> >                                                * mode */
> > +#define JBD2_FAST_COMMIT_ONGOING     0x100   /* Fast commit is ongoing */
> > +#define JBD2_FULL_COMMIT_ONGOING     0x200   /* Full commit is ongoing */
> >
> >   /*
> >    * Function declarations for the journaling transaction and buffer
> > @@ -1546,6 +1564,9 @@ extern int jbd2_cleanup_journal_tail(journal_t *);
> >
> >   /* Fast commit related APIs */
> >   int jbd2_fc_init(journal_t *journal, int num_fc_blks);
> > +int jbd2_fc_start(journal_t *journal, tid_t tid);
> > +int jbd2_fc_stop(journal_t *journal);
> > +int jbd2_fc_stop_do_commit(journal_t *journal, tid_t tid);
> >   /*
> >    * is_journal_abort
> >    *
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization
  2020-10-09 16:10   ` Ritesh Harjani
@ 2020-10-13  0:28     ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2020-10-13  0:28 UTC (permalink / raw)
  To: Ritesh Harjani; +Cc: Ext4 Developers List, Theodore Y. Ts'o

Thanks Ritesh for taking a look at the patches! I know that a couple
of patches in this series are really big, I really appreciate you
taking a look at them!

On Fri, Oct 9, 2020 at 9:10 AM Ritesh Harjani <riteshh@linux.ibm.com> wrote:
>
>
> Sorry about the delay. Few comments below.
>
> On 9/19/20 6:24 AM, Harshad Shirwadkar wrote:
> > This patch adds fast commit area trackers in the journal_t
> > structure. These are initialized via the jbd2_fc_init() routine that
> > this patch adds. This patch also adds ext4/fast_commit.c and
> > ext4/fast_commit.h files for fast commit code that will be added in
> > subsequent patches in this series.
> >
> > Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > ---
> >   fs/ext4/Makefile      |  2 +-
> >   fs/ext4/ext4.h        |  4 ++++
> >   fs/ext4/fast_commit.c | 20 +++++++++++++++++
> >   fs/ext4/fast_commit.h |  9 ++++++++
> >   fs/ext4/super.c       |  1 +
> >   fs/jbd2/journal.c     | 52 ++++++++++++++++++++++++++++++++++++++-----
> >   include/linux/jbd2.h  | 39 ++++++++++++++++++++++++++++++++
> >   7 files changed, 121 insertions(+), 6 deletions(-)
> >   create mode 100644 fs/ext4/fast_commit.c
> >   create mode 100644 fs/ext4/fast_commit.h
> >
> > diff --git a/fs/ext4/Makefile b/fs/ext4/Makefile
> > index 2e42f47a7f98..49e7af6cc93f 100644
> > --- a/fs/ext4/Makefile
> > +++ b/fs/ext4/Makefile
> > @@ -10,7 +10,7 @@ ext4-y      := balloc.o bitmap.o block_validity.o dir.o ext4_jbd2.o extents.o \
> >               indirect.o inline.o inode.o ioctl.o mballoc.o migrate.o \
> >               mmp.o move_extent.o namei.o page-io.o readpage.o resize.o \
> >               super.o symlink.o sysfs.o xattr.o xattr_hurd.o xattr_trusted.o \
> > -             xattr_user.o
> > +             xattr_user.o fast_commit.o
> >
> >   ext4-$(CONFIG_EXT4_FS_POSIX_ACL)    += acl.o
> >   ext4-$(CONFIG_EXT4_FS_SECURITY)             += xattr_security.o
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 82e889d5c2ed..9af3971dd12e 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -964,6 +964,7 @@ do {                                                                             \
> >   #endif /* defined(__KERNEL__) || defined(__linux__) */
> >
> >   #include "extents_status.h"
> > +#include "fast_commit.h"
> >
> >   /*
> >    * Lock subclasses for i_data_sem in the ext4_inode_info structure.
> > @@ -2679,6 +2680,9 @@ extern int ext4_init_inode_table(struct super_block *sb,
> >                                ext4_group_t group, int barrier);
> >   extern void ext4_end_bitmap_read(struct buffer_head *bh, int uptodate);
> >
> > +/* fast_commit.c */
> > +
> > +void ext4_fc_init(struct super_block *sb, journal_t *journal);
> >   /* mballoc.c */
> >   extern const struct seq_operations ext4_mb_seq_groups_ops;
> >   extern long ext4_mb_stats;
> > diff --git a/fs/ext4/fast_commit.c b/fs/ext4/fast_commit.c
> > new file mode 100644
> > index 000000000000..0dad8bdb1253
> > --- /dev/null
> > +++ b/fs/ext4/fast_commit.c
> > @@ -0,0 +1,20 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +
> > +/*
> > + * fs/ext4/fast_commit.c
> > + *
> > + * Written by Harshad Shirwadkar <harshadshirwadkar@gmail.com>
> > + *
> > + * Ext4 fast commits routines.
> > + */
> > +#include "ext4_jbd2.h"
> > +
> > +void ext4_fc_init(struct super_block *sb, journal_t *journal)
> > +{
> > +     if (!test_opt2(sb, JOURNAL_FAST_COMMIT))
> > +             return;
> > +     if (jbd2_fc_init(journal, EXT4_NUM_FC_BLKS)) {
> > +             pr_warn("Error while enabling fast commits, turning off.");
> > +             ext4_clear_feature_fast_commit(sb);
> > +     }
> > +}
> > diff --git a/fs/ext4/fast_commit.h b/fs/ext4/fast_commit.h
> > new file mode 100644
> > index 000000000000..8362bf5e6e00
> > --- /dev/null
> > +++ b/fs/ext4/fast_commit.h
> > @@ -0,0 +1,9 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +
> > +#ifndef __FAST_COMMIT_H__
> > +#define __FAST_COMMIT_H__
> > +
> > +/* Number of blocks in journal area to allocate for fast commits */
> > +#define EXT4_NUM_FC_BLKS             256
>
> Just wanted to understand how is this value determined?
> Do you think this needs to be configurable?
> Just thinking since, on some platforms blksz could be of 64K.
I see, I chose this value experimentally. In my experiments with very
aggressive journal commits, (such as fs_mark and NFS), I found that
256 blocks was enough to guarantee that this space doesn't get filled
up before the mandatory periodic full commit (happening at default of
5 seconds). But I realize that it's probably better to make this
configurable. Another option is to have this value be statically
defined as a percentage of the total number of blocks available for
JBD2. The latter has the advantage that we don't need on-disk format
doesn't need to be updated. Performance gains with fast commits are
achieved by delaying full commits as much as possible.
>
> > +
> > +#endif /* __FAST_COMMIT_H__ */
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index b62858ee420b..94aaaf940449 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -4962,6 +4962,7 @@ static void ext4_init_journal_params(struct super_block *sb, journal_t *journal)
> >       journal->j_commit_interval = sbi->s_commit_interval;
> >       journal->j_min_batch_time = sbi->s_min_batch_time;
> >       journal->j_max_batch_time = sbi->s_max_batch_time;
> > +     ext4_fc_init(sb, journal);
> >
> >       write_lock(&journal->j_state_lock);
> >       if (test_opt(sb, BARRIER))
> > diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
> > index 17fdc482f554..736a1736619f 100644
> > --- a/fs/jbd2/journal.c
> > +++ b/fs/jbd2/journal.c
> > @@ -1179,6 +1179,14 @@ static journal_t *journal_init_common(struct block_device *bdev,
> >       if (!journal->j_wbuf)
> >               goto err_cleanup;
> >
> > +     if (journal->j_fc_wbufsize > 0) {
> > +             journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> > +                                     sizeof(struct buffer_head *),
> > +                                     GFP_KERNEL);
> > +             if (!journal->j_fc_wbuf)
> > +                     goto err_cleanup;
> > +     }
> > +
> >       bh = getblk_unmovable(journal->j_dev, start, journal->j_blocksize);
> >       if (!bh) {
> >               pr_err("%s: Cannot get buffer for journal superblock\n",
> > @@ -1192,11 +1200,22 @@ static journal_t *journal_init_common(struct block_device *bdev,
> >
> >   err_cleanup:
> >       kfree(journal->j_wbuf);
> > +     kfree(journal->j_fc_wbuf);
> >       jbd2_journal_destroy_revoke(journal);
> >       kfree(journal);
> >       return NULL;
> >   }
> >
> > +int jbd2_fc_init(journal_t *journal, int num_fc_blks)
> > +{
> > +     journal->j_fc_wbufsize = num_fc_blks;
> > +     journal->j_fc_wbuf = kmalloc_array(journal->j_fc_wbufsize,
> > +                             sizeof(struct buffer_head *), GFP_KERNEL);
> > +     if (!journal->j_fc_wbuf)
> > +             return -ENOMEM;
> > +     return 0;
> > +}
> > +
> >   /* jbd2_journal_init_dev and jbd2_journal_init_inode:
> >    *
> >    * Create a journal structure assigned some fixed set of disk blocks to
> > @@ -1314,11 +1333,20 @@ static int journal_reset(journal_t *journal)
> >       }
> >
> >       journal->j_first = first;
> > -     journal->j_last = last;
> >
> > -     journal->j_head = first;
> > -     journal->j_tail = first;
> > -     journal->j_free = last - first;
> > +     if (jbd2_has_feature_fast_commit(journal) &&
> > +         journal->j_fc_wbufsize > 0) {
> > +             journal->j_last_fc = last;
> > +             journal->j_last = last - journal->j_fc_wbufsize;
> > +             journal->j_first_fc = journal->j_last + 1;
> > +             journal->j_fc_off = 0;
> > +     } else {
> > +             journal->j_last = last;
> > +     }
> > +
> > +     journal->j_head = journal->j_first;
> > +     journal->j_tail = journal->j_first;
> > +     journal->j_free = journal->j_last - journal->j_first;
> >
> >       journal->j_tail_sequence = journal->j_transaction_sequence;
> >       journal->j_commit_sequence = journal->j_transaction_sequence - 1;
> > @@ -1663,9 +1691,18 @@ static int load_superblock(journal_t *journal)
> >       journal->j_tail_sequence = be32_to_cpu(sb->s_sequence);
> >       journal->j_tail = be32_to_cpu(sb->s_start);
> >       journal->j_first = be32_to_cpu(sb->s_first);
> > -     journal->j_last = be32_to_cpu(sb->s_maxlen);
> >       journal->j_errno = be32_to_cpu(sb->s_errno);
> >
> > +     if (jbd2_has_feature_fast_commit(journal) &&
> > +         journal->j_fc_wbufsize > 0) {
> > +             journal->j_last_fc = be32_to_cpu(sb->s_maxlen);
> > +             journal->j_last = journal->j_last_fc - journal->j_fc_wbufsize;
> > +             journal->j_first_fc = journal->j_last + 1;
> > +             journal->j_fc_off = 0;
> > +     } else {
> > +             journal->j_last = be32_to_cpu(sb->s_maxlen);
> > +     }
> > +
> >       return 0;
> >   }
> >
> > @@ -1726,6 +1763,9 @@ int jbd2_journal_load(journal_t *journal)
> >        */
> >       journal->j_flags &= ~JBD2_ABORT;
> >
> > +     if (journal->j_fc_wbufsize > 0)
> > +             jbd2_journal_set_features(journal, 0, 0,
> > +                                       JBD2_FEATURE_INCOMPAT_FAST_COMMIT);
> >       /* OK, we've finished with the dynamic journal bits:
> >        * reinitialise the dynamic contents of the superblock in memory
> >        * and reset them on disk. */
> > @@ -1809,6 +1849,8 @@ int jbd2_journal_destroy(journal_t *journal)
> >               jbd2_journal_destroy_revoke(journal);
> >       if (journal->j_chksum_driver)
> >               crypto_free_shash(journal->j_chksum_driver);
> > +     if (journal->j_fc_wbufsize > 0)
> > +             kfree(journal->j_fc_wbuf);
> >       kfree(journal->j_wbuf);
> >       kfree(journal);
> >
> > diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
> > index f438257d7f31..36f65a818366 100644
> > --- a/include/linux/jbd2.h
> > +++ b/include/linux/jbd2.h
> > @@ -915,6 +915,30 @@ struct journal_s
> >        */
> >       unsigned long           j_last;
> >
> > +     /**
> > +      * @j_first_fc:
> > +      *
> > +      * The block number of the first fast commit block in the journal
> > +      * [j_state_lock].
> > +      */
> > +     unsigned long           j_first_fc;
> > +
> > +     /**
> > +      * @j_fc_off:
> > +      *
> > +      * Number of fast commit blocks currently allocated.
> > +      * [j_state_lock].
> > +      */
> > +     unsigned long           j_fc_off;
>
> I guess choosing a single naming convention for fast commit would be
> very helpful for grepping/searching.
> So for e.g. we could have everything using j_fc_**
> If you agree, then we may have to change other members of this structure
> accordingly.
That makes sense, I'll rename the variables / functions where this
convention is not followed.

Thanks,
Harshad

>
> -ritesh
>
> > +
> > +     /**
> > +      * @j_last_fc:
> > +      *
> > +      * The block number one beyond the last fast commit block in the journal
> > +      * [j_state_lock].
> > +      */
> > +     unsigned long           j_last_fc;
> > +
> >       /**
> >        * @j_dev: Device where we store the journal.
> >        */
> > @@ -1065,6 +1089,12 @@ struct journal_s
> >        */
> >       struct buffer_head      **j_wbuf;
> >
> > +     /**
> > +      * @j_fc_wbuf: Array of fast commit bhs for
> > +      * jbd2_journal_commit_transaction.
> > +      */
> > +     struct buffer_head      **j_fc_wbuf;
> > +
> >       /**
> >        * @j_wbufsize:
> >        *
> > @@ -1072,6 +1102,13 @@ struct journal_s
> >        */
> >       int                     j_wbufsize;
> >
> > +     /**
> > +      * @j_fc_wbufsize:
> > +      *
> > +      * Size of @j_fc_wbuf array.
> > +      */
> > +     int                     j_fc_wbufsize;
> > +
> >       /**
> >        * @j_last_sync_writer:
> >        *
> > @@ -1507,6 +1544,8 @@ void __jbd2_log_wait_for_space(journal_t *journal);
> >   extern void __jbd2_journal_drop_transaction(journal_t *, transaction_t *);
> >   extern int jbd2_cleanup_journal_tail(journal_t *);
> >
> > +/* Fast commit related APIs */
> > +int jbd2_fc_init(journal_t *journal, int num_fc_blks);
> >   /*
> >    * is_journal_abort
> >    *
> >

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, back to index

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-19  0:54 [PATCH v9 0/9] ext4: add fast commits feature Harshad Shirwadkar
2020-09-19  0:54 ` [PATCH v9 1/9] doc: update ext4 and journalling docs to include fast commit feature Harshad Shirwadkar
2020-09-22 17:50   ` Darrick J. Wong
2020-09-24  6:56     ` harshad shirwadkar
2020-10-09 18:28   ` Theodore Y. Ts'o
2020-10-13  0:27     ` harshad shirwadkar
2020-09-19  0:54 ` [PATCH v9 2/9] ext4: add fast_commit feature and handling for extended mount options Harshad Shirwadkar
2020-10-09 17:58   ` Theodore Y. Ts'o
2020-10-13  0:27     ` harshad shirwadkar
2020-09-19  0:54 ` [PATCH v9 3/9] ext4 / jbd2: add fast commit initialization Harshad Shirwadkar
2020-09-19 15:22   ` kernel test robot
2020-10-09 16:10   ` Ritesh Harjani
2020-10-13  0:28     ` harshad shirwadkar
2020-09-19  0:54 ` [PATCH v9 4/9] jbd2: add fast commit machinery Harshad Shirwadkar
2020-10-09 16:16   ` Ritesh Harjani
2020-10-13  0:27     ` harshad shirwadkar
2020-09-19  0:54 ` [PATCH v9 5/9] ext4: main fast-commit commit path Harshad Shirwadkar
2020-09-19  8:19   ` kernel test robot
2020-10-09 17:04   ` Ritesh Harjani
2020-10-13  0:25     ` harshad shirwadkar
2020-10-09 19:14   ` Theodore Y. Ts'o
2020-10-13  0:27     ` harshad shirwadkar
2020-09-19  0:54 ` [PATCH v9 6/9] jbd2: fast commit recovery path Harshad Shirwadkar
2020-09-19  0:54 ` [PATCH v9 7/9] ext4: " Harshad Shirwadkar
2020-09-19 14:15   ` kernel test robot
2020-10-09 17:14   ` Ritesh Harjani
2020-10-13  0:27     ` harshad shirwadkar
2020-09-19  0:54 ` [PATCH v9 8/9] ext4: add a mount opt to forcefully turn fast commits on Harshad Shirwadkar
2020-09-19  0:54 ` [PATCH v9 9/9] ext4: add fast commit stats in procfs Harshad Shirwadkar

Linux-ext4 Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linux-ext4/0 linux-ext4/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linux-ext4 linux-ext4/ https://lore.kernel.org/linux-ext4 \
		linux-ext4@vger.kernel.org
	public-inbox-index linux-ext4

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-ext4


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git